The Metropolis-Hastings algorithm is one of the most popular Markov Chain Monte Carlo (MCMC) algorithms.
Like other MCMC methods, the Metropolis-Hastings algorithm is used to generate serially correlated draws from a sequence of probability distributions. The sequence converges to a given target distribution.
Table of contents
Before reading this lecture, you should review the basics of Markov chains and MCMC.
In particular, you should keep in mind that an MCMC algorithm generates a random sequence having the following properties:
it is a Markov chain (given , the subsequent observations are conditionally independent of the previous observations , for any positive integers and ).
two observations and are in general not independent, but they become closer and closer to being independent as increases;
the sequence converges to the target distribution (the larger is, the more the distribution of is similar to the target distribution).
When we run an MCMC algorithm for periods, we obtain an MCMC sample made up of the realizations of the first terms of the chain:
Then, we use the empirical distribution of the sample to approximate the target distribution.
Let be the probability density (or probability mass) function of the distribution from which we wish to extract a sample of draws. We call the target distribution.
Denote by a family of conditional distributions of our choice, from which it is easy to generate draws.
The vectors , and the argument of all have the same dimension.
The Metropolis-Hastings algorithm starts from any value belonging to the support of the target distribution. The value can be user-defined or extracted from a given distribution.
Then, the subsequent values are generated recursively.
In particular, the value at time step is generated as follows:
draw from the distribution with density ;
set
draw from a uniform distribution on ;
if , set ; otherwise, set .
Since is uniform, that is, the probability of accepting the proposal as the new draw is equal to .
The following terminology is used:
the distribution is called proposal distribution;
the draw is called proposal;
the probability is called acceptance probability;
when and , we say that the proposal is accepted;
when and , we say that the proposal is rejected.
If for any value of and , then we say that the proposal distribution is symmetric.
In this special case, the acceptance probability isand the algorithm is called Metropolis algorithm.
This simpler version of the algorithm was invented by Nicholas Metropolis.
We do not give a fully rigorous proof of convergence, but we provide the main intuitions.
The crucial step is to prove that the so-called detailed balance condition holds (see Markov chains), which implies that the target distribution is the stationary distribution of the Markov chain generated by the Metropolis-Hastings algorithm.
Denote by the transition kernel of the Markov chain, that is, the probability density of a transition from to .
When , then and we have
Symmetrically, when the transition is from to and , we have
Therefore, when , the target density and the transition kernel of the chain satisfy the detailed balance condition
When , then the detailed balance condition is trivially satisfied because and
By putting together the two cases ( and ), we find that the detailed balance condition is always satisfied. As a consequence, is the stationary distribution of the chain.
When we proved the detailed balance condition, we omitted an important detail: the proposal distribution must be able to generate all the values belonging to the support of the target distribution.
This is intuitive: if the proposal distribution never generates a value to which the target distribution assigns a positive probability, then certainly the stationary distribution of the chain cannot be equal to the target distribution.
From a technical viewpoint, if a value belongs to the support of the target distribution but is never extracted from the proposal distribution, then , which causes division by zero problems in the proof of the detailed balance condition and makes it invalid.
Denote by the support of the target density .
A simple condition that guarantees that all the values belonging to can be generated as proposals is that for any two values .
Not only this condition is sufficient to prove that the target distribution is the stationary distribution of the chain, but it is basically all we need to prove that the chain is ergodic (i.e., we can rely on Law of Large Numbers for dependent sequences) and converges to the stationary distribution.
For a discussion of this condition (and other minor technical conditions that are always satisfied in practice) see Robert and Casella (2013).
As already shown in the introductory lecture on MCMC methods, one of the main advantages of the Metropolis-Hastings algorithm is that we need to know the target distribution only up to a multiplicative constant.
This is very important in Bayesian inference, where the posterior distribution is often known up to a multiplicative constant (because the likelihood and the prior are known but the marginal distribution is not).
Suppose thatwhere is a known function but the constant is not.
Then, the acceptance probability is
So, we do not need to know the constant to compute the acceptance probability, and the latter is the only quantity in the algorithm that depends on the target distribution .
Therefore, we can use the Metropolis-Hastings algorithm to generate draws from a distribution even if we do not fully know the probability density (or mass) of that distribution!
If the proposal is independent of the previous state , that is,then the algorithm is called Independent Metropolis-Hastings or Independence chain Metropolis-Hastings.
For example, a common choice is to extract proposals from a multivariate normal distribution (each draw is independent from the previous ones).
When is drawn independently, the acceptance probability is
It is easy to see that when for any .
In other words, if the proposal distribution is equal to the target distribution, the acceptance probability is and the proposal is always accepted. Furthermore, since the proposals are independent, we obtain a sample of independent draws from the target distribution.
On the contrary, the more the proposal distribution differs from the target distribution, the lower the acceptance probabilities are, and the more often the proposals are rejected.
The consequence is that the chain remains stuck at particular points for long periods of time, giving rise to a highly serially correlated MCMC sample.
As discussed in the lecture on MCMC diagnostics, chains that produce highly correlated samples can be problematic because they need to be run for very long periods of time, which can be prohibitively costly.
Therefore, the proposal distribution should be as similar as possible to the target distribution, so as to get high acceptance probabilities and samples with low serial correlation.
In the following plot, we show two examples:
on the left, the proposal is close to the target (upper panel); as a consequence, the samples are only mildly correlated (lower panel);
on the right, the proposal is very different from the target (upper panel); as a consequence, the samples are highly correlated (lower panel).
Unfortunately, there is no simple and universal method to select good proposal distributions. In fact, this is a topic that is still actively researched by statisticians.
However, a generally valid strategy is to:
start with a simple proposal distribution (e.g., a multivariate normal);
generate a first MCMC sample (which might be too correlated) and use it to infer characteristics of the target distribution (e.g., its mean and covariance matrix);
use the things learned from the first sample to improve the proposal distribution (e.g., by tuning the mean and covariance of the multivariate normal used as proposal).
If the proposals are formed aswhere is a sequence of independent draws from a known probability distribution (e.g., multivariate normal), then the algorithm is called Random walk Metropolis-Hastings.
We have thatand
Therefore, the acceptance probability is
To understand the properties of this algorithm, let us consider the special (but very common in practice) case in which the distribution of the random walk increment is symmetric.
In this case, , and the acceptance probability is
Suppose also that the target density has the following two characteristics (which are very common in practice too):
points that are close to each other have similar densities;
there is a high probability region (e.g., around the mode of the distribution); the points that are far from this region have a much lower density than the points that belong to it.
Now, consider two extreme cases:
if the random walk increments are on average very small (their variance is small), then the ratioand the acceptance probability are always close to because, by assumption, points that are close to each other have similar densities; the proposals are almost always accepted, but they are very close to each other; the resulting MCMC sample is highly serially correlated;
if the random walk increments are on average very large (their variance is large), then the ratiois often close to zero when lies in the high probability region because the proposals tend to be far from this region and to have low density; the proposals are often rejected and the chain remains stuck at high-density points for long periods of time; we obtain a highly serially correlated MCMC sample.
The bottom line is: we get a highly correlated sample both if the variance of the increments is too large and if it is too small.
To avoid these two extreme cases, it is very important to accurately tune the variance of the random walk increments.
We can do this, for example, with a trial-and-error process:
we start with a guess of a good value for the variance, and we generate a first MCMC sample;
then, we look at the trace plots of the sample (see MCMC diagnostics); if we see points where the chain gets stuck (flat lines in trace plot) we decrease the variance, while if we see that the trace plot moves very slowly, we increase the variance;
finally, we re-run the chain.
An alternative and very effective method is to automatically tune the variance while running the chain.
Methods to do so are called adaptive Metropolis-Hastings methods. We do not discuss them here, but you can read this nice introduction if you are interested in the topic.
Robert, C. P. and G. Casella (2013) Monte Carlo Statistical Methods, Springer Verlag.
Please cite as:
Taboga, Marco (2021). "Metropolis-Hastings algorithm", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm.
Most of the learning materials found on this website are now available in a traditional textbook format.