# Metropolis-Hastings algorithm

The Metropolis-Hastings algorithm is one of the most popular Markov Chain Monte Carlo (MCMC) algorithms.

Like other MCMC methods, the Metropolis-Hastings algorithm is used to generate serially correlated draws from a sequence of probability distributions that converge to a given target distribution.

## Preliminaries

Before reading this lecture, you should review the basics of Markov chains and MCMC.

In particular, you should keep in mind that an MCMC algorithm generates a random sequence having the following properties:

• it is a Markov chain (i.e., given , subsequent observations are conditionally independent of previous observations , for any positive integers and ).

• two observations and are in general not independent, but they become closer and closer to being independent as increases;

• the sequence converges to the target distribution (the larger is, the more the distribution of is similar to the target distribution).

When we run an MCMC algorithm for periods, we obtain an MCMC sample composed of the realizations of the first terms of the chain:

Then, we use the empirical distribution of the sample thus obtained to approximate the target distribution.

## The algorithm

Let be the probability density (or mass) function of the distribution from which we wish to extract a sample of draws. We call the target distribution.

Denote by a family of conditional distributions of our choice, from which it is easy to generate draws. We require that , and the argument of all have the same dimension.

The Metropolis-Hastings algorithm starts from any value belonging to the support of the target distribution. The value can be user-defined or extracted from a given distribution.

Then, the subsequent values are generated recursively. In particular, a generic value is generated as follows:

1. draw from the distribution with density ;

2. set

3. draw from a uniform distribution on ;

4. if , set ; otherwise, set .

Since is uniform, that is, the probability of accepting the proposal as the new draw is equal to .

## Terminology

The following terminology is used:

• the distribution is called proposal distribution;

• the draw is called proposal;

• the probability is called acceptance probability;

• when and , we say that the proposal is accepted;

• when and , we say that the proposal is rejected.

In the special case in which the proposal distribution is symmetric (i.e., for any value of and ), then the acceptance probability isand the algorithm is called Metropolis algorithm (because this simpler version was invented by Nicholas Metropolis).

## Proof of convergence

We do not give a fully rigorous proof of convergence, but we provide the main intuitions.

### Detailed balance

The crucial step is to prove that the target distribution and the Markov chain generated by the Metropolis-Hastings algorithm satisfy the detailed balance condition (see the lecture on Markov chains), and, as a consequence, the target distribution is the stationary distribution of the chain.

Denote by the transition kernel of the Markov chain, that is, the probability density of a transition from to .

When , then and we have

Symmetrically, when the transition is from to and , we have

Therefore, when , the target density and the transition kernel of the chain satisfy the detailed balance condition

When , then the detailed balance condition is trivially satisfied because and

By putting together the two cases ( and ), we find that the detailed balance condition is always satisfied. As a consequence, is the stationary distribution of the chain.

### Technical conditions

While proving the detailed balance condition, we have omitted an important detail: the proposal distribution must be able to generate all the values belonging to the support of the target distribution. This is intuitive: if the proposal distribution never generates a value to which the target distribution assigns a positive probability, then certainly the stationary distribution of the chain cannot be equal to the target distribution. From a technical viewpoint, if a value belongs to the support of the target distribution but is never extracted from the proposal distribution, then , which causes division by zero problems in the proof of the detailed balance condition and makes it invalid.

Denote by the support of the target density . A simple condition that guarantees that all the values belonging to can be generated as proposals is that for any two values .

Not only is this condition sufficient to prove that the target distribution is the stationary distribution of the chain, but it is basically all we need to prove that the chain is ergodic (i.e., we can rely on Law of Large Numbers for dependent sequences) and converges to the stationary distribution. For a discussion of this condition (and other minor technical conditions that are always satisfied in practice) see Robert and Casella (2013).

## Multiplicative constants

As already shown in the introductory lecture on MCMC methods, one of the main advantages of the Metropolis-Hastings algorithm is that we need to know the target distribution only up to a multiplicative constant. This is very important in Bayesian inference, where the posterior distribution is often known up to a multiplicative constant (when the likelihood and the prior are known but the marginal distribution is not).

Supposewhere is a known function but the constant is not.

Then, the acceptance probability is

So, we do not need to know the constant to compute the acceptance probability, and the latter is the only quantity in the algorithm that depends on the target distribution . Therefore, the Metropolis-Hastings algorithm allows us to generate draws from a distribution even if we do not fully know the probability density (or mass) of that distribution!

## Independent Metropolis-Hastings

If the proposal is independent of the previous state , that is,then the algorithm is called Independent Metropolis-Hastings or Independence chain Metropolis-Hastings.

For example, a common choice is to extract proposals from a multivariate normal distribution (each draw is independent from the previous ones).

When is drawn independently, the acceptance probability is

It is easy to see that when for any . In other words, if the proposal distribution is equal to the target distribution, the acceptance probability is and the proposal is always accepted. Furthermore, since the proposals are independent, we obtain a sample of independent draws from the target distribution.

On the contrary, the more the proposal distribution differs from the target distribution, the lower the acceptance probabilities are, and the more often the proposals are rejected and the chain remains stuck at particular points for long periods of time, giving rise to a highly serially correlated MCMC sample. As discussed in the lecture on MCMC diagnostics, chains that produce highly correlated samples can be problematic because they need to be run for very long periods of time, which can be prohibitively costly. Therefore, one should aim at having a proposal distribution that is as similar as possible to the target distribution, so as to get high acceptance probabilities and samples with low serial correlation. Unfortunately, there is no simple and universal method to do so, and how to select good proposal distributions is a topic that is still actively researched by statisticians. However, a generally valid strategy is to start with a simple proposal distribution (e.g., a multivariate normal), generate a first MCMC sample (which might be too correlated) and use it to infer characteristics of the target distribution (e.g., its mean and covariance matrix) that can be useful to improve the proposal distribution (e.g., by tuning the mean and covariance of the multivariate normal used as proposal).

## Random walk Metropolis-Hastings

If the proposals are formed aswhere is a sequence of independent draws from a known probability distribution (e.g., multivariate normal), then the algorithm is called Random walk Metropolis-Hastings.

We have thatand

Therefore, the acceptance probability is

To understand the properties of this algorithm, let us consider the special (but very common in practice) case in which the distribution of the random walk increment is symmetric. In such a case, we have that and the acceptance probability simplifies to

Suppose also that the target density has the following two characteristics (which are very common in practice too):

• points that are close to each other have similar densities;

• there is a high probability region (e.g., around the mode of the distribution); the points that are far from this region have a much lower density than the points that belong to it.

Now, consider two extreme cases:

1. if the random walk increments are on average very small (their variance is small), then the ratioand the acceptance probability are always close to (remember that we have assumed that points that are close to each other have similar densities); the proposals are almost always accepted, but they are very close to each other, and the resulting MCMC sample is highly serially correlated;

2. if the random walk increments are on average very large (their variance is large), then the ratiois often close to zero when lies in the high probability region (the proposals tend to be far from this region and to have low density); the proposals are often rejected and the chain remains stuck at high-density points for long periods of time, giving rise to a highly serially correlated MCMC sample.

So, we get a highly correlated sample both if the variance of the increments is too large and if it is too small. Therefore, it is very important to accurately tune the variance of the random walk increments, so as to avoid these two extreme cases. We can do this, for example, with a trial-and-error process: we start with a guess of a good value for the variance, and we generate a first MCMC sample; then, we look at the trace plots of the sample (see MCMC diagnostics): if we see points where the chain gets stuck (flat lines in trace plot) we decrease the variance, while if we see that the trace plot moves very slowly, we increase the variance. Then we re-run the chain. An alternative possibility is to automatically tune the variance while running the chain. Methods to do so are called adaptive Metropolis-Hastings methods. We do not discuss them here, but you can read this nice introduction if you are interested in the topic.

## References

Robert, C. P. and G. Casella (2013) Monte Carlo Statistical Methods, Springer Verlag.

The book

Most of the learning materials found on this website are now available in a traditional textbook format.

Glossary entries
Share