Metropolis-Hastings algorithm

The Metropolis-Hastings algorithm is one of the most popular Markov Chain Monte Carlo (MCMC) algorithms.

Like other MCMC methods, the Metropolis-Hastings algorithm is used to generate serially correlated draws from a sequence of probability distributions. The sequence converges to a given target distribution.

The main difference between standard Monte Carlo methods and MCMC ones is that the former produce independent draws from the target distribution, while the latter produce serially correlated draws.

Table of contents

Preliminaries
The algorithm
Terminology
Metropolis algorithm
Proof of convergence
1. Detailed balance
2. Technical conditions
Multiplicative constants
Independent Metropolis-Hastings
Random walk Metropolis-Hastings
1. Properties of the algorithm
2. Tuning the variance of the increments
References

Preliminaries

Before reading this lecture, you should review the basics of Markov chains and MCMC.

In particular, you should keep in mind that an MCMC algorithm generates a random sequence having the following properties:

it is a Markov chain (given $X_{t}$ , the subsequent observations $X_{t+n}$ are conditionally independent of the previous observations $X_{t-k}$ , for any positive integers and ).
two observations $X_{t}$ and $X_{t+n}$ are in general not independent, but they become closer and closer to being independent as increases;
the sequence converges to the target distribution (the larger is, the more the distribution of $X_{t}$ is similar to the target distribution).

When we run an MCMC algorithm for periods, we obtain an MCMC sample made up of the realizations of the first terms of the chain:

Then, we use the empirical distribution of the sample to approximate the target distribution.

The algorithm

Let be the probability density (or probability mass) function of the distribution from which we wish to extract a sample of draws. We call the target distribution.

Denote by a family of conditional distributions of our choice, from which it is easy to generate draws.

The vectors , $x^{prime }$ and the argument of all have the same dimension.

The Metropolis-Hastings algorithm starts from any value $x_{1}$ belonging to the support of the target distribution. The value $x_{1}$ can be user-defined or extracted from a given distribution.

Then, the subsequent values $x_{2},ldots x_{T}$ are generated recursively.

In particular, the value $x_{t}$ at time step is generated as follows:

draw $y_{t}$ from the distribution with density ;
set
draw $u_{t}$ from a uniform distribution on ;
if $u_{t}leq p_{t}$ , set $x_{t}=y_{t}$ ; otherwise, set $x_{t}=x_{t-1}$ .

Since $u_{t}$ is uniform, that is, the probability of accepting the proposal $y_{t}$ as the new draw $x_{t}$ is equal to $p_{t}$ .

Terminology

The following terminology is used:

the distribution is called proposal distribution;
the draw $y_{t}$ is called proposal;
the probability $p_{t}$ is called acceptance probability;
when $u_{t}leq p_{t}$ and $x_{t}=y_{t}$ , we say that the proposal is accepted;
when $u_{t}>p_{t}$ and $x_{t}=x_{t-1}$ , we say that the proposal is rejected.

Metropolis algorithm

If for any value of $x_{t-1}$ and $y_{t}$ , then we say that the proposal distribution is symmetric.

In this special case, the acceptance probability is [eq9] and the algorithm is called Metropolis algorithm.

This simpler version of the algorithm was invented by Nicholas Metropolis.

Proof of convergence

We do not give a fully rigorous proof of convergence, but we provide the main intuitions.

Detailed balance

The crucial step is to prove that the so-called detailed balance condition holds (see Markov chains), which implies that the target distribution is the stationary distribution of the Markov chain generated by the Metropolis-Hastings algorithm.

Denote by the transition kernel of the Markov chain, that is, the probability density of a transition from $x_{t-1}$ to $x_{t}$ .

When $x_{t} eq x_{t-1}$ , then $x_{t}=y_{t}$ and we have [eq11]

Symmetrically, when the transition is from $x_{t}$ to $x_{t-1}$ and $x_{t-1} eq x_{t}$ , we have

[eq12]

Therefore, when $x_{t} eq x_{t-1}$ , the target density and the transition kernel of the chain satisfy the detailed balance condition

When $x_{t}=x_{t-1}$ , then the detailed balance condition is trivially satisfied because and

By putting together the two cases ( $x_{t}=x_{t-1}$ and $x_{t} eq x_{t-1}$ ), we find that the detailed balance condition is always satisfied. As a consequence, is the stationary distribution of the chain.

Technical conditions

When we proved the detailed balance condition, we omitted an important detail: the proposal distribution must be able to generate all the values belonging to the support of the target distribution.

This is intuitive: if the proposal distribution never generates a value $x_{t}$ to which the target distribution assigns a positive probability, then certainly the stationary distribution of the chain cannot be equal to the target distribution.

From a technical viewpoint, if a value $x_{t}$ belongs to the support of the target distribution but is never extracted from the proposal distribution, then , which causes division by zero problems in the proof of the detailed balance condition and makes it invalid.

Denote by $R_{f}$ the support of the target density .

A simple condition that guarantees that all the values belonging to $R_{f}$ can be generated as proposals is that for any two values $x,yin R_{f}$ .

Not only this condition is sufficient to prove that the target distribution is the stationary distribution of the chain, but it is basically all we need to prove that the chain is ergodic (i.e., we can rely on Law of Large Numbers for dependent sequences) and converges to the stationary distribution.

For a discussion of this condition (and other minor technical conditions that are always satisfied in practice) see Robert and Casella (2013).

Multiplicative constants

As already shown in the introductory lecture on MCMC methods, one of the main advantages of the Metropolis-Hastings algorithm is that we need to know the target distribution only up to a multiplicative constant.

This is very important in Bayesian inference, where the posterior distribution is often known up to a multiplicative constant (because the likelihood and the prior are known but the marginal distribution is not).

Suppose thatwhere is a known function but the constant is not.

Then, the acceptance probability is [eq19]

So, we do not need to know the constant to compute the acceptance probability, and the latter is the only quantity in the algorithm that depends on the target distribution .

Therefore, we can use the Metropolis-Hastings algorithm to generate draws from a distribution even if we do not fully know the probability density (or mass) of that distribution!

Independent Metropolis-Hastings

If the proposal $y_{t}$ is independent of the previous state $x_{t-1}$ , that is,then the algorithm is called Independent Metropolis-Hastings or Independence chain Metropolis-Hastings.

For example, a common choice is to extract proposals $y_{t}$ from a multivariate normal distribution (each draw is independent from the previous ones).

When $y_{t}$ is drawn independently, the acceptance probability is [eq21]

Perfect proposal distribution

It is easy to see that $p_{t}=1$ when for any .

In other words, if the proposal distribution is equal to the target distribution, the acceptance probability is and the proposal is always accepted. Furthermore, since the proposals are independent, we obtain a sample of independent draws from the target distribution.

Bad proposal distribution

On the contrary, the more the proposal distribution differs from the target distribution, the lower the acceptance probabilities are, and the more often the proposals are rejected.

The consequence is that the chain remains stuck at particular points for long periods of time, giving rise to a highly serially correlated MCMC sample.

As discussed in the lecture on MCMC diagnostics, chains that produce highly correlated samples can be problematic because they need to be run for very long periods of time, which can be prohibitively costly.

Good proposal distribution

Therefore, the proposal distribution should be as similar as possible to the target distribution, so as to get high acceptance probabilities and samples with low serial correlation.

In the following plot, we show two examples:

on the left, the proposal is close to the target (upper panel); as a consequence, the samples are only mildly correlated (lower panel);
on the right, the proposal is very different from the target (upper panel); as a consequence, the samples are highly correlated (lower panel).

In these plots we show two examples: on the left, we have a good proposal distribution, which is close to the target and produces mildly correlated samples; on the right, we have a bad proposal distribution, which is far from the target and produces highly correlated samples.

How to choose a proposal distribution

Unfortunately, there is no simple and universal method to select good proposal distributions. In fact, this is a topic that is still actively researched by statisticians.

However, a generally valid strategy is to:

start with a simple proposal distribution (e.g., a multivariate normal);
generate a first MCMC sample (which might be too correlated) and use it to infer characteristics of the target distribution (e.g., its mean and covariance matrix);
use the things learned from the first sample to improve the proposal distribution (e.g., by tuning the mean and covariance of the multivariate normal used as proposal).

Random walk Metropolis-Hastings

If the proposals are formed aswhere is a sequence of independent draws from a known probability distribution (e.g., multivariate normal), then the algorithm is called Random walk Metropolis-Hastings.

We have thatand

Therefore, the acceptance probability is [eq27]

Properties of the algorithm

To understand the properties of this algorithm, let us consider the special (but very common in practice) case in which the distribution of the random walk increment $arepsilon _{t}$ is symmetric.

In this case, , and the acceptance probability is [eq29]

Suppose also that the target density has the following two characteristics (which are very common in practice too):

points that are close to each other have similar densities;
there is a high probability region (e.g., around the mode of the distribution); the points that are far from this region have a much lower density than the points that belong to it.

Now, consider two extreme cases:

if the random walk increments $arepsilon _{t}$ are on average very small (their variance is small), then the ratioand the acceptance probability $p_{t}$ are always close to because, by assumption, points that are close to each other have similar densities; the proposals are almost always accepted, but they are very close to each other; the resulting MCMC sample is highly serially correlated;
if the random walk increments $arepsilon _{t}$ are on average very large (their variance is large), then the ratiois often close to zero when $x_{t-1}$ lies in the high probability region because the proposals tend to be far from this region and to have low density; the proposals are often rejected and the chain remains stuck at high-density points for long periods of time; we obtain a highly serially correlated MCMC sample.

The bottom line is: we get a highly correlated sample both if the variance of the increments is too large and if it is too small.

The chain has low serial correlation only if the variance of the increments is well-tuned. If the variance is too low or too high, there is high serial correlation, either because the chain moves too slowly or because it often gets stuck.

Tuning the variance of the increments

To avoid these two extreme cases, it is very important to accurately tune the variance of the random walk increments.

We can do this, for example, with a trial-and-error process:

we start with a guess of a good value for the variance, and we generate a first MCMC sample;
then, we look at the trace plots of the sample (see MCMC diagnostics); if we see points where the chain gets stuck (flat lines in trace plot) we decrease the variance, while if we see that the trace plot moves very slowly, we increase the variance;
finally, we re-run the chain.

An alternative and very effective method is to automatically tune the variance while running the chain.

Methods to do so are called adaptive Metropolis-Hastings methods. We do not discuss them here, but you can read this nice introduction if you are interested in the topic.

References

Robert, C. P. and G. Casella (2013) Monte Carlo Statistical Methods, Springer Verlag.

How to cite

Please cite as:

Taboga, Marco (2021). "Metropolis-Hastings algorithm", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm.