StatLect
Index > Fundamentals of statistics

Metropolis-Hastings algorithm

The Metropolis-Hastings algorithm is one of the most popular Markov Chain Monte Carlo (MCMC) algorithms.

Like other MCMC methods, the Metropolis-Hastings algorithm is used to generate serially correlated draws from a sequence of probability distributions that converge to a given target distribution.

Table of Contents

Preliminaries

Before reading this lecture, you should review the basics of Markov chains and MCMC.

In particular, you should keep in mind that an MCMC algorithm generates a random sequence [eq1] having the following properties:

When we run an MCMC algorithm for $T$ periods, we obtain an MCMC sample composed of the realizations of the first $T$ terms of the chain:[eq2]

Then, we use the empirical distribution of the sample thus obtained to approximate the target distribution.

The algorithm

Let $fleft( x
ight) $ be the probability density (or mass) function of the distribution from which we wish to extract a sample of draws. We call $fleft( x
ight) $ the target distribution.

Denote by [eq3] a family of conditional distributions of our choice, from which it is easy to generate draws. We require that x, $x^{prime }$ and the argument of $fleft( x
ight) $ all have the same dimension.

The Metropolis-Hastings algorithm starts from any value $x_{1}$ belonging to the support of the target distribution. The value $x_{1}$ can be user-defined or extracted from a given distribution.

Then, the subsequent values $x_{2},ldots x_{T}$ are generated recursively. In particular, a generic value $x_{t}$ is generated as follows:

  1. draw $y_{t}$ from the distribution with density [eq4];

  2. set[eq5]

  3. draw $u_{t}$ from a uniform distribution on $left[ 0,1
ight] $;

  4. if $u_{t}leq p_{t}$, set $x_{t}=y_{t}$; otherwise, set $x_{t}=x_{t-1}$.

Since $u_{t}$ is uniform, [eq6]that is, the probability of accepting the proposal $y_{t}$ as the new draw $x_{t}$ is equal to $p_{t}$.

Terminology

The following terminology is used:

In the special case in which the proposal distribution is symmetric (i.e., [eq8] for any value of $x_{t-1}$ and $y_{t}$), then the acceptance probability is[eq9]and the algorithm is called Metropolis algorithm (because this simpler version was invented by Nicholas Metropolis).

Proof of convergence

We do not give a fully rigorous proof of convergence, but we provide the main intuitions.

Detailed balance

The crucial step is to prove that the target distribution and the Markov chain generated by the Metropolis-Hastings algorithm satisfy the detailed balance condition (see the lecture on Markov chains), and, as a consequence, the target distribution is the stationary distribution of the chain.

Denote by [eq10] the transition kernel of the Markov chain, that is, the probability density of a transition from $x_{t-1}$ to $x_{t}$.

When $x_{t}
eq x_{t-1}$, then $x_{t}=y_{t}$ and we have [eq11]

Symmetrically, when the transition is from $x_{t}$ to $x_{t-1}$ and $x_{t-1}
eq x_{t}$, we have

[eq12]

Therefore, when $x_{t}
eq x_{t-1}$, the target density $f$ and the transition kernel of the chain satisfy the detailed balance condition[eq13]

When $x_{t}=x_{t-1}$, then the detailed balance condition is trivially satisfied because [eq14]and[eq15]

By putting together the two cases ($x_{t}=x_{t-1}$ and $x_{t}
eq x_{t-1}$), we find that the detailed balance condition is always satisfied. As a consequence, $fleft( {}
ight) $ is the stationary distribution of the chain.

Technical conditions

While proving the detailed balance condition, we have omitted an important detail: the proposal distribution must be able to generate all the values belonging to the support of the target distribution. This is intuitive: if the proposal distribution never generates a value $x_{t}$ to which the target distribution assigns a positive probability, then certainly the stationary distribution of the chain cannot be equal to the target distribution. From a technical viewpoint, if a value $x_{t}$ belongs to the support of the target distribution but is never extracted from the proposal distribution, then [eq16], which causes division by zero problems in the proof of the detailed balance condition and makes it invalid.

Denote by $R_{f}$ the support of the target density $fleft( x
ight) $. A simple condition that guarantees that all the values belonging to $R_{f}$ can be generated as proposals is that [eq17]for any two values $x,yin R_{f}$.

Not only is this condition sufficient to prove that the target distribution is the stationary distribution of the chain, but it is basically all we need to prove that the chain is ergodic (i.e., we can rely on Law of Large Numbers for dependent sequences) and converges to the stationary distribution. For a discussion of this condition (and other minor technical conditions that are always satisfied in practice) see Robert and Casella (2013).

Multiplicative constants

As already shown in the introductory lecture on MCMC methods, one of the main advantages of the Metropolis-Hastings algorithm is that we need to know the target distribution $f$ only up to a multiplicative constant. This is very important in Bayesian inference, where the posterior distribution is often known up to a multiplicative constant (when the likelihood and the prior are known but the marginal distribution is not).

Suppose[eq18]where $hleft( x
ight) $ is a known function but the constant $c$ is not.

Then, the acceptance probability is[eq19]

So, we do not need to know the constant $c$ to compute the acceptance probability, and the latter is the only quantity in the algorithm that depends on the target distribution $f$. Therefore, the Metropolis-Hastings algorithm allows us to generate draws from a distribution even if we do not fully know the probability density (or mass) of that distribution!

Independent Metropolis-Hastings

If the proposal $y_{t}$ is independent of the previous state $x_{t-1}$, that is,[eq20]then the algorithm is called Independent Metropolis-Hastings or Independence chain Metropolis-Hastings.

For example, a common choice is to extract proposals $y_{t}$ from a multivariate normal distribution (each draw is independent from the previous ones).

When $y_{t}$ is drawn independently, the acceptance probability is [eq21]

It is easy to see that $p_{t}=1$ when [eq22] for any x. In other words, if the proposal distribution is equal to the target distribution, the acceptance probability is 1 and the proposal is always accepted. Furthermore, since the proposals are independent, we obtain a sample of independent draws from the target distribution.

On the contrary, the more the proposal distribution differs from the target distribution, the lower the acceptance probabilities are, and the more often the proposals are rejected and the chain remains stuck at particular points for long periods of time, giving rise to a highly serially correlated MCMC sample. As discussed in the lecture on MCMC diagnostics, chains that produce highly correlated samples can be problematic because they need to be run for very long periods of time, which can be prohibitively costly. Therefore, one should aim at having a proposal distribution that is as similar as possible to the target distribution, so as to get high acceptance probabilities and samples with low serial correlation. Unfortunately, there is no simple and universal method to do so, and how to select good proposal distributions is a topic that is still actively researched by statisticians. However, a generally valid strategy is to start with a simple proposal distribution (e.g., a multivariate normal), generate a first MCMC sample (which might be too correlated) and use it to infer characteristics of the target distribution (e.g., its mean and covariance matrix) that can be useful to improve the proposal distribution (e.g., by tuning the mean and covariance of the multivariate normal used as proposal).

Random walk Metropolis-Hastings

If the proposals are formed as[eq23]where [eq24] is a sequence of independent draws from a known probability distribution (e.g., multivariate normal), then the algorithm is called Random walk Metropolis-Hastings.

We have that[eq25]and[eq26]

Therefore, the acceptance probability is[eq27]

To understand the properties of this algorithm, let us consider the special (but very common in practice) case in which the distribution of the random walk increment $arepsilon _{t}$ is symmetric. In such a case, we have that [eq28] and the acceptance probability simplifies to[eq29]

Suppose also that the target density has the following two characteristics (which are very common in practice too):

Now, consider two extreme cases:

  1. if the random walk increments $arepsilon _{t}$ are on average very small (their variance is small), then the ratio[eq30]and the acceptance probability $p_{t}$ are always close to 1 (remember that we have assumed that points that are close to each other have similar densities); the proposals are almost always accepted, but they are very close to each other, and the resulting MCMC sample is highly serially correlated;

  2. if the random walk increments $arepsilon _{t}$ are on average very large (their variance is large), then the ratio[eq31]is often close to zero when $x_{t-1}$ lies in the high probability region (the proposals tend to be far from this region and to have low density); the proposals are often rejected and the chain remains stuck at high-density points for long periods of time, giving rise to a highly serially correlated MCMC sample.

So, we get a highly correlated sample both if the variance of the increments is too large and if it is too small. Therefore, it is very important to accurately tune the variance of the random walk increments, so as to avoid these two extreme cases. We can do this, for example, with a trial-and-error process: we start with a guess of a good value for the variance, and we generate a first MCMC sample; then, we look at the trace plots of the sample (see MCMC diagnostics): if we see points where the chain gets stuck (flat lines in trace plot) we decrease the variance, while if we see that the trace plot moves very slowly, we increase the variance. Then we re-run the chain. An alternative possibility is to automatically tune the variance while running the chain. Methods to do so are called adaptive Metropolis-Hastings methods. We do not discuss them here, but you can read this nice introduction if you are interested in the topic.

References

Robert, C. P. and G. Casella (2013) Monte Carlo Statistical Methods, Springer Verlag.

The book

Most of the learning materials found on this website are now available in a traditional textbook format.