StatLect
Index > Fundamentals of statistics

Markov Chain Monte Carlo (MCMC) methods

Markov Chain Monte Carlo (MCMC) methods are very powerful Monte Carlo methods that are often used in Bayesian inference.

While "classical" Monte Carlo methods rely on computer generated samples made up of independent observations, MCMC methods are based on techniques that allow to generate sequences of dependent observations (these sequences are Markov chains, hence the name of the methods).

Throughout this lecture we are going to assume that you are familiar with:

Table of Contents

How MCMC works

The purpose of any Monte Carlo method is to approximate some feature (e.g., the mean) of a given probability distribution. This is accomplished by using a computer generated sample of draws from the given distribution to compute a plug-in estimate of the feature to be approximated. In particular, suppose we are given a random vector X with joint distribution function [eq1], and we want to approximate a feature [eq2] of the distribution $F_{X}$.

In a Markov Chain Monte Carlo method, we generate through a computer algorithm a sample [eq3]of realizations of n random variables X_1, ..., X_n.

The algorithm is devised in such a way that the sequence [eq4] is a Markov Chain converging to the stationary distribution [eq5].

The sample is used, as in standard Monte Carlo methods, to produce a plug-in estimate[eq6]of the feature [eq7], where [eq8] is the empirical distribution of the sample $xi _{n}$ (i.e., a probability distribution that assigns probability $1/n$ to each one of the values [eq9]).

Unlike in standard Monte Carlo methods, the variables X_1, ..., X_n are in general not independent. We need to take this fact into account when we assess the convergence of [eq10] to [eq11]. For example, if [eq7] is the expected value of X, and [eq13] is the sample mean[eq14]we are usually able to show that an ergodic theorem (a Law of Large Numbers for dependent observations) applies to the sample mean, so that it converges to the expected value of X.

Examples

Two popular examples of MCMC algorithms are provided below:

Example 1 - Metropolis-Hastings

One of the most popular MCMC algorithms is the Metropolis-Hastings (M-H) algorithm.

Denote by $fleft( x
ight) $ the density (or mass) function of the target distribution, that is, the distribution from which we wish to extract a sequence of draws (e.g., $f$ could be the posterior density in Bayesian inference).

Denote by [eq15] a conditional distribution from which we are able to extract samples of IID draws (x, $x^{prime }$ and the argument of the target distribution all have the same dimension).

The M-H algorithm starts from any value $x_{1}$ belonging to the support of the target distribution and generates the subsequent values $x_{i}$ as follows:

  1. generate a proposal $y_{i}$ from the proposal distribution [eq16];

  2. compute the acceptance probability[eq17]

  3. draw a uniform random variable $u_{i}$ (on $left[ 0,1
ight] $);

  4. if $u_{i}leq p_{i}$, accept the proposal and set $x_{i}=y_{i}$; otherwise, reject the proposal and set $x_{i}=x_{i-1}$.

It can be proved that, provided some technical conditions are met, the sequence [eq18] thus generated is the realization of a Markov Chain that converges to the stationary distribution $fleft( x
ight) $. Furthermore, for any function $g$, [eq19]where $g $is any function such that the expected value [eq20] exists and is finite and [eq21] denotes almost sure convergence as n tends to infinity. In other words, a Strong Law of Large Numbers (an ergodic theorem) holds for the sample mean [eq22].

The power of this algorithm lies in the fact that you need to know the function $f$ only up to a multiplicative constant. For example, in Bayesian inference it is very common to know the posterior distribution up to a multiplicative constant because the likelihood and the prior are known but the marginal distribution is not. Suppose[eq23]and we know $hleft( x
ight) $ but not $c$ . Then, the acceptance probability in the M-H algorithm is[eq24]

As a consequence, the acceptance probability, which is the only quantity that depends on $f$, can be computed without knowing the constant $c$. This is the beauty of the Metropolis-Hastings algorithm: we can generate draws from a distribution even if we do not fully know the density of that distribution!

For more details see the lecture on the Metropolis-Hastings algorithm.

Example 2 - Gibbs sampling

Another popular MCMC algorithm is the so-called Gibbs sampling algorithm.

Suppose you want to generate draws of a random vector $x_{ullet }$ having joint density[eq25]where [eq26] are $B$ blocks of entries (or single entries) of $x_{ullet }$.

Given a block $x_{ullet ,b}$, denote by $x_{ullet ,-b}$ the vector comprising all blocks except $x_{ullet ,b}$.

Suppose you are able to generate draws from the $B$ conditional distributions [eq27], ..., [eq28]. In MCMC jargon, these are called the full conditional distributions.

The Gibbs sampling algorithm starts from any vector [eq29] belonging to the support of the target distribution and generates the subsequent values $x_{i}$ as follows:

  1. for $b=1,ldots ,B$, generate $x_{i,b}$ from the conditional distribution with density[eq30]

In other words, at each iteration, the blocks are extracted one by one from their full conditional distributions, conditioning on the latest draws of all the other blocks.

Note that at iteration i, when you extact the $b$-th block, the latest draws of blocks 1 to $b-1$ are those already extracted in iteration i, while the latest draws of blocks $b+1$ to $B$ are those extracted in the previous iteration ($i-1$).

It can be proved that Gibbs sampling is a special case of Metropolis-Hastings. Therefore, as in M-H, the sequence [eq31] generated by the algorithm is the realization of a Markov Chain that converges to the stationary distribution $fleft( x
ight) $. Furthermore, an ergodic theorem for the sample means [eq32] holds.

Burn-in sample

A common practice is to discard the first draws of an MCMC sample. For example, if we have 110,000 draws, we discard the first 10,000 and keep the remaining 100,000.

The set of draws that are discarded is called the burn-in sample.

Why do we do this? If the starting value $x_{1}$ is extracted from a distribution which is very different from the target distribution $F_{X}$, then also the distributions of the subsequent draws $x_{2}$, $x_{3}$, ... will be very different from $F_{X}$. However, the differences will become smaller and smaller because the chain converges to the target distribution. By discarding a burn-in sample, we eliminate draws whose distributions are far from the target distribution and we keep draws whose distributions are closer to the target. This reduces the bias of any Monte Carlo approximation performed with the MCMC sample.

The book

Most of the learning materials found on this website are now available in a traditional textbook format.