Search for probability and statistics terms on Statlect
StatLect

Markov Chain Monte Carlo (MCMC) methods

by , PhD

Markov Chain Monte Carlo (MCMC) methods are very powerful Monte Carlo methods that are often used in Bayesian inference.

While "classical" Monte Carlo methods rely on computer-generated samples made up of independent observations, MCMC methods are used to generate sequences of dependent observations. These sequences are Markov chains, which explains the name of the methods.

MCMC vs standard Monte Carlo: the main difference is that MCMC produces a sequence of correlated draws, while in standard Monte Carlo the draws are independent of each other.

Table of Contents

Monte Carlo

To understand MCMC, we need to be familiar with the basics of the Monte Carlo method.

We use the Monte Carlo method to approximate a feature of the probability distribution of a random variable X (e.g., its expected value), when we are not able to work it out analytically.

With a computer, we generate a sample [eq1] of independent draws from the distribution of X.

Then, we use the empirical distribution of the sample to carry out our calculations. The empirical distribution is the discrete probability distribution that assigns probability $1/T$ to each one of the values [eq2].

Example If we need to compute the expected value of X, we approximate it with the expected value of the empirical distribution, which is just the sample mean of [eq3].

Example If we need to calculate the probability that X will be below some threshold, we approximate it with the fraction of observations in the computer-generated sample that are below the threshold.

Example To approximate the variance of X, we compute the sample variance of [eq4].

Plug-in principle

In general, whatever calculation we would like to perform on the true distribution of X is approximated by the corresponding calculation performed on the empirical distribution. The latter is easy to perform on a computer because it involves a discrete distribution with finite support.

This kind of procedure usually works well, in the sense that the approximation converges to the true quantity as the sample size $T$ increases. This is known as the plug-in principle.

According to the plug-in principle, a feature of the true distribution can be approximated by the same feature of the empirical distribution.

Basics of Markov Chains

MCMC methods work like standard Monte Carlo methods, although with a twist: the computer-generated draws [eq5] are not independent, but they are serially correlated.

In particular, they are the realizations of $T$ random variables X_1, ..., $X_{T}$ that form a Markov Chain.

You do not need to be an expert about Markov Chains to use MCMC methods. A basic understanding of the material in the following sections suffices, although we recommend reading our introductory lecture on Markov chains.

Markov property

A random sequence [eq6] is a Markov chain if and only if, given the current value $X_{t}$, the future observations $X_{t+n}$ are conditionally independent of the past values $X_{t-k}$, for any positive integers k and n.

This property, known as the Markov property, says that the process is memoryless: the probability distribution of the future values of the chain depends only on its current value $X_{t}$, regardless of how the value was reached (i.e., regardless of the path followed by the chain in the past).

In more formal terms, the Markov property is as follows: the conditional probability of X at time t+n, given X at time t and previous times, is the same as the analogous conditional probability in which we condition only on X at time t.

Conditional and unconditional distributions

Thanks to the Markov property, we basically need only two things to analyze a chain:

  1. the conditional probabilities (or densities) that $X_{t+n}=x_{t+n}$, given that $X_{t}=x_{t}$, denoted by[eq7]

  2. the unconditional probabilities that $X_{t}=x_{t}$, denoted by [eq8]

Asymptotic independence

Although this is not true in general of any Markov chain, the chains generated by MCMC methods have the following property:

This property implies that [eq9] converges to [eq10] as n becomes large.

This is very important. We usually generate the first value of the chain ($x_{1}$) in a somewhat arbitrary manner, but we know that the distribution of the terms of the chain is less and less affected by our initial choice as $t$ becomes large. In other words, [eq11] converges to the same distribution [eq12] irrespective of how we choose $x_{1}$.

Target distribution

In a chain generated by MCMC methods, not only [eq13] converges to [eq12], but, as $t$ varies, the distributions [eq12] become almost identical to each other. They converge to the so-called stationary distribution of the chain.

Moreover, the stationary distribution coincides with the target distribution, that is, the distribution from which we would like to extract a sample.

In other words, the larger $t$ is, the more [eq16] and [eq12] are similar to the target distribution.

The target distribution is the distribution from which we would like to extract computer-generated draws.

A black-box approach

Let us summarize what we have said thus far.

We can imagine an MCMC algorithm as a black box that takes two inputs:

  1. an initial value $x_{1}$;

  2. a target distribution.

The output is a sequence of values [eq18].

Although the first values of the sequence are extracted from distributions that can be quite different from the target distribution, the similarity between these distributions and the target progressively increases: when $t$ is large, $x_{t}$ is drawn from a distribution that is very similar to the target.

An MCMC algorithm takes as inputs the target distribution and the initial value. Its output is a correlated sample whose last part is made up pf draws from the target.

Burn-in sample

Because of the discrepancy between the target distribution and the distributions of the first terms of the chain, a common practice is to discard the first draws of an MCMC sample.

For example, if we have 110,000 draws, we discard the first 10,000 and keep the remaining 100,000.

The set of draws that are discarded is called the burn-in sample.

By discarding the burn-in sample, we eliminate draws whose distributions are far from the target distribution and we keep draws whose distributions are closer to the target.

This reduces the bias of any Monte Carlo approximation performed with the MCMC sample.

How to choose the size of the burn-in sample is discussed in the lecture on MCMC diagnostics.

This picture shows the trace plot (time series) of an MCMC sample, where it is perfectly clear that the first part of the sample has a different distribution and needs to be discarded (burn-in).

Correlation and effective sample size

After discarding a burn-in sample, we have a sample of draws from distributions that are very similar to the target.

However, we still have a problem: our draws are not independent.

What are the implications of the lack of independence?

To understand them, we need to go back to the standard Monte Carlo method.

The accuracy of a standard Monte Carlo simulation depends on the sample size $T$: the larger $T$ is, the better the approximation.

In the case of an MCMC simulation, we need to use the concept of effective sample size: $T$ dependent observations are equivalent to a smaller number of independent observations.

For example, 1000 dependent observations could be equivalent to 100 independent observations. In this case, we say that the effective sample size is equal to 100.

Roughly speaking, the higher the correlation between adjacent observations, the smaller the effective sample size, and the less accurate the MCMC approximation is.

This is why in an MCMC simulation, most of the efforts are devoted to reducing the correlation as much as possible, so as to achieve a satisfactory effective sample size.

Our lecture on MCMC diagnostics discusses this point in detail.

Metropolis-Hastings

One of the most popular MCMC algorithms is the Metropolis-Hastings (M-H) algorithm.

Denote by $fleft( x
ight) $ the density (or mass) function of the target distribution, that is, the distribution from which we wish to extract a sequence of draws.

For example, $f$ could be the posterior density in Bayesian inference.

Denote by [eq19] a conditional distribution from which we are able to extract computer-generated samples (e.g., a multivariate normal distribution with mean $x^{prime }$).

The generated sample x, the value $x^{prime }$ on which we condition, and the argument of the target distribution $f$ all have the same dimension.

The algorithm

The M-H algorithm starts from any value $x_{1}$ belonging to the support of the target distribution and generates the subsequent values $x_{t}$ as follows:

  1. generate a proposal $y_{t}$ from the distribution [eq20];

  2. compute the acceptance probability[eq21]

  3. draw a uniform random variable $u_{t}$ (on $left[ 0,1
ight] $);

  4. if $u_{t}leq p_{t}$, accept the proposal and set $x_{t}=y_{t}$; otherwise, reject the proposal and set $x_{t}=x_{t-1}$.

Convergence

If some technical conditions are met, it can be proved that the distributions of the draws [eq22] converge to the target distribution $fleft( x
ight) $.

Furthermore, [eq23]where $g $is any function such that the expected value [eq24] exists and is finite, and [eq25] denotes almost sure convergence as $T$ tends to infinity.

In other words, a Strong Law of Large Numbers (an ergodic theorem) holds for the sample mean [eq26].

The main advantage of the algorithm

The power of the Metropolis-Hastings algorithm lies in the fact that you need to know the function $f$ only up to a multiplicative constant.

For example, in Bayesian inference it is very common to know the posterior distribution up to a multiplicative constant because the likelihood and the prior are known but the marginal distribution is not.

Suppose that[eq27]and we know $hleft( x
ight) $ but not $c$ .

Then, the acceptance probability in the M-H algorithm is[eq28]

As a consequence, the acceptance probability, which is the only quantity that depends on $f$, can be computed without knowing the constant $c$.

This is the beauty of the Metropolis-Hastings algorithm: we can generate draws from a distribution even if we do not fully know the density of that distribution!

Learn more

For more details, see the lecture on the Metropolis-Hastings algorithm.

Gibbs sampling

Another popular MCMC algorithm is the so-called Gibbs sampling algorithm.

Full conditionals

Suppose that we want to generate draws of a random vector $x_{ullet }$ having joint density[eq29]where [eq30] are $B$ blocks of entries (or single entries) of $x_{ullet }$.

Given a block $x_{ullet ,b}$, denote by $x_{ullet ,-b}$ the vector comprising all blocks except $x_{ullet ,b}$.

Suppose that we are able to generate draws from the $B$ conditional distributions [eq31], ..., [eq32]. In MCMC jargon, these are called the full conditional distributions.

The algorithm

The Gibbs sampling algorithm starts from any vector [eq33]belonging to the support of the target distribution and generates the subsequent values $x_{t}$ as follows:

  1. for $b=1,ldots ,B$, generate $x_{t,b}$ from the conditional distribution with density[eq34]

In other words, at each iteration, the blocks are extracted one by one from their full conditional distributions, conditioning on the latest draws of all the other blocks.

Note that at iteration $t$, when you extract the $b$-th block, the latest draws of blocks 1 to $b-1$ are those already extracted in iteration $t$, while the latest draws of blocks $b+1$ to $B$ are those extracted in the previous iteration ($t-1$).

Convergence

It can be proved that Gibbs sampling is a special case of Metropolis-Hastings. Therefore, as in M-H, the distributions of the draws [eq35] converge to the target distribution $fleft( x
ight) $. Furthermore, an ergodic theorem for the sample means [eq36] holds.

Learn more about MCMC methods

If you liked this lecture and you want to learn more about Markov Chain Monte Carlo methods, then we suggest that you read these two lectures:

How to cite

Please cite as:

Taboga, Marco (2021). "Markov Chain Monte Carlo (MCMC) methods", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/Markov-Chain-Monte-Carlo.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.