Markov Chain Monte Carlo (MCMC) methods are very powerful Monte Carlo methods that are often used in Bayesian inference.
While "classical" Monte Carlo methods rely on computer-generated samples made up of independent observations, MCMC methods are used to generate sequences of dependent observations. These sequences are Markov chains, which explains the name of the methods.
Table of contents
To understand MCMC, we need to be familiar with the basics of the Monte Carlo method.
We use the Monte Carlo method to approximate a feature of the probability
distribution of a random variable
(e.g., its expected value), when we are not able to work it out analytically.
With a computer, we generate a sample
of independent draws from the distribution of
.
Then, we use the
empirical
distribution of the sample to carry out our calculations. The empirical
distribution is the discrete probability distribution that assigns probability
to each one of the values
.
Example
If we need to compute the expected value of
,
we approximate it with the expected value of the empirical distribution, which
is just the sample mean of
.
Example
If we need to calculate the probability that
will be below some threshold, we approximate it with the fraction of
observations in the computer-generated sample that are below the threshold.
Example
To approximate the variance of
,
we compute the sample variance of
.
In general, whatever calculation we would like to perform on the true
distribution of
is approximated by the corresponding calculation performed on the
empirical distribution. The latter is easy to perform on a
computer because it involves a discrete distribution with finite
support.
This kind of procedure usually works well, in the sense that the approximation
converges to the true quantity as the sample size
increases. This is known as the
plug-in principle.
MCMC methods work like standard Monte Carlo methods, although with a twist:
the computer-generated draws
are not independent, but they are serially correlated.
In particular, they are the
realizations of
random variables
,
...,
that form a Markov Chain.
You do not need to be an expert about Markov Chains to use MCMC methods. A basic understanding of the material in the following sections suffices, although we recommend reading our introductory lecture on Markov chains.
A random sequence
is a Markov chain if and only if, given the current value
,
the future observations
are conditionally independent of the past values
,
for any positive integers
and
.
This property, known as the Markov property, says that the process is
memoryless: the probability distribution of the future values of the chain
depends only on its current value
,
regardless of how the value was reached (i.e., regardless of the path followed
by the chain in the past).
Thanks to the Markov property, we basically need only two things to analyze a chain:
the conditional probabilities (or densities) that
,
given that
,
denoted
by
the unconditional probabilities that
,
denoted by
Although this is not true in general of any Markov chain, the chains generated by MCMC methods have the following property:
two variables
and
are not
independent,
but they become closer and closer to being independent as
increases.
This property implies that
converges to
as
becomes large.
This is very important. We usually generate the first value of the chain
()
in a somewhat arbitrary manner, but we know that the distribution of the terms
of the chain is less and less affected by our initial choice as
becomes large. In other words,
converges to the same distribution
irrespective of how we choose
.
In a chain generated by MCMC methods, not only
converges to
,
but, as
varies, the distributions
become almost identical to each other. They converge to the so-called
stationary distribution of the chain.
Moreover, the stationary distribution coincides with the target distribution, that is, the distribution from which we would like to extract a sample.
In other words, the larger
is, the more
and
are similar to the target distribution.
Let us summarize what we have said thus far.
We can imagine an MCMC algorithm as a black box that takes two inputs:
an initial value
;
a target distribution.
The output is a sequence of values
.
Although the first values of the sequence are extracted from distributions
that can be quite different from the target distribution, the similarity
between these distributions and the target progressively increases: when
is large,
is drawn from a distribution that is very similar to the target.
Because of the discrepancy between the target distribution and the distributions of the first terms of the chain, a common practice is to discard the first draws of an MCMC sample.
For example, if we have 110,000 draws, we discard the first 10,000 and keep the remaining 100,000.
The set of draws that are discarded is called the burn-in sample.
By discarding the burn-in sample, we eliminate draws whose distributions are far from the target distribution and we keep draws whose distributions are closer to the target.
This reduces the bias of any Monte Carlo approximation performed with the MCMC sample.
How to choose the size of the burn-in sample is discussed in the lecture on MCMC diagnostics.
After discarding a burn-in sample, we have a sample of draws from distributions that are very similar to the target.
However, we still have a problem: our draws are not independent.
What are the implications of the lack of independence?
To understand them, we need to go back to the standard Monte Carlo method.
The accuracy of a standard Monte Carlo simulation depends on the sample size
:
the larger
is, the better the approximation.
In the case of an MCMC simulation, we need to use the concept of
effective sample size:
dependent observations are equivalent to a smaller number of independent
observations.
For example, 1000 dependent observations could be equivalent to 100 independent observations. In this case, we say that the effective sample size is equal to 100.
Roughly speaking, the higher the correlation between adjacent observations, the smaller the effective sample size, and the less accurate the MCMC approximation is.
This is why in an MCMC simulation, most of the efforts are devoted to reducing the correlation as much as possible, so as to achieve a satisfactory effective sample size.
Our lecture on MCMC diagnostics discusses this point in detail.
One of the most popular MCMC algorithms is the Metropolis-Hastings (M-H) algorithm.
Denote by
the density (or mass) function of the target distribution,
that is, the distribution from which we wish to extract a sequence of draws.
For example,
could be the posterior density in Bayesian inference.
Denote by
a conditional distribution from which we are able to extract
computer-generated samples (e.g., a multivariate normal distribution with mean
).
The generated sample
,
the value
on which we condition, and the argument of the target distribution
all have the same dimension.
The M-H algorithm starts from any value
belonging to the support
of the target distribution and generates the subsequent values
as follows:
generate a proposal
from the distribution
;
compute the acceptance
probability
draw a uniform
random variable
(on
);
if
,
accept the proposal and set
;
otherwise, reject the proposal and set
.
If some technical conditions are met, it can be proved that the distributions
of the draws
converge to the target distribution
.
Furthermore,
where
is
any function such that the expected value
exists and is finite, and
denotes almost sure
convergence as
tends to infinity.
In other words, a Strong Law of Large Numbers (an
ergodic theorem) holds
for the sample mean
.
The power of the Metropolis-Hastings algorithm lies in the fact that you need
to know the function
only up to a multiplicative constant.
For example, in Bayesian inference it is very common to know the posterior distribution up to a multiplicative constant because the likelihood and the prior are known but the marginal distribution is not.
Suppose
thatand
we know
but not
.
Then, the acceptance probability in the M-H algorithm
is
As a consequence, the acceptance probability, which is the only quantity that
depends on
,
can be computed without knowing the constant
.
This is the beauty of the Metropolis-Hastings algorithm: we can generate draws from a distribution even if we do not fully know the density of that distribution!
For more details, see the lecture on the Metropolis-Hastings algorithm.
Another popular MCMC algorithm is the so-called Gibbs sampling algorithm.
Suppose that we want to generate draws of a random vector
having joint
density
where
are
blocks of entries (or single entries) of
.
Given a block
,
denote by
the vector comprising all blocks except
.
Suppose that we are able to generate draws from the
conditional distributions
,
...,
.
In MCMC jargon, these are called the full conditional
distributions.
The Gibbs sampling algorithm starts from any vector
belonging
to the support of the target distribution and generates the subsequent values
as follows:
for
,
generate
from the conditional distribution with
density
In other words, at each iteration, the blocks are extracted one by one from their full conditional distributions, conditioning on the latest draws of all the other blocks.
Note that at iteration
,
when you extract the
-th
block, the latest draws of blocks
to
are those already extracted in iteration
,
while the latest draws of blocks
to
are those extracted in the previous iteration
(
).
It can be proved that Gibbs sampling is a special case of Metropolis-Hastings.
Therefore, as in M-H, the distributions of the draws
converge to the target distribution
.
Furthermore, an ergodic theorem for the sample means
holds.
If you liked this lecture and you want to learn more about Markov Chain Monte Carlo methods, then we suggest that you read these two lectures:
Please cite as:
Taboga, Marco (2021). "Markov Chain Monte Carlo (MCMC) methods", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/Markov-Chain-Monte-Carlo.
Most of the learning materials found on this website are now available in a traditional textbook format.