Search for probability and statistics terms on Statlect
Index > Fundamentals of statistics

Bayesian inference

by , PhD

Bayesian inference is a way of making statistical inferences in which the statistician assigns subjective probabilities to the distributions that could generate the data. These subjective probabilities form the so-called prior distribution. After the data is observed, Bayes' rule is used to update the prior, that is, to revise the probabilities assigned to the possible data generating distributions. These revised probabilities form the so-called posterior distribution.

Table of Contents


Remember that the main elements of a statistical inferences problem are the following:

  1. we observe some data (a sample);

  2. we write the sample as a vector x;

  3. we regard x as the realization of a random vector X;

  4. we do not know the probability distribution of X (i.e., the distribution that generated our sample);

  5. we define a statistical model, that is, a set $Phi $ of probability distributions that could have generated the data;

  6. optionally, we parametrize the model, that is, we put the elements of $Phi $ in correspondence with a set of real vectors called parameters;

  7. we use the sample and the statistical model to make a statement (an inference) about the unknown data generating distribution (or about the parameter that corresponds to it).

In Bayesian inference, we assign a subjective distribution to the elements of $Phi $, and then we use the data to derive a posterior distribution.

In parametric Bayesian inference, the subjective distribution is assigned to the parameters that are put into correspondence with the elements of $Phi $.

The likelihood

The first building block of a parametric Bayesian model is the likelihood[eq1]which is equal to the probability density of x when the parameter of the true data generating distribution is equal to $	heta $.

Note that for the time being we are assuming that x and $	heta $ are continuous. Later, we will discuss how to relax this assumption.

Example Suppose the sample [eq2]is a vector of n independent and identically distributed draws [eq3] from a normal distribution. The mean mu of the distribution is unknown, while its variance sigma^2 is known. These are the two parameters of the model. The probability density function of a generic draw $x_{i}$ is[eq4]where we use the notation [eq5] to highlight the fact that mu is unknown and the density of $x_{i}$ depends on this unknown parameter. Because the observations [eq6] are independent, we can write the likelihood as [eq7]

The prior

The second building block of a Bayesian model is the prior[eq8]which is equal to the subjective probability density assigned to the parameter $	heta $ of the data generating distribution.

Example Let us continue the previous example. The statistician believes that the parameter mu is most likely equal to $mu _{0}$ and that values of mu very far from $mu _{0}$ are quite unlikely. She expresses this belief about the parameter mu by assigning to it a normal distribution with mean $mu _{0}$ and variance $	au ^{2}$. So the prior is[eq9]

The prior predictive distribution

Having specified the prior and the likelihood, we can derive the marginal density of x:[eq10]where: in step $rame{A}$ we have performed the so-called marginalization (see the lecture on random vectors); in step $rame{B}$ we have used the fact that a joint density can be written as the product of a conditional and a marginal density (see the lecture on conditional probability distributions).

The notation[eq11]is a shorthand for the multiple integral[eq12]where K is the dimension of the parameter vector $	heta $.

The marginal density of x, derived in the manner above, is often called the prior predictive distribution. Roughly speaking, it is the probability distribution we assign to the data x before observing it.

Example Given the prior and the posterior specified in the previous two examples, it can be proved that the prior predictive distribution is [eq13]where i is an $n	imes 1$ vector of ones, and I is the $n	imes n$ identity matrix. Thus, the prior predictive distribution of x is multivariate normal with mean $imu _{0}$ and covariance matrix [eq14]Thus, under the prior predictive distribution, a draw $x_{i}$ has mean $mu _{0}$, variance [eq15] and covariance with the other draws equal to $	au ^{2}$. The covariance is induced by the fact that the mean parameter mu, which is stochastic, is the same for all draws.

The posterior

After observing the data x, the statistician can use Bayes' rule to update the prior about the parameter $	heta $:[eq16]

The conditional density [eq17] is called posterior distribution of the parameter.

By using the formula for the marginal density derived above, we obtain[eq18]which makes clear that the posterior depends on the the two distributions specified by the statistician, the prior [eq19] and the likelihood [eq20].

Example In the normal model of the previous examples, it can be proved that the posterior is[eq21]where[eq22]Thus, the posterior distribution of mu is normal with mean $mu _{n}$ and variance $sigma _{n}^{2}$. Note that the posterior mean $mu _{n}$ is the weighted average of the mean of the observed data ([eq23]) and the prior mean $mu _{0}$. The weights are inversely proportional to the variances of the two means: if the prior variance $	au ^{2}$ is high, then the prior mean $mu _{0}$ receives little weight; by the same token, if the variance of the sample mean (which is equal to $sigma ^{2}/n$) is high, then the sample mean receives little weight and more weight is assigned to the prior. Both the sample mean and the prior mean provide information about mu. They are combined together, but more weight is given to the signal that has higher precision (smaller variance). Note also that when the sample size n becomes very large (goes to infinity), then all the weight is given to the information coming from the sample (the sample mean) and no weight is given to the prior. This is typical of Bayesian inference.


Given any posterior density[eq24]we can take any function [eq25] of the data, that does not depend on $	heta $, and use it to build another function[eq26]We write[eq27]that is, [eq17] is proportional to [eq29], in order to highlight that [eq30] is equal to [eq31] times a constant ([eq32]; remember that the data is a constant after being observed).

The posterior can be recovered from [eq33] as follows:[eq34]where: in step $rame{A}$ we have used the fact that $qleft( x
ight) $ does not depend on $	heta $ and, as a consequence, it can be brought out of the integral; in step $rame{B}$ we have used the fact that the integral of a density (over the whole support) is equal to 1.

In summary, by multiplying the posterior by any constant (that does not depend on $	heta $, but may depend on x), we obtain a function [eq35] proportional to the posterior. If we divide the new function by its integral between $-infty $ and $infty $, then we obtain the prior.

The posterior is proportional to the prior times the likelihood

Note that in the posterior formula[eq36]the marginal density[eq37]does not depend on $	heta $ (because $	heta $ is "integrated out"). Thus, by using the notation introduced in the previous section, we can write[eq38]that is, the posterior [eq17] is proportional to the prior [eq19] times the likelihood [eq41].

Note that both [eq20] and [eq43] are known (they are specified by the statistician), so we are saying that the posterior (which we want to compute) is proportional to two known quantities. This proportionality to two known quantities is extremely important: there are various methods that allow to exploit it in order to compute the posterior when (2) cannot be computed and hence (1) cannot be computed directly.

The posterior predictive distribution

Suppose a second sample of data, denoted by $y$, is observed after observing the sample x and updating the prior about the parameter $	heta $, that is, after computing[eq44]

Suppose that also the distribution of $y$ depends on $	heta $, but is independent of x conditional on $	heta $:[eq45]

Then the distribution of $y$ given x is


The distribution of $y$ given x, derived in the manner above, is often called the posterior predictive distribution.

Example In the normal model of the previous examples, the prior is updated with n draws [eq47]. Consider a new draw $x_{n+1}$ from the same normal distribution. It can be proved that the posterior predictive distribution of $x_{n+1}$ is a normal distribution with mean $mu _{n}$ (the posterior mean of mu) and variance [eq48] , where $sigma _{n}^{2}$ is the posterior variance of mu.

Quantities of interest

After having updated the prior, we can make statements (inferences) about the parameter $	heta $ by using its posterior distribution; more in general, we can make statements about quantities that depend on $	heta $ by using the posterior predictive distribution, introduced in the previous section. These quantities, about which we want to make a statement, are often called quantities of interest.

The Bayesian approach provides us with a posterior probability distribution of our quantity of interest. We are free to summarize that distribution in any way that we deem convenient or fits our purposes. For example, we can:


Up to know we have assumed that x and $	heta $ are continuous. When they are discrete, there are no substantial changes, but probability density functions are replaced with probability mass functions and integrals are replaced with summations.

For example, if $	heta $ is discrete and x is continuous:


It often happens that we are not able to apply Bayes' rule [eq52]because we cannot derive the marginal distribution $pleft( x
ight) $ analytically.

However, given the joint distribution[eq53]if we are able to express it as[eq54]where $hleft( x
ight) $ is a function that depends only on x, and [eq55] is a probability density (or mass) function of $	heta $ (for any fixed x), then[eq56]

See the lecture on the factorization of probability density functions for a proof of this fact (and a detailed exposition with examples).


There are several Bayesian models that allow to compute the posterior distribution of the parameters analytically. However, this is often not possible. When an analytical solution is not available, the methods that are most commonly employed to derive the posterior distribution numerically are the so called Markov Chain Monte Carlo methods. They are Monte Carlo methods that allow to generate a large sample of correlated draws from the posterior distribution of the parameter vector by simply using the proportionality[eq57]The empirical distribution of the generated sample can then be used to produce plug-in estimates of the quantities of interest.

See the lecture on MCMC methods for more details.

The book

Most of the learning materials found on this website are now available in a traditional textbook format.