StatLect
Index > Fundamentals of statistics

Bayesian estimation of the parameters of the normal distribution

This lecture shows how to apply the basic principles of Bayesian inference to the problem of estimating the parameters (mean and variance) of a normal distribution.

Table of Contents

Unknown mean and known variance

The observed sample used to carry out inferences is a vector [eq1]whose entries are n independent and identically distributed draws [eq2] from a normal distribution.

In this section, we are going to assume that the mean mu of the distribution is unknown, while its variance sigma^2 is known.

In the next section, also sigma^2 will be treated as unknown.

The likelihood

The probability density function of a generic draw $x_{i}$ is[eq3]where we use the notation [eq4] to highlight the fact that the density depends on the unknown parameter mu.

Since [eq5] are independent, the likelihood is [eq6]

The prior

The prior is[eq7]that is, mu has a normal distribution with mean $mu _{0}$ and variance $	au _{0}^{2}$.

This prior is used to express the statistician's belief that the unknown parameter mu is most likely equal to $mu _{0}$ and that values of mu very far from $mu _{0}$ are quite unlikely (how unlikely depends on the variance $	au _{0}^{2}$).

The posterior

Given the prior and the likelihood, specified above, the posterior is[eq8]where [eq9]

Proof

Write the joint distribution as

[eq10]where we have defined[eq11]Note that[eq12]where in step $rame{A}$ we added and subtracted the sample mean [eq13]and in step $rame{B}$ we used the fact that [eq14]We can use this result to write [eq15]where in step $rame{A}$ we have defined[eq16]We can put together the results obtained so far and get[eq17]where [eq18]is a function that depends on x but not on mu, and [eq19]is a probability density function if considered as a function of mu for any given x (note that $g$ depends on x through $mu _{n}$). In fact, [eq20] is the density of a normal distribution with mean $mu _{n}$ and variance $	au _{n}^{2}$. By a standard result on the factorization of probability density functions (see also the introduction to Bayesian inference), we have that [eq21]Therefore, the posterior distribution [eq22] is a normal distribution with mean $mu _{n}$ and variance $sigma _{n}^{2}$. We have yet to figure out what $pleft( x
ight) $ is. This will be done in the next proof.

Thus, the posterior distribution of mu is a normal distribution with mean $mu _{n}$ and variance $	au _{n}^{2}$.

Note that the posterior mean $mu _{n}$ is the weighted average of two signals:

  1. the sample mean [eq23] of the observed data;

  2. the prior mean $mu _{0}$.

The greater the precision of a signal, the higher its weight is. Both the prior and the sample mean convey some information (a signal) about mu. The signals are combined (linearly), but more weight is given to the signal that has higher precision (smaller variance).

The weight given to the sample mean increases with n, while the weight given to the prior mean does not. As a consequence, when the sample size n becomes large, more and more weight is given to the sample mean. In the limit, all weight is given to the information coming from the sample and no weight is given to the prior.

The prior predictive distribution

The prior predictive distribution is [eq24]where i is an $n	imes 1$ vector of ones, and I is the $n	imes n$ identity matrix.

Proof

From the previous proof we know that[eq25]where we have defined[eq26]By defining [eq27], we can write [eq28]where in step $rame{A}$ we have used the facts that[eq29]and

[eq30]and in step $rame{B}$ we have used the fact that [eq31]so that[eq32]Now, note that

[eq33]where in step $rame{A}$ we have used the matrix determinant lemma[eq34]Now, putting together all the pieces, we have[eq35]

Thus, the prior predictive distribution of x is multivariate normal with mean $mu _{0}i$ and covariance matrix [eq36]

Under this distribution, a draw $x_{i}$ has prior mean $mu _{0}$, variance [eq37] and covariance with the other draws equal to $	au ^{2}$. The covariance is positive because the draws $x_{i}$, despite being independent conditional on mu, all share the same mean parameter mu, which is random.

The posterior predictive distribution

Assume that $m$ new observations [eq38] are drawn independently from the same normal distribution from which [eq39] have been extracted.

The posterior predictive distribution of the vector[eq40]is [eq41]where I is the $m	imes m$ identity matrix and i is a $m	imes 1$ vector of ones.

So, $widetilde{x}$ has a multivariate normal distribution with mean $mu _{n}i$ (where $mu _{n}$ is the posterior mean of mu) and covariance matrix [eq42] (where $	au _{n}^{2}$ is the posterior variance of mu).

Proof

The derivation is almost identical to the derivation of the prior predictive distribution of x. The posterior [eq43]is used as a new prior. The likelihood [eq44]is the same as [eq45]because $widetilde{x}$ is independent of x conditional on mu. Therefore, we can perform the factorization[eq46]and derive [eq47] by following the same procedure we followed to derive $pleft( x
ight) $. The main difference is that we need to replace the prior mean $mu _{0}$ with the posterior mean $mu _{n}$ and the prior variance $	au _{0}^{2}$ with the posterior variance $	au _{n}^{2}$.

Unknown mean and unknown variance

As in the previous section, the sample [eq1]is assumed to be a vector of IID draws from a normal distribution.

However, we now assume that not only the mean mu, but also the variance sigma^2 is unknown.

The likelihood

The probability density function of a generic draw $x_{i}$ is[eq49]The notation [eq50] highlights the fact that the density depends on the two unknown parameters mu and sigma^2.

Since [eq51] are independent, the likelihood is [eq52]

The prior

The prior is hierarchical.

First, we assign the following prior to the mean, conditional on the variance: [eq53]that is, mu has a standard normal distribution with mean $mu _{0}$ and variance [eq54].

Note that the variance of the parameter mu is assumed to be proportional to the unknown variance sigma^2 of the data points. The constant of proportionality $
u $ determines how tight the prior is, that is, how probable we deem that mu is very close to the prior mean $mu _{0}$.

Then, we assign the following prior to the variance:[eq55]that is, [eq56] has an inverse-Gamma distribution with parameters k and $1/sigma _{0}^{2}$ (i.e., the precision $1/sigma ^{2}$ has a Gamma distribution with parameters k and $1/sigma _{0}^{2}$).

By the properties of the Gamma distribution, the prior mean of the precision is[eq57]and its variance is [eq58]

We can think of $1/sigma _{0}^{2}$ as our best guess of the precision of the data generating distribution. k is the parameter that we use to express our degree of confidence in our guess about the precision. The greater k, the tighter our prior about $1/sigma ^{2}$ is, and the more we deem probable that $1/sigma ^{2}$ is close to $1/sigma _{0}^{2}$.

The posterior distribution of the mean conditional on the variance

Conditional on sigma^2, the posterior distribution of mu is[eq59]where [eq60]

Proof

This can be derived from the case where sigma^2 is known (see above). In that case[eq9]Now, [eq62]. So,[eq63]and[eq64]

Thus, conditional on sigma^2 and x, mu is normal with mean $mu _{n}$ and variance $	au _{n}^{2}$.

The prior predictive distribution conditional on the variance

Conditional on sigma^2, the prior predictive distribution of x is [eq65]where i is an $n	imes 1$ vector of ones, and I is the $n	imes n$ identity matrix.

Proof

This can be derived from the case where sigma^2 is known (see above). In that case[eq66]where [eq62]. So,[eq68]

The posterior distribution of the variance

The posterior distribution of the variance is[eq69]where[eq70]

Proof

Consider the joint distribution [eq71]where we have defined[eq70]We can write [eq73]where [eq74]is a function that depends on x (via $sigma _{n}^{2}$) but not on sigma^2, and [eq75]is a probability density function if considered as a function of $sigma ^{2} $ for any given x (note that $g$ depends on x through $sigma _{n}^{2}$). In particular, [eq76] is the density of an inverse-Gamma distribution with parameters $n+k$ and $1/sigma _{n}^{2}$. Thus, by a well-known result on the factorization of joint probability density functions, we have that [eq77]Therefore, the posterior distribution [eq78] is inverse-Gamma with parameters $n+k$ and $1/sigma _{n}^{2}$ . What distribution $pleft( x
ight) $ is will be shown in the next proof.

Thus, $1/sigma ^{2}$ has a Gamma distribution with parameters $n+k$ and $1/sigma _{n}^{2}$ .

The prior predictive distribution

The prior predictive distribution of x is [eq79]that is, a multivariate Student's t distribution with mean $mu _{0}i$, scale matrix [eq80] and k degrees of freedom.

Proof

The prior predicitve distribution has already been derived in the previous proof. We just need to do a little bit of algebra to clearly show that it is a multivariate Student's t distribution with mean $mu _{0}i$, scale matrix [eq81] and k degrees of freedom: [eq82]

The posterior distribution of the mean

The posterior distribution of the mean is[eq83]where $Bleft( {}
ight) $ is the Beta function.

Proof

We have already proved that, conditional on sigma^2 and x, mu is normal with mean $mu _{n}$ and variance [eq84]We have also proved that, conditional on x, $1/sigma ^{2}$ has a Gamma distribution with parameters $n+k$ and $1/sigma _{n}^{2}$. Thus, we can write[eq85]where Z is standard normal conditional on x and sigma^2, and $Gamma _{1}$ has a Gamma distribution with parameters $n+k$ and $1/sigma _{n}^{2}$. Now, note that, by the properties of the Gamma distribution,[eq86]has a Gamma distribution with parameters $n+k$ and 1. We can write[eq87]But[eq88]has a standard Student's t distribution with $n+k$ degrees of freedom (see the lecture on the t distribution). As a consequence, mu has a Student's t distribution with mean $mu _{n}$, scale parameter [eq89]and $n+k$ degrees of freedom. Thus, its density is[eq83]where $Bleft( {}
ight) $ is the Beta function.

In other words, mu has a t distribution with mean $mu _{n}$, scale parameter [eq91]and $n+k$ degrees of freedom.

The book

Most of the learning materials found on this website are now available in a traditional textbook format.