This lecture shows how to apply the basic principles of Bayesian inference to the problem of estimating the parameters (mean and variance) of a normal distribution.
Table of contents
The observed sample used to carry out inferences is a vector
whose
entries are
independent and identically distributed
draws
from a normal
distribution.
In this section, we are going to assume that the mean
of the distribution is unknown, while its
variance
is known.
In the next section, also
will be treated as unknown.
The probability density
function of a generic draw
is
where
we use the notation
to highlight the fact that the density depends on the unknown parameter
.
Since
are
independent,
the likelihood is
The prior
isthat
is,
has a normal distribution with mean
and variance
.
This prior is used to express the statistician's belief that the unknown
parameter
is most likely equal to
and that values of
very far from
are quite unlikely (how unlikely depends on the variance
).
Given the prior and the likelihood, specified above, the posterior
iswhere
Write the joint distribution as
where
we have
defined
Note
that
where
in step
we added and subtracted the sample mean
and
in step
we used the fact that
We
can use this result to write
where
in step
we have
defined
We
can put together the results obtained so far and
get
where
is
a function that depends on
but not on
,
and
is
a probability density function if considered as a function of
for any given
(note that
depends on
through
).
In fact,
is the density of a normal distribution with mean
and variance
.
By a standard result on the
factorization
of probability density functions (see also the
introduction to
Bayesian inference), we have that
Therefore,
the posterior distribution
is a normal distribution with mean
and variance
.
We have yet to figure out what
is. This will be done in the next proof.
Thus, the posterior distribution of
is a normal distribution with mean
and variance
.
Note that the posterior mean
is the weighted average of two signals:
the sample mean
of the observed data;
the prior mean
.
The greater the precision of a
signal, the higher its weight is. Both the prior and the sample mean convey
some information (a signal) about
.
The signals are combined (linearly), but more weight is given to the signal
that has higher precision (smaller variance).
The weight given to the sample mean increases with
,
while the weight given to the prior mean does not. As a consequence, when the
sample size
becomes large, more and more weight is given to the sample mean. In the limit,
all weight is given to the information coming from the sample and no weight is
given to the prior.
The prior predictive distribution is
where
is an
vector of ones, and
is the
identity matrix.
From the previous proof we know
thatwhere
we have
defined
By
defining
,
we can write
where
in step
we have used the facts
that
and
and
in step
we have used the fact that
so
that
Now,
note that
where
in step
we have used the
matrix
determinant
lemma
Now,
putting together all the pieces, we
have
Thus, the prior predictive distribution of
is
multivariate
normal with mean
and covariance
matrix
Under this distribution, a draw
has prior mean
,
variance
and covariance with the other draws equal to
.
The covariance is positive because the draws
,
despite being independent conditional on
,
all share the same mean parameter
,
which is random.
Assume that
new observations
are drawn independently from the same normal distribution from which
have been extracted.
The posterior predictive distribution of the
vectoris
where
is the
identity matrix and
is a
vector of ones.
So,
has a multivariate normal distribution with mean
(where
is the posterior mean of
)
and covariance matrix
(where
is the posterior variance of
).
The derivation is almost identical to the
derivation of the prior predictive distribution of
.
The posterior
is
used as a new prior. The likelihood
is
the same as
because
is independent of
conditional on
.
Therefore, we can perform the
factorization
and
derive
by following the same procedure we followed to derive
.
The main difference is that we need to replace the prior mean
with the posterior mean
and the prior variance
with the posterior variance
.
As in the previous section, the sample
is
assumed to be a vector of IID draws from a normal distribution.
However, we now assume that not only the mean
,
but also the variance
is unknown.
The probability density function of a generic draw
is
The
notation
highlights the fact that the density depends on the two unknown parameters
and
.
Since
are independent, the likelihood is
The prior is hierarchical.
First, we assign the following prior to the mean, conditional on the variance:
that
is,
has a standard normal distribution with mean
and variance
.
Note that the variance of the parameter
is assumed to be proportional to the unknown variance
of the data points. The constant of proportionality
determines how tight the prior is, that is, how probable we deem that
is very close to the prior mean
.
Then, we assign the following prior to the
variance:that
is,
has an inverse-Gamma distribution with parameters
and
(i.e., the precision
has a Gamma
distribution with parameters
and
).
By the properties of the Gamma distribution, the prior mean of the precision
isand
its variance is
We can think of
as our best guess of the precision of the data generating distribution.
is the parameter that we use to express our degree of confidence in our guess
about the precision. The greater
,
the tighter our prior about
is, and the more we deem probable that
is close to
.
Conditional on
,
the posterior distribution of
is
where
This can be derived from the case where
is known (see above). In that
case
Now,
.
So,
and
Thus, conditional on
and
,
is normal with mean
and variance
.
Conditional on
,
the prior predictive distribution of
is
where
is an
vector of ones, and
is the
identity matrix.
This can be derived from the case where
is known (see above). In that
case
where
.
So,
The posterior distribution of the variance
iswhere
Consider the joint distribution
where
we have
defined
We
can write
where
is
a function that depends on
(via
)
but not on
,
and
is
a probability density function if considered as a function of
for any given
(note that
depends on
through
).
In particular,
is the density of an inverse-Gamma distribution with parameters
and
.
Thus, by a
well-known
result on the factorization of joint probability density functions, we
have that
Therefore,
the posterior distribution
is inverse-Gamma with parameters
and
. What distribution
is will be shown in the next proof.
Thus,
has a Gamma distribution with parameters
and
.
The prior predictive distribution of
is
that
is, a
multivariate
Student's t distribution with mean
,
scale matrix
and
degrees of freedom.
The prior predictive distribution has
already been derived in the previous proof. We just need to do a little bit of
algebra to clearly show that it is a multivariate Student's t distribution
with mean
,
scale matrix
and
degrees of freedom:
The posterior distribution of the mean
iswhere
is the Beta function.
We have already proved that, conditional on
and
,
is normal with mean
and variance
We
have also proved that, conditional on
,
has a Gamma distribution with parameters
and
.
Thus, we can
write
where
is standard normal conditional on
and
,
and
has a Gamma distribution with parameters
and
.
Now, note that, by the properties of the Gamma
distribution,
has
a Gamma distribution with parameters
and
.
We can
write
But
has
a standard Student's t distribution with
degrees of freedom (see the
lecture on the t
distribution). As a consequence,
has a Student's t distribution with mean
,
scale parameter
and
degrees of freedom. Thus, its density
is
where
is the Beta function.
In other words,
has a t
distribution with mean
,
scale parameter
and
degrees of freedom.
Please cite as:
Taboga, Marco (2021). "Bayesian estimation of the parameters of the normal distribution", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/normal-distribution-Bayesian-estimation.
Most of the learning materials found on this website are now available in a traditional textbook format.