This lecture provides an introduction to Bayesian estimation of the parameters of a linear regression model.
The model is the
normal
linear regression model:
where:
is the
vector of observations of the dependent variable;
is the
matrix of regressors, which is assumed to have full rank;
is the
vector of regression coefficients;
is the
vector of errors, which is assumed to have a
multivariate normal distribution conditional on
,
with mean
and covariance matrix
where
is a positive constant and
is the
identity matrix.
The assumption that the covariance matrix of
is equal to
implies that
the entries of
are mutually independent (i.e.,
is independent of
for
)
all the entries of
have the same variance (i.e.,
for any
).
In this section, we are going to assume that the vector of
regression coefficients
is unknown, while the
variance of the error
terms
is known.
In the next section, also
will be treated as unknown.
Conditional on
,
is multivariate normal (being a
linear
transformation of the normal vector
).
Its likelihood is
The prior on
is assumed to be multivariate
normal:
that
is,
has a multivariate normal distribution with mean
and covariance matrix
,
where
is a
symmetric positive definite matrix.
This prior is used to express the belief that
is most likely equal to
.
The dispersion of the belief is given by the covariance matrix
.
Given the prior and the likelihood shown above, the posterior
iswhere
that
is, the posterior distribution of
is a multivariate normal distribution with mean
and covariance matrix
.
The joint distribution is
We
can write it
as
where
and
where
is the
identity matrix. Now define a new matrix
(called rotation matrix) as
follows:
We
have that
Furthermore,
and
Moreover,
the determinant of
is
where
we have used the
formula
for the determinant of a block
matrix
We
can now use the results about
to rewrite the joint
density:
We
have shown above that
is block-diagonal. Therefore, by using the expressions derived above for the
blocks of
and
,
we
obtain
Thus,
we have factorized the joint density
as
where
is
a function that depends on
but not on
,
and
is
a probability density function if considered as a function of
for any given
(note that
depends on
through
).
More specifically,
is the density of a multivariate normal distribution with mean
and covariance matrix
.
By a standard result on the
factorization
of probability density functions (see also the
introduction to
Bayesian inference), we have that
Therefore,
the posterior distribution
is a multivariate normal distribution with mean
and covariance matrix
.
Note that the posterior mean can be written as
where
is
the ordinary
least squares estimator of the regression coefficients. Thus, the
posterior mean of
is the weighted average of
the OLS estimate derived from the observed data;
the prior mean
.
Remember that the covariance matrix of the OLS estimator in the normal linear
regression model
iswhile
the covariance matrix of the prior
is
Therefore, we can
write
Both the prior mean and the OLS estimator derived from the data convey some
information about
.
The two sources of information are combined linearly in order to produce the
posterior mean
,
but more weight is given to the signal that has higher
precision (smaller covariance
matrix).
Note also that
tends to infinity when the sample size
goes to infinity. As a consequence, the weight given to
increases with
.
In other words, the larger the sample size becomes, the more weight is given
to the OLS estimate, that is, to the information coming from the observed
sample. In the limit, all weight is given to the latter and no weight is given
to the prior. Roughly speaking, Bayesian regression and frequentist (OLS)
regression provide almost the same results when the sample size is large.
The prior predictive distribution is
where
is the
identity matrix.
Thus, the prior predictive distribution of the vector of observations of the
dependent variable
is multivariate normal with mean
and covariance
matrix
See the previous proof.
Suppose that, after observing the sample
and updating the prior, a new sample
of size
is generated by the same regression. Suppose further that we observe only the
regressors
and we want to predict
.
This is done through the posterior predictive
distribution
where
is the
identity matrix.
So,
has a multivariate normal distribution with mean
(where
is the posterior mean of
)
and covariance matrix
(where
is the posterior covariance matrix of
).
The derivation is almost identical to that
of the prior predictive distribution of
(see above). The posterior
is
used as a new prior. The likelihood
is
the same as
because
is independent of
and
conditional on
.
So, we can perform the
factorization
and
derive
by using the same procedure used to find
.
The main difference is that we need to replace the prior mean
with the posterior
and the prior covariance matrix
with the posterior
.
Everything is as in the previous section, except for the fact that not only
the vector of regression coefficients
,
but also the variance of the error term
is assumed to be unknown.
The likelihood of
is
Note
that we now highlight the fact that we are conditioning on both of the unknown
parameters
and
.
The prior is hierarchical.
As in the previous section, we assign a multivariate normal prior to the
regression coefficients, conditional on
:
where
the covariance matrix of
is assumed to be proportional to
.
Then, we assign the following prior to the
variance:that
is,
has an inverse-Gamma distribution with parameters
and
(i.e., the precision
has a Gamma
distribution with parameters
and
).
By the properties of the Gamma distribution, the prior mean of the precision
isand
its variance is
We can think of
as our best guess of the precision of the regression (i.e., of its error
terms). We use the parameter
to express our degree of confidence in the guess about the precision. The
larger
,
the tighter the prior about
is, and the more we consider it likely that
is close to
.
Conditional on
,
the posterior distribution of
is
where
This has been proved in the previous section
(known variance). As a matter of fact, conditional on
,
we can treat
as if it was known.
Conditional on
,
the prior predictive distribution of
is
where
is the
identity matrix.
See previous section.
The posterior distribution of the variance
iswhere
In other words,
has a Gamma distribution with parameters
and
.
Consider the joint distribution
where
we have
defined
We
can write
where
is
a function that depends on
(via
)
but not on
,
and
is
a probability density function if considered as a function of
for any given
(note that
depends on
through
).
In particular,
is the density of an inverse-Gamma distribution with parameters
and
.
Thus, by a
well-known
result on the factorization of joint probability density functions, we
have that
Therefore,
the posterior distribution
is inverse-Gamma with parameters
and
. What distribution
is will be shown in the next proof.
The prior predictive distribution of the dependent variable
is
that
is, a
multivariate
Student's t distribution with mean
,
scale matrix
and
degrees of freedom.
The prior predictive distribution has
already been derived in the previous proof. We just need to perform some
algebraic manipulations in order to clearly show that it is a multivariate
Student's t distribution with mean
,
scale matrix
and
degrees of freedom:
The posterior distribution of the regression coefficients
isthat
is,
has a multivariate t distribution with mean
,
scale matrix
and
degrees of freedom.
As proved above, we have that: 1)
conditional on
and the data
,
is multivariate normal with mean
and variance
;
2) conditional on the data
,
has a Gamma distribution with parameters
and
.
Thus, we can
write
where
is standard multivariate normal vector and
Now
define a new variable
which,
by the properties of the Gamma distribution, has a Gamma distribution with
parameters
and
.
We can now
write
But
has
a standard multivariate Student's t distribution with
degrees of freedom (see the
lecture
on the multivariate t distribution). As a consequence,
has a multivariate Student's t distribution with mean
,
scale matrix
and
degrees of freedom. Thus, its density
is
Please cite as:
Taboga, Marco (2021). "Bayesian linear regression", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/Bayesian-regression.
Most of the learning materials found on this website are now available in a traditional textbook format.