Index > Fundamentals of statistics

Bayesian linear regression

by Marco Taboga, PhD

This lecture provides an introduction to Bayesian estimation of the parameters of a linear regression model.

Table of contents

The model
Unknown regression coefficients and known variance
Unknown regression coefficients and unknown variance

The model

The model is the normal linear regression model:

where:

is the vector of observations of the dependent variable;
is the matrix of regressors, which is assumed to have full rank;
is the vector of regression coefficients;
is the vector of errors, which is assumed to have a multivariate normal distribution conditional on , with mean and covariance matrix where is a positive constant and is the identity matrix.

The assumption that the covariance matrix of is equal to $sigma ^{2}I$ implies that

the entries of are mutually independent (i.e., $arepsilon _{i}$ is independent of $arepsilon _{j}$ for )
all the entries of have the same variance (i.e., for any ).

Unknown regression coefficients and known variance

In this section, we are going to assume that the vector of regression coefficients is unknown, while the variance of the error terms is known.

In the next section, also will be treated as unknown.

The likelihood

Conditional on , is multivariate normal (being a linear transformation of the normal vector ). Its likelihood is

The prior

The prior on is assumed to be multivariate normal:that is, has a multivariate normal distribution with mean $eta _{0}$ and covariance matrix $sigma ^{2}V_{0}$ , where $V_{0}$ is a symmetric positive definite matrix.

This prior is used to express the belief that is most likely equal to $eta _{0}$ . The dispersion of the belief is given by the covariance matrix $sigma ^{2}V_{0}$ .

The posterior

Given the prior and the likelihood shown above, the posterior iswhere [eq7] that is, the posterior distribution of is a multivariate normal distribution with mean $eta _{N}$ and covariance matrix $sigma ^{2}V_{N}$ .

Proof

The joint distribution is

[eq8] We can write it aswhere [eq10] and [eq11] where is the identity matrix. Now define a new matrix (called rotation matrix) as follows: [eq12] We have that [eq13] Furthermore, [eq14] and [eq15] Moreover, the determinant of is [eq16] where we have used the formula for the determinant of a block matrix [eq17] We can now use the results about to rewrite the joint density: [eq18] We have shown above that is block-diagonal. Therefore, by using the expressions derived above for the blocks of and , we obtain [eq21] Thus, we have factorized the joint density aswhereis a function that depends on but not on , and is a probability density function if considered as a function of for any given (note that depends on through $eta _{N}$ ). More specifically, is the density of a multivariate normal distribution with mean $eta _{N}$ and covariance matrix $sigma ^{2}V_{N}$ . By a standard result on the factorization of probability density functions (see also the introduction to Bayesian inference), we have that [eq26] Therefore, the posterior distribution is a multivariate normal distribution with mean $eta _{N}$ and covariance matrix $sigma ^{2}V_{N}$ .

Note that the posterior mean can be written as [eq28] whereis the ordinary least squares estimator of the regression coefficients. Thus, the posterior mean of is the weighted average of

the OLS estimate derived from the observed data;
the prior mean $eta _{0}$ .

Remember that the covariance matrix of the OLS estimator in the normal linear regression model iswhile the covariance matrix of the prior is

Therefore, we can write

Both the prior mean and the OLS estimator derived from the data convey some information about . The two sources of information are combined linearly in order to produce the posterior mean $eta _{N}$ , but more weight is given to the signal that has higher precision (smaller covariance matrix).

Note also that $(X^{ op }X)$ tends to infinity when the sample size goes to infinity. As a consequence, the weight given to $eta _{OLS}$ increases with . In other words, the larger the sample size becomes, the more weight is given to the OLS estimate, that is, to the information coming from the observed sample. In the limit, all weight is given to the latter and no weight is given to the prior. Roughly speaking, Bayesian regression and frequentist (OLS) regression provide almost the same results when the sample size is large.

The prior predictive distribution

The prior predictive distribution is where is the identity matrix.

Thus, the prior predictive distribution of the vector of observations of the dependent variable is multivariate normal with mean $Xeta _{0}$ and covariance matrix

Proof

See the previous proof.

The posterior predictive distribution

Suppose that, after observing the sample and updating the prior, a new sample of size is generated by the same regression. Suppose further that we observe only the regressors and we want to predict . This is done through the posterior predictive distribution [eq36] where is the identity matrix.

So, has a multivariate normal distribution with mean (where $eta _{N}$ is the posterior mean of ) and covariance matrix (where $V_{N}$ is the posterior covariance matrix of ).

Proof

The derivation is almost identical to that of the prior predictive distribution of (see above). The posterior is used as a new prior. The likelihood is the same as because is independent of and conditional on . So, we can perform the factorizationand derive by using the same procedure used to find . The main difference is that we need to replace the prior mean $eta _{0}$ with the posterior $eta _{N}$ and the prior covariance matrix $sigma ^{2}V_{0}$ with the posterior $sigma ^{2}V_{N}$ .

Unknown regression coefficients and unknown variance

Everything is as in the previous section, except for the fact that not only the vector of regression coefficients , but also the variance of the error term is assumed to be unknown.

The likelihood

The likelihood of is Note that we now highlight the fact that we are conditioning on both of the unknown parameters and .

The prior

The prior is hierarchical.

As in the previous section, we assign a multivariate normal prior to the regression coefficients, conditional on : where the covariance matrix of is assumed to be proportional to .

Then, we assign the following prior to the variance: [eq47] that is, has an inverse-Gamma distribution with parameters and $sigma _{0}^{2}$ (i.e., the precision $1/sigma ^{2}$ has a Gamma distribution with parameters and $1/sigma _{0}^{2}$ ).

By the properties of the Gamma distribution, the prior mean of the precision is [eq49] and its variance is [eq50]

We can think of $1/sigma _{0}^{2}$ as our best guess of the precision of the regression (i.e., of its error terms). We use the parameter to express our degree of confidence in the guess about the precision. The larger , the tighter the prior about $1/sigma ^{2}$ is, and the more we consider it likely that $1/sigma ^{2}$ is close to $1/sigma _{0}^{2}$ .

The posterior distribution of the regression coefficients conditional on the variance

Conditional on , the posterior distribution of iswhere [eq7]

Proof

This has been proved in the previous section (known variance). As a matter of fact, conditional on , we can treat as if it was known.

The prior predictive distribution conditional on the variance

Conditional on , the prior predictive distribution of is where is the identity matrix.

Proof

See previous section.

The posterior distribution of the variance

The posterior distribution of the variance is [eq54] where

In other words, $1/sigma ^{2}$ has a Gamma distribution with parameters and $1/sigma _{N}^{2}$ .

Proof

Consider the joint distribution [eq56] where we have definedWe can write where [eq59] is a function that depends on (via $sigma _{N}^{2}$ ) but not on , and [eq60] is a probability density function if considered as a function of $sigma ^{2}$ for any given (note that depends on through $sigma _{N}^{2}$ ). In particular, is the density of an inverse-Gamma distribution with parameters and $1/sigma _{N}^{2}$ . Thus, by a well-known result on the factorization of joint probability density functions, we have that [eq62] Therefore, the posterior distribution is inverse-Gamma with parameters and $1/sigma _{N}^{2}$ . What distribution is will be shown in the next proof.

The prior predictive distribution

The prior predictive distribution of the dependent variable is [eq65] that is, a multivariate Student's t distribution with mean $Xeta _{0}$ , scale matrix and degrees of freedom.

Proof

The prior predictive distribution has already been derived in the previous proof. We just need to perform some algebraic manipulations in order to clearly show that it is a multivariate Student's t distribution with mean $Xeta _{0}$ , scale matrix and degrees of freedom: [eq68]

The posterior distribution of the regression coefficients

The posterior distribution of the regression coefficients is [eq69] that is, has a multivariate t distribution with mean $eta _{N}$ , scale matrix and degrees of freedom.

Proof

As proved above, we have that: 1) conditional on and the data , is multivariate normal with mean $eta _{N}$ and variance $sigma ^{2}V_{N}$ ; 2) conditional on the data , $1/sigma ^{2}$ has a Gamma distribution with parameters and $1/sigma _{N}^{2}$ . Thus, we can writewhere is standard multivariate normal vector and Now define a new variable [eq73] which, by the properties of the Gamma distribution, has a Gamma distribution with parameters and . We can now write [eq74] But [eq75] has a standard multivariate Student's t distribution with degrees of freedom (see the lecture on the multivariate t distribution). As a consequence, has a multivariate Student's t distribution with mean $eta _{N}$ , scale matrix [eq76] and degrees of freedom. Thus, its density is [eq69]

How to cite

Please cite as:

Taboga, Marco (2021). "Bayesian linear regression", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/Bayesian-regression.