Search for probability and statistics terms on Statlect
StatLect

Bayesian linear regression

by , PhD

This lecture provides an introduction to Bayesian estimation of the parameters of a linear regression model.

Table of Contents

The model

The model is the normal linear regression model: [eq1]

where:

The assumption that the covariance matrix of epsilon is equal to $sigma ^{2}I$ implies that

  1. the entries of epsilon are mutually independent (i.e., $arepsilon _{i}$ is independent of $arepsilon _{j}$ for $i
eq j$)

  2. all the entries of epsilon have the same variance (i.e., [eq3] for any i).

Unknown regression coefficients and known variance

In this section, we are going to assume that the vector of regression coefficients $eta $ is unknown, while the variance of the error terms sigma^2 is known.

In the next section, also sigma^2 will be treated as unknown.

The likelihood

Conditional on X, $y$ is multivariate normal (being a linear transformation of the normal vector epsilon). Its likelihood is [eq4]

The prior

The prior on $eta $ is assumed to be multivariate normal:[eq5]that is, $eta $ has a multivariate normal distribution with mean $eta _{0}$ and covariance matrix $sigma ^{2}V_{0}$, where $V_{0}$ is a $K	imes K $ symmetric positive definite matrix.

This prior is used to express the belief that $eta $ is most likely equal to $eta _{0}$. The dispersion of the belief is given by the covariance matrix $sigma ^{2}V_{0}$.

The posterior

Given the prior and the likelihood shown above, the posterior is[eq6]where [eq7]that is, the posterior distribution of $eta $ is a multivariate normal distribution with mean $eta _{N}$ and covariance matrix $sigma ^{2}V_{N}$.

Proof

The joint distribution is

[eq8]We can write it as[eq9]where[eq10]and[eq11]where I is the $N	imes N$ identity matrix. Now define a new matrix A (called rotation matrix) as follows:[eq12]We have that [eq13]Furthermore,[eq14]and[eq15]Moreover, the determinant of A is[eq16]where we have used the formula for the determinant of a block matrix[eq17]We can now use the results about A to rewrite the joint density:[eq18]We have shown above that [eq19] is block-diagonal. Therefore, by using the expressions derived above for the blocks of $Alpha $ and [eq19], we obtain[eq21]Thus, we have factorized the joint density as[eq22]where[eq23]is a function that depends on $y$ but not on $eta $, and [eq24]is a probability density function if considered as a function of $eta $ for any given $y$ (note that $g$ depends on $y$ through $eta _{N}$). More specifically, [eq25] is the density of a multivariate normal distribution with mean $eta _{N}$ and covariance matrix $sigma ^{2}V_{N}$. By a standard result on the factorization of probability density functions (see also the introduction to Bayesian inference), we have that [eq26]Therefore, the posterior distribution [eq27] is a multivariate normal distribution with mean $eta _{N}$ and covariance matrix $sigma ^{2}V_{N}$.

Note that the posterior mean can be written as [eq28]where[eq29]is the ordinary least squares estimator of the regression coefficients. Thus, the posterior mean of $eta $ is the weighted average of

  1. the OLS estimate derived from the observed data;

  2. the prior mean $eta _{0}$.

Remember that the covariance matrix of the OLS estimator in the normal linear regression model is[eq30]while the covariance matrix of the prior is[eq31]

Therefore, we can write[eq32]

Both the prior mean and the OLS estimator derived from the data convey some information about $eta $. The two sources of information are combined linearly in order to produce the posterior mean $eta _{N}$, but more weight is given to the signal that has higher precision (smaller covariance matrix).

Note also that $(X^{	op }X)$ tends to infinity when the sample size $N$ goes to infinity. As a consequence, the weight given to $eta _{OLS}$ increases with $N$. In other words, the larger the sample size becomes, the more weight is given to the OLS estimate, that is, to the information coming from the observed sample. In the limit, all weight is given to the latter and no weight is given to the prior. Roughly speaking, Bayesian regression and frequentist (OLS) regression provide almost the same results when the sample size is large.

The prior predictive distribution

The prior predictive distribution is [eq33]where I is the $N	imes N$ identity matrix.

Thus, the prior predictive distribution of the vector of observations of the dependent variable $y$ is multivariate normal with mean $Xeta _{0}$ and covariance matrix [eq34]

Proof

See the previous proof.

The posterior predictive distribution

Suppose that, after observing the sample $left( y,X
ight) $ and updating the prior, a new sample [eq35] of size $M$ is generated by the same regression. Suppose further that we observe only the regressors $widetilde{X}$ and we want to predict $widetilde{y}$. This is done through the posterior predictive distribution[eq36]where I is the $M	imes M$ identity matrix.

So, $widetilde{y}$ has a multivariate normal distribution with mean [eq37] (where $eta _{N}$ is the posterior mean of $eta $) and covariance matrix [eq38] (where $V_{N}$ is the posterior covariance matrix of $eta $).

Proof

The derivation is almost identical to that of the prior predictive distribution of $y$ (see above). The posterior [eq39]is used as a new prior. The likelihood [eq40]is the same as [eq41]because $widetilde{y}$ is independent of $y$ and X conditional on $eta $. So, we can perform the factorization[eq42]and derive [eq43] by using the same procedure used to find [eq44]. The main difference is that we need to replace the prior mean $eta _{0}$ with the posterior $eta _{N}$ and the prior covariance matrix $sigma ^{2}V_{0}$ with the posterior $sigma ^{2}V_{N}$.

Unknown regression coefficients and unknown variance

Everything is as in the previous section, except for the fact that not only the vector of regression coefficients $eta $, but also the variance of the error term sigma^2 is assumed to be unknown.

The likelihood

The likelihood of $y$ is [eq45]Note that we now highlight the fact that we are conditioning on both of the unknown parameters mu and sigma^2.

The prior

The prior is hierarchical.

As in the previous section, we assign a multivariate normal prior to the regression coefficients, conditional on sigma^2: [eq46]where the covariance matrix of $eta $ is assumed to be proportional to sigma^2.

Then, we assign the following prior to the variance:[eq47]that is, [eq48] has an inverse-Gamma distribution with parameters $L$ and $sigma _{0}^{2}$ (i.e., the precision $1/sigma ^{2}$ has a Gamma distribution with parameters $L$ and $1/sigma _{0}^{2}$).

By the properties of the Gamma distribution, the prior mean of the precision is[eq49]and its variance is [eq50]

We can think of $1/sigma _{0}^{2}$ as our best guess of the precision of the regression (i.e., of its error terms). We use the parameter $L$ to express our degree of confidence in the guess about the precision. The larger $L$, the tighter the prior about $1/sigma ^{2}$ is, and the more we consider it likely that $1/sigma ^{2}$ is close to $1/sigma _{0}^{2}$.

The posterior distribution of the regression coefficients conditional on the variance

Conditional on sigma^2, the posterior distribution of $eta $ is[eq51]where [eq7]

Proof

This has been proved in the previous section (known variance). As a matter of fact, conditional on sigma^2, we can treat sigma^2 as if it was known.

The prior predictive distribution conditional on the variance

Conditional on sigma^2, the prior predictive distribution of $y$ is [eq53]where I is the $N	imes N$ identity matrix.

Proof

See previous section.

The posterior distribution of the variance

The posterior distribution of the variance is[eq54]where[eq55]

In other words, $1/sigma ^{2}$ has a Gamma distribution with parameters $N+L $ and $1/sigma _{N}^{2}$ .

Proof

Consider the joint distribution [eq56]where we have defined[eq55]We can write [eq58]where [eq59]is a function that depends on $y$ (via $sigma _{N}^{2}$) but not on sigma^2, and [eq60]is a probability density function if considered as a function of $sigma ^{2} $ for any given $y$ (note that $g$ depends on $y$ through $sigma _{N}^{2}$). In particular, [eq61] is the density of an inverse-Gamma distribution with parameters $N+L$ and $1/sigma _{N}^{2}$. Thus, by a well-known result on the factorization of joint probability density functions, we have that [eq62]Therefore, the posterior distribution [eq63] is inverse-Gamma with parameters $N+L$ and $1/sigma _{N}^{2}$ . What distribution [eq64] is will be shown in the next proof.

The prior predictive distribution

The prior predictive distribution of the dependent variable $y$ is [eq65]that is, a multivariate Student's t distribution with mean $Xeta _{0}$, scale matrix [eq66] and $L$ degrees of freedom.

Proof

The prior predictive distribution has already been derived in the previous proof. We just need to perform some algebraic manipulations in order to clearly show that it is a multivariate Student's t distribution with mean $Xeta _{0}$, scale matrix [eq67] and $L$ degrees of freedom: [eq68]

The posterior distribution of the regression coefficients

The posterior distribution of the regression coefficients is[eq69]that is, $eta $ has a multivariate t distribution with mean $eta _{N}$, scale matrix [eq70] and $N+L$ degrees of freedom.

Proof

As proved above, we have that: 1) conditional on sigma^2 and the data $left( y,X
ight) $, $eta $ is multivariate normal with mean $eta _{N}$ and variance $sigma ^{2}V_{N}$; 2) conditional on the data $left( y,X
ight) $, $1/sigma ^{2}$ has a Gamma distribution with parameters $N+L$ and $1/sigma _{N}^{2}$. Thus, we can write[eq71]where Z is standard multivariate normal vector and [eq72]Now define a new variable [eq73]which, by the properties of the Gamma distribution, has a Gamma distribution with parameters $N+L$ and 1. We can now write[eq74]But[eq75]has a standard multivariate Student's t distribution with $N+L$ degrees of freedom (see the lecture on the multivariate t distribution). As a consequence, $eta $ has a multivariate Student's t distribution with mean $eta _{N}$, scale matrix [eq76]and $N+L$ degrees of freedom. Thus, its density is[eq69]

How to cite

Please cite as:

Taboga, Marco (2021). "Bayesian linear regression", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/Bayesian-regression.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.