Linear regression model

A linear regression model is a conditional model in which the output variable is a linear function of the input variables and of an unobservable error term that adds noise to the relationship between inputs and outputs.

This lecture introduces the main mathematical assumptions, the matrix notation and the terminology used in linear regression models.

Table of contents

Dependent and independent variables
Regression coefficients and errors
Example
Matrix notation
Intercept
Zero-mean errors
OLS estimator
Residuals
Formula for the OLS estimator
Models and assumptions
The normal linear regression model
More realistic models
Learn about the mathematics of linear regression

Dependent and independent variables

We assume that the statistician observes a sample of realizations for , where:

$y_{i}$ is a scalar output variable, also called dependent variable or regressand;
$x_{i}$ is a vector of input variables, also called independent variables or regressors;
is the sample size.

Regression coefficients and errors

Inputs and outputs are assumed to have a linear relationship:where:

is a vector of constants, called regression coefficients;
$arepsilon _{i}$ is an unobservable error term which encompasses the sources of variability in $y_{i}$ that are not included in the vector of inputs $x_{i}$ ; for example, $arepsilon _{i}$ could include measurement errors and input variables that are not observed by the statistician.

The linear relationship is assumed to hold for each , with the same .

Example

Let us make an example.

Suppose that we have a sample of individuals for which weight, height and age are observed.

We want to set up a linear regression model to predict weight based on height and age.

Then, we could postulate thatwhere:

$w_{i}$ , $h_{i}$ and $lpha _{i}$ denote the weight, age and height of the -th individual in the sample, respectively;
$eta _{1}$ , $eta _{2}$ and $eta _{3}$ are regression coefficients;
$arepsilon _{i}$ is an error term.

The regression equation can be written in vector notation asby defining [eq5] where $x_{i}$ is a vector and is a vector.

SimpleR is StatLect's linear regression tool. You can estimate multiple linear regressions in seconds without coding.

Matrix notation

Denote by the vector of outputs [eq6] by the matrix of inputs [eq7] and by [eq8] the vector of error terms.

Then, the linear relationship can be expressed in matrix form as

The matrix is called design matrix.

Intercept

The vector of regressors $x_{i}$ usually contains a constant variable equal to .

Without loss of generality, we can assume that the constant is the first entry of $x_{i}$ .

Therefore, the first column of the design matrix is a column of s.

The regression coefficient corresponding to the constant variable is called intercept.

Example Suppose that the number of regressors is and the regression includes a constant equal to . Then, we have that [eq10] The coefficient $eta _{1}$ is the intercept of the regression.

Zero-mean errors

When an intercept is included in the regression, we can assume without loss of generality that the expected value of the error term is equal to .

Consider, for instance, the previous example.

If we had , then we could write [eq12]

We could then define a new regression equationwhere [eq14]

The expected value of the new error would be zero because

OLS estimator

Usually, the vector of regression coefficients is unknown and needs to be estimated.

The most commonly used estimator of is the Ordinary Least Squares (OLS) estimator.

The OLS estimator is not only computationally convenient, but it enjoys good statistical properties under different sets of mathematical assumptions on the joint distribution of and .

The following is a formal definition of the OLS estimator.

Definition An estimator is an OLS estimator of if and only if satisfies [eq16]

The OLS estimator is the vector of estimated regression coefficients that minimizes the sum of the squared distances between predicted values $x_{i}b$ and observed values $y_{i}$ .

In other words, the OLS estimator makes the predicted values as close as possible to the actual output values.

Residuals

A residualis the difference between the observed output $y_{i}$ and its predicted value $x_{i}b$ .

Thus, the OLS estimator is the estimator that minimizes the sum of squared residuals.

Formula for the OLS estimator

If the design matrix has full rank, the OLS minimization problem has a solution that is both unique and explicit.

Proposition If the design matrix has full rank, then the OLS estimator is

Proof

First of all, observe that the sum of squared residuals, henceforth indicated by , can be written in matrix form as follows: [eq19] The first order condition for a minimum is that the gradient of with respect to should be equal to zero:that is,orNow, if has full rank (i.e., rank equal to ), then the matrix is invertible. As a consequence, the first order condition is satisfied byWe now need to check that this is indeed a global minimum. Note that the Hessian matrix, that is, the matrix of second derivatives of , isBut $X^{ op }X$ is a positive definite matrix because, for any , we have [eq26] where the last inequality follows from the fact that has full rank (and, as a consequence, implies that $x_{i}a$ cannot be equal to for every ). Thus, is strictly convex in , which implies that is indeed a global minimum.

Models and assumptions

The linearity assumptionis not per se sufficient to determine the mathematical properties of the OLS estimator of (or of any other estimator).

In order to be able to establish any property (e.g., unbiasedness, consistency and asymptotic normality), we need to make further assumptions about the joint distribution of the regressors and the error terms .

These further assumptions, together with the linearity assumption, form a linear regression model.

The next section provides an example.

The normal linear regression model

A popular linear regression model is the so called Normal Linear Regression Model (NLRM).

In the NLRM it is assumed that:

the vector of errors has a multivariate normal distribution conditional on the design matrix ;
the covariance matrix of is diagonal and all the diagonal entries are equal (in other words, the entries of are mutually independent and have constant variance).

Under these hypotheses, the OLS estimator has a multivariate normal distribution. Furthermore, the distributions of several test statistics can be derived analytically.

More details about the NLRM can be found in the lecture on the Normal Linear Regression Model.

More realistic models

The NLRM has several appealing properties, but its assumptions are unrealistic in many practical cases of interest.

For this reason, we often prefer to make weaker assumptions, under which it is possible to prove that the OLS estimators are consistent and asymptotically normal.

These assumption are discussed in the lecture on the properties of the OLS estimator.

Learn about the mathematics of linear regression

If you want to learn more about the mathematics of linear regression, you can read the following lectures:

How to cite

Please cite as:

Taboga, Marco (2021). "Linear regression model", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/linear-regression.