# Ridge regression

Ridge regression is a term used to refer to a linear regression model whose coefficients are estimated not by ordinary least squares (OLS), but by an estimator, called ridge estimator, that, albeit biased, has lower variance than the OLS estimator.

In certain cases, the mean squared error of the ridge estimator (which is the sum of its variance and the square of its bias) is smaller than that of the OLS estimator.

## Linear regression

Ridge estimation is carried out on the linear regression modelwhere:

• is the vector of observations of the dependent variable;

• is the matrix of regressors (there are regressors);

• is the vector of regression coefficients;

• is the vector of errors.

## Ridge estimator

Remember that the OLS estimator solves the minimization problemwhere is the -th row of and and are column vectors.

When has full rank, the solution to the OLS problem is

The ridge estimator solves the slightly modified minimization problemwhere is a positive constant.

Thus, in ridge estimation we add a penalty to the least squares criterion: we minimize the sum of squared residualsplus the squared norm of of the vector of coefficients

The ridge problem penalizes large regression coefficients, and the larger the parameter is, the larger the penalty.

We will discuss below how to choose the penalty parameter .

The solution to the minimization problem iswhere is the identity matrix.

Proof

The objective function to minimize can be written in matrix form as follows:The first order condition for a minimum is that the gradient of with respect to should be equal to zero:that is,orThe matrixis positive definite for any because, for any vector , we havewhere the last inequality follows from the fact that even if is equal to for every , is strictly positive for at least one . Therefore, the matrix has full rank and it is invertible. As a consequence, the first order condition is satisfied byWe now need to check that this is indeed a global minimum. Note that the Hessian matrix, that is, the matrix of second derivatives of , isThus, the Hessian is positive definite (it is a positive multiple of a matrix that we have just proved to be positive definite). Hence, is strictly convex in , which implies that is a global minimum.

If you read the proof above, you will notice that, unlike in OLS estimation, we do not need to assume that the design matrix is full-rank. Therefore, the ridge estimator exists also when does not have full rank.

## Bias and variance of the ridge estimator

In this section we derive the bias and variance of the ridge estimator under the commonly made assumption (e.g., in the normal linear regression model) that, conditional on , the errors of the regression have zero mean and constant variance and are uncorrelated:where is a positive constant and is the identity matrix.

### Bias

The conditional expected value of the ridge estimator iswhich is different from unless (the OLS case).

The bias of the estimator is

Proof

We can write the ridge estimator asTherefore, The ridge estimator is unbiased, that is,if and only ifBut this is possible if only if , that is, if the ridge estimator coincides with the OLS estimator. where is the identity matrix. The bias is

### Variance

The covariance matrix of the ridge estimator is

Proof

Remember that the OLS estimator has conditional varianceWe can write the ridge estimator as a function of the OLS estimator:Therefore,

Importantly, the variance of the ridge estimator is always smaller than the variance of the OLS estimator.

More precisely, the difference between the covariance matrix of the OLS estimator and that of the ridge estimator is positive definite (remember from the lecture on the Gauss-Markov theorem that the covariance matrices of two estimators are compared by checking whether their difference is positive definite).

Proof

In order to make a comparison, the OLS estimator must exist. As a consequence, must be full-rank. With this assumption in place, the conditional variance of the OLS estimator isNow, define the matrixwhich is invertible. Then, we can rewrite the covariance matrix of the ridge estimator as follows:The difference between the two covariance matrices isIf , the latter matrix is positive definite because for any , we haveandbecause and its inverse are positive definite.

## Mean squared error

The mean squared error (MSE) of the ridge estimator is equal to the trace of its covariance matrix plus the squared norm of its bias (the so-called bias-variance decomposition):

The OLS estimator has zero bias, so its MSE is

The difference between the two MSEs iswhere we have used the fact that the sum of the traces of two matrices is equal to the trace of their sum.

We have a difference between two terms ( and ). We have already proved that the matrixis positive definite. As a consequence, its trace (term ) is strictly positive.

The square of the bias (term ) is also strictly positive. Therefore, the difference between and could in principle be either positive or negative.

It is possible to prove (see Theobald 1974 and Farebrother 1976) that whether the difference is positive or negative depends on the penalty parameter , and it is always possible to find a value for such that the difference is positive.

Thus, there always exists a value of the penalty parameter such that the ridge estimator has lower mean squared error than the OLS estimator.

This result is very important from both a practical and a theoretical standpoint. Although, by the Gauss-Markov theorem, the OLS estimator has the lowest variance (and the lowest MSE) among the estimators that are unbiased, there exists a biased estimator (a ridge estimator) whose MSE is lower than that of OLS.

## How to choose the penalty parameter

We have just proved that there exists a such that the ridge estimator is better (in the MSE sense) than the OLS one.

The question is: how do find the optimal ?

The most common way to find the best is by so-called leave-one-out cross-validation:

1. we choose a grid of possible values for the penalty parameter;

2. for , we exclude the -th observation from the sample and we:

1. use the remaining observations to compute ridge estimates of , denoted by , where the subscripts indicate that the penalty parameter is set equal to () and the -th observation has been excluded;

2. compute out-of-sample predictions of the excluded observationfor .

3. we compute the MSE of the predictionsfor .

4. we choose as the optimal penalty parameter the one that minimizes the MSE of the predictions:

In other words, we set equal to the value that generates the lowest MSE in the leave-one-out cross-validation exercise.

### Python example

Our lecture on the choice of regularization parameters provides an example (with Python code) of how to choose by using a cross-validation method called hold-out cross-validation.

## The ridge estimator is not scale invariant

A nice property of the OLS estimator is that it is scale invariant: if we post-multiply the design matrix by an invertible matrix , then the OLS estimate we obtain is equal to the previous estimate multiplied by .

For example, if we multiply a regressor by 2, then the OLS estimate of the coefficient of that regressor is divided by 2.

In more formal terms, consider the OLS estimate and the rescaled design matrix

The OLS estimate associated to the new design matrix is

Thus, no matter how we rescale the regressors, we always obtain the same result.

This is a nice property of the OLS estimator that is unfortunately not possessed by the ridge estimator.

Consider the estimate

Then, the ridge estimate associated to the rescaled matrix iswhich is equal to only ifthat is, only if

In other words, the ridge estimator is scale-invariant only in the special case in which the scale matrix is orthonormal.

## Always use standardized variables

The general absence of scale-invariance implies that any choice we make about the scaling of variables (e.g., expressing a regressor in centimeters vs meters or thousands vs millions of dollars) affects the coefficient estimates.

Since this is highly undesirable, what we usually do is to standardize all the variables in our regression, that is, we subtract from each variable its mean and we divide it by its standard deviation. By doing so, the coefficient estimates are not affected by arbitrary choices of the scaling of variables.

## References

Farebrother, R. W. (1976) " Further results on the mean square error of ridge regression", Journal of the Royal Statistical Society, Series B (Methodological), 38, 248-250.

Theobald, C. M. (1974) " Generalizations of mean square error applied to ridge regression", Journal of the Royal Statistical Society, Series B (Methodological), 36, 103-106.