Search for probability and statistics terms on Statlect
StatLect

Ridge regression

by , PhD

Ridge regression is a term used to refer to a linear regression model whose coefficients are estimated not by ordinary least squares (OLS), but by an estimator, called ridge estimator, that, albeit biased, has lower variance than the OLS estimator.

In certain cases, the mean squared error of the ridge estimator (which is the sum of its variance and the square of its bias) is smaller than that of the OLS estimator.

Table of Contents

Linear regression

Ridge estimation is carried out on the linear regression model[eq1]where:

Ridge estimator

Remember that the OLS estimator $widehat{eta }$ solves the minimization problem[eq2]where $x_{i}$ is the i-th row of X and $b$ and $widehat{eta }$ are Kx1 column vectors.

When X has full rank, the solution to the OLS problem is[eq3]

The ridge estimator [eq4] solves the slightly modified minimization problem[eq5]where $lambda $ is a positive constant.

Thus, in ridge estimation we add a penalty to the least squares criterion: we minimize the sum of squared residuals[eq6]plus the squared norm of of the vector of coefficients[eq7]

The ridge problem penalizes large regression coefficients, and the larger the parameter $lambda $ is, the larger the penalty.

We will discuss below how to choose the penalty parameter $lambda $.

The solution to the minimization problem is[eq8]where I is the $K	imes K$ identity matrix.

Proof

The objective function to minimize can be written in matrix form as follows:[eq9]The first order condition for a minimum is that the gradient of $RIDGE$ with respect to $b$ should be equal to zero:[eq10]that is,[eq11]or[eq12]The matrix[eq13]is positive definite for any $lambda >0$ because, for any Kx1 vector $a
eq 0$, we have[eq14]where the last inequality follows from the fact that even if $x_{i}a$ is equal to 0 for every i, $a_{k}^{2}$ is strictly positive for at least one k. Therefore, the matrix has full rank and it is invertible. As a consequence, the first order condition is satisfied by[eq15]We now need to check that this is indeed a global minimum. Note that the Hessian matrix, that is, the matrix of second derivatives of $RIDGE$, is[eq16]Thus, the Hessian is positive definite (it is a positive multiple of a matrix that we have just proved to be positive definite). Hence, $RIDGE$ is strictly convex in $b$, which implies that $b$ is a global minimum.

If you read the proof above, you will notice that, unlike in OLS estimation, we do not need to assume that the design matrix X is full-rank. Therefore, the ridge estimator exists also when X does not have full rank.

The Ridge estimator is biased but it has lower variance than the OLS estimator.

Bias and variance of the ridge estimator

In this section we derive the bias and variance of the ridge estimator under the commonly made assumption (e.g., in the normal linear regression model) that, conditional on X, the errors of the regression have zero mean and constant variance sigma^2 and are uncorrelated:[eq17]where sigma^2 is a positive constant and I is the $N	imes N$ identity matrix.

Bias

The conditional expected value of the ridge estimator [eq18] is[eq19]which is different from $eta $ unless $lambda =0$ (the OLS case).

The bias of the estimator is[eq20]

Proof

We can write the ridge estimator as[eq21]Therefore, [eq22]The ridge estimator is unbiased, that is,[eq23]if and only if[eq24]But this is possible if only if $lambda =0$, that is, if the ridge estimator coincides with the OLS estimator. where I is the $K	imes K$ identity matrix. The bias is[eq25]

Variance

The covariance matrix of the ridge estimator is[eq26]

Proof

Remember that the OLS estimator $widehat{eta }$ has conditional variance[eq27]We can write the ridge estimator as a function of the OLS estimator:[eq28]Therefore,[eq29]

Importantly, the variance of the ridge estimator is always smaller than the variance of the OLS estimator.

More precisely, the difference between the covariance matrix of the OLS estimator and that of the ridge estimator [eq30]is positive definite (remember from the lecture on the Gauss-Markov theorem that the covariance matrices of two estimators are compared by checking whether their difference is positive definite).

Proof

In order to make a comparison, the OLS estimator must exist. As a consequence, X must be full-rank. With this assumption in place, the conditional variance of the OLS estimator is[eq31]Now, define the matrix[eq32]which is invertible. Then, we can rewrite the covariance matrix of the ridge estimator as follows:[eq33]The difference between the two covariance matrices is[eq34]If $lambda >0$, the latter matrix is positive definite because for any $v
eq 0$, we have[eq35]and[eq36]because $X^{	op }X$ and its inverse are positive definite.

Mean squared error

The mean squared error (MSE) of the ridge estimator is equal to the trace of its covariance matrix plus the squared norm of its bias (the so-called bias-variance decomposition):[eq37]

The OLS estimator has zero bias, so its MSE is[eq38]

The difference between the two MSEs is[eq39]where we have used the fact that the sum of the traces of two matrices is equal to the trace of their sum.

We have a difference between two terms ($rame{1}$ and $rame{2}$). We have already proved that the matrix[eq40]is positive definite. As a consequence, its trace (term $rame{1}$) is strictly positive.

The square of the bias (term $rame{2}$) is also strictly positive. Therefore, the difference between $rame{1}$ and $rame{2}$ could in principle be either positive or negative.

It is possible to prove (see Theobald 1974 and Farebrother 1976) that whether the difference is positive or negative depends on the penalty parameter $lambda $, and it is always possible to find a value for $lambda $ such that the difference is positive.

Thus, there always exists a value of the penalty parameter such that the ridge estimator has lower mean squared error than the OLS estimator.

This result is very important from both a practical and a theoretical standpoint. Although, by the Gauss-Markov theorem, the OLS estimator has the lowest variance (and the lowest MSE) among the estimators that are unbiased, there exists a biased estimator (a ridge estimator) whose MSE is lower than that of OLS.

How to choose the penalty parameter

We have just proved that there exists a $lambda $ such that the ridge estimator is better (in the MSE sense) than the OLS one.

The question is: how do find the optimal $lambda $?

The most common way to find the best $lambda $ is by so-called leave-one-out cross-validation:

  1. we choose a grid of $P$ possible values [eq41] for the penalty parameter;

  2. for $i=1,ldots ,N$, we exclude the i-th observation [eq42] from the sample and we:

    1. use the remaining $N-1$ observations to compute $P$ ridge estimates of $eta $, denoted by [eq43], where the subscripts $lambda _{p},i$ indicate that the penalty parameter is set equal to $lambda _{p}$ ($p=1,ldots ,P$) and the i-th observation has been excluded;

    2. compute $P$ out-of-sample predictions of the excluded observation[eq44]for $p=1,ldots ,P$.

  3. we compute the MSE of the predictions[eq45]for $p=1,ldots ,P$.

  4. we choose as the optimal penalty parameter $lambda ^{st }$ the one that minimizes the MSE of the predictions:[eq46]

In other words, we set $lambda $ equal to the value that generates the lowest MSE in the leave-one-out cross-validation exercise.

Python example

Our lecture on the choice of regularization parameters provides an example (with Python code) of how to choose $lambda $ by using a cross-validation method called hold-out cross-validation.

The ridge estimator is not scale invariant

A nice property of the OLS estimator is that it is scale invariant: if we post-multiply the design matrix by an invertible matrix $R$, then the OLS estimate we obtain is equal to the previous estimate multiplied by $R^{-1}$.

For example, if we multiply a regressor by 2, then the OLS estimate of the coefficient of that regressor is divided by 2.

In more formal terms, consider the OLS estimate [eq47]and the rescaled design matrix [eq48]

The OLS estimate associated to the new design matrix is[eq49]

Thus, no matter how we rescale the regressors, we always obtain the same result.

This is a nice property of the OLS estimator that is unfortunately not possessed by the ridge estimator.

Consider the estimate [eq50]

Then, the ridge estimate associated to the rescaled matrix [eq48]is[eq52]which is equal to [eq53] only if[eq54]that is, only if[eq55]

In other words, the ridge estimator is scale-invariant only in the special case in which the scale matrix $R$ is orthonormal.

Always use standardized variables

The general absence of scale-invariance implies that any choice we make about the scaling of variables (e.g., expressing a regressor in centimeters vs meters or thousands vs millions of dollars) affects the coefficient estimates.

Since this is highly undesirable, what we usually do is to standardize all the variables in our regression, that is, we subtract from each variable its mean and we divide it by its standard deviation. By doing so, the coefficient estimates are not affected by arbitrary choices of the scaling of variables.

Do not forget to standardize the regressors before running a Ridge regression.

References

Farebrother, R. W. (1976) " Further results on the mean square error of ridge regression", Journal of the Royal Statistical Society, Series B (Methodological), 38, 248-250.

Theobald, C. M. (1974) " Generalizations of mean square error applied to ridge regression", Journal of the Royal Statistical Society, Series B (Methodological), 36, 103-106.

How to cite

Please cite as:

Taboga, Marco (2021). "Ridge regression", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/ridge-regression.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.