Search for probability and statistics terms on Statlect

Linear regression - Model selection criteria

by , PhD

How do we choose among different linear regression models? How do we decide whether to use a more parsimonious model or one that includes several regressors?

This kind of choice is often performed by using so-called information criteria, which we briefly discuss in this lecture.

Table of Contents

Information criteria

Information criteria are used to attribute scores to different regression models.

A score is:

The best model is the one with the lowest score.


Generating a trade-off between fit and complexity discourages overfitting, that is, the tendency of complex models to fit the sample data very well and make poor predictions out of sample.


In what follows, $N$ is the sample size, K is the number of regressors and $SSR$ is the sum of squared residuals:[eq1]where $y_{i	ext{ }}$ is the dependent variable, $x_{i}$ is the $1	imes K$ vector of regressors, and $widehat{eta }$ is the OLS estimate of the Kx1 vector of regression coefficients.

The sum of squared residuals

The product [eq2] is the prediction of $y_{i}$ and the difference [eq3]is the prediction error or residual.

By squaring the residuals and summing them up, we obtain the sum of squared residuals $SSR$.

The larger $SSR$ is, the worse the fit of the model.

Popular information criteria

We now list some popular information criteria:

How the criteria work

All of the criteria are increasing in $SSR$: the larger $SSR$, the higher the score.

They are also increasing in K: the larger the number of parameters (and the more complex the model), the higher the score.

However, while an increase in $SSR$ has always the same effect on the score, an increase in K has different effects, depending on the criterion.

The criteria are ordered based on the strength of the penalty for model complexity: the AIC imposes the mildest penalty, while the BIC has the strongest one.


Given 20 observations, we estimate a regression model with 2 regressors and we obtain a sum of squared residuals equal to 10.

Then, we find a new regressor. We add it to our regression and the sum of squared residuals decreases to 9.5.

Which of the two models is better according to the Akaike Information Criterion?

The score of the first model (2 regressors) is[eq8]

The score of the second model (3 regressors) is[eq9]

The best model is the one that has the lowest score.

Therefore, the best model according to the Akaike criterion is the model with two regressors.

How the criteria are derived

The information criteria above are used not only for linear regression, but for any statistical model estimated by maximum likelihood (ML).

The general formulae involve the log-likelihood of the model, evaluated at the ML parameter estimate.

Denote the log-likelihood by $l$.

The general formulae (explained here) are:[eq10]

The formulae for linear regression (reported previously) are obtained by making the substitution[eq11]

Here is a proof that the latter is the log-likelihood of a linear regression model.


In the normal linear regression model (a model with normally distributed errors), the log-likelihood function is [eq12]where $widehat{eta }$ is the OLS estimate of the vector of regression coefficients (which coincides with the ML estimate) and[eq13]is the ML estimate of the variance of the error terms. By substituting the formula for [eq14] in the expression for the log-likelihood, we get[eq15]We can add or subtract a constant to the scores provided by an information criterion without changing the ranking of the models. Therefore, we can drop the constant and write[eq16]

Which criterion to use

Is there a preferred criterion? For example, is Hannan-Quinn better than Akaike?

The simple answer is: no.

There are many papers that compare the various criteria. What they find is that their performance in selecting the best model is very much dependent on the specific application.

Therefore, analysts and researchers tend to use many criteria simultaneously and report all of them.

If all the criteria select the same model, then there is little room for doubt.

On the contrary, if different criteria select different models, the interpretation is that there is no clear winner. Then, the choice can be made on other grounds, for example:


An alternative to using information criteria is to check the out-of-sample predictive ability of different models.

This is usually done with cross-validation techniques (e.g., holdout, k-fold and leave-one-out).

How to cite

Please cite as:

Taboga, Marco (2021). "Linear regression - Model selection criteria", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.