In regression analysis, the variance inflation factor (VIF) is a measure of the degree of multicollinearity of one regressor with the other regressors.
Table of contents
Multicollinearity arises when a regressor is very similar to a linear combination of other regressors.
Multicollinearity has the effect of markedly increasing the variance of regression coefficient estimates. Therefore, we usually try to avoid it as much as possible.
To detect and measure multicollinearity, we use the so-called variance inflation factors.
Consider the linear regressionwhere:
is the dependent variable;
are regressors;
are regression coefficients;
is the error term;
the observations are indexed by .
The linear regression can be written in matrix form as:where:
and are vectors;
is an matrix;
is a vector.
If the design matrix has full rank, then we can compute the ordinary least squares (OLS) estimator of the vector of regression coefficients as follows:
Under certain assumptions (see, e.g., the lecture on the Gauss-Markov theorem), the covariance matrix of the OLS estimator is
Therefore, the variance of the OLS estimator of a single coefficient iswhere is the -th entry on the main diagonal of .
If the -th regressor has zero mean, we can write the variance of its estimated coefficient aswhere is the R squared obtained by regressing the -th regressor on all the other regressors.
Without loss of generality, suppose that (otherwise, change the order of the regressors). We can write the design matrix as a block matrix:where is the first column of and the block contains all the other columns. Then, we haveWe use Schur complements, and in particular the formulato write the first entry of the inverse of as:As proved in the lecture on partitioned regressions, the matrix is idempotent and symmetric; moreover, when it is post-multiplied by , it gives as a result the residuals of a regression of on . The vector of these residuals is denoted byTherefore, If has zero mean, the R squared of the regression of on is Note that this formula for the R squared is correct only if has zero mean. Then, we can writeTherefore,and
If the -th regressor is orthogonal to all the other regressors, we can write the variance of its estimated coefficient as
As in the previous proof, we assume without loss of generality that . In that proof, we have demonstrated thatIf is orthogonal to all the columns in , thenTherefore,
Thus, the variance of is the product of two terms:
the variance that would have if the -th regressor were orthogonal to all the other regressors;
the term , where is the R squared in a regression of the -th regressor on all the other regressors.
The second term is called the variance inflation factor because it inflates the variance of with respect to the base case of orthogonality.
In order to derive the VIF, we have made the important assumption that the -th regressor has zero mean.
If this assumption is not met, then it is incorrect to compute the VIF as because the latter is no longer a factor in the formula that relates the actual variance of to its hypothetical variance under the assumption of orthogonality.
One way to make sure that the zero-mean assumption is met is to run a demeaned regression: before computing the OLS coefficient estimates, we demean all the variables.
As explained in the lecture on partitioned regression, demeaning does not change the coefficient estimates, provided that the regression includes a constant.
Note that a demeaned regression is a special case of a standardized regression. Therefore, we can run a standardized regression before computing variance inflation factors.
We have explained above that the VIF provides a comparison between the actual variance of a coefficient estimator and its hypothetical variance (under the assumption of orthogonality).
By definition, the -th regressor is orthogonal to all the other regressors if and only iffor all .
If the -th regressor has zero mean, then the orthogonality condition is equivalent to saying that the -th regressor is uncorrelated with all the other regressors.
Denote the sample means of and by and . We assume that . Then, the sample covariance between and is Therefore, and are uncorrelated.
This is why, if the -th regressor has zero mean, the VIF provides a comparison between:
the actual variance of a coefficient estimator;
the variance that the estimator would have if the corresponding variable were uncorrelated with all the other regressors.
We usually compute the VIF for all the regressors. If there are many regressors and the sample size is large, computing the VIF ascan be quite burdensome because we need to run many large regressions (one for each ) in order to compute different R squareds.
A better alternative is to use the equivalent formulawhich can be easily derived from the formulae given above.
We have proved thatwhich implies that
When we use the latter formula, we compute only once. Then, we use its diagonal entries to compute the VIFs.
The numbers in the denominator are easy to calculate because each of them is the reciprocal of the inner product of a vector with itself.
Here is the final recipe for computing the variance inflation factors:
Make sure that your regression includes a constant (otherwise this recipe cannot be used).
Demean all the variables and drop the constant.
Compute .
For each compute .
The VIF for the -th regressor is
The VIF is equal to 1 if the regressor is uncorrelated with the other regressors, and greater than 1 in case of non-zero correlation.
The greater the VIF, the higher the degree of multicollinearity.
In the limit, when multicollinearity is perfect (i.e., the regressor is equal to a linear combination of other regressors), the VIF tends to infinity.
There is no precise rule for deciding when a VIF is too high (O'Brien 2007), but values above 10 are often considered a strong hint that trying to reduce the multicollinearity of the regression might be worthwhile.
In the lecture on Multicollinearity, we discuss in more detail the interpretation of the variance inflation factor, and we explain how to deal with multicollinearity.
O'Brien, R. (2007) A Caution Regarding Rules of Thumb for Variance Inflation Factors, Quality & Quantity, 41, 673-690.
Previous entry: Variance formula
Please cite as:
Taboga, Marco (2021). "Variance inflation factor", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix.
Most of the learning materials found on this website are now available in a traditional textbook format.