This lecture deals with standardized linear regressions, that is, regression models in which the variables are standardized.
A variable is standardized by subtracting from it its sample mean and by dividing it by its standard deviation. After being standardized, the variable has zero mean and unit standard deviation.
We are going to deal with linear regressionswhere are the observations in the sample, there are regressors and regression coefficients , is the dependent variable and is the error term.
In a standardized regression all the variables have zero mean and unit standard deviation or, equivalently, unit variance. More precisely,for .
Furthermore, we assume that also the dependent variable is standardized:
In general, a variable to be included in a regression model has not zero mean and unit variance. Denote by such a variable (where the superscript indicates that the variable is unstandardized). Then, we standardize it before including it in the regression.
We compute the sample mean and variance of :
Then, we compute the standardized variable to be used in the regression:for and .
The same process is performed on the dependent variable if it does not have zero mean and unit variance.
Particular care needs to be taken if the regression includes an intercept, that is, if one of the regressors is constant and equal to 1.
Clearly, the constant cannot be standardized because it has zero variance and division by zero is not allowed.
We have two possibilities:
we leave the constant as it is, that is, we do not standardize it;
we drop the constant from the regression.
If all the variables, including the dependent variable , are standardized, as we have assumed above, then there is no need to include a constant in the regression because the OLS estimate of its coefficient would anyway be equal to zero (proof below). Therefore, in what follows we are always going to drop the constant.
Write the regression in matrix formwhere is the vector of independent variables, the vector of regressors, is the matrix of regression coefficients and the vector of error terms.
The OLS estimator of is
Suppose the first regressor is constant and equal to 1, and all the other regressors are standardized. Denote by the matrix obtained by deleting the first column of (i.e., the column containing the constant). Then, is block diagonal:where the off-diagonal blocks are zero because the variables are standardized.
As a consequence, is block diagonal:
Furthermore,where because is standardized.
Thus, by carrying out the multiplication of the two block matrices and , we get
In other words, when we add an intercept, the OLS estimator of the other regressors does not change and the estimated intercept is always equal to zero.
Standardizing the variables in the regression greatly simplifies the computation of their sample covariances and correlations.
The sample covariance between two regressors and iswhere the sample means and are zero because the two regressors are standardized.
For the same reason, the sample covariance between and is
The sample correlation between and iswhere the sample variances and are equal to 1 because the two regressors are standardized.
By the same token, the sample correlation between and is
Thus, in a standardized regression, sample correlations and sample variances coincide.
Denote by the vector of independent variables and by the matrix of regressors, so that the regression equation can be written in matrix form aswhere is the vector of regression coefficients and is the vector of error terms.
The OLS estimator of is
When all the variables are standardized, the OLS estimator can be written as a function of their sample correlations.
Denote by the -th row of . Note that the -th element of is
Furthermore, the -th element of is
Denote by the sample correlation matrix of , that is, the matrix whose -th entry is equal to . Then,
Similarly, denote by the vector whose -th entry is equal to , so that
Thus, we can write the OLS estimator as a function of the sample correlation matrices:
The estimated coefficients of a linear regression model with standardized variables are called standardized coefficients. They are sometimes deemed easier to interpret than the coefficients of an unstandardized regression.
In general, a regression coefficient is interpreted as the effect that is produced on the dependent variable when the -th regressor is increased by one unit.
Sometimes, for example, when we read the output of a regression estimated by someone else, we are unable to tell whether a unit increase in the regressor is a lot or little, or we are uncertain about the relevance of the effect on the dependent variable. In these situations, standardized coefficients are easier to interpret.
In a standardized regression, a unit increase in a variable is equal to its standard deviation. Roughly speaking, the standard deviation is the average deviation of a random variable from its mean. So, when a variable differs from its mean by one standard deviation, that is in a sense a "typical" deviation. Then, a standardized coefficient tells you what multiple or fraction of a typical deviation in is caused by a typical deviation in the -th regressor.
Another benefit of standardization is that it is easier to make comparisons among regressors. In particular, if we ask what regressor has the largest impact on the dependent variable, then we have an easy answer: it is the regressor whose coefficient is the highest in absolute value. In fact, a typical deviation of that regressor from its mean will produce the largest effect, as compared to the effects produced by typical deviations of the other regressors from their mean.
Please cite as:
Taboga, Marco (2021). "Linear regression with standardized variables", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/linear-regression-with-standardized-variables.
Most of the learning materials found on this website are now available in a traditional textbook format.