Linear regression with standardized variables

This lecture deals with standardized linear regressions, that is, regression models in which the variables are standardized.

A variable is standardized by subtracting from it its sample mean and by dividing it by its standard deviation. After being standardized, the variable has zero mean and unit standard deviation.

Table of contents

Standardization
How to obtain standardized variables
No intercept
Sample covariances
Sample correlations
OLS estimator
Standardized coefficients
1. Interpretation
2. Comparisons among standardized coefficients

Standardization

We are going to deal with linear regressionswhere are the observations in the sample, there are regressors and regression coefficients , $y_{i}$ is the dependent variable and $arepsilon _{i}$ is the error term.

In a standardized regression all the variables have zero mean and unit standard deviation or, equivalently, unit variance. More precisely, [eq4] for .

Furthermore, we assume that also the dependent variable is standardized: [eq5]

How to obtain standardized variables

In general, a variable to be included in a regression model has not zero mean and unit variance. Denote by $x_{ik}^{u}$ such a variable (where the superscript indicates that the variable is unstandardized). Then, we standardize it before including it in the regression.

We compute the sample mean and variance of $x_{ik}^{u}$ : [eq6]

Then, we compute the standardized variable $x_{ik}$ to be used in the regression: [eq7] for and .

The same process is performed on the dependent variable $y_{i}^{u}$ if it does not have zero mean and unit variance.

No intercept

Particular care needs to be taken if the regression includes an intercept, that is, if one of the regressors is constant and equal to 1.

Clearly, the constant cannot be standardized because it has zero variance and division by zero is not allowed.

We have two possibilities:

we leave the constant as it is, that is, we do not standardize it;
we drop the constant from the regression.

If all the variables, including the dependent variable $y_{i}$ , are standardized, as we have assumed above, then there is no need to include a constant in the regression because the OLS estimate of its coefficient would anyway be equal to zero (proof below). Therefore, in what follows we are always going to drop the constant.

Proof

Write the regression in matrix formwhere is the vector of independent variables, the vector of regressors, is the matrix of regression coefficients and the vector of error terms.

The OLS estimator of is

Suppose the first regressor is constant and equal to 1, and all the other regressors are standardized. Denote by $X_{-1}$ the matrix obtained by deleting the first column of (i.e., the column containing the constant). Then, $X^{ op }X$ is block diagonal: [eq10] where the off-diagonal blocks are zero because the variables are standardized.

As a consequence, is block diagonal: [eq12]

Furthermore, [eq13] where because $y_{i}$ is standardized.

Thus, by carrying out the multiplication of the two block matrices and $X^{ op }y$ , we get [eq15]

In other words, when we add an intercept, the OLS estimator of the other regressors does not change and the estimated intercept is always equal to zero.

Sample covariances

Standardizing the variables in the regression greatly simplifies the computation of their sample covariances and correlations.

The sample covariance between two regressors $x_{ik}$ and $x_{il}$ is [eq16] where the sample means $overline{x_{k}}$ and $overline{x_{l}}$ are zero because the two regressors are standardized.

For the same reason, the sample covariance between $y_{i}$ and $x_{ik}$ is [eq17]

Sample correlations

The sample correlation between $x_{ik}$ and $x_{il}$ is [eq18] where the sample variances $s_{k}^{2}$ and $s_{l}^{2}$ are equal to 1 because the two regressors are standardized.

By the same token, the sample correlation between $y_{i}$ and $x_{ik}$ is [eq19]

Thus, in a standardized regression, sample correlations and sample variances coincide.

OLS estimator

Denote by the vector of independent variables and by the matrix of regressors, so that the regression equation can be written in matrix form aswhere is the vector of regression coefficients and is the vector of error terms.

The OLS estimator of is

When all the variables are standardized, the OLS estimator can be written as a function of their sample correlations.

Denote by $x_{iullet }$ the -th row of . Note that the -th element of $X^{ op }X$ is [eq22]

Furthermore, the -th element of $X^{ op }y$ is [eq23]

Denote by $r_{xx}$ the sample correlation matrix of , that is, the matrix whose -th entry is equal to $r_{kl}$ . Then,

Similarly, denote by $r_{xy}$ the vector whose -th entry is equal to $r_{ky}$ , so that

Thus, we can write the OLS estimator as a function of the sample correlation matrices:

Standardized coefficients

The estimated coefficients of a linear regression model with standardized variables are called standardized coefficients. They are sometimes deemed easier to interpret than the coefficients of an unstandardized regression.

Interpretation

In general, a regression coefficient $eta _{k}$ is interpreted as the effect that is produced on the dependent variable when the -th regressor is increased by one unit.

Sometimes, for example, when we read the output of a regression estimated by someone else, we are unable to tell whether a unit increase in the regressor is a lot or little, or we are uncertain about the relevance of the effect $eta _{k}$ on the dependent variable. In these situations, standardized coefficients are easier to interpret.

In a standardized regression, a unit increase in a variable is equal to its standard deviation. Roughly speaking, the standard deviation is the average deviation of a random variable from its mean. So, when a variable differs from its mean by one standard deviation, that is in a sense a "typical" deviation. Then, a standardized coefficient $eta _{k}$ tells you what multiple or fraction of a typical deviation in $y_{i}$ is caused by a typical deviation in the -th regressor.

Comparisons among standardized coefficients

Another benefit of standardization is that it is easier to make comparisons among regressors. In particular, if we ask what regressor has the largest impact on the dependent variable, then we have an easy answer: it is the regressor whose coefficient is the highest in absolute value. In fact, a typical deviation of that regressor from its mean will produce the largest effect, as compared to the effects produced by typical deviations of the other regressors from their mean.

How to cite

Please cite as:

Taboga, Marco (2021). "Linear regression with standardized variables", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/linear-regression-with-standardized-variables.