In regression analysis, a dummy variable is a regressor that can take only two values: either 1 or 0.
Dummy variables are typically used to encode categorical features.
Suppose that we want to analyze how personal income is affected by:
years of work experience;
postgraduate education.
To do so, we can specify a linear regression model as follows:where:
the subscript denotes an individual in our sample;
the dependent variable is a measure of income;
is the number of years of work experience;
is a dummy variable, equal to 1 if the individual has a higher degree or other postgraduate qualification and 0 otherwise;
is the error of the regression;
is the intercept of the regression;
and are the regression coefficients of the two variables.
In the previous example, is the regression coefficient of the dummy variable. It measures by how much postgraduate education raises income on average.
In general, the regression coefficient on a dummy variable gives us the average increase in observed when the dummy is equal to 1 (with respect to the base case in which the dummy is equal to 0).
Let us continue with the previous example, to see how a dummy variable looks like when the data is gathered in a matrix or table.
Suppose that our sample is as follows.
After encoding the categorical variable with a dummy, the vector of dependent variables and the matrix of regressors (so-called design matrix) will be
Note that the first column contains all 1s because we have included an intercept in the regression.
We might be tempted to include two dummies in our regression:
a first dummy that is equal to 1 if the individual has a higher degree and 0 otherwise;
a second dummy that is equal to 1 if the individual does not have a higher degree and 0 otherwise.
In our previous example, the design matrix would become
The problem with this double encoding is that our regressors become perfectly multicollinear, that is, one of the columns of is equal to a linear combination of the other columns.
In our example, we have
With perfect multicollinearity, the design matrix becomes singular, which implies that we cannot estimate the regression coefficients with Ordinary Least Squares (OLS).
In fact, we can compute the OLS estimator only if is full-rank. We can still compute estimators such as the Ridge, which do not require to be full-rank.
Thus, when we have an intercept in the regression model and we want to avoid perfect multicollinearity, we create only one dummy to encode a categorical variable that has two categories.
Similarly, we create dummies to encode a categorical variable that has categories.
The category that is not encoded into a dummy becomes the base category.
Let us make an example with three categories.
Suppose that our sample is similar to the previous one, but individuals have been divided into three groups (H, M and L) based on their education.
If we choose L as the base category, then we create two dummies:
the first dummy is 1 if the category is M and 0 otherwise;
the second dummy is 1 if the category is H and 0 otherwise.
The vector of dependent variables and the matrix of regressors (so-called design matrix) are
The regression equation is
The regression coefficients of the two dummies are interpreted as follows:
is the average increase in observed when the category is M (with respect to the base case in which the category is L);
is the average increase in observed when the category is H (with respect to the base case in which the category is L).
An alternative to encoding only of the categories as dummies is to drop the intercept and encode all the categories.
With the data in the previous example, we could have done:where we have encoded L, M and H into three dummies , and .
The regression equation is
The interpretation of regression coefficients changes:
is the intercept of the regression when the category is L (i.e., the average income of an L individual with no work experience);
is the intercept when the category is M;
is the intercept for H individuals.
In machine learning, the practice of encoding categories into dummies is often called one-hot encoding.
Please cite as:
Taboga, Marco (2021). "Dummy variable", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/dummy-variable.
Most of the learning materials found on this website are now available in a traditional textbook format.