Index > Fundamentals of statistics

Dummy variable

by Marco Taboga, PhD

In regression analysis, a dummy variable is a regressor that can take only two values: either 1 or 0.

Dummy variables are typically used to encode categorical features.

Table of contents

Example
Interpretation
Matrix form
Collinearity
More than two categories
Example with three categories
Dropping the intercept
One-hot encoding

Example

Suppose that we want to analyze how personal income is affected by:

years of work experience;
postgraduate education.

To do so, we can specify a linear regression model as follows:where:

the subscript denotes an individual in our sample;
the dependent variable $y_{i}$ is a measure of income;
$x_{i}$ is the number of years of work experience;
$d_{i}$ is a dummy variable, equal to 1 if the individual has a higher degree or other postgraduate qualification and 0 otherwise;
$arepsilon _{i}$ is the error of the regression;
$eta _{0}$ is the intercept of the regression;
$eta _{1}$ and $eta _{2}$ are the regression coefficients of the two variables.

Interpretation

In the previous example, $eta _{2}$ is the regression coefficient of the dummy variable. It measures by how much postgraduate education raises income on average.

In general, the regression coefficient on a dummy variable gives us the average increase in $y_{i}$ observed when the dummy is equal to 1 (with respect to the base case in which the dummy is equal to 0).

Matrix form

Let us continue with the previous example, to see how a dummy variable looks like when the data is gathered in a matrix or table.

Suppose that our sample is as follows.

[eq2]

After encoding the categorical variable with a dummy, the vector of dependent variables and the matrix of regressors (so-called design matrix) will be [eq3]

Note that the first column contains all 1s because we have included an intercept in the regression.

Collinearity

We might be tempted to include two dummies in our regression:

a first dummy that is equal to 1 if the individual has a higher degree and 0 otherwise;
a second dummy that is equal to 1 if the individual does not have a higher degree and 0 otherwise.

In our previous example, the design matrix would become [eq4]

The problem with this double encoding is that our regressors become perfectly multicollinear, that is, one of the columns of is equal to a linear combination of the other columns.

In our example, we have [eq5]

With perfect multicollinearity, the design matrix becomes singular, which implies that we cannot estimate the regression coefficients with Ordinary Least Squares (OLS).

In fact, we can compute the OLS estimator only if is full-rank. We can still compute estimators such as the Ridge, which do not require to be full-rank.

More than two categories

Thus, when we have an intercept in the regression model and we want to avoid perfect multicollinearity, we create only one dummy to encode a categorical variable that has two categories.

Similarly, we create dummies to encode a categorical variable that has categories.

The category that is not encoded into a dummy becomes the base category.

Example with three categories

Let us make an example with three categories.

Suppose that our sample is similar to the previous one, but individuals have been divided into three groups (H, M and L) based on their education.

[eq6]

If we choose L as the base category, then we create two dummies:

the first dummy $d_{1,i}$ is 1 if the category is M and 0 otherwise;
the second dummy $d_{2,i}$ is 1 if the category is H and 0 otherwise.

The vector of dependent variables and the matrix of regressors (so-called design matrix) are [eq7]

The regression equation is

The regression coefficients of the two dummies are interpreted as follows:

$eta _{2}$ is the average increase in $y_{i}$ observed when the category is M (with respect to the base case in which the category is L);
$eta _{3}$ is the average increase in $y_{i}$ observed when the category is H (with respect to the base case in which the category is L).

Dropping the intercept

An alternative to encoding only of the categories as dummies is to drop the intercept and encode all the categories.

With the data in the previous example, we could have done: [eq9] where we have encoded L, M and H into three dummies $d_{1,i}$ , $d_{2,i}$ and $d_{3,i}$ .

The regression equation is

The interpretation of regression coefficients changes:

$eta _{1}$ is the intercept of the regression when the category is L (i.e., the average income of an L individual with no work experience);
$eta _{2}$ is the intercept when the category is M;
$eta _{3}$ is the intercept for H individuals.

One-hot encoding

In machine learning, the practice of encoding categories into dummies is often called one-hot encoding.

How to cite

Please cite as:

Taboga, Marco (2021). "Dummy variable", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/dummy-variable.