A design matrix is a matrix containing data about multiple characteristics of several individuals or objects. Each row corresponds to an individual and each column to a characteristic.
The design matrix is a fundamental mathematical object in regression analysis, for example, in linear regression models and in logit models. It is often denoted by the capital letter .
We provide here some examples of design matrices.
Example If we measure the height and weight of five individuals, we can collect the measurements in a design matrix having five rows and two columns. Each row corresponds to one of the ten individuals, the first column contains the height measurements and the second one reports the weights:where denotes the height of the -th individual and her weight.
Example If we collect the data about the gross domestic product (GDP) of four countries in three consecutive years, then the design matrix is the matrixwhere, for example, is the GDP of the third country in the second year.
Consider the linear regressionwhere is the dependent variable, is a vector containing the explanatory variables (regressors), is a vector of regression coefficients, is the error term and there are observations ().
Thus, we observe characteristics, contained in the vector of regressors , for each of the observations.
All the observations can be collected in the design matrixwhere denotes the -th entry of the vector , that is, the -th regressor.
We can similarly stack the observations of the dependent variable and the error terms into two vectors:
Having defined the design matrix and the two vectors and , we can write the regression equations in matrix form:
This allows us to use matrix algebra to find an estimator of the regression coefficients (see the lecture on linear regression to see how).
In most statistical models the design matrix is required to have full-rank, that is, its columns must be linearly independent (see, e.g., the normal linear regression model). When this requirement is not met, we say that the design matrix suffers from multicollinearity (see this lecture for details).
However, there are also regression models where the design matrix can be rank-deficient (i.e., not full-rank), for example the Ridge regression model.
See the lecture on linear regression models for more details.
Previous entry: Cross-covariance matrix
Next entry: Discrete random variable
Please cite as:
Taboga, Marco (2021). "Design matrix", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/glossary/design-matrix.
Most of the learning materials found on this website are now available in a traditional textbook format.