This lecture explains how to perform maximum likelihood estimation of the coefficients of a probit model (also called probit regression).
Before reading this lecture, it may be helpful to read the introductory lectures about maximum likelihood estimation and about the probit model.
In a probit model, the output variable is a Bernoulli random variable (i.e., a discrete variable that can take only two values, either or ).
Conditional on a vector of inputs , we have thatwhere is the cumulative distribution function of the standard normal distribution and is a vector of coefficients.
We assume that a sample of independently and identically distributed input-output couples , for , is observed and used to estimate the vector .
The likelihood of a single observation is
In fact, note that
when , then and ;
when , then and ;
Since the observations are IID, then the likelihood of the entire sample is equal to the product of the likelihoods of the single observations:where is the vector of all outputs and is the matrix of all inputs.
Now, define so that
if ;
if .
By using the newly defined variables , we can also write the likelihood in the following more compact form:
First note that when , then and Furthermore, when , then and Since can take only two values ( and ), (a) and (b) imply thatfor all . Moreover, the symmetry of the standard normal distribution around implies that So, when , then and When , then and Thus, it descends from (c) and (d) thatfor all . Thanks to these facts, we can write the likelihood as
The log-likelihood is
It is computed as follows:
By using the variables, the log-likelihood can also be written as
This is derived from the compact form of the likelihood:
The score vector, that is the vector of first derivatives of the log-likelihood with respect to the parameter , iswhere is the probability density function of the standard normal distribution.
This is obtained as follows:where in step we have used the fact that the probability density function is the derivative of the cumulative distribution function, that is,
By using the variables, the score can also be written aswhere
This is demonstrated as follows:
The Hessian, that is the matrix of second derivatives, is
It can be proved as follows:
It can be proved (see, e.g., Amemiya 1985) that the quantityis always positive.
The maximum likelihood estimator of the parameter is obtained as a solution of the following maximization problem:
As for the logit model, also for the probit model the maximization problem is not guaranteed to have a solution, but when it has one, at the maximum the score vector satisfies the first order conditionthat is,
The quantity is the residual, that is, the forecasting error committed by using to predict . Note the difference with respect to the logit model:
in the logit model, residuals need to be orthogonal to the predictors ;
in the probit model, the orthogonality condition holds for weighted residuals; the weight assigned to each residual is
By using the variables and the second expression for the score derived above, the first order condition can also be written as where
There is no analytical solution of the first order condition. One of the most common ways of solving it numerically is by using the Newton-Raphson method. It is an iterative method. Starting from an initial guess of the solution (e.g., ), we generate a sequence of guessesand we stop when numerical convergence is achieved (see Maximum likelihood algorithm for an introduction to numerical optimization methods and numerical convergence).
Define
and the vector
Denote by the diagonal matrix (i.e., having all off-diagonal elements equal to ) such that the elements on its diagonal are , ..., :The matrix is positive definite because all its diagonal entries are positive (see the comments about the Hessian above).
Finally, the matrix of inputs (the design matrix) defined byis assumed to have full rank.
With the notation just introduced, we can write the score asand the Hessian as
Therefore, the Newton-Raphson recursive formula becomes
The assumption that has full-rank guarantees the existence of the inverse . Furthermore, it ensures that the Hessian is negative definite, so that the log-likelihood is concave.
As for the logit classification model, also for the probit model it is straightforward to prove that the Newton-Raphson iterations are equivalent to Iteratively Reweighted Least Squares (IRLS) iterations: where we perform a Weighted Least Squares (WLS) estimation with weights of a linear regression of the dependent variables on the regressors .
Write as Then, the Newton-Raphson formula can be written as
The Hessian matrix derived above is usually employed to estimate the asymptotic covariance matrix of the maximum likelihood estimator :where and ( is the last step of the iterative procedure used to maximize the likelihood).
A proof of the fact that the inverse of the negative Hessian, divided by the sample size, converges to the asymptotic covariance matrix can be found in the lecture on estimating the covariance matrix of MLE estimators.
Given the above estimate of the asymptotic covariance matrix, the distribution of can be approximated by a normal distribution having mean equal to the true parameter and covariance matrix
Amemiya, T. (1985) Advanced econometrics, Harvard University Press.
Please cite as:
Taboga, Marco (2021). "Probit classification model - Maximum likelihood", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/probit-model-maximum-likelihood.
Most of the learning materials found on this website are now available in a traditional textbook format.