Probit classification model - Maximum likelihood

This lecture explains how to perform maximum likelihood estimation of the coefficients of a probit model (also called probit regression).

Before reading this lecture, it may be helpful to read the introductory lectures about maximum likelihood estimation and about the probit model.

Table of contents

Main assumptions and notation
The likelihood
The log-likelihood
The score
The Hessian
The first-order condition
Newton-Raphson method
Iteratively reweighted least squares
Covariance matrix of the estimator
References

Main assumptions and notation

In a probit model, the output variable $y_{i}$ is a Bernoulli random variable (i.e., a discrete variable that can take only two values, either or ).

Conditional on a vector of inputs $x_{i}$ , we have that [eq1] where is the cumulative distribution function of the standard normal distribution and is a vector of coefficients.

We assume that a sample of independently and identically distributed input-output couples , for , is observed and used to estimate the vector .

The likelihood

The likelihood of a single observation is

In fact, note that

when $y_{i}=1$ , then $1-y_{i}=0$ and ;
when $y_{i}=0$ , then $1-y_{i}=1$ and ;

Since the observations are IID, then the likelihood of the entire sample is equal to the product of the likelihoods of the single observations: [eq7] where is the vector of all outputs and is the matrix of all inputs.

Now, define so that

$q_{i}=1$ if $y_{i}=1$ ;
$q_{i}=-1$ if $y_{i}=0$ .

By using the newly defined variables $q_{i}$ , we can also write the likelihood in the following more compact form: [eq9]

Proof

First note that when $y_{i}=1$ , then $q_{i}=1$ and Furthermore, when $y_{i}=0$ , then $q_{i}=-1$ and Since $y_{i}$ can take only two values ( and ), (a) and (b) imply thatfor all $y_{i}$ . Moreover, the symmetry of the standard normal distribution around implies that So, when $y_{i}=0$ , then $q_{i}=-1$ and When $y_{i}=1$ , then $q_{i}=1$ and Thus, it descends from (c) and (d) thatfor all $y_{i}$ . Thanks to these facts, we can write the likelihood as [eq17]

The log-likelihood

The log-likelihood is [eq18]

Proof

It is computed as follows:

[eq19]

By using the $q_{i}$ variables, the log-likelihood can also be written as [eq20]

Proof

This is derived from the compact form of the likelihood: [eq21]

The score

The score vector, that is the vector of first derivatives of the log-likelihood with respect to the parameter , is [eq22] where is the probability density function of the standard normal distribution.

Proof

This is obtained as follows: [eq23] where in step we have used the fact that the probability density function is the derivative of the cumulative distribution function, that is, [eq24]

By using the $q_{i}$ variables, the score can also be written as [eq25] where [eq26]

Proof

This is demonstrated as follows: [eq27]

The Hessian

The Hessian, that is the matrix of second derivatives, is [eq28]

Proof

It can be proved as follows: [eq29]

It can be proved (see, e.g., Amemiya 1985) that the quantityis always positive.

The first-order condition

The maximum likelihood estimator of the parameter is obtained as a solution of the following maximization problem:

As for the logit model, also for the probit model the maximization problem is not guaranteed to have a solution, but when it has one, at the maximum the score vector satisfies the first order conditionthat is, [eq33]

The quantity is the residual, that is, the forecasting error committed by using to predict $y_{i}$ . Note the difference with respect to the logit model:

in the logit model, residuals need to be orthogonal to the predictors $x_{i}$ ;
in the probit model, the orthogonality condition holds for weighted residuals; the weight assigned to each residual is

By using the $q_{i}$ variables and the second expression for the score derived above, the first order condition can also be written as [eq37] where [eq26]

Newton-Raphson method

There is no analytical solution of the first order condition. One of the most common ways of solving it numerically is by using the Newton-Raphson method. It is an iterative method. Starting from an initial guess of the solution (e.g., ), we generate a sequence of guessesand we stop when numerical convergence is achieved (see Maximum likelihood algorithm for an introduction to numerical optimization methods and numerical convergence).

Define [eq42]

and the vector [eq43]

Denote by $W_{t}$ the diagonal matrix (i.e., having all off-diagonal elements equal to ) such that the elements on its diagonal are , ..., : [eq46] The matrix $W_{t}$ is positive definite because all its diagonal entries are positive (see the comments about the Hessian above).

Finally, the matrix of inputs (the design matrix) defined by [eq47] is assumed to have full rank.

With the notation just introduced, we can write the score asand the Hessian as

Therefore, the Newton-Raphson recursive formula becomes

The assumption that has full-rank guarantees the existence of the inverse . Furthermore, it ensures that the Hessian is negative definite, so that the log-likelihood is concave.

Iteratively reweighted least squares

As for the logit classification model, also for the probit model it is straightforward to prove that the Newton-Raphson iterations are equivalent to Iteratively Reweighted Least Squares (IRLS) iterations: where we perform a Weighted Least Squares (WLS) estimation with weights $W_{t-1}$ of a linear regression of the dependent variables on the regressors .

Proof

Write as Then, the Newton-Raphson formula can be written as [eq56]

Covariance matrix of the estimator

The Hessian matrix derived above is usually employed to estimate the asymptotic covariance matrix of the maximum likelihood estimator : [eq57] where and $W=W_{T}$ ( is the last step of the iterative procedure used to maximize the likelihood).

A proof of the fact that the inverse of the negative Hessian, divided by the sample size, converges to the asymptotic covariance matrix can be found in the lecture on estimating the covariance matrix of MLE estimators.

Given the above estimate of the asymptotic covariance matrix, the distribution of can be approximated by a normal distribution having mean equal to the true parameter and covariance matrix

References

Amemiya, T. (1985) Advanced econometrics, Harvard University Press.

How to cite

Please cite as:

Taboga, Marco (2021). "Probit classification model - Maximum likelihood", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/probit-model-maximum-likelihood.