Search for probability and statistics terms on Statlect

Probit classification model - Maximum likelihood

by , PhD

This lecture explains how to perform maximum likelihood estimation of the coefficients of a probit model (also called probit regression).

Before reading this lecture, it may be helpful to read the introductory lectures about maximum likelihood estimation and about the probit model.

Table of Contents

Main assumptions and notation

In a probit model, the output variable $y_{i}$ is a Bernoulli random variable (i.e., a discrete variable that can take only two values, either 1 or 0).

Conditional on a $1	imes K$ vector of inputs $x_{i}$, we have that[eq1]where $Fleft( t
ight) $ is the cumulative distribution function of the standard normal distribution and $eta $ is a Kx1 vector of coefficients.

We assume that a sample of independently and identically distributed input-output couples [eq2], for $i=1,ldots ,N$, is observed and used to estimate the vector $eta $.

The likelihood

The likelihood of a single observation [eq3] is[eq4]

In fact, note that

Since the observations are IID, then the likelihood of the entire sample is equal to the product of the likelihoods of the single observations:[eq7]where $y$ is the $N	imes 1$ vector of all outputs and X is the $N	imes K$ matrix of all inputs.

Now, define [eq8] so that

By using the newly defined variables $q_{i}$, we can also write the likelihood in the following more compact form:[eq9]


First note that when $y_{i}=1$, then $q_{i}=1$ and [eq10]Furthermore, when $y_{i}=0$, then $q_{i}=-1$ and [eq11]Since $y_{i}$ can take only two values (0 and 1), (a) and (b) imply that[eq12]for all $y_{i}$. Moreover, the symmetry of the standard normal distribution around 0 implies that [eq13]So, when $y_{i}=0$, then $q_{i}=-1$ and [eq14]When $y_{i}=1$, then $q_{i}=1$ and [eq15]Thus, it descends from (c) and (d) that[eq16]for all $y_{i}$. Thanks to these facts, we can write the likelihood as[eq17]

The log-likelihood

The log-likelihood is[eq18]


It is computed as follows:


By using the $q_{i}$ variables, the log-likelihood can also be written as[eq20]


This is derived from the compact form of the likelihood:[eq21]

The score

The score vector, that is the vector of first derivatives of the log-likelihood with respect to the parameter $eta $, is[eq22]where $fleft( t
ight) $ is the probability density function of the standard normal distribution.


This is obtained as follows:[eq23]where in step $rame{A}$ we have used the fact that the probability density function is the derivative of the cumulative distribution function, that is,[eq24]

By using the $q_{i}$ variables, the score can also be written as[eq25]where[eq26]


This is demonstrated as follows:[eq27]

The Hessian

The Hessian, that is the matrix of second derivatives, is[eq28]


It can be proved as follows:[eq29]

It can be proved (see, e.g., Amemiya 1985) that the quantity[eq30]is always positive.

The first-order condition

The maximum likelihood estimator $widehat{	heta }$ of the parameter $	heta $ is obtained as a solution of the following maximization problem:[eq31]

As for the logit model, also for the probit model the maximization problem is not guaranteed to have a solution, but when it has one, at the maximum the score vector satisfies the first order condition[eq32]that is,[eq33]

The quantity [eq34] is the residual, that is, the forecasting error committed by using [eq35] to predict $y_{i}$. Note the difference with respect to the logit model:

By using the $q_{i}$ variables and the second expression for the score derived above, the first order condition can also be written as [eq37]where[eq26]

Newton-Raphson method

There is no analytical solution of the first order condition. One of the most common ways of solving it numerically is by using the Newton-Raphson method. It is an iterative method. Starting from an initial guess of the solution [eq39] (e.g., [eq40]), we generate a sequence of guesses[eq41]and we stop when numerical convergence is achieved (see Maximum likelihood algorithm for an introduction to numerical optimization methods and numerical convergence).


and the $N	imes 1$ vector[eq43]

Denote by $W_{t}$ the $N	imes N$ diagonal matrix (i.e., having all off-diagonal elements equal to 0) such that the elements on its diagonal are [eq44], ..., [eq45]:[eq46]The matrix $W_{t}$ is positive definite because all its diagonal entries are positive (see the comments about the Hessian above).

Finally, the $N	imes K$ matrix of inputs (the design matrix) defined by[eq47]is assumed to have full rank.

With the notation just introduced, we can write the score as[eq48]and the Hessian as[eq49]

Therefore, the Newton-Raphson recursive formula becomes[eq50]

The assumption that X has full-rank guarantees the existence of the inverse [eq51]. Furthermore, it ensures that the Hessian is negative definite, so that the log-likelihood is concave.

Iteratively reweighted least squares

As for the logit classification model, also for the probit model it is straightforward to prove that the Newton-Raphson iterations are equivalent to Iteratively Reweighted Least Squares (IRLS) iterations: [eq52]where we perform a Weighted Least Squares (WLS) estimation with weights $W_{t-1}$ of a linear regression of the dependent variables [eq53] on the regressors X.


Write [eq54] as [eq55]Then, the Newton-Raphson formula can be written as[eq56]

Covariance matrix of the estimator

The Hessian matrix derived above is usually employed to estimate the asymptotic covariance matrix of the maximum likelihood estimator $widehat{eta }$:[eq57]where [eq58] and $W=W_{T}$ ($T$ is the last step of the iterative procedure used to maximize the likelihood).

A proof of the fact that the inverse of the negative Hessian, divided by the sample size, converges to the asymptotic covariance matrix can be found in the lecture on estimating the covariance matrix of MLE estimators.

Given the above estimate of the asymptotic covariance matrix, the distribution of $widehat{eta }$ can be approximated by a normal distribution having mean equal to the true parameter and covariance matrix [eq59]


Amemiya, T. (1985) Advanced econometrics, Harvard University Press.

How to cite

Please cite as:

Taboga, Marco (2021). "Probit classification model - Maximum likelihood", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.