StatLect
Index > Fundamentals of statistics

Logistic classification model - Maximum likelihood estimation

This lecture deals with maximum likelihood estimation of the logistic classification model (also called logit model or logistic regression).

Before reading this lecture, you might want to revise the lectures on maximum likelihood estimation and on the logit model.

Table of Contents

Model and notation

Remember that in the logit model the output variable $y_{i}$ is a Bernoulli random variable (it can take only two values, either 1 or 0) and[eq1]where [eq2]is the logistic function, $x_{i}$ is a $1	imes K$ vector of inputs and $eta $ is a Kx1 vector of coefficients.

Furthermore,[eq3]

The vector of coefficients $eta $ is the parameter to be estimated by maximum likelihood.

We assume that the estimation is carried out with an IID sample comprising $N$ data points [eq4]

The likelihood

The likelihood of an observation [eq5] can be written as[eq6]

If you are wondering about the exponents $y_{i}$ and $1-y_{i}$ or, more in general, about this formula for the likelihood, you are advised to revise the lecture on Classification models and their maximum likelihood estimation.

Denote the $N	imes 1$ vector of all outputs by $y$ and the $N	imes K$ matrix of all inputs by X. Since the observations are IID, then the likelihood of the entire sample is equal to the product of the likelihoods of the single observations:[eq7]

The log-likelihood

The log-likelihood of the logistic model is[eq8]

Proof

It is computed as follows:

[eq9]

The score

The score vector, that is the vector of first derivatives of the log-likelihood with respect to the parameter $eta $, is[eq10]

Proof

This is obtained as follows:[eq11]

The Hessian

The Hessian, that is the matrix of second derivatives, is[eq12]

Proof

It can be proved as follows:[eq13]where we have used the fact that the derivative of the logistic function $Sleft( t
ight) $ is[eq14]

The first-order condition

The maximum likelihood estimator $widehat{	heta }$ of the parameter $	heta $ solves[eq15]

In general, there is no analytical solution of this maximization problem and a solution must be found numerically (see the lecture entitled Maximum likelihood algorithm for an introduction to the numerical maximization of the likelihood).

Moreover, this maximization problem is not guaranteed to have a solution because some pathological situations can arise in which the log-likelihood is an unbounded function of the parameters. In these situations the log-likelihood can be made as large as desired by appropriately choosing $eta $. This happens when the residuals can be made as small as desired (so-called perfect separation of classes). It is not a common situation. It means that the model can perfectly fit the observed classes. In all other situations, the maximization problem has a solution, and at the maximum the score vector satisfies the first order condition [eq16]that is,[eq17]

Note that [eq18] is the error committed by using [eq19] as a predictor of $y_{i}$. It is similar to a regression residual (see Linear regression). Furthermore the first order condition above is similar to the first order condition that is found when estimating a linear regression model by ordinary least squares: it says that the residuals need to be orthogonal to the predictors $x_{i}$.

Newton-Raphson method

The first order condition above has no explicit solution. In most statistical software packages it is solved by using the Newton-Raphson method. The method is pretty simple: we start from a guess of the solution [eq20] (e.g., [eq21]), and then we recursively update the guess with the equation[eq22]until numerical convergence (of [eq23] to the solution $widehat{eta }$).

Denote by $widehat{y}_{t}$ the $N	imes 1$ vector of conditional probabilities of the outputs computed by using [eq23] as parameter:[eq25]

Denote by $W_{t}$ the $N	imes N$ diagonal matrix (i.e., having all off-diagonal elements equal to 0) such that the elements on its diagonal are [eq26], ..., [eq27]:[eq28]

The $N	imes K$ matrix of inputs[eq29]

which is called design matrix (as in linear regression), is assumed to be a full-rank matrix.

By using this notation, the score in Newton-Raphson recursive formula can be written as[eq30]and the Hessian as[eq31]

Therefore, the Newton-Raphson formula becomes[eq32]where the existence of the inverse [eq33] is guaranteed by the assumption that X has full-rank (the assumption also guarantees that the log-likelihood is concave and the maximum likelihood problem has a unique solution).

Iteratively reweighted least squares

If you deal with logit models, you will often read that they can be estimated by Iteratively Reweighted Least Squares (IRLS). The Newton-Raphson formula above is equivalent to the IRLS formula [eq34]that is obtained by performing a Weighted Least Squares (WLS) estimation with weights $W_{t-1}$ of a linear regression of the dependent variables [eq35] on the regressors X.

Proof

Write [eq36] as [eq37]Then, we can re-write the Newton Raphson formula as follows:[eq38]

The IRLS formula can alternatively be written as[eq39]

Covariance matrix of the estimator

The asymptotic covariance matrix of the maximum likelihood estimator $widehat{eta }$ is usually estimated with the Hessian (see the lecture on the covariance matrix of MLE estimators), as follows: [eq40]where [eq41] and $W=W_{T}$ ($T$ is the last step of the iterative procedure used to maximize the likelihood). As a consequence, the distribution of $widehat{eta }$ can be approximated by a normal distribution with mean equal to the true parameter value and variance equal to[eq42]

The book

Most of the learning materials found on this website are now available in a traditional textbook format.