Search for probability and statistics terms on Statlect

Logistic regression - Maximum Likelihood Estimation

by , PhD

This lecture deals with maximum likelihood estimation of the logistic classification model (also called logit model or logistic regression).

Before proceeding, you might want to revise the introductions to maximum likelihood estimation (MLE) and to the logit model.

Table of Contents

Model and notation

In the logit model, the output variable $y_{i}$ is a Bernoulli random variable (it can take only two values, either 1 or 0) and[eq1]where [eq2]is the logistic function, $x_{i}$ is a $1	imes K$ vector of inputs and $eta $ is a Kx1 vector of coefficients.


The vector of coefficients $eta $ is the parameter to be estimated by maximum likelihood.

We assume that the estimation is carried out with an IID sample comprising $N$ data points [eq4]

The likelihood

The likelihood of an observation [eq5] can be written as[eq6]

If you are wondering about the exponents $y_{i}$ and $1-y_{i}$ or, more in general, about this formula for the likelihood, you are advised to revise the lecture on Classification models and their maximum likelihood estimation.

Denote the $N	imes 1$ vector of all outputs by $y$ and the $N	imes K$ matrix of all inputs by X.

Since the observations are IID, then the likelihood of the entire sample is equal to the product of the likelihoods of the single observations:[eq7]

The log-likelihood

The log-likelihood of the logistic model is[eq8]


It is computed as follows:


The score

The score vector, that is the vector of first derivatives of the log-likelihood with respect to the parameter $eta $, is[eq10]


This is obtained as follows:[eq11]

The Hessian

The Hessian, that is the matrix of second derivatives, is[eq12]


It can be proved as follows:[eq13]where we have used the fact that the derivative of the logistic function $Sleft( t
ight) $ is[eq14]

Absence of analytical solutions

The maximum likelihood estimator $widehat{	heta }$ of the parameter $	heta $ solves[eq15]

In general, there is no analytical solution of this maximization problem and a solution must be found numerically (see the lecture entitled Maximum likelihood algorithm for an introduction to the numerical maximization of the likelihood).

Perfect separation of classes

The maximization problem is not guaranteed to have a solution because some pathological situations can arise in which the log-likelihood is an unbounded function of the parameters.

In these situations the log-likelihood can be made as large as desired by appropriately choosing $eta $. This happens when the residuals can be made as small as desired (so-called perfect separation of classes).

It is not a common situation. It means that the model can perfectly fit the observed classes.

The first-order condition

In all other situations, the maximization problem has a solution, and at the maximum the score vector satisfies the first order condition [eq16]that is,[eq17]

Note that [eq18] is the error committed by using [eq19] as a predictor of $y_{i}$. It is similar to a regression residual (see Linear regression).

Furthermore, the first order condition above is similar to the first order condition that is found when estimating a linear regression model by ordinary least squares: it says that the residuals need to be orthogonal to the predictors $x_{i}$.

Newton-Raphson method

The first order condition above has no explicit solution. In most statistical software packages it is solved by using the Newton-Raphson method. The method is pretty simple: we start from a guess of the solution [eq20] (e.g., [eq21]), and then we recursively update the guess with the equation[eq22]until numerical convergence (of [eq23] to the solution $widehat{eta }$).

Denote by $widehat{y}_{t}$ the $N	imes 1$ vector of conditional probabilities of the outputs computed by using [eq23] as parameter:[eq25]

Denote by $W_{t}$ the $N	imes N$ diagonal matrix (i.e., having all off-diagonal elements equal to 0) such that the elements on its diagonal are [eq26], ..., [eq27]:[eq28]

The $N	imes K$ matrix of inputs[eq29]which is called design matrix (as in linear regression), is assumed to be a full-rank matrix.

By using this notation, the score in Newton-Raphson recursive formula can be written as[eq30]and the Hessian as[eq31]

Therefore, the Newton-Raphson formula becomes[eq32]where the existence of the inverse [eq33] is guaranteed by the assumption that X has full-rank (the assumption also guarantees that the log-likelihood is concave and the maximum likelihood problem has a unique solution).

Iteratively reweighted least squares

If you deal with logit models, you will often read that they can be estimated by Iteratively Reweighted Least Squares (IRLS). The Newton-Raphson formula above is equivalent to the IRLS formula [eq34]that is obtained by performing a Weighted Least Squares (WLS) estimation with weights $W_{t-1}$ of a linear regression of the dependent variables [eq35] on the regressors X.


Write [eq36] as [eq37]Then, we can re-write the Newton Raphson formula as follows:[eq38]

The IRLS formula can alternatively be written as[eq39]

Covariance matrix of the estimator

The asymptotic covariance matrix of the maximum likelihood estimator $widehat{eta }$ is usually estimated with the Hessian (see the lecture on the covariance matrix of MLE estimators), as follows: [eq40]where [eq41] and $W=W_{T}$ ($T$ is the last step of the iterative procedure used to maximize the likelihood). As a consequence, the distribution of $widehat{eta }$ can be approximated by a normal distribution with mean equal to the true parameter value and variance equal to[eq42]

Other examples

StatLect has several MLE examples. Learn how to find the estimators of the parameters of the following models and distributions.

Exponential distributionUnivariate distributionAnalytical
Normal distributionUnivariate distributionAnalytical
Poisson distributionUnivariate distributionAnalytical
T distributionUnivariate distributionNumerical
Multivariate normal distributionMultivariate distributionAnalytical
Normal linear regression modelRegression modelAnalytical
Probit classification modelClassification modelNumerical

How to cite

Please cite as:

Taboga, Marco (2021). "Logistic regression - Maximum Likelihood Estimation", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.