 StatLect

# Logistic regression - Maximum Likelihood Estimation

This lecture deals with maximum likelihood estimation of the logistic classification model (also called logit model or logistic regression).

Before proceeding, you might want to revise the introductions to maximum likelihood estimation (MLE) and to the logit model. ## Model and notation

In the logit model, the output variable is a Bernoulli random variable (it can take only two values, either 1 or 0) and where is the logistic function, is a vector of inputs and is a vector of coefficients.

Furthermore, The vector of coefficients is the parameter to be estimated by maximum likelihood.

We assume that the estimation is carried out with an IID sample comprising data points ## The likelihood

The likelihood of an observation can be written as If you are wondering about the exponents and or, more in general, about this formula for the likelihood, you are advised to revise the lecture on Classification models and their maximum likelihood estimation.

Denote the vector of all outputs by and the matrix of all inputs by .

Since the observations are IID, then the likelihood of the entire sample is equal to the product of the likelihoods of the single observations: ## The log-likelihood

The log-likelihood of the logistic model is Proof

It is computed as follows: ## The score

The score vector, that is the vector of first derivatives of the log-likelihood with respect to the parameter , is Proof

This is obtained as follows: ## The Hessian

The Hessian, that is the matrix of second derivatives, is Proof

It can be proved as follows: where we have used the fact that the derivative of the logistic function is ## Absence of analytical solutions

The maximum likelihood estimator of the parameter solves In general, there is no analytical solution of this maximization problem and a solution must be found numerically (see the lecture entitled Maximum likelihood algorithm for an introduction to the numerical maximization of the likelihood).

## Perfect separation of classes

The maximization problem is not guaranteed to have a solution because some pathological situations can arise in which the log-likelihood is an unbounded function of the parameters.

In these situations the log-likelihood can be made as large as desired by appropriately choosing . This happens when the residuals can be made as small as desired (so-called perfect separation of classes).

It is not a common situation. It means that the model can perfectly fit the observed classes.

## The first-order condition

In all other situations, the maximization problem has a solution, and at the maximum the score vector satisfies the first order condition that is, Note that is the error committed by using as a predictor of . It is similar to a regression residual (see Linear regression).

Furthermore, the first order condition above is similar to the first order condition that is found when estimating a linear regression model by ordinary least squares: it says that the residuals need to be orthogonal to the predictors .

## Newton-Raphson method

The first order condition above has no explicit solution. In most statistical software packages it is solved by using the Newton-Raphson method. The method is pretty simple: we start from a guess of the solution (e.g., ), and then we recursively update the guess with the equation until numerical convergence (of to the solution ).

Denote by the vector of conditional probabilities of the outputs computed by using as parameter: Denote by the diagonal matrix (i.e., having all off-diagonal elements equal to ) such that the elements on its diagonal are , ..., : The matrix of inputs which is called design matrix (as in linear regression), is assumed to be a full-rank matrix.

By using this notation, the score in Newton-Raphson recursive formula can be written as and the Hessian as Therefore, the Newton-Raphson formula becomes where the existence of the inverse is guaranteed by the assumption that has full-rank (the assumption also guarantees that the log-likelihood is concave and the maximum likelihood problem has a unique solution).

## Iteratively reweighted least squares

If you deal with logit models, you will often read that they can be estimated by Iteratively Reweighted Least Squares (IRLS). The Newton-Raphson formula above is equivalent to the IRLS formula that is obtained by performing a Weighted Least Squares (WLS) estimation with weights of a linear regression of the dependent variables on the regressors .

Proof

Write as Then, we can re-write the Newton Raphson formula as follows: The IRLS formula can alternatively be written as ## Covariance matrix of the estimator

The asymptotic covariance matrix of the maximum likelihood estimator is usually estimated with the Hessian (see the lecture on the covariance matrix of MLE estimators), as follows: where and ( is the last step of the iterative procedure used to maximize the likelihood). As a consequence, the distribution of can be approximated by a normal distribution with mean equal to the true parameter value and variance equal to ## Other examples

StatLect has several MLE examples. Learn how to find the estimators of the parameters of the following models and distributions.

TypeSolution
Exponential distributionUnivariate distributionAnalytical
Normal distributionUnivariate distributionAnalytical
Poisson distributionUnivariate distributionAnalytical
T distributionUnivariate distributionNumerical
Multivariate normal distributionMultivariate distributionAnalytical
Normal linear regression modelRegression modelAnalytical
Probit classification modelClassification modelNumerical