This lecture deals with maximum likelihood estimation of the logistic classification model (also called logit model or logistic regression).

Before reading this lecture, you might want to revise the lectures on maximum likelihood estimation and on the logit model.

Remember that in the logit model the output variable is a Bernoulli random variable (it can take only two values, either 1 or 0) andwhere is the logistic function, is a vector of inputs and is a vector of coefficients.

Furthermore,

The vector of coefficients is the parameter to be estimated by maximum likelihood.

We assume that the estimation is carried out with an IID sample comprising data points

The likelihood of an observation can be written as

If you are wondering about the exponents and or, more in general, about this formula for the likelihood, you are advised to revise the lecture on Classification models and their maximum likelihood estimation.

Denote the vector of all outputs by and the matrix of all inputs by . Since the observations are IID, then the likelihood of the entire sample is equal to the product of the likelihoods of the single observations:

The log-likelihood of the logistic model is

Proof

It is computed as follows:

The score vector, that is the vector of first derivatives of the log-likelihood with respect to the parameter , is

Proof

This is obtained as follows:

The Hessian, that is the matrix of second derivatives, is

Proof

It can be proved as follows:where we have used the fact that the derivative of the logistic function is

The maximum likelihood estimator of the parameter solves

In general, there is no analytical solution of this maximization problem and a solution must be found numerically (see the lecture entitled Maximum likelihood algorithm for an introduction to the numerical maximization of the likelihood).

Moreover, this maximization problem is not guaranteed to have a solution because some pathological situations can arise in which the log-likelihood is an unbounded function of the parameters. In these situations the log-likelihood can be made as large as desired by appropriately choosing . This happens when the residuals can be made as small as desired (so-called perfect separation of classes). It is not a common situation. It means that the model can perfectly fit the observed classes. In all other situations, the maximization problem has a solution, and at the maximum the score vector satisfies the first order condition that is,

Note that is the error committed by using as a predictor of . It is similar to a regression residual (see Linear regression). Furthermore the first order condition above is similar to the first order condition that is found when estimating a linear regression model by ordinary least squares: it says that the residuals need to be orthogonal to the predictors .

The first order condition above has no explicit solution. In most statistical software packages it is solved by using the Newton-Raphson method. The method is pretty simple: we start from a guess of the solution (e.g., ), and then we recursively update the guess with the equationuntil numerical convergence (of to the solution ).

Denote by the vector of conditional probabilities of the outputs computed by using as parameter:

Denote by the diagonal matrix (i.e., having all off-diagonal elements equal to ) such that the elements on its diagonal are , ..., :

The matrix of inputs

which is called design matrix (as in linear regression), is assumed to be a full-rank matrix.

By using this notation, the score in Newton-Raphson recursive formula can be written asand the Hessian as

Therefore, the Newton-Raphson formula becomeswhere the existence of the inverse is guaranteed by the assumption that has full-rank (the assumption also guarantees that the log-likelihood is concave and the maximum likelihood problem has a unique solution).

If you deal with logit models, you will often read that they can be estimated by Iteratively Reweighted Least Squares (IRLS). The Newton-Raphson formula above is equivalent to the IRLS formula that is obtained by performing a Weighted Least Squares (WLS) estimation with weights of a linear regression of the dependent variables on the regressors .

Proof

Write as Then, we can re-write the Newton Raphson formula as follows:

The IRLS formula can alternatively be written as

The asymptotic covariance matrix of the maximum likelihood estimator is usually estimated with the Hessian (see the lecture on the covariance matrix of MLE estimators), as follows: where and ( is the last step of the iterative procedure used to maximize the likelihood). As a consequence, the distribution of can be approximated by a normal distribution with mean equal to the true parameter value and variance equal to

The book

Most of the learning materials found on this website are now available in a traditional textbook format.

Featured pages

- Beta distribution
- Point estimation
- F distribution
- Normal distribution
- Delta method
- Uniform distribution

Explore

Main sections

- Mathematical tools
- Fundamentals of probability
- Probability distributions
- Asymptotic theory
- Fundamentals of statistics
- Glossary

About

Glossary entries

Share