# Logistic classification model (logit or logistic regression)

The logistic model (or logit) is a classification model used to predict variables that can take only two values.

## Overview

The logistic classification model has the following characteristics:

• the output variable can be equal to either 0 or 1;

• the predicted output is a number between 0 and 1;

• as in linear regression, we use a vector of estimated coefficients to compute , a linear combination of the input variables ;

• unlike in linear regression, we transform using a nonlinear function , to make sure that the predictions are between 0 and 1.

## Interpretation of the predicted output

In a logit model, the predicted output has two interpretations:

• the estimated probability that will be equal to 1;

• our best guess of the value of the output variable .

## Classification vs regression

A logit model is often called logistic regression model.

However, we prefer to stick to the convention (widespread in the machine learning community) of using the term regression only for models in which the output variable is continuous.

Therefore, we use the term classification here because in a logit model the output is discrete.

## Sample

Suppose that we observe a sample of data for .

Each observation has:

• an output variable denoted by ;

• a vector of inputs, denoted by .

## Conditional probabilities

The output can take only two values, either 0 or 1 (it is a Bernoulli random variable).

The probability that the output is equal to 1, conditional on the inputs , is assumed to bewhere is the logistic function and is a vector of coefficients.

The probability that is equal to 0 is

## The logistic function

It is immediate to see that the logistic function is always positive.

Furthermore, it is increasing and so that it satisfies

Thus, is a well-defined probability because it lies between 0 and 1.

## Explanation

Why is the logistic classification model specified in this manner?

Why is the logistic function used to transform the linear combination of inputs ?

The simple answer is that we would like to do something similar to what we do in a linear regression model: use a linear combination of the inputs as our prediction of the output.

However, our prediction needs to be a probability and there is no guarantee that the linear combination is between 0 and 1.

Thus, we use the logistic function because it provides a convenient way of transforming and forcing it to lie in the interval between 0 and 1.

## Alternatives

We could have used other functions that enjoy properties similar to the logistic function.

As a matter of fact, other popular classification models can be obtained by simply substituting the logistic function with another function and leaving everything else in the model unchanged.

For example, by substituting the logit function with the cumulative distribution function of a standard normal distribution, we obtain the so-called probit model.

## The logit model as a latent variable model

Another way of thinking about the logit model is to define a latent variable (i.e., an unobserved variable)where is a random error term that adds noise to the relationship between the inputs and the variable .

The latent variable is then assumed to determine the output as follows:

From these assumptions and the additional assumption that has a symmetric distribution around , it follows thatwhere is the cumulative distribution function of the error .

It turns out that the logistic function used to define the logit model is the cumulative distribution function of a symmetric probability distribution called standard logistic distribution.

Therefore, the logit model can be written as a latent variable model, specified by equations (1) and (2) above, in which the error has a logistic distribution.

By choosing different distributions for the error , we obtain other binary classification models.

For example, if we assume that has a standard normal distribution, then we obtain the probit model.

## Estimation by maximum likelihood

The vector of coefficients is often estimated by maximum likelihood methods.

Assume that the observations in the sample are IID and denote the vector of all outputs by and the matrix of all inputs by . The latter is assumed to have full rank.

It is possible to prove (see the lecture on Maximum likelihood estimation of the logit model) that the maximum likelihood estimator (when it exists) can be obtained by performing simple Newton-Raphson iterations as follows:

• start from a guess (e.g., );

• recursively update the guess:where:and is an diagonal matrix (i.e., having all off-diagonal entries equal to ) such that the elements on its diagonal are

• stop when numerical convergence is achieved, that is, when the difference between and is so small as to be negligible;

• set the maximum likelihood estimator equal to the last update (denote the last iteration by ).

The asymptotic covariance matrix of the maximum likelihood estimator can be consistently estimated by so that the distribution of the estimator is approximately normal with mean equal to and covariance matrix .

## Hypothesis testing

If the logit model is estimated with the maximum likelihood procedure illustrated above, any one of the classical tests based on maximum likelihood procedures (e.g., Wald, Likelihood Ratio, Lagrange Multiplier) can be used to test an hypothesis about the vector of coefficients .

Other tests can be constructed by exploiting the asymptotic normality of the maximum likelihood estimator.

For example, we can perform a z test to test the null hypothesis where is the -th entry of the vector of coefficients and .

The test statistic iswhere is the -th entry of and is the -th entry on the diagonal of the matrix .

As the sample size increases, converges in distribution to a standard normal distribution. The latter distribution can be used to derive critical values and perform the test.

Proof

We haveBy the asymptotic normality of the maximum likelihood estimator, the numerator converges in distribution to a normal random variable with mean . Furthermore, the consistency of our estimator of the asymptotic covariance matrix implies thatwhere denotes convergence in probability. By the Continuous Mapping theorem, and, by Slutsky's theorem, converges in distribution to a standard normal random variable.