Index > Fundamentals of statistics

Logistic classification model (logit or logistic regression)

by Marco Taboga, PhD

The logistic model (or logit) is a classification model used to predict variables that can take only two values.

Table of contents

Overview
Interpretation of the predicted output
Classification vs regression
Sample
Conditional probabilities
The logistic function
Explanation
Alternatives
The logit model as a latent variable model
Estimation by maximum likelihood
Hypothesis testing

Overview

The logistic classification model has the following characteristics:

the output variable $y_{i}$ can be equal to either 0 or 1;
the predicted output $widehat{y}_{i}$ is a number between 0 and 1;
as in linear regression, we use a vector of estimated coefficients to compute , a linear combination of the input variables $x_{i}$ ;
unlike in linear regression, we transform using a nonlinear function , to make sure that the predictions are between 0 and 1.

Interpretation of the predicted output

In a logit model, the predicted output $widehat{y}_{i}$ has two interpretations:

the estimated probability that $y_{i}$ will be equal to 1;
our best guess of the value of the output variable $y_{i}$ .

Classification vs regression

A logit model is often called logistic regression model.

However, we prefer to stick to the convention (widespread in the machine learning community) of using the term regression only for models in which the output variable is continuous.

Therefore, we use the term classification here because in a logit model the output is discrete.

Sample

Suppose that we observe a sample of data for .

Each observation has:

an output variable denoted by $y_{i}$ ;
a vector of inputs, denoted by $x_{i}$ .

Conditional probabilities

The output $y_{i}$ can take only two values, either 0 or 1 (it is a Bernoulli random variable).

The probability that the output $y_{i}$ is equal to 1, conditional on the inputs $x_{i}$ , is assumed to bewhere [eq6] is the logistic function and is a vector of coefficients.

The probability that $y_{i}$ is equal to 0 is

The logistic function

It is immediate to see that the logistic function is always positive.

Furthermore, it is increasing and [eq8] so that it satisfies

Thus, is a well-defined probability because it lies between 0 and 1.

Explanation

Why is the logistic classification model specified in this manner?

Why is the logistic function used to transform the linear combination of inputs $x_{i}eta$ ?

The simple answer is that we would like to do something similar to what we do in a linear regression model: use a linear combination of the inputs as our prediction of the output.

However, our prediction needs to be a probability and there is no guarantee that the linear combination $x_{i}eta$ is between 0 and 1.

Thus, we use the logistic function because it provides a convenient way of transforming $x_{i}eta$ and forcing it to lie in the interval between 0 and 1.

Alternatives

We could have used other functions that enjoy properties similar to the logistic function.

As a matter of fact, other popular classification models can be obtained by simply substituting the logistic function with another function and leaving everything else in the model unchanged.

For example, by substituting the logit function with the cumulative distribution function of a standard normal distribution, we obtain the so-called probit model.

The logit model as a latent variable model

Another way of thinking about the logit model is to define a latent variable (i.e., an unobserved variable)where $arepsilon _{i}$ is a random error term that adds noise to the relationship between the inputs $x_{i}$ and the variable $z_{i}$ .

The latent variable $z_{i}$ is then assumed to determine the output $y_{i}$ as follows: [eq12]

From these assumptions and the additional assumption that $arepsilon _{i}$ has a symmetric distribution around , it follows that [eq13] where is the cumulative distribution function of the error $arepsilon _{i}$ .

It turns out that the logistic function used to define the logit model is the cumulative distribution function of a symmetric probability distribution called standard logistic distribution.

Therefore, the logit model can be written as a latent variable model, specified by equations (1) and (2) above, in which the error $arepsilon _{i}$ has a logistic distribution.

By choosing different distributions for the error $arepsilon _{i}$ , we obtain other binary classification models.

For example, if we assume that $arepsilon _{i}$ has a standard normal distribution, then we obtain the probit model.

Estimation by maximum likelihood

The vector of coefficients is often estimated by maximum likelihood methods.

Assume that the observations in the sample are IID and denote the vector of all outputs by and the matrix of all inputs by . The latter is assumed to have full rank.

It is possible to prove (see the lecture on Maximum likelihood estimation of the logit model) that the maximum likelihood estimator (when it exists) can be obtained by performing simple Newton-Raphson iterations as follows:

start from a guess (e.g., );
recursively update the guess:where:and $W_{t-1}$ is an diagonal matrix (i.e., having all off-diagonal entries equal to ) such that the elements on its diagonal are
stop when numerical convergence is achieved, that is, when the difference between and is so small as to be negligible;
set the maximum likelihood estimator equal to the last update (denote the last iteration by ).

The asymptotic covariance matrix of the maximum likelihood estimator can be consistently estimated by [eq23] so that the distribution of the estimator is approximately normal with mean equal to and covariance matrix .

Hypothesis testing

If the logit model is estimated with the maximum likelihood procedure illustrated above, any one of the classical tests based on maximum likelihood procedures (e.g., Wald, Likelihood Ratio, Lagrange Multiplier) can be used to test an hypothesis about the vector of coefficients .

Other tests can be constructed by exploiting the asymptotic normality of the maximum likelihood estimator.

For example, we can perform a z test to test the null hypothesis where $eta _{k}$ is the -th entry of the vector of coefficients and .

The test statistic is [eq26] where is the -th entry of and is the -th entry on the diagonal of the matrix .

As the sample size increases, converges in distribution to a standard normal distribution. The latter distribution can be used to derive critical values and perform the test.

Proof

We have [eq30] By the asymptotic normality of the maximum likelihood estimator, the numerator converges in distribution to a normal random variable with mean . Furthermore, the consistency of our estimator of the asymptotic covariance matrix implies that [eq32] where denotes convergence in probability. By the Continuous Mapping theorem, [eq34] and, by Slutsky's theorem, converges in distribution to a standard normal random variable.

How to cite

Please cite as:

Taboga, Marco (2021). "Logistic classification model (logit or logistic regression)", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/logistic-classification-model.