The logistic model (or logit) is a classification model used to predict variables that can take only two values.
The logistic classification model has the following characteristics:
the output variable
can be equal to either 0 or 1;
the predicted output
is a number between 0 and 1;
as in linear regression, we use a vector of estimated coefficients
to compute
a linear combination of the input variables
unlike in linear regression, we transform
using a nonlinear function
to make sure that the predictions
between 0 and 1.
In a logit model, the predicted output
has two interpretations:
the estimated probability that
will be equal to 1;
our best guess of the value of the output variable
A logit model is often called logistic regression model.
However, we prefer to stick to the convention (widespread in the machine learning community) of using the term regression only for models in which the output variable is continuous.
Therefore, we use the term classification here because in a logit model the output is discrete.
Suppose that we observe a sample of data
Each observation has:
an output variable denoted by
vector of inputs, denoted by
The output
can take only two values, either 0 or 1 (it is a
Bernoulli random
The probability that the output
is equal to 1,
on the inputs
is assumed to
the logistic function and
is a
vector of coefficients.
The probability that
is equal to 0
It is immediate to see that the logistic function
is always positive.
Furthermore, it is increasing and
that it
is a well-defined probability because it lies between 0 and 1.
Why is the logistic classification model specified in this manner?
Why is the logistic function used to transform the linear combination of
The simple answer is that we would like to do something similar to what we do in a linear regression model: use a linear combination of the inputs as our prediction of the output.
However, our prediction needs to be a probability and there is no guarantee
that the linear combination
is between 0 and 1.
Thus, we use the logistic function because it provides a convenient way of
and forcing it to lie in the interval between 0 and 1.
We could have used other functions that enjoy properties similar to the logistic function.
As a matter of fact, other popular classification models can be obtained by simply substituting the logistic function with another function and leaving everything else in the model unchanged.
For example, by substituting the logit function with the cumulative distribution function of a standard normal distribution, we obtain the so-called probit model.
Another way of thinking about the logit model is to define a latent variable
(i.e., an unobserved
is a random error term that adds noise to the relationship between the inputs
and the variable
The latent variable
is then assumed to determine the output
From these assumptions and the additional assumption that
has a symmetric distribution around
it follows
is the cumulative distribution
function of the error
It turns out that the logistic function used to define the logit model is the cumulative distribution function of a symmetric probability distribution called standard logistic distribution.
Therefore, the logit model can be written as a latent variable model,
specified by equations (1) and (2) above, in which the error
has a logistic distribution.
By choosing different distributions for the error
we obtain other binary classification models.
For example, if we assume that
has a standard normal distribution, then we obtain the probit model.
The vector of coefficients
is often estimated by
likelihood methods.
Assume that the observations
in the sample are IID and denote the
vector of all outputs by
and the
matrix of all inputs by
The latter is assumed to have full
It is possible to prove (see the lecture on
likelihood estimation of the logit model) that the maximum likelihood
(when it exists) can be obtained by performing simple
iterations as follows:
start from a guess
recursively update the
is an
diagonal matrix (i.e., having all off-diagonal entries equal to
such that the elements on its diagonal are
stop when numerical convergence is achieved, that is, when the difference
is so small as to be negligible;
set the maximum likelihood estimator
equal to the last update
(denote the last iteration by
The asymptotic covariance matrix of the maximum likelihood estimator
can be consistently estimated by
that the distribution of the estimator
is approximately normal with mean equal to
and covariance
If the logit model is estimated with the maximum likelihood procedure
illustrated above, any one of the classical
based on maximum likelihood procedures (e.g.,
Ratio, Lagrange
Multiplier) can be used to
test an
hypothesis about the vector of coefficients
Other tests can be constructed by exploiting the asymptotic normality of the maximum likelihood estimator.
For example, we can perform a z test to test the
null hypothesis
is the
entry of the vector of coefficients
The test statistic
is the
entry of
is the
entry on the diagonal of the matrix
As the sample size
converges in distribution to a
standard normal
distribution. The latter distribution can be used to
derive critical values and perform the
the asymptotic normality of the maximum likelihood estimator, the numerator
converges in
distribution to a normal random variable with mean
Furthermore, the consistency of our estimator of the asymptotic covariance
matrix implies
denotes convergence
in probability. By the
Continuous Mapping
by Slutsky's theorem,
converges in distribution to a standard normal random variable.
Please cite as:
Taboga, Marco (2021). "Logistic classification model (logit or logistic regression)", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix.
Most of the learning materials found on this website are now available in a traditional textbook format.