Classification models belong to the class of conditional models, that is, probabilistic models that specify the conditional probability distributions of the output variables given the inputs. The peculiarity of classification models is that in these models the output has a discrete probability distribution (as opposed to regression models, where the output variable is continuous).
There are two different flavors of classification models:
binary classification models, where the output variable has a Bernoulli distribution conditional on the inputs;
multinomial classification models, where the output has a Multinoulli distribution conditional on the inputs.
Remember that a Bernoulli random variable can take only two values, either 1 or 0. So, a binary model is used when the output can take only two values.
The Multinoulli distribution is more general. It can be used to model outputs
that can take two or more values. If the output variable can take
different values, then it is represented as a
Multinoulli random vector, that is, a random vector whose
realizations have
all entries equal to 0, except for the entry corresponding to the realized
output value, which is equal to
1
Example
If the output variable is gender (male of female), then it can be represented as
a Bernoulli random variable that takes value 1 for males and 0 for females. It
can also be represented as a
Multinoulli random vector that takes value
for
males
and
for
females.
The previous example also shows that a binary classification model (Bernoulli distribution) can always be written as a multinomial model (Multinoulli distribution).
Example
If the output variable can belong to one of three classes (red, green or
blue), then it can be represented as a Multinoulli random vector whose
realizations
are
We now introduce the main assumptions, the notation and the terminology we are going to use to present the basics of classification models.
We assume that a sample of data
for
is observed by the statistician. The output variables are denoted by
,
and the associated inputs, which are
vectors, are denoted by
.
The output can take
values
.
In the case of a binary model,
,
and
.
In the case of a multinomial model,
and, for
,
is a
vector whose entries are all equal to zero except for the
-th
entry, which is equal to
.
We assume that there are
functions
,
...,
such
that
for
and
.
The conditional probability depends not only on the observed output but also
on a vector of parameters
.
Probabilities need to be non-negative and sum up to 1 (see
Probability and its
properties). As a consequence, the functions
must be defined in such a way
that
for
any couple
.
Example
The
logistic
classification model is a binary model in which the
conditional
probability mass function of the output
is a non-linear function of the inputs
:
where
is a
vector of coefficients and
is the logistic function defined
by
Thus,
conditional on
,
the output
has a Bernoulli distribution with probability
.
Using the general notation proposed above and defining
,
we
have:
It
can easily be checked that the probabilities sum up to 1 for any
and any
.
Example
The multinomial logistic classification model (also called softmax model) is a
multinomial model in which the conditional probabilities of the outputs are
defined for
as
where
to each class
corresponds a
vector of coefficients
.
The vector of parameters
is
Thus,
conditional on
,
the output
has a Multinoulli distribution with probabilities
The parameters of a multinomial classification model can be estimated by
maximum
likelihood. The likelihood of an observation
can be written
as
where
is the
-th
component of the Multinoulli vector
.
Note that
takes value 1 when the output variable belongs to the
-th
class and 0 otherwise. As a consequence, only one term in the product (the
term corresponding to the observed class) can be different from 1. The latter
fact is illustrated by the following example.
Example
When there are two classes
()
and the output variable belongs to the second class, we have that the
realization of the Multinoulli random vector
is
The
two components of the vector
are
and
the likelihood
is
Denote the
vector of all outputs by
and the
matrix of all inputs by
.
If we assume that the observations
in the sample are IID, then the
likelihood of the entire sample is equal to the product of the likelihoods of
the single
observations:
and
the log-likelihood
is
The maximum likelihood estimator
of the parameter
solves
In general, there is no analytical solution of this maximization problem and a
solution must be found numerically (see the lecture entitled
Maximum
likelihood algorithm for a detailed explanation of how this can be done).
Often, derivatives based algorithms are used (see the aforementioned lecture
for an explanation). For several classification models (e.g., the multinomial
logistic model introduced in the example above) the use of derivatives based
algorithms is facilitated by the fact that the gradient (i.e., the vector of
derivatives) of the functions
with respect to
can be computed analytically, which allows us to compute analytically also the
gradient of the log-likelihood function by using the chain
rule:
Please cite as:
Taboga, Marco (2021). "Classification models", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/classification-models.
Most of the learning materials found on this website are now available in a traditional textbook format.