In this lecture we show how to perform maximum likelihood estimation of a Gaussian mixture model with the Expectation-Maximization (EM) algorithm.
At the end of the lecture we discuss practically relevant aspects of the algorithm such as the initialization of parameters and the stopping criterion.
Table of contents
The sample
is
made of
independently and identically
distributed draws from a mixture of
-dimensional
multivariate
normal distributions.
The joint probability
density function of the
-th
observation
is
where:
is the probability of
the
-th
component of the mixture;
the
vector
is the mean vector of the
-th
component;
the
matrix
is the covariance
matrix of the
-th
component.
The probabilities of the
components of the mixture are non-negative and sum up to
:
The covariance matrices
are assumed to be positive
definite, so that their
determinants
are strictly positive.
We will denote by
the vector that gathers all the parameters
of the model, that
is,
We can write the Gaussian mixture model as a latent-variable
model:where:
the observable variables
are conditionally multivariate normal with mean
and variance
:
the latent variables
have the discrete distribution
for
.
In the formulae above we have explicitly written the value of the latent
variable on which we are conditioning
().
However, in what follows we will also use the notations
and
if the value taken by
is implicitly given by the context.
Since we are able to write the Gaussian mixture model as a latent-variable model, we can use the EM algorithm to find the maximum likelihood estimators of its parameters.
Starting from an initial guess of the parameter vector
,
the algorithm produces a new estimate of the parameter vector
at each iteration
.
The
-th
iteration consists of two steps:
the Expectation step, where we compute the
conditional
probabilities of the latent variables using the vector
from the previous iteration;
the Maximization step, where we maximize the expectation of the complete-data
log-likelihood, computed with respect
to the conditional probabilities found in the Expectation step. The result of
the maximization is a new parameter vector
.
The iterations end when a stopping criterion is met (e.g., when the increases in the log-likelihood or the changes in the parameter vector become smaller than a certain threshold). See below for more details.
We denote by
and
the vectors obtained by stacking all the observed and latent variables
and
respectively.
Moreover, we use double subscripts for the various parameters. The first
subscript denotes the mixture component
,
and the second one the iteration number
.
For example, at the
-th
iteration, the parameter vector
contains an estimate of the covariance matrix of the
-th
component of the mixture, which we denote
by
In the E step, the conditional probabilities of the components of the mixture
are calculated separately for each
observation:where
The general formula for the calculation of
conditional probabilities in the E step
iswhere
is the set of all possible values that
can take. In our case,
.
Since the observations are independent, we
have
and
Therefore,
where
We then use the conditional probabilities to compute the
expected value of
the complete
log-likelihood:
The expected value can be computed as
follows:where:
in step
we have used the standard E-step formula for computing the expectation of the
complete log-likelihood (as above,
is the set of all possible values that the vector of unobservable variables
can take); in step
we have exploited the independence of the observations; in step
we have used the fact that
is
the expected value of
under the joint distribution of
given
and
;
therefore, we can use the marginal distribution of
given
and
to compute the expected value of each term in the sum. Moreover, we can write
In the M step, we solve the maximization
problem
The solution
is
In order to solve the problem, we need to
equate to zero the gradients of
with respect to the various components of the vector
.
We start with the probabilities of the components of the mixture, which need
to satisfy the
constraint
We
take care of the constraint by performing the
substitution
Therefore,
for
.
We can
write
which
is solved
by
The
first-order conditions for the mean vectors
are
which
are solved
by
The
first-order conditions for the inverse covariance matrices (see
this
lecture for more details)
are
which
are solved
by
As we explained in the lecture on the EM algorithm, while the likelihood is guaranteed to increase at each iteration, there is no guarantee that the algorithm converges to a global maximum of the likelihood.
For this reason, we often use the multiple-starts approach:
we run the EM algorithm several times with different random initializations of
the parameter vector
;
our final estimate of the parameter vector
is that which achieves the highest value of the incomplete
log-likelihood
Several papers discuss how to choose the starting value
(see the references below).
According to our experience, when we use the multiple-starts approach, the
best way to draw random initializations of
is as follows.
At the beginning of each run of the EM algorithm, we repeat the following
steps for
:
we draw a small random sub-sample from our sample of observations
(the numerosity of the sub-sampe could be, e.g.,
);
we set the starting values
and
equal to the sample means and variances of
on the sub-sample;
we set
.
There is a vast literature on the stopping criteria used to decide when to end the EM iterations (see below).
However, in our experience, the most robust method is to stop the iterations
whenwhere
is a pre-specified threshold
(
generally works very well).
In other words, we stop the algorithm when none of the conditional
probabilities computed in the E step changes significantly with respect to the
previous iteration (note the iteration subscripts
and
).
If you check the formulae in the M step, you will realize that the parameter
estimates are a function of the observations
and of the conditional probabilities
.
Therefore, insignificant changes in the latter imply that also parameter
estimates are stable.
The EM algorithm can sometimes converge to degenerate solutions in which the covariance matrix of one of the components of the mixture is singular and the log-likelihood is infinite (most likely resulting in a NaN on computers).
In our experience, imposing constraints in the M step to avoid such singularities can seriously harm the convergence properties of the EM algorithm.
Our advice is to simply terminate any run of the algorithm that gives rise to singularities or NaNs and proceed to next run (with a new random initialization, according to the multiple-starts approach described above).
If we find that we often incur in singularities, we simply increase the number of runs of the algorithm.
We have provided our own view about the best initialization method and stopping criterion. However, you are invited to consult the references below about these two topics.
Biernacki, C., Celeux, G. and Govaert, G., 2003. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & Data Analysis, 41(3-4), pp.561-575.
Blömer, J. and Bujna, K., 2013. Simple methods for initializing the em algorithm for gaussian mixture models. CoRR.
Kwedlo, W., 2013. A new method for random initialization of the EM algorithm for multivariate Gaussian mixture learning. In Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013 (pp. 81-90). Springer, Heidelberg.
McKenzie, P. and Alder, M., 1994. Initializing the EM algorithm for use in Gaussian mixture modelling. Pattern recognition in practice IV, pp.91-105.
Paclík, P. and Novovičová, J., 2001. Number of components and initialization in Gaussian mixture model for pattern recognition. In Artificial Neural Nets and Genetic Algorithms (pp. 406-409). Springer, Vienna.
Shireman, E., Steinley, D. and Brusco, M.J., 2017. Examining the effect of initialization strategies on the performance of Gaussian mixture modeling. Behavior research methods, 49(1), pp.282-293.
Abbi, R., El-Darzi, E., Vasilakis, C. and Millard, P., 2008, September. Analysis of stopping criteria for the EM algorithm in the context of patient grouping according to length of stay. In 2008 4th International IEEE Conference Intelligent Systems (Vol. 1, pp. 3-9). IEEE.
Kontaxakis, G. and Tzanakos, G., 1992, October. Study of the convergence properties of the EM algorithm-a new stopping rule. In IEEE Conference on Nuclear Science Symposium and Medical Imaging (pp. 1163-1165). IEEE.
Kontaxakis, G. and Tzanakos, G., 1993, March. Further study of a stopping rule for the EM algorithm. In 1993 IEEE Annual Northeast Bioengineering Conference (pp. 52-53). IEEE.
Please cite as:
Taboga, Marco (2021). "Gaussian mixture - Maximum likelihood estimation", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/Gaussian-mixture-maximum-likelihood.
Most of the learning materials found on this website are now available in a traditional textbook format.