Search for probability and statistics terms on Statlect
StatLect

Gaussian mixture - Maximum likelihood estimation

by , PhD

In this lecture we show how to perform maximum likelihood estimation of a Gaussian mixture model with the Expectation-Maximization (EM) algorithm.

At the end of the lecture we discuss practically relevant aspects of the algorithm such as the initialization of parameters and the stopping criterion.

Table of Contents

The sample

The sample [eq1]is made of $N$ independently and identically distributed draws from a mixture of K-dimensional multivariate normal distributions.

The mixture

The joint probability density function of the n-th observation is[eq2]where:

The probabilities of the $D$ components of the mixture are non-negative and sum up to 1:[eq3]

The covariance matrices $V_{d}$ are assumed to be positive definite, so that their determinants [eq4] are strictly positive.

The vector of parameters

We will denote by $	heta $ the vector that gathers all the parameters of the model, that is,[eq5]

The mixture as a latent variable model

We can write the Gaussian mixture model as a latent-variable model:[eq6]where:

In the formulae above we have explicitly written the value of the latent variable on which we are conditioning ($z_{n}=d$). However, in what follows we will also use the notations [eq9] and [eq10] if the value taken by $z_{n}$ is implicitly given by the context.

The EM algorithm

Since we are able to write the Gaussian mixture model as a latent-variable model, we can use the EM algorithm to find the maximum likelihood estimators of its parameters.

Starting from an initial guess of the parameter vector $	heta _{0}$, the algorithm produces a new estimate of the parameter vector $	heta _{j}$ at each iteration $j$.

The $j$-th iteration consists of two steps:

  1. the Expectation step, where we compute the conditional probabilities of the latent variables using the vector $	heta _{j-1}$ from the previous iteration;

  2. the Maximization step, where we maximize the expectation of the complete-data log-likelihood, computed with respect to the conditional probabilities found in the Expectation step. The result of the maximization is a new parameter vector $	heta _{j}$.

The iterations end when a stopping criterion is met (e.g., when the increases in the log-likelihood or the changes in the parameter vector become smaller than a certain threshold). See below for more details.

Some notation

We denote by x and $z$ the vectors obtained by stacking all the observed and latent variables $x_{n}$ and $z_{n}$ respectively.

Moreover, we use double subscripts for the various parameters. The first subscript denotes the mixture component $d$, and the second one the iteration number $j$.

For example, at the $j$-th iteration, the parameter vector $	heta _{j}$ contains an estimate of the covariance matrix of the $d$-th component of the mixture, which we denote by[eq11]

The E step

In the E step, the conditional probabilities of the components of the mixture are calculated separately for each observation:[eq12]where[eq13]

Proof

The general formula for the calculation of conditional probabilities in the E step is[eq14]where $R_{Z}$ is the set of all possible values that $z$ can take. In our case, [eq15]. Since the observations are independent, we have[eq16]and[eq17]Therefore,[eq18]where[eq19]

We then use the conditional probabilities to compute the expected value of the complete log-likelihood:[eq20]

Proof

The expected value can be computed as follows:[eq21]where: in step $frame{A}$ we have used the standard E-step formula for computing the expectation of the complete log-likelihood (as above, $R_{Z}$ is the set of all possible values that the vector of unobservable variables $z$ can take); in step $frame{B}$ we have exploited the independence of the observations; in step $frame{D}$ we have used the fact that [eq22]is the expected value of [eq23] under the joint distribution of $z$ given x and $	heta _{j-1}$; therefore, we can use the marginal distribution of $z_{n}$ given $x_{n}$ and $	heta _{j-1}$ to compute the expected value of each term in the sum. Moreover, we can write [eq24]

The M step

In the M step, we solve the maximization problem[eq25]

The solution is[eq26]

Proof

In order to solve the problem, we need to equate to zero the gradients of [eq27] with respect to the various components of the vector $	heta $. We start with the probabilities of the components of the mixture, which need to satisfy the constraint[eq28]We take care of the constraint by performing the substitution[eq29]Therefore,[eq30]for $d=1,ldots ,D-1$. We can write[eq31]which is solved by[eq32]The first-order conditions for the mean vectors are[eq33]which are solved by[eq34]The first-order conditions for the inverse covariance matrices (see this lecture for more details) are[eq35]which are solved by[eq36]

Convergence and multiple starts

As we explained in the lecture on the EM algorithm, while the likelihood is guaranteed to increase at each iteration, there is no guarantee that the algorithm converges to a global maximum of the likelihood.

For this reason, we often use the multiple-starts approach:

Initialization

Several papers discuss how to choose the starting value $	heta _{0}$ (see the references below).

According to our experience, when we use the multiple-starts approach, the best way to draw random initializations of $	heta _{0}$ is as follows.

At the beginning of each run of the EM algorithm, we repeat the following steps for $d=1,ldots ,D$:

Stopping criterion

There is a vast literature on the stopping criteria used to decide when to end the EM iterations (see below).

However, in our experience, the most robust method is to stop the iterations when[eq38]where epsilon is a pre-specified threshold ($varepsilon =0.001$ generally works very well).

In other words, we stop the algorithm when none of the conditional probabilities computed in the E step changes significantly with respect to the previous iteration (note the iteration subscripts $j-1$ and $j-2$).

If you check the formulae in the M step, you will realize that the parameter estimates are a function of the observations $x_{n}$ and of the conditional probabilities [eq39]. Therefore, insignificant changes in the latter imply that also parameter estimates are stable.

Degenerate solutions and numerical problems

The EM algorithm can sometimes converge to degenerate solutions in which the covariance matrix of one of the components of the mixture is singular and the log-likelihood is infinite (most likely resulting in a NaN on computers).

In our experience, imposing constraints in the M step to avoid such singularities can seriously harm the convergence properties of the EM algorithm.

Our advice is to simply terminate any run of the algorithm that gives rise to singularities or NaNs and proceed to next run (with a new random initialization, according to the multiple-starts approach described above).

If we find that we often incur in singularities, we simply increase the number of runs of the algorithm.

References

We have provided our own view about the best initialization method and stopping criterion. However, you are invited to consult the references below about these two topics.

Initialization methods

Biernacki, C., Celeux, G. and Govaert, G., 2003. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & Data Analysis, 41(3-4), pp.561-575.

Blömer, J. and Bujna, K., 2013. Simple methods for initializing the em algorithm for gaussian mixture models. CoRR.

Kwedlo, W., 2013. A new method for random initialization of the EM algorithm for multivariate Gaussian mixture learning. In Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013 (pp. 81-90). Springer, Heidelberg.

McKenzie, P. and Alder, M., 1994. Initializing the EM algorithm for use in Gaussian mixture modelling. Pattern recognition in practice IV, pp.91-105.

Paclík, P. and Novovičová, J., 2001. Number of components and initialization in Gaussian mixture model for pattern recognition. In Artificial Neural Nets and Genetic Algorithms (pp. 406-409). Springer, Vienna.

Shireman, E., Steinley, D. and Brusco, M.J., 2017. Examining the effect of initialization strategies on the performance of Gaussian mixture modeling. Behavior research methods, 49(1), pp.282-293.

Stopping criteria

Abbi, R., El-Darzi, E., Vasilakis, C. and Millard, P., 2008, September. Analysis of stopping criteria for the EM algorithm in the context of patient grouping according to length of stay. In 2008 4th International IEEE Conference Intelligent Systems (Vol. 1, pp. 3-9). IEEE.

Kontaxakis, G. and Tzanakos, G., 1992, October. Study of the convergence properties of the EM algorithm-a new stopping rule. In IEEE Conference on Nuclear Science Symposium and Medical Imaging (pp. 1163-1165). IEEE.

Kontaxakis, G. and Tzanakos, G., 1993, March. Further study of a stopping rule for the EM algorithm. In 1993 IEEE Annual Northeast Bioengineering Conference (pp. 52-53). IEEE.

How to cite

Please cite as:

Taboga, Marco (2021). "Gaussian mixture - Maximum likelihood estimation", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/Gaussian-mixture-maximum-likelihood.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.