Search for probability and statistics terms on Statlect

EM algorithm

by , PhD

The Expectation-Maximization (EM) algorithm is a recursive algorithm that can be used to search for the maximum likelihood estimators of model parameters when the model includes some unobservable variables (also called latent variables).

Table of Contents

Latent-variable model

In a latent-variable model, there are two vectors:

We denote their joint probability by[eq1]where $	heta $ is a vector of parameters to be estimated.

Model specification

The joint distribution of x and $z$ is usually obtained by separately specifying the conditional distribution of the observed variables[eq2]and the marginal distribution of the latent variables[eq3]

As a consequence, the joint distribution is[eq4]

In the context of latent-variable models, the joint distribution is often called complete likelihood.

Derived quantities

Given the model specification above, if $z$ is a discrete random vector, we can derive the marginal distribution of x as[eq5]where the sum is over the set $R_{z}$ of all the values that $z$ can possibly take (its support).

We can also derive the conditional distribution of $z$ as[eq6]

If $z$ is continuous the sums above are replaced by integrals.

Maximum likelihood problem

The maximum likelihood estimator (MLE) of the parameter $	heta $ is the vector $widehat{	heta }$ that solves the maximization problem[eq7]

This problem does not usually have an analytical solution. As a consequence, we need to solve it with an iterative procedure that starts from an initial guess $	heta _{0}$ of the solution and produces a sequence[eq8]that hopefully converges to the solution.

We say hopefully because we are often unable to derive theoretical guarantees about the numerical convergence of likelihood maximization algorithms, especially for latent-variable models.

The EM algorithm is one of the iterative procedures that can be used to search for a solution when we are dealing with a latent-variable model specified as above.

The algorithm

Starting from an initial guess $	heta _{0}$, the $j$-th iteration of the EM algorithm consists of the following steps:

  1. use the parameter value $	heta _{j-1}$ found in the previous iteration to compute the conditional probabilities[eq9]for each $zin R_{Z}$;

  2. use the conditional probabilities derived in step 1 to compute the expected value of the complete log-likelihood:[eq10]

  3. solve the maximization problem[eq11]

  4. if the parameter update is smaller than a pre-specified threshold epsilon, that is, if [eq12]stop the algorithm, else return to step 1.

Steps 1 and 2 are collectively called the Expectation step, while step 3 is called the Maximization step. Hence the name of the algorithm (Expectation-Maximization).

Step 4 is a stopping criterion: we stop the algorithm when there are no significant changes in the parameter.

Theoretical guarantees

In general, the algorithm is not guaranteed to converge to a global maximum of the likelihood.

However, it has the important property that the likelihood is guaranteed to increase at each iteration:[eq13]for every $j$.


The conditional probability formula[eq14]implies that[eq15]If we multiply both sides by [eq16], we obtain[eq17]Then, we can sum over the support of $z$:[eq18]Since the log-likelihood [eq19] does not depend on $z$ and [eq20]we have[eq21]Moreover, [eq22]Thus,[eq23]Setting [eq24], we obtain[eq25]Subtracting (2) from (1), we get[eq26]By Jensen's inequality, we have[eq27]Therefore,[eq28]which, for [eq29], becomes [eq30]But [eq31]because[eq11]Therefore[eq33]or[eq34]

Advantages of the EM algorithm

The EM algorithm is particularly advantageous when the maximization problem in the Maximization step[eq11] has a closed-form solution.

This happens, for example, when the latent-variable model is a mixture of multivariate normal distributions.


As we said, there is no guarantee that the EM algorithm converges to a global maximum of the likelihood.

If we suspect that the likelihood may have multiple local minima, we should use the multiple starts approach. In other words, we should run the EM algorithm several times with different starting values $	heta _{0}$.

Example: estimation of Gaussian mixtures

The EM algorithm is often used to estimate the parameters of a mixture of normal distributions (also called Gaussian mixture). See this lecture for details.

How to cite

Please cite as:

Taboga, Marco (2021). "EM algorithm", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.