Kullback-Leibler divergence

The Kullback-Leibler divergence is a measure of the dissimilarity between two probability distributions.

Table of contents

Definition
The KL divergence is non-negative
Asymmetry
Why the KL divergence is a measure of dissimilarity
References

Definition

We are going to give two separate definitions of Kullback-Leibler (KL) divergence, one for discrete random variables and one for continuous variables.

Definition Let and be two discrete random variables with supports and $R_{Y}$ and probability mass functions and . Let , so thatThen the KL divergence of from is [eq7]

Note that the summation is over the support of , so that we always have and , and, as a consequence, the natural logarithm [eq10] is always well-defined.

The KL divergence measures how much the distribution defined by is dissimilar from the reference distribution defined by .

The definition for continuous random variables is analogous.

Definition Let and be two continuous random variables with supports and $R_{Y}$ and probability density functions and such that [eq16] for any set $Asubseteq R_{X}$ . Then the KL divergence of from is [eq19]

In order to be entirely rigorous the above definition should also specify that the sets $Asubseteq R_{X}$ must be measurable (see the lecture on probability for a definition of measurable set).

Property (1), which is called absolute continuity, requires that if the distribution associated to the density assigns a non-zero probability to a set , then also the distribution must assign a non-zero probability to that set. This requirement is analogous to that for discrete variables and ensures that [eq22] is well-defined on all sets that have non-zero probability.

The KL divergence is non-negative

The next proposition states a fundamental property of the Kullback-Leibler divergence.

Proposition Let and be two probability mass functions and . If the two probability mass functions coincide, that is, if for all $xin R_{X}$ , then Otherwise, if they do not coincide, then

Proof

Let us first prove the equality part. If the two probability mass functions coincide, then [eq29] for $xin R_{X}$ and [eq30] When they do not coincide, then we have [eq31] where: in step we have written the summation as an expected value with respect to the probability distribution of ; in step , we have used Jensen's inequality (the function is strictly convex in and the random variable is not constant because the two probability mass function do not coincide); in step we have used the fact that a sum of probabilities cannot be greater than 1.

A similar result holds for continuous variables.

Proposition Let and be two probability density functions such that their KL divergence is well-defined. If the two probability density function coincide almost surely, that is, if [eq36] for all measurable sets $Asubseteq R_{X}$ , then Otherwise, if they do not coincide almost surely, then

Proof

The proof is analogous to that for discrete variables.

Asymmetry

An often cited property of the KL divergence is that it is not symmetric, that is, in general there is no guarantee that

In fact, it is even possible that exists when is not well-defined: as you can check by looking at the definition of KL divergence, this happens when the support of is strictly included in the support of :

Why the KL divergence is a measure of dissimilarity

Since the Kullback-Leibler divergence is an information-theoretic concept and most of the students of probability and statistics are not familiar with information theory, they struggle to get an intuitive understanding of the reason why the KL divergence measures the dissimilarity of a probability distribution from a reference distribution. We provide an explanation that is entirely based on probabilistic concepts.

Suppose that and are two probability mass functions such that the KL divergence is well-defined.

Take a convex combination of the two distributionswhere .

By increasing we can make more and more similar to until, when , and coincide.

It is possible to prove that the KL divergence is convex (see Cover and Thomas 2006) and, as a consequence,

Thus, the higher is, the smaller becomes. In other words, the more is similar to , the smaller the Kullback-Leibler divergence becomes.

References

Cover, T. M., and J. A. Thomas (2006) " Elements of information theory", Wiley-Interscience.

How to cite

Please cite as:

Taboga, Marco (2021). "Kullback-Leibler divergence", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-probability/Kullback-Leibler-divergence.