Search for probability and statistics terms on Statlect

Kullback-Leibler divergence

by , PhD

The Kullback-Leibler divergence is a measure of the dissimilarity between two probability distributions.

Table of Contents


We are going to give two separate definitions of Kullback-Leibler (KL) divergence, one for discrete random variables and one for continuous variables.

Definition Let X and Y be two discrete random variables with supports R_X and $R_{Y}$ and probability mass functions [eq1] and [eq2]. Let [eq3], so that[eq4]Then the KL divergence of [eq2] from [eq6] is[eq7]

Note that the summation is over the support of X, so that we always have [eq8] and [eq9], and, as a consequence, the natural logarithm[eq10]is always well-defined.

The KL divergence [eq11] measures how much the distribution defined by [eq2] is dissimilar from the reference distribution defined by [eq13].

The definition for continuous random variables is analogous.

Definition Let X and Y be two continuous random variables with supports R_X and $R_{Y}$ and probability density functions [eq14] and [eq15] such that[eq16]for any set $Asubseteq R_{X}$. Then the KL divergence of [eq17] from [eq18] is[eq19]

In order to be entirely rigorous the above definition should also specify that the sets $Asubseteq R_{X}$ must be measurable (see the lecture on probability for a definition of measurable set).

Property (1), which is called absolute continuity, requires that if the distribution associated to the density [eq18] assigns a non-zero probability to a set A, then also the distribution [eq21] must assign a non-zero probability to that set. This requirement is analogous to that for discrete variables and ensures that [eq22]is well-defined on all sets that have non-zero probability.

The KL divergence is non-negative

The next proposition states a fundamental property of the Kullback-Leibler divergence.

Proposition Let [eq13] and [eq2] be two probability mass functions and [eq25]. If the two probability mass functions coincide, that is, if [eq26]for all $xin R_{X}$, then [eq27]Otherwise, if they do not coincide, then[eq28]


Let us first prove the equality part. If the two probability mass functions coincide, then [eq29]for $xin R_{X}$ and[eq30]When they do not coincide, then we have[eq31]where: in step $rame{A}$ we have written the summation as an expected value with respect to the probability distribution of X; in step $rame{B}$, we have used Jensen's inequality (the function [eq32] is strictly convex in x and the random variable [eq33] is not constant because the two probability mass function do not coincide); in step $rame{C}$ we have used the fact that a sum of probabilities cannot be greater than 1.

A similar result holds for continuous variables.

Proposition Let [eq18] and [eq15] be two probability density functions such that their KL divergence is well-defined. If the two probability density function coincide almost surely, that is, if [eq36]for all measurable sets $Asubseteq R_{X}$, then [eq37]Otherwise, if they do not coincide almost surely, then [eq28]


The proof is analogous to that for discrete variables.


An often cited property of the KL divergence is that it is not symmetric, that is, in general there is no guarantee that[eq39]

In fact, it is even possible that [eq11] exists when [eq41] is not well-defined: as you can check by looking at the definition of KL divergence, this happens when the support of X is strictly included in the support of Y: [eq42]

Why the KL divergence is a measure of dissimilarity

Since the Kullback-Leibler divergence is an information-theoretic concept and most of the students of probability and statistics are not familiar with information theory, they struggle to get an intuitive understanding of the reason why the KL divergence measures the dissimilarity of a probability distribution from a reference distribution. We provide an explanation that is entirely based on probabilistic concepts.

Suppose that [eq13] and [eq2] are two probability mass functions such that the KL divergence [eq45] is well-defined.

Take a convex combination of the two distributions[eq46]where [eq47].

By increasing $lambda $ we can make [eq48] more and more similar to [eq13] until, when $lambda =1$, [eq50] and [eq13] coincide.

It is possible to prove that the KL divergence is convex (see Cover and Thomas 2006) and, as a consequence,[eq52]

Thus, the higher $lambda $ is, the smaller [eq53] becomes. In other words, the more [eq54] is similar to [eq55], the smaller the Kullback-Leibler divergence becomes.


Cover, T. M., and J. A. Thomas (2006) " Elements of information theory", Wiley-Interscience.

How to cite

Please cite as:

Taboga, Marco (2021). "Kullback-Leibler divergence", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.