Search for probability and statistics terms on Statlect
StatLect

Empirical distribution

by , PhD

The empirical distribution, or empirical distribution function, can be used to describe a sample of observations of a given variable. Its value at a given point is equal to the proportion of observations from the sample that are less than or equal to that point.

Table of Contents

Definition

The following is a formal definition.

Definition Let[eq1]be a sample of size n, where $x_{1}$,...,$x_{n}$ are the n observations from the sample. The empirical distribution function of the sample $xi _{n}$ is the function [eq2] defined as[eq3]where [eq4] is an indicator function that is equal to 1 if $x_{i}leq x$ and 0 otherwise.

In other words, the value of the empirical distribution function at a given point x is obtained by:

  1. counting the number of observations that are less than or equal to x;

  2. dividing the number thus obtained by the total number of observations, so as to obtain the proportion of observations that is less than or equal to x.

An example follows.

Example Suppose we observe a sample made of four observations:[eq5]where[eq6]What is the value of the empirical distribution function of the sample $xi _{4}$ at the point $x=4$? According to the definition above, it is[eq7]In other words, the proportion of observations that are less than or equal to $4$ is $3/4$.

The empirical distribution is the distribution function of a discrete variable

Let [eq8],...,$x_{(n)}$ be the sample observations ordered from the smallest to the largest (in technical terms, the order statistics of the sample).

Then it is easy to see that the empirical distribution function can be written as[eq9]This is a function that is everywhere flat except at sample points, where it jumps by $frac{1}{n}$. It is the distribution function of a discrete random variable $Y_{n}$ that can take any one of the values $x_{1}$,...,$x_{n}$ with probability $1/n$. In other words, it is the distribution function of a discrete variable $Y_{n}$ having probability mass function[eq10]

Finite sample properties

When the n observations from the sample $x_{1}$,...,$x_{n}$ are the realizations of n random variables X_1,...,X_n, then the value [eq11] taken by the empirical distribution at a given point x can also be regarded as a random variable. Under the hypothesis that all the random variables X_1,...,X_n have the same distribution, the expected value and the variance of [eq12] can be easily computed, as shown in the following proposition.

Proposition If the n observations in the sample[eq13]are the realizations of n random variables X_1,...,X_n having the same distribution function [eq14], then[eq15]for any $xin U{211d} $. Furthermore, if X_1,...,X_n are mutually independent, then[eq16]for any $xin U{211d} $.

Proof

The result about the expected value is proved by using the definition of distribution function and the properties of indicator functions (in particular, the fact that the expected value of an indicator is equal to the probability of the event it indicates):[eq17]The result about the variance is proved as follows:[eq18]

Thus, for any given point, the empirical distribution function is an unbiased estimator of the true distribution function. Furthermore, its variance tends to zero as the sample size becomes large (as n tends to infinity).

Large sample properties

An immediate consequence of the previous result is that the empirical distribution converges in mean-square to the true one.

Proposition If the n observations in the sample[eq13]are the realizations of n mutually independent random variables X_1,...,X_n having the same distribution function [eq20], then[eq21]for any $xin U{211d} $.

Proof

We have that[eq22]

As a matter of fact, it is possible to prove a much stronger result, called Glivenko-Cantelli theorem, which states that not only [eq23] converges almost surely to [eq24] for each x, but it also converges uniformly, that is,[eq25]

Furthermore, the assumption that the random variables X_1,...,X_n be mutually independent can be relaxed (see, e.g., Borokov 1999) to allow for some dependence among the observations (similarly to what can be done for the Law of Large Numbers; see Chebyshev's Weak Law of Large Numbers for correlated sequences).

References

Borokov, A. A. (1999) Mathematical statistics, CRC Press.

How to cite

Please cite as:

Taboga, Marco (2021). "Empirical distribution", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/asymptotic-theory/empirical-distribution.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.