Search for probability and statistics terms on Statlect
StatLect

Maximum likelihood estimation

by , PhD

Maximum likelihood estimation (MLE) is an estimation method that allows us to use a sample to estimate the parameters of the probability distribution that generated the sample.

This lecture provides an introduction to the theory of maximum likelihood, focusing on its mathematical aspects, in particular on:

At the end of the lecture, we provide links to pages that contain examples and that treat practically relevant aspects of the theory, such as numerical optimization and hypothesis testing.

Table of Contents

The sample and its likelihood

The main elements of a maximum likelihood estimation problem are the following:

Maximum likelihood estimator

A maximum likelihood estimator $widehat{	heta }$ of $	heta _{0}$ is obtained as a solution of a maximization problem:[eq8]In other words, $widehat{	heta }$ is the parameter that maximizes the likelihood of the sample $xi $. $widehat{	heta }$ is called the maximum likelihood estimator of $	heta $.

In what follows, the symbol $widehat{	heta }$ will be used to denote both a maximum likelihood estimator (a random variable) and a maximum likelihood estimate (a realization of a random variable): the meaning will be clear from the context.

The same estimator $widehat{	heta }$ is obtained as a solution of[eq9]i.e., by maximizing the natural logarithm of the likelihood function. Solving this problem is equivalent to solving the original one, because the logarithm is a strictly increasing function. The logarithm of the likelihood is called log-likelihood and it is denoted by[eq10]

Asymptotic properties

To derive the (asymptotic) properties of maximum likelihood estimators, one needs to specify a set of assumptions about the sample $xi $ and the parameter space $Theta $.

The next section presents a set of assumptions that allows us to easily derive the asymptotic properties of the maximum likelihood estimator. Some of the assumptions are quite restrictive, while others are very generic. Therefore, the subsequent sections discuss how the most restrictive assumptions can be weakened and how the most generic ones can be made more specific.

Note: the presentation in this section does not aim at being one hundred per cent rigorous. Its aim is rather to introduce the reader to the main steps that are necessary to derive the asymptotic properties of maximum likelihood estimators. Therefore, some technical details are either skipped or de-emphasized. After getting a grasp of the main issues related to the asymptotic properties of MLE, the interested reader can refer to other sources (e.g., Newey and McFadden - 1994, Ruud - 2000) for a fully rigorous presentation of MLE theory.

Assumptions

Let [eq11] be a sequence of Kx1 random vectors. Denote by $xi _{n}$ the sample comprising the first n realizations of the sequence[eq12]which is a realization of the random vector[eq13]

We assume that:

  1. IID. [eq14] is an IID sequence.

  2. Continuous variables. A generic term $X_{j}$ of the sequence [eq15] is a continuous random vector, whose joint probability density function [eq16]belongs to a set of joint probability density functions [eq17] indexed by a Kx1 parameter $	heta in Theta $ (where we have dropped the subscript $j$ to highlight the fact that the terms of the sequence are identically distributed).

  3. Identification. If [eq18], then the ratio[eq19]is not almost surely constant. This also implies that the parametric family is identifiable: there does not exist another parameter [eq20] such that [eq21] is the true probability density function of $X_{j}$.

  4. Integrable log-likelihood. The log-likelihood is integrable:[eq22]

  5. Maximum. The density functions [eq23] and the parameter space $Theta $ are such that there always exists a unique solution [eq24] of the maximization problem:[eq25]where the rightmost equality is a consequence of independence (see the IID assumption above). Of course, this is the same as[eq26]where [eq27] is the log-likelihood and [eq28]are the contributions of the individual observations to the log-likelihood. It is also the same as[eq29]

  6. Exchangeability of limit. The density functions [eq17] and the parameter space $Theta $ are such that[eq31]where $QTR{rm}{plim}$ denotes a limit in probability. Roughly speaking, the probability limit can be brought inside the $rg max $ operator.

  7. Differentiability. The log-likelihood [eq32] is two times continuously differentiable with respect to $	heta $ in a neighborhood of $	heta _{0}$.

  8. Other technical conditions. The derivatives of the log-likelihood [eq27] are well-behaved, so that it is possible to exchange integration and differentiation, compute their first and second moments, and probability limits involving their entries are also well-behaved.

Information inequality

Given the assumptions made above, we can derive an important fact about the expected value of the log-likelihood:[eq34]

Proof

First of all,[eq35]Therefore, the inequality[eq36]is satisfied if and only if[eq37]which can be also written as[eq38](note that everything we have done so far is legitimate because we have assumed that the log-likelihoods are integrable). Thus, proving our claim is equivalent to demonstrating that this last inequality holds. In order to do this, we need to use Jensen's inequality. Since the logarithm is a strictly concave function and, by our assumptions, the ratio[eq19]is not almost surely constant, by Jensen's inequality we have[eq40]But,[eq41]Therefore,[eq42]which is exactly what we needed to prove.

This inequality, called information inequality by many authors, is essential for proving the consistency of the maximum likelihood estimator.

Consistency

Given the assumptions above, the maximum likelihood estimator [eq43] is a consistent estimator of the true parameter $	heta _{0}$:[eq44]where $QTR{rm}{plim}$ denotes a limit in probability.

Proof

We have assumed that the density functions [eq23] and the parameter space $Theta $ are such that[eq46]But [eq47]The last equality is true, because, by Kolmogorov's Strong Law of Large Numbers (we have an IID sequence with finite mean), the sample average [eq48] converges almost surely to [eq49] and, therefore, it converges also in probability (convergence almost surely implies convergence in probability). Thus, putting things together, we obtain[eq50]In the proof of the information inequality (see above), we have seen that[eq51]which, obviously, implies[eq52]Thus,[eq53]

Score vector

Denote by [eq54] the gradient of the log-likelihood, that is, the vector of first derivatives of the log-likelihood, evaluated at the point $	heta $. This vector is often called the score vector.

Given the assumptions above, the score has zero expected value:[eq55]

Proof

First of all, note that[eq56]because probability density functions integrate to 1. Now, taking the first derivative of both sides with respect to any component $	heta _{0,j}$ of $	heta _{0}$ and bringing the derivative inside the integral:[eq57]Now, multiply and divide the integrand function by [eq58]:[eq59]Since[eq60]we can write[eq61]or, using the definition of expected value:[eq62]which can be written in vector form using the gradient notation as[eq63]This result can be used to derive the expected value of the score as follows:[eq64]

Information matrix

Given the assumptions above, the covariance matrix of the score (called information matrix or Fisher information matrix) is[eq65]where [eq66] is the Hessian of the log-likelihood, that is, the matrix of second derivatives of the log-likelihood, evaluated at the point $	heta $.

Proof

From the previous proof, we know that[eq61]Now, taking the first derivative of both sides with respect to any component $	heta _{0,k}$ of $	heta _{0}$, we obtain[eq68]Rearranging, we get[eq69]Since this is true for any $j$ and any k, we can express it in matrix form as[eq70]where the left hand side is the covariance matrix of the gradient. This result is equivalent to the result we need to prove because[eq71]

The latter equality is often called information equality.

Asymptotic normality

The maximum likelihood estimator is asymptotically normal:[eq72]In other words, the distribution of the maximum likelihood estimator [eq73] can be approximated by a multivariate normal distribution with mean $	heta _{0}$ and covariance matrix[eq74]

Proof

Denote by[eq75]the gradient of the log-likelihood, i.e., the vector of first derivatives of the log-likelihood. Denote by[eq76]the Hessian of the log-likelihood, i.e., the matrix of second derivatives of the log-likelihood. Since the maximum likelihood estimator [eq77] maximizes the log-likelihood, it satisfies the first order condition[eq78]Furthermore, by the Mean Value Theorem, we have[eq79]where, for each $j=1,ldots ,K,$, the intermediate points [eq80] satisfy[eq81]and the notation[eq82]indicates that each row of the Hessian is evaluated at a different point (row $j$ is evaluated at the point [eq83]). Substituting the first order condition in the mean value equation, we obtain[eq84]which, by solving for [eq85], becomes[eq86]which can be rewritten as[eq87]We will show that the term in the first pair of square brackets converges in probability to a constant, invertible matrix and that the term in the second pair of square brackets converges in distribution to a normal distribution. The consequence will be that their product also converges in distribution to a normal distribution (by Slutsky's theorem).

As far as the first term is concerned, note that the intermediate points [eq88] converge in probability to $	heta _{0}$:[eq89]Therefore, skipping some technical details, we get[eq90]As far as the second term is concerned, we get [eq91]By putting things together and using the Continuous Mapping Theorem and Slutsky's theorem (see also the exercises in the lecture on Slutsky's theorem), we obtain[eq92]

By the information equality (see its proof), the asymptotic covariance matrix is equal to the negative of the expected value of the Hessian matrix: [eq93]

Different assumptions

As previously mentioned, some of the assumptions made above are quite restrictive, while others are very generic. We now discuss how the former can be weakened and how the latter can be made more specific.

Assumption 1 (IID). It is possible to relax the assumption that [eq94] is IID and allow for some dependence among the terms of the sequence (see, e.g., Bierens - 2004 for a discussion). In case dependence is present, the formula for the asymptotic covariance matrix of the MLE given above is no longer valid and needs to be replaced by a formula that takes serial correlation into account.

Assumption 2 (continuous variables). It is possible to prove consistency and asymptotic normality also when the terms of the sequence [eq94] are extracted from a discrete distribution, or from a distribution that is neither discrete nor continuous (see, e.g., Newey and McFadden - 1994).

Assumption 3 (identification). Typically, different identification conditions are needed when the IID assumption is relaxed (e.g., Bierens - 2004).

Assumption 5 (maximum). To ensure the existence of a maximum, requirements are typically imposed both on the parameter space and on the log-likelihood function. For example, it can be required that the parameter space be compact (closed and bounded) and the log-likelihood function be continuous. Also, the parameter space can be required to be convex and the log-likelihood function strictly concave (e.g.: Newey and McFadden - 1994).

Assumption 6 (exchangeability of limit). To ensure the exchangeability of the limit and the $rg max $ operator, the following condition is often imposed:[eq96]

Assumption 8 (other technical conditions). See, for example, Newey and McFadden (1994) for a discussion of these technical conditions.

Numerical optimization

In some cases, the maximum likelihood problem has an analytical solution. That is, it is possible to write the maximum likelihood estimator $widehat{	heta }$ explicitly as a function of the data.

However, in many cases there is no explicit solution. In these cases, numerical optimization algorithms are used to maximize the log-likelihood. The lecture entitled Maximum likelihood - Algorithm discusses these algorithms.

Examples

The following lectures provide detailed examples of how to derive analytically the maximum likelihood (ML) estimators and their asymptotic variance:

The following lectures provides examples of how to perform maximum likelihood estimation numerically:

More details

The following sections contain more details about the theory of maximum likelihood estimation.

Estimation of the asymptotic covariance matrix

Methods to estimate the asymptotic covariance matrix of maximum likelihood estimators, including OPG, Hessian and Sandwich estimators, are discussed in the lecture entitled Maximum likelihood - Covariance matrix estimation.

Hypothesis testing

Tests of hypotheses on parameters estimated by maximum likelihood are discussed in the lecture entitled Maximum likelihood - Hypothesis testing, as well as in the lectures on the three classical tests:

  1. Wald test;

  2. score test;

  3. likelihood ratio test.

References

Bierens, H. J. (2004) Introduction to the mathematical and statistical foundations of econometrics, Cambridge University Press.

Newey, W. K. and D. McFadden (1994) "Chapter 35: Large sample estimation and hypothesis testing", in Handbook of Econometrics, Elsevier.

Ruud, P. A. (2000) An introduction to classical econometric theory, Oxford University Press.

How to cite

Please cite as:

Taboga, Marco (2021). "Maximum likelihood estimation", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/maximum-likelihood.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.