StatLect
Index > Fundamentals of statistics

Hierarchical Bayesian models

A hierarchical Bayesian model is a model in which the prior distribution of some of the model parameters depends on other parameters, which are also assigned a prior.

Table of Contents

Definition

Given the observed data x, in a hierarchical Bayesian model, the likelihood depends on two parameter vectors $	heta $ and $arphi $ [eq1]and the prior[eq2]is specified by separately specifying the conditional distribution [eq3] and the distribution [eq4].

In the literature it is often required that the likelihood does not depend on $arphi ,$, that is,[eq5]

In this special case, the parameter $arphi $ is called hyper-parameter and the the prior [eq6] is called hyper-prior.

We use a broader definition of hierarchical model, that does not necessarily include assumption (1), because it allows for a unified treatment of several interesting models.

Examples

The following examples illustrate two popular models that fall within our definition.

Example 1 - Random means

Suppose the sample [eq7]is a vector of draws [eq8] from n normal distributions having different unknown means $mu _{i}$ and a known common variance sigma^2:[eq9]

Denote by mu the vector of means: [eq10]

Conditional on mu, the observations are assumed to be independent. As a consequence, the likelihood of the whole sample, conditional on mu, can be written as [eq11]

Now, assume the means $mu _{i}$ are a sample of IID draws from a normal distribution with unknown mean $m$ and known variance $	au ^{2}$, so that[eq12]

Finally, we assign a normal prior (with known mean $m_{0}$ and variance $u^{2}$) to the hyper-parameter $m$:[eq13]

The model just described is a hierarchical model. With the notation used in the definition, we have $	heta =mu $, $arphi =m$ and the added assumption that[eq14]

Example 2 - Normal mean and Gamma precision

Suppose that the sample [eq7]is a vector of IID draws [eq8] from a normal distribution having unknown mean mu and unknown variance sigma^2.

The likelihood of the whole sample, conditional on mu and sigma^2, is [eq17]

Now, assume that the mean mu is itself normal with known mean $m$ and variance $sigma ^{2}/
u $, where $
u $ is a known parameter:[eq18]

Finally, we assign an inverse-Gamma prior to the parameter sigma^2 (i.e., a Gamma distribution to the precision $1/sigma ^{2}$):[eq19]where k and $h$ are the two parameters of the Gamma distribution.

This is a very popular model, known as normal - inverse Gamma model.

It fits the above definition of a hierarchical model with $	heta =mu $, [eq20].

Computations

The computation of the posterior distribution is usually performed in steps: first $arphi $ is taken as given, and a conditional distribution for $	heta $ is derived; then a posterior for $arphi $ is computed.

The steps are as follows.

  1. Conditional on $arphi $ (i.e., by keeping it fixed), compute:

    1. the prior predictive distribution of x:[eq21]

    2. the posterior distribution of $	heta $: [eq22]

  2. By using [eq23] from step 1, compute:

    1. the prior predictive distribution of x:[eq24]

    2. the posterior marginal distribution of $arphi $:[eq25]

  3. Compute the posterior joint distribution of $arphi $ and $	heta $: [eq26]

  4. Compute the posterior marginal distribution of $	heta $:[eq27]

When we are not able to carry out the integrations required to derive the predictive distributions, or when we cannot compute posteriors with Bayes' rule, then we can use other computational methods (e.g., the factorization method illustrated in the lecture on Bayesian inference). In these cases, the steps of the above procedure remain valid: we first derive posterior and predictive distributions given $arphi $, by using whatever method is available to us; then, we use the conditional distributions thus derived to compute the posterior of $arphi $.

More than two levels

In the definition above, there were only two levels: a parameter $	heta $ and a hyper-parameter $arphi $.

The definition can be generalized to more than two levels. For example, we could have a third parameter $zeta $, the likelihood [eq28]and the prior[eq29]which is specified by separately specifying the conditional distributions [eq30], [eq31] and the distribution [eq32].

With more than two levels, the computation strategy is similar to that illustrated in the previous section. First, we take all parameters but one as given, and we derive the prior predictive distribution of x, conditional on the parameters that have been kept fixed. Then, we use the predictive distribution thus obtained as likelihood, and we use it to obtain another prior predictive distribution for x, conditional on a smaller number of parameters than in the previous step. And so on.

The book

Most of the learning materials found on this website are now available in a traditional textbook format.