Finite Mixture Models
Bayesian Statistics Statistics 4224/5224 Spring 2021
March 16, 2021
1
The following notes summarize Section 22.1 (pages 519–524) of Bayesian Data Analysis, by Andrew Gelman et al.
We also include, as an example, Exercise 6.2 (page 237) from A First Course in Bayesian Statistical Methods, by Peter D. Hoff.
Mixture distributions arise in practical problems when the mea- surements of a random variable are taken under two different conditions.
For the greater flexibility, we construct such distributions as mix- tures of simpler forms.
For example, it is best to model male and female heights as separate univariate, perhaps normal, distributions, rather than a single bimodal distribution.
2
Setting up and interpreting mixture models
Finite mixtures
Suppose that it is considered desirable to model the distribution of y = (y1,…,yn) as a mixture of H components.
It is assumed that it is not known which mixture component underlies each particular observation.
For h = 1,…,H, the h-th component distribution, fh(yi|θh), is assumed to depend on a parameter vector θh; the parameter denoting the proportion of the population from component h is λh with Hh=1 λh = 1.
3
It is common to assume that the mixture components are all from the same parametric family, such as the normal, with different parameter vectors.
The sampling distribution (likelihood) of y in that case is p(yi|θ, λ) = λ1f(yi|θ1) + · · · + λHf(yi|θH) .
For an alternative representation of this model, introduce the
unobserved indicator variables
zih =
Given λ, the distribution of each vector zi = (zi1,…,ziH) is Multinomial(1; λ1, . . . , λH ).
4
1 unit i is drawn from mixture component h 0 otherwise.
The joint distribution of the observed data y and unobserved indicators z, conditional on the model parameters, can be written
nH
p(y, z|θ, λ) = p(z|λ)p(y|z, θ) = [λhf (yi|θh)]zih ,
i=1 h=1
with exactly one of the zih equaling 1 for each i.
At this point H, the number of mixture components, is assumed to be known and fixed.
Continuous mixtures
The finite mixture is a special case of the more general model specification
p(yi|θ, λ) =
f (yi|θ, z)λ(z)dz .
5
Identifiability of the mixture likelihood
Parameters in a model are not identified if the same likelihood function is obtained for more than one choice of the model pa- rameters.
All finite mixture models are nonidentifiable in one sense; the distribution is unchanged if the group labels are permuted.
When possible, the parameter space should be defined to clear up any ambiguity, for example, by specifying the means of the mixture components be in nondecreasing order, or specifying the mixture proportions λh to be nondecreasing.
For many problems, an informative prior distribution has the effect of identifying specific components with specific subpopu- lations.
6
Prior distribution
The prior distribution for the finite mixture model parameters (θ, λ) is taken in most applications to be a product of independent priors on θ and λ.
If the vector of mixture indicators zi = (zi1, . . . , ziH ) is modeled as multinomial with parameter λ, then the natural conjugate prior distribution is the Dirichlet, λ ∼ Dirichlet(α1, . . . , αH ).
We use θ to represent the vector consisting of all the parameters in the mixture components, θ = (θ1, . . . , θH ).
Some parameters may be common to all components and other parameters specific to a single component; for example, in a mixture of normals we might assume the variance is the same for each component but the means differ.
7
Number of mixture components
For finite mixture models there is often uncertainty concerning the number of mixture components H to include in the model.
It is often appropriate to begin with a small model and assess the adequacy of the fit.
Posterior predictive checks can be used to determine whether the current number of components describes the range of observed data.
An alternative approach is to view H as a hierarchical parame- ter that can attain the values 1,2,3,…, and average inferences about y over the posterior distribution of mixture models.
8
More general formulation
Finite mixture models arise through supposing that each of the n items in the sample belong to one of H subpopulations, with each latent subpopulation or latent class having a different value of one or more parameters in a parametric model.
Let zi ∈ {1,…,H} denote the subpopulation index for item i, commonly referred to as the latent class structure.
Then, the response yi for item i conditionally on zi has the distribution,
yi|zi ∼ f(yi|θzi,φ)
with f(y|θ,φ) a parametric sampling distribution with parameters φ that do not vary across the subpopulations and parameters θh corresponding to subpopulation h, for h = 1, . . . , H.
9
Supposing that Pr(zi = h) = πh, the following likelihood is ob- tained in marginalizing out the latent class status:
H
g(yi|π,θ,φ)= πhf(yi|θh,φ),
h=1
which corresponds to a finite mixture with H components, with
component h assigned probability weight πh.
In general, g can approximate a much broader variety of true
likelihoods than can f.
To illustrate this, consider the simple example in which
f (yi|θh, φ) = dnorm(yi|μh, σ2) ,
so that in subpopulation h the data follow the normal likelihood,
yi|zi = h ∼ Normal(μh, σ2) .
10
The normal assumption is restrictive in implying that the distri- bution is unimodal and symmetric, with a particular fixed shape.
However, by allowing the mean of the normal distribution to vary across subpopulations, we obtain a flexible mixture of normals model in marginalizing out the latent subpopulation status
nH
p(y|π, θ, φ) = πh × dnorm(yi|μh, σ2) . i=1 h=1
Any density can be accurately approximated using a mixture of sufficiently many normals.
By allowing the variance in the normal kernel to also vary across components (placing an h subscript on σ2), one instead obtains a location-scale mixture of normals. Location-scale mixtures can often obtain better accuracy using fewer mixture components than location mixtures.
11
Mixtures as true models or approximating distributions
There are two schools of though in considering finite mixture models.
The first viewpoint is that the incorporation of latent subpopu- lations is a realistic characterization of the true data-generating mechanism, and that such subpopulations really exist.
The second school of though is that finite mixture models are useful in providing a highly flexible class of probability mod- els that can be used to build more realistic hierarchical models that better account for one’s true uncertainty about parametric choices.
12
Posterior simulation using the Gibbs sampler
For mixture models, the Gibbs sampler alternates between two major steps: obtaining draws from the distribution of the indica- tors given the model parameters, and obtaining draws from the model parameters given the indicators.
Given the indicators, the mixture model reduces to an ordinary model (the use of conjugate families as prior distributions can be helpful).
Obtaining draws from the distribution of the indicators is usually straightforward: these are multinomial draws in finite mixture models.
13
Posterior inference
When the Gibbs sampler has reached approximate convergence, posterior inferences about model parameters are obtained by ig- noring the drawn indicators.
The fit of the model can be assessed by posterior predictive checks.
Example
Exercise 6.2 from A First Course in Bayesian Statistical Methods, by Peter D. Hoff.
14
See
Courseworks → Files → Examples → Example22 R .
The problem
The data file glucose contains the plasma glucose concentration of 532 females from a study on diabetes.
1. Make a histogram or kernel density estimate of the data. Describe how this empirical distribution deviates from the shape of a normal distribution.
15
2. Consider the following mixture model for these data:
(a) For each study participant there is an unobserved group membership variable zi which is equal to 1 or 2 with prob- ability π or 1 − π.
(b) If zi = 1 then yi ∼ Normal(θ1,σ12), and if z1 = 2 then yi ∼ Normal(θ2, σ2).
(c) Let π ∼ Beta(a, b), θj ∼ Normal(μ0, τ02), and σj2 ∼ Inv-χ2(ν0,σ02) for both j = 1 and j = 2.
Obtain the full conditional distributions of (z1, . . . , zn), π, μ1, μ2, σ12 and σ2.
16
3. Setting a = b = 1, μ0 = 120, τ02 = 200, σ02 = 1000, and ν0 = 10, implement the Gibbs sampler for at least 10,000 iterations. At each iteration t, relabel, if necessary, so that θ 1t < θ 2t . Compute and plot the autocorrelation functions of θ1t and θ2t , as well as their effective sample sizes. 4. For each iteration t of the Gibbs sampler, sample a value tttt2 z ∼ Bernoulli(π ), then sample y ̃ ∼ Normal(θz, (σz) ). Plot a histogram or kernel density estimate of the empirical 1T distribution of y ̃ ,...,y ̃ , and compare it to the distribution in item 1. above. Discuss the adequacy of this two-component mixture model for the glucose data. 17 Solution See Courseworks → Files → Examples → Example22 R for the R code. An answer to item 2. is given here. Let and i=1 1 n y ̄ = yI j n i (zi=j) j i=1 n nj=I(zi=j) for j=1,2 for j=1,2. 18 The full conditionals for the parameters π, θ1, θ2, σ12 and σ2, given yi and zi for i = 1,2,...,n are and and σj2| · · · ∼ Inv-χ2 for j = 1, 2. π|··· ∼ Beta(a + n1,b + n2) μ /τ2 + n y ̄ /σ2 θj|···∼Normal 0 0 jj j, 1/τ02 + nj/σj2 1 1/τ02 + nj/σj2 ν0σ2 + n ν0 + nj, 0 i=1 (zi=j) (yi − θj)2I ν+n 0j 19 Conditional on the parameters (π,θ1,θ2,σ12,σ2) and the data (y1,...,yn), the latent variables z1,...,zn are independent with Pr(zi = 1|···) = π × N(yi|θ1,σ12) π × N(yi|θ1, σ12) + (1 − π) × N(yi|θ2, σ2) and Pr(zi = 2|···) = 1 − Pr(zi = 1|···) for i = 1,2,...,n. Solutions have been posted based on writing our own Gibbs sam- pler using the above results, as well as using Stan. For your homework problem, I suggest using Stan. See Courseworks → Files → Examples → Example22 Stan . Note that the model fit by the Stan code is not exactly the same as that fit by the R code (see below). 20 Summary of results Using R, we fit a location-scale mixture of two normals. Based on the posterior means, we estimate that the population of glu- cose measurements are a 62-38 mixture of a Normal(104,182) distribution and a Normal(149,272) distribution. Using Stan, we fit a location-only mixture (separate means but equal variance) of H = 3 normal distributions. Based on the posterior means, we estimate that the glucose values are drawn from a population that is a 57/27/16 mix of a Normal(101, 162) and a Normal(134,162) and a Normal(173,162). We can compare the empirical distributions (histograms of sam- ple data) to estimated posterior predictive distributions in side- by-side plots, to visually assess the fit of the two mixture models considered. 21 Left hand plot shows empirical distribution (i.e., histogram of sample data). Right hand plot shows posterior predictive distribution for location- scale mixture of two normal distribuitons. Empirical distribution Posterior predictive dist 50 100 150 200 250 Glucose (n=532) 50 100 150 200 250 Location−scale mixture (H=2) 22 Density 0.000 0.005 0.010 0.015 Density 0.000 0.005 0.010 0.015 Left hand plot shows empirical distribution (i.e., histogram of sample data). Right hand plot shows posterior predictive distribution for location- only mixture of H = 3 normal distributions. Empirical distribution Posterior predictive dist 50 100 150 200 Glucose (n=532) 50 100 150 200 Location only mixture (H=3) 23 Density 0.000 0.005 0.010 0.015 Density 0.000 0.005 0.010 0.015