A beta-binomial hierarchical model
Bayesian Statistics Statistics 4224/5224 Spring 2021
February 4, 2021
1
Please first read Sections 5.1–5.3 of Gelman et al, and review the summary notes given in Ch05a Hierarchical1.pdf.
The model
The observed data y1,…,yJ have a sampling distribution given by
yj|θ ∼ indep Binomial(nj,θj) . As a prior on θ = (θ1,…,θJ) we assume that
θ1,…,θJ|α,β ∼ iid Beta(α,β) .
Model specification is completed by a (generally noninformative)
prior on the hyperparameters, which we will take to be p(α, β) ∝ (α + β)−5/2 ,
an improper prior.
2
Where might this model be applicable?
Suppose we have J related experiments, each of which consists of nj Bernoulli trials, let θj denote the success probability in the jth experiment.
Let yij ∈ {0,1} denote the outcome for the ith trial in the jth experiment, for i = 1,…,nj and j = 1,…,J.
The it is reasonable to model yij|θ ∼ indep Bernoulli(θj).
Letting yj = nj yij, for j = 1, . . . , J, gives the above likelihood. i=1
3
This model may be applicable, for example, if a study of a heart disease treatment is being conducted at J different testing cen- ters across the country. Let yij = 1 indicate that the ith subject in the jth location had a favorable outcome.
It would generally not be reasonable to assume the success prob- abilities θj are equal across all J different studies. On the other hand, it would be foolish to treat the J locations as being unre- lated to each other.
The hierarchical model provides the desired compromise between these two extremes, by viewing the location-specific success probabilities θj as being sampled from a ‘superpopulation’ with a conjugate Beta distribution.
4
What’s up with that hyperprior?
Other than the constraints α > 0 and β > 0, it is difficult to quantify what sort of prior information might be available.
Recall that μ = α and ψ = α + β represent the ‘prior expected α+β
value’ and ‘prior sample size’ for the Beta conjugate prior with Binomial likelihood.
We can take independent uniform priors
μ ∼ Uniform(0, 1) and p(ψ−1/2) ∝ 1
which gives
p(μ,ψ)∝ψ−3/2 for 0<μ<1, 0<ψ<∞.
5
Back-transforming to α = μψ and β = (1 − μ)ψ gives, after rescaling by the Jacobian,
p(α,β)∝(α+β)−5/2 forα>0, β>0. How do we fit this model?
We construct a Monte Carlo approximation to the posterior dis- tribution by simulating random draws (αs,βs,θs) from the joint posterior
p(α, β, θ|y) = p(α, β|y)p(θ|α, β, y)
by
sample (αs, βs) ∼ p(α, β|y) , then sample θs ∼ p(θ|αs, βs, y) . for s = 1,…S.
6
The second step is easy, by the conjugacy of the Beta prior with Binomial likelihood,
θj|α, β, y ∼ indep Beta(α + yj, β + nj − yj) .
To sample from p(α,β|y) we must first derive an expression for this density (at least an unnormalized version of it), which we can get by
p(α, β|y) = p(α, β, θ|y) ∝ p(α, β)p(θ|α, β)p(y|θ) p(θ|α, β, y) p(θ|α, β, y)
which reduces to
p(α,β|y)∝p(α,β) J Γ(α+β)Γ(α+yj)Γ(β+nj −yj) . j=1 Γ(α)Γ(β) Γ(α + β + nj)
7
You can see page 110 of Gelman et al for the details that we are skipping over here.
Drawing random samples (αs,βs) from the discrete grid-based approximation to p(α,β|y) does not work well, however, because of the skewness of the distribution, and strong posterior corre- lation between α and β.
Instead, consider the reparameterization to
αα logit(μ)=logit α+β =log β ,log(ψ)=log(α+β) .
The posterior density
p (log(α/β), log(α + β)|y) ∝ αβ p(α, β|y)
8
has a much more ‘regular’ form. As an exercise, you can check that the appropriate Jacobian for this transformation is indeed given by αβ.
p(alpha,beta|y)
Posterior
beta
0 10 20 30 40 50
log(psi)
12345
0 2 4 6
alpha
8
10
−2.4 −2.2 −2.0 −1.8 −1.6 −1.4 −1.2
logit(mu)
So rather than try to sample directly from the joint distribution in the left-hand contour plot, we simulate S draws from that in the right-hand plot, then back-transform by
9
> mu.sim <- inv.logit(logit.mu.sim); > psi.sim <- exp(log.psi.sim); > alpha.sim <- mu.sim * psi.sim; > beta.sim <- (1-mu.sim) * psi.sim; where inv.logit is given by logit−1(u) = eu/(1 + eu). We then proceed with θjs ∼ Beta(αs + yj, βs + nj − yj) for j = 1,...,J and s = 1,...,S. See Courseworks → Files → Examples → Example05a Hierarchical . 10 What sort of predictions might be of interest? At least three types: 1. y ̃ , the number of successes in n ̃ additional trials of the jth jj experiment, j = 1, . . . , J . The posterior predictive density is p(y ̃ |θ )p(θ |y)dθ so if θjs is a draw from p(θj|y), we can sample p(y ̃ |y) = jjjjj ss y ̃ ∼Binomial(n ̃j,θ). jj 11 2. θ ̃ , the true long-run success probability for a (J + 1)st J+1 experiment. The posterior predictive density is ̃ ̃ p(θ|y) = p(θ|α, β)p(α, β|y)dβdα so if (αs, βs) are a random draw from p(α, β|y), we can sample sss θ ̃ ∼ Beta(α , β ) . 3. y ̃ , the observed number of successes in n ̃ trials for a J+1 J+1 (J + 1)st experiment. Write the posterior predictive distri- bution as s item 2 above, we can sample ss y ̃ ∼Binomial(n ̃,θ ̃). ̃ ̃ ̃ p(y ̃|θ)p(θ|y)dθ p(y ̃|y) = so if θ ̃ is a draw from the posterior predictive distribution in 12