程序代写CS代考 chain Bayesian algorithm Stat314 /461 Term 4: Gibbs sampling

Stat314 /461 Term 4: Gibbs sampling

October, 2021
1/26

A reminder about our computation problem
Let θ = (θ1, θ2, . . . , θK ) We are interested in the posterior distribution p(θ|data) = p(data|θp(θ) . (1)
We should always be able to write down the numerator of (1). But the integral in the denominator may not be friendly (it is also K-dimensional). We also need to to be able integrate p(θ|data) to compute useful things, e.g. posterior mean, variance, marginal tail probabilities etc.
􏶦 p(data|θ)p(θ) dθ
2/26

Some methods for sampling from the posterior in multi-parameter problems
θ = (θ1,θ2,θ3,…,θK)
We have studied the Metropolis-Hastings algorithm for sampling from p(θ|data) by jumping through the k dimensional space in a manner guided by the posterior – sample more points from areas of high posterior density.
For small dimensional problems we may be able to apply rejection or importance sampling.
3/26

Sampling p(θ|data) by breaking θ into components Recall that
p(θ|data) = p(θ1|data)p(θ2|θ1, data) × . . . × p(θq|θK−1,…,θ1|data) (2)
In complex problems some components of this decomposition may be difficult to obtain or simulate. Metropolis-Hastings or Rejection sampling may help with some components.
Gibbs sampling is another method of posterior simulation that involves breaking θ into components but instead of sampling from the sequence of conditionals in (2) proceeds over a series of iterations and samples from the posterior of each component conditionally on the most-recently sampled value of all other components.
4/26

Gibbs sampler – background
Gelfand et al (1990) two JASA articles, introduced the Gibbs sampler to a general statistical audience.
Tanner and Wong (1987) (missing data problems) Geman and Geman (1984) – Image analysis,
A special case of Metropolis-Hastings
5/26

The Gibbs sampler: General Statement
The Gibbs sampler proceeds by
1 assigning(K-1)componentofθ=θ1,…,θK aninitialvalue;
2 alternately sampling from the “full conditional posterior” distribution of
each component given not only the data but all other components of θ
3 repeating step (2) for some number of draws until the sampling process
converges to the desired joint distribution (“burn-in”)
4 repeating step (2) a further M times until to obtain M simulations
from the desired joint distribution.
If each of the ”full-conditionals” is easy to sample from we have a
readily implemented algorithm. Sometimes the full-conditionals correspond to conjugate models so sampling from the full conditionals is straightforward.
6/26

Simple Example: Posterior for mean and precision of a normal distribution
indep
Yi|μ,τ ∼ Normal(μ,τ),i = 1,…,n
where τ is the precision (inverse of the variance). Y = Y , . . . , Yn We
adopt a prior of the form
where
p(μ, τ ) = p(μ)p(τ ) (3)
μ ∼ Normal(mprior , κprior )
τ ∼ Gamma(a, b) (4)
A Gibbs sampler for this problem alternates between the following steps (i) draw μ from p(μ|τ, Y)
(ii) draw τ from p(τ|μ,Y)
7/26

Gibbs for Normal cont’d; full conditionals
From Elena’s notes we know that given our prior (normal for μ, Gamma for τ, p(μ,τ) = p(μ)p(τ)
[μ|τ,Y]∼Normal
[τ|μ,Y]∼Gamma a+n/2,b+0.5􏰐(Yi −μ)2 (5) i=1
where Y = 1 􏰀i=n Yi. n i=1
􏰴τnY +κ m 􏰵 prior prior,τn+κprior
τ n + κprior
􏰴 i=n 􏰵
8/26

Gibbs for Normal cont’d – The sampling algorithm
1 set τ = τ(0) (usually by drawing from an approximation) 2 For t in 1 : T {
2.1 draw μ(t) from Normal
􏰥τ(t−1)nY +κpriormprior τ (t −1) n + κ
(t−1)
􏰦
2.2 draw τ(t) from Gamma
}
􏰴 i=n 􏰵 a + n/2, b + 0.5 􏰐(Yi − μ(t))2
i=1
(6)
3 discard first L draws (burn-in)
Example see Gibbs_example1_normalmodel.Rmd
, τ
n + κprior
prior
9/26

Aside about the Normal distribution (1)
Term 3: Normal density parameterised with mean μ and precision (as in WinBugs)
􏶨τ 􏰥 τ(y−μ)2􏰦
p(Y =y|μ,τ)= 2πexp − 2 (7)
Term 4: (Mostly) Normal density parameterised with mean and standard deviation (as in R and Gelman et al BDA)
1 􏰥 (y−μ)2􏰦 p(Y=y|μ,σ)=√ 2exp − 2σ2 (8)
2πσ
Conditionally conjugate prior for this parameterisation is:
p(μ, σ) = p(μ)p(σ)
μ ∼ Normal(m0, s02) (9)
σ2 ∼Invχ2(c,d) (10)
No time (or need) to derive the corresponding “full-conditionals” here – see Gelman et al BDA)
10/26

Asides about the normal distribution (2)
It is probably mildly annoying that the two parameterisations of the Normal are used in the course.
. But do not worry about this. Just think of the normal distribution parameterised by mean and precision as one model; the normal distribution parameterised by mean and standard deviation or variance as another model.
We try to make clear which parameterisation we are using.
11/26

Asides about the normal distribution (3)
Thinking in terms of the normal distribution parameterised by mean and precision: No prior of the form p(μ, τ ) = p(μ)p(τ ), can be conjugate for the normal model. – Consider the form of the normal density.
Similarly, if working with the normal distribution parameterised by mean and standard deviation, σ, no prior of the form
p(μ, σ) = p(μ)p(σ) can be conjugate
The priors p(μ, τ ) = p(μ)p(τ ), p(μ, σ) = p(μ)p(σ), where p(μ) is Normal, p(τ) is Gamma and p(σ2) is inverse-Gamma are only conditionally conjugate.
The Gibbs sampler takes advantage of conditional conjugacy.
However in more complex problems conditional conjugacy is not guaranteed and more advanced methods are required to simulate the full conditionals (e.g. Metropolis-Hastings)
12/26

More general statement of the Gibbs sampler
1 initialise θ1,θ2,…,θK to θ(0),θ(0),…,θ(0) 12K
2 for (t in 1 : T) {
drawθ(t) fromp(θ1|θ(t−1),…,θ(t−1),data)
12K
drawθ(t) fromp(θ2|θ(t),θ(t−1)…,θ(t−1),data)
213K .
draw θ(t) from p(θ1|θ(t), θt) . . . , θ(t) , data), } K 12K−1
3 discard the first L iterations
13/26

General comments on the Gibbs sampler
It is a MCMC procedure, in fact it can be shown to be a special-case of Metropolis-Hastings.
Usual MCMC good practices therefore apply
1 Discard burn-in sample
2 run multiple chains from over-dispersed starting points
3 Use Gelman- to check for convergence; Increase burn-in
period if Rhat too big (e.g Rhat > 1.1 for important analyses).
14/26

Gibbs sampler for problems of the “missing data type”
The Gibbs sampler deals easily with problems of the “missing data” type e.g
non-response
mis-measured variables
latent variables – a relevant variable is not directly observable
The general idea is to alternate between
i simulating the “missing” data from their conditional posterior (predictive) distribution given the current value of the model parameters
ii drawing from the conditional posterior of the parameters given the observed data and most recent imputations of the missing data.
This is the idea behind the “data-augmentation” approach developed by Tanner and Wong (1987).
15/26

Gibbs for missing data: Theory (1)
Suppose θ is the model parameter of interest, Dobs the data actually observed and Dmis the missing data. We define the full data, that we would have liked to observed, by data = (Dobs , Dmis )
Assume we know how to compute p(θ|data), the posterior given the full data- a standard Bayesian inference problem.
Since we only observe Dobs the posterior we need to compute is p(θ|Dobs) – we can only condition on the data actually observed.
However 􏰣 p(θ|Dobs ) =
p(θ, Dmis |Dobs ) dDmis (11)
16/26

Gibbs for missing data: Theory (2)
want: p(θ|Dobs ) = 􏶦 p(θ, Dmis |Dobs ) dDmis
We can use the algorithm to, effectively, do the integration for us by simulating p(θ,Dmis|Dobs) by sampling alternately from:
i p(θ|Dmis,Dobs)=p(θ|data) (standard) ii p(Dmis|θ,Dobs)
For inference we can ignore the generated Dmis values and treat the generated θ values as a sample from p(θ|Dobs)
17/26

Gibbs sampler for missing data problems: Zero-inflated models
see .Rmd file Gibbs_example2_ZIPmodel.Rmd
18/26

Gibbs sampler: Example 3 random rounding
For example suppose Y = (Y1,…,Yn) is a vector of counts and we adopt the model
indep
Yi ∼ Poisson(θ)
θ ∼ Gamma(α, β) (12)
for fixed α,β. Instead of observing the counts we see only a randomly rounded version of the counts, R = R1, R2, . . . , Rn Our inferential task is then to compute p(θ|R) since R is the observed data. Since
􏰣
p(θ|R) =
p(θ, Y|R)dY (13)
if we can compute the joint posterior p(θ, Y|R) we are done. Given a sample from the joint posterior we can just focus on the generated θ values for inference.
19/26

Gibbs sampler for inference under random rounding
1 initialise θ to θ(0) 2 for t in 1 to K
i draw Y(t) from
p(Y|R,θ(t−1)) ∝ p(R|Y)p(Y|θ = θ(t−1)) (14)
ii
= 􏶧 p(Ri |Yi ) 􏶧 poisson(Yi |θ = θ(t−1)) (15) ii
= 􏶧 p(Ri |Yi )poisson(Yi |θ = θ(t−1)) (16) i
drawθt ∼Gamma(α+􏰐Y(t),β+n) (17) i
i
(16) can be simulated easily by direct simulation or (rejection sampling); (17) follows from the conjugate Poisson-Gamma model.
20/26

Random Rounding
For details see: Gibbs_example3_RR3.Rmd
21/26

The Gibbs sampler is a special case of the Metropolis-Hastings algorithm
recognise the full conditional posterior distributions as a special jumping distribution in which the only jumps allowed are to values of θ which match the current value of θ wrt to elements except the one currently being updated.
􏰢p(θnew |θ(t−1)∗, data) if θnew = θ(t−1)∗ JGibbs (θnew |θ(t−1)) = j −j −j −j
(18)
j,t 0 otherwise
plug JGibbs into the M-H acceptance ratio formula
j,t
rMH,j (θnew , θ(t−1)∗) = j,t
p(θnew |data)/JGibbs (θnew |θ(t−1)∗) p(θ(t−1)∗|data)/JGibbs (θ(t−1)∗|θnew )
j,t
see notes for explanation of (t − 1)∗ superscript (all elements except θj set to their most recently updated value.
22/26

Acceptance probabilities for Gibbs sampler viewed as Metropolis- acceptance ratio at the jth step of the tth iteration is therefore
rMH,j(θnew,θ(t−1)∗) = =
p(θ(t−1)∗|data) = −j
p(θ(t−1)∗|data) −j
= 1.
see notes for explanation of steps.
p(θnew|data)/Jj,t(θnew|θ(t−1)∗) (19) p(θ(t−1)∗|data)/Jj,t(θ(t−1)∗|θnew)
p(θnew |data)/p(θnew |θ(t−1)∗, data)
j −j (20)
p(θ(t−1)∗|data)/p(θ(t−1)∗|θ(t−1)∗, data) j −j
(21) (22)
23/26

Gibbs sampler when not all full conditionals can be directly simulated
For difficult full conditionals we can use a Metropolis-Hastings step
Suppose the difficult full conditional is for component j, and that this is the last component to be updated on each Gibbs iteration
(i) draw a proposal θ(t) from Jj,t(θj|θ(t−1)∗) jj
(ii) evaluate
rMH,j(θ(t),θ(t−1)) = j −j j j
q(θ(t)|data,θ(t−1)∗)/Jj,t(θ(t)|θ(t−1)∗) q(θ(t−1)∗|data,θ(t))/Jj,t(θ(t−1)∗|θ(t))
j −j j j (iii) accept θ(t) with probability min(1, r(t) ).
j MH,j
If θ(t) is not accepted set θ(t) = θ(t−1)∗, i.e stay at θ(t−1)∗. jjjj
24/26

Hybrid Gibbs/Metropolis-Hastings samplers arise frequently in practice
Reconsider our fishing example: Suppose instead of a just focussing on the expected number of fish caught (given that a party fished) we were interested in a Poisson regression relating catch to covariates.
indep
[Yi|Zi = 1,Xi,β] ∼ Poisson(θi),i = 1,…,n log(θi)=β0 +X1iβ1 +X2iβ2 +…,i =1,…,n
There is no obvious conditionally conjugate prior for β = (β0,β1,β2,…).
A hybrid Gibbs / Metropolis-Hastings sampler would alternate between
i simulating Z from p(Z|Y, X, φ, β)
ii updating β using a Metropolis-Hastings step
iii simulating φ from p(φ|Z)
25/26

Comment on Gibbs sampler with Metropolis-Hastings steps
Used to be referred to as “Metropolis-Hastings within Gibbs”
Given that Gibbs itself is a special case of Metropolis-Hastings, it is possibly easier to think of hybrid Gibbs / Metropolis-Hastings procedures simply as versions of Metropolis-Hastings.
In problems with many parameters it is difficult to apply Metropolis-Hastings to the full parameter vector; some chunking of parameters into sub-groups seems inevitable.
Hybrid Gibbs / Metropolis-Hastings algorithms provide a practical way to apply MCMC in problems without conditional conjugacy and/or many parameters.
26/26