Monte Carlo approximation
Bayesian Statistics Statistics 4224/5224 Spring 2021
January 28, 2021
1
Summarizing posterior inference
In addition to point summaries (posterior mean, median, or mode), it is nearly always important to report posterior uncer- tainty.
Our usual approach is to present quantiles of the posterior dis- tributions of estimands of interest.
A 100(1 − α)% posterior credible interval for θ is an interval [l(y),u(y)] such that
Pr{l(y)<θ θ1.
6
Here yi ∼ Binomial(n, θi) where θi ∼ Beta(1, 1), and thus θi|yi ∼ Beta(yi + 1, ni − yi + 1)
for i = 1, 2, with θ1 and θ2 independent. Data: y1 =5 and y2 =8.
Evaluate Pr(θ2 > θ1|y1, y2).
Use Monte Carlo!
Generate random samples θis ∼ Beta(yi + 1, ni − yi + 1) for s = 1,…,S.
7
Then
Pr(θ2 > θ1|y1, y2) ≈ S Find the R code in
Courseworks → Files → Examples → Example01 MC . Example: Posterior prediction
Last spring we planted n = 10 seedlings, and y = 2 survived the winter; if we plant another 10 this year, what is the probability at least one will survive the winter?
1S s s
s=1
I(θ2 > θ1)
Sampling model: y, y ̃|θ ∼ iid Binomial(n, θ).
8
Prior: θ ∼ Uniform(0, 1) = Beta(1, 1).
The posterior predictive distribution satisfies
p(y ̃|y) =
We can approximate Pr(y ̃ > 0|y) by Monte Carlo sampling:
First sample θ1, . . . , θS ∼ p(θ|y). ss
Then sample y ̃ ∼ p(y ̃|θ ) for s = 1,…,S. 1Ss
Then Pr(y ̃ > 0|y) ≈ S s=1 I(y ̃ > 0).
p(y ̃|θ)p(θ|y)dθ
Find the R code in
Courseworks → Files → Examples → Example02 PPD .
9
The Monte Carlo method
Suppose the posterior distribution θ|y has the following density:
0.0 0.2 0.4
0.6 0.8 1.0
And we wish to use the Monte Carlo method to approximate
• posterior probabilities, such as Pr(θ > .50|y);
• posterior quantiles (median, a 95% posterior interval); • posterior moments, such as E(θ|y) = θp(θ|y)dθ.
10
theta
p(theta|y)
0.0 0.5 1.0 1.5 2.0 2.5
Simulating a sequence of random draws, θ1, . . . , θS ∼ iid p(θ|y), can be visualized as randomly selecting S points in the area under the curve, and taking their θ-coordinates as our simulated sample.
●● ●
●
●
● ●
● ●
● ●●
●
●●●● ●
● ●
●
●●●● ●
● ●● ●
●● ●●● ●
●
●
●● ●● ●●
●
● ●● ●
●●●
●● ●●
●●●●● ●●● ●●● ●● ●
●●● ● ●●●●●●●
●● ●●●●●● ●●●●●● ● ●●●●●●
● ●● ●● ●●●●
●●● ●●●●
●●● ● ●● ● ●●● ●
●
●
● ● ●●●●●● ● ● ● ●●
●●●●●● ● ●● ●●●
●● ●●● ● ●● ●●
●● ●●●● ●●● ●●●● ●●●●
●●●●●
●● ●
●●●● ● ●
●●● ● ● ● ●
●● ●●●●
●●●● ●●●
●●● ●●● ●●●●●●●● ●●●● ●●●● ●●●
● ● ● ● ● ● ● ● ● ●● ● ● ●●
●●
●●●●● ● ● ●
● ●●●●●●●●●●●
●
● ●●●● ● ●●●●● ●
●●●●●●●●●●● ●●●●●● ● ● ● ● ●● ● ●● ●● ● ● ●● ●●
● ●●
0.0 0.2 0.4
0.6 0.8 1.0
theta
11
p(theta|y)
0.0 0.5 1.0 1.5 2.0 2.5
Posterior probabilities
We can approximate
Pr(θ > 0.5|y) =
∞ 0.5
p(θ|y)dθ
by 1 S I(θs > 0.5), the proportion of simulated draws that
S s=1
are greater than 0.5.
●● ●
●
●
● ●
● ●
● ●●
●
●●●● ●
● ●
●
●●● ● ● ●● ●
●●● ● ●
●
● ●● ●
●●● ●●
●
●● ●●
●● ●
●●
●●●●● ●●● ●●● ●● ●
●
●●● ● ●●●●
●●●
●● ●●●●●●
●● ● ● ●●●
● ●●●● ●● ● ●●●
●●● ●● ● ● ●● ● ●●● ●
●●● ●●●●
● ● ●●●●●● ● ● ● ●●
●
●
●●
● ● ●●●● ●●
●●●● ●
●●● ●●● ● ●● ● ● ●●●●●●●● ●
● ●●● ●
●● ●●●●
● ● ● ●
● ●●● ●●●
● ● ● ● ●●
●●● ●●●
● ● ●●●● ●●●●
●●●●● ●●● ● ●●●
●●
● ● ●●●● ● ●●●●●●●●●● ●●●●●● ● ● ● ● ●● ● ●● ●● ● ● ●● ●●
● ●
● ● ●
● ● ● ● ● ●
●●●● ●
● ● ●●
● ● ●● ● ● ●●
●●●●● ● ●
● ●●●●●●●●●●
●● ●
● ●●●●● ●
0.0 0.2 0.4
0.6 0.8 1.0
theta
12
p(theta|y)
0.0 0.5 1.0 1.5 2.0 2.5
Posterior quantiles
We approximate the posterior q-quantile, 0 < q < 1, by a value for which 100q% of the θs are less than this value.
● ●
●● ●
● ●●
● ●
●
●
●
●●●● ●
●●● ● ● ●● ●
● ● ●● ●
● ●
●
●●● ●●
●
●● ●● ● ●● ●
●● ●
● ●●●●● ●●●
● ●
●●● ●● ●
●●● ● ●
●●● ●●●
●● ●●●●●● ●● ● ● ●
●●● ●●●●●●
● ●●●● ● ●●
●●● ●●●●
●
●
●●● ● ●● ● ●●● ●
●●
● ● ●●●●●● ● ● ● ●●
● ● ●●● ●●
● ●●
●●● ●●● ●● ●● ●● ●●●●●●● ●
●●●● ●
●●
● ● ● ● ● ●
●●● ● ●●●●●●
●●● ● ●
●●●● ● ●●●● ●●●●
●●● ●●● ●●●●●●●● ●●●● ● ● ● ●●● ● ● ●●
● ● ● ● ● ● ● ●● ● ● ●● ●● ●●●●●●●●●●●
● ●●●●●●●●●●●
● ●●●●● ●●●●●● ●
●●●●●●●●●● ●●●●●● ● ● ● ● ●● ● ●● ●● ● ● ●● ●●
● ●●●●
0.0 0.2 0.4
0.6 0.8 1.0
The lower 5% and upper 5% of the simulated θs-values are marked by red dots. We are 90% certain that the true value of θ falls between the two dashed blue lines.
13
theta
p(theta|y)
0.0 0.5 1.0 1.5 2.0 2.5
Posterior moments
We approximate the posterior expectation of θ by the sample average of the simulated random draws,
1Ss
E(θ|y) = θp(θ|y)dθ ≈ S
We can similarly approximate posterior expectations for various
functions of the unknown parameter,
1Ss
E[h(θ)|y] = h(θ)p(θ|y)dθ ≈ S
The LLN guarantees that the Monte Carlo approximation con- verges to the posterior moment, as S → ∞, and the CLT can be used to establish asymptotic bounds on the Monte Carlo error.
14
s=1
θ .
h(θ ) .
s=1
Sampling from mixture distributions
Finite mixtures
Suppose the posterior density can be expressed as a weighted average of k different density functions
p(θ|y) = λ1p1(θ|y) + λ2p2(θ|y) + · · · + λkpk(θ|y) where λi ≥ 0 and λ1 + λ2 + · · · + λk = 1.
Simulating a random draw θs from a finite mixture distribution is accomplished in two steps:
1. Randomly select zs ∈ {1,2,...,k}, where Pr(z = i) = λi, for i = 1,2,...,k; then
2. draw the sample θs ∼ pzs(θ|y).
15
Continuous mixtures
If instead of there being k distinct component distributions, there was a continuum of them, and instead of probabilities λi = Pr(z = i) there was a probability density λ(z), we would have a continuous mixture distribution of the form
p(θ|y) =
The two-step algorithm for drawing a random sample from this
distribution generalizes the algorithm given above:
1. Randomly generate zs ∼ λ(z), then 2. draw the sample θs ∼ p(θ|y; zs) .
p(θ|y; z)λ(z)dz .
16
The PPD is a mixture distribution
If the two-step process described above seemed familiar, it’s be- cause we’ve encountered this idea already, in our discussion of simulating data from the posterior predictive distribution.
The posterior predictive distribution is a mixture of sampling distributions p(y ̃|θ), weighted according to the posterior density p(θ|y).
Thus we simulate a random draw from
p(y ̃|y) =
p(y ̃|θ)p(θ|y)dθ
as follows:
1. First draw a parameter value from the posterior, θs ∼ p(θ|y);
ss 2. then draw the predictive quantity y ̃ ∼ p(y ̃|θ ).
17
The two-parameter normal model
Building on the results we developed for estimation of the mean and variance separately, assuming the other parameter was known, we now consider the model
y1, . . . , yn|μ, σ2 ∼ iid Normal(μ, σ2)
and suppose we wish to do Bayesian inference about the two-
dimensional model parameter θ = (μ, σ2). The likelihood for this model can be written
1n p(y|μ, σ2) ∝ (σ2)−n/2 exp − (yi − μ)2
2σ2i=1 2−n/21 2 2
=(σ ) exp −2σ2 (n−1)s +n(y ̄−μ)
18
where s2 = 1 n (yi − y ̄)2 is the sample variance of the yi’s. n−1 i=1
Conjugate prior
The conjugate prior is given by
μ|σ2 ∼ Normal(μ0, σ2/κ0)
σ2 ∼ Inv-χ2(ν0, σ02) . The resulting posterior is given by
μ|σ2, y ∼ Normal(μn, σ2/κn) σ2|y ∼ Inv-χ2(νn, σn2)
19
where
κn = κ0 + n
μ n = κ 0 μ 0 + n y ̄
κn νn = ν0 + n
212 2κ0n2 σn=ν ν0σ0+(n−1)s+κ(y ̄−μ0)
nn
20
Simulated draws from this posterior can be produced by σ2(s) ∼ Inv-χ2(νn, σn2) , then μ(s) ∼ Normal(μn, σ2(s)/κn)
for s = 1,...,S.
This Monte Carlo procedure can be implemented in R with only
two lines of code:
sigma2.sim <- nu.n * sigma2.n / rchisq(S, df=nu.n)
mu.sim <- rnorm(S, mean=mu.n, sd=sqrt(sigma2.sim/kappa.n))
Noninformative prior
A sensible vague prior density for μ and σ is uniform on (μ, log σ) or, equivalently,
p(μ, σ2) ∝ (σ2)−1 .
21
The resulting posterior can be specified by μ|σ2, y ∼ Normal(y ̄, σ2/n)
and
σ2|y ∼ Inv-χ2(n − 1, s2) .
The marginal posterior of μ, integrating over σ2, is given by
μ − y ̄
√ y∼tn−1
s/ n
That is, the marginal posterior of μ is Student’s t-distribution
with location y ̄ and scale s/√n and n − 1 degrees of freedom. 22