Stat314/461Term 4: Rejection sampling: A simple simulation method for non-standard (posterior) densities
Stat314/461Term 4:
Rejection sampling: A simple simulation method for
non-standard (posterior) densities
September, 2021
1 / 28
Rejection sampling: Problem 1 – non-conjugate Binomial
model
Yi
indep
∼ Binomial(10, θ)
logit(θ) ∼ Normal(µ, σ2);
[
logit(θ) = log
(
θ
(1− θ)
)]
(1)
⇒ p(θ) =
1
√
2πσθ(1− θ)
exp
{
−
1
2σ2
[
log
(
θ
1− θ
)
− µ
]2}
Y =(6, 7, 5, 5, 4, 8)
p(θ|Y) =
∏i=6
i=1 binomial(Yi |10, θ)logitN(θ|µ, σ
2)∫ ∏i=6
i=1 binomial(Yi |10, θ)logitN(θ|µ, σ2) dθ
∝
i=6∏
i=1
θYi (1− θ)10−Yi × logitN(θ|µ, σ2)
How can we simulate p(θ|Y)? 2 / 28
Rejection sampling (or acceptance sampling)
Simulate from an approximation; then accept with a probability that
depends on the ratio of the true density to the approximating density.
This idea re-appears in more advanced Bayesian computation
methods e.g. Metropolis-Hastings.
3 / 28
Rejection sampling algorithm
(stated in a general way, assuming some parameter of interest θ, and data
Y )
Suppose
p(θ|Y) =
p(Y|θ)p(θ)∫
p(Y|θ)p(θ)dθ
=
q(θ)∫
p(Y|θ)p(θ)dθ
Let g(θ) be an approximation to p(θ|Y), that can be easily simulated
from.
Set r(θ) = q(θ|Y)/g(θ)
Find M such that r(θ) ≤ M, for all θ.
1 Sample θ∗ from g(θ)
2 Accept θ∗ with probability r(θ∗)/M.
3 Repeat until the desired number of draws is obtained.
To implement step 2:
draw u ∼ Uniform(0, 1)
Accept if u ≤ r(θ)/M
4 / 28
Some characteristics of rejection sampling (i)
a good approximation g(θ) is roughly proportional to q(θ|Y) which is,
of course, proportional to p(θ|data).
if proportionality does hold
g(θ) = Kq(θ|Y)
r(θ) =
q(θ|Y)
Kq(θ|Y)
= 1/K (2)
so if we set M = 1/K all draws from g(θ) will be accepted.
5 / 28
Some characteristics of rejection sampling (ii):
Consequences of non-proportionality
If g(θ) not approximately proportional to q(θ|data)
There will be θ s.t. g(θ) << q(θ|data)
Hence M must be set large to ensure r(θ) = q(θ|Y)/g(θ) ≤ M for all
θ
But g(θ) >> q(θ|data) for some θ, hence r(θ)/M will be low in
these high g density regions
So most draws rejected.
The worse the approximation the greater the expected number of
draws required before acceptance.
But self-diagnosing – many rejections implies a poor approximating
density
6 / 28
A poor approximating density
(consider (i) red approximating black; (ii) black approximating red)
7 / 28
A good approximating density
8 / 28
Insight: what if the approximating density is the prior
This will often be inefficient because the prior is usually more diffuse than
the posterior. Hence many draws are likely to be rejected (wasted).
Nevertheless it is interesting to think this through:
if g(θ) = p(θ) then the importance sampling ratio, r(θ) is
r(θ) =
q(θ|Y)
p(θ)
=
p(Y|θ)p(θ)
p(θ)
= p(Y|θ)︸ ︷︷ ︸
= Likelihood (3)
9 / 28
Implications of using the prior as the approximation –
cont’d
setting the approximating density g(θ) to the prior p(θ) implies the
importance sampling ratio is equal to the likelihood: r(θ) = p(Y|θ).
Hence M can be found by maximizing the likelihood p(Y|θ) wrt θ
i.e. set M to the value of the likelihood with θ set to the maximising
value, p(Y|θ̂), where θ̂ is the MLE.
10 / 28
Rejection algorithm when prior is used as the
approximating density
(M is the maximum value of the likelihood)
Sample θ∗ from the prior p(θ)
Accept θ∗ with probability p(Y|θ∗)/M.
so values of θ close to the the value that maximises the likelihood
(i.e. the MLE) have a high probability of being accepted
values far from the MLE have low probability of being accepted.
Not a bad insight on Bayesian inference; start with the prior; values
more strongly supported by the data (high likelihood) are more likely
to make it into the posterior sample;
Like giving more weight to values more strongly supported by the
data.
11 / 28
Rejection sampling for Binomial – logitNormal problem:
Using the prior as the approximator
Yi
indep
∼ Binomial(10, θ)
logit(θ) ∼ Normal(µ, σ2) (4)
p(θ|Y,N) ∝
∏
i
Binomial(Yi |10, θ)× logitN(θ|µ, σ2) (5)
Y = (6, 7, 5, 5, 4, 8), N = 10
If we use the prior as the approximating density:
r(θ) =
∏
i Binomial(Yi |10, θ)× logitN(θ|µ, σ
2)
logitN(θ|µ, σ2)
=
∏
i
Binomial(Yi |10, θ) ∝ θ
∑
i Yi (1− θ)
∑
i (10−Yi ) (6)
12 / 28
Rejection sampler using prior as the approximator cont’d
∑
i Yi = 35;
∑
i Ni = 60
set θMLE =
∑
i Yi/
∑
i Ni = 35/60
set M =
∏
i Binomial(Yi |10, θ = θMLE )
draw θ∗ ∼ logitN(µ, σ2)
Accept with probability
∏
i Binomial(Yi |10, θ = θ
∗)/M
Repeat until desired simulation size is reached
For acceptance step it is easiest to work on the log scale:
draw U ∼ Uniform(0, 1)
accept θ∗ if logU < log r(θ∗)− logM))
13 / 28
Performance of the rejection sampler with prior as the
approximator
-see R script for code
p(θ) = logitN(logit(0.5), 5) : acceptance rate 5.4%
p(θ) = logitN(logit(0.5), 1) : acceptance rate 24.7%
14 / 28
Rejection sampling for Binomial - logitNormal problem:
Using a beta approximation to the posterior as the
approximator
The conjugate prior for Binomial likelihood is the beta distribution
If we can construct a beta approximation to the logit-normal prior we
can obtain a beta approximation to the posterior using the standard
updating results for the Binomial - beta model.
A beta approximation to the logitN(µ, σ2) distribution can be
obtained by matching the mean and variance of the two densities.
15 / 28
Rejection sampling using a beta approximation to the
posterior, cont’d
if θ ∼ Beta(a, b)
E(θ) =
a
a + b
(7)
V(θ) =
ab
(a + b)2(a + b + 1)
(8)
obtain E(θ) and V(θ) for θ ∼ logitN(µ, σ2) by simulation or
numerical integration (see logitnorm package).
Use (7) and (8) to solve for a and b in terms of E(θ) and V(θ)
Beta(a +
∑
i Yi , b +
∑
i (N − Yi )) is then an approximation to the
posterior.
16 / 28
Rejection sampling using a beta approximation to the
posterior, cont’d
r(θ) =
p(θ|Y,N)
g(θ)
∝
θ
∑
i Yi (1− θ)
∑
i (N−Yi ) × logitN(θ|µ, σ2)
beta (θ|a +
∑
i Yi , b +
∑
i (Ni − Yi ))
= rA(θ), say
Find log(M) = max log(rA(θ))
draw θ∗ from θ ∼ Beta(a +
∑
i Yi , b +
∑
i (Ni − Yi ))
compute log(rA(θ
∗))
draw U ∼ Uniform(0, 1)
Accept θ∗ if log(U) ≤ log(rA(θ∗))− log(M)
17 / 28
Performance of the rejection sampler using a beta
approximation to the posterior as the approximator
-see R script for code
p(θ) = logitN(logit(0.5), 5): acceptance rate 71.5%
p(θ) = logitN(logit(0.5), 1): acceptance rate 70.7%
18 / 28
Proof that rejection sampling works
see separate notes
19 / 28
Rejection sampler Example 2: Population estimation with
known capture probability
Consider the following problem:
X “successes” observed from a sequence of Bernoulli trials with known
success probability θ but unknown number of trials, N. We need to
estimate N
e.g. N is the unknown size of a population, X is the observed number of
“captures” from the population and θ is the capture probability -assumed
known here but in practice estimated by clever methods (e.g. census
coverage).
The likelihood for this problem is
p(X |N, θ) =
(
N
X
)
θX (1− θ)N−X
i.e. a Binomial likelihood but with N as the unknown.
20 / 28
Notes on the likelihood
p(X |N, θ) =
(
N
X
)
θX (1− θ)N−X (9)
∝
(
N
X
)
(1− θ)N−X
1 “data” in this example is just X - a single count hence not boldfaced.
2 the likelihood is zero for values of N < X so no support from the data
for values of the population size less than the observed count.
21 / 28
The posterior for the unknown number of trials
We want p(N|X , θ) which is shorthand for
Pr(N = n|X , θ), for n = 0, 1, . . . . Combining the binomial likelihood (9)
with a prior for N, p(N), (keep in mind N is discrete) leads to
Pr(N = n|X , θ) =
(
N
X
)
θX (1− θ)N−X Pr(N = n)∑
m
(
m
X
)
θX (1− θ)m−X Pr(N = m)
(10)
and we need to compute this for every possible n How can we simulate a
sample of values from this posterior pmf?
22 / 28
The inverse distribution method
If our prior for N was discrete uniform or some other bounded distribution
we could tabulate the entire distribution. This means we could also
tabulate the cumulative distribution function which would just be
Pr(N ≤ q|X , θ) =
n=q∑
n=nmin
Pr(N = n|X , θ) (11)
In this case (bounded prior) we could use the inverse distribution method
to simulate the posterior for N. If the cdf, F is known and
U ∼ Uniform(0, 1) then F−1(U) is a draw from the target distribution (i.e
distribution for which F is the cdf).
But if there is no a priori bound or the bound is very large this could be
computationally inefficient.
23 / 28
Back to the unknown N , known p (aka θ) problem
We have
Pr(N = n|X , θ) =
(
n
X
)
θX (1− θ)n−X Pr(N = n)∑
m
(
m
X
)
θX (1− θ)m−X Pr(N = m)
, n = 0, 1, 2, . . .
(12)
Suppose p(N) is defined to be discrete uniform on some suitably chosen
interval: N ∼ Uniform(L1, L2) then
Pr(N = n|X , θ) =
(
n
X
)
θX (1− θ)n−X∑
m
(
m
X
)
θX (1− θ)m−X
; L1 ≤ n ≤ L2 (13)
∝
(
n
X
)
θX (1− θ)n−X ; L1 ≤ n ≤ L2 (14)
= 0; n < L1, n > L2 (15)
i.e posterior pmf is just the likelihood “normalised” to integrate to 1.
24 / 28
Rejection sampling for N assuming a uniform prior
For given success probability θ and observed count X the value of N that
maximises the likelihood is N̂ = X/θ.
If g(N) is also uniform on (L1, L2), then previous argument suggests the
following rejection sampling algorithm
1 set M = p(X )|N = N̂, θ) ∝
(
N̂
X
)
(1− θ)N̂−X
2 draw N∗ uniformly from U(L1, L2)
3 accept with probability
(N
∗
X )θ
X (1−θ)N
∗−X
M
25 / 28
Posterior for N given X = 4350, θ = 0.87
26 / 28
posterior quantiles
probs=c(0.025,0.05,0.5,0.95,0.975))
quant_rejection <- quantile(possamp,probs=probs) quant_rejection 2.5% 5% 50% 95% 97.5% 4945 4956 5000 5044 5053 true_quantiles 2.5% 5% 50% 95% 97.5% 4947 4956 5000 5045 5054 prior_quantiles 2.5% 5% 50% 95% 97.5% 4400.000 4444.000 5335.000 6243.050 6288.125 27 / 28 Concluding remarks on rejection sampling Potentially inefficient if the approximating density is a poor approximation. A poor approximation is immediately revealed by low acceptance rates; An “adaptive” variant of rejection sampling is discussed in the literature - the idea is to improve the approximation as a simulation progresses. Rejection sampling struggles as the dimension of the problem increases. 28 / 28