Rejection sampling, importance sampling, and an introduction to Markov chain Monte Carlo
Bayesian Statistics Statistics 4224/5224 Spring 2021
February 25, 2021
1
The following notes summarize Sections 10.3–10.4 and the in- troduction to Chapter 11 of Bayesian Data Analysis, by Andrew Gelman et al.
Direct approximation by calculating at a grid of points
For the simplest discrete approximation, compute the target den- sity, p(θ|y), at a set of evenly spaced values θ1, . . . , θN , that cover a broad range of the parameters space for θ, then approximate the continuous p(θ|y) by the discrete density at θ1,…,θN, with probabilities p(θi)/ Nj=1 p(θj|y).
Because the approximate density must be normalized anyway, this method will work just as well using an unnormalized density function, q(θ|y), in place of p(θ|y).
2
Once the grid of density values is computed, a set of independent random draws θ1,…,θS from p(θ|y) is obtained by invoking the R function sample, with arguments
• x set to the vector of values θ1,…,θN,
• size set to the number S of samples desired,
• replace=T and
• prob set to the posterior density values p(θi|y) for i = 1, . . . , N .
This method works well when the points θi are spaced closely enough and miss nothing important beyond their boundaries.
The discrete approximation is more difficult to use in higher- dimensional multivariate problems, when computing at every point in a dense multidimensional grid becomes prohibitively ex- pensive.
3
Simulating from predictive distributions
Once we have sampled from the posterior distribution, p(θ|y), it is typically easy to draw from the predictive distribution of unobserved or future data, y ̃.
For each draw θs from the posterior distribution, s = 1,…,S, ss
just draw y ̃ from the predictive distribution p(y ̃|θ ). 1S
The set of simulated draws y ̃ ,…,y ̃ , characterize the posterior predictive distribution.
Posterior predictive distributions are crucial to the model-checking step of a complete Bayesian data analysis.
4
Rejection sampling
Suppose we want to obtain a single random draw from a density p(θ|y), or perhaps an unnormalized density q(θ|y), where
p(θ|y) p(θ|y) = q(θ|y)dθ .
In the following description we use p(θ|y) to represent the target distribution, but we could just as well work with the unnormalized form q(θ|y) instead.
To perform rejection sampling, we require a positive function g(θ) defined for all θ for which p(θ|y) > 0 that has the following properties:
5
• We can draw from the probability density proportional to g(θ).
It is not required that g(θ) integrate to 1, but g(θ) must have a finite integral.
• The importance ratio p(θ|y)/g(θ), must have a known bound. That is, there must be some known constant M for which
p(θ|y)≤M forall θ. g(θ)
The rejection sampling algorithm proceeds in two steps.
6
1. Sample θ at random from the probability density proportional to g(θ).
2. With probability p(θ|y)/[Mg(θ)], accept θ as a draw from p(θ|y). If the drawn θ is rejected, return to step 1.
An accepted θ has the correct distribution, p(θ|y).
That is, the distribution of drawn θ, conditional on it being ac-
cepted, is p(θ|y).
(The boundedness condition is necessary so that the probability in step 2 is not greater than 1.)
7
Figure 10.1 on page 264 of Gelman et al illustrates rejection sampling.
A good proposal density g(θ) for rejection sampling should be roughly proportional to p(θ|y).
The ideal situation is g ∝ p, in which case, with a suitable value of M, we can accept every draw with probability 1.
When g is not nearly proportional to p, the bound M must be set so large that almost all draws obtained in step 1 will be rejected in step 2.
8
Importance sampling
Importance sampling is a method, related to rejection sampling and a precursor to the Metropolis algorithm, that is used for computing expectations using a random sample drawn from an approximation to the target distribution.
Suppose we are interested in
E[h(θ)|y] =
h(θ)p(θ|y)dθ ,
but we cannot generate random draws of θ from p(θ|y), and thus cannot evaluate the integral by a simple average of simulated values.
9
If g(θ) is a probability density from which we can generate random draws, then we can write
h(θ)q(θ|y)dθ [h(θ)q(θ|y)/g(θ)]g(θ)dθ E[h(θ)|y] = q(θ|y)dθ = [q(θ|y)/g(θ)]g(θ)dθ ,
which can be estimated using S draws θ1, . . . , θS from g(θ) by the expression
where the factors
1 S h(θs)w(θs) Ss=1 ,
1 S w(θs) S s=1
w(θs) = q(θs|y) g(θs)
are called importance ratios or importance weights.
10
It is generally advisable to use the same set of random draws for both the numerator and denominator in order to reduce the sampling error in the estimate.
If g(θ) can be chosen so that q/g is roughly constant, then fairly precise estimates of the integral can be obtained.
Importance sampling is not a useful method if the importance ratios vary substantially.
The worst possible scenario occurs when the importance ratios are small with high probability but with a low probability are huge, which happens, for example, if q has wide tails compared with g.
11
Efficiency of importance sampling
The effective sample size can be estimated by the approximation Seff= 1 ,
Ss=1(w ̃(θs))2 where w ̃(θs) are normalized weights, that is,
S′ w ̃(θs) = w(θs) w(θs ) .
s′=1
The effective sample size Seff is small if there are few extremely
high weights which would unduly influence the distribution.
If the distribution has occasional very large weights, however, this estimate is itself noisy; it can thus be taken as no more than a rough guide.
12
Example
Consider the bivariate normal data from Problem 2 of the first homework assignment,
> x <- c(-3.3, +0.1, -1.1, +2.7, +2.0, -0.4)
> y <- c(-2.6, -0.2, -1.5, +1.5, +1.9, -0.3)
> n <- length(y)
where X and Y are known to have marginal mean 0 and variance 1, and covariance known to be positive.
See
Courseworks → Files → Examples → Example10b . 13
Assuming a Uniform(0,1) prior on the correlation coefficient θ, an unnormalized version of the posterior density is
1n q(θ|x, y) = (1 − θ2)−n/2 exp − x2i − 2θxiyi + yi2
2(1−θ2)i=1 (a) Use a numerical optimization routine to find the MAP esti-
for 0 < θ < 1.
mator of θ.
(b) Approximate E(θ|x,y) and E(θ2|x,y) by quadrature, evaluat-
ing the integrand on the discrete grid of values
θ ∈ {.005, .015, .025, . . . , .985, .995} .
14
(c) Approximate E(θ|x, y) and E(θ2|x, y) by Monte Carlo integra- tion, sampling from the discrete grid of values
θ ∈ {.005, .015, .025, . . . , .985, .995} ,
adding random jitter to each draw to better approximate sam-
pling from a continuous distribution.
(d) Approximate E(θ|x,y) and E(θ2|x,y) by rejection sampling,
using a Uniform(0,1) proposal distribution.
(e) Approximate E(θ|x, y) and E(θ2|x, y) by importance sampling,
using 1000 draws from the Uniform(0,1) proposal distribution.
(f) Compute an estimate of the effective sample size for impor- tance sampling.
15
Basics of Markov chain Monte Carlo
Markov chain simulation, also called Markov chain Monte Carlo or MCMC, is a general method based on drawing values of θ from approximate distributions and then correcting those draws to better approximate the target posterior distribution.
Sampling is done sequentially, with the distribution of the sam- pled draws depending on the last value drawn.
Markov chain simulation is used when it is not possible (or not computationally efficient) to sample θ from p(θ|y); instead we sample iteratively in such a way that at each step of the process we expect to draw from a distribution that becomes closer to p(θ|y).
16
The key to MCMC is to create a Markov process whose station- ary distribution is the specified p(θ|y) and to run the simulation long enough that the distribution of the current draws is close enough to the stationary distribution.
For any specific p(θ|y), or unnormalized density q(θ|y), a variety of Markov chains with the desired property can be constructed (Sections 11.1–11.3 of Gelman et al).
Once the simulation algorithm has been implemented and the simulations drawn, it is absolutely necessary to check the con- vergence of the simulated sequences, and estimate the effective number of simulation draws (Sections 11.4–11.5).
17