Stat314 Term 4: Metropolis and Metropolis-Hastings algorithms for posterior computation
Stat314 Term 4:
Metropolis and Metropolis-Hastings algorithms for
posterior computation
September, 2021
1 / 26
The basic idea of Metropolis type algorithms for sampling
the posterior: Informal description
i Start at some point in the parameter space.
ii Propose a jump to another point by drawing a proposed value from
some “jumping density.”
iii If the proposed jump is to a point with higher posterior density than
the current point, move to the proposed value (accept the proposal)
If the proposed jump is to a point with lower posterior density than
the current value, accept the proposed jump with a probability related
to the ratio of the unnormalised posterior densities at the proposed
and current values. If the proposed jump is rejected stay at the
current value.
iv repeat steps (ii) to (iii) until the desired simulation size is reached.
2 / 26
Summary of the basic idea of Metropolis sampling
“exploring the parameter space” but in a manner guided by the
(unnormalized) posterior density. Always move to a point with higher
posterior density; sometimes move to a point with lower density.
As this process evolves over a large number of iterations we end up
with a large number of points from areas of high posterior density and
a small number of points from areas of low posterior density.
Exactly what we want!
animation
https://chi-feng.github.io/mcmc-demo/
3 / 26
https://chi-feng.github.io/mcmc-demo/
Background on Metropolis and Metropolis-Hastings
algorithm (i)
Metropolis algorithm named after – pioneer of
Monte Carlo methods in Physics along with . Ideas
developed in 1940s, but general application to Bayesian computation
not recognised until 1990s.
This algorithm assumes a symmetric jumping density which is a
density satisfying Jt(θ
(a)|θ(b)) = Jt(θ(b)|θ(a)), ∀θ(a),θ(b) where by
θ(a),θ(b) we mean two particular values of θ (i.e two points in the
parameter space).
4 / 26
Background on the Metropolis and Metropolis-Hastings
algorithm (ii)
The normal density, with fixed precision, is the most commonly used
example of a symmetric jumping density.
[θ(a)|θ(b)] ∼ N(θ(b), τ) (1)
(where τ is the precision) implies
p(θ(a)|θ(b)) =
√
τ
√
2π
exp
(
−
τ(θ(a) − θ(b))2
2
)
(2)
=
√
τ
√
2π
exp
(
−
τ(θ(b) − θ(a))2
2
)
(3)
so p(θ(a)|θ(b) is symmetric in θ(a) and θ(b).
The Metropolis-Hastings allows algorithm allows asymmetric jumping
densities. Introduced by Hastings (1970).
5 / 26
The Metropolis -Hastings algorithm
1 draw θ(0) from an initial density g0(θ); we require p(θ
(0)|data) > 0
(posterior has positive density at the generated initial point).
2 for t = 1, 2, . . . ,
sample a proposal, θ(∗) from a jumping (or proposal) distribution,
Jt(θ
(∗)|θ(t−1)),
3 calculate the Metropolis-Hastings acceptance ratio
rMH(θ
∗,θ(t−1)) =
q(θ∗|data)/Jt(θ∗|θ(t−1))
q(θ(t−1)|data)/Jt(θ(t−1)|θ∗)
(4)
4 If rMH(θ
∗,θ(t−1)) > 1 set θ(t) = θ∗; else set
θ(t) =
{
θ(∗) with probability rMH(θ
(∗),θ(t−1))
θ(t−1) otherwise
6 / 26
Simplification rMH for symmetric jumping density
For symmetric J(θ|θ(t−1))
rMH(θ
∗,θ(t−1)) =
q(θ∗|data)/Jt(θ∗|θ(t−1))
q(θ(t−1)|data)/Jt(θ(t−1)|θ∗)
(5)
=
q(θ∗|data)
q(θ(t−1)|data)
(6)
So the M-H acceptance ratio only depends only on the ratio of the values
of unnormalized density at the proposed new point and the current point.
7 / 26
Notes on implementation of the Metropolis algorithm
To determine whether a proposal is accepted draw U ∼ Uniform(0, 1);
if U ≤ rMH(θ(∗),θ(t)) then accept θ(t) = θ(∗); else θ(t) = θ(t−1).
Early draws in the sequence will not be from the posterior distribution.
Run the algorithm until convergence; discard early draws; subsequent
draws can be regarded as draws from the target posterior.
The draws are correlated – this means the effective Monte Carlo
sample size, is less than the actual size.
detecting convergence is difficult. Best approach is to run multiple
chains from different starting points; inspect traceplots; use the the
Gelman- ̂ statistic.
8 / 26
Diagnosing convergence of MCMC procedures
run multiple chains from different starting values; discard early draws
(e.g first 50%; compare within and between chain variation,
formalised in the Gelman-Rubin statistic (R̂) – see next slide
Inspect trace plots (see below for examples)
if only a single chain is run, look at trace plot; experiment with
different starting values; experiment with different burn-in periods.
9 / 26
illustration of trace plots (1)
10 / 26
illustration of trace plots (2)
11 / 26
Convergence problems: Mixing and stationarity
See Figure 11.3 Gelman et BDA 3rd edition, p 283.
12 / 26
Gelman-Rubin diagnostic for detecting MCMC sampler
convergence
Suppose we have m chains of length K after discarding the burn-in
period.
Split each chain in two, so now we have 2m chains of length K/2.
We work with these split – chains;
If chains have reached their stationary distribution by the end of the
burn-in, the two-halves of the post burn-in chain should be similar
Let W be the within chain variance (compute variance of draws for
each chain and average over chains), and B the between chain
variance (K/2 times variance in chain means ).
V+ =
K/2− 1
K/2
W +
1
K/2
B
R̂ =
√
V+
W
→ 1 as K →∞
Values close to 1 suggest convergence.
Supported by asymptotic arguments (see Gelman et al BDA) 13 / 26
Choice of initial values
Not important in terms of final posterior sample and hence posterior
inferences. The sampler will move away from points with low support
in the posterior. We discard early draws.
However initial values should be plausible to avoid numerical issues in
early stages of the sampling.
Ideally, in the multiple chain setting, starting values should be
generated from an overdispersed approximation to the posterior.
Theory behind R̂ is based on this idea. Sometimes it is hard to find
an approximation to the posterior.
14 / 26
Choice of initial values cont’d
If we can maximize the unnormalized posterior density wrt the
parameters to find the mode (same as mode of normalized posterior)
an approximation can often be based on a normal or t distribution
centered on the mode with variance determined by the shape of the
posterior near the mode – related to the inverse second -derivative
matrix. See the R function laplace() in LearnBayes.
15 / 26
Binomial-logit-normal example
Six Binomial experiments, each of size 10. Number of successes is
Y = (6, 7, 5, 5, 4, 8)
Yi |N,θ
indep
∼ Binomial(θ,N)
[log(
θ
1− θ
)] ∼ Normal(µ, σ2)
Equivalently, could write this as an intercept only logistic regression model:
[log(
θ
1− θ
)] = β0 (7)
β0 ∼ Normal(µ, σ2) (8)
Note
η = log(
θ
1− θ
)⇔ θ =
exp(η)
1 + exp(η)
=
1
1 + exp(−η)
(9)
We write logit(θ) for log( θ
1−θ ) and invlogit(η) for
exp(η)
1+exp(η)
= 1
1+exp(−η)
16 / 26
Likelihood and posterior for Binomial parameterised in
terms of the logit (η)
Suppose
[Yi |θ,Ni ]
indep
∼ Binomial(θ,Ni ), i = 1, . . . ,K
η = logit(θ); η ∼ Normal(µ, σ2) (10)
p(Y|η,N) =
∏
i
(
Ni
invlogit(η)
)
invlogit(η)Yi (1− invlogit(η))Ni−Yi
∝
∏
i
invlogit(η)Yi (1− invlogit(η))Ni−Yi
so
p(η|Y,N) ∝
[∏
i
invlogit(η)Yi (1− invlogit(η))Ni−Yi
]
×
×Normal(η|µ, σ2)
Not a conjugate model.
17 / 26
Setting up the Metropolis sample for the Binomial –
logit-normal model
We will draw initial values from a normal approx based on the the
posterior mode and inverse of negative second derivative of the
log-posterior, Vapprox
Jumping distribution will be a normal centred on the current value
(hence symmetric)
Jt(η
(t)|η(t−1)) = normal(η(t)|η(t−1), v)
for some suitable chosen variance v (part of the art of setting up
Metropolis samplers). We will use v = 2.42Vapprox
accept jumps with probability
min
(
1,
q(η(t)|Y,N)
q(η(t−1)|Y,N)
)
See R code for example of computations
MH_example1_binomlogit_2021.r
18 / 26
Tuning Metropolis-Hastings algorithms (i)
If jumps are too small the chain will move too slowly through the
space; may not traverse the full space; convergence may be too slow.
Depending on the circumstances, high acceptance rates can be an
indicator that the jumps are too small.
if jumps are too big acceptance rates will be low and the sampler
inefficient.
For approximately normal posteriors Gelman et al (BDA) suggest a
(multivariate) normal jumping density centred on the current value
with variance given by Vjump ≈ (2.42/d)Σ where Σ is the variance of
the approximate normal posterior.
This provides another reason to try and construct an approximation
based on on the mode and shape of the posterior near the mode.
19 / 26
Tuning Metropolis-Hastings algorithms (ii)
Gelman et al’s suggested jumping density is not a hard and fast rule,
however.
It may be necessary to change the jumping density as the sampler
evolves -reduce jumping density variance if acceptance rates are too
low; increase the jumping density variance if acceptance rates are too
high.
One possibility is to:
1 start the sampler with Gelman et al’s recommended jumping density
2 after some number of draws change the variance of the jumping density
to be proportional to the variance of the draws generated so far;
3 increase or decrease the jumping density variance depending on
acceptance rates;
20 / 26
If tuning is necessary, think in terms of an adaptive phase where the
jumping density is tuned; followed by a fixed phase where the tuned
jumping density is used and the sampler is run for a long period – only
draws from the fixed phase should be used for inference.
21 / 26
Final notes on Metropolis- posterior sample obtained comprises correlated draws from the
posterior; The effective simulation sample size is therefore less than
the nominal simulation sample size; sometimes dramatically so.
Computation of the effective sample size for correlated draws from a
distribution is tricky but some time-series concepts can be used; won’t
discuss here but the coda package contains an effectiveSize()
function which adjusts nominal size for correlation.
The chains can be thinned to get closer to independent draws if
needed.
Two broad classes of jumping densities:
1 random walk style – mostly what we have been considering; often these
are (multivariate) normal densities centred on the current value.
2 Draws from an approximate posterior density – the Metropolis-
Hastings acceptance probabilities then act to correct the
approximation. In this case we would like a high acceptance rate
22 / 26
Outline proof of the convergence of the Metropolis
algorithm to the posterior distribution
(based on Gelman et al BDA, 3rd edn. pp. 279)
1 Recognise the sequence of draws as a Markov chain; the states of the
chain are points in the parameters space
2 If the Markov Chain is aperiodic, not transient and irreducible then it
converges to a unique stationary distribution. In different ways these
conditions are all getting at the idea that each state (point) should be
eligible to be visited at each iteration. The first two conditions are
virtually guaranteed for a random walk on a proper distribution
(assuming the jumping distribution is alsoproper); irreducibility
captures the idea that each state should able to be reached
(eventually) from any other state. This amounts to a condition on
the jumping distribution – it must eventually be able to jump to all
states with positive probability.
23 / 26
Outline proof continued
3 Assuming the sequence of draws constitute an aperiodic,
non-transient and irreducible Markov Chain, if we can show that the
target posterior distribution is the stationary distribution of the chain,
then, in view of the uniqueness of the stationary distribution of the
chain, we are done.
24 / 26
Outline proof continued: Definition of stationarity
Definition of stationarity
if g(θ) is some density such that if θ(t−1),θ(t) are successive values from
the Markov Chain θ̃,
p(θ(t−1) = θa) = g(θa)⇒ p(θ(t) = θa) = g(θa) ∀θa (11)
then g() is a stationary distribution of the chain. That is if θ(t−1) being a
draw from the density g() implies θ(t) is also a draw from the density g()
then g() is a stationary distribution of the Markov Chain.
25 / 26
Outline proof of convergence of Metropolis algorithm –
cont’d
Since the marginal distributions of exchangeable random variables are
identical (see Assignment 4, question 1), if we can show θ(t−1) being a
draw from p(θ|data) implies θ(t−1),θ(t) are exchangeable then we are
done, since this implies θ(t) is also a draw from p(θ|data).
See separate notes for details
26 / 26