1 Outline proof of convergence of Metropolis
sampler to the target posterior distribution
(based on Gelman et al: BDA, 3rd edn pp
279-80).
1.1 Preliminaries
Suppose we are interested in the posterior distribution
p(θ|data) =
p(data|θ)p(θ)∫
p(data|θ)p(θ) dθ
∝ p(data|θ)p(θ) = q(θ|data)
and propose to approximate the distribution using a Metropolis (or Metropo-
lis – Hastings algorithm)
Let Jt(θ
new|θ(t−1)) denote a symmetric jumping or proposal distribution
from which a jump from the value of the θ at the (t − 1)th iteration of
the chain is generated. In this context symmetry means Jt(θ
new|θ(t−1)) =
Jt(θ
(t−1)|θnew) for all values of θnew and θ(t−1). For the Metropolis-Hastings
algorithm the requirement that Jt() be symmetric is relaxed.
1. Assume the sequence of parameter values θ̃ = (θ(0),θ(1), . . . ,θ(N)) con-
stitute a Markov chain, i.e.
p(θ(t)|θ(t−1), . . . ,θ(0)) = p(θ(t)|θ(t−1)) (1)
This holds by construction. (Here superscripts are iteration indices,
not powers)
2. Assume the chain is
(a) aperiodic (no deterministic movement )
(b) not transient (positive recurrent– some chance of returning to the
initial value);
1
(c) irreducible (the jumping distribution Jt() is able eventually to
jump to any state).
3. The above conditions imply the Markov Chain has a unique stationary
distribution.
4. Hence the the marginal distributions from which the θ values are drawn
converge to a unique stationary distribution.
5. If we can show that the stationary distribution of the Markov Chain is
the target posterior distribution p(θ|data) then we are done.
Definition of stationarity
if g(θ) is some density such that if θ(t−1),θ(t) are successive values from the
Markov Chain θ̃,
p(θ(t−1) = θa) = g(θa)⇒ p(θ(t) = θa) = g(θa) for all θa (2)
then g is a stationary distribution of the Markov Chain. That is if θ(t−1) being
a draw from the density g() implies θ(t) is also a draw from the density g()
then g() is a stationary distribution of the Markov Chain. Here, I am using
notation such as p(θ(t) = θa) to refer to a density function for θ
(t) evaluated
at the specific point θa (so p(θ
(t) = θa)∀θa refers to the full density function.)
1.2 Outline Proof that the posterior distribution is the
stationary distribution of the chain
Suppose the chain is aperiodic, not transient and irreducible and has been
running for some time. Consider the chain at the (t − 1)th iteration and
suppose that the marginal distribution of θ(t−1) is the target distribution
p(θ|data), that is p(θ(t−1) = θa) = p(θ = θa|data), for all θa. If we can
show that this implies the marginal distribution of p(θ(t)) is also p(θ|data)
meaning p(θ(t) = θa) = p(θ = θa|data)∀θa then we are done. In fact we
need only show that the (t − 1)th draw in the chain is a draw from the
target posterior distribution implies (θ(t),θ(t−1)) are exchangeable, since the
2
marginals of exchangeable r.v’s are identical (proved in Assignment 4). Recall
that exchangeability means
p(θ(t) = θa, θ
(t−1) = θb) = p(θ
(t) = θb, θ
(t−1) = θa), for all θa, θb
We will sometimes just write p(θa|data) to refer to the value of the poste-
rior density function for θ evaluated at the point θa, i.e p(θa|data) is short-
hand for p(θ = θa|data).
Let θa,θb be arbitrary points drawn from the target posterior density but
labelled so that
p(θb|data) ≥ p(θa|data) (3)
implying that the Metropolis acceptance ratio for a jump from θa to θb is
rM(θb,θa) =
p(θb|data)
p(θa|data)
=
q(θb|data)
q(θa|data)
≥ 1. (4)
Consequently, a jump from θa to θb will always be accepted, since the accep-
tance probability for a proposed jump in that direction is min (1, rM(θb,θa))) .
Note in (4 ), the second equality follows because the normalising constant
cancels from the numerator and denominator of the ratio of posterior densi-
ties. As usual, we need only be able to compute the unnormalized posterior
densities.
On the other hand, the Metropolis acceptance ratio for a proposed jump
from θb to θa is
rM(θa,θb) =
p(θa|data)
p(θb|data)
=
q(θa|data)
q(θb|data)
< 1, (5) so a jump from θb to θa will be accepted with probability rM(θa,θb). The joint probability density for observing θa at the (t− 1)th iteration and θb at the tth iteration is p(θ(t−1) = θa,θ (t) = θb) = p(θ (t−1) = θa)p(θ (t) = θb|θ(t−1) = θa) = p(θ = θa|data)Jt(θb|θa)×min (1, rM(θb,θa)) (6) = p(θ) = θa|data)Jt(θb|θa) (7) = p(θa|data)Jt(θb|θa) (8) 3 since rM(θa,θb) ≥ 1 because of the way we have arranged θa and θb. In (6) I have used the assumption that θ(t−1) is a draw from p(θ|data). The joint probability density for observing θb at iteration (t− 1) and θa at iteration (t) is p(θ(t−1) = θb,θ (t) = θa) = p(θb|data)p(θ(t) = θa|θ(t−1) = θb) = p(θb|data)Jt(θa|θb)×min (1, rM(θa,θb)) = p(θb|data)Jt(θa|θb) p(θa|data) p(θb|data) = p(θa|data)Jt(θa|θb) = p(θa|data)Jt(θb|θa) (9) where (9) follows because of the symmetry of Jt(). Thus, comparing (8) and (9) it is clear that p(θ(t−1) = θa,θ (t) = θb) = p(θ (t−1) = θb,θ (t) = θa) and this holds for any θa,θb. That is (θ (t−1),θ(t) are exchangeable hence p(θ(t)) = p(θ(t−1)) = p(θ|data) since we assumed θ(t−1) is a draw from the posterior distribution. Thus we have shown p(θ(t−1))=p(θ|data) implies p(θ(t) = p(θ|data) In other words, the posterior distribution is a stationary distribution of the Markov Chain; In view of the uniqueness of the stationary distribution the chain converges to the posterior distribution. In the case of the Metropolis-Hastings algorithm the proof is very similar because the acceptance probabilities adjust for the assymetry in Jt(). 4