Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 821
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
566of 821
Part XVI
Sampling
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
567of 821
Approximating π – Buffon’s Needle (1777)
Drop length c needle on parallel lines distance d apart
Needle falls perpendicular:
Probability of crossing the line is c/d.
Needle falls at an arbitrary angle a:
Probability of crossing the line c sin(a)/d.
Every angle is equally probable. Calculate the mean:
p(crossing) =
c
d
∫ π
0
sin(a) dp(a) =
1
π
c
d
∫ π
0
sin(a) da =
2
π
c
d
(iv) n crossings in N experiments results in nN ≈
2
π
c
d
d c
(i) Needle falls
perpendicular (a = π/2).
d c sin a
a
(ii) Needle falls at
arbitrary angle a.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
568of 821
Motivation
For most probabilistic models of practical interest, exact
inference is intractable. Need approximation.
Numerical sampling (Monte Carlo methods).
Fundamental problem : Find the expectation of some
function f (z) w.r.t. a probability distribution p(z)
E [f ] =
∫
f (z) p(z) dz
Key idea : Draw z(l), l = 1. . . . ,L independent samples
from p(z) and approximate the expectation by
f̂ =
1
L
L∑
l=1
f (z(l))
Problem: How to obtain independent samples from p(z) ?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
569of 821
Approximating the Expectation of f (z)
Samples must be independent, otherwise the effective
sample size is much smaller than the apparent sample
size.
For dependent sequences of samples: if f (z) is small in
regions where p(z) is large (or vice versa) : need large
sample sizes to catch contributions from all regions.
p(z) f(z)
z
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
570of 821
Sampling from the Uniform Distribution
In a computer usually via pseudorandom number
generator : an algorithm generating a sequence of
numbers that approximates the properties of random
numbers.
Use a mathematically well crafted pseudorandom number
generator.
From now on we will assume that we have a good
pseudorandom number generator for uniformly distributed
data available.
If you don’t trust any algorithm :
Three carefully adjusted radio receivers picking up
atmospheric noise to provide real random numbers at
http://www.random.org/
http://www.random.org/
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
571of 821
Sampling from Standard Distributions
Goal: Sample from p(y) which is given in analytical form.
Suppose uniformly distributed samples of z in the interval
(0, 1) are available.
Calculate the cumulative distribution function
h(y) =
∫ y
−∞
p(x) dx
Transform the samples from U(z | 0, 1) by
y = h−1(z)
to obtain samples y distributed according to p(y).
Easy to check that h is then the CDF of y:
p(y < x) = p(h−1(z) < x)
= p(z < h(x))
= h(x).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
572of 821
Sampling from Standard Distributions
Goal: Sample from p(y) which is given in analytical form.
If a uniformly distributed random variable z is transformed
using y = h−1(z) then y will be distributed according to p(y).
p(y)
h(y)
y0
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
573of 821
Sampling from the Exponential Distribution
Goal: Sample from the exponential distribution
p(y) =
{
λe−λy 0 ≤ y
0 y < 0
with rate parameter λ > 0.
Suppose uniformly distributed samples of z in the interval
(0, 1) are available.
Calculate the cumulative distribution function
h(y) =
∫ y
−∞
p(x) dx =
∫ y
0
λe−λx dx = 1− e−λy
Transform the samples from U(z | 0, 1) by
y = h−1(z) = −
1
λ
ln(1− z)
to obtain samples y distributed according to the
exponential distribution.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
574of 821
Sampling from Multivariate Distributions
Generalisation to multiple variables is straightforward
Consider change of variables via the Jacobian
p(y1, . . . , yM) = p(z1, . . . , zM)
∣∣∣∣ ∂(z1, . . . , zM)∂(y1, . . . , yM)
∣∣∣∣
Technical challenge: Multiple integrals; inverting nonlinear
functions of multiple variables.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
575of 821
Sampling the Gaussian Distribution – Box-Muller
1 Generate pairs of uniformly distributed random numbers
z1, z2 ∈ (−1, 1) (e.g. zi = 2z− 1 for z from U(z | 0, 1))
2 Discard any pair (z1, z2) unless z21 + z
2
2 ≤ 1. Results in a
uniform distribution inside of the unit circle p(z1, z2) = 1/π.
3 Evaluate r2 = z21 + z
2
2 and
y1 = z1
(
−2 ln r2
r2
)1/2
y2 = z2
(
−2 ln r2
r2
)1/2
4 y1 and y2 are independent with joint distribution
p(y1, y2) = p(z1, z2)
∣∣∣∣ ∂(z1, z2)∂(y1, y2)
∣∣∣∣ = 1√2π e−y21/2 1√2π e−y22/2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
576of 821
Sampling the Multivariate Gaussian
Generate x ∼ N (0, I)
let y = Ax + b
E y = b
E [(y− E y)>(y− E y)] = A>E [x>x]A
= A>A
We know y is Gaussian (why?). So to sample y ∼ N (m,Σ) put
b = m
A>A = Σ
A is the Cholesky factor of Σ
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
577of 821
Rejection Sampling
Assumption 1 : Sampling directly from p(z) is difficult, but
we can evaluate p(z) up to some unknown normalisation
constant Zp
p(z) =
1
Zp
p̃(z)
Assumption 2 : We can draw samples from a simpler
distribution q(z) and for some constant k and all z holds
kq(z) ≥ p̃(z)
z0 z
u0
kq(z0)
kq(z)
p̃(z)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
578of 821
Rejection Sampling
1 Generate a random number z0 from the distribution q(z).
2 Generate a number from the u0 from the uniform
distribution over [0, k q(z0)].
3 If u0 > p̃(z0) then reject the pair (z0, u0).
4 The remaining pairs have uniform distribution under the
curve p̃(z).
5 The z values are distributed according to p(z).
z0 z
u0
kq(z0)
kq(z)
p̃(z)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
579of 821
Rejection Sampling – Example
Consider the Gamma Distribution for a > 1
Gam(z | a, b) =
baza−1 exp(−bz)
Γ(a)
Suitable q(z) could be (similar to the Cauchy distribution)
q(z) =
k
1 + (z− c)2/b2
Samples z from q(z) by using uniformly distributed y and
transformation z = b tan y + c for c = a− 1, b2 = 2a− 1 and
k as small as possible for kq(z) ≥ p̃(z).
z
p(z)
0 10 20 30
0
0.05
0.1
0.15
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
580of 821
Adaptive Rejection Sampling
Suitable form for the proposal distribution q(z) might be
difficult to find.
If p(z) is log-concave (ln p(z) has nonincreasing
derivatives), use the derivatives to construct an envelope.
1 Start with an initial grid of points z1, . . . , zM and construct
the envelope using the tangents at the p(zi), i = 1, . . . ,M).
2 Draw a sample from the envelop function and if accepted
use it to calculate p(z). Otherwise, use it to refine the grid.
z1 z2 z3 z
ln p(z)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
581of 821
Adaptive Rejection Sampling – Example
The piecewise exponential distribution is defined as
p(z) = kmλme
−λm(z−zm−1) ẑm−1,m < z ≤ ẑm,m+1
where ẑm−1,m is the point of intersection of the tangent
lines at zm−1 and zm, λm is the slope of the tangent at zm
and km accounts for the corresponding offset.
z1 z2 z3 z
ln p(z)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
582of 821
Rejection Sampling - Problems
Need to find a proposal distribution q(z) which is a close
upper bound to p(z); otherwise many samples are rejected.
Curse of dimensionality for multivariate distributions.
z0 z
u0
kq(z0)
kq(z)
p̃(z)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
583of 821
Importance Sampling
Directly calculate Ep [f (z)] w.r.t. some p(z).
Does not sample p(z) as an intermediate step.
Again use samples z(l) from some proposal q(z).
E [f ] =
∫
f (z) p(z) dz
=
∫
f (z)
p(z)
q(z)
q(z) dz
≈
1
L
L∑
l=1
p(z(l))
q(z(l))
f (z(l))
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
584of 821
Importance Sampling - Unnormalised
Consider unnormalised p(z) = p̃(z)Zp and q(z) =
q̃(z)
Zq
.
It follows then that, with r̃l =
p̃(z(l))
q̃(z(l))
,
E [f ] =
∫
f (z) p(z) dz
=
Zq
Zp
∫
f (z)
p̃(z)
q̃(z)
q(z) dz
≈
Zq
Zp
1
L
L∑
l=1
r̃l f (z
(l)).
Using the same samples (and f (z) = 1) gives
Zp
Zq
≈
1
L
L∑
l=1
r̃l
⇒ E [f ] ≈
L∑
l=1
r̃l∑L
m=1 r̃m
f (z(l))
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
585of 821
Importance Sampling - Key Points
Importance weights rl correct the bias introduced by
sampling from the proposal distribution q(z) instead of the
wanted distribution p(z).
Success depends on how well q(z) approximates p(z).
If p(z) > 0 in same region, then q(z) > 0 necessary.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
586of 821
Markov Chain Monte Carlo
Goal: sample from p(z).
Generate a sequence using a Markov chain.
1 Generate a new sample z? ∼ q(z | z(l)), conditional on the
previous sample z(l).
2 Accept or reject the new sample according to some
appropriate criterion.
z(l+1) =
{
z? if accepted
z(l) if rejected
3 For an appropriate proposal and corresponding acceptance
criterion, as l→∞, z(l) approaches an independent sample
of p(z).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
587of 821
Metropolis Algorithm
1 Choose a symmetric proposal distribution
q(zA | zB) = q(zB | zA).
2 Accept the new sample z? with probability
A(z?, z(r)) = min
(
1,
p̃(z?)
p̃(z(r))
)
e.g., let u ∼ Uniform(0, 1) and accept if p̃(z
?)
p̃(z(r))
> u.
3 Unlike rejection sampling we include the previous sample
on rejection of the proposal:
z(l+1) =
{
z? if accepted
z(r) if rejected
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
588of 821
Metropolis Algorithm – Illustration
Sampling from a Gaussian Distribution (black contour
shows one standard deviation).
Proposal is q(z|z(l)) = N (z|z(l), σ2I) with σ = 0.2.
150 candidates generated; 43 rejected.
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
2.5
3
accepted steps, rejected steps.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
589of 821
Markov Chain Monte Carlo – Why it works
A First Order Markov Chain is a series of random variables
z(1), . . . , z(M) such that the following property holds
p(z(m+1) | z(1), . . . , z(m)) = p(z(m+1) | z(m))
Marginal probability
p(z(m+1)) =
∑
z(m)
p(z(m+1) | z(m)) p(z(m))
=
∑
z(m)
Tm(z
(m) | z(m+1)) p(z(m))
where Tm(z(m) | z(m+1)) are the transition probabilities.
z1 z2 zm+1zm
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
590of 821
Markov Chain Monte Carlo – Why it works
Marginal probability
p(z(m+1)) =
∑
z(m)
Tm(z
(m) | z(m+1)) p(z(m))
A Markov chain is called homogeneous if the transition
probabilities are the same for all m, denoted by T(z′, z).
A distribution is invariant, or stationary, with respect to a
Markov chain if each step leaves the distribution invariant.
For a homogeneous Markov chain, the distribution p?(z) is
invariant if
p?(z) =
∑
z′
T(z′, z) p?(z′). (1)
(Note: There can be many. If T is the identity matrix, every
distribution is invariant.)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
591of 821
Markov Chain Monte Carlo – Why it works
Detailed balance
p?(z) T(z, z′) = p?(z′) T(z′, z). (2)
is sufficient (but not necessary) for p?(z) to be invariant (to
check, put (2) into (1)). (A Markov chain that respects the
detailed balance is called reversible.)
A Markov chain is ergodic if it converges to the invariant
distribution irrespective of the choice of the initial
conditions. The invariant distribution is then called
equilibrum.
An ergodic Markov chain can have only one equilibrium
distribution.
Why is it working? Choose the transition probabilities T to
satisfy the detailed balance for our goal distribution p(z).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
592of 821
Markov Chain Monte Carlo – Metropolis-Hastings
Generalisation of the Metropolis algorithm for
nonsymmetric proposal distributions qk.
At step τ , draw a sample z? from the distribution qk(z | z(τ))
where k labels the set of possible transitions.
Accept with probability
A?k (z, z
(τ)) = min
(
1,
p̃(z?) qk(z(τ) | z?)
p̃(z(τ)) qk(z? | z(τ))
)
Choice of proposal distribution critical.
Common choice : Gaussian centered on the current state.
small variance→ high acceptance rate, but slow walk
through the state space; samples not independent
large variance→ high rejection rate
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
593of 821
Markov Chain Monte Carlo – Metropolis-Hastings
Transition probability of this Markov chain is
T(z, z′) = qk(z
′ | z) Ak(z′, z)
Prove that p(z) is the invariant distribution if the detailed
balance holds
p(z) T(z, z′) = T(z′, z) p(z′).
Using the symmetry min(a, b) = min(b, a) it can be shown
that the detailed balance holds
p(z) qk(z
′ | z) Ak(z′, z) = min (p(z) qk(z′ | z), p(z′) qk(z | z′))
= min (p(z′) qk(z | z′), p(z) qk(z′ | z))
= p(z′) qk(z | z′) Ak(z, z′).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
594of 821
Markov Chain Monte Carlo – Metropolis-Hastings
Isotropoic Gaussian proposal distribution (blue)
In order to keep the rejection rate low, use the smallest
standard deviation σmin of the multivariate Gaussian (red)
for the proposal distribution.
Leads to random walk behaviour→ slow exploration of the
state space.
Number of steps separating states that are approximately
independent is (σmax/σmin)2.
σmax
σmin
ρ
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
595of 821
Gibbs Sampling
Goal: sample from a joint distribution p(z) = p(z1, . . . , zM)
How? sample one variable from the distribution
conditioned on all the other variable
Example : given p(z1, z2, z3)
At step τ we have samples z(τ)1 , z
(τ)
2 and z
(τ)
3 .
Get samples for the next step τ + 1
z(τ+1)1 ∼ p(z
(τ)
1 | z
(τ)
2 , z
(τ)
3 )
z(τ+1)2 ∼ p(z
(τ)
2 | z
(τ+1)
1 , z
(τ)
3 )
z(τ+1)3 ∼ p(z
(τ)
3 | z
(τ+1)
1 , z
(τ+1)
2 )
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Sampling from the
Uniform Distribution
Sampling from Standard
Distributions
Rejection Sampling
Adaptive Rejection
Sampling
Importance Sampling
Markov Chain Monte
Carlo
Gibbs Sampling
596of 821
Gibbs Sampling – Why does it work?
1 p(z) is invariant of each of the Gibbs sampling steps and
hence of the whole Markov chain : a) p(zi | {z\i}) is
invariant because the marginal distribution p(z\i) does not
change, and b) by definition each step samples from
p(zi | {z\i}).
2 Ergodicity: sufficient that none of the conditional
distribution is zero anywhere.
3 Gibbs sampling is a Metropolis-Hastings sampling in
which each step is accepted.
z1
z2
L
l
Decision Theory and Modelling
Inference and Decision
Probabilistic Modelling in the Abstract
Model Selection