Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 821
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
597of 821
Part XVII
Probabilistic Graphical Models 1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
598of 821
Motivation via a picture
Image Segmentation
Cluster the gray value representation of an image
Neighbourhood information lost
Need to use the structure of the image.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
599of 821
Motivation via Independence (1/4)
Why is the grass wet?
Cloudy
Sprinkler Rain
WetGrass
Introduce four Boolean variables :
C(loudy), S(prinkler),R(ain),W(etGrass) ∈ {F(alse),T(rue)}.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
600of 821
Motivation via Independence (2/4)
Model the conditional probabilities
p(C = F) p(C = T)
0.2 0.8
C p(S = F) p(S = T)
F 0.5 0.5
T 0.9 0.1
C p(R = F) p(R = T)
F 0.8 0.2
T 0.2 0.8
S R p(W = F) p(W = T)
F F 1.0 0.0
T F 0.1 0.9
F T 0.1 0.9
T T 0.01 0.99
Cloudy
Sprinkler Rain
WetGrass
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
601of 821
Motivation via Independence (3/4)
If everything depends on everything
C S R W p(C, S, R, W)
F F F F . . .
F F F T . . .
. . . . . .
T T T F . . .
T T T T . . .
p(W, S,R,C) = p(W | S,R,C) p(S,R,C)
= p(W | S,R,C) p(S |R,C) p(R,C)
= p(W | S,R,C) p(S |R,C) p(R |C) p(C)
Cloudy
Sprinkler Rain
WetGrass
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
602of 821
Motivation via Independence (4/4)
p(W) =∑
S,R,C
p(W | S,R,C) p(S |R,C) p(R |C) p(C)
p(W) =∑
S,R
p(W | S,R)
∑
C
p(S |C) p(R |C) p(C)
Cloudy
Sprinkler Rain
WetGrass
Cloudy
Sprinkler Rain
WetGrass
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
603of 821
Motivation via Distributive Law
Two key observations when dealing with probabilities
1 Distributive Law can save operations
a(b + c)︸ ︷︷ ︸
2 operations
= ab + ac︸ ︷︷ ︸
3 operations
2 If some probabilities do not depend on all random variables,
we might be able to factor them out. For example, assume
p(x1, x3 | x2) = p(x1 | x2) p(x3 | x2),
then (using
∑
x3
p(x3 | x2) = 1)
p(x1) =
∑
x2,x3
p(x1, x2, x3) =
∑
x2,x3
p(x1, x3 | x2) p(x2)
=
∑
x2,x3
p(x1 | x2) p(x3 | x2) p(x2)
︸ ︷︷ ︸
O(|X1||X2||X3|)
=
∑
x2
p(x1 | x2) p(x2)
︸ ︷︷ ︸
O(|X1||X2|)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
604of 821
Motivation via Complexity Reduction
How to deal with more complex expression?
p(x1) p(x2) p(x3) p(x4|x1, x2, x3) p(x5|x1, x3) p(x6|x4) p(x7|x4, x5)
Graphical models
x1
x2 x3
x4 x5
x6 x7
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
605of 821
Motivation via Complexity Reduction
How to deal with more complex expression?
p(x1) p(x2) p(x3) p(x4|x1, x2, x3) p(x5|x1, x3) p(x6|x4) p(x7|x4, x5)
Graphical models
x1
x2 x3
x4 x5
x6 x7
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
606of 821
Probabilistic Graphical Models: Aims
Graphical models
Visualise the structure of a probabilistic model
Complex computations with formulas→ manipulations
with graphs
Obtain insights into model properties by inspection
Develop and motivate new models
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
607of 821
Probabilistic Graphical Models: Main Flavours
Graph
1 Nodes (vertices) : a random variable
2 Edges (links, arcs; directed or undirected) : probabilistic
relationship
Directed Graph : Bayesian Network (also called Directed
Graphical Model) expressing causal relationship between
variables
Undirected Graph : Markov Random Field expressing soft
constraints between variables
Factor Graph : convenient for solving inference problems
(derived from Bayesian Networks or Markov Random
Fields).
x1
x2 x3
x4 x5
x6 x7
Bayesian Network
x1 x2 x3
fa fb fc fd
Factor Graph
xi
yi
Markov Random
Field
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
608of 821
Bayesian Network: Arbitrary Joint Distribution
p(a, b, c) = p(c | a, b) p(a, b) = p(c | a, b) p(b | a) p(a)
1 Draw a node for each conditional distribution associated
with a random variable.
2 Draw an edge from each conditional distribution
associated with a random variable to all other conditional
distribution which are conditioned on this variable.
a
b
c
We have chosen a particular ordering of the variables !
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
609of 821
Bayesian Network: Arbitrary Joint Distribution
p(a, b, c) = p(c | a, b) p(a, b) = p(c | a, b) p(b | a) p(a)
1 Draw a node for each conditional distribution associated
with a random variable.
2 Draw an edge from each conditional distribution
associated with a random variable to all other conditional
distribution which are conditioned on this variable.
a
b
c
We have chosen a particular ordering of the variables !
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
610of 821
Bayesian Network: Arbitrary Joint Distribution
General case for K variables
p(x1, . . . , xK) = p(xK | x1, . . . , xK−1) . . . p(x2 | x1) p(x1)
The graph of this distribution is fully connected. (Prove it.)
What happens if we deal with a distribution represented by
a graph which is not fully connected?
Can not be the most general distribution anymore.
The absence of edges carries important information.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
611of 821
Joint Factorisation→ Graph
p(x1)p(x2) p(x3) p(x4|x1, x2, x3) p(x5|x1, x3) p(x6|x4) p(x7|x4, x5)
1 Draw a node for each conditional distribution associated
with a random variable.
2 Draw an edge from each conditional distribution
associated with a random variable to all other conditional
distribution which are conditioned on this variable.
x1
x2 x3
x4 x5
x6 x7
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
612of 821
Joint Factorisation→ Graph
p(x1)p(x2) p(x3) p(x4|x1, x2, x3) p(x5|x1, x3) p(x6|x4) p(x7|x4, x5)
1 Draw a node for each conditional distribution associated
with a random variable.
2 Draw an edge from each conditional distribution
associated with a random variable to all other conditional
distribution which are conditioned on this variable.
x1
x2 x3
x4 x5
x6 x7
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
613of 821
Joint Factorisation→ Graph
p(x1)p(x2) p(x3) p(x4|x1, x2, x3) p(x5|x1, x3) p(x6|x4) p(x7|x4, x5)
1 Draw a node for each conditional distribution associated
with a random variable.
2 Draw an edge from each conditional distribution
associated with a random variable to all other conditional
distribution which are conditioned on this variable.
x1
x2 x3
x4 x5
x6 x7
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
614of 821
Graph→ Joint Factorisation
x1
x2 x3
x4 x5
x6 x7
Can we get the expression from the graph?
1 Write a product of probability distributions, one for each
associated random variable. ↔ Draw a node for each
conditional distribution associated with a random variable.
2 Add all random variables associated with parent nodes to
the list of conditioning variables. ↔ Draw an edge from
each conditional distribution associated with a random
variable to all other conditional distribution which are
conditioned on this variable.
p(x1)p(x2) p(x3) p(x4|x1, x2, x3) p(x5|x1, x3) p(x6|x4) p(x7|x4, x5)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
615of 821
Graph→ Joint Factorisation
x1
x2 x3
x4 x5
x6 x7
Can we get the expression from the graph?
1 Write a product of probability distributions, one for each
associated random variable. ↔ Draw a node for each
conditional distribution associated with a random variable.
2 Add all random variables associated with parent nodes to
the list of conditioning variables. ↔ Draw an edge from
each conditional distribution associated with a random
variable to all other conditional distribution which are
conditioned on this variable.
p(x1)p(x2) p(x3) p(x4|x1, x2, x3) p(x5|x1, x3) p(x6|x4) p(x7|x4, x5)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
616of 821
Joint Factorisation↔ Graph
The joint distribution defined by a graph is given by the
product, over all of the nodes of the graph, of a conditional
distribution for each node conditioned on the variables
corresponding to the parents of the node in the graph.
p(x) =
K∏
k=1
p(xk | pa(xk))
where pa(xk) denotes the set of parents of xk and
x = (x1, . . . , xK).
The absence of edges (compared to the fully connected
case) is the point.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
617of 821
Joint Factorisation↔ Graph
Restriction : Graph must be a directed acyclic graph
(DAG).
There are no closed paths in the graph when moving
along the directed edges.
Or equivalently: There exists an ordering of the nodes
such that there are no edges that go from any node to any
lower numbered node.
x1
x2 x3
x4 x5
x6 x7
Extension: Can also have sets of variables, or vectors at a
node.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
618of 821
Bayesian Network: Normalisation
Given
p(x) =
K∏
k=1
p(xk | pa(xk)).
Is p(x) normalised,
∑
x p(x) = 1?
As graph is DAG, there always exists a node with no
outgoing edges, say xi.
∑
x
p(x) =
∑
x1,…,xi−1,xi+1,…,xK
K∏
k=1
k 6=i
p(xk | pa(xk))
∑
xi
p(xi | pa(xi))︸ ︷︷ ︸
=1
because
∑
xi
p(xi | pa(xi)) =
∑
xi
p(xi,pa(xi))
p(pa(xi))
=
p(pa(xi))
p(pa(xi))
= 1
Repeat, until no node left.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
619of 821
Bayesian Network: Normalisation
Given
p(x) =
K∏
k=1
p(xk | pa(xk)).
Is p(x) normalised,
∑
x p(x) = 1?
As graph is DAG, there always exists a node with no
outgoing edges, say xi.
∑
x
p(x) =
∑
x1,…,xi−1,xi+1,…,xK
K∏
k=1
k 6=i
p(xk | pa(xk))
∑
xi
p(xi | pa(xi))︸ ︷︷ ︸
=1
because
∑
xi
p(xi | pa(xi)) =
∑
xi
p(xi,pa(xi))
p(pa(xi))
=
p(pa(xi))
p(pa(xi))
= 1
Repeat, until no node left.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
620of 821
Bayesian Network – Plate Notation
Bayesian polynomial regression : observed inputs x,
observed targets t, noise variance σ2, hyperparameter α
controlling the priors for w.
Focusing on t and w only
p(t,w) = p(w)
N∏
n=1
p(tn |w)
w
t1 tN
tn
N
w
A plate.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
621of 821
Bayesian Network – Plate Notation
Bayesian polynomial regression : observed inputs x,
observed targets t, noise variance σ2, hyperparameter α
controlling the priors for w.
Focusing on t and w only
p(t,w) = p(w)
N∏
n=1
p(tn |w)
w
t1 tN
tn
N
w
A plate.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
622of 821
Bayesian Network – Plate Notation
Include also the parameters into the graphical model
p(t,w |x, α, σ2) = p(w |α)
N∏
n=1
p(tn |w, xn, σ2)
Random variables = open circles
Deterministic variables = smaller solid circles
tn
xn
N
w
α
σ2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
623of 821
Bayesian Network – Plate Notation
Random variables
Observed random variables, e.g. t
Unobserved random variables, e.g. w,
(latent random variables, hidden random variables)
Shade the observed random variables in the graphical
model.
tn
xn
N
w
α
σ2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
624of 821
Bayesian Network – Plate Notation
Prediction : new data point x̂. Want to predict t̂.
p(̂t, t,w | x̂,x, α, σ2) =
[
N∏
n=1
p(tn | xn,w, σ2)
]
p(w |α) p(̂t | x̂,w, σ2)
tn
xn
N
w
α
t̂
σ2
x̂
Polynomial regression model including prediction.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
625of 821
Bayesian Network – Conditional Independence
Any p(x) may be written
p(x) =
K∏
k=1
p(xk | x1, x2, . . . , xk−1)
=
K∏
k=1
p(xk | pa(xk))
What does it mean if pa(xk) 6= x1, x2, . . . , xk−1 in the above
equation?
Sparse graphical model.
p(x) no longer general.
Assumption, e.g. p(W|C, S,R) = p(W|S,R).
Conditional independence.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
626of 821
Bayesian Network – Conditional Independence
Definition (Conditional Independence)
If for three random variables a, b, and c the following holds
p(a | b, c) = p(a | c)
then a is conditionally independent of b given c.
Notation : a ⊥⊥ b | c.
The above equation must hold for all possible values of c.
Consequence :
p(a, b | c) = p(a | b, c) p(b | c)
= p(a | c) p(b | c)
Conditional independence simplifies
the structure of the model
the computations needed to perform interference/learning.
NB: conditional dependence is a related but different
concept we are not concerned with in this course.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
627of 821
Bayesian Network – Conditional Independence
Rules for Conditional Independence
Symmetry : X ⊥⊥ Y |Z =⇒ Y ⊥⊥ X |Z
Decomposition : Y,W ⊥⊥ X |Z =⇒ Y ⊥⊥ X |Z and W ⊥⊥ X |Z
Weak Union : X ⊥⊥ Y,W |Z =⇒ X ⊥⊥ Y |Z,W
Contraction : X ⊥⊥ W |Z,Y
and X ⊥⊥ Y |Z =⇒ X ⊥⊥ W,Y |Z
Intersection : X ⊥⊥ Y |Z,W
and X ⊥⊥ W |Z,Y =⇒ X ⊥⊥ Y,W |Z
Note: Intersection is only valid for p(X), p(Y), p(Z), p(W) > 0.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
628of 821
Conditional Independence: TT (1/3)
Can we work with the graphical model directly?
Check the simplest examples containing only three nodes.
First example has joint distribution
p(a, b, c) = p(a | c) p(b | c) p(c)
Marginalise both sides over c
p(a, b) =
∑
c
p(a | c) p(b | c) p(c) 6= p(a) p(b).
Does not hold : a ⊥⊥ b | ∅ (where ∅ is the empty set).
c
a b
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
629of 821
Conditional Independence: TT (2/3)
Now condition on c.
p(a, b | c) =
p(a, b, c)
p(c)
= p(a | c) p(b | c)
Therefore a ⊥⊥ b | c.
c
a b
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
630of 821
Conditional Independence: TT (3/3)
Graphical interpretation
In both graphical models there is a path from a to b.
The node c is called tail-to-tail (TT) with respect to this
path because the node c is connected to the tails of the
arrows in the path.
The presence of the TT-node c in the path left renders a
dependent on b (and b dependent on a).
Conditioning on c blocks the path from a to b and causes a
and b to become conditionally independent on c.
c
a b
Not a ⊥⊥ b | ∅
c
a b
a ⊥⊥ b | c
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
631of 821
Conditional Independence: HT (1/3)
Second example.
p(a, b, c) = p(a) p(c | a) p(b | c)
Marginalise over c to test for independence.
p(a, b) = p(a)
∑
c
p(c | a) p(b | c) = p(a) p(b | a) 6= p(a) p(b)
Does not hold : a ⊥⊥ b | ∅.
a c b
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
632of 821
Conditional Independence: HT (2/3)
Now condition on c.
p(a, b | c) =
p(a, b, c)
p(c)
=
p(a) p(c | a) p(b | c)
p(c)
= p(a | c) p(b | c)
where we used Bayes’ theorem p(c | a) = p(a | c) p(c)/p(a).
Therefore a ⊥⊥ b | c.
a c b
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
633of 821
Conditional Independence: HT (3/3)
a c b
Not a ⊥⊥ b | ∅
a c b
a ⊥⊥ b | c
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
634of 821
Conditional Independence: HH (1/4)
Third example. (A little bit more subtle.)
p(a, b, c) = p(a) p(b) p(c | a, b)
Marginalise over c to test for independence.
p(a, b) =
∑
c
p(a) p(b) p(c | a, b) = p(a) p(b)
∑
c
p(c | a, b)
= p(a) p(b)
a and b are independent if NO variable is observed:
a ⊥⊥ b | ∅.
c
a b
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
635of 821
Conditional Independence: HH (2/4)
Now condition on c.
p(a, b | c) =
p(a, b, c)
p(c)
=
p(a) p(b) p(c | a, b)
p(c)
6= p(a | c) p(b | c).
Does not hold : a ⊥⊥ b | c.
c
a b
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
636of 821
Conditional Independence: HH (3/4)
Graphical interpretation
In both graphical models there is a path from a to b.
The node c is called head-to-head (HH) with respect to
this path because the node c is connected to the heads of
the arrows in the path.
The presence of the HH-node c in the path left makes a
independent of b (and b independent of a). The
unobserved c blocks the path from a to b.
c
a b
a ⊥⊥ b | ∅
c
a b
Not a ⊥⊥ b | c
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
637of 821
Conditional Independence: HH (4/4)
Graphical interpretation
Conditioning on c unblocks the path from a to b, and
renders a conditionally dependent on b given c.
Some more terminology: Node y is a descendant of node
x if there is a path from x to y in which each step follows
the directions of the arrows.
A HH-path will become unblocked if either the node, or any
of its descendants, is observed.
c
a b
Not a ⊥⊥ b | c
f
e b
a
c
Not a ⊥⊥ f | c
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
638of 821
Conditional Independence – General Graph
Conditional independence and has been established for
all configurations of three variables: (HH, HT, TT) ×
(observed, unobserved)
the subtle case of HH junction with observed descendent.
We can generalise these results to arbitrary Bayesian
Networks.
Roughly: the graph connectivity implies conditional
independence for those sets of nodes which are
directionally separated.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
639of 821
Bayesian Network – D-Separation
Definition (Blocked Path)
A blocked path is a path which contains
an observed TT- or HT-node, or
an unobserved HH-node whose descendants are all
unobserved.
c
a b
a c b
c
a b
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
640of 821
Bayesian Network – D-Separation
Consider a general directed graph in which A, B, and C are
arbitrary non-intersecting sets of nodes. (There may be
other nodes in the graph which are not contained in the
union of A, B, and C.)
Consider all possible paths from any node in A to any
node in B.
Any such path is blocked, if it includes a node such that
either
the node is HT or TT, and the node is in set C, or
the node is HH, and neither the node, nor any of the
descendants, is in set C.
If all paths are blocked, then A is d− separated from B by
C, and the joint distribution over all the variables in the
graph will satisfy A ⊥⊥ B |C.
(Note: D-separation stands for ’directional’ separation.)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
641of 821
Bayesian Network – D-separation
Example
The path from a to b is not blocked by f because f is a
TT-node and unobserved.
The path from a to b is not blocked by e because e is a
HH-node, and although unobserved itself, one of its
descendants (node c) is observed.
f
e b
a
c
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
642of 821
Bayesian Network – D-separation
Another example
The path from a to b is blocked by f because f is a
TT-node and observed. Therefore, a ⊥⊥ b | f .
Furthermore, the path from a to b is also blocked by e
because e is a HH-node, and neither it nor its descendants
are observed. Therefore a ⊥⊥ b | f .
f
e b
a
c
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
643of 821
Conditional Independence⇔ Factorisation (1/4)
Theorem (Factorisation⇒ Conditional Independence)
If a probability distribution factorises according to a directed
acyclic graph, and if A, B and C are disjoint subsets of nodes
such that A is d-separated from B by C in the graph, then the
distribution satisfies A ⊥⊥ B |C.
Theorem (Conditional Independence⇒ Factorisation)
If a probability distribution satisfies the conditional
independence statements implied by d-separation over a
particular directed graph, then it also factorises according to
the graph.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
644of 821
Conditional Independence⇔ Factorisation (2/4)
Why is Conditional Independence⇔ Factorisation relevant?
Conditional Independence statements are usually what a
domain expert knows about the problem at hand.
Needed is a model p(x) for computation.
The Conditional Independence⇒ Factorisation provides
p(x) from Conditional Independence statements.
One can build a global model for computation from local
conditional independence statements.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
645of 821
Conditional Independence⇔ Factorisation (3/4)
Given a set of Conditional Independence statements.
Adding another statement will in general produce other
statements.
All statements can be read as d-separation in a DAG.
However, there are sets of Conditional Independence
statements which cannot be satisfied by any Bayesian
Network.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Bayesian Network
Plate Notation
Conditional
Independence
646of 821
Conditional Independence⇔ Factorisation (4/4)
The broader picture
A directed graphical model can be viewed as a filter
accepting probability distributions p(x) and only letting
these through which satisfy the factorisation property. The
set of all possible distribution p(x) which pass through the
filter is denoted as DF .
Alternatively, only these probability distributions p(x) pass
through the filter (graph), which respect the conditional
independencies implied by the d-separation properties of
the graph.
The d-separation theorem says that the resulting set DF is
the same in both cases.
p(x) DF
Sampling
Motivation
Sampling from the Uniform Distribution
Sampling from Standard Distributions
Rejection Sampling
Adaptive Rejection Sampling
Importance Sampling
Markov Chain Monte Carlo
Gibbs Sampling