CS计算机代考程序代写 chain Bayesian Bayesian network Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 821

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

597of 821

Part XVII

Probabilistic Graphical Models 1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

598of 821

Motivation via a picture

Image Segmentation
Cluster the gray value representation of an image

Neighbourhood information lost
Need to use the structure of the image.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

599of 821

Motivation via Independence (1/4)

Why is the grass wet?

Cloudy

Sprinkler Rain

WetGrass

Introduce four Boolean variables :
C(loudy), S(prinkler),R(ain),W(etGrass) ∈ {F(alse),T(rue)}.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

600of 821

Motivation via Independence (2/4)

Model the conditional probabilities
p(C = F) p(C = T)

0.2 0.8

C p(S = F) p(S = T)
F 0.5 0.5
T 0.9 0.1

C p(R = F) p(R = T)
F 0.8 0.2
T 0.2 0.8

S R p(W = F) p(W = T)
F F 1.0 0.0
T F 0.1 0.9
F T 0.1 0.9
T T 0.01 0.99

Cloudy

Sprinkler Rain

WetGrass

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

601of 821

Motivation via Independence (3/4)

If everything depends on everything
C S R W p(C, S, R, W)
F F F F . . .
F F F T . . .

. . . . . .
T T T F . . .
T T T T . . .

p(W, S,R,C) = p(W | S,R,C) p(S,R,C)
= p(W | S,R,C) p(S |R,C) p(R,C)
= p(W | S,R,C) p(S |R,C) p(R |C) p(C)

Cloudy

Sprinkler Rain

WetGrass

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

602of 821

Motivation via Independence (4/4)

p(W) =∑
S,R,C

p(W | S,R,C) p(S |R,C) p(R |C) p(C)

p(W) =∑
S,R

p(W | S,R)

C

p(S |C) p(R |C) p(C)

Cloudy

Sprinkler Rain

WetGrass

Cloudy

Sprinkler Rain

WetGrass

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

603of 821

Motivation via Distributive Law

Two key observations when dealing with probabilities
1 Distributive Law can save operations

a(b + c)︸ ︷︷ ︸
2 operations

= ab + ac︸ ︷︷ ︸
3 operations

2 If some probabilities do not depend on all random variables,
we might be able to factor them out. For example, assume

p(x1, x3 | x2) = p(x1 | x2) p(x3 | x2),

then (using

x3
p(x3 | x2) = 1)

p(x1) =

x2,x3

p(x1, x2, x3) =

x2,x3

p(x1, x3 | x2) p(x2)

=

x2,x3

p(x1 | x2) p(x3 | x2) p(x2)

︸ ︷︷ ︸
O(|X1||X2||X3|)

=

x2

p(x1 | x2) p(x2)

︸ ︷︷ ︸
O(|X1||X2|)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

604of 821

Motivation via Complexity Reduction

How to deal with more complex expression?

p(x1) p(x2) p(x3) p(x4|x1, x2, x3) p(x5|x1, x3) p(x6|x4) p(x7|x4, x5)

Graphical models
x1

x2 x3

x4 x5

x6 x7

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

605of 821

Motivation via Complexity Reduction

How to deal with more complex expression?

p(x1) p(x2) p(x3) p(x4|x1, x2, x3) p(x5|x1, x3) p(x6|x4) p(x7|x4, x5)

Graphical models
x1

x2 x3

x4 x5

x6 x7

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

606of 821

Probabilistic Graphical Models: Aims

Graphical models
Visualise the structure of a probabilistic model
Complex computations with formulas→ manipulations
with graphs
Obtain insights into model properties by inspection
Develop and motivate new models

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

607of 821

Probabilistic Graphical Models: Main Flavours

Graph
1 Nodes (vertices) : a random variable
2 Edges (links, arcs; directed or undirected) : probabilistic

relationship
Directed Graph : Bayesian Network (also called Directed
Graphical Model) expressing causal relationship between
variables
Undirected Graph : Markov Random Field expressing soft
constraints between variables
Factor Graph : convenient for solving inference problems
(derived from Bayesian Networks or Markov Random
Fields).

x1

x2 x3

x4 x5

x6 x7

Bayesian Network

x1 x2 x3

fa fb fc fd

Factor Graph
xi

yi

Markov Random
Field

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

608of 821

Bayesian Network: Arbitrary Joint Distribution

p(a, b, c) = p(c | a, b) p(a, b) = p(c | a, b) p(b | a) p(a)

1 Draw a node for each conditional distribution associated
with a random variable.

2 Draw an edge from each conditional distribution
associated with a random variable to all other conditional
distribution which are conditioned on this variable.

a

b

c

We have chosen a particular ordering of the variables !

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

609of 821

Bayesian Network: Arbitrary Joint Distribution

p(a, b, c) = p(c | a, b) p(a, b) = p(c | a, b) p(b | a) p(a)

1 Draw a node for each conditional distribution associated
with a random variable.

2 Draw an edge from each conditional distribution
associated with a random variable to all other conditional
distribution which are conditioned on this variable.

a

b

c

We have chosen a particular ordering of the variables !

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

610of 821

Bayesian Network: Arbitrary Joint Distribution

General case for K variables

p(x1, . . . , xK) = p(xK | x1, . . . , xK−1) . . . p(x2 | x1) p(x1)

The graph of this distribution is fully connected. (Prove it.)
What happens if we deal with a distribution represented by
a graph which is not fully connected?
Can not be the most general distribution anymore.
The absence of edges carries important information.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

611of 821

Joint Factorisation→ Graph

p(x1)p(x2) p(x3) p(x4|x1, x2, x3) p(x5|x1, x3) p(x6|x4) p(x7|x4, x5)

1 Draw a node for each conditional distribution associated
with a random variable.

2 Draw an edge from each conditional distribution
associated with a random variable to all other conditional
distribution which are conditioned on this variable.

x1

x2 x3

x4 x5

x6 x7

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

612of 821

Joint Factorisation→ Graph

p(x1)p(x2) p(x3) p(x4|x1, x2, x3) p(x5|x1, x3) p(x6|x4) p(x7|x4, x5)

1 Draw a node for each conditional distribution associated
with a random variable.

2 Draw an edge from each conditional distribution
associated with a random variable to all other conditional
distribution which are conditioned on this variable.

x1

x2 x3

x4 x5

x6 x7

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

613of 821

Joint Factorisation→ Graph

p(x1)p(x2) p(x3) p(x4|x1, x2, x3) p(x5|x1, x3) p(x6|x4) p(x7|x4, x5)

1 Draw a node for each conditional distribution associated
with a random variable.

2 Draw an edge from each conditional distribution
associated with a random variable to all other conditional
distribution which are conditioned on this variable.

x1

x2 x3

x4 x5

x6 x7

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

614of 821

Graph→ Joint Factorisation

x1

x2 x3

x4 x5

x6 x7

Can we get the expression from the graph?

1 Write a product of probability distributions, one for each
associated random variable. ↔ Draw a node for each
conditional distribution associated with a random variable.

2 Add all random variables associated with parent nodes to
the list of conditioning variables. ↔ Draw an edge from
each conditional distribution associated with a random
variable to all other conditional distribution which are
conditioned on this variable.

p(x1)p(x2) p(x3) p(x4|x1, x2, x3) p(x5|x1, x3) p(x6|x4) p(x7|x4, x5)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

615of 821

Graph→ Joint Factorisation

x1

x2 x3

x4 x5

x6 x7

Can we get the expression from the graph?
1 Write a product of probability distributions, one for each

associated random variable. ↔ Draw a node for each
conditional distribution associated with a random variable.

2 Add all random variables associated with parent nodes to
the list of conditioning variables. ↔ Draw an edge from
each conditional distribution associated with a random
variable to all other conditional distribution which are
conditioned on this variable.

p(x1)p(x2) p(x3) p(x4|x1, x2, x3) p(x5|x1, x3) p(x6|x4) p(x7|x4, x5)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

616of 821

Joint Factorisation↔ Graph

The joint distribution defined by a graph is given by the
product, over all of the nodes of the graph, of a conditional
distribution for each node conditioned on the variables
corresponding to the parents of the node in the graph.

p(x) =
K∏

k=1

p(xk | pa(xk))

where pa(xk) denotes the set of parents of xk and
x = (x1, . . . , xK).
The absence of edges (compared to the fully connected
case) is the point.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

617of 821

Joint Factorisation↔ Graph

Restriction : Graph must be a directed acyclic graph
(DAG).
There are no closed paths in the graph when moving
along the directed edges.
Or equivalently: There exists an ordering of the nodes
such that there are no edges that go from any node to any
lower numbered node.

x1

x2 x3

x4 x5

x6 x7

Extension: Can also have sets of variables, or vectors at a
node.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

618of 821

Bayesian Network: Normalisation

Given

p(x) =
K∏

k=1

p(xk | pa(xk)).

Is p(x) normalised,

x p(x) = 1?

As graph is DAG, there always exists a node with no
outgoing edges, say xi.


x

p(x) =

x1,…,xi−1,xi+1,…,xK

K∏
k=1
k 6=i

p(xk | pa(xk))

xi

p(xi | pa(xi))︸ ︷︷ ︸
=1

because

xi
p(xi | pa(xi)) =


xi

p(xi,pa(xi))
p(pa(xi))

=
p(pa(xi))
p(pa(xi))

= 1

Repeat, until no node left.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

619of 821

Bayesian Network: Normalisation

Given

p(x) =
K∏

k=1

p(xk | pa(xk)).

Is p(x) normalised,

x p(x) = 1?
As graph is DAG, there always exists a node with no
outgoing edges, say xi.


x

p(x) =

x1,…,xi−1,xi+1,…,xK

K∏
k=1
k 6=i

p(xk | pa(xk))

xi

p(xi | pa(xi))︸ ︷︷ ︸
=1

because

xi
p(xi | pa(xi)) =


xi

p(xi,pa(xi))
p(pa(xi))

=
p(pa(xi))
p(pa(xi))

= 1

Repeat, until no node left.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

620of 821

Bayesian Network – Plate Notation

Bayesian polynomial regression : observed inputs x,
observed targets t, noise variance σ2, hyperparameter α
controlling the priors for w.
Focusing on t and w only

p(t,w) = p(w)
N∏

n=1

p(tn |w)

w

t1 tN

tn
N

w

A plate.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

621of 821

Bayesian Network – Plate Notation

Bayesian polynomial regression : observed inputs x,
observed targets t, noise variance σ2, hyperparameter α
controlling the priors for w.
Focusing on t and w only

p(t,w) = p(w)
N∏

n=1

p(tn |w)

w

t1 tN

tn
N

w

A plate.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

622of 821

Bayesian Network – Plate Notation

Include also the parameters into the graphical model

p(t,w |x, α, σ2) = p(w |α)
N∏

n=1

p(tn |w, xn, σ2)

Random variables = open circles
Deterministic variables = smaller solid circles

tn

xn

N

w

α

σ2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

623of 821

Bayesian Network – Plate Notation

Random variables
Observed random variables, e.g. t
Unobserved random variables, e.g. w,
(latent random variables, hidden random variables)

Shade the observed random variables in the graphical
model.

tn

xn

N

w

α

σ2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

624of 821

Bayesian Network – Plate Notation

Prediction : new data point x̂. Want to predict t̂.

p(̂t, t,w | x̂,x, α, σ2) =

[
N∏

n=1

p(tn | xn,w, σ2)

]
p(w |α) p(̂t | x̂,w, σ2)

tn

xn

N

w

α


σ2

Polynomial regression model including prediction.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

625of 821

Bayesian Network – Conditional Independence

Any p(x) may be written

p(x) =
K∏

k=1

p(xk | x1, x2, . . . , xk−1)

=

K∏
k=1

p(xk | pa(xk))

What does it mean if pa(xk) 6= x1, x2, . . . , xk−1 in the above
equation?

Sparse graphical model.
p(x) no longer general.
Assumption, e.g. p(W|C, S,R) = p(W|S,R).
Conditional independence.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

626of 821

Bayesian Network – Conditional Independence

Definition (Conditional Independence)

If for three random variables a, b, and c the following holds

p(a | b, c) = p(a | c)

then a is conditionally independent of b given c.
Notation : a ⊥⊥ b | c.

The above equation must hold for all possible values of c.
Consequence :

p(a, b | c) = p(a | b, c) p(b | c)
= p(a | c) p(b | c)

Conditional independence simplifies
the structure of the model
the computations needed to perform interference/learning.

NB: conditional dependence is a related but different
concept we are not concerned with in this course.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

627of 821

Bayesian Network – Conditional Independence

Rules for Conditional Independence

Symmetry : X ⊥⊥ Y |Z =⇒ Y ⊥⊥ X |Z
Decomposition : Y,W ⊥⊥ X |Z =⇒ Y ⊥⊥ X |Z and W ⊥⊥ X |Z
Weak Union : X ⊥⊥ Y,W |Z =⇒ X ⊥⊥ Y |Z,W
Contraction : X ⊥⊥ W |Z,Y

and X ⊥⊥ Y |Z =⇒ X ⊥⊥ W,Y |Z
Intersection : X ⊥⊥ Y |Z,W

and X ⊥⊥ W |Z,Y =⇒ X ⊥⊥ Y,W |Z

Note: Intersection is only valid for p(X), p(Y), p(Z), p(W) > 0.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

628of 821

Conditional Independence: TT (1/3)

Can we work with the graphical model directly?
Check the simplest examples containing only three nodes.
First example has joint distribution

p(a, b, c) = p(a | c) p(b | c) p(c)

Marginalise both sides over c

p(a, b) =

c

p(a | c) p(b | c) p(c) 6= p(a) p(b).

Does not hold : a ⊥⊥ b | ∅ (where ∅ is the empty set).

c

a b

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

629of 821

Conditional Independence: TT (2/3)

Now condition on c.

p(a, b | c) =
p(a, b, c)

p(c)
= p(a | c) p(b | c)

Therefore a ⊥⊥ b | c.

c

a b

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

630of 821

Conditional Independence: TT (3/3)

Graphical interpretation
In both graphical models there is a path from a to b.
The node c is called tail-to-tail (TT) with respect to this
path because the node c is connected to the tails of the
arrows in the path.
The presence of the TT-node c in the path left renders a
dependent on b (and b dependent on a).
Conditioning on c blocks the path from a to b and causes a
and b to become conditionally independent on c.

c

a b

Not a ⊥⊥ b | ∅

c

a b

a ⊥⊥ b | c

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

631of 821

Conditional Independence: HT (1/3)

Second example.

p(a, b, c) = p(a) p(c | a) p(b | c)

Marginalise over c to test for independence.

p(a, b) = p(a)

c

p(c | a) p(b | c) = p(a) p(b | a) 6= p(a) p(b)

Does not hold : a ⊥⊥ b | ∅.

a c b

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

632of 821

Conditional Independence: HT (2/3)

Now condition on c.

p(a, b | c) =
p(a, b, c)

p(c)
=

p(a) p(c | a) p(b | c)
p(c)

= p(a | c) p(b | c)

where we used Bayes’ theorem p(c | a) = p(a | c) p(c)/p(a).
Therefore a ⊥⊥ b | c.

a c b

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

633of 821

Conditional Independence: HT (3/3)

a c b

Not a ⊥⊥ b | ∅

a c b

a ⊥⊥ b | c

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

634of 821

Conditional Independence: HH (1/4)

Third example. (A little bit more subtle.)

p(a, b, c) = p(a) p(b) p(c | a, b)

Marginalise over c to test for independence.

p(a, b) =

c

p(a) p(b) p(c | a, b) = p(a) p(b)

c

p(c | a, b)

= p(a) p(b)

a and b are independent if NO variable is observed:
a ⊥⊥ b | ∅.

c

a b

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

635of 821

Conditional Independence: HH (2/4)

Now condition on c.

p(a, b | c) =
p(a, b, c)

p(c)
=

p(a) p(b) p(c | a, b)
p(c)

6= p(a | c) p(b | c).

Does not hold : a ⊥⊥ b | c.

c

a b

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

636of 821

Conditional Independence: HH (3/4)

Graphical interpretation
In both graphical models there is a path from a to b.
The node c is called head-to-head (HH) with respect to
this path because the node c is connected to the heads of
the arrows in the path.
The presence of the HH-node c in the path left makes a
independent of b (and b independent of a). The
unobserved c blocks the path from a to b.

c

a b

a ⊥⊥ b | ∅
c

a b

Not a ⊥⊥ b | c

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

637of 821

Conditional Independence: HH (4/4)

Graphical interpretation
Conditioning on c unblocks the path from a to b, and
renders a conditionally dependent on b given c.
Some more terminology: Node y is a descendant of node
x if there is a path from x to y in which each step follows
the directions of the arrows.
A HH-path will become unblocked if either the node, or any
of its descendants, is observed.

c

a b

Not a ⊥⊥ b | c

f

e b

a

c

Not a ⊥⊥ f | c

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

638of 821

Conditional Independence – General Graph

Conditional independence and has been established for
all configurations of three variables: (HH, HT, TT) ×
(observed, unobserved)
the subtle case of HH junction with observed descendent.

We can generalise these results to arbitrary Bayesian
Networks.
Roughly: the graph connectivity implies conditional
independence for those sets of nodes which are
directionally separated.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

639of 821

Bayesian Network – D-Separation

Definition (Blocked Path)

A blocked path is a path which contains
an observed TT- or HT-node, or
an unobserved HH-node whose descendants are all
unobserved.

c

a b

a c b

c

a b

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

640of 821

Bayesian Network – D-Separation

Consider a general directed graph in which A, B, and C are
arbitrary non-intersecting sets of nodes. (There may be
other nodes in the graph which are not contained in the
union of A, B, and C.)
Consider all possible paths from any node in A to any
node in B.
Any such path is blocked, if it includes a node such that
either

the node is HT or TT, and the node is in set C, or
the node is HH, and neither the node, nor any of the
descendants, is in set C.

If all paths are blocked, then A is d− separated from B by
C, and the joint distribution over all the variables in the
graph will satisfy A ⊥⊥ B |C.

(Note: D-separation stands for ’directional’ separation.)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

641of 821

Bayesian Network – D-separation

Example
The path from a to b is not blocked by f because f is a
TT-node and unobserved.
The path from a to b is not blocked by e because e is a
HH-node, and although unobserved itself, one of its
descendants (node c) is observed.

f

e b

a

c

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

642of 821

Bayesian Network – D-separation

Another example
The path from a to b is blocked by f because f is a
TT-node and observed. Therefore, a ⊥⊥ b | f .
Furthermore, the path from a to b is also blocked by e
because e is a HH-node, and neither it nor its descendants
are observed. Therefore a ⊥⊥ b | f .

f

e b

a

c

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

643of 821

Conditional Independence⇔ Factorisation (1/4)

Theorem (Factorisation⇒ Conditional Independence)
If a probability distribution factorises according to a directed
acyclic graph, and if A, B and C are disjoint subsets of nodes
such that A is d-separated from B by C in the graph, then the
distribution satisfies A ⊥⊥ B |C.

Theorem (Conditional Independence⇒ Factorisation)
If a probability distribution satisfies the conditional
independence statements implied by d-separation over a
particular directed graph, then it also factorises according to
the graph.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

644of 821

Conditional Independence⇔ Factorisation (2/4)

Why is Conditional Independence⇔ Factorisation relevant?
Conditional Independence statements are usually what a
domain expert knows about the problem at hand.
Needed is a model p(x) for computation.
The Conditional Independence⇒ Factorisation provides
p(x) from Conditional Independence statements.
One can build a global model for computation from local
conditional independence statements.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

645of 821

Conditional Independence⇔ Factorisation (3/4)

Given a set of Conditional Independence statements.
Adding another statement will in general produce other
statements.
All statements can be read as d-separation in a DAG.
However, there are sets of Conditional Independence
statements which cannot be satisfied by any Bayesian
Network.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Bayesian Network

Plate Notation

Conditional
Independence

646of 821

Conditional Independence⇔ Factorisation (4/4)

The broader picture
A directed graphical model can be viewed as a filter
accepting probability distributions p(x) and only letting
these through which satisfy the factorisation property. The
set of all possible distribution p(x) which pass through the
filter is denoted as DF .
Alternatively, only these probability distributions p(x) pass
through the filter (graph), which respect the conditional
independencies implied by the d-separation properties of
the graph.
The d-separation theorem says that the resulting set DF is
the same in both cases.

p(x) DF

Sampling
Motivation
Sampling from the Uniform Distribution
Sampling from Standard Distributions
Rejection Sampling
Adaptive Rejection Sampling
Importance Sampling
Markov Chain Monte Carlo
Gibbs Sampling