Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 821
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
647of 821
Finding p(a, b | f ) – ‘Analytical’ method
1
f
e b
a
c
2 Conditional probability with the given observable(s):
p(a, b, c, e | f )
3 Rewrite it as a joint distribution over all variables
p(a, b, c, e | f ) =
p(a, b, c, e, f )
p(f )
4 Factorise the joint probability according to the graph
p(a, b, c, e | f ) =
p(a, b, c, e, f )
p(f )
=
p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )
5 Marginalise over all variables we don’t care about
p(a, b | f ) =
∑
c,e
p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )
= p(a) p(b | f )
w
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
648of 821
Finding p(a, b | f ) – ‘Analytical’ method
1
f
e b
a
c
2 Conditional probability with the given observable(s):
p(a, b, c, e | f )
3 Rewrite it as a joint distribution over all variables
p(a, b, c, e | f ) =
p(a, b, c, e, f )
p(f )
4 Factorise the joint probability according to the graph
p(a, b, c, e | f ) =
p(a, b, c, e, f )
p(f )
=
p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )
5 Marginalise over all variables we don’t care about
p(a, b | f ) =
∑
c,e
p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )
= p(a) p(b | f )
w
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
649of 821
Finding p(a, b | f ) – ‘Analytical’ method
1
f
e b
a
c
2 Conditional probability with the given observable(s):
p(a, b, c, e | f )
3 Rewrite it as a joint distribution over all variables
p(a, b, c, e | f ) =
p(a, b, c, e, f )
p(f )
4 Factorise the joint probability according to the graph
p(a, b, c, e | f ) =
p(a, b, c, e, f )
p(f )
=
p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )
5 Marginalise over all variables we don’t care about
p(a, b | f ) =
∑
c,e
p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )
= p(a) p(b | f )
w
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
650of 821
Finding p(a, b | f ) – ‘Analytical’ method
1
f
e b
a
c
2 Conditional probability with the given observable(s):
p(a, b, c, e | f )
3 Rewrite it as a joint distribution over all variables
p(a, b, c, e | f ) =
p(a, b, c, e, f )
p(f )
4 Factorise the joint probability according to the graph
p(a, b, c, e | f ) =
p(a, b, c, e, f )
p(f )
=
p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )
5 Marginalise over all variables we don’t care about
p(a, b | f ) =
∑
c,e
p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )
= p(a) p(b | f )
w
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
651of 821
Finding p(a, b | f ) – ‘Analytical’ method
1
f
e b
a
c
2 Conditional probability with the given observable(s):
p(a, b, c, e | f )
3 Rewrite it as a joint distribution over all variables
p(a, b, c, e | f ) =
p(a, b, c, e, f )
p(f )
4 Factorise the joint probability according to the graph
p(a, b, c, e | f ) =
p(a, b, c, e, f )
p(f )
=
p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )
5 Marginalise over all variables we don’t care about
p(a, b | f ) =
∑
c,e
p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )
= p(a) p(b | f )
w
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
652of 821
Finding p(a, b | f ) – ‘Graphical’ method
1
f
e b
a
c
2 Check whether a ⊥⊥ b | f holds or not.
Result : a ⊥⊥ b | f holds.
Reason : The path from a to b is blocked by f because f is a
TT-node and observed. Therefore, a ⊥⊥ b | f .
3 Write down the factorisation
p(a, b | f ) = p(a | f ) p(b | f ) = p(a) p(b | f )
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
653of 821
Finding p(a, b | f ) – ‘Graphical’ method
1
f
e b
a
c
2 Check whether a ⊥⊥ b | f holds or not.
Result : a ⊥⊥ b | f holds.
Reason : The path from a to b is blocked by f because f is a
TT-node and observed. Therefore, a ⊥⊥ b | f .
3 Write down the factorisation
p(a, b | f ) = p(a | f ) p(b | f ) = p(a) p(b | f )
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
654of 821
Finding p(a, b | f ) – ‘Graphical’ method
1
f
e b
a
c
2 Check whether a ⊥⊥ b | f holds or not.
Result : a ⊥⊥ b | f holds.
Reason : The path from a to b is blocked by f because f is a
TT-node and observed. Therefore, a ⊥⊥ b | f .
3 Write down the factorisation
p(a, b | f ) = p(a | f ) p(b | f ) = p(a) p(b | f )
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
655of 821
Bayesian Network – D-separation
A third example
Is x3 d-separated from x6 given x1 and x5 ?
Mark x1 and x5 as observed.
x2
x5
x4
x1
x3
x6
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
656of 821
Bayesian Network – D-separation
Is x3 d-separated from x6 given x1 and x5 ?
x2
x5
x4
x1
x3
x6
x1
x5
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
657of 821
Bayesian Network – D-separation
Is x3 d-separated from x6 given x1 and x5 ?
4 junctions of interest between x3 and x6:
{x3, x1, x2} is blocked because x1 is TT-node and observed.
{x3, x5, x6} is blocked because x5 is a HT-node and
observed.
{x3, x5, x2} is not blocked because x5 is a HH-node and
observed.
{x5, x2, x6} is not blocked because x2 is a TT-node and
unobserved.
The path {x3, x5, x2, x6} prevents D-separation.
Therefore, x3 is not d-separated from x6 given x1 and x5 as
not all paths between x3 and x6 are blocked.
x2
x5
x4
x1
x3
x6
x1
x5
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
658of 821
Part XVIII
Probabilistic Graphical Models 2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
659of 821
Markov Random Fields (MRFs)
Markov Random Fields (Markov network, undirected
graphical model) are defined over a graph with undirected
edges.
MRFs allow for different conditional independence
statements than Bayesian networks.
In a Bayesian network, the definition of a blocked path was
subtle for a HH-node because it did include that all
descendants were unobservable.
Is there an alternative graphical semantics for probability
distributions such that conditional independence is
determined by simple graph separation?
Yes, removing the direction from the edges removes the
asymmetry between parent and child nodes and
subsequently the subtleties associated with the HH-node.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
660of 821
Graph Separation
Definition (Graph Separation)
In an undirected graph G, having A, B and C disjoint subsets of
nodes, if every path from A to B includes at least one node from
C, then C is said to separate A from B in G.
A
C
B
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
661of 821
Conditional Independence in MRFs
Definition (Conditional Independence in Markov Random Field)
In an undirected graph G, having A, B and C disjoint subsets of
nodes, A is conditionally independent of B given C
A ⊥⊥ B |C
iff C separates A from B in G.
A
C
B
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
662of 821
Markov Random Field
Definition (Markov Random Field)
A Markov Random Field is a set of probability distributions
{ p(x) | p(x) > 0,∀ x } such that there exists an undirected
graph G with disjoint subsets of nodes A, B and C, in which
whenever C separates A from B in G,
A ⊥⊥ B |C.
Although we sometimes say “the MRF is such an
undirected graph”, we mean “the MRF represents the set
of all probability distributions whose conditional
independency statements are precisely those given by
graph separation in the graph”.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
663of 821
Factorisation in MRFs
Assume two nodes xi and xj that are not connected by an
edge.
Given all other nodes in the graph, xi and xj must be
conditionally independent as all paths between xi and xj
are blocked by observed nodes.
p(xi, xj | x\{i,j}) = p(xi | x\{i,j}) p(xj | x\{i,j})
where x\{i,j} denotes the set of all variables x with xi and xj
removed.
This is suggestive of the importance of considering sets of
connected nodes (cliques). Can we use this for the
factorisation of the graph?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
664of 821
Cliques in a graph
A clique is a subset of nodes in a graph such that there
exists an edge between all pairs of nodes in the subset.
(The nodes in a clique are fully connected.)
A maximal clique of a graph is a clique which is not a
proper subset of another clique. (No other nodes of the
graph can be added to a maximal clique without
destroying the property of full connectedness.)
x1
x2
x3
x4
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
665of 821
Factorisation using Cliques
We can express the factorisation with the help of the
cliques.
In fact, we only need the maximal cliques, as any function
over the variables of a subset of a maximal clique can be
expressed as a function of the members of the maximal
clique.
Denote by C the set of maximal cliques of a graph.
For a maximal clique C ∈ C, denote by xC the subset of the
variables x which belong to C.
x1
x2
x3
x4
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
666of 821
Factorisation using Cliques
A probability distribution p(x) factorises with respect to a
given undirected graph G if it can be written as
p(x) =
1
Z
∏
C∈C
ψC(xC)
where C is the set of maximal cliques of G, and potential
functions ψC(xC) ≥ 0. The constant Z =
∑
x p(x) ensures
the correct normalisation of p(x).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
667of 821
Conditional Independence⇔ Factorisation
Theorem (Factorisation⇒ Conditional Independence)
If a probability distribution factorises according to an undirected
graph, and if A, B and C are disjoint subsets of nodes such that
C separates A from B in the graph, then the distribution
satisfies A ⊥⊥ B |C.
Theorem (Conditional Independence⇒ Factorisation
(Hammersley-Clifford Theorem))
If a strictly positive probability distribution p(x) > 0,∀ x, satisfies
the conditional independence statements implied by graph
separation over a particular undirected graph, then it also
factorises according to the graph.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
668of 821
Factorisation with strictly positive potential
functions
As the potential functions ψC(xC) are strictly positive, one
can express them as exponential
ψC(xC) = exp{−E(xC)}
of an energy function E(xC).
The exponential distribution is called the Boltzmann
distribution.
The joint distribution is defined as the product of the
potentials, and so the total energy is obtained by adding
the energies of each of the maximal cliques.
p(x) =
1
Z
∏
C∈C
ψC(xC) =
1
Z
exp
{
−
∑
C∈C
E(xC)
}
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
669of 821
Example of Image Denoising
Given an unknown and noise-free image described by
binary pixels xi ∈ {−1,+1} for i = 1, . . . ,D.
Randomly flip the sign of some pixels with some small
probability and denote the pixels of the noisy image by yi
for i = 1, . . . ,D.
Goal : Recover the original noise-free image.
Original image. After randomly changing
10% of the pixels.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
670of 821
Example of Image Denoising
Prior knowledge 1 : We know the noise level is small.
Therefore: Strong correlation between original pixels xi
and noisy pixels yi.
Prior knowledge 2 : Neighbouring pixels xi and xj in an
image are strongly correlated (given a decent resolution).
Prior knowledge can be captured in a Markov Random
Field.
xi
yi
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
671of 821
Example of Image Denoising
Cliques {xi, yi} : Choose an energy function of the form
−ηxiyi for η > 0 resulting in a potential function exp{ηxiyi}.
This favours equal signs for xi and yi.
Cliques {xi, xj} where i and j are indices of neighbouring
pixels : Choose an energy function of the form −βxixj for
β > 0 again favouring equal signs for xi and xj.
We need to accomodate for the fact that one kind of pixels
might exist more often than the other kind. This can be
done by adding a term hxi for each pixel in the noise-free
image. (Potential function is an arbitrary, nonnegative
function over maximal cliques, so we are allowed to
multiply it by any nonnegative function of subsets of the
clique.)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
672of 821
Example of Image Denoising
Energy function
E(x, y) = h
∑
i
xi − β
∑
i,j
xixj − η
∑
i
xiyi
which defines a joint distribution over x and y
p(x, y) =
1
Z
exp{−E(x, y)}.
xi
yi
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
673of 821
Example – Iterated Conditional Modes (ICM)
1 Fix the elements for y as we have them observed.
(Implicitily defines p(x | y)).
2 Initialise xi = yi for i = 1, . . . ,D.
3 Take one node xj and evaluate the total energy for both
possible states of xj = {−1,+1} keeping all other variables
fixed. Set xj to the state having the lower energy.
4 Repeat for another node until stopping criterion is satisfied.
Local minimum (ICM). Global minimum (graph-cut).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
674of 821
Bayesian Networks vs. Markov Random Fields
Two frameworks for graphical models, can we convert
between them?
Bayesian Network
x1 x2 xN−1 xN
p(x) = p(x1) p(x2 | x1) p(x3 | x2) . . . p(xN | xN−1)
Random Markov Field
x1 x2 xN−1 xN
p(x) =
1
Z
ψ1,2(x1, x2)ψ2,3(x2, x3) . . . ψN−1,N(xN−1, xN)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
675of 821
Chains are the same
Bayesian Network
p(x) = p(x1) p(x2 | x1) p(x3 | x2) . . . p(xN | xN−1)
Random Markov Field
p(x) =
1
Z
ψ1,2(x1, x2)ψ2,3(x2, x3) . . . ψN−1,N(xN−1, xN)
Find corresponding terms
ψ1,2(x1, x2) = p(x1) p(x2 | x1)
ψ2,3(x2, x3) = p(x3 | x2)
…
ψN−1,N(xN−1, xN) = p(xN | xN−1)
and note that Z = 1 in this case.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
676of 821
Moralisation
For other kind of Bayesian Networks (BNs), create the
cliques of a MRF by adding undirected edges between all
pairs of parents for each node in the graph.
This process of ’marrying the parents’ is called
moralisation, and the result is a moral graph.
BUT the resulting MRF may represent different conditional
independence statements than the original BN.
Example: The MRF is fully connected, and exhibits NO
conditional independence properties, in contrast to the
original directed graph.
x1 x3
x4
x2
x1 x3
x4
x2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
677of 821
Distribution vs Graph
Definition (D-map)
A graph is a D-map (dependency map) of a distribution if every
conditional independence statement satisfied by the
distribution is reflected in the graph.
Definition (I-map)
A graph is an I-map (independence map) of a distribution if
every conditional independence statement implied by the graph
is satisfied in the distribution.
Definition (P-map)
A graph is a P-map (perfect map) of a distribution if it is both a
D-map and an I-map for the distribution.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
678of 821
Bayesian Networks vs. Markov Random Fields
Consider probability distributions depending on N variables
P set of all probability distributions
D set of all probability distributions that can be represented
as a perfect map by an directed graph
U set of all probability distributions that can be represented
as a perfect map by an undirected graph
P
UD
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
679of 821
Only as Bayesian Network
C
A B
A directed graph whose conditional independence properties
cannot be expressed using an undirected graph over the same
three variables.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
680of 821
Only as Markov Random Field
A
C
B
D
An undirected graph whose conditional independence
properties cannot be expressed using a directed graph over
the same four variables.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
681of 821
BNs vs. MRFs: Similarities
In both types of Graphical Models
A relationship between the conditional independence
statements satisfied by a distribution and the associated
simplified algebraic structure of the distribution is made in
term of graphical objects.
The conditional independence statements are related to
concepts of separation between variables in the graph.
The simplified algebraic structure (factorisation of p(x)) is
related to ’local pieces’ of the graph (child + its parents in
BNs, cliques in MRFs).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Markov Random Fields
Bayesian Networks vs.
Markov Random Fields
682of 821
BNs vs. MRFs: Differences
Differences
The set of probability distributions that can be represented
as MRFs is different from the set that can be represented
as BNs.
Although both MRFs and BNs are expressed as a
factorisation of local functions on the graph, the MRF has
a normalisation constant Z =
∑
x
∏
C∈C ψC(xC) that
couples all factors, whereas the BN has not.
The local ’pieces’ of the BN are probability distributions
themselves, whereas in MRFs they need only be
non-negative functions (i.e. they may not have range [0, 1]
as probabilities do).
Probabilistic Graphical Models 1
Motivation
Bayesian Network
Plate Notation
Conditional Independence