Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 821
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
683of 821
Part XIX
Probabilistic Graphical Models 3
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
684of 821
Factor Graphs
Write p(x) in the form of a product of factors
p(x) =
∏
s
fs(xs)
where xs denotes a subset of variables.
Example
p(x) = fa(x1, x2) fb(x1, x2) fc(x2, x3) fd(x3).
x1 x2 x3
fa fb fc fd
More information than in MRF, because there
fa(x1, x2) fb(x1, x2) would be in one potential function.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
685of 821
Factor Graphs – MRF example
Example of factor graphs representing the same distribution
x1 x2
x3
Undirected graph
single clique
potential
ψ(x1, x2, x3)
x1 x2
x3
f
Factor graph
f (x1, x2, x3)
=
ψ(x1, x2, x3)
x1 x2
x3
fa
fb
Factor graph
factors satisfy
fa(x1, x2, x3) fb(x2, x3)
=
ψ(x1, x2, x3)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
686of 821
Factor Graphs – BN example
Example of factor graphs representing the same distribution
x1 x2
x3
Directed graph
p(x1) p(x2) p(x3 | x1, x2)
x1 x2
x3
f
Factor graph
f (x1, x2, x3)
=
p(x1) p(x2) p(x3 | x1, x2)
x1 x2
x3
fc
fa fb
Factor graph
factors satisfy
fa(x1) = p(x1)
fb(x2) = p(x2)
fc(x1, x2, x3) =
p(x3 | x1, x2)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
687of 821
Factor Graph – Bipartite Graph
Factor Graphs are bipartite graphs.
Definition (Bipartite Graph)
A bipartite graph (or bigraph) is a graph whose vertices can be
divided into two disjoint sets U and V such that every edge
connects a vertex in U to one in V.
U V
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
688of 821
Markov Random Field→ Factor Graph
1 Create variable nodes for each node in the original graph
2 Create factor nodes for each maximal clique xs
3 Set the factors fs(xs) to the clique potentials
Note: There may be several different factor graphs
corresponding to the same undirected graph.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
689of 821
Bayesian Network→ Factor Graph
1 Create variable nodes for each node in the original graph.
2 Create factor nodes corresponding to the conditional
distributions.
3 Add appropriate links.
Note: There may be several different factor graphs
corresponding to the same directed graph.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
690of 821
The Sum-Product Algorithm – Motivation
p(x2) =
∑
x1,x3,x4
fa(x1, x2) fb(x2, x3) fc(x2, x4)
=
(∑
x1
fa(x1, x2)
)(∑
x3
fb(x2, x3)
)(∑
x4
fc(x2, x4)
)
x1 x2 x3
x4
fa fb
fc
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
691of 821
The Sum-Product Algorithm in a Nutshell
Kschischang & Frey, 2001.
The message sent from a node v on an edge e is the product of
the local function at v (or the unit function if v is a variable
node) with all messages received at v on edges other than e,
summarized for the variable associated with e.
µxm→fs(xm) =
∏
l∈ne(xm)\fs
µfl→xm(xm)
µfs→x(x) =
∑
x1
· · ·
∑
xM
fs(x, x1, . . . , xM)
∏
m∈ne(fs)\x
µxm→fs(xm)
the marginals are then
p(xm) =
∏
l∈ne(xm)
µfl→xm(xm)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
692of 821
The Sum-Product Algorithm – Example
Nodes send the following on each adjacent edge after
receiving on other edges (leaves→ arbitrary root→
leaves)
factors nodes: marginalised own factor times product of
incoming messages
variable nodes: product of incoming messages
µx1→fa(x1) = 1 µx4→fc(x4) = 1
µfa→x2(x2) =
∑
x1
fa(x1, x2) µfc→x2(x2) =
∑
x4
fc(x2, x4)
µx2→fb(x2) = µfa→x2(x2) µfc→x2(x2)
µfb→x3(x3) =
∑
x2
fb(x2, x3)µx2→fb(x2)
x1 x2 x3
x4
fa fb
fc
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
693of 821
The Sum-Product Algorithm – Example
Propagating back from root
µx3→fb(x3) = 1 µfb→x2(x2) =
∑
x3
fb(x2, x3)
Marginals are products of messages at a variable node:
p(x2) = µfa→x2(x2)µfb→x2(x2)µfc→x2(x2)
=
∑
x1
fa(x1, x2)
∑
x3
fb(x2, x3)
∑
x4
fc(x2, x4)
x1 x2 x3
x4
fa fb
fc
Works for any tree structured factor graph.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
694of 821
The Sum-Product Algorithm – Overview
Find some marginal p(x) =
∑
x\x p(x)
Idea: interchange summations and products:
Push sums as far into products as possible
View partial sums as messages
Partial sums are coupled by factors
Deal with coupling by propagating messages
Messages are functions of some variable
Exploit the tree structure of the factor graph to divide and
conquer
The tree structure is crucial.
Generally incorrect for loopy graphs.
Approach is also known as
Belief propagation
Message passing
Sum product algorithm
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
695of 821
The Sum-Product Algorithm – In Detail
Partition the factors in the joint distribution into groups,
with one group associated with each factor of the nodes
that is a neighbour of the variable node x.
p(x) =
∏
s∈ne(x)
Fs(x,Xs)
ne(x) is the set of factor nodes which are neighbours of x
Xs is the set of all variables in the sub-tree connected to the
variable node x via the factor node fs
Fs(x,Xs) is the product of all the factors in the group
associated with factor fs.
xfs
µfs→x(x)
F
s
(x
,X
s
)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
696of 821
The Sum-Product Algorithm – In Detail
Goal: Marginal distribution p(x) =
∑
x\x p(x)
via joint distribution
p(x) =
∏
s∈ne(x)
Fs(x,Xs)
resulting in
p(x) =
∑
x\x
∏
s∈ne(x)
Fs(x,Xs) =
∏
s∈ne(x)
∑
Xs
Fs(x,Xs)︸ ︷︷ ︸
.
=µfs→x(x)
xfs
µfs→x(x)
F
s
(x
,X
s
)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
697of 821
Example for Exchanging Sum and Product 1/2
Goal: Marginal distribution p(x2) =
∑
x\x2 p(x)
via joint distribution
p(x) =
∏
s∈ne(x)
Fs(x,Xs)
e.g.
p(x1, x2, x3, x4) =
∏
s∈ne(x2)
Fs(x2,Xs) = fa(x1, x2) fb(x2, x3) fc(x2, x4)
x1 x2 x3
x4
fa fb
fc
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
698of 821
Example for Exchanging Sum and Product 2/2
Goal: Marginal distribution p(x)
p(x) =
∑
x\x
∏
s∈ne(x)
Fs(x,Xs) =
∏
s∈ne(x)
∑
Xs
Fs(x,Xs)
e.g.
p(x2) =
∑
x1,x3,x4
fa(x1, x2) fb(x2, x3) fc(x2, x4)
=
(
1 ·
∑
x1
fa(x1, x2)
)
︸ ︷︷ ︸
.
=µfa→x2(x2)
(
1 ·
∑
x3
fb(x2, x3)
)
︸ ︷︷ ︸
.
=µfb→x2(x2)
(
1 ·
∑
x4
fc(x2, x4)
)
︸ ︷︷ ︸
.
=µfc→x2(x2)
x1 x2 x3
x4
fa fb
fc
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
699of 821
The Sum-Product Algorithm – In Detail (recap)
p(x) =
∑
x\x
∏
s∈ne(x)
Fs(x,Xs) =
∏
s∈ne(x)
∑
Xs
Fs(x,Xs)
=
∏
s∈ne(x)
µfs→x(x)
messages are functions of the variable they are passed to:
µfs→x(x) =
∑
Xs
Fs(x,Xs).
xfs
µfs→x(x)
F
s
(x
,X
s
)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
700of 821
How to evaluate the messages µfs→x(x)?
Each factor Fs(x,Xs) consists of a subgraph and can
therefore be written as
Fs(x,Xs) = fs(x, x1, . . . , xM)G1(x1,Xs1) . . .GM(xM,XsM)
xm
xM
x
fs
µxM→fs(xM )
µfs→x(x)
Gm(xm, Xsm)
{x, x1, . . . , xM} is the set of variables on which fs depends
Xsm are the variables in the subtree connected to fs via xm
beware the variable notation (here and in the textbook)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
701of 821
How to evaluate the messages µfs→x(x)?
Push the sums over Xsm into the product:
µfs→x(x) =
∑
Xs
Fs(x,Xs)
=
∑
x1
· · ·
∑
xM
fs(x, x1, . . . , xM)
∏
m∈ne(fs)\x
[∑
Xsm
Gm(xm,Xsm)
]
=
∑
x1
· · ·
∑
xM
fs(x, x1, . . . , xM)
∏
m∈ne(fs)\x
µxm→fs(xm).
xm
xM
x
fs
µxM→fs(xM )
µfs→x(x)
Gm(xm, Xsm)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
702of 821
Two kind of messages
factor nodes→ variable nodes
µfs→x(x) =
∑
Xs
Fs(x,Xs)
variable nodes→ factor nodes
µxm→fs(xm) =
∑
Xsm
Gm(xm,Xsm)
xm
xM
x
fs
µxM→fs(xM )
µfs→x(x)
Gm(xm, Xsm)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
703of 821
factor nodes→ variable nodes
We already have a formula to calculate messages from
factor nodes to variable nodes
µfs→x(x) =
∑
x1
· · ·
∑
xM
fs(x, x1, . . . , xM)
∏
m∈ne(fs)\x
µxm→fs(xm)
Take the product of all incoming messages, multiply with
the factor associated with the node and marginalise over
all variables associated with the incoming messages.
xm
xM
x
fs
µxM→fs(xM )
µfs→x(x)
Gm(xm, Xsm)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
704of 821
variable nodes→ factor nodes (1/2)
Gm(xm,Xsm) is a product of terms Fl(xm,Xml)
Gm(xm,Xsm) =
∏
l∈ne(xm)\fs
Fl(xm,Xml)
where the product is taken over all neighbours of node xm
except for node fs.
xm
fl
fL
fs
Fl(xm, Xml)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
705of 821
variable nodes→ factor nodes (2/2)
µxm→fs(xm) =
∑
Xsm
Gm(xm,Xsm)
=
∏
l∈ne(xm)\fs
[∑
Xml
Fl(xm,Xml)
]
=
∏
l∈ne(xm)\fs
µfl→xm(xm).
Evaluate the message sent by a variable to an adjacent
factor by taking the product of incoming messages.
xm
fl
fL
fs
Fl(xm, Xml)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
706of 821
How to start? How to normalise?
Consider node x as the root node of the factor graph.
Start at the leaf nodes.
If the leaf node is a
variable node then
x f
µx→f (x) = 1
If the leaf node is a
factor node then
xf
µf→x(x) = f(x)
These initialisation rules follow from the general formulae.
Normalisation (if the factors are un-normalised):
Calculate p̃(x) by message passing as before
Then Z =
∑
x p̃(x) and p =
1
Z p̃
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
707of 821
Marginals for ALL variable nodes in the graph
Brute force: run the above algorithm for each node.
More efficient
1 Arbitrarily choose one root in the graph.
2 Propagate all messages from leaves to root.
3 Propagate all messages from root to leaves.
4 Local messages give marginals.
Only doubles the number of computations.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
708of 821
Generalising Sum-Product: Distributive Property
For the sum product algorithm, we used the property that
multiplication is distributive over addition:
ab + ac = a(b + c)
Abstract theory
A set with two associative operations ’+’ and ’×’ satisfying
the distributive property is called a semi-ring.
Generalized Distributive Law
In principle, one can replace ’sum’ and ’product’ in the
previous algorithm with other operations.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
709of 821
Generalising Sum-Product: Max-Product
ab + ac = a(b + c)→ max(ab, ac) = amax(b, c)
Replace ’+’ by max, and ’×’ by
∑
Max Product algorithm: compute argmaxx p(x)
For Hidden Markov Models, this is the Viterbi algorithm.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
710of 821
Similar Algorithms
Junction Tree Algorithm: exact inference in general
undirected graphs. (For directed graphs, first convert to
moral graph.) Computational cost is determined by the
number of variables in the largest clique of the graph.
Grows exponentially with this number for discrete
variables.
Loopy Belief Propagation: try sum-product on graphs
which are not tree-structured. Graph has cycles and
therefore information flows several times through the
graph. Initialise by assuming a unit message has been
sent over each link in each direction. Convergence is not
longer guaranteed in general.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Factor Graphs
The Sum-Product
Algorithm
Similar Algorithms
Learning the Graph
Structure
711of 821
Learning the Graph Structure
Define a space of possible graph structures.
Define a measure to score each of the graph structures.
Bayesian viewpoint: Compute the posterior distribution
over graph structures.
If we have a prior p(m) over graphs indexed by m, the
posterior is given via Bayes’ theorem as
p(m | D) ∝ p(m) p(D |m)
where D is the observed data set, and the model evidence
p(D |m) provides the score for each model.
Challenge 1: Evaluation of the model evidence involves
marginalisation over the latent variables and is
computationally very demanding.
Challenge 2: The number of different graph structures
grows exponentially with the number of nodes. Need to
resort to heuristics to find good candidates.
Probabilistic Graphical Models 2
Markov Random Fields
Bayesian Networks vs. Markov Random Fields