CS计算机代考程序代写 Bayesian Hidden Markov Mode Bayesian network algorithm Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 821

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

683of 821

Part XIX

Probabilistic Graphical Models 3

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

684of 821

Factor Graphs

Write p(x) in the form of a product of factors

p(x) =
∏

fs(xs)

where xs denotes a subset of variables.
Example

p(x) = fa(x1, x2) fb(x1, x2) fc(x2, x3) fd(x3).
x1 x2 x3

fa fb fc fd

More information than in MRF, because there
fa(x1, x2) fb(x1, x2) would be in one potential function.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

685of 821

Factor Graphs – MRF example

Example of factor graphs representing the same distribution

x1 x2

Undirected graph
single clique

potential
ψ(x1, x2, x3)

x1 x2

Factor graph
f (x1, x2, x3)

=
ψ(x1, x2, x3)

x1 x2

Factor graph
factors satisfy

fa(x1, x2, x3) fb(x2, x3)
=

ψ(x1, x2, x3)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

686of 821

Factor Graphs – BN example

Example of factor graphs representing the same distribution

x1 x2

Directed graph
p(x1) p(x2) p(x3 | x1, x2)

x1 x2

Factor graph
f (x1, x2, x3)

=
p(x1) p(x2) p(x3 | x1, x2)

x1 x2

fa fb

Factor graph
factors satisfy
fa(x1) = p(x1)
fb(x2) = p(x2)
fc(x1, x2, x3) =

p(x3 | x1, x2)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

687of 821

Factor Graph – Bipartite Graph

Factor Graphs are bipartite graphs.

Definition (Bipartite Graph)

A bipartite graph (or bigraph) is a graph whose vertices can be
divided into two disjoint sets U and V such that every edge
connects a vertex in U to one in V.

U V

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

688of 821

Markov Random Field→ Factor Graph

1 Create variable nodes for each node in the original graph
2 Create factor nodes for each maximal clique xs
3 Set the factors fs(xs) to the clique potentials

Note: There may be several different factor graphs
corresponding to the same undirected graph.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

689of 821

Bayesian Network→ Factor Graph

1 Create variable nodes for each node in the original graph.
2 Create factor nodes corresponding to the conditional

distributions.
3 Add appropriate links.

Note: There may be several different factor graphs
corresponding to the same directed graph.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

690of 821

The Sum-Product Algorithm – Motivation

p(x2) =
∑

x1,x3,x4

fa(x1, x2) fb(x2, x3) fc(x2, x4)

(∑
x1

fa(x1, x2)

)(∑
x3

fb(x2, x3)

)(∑
x4

fc(x2, x4)

)

x1 x2 x3

fa fb

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

691of 821

The Sum-Product Algorithm in a Nutshell

Kschischang & Frey, 2001.

The message sent from a node v on an edge e is the product of
the local function at v (or the unit function if v is a variable
node) with all messages received at v on edges other than e,
summarized for the variable associated with e.

µxm→fs(xm) =
∏

l∈ne(xm)\fs

µfl→xm(xm)

µfs→x(x) =
∑

· · ·
∑
xM

fs(x, x1, . . . , xM)
∏

m∈ne(fs)\x

µxm→fs(xm)

the marginals are then

p(xm) =
∏

l∈ne(xm)

µfl→xm(xm)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

692of 821

The Sum-Product Algorithm – Example

Nodes send the following on each adjacent edge after
receiving on other edges (leaves→ arbitrary root→
leaves)

factors nodes: marginalised own factor times product of
incoming messages
variable nodes: product of incoming messages

µx1→fa(x1) = 1 µx4→fc(x4) = 1

µfa→x2(x2) =
∑

fa(x1, x2) µfc→x2(x2) =
∑

fc(x2, x4)

µx2→fb(x2) = µfa→x2(x2) µfc→x2(x2)

µfb→x3(x3) =
∑

fb(x2, x3)µx2→fb(x2)

x1 x2 x3

fa fb

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

693of 821

The Sum-Product Algorithm – Example

Propagating back from root

µx3→fb(x3) = 1 µfb→x2(x2) =
∑

fb(x2, x3)

Marginals are products of messages at a variable node:

p(x2) = µfa→x2(x2)µfb→x2(x2)µfc→x2(x2)

=
∑

fa(x1, x2)
∑

fb(x2, x3)
∑

fc(x2, x4)

x1 x2 x3

fa fb

Works for any tree structured factor graph.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

694of 821

The Sum-Product Algorithm – Overview

Find some marginal p(x) =
∑

x\x p(x)
Idea: interchange summations and products:

Push sums as far into products as possible
View partial sums as messages
Partial sums are coupled by factors
Deal with coupling by propagating messages
Messages are functions of some variable
Exploit the tree structure of the factor graph to divide and
conquer

The tree structure is crucial.
Generally incorrect for loopy graphs.
Approach is also known as

Belief propagation
Message passing
Sum product algorithm

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

695of 821

The Sum-Product Algorithm – In Detail

Partition the factors in the joint distribution into groups,
with one group associated with each factor of the nodes
that is a neighbour of the variable node x.

p(x) =
∏

s∈ne(x)

Fs(x,Xs)

ne(x) is the set of factor nodes which are neighbours of x
Xs is the set of all variables in the sub-tree connected to the
variable node x via the factor node fs
Fs(x,Xs) is the product of all the factors in the group
associated with factor fs.

xfs

µfs→x(x)

F
s
(x
,X

s
)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

696of 821

The Sum-Product Algorithm – In Detail

Goal: Marginal distribution p(x) =
∑

x\x p(x)
via joint distribution

p(x) =
∏

s∈ne(x)

Fs(x,Xs)

resulting in

p(x) =
∑
x\x

∏
s∈ne(x)

Fs(x,Xs) =
∏

s∈ne(x)

∑
Xs

Fs(x,Xs)︸︷︷︸
.
=µfs→x(x)

xfs

µfs→x(x)

F
s
(x
,X

s
)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

697of 821

Example for Exchanging Sum and Product 1/2

Goal: Marginal distribution p(x2) =
∑

x\x2 p(x)
via joint distribution

p(x) =
∏

s∈ne(x)

Fs(x,Xs)

e.g.

p(x1, x2, x3, x4) =
∏

s∈ne(x2)

Fs(x2,Xs) = fa(x1, x2) fb(x2, x3) fc(x2, x4)

x1 x2 x3

fa fb

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

698of 821

Example for Exchanging Sum and Product 2/2

Goal: Marginal distribution p(x)

p(x) =
∑
x\x

∏
s∈ne(x)

Fs(x,Xs) =
∏

s∈ne(x)

∑
Xs

Fs(x,Xs)

e.g.

p(x2) =
∑

x1,x3,x4

fa(x1, x2) fb(x2, x3) fc(x2, x4)

(
1 ·
∑

fa(x1, x2)

)
︸︷︷︸

.
=µfa→x2(x2)

(
1 ·
∑

fb(x2, x3)

)
︸︷︷︸

.
=µfb→x2(x2)

(
1 ·
∑

fc(x2, x4)

)
︸︷︷︸

.
=µfc→x2(x2)

x1 x2 x3

fa fb

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

699of 821

The Sum-Product Algorithm – In Detail (recap)

p(x) =
∑
x\x

∏
s∈ne(x)

Fs(x,Xs) =
∏

s∈ne(x)

∑
Xs

Fs(x,Xs)

=
∏

s∈ne(x)

µfs→x(x)

messages are functions of the variable they are passed to:

µfs→x(x) =
∑

Fs(x,Xs).

xfs

µfs→x(x)

F
s
(x
,X

s
)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

700of 821

How to evaluate the messages µfs→x(x)?

Each factor Fs(x,Xs) consists of a subgraph and can
therefore be written as

Fs(x,Xs) = fs(x, x1, . . . , xM)G1(x1,Xs1) . . .GM(xM,XsM)

x
fs

µxM→fs(xM )

µfs→x(x)

Gm(xm, Xsm)

{x, x1, . . . , xM} is the set of variables on which fs depends
Xsm are the variables in the subtree connected to fs via xm
beware the variable notation (here and in the textbook)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

701of 821

How to evaluate the messages µfs→x(x)?

Push the sums over Xsm into the product:

µfs→x(x) =
∑

Fs(x,Xs)

=
∑

· · ·
∑
xM

fs(x, x1, . . . , xM)
∏

m∈ne(fs)\x

[∑
Xsm

Gm(xm,Xsm)

]

=
∑

· · ·
∑
xM

fs(x, x1, . . . , xM)
∏

m∈ne(fs)\x

µxm→fs(xm).

x
fs

µxM→fs(xM )

µfs→x(x)

Gm(xm, Xsm)

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

702of 821

Two kind of messages

factor nodes→ variable nodes

µfs→x(x) =
∑

Fs(x,Xs)

variable nodes→ factor nodes

µxm→fs(xm) =
∑
Xsm

Gm(xm,Xsm)

x
fs

µxM→fs(xM )

µfs→x(x)

Gm(xm, Xsm)

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

703of 821

factor nodes→ variable nodes

We already have a formula to calculate messages from
factor nodes to variable nodes

µfs→x(x) =
∑

· · ·
∑
xM

fs(x, x1, . . . , xM)
∏

m∈ne(fs)\x

µxm→fs(xm)

Take the product of all incoming messages, multiply with
the factor associated with the node and marginalise over
all variables associated with the incoming messages.

x
fs

µxM→fs(xM )

µfs→x(x)

Gm(xm, Xsm)

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

704of 821

variable nodes→ factor nodes (1/2)

Gm(xm,Xsm) is a product of terms Fl(xm,Xml)

Gm(xm,Xsm) =
∏

l∈ne(xm)\fs

Fl(xm,Xml)

where the product is taken over all neighbours of node xm
except for node fs.

Fl(xm, Xml)

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

705of 821

variable nodes→ factor nodes (2/2)

µxm→fs(xm) =
∑
Xsm

Gm(xm,Xsm)

=
∏

l∈ne(xm)\fs

[∑
Xml

Fl(xm,Xml)

]

=
∏

l∈ne(xm)\fs

µfl→xm(xm).

Evaluate the message sent by a variable to an adjacent
factor by taking the product of incoming messages.

Fl(xm, Xml)

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

706of 821

How to start? How to normalise?

Consider node x as the root node of the factor graph.
Start at the leaf nodes.

If the leaf node is a
variable node then

x f

µx→f (x) = 1

If the leaf node is a
factor node then

µf→x(x) = f(x)

These initialisation rules follow from the general formulae.
Normalisation (if the factors are un-normalised):

Calculate p̃(x) by message passing as before
Then Z =

∑
x p̃(x) and p =

1
Z p̃

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

707of 821

Marginals for ALL variable nodes in the graph

Brute force: run the above algorithm for each node.
More efficient

1 Arbitrarily choose one root in the graph.
2 Propagate all messages from leaves to root.
3 Propagate all messages from root to leaves.
4 Local messages give marginals.

Only doubles the number of computations.

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

708of 821

Generalising Sum-Product: Distributive Property

For the sum product algorithm, we used the property that
multiplication is distributive over addition:

ab + ac = a(b + c)

Abstract theory
A set with two associative operations ’+’ and ’×’ satisfying
the distributive property is called a semi-ring.
Generalized Distributive Law
In principle, one can replace ’sum’ and ’product’ in the
previous algorithm with other operations.

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

709of 821

Generalising Sum-Product: Max-Product

ab + ac = a(b + c)→ max(ab, ac) = amax(b, c)
Replace ’+’ by max, and ’×’ by

∑
Max Product algorithm: compute argmaxx p(x)
For Hidden Markov Models, this is the Viterbi algorithm.

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

710of 821

Similar Algorithms

Junction Tree Algorithm: exact inference in general
undirected graphs. (For directed graphs, first convert to
moral graph.) Computational cost is determined by the
number of variables in the largest clique of the graph.
Grows exponentially with this number for discrete
variables.
Loopy Belief Propagation: try sum-product on graphs
which are not tree-structured. Graph has cycles and
therefore information flows several times through the
graph. Initialise by assuming a unit message has been
sent over each link in each direction. Convergence is not
longer guaranteed in general.

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Factor Graphs

The Sum-Product
Algorithm

Similar Algorithms

Learning the Graph
Structure

711of 821

Learning the Graph Structure

Define a space of possible graph structures.
Define a measure to score each of the graph structures.
Bayesian viewpoint: Compute the posterior distribution
over graph structures.
If we have a prior p(m) over graphs indexed by m, the
posterior is given via Bayes’ theorem as

p(m | D) ∝ p(m) p(D |m)

where D is the observed data set, and the model evidence
p(D |m) provides the score for each model.
Challenge 1: Evaluation of the model evidence involves
marginalisation over the latent variables and is
computationally very demanding.
Challenge 2: The number of different graph structures
grows exponentially with the number of nodes. Need to
resort to heuristics to find good candidates.

Probabilistic Graphical Models 2
Markov Random Fields
Bayesian Networks vs. Markov Random Fields

Related Posts