CS计算机代考程序代写 chain Bayesian Bayesian network Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 821

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

647of 821

Finding p(a, b | f ) – ‘Analytical’ method

1

f

e b

a

c

2 Conditional probability with the given observable(s):

p(a, b, c, e | f )

3 Rewrite it as a joint distribution over all variables

p(a, b, c, e | f ) =
p(a, b, c, e, f )

p(f )

4 Factorise the joint probability according to the graph

p(a, b, c, e | f ) =
p(a, b, c, e, f )

p(f )
=

p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )

5 Marginalise over all variables we don’t care about

p(a, b | f ) =

c,e

p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )

= p(a) p(b | f )

w

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

648of 821

Finding p(a, b | f ) – ‘Analytical’ method

1

f

e b

a

c

2 Conditional probability with the given observable(s):

p(a, b, c, e | f )

3 Rewrite it as a joint distribution over all variables

p(a, b, c, e | f ) =
p(a, b, c, e, f )

p(f )

4 Factorise the joint probability according to the graph

p(a, b, c, e | f ) =
p(a, b, c, e, f )

p(f )
=

p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )

5 Marginalise over all variables we don’t care about

p(a, b | f ) =

c,e

p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )

= p(a) p(b | f )

w

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

649of 821

Finding p(a, b | f ) – ‘Analytical’ method

1

f

e b

a

c

2 Conditional probability with the given observable(s):

p(a, b, c, e | f )

3 Rewrite it as a joint distribution over all variables

p(a, b, c, e | f ) =
p(a, b, c, e, f )

p(f )

4 Factorise the joint probability according to the graph

p(a, b, c, e | f ) =
p(a, b, c, e, f )

p(f )
=

p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )

5 Marginalise over all variables we don’t care about

p(a, b | f ) =

c,e

p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )

= p(a) p(b | f )

w

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

650of 821

Finding p(a, b | f ) – ‘Analytical’ method

1

f

e b

a

c

2 Conditional probability with the given observable(s):

p(a, b, c, e | f )

3 Rewrite it as a joint distribution over all variables

p(a, b, c, e | f ) =
p(a, b, c, e, f )

p(f )

4 Factorise the joint probability according to the graph

p(a, b, c, e | f ) =
p(a, b, c, e, f )

p(f )
=

p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )

5 Marginalise over all variables we don’t care about

p(a, b | f ) =

c,e

p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )

= p(a) p(b | f )

w

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

651of 821

Finding p(a, b | f ) – ‘Analytical’ method

1

f

e b

a

c

2 Conditional probability with the given observable(s):

p(a, b, c, e | f )

3 Rewrite it as a joint distribution over all variables

p(a, b, c, e | f ) =
p(a, b, c, e, f )

p(f )

4 Factorise the joint probability according to the graph

p(a, b, c, e | f ) =
p(a, b, c, e, f )

p(f )
=

p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )

5 Marginalise over all variables we don’t care about

p(a, b | f ) =

c,e

p(a) p(f ) p(e | a, f ) p(b | f ) p(c | e)
p(f )

= p(a) p(b | f )

w

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

652of 821

Finding p(a, b | f ) – ‘Graphical’ method

1

f

e b

a

c

2 Check whether a ⊥⊥ b | f holds or not.
Result : a ⊥⊥ b | f holds.

Reason : The path from a to b is blocked by f because f is a
TT-node and observed. Therefore, a ⊥⊥ b | f .

3 Write down the factorisation

p(a, b | f ) = p(a | f ) p(b | f ) = p(a) p(b | f )

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

653of 821

Finding p(a, b | f ) – ‘Graphical’ method

1

f

e b

a

c

2 Check whether a ⊥⊥ b | f holds or not.
Result : a ⊥⊥ b | f holds.

Reason : The path from a to b is blocked by f because f is a
TT-node and observed. Therefore, a ⊥⊥ b | f .

3 Write down the factorisation

p(a, b | f ) = p(a | f ) p(b | f ) = p(a) p(b | f )

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

654of 821

Finding p(a, b | f ) – ‘Graphical’ method

1

f

e b

a

c

2 Check whether a ⊥⊥ b | f holds or not.
Result : a ⊥⊥ b | f holds.

Reason : The path from a to b is blocked by f because f is a
TT-node and observed. Therefore, a ⊥⊥ b | f .

3 Write down the factorisation

p(a, b | f ) = p(a | f ) p(b | f ) = p(a) p(b | f )

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

655of 821

Bayesian Network – D-separation

A third example
Is x3 d-separated from x6 given x1 and x5 ?
Mark x1 and x5 as observed.

x2

x5

x4

x1

x3

x6

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

656of 821

Bayesian Network – D-separation

Is x3 d-separated from x6 given x1 and x5 ?

x2

x5

x4

x1

x3

x6

x1

x5

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

657of 821

Bayesian Network – D-separation

Is x3 d-separated from x6 given x1 and x5 ?
4 junctions of interest between x3 and x6:

{x3, x1, x2} is blocked because x1 is TT-node and observed.
{x3, x5, x6} is blocked because x5 is a HT-node and
observed.
{x3, x5, x2} is not blocked because x5 is a HH-node and
observed.
{x5, x2, x6} is not blocked because x2 is a TT-node and
unobserved.

The path {x3, x5, x2, x6} prevents D-separation.
Therefore, x3 is not d-separated from x6 given x1 and x5 as
not all paths between x3 and x6 are blocked.

x2

x5

x4

x1

x3

x6

x1

x5

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

658of 821

Part XVIII

Probabilistic Graphical Models 2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

659of 821

Markov Random Fields (MRFs)

Markov Random Fields (Markov network, undirected
graphical model) are defined over a graph with undirected
edges.
MRFs allow for different conditional independence
statements than Bayesian networks.
In a Bayesian network, the definition of a blocked path was
subtle for a HH-node because it did include that all
descendants were unobservable.
Is there an alternative graphical semantics for probability
distributions such that conditional independence is
determined by simple graph separation?
Yes, removing the direction from the edges removes the
asymmetry between parent and child nodes and
subsequently the subtleties associated with the HH-node.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

660of 821

Graph Separation

Definition (Graph Separation)

In an undirected graph G, having A, B and C disjoint subsets of
nodes, if every path from A to B includes at least one node from
C, then C is said to separate A from B in G.

A

C
B

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

661of 821

Conditional Independence in MRFs

Definition (Conditional Independence in Markov Random Field)

In an undirected graph G, having A, B and C disjoint subsets of
nodes, A is conditionally independent of B given C

A ⊥⊥ B |C

iff C separates A from B in G.

A

C
B

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

662of 821

Markov Random Field

Definition (Markov Random Field)

A Markov Random Field is a set of probability distributions
{ p(x) | p(x) > 0,∀ x } such that there exists an undirected
graph G with disjoint subsets of nodes A, B and C, in which
whenever C separates A from B in G,

A ⊥⊥ B |C.

Although we sometimes say “the MRF is such an
undirected graph”, we mean “the MRF represents the set
of all probability distributions whose conditional
independency statements are precisely those given by
graph separation in the graph”.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

663of 821

Factorisation in MRFs

Assume two nodes xi and xj that are not connected by an
edge.
Given all other nodes in the graph, xi and xj must be
conditionally independent as all paths between xi and xj
are blocked by observed nodes.

p(xi, xj | x\{i,j}) = p(xi | x\{i,j}) p(xj | x\{i,j})

where x\{i,j} denotes the set of all variables x with xi and xj
removed.
This is suggestive of the importance of considering sets of
connected nodes (cliques). Can we use this for the
factorisation of the graph?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

664of 821

Cliques in a graph

A clique is a subset of nodes in a graph such that there
exists an edge between all pairs of nodes in the subset.
(The nodes in a clique are fully connected.)
A maximal clique of a graph is a clique which is not a
proper subset of another clique. (No other nodes of the
graph can be added to a maximal clique without
destroying the property of full connectedness.)

x1
x2

x3
x4

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

665of 821

Factorisation using Cliques

We can express the factorisation with the help of the
cliques.
In fact, we only need the maximal cliques, as any function
over the variables of a subset of a maximal clique can be
expressed as a function of the members of the maximal
clique.
Denote by C the set of maximal cliques of a graph.
For a maximal clique C ∈ C, denote by xC the subset of the
variables x which belong to C.

x1
x2

x3
x4

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

666of 821

Factorisation using Cliques

A probability distribution p(x) factorises with respect to a
given undirected graph G if it can be written as

p(x) =
1
Z


C∈C

ψC(xC)

where C is the set of maximal cliques of G, and potential
functions ψC(xC) ≥ 0. The constant Z =


x p(x) ensures

the correct normalisation of p(x).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

667of 821

Conditional Independence⇔ Factorisation

Theorem (Factorisation⇒ Conditional Independence)
If a probability distribution factorises according to an undirected
graph, and if A, B and C are disjoint subsets of nodes such that
C separates A from B in the graph, then the distribution
satisfies A ⊥⊥ B |C.

Theorem (Conditional Independence⇒ Factorisation
(Hammersley-Clifford Theorem))

If a strictly positive probability distribution p(x) > 0,∀ x, satisfies
the conditional independence statements implied by graph
separation over a particular undirected graph, then it also
factorises according to the graph.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

668of 821

Factorisation with strictly positive potential
functions

As the potential functions ψC(xC) are strictly positive, one
can express them as exponential

ψC(xC) = exp{−E(xC)}

of an energy function E(xC).
The exponential distribution is called the Boltzmann
distribution.
The joint distribution is defined as the product of the
potentials, and so the total energy is obtained by adding
the energies of each of the maximal cliques.

p(x) =
1
Z


C∈C

ψC(xC) =
1
Z
exp

{


C∈C

E(xC)

}

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

669of 821

Example of Image Denoising

Given an unknown and noise-free image described by
binary pixels xi ∈ {−1,+1} for i = 1, . . . ,D.
Randomly flip the sign of some pixels with some small
probability and denote the pixels of the noisy image by yi
for i = 1, . . . ,D.
Goal : Recover the original noise-free image.

Original image. After randomly changing
10% of the pixels.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

670of 821

Example of Image Denoising

Prior knowledge 1 : We know the noise level is small.
Therefore: Strong correlation between original pixels xi
and noisy pixels yi.
Prior knowledge 2 : Neighbouring pixels xi and xj in an
image are strongly correlated (given a decent resolution).
Prior knowledge can be captured in a Markov Random
Field.

xi

yi

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

671of 821

Example of Image Denoising

Cliques {xi, yi} : Choose an energy function of the form
−ηxiyi for η > 0 resulting in a potential function exp{ηxiyi}.
This favours equal signs for xi and yi.
Cliques {xi, xj} where i and j are indices of neighbouring
pixels : Choose an energy function of the form −βxixj for
β > 0 again favouring equal signs for xi and xj.
We need to accomodate for the fact that one kind of pixels
might exist more often than the other kind. This can be
done by adding a term hxi for each pixel in the noise-free
image. (Potential function is an arbitrary, nonnegative
function over maximal cliques, so we are allowed to
multiply it by any nonnegative function of subsets of the
clique.)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

672of 821

Example of Image Denoising

Energy function

E(x, y) = h

i

xi − β

i,j

xixj − η

i

xiyi

which defines a joint distribution over x and y

p(x, y) =
1
Z
exp{−E(x, y)}.

xi

yi

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

673of 821

Example – Iterated Conditional Modes (ICM)

1 Fix the elements for y as we have them observed.
(Implicitily defines p(x | y)).

2 Initialise xi = yi for i = 1, . . . ,D.
3 Take one node xj and evaluate the total energy for both

possible states of xj = {−1,+1} keeping all other variables
fixed. Set xj to the state having the lower energy.

4 Repeat for another node until stopping criterion is satisfied.

Local minimum (ICM). Global minimum (graph-cut).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

674of 821

Bayesian Networks vs. Markov Random Fields

Two frameworks for graphical models, can we convert
between them?
Bayesian Network

x1 x2 xN−1 xN

p(x) = p(x1) p(x2 | x1) p(x3 | x2) . . . p(xN | xN−1)

Random Markov Field
x1 x2 xN−1 xN

p(x) =
1
Z
ψ1,2(x1, x2)ψ2,3(x2, x3) . . . ψN−1,N(xN−1, xN)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

675of 821

Chains are the same

Bayesian Network

p(x) = p(x1) p(x2 | x1) p(x3 | x2) . . . p(xN | xN−1)

Random Markov Field

p(x) =
1
Z
ψ1,2(x1, x2)ψ2,3(x2, x3) . . . ψN−1,N(xN−1, xN)

Find corresponding terms

ψ1,2(x1, x2) = p(x1) p(x2 | x1)
ψ2,3(x2, x3) = p(x3 | x2)


ψN−1,N(xN−1, xN) = p(xN | xN−1)

and note that Z = 1 in this case.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

676of 821

Moralisation

For other kind of Bayesian Networks (BNs), create the
cliques of a MRF by adding undirected edges between all
pairs of parents for each node in the graph.
This process of ’marrying the parents’ is called
moralisation, and the result is a moral graph.
BUT the resulting MRF may represent different conditional
independence statements than the original BN.
Example: The MRF is fully connected, and exhibits NO
conditional independence properties, in contrast to the
original directed graph.

x1 x3

x4

x2

x1 x3

x4

x2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

677of 821

Distribution vs Graph

Definition (D-map)

A graph is a D-map (dependency map) of a distribution if every
conditional independence statement satisfied by the
distribution is reflected in the graph.

Definition (I-map)

A graph is an I-map (independence map) of a distribution if
every conditional independence statement implied by the graph
is satisfied in the distribution.

Definition (P-map)

A graph is a P-map (perfect map) of a distribution if it is both a
D-map and an I-map for the distribution.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

678of 821

Bayesian Networks vs. Markov Random Fields

Consider probability distributions depending on N variables
P set of all probability distributions
D set of all probability distributions that can be represented

as a perfect map by an directed graph
U set of all probability distributions that can be represented

as a perfect map by an undirected graph

P
UD

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

679of 821

Only as Bayesian Network

C

A B

A directed graph whose conditional independence properties
cannot be expressed using an undirected graph over the same

three variables.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

680of 821

Only as Markov Random Field

A

C

B

D
An undirected graph whose conditional independence

properties cannot be expressed using a directed graph over
the same four variables.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

681of 821

BNs vs. MRFs: Similarities

In both types of Graphical Models
A relationship between the conditional independence
statements satisfied by a distribution and the associated
simplified algebraic structure of the distribution is made in
term of graphical objects.
The conditional independence statements are related to
concepts of separation between variables in the graph.
The simplified algebraic structure (factorisation of p(x)) is
related to ’local pieces’ of the graph (child + its parents in
BNs, cliques in MRFs).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Markov Random Fields

Bayesian Networks vs.
Markov Random Fields

682of 821

BNs vs. MRFs: Differences

Differences
The set of probability distributions that can be represented
as MRFs is different from the set that can be represented
as BNs.
Although both MRFs and BNs are expressed as a
factorisation of local functions on the graph, the MRF has
a normalisation constant Z =


x

C∈C ψC(xC) that
couples all factors, whereas the BN has not.
The local ’pieces’ of the BN are probability distributions
themselves, whereas in MRFs they need only be
non-negative functions (i.e. they may not have range [0, 1]
as probabilities do).

Probabilistic Graphical Models 1
Motivation
Bayesian Network
Plate Notation
Conditional Independence