CS计算机代考程序代写 Bayesian algorithm Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 825

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

536of 825

Part XV

Decision Theory and Modelling

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

537of 825

Lecture Overview

Decision theory
Families of approaches
(non-probabilistic, discriminative probabilistic, and
generative probabilistic)
Minimising expected misclassifications rate
Minimising expected L2 error (regression)

Probabilistic Modelling in the abstract
Model formulation
Parameter estimation
(point estimate or full Bayesian posterior)
Why the full Bayesian posterior is challenging to work with:
integrating hypotheses

Overview of computational approaches in this course

Model selection
ML-II: Maximum (marginal) likelihood
Cross validation

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

538of 825

Three Models for Decision Problems

For simplicity, consider classification given dataset D.
Discriminant functions use D to determine some function
f (x) which maps each input directly onto a class label
(non-probabilistic).
Discriminative Models

1 Solve the inference problem of determining the posterior
predictive probability p(Ck | x,D).

2 Combine with decision theory to assign each new x to one
of the classes.

Generative Models
1 Solve the inference problem of determining the posterior

predictive probability p(x | Ck,D). and p(Ck | D) (or simply
model the joint in x and Ck directly).

2 Compute p(Ck | x,D).
3 Use decision theory to assign each new x to one of the

classes.

Each have pros and cons.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

539of 825

What is Probability?

You observe the results of a coin toss x ∈ {H,T}.
Frequentist interpretation

p(x = H) = ρ

ρ = limN→∞ Nheads/N

ρ is an unknown constant; requires repeated experiments
Bayesian interpretation

p(x = H) = 1⇔ I’m certain it will be heads
p(x = H) = 1/2⇔ I believe the coin is fair
Non-repeatable outcomes

my charging station is at location (x, y, z)
that stormtrooper is hostile
Marc Marquez will win the next MotoGP season

Cox axioms⇒ beliefs obey the laws of probability
Calculus ∼ bridge as probability ∼ intelligent agent
Graphical models are an intuitive language for probability

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

540of 825

Decision Theory – Key Ideas

Two classes C1 and C2
joint distribution p(x, Ck)
using Bayes’ theorem

p(Ck | x) =
p(x | Ck) p(Ck)

p(x)

Example: cancer treatment (k = 2)
data x : an X-ray image
C1 : patient has cancer (C2 : patient has no cancer)
p(C1) is the prior probability of a person having cancer
p(C1 | x) is the posterior probability of a person having
cancer after having seen the X-ray data

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

541of 825

Decision Theory – Key Ideas

Need a rule which assigns each value of the input x to one
of the available classes.
The input space is partitioned into decision regions Rk.
Leads to decision boundaries or decision surfaces
probability of a mistake

p(mistake) = p(x ∈ R1, C2) + p(x ∈ R2, C1)

=


R1

p(x, C2) dx +

R2

p(x, C1) dx

R2

R1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

542of 825

Decision Theory – Key Ideas

probability of a mistake

p(mistake) = p(x ∈ R1, C2) + p(x ∈ R2, C1)

=


R1

p(x, C2) dx +

R2

p(x, C1) dx

goal: minimize p(mistake)

R1 R2

x0 x̂

p(x, C1)

p(x, C2)

x

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

543of 825

Decision Theory – Key Ideas

multiple classes
more convenient to write maximising the probability of
correct classification

p(correct) =
K∑

k=1

p(x ∈ Rk, Ck)

=

K∑
k=1


Rk

p(x, Ck) dx

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

544of 825

Minimising the Expected Loss

Not all mistakes are equally costly.
Weight each misclassification of x to the wrong class Cj
instead of assigning it to the correct class Ck by a factor Lkj.
The expected loss is now

E [L] =

k


j


Rj

Lkj p(x, Ck)dx

Goal: minimize the expected loss E [L]. Since the Rj are
disjoint, the optimal decision rule assigns x to class

j? = arg min
j


k

Lkj p(x, Ck)

= arg min
j


k

Lkj p(x | Ck).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

545of 825

The Reject Region

Avoid making automated decisions on difficult cases.
Difficult cases:

posterior probabilities p(Ck | x) are very small
joint distributions p(x, Ck) have comparable values

x

p(C1|x) p(C2|x)

0.0

1.0
θ

reject region

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

546of 825

Recall: Loss Function for Regression

Expected loss for squared loss function

E [L] =
∫ ∫

{y(x)− t}2 p(x, t) dx dt.

Minimise E [L] by choosing the regression function

y(x) =

t p(x, t) dt
p(x)

=


t p(t | x) dt = Et [t | x]

(use calculus of variations to derive this result ;
alternatively work point-wise by fixing an x and using
stationarity to solve for y(x)).
Other losses are possible — the minimiser of the expected
absolute error is also a simple exercise:

E [L] =
∫ ∫

|y(x)− t|p(x, t) dx dt.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

547of 825

A General Probabilistic Model

Acting optimally in the face of the decision problems above
is relatively straight-forward given the appropriate posterior
predictive distribution.
This is a key advantage of probabilistic methods. Let’s
now review the mechanics of probabilistic inference.
Treat both parameters θ and data X as random variables.
The joint factorises as

p(X , θ) = p(X|θ)p(θ)

On observing X , we update our prior on θ to obtain the
posterior

p(θ|X ) =
p(X|θ)p(θ)

p(X )

p(X ) =

p(X , θ) dθ =

p(X|θ)p(θ) dθ

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

548of 825

Optimisation vs. Integration

Two main approaches for handling the posterior:
1 Employ a heuristic rule to choose a point estimate

yields a single “best” θ?
rules include maximum likelihood and maximum a posteriori
often involves optimisation, which may be difficult in practice
prediction is simple

2 Maintain the full Bayesian posterior
keep the full posterior distribution
generally involves integration or summation, which may be
(very) difficult in practice
prediction may also be difficult in practice, as we integrate or
sum over all possible hypotheses
typically better captures uncertainty, which is useful for
subsequent decision problems

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

549of 825

Example Likelihood Functions

Regression, e.g. x ∈ RD, t ∈ R:

p
(
t|x, θ, σ2

)
= N

(
t|y(x, θ), σ2

)
σ2 = noise variance hyper-parameter

Binary Classification, e.g. x ∈ RD, t ∈ {−1,+1}:

p
(
t|x, θ

)
= Bernoulli

(
t| sigmoid(y(x))

)
sigmoid(y) = 1/(1 + exp(y))

Gaussian density estimation:

p(x|θ) = N
(
x|µ(θ),Σ(θ)

)
All may be combined with the i.i.d. assumption

p(X|θ) =

x∈X

p(x|θ)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

550of 825

Maximum Likelihood to MAP

Posterior ∝ Prior × Likelihood

The ML heuristic is simply:

θML ≡ arg max
θ

p(X|θ)

The MAP heuristic seems more intuitive:

θMAP ≡ arg max
θ

p(θ|X )

= arg max
θ

p(X|θ)p(θ)

But both are merely heuristic point estimates.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

551of 825

Bayesian Inference and Integration

Even more intuitive is to maintain multiple hypotheses with
various degrees of certainty.
Captured by the Bayesian posterior:

p(θ|X ) =
p(X|θ)p(θ)

p(X )

p(X ) =

p(X , θ) dθ =

p(X|θ)p(θ) dθ

Integration (or summation) arises in the following ways:
1 Calculating the denominator (or normalising constant) p(X )
2 Marginalising parameters in prediction of some quantity of

interest z:

p(z|X ) =

p(z, θ|X ) dθ (marginalisation)

=


p(z|θ)p(θ|X ) dθ (factorisation and conditional independence)

= Eθ|X [p(z|θ)] .

In the Bayesian view we treat all parameters, data, and
latent variables in the same way (as random variables).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

552of 825

Bayesian Inference and Integration:
Computational Strategies

Generally, Bayesian integrals are intractable. In this course we
consider the following approaches:

Tractable conjugate priors (recall e.g. Bayesian linear
regression); posterior in the same family as the prior.
Exact simplification of sums — sum-product algorithm.
Laplace approximation (a slight Bayesian twist on a
maximum a posteriori estimator).
Variational inference (typically optimise some distributional
distance such as the Kullback-Leibler divergence; broadly
speaking more global than the Laplace approximation).
Sampling (approximate expectations empirically on a finite
sample — generating the sample is typically itself
intractible and requires approximation. . . ).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

553of 825

Conjugate Priors

A prior and likelihood are conjugate to one another if the
posterior is in the same family as the prior.
For example

p(µ|µ0, σ0) = N (µ0, σ20)
p(x|µ) = N (x|µ, σ2x )

p(µ|X , µ0, σ0) = N

(
1

1
σ20

+ n
σ2x

(
µ0
σ20

+

∑N
i=1 xi
σ2x

)
,

1
1
σ2

+ n
σ2x

)

Often interpretable as pseudo-observations in the
likelihood function:

p(θ) ∝ p(Xpseudo|θ)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

554of 825

Exact simplification of sums

Distributive Law can save operations

a(b + c)︸ ︷︷ ︸
2 operations

= ab + ac︸ ︷︷ ︸
3 operations

If some probabilities do not depend on all random
variables, we might be able to factor them out. For
example, assume

p(x1, x3 | x2) = p(x1 | x2) p(x3 | x2),

then (using

x3
p(x3 | x2) = 1)

p(x1) =

x2,x3

p(x1, x2, x3) =

x2,x3

p(x1, x3 | x2) p(x2)

=

x2,x3

p(x1 | x2) p(x3 | x2) p(x2)︸ ︷︷ ︸
O(|X1||X2||X3|)

=

x2

p(x1 | x2) p(x2)︸ ︷︷ ︸
O(|X1||X2|)

The sum-product algorithm extends this idea.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

555of 825

Laplace’ Approximation

Approximate p(x) = 1Z f (x) for x ∈ R
M

Taylor expand log f (x) at a local maximum x? of f :

ln f (x) ‘
1
2

(x− x?)TH(x?)(x− x?) + const.

(no linear term: the gradient vanishes at z?)
where the Hessian is defined as

H(x?) = ∇∇ ln f (x) |x=x? ≡ A.

The Laplace approximation is quadratic in log q and
therefore Gaussian:

q(x) =
|A|1/2

(2π)M/2
exp

{

1
2

(z− z0)TA(z− z0)
}

= N (z | z0,A−1)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

556of 825

Laplace Approximation

−2 −1 0 1 2 3 4
0

0.2

0.4

0.6

0.8

Non-Gaussian (yellow) and
Gaussian approximation (red).

−2 −1 0 1 2 3 4
0

10

20

30

40

Negative log of the
Non-Gaussian (yellow) and

Gaussian approx. (red).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

557of 825

Variational Inference

Write the approximate posterior q(θ|X ) ≈ p(θ|X )
Optimise a variational objective for some divergence D:

q(z|X ) = arg min
q

D [p(z|X ), q(z|X )]

Various flavours, e.g.
choose a factorising q and then do varational optimisation
over functions for the factors
choose a simplifying family q(z|X , φ) and choose the
parameters φ which minimise the divergence to the true
posterior

A typical choice of divergence is the Kullback-Leibler or
KL-divergence:

DKL[p, q] ≡

p(z)(log p(z)− log(q(z))) dz

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

558of 825

Sampling

Monte-Carlo methods use random sampling.
Example: draw samples x(i) ∼ X, i = 1, 2, . . . ,N from some
random variable X and approximate E[ f ] by

Ex∼X[ f (x)] ≈
1
N

N∑
i=1

f (x(i))

for independent x(i) the variance of the estimate ∝ 1N .
Sampling x(i) ∼ X may itself be intractable, and
approximate sampling methods are used.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

559of 825

Review of this lecture so far

1 Given the relevant posterior predictive distribution, solving
decision problems is relatively straightforward.

2 Given a model, finding the relevant posterior predictive
distribution is straightforward in principle (using the rules
of probability, in particular Bayes’ rule), but usually difficult
in practice and requires approximations.

3 But how do we come up with the model?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

560of 825

Modelling: A Generative View

A powerful paradigm for designing models is to think of a data
generating procedure. The parameters are subsequently
inferred from a dataset by inverting this procedure using the
rules of probability.

For example, in the Gaussian mixture model we
1 Sample parameters from a prior:

(mixture proportions + mean and covariance parameters for
each mixture component).

2 Sample a cluster index given mixture proportions.
3 Sample a datapoint given the mean and covariance

associated with the cluster index sampled in the previous
step.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

561of 825

Model Selection

The prior also has parameters, which we call
hyper-parameters, which may be selected by

Choosing a mathematically convenient form.
Using our domain knowledge and intuition.

Why not add another layer of indirection in our
model?

We can place a hyper-prior on the
hyper-parameters.
This is an infinite regress — Turtles all the way
down!

In practice we introduce model selection tools,
to select hyper-parameters, including:

Maximum marginal likelihood
(a.k.a evidence maximisation or ML-II)
Cross validation

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

562of 825

Maximum Marginal Likelihood
(ML-II or Evidence Maximisation)

Apply the maximum likelihood heuristic to choose the
hyper-parameters, marginalising over the parameters.
Given k models M = {M1,M2, . . . ,MK},

Mk ∼ p(M)
θk ∼ p(θ|Mk)
X ∼ p(X|θk,Mk),

we pick

M? = arg max
Mk

p(X|Mk)

= arg max
Mk


p(X , θk, |Mk) dθk

The criterion is the normalisation constant of the Bayesian
posterior — p(X ) in the earlier notation.
Treat a turtle as fixed rather than a random variable.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

563of 825

ML-II Example: Bayesian Linear Regression

The Gaussian prior and likelihood

p(w|α) = N (w | 0, α−1I)
p(t |w, β) = N (t |Φw, β−1I)

with N training data pairs X = {(xn, tn)}
N
i=1 imply the posterior

p(w | t, α, β) = N (w |mN ,SN)
mN = βSNΦT t

S−1N = αI + βΦ
TΦ.

The log marginal-likelihood of (α, β) is then (Bishop 3.86)

log p(t|α, β) = log

p(t,w|α, β) dw

= log


p(t|w, β)p(w|α) dw

=
M
2

logα+
N
2

log β − E(mN)−
1
2

log|A| −
N
2

log(2π)

E(mN) =
β

2
‖t−ΦmN‖

2
+
α

2
m>N mN .

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

564of 825

Cross validation

Take the model with best S-fold cross validation score:
For each of S (= 4) runs, train on the white portion, test on
the pink portion.
Average the test score over each run.

Cross-validation is extremely effective.

Trivial to implement for any machine learning algorithm
Trivial to specialise to any loss function (decision problem)
Applicable to discriminative, generative and non
probabilistic algorithms alike; we may even compare
between them.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Inference and Decision

Probabilistic Modelling
in the Abstract

Model Selection

565of 825

Nested Cross Validation

Problem — the following will be overly-optimistic:
compute the cross validation score for each of K models
report the result for the best model.

Solution — nested cross-validation:
For each of the four folds in the LHS diagram:

For each of the K models:
Compute an inner S′-fold cross validation score (as per the
RHS rigure) on the given white portion of the LHS figure.

Take the best model and
re-train it on the entire white portion for that (outer) fold
compute the loss on the pink portion for that fold.

Report the average loss on the pink portions of the LHS
figure (which may be the result of different models).

Neural Networks 2
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks