Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 825
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
536of 825
Part XV
Decision Theory and Modelling
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
537of 825
Lecture Overview
Decision theory
Families of approaches
(non-probabilistic, discriminative probabilistic, and
generative probabilistic)
Minimising expected misclassifications rate
Minimising expected L2 error (regression)
Probabilistic Modelling in the abstract
Model formulation
Parameter estimation
(point estimate or full Bayesian posterior)
Why the full Bayesian posterior is challenging to work with:
integrating hypotheses
Overview of computational approaches in this course
Model selection
ML-II: Maximum (marginal) likelihood
Cross validation
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
538of 825
Three Models for Decision Problems
For simplicity, consider classification given dataset D.
Discriminant functions use D to determine some function
f (x) which maps each input directly onto a class label
(non-probabilistic).
Discriminative Models
1 Solve the inference problem of determining the posterior
predictive probability p(Ck | x,D).
2 Combine with decision theory to assign each new x to one
of the classes.
Generative Models
1 Solve the inference problem of determining the posterior
predictive probability p(x | Ck,D). and p(Ck | D) (or simply
model the joint in x and Ck directly).
2 Compute p(Ck | x,D).
3 Use decision theory to assign each new x to one of the
classes.
Each have pros and cons.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
539of 825
What is Probability?
You observe the results of a coin toss x ∈ {H,T}.
Frequentist interpretation
p(x = H) = ρ
ρ = limN→∞ Nheads/N
ρ is an unknown constant; requires repeated experiments
Bayesian interpretation
p(x = H) = 1⇔ I’m certain it will be heads
p(x = H) = 1/2⇔ I believe the coin is fair
Non-repeatable outcomes
my charging station is at location (x, y, z)
that stormtrooper is hostile
Marc Marquez will win the next MotoGP season
Cox axioms⇒ beliefs obey the laws of probability
Calculus ∼ bridge as probability ∼ intelligent agent
Graphical models are an intuitive language for probability
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
540of 825
Decision Theory – Key Ideas
Two classes C1 and C2
joint distribution p(x, Ck)
using Bayes’ theorem
p(Ck | x) =
p(x | Ck) p(Ck)
p(x)
Example: cancer treatment (k = 2)
data x : an X-ray image
C1 : patient has cancer (C2 : patient has no cancer)
p(C1) is the prior probability of a person having cancer
p(C1 | x) is the posterior probability of a person having
cancer after having seen the X-ray data
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
541of 825
Decision Theory – Key Ideas
Need a rule which assigns each value of the input x to one
of the available classes.
The input space is partitioned into decision regions Rk.
Leads to decision boundaries or decision surfaces
probability of a mistake
p(mistake) = p(x ∈ R1, C2) + p(x ∈ R2, C1)
=
∫
R1
p(x, C2) dx +
∫
R2
p(x, C1) dx
R2
R1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
542of 825
Decision Theory – Key Ideas
probability of a mistake
p(mistake) = p(x ∈ R1, C2) + p(x ∈ R2, C1)
=
∫
R1
p(x, C2) dx +
∫
R2
p(x, C1) dx
goal: minimize p(mistake)
R1 R2
x0 x̂
p(x, C1)
p(x, C2)
x
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
543of 825
Decision Theory – Key Ideas
multiple classes
more convenient to write maximising the probability of
correct classification
p(correct) =
K∑
k=1
p(x ∈ Rk, Ck)
=
K∑
k=1
∫
Rk
p(x, Ck) dx
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
544of 825
Minimising the Expected Loss
Not all mistakes are equally costly.
Weight each misclassification of x to the wrong class Cj
instead of assigning it to the correct class Ck by a factor Lkj.
The expected loss is now
E [L] =
∑
k
∑
j
∫
Rj
Lkj p(x, Ck)dx
Goal: minimize the expected loss E [L]. Since the Rj are
disjoint, the optimal decision rule assigns x to class
j? = arg min
j
∑
k
Lkj p(x, Ck)
= arg min
j
∑
k
Lkj p(x | Ck).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
545of 825
The Reject Region
Avoid making automated decisions on difficult cases.
Difficult cases:
posterior probabilities p(Ck | x) are very small
joint distributions p(x, Ck) have comparable values
x
p(C1|x) p(C2|x)
0.0
1.0
θ
reject region
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
546of 825
Recall: Loss Function for Regression
Expected loss for squared loss function
E [L] =
∫ ∫
{y(x)− t}2 p(x, t) dx dt.
Minimise E [L] by choosing the regression function
y(x) =
∫
t p(x, t) dt
p(x)
=
∫
t p(t | x) dt = Et [t | x]
(use calculus of variations to derive this result ;
alternatively work point-wise by fixing an x and using
stationarity to solve for y(x)).
Other losses are possible — the minimiser of the expected
absolute error is also a simple exercise:
E [L] =
∫ ∫
|y(x)− t|p(x, t) dx dt.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
547of 825
A General Probabilistic Model
Acting optimally in the face of the decision problems above
is relatively straight-forward given the appropriate posterior
predictive distribution.
This is a key advantage of probabilistic methods. Let’s
now review the mechanics of probabilistic inference.
Treat both parameters θ and data X as random variables.
The joint factorises as
p(X , θ) = p(X|θ)p(θ)
On observing X , we update our prior on θ to obtain the
posterior
p(θ|X ) =
p(X|θ)p(θ)
p(X )
p(X ) =
∫
p(X , θ) dθ =
∫
p(X|θ)p(θ) dθ
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
548of 825
Optimisation vs. Integration
Two main approaches for handling the posterior:
1 Employ a heuristic rule to choose a point estimate
yields a single “best” θ?
rules include maximum likelihood and maximum a posteriori
often involves optimisation, which may be difficult in practice
prediction is simple
2 Maintain the full Bayesian posterior
keep the full posterior distribution
generally involves integration or summation, which may be
(very) difficult in practice
prediction may also be difficult in practice, as we integrate or
sum over all possible hypotheses
typically better captures uncertainty, which is useful for
subsequent decision problems
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
549of 825
Example Likelihood Functions
Regression, e.g. x ∈ RD, t ∈ R:
p
(
t|x, θ, σ2
)
= N
(
t|y(x, θ), σ2
)
σ2 = noise variance hyper-parameter
Binary Classification, e.g. x ∈ RD, t ∈ {−1,+1}:
p
(
t|x, θ
)
= Bernoulli
(
t| sigmoid(y(x))
)
sigmoid(y) = 1/(1 + exp(y))
Gaussian density estimation:
p(x|θ) = N
(
x|µ(θ),Σ(θ)
)
All may be combined with the i.i.d. assumption
p(X|θ) =
∏
x∈X
p(x|θ)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
550of 825
Maximum Likelihood to MAP
Posterior ∝ Prior × Likelihood
The ML heuristic is simply:
θML ≡ arg max
θ
p(X|θ)
The MAP heuristic seems more intuitive:
θMAP ≡ arg max
θ
p(θ|X )
= arg max
θ
p(X|θ)p(θ)
But both are merely heuristic point estimates.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
551of 825
Bayesian Inference and Integration
Even more intuitive is to maintain multiple hypotheses with
various degrees of certainty.
Captured by the Bayesian posterior:
p(θ|X ) =
p(X|θ)p(θ)
p(X )
p(X ) =
∫
p(X , θ) dθ =
∫
p(X|θ)p(θ) dθ
Integration (or summation) arises in the following ways:
1 Calculating the denominator (or normalising constant) p(X )
2 Marginalising parameters in prediction of some quantity of
interest z:
p(z|X ) =
∫
p(z, θ|X ) dθ (marginalisation)
=
∫
p(z|θ)p(θ|X ) dθ (factorisation and conditional independence)
= Eθ|X [p(z|θ)] .
In the Bayesian view we treat all parameters, data, and
latent variables in the same way (as random variables).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
552of 825
Bayesian Inference and Integration:
Computational Strategies
Generally, Bayesian integrals are intractable. In this course we
consider the following approaches:
Tractable conjugate priors (recall e.g. Bayesian linear
regression); posterior in the same family as the prior.
Exact simplification of sums — sum-product algorithm.
Laplace approximation (a slight Bayesian twist on a
maximum a posteriori estimator).
Variational inference (typically optimise some distributional
distance such as the Kullback-Leibler divergence; broadly
speaking more global than the Laplace approximation).
Sampling (approximate expectations empirically on a finite
sample — generating the sample is typically itself
intractible and requires approximation. . . ).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
553of 825
Conjugate Priors
A prior and likelihood are conjugate to one another if the
posterior is in the same family as the prior.
For example
p(µ|µ0, σ0) = N (µ0, σ20)
p(x|µ) = N (x|µ, σ2x )
p(µ|X , µ0, σ0) = N
(
1
1
σ20
+ n
σ2x
(
µ0
σ20
+
∑N
i=1 xi
σ2x
)
,
1
1
σ2
+ n
σ2x
)
Often interpretable as pseudo-observations in the
likelihood function:
p(θ) ∝ p(Xpseudo|θ)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
554of 825
Exact simplification of sums
Distributive Law can save operations
a(b + c)︸ ︷︷ ︸
2 operations
= ab + ac︸ ︷︷ ︸
3 operations
If some probabilities do not depend on all random
variables, we might be able to factor them out. For
example, assume
p(x1, x3 | x2) = p(x1 | x2) p(x3 | x2),
then (using
∑
x3
p(x3 | x2) = 1)
p(x1) =
∑
x2,x3
p(x1, x2, x3) =
∑
x2,x3
p(x1, x3 | x2) p(x2)
=
∑
x2,x3
p(x1 | x2) p(x3 | x2) p(x2)︸ ︷︷ ︸
O(|X1||X2||X3|)
=
∑
x2
p(x1 | x2) p(x2)︸ ︷︷ ︸
O(|X1||X2|)
The sum-product algorithm extends this idea.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
555of 825
Laplace’ Approximation
Approximate p(x) = 1Z f (x) for x ∈ R
M
Taylor expand log f (x) at a local maximum x? of f :
ln f (x) ‘
1
2
(x− x?)TH(x?)(x− x?) + const.
(no linear term: the gradient vanishes at z?)
where the Hessian is defined as
H(x?) = ∇∇ ln f (x) |x=x? ≡ A.
The Laplace approximation is quadratic in log q and
therefore Gaussian:
q(x) =
|A|1/2
(2π)M/2
exp
{
−
1
2
(z− z0)TA(z− z0)
}
= N (z | z0,A−1)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
556of 825
Laplace Approximation
−2 −1 0 1 2 3 4
0
0.2
0.4
0.6
0.8
Non-Gaussian (yellow) and
Gaussian approximation (red).
−2 −1 0 1 2 3 4
0
10
20
30
40
Negative log of the
Non-Gaussian (yellow) and
Gaussian approx. (red).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
557of 825
Variational Inference
Write the approximate posterior q(θ|X ) ≈ p(θ|X )
Optimise a variational objective for some divergence D:
q(z|X ) = arg min
q
D [p(z|X ), q(z|X )]
Various flavours, e.g.
choose a factorising q and then do varational optimisation
over functions for the factors
choose a simplifying family q(z|X , φ) and choose the
parameters φ which minimise the divergence to the true
posterior
A typical choice of divergence is the Kullback-Leibler or
KL-divergence:
DKL[p, q] ≡
∫
p(z)(log p(z)− log(q(z))) dz
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
558of 825
Sampling
Monte-Carlo methods use random sampling.
Example: draw samples x(i) ∼ X, i = 1, 2, . . . ,N from some
random variable X and approximate E[ f ] by
Ex∼X[ f (x)] ≈
1
N
N∑
i=1
f (x(i))
for independent x(i) the variance of the estimate ∝ 1N .
Sampling x(i) ∼ X may itself be intractable, and
approximate sampling methods are used.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
559of 825
Review of this lecture so far
1 Given the relevant posterior predictive distribution, solving
decision problems is relatively straightforward.
2 Given a model, finding the relevant posterior predictive
distribution is straightforward in principle (using the rules
of probability, in particular Bayes’ rule), but usually difficult
in practice and requires approximations.
3 But how do we come up with the model?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
560of 825
Modelling: A Generative View
A powerful paradigm for designing models is to think of a data
generating procedure. The parameters are subsequently
inferred from a dataset by inverting this procedure using the
rules of probability.
For example, in the Gaussian mixture model we
1 Sample parameters from a prior:
(mixture proportions + mean and covariance parameters for
each mixture component).
2 Sample a cluster index given mixture proportions.
3 Sample a datapoint given the mean and covariance
associated with the cluster index sampled in the previous
step.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
561of 825
Model Selection
The prior also has parameters, which we call
hyper-parameters, which may be selected by
Choosing a mathematically convenient form.
Using our domain knowledge and intuition.
Why not add another layer of indirection in our
model?
We can place a hyper-prior on the
hyper-parameters.
This is an infinite regress — Turtles all the way
down!
In practice we introduce model selection tools,
to select hyper-parameters, including:
Maximum marginal likelihood
(a.k.a evidence maximisation or ML-II)
Cross validation
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
562of 825
Maximum Marginal Likelihood
(ML-II or Evidence Maximisation)
Apply the maximum likelihood heuristic to choose the
hyper-parameters, marginalising over the parameters.
Given k models M = {M1,M2, . . . ,MK},
Mk ∼ p(M)
θk ∼ p(θ|Mk)
X ∼ p(X|θk,Mk),
we pick
M? = arg max
Mk
p(X|Mk)
= arg max
Mk
∫
p(X , θk, |Mk) dθk
The criterion is the normalisation constant of the Bayesian
posterior — p(X ) in the earlier notation.
Treat a turtle as fixed rather than a random variable.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
563of 825
ML-II Example: Bayesian Linear Regression
The Gaussian prior and likelihood
p(w|α) = N (w | 0, α−1I)
p(t |w, β) = N (t |Φw, β−1I)
with N training data pairs X = {(xn, tn)}
N
i=1 imply the posterior
p(w | t, α, β) = N (w |mN ,SN)
mN = βSNΦT t
S−1N = αI + βΦ
TΦ.
The log marginal-likelihood of (α, β) is then (Bishop 3.86)
log p(t|α, β) = log
∫
p(t,w|α, β) dw
= log
∫
p(t|w, β)p(w|α) dw
=
M
2
logα+
N
2
log β − E(mN)−
1
2
log|A| −
N
2
log(2π)
E(mN) =
β
2
‖t−ΦmN‖
2
+
α
2
m>N mN .
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
564of 825
Cross validation
Take the model with best S-fold cross validation score:
For each of S (= 4) runs, train on the white portion, test on
the pink portion.
Average the test score over each run.
Cross-validation is extremely effective.
Trivial to implement for any machine learning algorithm
Trivial to specialise to any loss function (decision problem)
Applicable to discriminative, generative and non
probabilistic algorithms alike; we may even compare
between them.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Inference and Decision
Probabilistic Modelling
in the Abstract
Model Selection
565of 825
Nested Cross Validation
Problem — the following will be overly-optimistic:
compute the cross validation score for each of K models
report the result for the best model.
Solution — nested cross-validation:
For each of the four folds in the LHS diagram:
For each of the K models:
Compute an inner S′-fold cross validation score (as per the
RHS rigure) on the given white portion of the LHS figure.
Take the best model and
re-train it on the entire white portion for that (outer) fold
compute the loss on the pink portion for that fold.
Report the average loss on the pink portions of the LHS
figure (which may be the result of different models).
Neural Networks 2
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks