Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 825
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
68of 825
Part II
Introduction
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
69of 825
Flavour of this course
Formalise intuitions about problems
Use language of mathematics to express models
Geometry, vectors, linear algebra for reasoning
Probabilistic models to capture uncertainty
Design and analysis of algorithms
Numerical algorithms in python
Understand the choices when designing machine learning
methods
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
70of 825
What is Machine Learning?
Definition (Mitchell, 1998)
A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P,
if its performance at tasks in T, as measured by P, improves
with experience E.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
71of 825
Polynomial Curve Fitting
some artificial data created from the function
sin(2πx) + random noise x = 0, . . . , 1
x
t
0 1
−1
0
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
72of 825
Polynomial Curve Fitting – Input Specification
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
73of 825
Polynomial Curve Fitting – Input Specification
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
74of 825
Polynomial Curve Fitting – Model Specification
M : order of polynomial
y(x,w) = w0 + w1 x + w2 x2 + · · ·+ wM xM
=
M∑
m=0
wm x
m
nonlinear function of x
linear function of the unknown model parameter w
How can we find good parameters w = (w1, . . . ,wM)T?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
75of 825
Learning is Improving Performance
t
x
y(xn,w)
tn
xn
Performance measure : Error between target and
prediction of the model for the training data
E(w) =
1
2
N∑
n=1
(y(xn,w)− tn)2
unique minimum of E(w) for argument w? under certain
conditions (what are they?)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
76of 825
Learning is Improving Performance
t
x
y(xn,w)
tn
xn
Performance measure : Error between target and
prediction of the model for the training data
E(w) =
1
2
N∑
n=1
(y(xn,w)− tn)2
unique minimum of E(w) for argument w? under certain
conditions (what are they?)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
77of 825
Model Comparison or Model Selection
y(x,w) =
M∑
m=0
wm x
m
∣∣∣∣∣
M=0
= w0
x
t
M = 0
0 1
−1
0
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
78of 825
Model Comparison or Model Selection
y(x,w) =
M∑
m=0
wm x
m
∣∣∣∣∣
M=1
= w0 + w1 x
x
t
M = 1
0 1
−1
0
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
79of 825
Model Comparison or Model Selection
y(x,w) =
M∑
m=0
wm x
m
∣∣∣∣∣
M=3
= w0 + w1 x + w2 x
2 + w3 x
3
x
t
M = 3
0 1
−1
0
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
80of 825
Model Comparison or Model Selection
y(x,w) =
M∑
m=0
wm x
m
∣∣∣∣∣
M=9
= w0 + w1 x + · · ·+ w8 x8 + w9 x9
overfitting
x
t
M = 9
0 1
−1
0
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
81of 825
Testing the Model
Train the model and get w?
Get 100 new data points
Root-mean-square (RMS) error
ERMS =
√
2E(w?)/N
M
E
R
M
S
0 3 6 9
0
0.5
1
Training
Test
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
82of 825
Testing the Model
M = 0 M = 1 M = 3 M = 9
w?0 0.19 0.82 0.31 0.35
w?1 -1.27 7.99 232.37
w?2 -25.43 -5321.83
w?3 17.37 48568.31
w?4 -231639.30
w?5 640042.26
w?6 -1061800.52
w?7 1042400.18
w?8 -557682.99
w?9 125201.43
Table: Coefficients w? for polynomials of various order.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
83of 825
More Data
N = 15
x
t
N = 15
0 1
−1
0
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
84of 825
More Data
N = 100
heuristics : have no less than 5 to 10 times as many data
points than parameters
but number of parameters is not necessarily the most
appropriate measure of model complexity !
later: Bayesian approach
x
t
N = 100
0 1
−1
0
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
85of 825
Regularisation
How to constrain the growing of the coefficients w ?
Add a regularisation term to the error function
Ẽ(w) =
1
2
N∑
n=1
( y(xn,w)− tn)2 +
λ
2
‖w‖2
Squared norm of the parameter vector w
‖w‖2 ≡ wTw = w20 + w21 + · · ·+ w2M
unique minimum of E(w) for argument w? under certain
conditions (what are they for λ = 0? for λ > 0?)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
86of 825
Regularisation
M = 9
x
t
ln λ = −18
0 1
−1
0
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
87of 825
Regularisation
M = 9
x
t
ln λ = 0
0 1
−1
0
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
88of 825
Regularisation
M = 9
E
R
M
S
ln λ
−35 −30 −25 −20
0
0.5
1
Training
Test
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
89of 825
What is Machine Learning?
Definition (Mitchell, 1998)
A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P,
if its performance at tasks in T, as measured by P, improves
with experience E.
Task: regression
Experience: x input examples, t output labels
Performance: squared error
Model choice
Regularisation
do not train on the test set!
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
90of 825
Probability Theory
p(X,Y )
X
Y = 2
Y = 1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
91of 825
Probability Theory
Y vs. X a b c d e f g h i sum
2 0 0 0 1 4 5 8 6 2 26
1 3 6 8 8 5 3 1 0 0 34
sum 3 6 8 9 9 8 9 6 2 60
p(X,Y )
X
Y = 2
Y = 1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
92of 825
Sum Rule
Y vs. X a b c d e f g h i sum
2 0 0 0 1 4 5 8 6 2 26
1 3 6 8 8 5 3 1 0 0 34
sum 3 6 8 9 9 8 9 6 2 60
p(X = d,Y = 1) = 8/60
p(X = d) = p(X = d,Y = 2) + p(X = d,Y = 1)
= 1/60 + 8/60
p(X = d) =
∑
Y
p(X = d,Y)
p(X) =
∑
Y
p(X,Y)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
93of 825
Sum Rule
Y vs. X a b c d e f g h i sum
2 0 0 0 1 4 5 8 6 2 26
1 3 6 8 8 5 3 1 0 0 34
sum 3 6 8 9 9 8 9 6 2 60
p(X) =
∑
Y
p(X,Y)
p(X)
X
p(Y) =
∑
X
p(X,Y)
p(Y )
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
94of 825
Product Rule
Y vs. X a b c d e f g h i sum
2 0 0 0 1 4 5 8 6 2 26
1 3 6 8 8 5 3 1 0 0 34
sum 3 6 8 9 9 8 9 6 2 60
Conditional Probability
p(X = d | Y = 1) = 8/34
Calculate p(Y = 1):
p(Y = 1) =
∑
X
p(X,Y = 1) = 34/60
p(X = d,Y = 1) = p(X = d | Y = 1)p(Y = 1)
p(X,Y) = p(X | Y) p(Y)
Another intuitive view is renormalisation of relative frequencies:
p(X | Y) = p(X,Y)
p(Y)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
95of 825
Sum and Product Rules
Y vs. X a b c d e f g h i sum
2 0 0 0 1 4 5 8 6 2 26
1 3 6 8 8 5 3 1 0 0 34
sum 3 6 8 9 9 8 9 6 2 60
p(X) =
∑
Y
p(X,Y)
p(X)
X
p(X | Y) = p(X,Y)
p(Y)
X
p(X |Y = 1)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
96of 825
Sum Rule and Product Rule
Sum Rule
p(X) =
∑
Y
p(X,Y)
Product Rule
p(X,Y) = p(X | Y) p(Y)
These rules form the basis of Bayesian machine learning, and
this course!
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
97of 825
Bayes Theorem
Use product rule
p(X,Y) = p(X | Y) p(Y) = p(Y | X) p(X)
Bayes Theorem
p(Y | X) = p(X | Y) p(Y)
p(X)
only defined for p(X) > 0
and
p(X) =
∑
Y
p(X,Y) (sum rule)
=
∑
Y
p(X | Y) p(Y) (product rule)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
98of 825
Probability Densities
Real valued variable x ∈ R
Probability of x to fall in the interval (x, x + δx) is given by
p(x)δx for infinitesimal small δx.
p(x ∈ (a, b)) =
∫ b
a
p(x) dx.
xδx
p(x)
P (x)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
99of 825
Constraints on p(x)
Nonnegative
p(x) ≥ 0
Normalisation ∫ ∞
−∞
p(x) dx = 1.
xδx
p(x)
P (x)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
100of 825
Cumulative distribution function P(x)
P(x) =
∫ x
−∞
p(z) dz
or
d
dx
P(x) = p(x)
xδx
p(x)
P (x)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
101of 825
Multivariate Probability Density
Vector x ≡ (x1, . . . , xD)T =
x1…
xD
Nonnegative
p(x) ≥ 0
Normalisation ∫ ∞
−∞
p(x) dx = 1.
This means ∫ ∞
−∞
· · ·
∫ ∞
−∞
p(x) dx1 . . . dxD = 1.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
102of 825
Sum and Product Rule for Probability Densities
Sum Rule
p(x) =
∫ ∞
−∞
p(x, y) dy
Product Rule
p(x, y) = p(y | x) p(x)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
103of 825
Expectations
Weighted average of a function f(x) under the probability
distribution p(x)
E [f ] =
∑
x
p(x) f (x) discrete distribution p(x)
E [f ] =
∫
p(x) f (x) dx probability density p(x)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
104of 825
How to approximate E [f ]
Given a finite number N of points xn drawn from the
probability distribution p(x).
Approximate the expectation by a finite sum:
E [f ] ‘ 1
N
N∑
n=1
f (xn)
How to draw points from a probability distribution p(x) ?
Lecture coming about “Sampling”
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
105of 825
Expection of a function of several variables
arbitrary function f (x, y)
Ex [f (x, y)] =
∑
x
p(x) f (x, y) discrete distribution p(x)
Ex [f (x, y)] =
∫
p(x) f (x, y) dx probability density p(x)
Note that Ex [f (x, y)] is a function of y.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
106of 825
Conditional Expectation
arbitrary function f (x)
Ex [f | y] =
∑
x
p(x | y) f (x) discrete distribution p(x)
Ex [f | y] =
∫
p(x | y) f (x) dx probability density p(x)
Note that Ex [f | y] is a function of y.
Other notation used in the literature : Ex|y [f ].
What is E [E [f (x) | y]] ? Can we simplify it?
This must mean Ey [Ex [f (x) | y]]. (Why?)
Ey [Ex [f (x) | y]] =
∑
y
p(y)Ex [f | y] =
∑
y
p(y)
∑
x
p(x|y) f (x)
=
∑
x,y
f (x) p(x, y) =
∑
x
f (x) p(x)
= Ex [f (x)]
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
107of 825
Variance
arbitrary function f (x)
var[f ] = E
[
(f (x)− E [f (x)])2
]
= E
[
f (x)2
]
− E [f (x)]2
Special case: f (x) = x
var[x] = E
[
(x− E [x])2
]
= E
[
x2
]
− E [x]2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
108of 825
Covariance
Two random variables x ∈ R and y ∈ R
cov[x, y] = Ex,y [(x− E [x])(y− E [y])]
= Ex,y [x y]− E [x]E [y]
With E [x] = a and E [y] = b
cov[x, y] = Ex,y [(x− a)(y− b)]
= Ex,y [x y]− Ex,y [x b]− Ex,y [a y] + Ex,y [a b]
= Ex,y [x y]− b Ex,y [x]︸ ︷︷ ︸
=Ex[x]
−a Ex,y [y]︸ ︷︷ ︸
=Ey[y]
+a b Ex,y [1]︸ ︷︷ ︸
=1
= Ex,y [x y]− a b− a b + a b = Ex,y [x y]− a b
= Ex,y [x y]− E [x]E [y]
Expresses how strongly x and y vary together. If x and y
are independent, their covariance vanishes.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
109of 825
Covariance for Vector Valued Variables
Two random variables x ∈ RD and y ∈ RD
cov[x, y] = Ex,y
[
(x− E [x])(yT − E
[
yT
]
)
]
= Ex,y
[
x yT
]
− E [x]E
[
yT
]
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
110of 825
The Gaussian Distribution
x ∈ R
Gaussian Distribution with mean µ and variance σ2
N (x |µ, σ2) = 1
(2πσ2)
1
2
exp{− 1
2σ2
(x− µ)2}
N (x|µ, σ2)
x
2σ
µ
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
111of 825
The Gaussian Distribution
N (x |µ, σ2) > 0∫∞
−∞N (x |µ, σ
2) dx = 1
Expectation over x
E [x] =
∫ ∞
−∞
N (x |µ, σ2) x dx = µ
Expectation over x2
E
[
x2
]
=
∫ ∞
−∞
N (x |µ, σ2) x2 dx = µ2 + σ2
Variance of x
var[x] = E
[
x2
]
− E [x]2 = σ2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and
Covariances
112of 825
Strategy in this course
Estimate best predictor = training = learning
Given data (x1, y1), . . . , (xn, yn), find a predictor fw(·).
1 Identify the type of input x and output y data
2 Propose a (linear) mathematical model for fw
3 Design an objective function or likelihood
4 Calculate the optimal parameter (w)
5 Model uncertainty using the Bayesian approach
6 Implement and compute (the algorithm in python)
7 Interpret and diagnose results
Overview
Administration
Examples
What is common to these examples?
Definition
Notions of Learning
Python
Mathematics for Machine Learning
Human Learning