Statistical Machine Learning
Christian Walder
Machine Learning Research Group CSIRO Data61
and
College of Engineering and Computer Science The Australian National University
Canberra Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1
Neural Networks 2
Principal Component Analysis Autoencoders
Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2
1of 825
Part II
Introduction
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting Probability Theory Probability Densities
Expectations and Covariances
68of 825
Flavour of this course
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
69of 825
Formalise intuitions about problems
Use language of mathematics to express models Geometry, vectors, linear algebra for reasoning Probabilistic models to capture uncertainty Design and analysis of algorithms
Numerical algorithms in python
Understand the choices when designing machine learning methods
What is Machine Learning?
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
70of 825
Definition (Mitchell, 1998)
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Polynomial Curve Fitting
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
71of 825
some artificial data created from the function sin(2πx) + random noise x = 0, . . . , 1
1
t
0
−1
0x1
Polynomial Curve Fitting – Input Specification
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
72of 825
N = 10
x ≡ (x1,…,xN)T
t ≡ (t1,…,tN)T
Polynomial Curve Fitting – Input Specification
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
73of 825
N = 10
x ≡ (x1,…,xN)T
t ≡ (t1,…,tN)T
xi ∈R i=1,…,N ti ∈R i=1,…,N
Polynomial Curve Fitting – Model Specification
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
74of 825
M : order of polynomial
y(x, w) = w0 + w1 x + w2 x2 + · · · + wM xM M
= wm xm m=0
nonlinear function of x
linear function of the unknown model parameter w How can we find good parameters w = (w1,…,wM)T?
Learning is Improving Performance
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
t
tn
y(xn,w)
xn
x
75of 825
Learning is Improving Performance
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
76of 825
t
tn
y(xn,w)
Performance measure : Error between target and prediction of the model for the training data
1 N
E(w)= 2
unique minimum of E(w) for argument w⋆ under certain conditions (what are they?)
xn
x
n=1
(y(xn,w)−tn)2
Model Comparison or Model Selection
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
77of 825
y(x,w)= wmx
= w0
M m=0
m
M=0
1
t
0
−1
0x1
M=0
Model Comparison or Model Selection
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
78of 825
1
t
0
−1
y(x,w)= wmx
M m=0
m
M=1
= w0 + w1 x
0x1
M=1
Model Comparison or Model Selection
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
79of 825
y(x,w)= wmx
M m=0
m
M=3
=w0 +w1x+w2x2 +w3x3
1
t
0
−1
0x1
M=3
Model Comparison or Model Selection
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
80of 825
overfitting
1
t
0
−1
y(x,w)= wmx
M m=0
m
M=9
=w0 +w1x+···+w8x8 +w9x9
0x1
M=9
Testing the Model
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
81of 825
Train the model and get w⋆
Get 100 new data points Root-mean-square (RMS) error
ERMS = 2E(w⋆)/N
1
0.5
Training Test
0
03M69
ERMS
Testing the Model
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
82of 825
M=9 0.35 232.37 w⋆2 -25.43 -5321.83 w⋆3 17.37 48568.31 w⋆4 -231639.30 w⋆5 640042.26 w⋆6 -1061800.52 w⋆7 1042400.18 w⋆8 -557682.99 w⋆9 125201.43
Table: Coefficients w⋆ for polynomials of various order.
M=0 M=1 M=3
w⋆0 w⋆1
0.19
0.82 -1.27
0.31 7.99
More Data
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
83of 825
N = 15
1
t
0
−1
0x1
N = 15
More Data
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
84of 825
N = 100
heuristics : have no less than 5 to 10 times as many data
points than parameters
but number of parameters is not necessarily the most appropriate measure of model complexity !
later: Bayesian approach
1
t
0
−1
0x1
N = 100
Regularisation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
85of 825
How to constrain the growing of the coefficients w ? Add a regularisation term to the error function
E(w)= 2
n=1
Squared norm of the parameter vector w
∥w∥2 ≡wTw=w20 +w21 +···+w2M
unique minimum of E(w) for argument w⋆ under certain conditions (what are they for λ = 0? for λ > 0?)
1 N λ
(y(xn,w)−tn)2 + 2∥w∥2
Regularisation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
86of 825
M=9
1
t
0
−1
0x1
ln λ = −18
Regularisation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
87of 825
M=9
1
t
0
−1
0x1
ln λ = 0
Regularisation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
88of 825
M=9
1
0.5
0
−35 −30 ln λ −25 −20
Training Test
ERMS
What is Machine Learning?
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory Probability Densities
Expectations and Covariances
89of 825
Definition (Mitchell, 1998)
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Task: regression
Experience: x input examples, t output labels Performance: squared error
Model choice
Regularisation
do not train on the test set!
Probability Theory
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
90of 825
p(X,Y)
Y=2
Y=1
X
Probability Theory
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
91of 825
Yvs.X a b c d e f g h i sum 2 26 1 34
sum 368998962 60
0 3
0 6
0 8
1 8
4 5
5 3
8 1
6 0
2 0
p(X,Y)
Y=2
Y=1
X
Sum Rule
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
92of 825
Yvs.X a b c d e f g h i sum 2 26 1 34
sum 368998962 60
p(X = d, Y = 1) = 8/60
p(X = d) = p(X = d, Y = 2) + p(X = d, Y = 1)
= 1/60 + 8/60
p(X = d) = p(X = d,Y)
Y
p(X) = p(X,Y) Y
0 3
0 6
0 8
1 8
4 5
5 3
8 1
6 0
2 0
Sum Rule
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
93of 825
Yvs.X a b c d e f g h i sum 2 26 1 34
sum 368998962 60
p(X) = p(X,Y) p(Y) = p(X,Y) YX
p(X) p(Y)
0 3
0 6
0 8
1 8
4 5
5 3
8 1
6 0
2 0
X
Product Rule
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
94of 825
Yvs.X a b c d e f g h i sum 2 26 1 34
sum 368998962 60 Conditional Probability
0
3
0
6
0
8
1
8
4
5
5
3
8
1
6
0
2
0
p(X = d | Y = 1) = 8/34
p(Y =1)=p(X,Y =1)=34/60
X
p(X = d, Y = 1) = p(X = d | Y = 1)p(Y = 1)
p(X,Y) = p(X | Y)p(Y)
Another intuitive view is renormalisation of relative frequencies:
p(X | Y) = p(X,Y) p(Y )
Calculate p(Y = 1):
Sum and Product Rules
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
95of 825
Yvs.X a b c d e f g h i sum 2 26 1 34
sum 368998962 60
0
3
0
6
0
8
1
8
4
5
5
3
8
1
6
0
2
0
p(X) = p(X,Y) Y
p(X | Y) = p(X,Y) p(Y )
p(X) p(X|Y = 1)
XX
Sum Rule and Product Rule
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
96of 825
Sum Rule
Product Rule
p(X) = p(X,Y) Y
p(X,Y) = p(X | Y)p(Y)
These rules form the basis of Bayesian machine learning, and this course!
Bayes Theorem
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
97of 825
Use product rule
p(X,Y) = p(X | Y)p(Y) = p(Y | X)p(X) Bayes Theorem
p(Y | X) = p(X | Y)p(Y) p(X)
and
only defined for p(X) > 0
(sum rule) (product rule)
p(X) = p(X, Y) Y
= p(X | Y) p(Y) Y
Probability Densities
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting Probability Theory Probability Densities
Expectations and Covariances
98of 825
Real valued variable x ∈ R
Probability of x to fall in the interval (x, x + δx) is given by
p(x)δx for infinitesimal small δx.
b a
p(x ∈ (a, b)) = p(x)
p(x) dx. P (x)
δx
x
Constraints on p(x)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting Probability Theory Probability Densities
Expectations and Covariances
99of 825
Nonnegative Normalisation
p(x) ≥ 0 ∞
p(x) dx = 1. −∞
p(x)
P (x)
δx
x
Cumulative distribution function P(x)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting Probability Theory Probability Densities
Expectations and Covariances
100of 825
or
x −∞
P(x) =
d P(x) = p(x)
p(z) dz
dx
p(x)
P (x)
δx
x
Multivariate Probability Density
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting Probability Theory Probability Densities
Expectations and Covariances
101of 825
Vectorx≡(x1,…,xD) =. xD
Nonnegative Normalisation
This means
x1 T .
p(x) ≥ 0 ∞
p(x) dx = 1. −∞
∞ ∞
−∞
··· p(x) dx1 … dxD = 1. −∞
Sum and Product Rule for Probability Densities
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting Probability Theory Probability Densities
Expectations and Covariances
102of 825
Sum Rule
Product Rule
∞ −∞
p(x) =
p(x, y) = p(y | x) p(x)
p(x, y) dy
Expectations
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting Probability Theory Probability Densities
Expectations and Covariances
103of 825
Weighted average of a function f(x) under the probability distribution p(x)
E [f ] = p(x) f (x) discrete distribution p(x) x
E [f ] =
p(x) f (x) dx probability density p(x)
How to approximate E [f ]
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting Probability Theory Probability Densities
Expectations and Covariances
104of 825
Given a finite number N of points xn drawn from the probability distribution p(x).
Approximate the expectation by a finite sum:
1 N
E[f]≃ N
How to draw points from a probability distribution p(x) ? Lecture coming about “Sampling”
f(xn)
n=1
Expection of a function of several variables
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting Probability Theory Probability Densities
Expectations and Covariances
105of 825
arbitrary function f (x, y)
Ex [f (x, y)] = p(x) f (x, y) discrete distribution p(x)
x
p(x) f (x, y) dx probability density p(x)
Ex [f (x, y)] =
Note that Ex [f (x, y)] is a function of y.
Conditional Expectation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting Probability Theory Probability Densities
Expectations and Covariances
106of 825
arbitrary function f (x)
Ex [f | y] = p(x | y) f (x) discrete distribution p(x)
x
Ex [f | y] =
p(x | y) f (x) dx probability density p(x)
Note that Ex [f | y] is a function of y.
Other notation used in the literature : Ex|y [f ]. What is E [E [f(x) | y]] ? Can we simplify it? This must mean Ey [Ex [f (x) | y]]. (Why?)
Ey [Ex [f(x) | y]] = p(y)Ex [f | y] = p(y)p(x|y)f(x) yyx
= f(x)p(x,y) = f(x)p(x) x,y x
= Ex [f (x)]
Variance
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting Probability Theory Probability Densities
Expectations and Covariances
107of 825
arbitrary function f (x)
var[f] = E (f(x) − E [f(x)])2 = E f(x)2 − E [f(x)]2
Special case: f (x) = x
var[x] = E (x − E [x])2 = E x2 − E [x]2
Covariance
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting Probability Theory Probability Densities
Expectations and Covariances
108of 825
Two random variables x ∈ R and y ∈ R
cov[x, y] = Ex,y [(x − E [x])(y − E [y])]
= Ex,y [x y] − E [x] E [y] With E [x] = a and E [y] = b
cov[x, y] = Ex,y [(x − a)(y − b)]
= Ex,y [x y] − Ex,y [x b] − Ex,y [a y] + Ex,y [a b]
= Ex,y [x y] − b Ex,y [x] −a Ex,y [y] +a b Ex,y [1]
=Ex [x] =Ey [y] =1 = Ex,y [x y] − a b − a b + a b = Ex,y [x y] − a b
= Ex,y [x y] − E [x] E [y]
Expresses how strongly x and y vary together. If x and y are independent, their covariance vanishes.
Covariance for Vector Valued Variables
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting Probability Theory Probability Densities
Expectations and Covariances
109of 825
Two random variables x ∈ RD and y ∈ RD
cov[x, y] = Ex,y (x − E [x])(yT − E yT) = Ex,y x yT − E [x] E yT
The Gaussian Distribution
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting Probability Theory Probability Densities
Expectations and Covariances
110of 825
x∈R
Gaussian Distribution with mean μ and variance σ2
N(x|μ,σ2)= N(x|μ,σ2)
1 exp{− 1 (x−μ)2} 12
(2πσ2)2 2σ
2σ
μ
x
The Gaussian Distribution
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting Probability Theory Probability Densities
Expectations and Covariances
111of 825
N(x|μ,σ2) > 0
∞ N(x|μ,σ2)dx=1
Expectation over x
∞
−∞
E[x]= Expectation over x2
∞ −∞
Ex2= Variance of x
N(x|μ,σ2)x dx=μ N(x|μ,σ2)x2 dx=μ2+σ2
var[x]=Ex2−E[x]2 =σ2
−∞
Strategy in this course
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Polynomial Curve Fitting Probability Theory Probability Densities
Expectations and Covariances
112of 825
Estimate best predictor = training = learning
Given data (x1, y1), . . . , (xn, yn), find a predictor fw(·).
1 Identify the type of input x and output y data
2 Propose a (linear) mathematical model for fw
3 Design an objective function or likelihood
4 Calculate the optimal parameter (w)
5 Model uncertainty using the Bayesian approach
6 Implement and compute (the algorithm in python)
7 Interpret and diagnose results