Statistical Machine Learning
Christian Walder + Lexing Xie
Machine Learning Research Group, CSIRO Data61 ANU Computer Science
Canberra Semester One, 2021.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
1of 1
Part I
Linear Regression 2
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
2of 1
Linear Regression
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
3of 1
Basis functions
Maximum Likelihood with Gaussian Noise Regularisation
Bias variance decomposition
Training and Testing: (Non-Bayesian) Point Estimate
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
4of 1
Training Phase
training data x
training targets t
model with adjustable parameter w
Test Phase
fix the most appropriate w*
test data x
test target t
model with fixed parameter w*
Bayesian Regression
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
5of 1
Bayes Theorem
posterior = likelihood × prior p(w | t) = p(t | w) p(w)
normalisation p(t)
where we left out the conditioning on x (always assumed), and β, which is assumed to be constant.
I.i.d. regression likelihood for additive Gaussian noise is
N
p(t|w) = N(tn |y(xn,w),β−1)
n=1 N
= N (tn | w⊤φ(xn), β−1) n=1
= const × exp{−β 1 (t − Φw)⊤ (t − Φw)} 2
= N(t|Φw,β−1I)
How to choose a prior?
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
6of 1
The choice of prior affords an intuitive control over our inductive bias
All inference schemes have such biases, and often arise more opaquely than the prior in Bayes’ rule.
Can we find a prior for the given likelihood which
makes sense for the problem at hand allows us to find a posterior in a ’nice’ form
An answer to the second question:
Definition ( Conjugate Prior)
A class of prior probability distributions p(w) is conjugate to a class of likelihood functions p(x | w) if the resulting posterior distributions p(w | x) are in the same family as p(w).
Examples of Conjugate Prior Distributions
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
7of 1
Table: Discrete likelihood distributions
Likelihood
Conjugate Prior
Bernoulli Binomial Poisson Multinomial
Beta
Beta Gamma Dirichlet
Table: Continuous likelihood distributions
Likelihood
Conjugate Prior
Uniform Exponential Normal Multivariate normal
Pareto
Gamma
Normal (mean parameter) Multivariate normal (mean parameter)
Conjugate Prior to a Gaussian Distribution
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
8of 1
Example : If the likelihood function is Gaussian, choosing a Gaussian prior for the mean will ensure that the posterior distribution is also Gaussian.
Given a marginal distribution for x and a conditional Gaussian distribution for y given x in the form
we get
p(x) = N(x|μ,Λ−1)
p(y | x) = N (y | Ax + b, L−1)
p(y)=N(y|Aμ+b,L−1 +AΛ−1A⊤) p(x | y) = N (x | Σ{A⊤L(y − b) + Λμ}, Σ)
where Σ = (Λ + A⊤L A)−1.
Note that the covariance Σ does not involve y.
Conjugate Prior to a Gaussian Distribution (intuition)
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
9of 1
Given
p(x) = N(x|μ,Λ−1)
p(y | x) = N (y | Ax + b, L−1 ) ⇔ y = Ax + b + N (0, L−1 )
We have E[y] = E[Ax + b] = Aμ + b and by the easily proven Bienaymé formula for the variance of the sum of uncorrelated variables,
cov[y] = cov[Ax + b] +
=E[Ax(Ax)⊤ ]=AE[xx⊤ ]A⊤ =AΛ−1 A⊤ So y is Gaussian with
cov[N (0, L−1 )] .
=L−1
p(y) = N(y|Aμ + b,L−1 + AΛ−1A⊤) Then letting Σ = (Λ + A⊤L A)−1 and
p(x|y) = N(x|Σ{A⊤L(y − b) + Λμ},Σ) ⇔ x = Σ{A⊤L(y−b)+Λμ}+N(0,Σ)
yields the correct moments for x, since
E[x] = E[Σ{A⊤L(y−b)+Λμ}] = Σ{A⊤L(Aμ+b−b)+Λμ}
= Σ{A⊤LAμ + Λμ} = (Λ + A⊤LA)−1{A⊤LA + Λ}μ = μ, and it is similar (but tedious ; don’t do it) to recover cov[x] = Λ.
Bayesian Regression
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
10of 1
Choose a Gaussian prior with mean m0 and covariance S0 p(w) = N(w|m0,S0)
Same likelihood as before (here written in vector form):
p(t|w,β) = N(t|Φw,β−1I) Given N data pairs (xn, tn), the posterior is
where
p(w|t) = N(w|mN,SN)
mN =SN(S−1m0+βΦ⊤t) 0
S−1 = S−1 + βΦ⊤Φ N0
(derive this with the identities on the previous slides)
Bayesian Regression: Zero Mean, Isotropic Prior
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
11of 1
For simplicity we proceed with m0 = 0 and S0 = α−1I, so p(w|α) = N(w|0,α−1I)
The posterior becomes p(w|t) = N(w|mN,SN) with mN =βSNΦ⊤t
S−1 =αI+βΦ⊤Φ N
For α ≪ β we get
mN → wML = (Φ⊤Φ)−1Φ⊤t
Log of posterior is sum of log likelihood and log of prior
ln p(w | t) = − β (t − Φw)⊤ (t − Φw) − α w⊤ w + const 22
Bayesian Regression
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
12of 1
Log of posterior is sum of log likelihood and log of prior
lnp(w|t)=−β 1∥t−Φw∥2 −α ∥w∥2 +const. 2 2
sum-of-squares-error The maximum a posteriori estimator
regulariser
wm.a.p. =argmaxp(w|t) w
corresponds to minimising the sum-of-squares error function with quadratic regularisation coefficient λ = α/β.
The posterior is Gaussian so mode = mean: wm.a.p. = mN .
For α ≪ β the we recover unregularised least squares (equivalently m.a.p. approaches maximum likelihood), for example in case of
an infinitely broad prior with α → 0
an infinitely precise likelihood with β → ∞
Bayesian Inference in General: Sequential Update of Belief
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
13of 1
If we have not yet seen any data point (N = 0), the posterior is equal to the prior.
Sequential arrival of data points : the posterior given some observed data acts as the prior for the future data.
Nicely fits a sequential learning framework.
Sequential Update of the Posterior
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
14of 1
Example of a linear basis function model Single input x, single output t
Linear model y(x, w) = w0 + w1 x.
True data distribution sampling procedure:
1 Choose an xn from the uniform distribution U (x | − 1, +1).
2 Calculatef(xn,a)=a0 +a1xn,wherea0 =−0.3,a1 =0.5.
3 Add Gaussian noise with standard deviation σ = 0.2,
tn = N(xn |f(xn,a),0.04)
Set the precision of the uniform prior to α = 2.0.
Sequential Update of the Posterior
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
15of 1
Sequential Update of the Posterior
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
16of 1
Predictive Distribution
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
17of 1
In the training phase, data x and targets t are provided In the test phase, a new data value x is given and the
corresponding target value t is asked for
Bayesian approach: Find the probability of the test target t given the test data x, the training data x and the training targets t
p(t|x,x,t)
This is the Predictive Distribution (c.f. the posterior distribution, which is over the parameters).
How to calculate the Predictive Distribution?
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
18of 1
Introduce the model parameter w via the sum rule
p(t|x,x,t) =
= p(t|w,x,x,t)p(w|x,x,t)dw
The test target t depends only on the test data x and the model parameter w, but not on the training data and the training targets
p(t|w,x,x,t) = p(t|w,x)
The model parameter w are learned with the training data
x and the training targets t only p(w|x,x,t) = p(w|x,t)
Predictive Distribution
p(t|x,x,t) =
p(t|w,x)p(w|x,t)dw
p(t,w|x,x,t)dw
Proof of the Predictive Distribution
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
19of 1
The predictive distribution is
because
p(t,w,x,x,t) p(w,x,x,t) p(w,x,x,t) p(x,x,t) dw
p(t,w,x,x,t) p(x,x,t) dw
p(t|x,x,t) =
p(t|w,x,x,t)p(w|x,x,t)dw
p(t|w,x,x,t)p(w|x,x,t)dw = =
or simply
= p(t,x,x,t) p(x, x, t)
= p(t|x,x,t),
p(t|w,x,x,t)p(w|x,x,t)dw = p(t,w|x,x,t)dw = p(t|x,x,t).
Predictive Distribution with Simplified Prior
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
20of 1
Find the predictive distribution
p(t|t,α,β) =
(remember : conditioning on x is often suppressed to
simplify the notation.)
Now we know (neglecting as usual to notate conditioning
on x)
and the posterior was
where
p(t|w,β) = N(t|w⊤φ(x),β−1) p(w|t,α,β) = N(w|mN,SN)
mN =βSNΦ⊤t
S−1 =αI+βΦ⊤Φ N
p(t|w,β)p(w|t,α,β)dw
Predictive Distribution with Simplified Prior
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
21of 1
If we do the integral (it turns out to be the convolution of the two Gaussians), we get for the predictive distribution
p(t|x,t,α,β) = N(t|m⊤N φ(x),σN2 (x)) where the variance σN2 (x) is given by
σ N2 ( x ) = 1 + φ ( x ) ⊤ S N φ ( x ) . β
This is more easily shown using a similar approach to the earlier “intution” slide and again with the Bienaymé formula, now using
t = w⊤φ(x) + N (0, β−1).
However this is a linear-Gaussian specific trick and in general we need to integrate out the parameters.
Predictive Distribution with Simplified Prior
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
22of 1
Example with artificial sinusoidal data from sin(2πx) (green) and added noise. Number of data points N = 1.
1
t
0
−1
0x1
Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded).
Predictive Distribution with Simplified Prior
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
23of 1
Example with artificial sinusoidal data from sin(2πx) (green) and added noise. Number of data points N = 2.
1
t
0
−1
0x1
Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded).
Predictive Distribution with Simplified Prior
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
24of 1
Example with artificial sinusoidal data from sin(2πx) (green) and added noise. Number of data points N = 4.
1
t
0
−1
0x1
Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded).
Predictive Distribution with Simplified Prior
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
25of 1
Example with artificial sinusoidal data from sin(2πx) (green) and added noise. Number of data points N = 25.
1
t
0
−1
0x1
Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded).
Predictive Distribution with Simplified Prior
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
Plots of the function y(x, w) using samples from the posterior distribution over w. Number of data points N = 1.
11
tt
00
−1 −1
0x10x1
26of 1
Predictive Distribution with Simplified Prior
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
Plots of the function y(x, w) using samples from the posterior distribution over w. Number of data points N = 2.
11
tt 00
−1 −1
0x10x1
27of 1
Predictive Distribution with Simplified Prior
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
Plots of the function y(x, w) using samples from the posterior distribution over w. Number of data points N = 4.
11
tt 00
−1 −1
0x10x1
28of 1
Predictive Distribution with Simplified Prior
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
Plots of the function y(x, w) using samples from the posterior distribution over w. Number of data points N = 25.
11
tt 00
−1 −1
0x10x1
29of 1
Limitations of Linear Basis Function Models
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
30of 1
Basis function φj(x) are fixed before the training data set is observed.
Curse of dimensionality : Number of basis function grows
rapidly, often exponentially, with the dimensionality D.
But typical data sets have two nice properties which can be exploited if the basis functions are not fixed :
Data lie close to a nonlinear manifold with intrinsic dimension much smaller than D. Need algorithms which place basis functions only where data are (e.g. kernel methods / Gaussian processes).
Target variables may only depend on a few significant directions within the data manifold. Need algorithms which can exploit this property (e.g. linear methods or shallow neural networks).
Curse of Dimensionality
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
31of 1
Linear Algebra allows us to operate in n-dimensional vector spaces using the intution from our 3-dimensional world as a vector space. No surprises as long as n is finite.
If we add more structure to a vector space (e.g. inner product, metric), our intution gained from the 3-dimensional world around us may be wrong.
Example: Sphere of radius r = 1. What is the fraction of the volume of the sphere in a D-dimensional space which lies between radius r = 1 and r = 1 − ε ?
Volume scales like rD, therefore the formula for the volume of a sphere is VD(r) = KDrD.
VD(1)−VD(1−ε) =1−(1−ε)D VD (1)
Curse of Dimensionality
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
32of 1
Fraction of the volume of the sphere in a D-dimensional space which lies between radius r = 1 and r = 1 − ε
VD(1)−VD(1−ε) =1−(1−ε)D VD (1)
1
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
ε
D = 20 D=5
D=2 D=1
volume fraction
Curse of Dimensionality
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
33of 1
Probability density with respect to radius r of a Gaussian distribution for various values of the dimensionality D.
2
1
D=1 D=2
0 024
r
D = 20
p(r)
Curse of Dimensionality
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
34of 1
Probability density with respect to radius r of a Gaussian distribution for various values of the dimensionality D.
Example: D=2;assumeμ=0,Σ=I 11⊤1122
N(x|0,I)=2πexp −2x x =2πexp −2(x1+x2)
Coordinate transformation
x1 = r cos(φ) x2 = r sin(φ) Probability in the new coordinates
p(r,φ|0,I) = N(r(x),φ(x)|0,I)|J|
where | J | = r is the determinant of the Jacobian for the
given coordinate transformation.
1 12 p(r,φ|0,I) = 2πrexp −2r
Curse of Dimensionality
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
35of 1
Probability density with respect to radius r of a Gaussian distribution for D = 2 (and μ = 0, Σ = I)
p(r,φ|0,I) = 2πrexp Integrate over all angles φ
−2r
dφ = rexp
1 12
p(r|0,I) =
2π 1
1 2 −2r
1 2 −2r
0
2
1
2πrexp
D=1 D=2
0 024
r
D = 20
p(r)
Summary: Linear Regression
Statistical Machine Learning
⃝c 2021
Ong & Walder & Webers & Xie
Data61 | CSIRO ANU Computer Science
36of 1
Basis functions
Maximum likelihood with Gaussian noise Regularisation
Bias variance decomposition
Conjugate prior
Bayesian linear regression
Sequential update of the posterior Predictive distribution
Curse of dimensionality