Statistical Machine Learning
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
1of 1
Statistical Machine Learning
Christian Walder + Lexing Xie
Machine Learning Research Group, CSIRO Data61
ANU Computer Science
Canberra
Semester One, 2021.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
2of 1
Part I
Linear Regression 2
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
3of 1
Linear Regression
Basis functions
Maximum Likelihood with Gaussian Noise
Regularisation
Bias variance decomposition
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
4of 1
Training and Testing:
(Non-Bayesian) Point Estimate
model with adjustable
parameter w
training
data x
training
targets
t
model with fixed
parameter w*
test
data x
test
target
t
Training Phase
Test Phase
fix the most appropriate w*
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
5of 1
Bayesian Regression
Bayes Theorem
posterior =
likelihood× prior
normalisation
p(w | t) =
p(t |w) p(w)
p(t)
where we left out the conditioning on x (always assumed),
and β, which is assumed to be constant.
I.i.d. regression likelihood for additive Gaussian noise is
p(t |w) =
N∏
n=1
N (tn | y(xn,w), β−1)
=
N∏
n=1
N (tn |w>φ(xn), β−1)
= const× exp{−β
1
2
(t−Φw)>(t−Φw)}
= N (t |Φw, β−1I)
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
6of 1
How to choose a prior?
The choice of prior affords an intuitive control over our
inductive bias
All inference schemes have such biases, and often arise
more opaquely than the prior in Bayes’ rule.
Can we find a prior for the given likelihood which
makes sense for the problem at hand
allows us to find a posterior in a ’nice’ form
An answer to the second question:
Definition ( Conjugate Prior)
A class of prior probability distributions p(w) is conjugate to a
class of likelihood functions p(x |w) if the resulting posterior
distributions p(w | x) are in the same family as p(w).
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
7of 1
Examples of Conjugate Prior Distributions
Table: Discrete likelihood distributions
Likelihood Conjugate Prior
Bernoulli Beta
Binomial Beta
Poisson Gamma
Multinomial Dirichlet
Table: Continuous likelihood distributions
Likelihood Conjugate Prior
Uniform Pareto
Exponential Gamma
Normal Normal (mean parameter)
Multivariate normal Multivariate normal (mean parameter)
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
8of 1
Conjugate Prior to a Gaussian Distribution
Example : If the likelihood function is Gaussian, choosing
a Gaussian prior for the mean will ensure that the
posterior distribution is also Gaussian.
Given a marginal distribution for x and a conditional
Gaussian distribution for y given x in the form
p(x) = N (x |µ,Λ−1)
p(y | x) = N (y |Ax + b,L−1)
we get
p(y) = N (y |Aµ + b,L−1 + AΛ−1A>)
p(x | y) = N (x |Σ{A>L(y− b) + Λµ},Σ)
where Σ = (Λ + A>L A)−1.
Note that the covariance Σ does not involve y.
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
9of 1
Conjugate Prior to a Gaussian Distribution
(intuition)
Given
p(x) = N (x |µ,Λ−1)
p(y | x) = N (y |Ax + b,L−1)⇔ y = Ax + b +N (0,L−1)
We have E[y] = E[Ax + b] = Aµ + b and by the easily proven Bienaymé
formula for the variance of the sum of uncorrelated variables,
cov[y] = cov[Ax + b]︸ ︷︷ ︸
=E[Ax(Ax)>]=AE[xx>]A>=AΛ−1A>
+ cov[N (0,L−1)]︸ ︷︷ ︸
=L−1
.
So y is Gaussian with
p(y) = N (y |Aµ + b,L−1 + AΛ−1A>)
Then letting Σ = (Λ + A>L A)−1 and
p(x | y) = N (x |Σ{A>L(y− b) + Λµ},Σ)
⇔ x = Σ{A>L(y− b) + Λµ}+N (0,Σ)
yields the correct moments for x, since
E[x] = E[Σ{A>L(y− b) + Λµ}] = Σ{A>L(Aµ + b− b) + Λµ}
= Σ{A>LAµ + Λµ} = (Λ + A>L A)−1{A>LA + Λ}µ = µ,
and it is similar (but tedious ; don’t do it) to recover cov[x] = Λ.
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
10of 1
Bayesian Regression
Choose a Gaussian prior with mean m0 and covariance S0
p(w) = N (w |m0,S0)
Same likelihood as before (here written in vector form):
p(t|w, β) = N (t |Φw, β−1I)
Given N data pairs (xn, tn), the posterior is
p(w | t) = N (w |mN ,SN)
where
mN = SN(S−10 m0 + βΦ
>t)
S−1N = S
−1
0 + βΦ
>Φ
(derive this with the identities on the previous slides)
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
11of 1
Bayesian Regression: Zero Mean, Isotropic Prior
For simplicity we proceed with m0 = 0 and S0 = α−1I, so
p(w |α) = N (w | 0, α−1I)
The posterior becomes p(w | t) = N (w |mN ,SN) with
mN = βSNΦ>t
S−1N = αI + βΦ
>Φ
For α� β we get
mN → wML = (Φ>Φ)−1Φ>t
Log of posterior is sum of log likelihood and log of prior
ln p(w | t) = −
β
2
(t−Φw)>(t−Φw)−
α
2
w>w + const
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
12of 1
Bayesian Regression
Log of posterior is sum of log likelihood and log of prior
ln p(w | t) = −β
1
2
‖t−Φw‖2︸ ︷︷ ︸
sum-of-squares-error
−
α
2
‖w‖2︸︷︷︸
regulariser
+ const.
The maximum a posteriori estimator
wm.a.p. = arg max
w
p(w|t)
corresponds to minimising the sum-of-squares error
function with quadratic regularisation coefficient λ = α/β.
The posterior is Gaussian so mode = mean: wm.a.p. = mN .
For α� β the we recover unregularised least squares
(equivalently m.a.p. approaches maximum likelihood), for
example in case of
an infinitely broad prior with α→ 0
an infinitely precise likelihood with β →∞
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
13of 1
Bayesian Inference in General:
Sequential Update of Belief
If we have not yet seen any data point (N = 0), the
posterior is equal to the prior.
Sequential arrival of data points : the posterior given some
observed data acts as the prior for the future data.
Nicely fits a sequential learning framework.
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
14of 1
Sequential Update of the Posterior
Example of a linear basis function model
Single input x, single output t
Linear model y(x,w) = w0 + w1x.
True data distribution sampling procedure:
1 Choose an xn from the uniform distribution U(x | − 1,+1).
2 Calculate f (xn, a) = a0 + a1xn, where a0 = −0.3, a1 = 0.5.
3 Add Gaussian noise with standard deviation σ = 0.2,
tn = N (xn | f (xn, a), 0.04)
Set the precision of the uniform prior to α = 2.0.
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
15of 1
Sequential Update of the Posterior
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
16of 1
Sequential Update of the Posterior
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
17of 1
Predictive Distribution
In the training phase, data x and targets t are provided
In the test phase, a new data value x is given and the
corresponding target value t is asked for
Bayesian approach: Find the probability of the test target t
given the test data x, the training data x and the training
targets t
p(t | x, x, t)
This is the Predictive Distribution (c.f. the posterior
distribution, which is over the parameters).
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
18of 1
How to calculate the Predictive Distribution?
Introduce the model parameter w via the sum rule
p(t | x, x, t) =
∫
p(t,w | x, x, t)dw
=
∫
p(t |w, x, x, t)p(w | x, x, t)dw
The test target t depends only on the test data x and the
model parameter w, but not on the training data and the
training targets
p(t |w, x, x, t) = p(t |w, x)
The model parameter w are learned with the training data
x and the training targets t only
p(w | x, x, t) = p(w | x, t)
Predictive Distribution
p(t | x, x, t) =
∫
p(t |w, x)p(w | x, t)dw
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
19of 1
Proof of the Predictive Distribution
The predictive distribution is
p(t | x, x, t) =
∫
p(t |w, x, x, t)p(w | x, x, t)dw
because∫
p(t |w, x, x, t)p(w | x, x, t)dw =
∫
p(t,w, x, x, t)
p(w, x, x, t)
p(w, x, x, t)
p(x, x, t)
dw
=
∫
p(t,w, x, x, t)
p(x, x, t)
dw
=
p(t, x, x, t)
p(x, x, t)
= p(t | x, x, t),
or simply∫
p(t |w, x, x, t)p(w | x, x, t)dw =
∫
p(t,w | x, x, t)dw
= p(t | x, x, t).
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
20of 1
Predictive Distribution with Simplified Prior
Find the predictive distribution
p(t | t, α, β) =
∫
p(t |w, β) p(w | t, α, β)dw
(remember : conditioning on x is often suppressed to
simplify the notation.)
Now we know (neglecting as usual to notate conditioning
on x)
p(t |w, β) = N (t |w>φ(x), β−1)
and the posterior was
p(w | t, α, β) = N (w |mN ,SN)
where
mN = βSNΦ>t
S−1N = αI + βΦ
>Φ
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
21of 1
Predictive Distribution with Simplified Prior
If we do the integral (it turns out to be the convolution of
the two Gaussians), we get for the predictive distribution
p(t | x, t, α, β) = N (t |m>N φ(x), σ
2
N(x))
where the variance σ2N(x) is given by
σ2N(x) =
1
β
+ φ(x)>SNφ(x).
This is more easily shown using a similar approach to the
earlier “intution” slide and again with the Bienaymé
formula, now using
t = w>φ(x) +N (0, β−1).
However this is a linear-Gaussian specific trick and in
general we need to integrate out the parameters.
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
22of 1
Predictive Distribution with Simplified Prior
Example with artificial sinusoidal data from sin(2πx) (green)
and added noise. Number of data points N = 1.
x
t
0 1
−1
0
1
Mean of the predictive distribution (red) and regions of one
standard deviation from mean (red shaded).
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
23of 1
Predictive Distribution with Simplified Prior
Example with artificial sinusoidal data from sin(2πx) (green)
and added noise. Number of data points N = 2.
x
t
0 1
−1
0
1
Mean of the predictive distribution (red) and regions of one
standard deviation from mean (red shaded).
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
24of 1
Predictive Distribution with Simplified Prior
Example with artificial sinusoidal data from sin(2πx) (green)
and added noise. Number of data points N = 4.
x
t
0 1
−1
0
1
Mean of the predictive distribution (red) and regions of one
standard deviation from mean (red shaded).
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
25of 1
Predictive Distribution with Simplified Prior
Example with artificial sinusoidal data from sin(2πx) (green)
and added noise. Number of data points N = 25.
x
t
0 1
−1
0
1
Mean of the predictive distribution (red) and regions of one
standard deviation from mean (red shaded).
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
26of 1
Predictive Distribution with Simplified Prior
Plots of the function y(x,w) using samples from the posterior
distribution over w. Number of data points N = 1.
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
27of 1
Predictive Distribution with Simplified Prior
Plots of the function y(x,w) using samples from the posterior
distribution over w. Number of data points N = 2.
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
28of 1
Predictive Distribution with Simplified Prior
Plots of the function y(x,w) using samples from the posterior
distribution over w. Number of data points N = 4.
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
29of 1
Predictive Distribution with Simplified Prior
Plots of the function y(x,w) using samples from the posterior
distribution over w. Number of data points N = 25.
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
30of 1
Limitations of Linear Basis Function Models
Basis function φj(x) are fixed before the training data set is
observed.
Curse of dimensionality : Number of basis function grows
rapidly, often exponentially, with the dimensionality D.
But typical data sets have two nice properties which can
be exploited if the basis functions are not fixed :
Data lie close to a nonlinear manifold with intrinsic
dimension much smaller than D. Need algorithms which
place basis functions only where data are (e.g. kernel
methods / Gaussian processes).
Target variables may only depend on a few significant
directions within the data manifold. Need algorithms which
can exploit this property (e.g. linear methods or shallow
neural networks).
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
31of 1
Curse of Dimensionality
Linear Algebra allows us to operate in n-dimensional
vector spaces using the intution from our 3-dimensional
world as a vector space. No surprises as long as n is finite.
If we add more structure to a vector space (e.g. inner
product, metric), our intution gained from the
3-dimensional world around us may be wrong.
Example: Sphere of radius r = 1. What is the fraction of
the volume of the sphere in a D-dimensional space which
lies between radius r = 1 and r = 1− � ?
Volume scales like rD, therefore the formula for the volume
of a sphere is VD(r) = KDrD.
VD(1)− VD(1− �)
VD(1)
= 1− (1− �)D
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
32of 1
Curse of Dimensionality
Fraction of the volume of the sphere in a D-dimensional
space which lies between radius r = 1 and r = 1− �
VD(1)− VD(1− �)
VD(1)
= 1− (1− �)D
²
vo
lu
m
e
f
ra
ct
io
n
D = 1
D = 2
D = 5
D = 20
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
33of 1
Curse of Dimensionality
Probability density with respect to radius r of a Gaussian
distribution for various values of the dimensionality D.
D = 1
D = 2
D = 20
r
p
(r
)
0 2 4
0
1
2
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
34of 1
Curse of Dimensionality
Probability density with respect to radius r of a Gaussian
distribution for various values of the dimensionality D.
Example: D = 2; assume µ = 0,Σ = I
N (x | 0, I) =
1
2π
exp
{
−
1
2
x>x
}
=
1
2π
exp
{
−
1
2
(x21 + x
2
2)
}
Coordinate transformation
x1 = r cos(φ) x2 = r sin(φ)
Probability in the new coordinates
p(r, φ | 0, I) = N (r(x), φ(x) | 0, I) | J |
where | J | = r is the determinant of the Jacobian for the
given coordinate transformation.
p(r, φ | 0, I) =
1
2π
r exp
{
−
1
2
r2
}
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
35of 1
Curse of Dimensionality
Probability density with respect to radius r of a Gaussian
distribution for D = 2 (and µ = 0,Σ = I)
p(r, φ | 0, I) =
1
2π
r exp
{
−
1
2
r2
}
Integrate over all angles φ
p(r | 0, I) =
∫ 2π
0
1
2π
r exp
{
−
1
2
r2
}
dφ = r exp
{
−
1
2
r2
}
D = 1
D = 2
D = 20
r
p
(r
)
0 2 4
0
1
2
Statistical Machine
Learning
c©2021
Ong & Walder & Webers
& Xie
Data61 | CSIRO
ANU Computer Science
36of 1
Summary: Linear Regression
Basis functions
Maximum likelihood with Gaussian noise
Regularisation
Bias variance decomposition
Conjugate prior
Bayesian linear regression
Sequential update of the posterior
Predictive distribution
Curse of dimensionality