Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 825
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
113of 825
Part III
Linear Regression 1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
114of 825
Linear Regression
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N 0 2 4 6 8 10
x
0
1
2
3
4
5
6
7
8
t
Predictor y(x,w)?
Performance measure?
Optimal solution w∗?
Recall: projection, inverse
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
115of 825
Probabilities, Losses
Gaussian Distribution
Bayes Rule
Expected Loss
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
116of 825
Linear Curve Fitting – Least Squares
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N
y(x,w) = w1x + w0
X ≡ [x 1]
w∗ = (XTX)−1XT t
0 2 4 6 8 10
x
0
1
2
3
4
5
6
7
8
t
We assume
t = y(x,w)︸ ︷︷ ︸
deterministic
+ �︸︷︷︸
Gaussian noise
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
117of 825
Curve fitting – revisited
a priori belief about the parameter w captured in the prior
probability p(w)
observed data D = {t1, . . . , tN}
calculate the belief in w after the data D have been
observed
p(w | D) = p(D |w)p(w)
p(D)
p(D |w) as a function of w : likelihood function
likelihood expresses how probable the data are for
different values of w — it is not a probability density with
respect w (but it is with respect to D ; prove it)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
118of 825
Maximum Likelihood
Consider the linear regression problem, where we have
random variables xn and tn.
We assume a conditional model tn|xn
We propose a distribution, parameterized by θ
tn|xn ∼ density(θ)
For a given θ the density defines the probability of
observing tn|xn.
We are interested in finding θ that maximises the
probability (called the likelihood) of the data.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
119of 825
Likelihood Function – Frequentist versus Bayesian
Likelihood function p(D |w)
Frequentist Approach
w considered fixed
parameter
value defined by
some ’estimator’
error bars on the
estimated w
obtained from the
distribution of
possible data sets D
Bayesian Approach
only one single data
set D
uncertainty in the
parameters comes
from a probability
distribution over w
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
120of 825
Frequentist Estimator – Maximum Likelihood
choose w for which the likelihood p(D |w) (the probability
of the observed data) is maximal
the most common heuristic for learning a single fixed w
equivalently: error function is negative log of likelihood
function, to be minimised
log is a monotonic function
maximising the likelihood ⇐⇒ minimising the error
Example: Fair-looking coin is tossed three times, always
landing on heads.
Maximum likelihood estimate of the probability of landing
heads will give 1.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
121of 825
Bayesian Approach
including prior knowledge easy (via prior w)
subjective choice of prior, allows better results by
incorporating domain knowledge
sometimes choice of prior motivated by convinient
mathematical form
prior irrelevant as N →∞, but helps for small N
need to sum/integrate over the whole parameter space
advances in sampling (Markov Chain Monte Carlo methods)
advances in approximation schemes (Variational Bayes,
Expectation Propagation)
there is no true w:
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
122of 825
Regression
Given a training data set of N observations {xn} and target
values tn.
Goal : Learn to predict the value of one ore more target
values t given a new value of the input x.
Example: Polynomial curve fitting (see Introduction).
x
t
?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
123of 825
Supervised Learning:
(non-Bayesian) Point Estimate
model with adjustable
parameter w
training
data x
training
targets
t
model with fixed
parameter w*
test
data x
test
target
t
Training Phase
Test Phase
fix the most appropriate w*
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
124of 825
Why Linear Regression?
Analytic solution when minimising sum of squared errors
Well understood statistical behaviour
Efficient algorithms exist for convex losses and
regularizers
But what if the relationship is non-linear?
0 1 2 3 4 5 6
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
125of 825
Linear Basis Function Models
Linear combination of fixed nonlinear basis functions
y(x,w) =
M−1∑
j=0
wjφj(x) = wTφ(x)
parameter w = (w0, . . . ,wM−1)T
basis functions φ(x) = (φ0(x), . . . , φM−1(x))T
convention φ0(x) = 1
w0 is the bias parameter
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
126of 825
Polynomial Basis Functions
Scalar input variable x
φj(x) = x
j
Limitation : Polynomials are global functions of the input
variable x so the learned function will extrapolate poorly
−1 0 1
−1
−0.5
0
0.5
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
127of 825
’Gaussian’ Basis Functions
Scalar input variable x
φj(x) = exp
{
− (x− µj)
2
2s2
}
Not a probability distribution.
No normalisation required, taken care of by the model
parameters w.
Well behaved away from the data (though pulled to zero)
−1 0 1
0
0.25
0.5
0.75
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
128of 825
Sigmoidal Basis Functions
Scalar input variable x
φj(x) = σ
(
x− µj
s
)
where σ(a) is the logistic sigmoid function defined by
σ(a) =
1
1 + exp(−a)
σ(a) is related to the hyperbolic tangent tanh(a) by
tanh(a) = 2σ(a)− 1.
−1 0 1
0
0.25
0.5
0.75
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
129of 825
Other Basis Functions
Fourier Basis : each basis function represents a specific
frequency and has infinite spatial extent.
Wavelets : localised in both space and frequency (also
mutually orthogonal to simplify appliciation).
Splines (piecewise polynomials restricted to regions of the
input space; additional constraints where pieces meet, e.g.
smoothness constraints→ conditions on the derivatives).
0 1 2 3 4 5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
Linear
Splines
0 1 2 3 4 5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
Quadratic
Splines
0 1 2 3 4 5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
Cubic
Splines
0 1 2 3 4 5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
Quartic
Splines
Approximate the points
{(0, 0), (1, 1), (2,−1), (3, 0), (4,−2), (5, 1)} by different
splines.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
130of 825
Maximum Likelihood and Least Squares
No special assumption about the basis functions φj(x). In
the simplest case, one can think of φj(x) = xj, or φ(x) = x.
Assume target t is given by
t = y(x,w)︸ ︷︷ ︸
deterministic
+ �︸︷︷︸
noise
where � is a zero-mean Gaussian random variable with
precision (inverse variance) β.
Thus
p(t | x,w, β) = N (t | y(x,w), β−1)
t
xx0
y(x0)
y(x)
p(t|x0)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
131of 825
Maximum Likelihood and Least Squares
Likelihood of one target t given the data x was
p(t | x,w, β) = N (t | y(x,w), β−1)
Now, a set of inputs X with corresponding target values t.
Assume data are independent and identically distributed
(i.i.d.) (means : data are drawn independent and from the
same distribution). The likelihood of the target t is then
p(t |X,w, β) =
N∏
n=1
N (tn | y(xn,w), β−1)
=
N∏
n=1
N (tn |wTφ(xn), β−1)
From now on drop the conditioning variable X from the
notation, as with supervised learning we do not seek to
model the distribution of the input data.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
132of 825
Maximum Likelihood and Least Squares
Consider the logarithm of the likelihood p(t |w, β) (the
logarithm is a monotone function! )
ln p(t |w, β) =
N∑
n=1
lnN (tn |wTφ(xn), β−1)
=
N∑
n=1
ln
(√
β
2π
exp
{
−β
2
(tn − wTφ(xn))2
})
=
N
2
lnβ − N
2
ln(2π)− βED(w)
where the sum-of-squares error function is
ED(w) =
1
2
N∑
n=1
{tn − wTφ(xn)}2.
argmaxw ln p(t |w, β)→ argminw ED(w)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
133of 825
Maximum Likelihood and Least Squares
Goal: Find a more compact representation.
Rewrite the error function
ED(w) =
1
2
N∑
n=1
{tn − wTφ(xn)}2 =
1
2
(t−Φw)T(t−Φw)
where t = (t1, . . . , tN)T , and
Φ =
φ0(x1) φ1(x1) . . . φM−1(x1)
φ0(x2) φ1(x2) . . . φM−1(x2)
…
…
. . .
…
φ0(xN) φ1(xN) . . . φM−1(xN)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
134of 825
Maximum Likelihood and Least Squares
The log likelihood is now
ln p(t |w, β) = N
2
lnβ − N
2
ln(2π)− βED(w)
=
N
2
lnβ − N
2
ln(2π)− β 1
2
(t−Φw)T(t−Φw)
Find critical points of ln p(t |w, β).
The gradient with respect to w is
∇w ln p(t |w, β) = βΦT(t−Φw).
Setting the gradient to zero gives
0 = ΦT t−ΦTΦw,
which results in
wML = (Φ
TΦ)−1ΦT t = Φ†t
where Φ† is the Moore-Penrose pseudo-inverse of the
matrix Φ.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
135of 825
Maximum Likelihood and Least Squares
The log likelihood with the optimal wML is now
ln p(t |wML, β)
=
N
2
lnβ − N
2
ln(2π)− β 1
2
(t−ΦwML)T(t−ΦwML)
Find critical points of ln p(t |w, β) wrt β,
∂ ln p(t |wML, β)
∂β
= 0
results in
1
βML
=
1
N
(t−ΦwML)T(t−ΦwML)
Note: We can first find the maximum likelihood for w as
this does not depend on β. Then we can use wML to find
the maximum likelihood solution for β.
Could we have chosen optimisation wrt β first, and then
wrt to w ?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
136of 825
Sequential Learning – Stochastic Gradient
Descent
For large data sets, calculating the maximum likelihood
parameters wML and βML may be costly.
For online applications, never all data in memory.
Use a sequential algorithms (online algorithm).
If the error function is a sum over data points E =
∑
n En,
then
1 initialise w(0) to some starting value
2 update the parameter vector at iteration τ + 1 by
w(τ+1) = w(τ) − η∇En,
where En is the error function after presenting the nth data
set, and η is the learning rate.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
137of 825
Sequential Learning – Stochastic Gradient
Descent
For the sum-of-squares error function, stochastic gradient
descent results in
w(τ+1) = w(τ) + η
(
tn − w(τ)Tφ(xn)
)
φ(xn)
The value for the learning rate must be chosen carefully. A
too large learning rate may prevent the algorithm from
converging. A too small learning rate does follow the data
too slowly.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
138of 825
Regularized Least Squares
Add regularisation in order to prevent overfitting
ED(w) + λEW(w)
with regularisation coefficient λ.
Simple quadratic regulariser
EW(w) =
1
2
wTw
Maximum likelihood solution
w =
(
λI +ΦTΦ
)−1
ΦT t
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
139of 825
Regularized Least Squares
More general regulariser
EW(w) =
1
2
M∑
j=1
|wj|q
q = 1 (lasso) leads to a sparse model if λ large enough.
q = 0.5 q = 1 q = 2 q = 4
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
140of 825
Lagrangian Dual View of the Regulariser
By the Lagrange multiplier method, minimization of the
regularized error function
1
2
N∑
n=1
(tn − w>φ(xn))2 +
λ
2
M∑
j=1
|wj|q ,
is equivalent to minimizing the unregularized
sum-of-squares error,
1
2
N∑
n=1
(tn − w>φ(xn))2 subject to
M∑
j=1
|wj|q ≤ η.
This yields the figures on the next slide.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
141of 825
Comparison of Quadratic and Lasso Regulariser
Quadratic regulariser
1
2
M∑
j=1
w2j
w1
w2
w?
Lasso regulariser
1
2
M∑
j=1
|wj|
w1
w2
w?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
142of 825
Multiple Outputs
More than 1 target variable per data point.
y becomes a vector instead of a scalar. Each dimension
can be treated with a different set of basis functions (and
that may be necessary if the data in the different target
dimensions represent very different types of information.)
Here we restrict ourselves to the SAME basis functions
y(x,w) = WTφ(x)
where y is a K-dimensional column vector, W is an M × K
matrix of model parameters, and
φ(x) =
(
φ0(x), . . . , φM−1(x)
)
, with φ0(x) = 1, as before.
Define target matrix T containing the target vector tTn in the
nth row.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
143of 825
Multiple Outputs
Suppose the conditional distribution of the target vector is
an isotropic Gaussian of the form
p(t | x,W, β) = N (t |WTφ(x), β−1I).
The log likelihood is then
ln p(T |X,W, β) =
N∑
n=1
lnN (tn |WTφ(xn), β−1I)
=
NK
2
ln
(
β
2π
)
− β
2
N∑
n=1
‖tn −WTφ(xn)‖2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
144of 825
Multiple Outputs
Maximisation with respect to W results in
WML = (Φ
TΦ)−1ΦTT.
For each target variable tk, we get
wk = (Φ
TΦ)−1ΦT tk = Φ
†tk.
The solution between the different target variables
decouples.
Holds also for a general Gaussian noise distribution with
arbitrary covariance matrix.
Why? W defines the mean of the Gaussian noise
distribution. And the maximum likelihood solution for the
mean of a multivariate Gaussian is independent of the
covariance.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
145of 825
Loss Function for Regression
Over-fitting results from a large number of basis functions
and a relatively small training set.
Regularisation can prevent overfitting, but how to find the
correct value for the regularisation constant λ ?
Frequentists viewpoint of the model complexity is the
bias-variance trade-off.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
146of 825
Loss Function for Regression
Choose an estimator y(x) to estimate the target value t for
each input x.
Choose a loss function L(t, y(x)) which measures the
difference between the target t and the estimate y(x).
The expected loss is then
E [L] =
∫∫
L(t, y(x)) p(x, t) dx dt
Common choice: Squared Loss
L(t, y(x)) = {y(x)− t}2 .
Expected loss for squared loss function
E [L] =
∫∫
{y(x)− t}2 p(x, t) dx dt.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
147of 825
Loss Function for Regression
Expected loss for squared loss function
E [L] =
∫∫
{y(x)− t}2 p(x, t) dx dt.
Minimise E [L] by choosing the regression function
y(x) =
∫
t p(x, t) dt
p(x)
=
∫
t p(t | x) dt = Et [t | x]
(calculus of variations is not required to derive this result ;
we may work point-wise by fixing an x and using
stationarity to solve for y(x) — why is that sufficient?).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
148of 825
Optimal Predictor for Squared Loss
The regression function which minimises the expected
squared loss, is given by the mean of the conditional
distribution p(t | x).
t
xx0
y(x0)
y(x)
p(t|x0)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
149of 825
Analysing the Squared Loss (1)
Analyse the expected loss
E [L] =
∫∫
{y(x)− t}2 p(x, t) dx dt.
Rewrite the squared loss
{y(x)− t}2 = {y(x)− E [t | x] + E [t | x]− t}2
= {y(x)− E [t | x]}2 + {E [t | x]− t}2
+ 2 {y(x)− E [t | x]} {E [t | x]− t}
Claim∫∫
{y(x)− E [t | x]} {E [t | x]− t} p(x, t) dx dt = 0.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
150of 825
Analysing the Squared Loss (2)
Claim∫∫
{y(x)− E [t | x]} {E [t | x]− t} p(x, t) dx dt = 0.
Seperate functions depending on t from function
depending on x∫
{y(x)− E [t | x]}
(∫
{E [t | x]− t} p(x, t) dt
)
dx
Calculate the integral over t∫
{E [t | x]− t} p(x, t) dt = E [t | x] p(x)− p(x)
∫
t p(x, t)
p(x)
dt
= E [t | x] p(x)− p(x)E [t | x]
= 0
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
151of 825
Analysing the Squared Loss (3)
The expected loss is now
E [L] =
∫
{y(x)− E [t | x]}2p(x) dx +
∫
var[t | x] p(x) dx (1)
Minimise first term by choosing y(x) = E [t | x] (as we saw
already).
Second term represents the intrinsic variability of the
target data (can be regarded as noise). Independent of the
choice y(x), can not be reduced by learning a better y(x).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
152of 825
The Bias-Variance Decomposition (1)
Consider again squared loss for which the optimal
prediction is given by the conditional expectation h(x)
h(x) = E [t | x] =
∫
t p(t | x) dt.
Since h(x) is unavailable to us, it must be estimated from a
(finite) dataset D.
D is a finite sample from the unknown joint p(x, t)
Notate the dependency of the learned function on the data
by y(x;D).
Evaluate performance of algorithm by taking the
expectation ED [L] over all data sets D
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
153of 825
The Bias-Variance Decomposition (2)
Taking the expectation over data sets D, using Eqn 1, and
interchanging the order of expectations for the first term:
ED [E [L]] =
∫
ED
[
{y(x;D)− h(x)}2
]
p(x) dx
+
∫∫
{h(x)− t}2p(x, t) dx dt
Again, add and subtract the expectation ED [y(x;D)]
{y(x;D)− h(x)}2 = { y(x;D)− ED [y(x;D)]
+ ED [y(x;D)]− h(x)}2
and show that the mixed term vanishes under the
expectation ED.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
154of 825
The Bias-Variance Decomposition (3)
Expected loss ED [L] over all data sets D
expected loss = (bias)2 + variance + noise.
where
(bias)2 =
∫
{ED [y(x;D)]− h(x)}2 p(x) dx
variance =
∫
ED
[
{y(x;D)− ED [y(x;D)]}2
]
p(x) dx
noise =
∫∫
{h(x)− t}2 p(x, t) dx dt.
(bias)2 : How accurate is a model across different training
sets? (How much does the average prediction over all
data sets differ from the desired regression function ?)
variance : How sensitive is the model to small changes in
the training set? (How much do solutions for individual
data sets vary around their average ?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
155of 825
The Bias-Variance Decomposition
Simple models have low variance and high bias.
x
t
ln λ = 2.6
0 1
−1
0
1
x
t
0 1
−1
0
1
Left: Result of fitting the model to 100 data sets, only 25 shown.
Right: Average of the 100 fits in red, the sinusoidal function
from where the data were created in green.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
156of 825
The Bias-Variance Decomposition
Complex models have high variance and low bias.
x
t
ln λ = −2.4
0 1
−1
0
1
x
t
0 1
−1
0
1
Left: Result of fitting the model to 100 data sets, only 25 shown.
Right: Average of the 100 fits in red, the sinusoidal function
from where the data were created in green.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
157of 825
The Bias-Variance Decomposition
Dependence of bias and variance on the model complexity
Squared bias, variance, their sum, and test data
The minimum for (bias)2 + variance occurs close to the
value that gives the minimum error
ln λ
−3 −2 −1 0 1 2
0
0.03
0.06
0.09
0.12
0.15
(bias)2
variance
(bias)2 + variance
test error
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
158of 825
Unbiased Estimators
You may have encountered unbiased estimators
Why guarantee zero bias? To quote the pioneer of
Bayesian inference, Edwin Jaynes, from his book
Probability Theory: The Logic of Science (2003):
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Linear Basis Function
Models
Maximum Likelihood and
Least Squares
Sequential Learning
Regularized Least
Squares
Multiple Outputs
Loss Function for
Regression
The Bias-Variance
Decomposition
159of 825
The Bias-Variance Decomposition
Tradeoff between bias and variance
simple models have low variance and high bias
complex models have high variance and low bias
The sum of bias and variance has a minimum at a certain
model complexity.
Expected loss ED [L] over all data sets D
expected loss = (bias)2 + variance + noise.
The noise comes from the data, and can not be removed
from the expected loss.
To analyse the bias-variance decomposition : many data
sets needed, which are not always available.
Linear Regression 1
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition