CS计算机代考程序代写 scheme chain Bayesian algorithm Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 825

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

113of 825

Part III

Linear Regression 1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

114of 825

Linear Regression

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N 0 2 4 6 8 10

x

0

1

2

3

4

5

6

7

8

t
Predictor y(x,w)?
Performance measure?
Optimal solution w∗?
Recall: projection, inverse

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

115of 825

Probabilities, Losses

Gaussian Distribution
Bayes Rule
Expected Loss

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

116of 825

Linear Curve Fitting – Least Squares

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N

y(x,w) = w1x + w0
X ≡ [x 1]

w∗ = (XTX)−1XT t

0 2 4 6 8 10
x

0

1

2

3

4

5

6

7

8

t

We assume

t = y(x,w)︸ ︷︷ ︸
deterministic

+ �︸︷︷︸
Gaussian noise

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

117of 825

Curve fitting – revisited

a priori belief about the parameter w captured in the prior
probability p(w)
observed data D = {t1, . . . , tN}
calculate the belief in w after the data D have been
observed

p(w | D) = p(D |w)p(w)
p(D)

p(D |w) as a function of w : likelihood function
likelihood expresses how probable the data are for
different values of w — it is not a probability density with
respect w (but it is with respect to D ; prove it)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

118of 825

Maximum Likelihood

Consider the linear regression problem, where we have
random variables xn and tn.
We assume a conditional model tn|xn
We propose a distribution, parameterized by θ

tn|xn ∼ density(θ)

For a given θ the density defines the probability of
observing tn|xn.
We are interested in finding θ that maximises the
probability (called the likelihood) of the data.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

119of 825

Likelihood Function – Frequentist versus Bayesian

Likelihood function p(D |w)

Frequentist Approach
w considered fixed
parameter
value defined by
some ’estimator’
error bars on the
estimated w
obtained from the
distribution of
possible data sets D

Bayesian Approach
only one single data
set D
uncertainty in the
parameters comes
from a probability
distribution over w

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

120of 825

Frequentist Estimator – Maximum Likelihood

choose w for which the likelihood p(D |w) (the probability
of the observed data) is maximal
the most common heuristic for learning a single fixed w
equivalently: error function is negative log of likelihood
function, to be minimised
log is a monotonic function
maximising the likelihood ⇐⇒ minimising the error
Example: Fair-looking coin is tossed three times, always
landing on heads.
Maximum likelihood estimate of the probability of landing
heads will give 1.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

121of 825

Bayesian Approach

including prior knowledge easy (via prior w)
subjective choice of prior, allows better results by
incorporating domain knowledge
sometimes choice of prior motivated by convinient
mathematical form
prior irrelevant as N →∞, but helps for small N
need to sum/integrate over the whole parameter space

advances in sampling (Markov Chain Monte Carlo methods)
advances in approximation schemes (Variational Bayes,
Expectation Propagation)

there is no true w:

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

122of 825

Regression

Given a training data set of N observations {xn} and target
values tn.
Goal : Learn to predict the value of one ore more target
values t given a new value of the input x.
Example: Polynomial curve fitting (see Introduction).

x

t

?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

123of 825

Supervised Learning:
(non-Bayesian) Point Estimate

model with adjustable
parameter w

training
data x

training
targets

t

model with fixed
parameter w*

test
data x

test
target

t

Training Phase

Test Phase
fix the most appropriate w*

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

124of 825

Why Linear Regression?

Analytic solution when minimising sum of squared errors
Well understood statistical behaviour
Efficient algorithms exist for convex losses and
regularizers
But what if the relationship is non-linear?

0 1 2 3 4 5 6
1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

125of 825

Linear Basis Function Models

Linear combination of fixed nonlinear basis functions

y(x,w) =
M−1∑
j=0

wjφj(x) = wTφ(x)

parameter w = (w0, . . . ,wM−1)T

basis functions φ(x) = (φ0(x), . . . , φM−1(x))T

convention φ0(x) = 1
w0 is the bias parameter

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

126of 825

Polynomial Basis Functions

Scalar input variable x

φj(x) = x
j

Limitation : Polynomials are global functions of the input
variable x so the learned function will extrapolate poorly

−1 0 1
−1

−0.5

0

0.5

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

127of 825

’Gaussian’ Basis Functions

Scalar input variable x

φj(x) = exp
{
− (x− µj)

2

2s2

}
Not a probability distribution.
No normalisation required, taken care of by the model
parameters w.
Well behaved away from the data (though pulled to zero)

−1 0 1
0

0.25

0.5

0.75

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

128of 825

Sigmoidal Basis Functions

Scalar input variable x

φj(x) = σ
(

x− µj
s

)
where σ(a) is the logistic sigmoid function defined by

σ(a) =
1

1 + exp(−a)
σ(a) is related to the hyperbolic tangent tanh(a) by
tanh(a) = 2σ(a)− 1.

−1 0 1
0

0.25

0.5

0.75

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

129of 825

Other Basis Functions

Fourier Basis : each basis function represents a specific
frequency and has infinite spatial extent.
Wavelets : localised in both space and frequency (also
mutually orthogonal to simplify appliciation).
Splines (piecewise polynomials restricted to regions of the
input space; additional constraints where pieces meet, e.g.
smoothness constraints→ conditions on the derivatives).

0 1 2 3 4 5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

Linear
Splines

0 1 2 3 4 5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

Quadratic
Splines

0 1 2 3 4 5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

Cubic
Splines

0 1 2 3 4 5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

Quartic
Splines

Approximate the points
{(0, 0), (1, 1), (2,−1), (3, 0), (4,−2), (5, 1)} by different

splines.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

130of 825

Maximum Likelihood and Least Squares

No special assumption about the basis functions φj(x). In
the simplest case, one can think of φj(x) = xj, or φ(x) = x.
Assume target t is given by

t = y(x,w)︸ ︷︷ ︸
deterministic

+ �︸︷︷︸
noise

where � is a zero-mean Gaussian random variable with
precision (inverse variance) β.
Thus

p(t | x,w, β) = N (t | y(x,w), β−1)

t

xx0

y(x0)

y(x)

p(t|x0)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

131of 825

Maximum Likelihood and Least Squares

Likelihood of one target t given the data x was

p(t | x,w, β) = N (t | y(x,w), β−1)

Now, a set of inputs X with corresponding target values t.
Assume data are independent and identically distributed
(i.i.d.) (means : data are drawn independent and from the
same distribution). The likelihood of the target t is then

p(t |X,w, β) =
N∏

n=1

N (tn | y(xn,w), β−1)

=

N∏
n=1

N (tn |wTφ(xn), β−1)

From now on drop the conditioning variable X from the
notation, as with supervised learning we do not seek to
model the distribution of the input data.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

132of 825

Maximum Likelihood and Least Squares

Consider the logarithm of the likelihood p(t |w, β) (the
logarithm is a monotone function! )

ln p(t |w, β) =
N∑

n=1

lnN (tn |wTφ(xn), β−1)

=

N∑
n=1

ln

(√
β


exp

{
−β

2
(tn − wTφ(xn))2

})

=
N
2
lnβ − N

2
ln(2π)− βED(w)

where the sum-of-squares error function is

ED(w) =
1
2

N∑
n=1

{tn − wTφ(xn)}2.

argmaxw ln p(t |w, β)→ argminw ED(w)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

133of 825

Maximum Likelihood and Least Squares

Goal: Find a more compact representation.
Rewrite the error function

ED(w) =
1
2

N∑
n=1

{tn − wTφ(xn)}2 =
1
2
(t−Φw)T(t−Φw)

where t = (t1, . . . , tN)T , and

Φ =



φ0(x1) φ1(x1) . . . φM−1(x1)
φ0(x2) φ1(x2) . . . φM−1(x2)


. . .

φ0(xN) φ1(xN) . . . φM−1(xN)




Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

134of 825

Maximum Likelihood and Least Squares

The log likelihood is now

ln p(t |w, β) = N
2
lnβ − N

2
ln(2π)− βED(w)

=
N
2
lnβ − N

2
ln(2π)− β 1

2
(t−Φw)T(t−Φw)

Find critical points of ln p(t |w, β).
The gradient with respect to w is

∇w ln p(t |w, β) = βΦT(t−Φw).

Setting the gradient to zero gives

0 = ΦT t−ΦTΦw,

which results in

wML = (Φ
TΦ)−1ΦT t = Φ†t

where Φ† is the Moore-Penrose pseudo-inverse of the
matrix Φ.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

135of 825

Maximum Likelihood and Least Squares

The log likelihood with the optimal wML is now

ln p(t |wML, β)

=
N
2
lnβ − N

2
ln(2π)− β 1

2
(t−ΦwML)T(t−ΦwML)

Find critical points of ln p(t |w, β) wrt β,

∂ ln p(t |wML, β)
∂β

= 0

results in

1
βML

=
1
N
(t−ΦwML)T(t−ΦwML)

Note: We can first find the maximum likelihood for w as
this does not depend on β. Then we can use wML to find
the maximum likelihood solution for β.
Could we have chosen optimisation wrt β first, and then
wrt to w ?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

136of 825

Sequential Learning – Stochastic Gradient
Descent

For large data sets, calculating the maximum likelihood
parameters wML and βML may be costly.
For online applications, never all data in memory.
Use a sequential algorithms (online algorithm).
If the error function is a sum over data points E =


n En,

then
1 initialise w(0) to some starting value
2 update the parameter vector at iteration τ + 1 by

w(τ+1) = w(τ) − η∇En,

where En is the error function after presenting the nth data
set, and η is the learning rate.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

137of 825

Sequential Learning – Stochastic Gradient
Descent

For the sum-of-squares error function, stochastic gradient
descent results in

w(τ+1) = w(τ) + η
(

tn − w(τ)Tφ(xn)
)
φ(xn)

The value for the learning rate must be chosen carefully. A
too large learning rate may prevent the algorithm from
converging. A too small learning rate does follow the data
too slowly.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

138of 825

Regularized Least Squares

Add regularisation in order to prevent overfitting

ED(w) + λEW(w)

with regularisation coefficient λ.
Simple quadratic regulariser

EW(w) =
1
2

wTw

Maximum likelihood solution

w =
(
λI +ΦTΦ

)−1
ΦT t

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

139of 825

Regularized Least Squares

More general regulariser

EW(w) =
1
2

M∑
j=1

|wj|q

q = 1 (lasso) leads to a sparse model if λ large enough.

q = 0.5 q = 1 q = 2 q = 4

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

140of 825

Lagrangian Dual View of the Regulariser

By the Lagrange multiplier method, minimization of the
regularized error function

1
2

N∑
n=1

(tn − w>φ(xn))2 +
λ

2

M∑
j=1

|wj|q ,

is equivalent to minimizing the unregularized
sum-of-squares error,

1
2

N∑
n=1

(tn − w>φ(xn))2 subject to
M∑

j=1

|wj|q ≤ η.

This yields the figures on the next slide.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

141of 825

Comparison of Quadratic and Lasso Regulariser

Quadratic regulariser

1
2

M∑
j=1

w2j

w1

w2

w?

Lasso regulariser

1
2

M∑
j=1

|wj|

w1

w2

w?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

142of 825

Multiple Outputs

More than 1 target variable per data point.
y becomes a vector instead of a scalar. Each dimension
can be treated with a different set of basis functions (and
that may be necessary if the data in the different target
dimensions represent very different types of information.)
Here we restrict ourselves to the SAME basis functions

y(x,w) = WTφ(x)

where y is a K-dimensional column vector, W is an M × K
matrix of model parameters, and
φ(x) =

(
φ0(x), . . . , φM−1(x)

)
, with φ0(x) = 1, as before.

Define target matrix T containing the target vector tTn in the
nth row.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

143of 825

Multiple Outputs

Suppose the conditional distribution of the target vector is
an isotropic Gaussian of the form

p(t | x,W, β) = N (t |WTφ(x), β−1I).

The log likelihood is then

ln p(T |X,W, β) =
N∑

n=1

lnN (tn |WTφ(xn), β−1I)

=
NK
2

ln

(
β

)
− β

2

N∑
n=1

‖tn −WTφ(xn)‖2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

144of 825

Multiple Outputs

Maximisation with respect to W results in

WML = (Φ
TΦ)−1ΦTT.

For each target variable tk, we get

wk = (Φ
TΦ)−1ΦT tk = Φ

†tk.

The solution between the different target variables
decouples.
Holds also for a general Gaussian noise distribution with
arbitrary covariance matrix.
Why? W defines the mean of the Gaussian noise
distribution. And the maximum likelihood solution for the
mean of a multivariate Gaussian is independent of the
covariance.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

145of 825

Loss Function for Regression

Over-fitting results from a large number of basis functions
and a relatively small training set.
Regularisation can prevent overfitting, but how to find the
correct value for the regularisation constant λ ?
Frequentists viewpoint of the model complexity is the
bias-variance trade-off.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

146of 825

Loss Function for Regression

Choose an estimator y(x) to estimate the target value t for
each input x.
Choose a loss function L(t, y(x)) which measures the
difference between the target t and the estimate y(x).
The expected loss is then

E [L] =
∫∫

L(t, y(x)) p(x, t) dx dt

Common choice: Squared Loss

L(t, y(x)) = {y(x)− t}2 .

Expected loss for squared loss function

E [L] =
∫∫
{y(x)− t}2 p(x, t) dx dt.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

147of 825

Loss Function for Regression

Expected loss for squared loss function

E [L] =
∫∫
{y(x)− t}2 p(x, t) dx dt.

Minimise E [L] by choosing the regression function

y(x) =

t p(x, t) dt
p(x)

=


t p(t | x) dt = Et [t | x]

(calculus of variations is not required to derive this result ;
we may work point-wise by fixing an x and using
stationarity to solve for y(x) — why is that sufficient?).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

148of 825

Optimal Predictor for Squared Loss

The regression function which minimises the expected
squared loss, is given by the mean of the conditional
distribution p(t | x).

t

xx0

y(x0)

y(x)

p(t|x0)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

149of 825

Analysing the Squared Loss (1)

Analyse the expected loss

E [L] =
∫∫
{y(x)− t}2 p(x, t) dx dt.

Rewrite the squared loss

{y(x)− t}2 = {y(x)− E [t | x] + E [t | x]− t}2

= {y(x)− E [t | x]}2 + {E [t | x]− t}2

+ 2 {y(x)− E [t | x]} {E [t | x]− t}

Claim∫∫
{y(x)− E [t | x]} {E [t | x]− t} p(x, t) dx dt = 0.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

150of 825

Analysing the Squared Loss (2)

Claim∫∫
{y(x)− E [t | x]} {E [t | x]− t} p(x, t) dx dt = 0.

Seperate functions depending on t from function
depending on x∫

{y(x)− E [t | x]}
(∫
{E [t | x]− t} p(x, t) dt

)
dx

Calculate the integral over t∫
{E [t | x]− t} p(x, t) dt = E [t | x] p(x)− p(x)


t p(x, t)

p(x)
dt

= E [t | x] p(x)− p(x)E [t | x]
= 0

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

151of 825

Analysing the Squared Loss (3)

The expected loss is now

E [L] =

{y(x)− E [t | x]}2p(x) dx +


var[t | x] p(x) dx (1)

Minimise first term by choosing y(x) = E [t | x] (as we saw
already).
Second term represents the intrinsic variability of the
target data (can be regarded as noise). Independent of the
choice y(x), can not be reduced by learning a better y(x).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

152of 825

The Bias-Variance Decomposition (1)

Consider again squared loss for which the optimal
prediction is given by the conditional expectation h(x)

h(x) = E [t | x] =

t p(t | x) dt.

Since h(x) is unavailable to us, it must be estimated from a
(finite) dataset D.
D is a finite sample from the unknown joint p(x, t)
Notate the dependency of the learned function on the data
by y(x;D).
Evaluate performance of algorithm by taking the
expectation ED [L] over all data sets D

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

153of 825

The Bias-Variance Decomposition (2)

Taking the expectation over data sets D, using Eqn 1, and
interchanging the order of expectations for the first term:

ED [E [L]] =

ED
[
{y(x;D)− h(x)}2

]
p(x) dx

+

∫∫
{h(x)− t}2p(x, t) dx dt

Again, add and subtract the expectation ED [y(x;D)]

{y(x;D)− h(x)}2 = { y(x;D)− ED [y(x;D)]
+ ED [y(x;D)]− h(x)}2

and show that the mixed term vanishes under the
expectation ED.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

154of 825

The Bias-Variance Decomposition (3)

Expected loss ED [L] over all data sets D

expected loss = (bias)2 + variance + noise.

where

(bias)2 =

{ED [y(x;D)]− h(x)}2 p(x) dx

variance =

ED
[
{y(x;D)− ED [y(x;D)]}2

]
p(x) dx

noise =
∫∫
{h(x)− t}2 p(x, t) dx dt.

(bias)2 : How accurate is a model across different training
sets? (How much does the average prediction over all
data sets differ from the desired regression function ?)
variance : How sensitive is the model to small changes in
the training set? (How much do solutions for individual
data sets vary around their average ?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

155of 825

The Bias-Variance Decomposition

Simple models have low variance and high bias.

x

t
ln λ = 2.6

0 1

−1

0

1

x

t

0 1

−1

0

1

Left: Result of fitting the model to 100 data sets, only 25 shown.
Right: Average of the 100 fits in red, the sinusoidal function
from where the data were created in green.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

156of 825

The Bias-Variance Decomposition

Complex models have high variance and low bias.

x

t
ln λ = −2.4

0 1

−1

0

1

x

t

0 1

−1

0

1

Left: Result of fitting the model to 100 data sets, only 25 shown.
Right: Average of the 100 fits in red, the sinusoidal function
from where the data were created in green.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

157of 825

The Bias-Variance Decomposition

Dependence of bias and variance on the model complexity
Squared bias, variance, their sum, and test data
The minimum for (bias)2 + variance occurs close to the
value that gives the minimum error

ln λ

−3 −2 −1 0 1 2
0

0.03

0.06

0.09

0.12

0.15

(bias)2

variance

(bias)2 + variance
test error

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

158of 825

Unbiased Estimators

You may have encountered unbiased estimators
Why guarantee zero bias? To quote the pioneer of
Bayesian inference, Edwin Jaynes, from his book
Probability Theory: The Logic of Science (2003):

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Linear Basis Function
Models

Maximum Likelihood and
Least Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

159of 825

The Bias-Variance Decomposition

Tradeoff between bias and variance
simple models have low variance and high bias
complex models have high variance and low bias

The sum of bias and variance has a minimum at a certain
model complexity.
Expected loss ED [L] over all data sets D

expected loss = (bias)2 + variance + noise.

The noise comes from the data, and can not be removed
from the expected loss.
To analyse the bias-variance decomposition : many data
sets needed, which are not always available.

Linear Regression 1
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition