CS计算机代考程序代写 Bayesian algorithm chain scheme Statistical Machine Learning

Statistical Machine Learning
Christian Walder
Machine Learning Research Group CSIRO Data61
and
College of Engineering and Computer Science The Australian National University
Canberra Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1
Neural Networks 2
Principal Component Analysis Autoencoders
Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2
1of 825

Part III
Linear Regression 1
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
113of 825

Linear Regression
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
114of 825
N = 10
x ≡ (x1,…,xN)T
t ≡ (t1,…,tN)T
xi ∈R i=1,…,N ti ∈R i=1,…,N
Predictor y(x, w)? Performance measure? Optimal solution w∗? Recall: projection, inverse
8 7 6 5 4 3 2 1 0
0 2 4 6 8 10
x
t

Probabilities, Losses
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
115of 825
Gaussian Distribution Bayes Rule Expected Loss

Linear Curve Fitting – Least Squares
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
116of 825
N = 10
x ≡ (x , . . . , x )T
8 7 6 5 4 3 2 1 0
0 2 4 6 8 10
x
1N
t ≡ (t1,…,tN)T
xi ∈R i=1,…,N ti ∈R i=1,…,N
y(x, w) = w1x + w0 X≡[x 1]
w∗ =(XTX)−1XTt
t= y(x,w) + ε
We assume
􏲜 􏲛􏲚 􏲝
deterministic
􏲜􏲛􏲚􏲝
Gaussian noise
t

Curve fitting – revisited
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
117of 825
a priori belief about the parameter w captured in the prior probability p(w)
observed data D = {t1,…,tN}
calculate the belief in w after the data D have been
observed
p(w | D) = p(D | w)p(w) p(D)
p(D | w) as a function of w : likelihood function
likelihood expresses how probable the data are for different values of w — it is not a probability density with respect w (but it is with respect to D ; prove it)

Maximum Likelihood
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
118of 825
Consider the linear regression problem, where we have random variables xn and tn.
We assume a conditional model tn|xn
We propose a distribution, parameterized by θ
tn|xn ∼ density(θ)
For a given θ the density defines the probability of
observing tn|xn.
We are interested in finding θ that maximises the probability (called the likelihood) of the data.

Likelihood Function – Frequentist versus Bayesian
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
119of 825
Frequentist Approach
w considered fixed parameter
value defined by some ’estimator’
error bars on the estimated w obtained from the distribution of possible data sets D
Bayesian Approach
only one single data
set D
uncertainty in the parameters comes from a probability distribution over w
Likelihood function p(D | w)

Frequentist Estimator – Maximum Likelihood
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
120of 825
choose w for which the likelihood p(D | w) (the probability of the observed data) is maximal
the most common heuristic for learning a single fixed w equivalently: error function is negative log of likelihood
function, to be minimised
log is a monotonic function
maximising the likelihood ⇐⇒ minimising the error
Example: Fair-looking coin is tossed three times, always landing on heads.
Maximum likelihood estimate of the probability of landing heads will give 1.

Bayesian Approach
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
121of 825
including prior knowledge easy (via prior w) subjective choice of prior, allows better results by
incorporating domain knowledge
sometimes choice of prior motivated by convinient mathematical form
prior irrelevant as N → ∞, but helps for small N
need to sum/integrate over the whole parameter space
advances in sampling (Markov Chain Monte Carlo methods) advances in approximation schemes (Variational Bayes, Expectation Propagation)
there is no true w:

Regression
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
122of 825
Given a training data set of N observations {xn} and target values tn.
Goal : Learn to predict the value of one ore more target values t given a new value of the input x.
Example: Polynomial curve fitting (see Introduction).
t
?
x

Supervised Learning: (non-Bayesian) Point Estimate
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
123of 825
Training Phase
training data x
training targets t
model with adjustable parameter w
Test Phase
fix the most appropriate w*
test data x
test target t
model with fixed parameter w*

Why Linear Regression?
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
124of 825
Analytic solution when minimising sum of squared errors Well understood statistical behaviour
Efficient algorithms exist for convex losses and regularizers
But what if the relationship is non-linear?
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
0123456

Linear Basis Function Models
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
125of 825
Linear combination of fixed nonlinear basis functions
M−1
y(x, w) = 􏱿 wjφj(x) = wT φ(x)
j=0
parameter w = (w0, . . . , wM−1)T
basis functions φ(x) = (φ0(x), . . . , φM−1(x))T convention φ0(x) = 1
w0 is the bias parameter

Polynomial Basis Functions
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
126of 825
Scalar input variable x
φj(x) = xj
Limitation : Polynomials are global functions of the input variable x so the learned function will extrapolate poorly
1
0.5
0
−0.5
−1
−1 0 1

’Gaussian’ Basis Functions
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
127of 825
Scalar input variable x
φj(x) = exp − 2s2
Not a probability distribution.
No normalisation required, taken care of by the model parameters w.
Well behaved away from the data (though pulled to zero)
1
0.75
0.5
0.25
0
−1 0 1
􏲞 (x−μj)2􏲟

Sigmoidal Basis Functions
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
128of 825
Scalar input variable x
φj(x) = σ s
􏲇 x − μj 􏲈
where σ(a) is the logistic sigmoid function defined by
σ(a) = 1
1 + exp(−a)
σ(a) is related to the hyperbolic tangent tanh(a) by tanh(a) = 2σ(a) − 1.
1
0.75
0.5
0.25
0

Other Basis Functions
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
129of 825
Fourier Basis : each basis function represents a specific frequency and has infinite spatial extent.
Wavelets : localised in both space and frequency (also mutually orthogonal to simplify appliciation).
Splines (piecewise polynomials restricted to regions of the input space; additional constraints where pieces meet, e.g. smoothness constraints → conditions on the derivatives).
1.0 1.0
0.5 0.5
1.0 1.0
0.5 0.5
0.0 0.0 0.0 0.0
􏲱0.5 􏲱0.5
􏲱0.5 􏲱0.5
􏲱1.0 􏲱1.0 􏲱1.0 􏲱1.0
􏲱1.5 􏲱1.5
􏲱1.5 􏲱1.5
􏲱2.0
􏲱2.0 􏲱2.0 􏲱2.0 012345012345012345012345
Linear Quadratic Cubic Quartic Splines Splines Splines Splines
Approximate the points
{(0, 0), (1, 1), (2, −1), (3, 0), (4, −2), (5, 1)} by different splines.

Maximum Likelihood and Least Squares
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
130of 825
No special assumption about the basis functions φj(x). In the simplest case, one can think of φj(x) = xj, or φ(x) = x.
Assume target t is given by
t= y(x,w) + ε
􏲜􏲛􏲚􏲝
noise
where ε is a zero-mean Gaussian random variable with
precision (inverse variance) β. Thus
p(t|x,w,β) = N(t|y(x,w),β−1) t
y(x0)
y(x)
p(t|x0 )
􏲜 􏲛􏲚 􏲝
deterministic
x0 x

Maximum Likelihood and Least Squares
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
131of 825
Likelihood of one target t given the data x was p(t|x,w,β) = N(t|y(x,w),β−1)
Now, a set of inputs X with corresponding target values t.
Assume data are independent and identically distributed (i.i.d.) (means : data are drawn independent and from the same distribution). The likelihood of the target t is then
N
p(t|X,w,β) = 􏲠N(tn |y(xn,w),β−1)
n=1 N
= 􏲠N(tn |wTφ(xn),β−1) n=1
From now on drop the conditioning variable X from the notation, as with supervised learning we do not seek to model the distribution of the input data.

Maximum Likelihood and Least Squares
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
132of 825
Consider the logarithm of the likelihood p(t | w, β) (the logarithm is a monotone function! )
N
lnp(t|w,β) = 􏱿lnN(tn |wTφ(xn),β−1)
n=1
􏱿N 􏲁􏲣β 􏲞β T 2􏲟􏲂
ln 2π exp − 2 (tn − w φ(xn)) 22
=
= N ln β − N ln(2π) − βED(w)
n=1
where the sum-of-squares error function is 1 􏱿N
ED(w) = 2 n=1{tn − wTφ(xn)}2. argmaxw lnp(t|w,β) → argminw ED(w)

Maximum Likelihood and Least Squares
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
133of 825
Goal: Find a more compact representation. Rewrite the error function
1 􏱿N 1
ED(w) = 2 n=1{tn − wTφ(xn)}2 = 2(t − Φw)T(t − Φw)
where t = (t1,…,tN)T, and
φ0(x1) φ1(x1) . . . φM−1(x1)
φ0(x2) φ1(x2) . . . φM−1(x2)
 . . . .  Φ= . . .. .   
φ0(xN) φ1(xN) … φM−1(xN)

Maximum Likelihood and Least Squares
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
134of 825
The log likelihood is now
ln p(t | w, β) = N ln β − N ln(2π) − βED(w) 22
= N ln β − N ln(2π) − β 1 (t − Φw)T (t − Φw) 222
Find critical points of ln p(t | w, β). The gradient with respect to w is
∇w ln p(t | w, β) = βΦT (t − Φw). Setting the gradient to zero gives
which results in
0 = ΦT t − ΦT Φw, wML =(ΦTΦ)−1ΦTt=Φ†t
where Φ† is the Moore-Penrose pseudo-inverse of the matrix Φ.

Maximum Likelihood and Least Squares
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
135of 825
The log likelihood with the optimal wML is now lnp(t|wML,β)
= N lnβ− N ln(2π)−β1(t−ΦwML)T(t−ΦwML) 222
Find critical points of ln p(t | w, β) wrt β, ∂lnp(t|wML,β) =0
results in
∂β
1 = 1(t−ΦwML)T(t−ΦwML) βML N
Note: We can first find the maximum likelihood for w as this does not depend on β. Then we can use wML to find the maximum likelihood solution for β.
Could we have chosen optimisation wrt β first, and then wrt to w ?

Sequential Learning – Stochastic Gradient Descent
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
136of 825
For large data sets, calculating the maximum likelihood parameters wML and βML may be costly.
For online applications, never all data in memory. Use a sequential algorithms (online algorithm).
If the error function is a sum over data points E = 􏱾n En, then
1 initialise w(0) to some starting value
2 update the parameter vector at iteration τ + 1 by
w(τ+1) =w(τ) −η∇En,
where En is the error function after presenting the nth data set, and η is the learning rate.

Sequential Learning – Stochastic Gradient Descent
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
137of 825
For the sum-of-squares error function, stochastic gradient descent results in
w(τ+1) =w(τ) +η􏲏tn −w(τ)Tφ(xn)􏲐φ(xn)
The value for the learning rate must be chosen carefully. A too large learning rate may prevent the algorithm from converging. A too small learning rate does follow the data too slowly.

Regularized Least Squares
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
138of 825
Add regularisation in order to prevent overfitting
ED(w)+λEW(w) with regularisation coefficient λ.
Simple quadratic regulariser
EW(w) = 1wTw 2
Maximum likelihood solution
w = 􏲋λI + ΦT Φ􏲌−1 ΦT t

Regularized Least Squares
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
139of 825
More general regulariser
1 􏱿M EW(w)= 2 j=1|wj|q
q = 1 (lasso) leads to a sparse model if λ large enough.
q = 0.5 q = 1 q = 2 q = 4

Lagrangian Dual View of the Regulariser
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
140of 825
By the Lagrange multiplier method, minimization of the regularized error function
1 􏱿N λ 􏱿M
2 (tn − w⊤φ(xn))2 + 2 |wj|q ,
n=1 j=1
is equivalent to minimizing the unregularized
sum-of-squares error,
1NM
􏱿(tn − w⊤φ(xn))2 subject to 􏱿 |wj|q ≤ η.
2
n=1 j=1
This yields the figures on the next slide.

Comparison of Quadratic and Lasso Regulariser
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
141of 825
Quadratic regulariser
1 􏱿M
2
w2
Lasso regulariser
w 2j
1 􏱿M
j=1
2
j=1
| w j |
w2
w⋆
w⋆
w1
w1

Multiple Outputs
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
142of 825
More than 1 target variable per data point.
y becomes a vector instead of a scalar. Each dimension can be treated with a different set of basis functions (and that may be necessary if the data in the different target dimensions represent very different types of information.)
Here we restrict ourselves to the SAME basis functions
y(x, w) = WT φ(x)
where y is a K-dimensional column vector, W is an M × K
matrix of model parameters, and
φ(x) = 􏲏φ0(x), . . . , φM−1(x)􏲐, with φ0(x) = 1, as before.
Define target matrix T containing the target vector tTn in the nth row.

Multiple Outputs
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
143of 825
Suppose the conditional distribution of the target vector is an isotropic Gaussian of the form
p(t|x,W,β) = N(t|WTφ(x),β−1I). The log likelihood is then
N
lnp(T|X,W,β) = 􏱿lnN(tn |WTφ(xn),β−1I)
n=1
NK 􏲇β􏲈 β􏱿N
= 2 ln 2π −2n=1∥tn−WTφ(xn)∥2

Multiple Outputs
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
144of 825
Maximisation with respect to W results in WML = (ΦTΦ)−1ΦTT.
For each target variable tk, we get
wk =(ΦTΦ)−1ΦTtk =Φ†tk.
The solution between the different target variables decouples.
Holds also for a general Gaussian noise distribution with arbitrary covariance matrix.
Why? W defines the mean of the Gaussian noise distribution. And the maximum likelihood solution for the mean of a multivariate Gaussian is independent of the covariance.

Loss Function for Regression
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
145of 825
Over-fitting results from a large number of basis functions and a relatively small training set.
Regularisation can prevent overfitting, but how to find the correct value for the regularisation constant λ ?
Frequentists viewpoint of the model complexity is the bias-variance trade-off.

Loss Function for Regression
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
146of 825
Choose an estimator y(x) to estimate the target value t for each input x.
Choose a loss function L(t, y(x)) which measures the difference between the target t and the estimate y(x).
The expected loss is then 􏲗􏲗
E [L] =
L(t, y(x)) p(x, t) dx dt
Common choice: Squared Loss
L(t, y(x)) = {y(x) − t}2 .
Expected loss for squared loss function
􏲗􏲗 2
E[L]= {y(x)−t} p(x,t)dxdt.

Loss Function for Regression
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
147of 825
Expected loss for squared loss function
􏲗􏲗 2
E[L]= {y(x)−t} p(x,t)dxdt.
Minimise E [L] by choosing the regression function
􏲖 t p(x, t) dt y(x) = p(x) =
􏲗
t p(t|x) dt = Et [t|x]
(calculus of variations is not required to derive this result ; we may work point-wise by fixing an x and using stationarity to solve for y(x) — why is that sufficient?).

Optimal Predictor for Squared Loss
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
148of 825
The regression function which minimises the expected squared loss, is given by the mean of the conditional distribution p(t | x).
t
y(x0)
y(x)
p(t|x0 )
x0 x

Analysing the Squared Loss (1)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
149of 825
Analyse the expected loss
􏲗􏲗 2
E[L]= {y(x)−t} p(x,t)dxdt.
Rewrite the squared loss
{y(x)−t}2 ={y(x)−E[t|x]+E[t|x]−t}2 ={y(x)−E[t|x]}2 +{E[t|x]−t}2
Claim
􏲗􏲗
+2{y(x)−E[t|x]}{E[t|x]−t} {y(x) − E [t | x]} {E [t | x] − t} p(x, t) dx dt = 0.

Analysing the Squared Loss (2)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
150of 825
Claim
􏲗􏲗
{y(x) − E [t | x]} {E [t | x] − t} p(x, t) dx dt = 0. Seperate functions depending on t from function
depending on x
􏲗 􏲇􏲗 􏲈
{y(x) − E [t | x]} {E [t | x] − t} p(x, t) dt dx Calculate the integral over t
􏲗 􏲗 tp(x,t) {E [t | x] − t} p(x, t) dt = E [t | x] p(x) − p(x) p(x) dt
= E [t | x] p(x) − p(x)E [t | x] =0

Analysing the Squared Loss (3)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
151of 825
The expected loss is now
􏲗􏲗
E[L]= {y(x)−E[t|x]}2p(x)dx+ var[t|x]p(x)dx (1) Minimise first term by choosing y(x) = E [t | x] (as we saw
already).
Second term represents the intrinsic variability of the target data (can be regarded as noise). Independent of the choice y(x), can not be reduced by learning a better y(x).

The Bias-Variance Decomposition (1)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
152of 825
Consider again squared loss for which the optimal prediction is given by the conditional expectation h(x)
􏲗
h(x) = E[t|x] =
Since h(x) is unavailable to us, it must be estimated from a
tp(t|x) dt.
D is a finite sample from the unknown joint p(x, t)
(finite) dataset D.
Notate the dependency of the learned function on the data
by y(x; D).
Evaluate performance of algorithm by taking the expectation ED [L] over all data sets D

The Bias-Variance Decomposition (2)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
153of 825
Taking the expectation over data sets D, using Eqn 1, and interchanging the order of expectations for the first term:
􏲗
ED [E [L]] = ED 􏲔{y(x; D) − h(x)}2􏲕 p(x) dx
􏲗􏲗
{h(x) − t}2p(x, t) dx dt {y(x; D) − h(x)}2 = { y(x; D) − ED [y(x; D)]
+
Again, add and subtract the expectation ED [y(x; D)]
+ ED [y(x; D)] − h(x)}2
and show that the mixed term vanishes under the expectation ED.

The Bias-Variance Decomposition (3)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
154of 825
Expected loss ED [L] over all data sets D
expected loss = (bias)2 + variance + noise.
where
(bias) =
variance = noise =
􏲗􏲡 2􏲢 ED {y(x; D) − ED [y(x; D)]}
p(x) dx
2􏲗2
{ED [y(x; D)] − h(x)} p(x) dx
􏲗􏲗
{h(x) − t}2 p(x, t) dx dt.
(bias)2 : How accurate is a model across different training sets? (How much does the average prediction over all data sets differ from the desired regression function ?)
variance : How sensitive is the model to small changes in the training set? (How much do solutions for individual data sets vary around their average ?

The Bias-Variance Decomposition
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
Simple models have low variance and high bias.
11
ln λ = 2.6 tt
00
−1
−1
0x10x1
Left: Result of fitting the model to 100 data sets, only 25 shown. Right: Average of the 100 fits in red, the sinusoidal function from where the data were created in green.
155of 825

The Bias-Variance Decomposition
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
Complex models have high variance and low bias.
t
11
ln λ = −2.4 t 00
−1
−1
0x10x1
Left: Result of fitting the model to 100 data sets, only 25 shown. Right: Average of the 100 fits in red, the sinusoidal function from where the data were created in green.
156of 825

The Bias-Variance Decomposition
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
157of 825
Dependence of bias and variance on the model complexity Squared bias, variance, their sum, and test data
The minimum for (bias)2 + variance occurs close to the value that gives the minimum error
0.15 0.12 0.09 0.06 0.03
0
−3 −2 −1 0 1 2
lnλ
2 (bias)
variance
2
(bias) + variance
test error

Unbiased Estimators
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
158of 825
You may have encountered unbiased estimators
Why guarantee zero bias? To quote the pioneer of Bayesian inference, Edwin Jaynes, from his book Probability Theory: The Logic of Science (2003):

The Bias-Variance Decomposition
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares
Sequential Learning
Regularized Least Squares
Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
159of 825
Tradeoff between bias and variance
simple models have low variance and high bias complex models have high variance and low bias
The sum of bias and variance has a minimum at a certain model complexity.
Expected loss ED [L] over all data sets D
expected loss = (bias)2 + variance + noise.
The noise comes from the data, and can not be removed from the expected loss.
To analyse the bias-variance decomposition : many data sets needed, which are not always available.