Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 825
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
306of 825
Part VIII
Neural Networks 2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
307of 825
Neural Networks 2: Overview
Recall: we would like gradients w.r.t. parameters so that
we can optimise.
Today: gradients of neural network parameters via the
backpropagation of gradients algorithm.
Regularisation/model selection.
Incorporating invariances/domain knowledge.
Bayesian neural network (Laplace’s method).
Good News
We study back propagation for pedagogical reasons: in
practice one uses automatic differentiation which is far more
general and efficient (see e.g. the especially easy to use
PyTorch).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
308of 825
Chain Rule / Total Derivative
The composition of two functions is given by
f ◦ g(x) = f (g(x))
Let f and g be differentiable functions with derivatives f ′
and g′ respectively
Chain rule
d
dx
f (g(x)) = f ′(g(x))g′(x)
If we write u = g(x) and y = f (u),
dy
dx
=
dy
du
·
du
dx
Multivariate case we also need is the total derivative, e.g.
d
dt
f (x(t), y(t)) =
∂f
∂x
dx
dt
+
∂f
∂y
dy
dt
,
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
309of 825
Error Backpropagation
Goal: Efficiently update the weights in order to find a local
minimum of some error function E(w) utilizing the gradient
of the error function.
Core ideas :
1 Propagate the errors backwards through the network to
efficiently calculate the gradient.
2 Update the weights using the calculated gradient.
Sequential procedure : Calculate gradient and update
weights for each data/target pair.
Batch procedure : Collect gradient information for all
data/target pairs for the same weight setting. Then adjust
the weights.
Main question in both cases: How to calculate the gradient
of E(w) given one data/target pair?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
310of 825
Error Backpropagation
Assume the error is a sum over errors for each data/target
pair
E(w) =
N∑
n=1
En(w).
After applying input xn to the network, we get the output yn
and calculate the error En(w).
What is the gradient for one such term En(w)?
Note : In the following, we will drop the subscript n in order
to unclutter the equations.
Notation: Input pattern is x.
Scalar xi is the ith component of the input pattern x.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
311of 825
Backprop – One Layer – Scalar View
Simple linear model without hidden layers
One layer only, identity function as activation function!
yk =
∑
l
wkl xl,
and error after applying input xn
En(w) =
1
2
∑
k
(yk − tk)2.
The gradient with respect to wji is now
∂En(w)
∂wji
=
∑
k
(yk − tk)
∂
∂wji
yk =
∑
k
(yk − tk)
∂
∂wji
∑
l
wkl xl
=
∑
k
(yk − tk)
∑
l
xl δjkδil
= (yj − tj) xi.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
312of 825
Backprop – One Layer – Vector Calculus
Vector setup:
y = W x
W ∈ RD2×D1
x ∈ RD1
y ∈ RD2
Error after applying input training pair (x, t)
En(W) =
1
2
‖y− t‖2 .
Using the vector calculus rules gives
∇WEn(W) = ∇W
1
2
‖y− t‖2
= (y− t)∇Wy
= (y− t)x>.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
313of 825
FYI Only:
Backprop – One Layer – Directional Derivative
Do the same using the directional derivative:
y = W x W ∈ RD2×D1 ,
and error after applying input training pair (x, t)
En(W) =
1
2
(y− t)>(y− t) =
1
2
(Wx− t)>(Wx− t).
Define ∇d f (x) = ddh f (x + hd).
Relate this to the gradient by ∇d f (x) = 〈∇f (x), d〉.
The directional derivative with respect to W is now
∇ξEn(W) =
1
2
(
(ξx)>(y− t) + (y− t)>ξx
)
= x>ξ>(y− t)
With canonical inner product 〈A,B〉 = tr
{
A>B
}
the
gradient of En(W)(ξ) is
DEn(W)(ξ) = tr
x>ξ>(y− t)︸ ︷︷ ︸
scalar
= tr
ξ> (y− t)x>︸ ︷︷ ︸
gradient
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
314of 825
Error Backpropagation
The gradient
∇WEn(W) = (y− t)x>
or in components
∂En(w)
∂wji
= (yj − tj) xi.
looks like the product of the output error (yj − tj) with the
input xi associated with an edge for wji in the network
diagram.
Can we generalise this idea to nonlinear activation
functions?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
315of 825
Error Backpropagation
Now consider a network with nonlinear activation functions
h(·) composed with the sum over the inputs zi in one layer
and zj in the next layer connected by edges with weights wji
aj =
∑
i
wji zi
zj = h(aj).
Use the chain rule to calculate the gradient
∂En(w)
∂wji
=
∂En(w)
∂aj
∂aj
∂wji
= δj zi,
where we defined the error (a slight misnomer hailing from
the derivative of the squared error) δj =
∂En(w)
∂aj
Same intuition as before: gradient is output error times the
input associated with the edge for wji.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
316of 825
Error Backpropagation
Need to calculate the errors δ in every layer.
∂En(w)
∂wji
= δj zi δj =
∂En(w)
∂aj
Start the recursion; for output units with squared error:
δk = yk − tk.
For the hidden units we use the total derivative, e.g.
d
dt
f (x(t), y(t)) =
∂f
∂x
dx
dt
+
∂f
∂y
dy
dt
,
to calculate
δj =
∂En(w)
∂aj
=
∑
k
∂En(w)
∂ak
∂ak
∂aj
=
∑
k
δk
∂ak
∂aj
,
using the definition of δk.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
317of 825
Error Backpropagation
Express ak as a function of the incoming aj
ak =
∑
j
wkjzj =
∑
j
wkjh(aj),
and differentiate
∂ak
∂aj
= wkj
∂h(aj)
∂aj
= wkj
∂h(s)
∂s
∣∣∣∣∣
s=aj
= wkj h
′(aj).
Finally, we get for the error in the previous layer
δj = h
′(aj)
∑
k
wkj δk.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
318of 825
Error Backpropagation
The backfpropagation formula
δj = h
′(aj)
∑
k
wkjδk.
Form of h′(·) is available, because we choose the
activation function h(·) to be a fixed function such as a
tanh etc..
zi
zj
δj
δk
δ1
wji wkj
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
319of 825
Error Backpropagation Algorithms
1 Apply the input vector x to the network and forward
propagate through the network to calculate all activations
and outputs of each unit.
2 Compute the gradients of the error at the output.
3 Backpropagate the gradients backwards through the
network using the backpropagation formula.
4 Calculate all components of ∇En by
∂En(w)
∂wji
= δj zi
5 Update the weights w using ∂En(w)
∂wji
.
zi
zj
δj
δk
δ1
wji wkj
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
320of 825
Error Backpropagation
For batch processing, we repeat backpropagation for each
pattern in the training set and then sum over all patterns
∂E(w)
∂wji
=
N∑
n=1
∂En(w)
∂wji
Backpropagation can be generalised by assuming that
each node has a different activation function h(·).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
321of 825
Easy Backprop
Let z(0) = x= input
a(l) = W(l)z(l−1)
z(l) = h(a(l))
E = L(a(L))= L(y)
e.g.
=
1
2
‖y− t‖2.
The gradients of E w.r.t. the parameters are (total derivative)
∂E
∂W(l)
=
∂E
∂a(l)
·
∂a(l)
∂W(l)
= δ(l)z(l−1)
>
,
where (neglecting transposes — assume conformant shapes)
δ(l) ≡
∂E
∂a(l)
=
∂a(l+1)
∂a(l)
∂a(l+2)
∂a(l+1)
· · ·
∂a(L)
∂a(L−1)
∂L(a(L))
∂a(L)
has the recursion δ(L) = ∂L(a
(L))
∂a(L) along with
δ(l−1) =
∂a(l)
∂a(l−1)
δ(l)
∂a(l)
∂a(l−1)
=
∂W(l)h(a(l−1))
∂a(l−1)
= diag{h′(a(l−1))}W(l)
>
.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
322of 825
Efficieny of Backpropagation
For dense weight matrices, the complexity of calculating
the gradient ∂En(w)
∂wji
via backpropagation is of O(W) where
W is the number of weights.
Compare this to numerical differentiation using e.g.
∂En(w)
∂wji
=
En(wji + �)− En(wji − �)
2�
+ O(�2)
which needs O(W2) operations, and is less accurate.
FYI only — as in the previous lecture: In general we have the
“cheap gradient principle”. See (Griewank, A., 2000.
Evaluating Derivatives: Principles and Techniques of
Algorithmic Differentiation, Section 5.1).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
323of 825
Regularisation in Neural Networks
Number of input and output nodes determined by the
application.
Number of hidden nodes is a free parameter.
M = 1
0 1
−1
0
1
Training a two-layer network with 1 hidden node.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
324of 825
Regularisation in Neural Networks
Number of input and output nodes determined by the
application.
Number of hidden nodes is a free parameter.
M = 3
0 1
−1
0
1
Training a two-layer network with 3 hidden nodes.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
325of 825
Regularisation in Neural Networks
Number of input and output nodes determined by the
application.
Number of hidden nodes is a free parameter.
M = 10
0 1
−1
0
1
Training a two-layer network with 10 hidden nodes.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
326of 825
Regularisation in Neural Networks
Model complexity matters again.
M = 1
0 1
−1
0
1
M = 1
M = 3
0 1
−1
0
1
M = 3
M = 10
0 1
−1
0
1
M = 10
As before, we can use the regularised error
Ẽ(w) = E(w) +
λ
2
w>w
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
327of 825
Regularisation via Early Stopping
Stop training at the minimum of the validation set error.
0 10 20 30 40 50
0.15
0.2
0.25
Training set error.
0 10 20 30 40 50
0.35
0.4
0.45
Validation set error.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
328of 825
Invariances
If input data should be invariant with respect to some
transformations, we can utilise this for training.
Use training patterns including these transformations (e.g.
handwritten digits translated in the input space).
Or create extra artifical input data by applying several
transformations to the original input data.
Alternatively, preprocess the input data to remove the
transformation.
Or use convolutional neural networks (e.g. in image
processing where close pixels are more correlated than far
away pixels; therefore extract local features first and later
feed into a network extracting higher-order features).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
329of 825
Synthetic Warping
Create synthetic data by warping handwritten digits.
Left: Original digitised image. Right : Examples of warped
images (above) and their corresponding displacement fields
(below).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
330of 825
Bayesian Neural Networks
Predict a single target t from a vector of inputs x
Assume conditional distribution to be Gaussian with
precision β
p(t | x,w, β) = N (t | y(x,w), β−1)
Prior distribution over weights w is also assumed to be
Gaussian
p(w |α) = N (w | 0, α−1I)
For an i.i.d training data set {xn, tn}Nn=1, the likelihood of the
targets D = {t1, . . . , tN} is
p(D |w, β) =
N∏
n=1
N (tn | y(xn,w), β−1)
Posterior distribution
p(w | D, α, β) ∝ p(w |α)p(D |w, β)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
331of 825
Bayesian Neural Networks
But y(x,w) is nonlinear, and therefore we can no longer
calculate the posterior in closed form.
Use Laplace approximation
1 Find a (local) maximum wMAP of the posterior via numerical
optimisation.
2 Evaluate the matrix of second derivatives of the negative log
posterior distribution.
Find a (local) maximum using the log-posterior
ln p(w | D, α, β) = −
α
2
w>w−
β
2
N∑
n=1
(y(xn,w)− tn)
2
+ const
Find the matrix of second derivatives of the negative log
posterior distribution
A = −∇∇ ln p(w | D, α, β) = αI + βH
where H is the Hessian matrix of the sum-of-squares error
function with respect to the components of w.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Review
Error Backpropagation
Regularisation in Neural
Networks
Bayesian Neural
Networks
332of 825
Bayesian Neural Networks
Having wMAP, and A, we can approximate the posterior by
a Gaussian
q(w | D, α, β) = N (w |wMAP,A−1)
For the predictive distribution further linearly approximate
y(x,w) ≈ y(x,wMAP) + g>(w− wMAP)
where g = ∇wy(x,w)|w=wMAP .
Then
p(t | x,D, α, β) = N (t | y(x,wMAP), σ2(x))
where
σ2(x) = β−1 + g>A−1g.
(Recall the multivariate normal conditionals.)
variance due to the intrinsic noise on the target: β−1
variance due to the model parameter w : g>A−1g
Neural Networks 2
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks