Statistical Machine Learning
Christian Walder
Machine Learning Research Group CSIRO Data61
and
College of Engineering and Computer Science The Australian National University
Canberra Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1
Neural Networks 2
Principal Component Analysis Autoencoders
Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2
1of 825
Part VIII
Neural Networks 2
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
306of 825
Neural Networks 2: Overview
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
307of 825
Recall: we would like gradients w.r.t. parameters so that we can optimise.
Today: gradients of neural network parameters via the backpropagation of gradients algorithm.
Regularisation/model selection. Incorporating invariances/domain knowledge. Bayesian neural network (Laplace’s method).
Good News
We study back propagation for pedagogical reasons: in practice one uses automatic differentiation which is far more general and efficient (see e.g. the especially easy to use PyTorch).
Chain Rule / Total Derivative
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
308of 825
The composition of two functions is given by
f ◦ g(x) = f (g(x))
Let f and g be differentiable functions with derivatives f ′
and g′ respectively
Chain rule
d f(g(x)) = f′(g(x))g′(x) dx
If we write u = g(x) and y = f(u),
dy = dy · du
dx du dx
Multivariate case we also need is the total derivative, e.g.
d f(x(t),y(t))=∂f dx+∂f dy, dt ∂x dt ∂y dt
Error Backpropagation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
309of 825
Goal: Efficiently update the weights in order to find a local minimum of some error function E(w) utilizing the gradient of the error function.
Core ideas :
1 Propagate the errors backwards through the network to efficiently calculate the gradient.
2 Update the weights using the calculated gradient.
Sequential procedure : Calculate gradient and update
weights for each data/target pair.
Batch procedure : Collect gradient information for all data/target pairs for the same weight setting. Then adjust the weights.
Main question in both cases: How to calculate the gradient of E(w) given one data/target pair?
Error Backpropagation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
310of 825
Assume the error is a sum over errors for each data/target pair
N
E(w) = En(w).
n=1
After applying input xn to the network, we get the output yn
and calculate the error En(w).
What is the gradient for one such term En(w)?
Note : In the following, we will drop the subscript n in order to unclutter the equations.
Notation: Input pattern is x.
Scalar xi is the ith component of the input pattern x.
Backprop – One Layer – Scalar View
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
311of 825
Simple linear model without hidden layers
One layer only, identity function as activation function!
yk =wklxl, l
and error after applying input xn
En(w) = 1 (yk − tk)2.
2
k
The gradient with respect to wji is now
∂En(w)=(yk−tk) ∂ yk=(yk−tk) ∂ wklxl ∂wji k ∂wji k ∂wji l
=(yk −tk)xlδjkδil kl
=(yj −tj)xi.
Backprop – One Layer – Vector Calculus
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
312of 825
Vector setup:
Error after applying input training pair (x, t) En(W) = 1 ∥y − t∥2 .
2
Using the vector calculus rules gives
∇ E(W)=∇ 1∥y−t∥2 WnW2
= (y − t)∇Wy = (y − t)x⊤.
y=Wx W ∈ RD2×D1
x ∈ RD1 y ∈ RD2
FYI Only:
Backprop – One Layer – Directional Derivative
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
313of 825
Do the same using the directional derivative:
y = W x W ∈ RD2 ×D1 , and error after applying input training pair (x, t)
En(W)= 1(y−t)⊤(y−t)= 1(Wx−t)⊤(Wx−t). 22
Define∇df(x)= d f(x+hd). dh
Relate this to the gradient by ∇d f (x) = ⟨∇f (x), d⟩. The directional derivative with respect to W is now
∇ξEn(W) = 1 (ξx)⊤(y − t) + (y − t)⊤ξx = x⊤ξ⊤(y − t) 2
With canonical inner product ⟨A, B⟩ = tr A⊤ B the gradient of En(W)(ξ) is
DEn(W)(ξ) = tr
⊤⊤ ⊤ ⊤
x ξ (y−t) = tr ξ (y−t)x
scalar gradient
Error Backpropagation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
314of 825
The gradient
or in components
∂wji
looks like the product of the output error (yj − tj) with the input xi associated with an edge for wji in the network diagram.
Can we generalise this idea to nonlinear activation functions?
∇WEn(W) = (y − t)x⊤ ∂En(w) =(yj −tj)xi.
Error Backpropagation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
315of 825
Now consider a network with nonlinear activation functions h(·) composed with the sum over the inputs zi in one layer and zj in the next layer connected by edges with weights wji
aj = wji zi i
zj = h(aj).
Use the chain rule to calculate the gradient
∂En(w) = ∂En(w) ∂aj = δj zi, ∂wji ∂aj ∂wji
where we defined the error (a slight misnomer hailing from the derivative of the squared error) δj = ∂En(w)
∂aj
Same intuition as before: gradient is output error times the input associated with the edge for wji.
Error Backpropagation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
316of 825
Need to calculate the errors δ in every layer. ∂En(w) = δj zi δj = ∂En(w)
∂wji ∂aj
Start the recursion; for output units with squared error:
δk = yk − tk.
For the hidden units we use the total derivative, e.g.
d f(x(t),y(t))=∂f dx+∂f dy, dt ∂x dt ∂y dt
to calculate
δ = ∂En(w) =∂En(w)∂ak =δ ∂ak, j ∂a ∂a∂a k∂a
jkkjkj using the definition of δk.
Error Backpropagation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
317of 825
Express ak as a function of the incoming aj ak = wkjzj = wkjh(aj),
jj
and differentiate
∂ak = w ∂h(aj) = w ∂h(s) = w h′(a ).
kj kj kjj ∂aj ∂s s=aj
∂aj
Finally, we get for the error in the previous layer
δj = h′(aj) wkj δk. k
Error Backpropagation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
318of 825
The backfpropagation formula
δj =h′(aj)wkjδk.
k
Form of h′(·) is available, because we choose the activation function h(·) to be a fixed function such as a tanh etc..
zi δ δk
wj ji
wkj
zj
δ1
Error Backpropagation Algorithms
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
319of 825
1 Apply the input vector x to the network and forward propagate through the network to calculate all activations and outputs of each unit.
2 Compute the gradients of the error at the output.
3 Backpropagate the gradients backwards through the
network using the backpropagation formula.
4 Calculate all components of ∇En by
∂En(w) = δj zi ∂wji
5 Update the weights w using ∂ En (w) . ∂ wji
zi δ δk
wj ji
wkj
zj
δ1
Error Backpropagation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
320of 825
For batch processing, we repeat backpropagation for each pattern in the training set and then sum over all patterns
∂E(w) = N ∂En(w) ∂wji n=1 ∂wji
Backpropagation can be generalised by assuming that each node has a different activation function h(·).
Easy Backprop
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
321of 825
Let z(0) = x = input a(l) = W(l)z(l−1)
z(l) = h(a(l))
(L) e.g. 1 2
E=L(a )=L(y)=2∥y−t∥.
The gradients of E w.r.t. the parameters are (total derivative)
∂E ∂E ∂a(l) ⊤ ∂W(l) = ∂a(l) · ∂W(l) = δ(l)z(l−1) ,
where (neglecting transposes — assume conformant shapes)
(l) ∂E ∂a(l+1) ∂a(l+2) ∂a(L) ∂L(a(L)) δ ≡ ∂a(l) = ∂a(l) ∂a(l+1) · · · ∂a(L−1) ∂a(L)
has the recursion δ(L) = ∂L(a(L)) along with ∂ a(L)
∂a(l) δ(l−1) = ∂a(l−1) δ(l)
∂a(l) ∂W(l)h(a(l−1)) ′ (l−1) ∂a(l−1) = ∂a(l−1) =diag{h(a )}W
(l)⊤
.
Efficieny of Backpropagation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
322of 825
For dense weight matrices, the complexity of calculating the gradient ∂En(w) via backpropagation is of O(W) where
∂ wji
W is the number of weights.
Compare this to numerical differentiation using e.g. ∂En(w) = En(wji + ε) − En(wji − ε) + O(ε2)
∂wji 2ε
which needs O(W2) operations, and is less accurate.
FYI only — as in the previous lecture: In general we have the “cheap gradient principle”. See (Griewank, A., 2000. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Section 5.1).
Regularisation in Neural Networks
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
323of 825
Number of input and output nodes determined by the application.
Number of hidden nodes is a free parameter.
1
0
−1
01
Training a two-layer network with 1 hidden node.
M=1
Regularisation in Neural Networks
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
324of 825
Number of input and output nodes determined by the application.
Number of hidden nodes is a free parameter.
1
0
−1
01
Training a two-layer network with 3 hidden nodes.
M=3
Regularisation in Neural Networks
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
325of 825
Number of input and output nodes determined by the application.
Number of hidden nodes is a free parameter.
1
0
−1
01
Training a two-layer network with 10 hidden nodes.
M = 10
Regularisation in Neural Networks
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
326of 825
Model complexity matters again.
111
000
−1 −1 −1 010101
M = 1 M = 3 M = 10 As before, we can use the regularised error
E ( w ) = E ( w ) + λ w ⊤ w 2
M=1
M=3
M = 10
Regularisation via Early Stopping
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
327of 825
Stop training at the minimum of the validation set error.
0.25
0.2
0.15
0 10 20 30 40 50
Training set error.
0.45
0.4
0.35
0 10 20 30 40 50
Validation set error.
Invariances
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
328of 825
If input data should be invariant with respect to some transformations, we can utilise this for training.
Use training patterns including these transformations (e.g. handwritten digits translated in the input space).
Or create extra artifical input data by applying several transformations to the original input data.
Alternatively, preprocess the input data to remove the transformation.
Or use convolutional neural networks (e.g. in image processing where close pixels are more correlated than far away pixels; therefore extract local features first and later feed into a network extracting higher-order features).
Synthetic Warping
Create synthetic data by warping handwritten digits.
Left: Original digitised image. Right : Examples of warped images (above) and their corresponding displacement fields (below).
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
329of 825
Bayesian Neural Networks
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
330of 825
Predict a single target t from a vector of inputs x Assume conditional distribution to be Gaussian with
precision β
p(t|x,w,β) = N(t|y(x,w),β−1)
Prior distribution over weights w is also assumed to be
Gaussian
For an i.i.d training data set {xn, tn}Nn=1, the likelihood of the
p(w|α) = N(w|0,α−1I) targets D = {t1,…,tN} is
N
p(D|w,β) = N(tn |y(xn,w),β−1)
n=1
Posterior distribution
p(w|D,α,β) ∝ p(w|α)p(D|w,β)
Bayesian Neural Networks
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
331of 825
But y(x, w) is nonlinear, and therefore we can no longer calculate the posterior in closed form.
Use Laplace approximation
1 Find a (local) maximum wMAP of the posterior via numerical optimisation.
2 Evaluate the matrix of second derivatives of the negative log posterior distribution.
Find a (local) maximum using the log-posterior
n=1
posterior distribution
A = −∇∇ ln p(w | D, α, β) = αI + βH
where H is the Hessian matrix of the sum-of-squares error function with respect to the components of w.
lnp(w|D,α,β)=−2w⊤w− 2
α βN
(y(xn,w)−tn)2 +const Find the matrix of second derivatives of the negative log
Bayesian Neural Networks
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks
332of 825
Having wMAP, and A, we can approximate the posterior by a Gaussian
q(w|D,α,β) = N(w|wMAP,A−1)
For the predictive distribution further linearly approximate
Then
where
y(x, w) ≈ y(x, wMAP) + g⊤(w − wMAP) where g = ∇wy(x, w)|w=wMAP .
p(t|x,D,α,β) = N(t|y(x,wMAP),σ2(x)) σ2(x) = β−1 + g⊤A−1g.
(Recall the multivariate normal conditionals.)
variance due to the intrinsic noise on the target: β−1 variance due to the model parameter w : g⊤A−1g