Lecture 7. Multilayer Perceptron. Backpropagation
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• Multilayerperceptron ∗ Model structure
∗ Universal approximation ∗ Training preliminaries
• Backpropagation
∗ Step-by-step derivation ∗ Notes on regularisation
2
COMP90051 Statistical Machine Learning
Animals in the zoo
Artificial Neural Networks (ANNs)
Recurrent neural networks
Feed-forward networks
Multilayer perceptrons
Perceptrons
Convolutional neural networks
• Recurrent neural networks are not covered in this subject
• An autoencoder is an ANN trained in a specific way.
∗ E.g.,amultilayerperceptroncanbetrainedasanautoencoder,ora
recurrent neural network can be trained as an autoencoder.
3
art: OpenClipartVectors at pixabay.com (CC0)
COMP90051 Statistical Machine Learning
Multilayer Perceptron
Modelling non-linearity via function composition
4
COMP90051 Statistical Machine Learning
Limitations of linear models
Some problems are linearly separable, but many are not
AND OR
XOR
Possible solution: composition 𝑥𝑥XOR𝑥𝑥=𝑥𝑥OR𝑥𝑥 ANDnot𝑥𝑥AND𝑥𝑥
We are going to compose perceptrons …
1212 12
5
COMP90051 Statistical Machine Learning
Perceptron is sort of a building block for ANN
• ANNs are not restricted to binary classification
• Nodes in ANN can have various activation functions
Step function
Sign function Logistic function
1, 𝑖𝑖𝑓𝑓 𝑠𝑠 ≥ 0 𝑓𝑓 𝑠𝑠 = �0, 𝑖𝑖𝑓𝑓 𝑠𝑠 < 0
𝑓𝑓 𝑠𝑠 = � 1, 𝑖𝑖𝑓𝑓 𝑠𝑠 ≥ 0 −1, 𝑖𝑖𝑓𝑓 𝑠𝑠 < 0
𝑓𝑓𝑠𝑠=1
1 + 𝑒𝑒−𝑠𝑠
Many others: 𝑡𝑡𝑡𝑡𝑡𝑡𝑡, rectifier, etc.
6
COMP90051 Statistical Machine Learning
Feed-forward
𝑥𝑥 are 𝑥𝑥𝑥𝑥𝑚𝑚1 𝑖𝑖
Artificial Neural Network
𝑤𝑤 f l o w o𝑧𝑧 f computation
inputs, i.e., attributes
𝑢𝑢1 𝑧𝑧 𝑝𝑝 𝑞𝑞1
𝑧𝑧 are outputs, i.e., predicted labels
note: ANNs naturally handle multidimensional output
𝑥𝑥 2
of instances input layer
... 2 ...
note: here 𝑥𝑥𝑖𝑖 are components of a single training instance 𝒙𝒙
a training dataset is a set
...
𝑣𝑣𝑖𝑖𝑗𝑗 𝑢𝑢 𝑧𝑧
𝑗𝑗𝑗𝑗 𝑖𝑖
hidden layer
(ANNs can have more than one hidden layer)
output layer
e.g., for handwritten digits recognition, each output node can represent the probability of a digit
7
COMP90051 Statistical Machine Learning
ANN as function composition
𝑣𝑣𝑖𝑖𝑗𝑗 𝑢𝑢𝑗𝑗 = 𝑔𝑔(𝑟𝑟𝑗𝑗)
𝑧𝑧 𝑠𝑠𝑗𝑗 =𝑤𝑤0𝑗𝑗 +�𝑢𝑢𝑗𝑗𝑤𝑤𝑗𝑗𝑗𝑗 2 𝑗𝑗=1
𝑥𝑥1 𝑥𝑥2
𝑢𝑢 𝑤𝑤𝑗𝑗𝑗𝑗 𝑧𝑧1 𝑧𝑧𝑗𝑗 =𝑡(𝑠𝑠𝑗𝑗) 1𝑝𝑝
𝑚𝑚 𝑟𝑟=𝑣𝑣+�𝑥𝑥𝑣𝑣
note that 𝑧𝑧 is a 𝑗𝑗
𝑗𝑗 0𝑗𝑗𝑖𝑖=1𝑖𝑖𝑖𝑖𝑗𝑗𝑥𝑥...
... ...𝑧𝑧
𝑢𝑢 (a function applied to
𝑝𝑝
𝑞𝑞
function composition
𝑚𝑚
the result of another function, etc.)
here 𝑔𝑔, 𝑡 are activation functions. These can be either same (e.g., both sigmoid) or different
you can add bias node 𝑥𝑥0 = 1 to simplify equations: 𝑟𝑟 = ∑𝑚𝑚 𝑥𝑥 𝑣𝑣
𝑗𝑗 𝑖𝑖=0 𝑖𝑖 𝑖𝑖𝑗𝑗 similarly you can add bias node 𝑢𝑢0 = 1 to
simplify equations: 𝑠𝑠𝑗𝑗 = ∑𝑝𝑝 𝑢𝑢𝑗𝑗𝑤𝑤𝑗𝑗𝑗𝑗 𝑗𝑗=0
8
COMP90051 Statistical Machine Learning
ANN in supervised learning
• ANNs can be naturally adapted to various supervised learning setups, such as univariate and multivariate regression, as well as binary and multilabel classification
• Univariate regression 𝑦𝑦 = 𝑓𝑓 𝒙𝒙
∗ e.g.,linearregressionearlierinthecourse
• Multivariate regression 𝒚𝒚 = 𝑓𝑓(𝒙𝒙)
∗ predictingvaluesformultiplecontinuousoutcomes
• Binary classification
∗ e.g.,predictwhetherapatienthastypeIIdiabetes
• Multivariate classification
∗ e.g.,handwrittendigitsrecognitionwithlabels“1”,“2”,etc.
9
COMP90051 Statistical Machine Learning
The power of ANN as a non-linear model
• ANNs are capable of approximating plethora non-linear functions, e.g., 𝑧𝑧 𝑥𝑥 = 𝑥𝑥2 and 𝑧𝑧 𝑥𝑥 = sin𝑥𝑥
• For example, consider the 𝑢𝑢 following network. In this
example, hidden unit
activation functions are tanh 𝑢𝑢1
Plot by Geek3 Wikimedia Commons (CC3)
𝑥𝑥 𝑢𝑢2 𝑧𝑧 3
10
COMP90051 Statistical Machine Learning
The power of ANN as a non-linear model
•
functions, e.g., 𝑧𝑧 𝑥𝑥 = 𝑥𝑥2 and 𝑧𝑧 𝑥𝑥 = sin𝑥𝑥
𝑧𝑧(𝑥𝑥)
•
𝑥𝑥
Adapted from Bishop 2007
Blue points are the function values evaluated at different 𝑥𝑥. Red lines are the predictions from the ANN. Dashed lines are outputs of the hidden units
ANNs are capable of approximating various non-linear
Universal approximation theorem (Cybenko 1989): An ANN with a hidden layer with a finite number of uni𝑛𝑛ts, and mild assumptions on the activation function, can approximate continuous functions on compact subsets of 𝑹𝑹 arbitrarily well
11
COMP90051 Statistical Machine Learning
How to train your dragon network?
• You know the drill: Define the loss function and find parameters that minimise the loss on training data
Adapted from Movie Poster from Flickr user jdxyw (CC BY-SA 2.0)
• In the following, we are going to use stochastic gradient descent with a batch size of one. That is, we will process training examples one by one
12
COMP90051 Statistical Machine Learning
Training setup: univariate regression
• Moreover, we’ll use identity
output activation function
𝑧𝑧=𝑡𝑠𝑠 =𝑠𝑠=∑𝑝𝑝 𝑢𝑢𝑤𝑤 𝑥𝑥2
• Consider regression
𝑢𝑢1
𝑢𝑢
𝑥𝑥1
𝑣𝑣𝑖𝑖𝑗𝑗
𝑝𝑝
𝑤𝑤𝑗𝑗
𝑗𝑗=0
𝑗𝑗𝑗𝑗
𝑥𝑥
...
𝑧𝑧
• This will simplify description of backpropagation. In other settings, the training procedure is similar
...
𝑚𝑚
13
Statistical Machine Learning (S2 2018)
Training setup: univariate regression
Deck 7
𝑥𝑥1 𝑥𝑥2
𝑣𝑣𝑖𝑖𝑗𝑗 𝑢𝑢1 ...
𝑤𝑤𝑗𝑗
𝑥𝑥... 𝑚𝑚
𝑢𝑢
𝑧𝑧 𝑝𝑝
14
COMP90051 Statistical Machine Learning
Loss function for ANN training
• Need loss between training example {𝒙𝒙, 𝑦𝑦} & prediction
𝑓𝑓̂ 𝒙𝒙, 𝜽𝜽 = 𝑧𝑧, where 𝜽𝜽 is parameter vector of 𝑣𝑣𝑖𝑖𝑗𝑗 and 𝑤𝑤𝑗𝑗 • As regression, can use squared error
1̂1
𝐿𝐿 = 2 𝑓𝑓(𝒙𝒙, 𝜽𝜽) − 𝑦𝑦 2 = 2 𝑧𝑧 − 𝑦𝑦 2
• Decision-theoretic training: minimise 𝐿𝐿 w.r.t 𝜽𝜽 ∗ Fortunately𝐿𝐿(𝜽𝜽)isdifferentiable
∗ Unfortunatelynoanalyticsolutioningeneral
(the constant is used for mathematical convenience, see later)
15
COMP90051 Statistical Machine Learning
Stochastic gradient descent for ANN
Choose initial guess 𝜽𝜽(0), 𝑘𝑘 = 0
Here 𝜽𝜽 is a set of all weights form all layers
For 𝑖𝑖 from 1 to 𝑇𝑇 (epochs)
For 𝑗𝑗 from 1 to 𝑁𝑁 (training examples)
Considerexample 𝒙𝒙𝑗𝑗,𝑦𝑦𝑗𝑗 Update: 𝜽𝜽(𝑖𝑖+1) = 𝜽𝜽(𝑖𝑖) − 𝜂𝜂𝛁𝛁𝐿𝐿(𝜽𝜽(𝑖𝑖))
𝐿𝐿=1𝑧𝑧−𝑦𝑦2 𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿
2
𝑗𝑗𝑗𝑗
Need to compute partial
derivatives 𝜕𝜕𝜐𝜐𝑖𝑖𝑗𝑗 and 𝜕𝜕𝑤𝑤𝑗𝑗
16
COMP90051 Statistical Machine Learning
Backpropagation = “backward propagation of errors”
Calculating the gradient of loss of a composition
17
COMP90051 Statistical Machine Learning
Backpropagation: start with the chain rule
• Recall that the output 𝑧𝑧 of an ANN is a function composition, and hence 𝐿𝐿 𝑧𝑧 is also a composition
∗ 𝐿𝐿=0.5 𝑧𝑧−𝑦𝑦 2 =0.5 𝑡(𝑠𝑠)−𝑦𝑦 2 =0.5 𝑠𝑠−𝑦𝑦 2
∗ = 0.5 ∑𝑝𝑝 𝑢𝑢𝑗𝑗𝑤𝑤𝑗𝑗 − 𝑦𝑦 2 = 0.5 ∑𝑝𝑝 𝑔𝑔(𝑟𝑟𝑗𝑗)𝑤𝑤𝑗𝑗 − 𝑦𝑦 2 = ⋯
𝑗𝑗=0 𝑗𝑗=0 𝑥𝑥1 𝑣𝑣𝑖𝑖𝑗𝑗 𝑢𝑢 𝑤𝑤
• Backpropagation makes use of this fact by applying the chain rule for derivatives
𝑗𝑗
• 𝜕𝜕𝐿𝐿 =𝜕𝜕𝐿𝐿𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 • 𝜕𝜕𝑤𝑤𝑗𝑗 =𝜕𝜕𝑧𝑧𝜕𝜕𝑠𝑠𝜕𝜕𝑤𝑤𝑗𝑗
𝑥𝑥2 𝑢𝑢1 𝑧𝑧
𝑥𝑥 𝑚𝑚
𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗
...
𝑝𝑝
...
𝜕𝜕𝑣𝑣 𝜕𝜕𝑧𝑧𝜕𝜕𝑠𝑠𝜕𝜕𝑢𝑢 𝜕𝜕𝑟𝑟 𝜕𝜕𝑣𝑣 𝑖𝑖𝑗𝑗 𝑗𝑗 𝑗𝑗 𝑖𝑖𝑗𝑗
18
COMP90051 Statistical Machine Learning
Backpropagation: intermediate step
• Apply the chain rule • 𝜕𝜕𝐿𝐿 =𝜕𝜕𝐿𝐿𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠
• Now define
𝛿𝛿 ≡ 𝜕𝜕𝐿𝐿 = 𝜕𝜕𝐿𝐿 𝜕𝜕𝑧𝑧
𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝜕𝜕𝑤𝑤𝑗𝑗
𝜕𝜕𝑠𝑠 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠
𝜀𝜀 ≡𝜕𝜕𝐿𝐿=𝜕𝜕𝐿𝐿𝜕𝜕𝑧𝑧𝜕𝜕𝑠𝑠𝜕𝜕𝑢𝑢𝑗𝑗
• 𝜕𝜕𝐿𝐿 =𝜕𝜕𝐿𝐿𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 𝜕𝜕𝑧𝑧𝜕𝜕𝑠𝑠𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗
𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 𝜕𝜕𝑧𝑧𝜕𝜕𝑠𝑠𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 • Here𝐿𝐿=0.5 𝑧𝑧−𝑦𝑦
𝑥𝑥1 𝑣𝑣𝑖𝑖𝑗𝑗𝑢𝑢𝑤𝑤
and𝑧𝑧=𝑠𝑠 2 Thus𝛿𝛿= 𝑧𝑧−𝑦𝑦
𝑥𝑥 1
𝑗𝑗
𝑧𝑧
2 𝑥𝑥...𝑚𝑚
𝑢𝑢
• Here𝑠𝑠=∑𝑝𝑝 𝑢𝑢𝑗𝑗𝑤𝑤𝑗𝑗and 𝑢𝑢=𝑔𝑔𝑟𝑟 𝑗𝑗=0
...
𝑝𝑝
𝑗𝑗𝑗𝑗
Thus𝜀𝜀𝑗𝑗 =𝛿𝛿𝑤𝑤𝑗𝑗𝑔𝑔′(𝑟𝑟𝑗𝑗)
19
COMP90051 Statistical Machine Learning
Backpropagation equations
• We have •
𝑝𝑝 ∗𝑠𝑠=∑ 𝑢𝑢𝑤𝑤
∗ =𝛿𝛿 𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗
... where
Recall that
𝜕𝜕𝐿𝐿𝜕𝜕𝑠𝑠 𝜕𝜕𝐿𝐿
𝑗𝑗=0
𝑗𝑗𝑗𝑗
∗𝛿𝛿==𝑧𝑧−𝑦𝑦
∗𝜕𝜕𝐿𝐿=𝜀𝜀𝑗𝑗𝜕𝜕𝑟𝑟𝑗𝑗 𝑥𝑥𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗
𝜕𝜕𝑠𝑠
∗ 𝜀𝜀 = 𝜕𝜕𝐿𝐿 = 𝛿𝛿𝑤𝑤 𝑔𝑔 (𝑟𝑟 )
∗𝑟𝑟=∑𝑚𝑚 𝑥𝑥𝑣𝑣 𝑗𝑗 𝑖𝑖=0 𝑖𝑖 𝑖𝑖𝑗𝑗
𝑗𝑗 𝑥𝑥1 𝑣𝑣𝑖𝑖𝑗𝑗 𝑢𝑢1 𝑤𝑤
𝑧𝑧
• So𝜕𝜕𝑠𝑠 =𝑢𝑢𝑗𝑗and𝜕𝜕𝑟𝑟𝑗𝑗 =𝑥𝑥𝑖𝑖 𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗
𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗
𝑗𝑗′𝑗𝑗
2 𝑥𝑥...𝑚𝑚
𝑢𝑢... 𝑝𝑝
• Wehave
∗ 𝜕𝜕𝐿𝐿 =𝛿𝛿𝑢𝑢𝑗𝑗= 𝑧𝑧−𝑦𝑦𝑢𝑢𝑗𝑗
𝜕𝜕𝑤𝑤𝑗𝑗
∗ 𝜕𝜕𝐿𝐿 =𝜀𝜀𝑗𝑗𝑥𝑥𝑖𝑖=𝛿𝛿𝑤𝑤𝑗𝑗𝑔𝑔′ 𝑟𝑟𝑗𝑗 𝑥𝑥𝑖𝑖 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗
20
COMP90051 Statistical Machine Learning
• Use current
• Calculate 𝑟𝑟𝑗𝑗, 𝑢𝑢 ,𝑠𝑠and𝑧𝑧
𝑣𝑣 and𝑤𝑤 𝑖𝑖𝑗𝑗 𝑗𝑗
Forward propagation
𝑗𝑗 𝑥𝑥1 𝑣𝑣𝑖𝑖𝑗𝑗 𝑢𝑢1 𝑤𝑤
estimates of
𝑗𝑗
𝑥𝑥2 𝑧𝑧
𝑥𝑥... 𝑚𝑚
•
Backpropagation equations
𝑢𝑢... 𝑝𝑝
∗ 𝜕𝜕𝐿𝐿 =𝛿𝛿𝑢𝑢𝑗𝑗= 𝑧𝑧−𝑦𝑦𝑢𝑢𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗
∗ 𝜕𝜕𝐿𝐿 =𝜀𝜀𝑗𝑗𝑥𝑥𝑖𝑖=𝛿𝛿𝑤𝑤𝑗𝑗𝑔𝑔′ 𝑟𝑟𝑗𝑗 𝑥𝑥𝑖𝑖 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗
21
COMP90051 Statistical Machine Learning
Backward propagation of errors
𝜕𝜕𝐿𝐿 =𝜀𝜀𝑥𝑥 𝜀𝜀 =𝛿𝛿𝑤𝑤𝑔𝑔′(𝑟𝑟) 𝜕𝜕𝐿𝐿 =𝛿𝛿𝑢𝑢 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗𝑗𝑗𝑖𝑖 𝑗𝑗𝑗𝑗𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗𝑗𝑗
𝛿𝛿 = (𝑧𝑧 − 𝑦𝑦)
𝑥𝑥1 𝑣𝑣𝑖𝑖𝑗𝑗𝑢𝑢1𝑤𝑤𝑗𝑗
𝑥𝑥2 𝑧𝑧
𝑥𝑥...𝑚𝑚
•
Backpropagation equations
𝑢𝑢... 𝑝𝑝
∗ 𝜕𝜕𝐿𝐿 =𝛿𝛿𝑢𝑢𝑗𝑗= 𝑧𝑧−𝑦𝑦𝑢𝑢𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗
∗ 𝜕𝜕𝐿𝐿 =𝜀𝜀𝑗𝑗𝑥𝑥𝑖𝑖=𝛿𝛿𝑤𝑤𝑗𝑗𝑔𝑔′ 𝑟𝑟𝑗𝑗 𝑥𝑥𝑖𝑖 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗
22
COMP90051 Statistical Machine Learning
Some further notes on ANN training
• ANN’s are flexible (recall universal approximation theorem), but the flipside is over-parameterisation, hence tendency to overfitting
• Starting weights usually random distributed about zero
• Implicit regularisation:
early stopping
∗ With some activation functions, this shrinks the ANN towards a linear model (why?)
Plot by Geek3 Wikimedia Commons (CC3)
23
COMP90051 Statistical Machine Learning
Explicit regularisation
• Alternatively, an explicit regularisation can be used,
much like in ridge regression
• Instead of minimising the loss 𝐿𝐿, minimise regularised
function𝐿𝐿+𝜆𝜆 ∑𝑚𝑚 ∑𝑝𝑝 𝑣𝑣2 +∑𝑝𝑝 𝑤𝑤2 𝑖𝑖=0 𝑗𝑗=1 𝑖𝑖𝑗𝑗 𝑗𝑗=0 𝑗𝑗
• This will simply add 2𝜆𝜆𝑣𝑣𝑖𝑖𝑗𝑗 and 2𝜆𝜆𝑤𝑤𝑗𝑗 terms to the partial derivatives
• With some activation functions this also shrinks the ANN towards a linear model
24
COMP90051 Statistical Machine Learning
This lecture
• Multilayerperceptron ∗ Model structure
∗ Universal approximation ∗ Training preliminaries
• Backpropagation
∗ Step-by-step derivation ∗ Notes on regularisation
• Nextlecture:DNNs,CNNs,autoencoders
25