CS计算机代考程序代写 chain flex Lecture 7. Multilayer Perceptron. Backpropagation

Lecture 7. Multilayer Perceptron. Backpropagation
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne

COMP90051 Statistical Machine Learning
This lecture
• Multilayerperceptron ∗ Model structure
∗ Universal approximation ∗ Training preliminaries
• Backpropagation
∗ Step-by-step derivation ∗ Notes on regularisation
2

COMP90051 Statistical Machine Learning
Animals in the zoo
Artificial Neural Networks (ANNs)
Recurrent neural networks
Feed-forward networks
Multilayer perceptrons
Perceptrons
Convolutional neural networks
• Recurrent neural networks are not covered in this subject
• An autoencoder is an ANN trained in a specific way.
∗ E.g.,amultilayerperceptroncanbetrainedasanautoencoder,ora
recurrent neural network can be trained as an autoencoder.
3
art: OpenClipartVectors at pixabay.com (CC0)

COMP90051 Statistical Machine Learning
Multilayer Perceptron
Modelling non-linearity via function composition
4

COMP90051 Statistical Machine Learning
Limitations of linear models
Some problems are linearly separable, but many are not
AND OR
XOR
Possible solution: composition 𝑥𝑥XOR𝑥𝑥=𝑥𝑥OR𝑥𝑥 ANDnot𝑥𝑥AND𝑥𝑥
We are going to compose perceptrons …
1212 12
5

COMP90051 Statistical Machine Learning
Perceptron is sort of a building block for ANN
• ANNs are not restricted to binary classification
• Nodes in ANN can have various activation functions
Step function
Sign function Logistic function
1, 𝑖𝑖𝑓𝑓 𝑠𝑠 ≥ 0 𝑓𝑓 𝑠𝑠 = �0, 𝑖𝑖𝑓𝑓 𝑠𝑠 < 0 𝑓𝑓 𝑠𝑠 = � 1, 𝑖𝑖𝑓𝑓 𝑠𝑠 ≥ 0 −1, 𝑖𝑖𝑓𝑓 𝑠𝑠 < 0 𝑓𝑓𝑠𝑠=1 1 + 𝑒𝑒−𝑠𝑠 Many others: 𝑡𝑡𝑡𝑡𝑡𝑡𝑡, rectifier, etc. 6 COMP90051 Statistical Machine Learning Feed-forward 𝑥𝑥 are 𝑥𝑥𝑥𝑥𝑚𝑚1 𝑖𝑖 Artificial Neural Network 𝑤𝑤 f l o w o𝑧𝑧 f computation inputs, i.e., attributes 𝑢𝑢1 𝑧𝑧 𝑝𝑝 𝑞𝑞1 𝑧𝑧 are outputs, i.e., predicted labels note: ANNs naturally handle multidimensional output 𝑥𝑥 2 of instances input layer ... 2 ... note: here 𝑥𝑥𝑖𝑖 are components of a single training instance 𝒙𝒙 a training dataset is a set ... 𝑣𝑣𝑖𝑖𝑗𝑗 𝑢𝑢 𝑧𝑧 𝑗𝑗𝑗𝑗 𝑖𝑖 hidden layer (ANNs can have more than one hidden layer) output layer e.g., for handwritten digits recognition, each output node can represent the probability of a digit 7 COMP90051 Statistical Machine Learning ANN as function composition 𝑣𝑣𝑖𝑖𝑗𝑗 𝑢𝑢𝑗𝑗 = 𝑔𝑔(𝑟𝑟𝑗𝑗) 𝑧𝑧 𝑠𝑠𝑗𝑗 =𝑤𝑤0𝑗𝑗 +�𝑢𝑢𝑗𝑗𝑤𝑤𝑗𝑗𝑗𝑗 2 𝑗𝑗=1 𝑥𝑥1 𝑥𝑥2 𝑢𝑢 𝑤𝑤𝑗𝑗𝑗𝑗 𝑧𝑧1 𝑧𝑧𝑗𝑗 =𝑡(𝑠𝑠𝑗𝑗) 1𝑝𝑝 𝑚𝑚 𝑟𝑟=𝑣𝑣+�𝑥𝑥𝑣𝑣 note that 𝑧𝑧 is a 𝑗𝑗 𝑗𝑗 0𝑗𝑗𝑖𝑖=1𝑖𝑖𝑖𝑖𝑗𝑗𝑥𝑥... ... ...𝑧𝑧 𝑢𝑢 (a function applied to 𝑝𝑝 𝑞𝑞 function composition 𝑚𝑚 the result of another function, etc.) here 𝑔𝑔, 𝑡 are activation functions. These can be either same (e.g., both sigmoid) or different you can add bias node 𝑥𝑥0 = 1 to simplify equations: 𝑟𝑟 = ∑𝑚𝑚 𝑥𝑥 𝑣𝑣 𝑗𝑗 𝑖𝑖=0 𝑖𝑖 𝑖𝑖𝑗𝑗 similarly you can add bias node 𝑢𝑢0 = 1 to simplify equations: 𝑠𝑠𝑗𝑗 = ∑𝑝𝑝 𝑢𝑢𝑗𝑗𝑤𝑤𝑗𝑗𝑗𝑗 𝑗𝑗=0 8 COMP90051 Statistical Machine Learning ANN in supervised learning • ANNs can be naturally adapted to various supervised learning setups, such as univariate and multivariate regression, as well as binary and multilabel classification • Univariate regression 𝑦𝑦 = 𝑓𝑓 𝒙𝒙 ∗ e.g.,linearregressionearlierinthecourse • Multivariate regression 𝒚𝒚 = 𝑓𝑓(𝒙𝒙) ∗ predictingvaluesformultiplecontinuousoutcomes • Binary classification ∗ e.g.,predictwhetherapatienthastypeIIdiabetes • Multivariate classification ∗ e.g.,handwrittendigitsrecognitionwithlabels“1”,“2”,etc. 9 COMP90051 Statistical Machine Learning The power of ANN as a non-linear model • ANNs are capable of approximating plethora non-linear functions, e.g., 𝑧𝑧 𝑥𝑥 = 𝑥𝑥2 and 𝑧𝑧 𝑥𝑥 = sin𝑥𝑥 • For example, consider the 𝑢𝑢 following network. In this example, hidden unit activation functions are tanh 𝑢𝑢1 Plot by Geek3 Wikimedia Commons (CC3) 𝑥𝑥 𝑢𝑢2 𝑧𝑧 3 10 COMP90051 Statistical Machine Learning The power of ANN as a non-linear model • functions, e.g., 𝑧𝑧 𝑥𝑥 = 𝑥𝑥2 and 𝑧𝑧 𝑥𝑥 = sin𝑥𝑥 𝑧𝑧(𝑥𝑥) • 𝑥𝑥 Adapted from Bishop 2007 Blue points are the function values evaluated at different 𝑥𝑥. Red lines are the predictions from the ANN. Dashed lines are outputs of the hidden units ANNs are capable of approximating various non-linear Universal approximation theorem (Cybenko 1989): An ANN with a hidden layer with a finite number of uni𝑛𝑛ts, and mild assumptions on the activation function, can approximate continuous functions on compact subsets of 𝑹𝑹 arbitrarily well 11 COMP90051 Statistical Machine Learning How to train your dragon network? • You know the drill: Define the loss function and find parameters that minimise the loss on training data Adapted from Movie Poster from Flickr user jdxyw (CC BY-SA 2.0) • In the following, we are going to use stochastic gradient descent with a batch size of one. That is, we will process training examples one by one 12 COMP90051 Statistical Machine Learning Training setup: univariate regression • Moreover, we’ll use identity output activation function 𝑧𝑧=𝑡𝑠𝑠 =𝑠𝑠=∑𝑝𝑝 𝑢𝑢𝑤𝑤 𝑥𝑥2 • Consider regression 𝑢𝑢1 𝑢𝑢 𝑥𝑥1 𝑣𝑣𝑖𝑖𝑗𝑗 𝑝𝑝 𝑤𝑤𝑗𝑗 𝑗𝑗=0 𝑗𝑗𝑗𝑗 𝑥𝑥 ... 𝑧𝑧 • This will simplify description of backpropagation. In other settings, the training procedure is similar ... 𝑚𝑚 13 Statistical Machine Learning (S2 2018) Training setup: univariate regression Deck 7 𝑥𝑥1 𝑥𝑥2 𝑣𝑣𝑖𝑖𝑗𝑗 𝑢𝑢1 ... 𝑤𝑤𝑗𝑗 𝑥𝑥... 𝑚𝑚 𝑢𝑢 𝑧𝑧 𝑝𝑝 14 COMP90051 Statistical Machine Learning Loss function for ANN training • Need loss between training example {𝒙𝒙, 𝑦𝑦} & prediction 𝑓𝑓̂ 𝒙𝒙, 𝜽𝜽 = 𝑧𝑧, where 𝜽𝜽 is parameter vector of 𝑣𝑣𝑖𝑖𝑗𝑗 and 𝑤𝑤𝑗𝑗 • As regression, can use squared error 1̂1 𝐿𝐿 = 2 𝑓𝑓(𝒙𝒙, 𝜽𝜽) − 𝑦𝑦 2 = 2 𝑧𝑧 − 𝑦𝑦 2 • Decision-theoretic training: minimise 𝐿𝐿 w.r.t 𝜽𝜽 ∗ Fortunately𝐿𝐿(𝜽𝜽)isdifferentiable ∗ Unfortunatelynoanalyticsolutioningeneral (the constant is used for mathematical convenience, see later) 15 COMP90051 Statistical Machine Learning Stochastic gradient descent for ANN Choose initial guess 𝜽𝜽(0), 𝑘𝑘 = 0 Here 𝜽𝜽 is a set of all weights form all layers For 𝑖𝑖 from 1 to 𝑇𝑇 (epochs) For 𝑗𝑗 from 1 to 𝑁𝑁 (training examples) Considerexample 𝒙𝒙𝑗𝑗,𝑦𝑦𝑗𝑗 Update: 𝜽𝜽(𝑖𝑖+1) = 𝜽𝜽(𝑖𝑖) − 𝜂𝜂𝛁𝛁𝐿𝐿(𝜽𝜽(𝑖𝑖)) 𝐿𝐿=1𝑧𝑧−𝑦𝑦2 𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿 2 𝑗𝑗𝑗𝑗 Need to compute partial derivatives 𝜕𝜕𝜐𝜐𝑖𝑖𝑗𝑗 and 𝜕𝜕𝑤𝑤𝑗𝑗 16 COMP90051 Statistical Machine Learning Backpropagation = “backward propagation of errors” Calculating the gradient of loss of a composition 17 COMP90051 Statistical Machine Learning Backpropagation: start with the chain rule • Recall that the output 𝑧𝑧 of an ANN is a function composition, and hence 𝐿𝐿 𝑧𝑧 is also a composition ∗ 𝐿𝐿=0.5 𝑧𝑧−𝑦𝑦 2 =0.5 𝑡(𝑠𝑠)−𝑦𝑦 2 =0.5 𝑠𝑠−𝑦𝑦 2 ∗ = 0.5 ∑𝑝𝑝 𝑢𝑢𝑗𝑗𝑤𝑤𝑗𝑗 − 𝑦𝑦 2 = 0.5 ∑𝑝𝑝 𝑔𝑔(𝑟𝑟𝑗𝑗)𝑤𝑤𝑗𝑗 − 𝑦𝑦 2 = ⋯ 𝑗𝑗=0 𝑗𝑗=0 𝑥𝑥1 𝑣𝑣𝑖𝑖𝑗𝑗 𝑢𝑢 𝑤𝑤 • Backpropagation makes use of this fact by applying the chain rule for derivatives 𝑗𝑗 • 𝜕𝜕𝐿𝐿 =𝜕𝜕𝐿𝐿𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 • 𝜕𝜕𝑤𝑤𝑗𝑗 =𝜕𝜕𝑧𝑧𝜕𝜕𝑠𝑠𝜕𝜕𝑤𝑤𝑗𝑗 𝑥𝑥2 𝑢𝑢1 𝑧𝑧 𝑥𝑥 𝑚𝑚 𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 ... 𝑝𝑝 ... 𝜕𝜕𝑣𝑣 𝜕𝜕𝑧𝑧𝜕𝜕𝑠𝑠𝜕𝜕𝑢𝑢 𝜕𝜕𝑟𝑟 𝜕𝜕𝑣𝑣 𝑖𝑖𝑗𝑗 𝑗𝑗 𝑗𝑗 𝑖𝑖𝑗𝑗 18 COMP90051 Statistical Machine Learning Backpropagation: intermediate step • Apply the chain rule • 𝜕𝜕𝐿𝐿 =𝜕𝜕𝐿𝐿𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 • Now define 𝛿𝛿 ≡ 𝜕𝜕𝐿𝐿 = 𝜕𝜕𝐿𝐿 𝜕𝜕𝑧𝑧 𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑠𝑠 𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝜀𝜀 ≡𝜕𝜕𝐿𝐿=𝜕𝜕𝐿𝐿𝜕𝜕𝑧𝑧𝜕𝜕𝑠𝑠𝜕𝜕𝑢𝑢𝑗𝑗 • 𝜕𝜕𝐿𝐿 =𝜕𝜕𝐿𝐿𝜕𝜕𝑧𝑧 𝜕𝜕𝑠𝑠 𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 𝜕𝜕𝑧𝑧𝜕𝜕𝑠𝑠𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 𝜕𝜕𝑧𝑧𝜕𝜕𝑠𝑠𝜕𝜕𝑢𝑢𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 • Here𝐿𝐿=0.5 𝑧𝑧−𝑦𝑦 𝑥𝑥1 𝑣𝑣𝑖𝑖𝑗𝑗𝑢𝑢𝑤𝑤 and𝑧𝑧=𝑠𝑠 2 Thus𝛿𝛿= 𝑧𝑧−𝑦𝑦 𝑥𝑥 1 𝑗𝑗 𝑧𝑧 2 𝑥𝑥...𝑚𝑚 𝑢𝑢 • Here𝑠𝑠=∑𝑝𝑝 𝑢𝑢𝑗𝑗𝑤𝑤𝑗𝑗and 𝑢𝑢=𝑔𝑔𝑟𝑟 𝑗𝑗=0 ... 𝑝𝑝 𝑗𝑗𝑗𝑗 Thus𝜀𝜀𝑗𝑗 =𝛿𝛿𝑤𝑤𝑗𝑗𝑔𝑔′(𝑟𝑟𝑗𝑗) 19 COMP90051 Statistical Machine Learning Backpropagation equations • We have • 𝑝𝑝 ∗𝑠𝑠=∑ 𝑢𝑢𝑤𝑤 ∗ =𝛿𝛿 𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗 ... where Recall that 𝜕𝜕𝐿𝐿𝜕𝜕𝑠𝑠 𝜕𝜕𝐿𝐿 𝑗𝑗=0 𝑗𝑗𝑗𝑗 ∗𝛿𝛿==𝑧𝑧−𝑦𝑦 ∗𝜕𝜕𝐿𝐿=𝜀𝜀𝑗𝑗𝜕𝜕𝑟𝑟𝑗𝑗 𝑥𝑥𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 𝜕𝜕𝑠𝑠 ∗ 𝜀𝜀 = 𝜕𝜕𝐿𝐿 = 𝛿𝛿𝑤𝑤 𝑔𝑔 (𝑟𝑟 ) ∗𝑟𝑟=∑𝑚𝑚 𝑥𝑥𝑣𝑣 𝑗𝑗 𝑖𝑖=0 𝑖𝑖 𝑖𝑖𝑗𝑗 𝑗𝑗 𝑥𝑥1 𝑣𝑣𝑖𝑖𝑗𝑗 𝑢𝑢1 𝑤𝑤 𝑧𝑧 • So𝜕𝜕𝑠𝑠 =𝑢𝑢𝑗𝑗and𝜕𝜕𝑟𝑟𝑗𝑗 =𝑥𝑥𝑖𝑖 𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 𝑗𝑗 𝜕𝜕𝑟𝑟𝑗𝑗 𝑗𝑗′𝑗𝑗 2 𝑥𝑥...𝑚𝑚 𝑢𝑢... 𝑝𝑝 • Wehave ∗ 𝜕𝜕𝐿𝐿 =𝛿𝛿𝑢𝑢𝑗𝑗= 𝑧𝑧−𝑦𝑦𝑢𝑢𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗 ∗ 𝜕𝜕𝐿𝐿 =𝜀𝜀𝑗𝑗𝑥𝑥𝑖𝑖=𝛿𝛿𝑤𝑤𝑗𝑗𝑔𝑔′ 𝑟𝑟𝑗𝑗 𝑥𝑥𝑖𝑖 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 20 COMP90051 Statistical Machine Learning • Use current • Calculate 𝑟𝑟𝑗𝑗, 𝑢𝑢 ,𝑠𝑠and𝑧𝑧 𝑣𝑣 and𝑤𝑤 𝑖𝑖𝑗𝑗 𝑗𝑗 Forward propagation 𝑗𝑗 𝑥𝑥1 𝑣𝑣𝑖𝑖𝑗𝑗 𝑢𝑢1 𝑤𝑤 estimates of 𝑗𝑗 𝑥𝑥2 𝑧𝑧 𝑥𝑥... 𝑚𝑚 • Backpropagation equations 𝑢𝑢... 𝑝𝑝 ∗ 𝜕𝜕𝐿𝐿 =𝛿𝛿𝑢𝑢𝑗𝑗= 𝑧𝑧−𝑦𝑦𝑢𝑢𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗 ∗ 𝜕𝜕𝐿𝐿 =𝜀𝜀𝑗𝑗𝑥𝑥𝑖𝑖=𝛿𝛿𝑤𝑤𝑗𝑗𝑔𝑔′ 𝑟𝑟𝑗𝑗 𝑥𝑥𝑖𝑖 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 21 COMP90051 Statistical Machine Learning Backward propagation of errors 𝜕𝜕𝐿𝐿 =𝜀𝜀𝑥𝑥 𝜀𝜀 =𝛿𝛿𝑤𝑤𝑔𝑔′(𝑟𝑟) 𝜕𝜕𝐿𝐿 =𝛿𝛿𝑢𝑢 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗𝑗𝑗𝑖𝑖 𝑗𝑗𝑗𝑗𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗𝑗𝑗 𝛿𝛿 = (𝑧𝑧 − 𝑦𝑦) 𝑥𝑥1 𝑣𝑣𝑖𝑖𝑗𝑗𝑢𝑢1𝑤𝑤𝑗𝑗 𝑥𝑥2 𝑧𝑧 𝑥𝑥...𝑚𝑚 • Backpropagation equations 𝑢𝑢... 𝑝𝑝 ∗ 𝜕𝜕𝐿𝐿 =𝛿𝛿𝑢𝑢𝑗𝑗= 𝑧𝑧−𝑦𝑦𝑢𝑢𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗 ∗ 𝜕𝜕𝐿𝐿 =𝜀𝜀𝑗𝑗𝑥𝑥𝑖𝑖=𝛿𝛿𝑤𝑤𝑗𝑗𝑔𝑔′ 𝑟𝑟𝑗𝑗 𝑥𝑥𝑖𝑖 𝜕𝜕𝑣𝑣𝑖𝑖𝑗𝑗 22 COMP90051 Statistical Machine Learning Some further notes on ANN training • ANN’s are flexible (recall universal approximation theorem), but the flipside is over-parameterisation, hence tendency to overfitting • Starting weights usually random distributed about zero • Implicit regularisation: early stopping ∗ With some activation functions, this shrinks the ANN towards a linear model (why?) Plot by Geek3 Wikimedia Commons (CC3) 23 COMP90051 Statistical Machine Learning Explicit regularisation • Alternatively, an explicit regularisation can be used, much like in ridge regression • Instead of minimising the loss 𝐿𝐿, minimise regularised function𝐿𝐿+𝜆𝜆 ∑𝑚𝑚 ∑𝑝𝑝 𝑣𝑣2 +∑𝑝𝑝 𝑤𝑤2 𝑖𝑖=0 𝑗𝑗=1 𝑖𝑖𝑗𝑗 𝑗𝑗=0 𝑗𝑗 • This will simply add 2𝜆𝜆𝑣𝑣𝑖𝑖𝑗𝑗 and 2𝜆𝜆𝑤𝑤𝑗𝑗 terms to the partial derivatives • With some activation functions this also shrinks the ANN towards a linear model 24 COMP90051 Statistical Machine Learning This lecture • Multilayerperceptron ∗ Model structure ∗ Universal approximation ∗ Training preliminaries • Backpropagation ∗ Step-by-step derivation ∗ Notes on regularisation • Nextlecture:DNNs,CNNs,autoencoders 25