Lecture 16: Learning Parameters of Multi-layer Perceptrons with Backpropagation
Introduction to Machine Learning Semester 1, 2022
Copyright @ University of Melbourne 2022. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author.
Copyright By PowCoder代写 加微信 powcoder
Last lecture
• From perceptrons to neural networks • multilayer perceptron
• some examples
• features and limitations
• Learning parameters of neural networks • The Backpropagation algorithm
Recap: Multi-layer perceptrons
1 x1 x2 …
input weigthts
• Linearly separable data • Perceptron learning rule
y = f(θTx)
activation
Recap: Multi-layer perceptrons
• Linearly separable data • Perceptron learning rule
k k k a2 ŷ
Recap: Multi-layer perceptrons
θ2 θ z=Σθx
kkk 4 z=Σθx3 12 aŷ
k k k a2 θ3
4 a=g(z)2 1
z=Σkθkxk a2 a=g(z) 2
1 θ 31 θ1 x3 32
z=Σ θ x kkk
xz=Σθx θ2θ2 3
k k ka2 24 14 θ41 1 a=g(z) 3
z=Σkθkxk a3 a=g(z) 4
Recall: Supervised learning
θ2 θ z=Σθx
kkk 4 z=Σθx3 12 aŷ
k k k a2 θ3
4 a=g(z)2 1
z=Σkθkxk a2 a=g(z) 2
1 θ 31 θ1 x3 32
z=Σ θ x kkk
xz=Σθx θ2θ2 3
k k ka2 24 14 θ41 1 a=g(z) 3
z=Σkθkxk a3 a=g(z) 4
Recall: Supervised learning
θ2 θ z=Σθx
kkk 4 z=Σθx3 12 aŷ
k k k a2 θ3
4 a=g(z)2 1
z=Σkθkxk a2 a=g(z) 2
1 θ 31 θ1 x3 32
z=Σ θ x kkk
xz=Σθx θ2θ2 3
k k ka2 24 14 θ41 1 a=g(z) 3
z=Σkθkxk a3 a=g(z) 4
1. Forward propagate an input x from the training set 2. Compute the output yˆ with the MLP
Recall: Supervised learning
θ2 θ z=Σθx
kkk 4 z=Σθx3 12 aŷ
k k k a2 θ3
4 a=g(z)2 1
z=Σkθkxk a2 a=g(z) 2
1 θ 31 θ1 x3 32
z=Σ θ x kkk
xz=Σθx θ2θ2 3
k k ka2 24 14 θ41 1 a=g(z) 3
z=Σkθkxk a3 a=g(z) 4
1. Forward propagate an input x from the training set
2. Compute the output yˆ with the MLP
3. Compare predicted output yˆ against true output y; compute the error
Recall: Supervised learning
θ2 θ z=Σθx
kkk 4 z=Σθx3 12 aŷ
k k k a2 θ3
4 a=g(z)2 1
z=Σkθkxk a2 a=g(z) 2
1 θ 31 θ1 x3 32
z=Σ θ x kkk
xz=Σθx θ2θ2 3
k k ka2 24 14 θ41 1 a=g(z) 3
z=Σkθkxk a3 a=g(z) 4
1. Forward propagate an input x from the training set
2. Compute the output yˆ with the MLP
3. Compare predicted output yˆ against true output y; compute the error
4. Modify each weight such that the error decreases in future predictions (e.g., by applying gradient descent)
Recall: Supervised learning
θ2 θ z=Σθx
kkk 4 z=Σθx3 12 aŷ
k k k a2 θ3
4 a=g(z)2 1
z=Σkθkxk a2 a=g(z) 2
1 θ 31 θ1 x3 32
z=Σ θ x kkk
xz=Σθx θ2θ2 3
k k ka2 24 14 θ41 1 a=g(z) 3
z=Σkθkxk a3 a=g(z) 4
1. Forward propagate an input x from the training set
2. Compute the output yˆ with the MLP
3. Compare predicted output yˆ against true output y; compute the error
4. Modify each weight such that the error decreases in future
predictions (e.g., by applying gradient descent)
5. Repeat.
Recall: Optimization with Gradient Descent
θ2 θ z=Σθx
kkk 4 z=Σθx3 12 aŷ
k k k a2 θ3
4 a=g(z)2 1
z=Σkθkxk a2 a=g(z) 2
1 θ 31 θ1 x3 32
z=Σ θ x kkk
xz=Σθx θ2θ2 3
We want to
z=Σkθkxk a3 a=g(z) 4
k k ka2 24 14 θ41 1 a=g(z) 3
1. Find the best parameters, which lead to the smallest error E 2. Optimize each model parameter θl
3. We will use gradient descent to achieve that
4. θl,(t+1) ← θl,(t) + ∆θl ij ij ij
Towards Backpropagation
Recall Perceptron learning:
• Pass an input through and compute x
• Compare yˆ against y
x221 kkka2 ŷ
z=Σ θ x a=g(z)
• Weight update θi ← θi + η(y − yˆ)xi x1
Towards Backpropagation
• Compare yˆ against y
• Weight update θi ← θi + η(y − yˆ)xi
x221 kkka2 ŷ
Recall Perceptron learning:
• Pass an input through and compute x
z=Σ θ x a=g(z)
Compare against the MLP:
θ2 θ z=Σθx
kkk 4 z=Σθx3 12 aŷ
k k k a2 θ3
4 a=g(z)2 1
z=Σkθkxk a2 a=g(z) 2
xz=Σθx θ2θ2 3
z=Σkθkxk a3 a=g(z) 4
k k ka2 24 14 θ41 1 a=g(z) 3
1 θ 31 θ1 x3 32
z=Σ θ x kkk
Towards Backpropagation
Recall Perceptron learning:
• Pass an input through and compute x
• Compare yˆ against y
x221 kkka2 ŷ
z=Σ θ x a=g(z)
• Weight update θi ← θi + η(y − yˆ)xi x1
• This update rule depends on true target outputs y
• We only have access to true outputs for the final layer
• We do not know the true activations for the hidden layers. Can we generalize the above rule to also update the hidden layers?
Backpropagation provides us with an efficient way of computing partial derivatives of the error of an MLP wrt. each individual weight.
Backpropagation: Demo
Input layer
Hidden layer
output layer
Backpropagation: Demo
Input layer
Hidden layer
output layer
• Receive input
Backpropagation: Demo
Input layer
Hidden layer
output layer
• Receive input
• Forward pass: propagate activations through the network
Backpropagation: Demo
Input layer
Hidden layer
output layer
gΣa3 =ŷ 11
Σ a3 =y ̂ g22
• Receive input
• Forward pass: propagate activations through the network
Backpropagation: Demo
Input layer
Hidden layer output layer
gΣa3 =ŷ 11
• Receive input
• Forward pass: propagate activations through the network • Compute Error : compare output yˆ against true y
Backpropagation: Demo
Input layer
Hidden layer
output layer
gΣa3 =ŷ 11
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output yˆ against true y
• Backward pass: propagate error terms through the network
Backpropagation: Demo
Input layer
Hidden layer
output layer
gΣa3 =ŷ 11
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output yˆ against true y
• Backward pass: propagate error terms through the network
• Calculate ∆θl for all θl ij ij
• Update weights θl ← θl + ∆θl ij ij ij
Interim Summary
• We recall what a MLP is
• We recall that we want to learn its parameters such that our prediction error is minimized
• We recall that gradient descent gives us a rule for updating the weights θi ←θi +△θi with △θi =−η∂E
• But how do we compute ∂E ? ∂θi
• Backpropagation provides us with an efficient way of computing partial derivatives of the error of an MLP wrt. each individual weight.
The (Generalized) Delta Rule
Backpropagation 1: Model definition
Σθ2 Σ Σ ij
• Assuming a sigmoid activation function, the output of neuron i (or its activation ai ) is
ai =g(zi)= 1 1+e−zi
• And zi is the input of all incoming activations into neuron i zi =θijaj
• And Mean Squared Error (MSE) as error function E
( y i − yˆ i ) 2
Backpropagation 2: Error of the final layer
Σθ2 Σ Σ ij
△ θ 2 = − η ∂ E = η ( y p − yˆp ) g ′ ( z ) a ij∂θ2 ij
• Apply gradient descend for input p and weight θ2 connecting node j with ij
Backpropagation 2: Error of the final layer
Σθ2 Σ Σ ij
△ θ 2 = − η ∂ E = η ( y p − yˆp ) g ′ ( z ) a ij∂θ2 ij
• Apply gradient descend for input p and weight θ2 connecting node j with ij
Backpropagation 2: Error of the final layer
Σθ2 Σ Σ ij
• Apply gradient descend for input p and weight θ2 connecting node j with ij
△ θ 2 = − η ∂ E = η ( y p − yˆp ) g ′ ( z ) a ij∂θ2 ij
• The weight update corresponds to an error term (δi ) scaled by the incoming activation
• Weattachaδtonodei
Backpropagation: The Generalized Delta Rule
Σθ2 Σ Σ ij
• The Generalized Delta Rule
△θ2 =−η∂E =η(yp−yˆp)g′(z)a =ηδa
ij ∂θ2 ij ij ij
δ i = ( y p − yˆp ) g ′ ( z i )
• The above δi can only be applied to output units, because it relies on the
target outputs yp.
• We do not have target outputs y for the intermediate layers
Backpropagation: The Generalized Delta Rule
• Instead, we backpropagate the errors (δs) from right to left through the
△θ1 = η δ a jk jk
δ =θ1 δ g′(z) j ijij
Backpropagation: Demo
Input layer
Hidden layer
output layer
Backpropagation: Demo
Input layer
Hidden layer
output layer
• Receive input
Backpropagation: Demo
Input layer
Hidden layer
output layer
• Receive input
• Forward pass: propagate activations through the network
Backpropagation: Demo
Input layer
Hidden layer
output layer
gΣa3 =ŷ 11
Σ a3 =y ̂ g22
• Receive input
• Forward pass: propagate activations through the network
Backpropagation: Demo
Input layer
Hidden layer output layer
gΣa3 =ŷ 11
• Receive input
• Forward pass: propagate activations through the network • Compute Error : compare output yˆ against true y
Backpropagation: Demo
Input layer
Hidden layer
output layer
gΣa3 =ŷ 11
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output yˆ against true y
• Backward pass: propagate error terms through the network
Backpropagation: Demo
Input layer
Hidden layer output layer
gΣa3 =ŷ 11
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output yˆ against true y
• Backward pass: propagate error terms through the network
• Calculate ∂E for all θl ∂θl ij
• Update weights θl ← θl + ∆θl ij ij ij
Backpropagation Algorithm
Design your neural network Initialize parameters θ repeat
for training instance xi do
1. Forward pass the instance through the network, compute activa- tions, determine output
2. Compute the error
3. Propagate error back through the network, and compute for all weights between nodes ij in all layers l
∆θl =−η∂E =ηδa ij ∂θl ij
θl ←θl +∆θl ij ij ij
4. Update all parameters at once
until stopping criteria reached.
Derivation of the update rules
… optional slides after the next (summary) slide, for those who are interested!
After this lecture, you be able to understand
• Why estimation of the MLP parameters is difficult
• How and why we use Gradient Descent to optimize the parameters
• How Backpropagation is a special instance of gradient descent, which allows us to efficiently compute the gradients of all weights wrt. the error
• The mechanism behind gradient descent
• The mathematical justification of gradient descent
Good job, everyone!
• You now know what (feed forward) neural networks are
• You now know what to consider when designing neural networks • You now know how to estimate their parameters
• That’s more than the average ‘data scientist’ out there!
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
k1 k a2=g(z2)
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
k1 k a2=g(z2)
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
Chain of reactions in the forward pass, focussing on the output layer
• varying a2 causes a change in z3
• varying z3 causes a change in a13 = g(z3) • varying a13 = yˆ causes a change in E(y,yˆ)
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
k1 k a2=g(z2)
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
We can use the chain rule to capture the behavior of θ2 11
∆θ2 =−η∂E =−η∂E∂a13∂z3 = ∂ θ 2 ∂ a 13 ∂ z 3 ∂ θ 2
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
k1 k a2=g(z2)
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
We can use the chain rule to capture the behavior of θ2 11
∆θ2 =−η∂E =−η∂E∂a13∂z3 = ∂ θ 2 ∂ a 13 ∂ z 3 ∂ θ 2
Let’s look at each term individually
∂E =−(yi −ai) recallthatE =N 1(yi −ai)2 ∂ai i2
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
k1 k a2=g(z2)
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
We can use the chain rule to capture the behavior of θ2 11
∆θ2 =−η∂E =−η∂E∂a13∂z3 = ∂ θ 2 ∂ a 13 ∂ z 3 ∂ θ 2
Let’s look at each term individually
∂E =−(yi −ai) recallthatE =N 1(yi −ai)2
∂a = ∂g(z) = g′(z) ∂z ∂z
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
k1 k a2=g(z2)
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
We can use the chain rule to capture the behavior of θ2 11
∆θ2 =−η∂E =−η∂E∂a13∂z3 = ∂ θ 2 ∂ a 13 ∂ z 3 ∂ θ 2
Let’s look at each term individually
∂E =−(yi −ai) recallthatE =N 1(yi −ai)2
∂ai i2 ∂a = ∂g(z) = g′(z)
∂z = ∂ θi′jai′ = ∂ θi′jai′ =ai ∂θij ∂θij i′ i′ ∂θij
Backpropagation: Derivation
Input layer
x a1 1 1= 1 θ11
1 θ1 x2=a 2 21
Hidden layer Output layer
δ3 2112 322
z=Σ(θ a ) k1 k
z=Σ(θ a )a3 =ŷ a3=g(z3)
We can use the chain rule to capture the behavior of θ2 11
∆θ2 =−η∂E =−η∂E∂a13∂z3 = ηy−a13g′(z3)a12=ηδ13a12
∂θ2 ∂a3 ∂z3 ∂θ2
∂E =−(yi −ai) recallthatE =N 1(yi −ai)2
Let’s look at each term individually
∂a = ∂g(z) = g′(z) ∂z ∂z
∂z = ∂ θi′jai′ = ∂ θi′jai′ =ai ∂θij ∂θij i′ i′ ∂θij
Backpropagation: Derivation
Input layer
x a1 1 1= 1 θ11
1 θ1 x2=a 2 21
Hidden layer Output layer
δ3 2112 322
z=Σ(θ a ) k1 k
z=Σ(θ a )a3 =ŷ a3=g(z3)
We have another chain reaction. Let’s consider layer 2 • varying any θ1 causes a change in z2
• varying z2 causes a change in a12 = g(z2)
• varying a12 causes a change in z3 (we consider θ2 fixed for the moment) • varying z3 causes a change in a13 = g(z3)
• varying a13 = yˆ causes a change in E(y,yˆ)
Backpropagation: Derivation
Input layer
x a1 1 1= 1 θ11
1 θ1 x2=a 2 21
Hidden layer Output layer
δ3 2112 322
z=Σ(θ a ) k1 k
z=Σ(θ a )a3 =ŷ a3=g(z3)
Formulating this again as the chain rule
1 ∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
∆θk1 = −η∂θ1 = −η ∂a3 ∂z3 ∂a2 ∂z2 ∂θ1 k1 1 1 k1
Backpropagation: Derivation
Input layer
Hidden layer θ11
a11 32k12k11 1
z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
Output layer
1 θ1 x2=a 2 21
a2=g(z2) a3=g(z3)
Formulating this again as the chain rule
1 ∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
∆θk1 = −η∂θ1 = −η ∂a3 ∂z3 ∂a2 ∂z2 ∂θ1 k1 1 1 k1
We already know that
∂ E = − ( y − a 13 ) ∂ a 13
∂a13 ′3 ∂z3 = g (z )
Backpropagation: Derivation
Input layer
Hidden layer θ11
a11 32k12k11 1
z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
Output layer
1 θ1 x2=a 2 21
a2=g(z2) a3=g(z3)
Formulating this again as the chain rule
1 ∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
∆θk1 = −η∂θ1 = −η ∂a3 ∂z3 ∂a2 ∂z2 ∂θ1 k1 1 1 k1
∂ a 12 = ∂ a 12 = θ ∂ a 12 ∂ g ( z 2 )
And following the previous logic, we can calculate that ∂ z 3 ∂ θ 2 a 12 2
′ 2 ∂z2=∂z2 =g(z)
Backpropagation: Derivation
Input layer
Hidden layer θ11
a11 32k12k11 1
z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
Output layer
1 θ1 x2=a 2 21
a2=g(z2) a3=g(z3)
Formulating this again as the chain rule
1 ∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
∆θk1 = −η∂θ1 = −η ∂a3 ∂z3 ∂a2 ∂z2 ∂θ1 k1 1 1 k1
Plugging these into the above we get
−η ∂E = −η − (y − a13)g′(z3)θ2g′(z2)ak
Backpropagation: Derivation
Input layer
Hidden layer θ11
a11 32k12k11 1
z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
Output layer
1 θ1 x2=a 2 21
a2=g(z2) a3=g(z3)
Formulating this again as the chain rule
1 ∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
∆θk1 = −η∂θ1 = −η ∂a3 ∂z3 ∂a2 ∂z2 ∂θ1 k1 1 1 k1
Plugging these into the above we get
−η ∂E = −η − (y − a13)g′(z3)θ2g′(z2)ak
= η(y − a13)g′(z3)θ2g′(z2)ak
Backpropagation: Derivation
Input layer
Hidden layer θ11
a11 32k12k11 1
z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
Output layer
1 θ1 x2=a 2 21
a2=g(z2) a3=g(z3)
Formulating this again as the chain rule
1 ∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
∆θk1 = −η∂θ1 = −η ∂a3 ∂z3 ∂a2 ∂z2 ∂θ1 k1 1 1 k1
Plugging these into the above we get
= −η − (y − a13)g′(z3)θ2g′(z2)ak = η (y − a13)g′(z3) θ2g′(z2)ak
Backpropagation: Derivation
Input layer
Hidden layer θ11
a11 32k12k11 1
z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
Output layer
1 θ1 x2=a 2 21
a2=g(z2) a3=g(z3)
Formulating this again as the chain rule
1 ∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
∆θk1 = −η∂θ1 = −η ∂a3 ∂z3
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com