Learning Parameters of Multi-layer Perceptrons with Backpropagation
COMP90049
Introduction to Machine Learning Semester 1, 2020
Lea Frermann, CIS
1
Roadmap
Last lecture
• From perceptrons to neural networks • multilayer perceptron
• some examples
• features and limitations
Today
• Learning parameters of neural networks • The Backpropagation algorithm
2
Recap: Multi-layer perceptrons
1 x1 x2 …
θF xF
input weigthts
• Linearly separable data • Perceptron learning rule
θ0 θ1
θ2
f(x;θ) y
output
…
y = f(θTx)
activation
3
Recap: Multi-layer perceptrons
1
xa 1θ
1 1
x1 01 θ1
a=g(z) θ1
• Linearly separable data • Perceptron learning rule
x1 θ1
z=Σ θ x
k k k a2 ŷ
11
x2 2
21
3
x1 x3
3 31
3
Recap: Multi-layer perceptrons
Layer 1
1 xa x 1
Layer 2
z=Σ θ x
k k k a2
Layer 3
Layer 4
θ 11
θ1 11
2 z=Σθx
a=g(z)
a
kkk 3
1 1
a=g(z)
1
1 θ3 11
θ1 θ12 13
θ1 22
2 θ2
θ21 31 3
θ2 θ z=Σθx
12
θ2 32
kkk 4 z=Σθx3 12 aŷ
1 θ121
k k k a2 θ3
0
4 a=g(z)2 1
x 2 x2
2 θ 22
a=g(z)
21
θ3 22
a=g(z) 1
z=Σkθkxk a2 a=g(z) 2
θ1 θ23
23
θ213 θ31
23
z=Σ θ x
k k k a3
a=g(z) 3
θ332 θ3
1
1 θ 31 θ1 x3 32
θ2 33
z=Σ θ x kkk
a
ŷ
xz=Σθx θ2θ2 3
k k ka2 24 14 θ41 1 a=g(z) 3
3 θ33
z=Σkθkxk a3 a=g(z) 4
42
θ2 34
3
Recall: Supervised learning
Layer 1
1 xa x 1
Layer 2
z=Σ θ x
k k k a2
Layer 3
Layer 4
θ 11
θ1 11
2 z=Σθx
a=g(z)
a
kkk 3
1 1
a=g(z)
1
1 θ3 11
θ1 θ12 13
θ1 22
2 θ2
θ21 31 3
θ2 θ z=Σθx
12
θ2 32
kkk 4 z=Σθx3 12 aŷ
1 θ121
k k k a2 θ3
0
4 a=g(z)2 1
x 2 x2
2 θ 22
a=g(z)
21
θ3 22
a=g(z) 1
z=Σkθkxk a2 a=g(z) 2
θ1 θ23
23
θ213 θ31
23
z=Σ θ x
k k k a3
a=g(z) 3
θ332 θ3
1
1 θ 31 θ1 x3 32
θ2 33
z=Σ θ x kkk
a
ŷ
xz=Σθx θ2θ2 3
k k ka2 24 14 θ41 1 a=g(z) 3
3 θ33
z=Σkθkxk a3 a=g(z) 4
42
θ2 34
4
Recall: Supervised learning
Layer 1
1 xa x 1
Layer 2
z=Σ θ x
k k k a2
Layer 3
Layer 4
θ 11
θ1 11
2 z=Σθx
a=g(z)
a
kkk 3
1 1
a=g(z)
1
1 θ3 11
θ1 θ12 13
θ1 22
2 θ2
θ21 31 3
θ2 θ z=Σθx
12
θ2 32
kkk 4 z=Σθx3 12 aŷ
1 θ121
k k k a2 θ3
0
4 a=g(z)2 1
x 2 x2
2 θ 22
a=g(z)
21
θ3 22
a=g(z) 1
23
θ213 θ31
z=Σkθkxk a2 a=g(z) 2
θ1 θ23
23
z=Σ θ x
k k k a3
a=g(z) 3
θ332 θ3
1
1 θ 31 θ1 x3 32
θ2 33
z=Σ θ x kkk
a
ŷ
Recipe
xz=Σθx θ2θ2 3
3 θ33
z=Σkθkxk a3 a=g(z) 4
42
k k ka2 24 14 θ41 1 a=g(z) 3
1. Forward propagate an input x from the training set 2. Compute the output yˆ with the MLP
θ2 34
4
Recall: Supervised learning
Layer 1
1 xa x 1
Layer 2
z=Σ θ x
k k k a2
Layer 3
Layer 4
θ 11
θ1 11
2 z=Σθx
a=g(z)
a
kkk 3
1 1
a=g(z)
1
1 θ3 11
θ1 θ12 13
θ1 22
2 θ2
θ21 31 3
θ2 θ z=Σθx
12
θ2 32
kkk 4 z=Σθx3 12 aŷ
1 θ121
k k k a2 θ3
0
4 a=g(z)2 1
x 2 x2
2 θ 22
a=g(z)
21
θ3 22
a=g(z) 1
23
θ213 θ31
z=Σkθkxk a2 a=g(z) 2
θ1 θ23
23
z=Σ θ x
k k k a3
a=g(z) 3
θ332 θ3
1
1 θ 31 θ1 x3 32
θ2 33
z=Σ θ x kkk
a
ŷ
Recipe
xz=Σθx θ2θ2 3
3 θ33
z=Σkθkxk a3 a=g(z) 4
42
k k ka2 24 14 θ41 1 a=g(z) 3
1. Forward propagate an input x from the training set
2. Compute the output yˆ with the MLP
3. Compare predicted output yˆ against true output y; compute the error
θ2 34
4
Recall: Supervised learning
Layer 1
1 xa x 1
Layer 2
z=Σ θ x
k k k a2
Layer 3
Layer 4
θ 11
θ1 11
2 z=Σθx
a=g(z)
a
kkk 3
1 1
a=g(z)
1
1 θ3 11
θ1 θ12 13
θ1 22
2 θ2
θ21 31 3
θ2 θ z=Σθx
12
θ2 32
kkk 4 z=Σθx3 12 aŷ
1 θ121
k k k a2 θ3
0
4 a=g(z)2 1
x 2 x2
2 θ 22
a=g(z)
21
θ3 22
a=g(z) 1
23
θ213 θ31
z=Σkθkxk a2 a=g(z) 2
θ1 θ23
23
z=Σ θ x
k k k a3
a=g(z) 3
θ332 θ3
1
1 θ 31 θ1 x3 32
θ2 33
z=Σ θ x kkk
a
ŷ
Recipe
xz=Σθx θ2θ2 3
3 θ33
z=Σkθkxk a3 a=g(z) 4
42
k k ka2 24 14 θ41 1 a=g(z) 3
1. Forward propagate an input x from the training set
2. Compute the output yˆ with the MLP
3. Compare predicted output yˆ against true output y; compute the error
4. Modify each weight such that the error decreases in future predictions (e.g., by applying gradient descent)
θ2 34
4
Recall: Supervised learning
Layer 1
1 xa x 1
Layer 2
z=Σ θ x
k k k a2
Layer 3
Layer 4
θ 11
θ1 11
2 z=Σθx
a=g(z)
a
kkk 3
1 1
a=g(z)
1
1 θ3 11
θ1 θ12 13
θ1 22
2 θ2
θ21 31 3
θ2 θ z=Σθx
12
θ2 32
kkk 4 z=Σθx3 12 aŷ
1 θ121
k k k a2 θ3
0
4 a=g(z)2 1
x 2 x2
2 θ 22
a=g(z)
21
θ3 22
a=g(z) 1
23
θ213 θ31
z=Σkθkxk a2 a=g(z) 2
θ1 θ23
23
z=Σ θ x
k k k a3
a=g(z) 3
θ332 θ3
1
1 θ 31 θ1 x3 32
θ2 33
z=Σ θ x kkk
a
ŷ
Recipe
xz=Σθx θ2θ2 3
3 θ33
z=Σkθkxk a3 a=g(z) 4
42
k k ka2 24 14 θ41 1 a=g(z) 3
1. Forward propagate an input x from the training set
2. Compute the output yˆ with the MLP
3. Compare predicted output yˆ against true output y; compute the error
4. Modify each weight such that the error decreases in future
predictions (e.g., by applying gradient descent)
5. Repeat.
θ2 34
4
Recall: Optimization with Gradient Descent
Layer 1
1 xa x 1
Layer 2
z=Σ θ x
k k k a2
Layer 3
Layer 4
θ 11
θ1 11
2 z=Σθx
a=g(z)
a
kkk 3
1 1
a=g(z)
1
1 θ3 11
θ1 θ12 13
θ1 22
2 θ2
θ21 31 3
θ2 θ z=Σθx
12
θ2 32
kkk 4 z=Σθx3 12 aŷ
1 θ121
k k k a2 θ3
0
4 a=g(z)2 1
x 2 x2
2 θ 22
a=g(z)
21
θ3 22
a=g(z) 1
23
θ213 θ31
z=Σkθkxk a2 a=g(z) 2
θ1 θ23
23
z=Σ θ x
k k k a3
a=g(z) 3
θ332 θ3
1
1 θ 31 θ1 x3 32
θ2 33
z=Σ θ x kkk
a
ŷ
xz=Σθx θ2θ2 3
k k ka2 24 14 θ41 1 a=g(z) 3
3 θ33
We want to
θ2 34
z=Σkθkxk a3 a=g(z) 4
42
1. Find the best parameters, which lead to the smallest error E 2. Optimize each model parameter θl
3. We will use gradient descent to achieve that
4. θl,(t+1) ← θl,(t) + ∆θi ij ij ij
ik
5
Towards Backpropagation
Recall Perceptron learning:
• Pass an input through and compute x
θ1 11
1 xa1
1
1 θ1
01
yˆ
• Compare yˆ against y
x1 θ1
x221 kkka2 ŷ
• Weight update θi ← θi + η(y − yˆ)xi x1
x3
θ1
3 31
2
z=Σ θ x a=g(z)
3
6
Towards Backpropagation
θ1 θ12 13
θ1 22
θ2 θ z=Σθx
1 xa1
Recall Perceptron learning:
• Pass an input through and compute x
θ1 11
1
1 θ1
01
yˆ
• Compare yˆ against y
• Weight update θi ← θi + η(y − yˆ)xi
x1 θ1
x221 kkka2 ŷ
2
x1 x3
Layer 3
z=Σ θ x a=g(z)
3
θ1
3 31
Compare against the MLP:
Layer 4
Layer 1
1 xa x 1
Layer 2
z=Σ θ x
k k k a2
θ 11
θ1 11
2 z=Σθx
a
kkk 3
1 1
1
2 θ2
θ21 31 3
a=g(z)
a=g(z)
1 θ3 11
12
θ2 32
kkk 4 z=Σθx3 12 aŷ
1 θ121
k k k a2 θ3
0
4 a=g(z)2 1
x 2 x2
2 θ 22
a=g(z)
21
θ3 22
a=g(z) 1
z=Σkθkxk a2 a=g(z) 2
θ1 θ23
23
θ213 θ31
1
1 θ 31 θ1 x3 32
θ2 33
z=Σ θ x kkk
a
ŷ
23
z=Σ θ x
k k k a3
a=g(z) 3
θ332 θ3
xz=Σθx θ2θ2 3
k k ka2 24 14 θ41 1 a=g(z) 3
3 θ33
z=Σkθkxk a3 a=g(z) 4
42
θ2 34
6
Towards Backpropagation
Recall Perceptron learning:
• Pass an input through and compute x
θ1 11
1 xa1
1
1 θ1
01
yˆ
• Compare yˆ against y
x1 θ1
x221 kkka2 ŷ
• Weight update θi ← θi + η(y − yˆ)xi x1
x3
Problems
• This update rule depends on true target outputs y
• We only have access to true outputs for the final layer
2
z=Σ θ x a=g(z)
3
θ1
3 31
• We do not know the true activations for the hidden layers. Can we generalize the above rule to also update the hidden layers?
Backpropagation provides us with an efficient way of computing partial derivatives of the error of an MLP wrt. each individual weight.
6
Backpropagation: Demo
Input layer
xa 1= 1
x2=a2
Hidden layer
Σ g
Σ g
Σ g
output layer
gΣ =y1̂
Σ
g
=y2̂
7
Backpropagation: Demo
Input layer
xa 1= 1
x2=a2
Hidden layer
Σ g
Σ g
Σ g
output layer
gΣ =y1̂
Σ
g
=y2̂
• Receive input
7
Backpropagation: Demo
Input layer
xa 1= 1
x2=a2
Hidden layer
g Σ a2 1
Σ a2 g2
Σ a2 g3
output layer
gΣ =y1̂
Σ
g
=y2̂
• Receive input
• Forward pass: propagate activations through the network
7
Backpropagation: Demo
Input layer
xa 1= 1
x2=a2
Hidden layer
g Σ a2 1
Σ a2 g2
Σ a2 g3
output layer
gΣa3 =ŷ 11
Σ a3 =y ̂ g22
• Receive input
• Forward pass: propagate activations through the network
7
Backpropagation: Demo
Input layer
xa 1= 1
x2=a2
Hidden layer output layer
Σ a2
g1 δ3 1
Σ a2 g2
Σ a2 g3
gΣa3 =ŷ 11
δ3
2
• Receive input
• Forward pass: propagate activations through the network • Compute Error : compare output yˆ against true y
Σ a3 g22
=y ̂
7
Backpropagation: Demo
Input layer
xa 1= 1
x2=a2
Hidden layer
g
output layer
gΣa3 =ŷ 11
3 δ i 2
Σ a3 g22
δ2 1
Σ2 a 1
δ δ 3i 1
δ2
g
2
a2 Σ2
δ2 Σ3 a2
=y ̂
g3
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output yˆ against true y
• Backward pass: propagate error terms through the network
7
Backpropagation: Demo
Input layer
xa 1= 1
x2=a2
Hidden layer
g
output layer
gΣa3 =ŷ 11
3 δ i 2
Σ a3 g22
δ2 1
Σ2 a 1
δ δ 3i 1
δ2
a2 Σ2
g
2
δ2 Σ3 a2
=y ̂
• Receive input
g3
• Forward pass: propagate activations through the network
• Compute Error : compare output yˆ against true y
• Backward pass: propagate error terms through the network
• Calculate ∆θl for all θl ij ij
• Update weights θl ← θl + ∆θl ij ij ij
7
Interim Summary
• We recall what a MLP is
• We recall that we want to learn its parameters such that our prediction
error is minimized
• We recall that gradient descent gives us a rule for updating the weights
θi ←θi +△θi with △θi =−η∂E ∂θi
• But how do we compute ∂E ? ∂θi
• Backpropagation provides us with an efficient way of computing partial derivatives of the error of an MLP wrt. each individual weight.
8
The (Generalized) Delta Rule
Backpropagation 1: Model definition
k
j
i
g
Σθ2 Σ Σ ij
gg
• Assuming a sigmoid activation function, the output of neuron i (or its activation ai ) is
ai =g(zi)= 1 1+e−zi
• And zi is the input of all incoming activations into neuron i zi =θijaj
j
• And Mean Squared Error (MSE) as error function E
1 N
E = 2 N
( y i − yˆ i ) 2
i=1
9
Backpropagation 2: Error of the final layer
k
j
i
g
Σθ2 Σ Σ ij
node i
△ θ 2 = − η ∂ E = η ( y p − yˆp ) g ′ ( z ) a ij∂θ2 ij
ij
gg
• Apply gradient descend for input p and weight θ2 connecting node j with ij
10
Backpropagation 2: Error of the final layer
k
j
i
g
Σθ2 Σ Σ ij
node i
△ θ 2 = − η ∂ E = η ( y p − yˆp ) g ′ ( z ) a ij∂θ2 ij
ij
gg
• Apply gradient descend for input p and weight θ2 connecting node j with ij
10
Backpropagation 2: Error of the final layer
k
j
i
g
Σθ2 Σ Σ ij
gg
• Apply gradient descend for input p and weight θ2 connecting node j with ij
node i
△ θ 2 = − η ∂ E = η ( y p − yˆp ) g ′ ( z ) a ij∂θ2 ij
ij
= η δi aj
• The weight update corresponds to an error term (δi ) scaled by the incoming activation
• Weattachaδtonodei
10
Backpropagation: The Generalized Delta Rule
k
j
i
δi
g
Σθ2 Σ Σ ij
gg
• The Generalized Delta Rule
△θ2 =−η∂E =η(yp−yˆp)g′(z)a =ηδa
ij ∂θ2 ij ij ij
δ i = ( y p − yˆp ) g ′ ( z i )
• The above δi can only be applied to output units, because it relies on the
target outputs yp.
• We do not have target outputs y for the intermediate layers
11
Backpropagation: The Generalized Delta Rule
j
i
δi
• Instead, we backpropagate the errors (δs) from right to left through the
k
θ1
jk Σ Σ
Σ ggg
network
△θ1 = η δ a jk jk
δ =θ1 δ g′(z) j ijij
i
12
Backpropagation: Demo
Input layer
xa 1= 1
x2=a2
Hidden layer
Σ g
Σ g
Σ g
output layer
gΣ =y1̂
Σ
g
=y2̂
13
Backpropagation: Demo
Input layer
xa 1= 1
x2=a2
Hidden layer
Σ g
Σ g
Σ g
output layer
gΣ =y1̂
Σ
g
=y2̂
• Receive input
13
Backpropagation: Demo
Input layer
xa 1= 1
x2=a2
Hidden layer
g Σ a2 1
Σ a2 g2
Σ a2 g3
output layer
gΣ =y1̂
Σ
g
=y2̂
• Receive input
• Forward pass: propagate activations through the network
13
Backpropagation: Demo
Input layer
xa 1= 1
x2=a2
Hidden layer
g Σ a2 1
Σ a2 g2
Σ a2 g3
output layer
gΣa3 =ŷ 11
Σ a3 =y ̂ g22
• Receive input
• Forward pass: propagate activations through the network
13
Backpropagation: Demo
Input layer
xa 1= 1
x2=a2
Hidden layer output layer
Σ a2
g1 δ3 1
Σ a2 g2
Σ a2 g3
gΣa3 =ŷ 11
δ3
2
• Receive input
• Forward pass: propagate activations through the network • Compute Error : compare output yˆ against true y
Σ a3 g22
=y ̂
13
Backpropagation: Demo
Input layer
xa 1= 1
x2=a2
Hidden layer
g
output layer
gΣa3 =ŷ 11
3 δ i 2
Σ a3 g22
δ2 1
Σ2 a 1
δ δ 3i 1
δ2
g
2
a2 Σ2
δ2 Σ3 a2
=y ̂
g3
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output yˆ against true y
• Backward pass: propagate error terms through the network
13
Backpropagation: Demo
Input layer
xa 1= 1
x2=a2
Hidden layer output layer
2 δ i 1
Σ2
ga1 δ3
1
gΣa3 =ŷ 11
δ3 Σ a3
δ2 Σ 22
g
a2
2 g22
=y ̂
δ1 Σ3 a2
g3
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output yˆ against true y
• Backward pass: propagate error terms through the network
• Calculate ∂E for all θl ∂θl ij
ij
• Update weights θl ← θl + ∆θl ij ij ij
13
Backpropagation Algorithm
Design your neural network Initialize parameters θ repeat
for training instance xi do
1. Forward pass the instance through the network, compute activa- tions, determine output
2. Compute the error
3. Propagate error back through the network, and compute for all weights between nodes ij in all layers l
∆θl =−η∂E =ηδa ij ∂θl ij
ij
θl ←θl +∆θl ij ij ij
4. Update all parameters at once
until stopping criteria reached.
14
Derivation of the update rules
… optional slides after the next (summary) slide, for those who are interested!
15
Summary
After this lecture, you be able to understand
• Why estimation of the MLP parameters is difficult
• How and why we use Gradient Descent to optimize the parameters
• How Backpropagation is a special instance of gradient descent, which allows us to efficiently compute the gradients of all weights wrt. the error
• The mechanism behind gradient descent
• The mathematical justification of gradient descent
Good job, everyone!
• You now know what (feed forward) neural networks are
• You now know what to consider when designing neural networks • You now know how to estimate their parameters
• That’s more than the average ‘data scientist’ out there!
16
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
k1 k a2=g(z2)
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
a3=g(z3)
17
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
k1 k a2=g(z2)
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
Chain of reactions in the forward pass, focussing on the output layer
• varying a2 causes a change in z3
• varying z3 causes a change in a13 = g(z3) • varying a13 = yˆ causes a change in E(y,yˆ)
a3=g(z3)
17
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
k1 k a2=g(z2)
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
We can use the chain rule to capture the behavior of θ2 11
∆θ2 =−η∂E =−η∂E∂a13∂z3 = ∂ θ 2 ∂ a 13 ∂ z 3 ∂ θ 2
wrt E
a3=g(z3)
17
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
k1 k a2=g(z2)
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
We can use the chain rule to capture the behavior of θ2 11
∆θ2 =−η∂E =−η∂E∂a13∂z3 = ∂ θ 2 ∂ a 13 ∂ z 3 ∂ θ 2
Let’s look at each term individually
wrt E
a3=g(z3)
∂E =−(yi −ai) recallthatE =N 1(yi −ai)2 ∂ai i2
17
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
k1 k a2=g(z2)
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
We can use the chain rule to capture the behavior of θ2 11
∆θ2 =−η∂E =−η∂E∂a13∂z3 = ∂ θ 2 ∂ a 13 ∂ z 3 ∂ θ 2
wrt E
a3=g(z3)
Let’s look at each term individually
∂E =−(yi −ai) recallthatE =N 1(yi −ai)2
∂ai i2
∂a = ∂g(z) = g′(z) ∂z ∂z
17
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
k1 k a2=g(z2)
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
We can use the chain rule to capture the behavior of θ2 11
∆θ2 =−η∂E =−η∂E∂a13∂z3 = ∂ θ 2 ∂ a 13 ∂ z 3 ∂ θ 2
wrt E
a3=g(z3)
Let’s look at each term individually
∂E =−(yi −ai) recallthatE =N 1(yi −ai)2
∂ai i2 ∂a = ∂g(z) = g′(z)
∂z ∂z
∂z = ∂ θi′jai′ = ∂ θi′jai′ =ai ∂θij ∂θij i′ i′ ∂θij
17
Backpropagation: Derivation
Input layer Hidden layer Output layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
δ3
1
z=Σ(θ a )a3 =ŷ a11 k1k11
1
a3=g(z3)
We can use the chain rule to capture the behavior of θ2 11
θ2
2112 322
z=Σ(θ a ) k1 k
a2=g(z2)
wrt E
∆θ2 =−η∂E =−η∂E∂a13∂z3 = ηy−a13g′(z3)a12=ηδ13a12
∂θ2 ∂a3 ∂z3 ∂θ2
1
= δ 13
∂E =−(yi −ai) recallthatE =N 1(yi −ai)2
Let’s look at each term individually
∂ai i2
∂a = ∂g(z) = g′(z) ∂z ∂z
∂z = ∂ θi′jai′ = ∂ θi′jai′ =ai ∂θij ∂θij i′ i′ ∂θij
17
Backpropagation: Derivation
Input layer Hidden layer Output layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
z=Σ(θ a )a3 =ŷ a11 k1k11
1
a3=g(z3)
θ2
2112 322
z=Σ(θ a ) k1 k
a2=g(z2)
We have another chain reaction. Let’s consider layer 2 • varying any θ1 causes a change in z2
k1
• varying z2 causes a change in a12 = g(z2)
• varying a12 causes a change in z3 (we consider θ2 fixed for the moment) • varying z3 causes a change in a13 = g(z3)
• varying a13 = yˆ causes a change in E(y,yˆ)
δ3
1
17
Backpropagation: Derivation
Input layer Hidden layer Output layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
δ3
1
z=Σ(θ a )a3 =ŷ a11 k1k11
1
a3=g(z3)
Formulating this again as the chain rule
1 ∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
∆θk1 = −η∂θ1 = −η ∂a3 ∂z3 ∂a2 ∂z2 ∂θ1 k1 1 1 k1
θ2
2112 322
z=Σ(θ a ) k1 k
a2=g(z2)
17
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
k1 k
a2=g(z2) a3=g(z3)
Formulating this again as the chain rule
1 ∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
∆θk1 = −η∂θ1 = −η ∂a3 ∂z3 ∂a2 ∂z2 ∂θ1 k1 1 1 k1
We already know that
∂ E = − ( y − a 13 ) ∂ a 13
∂a13 ′3 ∂z3 = g (z )
δ3
1
17
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
k1 k
δ3
1
a2=g(z2) a3=g(z3)
Formulating this again as the chain rule
1 ∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
∆θk1 = −η∂θ1 = −η ∂a3 ∂z3 ∂a2 ∂z2 ∂θ1 k1 1 1 k1
∂ a 12 = ∂ a 12 = θ ∂ a 12 ∂ g ( z 2 )
And following the previous logic, we can calculate that ∂ z 3 ∂ θ 2 a 12 2
′ 2 ∂z2=∂z2 =g(z)
∂ z 2
∂θ1=ak k1
17
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
k1 k
a2=g(z2) a3=g(z3)
Formulating this again as the chain rule
1 ∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
∆θk1 = −η∂θ1 = −η ∂a3 ∂z3 ∂a2 ∂z2 ∂θ1 k1 1 1 k1
Plugging these into the above we get
−η ∂E = −η − (y − a13)g′(z3)θ2g′(z2)ak
∂θ1 k1
δ3
1
17
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
k1 k
a2=g(z2) a3=g(z3)
Formulating this again as the chain rule
1 ∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
∆θk1 = −η∂θ1 = −η ∂a3 ∂z3 ∂a2 ∂z2 ∂θ1 k1 1 1 k1
Plugging these into the above we get
−η ∂E = −η − (y − a13)g′(z3)θ2g′(z2)ak
∂θ1 k1
δ3
1
= η(y − a13)g′(z3)θ2g′(z2)ak
17
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
k1 k
a2=g(z2) a3=g(z3)
Formulating this again as the chain rule
1 ∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
∆θk1 = −η∂θ1 = −η ∂a3 ∂z3 ∂a2 ∂z2 ∂θ1 k1 1 1 k1
Plugging these into the above we get
= −η − (y − a13)g′(z3)θ2g′(z2)ak = η (y − a13)g′(z3) θ2g′(z2)ak
= δ 13
−η ∂E
∂θ1 k1
δ3
1
17
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
k1 k
a2=g(z2) a3=g(z3)
Formulating this again as the chain rule
1 ∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
∆θk1 = −η∂θ1 = −η ∂a3 ∂z3 ∂a2 ∂z2 ∂θ1 k1 1 1 k1
Plugging these into the above we get
= −η − (y − a13)g′(z3)θ2g′(z2)ak
−η ∂E
∂θ1 k1
= η (y − a13)g′(z3) θ2g′(z2)ak
= δ 13
= ηδ13θ2g′(z2)ak
δ3
1
17
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
k1 k
δ2 1
δ3
1
a2=g(z2) a3=g(z3)
Formulating this again as the chain rule
1 ∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
∆θk1 = −η∂θ1 = −η ∂a3 ∂z3 ∂a2 ∂z2 ∂θ1 k1 1 1 k1
Plugging these into the above we get
= −η − (y − a13)g′(z3)θ2g′(z2)ak
−η ∂E
∂θ1 k1
= η (y − a13)g′(z3) θ2g′(z2)ak 3
= δ 1
= η δ13θ2g′(z2) ak
= δ 12
17
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
k1 k
δ2 1
δ3
1
a2=g(z2) a3=g(z3)
Formulating this again as the chain rule
1 ∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
∆θk1 = −η∂θ1 = −η ∂a3 ∂z3 ∂a2 ∂z2 ∂θ1 k1 1 1 k1
Plugging these into the above we get
= −η − (y − a13)g′(z3)θ2g′(z2)ak
−η ∂E
∂θ1 k1
= η (y − a13)g′(z3) θ2g′(z2)ak 3
= δ 1
= η δ13θ2g′(z2) ak
= δ 12
= ηδ12ak
17
Backpropagation: Derivation
Input layer Hidden layer
x a1 1 1=1 θ11
1 θ1 x2=a 2 21
a11 32k12k11 1
Output layer z2=Σ(θ1 a1) 2 θ2 z=Σ(θ a)a3 =ŷ
k1 k
δ2 1
a2=g(z2) a3=g(z3)
Formulating this again as the chain rule
∂E ∂E ∂a13 ∂z3 ∂a12 ∂z2
−η∂θ1 = −η ∂a3 ∂z3 ∂a2 ∂z2 ∂θ1 k1 1 1 k1
If we had more than one weight θ2
−η ∂E = η (y − a3)g′(z3) θ2 g′(z2)a
1 j1j1jk ∂θk1 j
= δj3
= η δ3θ2 g′(z2) a = ηδ2a
j1j k 1k j
δ3
1
= δ 12
18