Learning Parameters of Multi-layer Perceptrons with
Backpropagation
COMP90049
Introduction to Machine Learning
Semester 1, 2020
Lea Frermann, CIS
1
Roadmap
Last lecture
• From perceptrons to neural networks
• multilayer perceptron
• some examples
• features and limitations
Today
• Learning parameters of neural networks
• The Backpropagation algorithm
2
Recap: Multi-layer perceptrons
x
1
x
2
…
x
F
f(x;θ) y
input weigthts
activation
output
1
y = f(θTx)
θ
2
…
θ
F
θ
0
θ
1
• Linearly separable data
• Perceptron learning rule
3
Recap: Multi-layer perceptrons
• Linearly separable data
• Perceptron learning rule
3
Recap: Multi-layer perceptrons
x
1
x
2
x
3
z=Σ
k
θ
k
x
k
a=g(z)
a
1
x1
2
x1
3
θ1
12
ŷ
0
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
ŷ
1
Layer 1 Layer 2 Layer 3 Layer 4
x1
1
a2
3
a3
3
a2
2
a2
1
a3
2
a3
1
a3
4
a4
2
a4
1
θ1
11
θ1
22
θ1
21
θ1
13
θ1
23
θ1
31
θ1
33
θ1
32
θ2
31
θ2
11
θ2
21
θ2
24
θ2
34
θ2
22
θ2
13
θ2
32
θ2
23
θ2
14
θ2
33
θ2
12
θ3
11
θ3
12
θ3
21
θ3
22
θ3
31
θ3
32
θ3
41
θ3
42
3
Recall: Supervised learning
x
1
x
2
x
3
z=Σ
k
θ
k
x
k
a=g(z)
a
1
x1
2
x1
3
θ1
12
ŷ
0
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
ŷ
1
Layer 1 Layer 2 Layer 3 Layer 4
x1
1
a2
3
a3
3
a2
2
a2
1
a3
2
a3
1
a3
4
a4
2
a4
1
θ1
11
θ1
22
θ1
21
θ1
13
θ1
23
θ1
31
θ1
33
θ1
32
θ2
31
θ2
11
θ2
21
θ2
24
θ2
34
θ2
22
θ2
13
θ2
32
θ2
23
θ2
14
θ2
33
θ2
12
θ3
11
θ3
12
θ3
21
θ3
22
θ3
31
θ3
32
θ3
41
θ3
42
Recipe
1. Forward propagate an input x from the training set
2. Compute the output ŷ with the MLP
3. Compare predicted output ŷ against true output y ; compute the error
4. Modify each weight such that the error decreases in future
predictions (e.g., by applying gradient descent)
5. Repeat.
4
Recall: Supervised learning
x
1
x
2
x
3
z=Σ
k
θ
k
x
k
a=g(z)
a
1
x1
2
x1
3
θ1
12
ŷ
0
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
ŷ
1
Layer 1 Layer 2 Layer 3 Layer 4
x1
1
a2
3
a3
3
a2
2
a2
1
a3
2
a3
1
a3
4
a4
2
a4
1
θ1
11
θ1
22
θ1
21
θ1
13
θ1
23
θ1
31
θ1
33
θ1
32
θ2
31
θ2
11
θ2
21
θ2
24
θ2
34
θ2
22
θ2
13
θ2
32
θ2
23
θ2
14
θ2
33
θ2
12
θ3
11
θ3
12
θ3
21
θ3
22
θ3
31
θ3
32
θ3
41
θ3
42
Recipe
1. Forward propagate an input x from the training set
2. Compute the output ŷ with the MLP
3. Compare predicted output ŷ against true output y ; compute the error
4. Modify each weight such that the error decreases in future
predictions (e.g., by applying gradient descent)
5. Repeat.
4
Recall: Supervised learning
x
1
x
2
x
3
z=Σ
k
θ
k
x
k
a=g(z)
a
1
x1
2
x1
3
θ1
12
ŷ
0
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
ŷ
1
Layer 1 Layer 2 Layer 3 Layer 4
x1
1
a2
3
a3
3
a2
2
a2
1
a3
2
a3
1
a3
4
a4
2
a4
1
θ1
11
θ1
22
θ1
21
θ1
13
θ1
23
θ1
31
θ1
33
θ1
32
θ2
31
θ2
11
θ2
21
θ2
24
θ2
34
θ2
22
θ2
13
θ2
32
θ2
23
θ2
14
θ2
33
θ2
12
θ3
11
θ3
12
θ3
21
θ3
22
θ3
31
θ3
32
θ3
41
θ3
42
Recipe
1. Forward propagate an input x from the training set
2. Compute the output ŷ with the MLP
3. Compare predicted output ŷ against true output y ; compute the error
4. Modify each weight such that the error decreases in future
predictions (e.g., by applying gradient descent)
5. Repeat.
4
Recall: Supervised learning
x
1
x
2
x
3
z=Σ
k
θ
k
x
k
a=g(z)
a
1
x1
2
x1
3
θ1
12
ŷ
0
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
ŷ
1
Layer 1 Layer 2 Layer 3 Layer 4
x1
1
a2
3
a3
3
a2
2
a2
1
a3
2
a3
1
a3
4
a4
2
a4
1
θ1
11
θ1
22
θ1
21
θ1
13
θ1
23
θ1
31
θ1
33
θ1
32
θ2
31
θ2
11
θ2
21
θ2
24
θ2
34
θ2
22
θ2
13
θ2
32
θ2
23
θ2
14
θ2
33
θ2
12
θ3
11
θ3
12
θ3
21
θ3
22
θ3
31
θ3
32
θ3
41
θ3
42
Recipe
1. Forward propagate an input x from the training set
2. Compute the output ŷ with the MLP
3. Compare predicted output ŷ against true output y ; compute the error
4. Modify each weight such that the error decreases in future
predictions (e.g., by applying gradient descent)
5. Repeat.
4
Recall: Supervised learning
x
1
x
2
x
3
z=Σ
k
θ
k
x
k
a=g(z)
a
1
x1
2
x1
3
θ1
12
ŷ
0
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
ŷ
1
Layer 1 Layer 2 Layer 3 Layer 4
x1
1
a2
3
a3
3
a2
2
a2
1
a3
2
a3
1
a3
4
a4
2
a4
1
θ1
11
θ1
22
θ1
21
θ1
13
θ1
23
θ1
31
θ1
33
θ1
32
θ2
31
θ2
11
θ2
21
θ2
24
θ2
34
θ2
22
θ2
13
θ2
32
θ2
23
θ2
14
θ2
33
θ2
12
θ3
11
θ3
12
θ3
21
θ3
22
θ3
31
θ3
32
θ3
41
θ3
42
Recipe
1. Forward propagate an input x from the training set
2. Compute the output ŷ with the MLP
3. Compare predicted output ŷ against true output y ; compute the error
4. Modify each weight such that the error decreases in future
predictions (e.g., by applying gradient descent)
5. Repeat.
4
Recall: Optimization with Gradient Descent
x
1
x
2
x
3
z=Σ
k
θ
k
x
k
a=g(z)
a
1
x1
2
x1
3
θ1
12
ŷ
0
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
ŷ
1
Layer 1 Layer 2 Layer 3 Layer 4
x1
1
a2
3
a3
3
a2
2
a2
1
a3
2
a3
1
a3
4
a4
2
a4
1
θ1
11
θ1
22
θ1
21
θ1
13
θ1
23
θ1
31
θ1
33
θ1
32
θ2
31
θ2
11
θ2
21
θ2
24
θ2
34
θ2
22
θ2
13
θ2
32
θ2
23
θ2
14
θ2
33
θ2
12
θ3
11
θ3
12
θ3
21
θ3
22
θ3
31
θ3
32
θ3
41
θ3
42
We want to
1. Find the best parameters, which lead to the smallest error E
2. Optimize each model parameter θlik
3. We will use gradient descent to achieve that
4. θl,(t+1)ij ← θ
l,(t)
ij + ∆θ
i
ij
5
Towards Backpropagation
Recall Perceptron learning:
• Pass an input through and compute
ŷ
• Compare ŷ against y
• Weight update θi ← θi + η(y − ŷ)xi
6
Towards Backpropagation
Recall Perceptron learning:
• Pass an input through and compute
ŷ
• Compare ŷ against y
• Weight update θi ← θi + η(y − ŷ)xi
Compare against the MLP:
x
1
x
2
x
3
z=Σ
k
θ
k
x
k
a=g(z)
a
1
x1
2
x1
3
θ1
12
ŷ
0
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
z=Σ
k
θ
k
x
k
a=g(z)
ŷ
1
Layer 1 Layer 2 Layer 3 Layer 4
x1
1
a2
3
a3
3
a2
2
a2
1
a3
2
a3
1
a3
4
a4
2
a4
1
θ1
11
θ1
22
θ1
21
θ1
13
θ1
23
θ1
31
θ1
33
θ1
32
θ2
31
θ2
11
θ2
21
θ2
24
θ2
34
θ2
22
θ2
13
θ2
32
θ2
23
θ2
14
θ2
33
θ2
12
θ3
11
θ3
12
θ3
21
θ3
22
θ3
31
θ3
32
θ3
41
θ3
42
6
Towards Backpropagation
Recall Perceptron learning:
• Pass an input through and compute
ŷ
• Compare ŷ against y
• Weight update θi ← θi + η(y − ŷ)xi
Problems
• This update rule depends on true target outputs y
• We only have access to true outputs for the final layer
• We do not know the true activations for the hidden layers. Can we
generalize the above rule to also update the hidden layers?
Backpropagation provides us with an efficient way of computing
partial derivatives of the error of an MLP wrt. each individual weight.
6
Backpropagation: Demo
g
Σ
g
Σ
g
Σ
g
Σ
g
Σx2=a2
x
1=
a
1
=y
2̂
=y
1̂
Input layer Hidden layer output layer
7
Backpropagation: Demo
g
Σ
g
Σ
g
Σ
g
Σ
g
Σx2=a2
x
1=
a
1
=y
2̂
=y
1̂
Input layer Hidden layer output layer
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∆θlij for all θ
l
ij
• Update weights θlij ← θ
l
ij + ∆θ
l
ij
7
Backpropagation: Demo
g
Σ
g
Σ
g
Σ
g
Σ
g
Σx2=a2
x
1=
a
1
=y
2̂
=y
1̂
Input layer Hidden layer output layer
a2
1
a2
2
a2
3
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∆θlij for all θ
l
ij
• Update weights θlij ← θ
l
ij + ∆θ
l
ij
7
Backpropagation: Demo
g
Σ
g
Σ
g
Σ
g
Σ
g
Σx2=a2
x
1=
a
1
=y
2
̂
=y
1̂
Input layer Hidden layer output layer
a2
1
a2
2
a2
3
a3
1
a3
2
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∆θlij for all θ
l
ij
• Update weights θlij ← θ
l
ij + ∆θ
l
ij
7
Backpropagation: Demo
g
Σ
g
Σ
g
Σ
g
Σ
g
Σx2=a2
x
1=
a
1
Input layer Hidden layer output layer
δ 3
1
δ 3
2
a2
1
a2
2
a2
3
=y
2
̂
=y
1̂
a3
1
a3
2
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∆θlij for all θ
l
ij
• Update weights θlij ← θ
l
ij + ∆θ
l
ij
7
Backpropagation: Demo
g
Σ
g
Σ
g
Σ
g
Σ
g
Σx2=a2
x
1=
a
1
Input layer Hidden layer output layer
δ i
δ i
a2
1
a2
2
a2
3
δ 2
1
δ 2
2
δ 2
3
=y
2
̂
=y
1̂
a3
1
a3
2
δ 3
2
δ 3
1
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∆θlij for all θ
l
ij
• Update weights θlij ← θ
l
ij + ∆θ
l
ij
7
Backpropagation: Demo
g
Σ
g
Σ
g
Σ
g
Σ
g
Σx2=a2
x
1=
a
1
Input layer Hidden layer output layer
δ i
δ i
a2
1
a2
2
a2
3
δ 2
1
δ 2
2
δ 2
3
=y
2
̂
=y
1̂
a3
1
a3
2
δ 3
2
δ 3
1
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∆θlij for all θ
l
ij
• Update weights θlij ← θ
l
ij + ∆θ
l
ij
7
Interim Summary
• We recall what a MLP is
• We recall that we want to learn its parameters such that our prediction
error is minimized
• We recall that gradient descent gives us a rule for updating the weights
θi ← θi +4θi with 4 θi = −η
∂E
∂θi
• But how do we compute ∂E
∂θi
?
• Backpropagation provides us with an efficient way of computing
partial derivatives of the error of an MLP wrt. each individual weight.
8
The (Generalized) Delta Rule
Backpropagation 1: Model definition
ij
θ2
ij
Σ
gg
Σ
k
g
Σ
• Assuming a sigmoid activation function, the output of neuron i (or its
activation ai ) is
ai = g(zi ) =
1
1 + e−zi
• And zi is the input of all incoming activations into neuron i
zi =
∑
j
θijaj
• And Mean Squared Error (MSE) as error function E
E =
1
2N
N∑
i=1
(y i − ŷ i )2
9
Backpropagation 2: Error of the final layer
ij
θ2
ij
Σ
gg
Σ
k
g
Σ
• Apply gradient descend for input p and weight θ2ij connecting node j with
node i
4θ2ij = −η
∂E
∂θ2ij
= η(yp − ŷp)g′(zi )aj
10
Backpropagation 2: Error of the final layer
ij
θ2
ij
Σ
gg
Σ
k
g
Σ
• Apply gradient descend for input p and weight θ2ij connecting node j with
node i
4θ2ij = −η
∂E
∂θ2ij
= η(yp − ŷp)g′(zi )aj
= η δi aj
• The weight update corresponds to an error term (δi ) scaled by the
incoming activation
• We attach a δ to node i
10
Backpropagation 2: Error of the final layer
ij
θ2
ij
Σ
gg
Σ
k
g
Σ
• Apply gradient descend for input p and weight θ2ij connecting node j with
node i
4θ2ij = −η
∂E
∂θ2ij
= η(yp − ŷp)g′(zi )aj
= η δi aj
• The weight update corresponds to an error term (δi ) scaled by the
incoming activation
• We attach a δ to node i
10
Backpropagation: The Generalized Delta Rule
ij
θ2
ij
Σ
gg
Σ
k
g
Σ
δ i
• The Generalized Delta Rule
4θ2ij = − η
∂E
∂θ2ij
= η(yp − ŷp)g′(zi )aj = η δi aj
δi = (y
p − ŷp)g′(zi )
• The above δi can only be applied to output units, because it relies on the
target outputs yp.
• We do not have target outputs y for the intermediate layers
11
Backpropagation: The Generalized Delta Rule
ij
θ1
jk Σ
gg
Σ
k
g
Σ
δ i
• Instead, we backpropagate the errors (δs) from right to left through the
network
4θ1jk = η δj ak
δj =
∑
i
θ
1
ij δi g
′
(zj )
12
Backpropagation: Demo
g
Σ
g
Σ
g
Σ
g
Σ
g
Σx2=a2
x
1=
a
1
=y
2̂
=y
1̂
Input layer Hidden layer output layer
13
Backpropagation: Demo
g
Σ
g
Σ
g
Σ
g
Σ
g
Σx2=a2
x
1=
a
1
=y
2̂
=y
1̂
Input layer Hidden layer output layer
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∂E
∂θlij
for all θlij
• Update weights θlij ← θ
l
ij + ∆θ
l
ij
13
Backpropagation: Demo
g
Σ
g
Σ
g
Σ
g
Σ
g
Σx2=a2
x
1=
a
1
=y
2̂
=y
1̂
Input layer Hidden layer output layer
a2
1
a2
2
a2
3
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∂E
∂θlij
for all θlij
• Update weights θlij ← θ
l
ij + ∆θ
l
ij
13
Backpropagation: Demo
g
Σ
g
Σ
g
Σ
g
Σ
g
Σx2=a2
x
1=
a
1
=y
2
̂
=y
1̂
Input layer Hidden layer output layer
a2
1
a2
2
a2
3
a3
1
a3
2
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∂E
∂θlij
for all θlij
• Update weights θlij ← θ
l
ij + ∆θ
l
ij
13
Backpropagation: Demo
g
Σ
g
Σ
g
Σ
g
Σ
g
Σx2=a2
x
1=
a
1
Input layer Hidden layer output layer
δ 3
1
δ 3
2
a2
1
a2
2
a2
3
=y
2
̂
=y
1̂
a3
1
a3
2
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∂E
∂θlij
for all θlij
• Update weights θlij ← θ
l
ij + ∆θ
l
ij
13
Backpropagation: Demo
g
Σ
g
Σ
g
Σ
g
Σ
g
Σx2=a2
x
1=
a
1
Input layer Hidden layer output layer
δ i
δ i
a2
1
a2
2
a2
3
δ 2
1
δ 2
2
δ 2
3
=y
2
̂
=y
1̂
a3
1
a3
2
δ 3
2
δ 3
1
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∂E
∂θlij
for all θlij
• Update weights θlij ← θ
l
ij + ∆θ
l
ij
13
Backpropagation: Demo
g
Σ
g
Σ
g
Σ
g
Σ
g
Σx2=a2
x
1=
a
1
Input layer Hidden layer output layer
δ i
δ
2
2
δ1
3
=y
2
̂
=y
1̂
a3
1
a3
2
a2
1
a2
2
a2
3
δ 3
1
δ 3
2
δ 2
1
• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∂E
∂θlij
for all θlij
• Update weights θlij ← θ
l
ij + ∆θ
l
ij
13
Backpropagation Algorithm
Design your neural network
Initialize parameters θ
repeat
for training instance xi do
1. Forward pass the instance through the network, compute activa-
tions, determine output
2. Compute the error
3. Propagate error back through the network, and compute for all
weights between nodes ij in all layers l
∆θ
l
ij = −η
∂E
∂θlij
= ηδiaj
4. Update all parameters at once
θ
l
ij ← θ
l
ij + ∆θ
l
ij
until stopping criteria reached.
14
Derivation of the update rules
… optional slides after the next (summary) slide, for those who are interested!
15
Summary
After this lecture, you be able to understand
• Why estimation of the MLP parameters is difficult
• How and why we use Gradient Descent to optimize the parameters
• How Backpropagation is a special instance of gradient descent, which
allows us to efficiently compute the gradients of all weights wrt. the error
• The mechanism behind gradient descent
• The mathematical justification of gradient descent
Good job, everyone!
• You now know what (feed forward) neural networks are
• You now know what to consider when designing neural networks
• You now know how to estimate their parameters
• That’s more than the average ‘data scientist’ out there!
16
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
Chain of reactions in the forward pass, focussing on the output layer
• varying a2 causes a change in z3
• varying z3 causes a change in a31 = g(z
3)
• varying a31 = ŷ causes a change in E(y , ŷ)
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
We can use the chain rule to capture the behavior of θ211 wrt E
∆θ
2
= −η
∂E
∂θ2
= −η
(
∂E
∂a31
)(
∂a31
∂z3
)(
∂z3
∂θ2
)
=
η
(
y − a31
)(
g′(z3)
)
︸ ︷︷ ︸
= δ31
(
a21
)
= ηδ
3
1a
2
1
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
We can use the chain rule to capture the behavior of θ211 wrt E
∆θ
2
= −η
∂E
∂θ2
= −η
(
∂E
∂a31
)(
∂a31
∂z3
)(
∂z3
∂θ2
)
=
η
(
y − a31
)(
g′(z3)
)
︸ ︷︷ ︸
= δ31
(
a21
)
= ηδ
3
1a
2
1
Let’s look at each term individually
∂E
∂ai
= −(yi − ai ) recall that E =
N∑
i
1
2
(yi − ai )
2
∂a
∂z
=
∂g(z)
∂z
= g′(z)
∂z
∂θij
=
∂
∂θij
∑
i′
θi′ jai′ =
∑
i′
∂
∂θij
θi′ jai′ = ai
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
We can use the chain rule to capture the behavior of θ211 wrt E
∆θ
2
= −η
∂E
∂θ2
= −η
(
∂E
∂a31
)(
∂a31
∂z3
)(
∂z3
∂θ2
)
=
η
(
y − a31
)(
g′(z3)
)
︸ ︷︷ ︸
= δ31
(
a21
)
= ηδ
3
1a
2
1
Let’s look at each term individually
∂E
∂ai
= −(yi − ai ) recall that E =
N∑
i
1
2
(yi − ai )
2
∂a
∂z
=
∂g(z)
∂z
= g′(z)
∂z
∂θij
=
∂
∂θij
∑
i′
θi′ jai′ =
∑
i′
∂
∂θij
θi′ jai′ = ai
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
We can use the chain rule to capture the behavior of θ211 wrt E
∆θ
2
= −η
∂E
∂θ2
= −η
(
∂E
∂a31
)(
∂a31
∂z3
)(
∂z3
∂θ2
)
=
η
(
y − a31
)(
g′(z3)
)
︸ ︷︷ ︸
= δ31
(
a21
)
= ηδ
3
1a
2
1
Let’s look at each term individually
∂E
∂ai
= −(yi − ai ) recall that E =
N∑
i
1
2
(yi − ai )
2
∂a
∂z
=
∂g(z)
∂z
= g′(z)
∂z
∂θij
=
∂
∂θij
∑
i′
θi′ jai′ =
∑
i′
∂
∂θij
θi′ jai′ = ai
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
δ3
1
We can use the chain rule to capture the behavior of θ211 wrt E
∆θ
2
= −η
∂E
∂θ2
= −η
(
∂E
∂a31
)(
∂a31
∂z3
)(
∂z3
∂θ2
)
= η
(
y − a31
)(
g′(z3)
)
︸ ︷︷ ︸
= δ31
(
a21
)
= ηδ
3
1a
2
1
Let’s look at each term individually
∂E
∂ai
= −(yi − ai ) recall that E =
N∑
i
1
2
(yi − ai )
2
∂a
∂z
=
∂g(z)
∂z
= g′(z)
∂z
∂θij
=
∂
∂θij
∑
i′
θi′ jai′ =
∑
i′
∂
∂θij
θi′ jai′ = ai
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
δ3
1
We have another chain reaction. Let’s consider layer 2
• varying any θ1k1 causes a change in z
2
• varying z2 causes a change in a21 = g(z
2)
• varying a21 causes a change in z
3 (we consider θ2 fixed for the moment)
• varying z3 causes a change in a31 = g(z
3)
• varying a31 = ŷ causes a change in E(y , ŷ)
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
δ3
1
Formulating this again as the chain rule
∆θ
1
k1 = −η
∂E
∂θ1k1
= −η
((
∂E
∂a31
)(
∂a31
∂z3
)(
∂z3
∂a21
))(
∂a21
∂z2
)(
∂z2
∂θ1k1
)
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
δ3
1
Formulating this again as the chain rule
∆θ
1
k1 = −η
∂E
∂θ1k1
= −η
((
∂E
∂a31
)(
∂a31
∂z3
)(
∂z3
∂a21
))(
∂a21
∂z2
)(
∂z2
∂θ1k1
)
We already know that
∂E
∂a31
= −(y − a31)
∂a31
∂z3
= g′(z3)
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
δ3
1
Formulating this again as the chain rule
∆θ
1
k1 = −η
∂E
∂θ1k1
= −η
((
∂E
∂a31
)(
∂a31
∂z3
)(
∂z3
∂a21
))(
∂a21
∂z2
)(
∂z2
∂θ1k1
)
And following the previous logic, we can calculate that
∂z3
∂a21
=
∂θ2a21
∂a21
= θ
2
∂a21
∂z2
=
∂g(z2)
∂z2
= g′(z2)
∂z2
∂θ1k1
= ak
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
δ3
1
Formulating this again as the chain rule
∆θ
1
k1 = −η
∂E
∂θ1k1
= −η
((
∂E
∂a31
)(
∂a31
∂z3
)(
∂z3
∂a21
))(
∂a21
∂z2
)(
∂z2
∂θ1k1
)
Plugging these into the above we get
−η
∂E
∂θ1k1
= −η
(
− (y − a31)g
′
(z3)θ2
)
g′(z2)ak
= η
(
(y − a31)g
′
(z3)︸ ︷︷ ︸
= δ31
θ
2
)
g′(z2)ak = η
(
δ
3
1θ
2
)
g′(z2)︸ ︷︷ ︸
= δ21
ak = ηδ
2
1ak
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
δ3
1
Formulating this again as the chain rule
∆θ
1
k1 = −η
∂E
∂θ1k1
= −η
((
∂E
∂a31
)(
∂a31
∂z3
)(
∂z3
∂a21
))(
∂a21
∂z2
)(
∂z2
∂θ1k1
)
Plugging these into the above we get
−η
∂E
∂θ1k1
= −η
(
− (y − a31)g
′
(z3)θ2
)
g′(z2)ak
= η
(
(y − a31)g
′
(z3)θ2
)
g′(z2)ak
= η
(
(y − a31)g
′
(z3)︸ ︷︷ ︸
= δ31
θ
2
)
g′(z2)ak = η
(
δ
3
1θ
2
)
g′(z2)︸ ︷︷ ︸
= δ21
ak = ηδ
2
1ak
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
δ3
1
Formulating this again as the chain rule
∆θ
1
k1 = −η
∂E
∂θ1k1
= −η
((
∂E
∂a31
)(
∂a31
∂z3
)(
∂z3
∂a21
))(
∂a21
∂z2
)(
∂z2
∂θ1k1
)
Plugging these into the above we get
−η
∂E
∂θ1k1
= −η
(
− (y − a31)g
′
(z3)θ2
)
g′(z2)ak
= η
(
(y − a31)g
′
(z3)︸ ︷︷ ︸
= δ31
θ
2
)
g′(z2)ak
= η
(
δ
3
1θ
2
)
g′(z2)︸ ︷︷ ︸
= δ21
ak = ηδ
2
1ak
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
δ3
1
Formulating this again as the chain rule
∆θ
1
k1 = −η
∂E
∂θ1k1
= −η
((
∂E
∂a31
)(
∂a31
∂z3
)(
∂z3
∂a21
))(
∂a21
∂z2
)(
∂z2
∂θ1k1
)
Plugging these into the above we get
−η
∂E
∂θ1k1
= −η
(
− (y − a31)g
′
(z3)θ2
)
g′(z2)ak
= η
(
(y − a31)g
′
(z3)︸ ︷︷ ︸
= δ31
θ
2
)
g′(z2)ak = η
(
δ
3
1θ
2
)
g′(z2)ak
= η
(
δ
3
1θ
2
)
g′(z2)︸ ︷︷ ︸
= δ21
ak = ηδ
2
1ak
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
δ3
1
δ
1
2
Formulating this again as the chain rule
∆θ
1
k1 = −η
∂E
∂θ1k1
= −η
((
∂E
∂a31
)(
∂a31
∂z3
)(
∂z3
∂a21
))(
∂a21
∂z2
)(
∂z2
∂θ1k1
)
Plugging these into the above we get
−η
∂E
∂θ1k1
= −η
(
− (y − a31)g
′
(z3)θ2
)
g′(z2)ak
= η
(
(y − a31)g
′
(z3)︸ ︷︷ ︸
= δ31
θ
2
)
g′(z2)ak = η
(
δ
3
1θ
2
)
g′(z2)︸ ︷︷ ︸
= δ21
ak
= ηδ
2
1ak
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
δ3
1
δ
1
2
Formulating this again as the chain rule
∆θ
1
k1 = −η
∂E
∂θ1k1
= −η
((
∂E
∂a31
)(
∂a31
∂z3
)(
∂z3
∂a21
))(
∂a21
∂z2
)(
∂z2
∂θ1k1
)
Plugging these into the above we get
−η
∂E
∂θ1k1
= −η
(
− (y − a31)g
′
(z3)θ2
)
g′(z2)ak
= η
(
(y − a31)g
′
(z3)︸ ︷︷ ︸
= δ31
θ
2
)
g′(z2)ak = η
(
δ
3
1θ
2
)
g′(z2)︸ ︷︷ ︸
= δ21
ak = ηδ
2
1ak
17
Backpropagation: Derivation
a2=g(z2)
z2=Σ(θ1
k1
a1
k
)
x
2=
a1
2
x
1=
a1
1
Input layer Output layer
=y
1̂a21
Hidden layer
θ1
11
θ1
21
θ2
11
a3=g(z3)
z3=Σ(θ2
k1
a2
k
) a31
δ3
1
δ
1
2
Formulating this again as the chain rule
−η
∂E
∂θ1k1
= −η
((
∂E
∂a31
)(
∂a31
∂z3
)(
∂z3
∂a21
))(
∂a21
∂z2
)(
∂z2
∂θ1k1
)
If we had more than one weight θ2
−η
∂E
∂θ1k1
= η
(∑
j
(yj − a
3
1)g
′
(z3j )︸ ︷︷ ︸
= δ3j
θ
2
1j
)
g′(z2)ak
= η
(∑
j
δ
3
j θ
2
1j
)
g′(z2)
︸ ︷︷ ︸
= δ21
ak = ηδ
2
1ak
18
The (Generalized) Delta Rule