CS计算机代考程序代写 chain AI algorithm Learning Parameters of Multi-layer Perceptrons with

Learning Parameters of Multi-layer Perceptrons with
Backpropagation

COMP90049
Introduction to Machine Learning
Semester 1, 2020

Lea Frermann, CIS

1

Roadmap

Last lecture

• From perceptrons to neural networks

• multilayer perceptron

• some examples

• features and limitations

Today

• Learning parameters of neural networks

• The Backpropagation algorithm

2

Recap: Multi-layer perceptrons

x
1

x
2

x
F

f(x;θ) y

input weigthts

activation

output

1

y = f(θTx)
θ

2

θ
F

θ
0

θ
1

• Linearly separable data

• Perceptron learning rule

3

Recap: Multi-layer perceptrons

• Linearly separable data

• Perceptron learning rule

3

Recap: Multi-layer perceptrons

x
1

x
2

x
3

z=Σ
k
θ

k
x

k

a=g(z)

a
1

x1
2

x1
3

θ1
12


0

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)


1

Layer 1 Layer 2 Layer 3 Layer 4
x1

1

a2
3

a3
3

a2
2

a2
1

a3
2

a3
1

a3
4

a4
2

a4
1

θ1
11

θ1
22

θ1
21

θ1
13

θ1
23

θ1
31

θ1
33

θ1
32

θ2
31

θ2
11

θ2
21

θ2
24

θ2
34

θ2
22

θ2
13

θ2
32

θ2
23

θ2
14

θ2
33

θ2
12

θ3
11

θ3
12

θ3
21

θ3
22

θ3
31

θ3
32

θ3
41

θ3
42

3

Recall: Supervised learning

x
1

x
2

x
3

z=Σ
k
θ

k
x

k

a=g(z)

a
1

x1
2

x1
3

θ1
12


0

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)


1

Layer 1 Layer 2 Layer 3 Layer 4
x1

1

a2
3

a3
3

a2
2

a2
1

a3
2

a3
1

a3
4

a4
2

a4
1

θ1
11

θ1
22

θ1
21

θ1
13

θ1
23

θ1
31

θ1
33

θ1
32

θ2
31

θ2
11

θ2
21

θ2
24

θ2
34

θ2
22

θ2
13

θ2
32

θ2
23

θ2
14

θ2
33

θ2
12

θ3
11

θ3
12

θ3
21

θ3
22

θ3
31

θ3
32

θ3
41

θ3
42

Recipe

1. Forward propagate an input x from the training set

2. Compute the output ŷ with the MLP

3. Compare predicted output ŷ against true output y ; compute the error

4. Modify each weight such that the error decreases in future
predictions (e.g., by applying gradient descent)

5. Repeat.

4

Recall: Supervised learning

x
1

x
2

x
3

z=Σ
k
θ

k
x

k

a=g(z)

a
1

x1
2

x1
3

θ1
12


0

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)


1

Layer 1 Layer 2 Layer 3 Layer 4
x1

1

a2
3

a3
3

a2
2

a2
1

a3
2

a3
1

a3
4

a4
2

a4
1

θ1
11

θ1
22

θ1
21

θ1
13

θ1
23

θ1
31

θ1
33

θ1
32

θ2
31

θ2
11

θ2
21

θ2
24

θ2
34

θ2
22

θ2
13

θ2
32

θ2
23

θ2
14

θ2
33

θ2
12

θ3
11

θ3
12

θ3
21

θ3
22

θ3
31

θ3
32

θ3
41

θ3
42

Recipe

1. Forward propagate an input x from the training set

2. Compute the output ŷ with the MLP

3. Compare predicted output ŷ against true output y ; compute the error

4. Modify each weight such that the error decreases in future
predictions (e.g., by applying gradient descent)

5. Repeat.

4

Recall: Supervised learning

x
1

x
2

x
3

z=Σ
k
θ

k
x

k

a=g(z)

a
1

x1
2

x1
3

θ1
12


0

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)


1

Layer 1 Layer 2 Layer 3 Layer 4
x1

1

a2
3

a3
3

a2
2

a2
1

a3
2

a3
1

a3
4

a4
2

a4
1

θ1
11

θ1
22

θ1
21

θ1
13

θ1
23

θ1
31

θ1
33

θ1
32

θ2
31

θ2
11

θ2
21

θ2
24

θ2
34

θ2
22

θ2
13

θ2
32

θ2
23

θ2
14

θ2
33

θ2
12

θ3
11

θ3
12

θ3
21

θ3
22

θ3
31

θ3
32

θ3
41

θ3
42

Recipe

1. Forward propagate an input x from the training set

2. Compute the output ŷ with the MLP

3. Compare predicted output ŷ against true output y ; compute the error

4. Modify each weight such that the error decreases in future
predictions (e.g., by applying gradient descent)

5. Repeat.

4

Recall: Supervised learning

x
1

x
2

x
3

z=Σ
k
θ

k
x

k

a=g(z)

a
1

x1
2

x1
3

θ1
12


0

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)


1

Layer 1 Layer 2 Layer 3 Layer 4
x1

1

a2
3

a3
3

a2
2

a2
1

a3
2

a3
1

a3
4

a4
2

a4
1

θ1
11

θ1
22

θ1
21

θ1
13

θ1
23

θ1
31

θ1
33

θ1
32

θ2
31

θ2
11

θ2
21

θ2
24

θ2
34

θ2
22

θ2
13

θ2
32

θ2
23

θ2
14

θ2
33

θ2
12

θ3
11

θ3
12

θ3
21

θ3
22

θ3
31

θ3
32

θ3
41

θ3
42

Recipe

1. Forward propagate an input x from the training set

2. Compute the output ŷ with the MLP

3. Compare predicted output ŷ against true output y ; compute the error

4. Modify each weight such that the error decreases in future
predictions (e.g., by applying gradient descent)

5. Repeat.

4

Recall: Supervised learning

x
1

x
2

x
3

z=Σ
k
θ

k
x

k

a=g(z)

a
1

x1
2

x1
3

θ1
12


0

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)


1

Layer 1 Layer 2 Layer 3 Layer 4
x1

1

a2
3

a3
3

a2
2

a2
1

a3
2

a3
1

a3
4

a4
2

a4
1

θ1
11

θ1
22

θ1
21

θ1
13

θ1
23

θ1
31

θ1
33

θ1
32

θ2
31

θ2
11

θ2
21

θ2
24

θ2
34

θ2
22

θ2
13

θ2
32

θ2
23

θ2
14

θ2
33

θ2
12

θ3
11

θ3
12

θ3
21

θ3
22

θ3
31

θ3
32

θ3
41

θ3
42

Recipe

1. Forward propagate an input x from the training set

2. Compute the output ŷ with the MLP

3. Compare predicted output ŷ against true output y ; compute the error

4. Modify each weight such that the error decreases in future
predictions (e.g., by applying gradient descent)

5. Repeat.

4

Recall: Optimization with Gradient Descent

x
1

x
2

x
3

z=Σ
k
θ

k
x

k

a=g(z)

a
1

x1
2

x1
3

θ1
12


0

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)


1

Layer 1 Layer 2 Layer 3 Layer 4
x1

1

a2
3

a3
3

a2
2

a2
1

a3
2

a3
1

a3
4

a4
2

a4
1

θ1
11

θ1
22

θ1
21

θ1
13

θ1
23

θ1
31

θ1
33

θ1
32

θ2
31

θ2
11

θ2
21

θ2
24

θ2
34

θ2
22

θ2
13

θ2
32

θ2
23

θ2
14

θ2
33

θ2
12

θ3
11

θ3
12

θ3
21

θ3
22

θ3
31

θ3
32

θ3
41

θ3
42

We want to

1. Find the best parameters, which lead to the smallest error E

2. Optimize each model parameter θlik
3. We will use gradient descent to achieve that

4. θl,(t+1)ij ← θ
l,(t)
ij + ∆θ

i
ij

5

Towards Backpropagation

Recall Perceptron learning:

• Pass an input through and compute

• Compare ŷ against y

• Weight update θi ← θi + η(y − ŷ)xi

6

Towards Backpropagation

Recall Perceptron learning:

• Pass an input through and compute

• Compare ŷ against y

• Weight update θi ← θi + η(y − ŷ)xi

Compare against the MLP:

x
1

x
2

x
3

z=Σ
k
θ

k
x

k

a=g(z)

a
1

x1
2

x1
3

θ1
12


0

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)

z=Σ
k
θ

k
x

k

a=g(z)


1

Layer 1 Layer 2 Layer 3 Layer 4
x1

1

a2
3

a3
3

a2
2

a2
1

a3
2

a3
1

a3
4

a4
2

a4
1

θ1
11

θ1
22

θ1
21

θ1
13

θ1
23

θ1
31

θ1
33

θ1
32

θ2
31

θ2
11

θ2
21

θ2
24

θ2
34

θ2
22

θ2
13

θ2
32

θ2
23

θ2
14

θ2
33

θ2
12

θ3
11

θ3
12

θ3
21

θ3
22

θ3
31

θ3
32

θ3
41

θ3
42

6

Towards Backpropagation

Recall Perceptron learning:

• Pass an input through and compute

• Compare ŷ against y

• Weight update θi ← θi + η(y − ŷ)xi

Problems

• This update rule depends on true target outputs y

• We only have access to true outputs for the final layer

• We do not know the true activations for the hidden layers. Can we
generalize the above rule to also update the hidden layers?

Backpropagation provides us with an efficient way of computing
partial derivatives of the error of an MLP wrt. each individual weight.

6

Backpropagation: Demo

g
Σ

g
Σ

g
Σ

g
Σ

g
Σx2=a2

x
1=

a
1

=y

=y

Input layer Hidden layer output layer

7

Backpropagation: Demo

g
Σ

g
Σ

g
Σ

g
Σ

g
Σx2=a2

x
1=

a
1

=y

=y

Input layer Hidden layer output layer

• Receive input

• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∆θlij for all θ

l
ij

• Update weights θlij ← θ
l
ij + ∆θ

l
ij

7

Backpropagation: Demo

g
Σ

g
Σ

g
Σ

g
Σ

g
Σx2=a2

x
1=

a
1

=y

=y

Input layer Hidden layer output layer

a2
1

a2
2

a2
3

• Receive input
• Forward pass: propagate activations through the network

• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∆θlij for all θ

l
ij

• Update weights θlij ← θ
l
ij + ∆θ

l
ij

7

Backpropagation: Demo

g
Σ

g
Σ

g
Σ

g
Σ

g
Σx2=a2

x
1=

a
1

=y
2
̂

=y

Input layer Hidden layer output layer

a2
1

a2
2

a2
3

a3
1

a3
2

• Receive input
• Forward pass: propagate activations through the network

• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∆θlij for all θ

l
ij

• Update weights θlij ← θ
l
ij + ∆θ

l
ij

7

Backpropagation: Demo

g
Σ

g
Σ

g
Σ

g
Σ

g
Σx2=a2

x
1=

a
1

Input layer Hidden layer output layer

δ 3
1

δ 3
2

a2
1

a2
2

a2
3

=y
2
̂

=y

a3
1

a3
2

• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y

• Backward pass: propagate error terms through the network
• Calculate ∆θlij for all θ

l
ij

• Update weights θlij ← θ
l
ij + ∆θ

l
ij

7

Backpropagation: Demo

g
Σ

g
Σ

g
Σ

g
Σ

g
Σx2=a2

x
1=

a
1

Input layer Hidden layer output layer

δ i

δ i

a2
1

a2
2

a2
3

δ 2
1

δ 2
2

δ 2
3

=y
2
̂

=y

a3
1

a3
2

δ 3
2

δ 3
1

• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network

• Calculate ∆θlij for all θ
l
ij

• Update weights θlij ← θ
l
ij + ∆θ

l
ij

7

Backpropagation: Demo

g
Σ

g
Σ

g
Σ

g
Σ

g
Σx2=a2

x
1=

a
1

Input layer Hidden layer output layer

δ i

δ i

a2
1

a2
2

a2
3

δ 2
1

δ 2
2

δ 2
3

=y
2
̂

=y

a3
1

a3
2

δ 3
2

δ 3
1

• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∆θlij for all θ

l
ij

• Update weights θlij ← θ
l
ij + ∆θ

l
ij

7

Interim Summary

• We recall what a MLP is

• We recall that we want to learn its parameters such that our prediction
error is minimized

• We recall that gradient descent gives us a rule for updating the weights

θi ← θi +4θi with 4 θi = −η
∂E
∂θi

• But how do we compute ∂E
∂θi

?

• Backpropagation provides us with an efficient way of computing
partial derivatives of the error of an MLP wrt. each individual weight.

8

The (Generalized) Delta Rule

Backpropagation 1: Model definition

ij

θ2
ij

Σ

gg

Σ
k

g
Σ

• Assuming a sigmoid activation function, the output of neuron i (or its
activation ai ) is

ai = g(zi ) =
1

1 + e−zi
• And zi is the input of all incoming activations into neuron i

zi =

j

θijaj

• And Mean Squared Error (MSE) as error function E

E =
1

2N

N∑
i=1

(y i − ŷ i )2

9

Backpropagation 2: Error of the final layer

ij

θ2
ij

Σ

gg

Σ
k

g
Σ

• Apply gradient descend for input p and weight θ2ij connecting node j with
node i

4θ2ij = −η
∂E
∂θ2ij

= η(yp − ŷp)g′(zi )aj

10

Backpropagation 2: Error of the final layer

ij

θ2
ij

Σ

gg

Σ
k

g
Σ

• Apply gradient descend for input p and weight θ2ij connecting node j with
node i

4θ2ij = −η
∂E
∂θ2ij

= η(yp − ŷp)g′(zi )aj

= η δi aj

• The weight update corresponds to an error term (δi ) scaled by the
incoming activation

• We attach a δ to node i

10

Backpropagation 2: Error of the final layer

ij

θ2
ij

Σ

gg

Σ
k

g
Σ

• Apply gradient descend for input p and weight θ2ij connecting node j with
node i

4θ2ij = −η
∂E
∂θ2ij

= η(yp − ŷp)g′(zi )aj

= η δi aj

• The weight update corresponds to an error term (δi ) scaled by the
incoming activation

• We attach a δ to node i

10

Backpropagation: The Generalized Delta Rule

ij

θ2
ij

Σ

gg

Σ
k

g
Σ

δ i

• The Generalized Delta Rule

4θ2ij = − η
∂E
∂θ2ij

= η(yp − ŷp)g′(zi )aj = η δi aj

δi = (y
p − ŷp)g′(zi )

• The above δi can only be applied to output units, because it relies on the
target outputs yp.

• We do not have target outputs y for the intermediate layers

11

Backpropagation: The Generalized Delta Rule

ij

θ1
jk Σ

gg

Σ
k

g
Σ

δ i

• Instead, we backpropagate the errors (δs) from right to left through the
network

4θ1jk = η δj ak

δj =

i

θ
1
ij δi g


(zj )

12

Backpropagation: Demo

g
Σ

g
Σ

g
Σ

g
Σ

g
Σx2=a2

x
1=

a
1

=y

=y

Input layer Hidden layer output layer

13

Backpropagation: Demo

g
Σ

g
Σ

g
Σ

g
Σ

g
Σx2=a2

x
1=

a
1

=y

=y

Input layer Hidden layer output layer

• Receive input

• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∂E

∂θlij
for all θlij

• Update weights θlij ← θ
l
ij + ∆θ

l
ij

13

Backpropagation: Demo

g
Σ

g
Σ

g
Σ

g
Σ

g
Σx2=a2

x
1=

a
1

=y

=y

Input layer Hidden layer output layer

a2
1

a2
2

a2
3

• Receive input
• Forward pass: propagate activations through the network

• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∂E

∂θlij
for all θlij

• Update weights θlij ← θ
l
ij + ∆θ

l
ij

13

Backpropagation: Demo

g
Σ

g
Σ

g
Σ

g
Σ

g
Σx2=a2

x
1=

a
1

=y
2
̂

=y

Input layer Hidden layer output layer

a2
1

a2
2

a2
3

a3
1

a3
2

• Receive input
• Forward pass: propagate activations through the network

• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∂E

∂θlij
for all θlij

• Update weights θlij ← θ
l
ij + ∆θ

l
ij

13

Backpropagation: Demo

g
Σ

g
Σ

g
Σ

g
Σ

g
Σx2=a2

x
1=

a
1

Input layer Hidden layer output layer

δ 3
1

δ 3
2

a2
1

a2
2

a2
3

=y
2
̂

=y

a3
1

a3
2

• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y

• Backward pass: propagate error terms through the network
• Calculate ∂E

∂θlij
for all θlij

• Update weights θlij ← θ
l
ij + ∆θ

l
ij

13

Backpropagation: Demo

g
Σ

g
Σ

g
Σ

g
Σ

g
Σx2=a2

x
1=

a
1

Input layer Hidden layer output layer

δ i

δ i

a2
1

a2
2

a2
3

δ 2
1

δ 2
2

δ 2
3

=y
2
̂

=y

a3
1

a3
2

δ 3
2

δ 3
1

• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network

• Calculate ∂E
∂θlij

for all θlij

• Update weights θlij ← θ
l
ij + ∆θ

l
ij

13

Backpropagation: Demo

g
Σ

g
Σ

g
Σ

g
Σ

g
Σx2=a2

x
1=

a
1

Input layer Hidden layer output layer
δ i

δ
2
2

δ1
3

=y
2
̂

=y

a3
1

a3
2

a2
1

a2
2

a2
3

δ 3
1

δ 3
2

δ 2
1

• Receive input
• Forward pass: propagate activations through the network
• Compute Error : compare output ŷ against true y
• Backward pass: propagate error terms through the network
• Calculate ∂E

∂θlij
for all θlij

• Update weights θlij ← θ
l
ij + ∆θ

l
ij

13

Backpropagation Algorithm

Design your neural network
Initialize parameters θ
repeat

for training instance xi do
1. Forward pass the instance through the network, compute activa-
tions, determine output
2. Compute the error
3. Propagate error back through the network, and compute for all
weights between nodes ij in all layers l

∆θ
l
ij = −η

∂E
∂θlij

= ηδiaj

4. Update all parameters at once

θ
l
ij ← θ

l
ij + ∆θ

l
ij

until stopping criteria reached.

14

Derivation of the update rules

… optional slides after the next (summary) slide, for those who are interested!

15

Summary

After this lecture, you be able to understand

• Why estimation of the MLP parameters is difficult

• How and why we use Gradient Descent to optimize the parameters

• How Backpropagation is a special instance of gradient descent, which
allows us to efficiently compute the gradients of all weights wrt. the error

• The mechanism behind gradient descent

• The mathematical justification of gradient descent

Good job, everyone!

• You now know what (feed forward) neural networks are

• You now know what to consider when designing neural networks

• You now know how to estimate their parameters

• That’s more than the average ‘data scientist’ out there!

16

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

Chain of reactions in the forward pass, focussing on the output layer

• varying a2 causes a change in z3

• varying z3 causes a change in a31 = g(z
3)

• varying a31 = ŷ causes a change in E(y , ŷ)

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

We can use the chain rule to capture the behavior of θ211 wrt E

∆θ
2

= −η
∂E
∂θ2

= −η
(
∂E
∂a31

)(
∂a31
∂z3

)(
∂z3

∂θ2

)
=

η
(

y − a31
)(

g′(z3)
)

︸ ︷︷ ︸
= δ31

(
a21
)

= ηδ
3
1a

2
1

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

We can use the chain rule to capture the behavior of θ211 wrt E

∆θ
2

= −η
∂E
∂θ2

= −η
(
∂E
∂a31

)(
∂a31
∂z3

)(
∂z3

∂θ2

)
=

η
(

y − a31
)(

g′(z3)
)

︸ ︷︷ ︸
= δ31

(
a21
)

= ηδ
3
1a

2
1

Let’s look at each term individually

∂E
∂ai

= −(yi − ai ) recall that E =
N∑
i

1
2

(yi − ai )
2

∂a
∂z

=
∂g(z)
∂z

= g′(z)

∂z
∂θij

=

∂θij


i′
θi′ jai′ =


i′

∂θij
θi′ jai′ = ai

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

We can use the chain rule to capture the behavior of θ211 wrt E

∆θ
2

= −η
∂E
∂θ2

= −η
(
∂E
∂a31

)(
∂a31
∂z3

)(
∂z3

∂θ2

)
=

η
(

y − a31
)(

g′(z3)
)

︸ ︷︷ ︸
= δ31

(
a21
)

= ηδ
3
1a

2
1

Let’s look at each term individually

∂E
∂ai

= −(yi − ai ) recall that E =
N∑
i

1
2

(yi − ai )
2

∂a
∂z

=
∂g(z)
∂z

= g′(z)

∂z
∂θij

=

∂θij


i′
θi′ jai′ =


i′

∂θij
θi′ jai′ = ai

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

We can use the chain rule to capture the behavior of θ211 wrt E

∆θ
2

= −η
∂E
∂θ2

= −η
(
∂E
∂a31

)(
∂a31
∂z3

)(
∂z3

∂θ2

)
=

η
(

y − a31
)(

g′(z3)
)

︸ ︷︷ ︸
= δ31

(
a21
)

= ηδ
3
1a

2
1

Let’s look at each term individually

∂E
∂ai

= −(yi − ai ) recall that E =
N∑
i

1
2

(yi − ai )
2

∂a
∂z

=
∂g(z)
∂z

= g′(z)

∂z
∂θij

=

∂θij


i′
θi′ jai′ =


i′

∂θij
θi′ jai′ = ai

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

δ3
1

We can use the chain rule to capture the behavior of θ211 wrt E

∆θ
2

= −η
∂E
∂θ2

= −η
(
∂E
∂a31

)(
∂a31
∂z3

)(
∂z3

∂θ2

)
= η

(
y − a31

)(
g′(z3)

)
︸ ︷︷ ︸

= δ31

(
a21
)

= ηδ
3
1a

2
1

Let’s look at each term individually

∂E
∂ai

= −(yi − ai ) recall that E =
N∑
i

1
2

(yi − ai )
2

∂a
∂z

=
∂g(z)
∂z

= g′(z)

∂z
∂θij

=

∂θij


i′
θi′ jai′ =


i′

∂θij
θi′ jai′ = ai

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

δ3
1

We have another chain reaction. Let’s consider layer 2

• varying any θ1k1 causes a change in z
2

• varying z2 causes a change in a21 = g(z
2)

• varying a21 causes a change in z
3 (we consider θ2 fixed for the moment)

• varying z3 causes a change in a31 = g(z
3)

• varying a31 = ŷ causes a change in E(y , ŷ)

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

δ3
1

Formulating this again as the chain rule

∆θ
1
k1 = −η

∂E
∂θ1k1

= −η

((
∂E
∂a31

)(
∂a31
∂z3

)(
∂z3

∂a21

))(
∂a21
∂z2

)(
∂z2

∂θ1k1

)

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

δ3
1

Formulating this again as the chain rule

∆θ
1
k1 = −η

∂E
∂θ1k1

= −η

((
∂E
∂a31

)(
∂a31
∂z3

)(
∂z3

∂a21

))(
∂a21
∂z2

)(
∂z2

∂θ1k1

)
We already know that

∂E
∂a31

= −(y − a31)

∂a31
∂z3

= g′(z3)

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

δ3
1

Formulating this again as the chain rule

∆θ
1
k1 = −η

∂E
∂θ1k1

= −η

((
∂E
∂a31

)(
∂a31
∂z3

)(
∂z3

∂a21

))(
∂a21
∂z2

)(
∂z2

∂θ1k1

)
And following the previous logic, we can calculate that

∂z3

∂a21
=
∂θ2a21
∂a21

= θ
2

∂a21
∂z2

=
∂g(z2)
∂z2

= g′(z2)
∂z2

∂θ1k1
= ak

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

δ3
1

Formulating this again as the chain rule

∆θ
1
k1 = −η

∂E
∂θ1k1

= −η

((
∂E
∂a31

)(
∂a31
∂z3

)(
∂z3

∂a21

))(
∂a21
∂z2

)(
∂z2

∂θ1k1

)
Plugging these into the above we get

−η
∂E
∂θ1k1

= −η
(
− (y − a31)g


(z3)θ2

)
g′(z2)ak

= η
(

(y − a31)g

(z3)︸ ︷︷ ︸

= δ31

θ
2
)

g′(z2)ak = η
(
δ

3

2
)

g′(z2)︸ ︷︷ ︸
= δ21

ak = ηδ
2
1ak

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

δ3
1

Formulating this again as the chain rule

∆θ
1
k1 = −η

∂E
∂θ1k1

= −η

((
∂E
∂a31

)(
∂a31
∂z3

)(
∂z3

∂a21

))(
∂a21
∂z2

)(
∂z2

∂θ1k1

)
Plugging these into the above we get

−η
∂E
∂θ1k1

= −η
(
− (y − a31)g


(z3)θ2

)
g′(z2)ak

= η
(

(y − a31)g

(z3)θ2

)
g′(z2)ak

= η
(

(y − a31)g

(z3)︸ ︷︷ ︸

= δ31

θ
2
)

g′(z2)ak = η
(
δ

3

2
)

g′(z2)︸ ︷︷ ︸
= δ21

ak = ηδ
2
1ak

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

δ3
1

Formulating this again as the chain rule

∆θ
1
k1 = −η

∂E
∂θ1k1

= −η

((
∂E
∂a31

)(
∂a31
∂z3

)(
∂z3

∂a21

))(
∂a21
∂z2

)(
∂z2

∂θ1k1

)
Plugging these into the above we get

−η
∂E
∂θ1k1

= −η
(
− (y − a31)g


(z3)θ2

)
g′(z2)ak

= η
(

(y − a31)g

(z3)︸ ︷︷ ︸

= δ31

θ
2
)

g′(z2)ak

= η
(
δ

3

2
)

g′(z2)︸ ︷︷ ︸
= δ21

ak = ηδ
2
1ak

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

δ3
1

Formulating this again as the chain rule

∆θ
1
k1 = −η

∂E
∂θ1k1

= −η

((
∂E
∂a31

)(
∂a31
∂z3

)(
∂z3

∂a21

))(
∂a21
∂z2

)(
∂z2

∂θ1k1

)
Plugging these into the above we get

−η
∂E
∂θ1k1

= −η
(
− (y − a31)g


(z3)θ2

)
g′(z2)ak

= η
(

(y − a31)g

(z3)︸ ︷︷ ︸

= δ31

θ
2
)

g′(z2)ak = η
(
δ

3

2
)

g′(z2)ak

= η
(
δ

3

2
)

g′(z2)︸ ︷︷ ︸
= δ21

ak = ηδ
2
1ak

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

δ3
1

δ
1
2

Formulating this again as the chain rule

∆θ
1
k1 = −η

∂E
∂θ1k1

= −η

((
∂E
∂a31

)(
∂a31
∂z3

)(
∂z3

∂a21

))(
∂a21
∂z2

)(
∂z2

∂θ1k1

)
Plugging these into the above we get

−η
∂E
∂θ1k1

= −η
(
− (y − a31)g


(z3)θ2

)
g′(z2)ak

= η
(

(y − a31)g

(z3)︸ ︷︷ ︸

= δ31

θ
2
)

g′(z2)ak = η
(
δ

3

2
)

g′(z2)︸ ︷︷ ︸
= δ21

ak

= ηδ
2
1ak

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

δ3
1

δ
1
2

Formulating this again as the chain rule

∆θ
1
k1 = −η

∂E
∂θ1k1

= −η

((
∂E
∂a31

)(
∂a31
∂z3

)(
∂z3

∂a21

))(
∂a21
∂z2

)(
∂z2

∂θ1k1

)
Plugging these into the above we get

−η
∂E
∂θ1k1

= −η
(
− (y − a31)g


(z3)θ2

)
g′(z2)ak

= η
(

(y − a31)g

(z3)︸ ︷︷ ︸

= δ31

θ
2
)

g′(z2)ak = η
(
δ

3

2
)

g′(z2)︸ ︷︷ ︸
= δ21

ak = ηδ
2
1ak

17

Backpropagation: Derivation

a2=g(z2)

z2=Σ(θ1
k1

a1
k
)

x
2=

a1
2

x
1=

a1
1

Input layer Output layer

=y
1̂a21

Hidden layer

θ1
11

θ1
21

θ2
11

a3=g(z3)

z3=Σ(θ2
k1

a2
k
) a31

δ3
1

δ
1
2

Formulating this again as the chain rule

−η
∂E
∂θ1k1

= −η

((
∂E
∂a31

)(
∂a31
∂z3

)(
∂z3

∂a21

))(
∂a21
∂z2

)(
∂z2

∂θ1k1

)
If we had more than one weight θ2

−η
∂E
∂θ1k1

= η
(∑

j

(yj − a
3
1)g

(z3j )︸ ︷︷ ︸

= δ3j

θ
2
1j

)
g′(z2)ak

= η
(∑

j

δ
3
j θ

2
1j

)
g′(z2)

︸ ︷︷ ︸
= δ21

ak = ηδ
2
1ak

18

The (Generalized) Delta Rule