Today: Outline
• Neural networks cont’d: learning via gradient descent; chain rule review; gradient computation using the backpropropagation algorithm
Machine Learning 2017, Kate Saenko 1
Neural Networks II
Learning
Artificial Neural Network
input
dog cat
• Artificial neural networks: consist of many inter-connected neurons organized in layers
• Neurons: each neuron receives inputs from neurons in previous layer, passes its output to next layer
• Activation: neuron’s output between 1 (excited) and 0 (not excited)
Machine Learning 2017, Kate Saenko 3
Artificial Neuron: Activation
Input
Multiply by
+4 weights 0
+2 0 Sum Squash -2 0
Activates on certain patterns
-2 -2
0 +4 -8 0 0 +4 +8 1
+3
-2
00 +2
+2
+2
Machine Learning 2017, Kate Saenko
4
+2
Artificial Neural Network: notation
input
𝑥1 …
Input Layer
Hidden Layer Output Layer
𝑥=
𝑥5
h1 h x1 1 1
x2
x h1 h
hidden layer activations
Machine Learning 2017, Kate Saenko
5
322 x4
x5 h1 2
h3
Artificial Neural Network: notation
input
𝑥1 …
Input Layer
Hidden Layer Output Layer
h1 h 1 1
𝑥=
𝑥5
x1 x2
hidden layer activations
h𝑖 = 𝑔(Θ(𝑖)𝑥)
𝑔𝑧=1
1 + exp(−𝑧)
output
h (x)=𝑔(Θ2 h𝑖) Θ
x h1 h
1 0.5
0
x4
x5 h1
3
2 2
h3 𝜃⋯𝜃 𝜃⋯𝜃
2
𝜃31 ⋯ 𝜃35 𝜃31 ⋯ 𝜃33
weights
Θ(1) =
11 15 Θ(2) = 11 13 ⋮⋱⋮ ⋮⋱⋮
Machine Learning 2017, Kate Saenko 6
Cost function
Neural network:
training error
regularization
Gradient computation
Need code to compute: –
–
Use “Backpropagation algorithm”
– Efficient way to compute
– Computes gradient incrementally by “propagating” backwards through the network
Neural Networks II
backpropagation
Chain Rule
• Need to compute gradient of
log(hΘ x ) = log(𝑔(Θ 2 𝑔(Θ 1 𝑥))) w.r.t 𝜃
• How can we compute the gradient of several chained functions? 𝑓𝜃=𝑓(𝑓𝜃) 𝑓′𝜃=𝑓′𝑓𝜃 ∗𝑓′(𝜃)
12122 𝑓 𝜃 = 𝑓 (𝑓 (𝑓 𝜃 )) 𝑓′ 𝜃 =
123
• What about functions of multiple variables?
𝑓𝜃,𝜃 =𝑓(𝑓𝜃,𝜃) 12 1212
𝜕𝑓
𝜕𝑓
𝜕𝜃 = 12
𝜕𝜃 =
https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/multivariable-chain-rule/v/multivariable-chain-rule
Machine Learning 2017, Kate Saenko 10
Backpropagation: Efficient Chain Rule
• Partial gradient computation via chain rule:
𝜕𝑓𝜕𝑓 𝜕𝑓 𝜕𝑓 =1𝑓𝑓𝜃∗2𝑓𝜃∗3𝜃
𝜕𝜃𝜕𝑓23 𝜕𝑓3 𝜕𝜃 1231
𝜕𝑓𝜕𝑓 𝜕𝑓 𝜕𝑓 =1𝑓𝑓𝜃∗2𝑓𝜃∗3𝜃
𝜕𝜃𝜕𝑓23 𝜕𝑓3 𝜕𝜃 2232
𝜕𝑓𝜕𝑓 𝜕𝑓 𝜕𝑓 =1𝑓𝑓𝜃∗2𝑓𝜃∗3𝜃
𝜕𝜃𝜕𝑓23 𝜕𝑓3 𝜕𝜃 2233
• need to re-evaluate functions many times
• Very inefficient! E.g. 100,000-dim parameters
Chain Rule with a Computational Graph
e.g. x = -2, y = 5, z = -4
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 10
Machine Learning 2017, Kate Saenko
13
Chain Rule with a Computational Graph
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 11
Machine Learning 2017, Kate Saenko
14
Computation Graph: Forward
e.g. x = -2, y = 5, z = -4
compute values
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 11
Machine Learning 2017, Kate Saenko
15
Computation Graph: Backward
e.g. x = -2, y = 5, z = -4
compute gradients
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 12
Machine Learning 2017, Kate Saenko
16
Computation Graph: Backward
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13
Machine Learning 2017, Kate Saenko
17
Computation Graph: Backward
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 14
Machine Learning 2017, Kate Saenko
18
Computation Graph: Backward
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 15
Machine Learning 2017, Kate Saenko
19
Computation Graph: Backward
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 16
Machine Learning 2017, Kate Saenko
20
Computation Graph: Backward
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 17
Machine Learning 2017, Kate Saenko
21
Computation Graph: Backward
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 18
Machine Learning 2017, Kate Saenko
22
Computation Graph: Backward
e.g. x = -2, y = 5, z = -4
Want:
Chain rule:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 19
Machine Learning 2017, Kate Saenko
23
Computation Graph: Backward
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 20
Machine Learning 2017, Kate Saenko
24
Computation Graph: Backward
e.g. x = -2, y = 5, z = -4
Want:
Chain rule:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 21
Machine Learning 2017, Kate Saenko
25
activations
f
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 22
Machine Learning 2017, Kate Saenko
26
activations
f
“local gradient”
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 23
Machine Learning 2017, Kate Saenko
27
activations
f
“local gradient”
gradients
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 24
Machine Learning 2017, Kate Saenko
28
activations
f
“local gradient”
gradients
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 25
Machine Learning 2017, Kate Saenko
29
activations
f
“local gradient”
gradients
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 26
Machine Learning 2017, Kate Saenko
30
activations
f
“local gradient”
gradients
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 27
Machine Learning 2017, Kate Saenko
31
Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 28
Machine Learning 2017, Kate Saenko
32
Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 29
Machine Learning 2017, Kate Saenko
33
Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 30
Machine Learning 2017, Kate Saenko
34
Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 31
Machine Learning 2017, Kate Saenko
35
Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 32
Machine Learning 2017, Kate Saenko
36
Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 33
Machine Learning 2017, Kate Saenko
37
Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 34
Machine Learning 2017, Kate Saenko
38
Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 35
Machine Learning 2017, Kate Saenko
39
Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 36
Machine Learning 2017, Kate Saenko
40
Another example:
(-1) * (-0.20) = 0.20
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 37
Machine Learning 2017, Kate Saenko
41
Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 38
Machine Learning 2017, Kate Saenko
42
Another example:
[local gradient] x [its gradient]
[1] x [0.2] = 0.2
[1] x [0.2] = 0.2 (both inputs!)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 39
Machine Learning 2017, Kate Saenko
43
Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 40
Machine Learning 2017, Kate Saenko
44
Another example:
[local gradient] x [its gradient]
x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 41
Machine Learning 2017, Kate Saenko
45
sigmoid function
sigmoid gate
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 42
Machine Learning 2017, Kate Saenko
46
sigmoid function
sigmoid gate
(0.73) * (1 – 0.73) = 0.2
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 43
Machine Learning 2017, Kate Saenko
47
Patterns in backward flow
add gate: gradient distributor max gate: gradient router
mul gate: gradient… “switcher”?
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 44
Machine Learning 2017, Kate Saenko
48
Gradients add at branches
+
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 45
Machine Learning 2017, Kate Saenko
49
Neural Networks II
Vectorized Backpropagation
Forward Pass
𝑊 1 ,𝑏(1) 111𝑊,𝑏
(0) (1) (1) (2) 𝑓 (2) h2 𝑎2 h2 𝑎1 h1
h(0) 𝑎(1) h(1) 333
Layer 1 Layer 2 Layer 3
h(0)
𝑎(1)
𝑓
h(1) 2 (2)
𝑦ො
𝐿(𝑦, 𝑦ො)
Loss
Backward Pass
𝑊 1 ,𝑏(1) 111𝑊,𝑏
(0) (1) (1) (2) 𝑓 (2) h2 𝑎2 h2 𝑎1 h1
h(0) 𝑎(1) h(1) 333
h(0)
𝑎(1)
𝑓
h(1) 2 (2)
𝑦ො
𝐿(𝑦, 𝑦ො)
Layer 1 Layer 2 Layer 3
Loss