程序代写代做代考 chain graph algorithm Announcements

Announcements
Reminder: self-grading forms for ps1 and ps2 due 10/5 at midnight (Boston)
• ps3 out today, due 10/8 (1 week)
• Midterm practice questions out next week

Today: Outline
• Neural networks cont’d: learning via gradient descent; chain rule review; gradient computation using the backpropropagation algorithm
Machine Learning 2017, Kate Saenko 2

Neural Networks II
Learning

Artificial Neural Network
input
dog cat
• Artificial neural networks: consist of many inter-connected neurons organized in layers
• Neurons: each neuron receives inputs from neurons in previous layer, passes its output to next layer
• Activation: neuron’s output between 1 (excited) and 0 (not excited)
Machine Learning 2017, Kate Saenko 4

Artificial Neuron: Activation
Input
Multiply by
+4 weights 0
+2 0 Sum Squash -2 0
Activates on certain patterns
-2
0 +4 -8 0
+3 0 +2
-2
-2
0 +4 +8 1
+2 0 +2
+2
Machine Learning 2017, Kate Saenko
5

Artificial Neural Network: notation
Input Layer Hidden Layer Output Layer
x h1 h
input
𝑥1 …
𝑥=
𝑥5
11
1 hidden layer activations x2
x h1 32
x4
x5 h1 2
h2
h3
Machine Learning 2017, Kate Saenko
6

Artificial Neural Network: notation
input
𝑥1 …
Input Layer Hidden Layer Output Layer
𝑥=
𝑥5
x h1 h
1 x2
11
hidden layer activations
h𝑖 = 𝑔(Θ(𝑖)𝑥)
𝑔𝑧=1
1 + exp(−𝑧)
output
h (x)=𝑔(Θ2 h𝑖) Θ
h2
h3
𝜃⋯𝜃 𝜃⋯𝜃 11 15 Θ(2) = 11 13
⋮⋱⋮ ⋮⋱⋮ 𝜃31 ⋯ 𝜃35 𝜃31 ⋯ 𝜃33
1 0.5
0
x h1 3 2
x4
x5 h1
2
weights
Θ(1) =
Machine Learning 2017, Kate Saenko 7

Cost function
Neural network:
training error
regularization

Gradient computation
Need code to compute: –
–
Use “Backpropagation algorithm”
– Efficient way to compute
– Computes gradient incrementally by “propagating” backwards through the network

Neural Networks II
backpropagation

Chain Rule
• Need to compute gradient of
log(hΘ x ) = log(𝑔(Θ 2 𝑔(Θ 1 𝑥))) w.r.tΘ
• How can we compute the gradient of several chained functions? 𝑓𝜃=𝑓(𝑓𝜃) 𝑓′𝜃=𝑓′𝑓𝜃 ∗𝑓′(𝜃)
12122 𝑓 𝜃 = 𝑓 (𝑓 (𝑓 𝜃 )) 𝑓′ 𝜃 =
123
• What about functions of multiple variables?
𝑓𝜃,𝜃 =𝑓(𝑓𝜃,𝜃) 12 1212
𝜕𝑓
𝜕𝑓
𝜕𝜃 = 12
𝜕𝜃 =
https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/multivariable-chain-rule/v/multivariable-chain-rule
Machine Learning 2017, Kate Saenko 11

Backpropagation: Efficient Chain Rule
• Partial gradient computation via chain rule:
𝜕𝑓𝜕𝑓 𝜕𝑓 𝜕𝑓 =1𝑓𝑓𝜃∗2𝑓𝜃∗3𝜃
𝜕𝜃𝜕𝑓23 𝜕𝑓3 𝜕𝜃 1231
𝜕𝑓𝜕𝑓 𝜕𝑓 𝜕𝑓 =1𝑓𝑓𝜃∗2𝑓𝜃∗3𝜃
𝜕𝜃𝜕𝑓23 𝜕𝑓3 𝜕𝜃 2232
𝜕𝑓𝜕𝑓 𝜕𝑓 𝜕𝑓 =1𝑓𝑓𝜃∗2𝑓𝜃∗3𝜃
𝜕𝜃𝜕𝑓23 𝜕𝑓3 𝜕𝜃 3233
• need to re-evaluate functions many times
• Very inefficient! E.g. 100,000-dim parameters

Chain Rule with a Computational Graph
e.g. x = -2, y = 5, z = -4
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 10
Machine Learning 2017, Kate Saenko
14

Chain Rule with a Computational Graph
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 11
Machine Learning 2017, Kate Saenko
15

Computation Graph: Forward
e.g. x = -2, y = 5, z = -4
compute values
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 11
Machine Learning 2017, Kate Saenko
16

Computation Graph: Backward
e.g. x = -2, y = 5, z = -4
compute gradients
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 12
Machine Learning 2017, Kate Saenko
17

Computation Graph: Backward
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13
Machine Learning 2017, Kate Saenko
18

Computation Graph: Backward
e.g. x = -2, y = 5, z = -4
Want:
Chain rule:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 19
Machine Learning 2017, Kate Saenko
24

activations
f
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 22
Machine Learning 2017, Kate Saenko
27

activations
f
“local gradient”
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 23
Machine Learning 2017, Kate Saenko
28

activations
f
“local gradient”
gradients
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 24
Machine Learning 2017, Kate Saenko
29

activations
f
“local gradient”
gradients
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 25
Machine Learning 2017, Kate Saenko
30

activations
f
“local gradient”
gradients
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 26
Machine Learning 2017, Kate Saenko
31

activations
f
“local gradient”
gradients
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 27
Machine Learning 2017, Kate Saenko
32

Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 28
Machine Learning 2017, Kate Saenko
33

Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 29
Machine Learning 2017, Kate Saenko
34

Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 30
Machine Learning 2017, Kate Saenko
35

Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 31
Machine Learning 2017, Kate Saenko
36

Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 32
Machine Learning 2017, Kate Saenko
37

Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 33
Machine Learning 2017, Kate Saenko
38

Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 34
Machine Learning 2017, Kate Saenko
39

Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 35
Machine Learning 2017, Kate Saenko
40

Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 36
Machine Learning 2017, Kate Saenko
41

Another example:
(-1) * (-0.20) = 0.20
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 37
Machine Learning 2017, Kate Saenko
42

Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 38
Machine Learning 2017, Kate Saenko
43

Another example:
[local gradient] x [its gradient]
[1] x [0.2] = 0.2
[1] x [0.2] = 0.2 (both inputs!)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 39
Machine Learning 2017, Kate Saenko
44

Another example:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 40
Machine Learning 2017, Kate Saenko
45

Another example:
[local gradient] x [its gradient]
x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 41
Machine Learning 2017, Kate Saenko
46

sigmoid function
sigmoid gate
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 42
Machine Learning 2017, Kate Saenko
47

sigmoid function
sigmoid gate
(0.73) * (1 – 0.73) = 0.2
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 43
Machine Learning 2017, Kate Saenko
48

Patterns in backward flow
add gate: gradient distributor max gate: gradient router
mul gate: gradient… “switcher”?
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 44
Machine Learning 2017, Kate Saenko
49

Gradients add at branches
+
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 13 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 – 45
Machine Learning 2017, Kate Saenko
53

Neural Networks II
Vectorized Backpropagation

Forward Pass
𝑊 1 ,𝑏(1) 111𝑊,𝑏
(0) (1) (1) (2) 𝑓 (2) h2 𝑎2 h2 𝑎1 h1
h(0) 𝑎(1) h(1) 333
Layer 1 Layer 2 Layer 3
h(0)
𝑎(1)
𝑓
h(1) 2 (2)
𝑦
𝐿(𝑦, 𝑦)
Loss

Backward Pass
𝑊 1 ,𝑏(1) 111𝑊,𝑏
(0) (1) (1) (2) 𝑓 (2) h2 𝑎2 h2 𝑎1 h1
h(0) 𝑎(1) h(1) 333
h(0)
𝑎(1)
𝑓
h(1) 2 (2)
𝑦
𝐿(𝑦, 𝑦)
Layer 1 Layer 2 Layer 3
Loss

Backpropagation example with vectorized gradients:
https://web.stanford.edu/class/cs224n/readings/gradient-notes.pdf

Next Class
Neural Networks III: Convolutional Nets:
Convolutional networks.
Reading: Bishop Ch 5.5

Related Posts