程序代写代做代考 html C deep learning Neural Networks IV

Neural Networks IV
Recurrent Networks

Today: Outline
• Convolutional networks: finish last lecture…
• Recurrent networks: forward pass, backward pass • NN training strategies: loss functions, dropout, etc.
Deep Learning 2017, Brian Kulis & Kate Saenko 2

Network architectures
Feed-forward
Fully connected
Layer 1 Layer 2 Layer 3Layer 4 Convolutional
Recurrent time 
Machine Learning 2017, Kate Saenko 3
output output output
hidden hidden hidden
input
input input

Neural Networks IV
Recurrent Architectures

Recurrent Networks for Sequences of Data
• Sequences in our world: – Audio
– Text
– Video
– Weather
– Stock market
• Sequential data is why we build RNN architectures.
• RNNs are tools for making predictions about sequences.
5
slide by Sarah Bargal 5

 Applications  RNNs  RNN BP  LSTMs
Limitations of Feed-Fwd Networks
• Limitations of feed-forward networks – Fixed length
Inputs and outputs are of fixed lengths
– Independence
Data (example: images) are independent of one another
6
Nervana, Tutorial, https://www.youtube.com/watch?v=Ukgii7Yd_cU
slide by Sarah Bargal 6

 Applications  RNNs  RNN BP  LSTMs
Advantages of RNN Models
• Whatfeed-forwardnetworkscannotdo – Variable length
“We would like to accommodate temporal sequences of various lengths.”
– Temporal dependence
“To predict where a pedestrian is at the next point in time, this depends on where he/she were in the previous time step.”
7
Nervana, Tutorial, https://www.youtube.com/watch?v=Ukgii7Yd_cU
slide by Sarah Bargal 7

 Applications  RNNs  RNN BP  LSTMs
Vanilla Neural Network (NN)
𝑥𝑥 : input/event
h𝑡𝑡: output/prediction 𝐴𝐴 ∶ chunk of NN
Every input is treated independently.
•𝑡𝑡NN
8
Chris Olah, Blog, http://colah.github.io/posts/2015-08-Understanding-LSTMs/
slide by Sarah Bargal 8

 Applications  RNNs  RNN BP  LSTMs
Recurrent Neural Network (RNN)
• RNN
The loop allows information to be passed from one time step to the next.
Now we are modeling the dynamics.
9
Chris Olah, Blog, http://colah.github.io/posts/2015-08-Understanding-LSTMs/
slide by Sarah Bargal 9

 Applications  RNNs  RNN BP  LSTMs
Recurrent Neural Network (RNN)
• A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor.
10
Chris Olah, Blog, http://colah.github.io/posts/2015-08-Understanding-LSTMs/
slide by Sarah Bargal 10

 Applications  RNNs  RNN BP  LSTMs
Recurrent Neural Network (RNN)
• A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor.
11
Chris Olah, Blog, http://colah.github.io/posts/2015-08-Understanding-LSTMs/
slide by Sarah Bargal 11

 Applications  RNNs  RNN BP  LSTMs
RNN Architectures
Vanilla NN Recurrent NNs
12
Andrej Karpathy, Blog, http://karpathy.github.io/2015/05/21/rnn-effectiveness/
slide by Sarah Bargal 12

Recurrent Neural Network
The state consists of a single “hidden” vector h:
𝑦𝑦𝑡𝑡
h𝑡𝑡
𝑥𝑥𝑡𝑡
Fei-Fei Li & Andrej Karpathy & Justin Johnson
parameters
8 Feb 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Feb 2016
Lecture 10 – 17

1 0 -1
h𝑦𝑦𝑡𝑡 𝑥𝑥𝑡𝑡𝑡𝑡
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Recurrent Neural Network
The state consists of a single “hidden” vector h:
Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Feb 2016
activation function
(elementwise)
Lecture 10 – 17 8 Feb 2016

Neural Networks IV
Example: Character RNN

Character-level language model example
Vocabulary: [h,e,l,o]
Example training sequence: “hello”
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
𝑦𝑦𝑡𝑡
𝑡𝑡 h𝑥𝑥
Lecture 10 – Lecture10- 18
𝑡𝑡
Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Feb 2016 8 Feb 2016

Character-level language model example
Vocabulary: [h,e,l,o]
Example training sequence: “hello”
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 10 – Lecture10- 19
8 Feb 2016 8 Feb 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson

Character-level language model example
Vocabulary: [h,e,l,o]
Example training sequence: “hello”
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 10 –
8 Feb 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Feb 2016
Lecture10- 20

Character-level language model example
Vocabulary: [h,e,l,o]
Example training sequence: “hello”
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 10 –
8 Feb 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Feb 2016
Lecture10- 21

min-char-rnn.py gist: 112 lines of Python
(https://gist.github. com/karpathy/d4dee566867f8291f086)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10 – 8 Feb 2016 22
Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Feb 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson

Training text
y
RNN
x
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture10- 34
Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Feb 2016
Lecture 10 –
8 Feb 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10 – 8 Feb 2016 Lecture10- 35
Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Feb 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson

at first:
train more
train more train more
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture10- 36
Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Feb 2016
Lecture 10 –
8 Feb 2016

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10 – 8 Feb 2016 Lecture10- 37
Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Feb 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson

Train on C code
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10 – 8 Feb 2016 Lecture10- 41
Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Feb 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson

Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 10 – 8 Feb 2016 Lecture10- 42
Fei-Fei Li & Andrej Karpathy & Justin Johnson
8 Feb 2016
Generated C code

Neural Networks IV
Learning in RNNs

 Applications  RNNs  RNN BP  LSTMs Forward pass
• Forwardpassthroughtime h𝑡𝑡=𝑊𝑊𝜙𝜙h𝑡𝑡−1 +𝑊𝑊𝑥𝑥𝑥𝑥𝑡𝑡
𝑬𝑬𝒕𝒕
𝑦𝑦𝑡𝑡 𝑊𝑊𝑦𝑦
𝑦𝑦𝑡𝑡 =𝑊𝑊𝑦𝑦𝜙𝜙(h𝑡𝑡)
𝑊𝑊𝑥𝑥
𝑊𝑊
Nando de Freitas , lecture, https://www.youtube.com/watch?v=56TYLaQN4N8 slide by Sarah Bargal
28
28

 Applications  RNNs  RNN BP  LSTMs Recurrent Neural Network (RNN)
Aside: Forward pass
h𝑡𝑡=𝑊𝑊𝜙𝜙h𝑡𝑡−1 +𝑊𝑊𝑥𝑥𝑥𝑥𝑡𝑡 𝑦𝑦𝑡𝑡 = 𝑊𝑊𝑦𝑦𝜙𝜙(h𝑡𝑡)
𝑬𝑬𝒕𝒕 𝑦𝑦𝑡𝑡
Chris Olah, Blog, http://colah.github.io/posts/2015-08-Understanding-LSTMs/ slide by Sarah Bargal
29
29

 Applications  RNNs  RNN BP  LSTMs Recurrent Neural Network (RNN)
• Errororcostiscomputed
for each prediction. h𝑡𝑡 = 𝑊𝑊𝜙𝜙 h𝑡𝑡−1 + 𝑊𝑊𝑥𝑥𝑥𝑥𝑡𝑡
Aside: Forward pass
𝑬𝑬 𝑬𝑬 𝑬𝑬 𝑬𝑬 𝑦𝑦𝑡𝑡 = 𝑊𝑊𝑦𝑦𝜙𝜙(h𝑡𝑡) 𝑦𝑦𝟎𝟎 𝑦𝑦𝟏𝟏 𝑦𝑦𝟐𝟐 𝑦𝑦𝒕𝒕 𝑬𝑬𝒕𝒕
0 1 2 𝑡𝑡 𝑦𝑦𝑡𝑡
Chris Olah, Blog, http://colah.github.io/posts/2015-08-Understanding-LSTMs/ slide by Sarah Bargal
30
30

 Applications  RNNs  RNN BP  LSTMs Backprop Through Time
=�𝑡𝑡 𝜕𝜕𝑊𝑊 𝑡𝑡=1 𝜕𝜕𝑊𝑊
𝑬𝑬 𝑦𝑦𝒕𝒕
Aside: Forward pass
• Backpropagationthroughtime 𝜕𝜕𝜕𝜕 𝑇𝑇𝜕𝜕𝜕𝜕
𝑡𝑡 𝑊𝑊𝑦𝑦
𝜕𝜕𝜕𝜕 𝑡𝑡 𝜕𝜕𝜕𝜕 𝜕𝜕𝑦𝑦 𝜕𝜕h 𝜕𝜕h
𝑡𝑡=�𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘 𝑊𝑊
𝜕𝜕𝑊𝑊 𝑘𝑘=1 𝜕𝜕𝑦𝑦𝑡𝑡 𝜕𝜕h𝑡𝑡 𝜕𝜕h𝑘𝑘 𝜕𝜕𝑊𝑊
𝑊𝑊𝑥𝑥
Nando de Freitas , lecture, https://www.youtube.com/watch?v=56TYLaQN4N8 slide by Sarah Bargal
31

 Applications  RNNs  RNN BP  LSTMs BP TT
=�𝑡𝑡 𝜕𝜕𝑊𝑊 𝑡𝑡=1 𝜕𝜕𝑊𝑊
𝑬𝑬 𝑦𝑦𝒕𝒕
Aside: Forward pass
• Backpropagationthroughtime 𝜕𝜕𝜕𝜕 𝑇𝑇𝜕𝜕𝜕𝜕
𝑡𝑡 𝑊𝑊𝑦𝑦
𝜕𝜕𝜕𝜕 𝑡𝑡 𝜕𝜕𝜕𝜕 𝜕𝜕𝑦𝑦 𝜕𝜕h 𝜕𝜕h 𝑡𝑡=�𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘 𝑊𝑊
𝜕𝜕𝑊𝑊 𝑘𝑘=1 𝜕𝜕𝑦𝑦𝑡𝑡 𝜕𝜕h𝑡𝑡 𝜕𝜕h𝑘𝑘 𝜕𝜕𝑊𝑊
𝑊𝑊𝑥𝑥
Nando de Freitas , lecture, https://www.youtube.com/watch?v=56TYLaQN4N8 slide by Sarah Bargal
32

 Applications  RNNs BP TT RNN BP  LSTMs
𝑬𝑬𝟎𝟎 𝑬𝑬𝟏𝟏 𝑬𝑬𝟐𝟐 𝑬𝑬𝒕𝒕 𝑦𝑦0 𝑦𝑦1 𝑦𝑦2 𝑦𝑦𝑡𝑡
𝒉𝒉𝟎𝟎 𝒉𝒉𝟏𝟏 𝒉𝒉𝟐𝟐 𝒉𝒉𝒕𝒕
𝜕𝜕𝜕𝜕 𝑡𝑡 𝜕𝜕𝜕𝜕 𝜕𝜕𝑦𝑦 𝜕𝜕h 𝜕𝜕h 𝑡𝑡=�𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘
𝜕𝜕𝑊𝑊 𝑘𝑘=1 𝜕𝜕𝑦𝑦𝑡𝑡 𝜕𝜕h𝑡𝑡 𝜕𝜕h𝑘𝑘 𝜕𝜕𝑊𝑊
Nando de Freitas , lecture, https://www.youtube.com/watch?v=56TYLaQN4N8 slide by Sarah Bargal
33
33

BP TT
𝑬𝑬𝟎𝟎 𝑬𝑬𝟏𝟏 𝑬𝑬𝟐𝟐 𝑬𝑬𝒕𝒕 𝑦𝑦0 𝑦𝑦1 𝑦𝑦2 𝑦𝑦𝑡𝑡
𝒉𝒉𝟎𝟎 𝒉𝒉𝟏𝟏 𝒉𝒉𝟐𝟐 𝒉𝒉𝒕𝒕
𝜕𝜕𝜕𝜕 𝑡𝑡 𝜕𝜕𝜕𝜕 𝜕𝜕𝑦𝑦 𝜕𝜕h 𝜕𝜕h 𝑡𝑡=�𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘
𝜕𝜕𝑊𝑊 𝑘𝑘=1 𝜕𝜕𝑦𝑦𝑡𝑡 𝜕𝜕h𝑡𝑡 𝜕𝜕h𝑘𝑘 𝜕𝜕𝑊𝑊
Nando de Freitas , lecture, https://www.youtube.com/watch?v=56TYLaQN4N8 slide by Sarah Bargal
34
34

BP TT
𝑬𝑬𝟎𝟎 𝑬𝑬𝟏𝟏 𝑬𝑬𝟐𝟐 𝑬𝑬𝒕𝒕 𝑦𝑦0 𝑦𝑦1 𝑦𝑦2 𝑦𝑦𝑡𝑡
𝒉𝒉𝟎𝟎 𝒉𝒉𝟏𝟏 𝒉𝒉𝟐𝟐 𝒉𝒉𝒕𝒕
𝜕𝜕𝜕𝜕 𝑡𝑡 𝜕𝜕𝜕𝜕 𝜕𝜕𝑦𝑦 𝜕𝜕h 𝜕𝜕h 𝑡𝑡=�𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘
𝜕𝜕𝑊𝑊 𝑘𝑘=1 𝜕𝜕𝑦𝑦𝑡𝑡 𝜕𝜕h𝑡𝑡 𝜕𝜕h𝑘𝑘 𝜕𝜕𝑊𝑊
𝜕𝜕h 𝑡𝑡 𝜕𝜕h
𝜕𝜕h𝑡𝑡 = � 𝜕𝜕h 𝑖𝑖 𝑘𝑘 𝑖𝑖=𝑘𝑘+1 𝑖𝑖−1
Nando de Freitas , lecture, https://www.youtube.com/watch?v=56TYLaQN4N8 slide by Sarah Bargal
35
35

BP TT
𝑬𝑬𝟎𝟎 𝑬𝑬𝟏𝟏 𝑬𝑬𝟐𝟐 𝑬𝑬𝒕𝒕 𝑦𝑦0 𝑦𝑦1 𝑦𝑦2 𝑦𝑦𝑡𝑡
𝒉𝒉𝟎𝟎 𝒉𝒉𝟏𝟏 𝒉𝒉𝟐𝟐 𝒉𝒉𝒕𝒕
𝜕𝜕𝜕𝜕 𝑡𝑡 𝜕𝜕𝜕𝜕 𝜕𝜕𝑦𝑦 𝜕𝜕h 𝜕𝜕h 𝑡𝑡=�𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘
𝜕𝜕𝑊𝑊 𝑘𝑘=1 𝜕𝜕𝑦𝑦𝑡𝑡 𝜕𝜕h𝑡𝑡 𝜕𝜕h𝑘𝑘 𝜕𝜕𝑊𝑊
𝜕𝜕h 𝑡𝑡 𝜕𝜕h 𝜕𝜕h 2 𝜕𝜕h 𝜕𝜕h𝜕𝜕h
For example @ 𝑡𝑡 = 2,
𝜕𝜕h𝑡𝑡=�𝜕𝜕h𝑖𝑖 𝜕𝜕h2=�𝜕𝜕h𝑖𝑖 =𝜕𝜕h1𝜕𝜕h2 𝑘𝑘 𝑖𝑖=𝑘𝑘+1 𝑖𝑖−1 0 𝑖𝑖=1 𝑖𝑖−1 0 1
Nando de Freitas , lecture, https://www.youtube.com/watch?v=56TYLaQN4N8 slide by Sarah Bargal
36
36

 Applications  RNNs  RNN BP  LSTMs Vanishing (and Exploding) Gradients
𝜕𝜕𝜕𝜕 𝑡𝑡 𝜕𝜕𝜕𝜕 𝜕𝜕𝑦𝑦 𝜕𝜕h 𝜕𝜕h 𝑡𝑡=�𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘
𝜕𝜕𝑊𝑊 𝜕𝜕𝑦𝑦𝑡𝑡 𝜕𝜕h𝑡𝑡 𝜕𝜕h𝑘𝑘 𝜕𝜕𝑊𝑊 𝑘𝑘=1
Aside: Forward pass
h𝑡𝑡=𝑊𝑊𝜙𝜙h𝑡𝑡−1 +𝑊𝑊𝑥𝑥𝑥𝑥𝑡𝑡 𝜕𝜕h 𝑡𝑡 𝜕𝜕h 𝑡𝑡 𝑦𝑦𝑡𝑡 = 𝑊𝑊𝑦𝑦𝜙𝜙(h𝑡𝑡)
𝜕𝜕h𝑡𝑡 = � 𝜕𝜕h 𝑖𝑖 = � 𝑊𝑊𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝜙𝜙𝜙(h𝑖𝑖−1)] 𝑘𝑘 𝑖𝑖=𝑘𝑘+1 𝑖𝑖−1 𝑖𝑖=𝑘𝑘+1
𝜕𝜕h𝑖𝑖 ≤ 𝑊𝑊𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝜙𝜙𝜙(h )] ≤𝛾𝛾 𝛾𝛾 𝜕𝜕h𝑖𝑖−1 𝑡𝑡 𝜕𝜕h 𝑖𝑖−1 𝑊𝑊𝜙𝜙
� 𝜕𝜕h 𝑖𝑖 ≤(𝛾𝛾𝑊𝑊𝛾𝛾𝜙𝜙)𝑡𝑡−𝑘𝑘 𝑖𝑖=𝑘𝑘+1 𝑖𝑖−1
Nando de Freitas , lecture, https://www.youtube.com/watch?v=56TYLaQN4N8
37
37

 Applications  RNNs  RNN BP  LSTMs Vanishing (and Exploding) Gradients
𝜕𝜕𝜕𝜕 𝑡𝑡 𝜕𝜕𝜕𝜕 𝜕𝜕𝑦𝑦 𝜕𝜕h 𝜕𝜕h 𝑡𝑡=�𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘
𝜕𝜕𝑊𝑊 𝜕𝜕𝑦𝑦𝑡𝑡 𝜕𝜕h𝑡𝑡 𝜕𝜕h𝑘𝑘 𝜕𝜕𝑊𝑊 𝑘𝑘=1
Aside: Forward pass
h𝑡𝑡=𝑊𝑊𝜙𝜙h𝑡𝑡−1 +𝑊𝑊𝑥𝑥𝑥𝑥𝑡𝑡 𝜕𝜕h 𝑡𝑡 𝜕𝜕h 𝑡𝑡 𝑦𝑦𝑡𝑡 = 𝑊𝑊𝑦𝑦𝜙𝜙(h𝑡𝑡)
𝜕𝜕h𝑡𝑡 = � 𝜕𝜕h 𝑖𝑖 = � 𝑊𝑊𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝜙𝜙𝜙(h𝑖𝑖−1)] 𝑘𝑘 𝑖𝑖=𝑘𝑘+1 𝑖𝑖−1 𝑖𝑖=𝑘𝑘+1
𝜕𝜕h𝑖𝑖 ≤ 𝑊𝑊𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝜙𝜙𝜙(h )] ≤𝛾𝛾 𝛾𝛾 𝜕𝜕h𝑖𝑖−1 𝑡𝑡 𝜕𝜕h 𝑖𝑖−1 𝑊𝑊𝜙𝜙
� 𝜕𝜕h 𝑖𝑖 ≤(𝛾𝛾𝑊𝑊𝛾𝛾𝜙𝜙)𝑡𝑡−𝑘𝑘 𝑖𝑖=𝑘𝑘+1 𝑖𝑖−1
Nando de Freitas , lecture, https://www.youtube.com/watch?v=56TYLaQN4N8
38
38

 Applications  RNNs  RNN BP  LSTMs Vanishing (and Exploding) Gradients
𝜕𝜕𝜕𝜕 𝑡𝑡 𝜕𝜕𝜕𝜕 𝜕𝜕𝑦𝑦 𝜕𝜕h 𝜕𝜕h 𝑡𝑡=�𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘
𝜕𝜕𝑊𝑊 𝜕𝜕𝑦𝑦𝑡𝑡 𝜕𝜕h𝑡𝑡 𝜕𝜕h𝑘𝑘 𝜕𝜕𝑊𝑊 𝑘𝑘=1
Aside: Forward pass
h𝑡𝑡=𝑊𝑊𝜙𝜙h𝑡𝑡−1 +𝑊𝑊𝑥𝑥𝑥𝑥𝑡𝑡 𝜕𝜕h 𝑡𝑡 𝜕𝜕h 𝑡𝑡 𝑦𝑦𝑡𝑡 = 𝑊𝑊𝑦𝑦𝜙𝜙(h𝑡𝑡)
𝜕𝜕h𝑡𝑡 = � 𝜕𝜕h 𝑖𝑖 = � 𝑊𝑊𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝜙𝜙𝜙(h𝑖𝑖−1)] 𝑘𝑘 𝑖𝑖=𝑘𝑘+1 𝑖𝑖−1 𝑖𝑖=𝑘𝑘+1
𝜕𝜕h𝑖𝑖 ≤ 𝑊𝑊𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝜙𝜙𝜙(h )] ≤𝛾𝛾 𝛾𝛾 𝜕𝜕h𝑖𝑖−1 𝑡𝑡 𝜕𝜕h 𝑖𝑖−1 𝑊𝑊𝜙𝜙
� 𝜕𝜕h 𝑖𝑖 ≤(𝛾𝛾𝑊𝑊𝛾𝛾𝜙𝜙)𝑡𝑡−𝑘𝑘 𝑖𝑖=𝑘𝑘+1 𝑖𝑖−1
Nando de Freitas , lecture, https://www.youtube.com/watch?v=56TYLaQN4N8
39
39

 Applications  RNNs  RNN BP  LSTMs Vanishing (and Exploding) Gradients
𝜕𝜕𝜕𝜕 𝑡𝑡 𝜕𝜕𝜕𝜕 𝜕𝜕𝑦𝑦 𝜕𝜕h 𝜕𝜕h 𝑡𝑡=�𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘
𝜕𝜕𝑊𝑊 𝜕𝜕𝑦𝑦𝑡𝑡 𝜕𝜕h𝑡𝑡 𝜕𝜕h𝑘𝑘 𝜕𝜕𝑊𝑊 𝑘𝑘=1
Aside: Forward pass
h𝑡𝑡=𝑊𝑊𝜙𝜙h𝑡𝑡−1 +𝑊𝑊𝑥𝑥𝑥𝑥𝑡𝑡 𝜕𝜕h 𝑡𝑡 𝜕𝜕h 𝑡𝑡 𝑦𝑦𝑡𝑡 = 𝑊𝑊𝑦𝑦𝜙𝜙(h𝑡𝑡)
𝜕𝜕h𝑡𝑡 = � 𝜕𝜕h 𝑖𝑖 = � 𝑊𝑊𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝜙𝜙𝜙(h𝑖𝑖−1)] 𝑘𝑘 𝑖𝑖=𝑘𝑘+1 𝑖𝑖−1 𝑖𝑖=𝑘𝑘+1
𝜕𝜕h𝑖𝑖 ≤ 𝑊𝑊𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝜙𝜙𝜙(h )] ≤𝛾𝛾 𝛾𝛾 𝜕𝜕h𝑖𝑖−1 𝑡𝑡 𝜕𝜕h 𝑖𝑖−1 𝑊𝑊𝜙𝜙
� 𝜕𝜕h 𝑖𝑖 ≤(𝛾𝛾𝑊𝑊𝛾𝛾𝜙𝜙)𝑡𝑡−𝑘𝑘 𝑖𝑖=𝑘𝑘+1 𝑖𝑖−1
Nando de Freitas , lecture, https://www.youtube.com/watch?v=56TYLaQN4N8
40
40

 Applications  RNNs  RNN BP  LSTMs Vanishing (and Exploding) Gradients
𝜕𝜕𝜕𝜕 𝑡𝑡 𝜕𝜕𝜕𝜕 𝜕𝜕𝑦𝑦 𝜕𝜕h 𝜕𝜕h 𝑡𝑡=�𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘
𝜕𝜕𝑊𝑊 𝜕𝜕𝑦𝑦𝑡𝑡 𝜕𝜕h𝑡𝑡 𝜕𝜕h𝑘𝑘 𝜕𝜕𝑊𝑊 𝑘𝑘=1
Aside: Forward pass
h𝑡𝑡=𝑊𝑊𝜙𝜙h𝑡𝑡−1 +𝑊𝑊𝑥𝑥𝑥𝑥𝑡𝑡 𝜕𝜕h 𝑡𝑡 𝜕𝜕h 𝑡𝑡 𝑦𝑦𝑡𝑡 = 𝑊𝑊𝑦𝑦𝜙𝜙(h𝑡𝑡)
𝜕𝜕h𝑡𝑡 = � 𝜕𝜕h 𝑖𝑖 = � 𝑊𝑊𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝜙𝜙𝜙(h𝑖𝑖−1)] 𝑘𝑘 𝑖𝑖=𝑘𝑘+1 𝑖𝑖−1 𝑖𝑖=𝑘𝑘+1
𝜕𝜕h𝑖𝑖 ≤ 𝑊𝑊𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝜙𝜙𝜙(h )] ≤𝛾𝛾 𝛾𝛾 𝜕𝜕h𝑖𝑖−1 𝑡𝑡 𝜕𝜕h 𝑖𝑖−1 𝑊𝑊𝜙𝜙
� 𝜕𝜕h 𝑖𝑖 ≤(𝛾𝛾𝑊𝑊𝛾𝛾𝜙𝜙)𝑡𝑡−𝑘𝑘 𝑖𝑖=𝑘𝑘+1 𝑖𝑖−1
Nando de Freitas , lecture, https://www.youtube.com/watch?v=56TYLaQN4N8
41
41

 Applications  RNNs  RNN BP  LSTMs Vanishing (and Exploding) Gradients
𝜕𝜕𝜕𝜕 𝑡𝑡 𝜕𝜕𝜕𝜕 𝜕𝜕𝑦𝑦 𝜕𝜕h 𝜕𝜕h 𝑡𝑡=�𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘
𝜕𝜕𝑊𝑊 𝜕𝜕𝑦𝑦𝑡𝑡 𝜕𝜕h𝑡𝑡 𝜕𝜕h𝑘𝑘 𝜕𝜕𝑊𝑊 𝑘𝑘=1
Aside: Forward pass
h𝑡𝑡=𝑊𝑊𝜙𝜙h𝑡𝑡−1 +𝑊𝑊𝑥𝑥𝑥𝑥𝑡𝑡 𝜕𝜕h 𝑡𝑡 𝜕𝜕h 𝑡𝑡 𝑦𝑦𝑡𝑡 = 𝑊𝑊𝑦𝑦𝜙𝜙(h𝑡𝑡)
𝜕𝜕h𝑡𝑡 = � 𝜕𝜕h 𝑖𝑖 = � 𝑊𝑊𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝜙𝜙𝜙(h𝑖𝑖−1)] 𝑘𝑘 𝑖𝑖=𝑘𝑘+1 𝑖𝑖−1 𝑖𝑖=𝑘𝑘+1
𝜕𝜕h𝑖𝑖 ≤ 𝑊𝑊𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝜙𝜙𝜙(h )] ≤𝛾𝛾 𝛾𝛾 𝜕𝜕h𝑖𝑖−1 𝑡𝑡 𝜕𝜕h 𝑖𝑖−1 𝑊𝑊𝜙𝜙
� 𝜕𝜕h 𝑖𝑖 ≤(𝛾𝛾𝑊𝑊𝛾𝛾𝜙𝜙)𝑡𝑡−𝑘𝑘 𝑖𝑖=𝑘𝑘+1 𝑖𝑖−1
Nando de Freitas , lecture, https://www.youtube.com/watch?v=56TYLaQN4N8
42
42

 Applications  RNNs  RNN BP  LSTMs Vanishing (and Exploding) Gradients
𝜕𝜕𝜕𝜕 𝑡𝑡 𝜕𝜕𝜕𝜕 𝜕𝜕𝑦𝑦 𝜕𝜕h 𝜕𝜕h 𝑡𝑡=�𝑡𝑡𝑡𝑡𝑡𝑡𝑘𝑘
𝜕𝜕𝑊𝑊 𝜕𝜕𝑦𝑦𝑡𝑡 𝜕𝜕h𝑡𝑡 𝜕𝜕h𝑘𝑘 𝜕𝜕𝑊𝑊 𝑘𝑘=1
Aside: Forward pass
h𝑡𝑡=𝑊𝑊𝜙𝜙h𝑡𝑡−1 +𝑊𝑊𝑥𝑥𝑥𝑥𝑡𝑡 𝜕𝜕h 𝑡𝑡 𝜕𝜕h 𝑡𝑡 𝑦𝑦𝑡𝑡 = 𝑊𝑊𝑦𝑦𝜙𝜙(h𝑡𝑡)
𝜕𝜕h𝑡𝑡 = � 𝜕𝜕h 𝑖𝑖 = � 𝑊𝑊𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝜙𝜙𝜙(h𝑖𝑖−1)] 𝑘𝑘 𝑖𝑖=𝑘𝑘+1 𝑖𝑖−1 𝑖𝑖=𝑘𝑘+1
𝜕𝜕h𝑖𝑖 ≤ 𝑊𝑊𝑇𝑇 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑[𝜙𝜙𝜙(h )] ≤𝛾𝛾 𝛾𝛾 𝜕𝜕h𝑖𝑖−1 𝑡𝑡 𝜕𝜕h 𝑖𝑖−1 𝑊𝑊𝜙𝜙
� 𝑖𝑖 ≤ (𝛾𝛾 𝛾𝛾 ) <1 vanishing 𝑖𝑖=𝑘𝑘+1 𝜕𝜕h𝑖𝑖−1 𝑊𝑊𝜙𝜙 𝑡𝑡−𝑘𝑘 >1 exploding
Nando de Freitas , lecture, https://www.youtube.com/watch?v=56TYLaQN4N8
43
43

 Applications  RNNs  RNN BP  LSTMs Vanishing (and Exploding) Gradients
• ExplodingGradients
– Easy to detect
– Clip the gradient at a threshold
• VanishingGradients
– More difficult to detect
– Architectures designed to combat the problem of vanishing gradients. Example: LSTMs by Schmidhuber et al.
44
44

Neural Networks IV
Training strategies

Universality
• Why study neural networks in general?
• Neural network can approximate any continuous
function, even with a single hidden layer!
• http://neuralnetworksanddeeplearning.com/chap4.html
Deep Learning 2017, Brian Kulis & Kate Saenko 53

Why Study Deep Networks?
Deep Learning 2017, Brian Kulis & Kate Saenko 54

Efficiency of convnets
Deep Learning 2017, Brian Kulis & Kate Saenko 55

But… Watch Out for Vanishing Gradients
• Consider a simple network, and perform backpropagation
• For simplicity, just a single neuron • Sigmoid at every layer,
• Cost function C
• Gradient is a product of terms:
Deep Learning 2017, Brian Kulis & Kate Saenko 56

Vanishing Gradients
• Gradient of sigmoid is in (0,1/4)
• Weights are also typically initialized in (0,1)
• Products of small numberssmall gradients
• Backprop does not change weights in earlier layers by much! • This is an issue with backprop, not with the model itself
RNNs: vanishing and exploding gradients
• Exploding: easy to fix, clip the gradient at a threshold
• Vanishing: More difficult to detect
• Architectures designed to combat the problem of vanishing gradients. Example: LSTMs by Schmidhuber et al.
Deep Learning 2017, Brian Kulis & Kate Saenko 57

Rectified Linear Units (RELU)
• Alternative non-linearity:
• Gradient of this function?
• Note: need subgradient descent here.
• https://cs224d.stanford.edu/notebooks/vanishing_grad_exa
mple.html
• Increasing the number of layers can result in requiring exponentially fewer hidden units per layer (see “Understanding Deep Neural Networks with Rectified Linear Units”)
• Biological considerations
• Onsomeinputs,biologicalneuronshavenoactivation
• On some inputs, neurons have activation proportional to input
Deep Learning 2017, Brian Kulis & Kate Saenko 58

Other Activation Functions
• Leaky ReLU:
• Tanh:
• Radial Basis Functions: • Softplus:
• Hard Tanh:
• Maxout:
• ….
Deep Learning 2017, Brian Kulis & Kate Saenko 59

Architecture Design and Training Issues
• How many layers? How many hidden units per layer? How to connect layers together? How to optimize?
• Cost functions
• L2/L1 regularization
• Data Set Augmentation • Early Stopping
• Dropout
• Minibatch Training
• Momentum
• Initialization
• Batch Normalization
Deep Learning 2017, Brian Kulis & Kate Saenko 60