Deep Learning – COSC2779 – Modelling Sequential (Time Series) Data
Deep Learning – COSC2779
Modelling Sequential (Time Series) Data
Dr. Ruwan Tennakoon
Sep 6, 2021
Reference: Chapter 10: Ian Goodfellow et. al., “Deep Learning”, MIT Press, 2016.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 1 / 39
MAIA – AI for music creation
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 1 / 39
https://edwardtky.wixsite.com/maia
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 1 / 39
NeuralTalk Sentence Generation Results
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 1 / 39
https://cs.stanford.edu/people/karpathy/deepimagesent/generationdemo/
Outline
1 Sequential Data
2 Recurrent Neural Networks (RNN)
3 Variants of RNN
4 Gated Recurrent Units Networks
5 Long Short Term Memory networks
6 Bi-Directional and Deep RNN
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 2 / 39
Revision
A look back at what we have learned:
Deep neural network Building blocks:
Week2: Feed forward NN Model and
cost functions.
Week3: Optimising deep models
challenges and solutions.
Week4: Convolution neural network: for
data with spatial structure.
Case study:
Week5: Famous networks for computer
vision applications
Putting things together:
Week6: Practice methodology
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 3 / 39
Revision
A look back at what we have learned:
Deep neural network Building blocks:
Week2: Feed forward NN Model and
cost functions.
Week3: Optimising deep models
challenges and solutions.
Week4: Convolution neural network: for
data with spatial structure.
Case study:
Week5: Famous networks for computer
vision applications
Putting things together:
Week6: Practice methodology
Image: Unsplash
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 3 / 39
Revision
A look back at what we have learned:
Deep neural network Building blocks:
Week2: Feed forward NN Model and
cost functions.
Week3: Optimising deep models
challenges and solutions.
Week4: Convolution neural network: for
data with spatial structure.
Week7-8: Recurrent neural network: for
data with sequential structure.
Case study:
Week5: Famous networks for computer
vision applications
Putting things together:
Week6: Practice methodology
Type of NN?
Supervised learning with fixed-size
vectors: Deep Feed-forward models
Input has topological structure: CNN.
Input or Output is a sequence: LSTM
or GRU (will be discussed in future).
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 3 / 39
Objectives of this lecture
Understand the main building blocks of RNNs designed to handle
sequential data.
Improvements to basic structure and the intuition.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 4 / 39
Outline
1 Sequential Data
2 Recurrent Neural Networks (RNN)
3 Variants of RNN
4 Gated Recurrent Units Networks
5 Long Short Term Memory networks
6 Bi-Directional and Deep RNN
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 5 / 39
Why do we need another model type?
Sentiment analysis:
“My experience so far has been fantastic” → Positive
“Your support team is useless” → Negative
How can we represent this data (for NN)?
Character level: Each character mapped to a number.
Word level
My experience so far has been fantastic
10277 512 12011 611 854 325 625
x(i) = x (i)〈1〉 x (i)〈2〉 x (i)〈3〉 x (i)〈4〉 x (i)〈5〉 x (i)〈6〉 x (i)〈7〉
Index Word
1 a
2 ability
3 able
. . . . . .
325 been
. . . . . .
512 experience
. . . . . .
611 far
. . . . . .
625 fantastic
. . . . . .
854 has
. . . . . .
10277 My
. . . . . .
12011 So
. . . . . .
Vocabulary: Assume
20k Words
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 6 / 39
Why do we need another model type?
Sentiment analysis:
“My experience so far has been fantastic” → Positive
“Your support team is useless” → Negative
How can we represent this data (for NN)?
Character level: Each character mapped to a number.
Word level
My experience so far has been fantastic
10277 512 12011 611 854 325 625
x(i) = x (i)〈1〉 x (i)〈2〉 x (i)〈3〉 x (i)〈4〉 x (i)〈5〉 x (i)〈6〉 x (i)〈7〉
Index Word
1 a
2 ability
3 able
. . . . . .
325 been
. . . . . .
512 experience
. . . . . .
611 far
. . . . . .
625 fantastic
. . . . . .
854 has
. . . . . .
10277 My
. . . . . .
12011 So
. . . . . .
Vocabulary: Assume
20k Words
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 6 / 39
Why do we need another model type?
Sentiment analysis:
“My experience so far has been fantastic” → Positive
“Your support team is useless” → Negative
How can we represent this data (for NN)?
Character level: Each character mapped to a number.
Word level
My experience so far has been fantastic
10277 512 12011 611 854 325 625
x(i) = x (i)〈1〉 x (i)〈2〉 x (i)〈3〉 x (i)〈4〉 x (i)〈5〉 x (i)〈6〉 x (i)〈7〉
Index Word
1 a
2 ability
3 able
. . . . . .
325 been
. . . . . .
512 experience
. . . . . .
611 far
. . . . . .
625 fantastic
. . . . . .
854 has
. . . . . .
10277 My
. . . . . .
12011 So
. . . . . .
Vocabulary: Assume
20k Words
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 6 / 39
One-Hot Representation
My experience so far has been fantastic
10277 512 12011 611 854 325 625
x(i) = x (i)〈1〉 x (i)〈2〉 x (i)〈3〉 x (i)〈4〉 x (i)〈5〉 x (i)〈6〉 x (i)〈7〉
0
.
.
.
.
.
.
0
1
0
.
.
.
0
. One at position 10277
x〈1〉
0
.
.
.
0
1
0
.
.
.
.
.
.
0
. One at position 512
x〈2〉
x (i) is a matrix with dimensions 20,000 x 7.
Index Word
1 a
2 ability
3 able
. . . . . .
325 been
. . . . . .
512 experience
. . . . . .
611 far
. . . . . .
625 fantastic
. . . . . .
854 has
. . . . . .
10277 My
. . . . . .
12011 So
. . . . . .
Vocabulary: Assume
20k Words
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 7 / 39
Why do we need another model type?
Sentiment analysis:
“My experience so far has been fantastic” → Positive
“Your support team is useless” → Negative
My experience so far has been fantastic
x(i) = [x 〈1〉, x 〈2〉, x 〈3〉, x 〈4〉, x 〈5〉, x 〈6〉, x 〈7〉]
We have now represented a sentence as a vector, Can we use FF-NN?
Variable length vector.
Context of data (elements) matters
You were right.
Make a right turn at the light.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 8 / 39
Sequential Data: Examples
Speech recognition:
→ “How are you”
DNA sequence analysis:
Machine translation:
Video Activity recognition:
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 9 / 39
Sequential Data
Data points with variable length .
Order of the data matters. Order can be:
Time related: Video analysis
Not related to time: DNA sequence
Shared feature across time is useful – context.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 10 / 39
Revision Questions
Why is word based representation preferred over character based
representation in NLP?
For a simple NLP problem, what will be the one-hot representation for a
word with index 3, If the vocabulary size is 10?
What are the main properties of sequential data?
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 11 / 39
Outline
1 Sequential Data
2 Recurrent Neural Networks (RNN)
3 Variants of RNN
4 Gated Recurrent Units Networks
5 Long Short Term Memory networks
6 Bi-Directional and Deep RNN
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 12 / 39
Named Entity Recognition (NER)
“Automatically find information units like names, including person, organization
and location names, and numeric expressions including time, date, money and
percent expressions from unstructured text.”
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 13 / 39
Named Entity Recognition (NER)
Simplified example: Only person names.
The retired English test cricketer Mark Butcher is
x 〈1〉 x 〈2〉 x 〈3〉 x 〈4〉 x 〈5〉 x 〈6〉 x 〈7〉 x 〈8〉
0 0 0 0 0 1 1 0
y 〈1〉 y 〈2〉 y 〈3〉 y 〈4〉 y 〈5〉 y 〈6〉 y 〈7〉 y 〈8〉
Dictionary size: 20k
Simple solution: Represent each word as one-hot vector, predict
name/not-name with FF-NN. Does not work for words like “Mark”, “Butcher”.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 13 / 39
Recurrent Neural Networks
Recurrent neural networks have loops in them, allowing information to persist.
x 〈t〉
RNN
ŷ 〈t〉
⇒
x 〈1〉 x 〈2〉 x 〈3〉 x 〈T 〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈5〉ŷ 〈T 〉
. . .
. . .
Unrolled Recurrent Neural Network
Same cell repeated. Not multiple cells.
The retired English Test cricketer Mark Butcher is
x 〈1〉 x 〈2〉 x 〈3〉 x 〈4〉 x 〈5〉 x 〈6〉 x 〈7〉 x 〈8〉
0 0 0 0 0 1 1 0
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 14 / 39
Recurrent Neural Networks
Recurrent neural networks have loops in them, allowing information to persist.
x 〈t〉
RNN
ŷ 〈t〉
⇒
x 〈1〉 x 〈2〉 x 〈3〉 x 〈T 〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈5〉ŷ 〈T 〉
. . .
. . .
Unrolled Recurrent Neural Network
Same cell repeated. Not multiple cells.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 14 / 39
Recurrent Neural Network: Cell
x 〈t〉
tanh
σ
ŷ 〈t〉
a〈t〉a〈t−1〉
Wax
Waa
Wya
a〈t〉 = g1
(
Waaa〈t−1〉 + Wax x 〈t〉 + ba
)
ŷ 〈t〉 = g2
(
Wyaa〈t〉 + by
)
Assume:
x 〈t〉 is 20000-dimensional (Vocab size)
a〈t〉 is 100-dimensional (user defined)
y 〈t〉 is 1-dimensional (output dimension)
Then weight matrices (Learned):
Wax is 100 x 20000
Waa is 100 x 100
Wya is 1 x 100
ba is 100 x 1
by is 1 x 1
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 15 / 39
Recurrent Neural Networks: Forward Propagation
x 〈1〉 x 〈2〉 x 〈3〉 x 〈5〉x 〈Tx 〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
Wax Wax Wax Wax
Waa Waa Waa Waa
Wya Wya Wya Wya
. . .
. . .
All Waa, Wax , Wya are shared across RNN cells
a〈t〉 = g1
(
Waaa〈t−1〉 + Wax x 〈t〉 + ba
)
ŷ 〈t〉 = g2
(
Wyaa〈t〉 + by
)
For a data point (i):
Tx Number of elements in
the input
Ty Number of elements in
the output
a〈0〉 Initial state can be all
zero vector.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 16 / 39
Recurrent Neural Network: Back-propagation
x 〈1〉 x 〈2〉 x 〈3〉 x 〈5〉x 〈Tx 〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
Wax Wax Wax Wax
Waa Waa Waa Waa
Wya Wya Wya Wya
. . .
. . .
y 〈1〉 y 〈2〉 y 〈3〉 y
〈Ty 〉. . .
L〈1〉 L〈2〉 L〈3〉 L
〈Ty〉
L (W)
All Waa, Wax , Wya are shared across RNN cells
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 17 / 39
Recurrent Neural Network: Back-propagation
x 〈1〉 x 〈2〉 x 〈3〉 x 〈5〉x 〈Tx 〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
Wax Wax Wax Wax
Waa Waa Waa Waa
Wya Wya Wya Wya
. . .
. . .
y 〈1〉 y 〈2〉 y 〈3〉 y
〈Ty 〉. . .
L〈1〉 L〈2〉 L〈3〉 L
〈Ty〉
L (W)
All Waa, Wax , Wya are shared across RNN cells
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 17 / 39
Recurrent Neural Network: Back-propagation
x 〈1〉 x 〈2〉 x 〈3〉 x 〈5〉x 〈Tx 〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
Wax Wax Wax Wax
Waa Waa Waa Waa
Wya Wya Wya Wya
. . .
. . .
y 〈1〉 y 〈2〉 y 〈3〉 y
〈Ty 〉. . .
L〈1〉 L〈2〉 L〈3〉 L
〈Ty〉
L (W)
All Waa, Wax , Wya are shared across RNN cells
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 17 / 39
Recurrent Neural Network: Back-propagation
x 〈1〉 x 〈2〉 x 〈3〉 x 〈5〉x 〈Tx 〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
Wax Wax Wax Wax
Waa Waa Waa Waa
Wya Wya Wya Wya
. . .
. . .
y 〈1〉 y 〈2〉 y 〈3〉 y
〈Ty 〉. . .
L〈1〉 L〈2〉 L〈3〉 L
〈Ty〉
L (W)
All Waa, Wax , Wya are shared across RNN cells
L (W) =
Ty∑
t=1
L〈t〉 (W)
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 17 / 39
Revision Questions
What values are usually used for a〈0〉?
What will be an appropriate function for output activation (g2) of a RNN
if the task is regression?
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 18 / 39
Outline
1 Sequential Data
2 Recurrent Neural Networks (RNN)
3 Variants of RNN
4 Gated Recurrent Units Networks
5 Long Short Term Memory networks
6 Bi-Directional and Deep RNN
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 19 / 39
Variants of the Simple RNN Structure
There are several variants of the basic RNN depending on the shapes of inputs and outputs.
Many-to-Many Many-to-One One-to-Many One-to-One Many-to-Many Ty 6= Tx
Image: The Unreasonable Effectiveness of Recurrent Neural Networks – Andrej Karpathy
Many-to-Many (Ty = Tx ): We have already discussed this – e.g. Named Entity
Recognition.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 20 / 39
Many-to-One
Sentiment analysis:
“My experience so far has been fantastic” → Positive
“Your support team is useless” → Negative
Video Action Classification:
x 〈1〉 x 〈2〉 x 〈3〉 x 〈5〉x 〈Tx 〉
ŷ 〈Ty 〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
Wax Wax Wax Wax
Waa Waa Waa Waa
Wya
y 〈Ty 〉
L (W)
. . .
All Waa, Wax , Wya are shared across RNN cells
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 21 / 39
One-to-Many
Generating Text/Music.
E.g Generating text similar to Shakespeare’s writing.
Image: Sonnet 18 in the 1609 Quarto of Shakespeare’s sonnets.
Given some text from Shakespeare’s writing generate novel sentences that look
similar.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 22 / 39
One-to-Many
Generating Shakespeare’s writing.
x 〈t〉 is a one-hot with size equal to number of characters.
ŷ 〈t〉 is a soft-max-out with size equal to number of characters.
x 〈1〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
Wax
Waa Waa Waa Waa
Wya Wya Wya Wya
. . .
x 〈2〉 x 〈3〉 x 〈5〉
All Waa, Wax , Wya are shared across RNN cells
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 23 / 39
One-to-Many
Inference: Convert ŷ 〈t〉 to one-hot by sampling and input as
x 〈t+1〉
x 〈1〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
Wax
Waa Waa Waa Waa
Wya Wya Wya Wya
. . .
x 〈2〉
x 〈3〉 x 〈5〉
All Waa, Wax , Wya are shared across RNN cells
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 23 / 39
One-to-Many
x 〈1〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
Wax
Waa Waa Waa Waa
Wya Wya Wya Wya
. . .
x 〈2〉 x 〈3〉
x 〈5〉
All Waa, Wax , Wya are shared across RNN cells
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 23 / 39
One-to-Many
x 〈1〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
Wax
Waa Waa Waa Waa
Wya Wya Wya Wya
. . .
x 〈2〉 x 〈3〉 x 〈5〉
All Waa, Wax , Wya are shared across RNN cells
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 23 / 39
Many-to-Many Ty 6= Tx
Many-to-Many Many-to-One One-to-Many One-to-OneMany-to-Many Ty 6= Tx
There are any situations where X, Y does not have one to one relationship.
E.g French to English translation, video captioning.
The sequence lengths are not the same.
Word to word translation does not work.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 24 / 39
Revision Questions
You are given the task of predicting the genre of short music clips (not
fixed length). What type of RNN is suitable for this task? (many-to-many,
one-to-many, many-to-one, many-to-many).
You are given the task of generating image captions. What type of RNN
is suitable for this task?
You are given the task of designing an automated exam marking system.
How would RNN help you in this task?
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 25 / 39
Outline
1 Sequential Data
2 Recurrent Neural Networks (RNN)
3 Variants of RNN
4 Gated Recurrent Units Networks
5 Long Short Term Memory networks
6 Bi-Directional and Deep RNN
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 26 / 39
Vanishing Gradients
x1
x2
ŷ
h(1) h(2) h(L)
Simple Feed Forward Neural Network
Vanishing gradient: use Skip
connections, BN, ReLU, etc.
Exploding gradient: use
gradient clipping.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 27 / 39
Vanishing Gradients
x1
x2
ŷ
h(1) h(2) h(L)
Simple Feed Forward Neural Network
x 〈1〉 x 〈2〉 x 〈3〉 x 〈5〉x 〈Tx 〉
ŷ 〈Ty 〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
Wax Wax Wax Wax
Waa Waa Waa Waa
Wya
Vanishing gradient: use Skip
connections, BN, ReLU, etc.
Exploding gradient: use
gradient clipping.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 27 / 39
Long Range Dependencies
x 〈1〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ〈
Ty〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
x 〈2〉 x 〈3〉 x 〈5〉
The cat, who ate . . . , (was/were) full.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 28 / 39
Long Range Dependencies
x 〈1〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ〈
Ty〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
x 〈2〉 x 〈3〉 x 〈5〉
The cat, who ate . . . , was full.
The cats, who ate . . . , were full.
In theory, RNNs are absolutely capable of handling such “long-term dependencies”. In practice,
RNNs don’t seem to be able to learn them.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 28 / 39
Gated Recurrent Units (GRU)
Change the structure of the units
x 〈t〉
tanh
σ
ŷ 〈t〉
a〈t〉a〈t−1〉
Wax
Waa
Wya
a〈t〉 = g1
(
Waaa〈t−1〉 + Wax x 〈t〉 + ba
)
ŷ 〈t〉 = g2
(
Wyaa〈t〉 + by
)
x 〈t〉
tanhσ σ
−1
c〈t−1〉 c〈t〉
× +
××
ŷ 〈t〉
On the Properties of Neural Machine Translation: Encoder-Decoder Approaches
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 29 / 39
https://arxiv.org/pdf/1409.1259.pdf
https://arxiv.org/pdf/1412.3555.pdf
Gated Recurrent Units (GRU)
c̃〈t〉 = tanh
(
Wc
[
c〈t−1〉, x 〈t〉
]
+ bc
)
. Cell State
c〈t〉 = c̃〈t〉
x 〈t〉
tanhσ σ
−1
Γr Γu c̃〈t〉
c〈t−1〉 c〈t〉
× +
××
ŷ 〈t〉
On the Properties of Neural Machine Translation: Encoder-Decoder Approaches
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 30 / 39
https://arxiv.org/pdf/1409.1259.pdf
https://arxiv.org/pdf/1412.3555.pdf
Gated Recurrent Units (GRU)
c̃〈t〉 = tanh
(
Wc
[
c〈t−1〉, x 〈t〉
]
+ bc
)
. Cell State
c〈t〉 = (1− Γu)� c〈t−1〉 + Γu � c̃〈t〉
Γu = σ
(
Wu
[
c〈t−1〉, x 〈t〉
]
+ bu
)
.Update gate
x 〈t〉
tanhσ σ
−1
Γr Γu c̃〈t〉
c〈t−1〉 c〈t〉
× +
××
ŷ 〈t〉
On the Properties of Neural Machine Translation: Encoder-Decoder Approaches
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 30 / 39
https://arxiv.org/pdf/1409.1259.pdf
https://arxiv.org/pdf/1412.3555.pdf
Gated Recurrent Units (GRU)
c̃〈t〉 = tanh
(
Wc
[
Γr � c〈t−1〉, x 〈t〉
]
+ bc
)
.Cell State
c〈t〉 = (1− Γu)� c〈t−1〉 + Γu � c̃〈t〉
Γu = σ
(
Wu
[
c〈t−1〉, x 〈t〉
]
+ bu
)
.Update gate
Γr = σ
(
Wr
[
c〈t−1〉, x 〈t〉
]
+ br
)
. Relevance gate
x 〈t〉
tanhσ σ
−1
Γr Γu c̃〈t〉
c〈t−1〉 c〈t〉
× +
××
ŷ 〈t〉
On the Properties of Neural Machine Translation: Encoder-Decoder Approaches
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 30 / 39
https://arxiv.org/pdf/1409.1259.pdf
https://arxiv.org/pdf/1412.3555.pdf
GRU Network
. . . . . .
Unrolled GRU network is identical to simple-RNN except for the cell type.
Note the horizontal line running through the top of the diagram. like a conveyor belt, it
runs straight down the entire chain, with only some minor linear interactions. It is very easy
for information to just flow along it unchanged (recall ResNet and skip connections).
Gates are a way to optionally let information through.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 31 / 39
GRU Network
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 31 / 39
Outline
1 Sequential Data
2 Recurrent Neural Networks (RNN)
3 Variants of RNN
4 Gated Recurrent Units Networks
5 Long Short Term Memory networks
6 Bi-Directional and Deep RNN
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 32 / 39
Long Short Term Memory (LSTM) Cell
Γu = σ
(
Wu
[
a〈t−1〉, x 〈t〉
]
+ bu
)
.Update gate
Γf = σ
(
Wf
[
a〈t−1〉, x 〈t〉
]
+ bf
)
. Forget gate
Γo = σ
(
Wo
[
a〈t−1〉, x 〈t〉
]
+ bo
)
.Output gate
c̃〈t〉 = tanh
(
Wc
[
a〈t−1〉, x 〈t〉
]
+ bc
)
. Cell State
c〈t〉 = Γf � c〈t−1〉 + Γu � c̃〈t〉
a〈t〉 = Γo � tanh
(
c〈t〉
)
Uses two gates (Γu,Γf ) to make the update compared to
single Γu in GRU.
Long-short term memory
x 〈t〉
tanhσ σ σ
a〈t−1〉 a〈t〉
c〈t−1〉 c〈t〉
× +
× ×
tanh
ŷ 〈t〉
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 33 / 39
https://www.bioinf.jku.at/publications/older/2604.pdf
LSTM vs GRU
LSTM was developed first, GRU is a simplification of the LSTM cell.
How to decide GRU vs. LSTM:
No universally applicable rule to select one over other for a given application.
GRU: simpler and allow much larger models.
LSTM: More powerful and flexible – Much older than GRU and historically proven.
Either GRU or LSTM will enable the model to capture long range dependencies.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 34 / 39
Outline
1 Sequential Data
2 Recurrent Neural Networks (RNN)
3 Variants of RNN
4 Gated Recurrent Units Networks
5 Long Short Term Memory networks
6 Bi-Directional and Deep RNN
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 35 / 39
Bi-Directional RNN
RNNs can only use information from the past to make predictions at time t.
In some cases, past information is not enough to make prediction. Future
information can be helpful.
Mark Butcher is a retired English Test cricketer
x 〈1〉 x 〈2〉 x 〈3〉 x 〈4〉 x 〈5〉 x 〈6〉 x 〈7〉 x 〈8〉
1 1 0 0 0 0 0 0
“Mark my word . . . ”
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 36 / 39
Bi-Directional RNN
x 〈1〉 x 〈2〉 x 〈3〉 x 〈T 〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈T 〉
Unrolled Bi-Directional Recurrent Neural Network
Two RNN. One goes from first word to last, the other goes from last to first.
Not suitable for real time systems as the future is not available for such
systems.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 37 / 39
Deep RNN
x 〈1〉 x 〈2〉 x 〈3〉 x 〈T 〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈T 〉
Unrolled Recurrent Neural Network
Stack RNN cells on top of each other.
Can be expensive in terms of memory. Not as many layers as CNN.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 38 / 39
Summary
Sequential data: variable length, order matters, context is important.
Recurrent neural networks have loops in them, allowing information to
persist.
Modifications to simple RNN:
GRU/LSTM: capture Long range dependencies.
Bi-Directional: Use historical and future information.
Deep: More complex models.
Next week: Applications of sequential data modelling
Lab: Getting started with RNN – Sentiment analysis
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 39 / 39
Sequential Data
Recurrent Neural Networks (RNN)
Variants of RNN
Gated Recurrent Units Networks
Long Short Term Memory networks
Bi-Directional and Deep RNN