CS计算机代考程序代写 chain deep learning DNA flex case study AI Deep Learning – COSC2779 – Modelling Sequential (Time Series) Data

Deep Learning – COSC2779 – Modelling Sequential (Time Series) Data

Deep Learning – COSC2779
Modelling Sequential (Time Series) Data

Dr. Ruwan Tennakoon

Sep 6, 2021

Reference: Chapter 10: Ian Goodfellow et. al., “Deep Learning”, MIT Press, 2016.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 1 / 39

MAIA – AI for music creation

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 1 / 39

https://edwardtky.wixsite.com/maia

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 1 / 39

NeuralTalk Sentence Generation Results

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 1 / 39

https://cs.stanford.edu/people/karpathy/deepimagesent/generationdemo/

Outline

1 Sequential Data
2 Recurrent Neural Networks (RNN)
3 Variants of RNN
4 Gated Recurrent Units Networks
5 Long Short Term Memory networks
6 Bi-Directional and Deep RNN

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 2 / 39

Revision

A look back at what we have learned:

Deep neural network Building blocks:

Week2: Feed forward NN Model and
cost functions.
Week3: Optimising deep models
challenges and solutions.
Week4: Convolution neural network: for
data with spatial structure.

Case study:

Week5: Famous networks for computer
vision applications

Putting things together:

Week6: Practice methodology

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 3 / 39

Revision

A look back at what we have learned:

Deep neural network Building blocks:

Week2: Feed forward NN Model and
cost functions.
Week3: Optimising deep models
challenges and solutions.
Week4: Convolution neural network: for
data with spatial structure.

Case study:

Week5: Famous networks for computer
vision applications

Putting things together:

Week6: Practice methodology
Image: Unsplash

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 3 / 39

Revision

A look back at what we have learned:

Deep neural network Building blocks:

Week2: Feed forward NN Model and
cost functions.
Week3: Optimising deep models
challenges and solutions.
Week4: Convolution neural network: for
data with spatial structure.
Week7-8: Recurrent neural network: for
data with sequential structure.

Case study:

Week5: Famous networks for computer
vision applications

Putting things together:

Week6: Practice methodology

Type of NN?
Supervised learning with fixed-size
vectors: Deep Feed-forward models
Input has topological structure: CNN.
Input or Output is a sequence: LSTM
or GRU (will be discussed in future).

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 3 / 39

Objectives of this lecture

Understand the main building blocks of RNNs designed to handle
sequential data.
Improvements to basic structure and the intuition.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 4 / 39

Outline

1 Sequential Data
2 Recurrent Neural Networks (RNN)
3 Variants of RNN
4 Gated Recurrent Units Networks
5 Long Short Term Memory networks
6 Bi-Directional and Deep RNN

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 5 / 39

Why do we need another model type?

Sentiment analysis:
“My experience so far has been fantastic” → Positive

“Your support team is useless” → Negative

How can we represent this data (for NN)?

Character level: Each character mapped to a number.
Word level

My experience so far has been fantastic
10277 512 12011 611 854 325 625

x(i) = x (i)〈1〉 x (i)〈2〉 x (i)〈3〉 x (i)〈4〉 x (i)〈5〉 x (i)〈6〉 x (i)〈7〉

Index Word
1 a
2 ability
3 able

. . . . . .
325 been
. . . . . .
512 experience
. . . . . .
611 far
. . . . . .
625 fantastic
. . . . . .
854 has
. . . . . .

10277 My
. . . . . .

12011 So
. . . . . .

Vocabulary: Assume
20k Words

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 6 / 39

Why do we need another model type?

Sentiment analysis:
“My experience so far has been fantastic” → Positive

“Your support team is useless” → Negative

How can we represent this data (for NN)?
Character level: Each character mapped to a number.

Word level

My experience so far has been fantastic
10277 512 12011 611 854 325 625

x(i) = x (i)〈1〉 x (i)〈2〉 x (i)〈3〉 x (i)〈4〉 x (i)〈5〉 x (i)〈6〉 x (i)〈7〉

Index Word
1 a
2 ability
3 able

. . . . . .
325 been
. . . . . .
512 experience
. . . . . .
611 far
. . . . . .
625 fantastic
. . . . . .
854 has
. . . . . .

10277 My
. . . . . .

12011 So
. . . . . .

Vocabulary: Assume
20k Words

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 6 / 39

Why do we need another model type?

Sentiment analysis:
“My experience so far has been fantastic” → Positive

“Your support team is useless” → Negative

How can we represent this data (for NN)?
Character level: Each character mapped to a number.
Word level

My experience so far has been fantastic
10277 512 12011 611 854 325 625

x(i) = x (i)〈1〉 x (i)〈2〉 x (i)〈3〉 x (i)〈4〉 x (i)〈5〉 x (i)〈6〉 x (i)〈7〉

Index Word
1 a
2 ability
3 able

. . . . . .
325 been
. . . . . .
512 experience
. . . . . .
611 far
. . . . . .
625 fantastic
. . . . . .
854 has
. . . . . .

10277 My
. . . . . .

12011 So
. . . . . .

Vocabulary: Assume
20k Words

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 6 / 39

One-Hot Representation

My experience so far has been fantastic
10277 512 12011 611 854 325 625

x(i) = x (i)〈1〉 x (i)〈2〉 x (i)〈3〉 x (i)〈4〉 x (i)〈5〉 x (i)〈6〉 x (i)〈7〉




0
.
.
.
.
.
.
0
1
0
.
.
.
0




. One at position 10277

x〈1〉 


0
.
.
.
0
1
0
.
.
.
.
.
.
0




. One at position 512

x〈2〉

x (i) is a matrix with dimensions 20,000 x 7.

Index Word
1 a
2 ability
3 able

. . . . . .
325 been
. . . . . .
512 experience
. . . . . .
611 far
. . . . . .
625 fantastic
. . . . . .
854 has
. . . . . .

10277 My
. . . . . .

12011 So
. . . . . .

Vocabulary: Assume
20k Words

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 7 / 39

Why do we need another model type?

Sentiment analysis:
“My experience so far has been fantastic” → Positive

“Your support team is useless” → Negative

My experience so far has been fantastic
x(i) = [x 〈1〉, x 〈2〉, x 〈3〉, x 〈4〉, x 〈5〉, x 〈6〉, x 〈7〉]

We have now represented a sentence as a vector, Can we use FF-NN?
Variable length vector.
Context of data (elements) matters

You were right.
Make a right turn at the light.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 8 / 39

Sequential Data: Examples

Speech recognition:

→ “How are you”

DNA sequence analysis:

Machine translation:

Video Activity recognition:

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 9 / 39

Sequential Data

Data points with variable length .
Order of the data matters. Order can be:

Time related: Video analysis
Not related to time: DNA sequence

Shared feature across time is useful – context.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 10 / 39

Revision Questions

Why is word based representation preferred over character based
representation in NLP?
For a simple NLP problem, what will be the one-hot representation for a
word with index 3, If the vocabulary size is 10?
What are the main properties of sequential data?

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 11 / 39

Outline

1 Sequential Data
2 Recurrent Neural Networks (RNN)
3 Variants of RNN
4 Gated Recurrent Units Networks
5 Long Short Term Memory networks
6 Bi-Directional and Deep RNN

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 12 / 39

Named Entity Recognition (NER)

“Automatically find information units like names, including person, organization
and location names, and numeric expressions including time, date, money and
percent expressions from unstructured text.”

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 13 / 39

Named Entity Recognition (NER)

Simplified example: Only person names.

The retired English test cricketer Mark Butcher is
x 〈1〉 x 〈2〉 x 〈3〉 x 〈4〉 x 〈5〉 x 〈6〉 x 〈7〉 x 〈8〉

0 0 0 0 0 1 1 0
y 〈1〉 y 〈2〉 y 〈3〉 y 〈4〉 y 〈5〉 y 〈6〉 y 〈7〉 y 〈8〉

Dictionary size: 20k

Simple solution: Represent each word as one-hot vector, predict
name/not-name with FF-NN. Does not work for words like “Mark”, “Butcher”.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 13 / 39

Recurrent Neural Networks

Recurrent neural networks have loops in them, allowing information to persist.

x 〈t〉

RNN

ŷ 〈t〉

x 〈1〉 x 〈2〉 x 〈3〉 x 〈T 〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈5〉ŷ 〈T 〉

. . .

. . .

Unrolled Recurrent Neural Network

Same cell repeated. Not multiple cells.

The retired English Test cricketer Mark Butcher is
x 〈1〉 x 〈2〉 x 〈3〉 x 〈4〉 x 〈5〉 x 〈6〉 x 〈7〉 x 〈8〉

0 0 0 0 0 1 1 0

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 14 / 39

Recurrent Neural Networks

Recurrent neural networks have loops in them, allowing information to persist.

x 〈t〉

RNN

ŷ 〈t〉

x 〈1〉 x 〈2〉 x 〈3〉 x 〈T 〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈5〉ŷ 〈T 〉

. . .

. . .

Unrolled Recurrent Neural Network

Same cell repeated. Not multiple cells.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 14 / 39

Recurrent Neural Network: Cell

x 〈t〉

tanh

σ

ŷ 〈t〉

a〈t〉a〈t−1〉

Wax
Waa

Wya

a〈t〉 = g1
(

Waaa〈t−1〉 + Wax x 〈t〉 + ba
)

ŷ 〈t〉 = g2
(

Wyaa〈t〉 + by
)

Assume:
x 〈t〉 is 20000-dimensional (Vocab size)
a〈t〉 is 100-dimensional (user defined)
y 〈t〉 is 1-dimensional (output dimension)

Then weight matrices (Learned):
Wax is 100 x 20000
Waa is 100 x 100
Wya is 1 x 100
ba is 100 x 1
by is 1 x 1

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 15 / 39

Recurrent Neural Networks: Forward Propagation

x 〈1〉 x 〈2〉 x 〈3〉 x 〈5〉x 〈Tx 〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉

a〈1〉 a〈2〉 a〈3〉a〈0〉

Wax Wax Wax Wax

Waa Waa Waa Waa

Wya Wya Wya Wya

. . .

. . .

All Waa, Wax , Wya are shared across RNN cells

a〈t〉 = g1
(

Waaa〈t−1〉 + Wax x 〈t〉 + ba
)

ŷ 〈t〉 = g2
(

Wyaa〈t〉 + by
)

For a data point (i):
Tx Number of elements in
the input
Ty Number of elements in
the output
a〈0〉 Initial state can be all
zero vector.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 16 / 39

Recurrent Neural Network: Back-propagation

x 〈1〉 x 〈2〉 x 〈3〉 x 〈5〉x 〈Tx 〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉

a〈1〉 a〈2〉 a〈3〉a〈0〉

Wax Wax Wax Wax

Waa Waa Waa Waa

Wya Wya Wya Wya

. . .

. . .

y 〈1〉 y 〈2〉 y 〈3〉 y
〈Ty 〉. . .

L〈1〉 L〈2〉 L〈3〉 L
〈Ty〉

L (W)

All Waa, Wax , Wya are shared across RNN cells

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 17 / 39

Recurrent Neural Network: Back-propagation

x 〈1〉 x 〈2〉 x 〈3〉 x 〈5〉x 〈Tx 〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉

a〈1〉 a〈2〉 a〈3〉a〈0〉

Wax Wax Wax Wax

Waa Waa Waa Waa

Wya Wya Wya Wya

. . .

. . .

y 〈1〉 y 〈2〉 y 〈3〉 y
〈Ty 〉. . .

L〈1〉 L〈2〉 L〈3〉 L
〈Ty〉

L (W)

All Waa, Wax , Wya are shared across RNN cells

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 17 / 39

Recurrent Neural Network: Back-propagation

x 〈1〉 x 〈2〉 x 〈3〉 x 〈5〉x 〈Tx 〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉

a〈1〉 a〈2〉 a〈3〉a〈0〉

Wax Wax Wax Wax

Waa Waa Waa Waa

Wya Wya Wya Wya

. . .

. . .

y 〈1〉 y 〈2〉 y 〈3〉 y
〈Ty 〉. . .

L〈1〉 L〈2〉 L〈3〉 L
〈Ty〉

L (W)

All Waa, Wax , Wya are shared across RNN cells

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 17 / 39

Recurrent Neural Network: Back-propagation

x 〈1〉 x 〈2〉 x 〈3〉 x 〈5〉x 〈Tx 〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉

a〈1〉 a〈2〉 a〈3〉a〈0〉

Wax Wax Wax Wax

Waa Waa Waa Waa

Wya Wya Wya Wya

. . .

. . .

y 〈1〉 y 〈2〉 y 〈3〉 y
〈Ty 〉. . .

L〈1〉 L〈2〉 L〈3〉 L
〈Ty〉

L (W)

All Waa, Wax , Wya are shared across RNN cells

L (W) =
Ty∑

t=1
L〈t〉 (W)

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 17 / 39

Revision Questions

What values are usually used for a〈0〉?
What will be an appropriate function for output activation (g2) of a RNN
if the task is regression?

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 18 / 39

Outline

1 Sequential Data
2 Recurrent Neural Networks (RNN)
3 Variants of RNN
4 Gated Recurrent Units Networks
5 Long Short Term Memory networks
6 Bi-Directional and Deep RNN

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 19 / 39

Variants of the Simple RNN Structure

There are several variants of the basic RNN depending on the shapes of inputs and outputs.

Many-to-Many Many-to-One One-to-Many One-to-One Many-to-Many Ty 6= Tx
Image: The Unreasonable Effectiveness of Recurrent Neural Networks – Andrej Karpathy

Many-to-Many (Ty = Tx ): We have already discussed this – e.g. Named Entity
Recognition.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 20 / 39

Many-to-One

Sentiment analysis:
“My experience so far has been fantastic” → Positive

“Your support team is useless” → Negative
Video Action Classification:

x 〈1〉 x 〈2〉 x 〈3〉 x 〈5〉x 〈Tx 〉

ŷ 〈Ty 〉

a〈1〉 a〈2〉 a〈3〉a〈0〉

Wax Wax Wax Wax

Waa Waa Waa Waa

Wya

y 〈Ty 〉

L (W)

. . .

All Waa, Wax , Wya are shared across RNN cells

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 21 / 39

One-to-Many

Generating Text/Music.
E.g Generating text similar to Shakespeare’s writing.

Image: Sonnet 18 in the 1609 Quarto of Shakespeare’s sonnets.

Given some text from Shakespeare’s writing generate novel sentences that look
similar.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 22 / 39

One-to-Many

Generating Shakespeare’s writing.
x 〈t〉 is a one-hot with size equal to number of characters.
ŷ 〈t〉 is a soft-max-out with size equal to number of characters.

x 〈1〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉

a〈1〉 a〈2〉 a〈3〉a〈0〉

Wax

Waa Waa Waa Waa

Wya Wya Wya Wya

. . .

x 〈2〉 x 〈3〉 x 〈5〉

All Waa, Wax , Wya are shared across RNN cells

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 23 / 39

One-to-Many

Inference: Convert ŷ 〈t〉 to one-hot by sampling and input as
x 〈t+1〉

x 〈1〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉

a〈1〉 a〈2〉 a〈3〉a〈0〉

Wax

Waa Waa Waa Waa

Wya Wya Wya Wya

. . .

x 〈2〉

x 〈3〉 x 〈5〉

All Waa, Wax , Wya are shared across RNN cells

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 23 / 39

One-to-Many

x 〈1〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉

a〈1〉 a〈2〉 a〈3〉a〈0〉

Wax

Waa Waa Waa Waa

Wya Wya Wya Wya

. . .

x 〈2〉 x 〈3〉

x 〈5〉

All Waa, Wax , Wya are shared across RNN cells

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 23 / 39

One-to-Many

x 〈1〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉

a〈1〉 a〈2〉 a〈3〉a〈0〉

Wax

Waa Waa Waa Waa

Wya Wya Wya Wya

. . .

x 〈2〉 x 〈3〉 x 〈5〉

All Waa, Wax , Wya are shared across RNN cells

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 23 / 39

Many-to-Many Ty 6= Tx

Many-to-Many Many-to-One One-to-Many One-to-OneMany-to-Many Ty 6= Tx

There are any situations where X, Y does not have one to one relationship.
E.g French to English translation, video captioning.

The sequence lengths are not the same.
Word to word translation does not work.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 24 / 39

Revision Questions

You are given the task of predicting the genre of short music clips (not
fixed length). What type of RNN is suitable for this task? (many-to-many,
one-to-many, many-to-one, many-to-many).
You are given the task of generating image captions. What type of RNN
is suitable for this task?
You are given the task of designing an automated exam marking system.
How would RNN help you in this task?

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 25 / 39

Outline

1 Sequential Data
2 Recurrent Neural Networks (RNN)
3 Variants of RNN
4 Gated Recurrent Units Networks
5 Long Short Term Memory networks
6 Bi-Directional and Deep RNN

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 26 / 39

Vanishing Gradients

x1

x2

h(1) h(2) h(L)

Simple Feed Forward Neural Network

Vanishing gradient: use Skip
connections, BN, ReLU, etc.
Exploding gradient: use
gradient clipping.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 27 / 39

Vanishing Gradients

x1

x2

h(1) h(2) h(L)

Simple Feed Forward Neural Network

x 〈1〉 x 〈2〉 x 〈3〉 x 〈5〉x 〈Tx 〉

ŷ 〈Ty 〉

a〈1〉 a〈2〉 a〈3〉a〈0〉

Wax Wax Wax Wax

Waa Waa Waa Waa

Wya

Vanishing gradient: use Skip
connections, BN, ReLU, etc.
Exploding gradient: use
gradient clipping.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 27 / 39

Long Range Dependencies

x 〈1〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ〈
Ty〉

a〈1〉 a〈2〉 a〈3〉a〈0〉

x 〈2〉 x 〈3〉 x 〈5〉

The cat, who ate . . . , (was/were) full.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 28 / 39

Long Range Dependencies

x 〈1〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ〈
Ty〉

a〈1〉 a〈2〉 a〈3〉a〈0〉

x 〈2〉 x 〈3〉 x 〈5〉

The cat, who ate . . . , was full.

The cats, who ate . . . , were full.

In theory, RNNs are absolutely capable of handling such “long-term dependencies”. In practice,
RNNs don’t seem to be able to learn them.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 28 / 39

Gated Recurrent Units (GRU)

Change the structure of the units

x 〈t〉

tanh

σ

ŷ 〈t〉

a〈t〉a〈t−1〉

Wax
Waa

Wya

a〈t〉 = g1
(
Waaa〈t−1〉 + Wax x 〈t〉 + ba

)
ŷ 〈t〉 = g2

(
Wyaa〈t〉 + by

)
x 〈t〉

tanhσ σ

−1

c〈t−1〉 c〈t〉
× +

××

ŷ 〈t〉

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 29 / 39

https://arxiv.org/pdf/1409.1259.pdf
https://arxiv.org/pdf/1412.3555.pdf

Gated Recurrent Units (GRU)

c̃〈t〉 = tanh
(
Wc
[
c〈t−1〉, x 〈t〉

]
+ bc

)
. Cell State

c〈t〉 = c̃〈t〉

x 〈t〉

tanhσ σ

−1

Γr Γu c̃〈t〉

c〈t−1〉 c〈t〉
× +

××

ŷ 〈t〉

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 30 / 39

https://arxiv.org/pdf/1409.1259.pdf
https://arxiv.org/pdf/1412.3555.pdf

Gated Recurrent Units (GRU)

c̃〈t〉 = tanh
(
Wc
[
c〈t−1〉, x 〈t〉

]
+ bc

)
. Cell State

c〈t〉 = (1− Γu)� c〈t−1〉 + Γu � c̃〈t〉

Γu = σ
(
Wu
[
c〈t−1〉, x 〈t〉

]
+ bu

)
.Update gate

x 〈t〉

tanhσ σ

−1

Γr Γu c̃〈t〉

c〈t−1〉 c〈t〉
× +

××

ŷ 〈t〉

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 30 / 39

https://arxiv.org/pdf/1409.1259.pdf
https://arxiv.org/pdf/1412.3555.pdf

Gated Recurrent Units (GRU)

c̃〈t〉 = tanh
(
Wc
[
Γr � c〈t−1〉, x 〈t〉

]
+ bc

)
.Cell State

c〈t〉 = (1− Γu)� c〈t−1〉 + Γu � c̃〈t〉

Γu = σ
(
Wu
[
c〈t−1〉, x 〈t〉

]
+ bu

)
.Update gate

Γr = σ
(
Wr
[
c〈t−1〉, x 〈t〉

]
+ br

)
. Relevance gate

x 〈t〉

tanhσ σ

−1

Γr Γu c̃〈t〉

c〈t−1〉 c〈t〉
× +

××

ŷ 〈t〉

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 30 / 39

https://arxiv.org/pdf/1409.1259.pdf
https://arxiv.org/pdf/1412.3555.pdf

GRU Network

. . . . . .

Unrolled GRU network is identical to simple-RNN except for the cell type.

Note the horizontal line running through the top of the diagram. like a conveyor belt, it
runs straight down the entire chain, with only some minor linear interactions. It is very easy
for information to just flow along it unchanged (recall ResNet and skip connections).

Gates are a way to optionally let information through.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 31 / 39

GRU Network

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 31 / 39

Outline

1 Sequential Data
2 Recurrent Neural Networks (RNN)
3 Variants of RNN
4 Gated Recurrent Units Networks
5 Long Short Term Memory networks
6 Bi-Directional and Deep RNN

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 32 / 39

Long Short Term Memory (LSTM) Cell

Γu = σ
(
Wu
[
a〈t−1〉, x 〈t〉

]
+ bu

)
.Update gate

Γf = σ
(
Wf
[
a〈t−1〉, x 〈t〉

]
+ bf

)
. Forget gate

Γo = σ
(
Wo
[
a〈t−1〉, x 〈t〉

]
+ bo

)
.Output gate

c̃〈t〉 = tanh
(
Wc
[
a〈t−1〉, x 〈t〉

]
+ bc

)
. Cell State

c〈t〉 = Γf � c〈t−1〉 + Γu � c̃〈t〉

a〈t〉 = Γo � tanh
(
c〈t〉
)

Uses two gates (Γu,Γf ) to make the update compared to
single Γu in GRU.

Long-short term memory

x 〈t〉

tanhσ σ σ
a〈t−1〉 a〈t〉

c〈t−1〉 c〈t〉
× +

× ×
tanh

ŷ 〈t〉

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 33 / 39

https://www.bioinf.jku.at/publications/older/2604.pdf

LSTM vs GRU

LSTM was developed first, GRU is a simplification of the LSTM cell.

How to decide GRU vs. LSTM:
No universally applicable rule to select one over other for a given application.
GRU: simpler and allow much larger models.
LSTM: More powerful and flexible – Much older than GRU and historically proven.

Either GRU or LSTM will enable the model to capture long range dependencies.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 34 / 39

Outline

1 Sequential Data
2 Recurrent Neural Networks (RNN)
3 Variants of RNN
4 Gated Recurrent Units Networks
5 Long Short Term Memory networks
6 Bi-Directional and Deep RNN

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 35 / 39

Bi-Directional RNN

RNNs can only use information from the past to make predictions at time t.

In some cases, past information is not enough to make prediction. Future
information can be helpful.

Mark Butcher is a retired English Test cricketer
x 〈1〉 x 〈2〉 x 〈3〉 x 〈4〉 x 〈5〉 x 〈6〉 x 〈7〉 x 〈8〉

1 1 0 0 0 0 0 0

“Mark my word . . . ”

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 36 / 39

Bi-Directional RNN

x 〈1〉 x 〈2〉 x 〈3〉 x 〈T 〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈T 〉

Unrolled Bi-Directional Recurrent Neural Network

Two RNN. One goes from first word to last, the other goes from last to first.

Not suitable for real time systems as the future is not available for such
systems.

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 37 / 39

Deep RNN

x 〈1〉 x 〈2〉 x 〈3〉 x 〈T 〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈T 〉

Unrolled Recurrent Neural Network

Stack RNN cells on top of each other.
Can be expensive in terms of memory. Not as many layers as CNN.
Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 38 / 39

Summary

Sequential data: variable length, order matters, context is important.
Recurrent neural networks have loops in them, allowing information to
persist.
Modifications to simple RNN:

GRU/LSTM: capture Long range dependencies.
Bi-Directional: Use historical and future information.
Deep: More complex models.

Next week: Applications of sequential data modelling

Lab: Getting started with RNN – Sentiment analysis

Lecture 7 (Part 1) Deep Learning – COSC2779 Sep 6, 2021 39 / 39

Sequential Data
Recurrent Neural Networks (RNN)
Variants of RNN
Gated Recurrent Units Networks
Long Short Term Memory networks
Bi-Directional and Deep RNN