CS计算机代考程序代写 python deep learning ada Deep Learning – COSC2779 – Sequential Data – Applications

Deep Learning – COSC2779 – Sequential Data – Applications

Deep Learning – COSC2779
Sequential Data – Applications

Dr. Ruwan Tennakoon

Sep 13, 2021

Reference: Chapter 10: Ian Goodfellow et. al., “Deep Learning”, MIT Press, 2016.

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 1 / 43

Outline

1 Text Classification

Word Embedding

Sentiment Classification
2 Machine Translation

Sequence to Sequence

(Advance) Sampling in Decoder

Attention
3 Speech Recognition

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 2 / 43

Revision
Sequential Data:

Data points with variable length.
Order of the data matters
Shared feature across time is useful.

Simple RNN Cell

Improvements:
Capture long range dependencies:

Bi-Directional: Use both historical and future
information.
Deep RNN: Increase model capacity.

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 3 / 43

Revision
Sequential Data:

Data points with variable length.
Order of the data matters
Shared feature across time is useful.

Simple RNN Cell

Improvements:
Capture long range dependencies:

Bi-Directional: Use both historical and future
information.
Deep RNN: Increase model capacity.

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 3 / 43

Revision
Sequential Data:

Data points with variable length.
Order of the data matters
Shared feature across time is useful.

Simple RNN Cell

Improvements:
Capture long range dependencies:

Bi-Directional: Use both historical and future
information.
Deep RNN: Increase model capacity.

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 3 / 43

Objectives of the lecture

Understand how RNNs are used in key application areas:
Text Classification.
Machine Translation.
Speech Recognition.

Will cover the fundamental concepts that will enable you to explore the
state-of-the-art.

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 4 / 43

Outline

1 Text Classification

Word Embedding

Sentiment Classification
2 Machine Translation

Sequence to Sequence

(Advance) Sampling in Decoder

Attention
3 Speech Recognition

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 5 / 43

Outline

1 Text Classification

Word Embedding

Sentiment Classification
2 Machine Translation

Sequence to Sequence

(Advance) Sampling in Decoder

Attention
3 Speech Recognition

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 6 / 43

One-Hot Representation

My experience so far has been fantastic
10277 512 12011 611 854 325 625

x(i) = x 〈1〉 x 〈2〉 x 〈3〉 x 〈4〉 x 〈5〉 x 〈6〉 x 〈7〉




0
.
.
.
.
.
.
0
1
0
.
.
.
0



. One at position 10277

x〈1〉 


0
.
.
.
0
1
0
.
.
.
.
.
.
0



. One at position 512

x〈2〉

x (i) is a matrix with dimensions 20,000 x 7.

Index Word
1 a
2 ability
3 able

. . . . . .
325 been
. . . . . .
512 experience
. . . . . .
611 far
. . . . . .
625 fantastic
. . . . . .
854 has
. . . . . .

10277 My
. . . . . .

12011 So
. . . . . .

Vocabulary: Assume
20k Words

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 7 / 43

One-Hot Representation

My experience so far has been fantastic
10277 512 12011 611 854 325 625

x(i) = x 〈1〉 x 〈2〉 x 〈3〉 x 〈4〉 x 〈5〉 x 〈6〉 x 〈7〉




0
.
.
.
.
.
.
0
1
0
.
.
.
0



. One at position 10277

x〈1〉 


0
.
.
.
0
1
0
.
.
.
.
.
.
0



. One at position 512

x〈2〉

x (i) is a matrix with dimensions 20,000 x 7.

vocabulary:
apple: [1, 0, 0, 0, 0]
orange: [0, 1, 0, 0, 0]
cricket: [0, 0, 1, 0, 0]
kabaddi: [0, 0, 0, 1, 0]
man: [0, 0, 0, 0, 1]

|apple− orange|= 2
|apple− cricket|= 2

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 7 / 43

One-Hot to feature Representation

One-Hot does not capture similarity between words.

Seen During Training: Joe Root was in the English cricket team.

During inference: Dan Murphy was in the Australian Kabaddi squad.

If the representation for ‘cricket’ ↔ ‘kabaddi’ and ‘team’ ↔ ‘squad’ was
close, then it is clear that “Dan Murphy” in this context is a persons
name – not an organization.
A representation that capture relationships between words can enable
inferring on unseen sequences during training.

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 8 / 43

Hand-Crafted Word Embedding

cricket kabaddi orange apple man
Gender 0.1 0.01 0 0 -1
age 0 0 0 0 1
food 0 0 1 1 0
sport 1 1 0 0 0
alive 0 0 0 0 1




noune 1 1 1 1 1
ecricket ekabaddi eorange eapple eman

Assume we have 300 features (rows in the table above).

The vocabulary size is 20k words.

Feature values can be floating point numbers (not binary).

oi ∈ R20,000 → One hot encoding
of word i in vocabulary.
ei ∈ R300 → Feature encoding of
word i in vocabulary.
E ∈ R300×20,000 → Feature
encoding of all words in
vocabulary (Embedding Matrix).

ei = E · oi

|ecricket − ekabaddi | < |ecricket − eorange | Lecture 8 (Part 1) Deep Learning - COSC2779 Sep 13, 2021 9 / 43 Properties of word Embedding Project word embedding from 300D to 2D using method like T-SNE. Lecture 8 (Part 1) Deep Learning - COSC2779 Sep 13, 2021 10 / 43 Learning Word Embeddings Learn by language modelling. E.g. Predict Next word: Joe Root was in the English cricket ???. Can create a supervised learning task where x is a fixed length sequence (window selected by user) and y is the next word in the sequence. Joe Root was in the English Cricket Team x 〈1〉 x 〈2〉 x 〈3〉 x 〈4〉 x 〈5〉 y A Neural Probabilistic Language Model Lecture 8 (Part 1) Deep Learning - COSC2779 Sep 13, 2021 11 / 43 https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf Learning Word Embeddings Joe Root was in the English Cricket Team o〈1〉 o〈2〉 o〈3〉 o〈4〉 o〈5〉 y o〈1〉 o〈2〉 o〈3〉 o〈4〉 o〈5〉 × × × × × e〈1〉 e〈2〉 e〈3〉 e〈4〉 e〈5〉 Mean(e) MLP (W) szoo ... sa E SoftMax Train for parameters E and the weights of the MLP (W) via back-propagation. Lecture 8 (Part 1) Deep Learning - COSC2779 Sep 13, 2021 12 / 43 Learning Word Embeddings Joe Root was in the English Cricket Team x 〈1〉 x 〈2〉 x 〈3〉 x 〈4〉 x 〈5〉 y context target Context can be: Last 5 words. 5 words before and 5 words after. Last one word. Near by word. Word2Vec and GloVe are more recent practical word embedding learning techniques. Lecture 8 (Part 1) Deep Learning - COSC2779 Sep 13, 2021 13 / 43 http://jalammar.github.io/illustrated-word2vec/ Word2Vec The skip-grams: Randomly pick a word as context, c. Randomly pick another word within a window of c as the target, t. “Joe Root was in the English Cricket Team” Context Target ‘English’ ‘Team’ ‘Team’ ‘the’ Large number of Context, target pairs are picked from a large text corpus (e.g. Wikipedia) Efficient Estimation of Word Representations in Vector Space Lecture 8 (Part 1) Deep Learning - COSC2779 Sep 13, 2021 14 / 43 https://arxiv.org/pdf/1301.3781.pdf Word2Vec “Joe Root was in the English Cricket Team” Context Target ‘English’ ‘Team’ ‘Team’ ‘cricket’ ... ... oc × ec MLP (W) szoo ... sa E SoftMax p(t | c) = eW>ec∑Nv
j=1 eW

>ej

Nv is the vocabulary size, which is
very large.

Word2Vec employs hierarchical
SoftMax to make it efficient.

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 15 / 43

Word2Vec: Negative sampling

“Joe Root was in the English Cricket Team”

Context Word Target
‘English’ ‘Team’ 1
‘English’ ‘apple’ 0
‘English’ ‘bus’ 0

oc × ec
MLP
(W)

szoo

sa

E
Sigmoid

p(y = 1 | c, t) = σ
(
W>ec

)
Train multiple sigmoid NN instead of
the SoftMax.

More recent model:
Paper: GloVe: Global Vectors for
Word Representation

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 16 / 43

https://nlp.stanford.edu/pubs/glove.pdf
https://nlp.stanford.edu/pubs/glove.pdf

Outline

1 Text Classification

Word Embedding

Sentiment Classification
2 Machine Translation

Sequence to Sequence

(Advance) Sampling in Decoder

Attention
3 Speech Recognition

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 17 / 43

Sentiment Classification
“The process of computationally identifying and categorizing opinions expressed in
a piece of text”

Sentiment analysis models focus on:
Polarity: positive, negative, neutral.
Feelings/Emotions: angry, happy, sad, etc.
Intentions: interested v. not interested.

x y
“My experience so far has been fantastic” Positive
“Your support team is useless” Negative


Usually the datasets are NOT very large.
Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 18 / 43

Sentiment Classification

Your support team was useless Negative
x 〈1〉 x 〈2〉 x 〈3〉 x 〈4〉 x 〈5〉 y

o〈1〉
o〈2〉
o〈3〉
o〈4〉
o〈5〉

×

×

×

×

×

e〈1〉
e〈2〉
e〈3〉
e〈4〉
e〈5〉

Mean(e)
MLP
(W) ŷ

E

Train for parameters E and the weights of the MLP (W) via
back-propagation.

Dataset can be too small to learn
a good word embedding.
Does not take the order of the
words into account.

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 19 / 43

Transfer learning with word embedding

Dataset can be too small to learn a good word embedding.

Transfer learning with word embedding:
1 Learn a word embedding using a large corpus (or download a word

embedding).
2 Transfer embedding to the model (Set E to the pre-learned word

embedding). Learn the model parameters of the task (W).
3 (Optional) Fine tune word embedding.

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 20 / 43

Order Matters

Does not take the order of the words into account.

o〈1〉
o〈2〉
o〈3〉
o〈4〉
o〈5〉

×

×

×

×

×

e〈1〉
e〈2〉
e〈3〉
e〈4〉
e〈5〉

Mean(e)
MLP
(W) ŷ

E

“Completely lacks good food or good service.” → ?

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 21 / 43

Many-to-One RNN

“Completely lacks good food or good service.” → ?

E E E E

x 〈1〉 x 〈2〉 x 〈3〉 x 〈Tx 〉

ŷ 〈Ty 〉

a〈1〉 a〈2〉 a〈3〉a〈0〉

Wax Wax Wax Wax

Waa Waa Waa Waa

Wya

y 〈Ty 〉 L (W)

. . .

RNN cell can be: LSTM/GRU

Bi-Directional to capture future
information.

Deep models to increase capacity.

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 22 / 43

Outline

1 Text Classification

Word Embedding

Sentiment Classification
2 Machine Translation

Sequence to Sequence

(Advance) Sampling in Decoder

Attention
3 Speech Recognition

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 23 / 43

Outline

1 Text Classification

Word Embedding

Sentiment Classification
2 Machine Translation

Sequence to Sequence

(Advance) Sampling in Decoder

Attention
3 Speech Recognition

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 24 / 43

Sequence to Sequence

Both input and output are sequences.

English Sinhala
“How are you today” ⇔ “ada obata kohomada?”

The two sequences are not of the same length.
One-to-one translation does not work.

ada obata kohomada
↓ ↓ ↓

today you how

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 25 / 43

Encoder-Decoder RNN

How are you today ⇔ ada obata kohomada
x 〈1〉 x 〈2〉 x 〈3〉 x 〈4〉 y 〈1〉 y 〈2〉 y 〈3〉

x 〈1〉 x 〈2〉 x 〈3〉 x 〈Tx 〉

a〈0〉

. . .

. . .

Encoder

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ
〈Tx 〉. . .

Decoder

x̄ 〈2〉 x̄ 〈3〉 x̄ 〈5〉

Sequence to Sequence Learning with Neural Networks

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 26 / 43

https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

Encoder-Decoder RNN

How are you today ⇔ ada obata kohomada
x 〈1〉 x 〈2〉 x 〈3〉 x 〈4〉 y 〈1〉 y 〈2〉 y 〈3〉

x 〈1〉 x 〈2〉 x 〈3〉 x 〈Tx 〉

a〈0〉

. . .

. . .

Encoder

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ
〈Tx 〉. . .

Decoder

x̄ 〈2〉 x̄ 〈3〉 x̄ 〈5〉

Sequence to Sequence Learning with Neural Networks

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 26 / 43

https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

Machine Translation

The decoder we use in machine translation is similar to the one used for
language modelling (e.g. text generation).

In language modelling we are interested in learning:

p
(
y 〈1〉, y 〈2〉, · · · , y 〈Ty 〉

)
In machine translation we are interested in learning a conditional model:

p
(
y 〈1〉, y 〈2〉, · · · , y 〈Ty 〉 | x 〈1〉, x 〈2〉, · · · , x 〈Tx 〉

)

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 27 / 43

Image Captioning
We can use the same encoder decoder architecture to do image captioning. The main
difference would be that the encoder in this task is a CNN (not a RNN).

Image: Python based Project – Learn to Build Image Caption Generator with CNN & LSTM

If interested, a nice worked example at: Python based Project – Learn to Build Image
Caption Generator with CNN & LSTM
Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 28 / 43

https://data-flair.training/blogs/python-based-project-image-caption-generator-cnn/
https://data-flair.training/blogs/python-based-project-image-caption-generator-cnn/
https://arxiv.org/pdf/1412.6632.pdf

Outline

1 Text Classification

Word Embedding

Sentiment Classification
2 Machine Translation

Sequence to Sequence

(Advance) Sampling in Decoder

Attention
3 Speech Recognition

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 29 / 43

Machine Translation

In machine translation we are
interested in learning a conditional
model:

arg max
y〈1〉,y〈2〉,···

p
(
y 〈1〉, y 〈2〉, · · · , y 〈Ty 〉 | x

)

Option 1: Greedy search.
Pick the word with maximum
probability using y 〈1〉

use it as the input to the next
step
Repeat till end of sentence.

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Tx 〉. . .

Decoder

x̄ 〈2〉 x̄ 〈3〉 x̄ 〈5〉

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 30 / 43

Problem with Greedy Search
Sinhalese sentence: “Ruwan November mase lankavata yanwa”
Possible English translations:

Ruwan is visiting Sri Lanka in November.
Ruwan is going to be visiting Sri Lanka in November.

The first sentence is a better translation, but the greedy search might pick the second one
because going is more common than visiting.

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 31 / 43

Beam Search

Instead of picking one candidate at each time step, you can pick best k
candidates. Here the k is known as the beam width.

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 32 / 43

Beam Search
At each time-step you need k copies of the
network (k=2 in example).

Ruwan is visiting Sri Lanka in
November.
Ruwan is going to be visiting Sri Lanka
in November.

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 33 / 43

Outline

1 Text Classification

Word Embedding

Sentiment Classification
2 Machine Translation

Sequence to Sequence

(Advance) Sampling in Decoder

Attention
3 Speech Recognition

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 34 / 43

Why Attention?

Assume you are given the following paragraph to be translate to another
language:

“Seq2seq is a family of machine learning approaches used for language processing.
Applications include language translation, image captioning, conversational models and text
summarization.”

Would you:
Read the complete paragraph and then start translating
Translate one small part at a time.

How to decide what part to be looked at in making the decision?

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 35 / 43

Why Attention?

Assume you are given the following paragraph to be translate to another
language:

“Seq2seq is a family of machine learning approaches used for language processing.
Applications include language translation, image captioning, conversational models and text
summarization.”

Would you:
Read the complete paragraph and then start translating
Translate one small part at a time.

How to decide what part to be looked at in making the decision?

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 35 / 43

Attention

Use a small neural network to decide the attention in making each decision.

x 〈1〉 x 〈2〉 x 〈3〉 x 〈Tx 〉

a〈0〉

. . .

. . .

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ
〈Tx 〉. . .

x̄ 〈2〉 x̄ 〈3〉 x̄ 〈5〉

Machine translation with regular RNN.

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 36 / 43

Attention

x 〈1〉 x 〈2〉 x 〈3〉 x 〈Tx 〉

a〈0〉

. . .

. . .

ŷ 〈1〉


α

〈1〉
1 α

〈2〉
1 α

〈3〉
1 α

〈5〉
1

a〈1〉 a〈2〉 a〈3〉 a〈5〉

c〈1〉

c〈t〉 =

j

α
〈j〉
t a
〈j〉

a〈j〉

b〈t−1〉

MLP
(W) e

〈j〉
t

α
〈j〉
t =

exp(e〈j〉t )∑
j exp(e

〈j〉
t )

b〈t−1〉 is the cell state from the previous time
step of the decoder.

Neural Machine Translation by Jointly Learning to Align and
Translate

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 37 / 43

https://arxiv.org/pdf/1409.0473.pdf
https://arxiv.org/pdf/1409.0473.pdf

Attention

x 〈1〉 x 〈2〉 x 〈3〉 x 〈Tx 〉

a〈0〉

. . .

. . .

ŷ 〈1〉 ŷ 〈2〉

x̄ 〈2〉∑
α

〈1〉
2 α

〈2〉
2 α

〈3〉
2 α

〈5〉
2

a〈1〉 a〈2〉 a〈3〉 a〈5〉

c〈2〉

c〈t〉 =

j

α
〈j〉
t a
〈j〉

a〈j〉

b〈t−1〉

MLP
(W) e

〈j〉
t

α
〈j〉
t =

exp(e〈j〉t )∑
j exp(e

〈j〉
t )

b〈t−1〉 is the cell state from the previous time
step of the decoder.

Neural Machine Translation by Jointly Learning to Align and
Translate

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 37 / 43

https://arxiv.org/pdf/1409.0473.pdf
https://arxiv.org/pdf/1409.0473.pdf

Attention

x 〈1〉 x 〈2〉 x 〈3〉 x 〈Tx 〉

a〈0〉

. . .

. . .

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉

x̄ 〈2〉 x̄ 〈3〉∑
α

〈1〉
3 α

〈2〉
3 α

〈3〉
3 α

〈5〉
3

a〈1〉 a〈2〉 a〈3〉 a〈5〉

c〈3〉

c〈t〉 =

j

α
〈j〉
t a
〈j〉

a〈j〉

b〈t−1〉

MLP
(W) e

〈j〉
t

α
〈j〉
t =

exp(e〈j〉t )∑
j exp(e

〈j〉
t )

b〈t−1〉 is the cell state from the previous time
step of the decoder.

Neural Machine Translation by Jointly Learning to Align and
Translate

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 37 / 43

https://arxiv.org/pdf/1409.0473.pdf
https://arxiv.org/pdf/1409.0473.pdf

Visualize Attention

Attention weights (α) of English to French translation.

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 38 / 43

Outline

1 Text Classification

Word Embedding

Sentiment Classification
2 Machine Translation

Sequence to Sequence

(Advance) Sampling in Decoder

Attention
3 Speech Recognition

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 39 / 43

Speech Recognition

Speech signals are usually represented as a spectrogram.

x 〈t〉 is a vector where number of elements equal the number of frequencies in
the spectrogram.

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 40 / 43

Speech Recognition

We can use a many-to-many RNN for speech recognition (e.g. Attention
model).

x 〈1〉 x 〈2〉 x 〈3〉 x 〈Tx 〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈5〉ŷ
〈Ty 〉

. . .

. . .

Problem: x vectors are generated at fixed time intervals (higher rate e.g.
100Hz). However the words in text are not (much lower rate). Ty � Tx

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 41 / 43

Speech Recognition

Allow multiple outputs for the same text-character.

x 〈1〉 x 〈2〉 x 〈3〉 x 〈T 〉

ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈5〉ŷ 〈T 〉

. . .

. . .

Output of RNN: he ll l oo ∼wo rr l d
Collapse repeated characters not separated by “ ”
CTC Output: “hello world”
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 42 / 43

https://www.cs.toronto.edu/~graves/icml_2006.pdf

Summary

How to represents input: Text, Speech
Main considerations in applying RNN to real-world-applications.
Encoder-decoder networks for sequence to sequence models.

Next week: Representation learning.

Lab: Sequence to Sequence model

Lecture 8 (Part 1) Deep Learning – COSC2779 Sep 13, 2021 43 / 43

Text Classification
Word Embedding
Sentiment Classification

Machine Translation
Sequence to Sequence
(Advance) Sampling in Decoder
Attention

Speech Recognition