CS计算机代考程序代写 deep learning flex algorithm l8-recurrent-v2

l8-recurrent-v2

COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1

COMP90042
Natural Language Processing

Lecture 8
Semester 1 2021 Week 4

Jey Han Lau

Deep Learning for NLP:
Recurrent Networks

COMP90042 L8

2

Outline

• Recurrent Networks

• Long Short-term Memory Networks

• Applications

COMP90042 L8

3

N-gram Language Models

• Can be implemented using counts (with
smoothing)

• Can be implemented using feed-forward neural
networks

• Generates sentences like (trigram model):

‣ I saw a table is round and about

COMP90042 L8

4

N-gram Language Models

• Can be implemented using counts (with
smoothing)

• Can be implemented using feed-forward neural
networks

• Generates sentences like (trigram model):

‣ I saw a

COMP90042 L8

5

N-gram Language Models

• Can be implemented using counts (with
smoothing)

• Can be implemented using feed-forward neural
networks

• Generates sentences like (trigram model):

‣ I saw a table

COMP90042 L8

6

N-gram Language Models

• Can be implemented using counts (with
smoothing)

• Can be implemented using feed-forward neural
networks

• Generates sentences like (trigram model):

‣ I saw a table is

COMP90042 L8

7

N-gram Language Models

• Can be implemented using counts (with
smoothing)

• Can be implemented using feed-forward neural
networks

• Generates sentences like (trigram model):

‣ I saw a table is round

COMP90042 L8

8

N-gram Language Models

• Can be implemented using counts (with
smoothing)

• Can be implemented using feed-forward neural
networks

• Generates sentences like (trigram model):

‣ I saw a table is round and

COMP90042 L8

9

N-gram Language Models

• Can be implemented using counts (with
smoothing)

• Can be implemented using feed-forward neural
networks

• Generates sentences like (trigram model):

‣ I saw a table is round and about

• Problem: limited context

COMP90042 L8

10

Recurrent Neural Networks

COMP90042 L8

11

Recurrent Neural Network (RNN)

• Allow representation of arbitrarily sized inputs

• Core Idea: processes the input sequence one at a
time, by applying a recurrence formula

• Uses a state vector to represent contexts that
have been previously processed

COMP90042 L8

12

Recurrent Neural Network (RNN)

f

x

y

new state previous state input

recurring function

si+1 = f(si, xi+1)

si = f(si−1, xi)

same recurring function

COMP90042 L8

13

Recurrent Neural Network (RNN)

f

x

y

new state previous state input

si = tanh(Wssi−1 + Wxxi + b)

COMP90042 L8

14

RNN Unrolled

• Same parameters ( , , and ) are used
across all time steps

Ws Wx b Wy

RNN RNNs1 RNNs2s0 RNNs3 s4

x2x1 x3 x4

y1 y2 y3 y4

“Simple RNN”
si = tanh(Wssi−1 + Wxxi + b)
yi = σ(Wysi)

COMP90042 L8

15

RNN Training

• An unrolled RNN is just a very

deep neural network

• But parameters are shared 

across all time steps

• To train RNN, we just need to create the unrolled
computation graph given an input sequence

• And use backpropagation algorithm to compute
gradients as usual

• This procedure is called backpropagation through
time

COMP90042 L8

16

(Simple) RNN for Language Model

• is current word (e.g. “eats”); mapped to an embedding

• contains information of the previous words (“a”, “cow”)

• is the next word (e.g. “grass”)

xi
si−1
yi

RNN RNNs1 RNNs2s0 RNNs3 s4

cowa eats grass

cow eats grass .

si = tanh(Wssi−1 + Wxxi + b)
yi = softmax(Wysi)

COMP90042 L8

17

RNN Language Model: Training
• Vocabulary: [a,

cow, eats, grass]

• Training Example:
a cow eats grass

1
0
0
0

1.5
0.8
-0.2

0.40
0.30
0.15
0.15

Output

Hidden

Input

a

cow

0
1
0
0

-1.2
-0.3
0.4

0.30
0.10
0.50
0.10

cow

eats

0
0
1
0

0.1
-1.4
0.6

0.30
0.25
0.25
0.20

eats

grass

L1 = − log P(0.30)
L2 = − log P(0.50)
L3 = − log P(0.20)

Ltotal = L1 + L2 + L3

si = tanh(Wssi−1 + Wxxi + b)
yi = softmax(Wysi)

COMP90042 L8

18

RNN Language Model: Generation

1
0
0
0

0.4
-0.6
0.1

0.02
0.90
0.03
0.05

Output

Hidden

Input

a

cow

0
1
0
0

0.2
0.8
-0.5

0.07
0.06
0.95
0.02

cow

eats

0
0
1
0

0.9
0.8
0.3

0.10
0.05
0.13
0.82

eats

grass

COMP90042 L8

19

• Mismatch between training and decoding
• Error propagation: unable to recover from errors in

intermediate steps
• Low diversity in generated language
• Tends to generate “bland” or “generic” language

PollEv.com/jeyhanlau569

What Are Some Potential Problems with
This Generation Approach?

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L8

20

COMP90042 L8

21

Long Short-term Memory
Networks

COMP90042 L8

22

Language Model… Solved?

• RNN has the capability to model infinite context

• But can it actually capture long-range
dependencies in practice?

• No… due to “vanishing gradients”

• Gradients in later steps diminish quickly during
backpropagation

• Earlier inputs do not get much update

COMP90042 L8

23

Long Short-term Memory (LSTM)

• LSTM is introduced to solve vanishing gradients

• Core idea: have “memory cells” that preserve
gradients across time

• Access to the memory cells is controlled by “gates”

• For each input, a gate decides:

‣ how much the new input should be written to the
memory cell

‣ and how much content of the current memory cell
should be forgotten

COMP90042 L8

24

Gating Vector
• A gate g is a vector

‣ each element has values between 0 to 1
• g is multiplied component-wise with vector v, to determine

how much information to keep for v

• Use sigmoid function to produce g:

‣ values between 0 to 1

0.9

0.1

0.0

2.5

5.3

1.2

g v

2.3

0.5

0.0

=*

COMP90042 L8

25

Simple RNN vs. LSTM

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://colah.github.io/
https://colah.github.io/
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
https://colah.github.io/posts/2015-08-Understanding-LSTMs/

COMP90042 L8

26

Simple RNN vs. LSTM

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://colah.github.io/
https://colah.github.io/
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
https://colah.github.io/posts/2015-08-Understanding-LSTMs/

COMP90042 L8

27

LSTM: Forget Gate

• Controls how much information to “forget” in the
memory cell (Ct-1)

• The cats that the boy likes

• Memory cell was storing noun information (cats)

• The cell should now forget cats and store boy to
correctly predict the singular verb likes

previous state current input

concatenate the two

sigmoid function to produce values between 0-1

COMP90042 L8

28

LSTM: Input Gate

• Input gate controls how much new information to
put to memory cell

• = new distilled information to be added

‣ e.g. information about boy

COMP90042 L8

29

LSTM: Update Memory Cell

• Use the forget and input gates to update memory
cell

forget gate input gate

new information

COMP90042 L8

30

LSTM: Output Gate

• Output gate controls how much to distill the
content of the memory cell to create the next state
(ht)

COMP90042 L8

31

LSTM: Summary

ft = σ(Wf · [ht−1, xt] + bf )

it = σ(Wi · [ht−1, xt] + bi)

ot = σ(Wo · [ht−1, xt] + bo)

C̃t = tanh(WC · [ht−1, xt] + bC)

Ct = ft ∗ Ct−1 + it ∗ C̃t
ht = ot ∗ tanh(Ct)

COMP90042 L8

32

• Introduces a lot of new parameters
• Still unable to capture very long range dependencies
• Much slower than simple RNNs
• Produces inferior word embeddings

PollEv.com/jeyhanlau569

What Are The Disadvantages of LSTM?

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L8

33

COMP90042 L8

34

Applications

COMP90042 L8

35

Shakespeare Generator

• Training data = all
works of
Shakespeare

• Model: character
RNN, hidden
dimension = 512

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
http://karpathy.github.io/2015/05/21/rnn-effectiveness/

COMP90042 L8

36

Wikipedia Generator

• Training data = 100MB of Wikipedia raw data

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
http://karpathy.github.io/2015/05/21/rnn-effectiveness/

COMP90042 L8

37

Code Generator

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
http://karpathy.github.io/2015/05/21/rnn-effectiveness/

COMP90042 L8

38

Deep-Speare

• Generates Shakespearean sonnets

https://github.com/jhlau/deepspeare

https://github.com/jhlau/deepspeare
https://github.com/jhlau/deepspeare

COMP90042 L8

39

COMP90042 L8

40

Text Classification

• RNNs can be used in a variety of NLP tasks

• Particularly suited for tasks where order of words
matter, e.g. sentiment classification

RNN RNNs1 RNNs2s0 RNNs3

moviethe isn’t great

COMP90042 L8

41

Sequence Labeling

• Also good for sequence labelling problems, e.g.
POS tagging

RNN RNNs1 RNNs2s0 RNNs3 s4

cowa eats grass

DET NOUN VERB NOUN

COMP90042 L8

42

Variants
• Peephole connections

‣ Allow gates to look at cell state

• Gated recurrent unit (GRU)

‣ Simplified variant with only 2 gates and no memory cell

ft = σ (Wf [Ct−1, ht−1, xt] + bf)

COMP90042 L8

43

Multi-layer LSTM

LSTM LSTMs1 LSTMs2s0 LSTMs3 s4

cowa eats grass

cow eats grass .

LSTM LSTMt1 LSTMt2t0 LSTMt3 t4

LSTM LSTMu1 LSTMu2u0 LSTMu3 u4

COMP90042 L8

44

Bidirectional LSTM

LSTM1 LSTM1 LSTM1 LSTM1
s1 s2s0 s3 s4

cowa eats grass

DET NN VB NN

u1u0 u2 u3 u4LSTM2 LSTM2 LSTM2 LSTM2

yi = softmax(Ws[si, ui])

use information from both
forward and backward LSTM

COMP90042 L8

45

Final Words

• Pros
‣ Has the ability to capture long range contexts
‣ Just like feedforward networks: flexible

• Cons
‣ Slower than FF networks due to sequential processing
‣ In practice doesn’t capture long range dependency

very well (evident when generating very long text)
‣ In practice also doesn’t stack well (multi-layer LSTM)
‣ Less popular nowadays due to the emergence of more

advanced architectures (Transformer; lecture 11!)

COMP90042 L8

46

Readings

• G15, Section 10 & 11

• JM3 Ch. 9.2-9.3