l8-recurrent-v2
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
Natural Language Processing
Lecture 8
Semester 1 2021 Week 4
Jey Han Lau
Deep Learning for NLP:
Recurrent Networks
COMP90042 L8
2
Outline
• Recurrent Networks
• Long Short-term Memory Networks
• Applications
COMP90042 L8
3
N-gram Language Models
• Can be implemented using counts (with
smoothing)
• Can be implemented using feed-forward neural
networks
• Generates sentences like (trigram model):
‣ I saw a table is round and about
COMP90042 L8
4
N-gram Language Models
• Can be implemented using counts (with
smoothing)
• Can be implemented using feed-forward neural
networks
• Generates sentences like (trigram model):
‣ I saw a
COMP90042 L8
5
N-gram Language Models
• Can be implemented using counts (with
smoothing)
• Can be implemented using feed-forward neural
networks
• Generates sentences like (trigram model):
‣ I saw a table
COMP90042 L8
6
N-gram Language Models
• Can be implemented using counts (with
smoothing)
• Can be implemented using feed-forward neural
networks
• Generates sentences like (trigram model):
‣ I saw a table is
COMP90042 L8
7
N-gram Language Models
• Can be implemented using counts (with
smoothing)
• Can be implemented using feed-forward neural
networks
• Generates sentences like (trigram model):
‣ I saw a table is round
COMP90042 L8
8
N-gram Language Models
• Can be implemented using counts (with
smoothing)
• Can be implemented using feed-forward neural
networks
• Generates sentences like (trigram model):
‣ I saw a table is round and
COMP90042 L8
9
N-gram Language Models
• Can be implemented using counts (with
smoothing)
• Can be implemented using feed-forward neural
networks
• Generates sentences like (trigram model):
‣ I saw a table is round and about
• Problem: limited context
COMP90042 L8
10
Recurrent Neural Networks
COMP90042 L8
11
Recurrent Neural Network (RNN)
• Allow representation of arbitrarily sized inputs
• Core Idea: processes the input sequence one at a
time, by applying a recurrence formula
• Uses a state vector to represent contexts that
have been previously processed
COMP90042 L8
12
Recurrent Neural Network (RNN)
f
x
y
new state previous state input
recurring function
si+1 = f(si, xi+1)
si = f(si−1, xi)
same recurring function
COMP90042 L8
13
Recurrent Neural Network (RNN)
f
x
y
new state previous state input
si = tanh(Wssi−1 + Wxxi + b)
COMP90042 L8
14
RNN Unrolled
• Same parameters ( , , and ) are used
across all time steps
Ws Wx b Wy
RNN RNNs1 RNNs2s0 RNNs3 s4
x2x1 x3 x4
y1 y2 y3 y4
“Simple RNN”
si = tanh(Wssi−1 + Wxxi + b)
yi = σ(Wysi)
COMP90042 L8
15
RNN Training
• An unrolled RNN is just a very
deep neural network
• But parameters are shared
across all time steps
• To train RNN, we just need to create the unrolled
computation graph given an input sequence
• And use backpropagation algorithm to compute
gradients as usual
• This procedure is called backpropagation through
time
COMP90042 L8
16
(Simple) RNN for Language Model
• is current word (e.g. “eats”); mapped to an embedding
• contains information of the previous words (“a”, “cow”)
• is the next word (e.g. “grass”)
xi
si−1
yi
RNN RNNs1 RNNs2s0 RNNs3 s4
cowa eats grass
cow eats grass .
si = tanh(Wssi−1 + Wxxi + b)
yi = softmax(Wysi)
COMP90042 L8
17
RNN Language Model: Training
• Vocabulary: [a,
cow, eats, grass]
• Training Example:
a cow eats grass
1
0
0
0
1.5
0.8
-0.2
0.40
0.30
0.15
0.15
Output
Hidden
Input
a
cow
0
1
0
0
-1.2
-0.3
0.4
0.30
0.10
0.50
0.10
cow
eats
0
0
1
0
0.1
-1.4
0.6
0.30
0.25
0.25
0.20
eats
grass
L1 = − log P(0.30)
L2 = − log P(0.50)
L3 = − log P(0.20)
Ltotal = L1 + L2 + L3
si = tanh(Wssi−1 + Wxxi + b)
yi = softmax(Wysi)
COMP90042 L8
18
RNN Language Model: Generation
1
0
0
0
0.4
-0.6
0.1
0.02
0.90
0.03
0.05
Output
Hidden
Input
a
cow
0
1
0
0
0.2
0.8
-0.5
0.07
0.06
0.95
0.02
cow
eats
0
0
1
0
0.9
0.8
0.3
0.10
0.05
0.13
0.82
eats
grass
COMP90042 L8
19
• Mismatch between training and decoding
• Error propagation: unable to recover from errors in
intermediate steps
• Low diversity in generated language
• Tends to generate “bland” or “generic” language
PollEv.com/jeyhanlau569
What Are Some Potential Problems with
This Generation Approach?
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L8
20
COMP90042 L8
21
Long Short-term Memory
Networks
COMP90042 L8
22
Language Model… Solved?
• RNN has the capability to model infinite context
• But can it actually capture long-range
dependencies in practice?
• No… due to “vanishing gradients”
• Gradients in later steps diminish quickly during
backpropagation
• Earlier inputs do not get much update
COMP90042 L8
23
Long Short-term Memory (LSTM)
• LSTM is introduced to solve vanishing gradients
• Core idea: have “memory cells” that preserve
gradients across time
• Access to the memory cells is controlled by “gates”
• For each input, a gate decides:
‣ how much the new input should be written to the
memory cell
‣ and how much content of the current memory cell
should be forgotten
COMP90042 L8
24
Gating Vector
• A gate g is a vector
‣ each element has values between 0 to 1
• g is multiplied component-wise with vector v, to determine
how much information to keep for v
• Use sigmoid function to produce g:
‣ values between 0 to 1
0.9
0.1
0.0
2.5
5.3
1.2
g v
2.3
0.5
0.0
=*
COMP90042 L8
25
Simple RNN vs. LSTM
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
https://colah.github.io/
https://colah.github.io/
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
COMP90042 L8
26
Simple RNN vs. LSTM
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
https://colah.github.io/
https://colah.github.io/
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
COMP90042 L8
27
LSTM: Forget Gate
• Controls how much information to “forget” in the
memory cell (Ct-1)
• The cats that the boy likes
• Memory cell was storing noun information (cats)
• The cell should now forget cats and store boy to
correctly predict the singular verb likes
previous state current input
concatenate the two
sigmoid function to produce values between 0-1
COMP90042 L8
28
LSTM: Input Gate
• Input gate controls how much new information to
put to memory cell
• = new distilled information to be added
‣ e.g. information about boy
C̃
COMP90042 L8
29
LSTM: Update Memory Cell
• Use the forget and input gates to update memory
cell
forget gate input gate
new information
COMP90042 L8
30
LSTM: Output Gate
• Output gate controls how much to distill the
content of the memory cell to create the next state
(ht)
COMP90042 L8
31
LSTM: Summary
ft = σ(Wf · [ht−1, xt] + bf )
it = σ(Wi · [ht−1, xt] + bi)
ot = σ(Wo · [ht−1, xt] + bo)
C̃t = tanh(WC · [ht−1, xt] + bC)
Ct = ft ∗ Ct−1 + it ∗ C̃t
ht = ot ∗ tanh(Ct)
COMP90042 L8
32
• Introduces a lot of new parameters
• Still unable to capture very long range dependencies
• Much slower than simple RNNs
• Produces inferior word embeddings
PollEv.com/jeyhanlau569
What Are The Disadvantages of LSTM?
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L8
33
COMP90042 L8
34
Applications
COMP90042 L8
35
Shakespeare Generator
• Training data = all
works of
Shakespeare
• Model: character
RNN, hidden
dimension = 512
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
COMP90042 L8
36
Wikipedia Generator
• Training data = 100MB of Wikipedia raw data
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
COMP90042 L8
37
Code Generator
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
COMP90042 L8
38
Deep-Speare
• Generates Shakespearean sonnets
https://github.com/jhlau/deepspeare
https://github.com/jhlau/deepspeare
https://github.com/jhlau/deepspeare
COMP90042 L8
39
COMP90042 L8
40
Text Classification
• RNNs can be used in a variety of NLP tasks
• Particularly suited for tasks where order of words
matter, e.g. sentiment classification
RNN RNNs1 RNNs2s0 RNNs3
moviethe isn’t great
COMP90042 L8
41
Sequence Labeling
• Also good for sequence labelling problems, e.g.
POS tagging
RNN RNNs1 RNNs2s0 RNNs3 s4
cowa eats grass
DET NOUN VERB NOUN
COMP90042 L8
42
Variants
• Peephole connections
‣ Allow gates to look at cell state
• Gated recurrent unit (GRU)
‣ Simplified variant with only 2 gates and no memory cell
ft = σ (Wf [Ct−1, ht−1, xt] + bf)
COMP90042 L8
43
Multi-layer LSTM
LSTM LSTMs1 LSTMs2s0 LSTMs3 s4
cowa eats grass
cow eats grass .
LSTM LSTMt1 LSTMt2t0 LSTMt3 t4
LSTM LSTMu1 LSTMu2u0 LSTMu3 u4
COMP90042 L8
44
Bidirectional LSTM
LSTM1 LSTM1 LSTM1 LSTM1
s1 s2s0 s3 s4
cowa eats grass
DET NN VB NN
u1u0 u2 u3 u4LSTM2 LSTM2 LSTM2 LSTM2
yi = softmax(Ws[si, ui])
use information from both
forward and backward LSTM
COMP90042 L8
45
Final Words
• Pros
‣ Has the ability to capture long range contexts
‣ Just like feedforward networks: flexible
• Cons
‣ Slower than FF networks due to sequential processing
‣ In practice doesn’t capture long range dependency
very well (evident when generating very long text)
‣ In practice also doesn’t stack well (multi-layer LSTM)
‣ Less popular nowadays due to the emergence of more
advanced architectures (Transformer; lecture 11!)
COMP90042 L8
46
Readings
• G15, Section 10 & 11
• JM3 Ch. 9.2-9.3