Deep Learning for NLP: Recurrent Networks
COMP90042
Natural Language Processing Lecture 8
COPYRIGHT 2020, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L8
N-gram Language Models Can be implemented using counts (with
smoothing)
•
•
•
Can be implemented using feed-forward neural networks
Generates sentences like (trigram model):
‣ I saw a table is round and about
2
COMP90042
L8
N-gram Language Models Can be implemented using counts (with
smoothing)
•
•
•
Can be implemented using feed-forward neural networks
Generates sentences like (trigram model):
‣ I saw a table is round and about
3
COMP90042
L8
N-gram Language Models Can be implemented using counts (with
smoothing)
•
•
•
Can be implemented using feed-forward neural networks
Generates sentences like (trigram model):
‣ I saw a table is round and about
4
COMP90042
L8
N-gram Language Models Can be implemented using counts (with
smoothing)
•
•
•
Can be implemented using feed-forward neural networks
Generates sentences like (trigram model):
‣ I saw a table is round and about
5
COMP90042
L8
N-gram Language Models Can be implemented using counts (with
smoothing)
•
•
•
Can be implemented using feed-forward neural networks
Generates sentences like (trigram model):
‣ I saw a table is round and about
6
COMP90042
L8
N-gram Language Models Can be implemented using counts (with
smoothing)
•
•
•
Can be implemented using feed-forward neural networks
Generates sentences like (trigram model):
‣ I saw a table is round and about
7
COMP90042
L8
N-gram Language Models Can be implemented using counts (with
smoothing)
•
•
•
•
Can be implemented using feed-forward neural networks
Generates sentences like (trigram model):
‣ I saw a table is round and about Problem: limited context
8
COMP90042
L8
• •
•
RNNs allow representing arbitrarily sized inputs
Core Idea: processes the input sequence one at a time, by applying a recurrence formula
Recurrent Neural Network (RNN)
Uses a state vector to represent contexts that have been previously processed
9
COMP90042
L8
Recurrent Neural Network (RNN)
function with parameters
y
si =f(si−1,xi) RNN new state previous state input x
10
COMP90042
L8
Recurrent Neural Network (RNN)
y
RNN
new state previous state input x
si = tanh(Wssi−1 + Wxxi + b)
11
COMP90042
L8
RNN Unrolled
y1 y2 y3 y4
s0 RNN s1 RNN s2 RNN s3 RNN s4
x1 x2 x3 x4
si = tanh(Wssi−1 + Wxxi + b) yi = σ(Wysi)
“Simple RNN”
• Sameparametersaresharedacrossalltimesteps 12
COMP90042
L8
RNN Training
• AnunrolledRNNisjustavery deep deep neural network
• Butsameparametersareshared across many time steps
• TotrainRNN,wejustneedtocreatetheunrolled computation graph given an input sequence
• Andusebackpropagationalgorithmtocompute gradients as usual
• Thisprocedureiscalledbackpropagationthrough time
13
COMP90042
L8
(Simple) RNN for Language Model
cow
s0 RNN s1
a
eats
RNN s2
cow
grass
RNN s3
eats
.
RNN s4
grass
si = tanh(Wssi−1 + Wxxi + b) yi = softmax(Wysi)
• Inputwordsmappedtoanembedding • Output=nextword
14
COMP90042
L8
RNN Language Model – Training
cow eats
Output
grass
•
•
Vocabulary: [a, cow, eats, grass]
0.30
0.25
0.25
0.20
0.40
0.30
0.15
0.15
0.30
0.30
0.30
0.10
0.1
-1.4
0.6
1.5
0.8
-0.2
-1.2
-0.3
0.4
Hidden
Training Example:
a cow eats grass
si = tanh(Wssi−1 + Wxxi + b) yi = softmax(Wysi)
0
0
1
0
1
0
0
0
0
1
0
0
Input
a cow
eats
15
COMP90042
L8
RNN Language Model – Generation
cow
Output
0.02
0.90
0.03
0.05
0.4
-0.6
0.1
Hidden
1
0
0
0
Input
a
16
COMP90042
L8
RNN Language Model – Generation
cow eats
Output
0.02
0.90
0.03
0.05
0.07
0.06
0.95
0.02
0.4
-0.6
0.1
0.2
0.8
-0.5
Hidden
1
0
0
0
0
1
0
0
Input
a cow
17
COMP90042
L8
RNN Language Model – Generation
cow eats
Output
grass
0.10
0.05
0.13
0.82
0.02
0.90
0.03
0.05
0.07
0.06
0.95
0.02
0.9
0.8
0.3
0.4
-0.6
0.1
0.2
0.8
-0.5
Hidden
0
0
1
0
1
0
0
0
0
1
0
0
Input
a cow
eats
18
COMP90042
L8
• •
• •
•
RNN has the capability to model infinite context
But can it actually capture long-range dependencies in practice?
Language Model… Solved?
No… due to “vanishing gradients”
Gradients in later steps diminish quickly during
backpropagation
Earlier inputs do not get much update
19
COMP90042
L8
Long Short-term Memory (LSTM)
• LSTMisintroducedtosolvevanishinggradients
• Coreidea:have“memorycells”thatpreserve gradients across time
• •
Access to the memory cells is controlled by “gates” For each input, a gate decides:
‣ how much the new input should be written to the memory cell
‣ and how much content of the current memory cell should be forgotten
20
COMP90042
L8
Long Short-term Memory (LSTM)
• Agategisavector
‣ each element has values between 0 to 1
• gismultipliedcomponent-wisewithvectorv,to determine how much information to keep for v
• Usesigmoidfunctiontokeepvaluesofgclosetoeither 0 or 1
=
0.9
0.1
0.0
2.5
5.3
1.2
2.3
0.5
0.0
*
gv
21
COMP90042
L8
LSTM vs. Simple RNN
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
22
COMP90042
L8
LSTM: Forget Gate
previous state current input
• Controlshowmuchinformationto“forget”inthe memory cell (Ct-1)
• Thecatsthattheboylikes
• Memorycellwasstoringpronouninformation(cats)
• Thecellshouldnowforgetcatsandstoreboyto correctly predict the singular verb likes
23
COMP90042
L8
LSTM: Input Gate
• Inputgatecontrolshowmuchnewinformationto put to memory cell
• = new distilled information to be added ‣ e.g. information about boy
24
COMP90042
L8
LSTM: Update Memory Cell
forget gate input gate
• Usetheforgetandinputgatestoupdatememory cell
25
COMP90042
L8
LSTM: Output Gate
• Outputgatecontrolshowmuchtodistillthe content of the memory cell to create the next state (ht)
26
COMP90042
L8
LSTM: Summary
ft =σ(Wf ·[ht−1,xt]+bf) it =σ(Wi ·[ht−1,xt]+bi)
ot =σ(Wo ·[ht−1,xt]+bo)
C ̃t =tanh(WC ·[ht−1,xt]+bC)
Ct =ft ∗Ct−1 +it ∗C ̃t ht = ot ∗ tanh(Ct)
27
COMP90042
L8
Variants Peephole connections
‣ Allow gates to look at cell state
•
•
Gated recurrent unit (GRU)
‣ Simplified variant with only 2 gates
28
COMP90042
L8
cow
u0 LSTM
t0 LSTM
s0 LSTM
u1
eats grass
LSTM u2 LSTM
LSTM t2 LSTM
LSTM s2 LSTM
u3
.
LSTM u4
LSTM t4
LSTM s4
Multi-layer LSTM
t1
t3
s1
s3
a
cow eats
grass
29
COMP90042
L8
Bidirectional LSTM
DET NN VB NN
u0 LSTM2 s0 LSTM1 s1
a
u1 LSTM2 LSTM1 s2
cow
u2 LSTM2 LSTM1 s3
eats
u3 LSTM2 u4 LSTM1 s4
grass
yi = softmax(Ws[si, ui])
30
COMP90042
L8
Shakespeare Generator
•
•
Training data = all works of Shakespeare
Model: 3-layer character RNN, hidden dimension = 512
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
31
COMP90042
L8
Wikipedia Generator
• Trainingdata=100MBofWikipediarawdata http://karpathy.github.io/2015/05/21/rnn-effectiveness/
32
COMP90042
L8
Code Generator
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
33
COMP90042
L8
•
Deep-Speare Generates Shakespearean sonnets
https://github.com/jhlau/deepspeare
34
COMP90042
L8
35
COMP90042
L8
•
•
Recurrent networks can be used in a variety of NLP tasks
Text Classification
Particularly suited for tasks where order of words matter, e.g. sentiment classification
s0 RNN s1 RNN s2 RNN s3 RNN
the movie isn’t
great
36
COMP90042
L8
•
RNNs work particularly for sequence labelling problems, e.g. POS tagging
DET
s0 RNN s1
a
NOUN
RNN s2
cow
VERB
RNN s3
eats
NOUN
RNN s4
grass
Sequence Labeling
37
COMP90042
L8
• Pros
Final Words
‣ Has the ability to capture long range contexts
‣ Excellent generalisation
‣ Just like feedforward networks: flexible, so it can be used for all sorts of tasks
‣ Common component in a number of NLP tasks • Cons
‣ Slower than feedforward networks due to sequential processing
‣ In practice still doesn’t capture long range dependency very well (evident when generating long text)
38
COMP90042
L8
•
Readings G15, Section 10 & 11
39