CS计算机代考程序代写 deep learning flex algorithm Deep Learning for NLP: Recurrent Networks

Deep Learning for NLP: Recurrent Networks
COMP90042
Natural Language Processing
Lecture 8
Semester 1 2021 Week 4 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1

COMP90042
L8
• • •
Recurrent Networks
Long Short-term Memory Networks Applications
Outline
2

COMP90042
L8
N-gram Language Models Can be implemented using counts (with
smoothing)



Can be implemented using feed-forward neural networks
Generates sentences like (trigram model):
‣ I saw a table is round and about
3

COMP90042
L8
N-gram Language Models Can be implemented using counts (with
smoothing)


‣ I saw a

Can be implemented using feed-forward neural networks
Generates sentences like (trigram model):
4

COMP90042
L8
N-gram Language Models Can be implemented using counts (with
smoothing)



Can be implemented using feed-forward neural networks
Generates sentences like (trigram model):
‣ I saw a table
5

COMP90042
L8
N-gram Language Models Can be implemented using counts (with
smoothing)



Can be implemented using feed-forward neural networks
Generates sentences like (trigram model):
‣ Isawatableis
6

COMP90042
L8
N-gram Language Models Can be implemented using counts (with
smoothing)



Can be implemented using feed-forward neural networks
Generates sentences like (trigram model):
‣ I saw a table is round
7

COMP90042
L8
N-gram Language Models Can be implemented using counts (with
smoothing)



Can be implemented using feed-forward neural networks
Generates sentences like (trigram model):
‣ I saw a table is round and
8

COMP90042
L8
N-gram Language Models Can be implemented using counts (with
smoothing)




Can be implemented using feed-forward neural networks
Generates sentences like (trigram model):
‣ I saw a table is round and about Problem: limited context
9

COMP90042
L8
Recurrent Neural Networks
10

COMP90042
L8
• •

Allow representation of arbitrarily sized inputs
Core Idea: processes the input sequence one at a time, by applying a recurrence formula
Recurrent Neural Network (RNN)
Uses a state vector to represent contexts that have been previously processed
11

COMP90042
L8
Recurrent Neural Network (RNN)
recurring function
y
si = f(si−1, xi)
new state previous state input x
si+1 = f(si, xi+1) same recurring function
f
12

COMP90042
L8
Recurrent Neural Network (RNN)
y
f
previous state input x
si = tanh(Wssi−1 + Wxxi + b)
new state
13

COMP90042
L8
RNN Unrolled
y1 y2 y3 y4
s0 RNN s1 RNN s2 RNN s3 RNN s4
x1 x2 x3 x4
si = tanh(Wssi−1 + Wxxi + b) yi = σ(Wysi)

Same parameters (Ws, Wx, b and Wy) are used across all time steps
“Simple RNN”
14

COMP90042
L8
RNN Training
• AnunrolledRNNisjustavery
 deep neural network
• Butparametersareshared
 across all time steps
• TotrainRNN,wejustneedtocreatetheunrolled computation graph given an input sequence
• Andusebackpropagationalgorithmtocompute gradients as usual
• Thisprocedureiscalledbackpropagationthrough time
15

COMP90042
L8
(Simple) RNN for Language Model
cow
s0 RNN s1
eats
RNN s2
cow
grass
RNN s3
eats
.
RNN s4
grass
• • •
xi is current word (e.g. “eats”); mapped to an embedding si−1 contains information of the previous words (“a”, “cow”)
yi is the next word (e.g. “grass”)
a
si = tanh(Wssi−1 + Wxxi + b) yi = softmax(Wysi)
16

COMP90042
L8
RNN Language Model: Training
cow
eats
grass


Vocabulary: [a, cow, eats, grass]
0.30
0.25
0.25
0.20
0.40
0.30
0.15
0.15
0.30
0.10
0.50
0.10
Output
0.1
-1.4
0.6
1.5
0.8
-0.2
-1.2
-0.3
0.4
Hidden
Training Example:
a cow eats grass
si = tanh(Wssi−1 + Wxxi + b) yi = softmax(Wysi)
L1 = − log P(0.30) L2 = − log P(0.50) L3 = − log P(0.20)
Ltotal = L1 + L2 + L3
0
0
1
0
1
0
0
0
0
1
0
0
Input
a
cow
eats
17

COMP90042
L8
RNN Language Model: Generation
cow eats
Output
grass
0.10
0.05
0.13
0.82
0.02
0.90
0.03
0.05
0.07
0.06
0.95
0.02
0.9
0.8
0.3
0.4
-0.6
0.1
0.2
0.8
-0.5
Hidden
0
0
1
0
1
0
0
0
0
1
0
0
Input
a cow
eats
18

COMP90042
L8
What Are Some Potential Problems with This Generation Approach?
• Mismatch between training and decoding
• Error propagation: unable to recover from errors in
intermediate steps
• Low diversity in generated language
• Tends to generate “bland” or “generic” language
PollEv.com/jeyhanlau569
19

COMP90042
L8
20

COMP90042
L8
Long Short-term Memory
Networks
21

COMP90042
L8
• •
• •

RNN has the capability to model infinite context
But can it actually capture long-range dependencies in practice?
Language Model… Solved?
No… due to “vanishing gradients”
Gradients in later steps diminish quickly during
backpropagation
Earlier inputs do not get much update
22

COMP90042
L8
• •
• •
LSTM is introduced to solve vanishing gradients
Core idea: have “memory cells” that preserve gradients across time
Long Short-term Memory (LSTM)
Access to the memory cells is controlled by “gates” For each input, a gate decides:
‣ how much the new input should be written to the memory cell
‣ and how much content of the current memory cell should be forgotten
23

COMP90042
L8
Gating Vector
• Agategisavector
‣ each element has values between 0 to 1
• gismultipliedcomponent-wisewithvectorv,todetermine how much information to keep for v
• Usesigmoidfunctiontoproduceg: ‣ values between 0 to 1
=
0.9
0.1
0.0
2.5
5.3
1.2
2.3
0.5
0.0
*
gv
24

COMP90042
L8
Simple RNN vs. LSTM
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
25

COMP90042
L8
Simple RNN vs. LSTM
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
26

COMP90042
L8
LSTM: Forget Gate
sigmoid function to produce values between 0-1 concatenate the two
previous state current input
• Controlshowmuchinformationto“forget”inthe memory cell (Ct-1)
• Thecatsthattheboylikes
• Memorycellwasstoringnouninformation(cats)
• Thecellshouldnowforgetcatsandstoreboyto correctly predict the singular verb likes
27

COMP90042
L8
LSTM: Input Gate


Input gate controls how much new information to put to memory cell
C ̃= new distilled information to be added ‣ e.g. information about boy
28

COMP90042
L8
LSTM: Update Memory Cell
new information

Use the forget and input gates to update memory cell
forget gate input gate
29

COMP90042
L8
LSTM: Output Gate

Output gate controls how much to distill the content of the memory cell to create the next state (ht)
30

COMP90042
L8
LSTM: Summary
ft =σ(Wf ·[ht−1,xt]+bf) it =σ(Wi ·[ht−1,xt]+bi)
ot =σ(Wo ·[ht−1,xt]+bo)
C ̃t =tanh(WC ·[ht−1,xt]+bC)
Ct =ft ∗Ct−1 +it ∗C ̃t ht = ot ∗ tanh(Ct)
31

COMP90042
L8
What Are The Disadvantages of LSTM?
• Introduces a lot of new parameters
• Still unable to capture very long range dependencies
• Much slower than simple RNNs
• Produces inferior word embeddings
PollEv.com/jeyhanlau569
32

COMP90042
L8
33

COMP90042
L8
Applications
34

COMP90042
L8
Shakespeare Generator


Training data = all works of Shakespeare
Model: character RNN, hidden dimension = 512
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
35

COMP90042
L8
Wikipedia Generator

Training data = 100MB of Wikipedia raw data
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
36

COMP90042
L8
Code Generator
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
37

COMP90042
L8

Deep-Speare Generates Shakespearean sonnets
https://github.com/jhlau/deepspeare
38

COMP90042
L8
39

COMP90042
L8
• •
Text Classification
RNNs can be used in a variety of NLP tasks
Particularly suited for tasks where order of words matter, e.g. sentiment classification

s0 RNN s1 RNN s2 RNN s3 RNN
the movie isn’t
great
40

COMP90042
L8

Also good for sequence labelling problems, e.g. POS tagging
DET
s0 RNN s1
a
NOUN
RNN s2
cow
VERB
RNN s3
eats
NOUN
RNN s4
grass
Sequence Labeling
41

COMP90042
L8
Variants • Peepholeconnections
‣ Allow gates to look at cell state ft = σ (Wf [Ct−1, ht−1, xt] + bf)
• Gatedrecurrentunit(GRU)
‣ Simplified variant with only 2 gates and no memory cell
42

COMP90042
L8
cow
u0 LSTM
t0 LSTM
s0 LSTM
u1
eats grass
LSTM u2 LSTM
LSTM t2 LSTM
LSTM s2 LSTM
u3
.
LSTM u4
LSTM t4
LSTM s4
Multi-layer LSTM
t1
t3
s1
s3
a
cow eats
grass
43

COMP90042
L8
Bidirectional LSTM
DET NN VB NN
u0 LSTM2
u1 LSTM2 LSTM1 s2
cow
u2 LSTM2 LSTM1 s3
eats
u3 LSTM2 u4 LSTM1 s4
grass
s0 LSTM1
a
s1
yi = softmax(Ws[si, ui])
use information from both forward and backward LSTM
44

COMP90042
L8
Final Words
‣ Has the ability to capture long range contexts
‣ Just like feedforward networks: flexible


Pros
Cons
‣ Slower than FF networks due to sequential processing
‣ In practice doesn’t capture long range dependency very well (evident when generating very long text)
‣ In practice also doesn’t stack well (multi-layer LSTM)
‣ Less popular nowadays due to the emergence of more advanced architectures (Transformer; lecture 11!)
45

COMP90042
L8
• •
G15, Section 10 & 11 JM3 Ch. 9.2-9.3
Readings
46