Contextual Representation
COMP90042
Natural Language Processing Lecture 11
COPYRIGHT 2020, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L11
•
•
• •
•
Each word type has one representation ‣ Word2Vec
Word Vectors/Embeddings
Always the same representation regardless of the context of the word
Does not capture multiple senses of words
Contextual representation = representation of words based on context
Pre-trained contextual representations work really well for downstream applications!
2
COMP90042
L11
RNN Language Model
cow
s0 RNN s1
a
eats
RNN s2
cow
grass
RNN s3
eats
.
RNN s4
grass
3
COMP90042
L11
RNN Language Model
cow eats
grass
0.10
0.05
0.13
0.82
0.02
0.90
0.03
0.05
0.07
0.06
0.95
0.02
Output
Contextual Word Embedding!
Word Embedding
0.9
0.8
0.3
0.4
-0.6
0.1
0.2
0.8
-0.5
Hidden
si = tanh(Wssi−1 + Wxxi + b) yi = softmax(si)
0
0
1
0
1
0
0
0
0
1
0
0
Input
a cow
eats
4
COMP90042
L11
•
•
Almost, but the contextual representation only captures context to the left
Solved?
Solution: use a bidirectional RNN instead!
5
COMP90042
L11
cow
eats
grass .
u2 RNN2 u3
Bidirectional RNN
u0
RNN2 u1
RNN2
RNN2 u4 RNN1 s4
grass
s0 RNN1
a
s1
RNN1 s2
cow
RNN1 s3
eats
yi = softmax([si, ui])
Contextual Word Embedding!
6
COMP90042
L11
ELMo
7
COMP90042
ELMo:
Embeddings from Language Models
•
•
•
L11
Peters et al. (2018): https://arxiv.org/abs/ 1802.05365v2
•
Combine hidden states from multiple layers of LSTM for downstream tasks
‣ Prior studies use only top layer information Improves task performance significantly!
Trains a bidirectional, multi-layer LSTM language model over 1B word corpus
8
COMP90042
L11
•
•
•
Number of LSTM layers = 2 LSTM hidden dimension = 4096
ELMo
Character convolutional
networks (CNN) to
create word
embeddings
‣ No unknown words
https://www.aclweb.org/anthology/P16-1101.pdf
9
COMP90042
L11
Extracting Contextual Representation
http://jalammar.github.io/illustrated-bert/
10
COMP90042
L11
Extracting Contextual Representation
11
COMP90042
L11
Downstream Task: POS Tagging
VERB
s0 RNN s1
?
RNN s2 RNN s3 RNN s4
let’s
stick
⊕
to improvisation
ELMo contextual embedding for stick
si = tanh(Wssi−1 + (Wxxi⊕ei) + b)
12
COMP90042
L11
How Good is ELMo?
• SQuAD:QA
• SNLI:textualentailment
• SRL:semanticrole labelling
• Coref:coreference resolution
• NER:namedentity recognition
• SST-5:sentimentanalysis 13
COMP90042
L11
•
•
Lower layer representation = captures syntax ‣ good for POS tagging, NER
Other Findings
Higher layer representation = captures semantics
‣ good for QA, textual entailment, sentiment analysis
14
COMP90042
L11
Contextual vs. Non-contextual
15
COMP90042
L11
•
•
Sequential processing: difficult to scale to very large corpus or models
•
Bidirectional RNNs help, but they only capture surface bidirectional representations
Disadvantages of RNNs
RNN language models run left to right (captures only one side of context)
‣ Produces well-formed sentence probability
16
COMP90042
L11
BERT
17
COMP90042
L11
• •
Devlin et al. (2019): https://arxiv.org/abs/1810.04805 Uses self-attention networks (aka Transformers) to
capture dependencies between words ‣ No sequential processing
BERT: Bidirectional Encoder Representations from Transformers
•
• •
Masked language model objective to capture deep bidirectional representations
Loses the ability to generate language
Not an issue if the goal is to learn contextual representations
18
COMP90042
L11
19
COMP90042
L11
Objective 1: Masked Language Model
• •
‘Mask’ out k% of tokens at random Objective: predict the masked words
lecture and
Today we have a [MASK] on contextual representations [MASK] it’s interesting
20
COMP90042
L11 Objective 2: Next Sentence Prediction
•
•
•
Learn relationships between sentences
Predicts whether sentence B follows sentence A
Useful pre-training objective for downstream applications that analyse sentence pairs (e.g. textual entailment)
Sentence A: Today we have a lecture on NLP. Sentence B: It is an interesting lecture. Label: IsNextSentence
Sentence A: Today we have a lecture on NLP. Sentence B: Polar bears are white.
Label: NotNextSentence
21
COMP90042
L11
Training/Model Details
WordPiece (subword) Tokenisation
Multiple layers of transformers to learn contextual representations
• •
• •
Train models trained on Wikipedia+BookCorpus Training takes multiple GPUs over several days
22
COMP90042
L11
Fine-Tuning for BERT
http://jalammar.github.io/illustrated-bert/
23
COMP90042
L11
Fine-Tuning for BERT
http://jalammar.github.io/illustrated-bert/
24
COMP90042
L11
Example: Spam Detection
http://jalammar.github.io/illustrated-bert/
25
COMP90042
L11
Example: Spam Detection
During fine-tuning, parameters of the whole network are updated!
http://jalammar.github.io/illustrated-bert/
26
COMP90042
L11
BERT vs. ELMo
ELMo provides only the contextual representations
Downstream applications has their own network architecture
•
• •
ELMo parameters are fixed when applied to downstream applications
‣ Only the weights to
combine states from
different LSTM layers
are learned
27
COMP90042
L11
•
BERT adds a classification layer for downstream tasks
‣ No task-specific model needed
BERT updates all parameters during fine-tuning
•
BERT vs. ELMo
28
COMP90042
L11
• • • • •
MNLI, RTE: textual entailment
QQP, STS-B, MRPC: sentence similarity QNLP: answerability prediction
SST: sentiment analysis
COLA: sentence acceptability prediction
How Good is BERT?
29
COMP90042
L11
•
Transformers
What are transformers, and how do they work?
??
30
COMP90042
L11
Attention is All You Need Vaswani et al. (2017): https://arxiv.org/abs/
1706.03762
•
•
Use attention instead of using RNNs (or CNNs) to capture dependencies between words
31
COMP90042
L11
contextual representation for made
weighted sum
I made her duck
32
COMP90042
L11
•
Input:
‣ query q (e.g. made )
‣ key k and value v (e.g. her)
•
Query, key and value are all vectors ‣ linear projections from embeddings
Self-Attention: Implementation
q·ki
A(q, K, V ) = e × vi
i j eq·kj
cmade = 0.1vI + 0.6vher + 0.3vduck
33
COMP90042
L11
Self-Attention: Implementation
q·ki
A(q, K, V ) = e × vi
• •
i j eq·kj Multiple queries, stack them in a matrix
A(Q, K, V ) = softmax(QK⊤)V
Uses scaled dot-product to prevent values from
growing too large
QK⊤
k
A(Q, K, V ) = softmax( √d
dimension of query and key vectors
)V
34
COMP90042
L11
•
•
Only one attention for each word pair
Uses multi-head attention to allow multiple interactions
35
COMP90042
L11
Transformer Block
36
COMP90042
L11
• •
Contextual representations are very useful Pre-trained on very very large corpus
‣ Builds up some knowledge about language
‣ Uses unsupervised objectives
•
When we use them for downstream tasks, we are no longer starting from “scratch”
A Final Word
37
COMP90042
L11
•
•
•
ELMo: https://arxiv.org/abs/1802.05365v2 BERT: https://arxiv.org/abs/1810.04805
Further Reading
Transformer: http://nlp.seas.harvard.edu/ 2018/04/03/attention.html
38