Contextual Representation
COMP90042
Natural Language Processing
Lecture 11
Semester 1 2021 Week 6 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L11
•
•
• •
•
Each word type has one representation ‣ Word2Vec
Word Vectors/Embeddings
Always the same representation regardless of the context of the word
Does not capture multiple senses of words
Contextual representation = representation of words based on context
Pretrained contextual representations work really well for downstream applications!
2
COMP90042
L11
RNN Language Model
cow
s0 RNN s1
a
eats
RNN s2
cow
grass
RNN s3
eats
.
RNN s4
grass
3
COMP90042
L11
RNN Language Model
cow eats
grass
0.10
0.05
0.13
0.82
0.02
0.90
0.03
0.05
0.07
0.06
0.95
0.02
Output
Contextual representation!
Word embedding matrix
0.9
0.8
0.3
0.4
-0.6
0.1
0.2
0.8
-0.5
Hidden
si = tanh(Wssi−1 + Wxxi + b) yi = softmax(si)
0
0
1
0
1
0
0
0
0
1
0
0
Input
a cow
eats
4
COMP90042
L11
•
•
Almost, but the contextual representation only captures context to the left
Solved?
Solution: use a bidirectional RNN instead!
5
COMP90042
L11
cow
eats
grass
u3 RNN2 RNN1 s3
.
u4
RNN1 s4
grass
Bidirectional RNN
u1
RNN2 u2
RNN2
s0 RNN1
a
s1
RNN1 s2
grass
cow
a cow eats grass .
yi = softmax([si, ui])
eats
.
eats
Bidirectional Contextual Representation!
6
COMP90042
L11
• • •
ELMo
BERT Transformers
Outline
7
COMP90042
L11
ELMo
8
COMP90042
ELMo:
Embeddings from Language Models
•
•
•
L11
Peters et al. (2018): https://arxiv.org/abs/ 1802.05365v2
•
Combine hidden states from multiple layers of LSTM for downstream tasks
‣ Prior studies use only top layer information Improves task performance significantly!
Trains a bidirectional, multi-layer LSTM language model over 1B word corpus
9
COMP90042
L11
•
•
•
Number of LSTM layers = 2 LSTM hidden dimension = 4096
Character convolutional
networks (CNN) to
create word
embeddings
‣ No unknown words
ELMo
https://www.aclweb.org/anthology/P16-1101.pdf
10
COMP90042
L11
Extracting Contextual Representation
http://jalammar.github.io/illustrated-bert/
11
COMP90042
L11
Extracting Contextual Representation
12
COMP90042
L11
Downstream Task: POS Tagging
VERB
s0 RNN s1
?
RNN s2
stick
⊕
TO
RNN s3
to
NOUN
RNN s4
improvisation
let’s
ELMo contextual embedding for stick
si = tanh(Wssi−1 + (Wxxi⊕ei) + b)
13
COMP90042
L11
How Good is ELMO?
• SQuAD:QA
• SNLI:textualentailment
• SRL:semanticrole labelling
• Coref:coreference resolution
• NER:namedentity recognition
• SST-5:sentimentanalysis 14
COMP90042
L11
•
•
Lower layer representation = captures syntax ‣ good for POS tagging, NER
Other Findings
Higher layer representation = captures semantics
‣ good for QA, textual entailment, sentiment analysis
15
COMP90042
L11
Contextual vs. Non-contextual
16
COMP90042
L11
What are the disadvantages of contextual embeddings?
• Difficult to do intrinsic evaluation (e.g. word similarity, analogy)
• Interpretability
• Computationally expensive to train large-scale
contextual embeddings
• Only works for certain languages
PollEv.com/jeyhanlau569
17
COMP90042
L11
18
COMP90042
L11
BERT
19
COMP90042
L11
•
•
•
Sequential processing: difficult to scale to very large corpus or models
Disadvantages of RNNs
RNN language models run left to right (captures only one side of context)
Bidirectional RNNs help, but they only capture surface bidirectional representations
20
COMP90042
L11
Extracting Contextual Representation
These two RNNS are run independently! Information is aggregated after they have separately produced their hidden representations
21
COMP90042
L11
• •
Devlin et al. (2019): https://arxiv.org/abs/1810.04805 Uses self-attention networks (aka Transformers) to
capture dependencies between words ‣ No sequential processing
BERT: Bidirectional Encoder Representations from Transformers
•
• •
Masked language model objective to capture deep bidirectional representations
Loses the ability to generate language
Not an issue if the goal is to learn contextual representations
22
COMP90042
L11
?? We’ll come back to describe Transformers in the last part of the lecture
23
COMP90042
L11
Objective 1: Masked Language Model
• •
‘Mask’ out k% of tokens at random Objective: predict the masked words
lecture and
Today we have a [lMecAtuSrKe] on contextual representations [MaAnSdK] it’s interesting
24
COMP90042
L11 Objective 2: Next Sentence Prediction
•
•
•
Learn relationships between sentences
Predicts whether sentence B follows sentence A
Useful pre-training objective for downstream applications that analyse sentence pairs (e.g. textual entailment)
Sentence A: Today we have a lecture on NLP. Sentence B: It is an interesting lecture. Label: IsNextSentence
Sentence A: Today we have a lecture on NLP. Sentence B: Polar bears are white.
Label: NotNextSentence
25
COMP90042
L11
Training/Model Details
WordPiece (subword) Tokenisation
Multiple layers of transformers to learn contextual representations
• •
• •
BERT is pretrained on Wikipedia+BookCorpus Training takes multiple GPUs over several days
26
COMP90042
L11
•
• •
Given a pretrained BERT, continue training (i.e. fine-tune) it on downstream tasks
How To Use BERT?
But how to adapt it to downstream task?
Add a classification layer on top of the contextual representations
27
COMP90042
L11
Training and Fine-Tuning
http://jalammar.github.io/illustrated-bert/
28
COMP90042
L11
captures information of the whole sentence; input to new classification layer we’re adding
Example: Spam Detection
contextual representations of each word
special token prepended to the start of every sentence
http://jalammar.github.io/illustrated-bert/
29
COMP90042
L11
Example: Spam Detection
During fine-tuning, parameters of the whole network are updated!
classification layer for downstream task; initialised randomly
contextual representation of [CLS]
http://jalammar.github.io/illustrated-bert/
30
COMP90042
L11
BERT vs. ELMo
ELMo provides only the contextual representations
Downstream applications has their own network architecture
•
• •
ELMo parameters are fixed when applied to downstream applications
‣ Only the weights to
combine states from
different LSTM layers
are learned
31
COMP90042
L11
•
BERT adds a classification layer for downstream tasks
‣ No task-specific model needed
BERT updates all parameters during fine-tuning
•
BERT vs. ELMo
32
COMP90042
L11
How Good is BERT?
• • • • •
MNLI, RTE: textual entailment
QQP, STS-B, MRPC: sentence similarity QNLI: answerability prediction
SST-2: sentiment analysis
COLA: sentence acceptability prediction
33
COMP90042
L11
Transformers
34
COMP90042
L11
•
Transformers
What are transformers, and how do they work?
??
35
COMP90042
L11
Attention is All You Need Vaswani et al. (2017): https://arxiv.org/abs/
1706.03762
•
•
Use attention instead of using RNNs (or CNNs) to capture dependencies between words
36
COMP90042
L11
contextual representation for made
weighted sum
I made her duck
37
COMP90042
L11
Self-Attention via Query, Key, Value
• Input:
‣ query q (e.g. made )
‣ key k and value v (e.g. her)
• Query,keyandvalueareallvectors
‣ linear projections from embeddings
• Comparisonbetweenqueryvectoroftargetword(made) and key vectors of context words to compute weights
• Contextualrepresentationoftargetword=weightedsumof
value vectors of context words and target word
cmade = 0.1vI+0.5vmade+0.2vher+0.3vduck
38
COMP90042
L11
target word 39
COMP90042
L11
Self-Attention
eq⋅ki query and key comparisons ∑eq⋅kj ×vi
A(Q, K, V) = softmax(QKT)V
Uses scaled dot-product to prevent values from
growing too large
A(Q,K,V) = softmax(QKT)V dk
A(q,K,V)=
i
∑
• •
j softmax Multiple queries, stack them in a matrix
dimension of query and key vectors
40
COMP90042
L11
•
•
A(Q,K,V) = softmax(QKT)V dk
MultiHead(Q, K, V ) = concat(head1, . . . , headh)WO headi = Attention(QWQ, KWK, VWV)
Only one attention for each word pair
Uses multi-head attention to allow multiple interactions
linear projection of Q/K/V for each head
iii
41
COMP90042
L11
Transformer Block
42
COMP90042
L11
• •
Contextual representations are very useful Pre-trained on very large corpus
‣ Learned some knowledge about language
‣ Uses unsupervised objectives
•
When we use them for downstream tasks, we are no longer starting from “scratch”
A Final Word
43
COMP90042
L11
•
•
•
ELMo: https://arxiv.org/abs/1802.05365 BERT: https://arxiv.org/abs/1810.04805
Further Reading
Transformer: http://nlp.seas.harvard.edu/ 2018/04/03/attention.html
44