l11-contextual-representation-v3
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
Natural Language Processing
Lecture 11
Semester 1 2021 Week 6
Jey Han Lau
Contextual Representation
COMP90042 L11
2
Word Vectors/Embeddings
• Each word type has one representation
‣ Word2Vec
• Always the same representation regardless of the
context of the word
• Does not capture multiple senses of words
• Contextual representation = representation of words
based on context
• Pretrained contextual representations work really well
for downstream applications!
COMP90042 L11
3
RNN Language Model
RNN RNNs1 RNNs2s0 RNNs3 s4
cowa eats grass
cow eats grass .
COMP90042 L11
4
RNN Language Model
1
0
0
0
0.4
-0.6
0.1
0.02
0.90
0.03
0.05
Output
Hidden
Input
a
cow
0
1
0
0
0.2
0.8
-0.5
0.07
0.06
0.95
0.02
cow
eats
0
0
1
0
0.9
0.8
0.3
0.10
0.05
0.13
0.82
eats
grass
si = tanh(Wssi−1 +Wxxi + b)
yi = softmax(si)
Word embedding matrixContextualrepresentation!
COMP90042 L11
5
Solved?
• Almost, but the contextual representation only
captures context to the left
• Solution: use a bidirectional RNN instead!
COMP90042 L11
6
Bidirectional RNN
RNN1 RNN1 RNN1 RNN1
s1 s2s0 s3 s4
cowa eats grass
Bidirectional Contextual
Representation!
cow eats grass .
u2u1 u3 u4RNN2 RNN2 RNN2
eats grass .
a cow eats grass .
yi = softmax([si, ui])
COMP90042 L11
7
Outline
• ELMo
• BERT
• Transformers
COMP90042 L11
8
ELMo
COMP90042 L11
9
ELMo:
Embeddings from Language Models
• Peters et al. (2018): https://arxiv.org/abs/
1802.05365v2
• Trains a bidirectional, multi-layer LSTM language
model over 1B word corpus
• Combine hidden states from multiple layers of
LSTM for downstream tasks
‣ Prior studies use only top layer information
• Improves task performance significantly!
https://arxiv.org/abs/1802.05365v2
https://arxiv.org/abs/1802.05365v2
https://arxiv.org/abs/1802.05365v2
https://arxiv.org/abs/1802.05365v2
COMP90042 L11
10
ELMo
• Number of LSTM layers = 2
• LSTM hidden dimension = 4096
• Character convolutional
networks (CNN) to
create word
embeddings
‣ No unknown words
https://www.aclweb.org/anthology/P16-1101.pdf
https://www.aclweb.org/anthology/P16-1101.pdf
https://www.aclweb.org/anthology/P16-1101.pdf
COMP90042 L11
11
Extracting Contextual Representation
http://jalammar.github.io/illustrated-bert/
http://jalammar.github.io/illustrated-bert/
http://jalammar.github.io/illustrated-bert/
COMP90042 L11
12
Extracting Contextual Representation
COMP90042 L11
13
Downstream Task: POS Tagging
RNN RNNs1 RNNs2s0 RNNs3 s4
sticklet’s to improvisation
VERB ? TO NOUN
ELMo contextual embedding for stick
⊕
si = tanh(Wssi−1 + (Wxxi⊕ei) + b)
COMP90042 L11
14
How Good is ELMO?
• SQuAD: QA
• SNLI: textual entailment
• SRL: semantic role
labelling
• Coref: coreference
resolution
• NER: named entity
recognition
• SST-5: sentiment analysis
COMP90042 L11
15
Other Findings
• Lower layer representation = captures syntax
‣ good for POS tagging, NER
• Higher layer representation = captures semantics
‣ good for QA, textual entailment, sentiment
analysis
COMP90042 L11
16
Contextual vs. Non-contextual
COMP90042 L11
17
• Difficult to do intrinsic evaluation (e.g. word
similarity, analogy)
• Interpretability
• Computationally expensive to train large-scale
contextual embeddings
• Only works for certain languages
PollEv.com/jeyhanlau569
What are the disadvantages of
contextual embeddings?
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L11
18
COMP90042 L11
19
BERT
COMP90042 L11
20
Disadvantages of RNNs
• Sequential processing: difficult to scale to very
large corpus or models
• RNN language models run left to right (captures
only one side of context)
• Bidirectional RNNs help, but they only capture
surface bidirectional representations
COMP90042 L11
21
Extracting Contextual Representation
These two RNNS are run independently!
Information is aggregated after they have separately
produced their hidden representations
COMP90042 L11
22
BERT: Bidirectional Encoder
Representations from Transformers
• Devlin et al. (2019): https://arxiv.org/abs/1810.04805
• Uses self-attention networks (aka Transformers) to
capture dependencies between words
‣ No sequential processing
• Masked language model objective to capture deep
bidirectional representations
• Loses the ability to generate language
• Not an issue if the goal is to learn contextual
representations
https://arxiv.org/abs/1810.04805
https://arxiv.org/abs/1810.04805
COMP90042 L11
23
?? We’ll come back to describe
Transformers in the last part of
the lecture
COMP90042 L11
24
Objective 1: Masked Language Model
• ‘Mask’ out k% of tokens at random
• Objective: predict the masked words
Today we have a lecture on contextual representations and it’s interesting
lecture and
[MASK] [MASK]
COMP90042 L11
25
Objective 2: Next Sentence Prediction
• Learn relationships between sentences
• Predicts whether sentence B follows sentence A
• Useful pre-training objective for downstream
applications that analyse sentence pairs (e.g.
textual entailment)
Sentence A: Today we have a lecture on NLP.
Sentence B: It is an interesting lecture.
Label: IsNextSentence
Sentence A: Today we have a lecture on NLP.
Sentence B: Polar bears are white.
Label: NotNextSentence
COMP90042 L11
26
Training/Model Details
• WordPiece (subword) Tokenisation
• Multiple layers of transformers to learn contextual
representations
• BERT is pretrained on Wikipedia+BookCorpus
• Training takes multiple GPUs over several days
COMP90042 L11
27
How To Use BERT?
• Given a pretrained BERT, continue training (i.e.
fine-tune) it on downstream tasks
• But how to adapt it to downstream task?
• Add a classification layer on top of the contextual
representations
COMP90042 L11
28
Training and Fine-Tuning
http://jalammar.github.io/illustrated-bert/
http://jalammar.github.io/illustrated-bert/
http://jalammar.github.io/illustrated-bert/
COMP90042 L11
29
Example: Spam Detection
http://jalammar.github.io/illustrated-bert/
contextual
representations
of each word
special token
prepended to
the start of
every sentence
captures information
of the whole sentence;
input to new
classification layer
we’re adding
http://jalammar.github.io/illustrated-bert/
http://jalammar.github.io/illustrated-bert/
COMP90042 L11
30
Example: Spam Detection
http://jalammar.github.io/illustrated-bert/
During fine-tuning, parameters of the whole
network are updated!
contextual
representation
of [CLS]
classification layer for
downstream task;
initialised randomly
http://jalammar.github.io/illustrated-bert/
http://jalammar.github.io/illustrated-bert/
COMP90042 L11
31
BERT vs. ELMo
• ELMo provides only the contextual representations
• Downstream applications has their own network
architecture
• ELMo parameters are fixed when applied to
downstream applications
‣ Only the weights to
combine states from
different LSTM layers
are learned
COMP90042 L11
32
BERT vs. ELMo
• BERT adds a classification layer for downstream
tasks
‣ No task-specific model needed
• BERT updates all parameters during fine-tuning
COMP90042 L11
33
How Good is BERT?
• MNLI, RTE: textual entailment
• QQP, STS-B, MRPC: sentence similarity
• QNLI: answerability prediction
• SST-2: sentiment analysis
• COLA: sentence acceptability prediction
COMP90042 L11
34
Transformers
COMP90042 L11
35
Transformers
• What are transformers, and how do they work?
??
COMP90042 L11
36
Attention is All You Need
• Vaswani et al. (2017): https://arxiv.org/abs/
1706.03762
• Use attention instead of using RNNs (or CNNs) to
capture dependencies between words
https://arxiv.org/abs/1706.03762
https://arxiv.org/abs/1706.03762
https://arxiv.org/abs/1706.03762
https://arxiv.org/abs/1706.03762
COMP90042 L11
37
I made her duck
weighted sum
contextual representation for made
COMP90042 L11
38
Self-Attention via Query, Key, Value
• Input:
‣ query q (e.g. made )
‣ key k and value v (e.g. her)
• Query, key and value are all vectors
‣ linear projections from embeddings
• Comparison between query vector of target word (made)
and key vectors of context words to compute weights
• Contextual representation of target word = weighted sum of
value vectors of context words and target word
cmade = 0.1vI+0.5vmade+0.2vher+0.3vduck
COMP90042 L11
39
target word
COMP90042 L11
40
Self-Attention
• Multiple queries, stack them in a matrix
• Uses scaled dot-product to prevent values from
growing too large
A(q, K, V) = ∑
i
eq⋅ki
∑
j
eq⋅kj
× vi
A(Q, K, V) = softmax(QKT)V
A(Q, K, V) = softmax (
QKT
dk ) V
dimension of query and key vectors
query and key comparisons
softmax
COMP90042 L11
41
• Only one attention for
each word pair
• Uses multi-head attention
to allow multiple
interactions
linear projection of Q/K/V for each head
A(Q, K, V ) = softmax (
QKT
dk ) V
MultiHead(Q, K, V ) = concat(head1, . . . , headh)W
O
headi = Attention(QW
Q
i , KW
K
i , VW
V
i )
COMP90042 L11
42
Transformer Block
COMP90042 L11
43
A Final Word
• Contextual representations are very useful
• Pre-trained on very large corpus
‣ Learned some knowledge about language
‣ Uses unsupervised objectives
• When we use them for downstream tasks, we are
no longer starting from “scratch”
COMP90042 L11
44
Further Reading
• ELMo: https://arxiv.org/abs/1802.05365
• BERT: https://arxiv.org/abs/1810.04805
• Transformer: http://nlp.seas.harvard.edu/
2018/04/03/attention.html
https://arxiv.org/abs/1802.05365
https://arxiv.org/abs/1810.04805
http://nlp.seas.harvard.edu/2018/04/03/attention.html
http://nlp.seas.harvard.edu/2018/04/03/attention.html
http://nlp.seas.harvard.edu/2018/04/03/attention.html
https://arxiv.org/abs/1802.05365
https://arxiv.org/abs/1810.04805
http://nlp.seas.harvard.edu/2018/04/03/attention.html
http://nlp.seas.harvard.edu/2018/04/03/attention.html
http://nlp.seas.harvard.edu/2018/04/03/attention.html