CS计算机代考程序代写 l11-contextual-representation-v3

l11-contextual-representation-v3

COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1

COMP90042
Natural Language Processing

Lecture 11
Semester 1 2021 Week 6

Jey Han Lau

Contextual Representation

COMP90042 L11

2

Word Vectors/Embeddings

• Each word type has one representation

‣ Word2Vec

• Always the same representation regardless of the
context of the word

• Does not capture multiple senses of words

• Contextual representation = representation of words
based on context

• Pretrained contextual representations work really well
for downstream applications!

COMP90042 L11

3

RNN Language Model

RNN RNNs1 RNNs2s0 RNNs3 s4

cowa eats grass

cow eats grass .

COMP90042 L11

4

RNN Language Model

1
0
0
0

0.4
-0.6
0.1

0.02
0.90
0.03
0.05

Output

Hidden

Input

a

cow

0
1
0
0

0.2
0.8
-0.5

0.07
0.06
0.95
0.02

cow

eats

0
0
1
0

0.9
0.8
0.3

0.10
0.05
0.13
0.82

eats

grass

si = tanh(Wssi−1 +Wxxi + b)

yi = softmax(si)

Word embedding matrixContextualrepresentation!

COMP90042 L11

5

Solved?

• Almost, but the contextual representation only
captures context to the left

• Solution: use a bidirectional RNN instead!

COMP90042 L11

6

Bidirectional RNN

RNN1 RNN1 RNN1 RNN1
s1 s2s0 s3 s4

cowa eats grass

Bidirectional Contextual
Representation!

cow eats grass .

u2u1 u3 u4RNN2 RNN2 RNN2

eats grass .

a cow eats grass .

yi = softmax([si, ui])

COMP90042 L11

7

Outline

• ELMo

• BERT

• Transformers

COMP90042 L11

8

ELMo

COMP90042 L11

9

ELMo:
Embeddings from Language Models

• Peters et al. (2018): https://arxiv.org/abs/
1802.05365v2

• Trains a bidirectional, multi-layer LSTM language
model over 1B word corpus

• Combine hidden states from multiple layers of
LSTM for downstream tasks

‣ Prior studies use only top layer information

• Improves task performance significantly!

https://arxiv.org/abs/1802.05365v2
https://arxiv.org/abs/1802.05365v2
https://arxiv.org/abs/1802.05365v2
https://arxiv.org/abs/1802.05365v2

COMP90042 L11

10

ELMo
• Number of LSTM layers = 2

• LSTM hidden dimension = 4096

• Character convolutional 

networks (CNN) to 

create word 

embeddings

‣ No unknown words

https://www.aclweb.org/anthology/P16-1101.pdf

https://www.aclweb.org/anthology/P16-1101.pdf
https://www.aclweb.org/anthology/P16-1101.pdf

COMP90042 L11

11

Extracting Contextual Representation

http://jalammar.github.io/illustrated-bert/

http://jalammar.github.io/illustrated-bert/
http://jalammar.github.io/illustrated-bert/

COMP90042 L11

12

Extracting Contextual Representation

COMP90042 L11

13

Downstream Task: POS Tagging

RNN RNNs1 RNNs2s0 RNNs3 s4

sticklet’s to improvisation

VERB ? TO NOUN

ELMo contextual embedding for stick

si = tanh(Wssi−1 + (Wxxi⊕ei) + b)

COMP90042 L11

14

How Good is ELMO?

• SQuAD: QA

• SNLI: textual entailment

• SRL: semantic role
labelling

• Coref: coreference
resolution

• NER: named entity
recognition

• SST-5: sentiment analysis

COMP90042 L11

15

Other Findings

• Lower layer representation = captures syntax

‣ good for POS tagging, NER

• Higher layer representation = captures semantics

‣ good for QA, textual entailment, sentiment
analysis

COMP90042 L11

16

Contextual vs. Non-contextual

COMP90042 L11

17

• Difficult to do intrinsic evaluation (e.g. word
similarity, analogy)

• Interpretability
• Computationally expensive to train large-scale

contextual embeddings
• Only works for certain languages

PollEv.com/jeyhanlau569

What are the disadvantages of
contextual embeddings?

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L11

18

COMP90042 L11

19

BERT

COMP90042 L11

20

Disadvantages of RNNs

• Sequential processing: difficult to scale to very
large corpus or models

• RNN language models run left to right (captures
only one side of context)

• Bidirectional RNNs help, but they only capture
surface bidirectional representations

COMP90042 L11

21

Extracting Contextual Representation

These two RNNS are run independently!
Information is aggregated after they have separately

produced their hidden representations

COMP90042 L11

22

BERT: Bidirectional Encoder
Representations from Transformers

• Devlin et al. (2019): https://arxiv.org/abs/1810.04805

• Uses self-attention networks (aka Transformers) to
capture dependencies between words

‣ No sequential processing

• Masked language model objective to capture deep
bidirectional representations

• Loses the ability to generate language

• Not an issue if the goal is to learn contextual
representations

https://arxiv.org/abs/1810.04805
https://arxiv.org/abs/1810.04805

COMP90042 L11

23

?? We’ll come back to describe
Transformers in the last part of
the lecture

COMP90042 L11

24

Objective 1: Masked Language Model
• ‘Mask’ out k% of tokens at random

• Objective: predict the masked words

Today we have a lecture on contextual representations and it’s interesting

lecture and

[MASK] [MASK]

COMP90042 L11

25

Objective 2: Next Sentence Prediction

• Learn relationships between sentences

• Predicts whether sentence B follows sentence A

• Useful pre-training objective for downstream
applications that analyse sentence pairs (e.g.
textual entailment)

Sentence A: Today we have a lecture on NLP.
Sentence B: It is an interesting lecture.
Label: IsNextSentence

Sentence A: Today we have a lecture on NLP.
Sentence B: Polar bears are white.
Label: NotNextSentence

COMP90042 L11

26

Training/Model Details

• WordPiece (subword) Tokenisation

• Multiple layers of transformers to learn contextual
representations

• BERT is pretrained on Wikipedia+BookCorpus

• Training takes multiple GPUs over several days

COMP90042 L11

27

How To Use BERT?

• Given a pretrained BERT, continue training (i.e.
fine-tune) it on downstream tasks

• But how to adapt it to downstream task?

• Add a classification layer on top of the contextual
representations

COMP90042 L11

28

Training and Fine-Tuning

http://jalammar.github.io/illustrated-bert/

http://jalammar.github.io/illustrated-bert/
http://jalammar.github.io/illustrated-bert/

COMP90042 L11

29

Example: Spam Detection

http://jalammar.github.io/illustrated-bert/

contextual
representations
of each word

special token
prepended to

the start of
every sentence

captures information
of the whole sentence;

input to new
classification layer

we’re adding

http://jalammar.github.io/illustrated-bert/
http://jalammar.github.io/illustrated-bert/

COMP90042 L11

30

Example: Spam Detection

http://jalammar.github.io/illustrated-bert/

During fine-tuning, parameters of the whole
network are updated!

contextual
representation

of [CLS]

classification layer for
downstream task;

initialised randomly

http://jalammar.github.io/illustrated-bert/
http://jalammar.github.io/illustrated-bert/

COMP90042 L11

31

BERT vs. ELMo

• ELMo provides only the contextual representations

• Downstream applications has their own network
architecture

• ELMo parameters are fixed when applied to
downstream applications

‣ Only the weights to 

combine states from 

different LSTM layers 

are learned

COMP90042 L11

32

BERT vs. ELMo

• BERT adds a classification layer for downstream
tasks

‣ No task-specific model needed

• BERT updates all parameters during fine-tuning

COMP90042 L11

33

How Good is BERT?

• MNLI, RTE: textual entailment

• QQP, STS-B, MRPC: sentence similarity

• QNLI: answerability prediction

• SST-2: sentiment analysis

• COLA: sentence acceptability prediction

COMP90042 L11

34

Transformers

COMP90042 L11

35

Transformers

• What are transformers, and how do they work?

??

COMP90042 L11

36

Attention is All You Need

• Vaswani et al. (2017): https://arxiv.org/abs/
1706.03762

• Use attention instead of using RNNs (or CNNs) to
capture dependencies between words

https://arxiv.org/abs/1706.03762
https://arxiv.org/abs/1706.03762
https://arxiv.org/abs/1706.03762
https://arxiv.org/abs/1706.03762

COMP90042 L11

37

I made her duck

weighted sum

contextual representation for made

COMP90042 L11

38

Self-Attention via Query, Key, Value
• Input:

‣ query q (e.g. made )

‣ key k and value v (e.g. her)

• Query, key and value are all vectors

‣ linear projections from embeddings

• Comparison between query vector of target word (made)
and key vectors of context words to compute weights

• Contextual representation of target word = weighted sum of

value vectors of context words and target word

cmade = 0.1vI+0.5vmade+0.2vher+0.3vduck

COMP90042 L11

39
target word

COMP90042 L11

40

Self-Attention

• Multiple queries, stack them in a matrix

• Uses scaled dot-product to prevent values from
growing too large

A(q, K, V) = ∑
i

eq⋅ki


j
eq⋅kj

× vi

A(Q, K, V) = softmax(QKT)V

A(Q, K, V) = softmax (
QKT

dk ) V
dimension of query and key vectors

query and key comparisons

softmax

COMP90042 L11

41

• Only one attention for
each word pair

• Uses multi-head attention
to allow multiple
interactions

linear projection of Q/K/V for each head

A(Q, K, V ) = softmax (
QKT

dk ) V
MultiHead(Q, K, V ) = concat(head1, . . . , headh)W

O

headi = Attention(QW
Q
i , KW

K
i , VW

V
i )

COMP90042 L11

42

Transformer Block

COMP90042 L11

43

A Final Word

• Contextual representations are very useful

• Pre-trained on very large corpus

‣ Learned some knowledge about language

‣ Uses unsupervised objectives

• When we use them for downstream tasks, we are
no longer starting from “scratch”

COMP90042 L11

44

Further Reading

• ELMo: https://arxiv.org/abs/1802.05365

• BERT: https://arxiv.org/abs/1810.04805

• Transformer: http://nlp.seas.harvard.edu/
2018/04/03/attention.html

https://arxiv.org/abs/1802.05365
https://arxiv.org/abs/1810.04805
http://nlp.seas.harvard.edu/2018/04/03/attention.html
http://nlp.seas.harvard.edu/2018/04/03/attention.html
http://nlp.seas.harvard.edu/2018/04/03/attention.html
https://arxiv.org/abs/1802.05365
https://arxiv.org/abs/1810.04805
http://nlp.seas.harvard.edu/2018/04/03/attention.html
http://nlp.seas.harvard.edu/2018/04/03/attention.html
http://nlp.seas.harvard.edu/2018/04/03/attention.html