程序代写代做代考 html Contextual Representation

Contextual Representation
COMP90042
Natural Language Processing Lecture 11
COPYRIGHT 2020, THE UNIVERSITY OF MELBOURNE
1

COMP90042
L11


• •

Each word type has one representation ‣ Word2Vec
Word Vectors/Embeddings
Always the same representation regardless of the context of the word
Does not capture multiple senses of words
Contextual representation = representation of words based on context
Pre-trained contextual representations work really well for downstream applications!
2

COMP90042
L11
RNN Language Model
cow
s0 RNN s1
a
eats
RNN s2
cow
grass
RNN s3
eats
.
RNN s4
grass
3

COMP90042
L11
RNN Language Model
cow eats
grass
0.10
0.05
0.13
0.82
0.02
0.90
0.03
0.05
0.07
0.06
0.95
0.02
Output
Contextual Word Embedding!
Word Embedding
0.9
0.8
0.3
0.4
-0.6
0.1
0.2
0.8
-0.5
Hidden
si = tanh(Wssi−1 + Wxxi + b) yi = softmax(si)
0
0
1
0
1
0
0
0
0
1
0
0
Input
a cow
eats
4

COMP90042
L11


Almost, but the contextual representation only captures context to the left
Solved?
Solution: use a bidirectional RNN instead!
5

COMP90042
L11
cow
eats
grass .
u2 RNN2 u3
Bidirectional RNN
u0
RNN2 u1
RNN2
RNN2 u4 RNN1 s4
grass
s0 RNN1
a
s1
RNN1 s2
cow
RNN1 s3
eats
yi = softmax([si, ui])
Contextual Word Embedding!
6

COMP90042
L11
ELMo
7

COMP90042
ELMo:
Embeddings from Language Models



L11
Peters et al. (2018): https://arxiv.org/abs/ 1802.05365v2

Combine hidden states from multiple layers of LSTM for downstream tasks
‣ Prior studies use only top layer information Improves task performance significantly!
Trains a bidirectional, multi-layer LSTM language model over 1B word corpus
8

COMP90042
L11



Number of LSTM layers = 2 LSTM hidden dimension = 4096
ELMo
Character convolutional
 networks (CNN) to 
 create word
 embeddings
‣ No unknown words
https://www.aclweb.org/anthology/P16-1101.pdf
9

COMP90042
L11
Extracting Contextual Representation
http://jalammar.github.io/illustrated-bert/
10

COMP90042
L11
Extracting Contextual Representation
11

COMP90042
L11
Downstream Task: POS Tagging
VERB
s0 RNN s1
?
RNN s2 RNN s3 RNN s4
let’s
stick

to improvisation
ELMo contextual embedding for stick
si = tanh(Wssi−1 + (Wxxi⊕ei) + b)
12

COMP90042
L11
How Good is ELMo?
• SQuAD:QA
• SNLI:textualentailment
• SRL:semanticrole labelling
• Coref:coreference resolution
• NER:namedentity recognition
• SST-5:sentimentanalysis 13

COMP90042
L11


Lower layer representation = captures syntax ‣ good for POS tagging, NER
Other Findings
Higher layer representation = captures semantics
‣ good for QA, textual entailment, sentiment analysis
14

COMP90042
L11
Contextual vs. Non-contextual
15

COMP90042
L11


Sequential processing: difficult to scale to very large corpus or models

Bidirectional RNNs help, but they only capture surface bidirectional representations
Disadvantages of RNNs
RNN language models run left to right (captures only one side of context)
‣ Produces well-formed sentence probability
16

COMP90042
L11
BERT
17

COMP90042
L11
• •
Devlin et al. (2019): https://arxiv.org/abs/1810.04805 Uses self-attention networks (aka Transformers) to
capture dependencies between words ‣ No sequential processing
BERT: Bidirectional Encoder Representations from Transformers

• •
Masked language model objective to capture deep bidirectional representations
Loses the ability to generate language
Not an issue if the goal is to learn contextual representations
18

COMP90042
L11
19

COMP90042
L11
Objective 1: Masked Language Model
• •
‘Mask’ out k% of tokens at random Objective: predict the masked words
lecture and
Today we have a [MASK] on contextual representations [MASK] it’s interesting
20

COMP90042
L11 Objective 2: Next Sentence Prediction



Learn relationships between sentences
Predicts whether sentence B follows sentence A
Useful pre-training objective for downstream applications that analyse sentence pairs (e.g. textual entailment)
Sentence A: Today we have a lecture on NLP. Sentence B: It is an interesting lecture. Label: IsNextSentence
Sentence A: Today we have a lecture on NLP. Sentence B: Polar bears are white.
Label: NotNextSentence
21

COMP90042
L11
Training/Model Details
WordPiece (subword) Tokenisation
Multiple layers of transformers to learn contextual representations
• •
• •
Train models trained on Wikipedia+BookCorpus Training takes multiple GPUs over several days
22

COMP90042
L11
Fine-Tuning for BERT
http://jalammar.github.io/illustrated-bert/
23

COMP90042
L11
Fine-Tuning for BERT
http://jalammar.github.io/illustrated-bert/
24

COMP90042
L11
Example: Spam Detection
http://jalammar.github.io/illustrated-bert/
25

COMP90042
L11
Example: Spam Detection
During fine-tuning, parameters of the whole network are updated!
http://jalammar.github.io/illustrated-bert/
26

COMP90042
L11
BERT vs. ELMo
ELMo provides only the contextual representations
Downstream applications has their own network architecture

• •
ELMo parameters are fixed when applied to downstream applications
‣ Only the weights to 
 combine states from 
 different LSTM layers 
 are learned
27

COMP90042
L11

BERT adds a classification layer for downstream tasks
‣ No task-specific model needed
BERT updates all parameters during fine-tuning

BERT vs. ELMo
28

COMP90042
L11
• • • • •
MNLI, RTE: textual entailment
QQP, STS-B, MRPC: sentence similarity QNLP: answerability prediction
SST: sentiment analysis
COLA: sentence acceptability prediction
How Good is BERT?
29

COMP90042
L11

Transformers
What are transformers, and how do they work?
??
30

COMP90042
L11
Attention is All You Need Vaswani et al. (2017): https://arxiv.org/abs/
1706.03762


Use attention instead of using RNNs (or CNNs) to capture dependencies between words
31

COMP90042
L11
contextual representation for made
weighted sum
I made her duck
32

COMP90042
L11

Input:
‣ query q (e.g. made )
‣ key k and value v (e.g. her)

Query, key and value are all vectors ‣ linear projections from embeddings
Self-Attention: Implementation
􏰁 􏰀 q·ki
A(q, K, V ) = e × vi
i j eq·kj
cmade = 0.1vI + 0.6vher + 0.3vduck
33

COMP90042
L11
Self-Attention: Implementation
􏰁 􏰀 q·ki
A(q, K, V ) = e × vi
• •
i j eq·kj Multiple queries, stack them in a matrix
A(Q, K, V ) = softmax(QK⊤)V
Uses scaled dot-product to prevent values from
growing too large
QK⊤
k
A(Q, K, V ) = softmax( √d
dimension of query and key vectors
)V
34

COMP90042
L11


Only one attention for each word pair
Uses multi-head attention to allow multiple interactions
35

COMP90042
L11
Transformer Block
36

COMP90042
L11
• •
Contextual representations are very useful Pre-trained on very very large corpus
‣ Builds up some knowledge about language
‣ Uses unsupervised objectives

When we use them for downstream tasks, we are no longer starting from “scratch”
A Final Word
37

COMP90042
L11



ELMo: https://arxiv.org/abs/1802.05365v2 BERT: https://arxiv.org/abs/1810.04805
Further Reading
Transformer: http://nlp.seas.harvard.edu/ 2018/04/03/attention.html
38