CS计算机代考程序代写 algorithm Machine Translation

Machine Translation
COMP90042
Natural Language Processing
Lecture 17
Semester 1 2021 Week 9 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1

COMP90042
L17

Machine translation (MT) is the task of translating text from one source language to another target language
虽然北风呼啸,但天空依然十分清澈
Introduction
However, the sky remained clear under the strong north wind
2

COMP90042
L17
Why?
Removes language barrier
Makes information in any languages accessible to anyone

• •
But translation is a classic “AI-hard” challenge
‣ Difficult to preserve the meaning and the fluency of the text after translation
3

COMP90042
L17
How good do you find machine translation (e.g. Google Translate)?
• I’ve never used it before
• Bad, it’s not really usable
• Usable, but it needs a lot of manual correction
• Quite good, I use it quite often
• Perfect, machine translation is a solved task
PollEv.com/jeyhanlau569
4

COMP90042
L17
5

COMP90042
L17
• • • • •
Not just simple word for word translation Structural changes, e.g., syntax and semantics Multiple word translations, idioms
Inflections for gender, case etc
Missing information (e.g., determiners)
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈
although north wind howls , but sky still extremely limpid
MT is Difficult
6

COMP90042
L17
• • • •
Statistical Machine Translation Neural Machine Translation Attention Mechanism Evaluation
Outline
7

COMP90042
L17
Statistical MT
8

COMP90042
L17
• •

Started in early 1950s
Motivated by the Cold War to translate Russian to English

Goal: translate 1-2 million words an hour within 5 years
Early Machine Translation
Rule-based system
‣ Use bilingual dictionary to map Russian words to English words
9

COMP90042
L17
Statistical MT
Given French sentence f, aim is to find the best
English sentence e ‣ argmaxe P(e|f)


Use Baye’s rule to decompose into two components
‣ argmaxe P(f|e)P(e)
10

COMP90042
L17
Language vs Translation Model
• argmaxe P(f|e)P(e)
P(e): language model
‣ learns how to write fluent English text
P( f | e): translation model
‣ learns how to translate words and phrases from English to French


11

COMP90042
L17


Language model:
‣ Text statistics in large monolingual corpora
How to Learn LM and TM?
Translation model:
‣ Word co-occurrences in parallel corpora ‣ i.e. English-French sentence pairs
12

COMP90042
L17


One text in multiple languages
Parallel Corpora
Ancient Egyptian
Produced by human translation
‣ Bible, news articles, legal transcripts, literature, subtitles
‣ Open parallel corpus: http://opus.nlpl.eu/
The Rosetta Stone
Demotic
Ancient Greek
13

COMP90042
L17
Models of Translation How to learn P( f | e) from parallel text?
We only have sentence pairs; words are not aligned in the parallel text

• •
I.e. we don’t have word to word translation
Zhang, Liu, Li, Zhou, Zong (2016)
14

COMP90042
L17

P(f,a|e)

Alignment
Idea: introduce word alignment as a latent variable into the model

Use algorithms such as expectation maximisation (EM) to learn (e.g. GIZA++)
15

COMP90042
L17

Complexity of Alignment
Some words are dropped and have no alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
16

COMP90042
L17

Complexity of Alignment One-to-many alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
17

COMP90042
L17

Complexity of Alignment Many-to-one alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
18

COMP90042
L17

Complexity of Alignment Many-to-many alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
19

COMP90042
L17

• •
A very popular field of research in NLP prior to 2010s
Statistical MT: Summary
Lots of feature engineering State-of-the-art systems are very complex
‣ Difficult to maintain
‣ Significant effort needed for new language pairs
20

COMP90042
L17
Neural Machine Translation
21

COMP90042
L17


• •
Neural machine translation is a new approach to do machine translation
Introduction
Use a single neural model to directly translate from source to target
Requires parallel text
Architecture: encoder-decoder model
‣ 1st RNN to encode the source sentence ‣ 2nd RNN to decode the target sentence
22

COMP90042
L17
This vector encodes the whole source sentence; it is used as the initial state for decoder RNN
RNN1 RNN1 RNN1
⽜牛吃草
Encoder RNN
Source sentence
cow eats
grass

RNN2
grass
RNN2 RNN2 RNN2
cow eats
Decoder RNN
Target sentence
hi = RNN1(hi−1, xi) st = RNN2(st−1, yt)
s1 = h|x| 23

COMP90042
L17

Neural MT
The decoder RNN can be interpreted as a
conditional language model
‣ Language model: predicts the next word given previous words in target sentence y
‣ Conditional: prediction is also conditioned on the source sentence x
P(y|x) = P(y1|x)P(y2|y1,x)…P(yt|y1,…,yt−1,x) 24

COMP90042
L17
• •
Training Neural MT
Requires parallel corpus just like statistical MT
Trains with next word prediction, just like a language model!
25

COMP90042
L17
Language Model Training Loss
negative logprob of cow
L1
cow
RNN2

negative logprob of eats
+ L2 eats
RNN2
cow
negative logprob of grass
+ L3 grass
RNN2
eats
negative logprob of
+ L4
RNN2
grass
=Ltotal
26

COMP90042
L17
Neural MT Training Loss
negative logprob of cow
L1
cow
RNN2

negative logprob of eats
+ L2 eats
RNN2
cow
negative logprob of grass
+ L3 grass
RNN2
eats
negative logprob of
+ L4
RNN2
grass
=Ltotal
RNN1
⽜牛
RNN1 RNN1
吃 草
Source sentence
Target sentence
27

COMP90042
L17
Training
RNN1
⽜牛
RNN1 RNN1 RNN2
吃 草 Source sentence
RNN2 RNN2 RNN2
cow eats grass
Target sentence
• •
During training, we have the target sentence
We can therefore feed the right word from target sentence, one step at a time
28

COMP90042
L17
Decoding at Test Time
RNN1
⽜牛
RNN1 RNN1
吃 草
Source sentence
cow eats grass
RNN2 RNN2 RNN2
cow eats
Target sentence

RNN2
grass


But at test time, we don’t have the target sentence (that’s what we’re trying to predict!)
argmax: take the word with the highest probability at every step
29
argmax
argmax
argmax
argmax

COMP90042
L17
Exposure Bias
RNN1
⽜牛
RNN1 RNN1
吃 草
Source sentence
cow drinks water
RNN2 RNN2 RNN2
cow drinks Target sentence

RNN2
water
• Describesthediscrepancybetweentrainingandtesting
• Training:alwayshavethegroundtruthtokensateachstep • Test:usesitsownpredictionateachstep
• Outcome:modelisunabletorecoverfromitsownerror
30
argmax
argmax
argmax
argmax

COMP90042
L17
• •
Greedy Decoding
argmax decoding is also called greedy decoding
Issue: does not guarantee optimal probability
P(y|x)
31

COMP90042
L17
Exhaustive Search Decoding
To find optimal P(y|x), we need to consider every word at every step to compute the probability of all possible sequences
• •

O(Vn); V = vocab size; n = sentence length Far too expensive to be feasible
32

COMP90042
L17
Beam Search Decoding Instead of considering all possible words at every
step, consider k best words

• • •

That is, we keep track of the top-k words that produce the best partial translations (hypotheses) thus far
k = beam width (typically 5 to 10)
k = 1 = greedy decoding
k = V = exhaustive search decoding
33

COMP90042
L17

Beam search decoding with k=2
34

COMP90042
L17
-0.5 = log P(cow | x, ) cow

tiger
-0.7 = log P(tiger | x, )
Beam search decoding with k=2
35

COMP90042
L17
-0.5
cow
-1.3 = log P(eats | x, cow) – 0.5 eats
chews
-1.9 = log P(chews | x, cow) – 0.5

tiger
-0.7
Beam search decoding with k=2
36

COMP90042
L17
-0.5
cow
-1.3 = log P(eats | x, cow) – 0.5 eats
chews
-1.9 = log P(chews | x, cow) – 0.5
-1.5 = log P(bites | x, tiger) – 0.7 bites
kills
-1.8 = log P(kills | x, tiger) – 0.7

tiger
-0.7
Beam search decoding with k=2
37

COMP90042
L17
-0.5
cow
-1.3
eats chews
-1.9 -1.5
bites kills
-1.8
-2.3 = log P(grass | x, cow eats) -1.3 grass
carrot
-2.0 = log P(carrot | x, cow eats) -1.3

tiger
-0.7
Beam search decoding with k=2
38

COMP90042
L17
-0.5
cow
-1.3
eats chews
-1.9 -1.5
bites kills
-1.8
-2.3 = log P(grass | x, cow eats) -1.3 grass
carrot
-2.5 = log P(carrot | x, cow eats) -1.3
-2.9 = log P(hard | x, tiger bites) -1.5 hard
man
-2.8 = log P(man | x, tiger bites) -1.5

tiger
-0.7
Beam search decoding with k=2
39

COMP90042
L17
-0.5
cow
-1.3
eats chews
-1.9 -1.5
bites kills
-1.8
-2.3
grass carrot
-2.5
Best translation probability!
-2.9
and
-3.5 -3.2
but

-2.9
hard -4.5
tiger
-0.7
man
-2.8
Beam search decoding with k=2
40

COMP90042
L17
Follow pointers to get the target sentence
-2.9

and
-3.5 -3.2
-2.3
-1.3 = log P(cow | x, a) – 0.5
grass
-0.5 = log P(a | x, ) eats
cow
carrot
-2.5 = log P(chews | x, a cow) -1.3
chews
-1.9 = log P(bull | x, a) – 0.5

but
-2.9 = log P(bites | x, the tiger) -1.5
-1.5 = log P(tiger | x, the) – 0.7 -4.5 hard
man
-2.8 = log P(kills | x, the tiger) -1.5
bites kills
tiger
-0.7 = log P(the | x, )
-1.8 = log P(cat | x, the) – 0.7 Beam search decoding with k=2
41

COMP90042
L17




When decoding, we stop when we generate token
When to Stop?
But multiple hypotheses may terminate their sentence at different time steps
We store hypotheses that have terminated, and continue explore those that haven’t
Typically we also set a maximum sentence length that can be generated (e.g. 50 words)
42

COMP90042
L17
What are some issues of NMT?
• Information of the whole source sentence is represented by a single vector
• NMT can generate new details not in source sentence
• NMT tend to generate not very fluent sentences
• Black-box model; difficult to explain when it doesn’t work
PollEv.com/jeyhanlau569
43

COMP90042
L17
44

COMP90042
L17
Neural MT: Summary • Singleend-to-endmodel
‣ Statistical MT systems have multiple sub- components
• Lessfeatureengineering
• Canproducenewdetailsthatarenotinthesource sentence (hallucination)
Lee, Firat, Agarwal, Fannjiang, Sussillo, 2018
45

COMP90042
L17
Attention Mechanism
46

COMP90042
L17
This vector encodes the whole source sentence!
RNN1 RNN1 RNN1
⽜牛吃草
Encoder RNN
Source sentence
cow eats grass

RNN2 RNN2 RNN2 RNN2
cow eats grass
Decoder RNN
Target sentence


With a long source sentence, the encoded vector is unlikely to capture all the information in the sentence
This creates an information bottleneck
47

COMP90042
L17

For the decoder, at every time step allow it to ‘attend’ to words in the source sentence
Attention
48

COMP90042
L17
cow
weighted sum
first time step attends to “⽜牛” to predict cow
softmax
attention distribution
dot product
RNN1
RNN1 RNN1 RNN2
⽜牛
吃 草

49

COMP90042
L17
cow eats
weighted sum
attention distribution
dot product
second time step attends to “吃” to predict eats
softmax
RNN1
RNN1 RNN1 RNN2 RNN2
⽜牛
吃 草
cow
50

COMP90042
L17
cow eats grass
weighted sum
softmax
attention distribution
dot product
RNN1
RNN1 RNN1
RNN2 RNN2 RNN2
⽜牛
吃 草
cow eats
51

COMP90042
L17
cow eats
grass
weighted sum
softmax
attention distribution
dot product
RNN1
RNN1 RNN1
RNN2 RNN2
RNN2 RNN2
⽜牛
吃 草
cow eats
grass 52

COMP90042
L17
Encoder-Decoder with Attention
Encoder hidden states: hi = RNN1(hi−1, xi)
Decoder hidden states: st = RNN2(st−1, yt)
• Foreachtimestepinthedecoder,attendtoeachofthe
hidden states to produce the attention weights:
et = [s⊺h1, s⊺h2, . . . , s⊺h|x|] ttt
• Applysoftmaxtotheattentionweightstogetavalid probability distribution:
αt = softmax(et)
• Computeweightedsumoftheencoderhiddenstates:
c = ∑| x | α i h t i=1ti
Concatenate ct and st to predict the next word context vector
• •




53

COMP90042
L17

Variants Dot product: s⊺hi
Bilinear: s Whi t⊺

ct can be injected to the current state (st), or to the input word (yt)

Attention:

⊺t

Additive: v tanh(Wsst + Whhi)
54

COMP90042
L17


Solves the information bottleneck issue by allowing decoder to have access to the source sentence words directly

Most state-of-the-art NMT systems use attention
‣ Google Translate (https://slator.com/technology/google-facebook-amazon- neural-machine-translation-just-had-its-busiest-month-ever/)
Attention: Summary
Provides some form of interpretability
‣ Attention weights can be seen as word alignments
55

COMP90042
L17
Evaluation
56

COMP90042
L17


BLEU: compute n-gram overlap between “reference” translation and generated translation
MT Evaluation
Typically computed for 1 to 4-gram
N BLEU=BP×exp 1∑logpn
“Brevity Penalty” to penalise short outputs
pn =
(N) n
# correct n-grams # predicted n-grams
BP = min (1, output length ) reference length
57

COMP90042
L17
• •

Statistical MT Neural MT
A Final Word
‣ Nowadays use Transformers rather than RNNs Encoder-decoder with attention architecture is a
general architecture that can be used for other tasks
‣ Summarisation (lecture 21)
‣ Other generation tasks such as dialogue generation
58

COMP90042
L17
• •
Reading JM3 Ch. 11.2-11.5
Optional: GIZA++ word alignment – https:// www.aclweb.org/anthology/W08-0509.pdf
59