程序代写代做代考 C algorithm Machine Translation

Machine Translation
COMP90042
Natural Language Processing Lecture 17
COPYRIGHT 2020, THE UNIVERSITY OF MELBOURNE
1

COMP90042
L17

Machine translation (MT) is the task of translating text from one source language to another target language
虽然北风呼啸,但天空依然十分清澈
Introduction
However, the sky remained clear under the strong north wind
2

COMP90042
L17
Why?
Removes language barrier
Makes information in any languages accessible to anyone

• •
But translation is a classic “AI-hard” challenge
‣ Difficult to preserve the meaning and the fluency of the text after translation
3

COMP90042
L17
• • • • •
Not just simple word for word translation Structural changes, e.g., syntax and semantics Multiple word translations, idioms
Inflections for gender, case etc
Missing information (e.g., determiners)
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈
although north wind howls , but sky still extremely limpid
MT is Difficult
4

COMP90042
L17




Statistical Machine Translation ‣ Word-based approach
Neural Machine Translation
‣ Encoder-decoder architecture
Attention Mechanism
‣ Improvement to Neural MT
Machine Translation Evaluation
Outline
5

COMP90042
L17
Statistical MT
6

COMP90042
L17
Early Machine Translation
• •


Goal: translate 1-2 million words an hour within 5 years
Started in early 1950s
Motivated by the Cold War to translate Russian to English
Rule-based system
‣ Use bilingual dictionary to map Russian words to English words
7

COMP90042
L17




Statistical MT
Given French sentence f, aim is to find the best
English sentence e
argmaxeP(e | f )
Use Baye’s rule to decompose into two
components
argmaxeP( f | e)P(e)
8

COMP90042 L17
Language vs Translation Model
• argmaxeP(f|e)P(e)
P(e): language model
‣ learns how to write fluent English text
P( f | e): translation model
‣ learns how to translate words and phrases from English to French


9

COMP90042
L17
How to Learn LM and TM? Language model:
‣ Based on text frequency in large monolingual corpora


Translation model:
‣ Based on word co-occurrences in parallel
corpora
‣ i.e. English-French sentence pairs
10

COMP90042
L17


One text in multiple languages
Parallel Corpora
Ancient Egyptian
Produced by human translation
‣ Bible, news articles, legal transcripts, literature, subtitles
‣ Open parallel corpus: http://opus.nlpl.eu/
The Rosetta Stone
Demotic
Ancient Greek
11

COMP90042
L17
• •

How to learn P( f | e) from parallel text?
We only have sentence pairs; words are not
aligned in the parallel text Not word to word translation:
‣ rearrangement of words
Models of Translation
Zhang, Liu, Li, Zhou, Zong (2016)
12

COMP90042
L17

P(f,a|e)
Alignment
Idea: introduce word alignment as a latent variable into the model

https://en.wikipedia.org/wiki/Bitext_word_alignment#/media/File:Word_alignment.svg
13

COMP90042
L17

Complexity of Alignment
Some words are dropped and have no alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
14

COMP90042
L17

Complexity of Alignment One-to-many alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
15

COMP90042
L17

Complexity of Alignment Many-to-one alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
16

COMP90042
L17

Complexity of Alignment Many-to-many alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
17

COMP90042
L17

Word alignments are rarely provided in parallel corpora
‣ As such alignment a introduced as a latent variable

Use algorithms such as expectation maximisation (EM) to learn
Learning the Alignment
18

COMP90042
L17

• •
A very popular field of research in NLP prior to 2010s
Statistical MT: Summary
Lots of feature engineering State-of-the-art systems are very complex
‣ Difficult to maintain
‣ Significant effort needed for new language pairs
19

COMP90042 L17
Neural Machine Translation
20

COMP90042
L17



Neural machine translation is a new approach to do machine translation
Introduction
Use a single neural model to directly translate from source to target
Architecture: encoder-decoder model
‣ 1st RNN to encode the source sentence ‣ 2nd RNN to decode the target sentence
21

COMP90042
L17
This vector encodes the whole source sentence; it is used as the initial state for decoder RNN
RNN1 RNN1 RNN1
⽜牛吃草
Encoder RNN
Source sentence
cow eats
grass

RNN2
grass
RNN2 RNN2 RNN2
cow eats
Decoder RNN
Target sentence
hi = RNN1(hi−1, xi) st = RNN2(st−1, yt)
s1 = h|x|
22

COMP90042
L17

Neural MT
The decoder RNN can be interpreted as a
conditional language model
‣ Language model: predicts the next word given previous words in target sentence y
‣ Conditional: prediction is also conditioned on the source sentence x
P(y|x) = P(y1|x)P(y2|y1,x)…P(yt|y1,…,yt−1,x) 23

COMP90042
L17
• •
Training Neural MT
Requires parallel corpus just like statistical MT
Trains with next word prediction, just like a language model!
24

COMP90042 L17
Language Model Training Loss
negative logprob of cow
L1
cow
RNN2

negative logprob of eats
+ L2 eats
RNN2
cow
negative logprob of grass
+ L3 grass
RNN2
eats
negative logprob of
+ L4
RNN2
grass
=Ltotal
25

COMP90042
L17
Neural MT Training Loss
negative logprob of cow
L1
cow
RNN2

negative logprob of eats
+ L2 eats
RNN2
cow
negative logprob of grass
+ L3 grass
RNN2
eats
negative logprob of
+ L4
RNN2
grass
=Ltotal
RNN1
⽜牛
RNN1 RNN1
吃 草
Source sentence
Target sentence
26

COMP90042
L17
Training
RNN1
⽜牛
RNN1 RNN1
吃 草
Source sentence
RNN2

RNN2 RNN2
cow eats
Target sentence
RNN2
grass
• •
During training, we have the target sentence
We can therefore feed the right word from target sentence, one step at a time
27

COMP90042
L17
Decoding at Test Time
RNN1
⽜牛
RNN1 RNN1
吃 草
Source sentence
cow eats grass
RNN2 RNN2 RNN2
cow eats
Target sentence

RNN2
grass


But at test time, we don’t have the target sentence (that’s what we’re trying to predict!)
argmax: take the word with the highest probability at every step
28
argmax
argmax
argmax
argmax

COMP90042
L17
• •
Greedy Decoding
argmax decoding is also called greedy decoding
Issue: does not guarantee optimal probability
P(y|x)
29

COMP90042
L17
Exhaustive Search Decoding
To find optimal P(y|x), we need to consider every word at every step to compute the probability of all possible sequences
• •

O(Vn); V = vocab size; n = sentence length Far too expensive to be feasible
30

COMP90042
L17
Beam Search Decoding Instead of considering all possible words at every
step, consider k best words

• • •

That is, we keep track of the top-k words that produce the best partial translations (hypotheses) thus far
k = beam width (typically 5 to 10)
k = 1 = greedy decoding
k = V = exhaustive search decoding
31

COMP90042
L17

Beam search decoding with k=2
32

COMP90042
L17
-0.5 = log P(cow | x, ) cow

tiger
-0.7 = log P(tiger | x, )
Beam search decoding with k=2
33

COMP90042
L17
-0.5
cow
-1.3 = log P(eats | x, cow) – 0.5 eats
chews
-1.9 = log P(chews | x, cow) – 0.5

tiger
-0.7
Beam search decoding with k=2
34

COMP90042
L17
-0.5
cow
-1.3 = log P(eats | x, cow) – 0.5 eats
chews
-1.9 = log P(chews | x, cow) – 0.5
-1.5 = log P(bites | x, tiger) – 0.7 bites
kills
-1.8 = log P(kills | x, tiger) – 0.7

tiger
-0.7
Beam search decoding with k=2
35

COMP90042
L17
-0.5
cow
-1.3
eats chews
-1.9 -1.5
bites kills
-1.8
-2.3 = log P(grass | x, cow eats) -1.3 grass
carrot
-2.0 = log P(carrot | x, cow eats) -1.3

tiger
-0.7
Beam search decoding with k=2
36

COMP90042
L17
-0.5
cow
-1.3
eats chews
-1.9 -1.5
bites kills
-1.8
-2.3 = log P(grass | x, cow eats) -1.3 grass
carrot
-2.5 = log P(carrot | x, cow eats) -1.3
-2.9 = log P(hard | x, tiger bites) -1.5 hard
man
-2.8 = log P(man | x, tiger bites) -1.5

tiger
-0.7
Beam search decoding with k=2
37

COMP90042 L17
-0.5
cow
-1.3
eats chews
-1.9 -1.5
bites kills
-1.8
-2.3
grass carrot
-2.5
Best translation probability!
-2.9
and
-3.5 -3.2
but

-2.9
hard -4.5
tiger
-0.7
man
-2.8
Beam search decoding with k=2
38

COMP90042
L17
Follow pointers to get the target sentence
-2.9

and
-3.5 -3.2
-2.3
-1.3 = log P(cow | x, a) – 0.5
grass
-0.5 = log P(a | x, ) eats
cow
carrot
-2.5 = log P(chews | x, a cow) -1.3
chews
-1.9 = log P(bull | x, a) – 0.5

but
-2.9 = log P(bites | x, the tiger) -1.5
-1.5 = log P(tiger | x, the) – 0.7 -4.5 hard
man
-2.8 = log P(kills | x, the tiger) -1.5
bites kills
tiger
-0.7 = log P(the | x, )
-1.8 = log P(cat | x, the) – 0.7 Beam search decoding with k=2
39

COMP90042
L17




When decoding, we stop when we generate token
When to Stop?
But multiple hypotheses may terminate their sentence at different time steps
We store hypotheses that have terminated, and continue explore those that haven’t
Typically we also set a maximum sentence length that can be generated (e.g. 50 words)
40

COMP90042
L17
Neural MT: Summary
• Singleend-to-endmodel
‣ Statistical MT systems have multiple sub-components
• Lessfeatureengineering
• Generatedtranslationismorefluent
• Butcanproducenewdetailsthatarenotinthesource sentence (hallucination)
Lee, Firat, Agarwal, Fannjiang, Sussillo, 2018
41

COMP90042
L17
Attention (!)
42

COMP90042
L17
This vector encodes the whole source sentence
RNN1 RNN1 RNN1
⽜牛吃草
Encoder RNN
Source sentence
cow eats grass

RNN2 RNN2 RNN2 RNN2
cow eats grass
Decoder RNN
Target sentence


With a long source sentence, the encoded vector is unlikely to capture all the information in the sentence
This creates an information bottleneck
43

COMP90042
L17

For the decoder, at every time step allow it to ‘attend’ to words in the source sentence
Attention
44

COMP90042
L17
cow
weighted sum
first time step attends to “⽜牛” to predict cow
softmax
attention distribution
dot product
RNN1
RNN1 RNN1
RNN2
⽜牛
吃 草

45

COMP90042
L17
cow
eats
weighted sum
attention distribution
dot product
second time step attends to “吃” to predict eats
softmax
RNN1
RNN1 RNN1 RNN2
RNN2
⽜牛
吃 草
cow
46

COMP90042
L17
cow
eats grass
weighted sum
softmax
attention distribution
dot product
RNN1
RNN1 RNN1
RNN2
RNN2 RNN2
⽜牛
吃 草

cow eats
47

COMP90042
L17
cow eats
grass
weighted sum
softmax
attention distribution
dot product
RNN1
RNN1 RNN1
RNN2 RNN2
RNN2 RNN2
⽜牛
吃 草
cow eats
grass 48

COMP90042 L17
Encoder-Decoder with Attention
Encoder hidden states: hi = RNN1(hi−1, xi)
Decoder hidden states: st = RNN2(st−1, yt)
• Foreachtimestepinthedecoder,attendtoeachofthe
hidden states to produce the attention weights:
et = [s⊺h1, s⊺h2, . . . , s⊺h|x|] ttt
• Applysoftmaxtotheattentionweightstogetavalid probability distribution:
αt = softmax(et)
• Computeweightedsumoftheencoderhiddenstates:
c = ∑| x | α i h t i=1ti
Concatenate ct and st to predict the next word context vector
• •




49

COMP90042
L17

Variants Dot product: s⊺hi
Bilinear: s Whi t⊺

ct can be injected to the current state (st), or to the input word (yt)

Attention:

⊺t

Additive: v tanh(Wsst + Whhi)
50

COMP90042
L17


Solves the information bottleneck issue by allowing decoder to have access to the source sentence words directly

Most state-of-the-art MT systems are based on this architecture
‣ Google Translate (https://slator.com/technology/google-facebook-amazon- neural-machine-translation-just-had-its-busiest-month-ever/)
Attention: Summary
Provides some form of interpretability
‣ Attention weights can be seen as word alignments
51

COMP90042
L17
Evaluation
52

COMP90042
L17


BLEU: compute n-gram overlap between “reference” translation and generated translation
MT Evaluation
Typically computed for 1 to 4-gram
N BLEU=BP×exp 1∑logpn
“Brevity Penalty” to penalise short outputs
pn =
(N) n
# correct n-grams # predicted n-grams
BP = min (1, output length ) reference length
53

COMP90042
L17



Statistical MT Neural MT
A Final Word
Encoder-decoder with attention architecture is a general architecture that can be used for other tasks
‣ Summarisation (lecture 21) ‣ Dialogue generation
54

COMP90042
L17

https://web.stanford.edu/class/cs224n/readings/ cs224n-2019-notes06- NMT_seq2seq_attention.pdf (Section 1-5)
Reading
55