Machine Translation
COMP90042
Natural Language Processing Lecture 17
COPYRIGHT 2020, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L17
•
Machine translation (MT) is the task of translating text from one source language to another target language
虽然北风呼啸,但天空依然十分清澈
Introduction
However, the sky remained clear under the strong north wind
2
COMP90042
L17
Why?
Removes language barrier
Makes information in any languages accessible to anyone
•
• •
But translation is a classic “AI-hard” challenge
‣ Difficult to preserve the meaning and the fluency of the text after translation
3
COMP90042
L17
• • • • •
Not just simple word for word translation Structural changes, e.g., syntax and semantics Multiple word translations, idioms
Inflections for gender, case etc
Missing information (e.g., determiners)
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈
although north wind howls , but sky still extremely limpid
MT is Difficult
4
COMP90042
L17
•
•
•
•
Statistical Machine Translation ‣ Word-based approach
Neural Machine Translation
‣ Encoder-decoder architecture
Attention Mechanism
‣ Improvement to Neural MT
Machine Translation Evaluation
Outline
5
COMP90042
L17
Statistical MT
6
COMP90042
L17
Early Machine Translation
• •
•
•
Goal: translate 1-2 million words an hour within 5 years
Started in early 1950s
Motivated by the Cold War to translate Russian to English
Rule-based system
‣ Use bilingual dictionary to map Russian words to English words
7
COMP90042
L17
•
‣
•
‣
Statistical MT
Given French sentence f, aim is to find the best
English sentence e
argmaxeP(e | f )
Use Baye’s rule to decompose into two
components
argmaxeP( f | e)P(e)
8
COMP90042 L17
Language vs Translation Model
• argmaxeP(f|e)P(e)
P(e): language model
‣ learns how to write fluent English text
P( f | e): translation model
‣ learns how to translate words and phrases from English to French
•
•
9
COMP90042
L17
How to Learn LM and TM? Language model:
‣ Based on text frequency in large monolingual corpora
•
•
Translation model:
‣ Based on word co-occurrences in parallel
corpora
‣ i.e. English-French sentence pairs
10
COMP90042
L17
•
•
One text in multiple languages
Parallel Corpora
Ancient Egyptian
Produced by human translation
‣ Bible, news articles, legal transcripts, literature, subtitles
‣ Open parallel corpus: http://opus.nlpl.eu/
The Rosetta Stone
Demotic
Ancient Greek
11
COMP90042
L17
• •
•
How to learn P( f | e) from parallel text?
We only have sentence pairs; words are not
aligned in the parallel text Not word to word translation:
‣ rearrangement of words
Models of Translation
Zhang, Liu, Li, Zhou, Zong (2016)
12
COMP90042
L17
•
P(f,a|e)
Alignment
Idea: introduce word alignment as a latent variable into the model
‣
https://en.wikipedia.org/wiki/Bitext_word_alignment#/media/File:Word_alignment.svg
13
COMP90042
L17
•
Complexity of Alignment
Some words are dropped and have no alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
14
COMP90042
L17
•
Complexity of Alignment One-to-many alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
15
COMP90042
L17
•
Complexity of Alignment Many-to-one alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
16
COMP90042
L17
•
Complexity of Alignment Many-to-many alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
17
COMP90042
L17
•
Word alignments are rarely provided in parallel corpora
‣ As such alignment a introduced as a latent variable
•
Use algorithms such as expectation maximisation (EM) to learn
Learning the Alignment
18
COMP90042
L17
•
• •
A very popular field of research in NLP prior to 2010s
Statistical MT: Summary
Lots of feature engineering State-of-the-art systems are very complex
‣ Difficult to maintain
‣ Significant effort needed for new language pairs
19
COMP90042 L17
Neural Machine Translation
20
COMP90042
L17
•
•
•
Neural machine translation is a new approach to do machine translation
Introduction
Use a single neural model to directly translate from source to target
Architecture: encoder-decoder model
‣ 1st RNN to encode the source sentence ‣ 2nd RNN to decode the target sentence
21
COMP90042
L17
This vector encodes the whole source sentence; it is used as the initial state for decoder RNN
RNN1 RNN1 RNN1
⽜牛吃草
Encoder RNN
Source sentence
cow eats
grass
RNN2
grass
RNN2 RNN2 RNN2
Decoder RNN
Target sentence
hi = RNN1(hi−1, xi) st = RNN2(st−1, yt)
s1 = h|x|
22
COMP90042
L17
•
Neural MT
The decoder RNN can be interpreted as a
conditional language model
‣ Language model: predicts the next word given previous words in target sentence y
‣ Conditional: prediction is also conditioned on the source sentence x
P(y|x) = P(y1|x)P(y2|y1,x)…P(yt|y1,…,yt−1,x) 23
•
COMP90042
L17
• •
Training Neural MT
Requires parallel corpus just like statistical MT
Trains with next word prediction, just like a language model!
24
COMP90042 L17
Language Model Training Loss
negative logprob of cow
L1
cow
RNN2
negative logprob of eats
+ L2 eats
RNN2
cow
negative logprob of grass
+ L3 grass
RNN2
eats
negative logprob of
+ L4
RNN2
grass
=Ltotal
25
COMP90042
L17
Neural MT Training Loss
negative logprob of cow
L1
cow
RNN2
negative logprob of eats
+ L2 eats
RNN2
cow
negative logprob of grass
+ L3 grass
RNN2
eats
negative logprob of
+ L4
RNN2
grass
=Ltotal
RNN1
⽜牛
RNN1 RNN1
吃 草
Source sentence
Target sentence
26
COMP90042
L17
Training
RNN1
⽜牛
RNN1 RNN1
吃 草
Source sentence
RNN2
RNN2 RNN2
cow eats
Target sentence
RNN2
grass
• •
During training, we have the target sentence
We can therefore feed the right word from target sentence, one step at a time
27
COMP90042
L17
Decoding at Test Time
RNN1
⽜牛
RNN1 RNN1
吃 草
Source sentence
cow eats grass
RNN2 RNN2 RNN2
Target sentence
RNN2
grass
•
•
But at test time, we don’t have the target sentence (that’s what we’re trying to predict!)
argmax: take the word with the highest probability at every step
28
argmax
argmax
argmax
argmax
COMP90042
L17
• •
Greedy Decoding
argmax decoding is also called greedy decoding
Issue: does not guarantee optimal probability
P(y|x)
29
COMP90042
L17
Exhaustive Search Decoding
To find optimal P(y|x), we need to consider every word at every step to compute the probability of all possible sequences
• •
•
O(Vn); V = vocab size; n = sentence length Far too expensive to be feasible
30
COMP90042
L17
Beam Search Decoding Instead of considering all possible words at every
step, consider k best words
•
• • •
•
That is, we keep track of the top-k words that produce the best partial translations (hypotheses) thus far
k = beam width (typically 5 to 10)
k = 1 = greedy decoding
k = V = exhaustive search decoding
31
COMP90042
L17
Beam search decoding with k=2
32
COMP90042
L17
-0.5 = log P(cow | x,
tiger
-0.7 = log P(tiger | x,
Beam search decoding with k=2
33
COMP90042
L17
-0.5
cow
-1.3 = log P(eats | x,
chews
-1.9 = log P(chews | x,
tiger
-0.7
Beam search decoding with k=2
34
COMP90042
L17
-0.5
cow
-1.3 = log P(eats | x,
chews
-1.9 = log P(chews | x,
-1.5 = log P(bites | x,
kills
-1.8 = log P(kills | x,
tiger
-0.7
Beam search decoding with k=2
35
COMP90042
L17
-0.5
cow
-1.3
eats chews
-1.9 -1.5
bites kills
-1.8
-2.3 = log P(grass | x,
carrot
-2.0 = log P(carrot | x,
tiger
-0.7
Beam search decoding with k=2
36
COMP90042
L17
-0.5
cow
-1.3
eats chews
-1.9 -1.5
bites kills
-1.8
-2.3 = log P(grass | x,
carrot
-2.5 = log P(carrot | x,
-2.9 = log P(hard | x,
man
-2.8 = log P(man | x,
tiger
-0.7
Beam search decoding with k=2
37
COMP90042 L17
-0.5
cow
-1.3
eats chews
-1.9 -1.5
bites kills
-1.8
-2.3
grass carrot
-2.5
Best translation probability!
-2.9
-3.5 -3.2
-2.9
hard -4.5
tiger
-0.7
man
-2.8
Beam search decoding with k=2
38
COMP90042
L17
Follow pointers to get the target sentence
-2.9
and
-3.5 -3.2
-2.3
-1.3 = log P(cow | x,
grass
-0.5 = log P(a | x,
cow
carrot
-2.5 = log P(chews | x,
chews
-1.9 = log P(bull | x,
but
-2.9 = log P(bites | x,
-1.5 = log P(tiger | x,
man
-2.8 = log P(kills | x,
bites kills
tiger
-0.7 = log P(the | x,
-1.8 = log P(cat | x,
39
COMP90042
L17
•
•
•
•
When decoding, we stop when we generate
When to Stop?
But multiple hypotheses may terminate their sentence at different time steps
We store hypotheses that have terminated, and continue explore those that haven’t
Typically we also set a maximum sentence length that can be generated (e.g. 50 words)
40
COMP90042
L17
Neural MT: Summary
• Singleend-to-endmodel
‣ Statistical MT systems have multiple sub-components
• Lessfeatureengineering
• Generatedtranslationismorefluent
• Butcanproducenewdetailsthatarenotinthesource sentence (hallucination)
Lee, Firat, Agarwal, Fannjiang, Sussillo, 2018
41
COMP90042
L17
Attention (!)
42
COMP90042
L17
This vector encodes the whole source sentence
RNN1 RNN1 RNN1
⽜牛吃草
Encoder RNN
Source sentence
cow eats grass
RNN2 RNN2 RNN2 RNN2
Decoder RNN
Target sentence
•
•
With a long source sentence, the encoded vector is unlikely to capture all the information in the sentence
This creates an information bottleneck
43
COMP90042
L17
•
For the decoder, at every time step allow it to ‘attend’ to words in the source sentence
Attention
44
COMP90042
L17
cow
weighted sum
first time step attends to “⽜牛” to predict cow
softmax
attention distribution
dot product
RNN1
RNN1 RNN1
RNN2
⽜牛
吃 草
45
COMP90042
L17
cow
eats
weighted sum
attention distribution
dot product
second time step attends to “吃” to predict eats
softmax
RNN1
RNN1 RNN1 RNN2
RNN2
⽜牛
吃 草
cow
46
COMP90042
L17
cow
eats grass
weighted sum
softmax
attention distribution
dot product
RNN1
RNN1 RNN1
RNN2
RNN2 RNN2
⽜牛
吃 草
cow eats
47
COMP90042
L17
cow eats
grass
weighted sum
softmax
attention distribution
dot product
RNN1
RNN1 RNN1
RNN2 RNN2
RNN2 RNN2
⽜牛
吃 草
grass 48
COMP90042 L17
Encoder-Decoder with Attention
Encoder hidden states: hi = RNN1(hi−1, xi)
Decoder hidden states: st = RNN2(st−1, yt)
• Foreachtimestepinthedecoder,attendtoeachofthe
hidden states to produce the attention weights:
et = [s⊺h1, s⊺h2, . . . , s⊺h|x|] ttt
• Applysoftmaxtotheattentionweightstogetavalid probability distribution:
αt = softmax(et)
• Computeweightedsumoftheencoderhiddenstates:
c = ∑| x | α i h t i=1ti
Concatenate ct and st to predict the next word context vector
• •
‣
•
‣
‣
49
COMP90042
L17
•
Variants Dot product: s⊺hi
Bilinear: s Whi t⊺
‣
ct can be injected to the current state (st), or to the input word (yt)
•
Attention:
‣
⊺t
‣
Additive: v tanh(Wsst + Whhi)
50
COMP90042
L17
•
•
Solves the information bottleneck issue by allowing decoder to have access to the source sentence words directly
•
Most state-of-the-art MT systems are based on this architecture
‣ Google Translate (https://slator.com/technology/google-facebook-amazon- neural-machine-translation-just-had-its-busiest-month-ever/)
Attention: Summary
Provides some form of interpretability
‣ Attention weights can be seen as word alignments
51
COMP90042
L17
Evaluation
52
COMP90042
L17
•
•
BLEU: compute n-gram overlap between “reference” translation and generated translation
MT Evaluation
Typically computed for 1 to 4-gram
N BLEU=BP×exp 1∑logpn
“Brevity Penalty” to penalise short outputs
pn =
(N) n
# correct n-grams # predicted n-grams
BP = min (1, output length ) reference length
53
COMP90042
L17
•
•
•
Statistical MT Neural MT
A Final Word
Encoder-decoder with attention architecture is a general architecture that can be used for other tasks
‣ Summarisation (lecture 21) ‣ Dialogue generation
54
COMP90042
L17
•
https://web.stanford.edu/class/cs224n/readings/ cs224n-2019-notes06- NMT_seq2seq_attention.pdf (Section 1-5)
Reading
55