Machine Translation
COMP90042
Natural Language Processing
Lecture 17
Semester 1 2021 Week 9 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L17
•
Machine translation (MT) is the task of translating text from one source language to another target language
虽然北风呼啸,但天空依然十分清澈
Introduction
However, the sky remained clear under the strong north wind
2
COMP90042
L17
Why?
Removes language barrier
Makes information in any languages accessible to anyone
•
• •
But translation is a classic “AI-hard” challenge
‣ Difficult to preserve the meaning and the fluency of the text after translation
3
COMP90042
L17
How good do you find machine translation (e.g. Google Translate)?
• I’ve never used it before
• Bad, it’s not really usable
• Usable, but it needs a lot of manual correction
• Quite good, I use it quite often
• Perfect, machine translation is a solved task
PollEv.com/jeyhanlau569
4
COMP90042
L17
5
COMP90042
L17
• • • • •
Not just simple word for word translation Structural changes, e.g., syntax and semantics Multiple word translations, idioms
Inflections for gender, case etc
Missing information (e.g., determiners)
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈
although north wind howls , but sky still extremely limpid
MT is Difficult
6
COMP90042
L17
• • • •
Statistical Machine Translation Neural Machine Translation Attention Mechanism Evaluation
Outline
7
COMP90042
L17
Statistical MT
8
COMP90042
L17
• •
•
Started in early 1950s
Motivated by the Cold War to translate Russian to English
•
Goal: translate 1-2 million words an hour within 5 years
Early Machine Translation
Rule-based system
‣ Use bilingual dictionary to map Russian words to English words
9
COMP90042
L17
Statistical MT
Given French sentence f, aim is to find the best
English sentence e ‣ argmaxe P(e|f)
•
•
Use Baye’s rule to decompose into two components
‣ argmaxe P(f|e)P(e)
10
COMP90042
L17
Language vs Translation Model
• argmaxe P(f|e)P(e)
P(e): language model
‣ learns how to write fluent English text
P( f | e): translation model
‣ learns how to translate words and phrases from English to French
•
•
11
COMP90042
L17
•
•
Language model:
‣ Text statistics in large monolingual corpora
How to Learn LM and TM?
Translation model:
‣ Word co-occurrences in parallel corpora ‣ i.e. English-French sentence pairs
12
COMP90042
L17
•
•
One text in multiple languages
Parallel Corpora
Ancient Egyptian
Produced by human translation
‣ Bible, news articles, legal transcripts, literature, subtitles
‣ Open parallel corpus: http://opus.nlpl.eu/
The Rosetta Stone
Demotic
Ancient Greek
13
COMP90042
L17
Models of Translation How to learn P( f | e) from parallel text?
We only have sentence pairs; words are not aligned in the parallel text
•
• •
I.e. we don’t have word to word translation
Zhang, Liu, Li, Zhou, Zong (2016)
14
COMP90042
L17
•
P(f,a|e)
•
Alignment
Idea: introduce word alignment as a latent variable into the model
‣
Use algorithms such as expectation maximisation (EM) to learn (e.g. GIZA++)
15
COMP90042
L17
•
Complexity of Alignment
Some words are dropped and have no alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
16
COMP90042
L17
•
Complexity of Alignment One-to-many alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
17
COMP90042
L17
•
Complexity of Alignment Many-to-one alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
18
COMP90042
L17
•
Complexity of Alignment Many-to-many alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
19
COMP90042
L17
•
• •
A very popular field of research in NLP prior to 2010s
Statistical MT: Summary
Lots of feature engineering State-of-the-art systems are very complex
‣ Difficult to maintain
‣ Significant effort needed for new language pairs
20
COMP90042
L17
Neural Machine Translation
21
COMP90042
L17
•
•
• •
Neural machine translation is a new approach to do machine translation
Introduction
Use a single neural model to directly translate from source to target
Requires parallel text
Architecture: encoder-decoder model
‣ 1st RNN to encode the source sentence ‣ 2nd RNN to decode the target sentence
22
COMP90042
L17
This vector encodes the whole source sentence; it is used as the initial state for decoder RNN
RNN1 RNN1 RNN1
⽜牛吃草
Encoder RNN
Source sentence
cow eats
grass
RNN2
grass
RNN2 RNN2 RNN2
Decoder RNN
Target sentence
hi = RNN1(hi−1, xi) st = RNN2(st−1, yt)
s1 = h|x| 23
COMP90042
L17
•
Neural MT
The decoder RNN can be interpreted as a
conditional language model
‣ Language model: predicts the next word given previous words in target sentence y
‣ Conditional: prediction is also conditioned on the source sentence x
P(y|x) = P(y1|x)P(y2|y1,x)…P(yt|y1,…,yt−1,x) 24
•
COMP90042
L17
• •
Training Neural MT
Requires parallel corpus just like statistical MT
Trains with next word prediction, just like a language model!
25
COMP90042
L17
Language Model Training Loss
negative logprob of cow
L1
cow
RNN2
negative logprob of eats
+ L2 eats
RNN2
cow
negative logprob of grass
+ L3 grass
RNN2
eats
negative logprob of
+ L4
RNN2
grass
=Ltotal
26
COMP90042
L17
Neural MT Training Loss
negative logprob of cow
L1
cow
RNN2
negative logprob of eats
+ L2 eats
RNN2
cow
negative logprob of grass
+ L3 grass
RNN2
eats
negative logprob of
+ L4
RNN2
grass
=Ltotal
RNN1
⽜牛
RNN1 RNN1
吃 草
Source sentence
Target sentence
27
COMP90042
L17
Training
RNN1
⽜牛
RNN1 RNN1 RNN2
吃 草
RNN2 RNN2 RNN2
cow eats grass
Target sentence
• •
During training, we have the target sentence
We can therefore feed the right word from target sentence, one step at a time
28
COMP90042
L17
Decoding at Test Time
RNN1
⽜牛
RNN1 RNN1
吃 草
Source sentence
cow eats grass
RNN2 RNN2 RNN2
Target sentence
RNN2
grass
•
•
But at test time, we don’t have the target sentence (that’s what we’re trying to predict!)
argmax: take the word with the highest probability at every step
29
argmax
argmax
argmax
argmax
COMP90042
L17
Exposure Bias
RNN1
⽜牛
RNN1 RNN1
吃 草
Source sentence
cow drinks water
RNN2 RNN2 RNN2
RNN2
water
• Describesthediscrepancybetweentrainingandtesting
• Training:alwayshavethegroundtruthtokensateachstep • Test:usesitsownpredictionateachstep
• Outcome:modelisunabletorecoverfromitsownerror
30
argmax
argmax
argmax
argmax
COMP90042
L17
• •
Greedy Decoding
argmax decoding is also called greedy decoding
Issue: does not guarantee optimal probability
P(y|x)
31
COMP90042
L17
Exhaustive Search Decoding
To find optimal P(y|x), we need to consider every word at every step to compute the probability of all possible sequences
• •
•
O(Vn); V = vocab size; n = sentence length Far too expensive to be feasible
32
COMP90042
L17
Beam Search Decoding Instead of considering all possible words at every
step, consider k best words
•
• • •
•
That is, we keep track of the top-k words that produce the best partial translations (hypotheses) thus far
k = beam width (typically 5 to 10)
k = 1 = greedy decoding
k = V = exhaustive search decoding
33
COMP90042
L17
Beam search decoding with k=2
34
COMP90042
L17
-0.5 = log P(cow | x,
tiger
-0.7 = log P(tiger | x,
Beam search decoding with k=2
35
COMP90042
L17
-0.5
cow
-1.3 = log P(eats | x,
chews
-1.9 = log P(chews | x,
tiger
-0.7
Beam search decoding with k=2
36
COMP90042
L17
-0.5
cow
-1.3 = log P(eats | x,
chews
-1.9 = log P(chews | x,
-1.5 = log P(bites | x,
kills
-1.8 = log P(kills | x,
tiger
-0.7
Beam search decoding with k=2
37
COMP90042
L17
-0.5
cow
-1.3
eats chews
-1.9 -1.5
bites kills
-1.8
-2.3 = log P(grass | x,
carrot
-2.0 = log P(carrot | x,
tiger
-0.7
Beam search decoding with k=2
38
COMP90042
L17
-0.5
cow
-1.3
eats chews
-1.9 -1.5
bites kills
-1.8
-2.3 = log P(grass | x,
carrot
-2.5 = log P(carrot | x,
-2.9 = log P(hard | x,
man
-2.8 = log P(man | x,
tiger
-0.7
Beam search decoding with k=2
39
COMP90042
L17
-0.5
cow
-1.3
eats chews
-1.9 -1.5
bites kills
-1.8
-2.3
grass carrot
-2.5
Best translation probability!
-2.9
-3.5 -3.2
-2.9
hard -4.5
tiger
-0.7
man
-2.8
Beam search decoding with k=2
40
COMP90042
L17
Follow pointers to get the target sentence
-2.9
and
-3.5 -3.2
-2.3
-1.3 = log P(cow | x,
grass
-0.5 = log P(a | x,
cow
carrot
-2.5 = log P(chews | x,
chews
-1.9 = log P(bull | x,
but
-2.9 = log P(bites | x,
-1.5 = log P(tiger | x,
man
-2.8 = log P(kills | x,
bites kills
tiger
-0.7 = log P(the | x,
-1.8 = log P(cat | x,
41
COMP90042
L17
•
•
•
•
When decoding, we stop when we generate
When to Stop?
But multiple hypotheses may terminate their sentence at different time steps
We store hypotheses that have terminated, and continue explore those that haven’t
Typically we also set a maximum sentence length that can be generated (e.g. 50 words)
42
COMP90042
L17
What are some issues of NMT?
• Information of the whole source sentence is represented by a single vector
• NMT can generate new details not in source sentence
• NMT tend to generate not very fluent sentences
• Black-box model; difficult to explain when it doesn’t work
PollEv.com/jeyhanlau569
43
COMP90042
L17
44
COMP90042
L17
Neural MT: Summary • Singleend-to-endmodel
‣ Statistical MT systems have multiple sub- components
• Lessfeatureengineering
• Canproducenewdetailsthatarenotinthesource sentence (hallucination)
Lee, Firat, Agarwal, Fannjiang, Sussillo, 2018
45
COMP90042
L17
Attention Mechanism
46
COMP90042
L17
This vector encodes the whole source sentence!
RNN1 RNN1 RNN1
⽜牛吃草
Encoder RNN
Source sentence
cow eats grass
RNN2 RNN2 RNN2 RNN2
Decoder RNN
Target sentence
•
•
With a long source sentence, the encoded vector is unlikely to capture all the information in the sentence
This creates an information bottleneck
47
COMP90042
L17
•
For the decoder, at every time step allow it to ‘attend’ to words in the source sentence
Attention
48
COMP90042
L17
cow
weighted sum
first time step attends to “⽜牛” to predict cow
softmax
attention distribution
dot product
RNN1
RNN1 RNN1 RNN2
⽜牛
吃 草
49
COMP90042
L17
cow eats
weighted sum
attention distribution
dot product
second time step attends to “吃” to predict eats
softmax
RNN1
RNN1 RNN1 RNN2 RNN2
⽜牛
吃 草
50
COMP90042
L17
cow eats grass
weighted sum
softmax
attention distribution
dot product
RNN1
RNN1 RNN1
RNN2 RNN2 RNN2
⽜牛
吃 草
51
COMP90042
L17
cow eats
grass
weighted sum
softmax
attention distribution
dot product
RNN1
RNN1 RNN1
RNN2 RNN2
RNN2 RNN2
⽜牛
吃 草
grass 52
COMP90042
L17
Encoder-Decoder with Attention
Encoder hidden states: hi = RNN1(hi−1, xi)
Decoder hidden states: st = RNN2(st−1, yt)
• Foreachtimestepinthedecoder,attendtoeachofthe
hidden states to produce the attention weights:
et = [s⊺h1, s⊺h2, . . . , s⊺h|x|] ttt
• Applysoftmaxtotheattentionweightstogetavalid probability distribution:
αt = softmax(et)
• Computeweightedsumoftheencoderhiddenstates:
c = ∑| x | α i h t i=1ti
Concatenate ct and st to predict the next word context vector
• •
‣
•
‣
‣
53
COMP90042
L17
•
Variants Dot product: s⊺hi
Bilinear: s Whi t⊺
‣
ct can be injected to the current state (st), or to the input word (yt)
•
Attention:
‣
⊺t
‣
Additive: v tanh(Wsst + Whhi)
54
COMP90042
L17
•
•
Solves the information bottleneck issue by allowing decoder to have access to the source sentence words directly
•
Most state-of-the-art NMT systems use attention
‣ Google Translate (https://slator.com/technology/google-facebook-amazon- neural-machine-translation-just-had-its-busiest-month-ever/)
Attention: Summary
Provides some form of interpretability
‣ Attention weights can be seen as word alignments
55
COMP90042
L17
Evaluation
56
COMP90042
L17
•
•
BLEU: compute n-gram overlap between “reference” translation and generated translation
MT Evaluation
Typically computed for 1 to 4-gram
N BLEU=BP×exp 1∑logpn
“Brevity Penalty” to penalise short outputs
pn =
(N) n
# correct n-grams # predicted n-grams
BP = min (1, output length ) reference length
57
COMP90042
L17
• •
•
Statistical MT Neural MT
A Final Word
‣ Nowadays use Transformers rather than RNNs Encoder-decoder with attention architecture is a
general architecture that can be used for other tasks
‣ Summarisation (lecture 21)
‣ Other generation tasks such as dialogue generation
58
COMP90042
L17
• •
Reading JM3 Ch. 11.2-11.5
Optional: GIZA++ word alignment – https:// www.aclweb.org/anthology/W08-0509.pdf
59