l17-machine-translation-v3
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE 1
COMP90042
Natural Language Processing
Lecture 17
Semester 1 2021 Week 9
Jey Han Lau
Machine Translation
COMP90042 L17
2
Introduction
• Machine translation (MT) is the task of
translating text from one source language to
another target language
虽然北风呼啸,但天空依然⼗分清澈
However, the sky remained clear under the strong north wind
COMP90042 L17
3
Why?
• Removes language barrier
• Makes information in any languages accessible to
anyone
• But translation is a classic “AI-hard” challenge
‣ Difficult to preserve the meaning and the
fluency of the text after translation
COMP90042 L17
4
• I’ve never used it before
• Bad, it’s not really usable
• Usable, but it needs a lot of manual correction
• Quite good, I use it quite often
• Perfect, machine translation is a solved task
PollEv.com/jeyhanlau569
How good do you find machine
translation (e.g. Google Translate)?
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L17
5
COMP90042 L17
6
MT is Difficult
• Not just simple word for word translation
• Structural changes, e.g., syntax and semantics
• Multiple word translations, idioms
• Inflections for gender, case etc
• Missing information (e.g., determiners)
虽然 北 风 呼啸 , 但 天空 依然 ⼗分 清澈
although north wind howls , but sky still extremely limpid
COMP90042 L17
7
Outline
• Statistical Machine Translation
• Neural Machine Translation
• Attention Mechanism
• Evaluation
COMP90042 L17
8
Statistical MT
COMP90042 L17
9
Early Machine Translation
• Started in early 1950s
• Motivated by the Cold War to translate Russian to
English
• Rule-based system
‣ Use bilingual dictionary to map Russian words
to English words
• Goal: translate 1-2 million words an hour within 5
years
COMP90042 L17
10
Statistical MT
• Given French sentence f, aim is to find the best
English sentence e
‣
• Use Baye’s rule to decompose into two
components
‣
argmaxe P(e | f )
argmaxe P( f |e)P(e)
COMP90042 L17
11
Language vs Translation Model
•
• : language model
‣ learns how to write fluent English text
• : translation model
‣ learns how to translate words and phrases from
English to French
argmaxe P( f |e)P(e)
P(e)
P( f |e)
COMP90042 L17
12
How to Learn LM and TM?
• Language model:
‣ Text statistics in large monolingual corpora
• Translation model:
‣ Word co-occurrences in parallel corpora
‣ i.e. English-French sentence pairs
COMP90042 L17
13
Parallel Corpora
• One text in multiple
languages
• Produced by human
translation
‣ Bible, news articles, legal
transcripts, literature,
subtitles
‣ Open parallel corpus:
http://opus.nlpl.eu/
Ancient Egyptian
Demotic
Ancient Greek
The Rosetta Stone
http://opus.nlpl.eu/
http://opus.nlpl.eu/
COMP90042 L17
14
Models of Translation
• How to learn from parallel text?
• We only have sentence pairs; words are not
aligned in the parallel text
• I.e. we don’t have word to word translation
P( f |e)
Zhang, Liu, Li, Zhou, Zong (2016)
COMP90042 L17
15
Alignment
• Idea: introduce word alignment as a latent
variable into the model
‣
• Use algorithms such as expectation maximisation
(EM) to learn (e.g. GIZA++)
P( f, a |e)
https://en.wikipedia.org/wiki/Bitext_word_alignment#/media/File:Word_alignment.svg
https://en.wikipedia.org/wiki/Bitext_word_alignment#/media/File:Word_alignment.svg
COMP90042 L17
16
Complexity of Alignment
• Some words are dropped and have no alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
COMP90042 L17
17
Complexity of Alignment
• One-to-many alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
COMP90042 L17
18
Complexity of Alignment
• Many-to-one alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
COMP90042 L17
19
Complexity of Alignment
• Many-to-many alignment
Brown, Della Pietra, Della Pietra, Mercer, 1993
COMP90042 L17
20
Statistical MT: Summary
• A very popular field of research in NLP prior to
2010s
• Lots of feature engineering
• State-of-the-art systems are very complex
‣ Difficult to maintain
‣ Significant effort needed for new language pairs
COMP90042 L17
21
Neural Machine Translation
COMP90042 L17
22
Introduction
• Neural machine translation is a new approach to
do machine translation
• Use a single neural model to directly translate
from source to target
• Requires parallel text
• Architecture: encoder-decoder model
‣ 1st RNN to encode the source sentence
‣ 2nd RNN to decode the target sentence
COMP90042 L17
23
Encoder RNN
RNN1
⽜牛
RNN1
吃
RNN1
草
Source sentence
Decoder RNN
RNN2
grass
RNN2
RNN2
cow
RNN2
eats
Target sentence
This vector encodes the whole source sentence;
it is used as the initial state for decoder RNN
hi = RNN1(hi−1, xi)
st = RNN2(st−1, yt)
s1 = h|x|
COMP90042 L17
24
Neural MT
• The decoder RNN can be interpreted as a
conditional language model
‣ Language model: predicts the next word given
previous words in target sentence y
‣ Conditional: prediction is also conditioned on
the source sentence x
• P(y |x) = P(y1 |x)P(y2 |y1, x) . . . P(yt |y1, . . . , yt−1, x)
COMP90042 L17
25
Training Neural MT
• Requires parallel corpus just like statistical MT
• Trains with next word prediction, just like a
language model!
COMP90042 L17
26
Language Model Training Loss
RNN2
cow
L1
negative
logprob of
cow
RNN2
cow
eats
L2
negative
logprob of
eats
+
RNN2
eats
grass
L3
negative
logprob of
grass
+
RNN2
grass
L4
negative
logprob of
+ Ltotal=
COMP90042 L17
27
Neural MT Training Loss
Ltotal=L2
negative
logprob of
eats
+
RNN2
cow
eats
L3
negative
logprob of
grass
+
RNN2
eats
grass
L4
negative
logprob of
+
RNN2
grass
RNN1
⽜牛
RNN1
吃
RNN1
草
L1
negative
logprob of
cow
RNN2
cow
Source sentence Target sentence
COMP90042 L17
28
Training
• During training, we have the target sentence
• We can therefore feed the right word from target
sentence, one step at a time
RNN2
cow
RNN2
eats
RNN2
grass
RNN1
⽜牛
RNN1
吃
RNN1
草
Source sentence
RNN2
Target sentence
COMP90042 L17
29
Decoding at Test Time
• But at test time, we don’t have the target sentence
(that’s what we’re trying to predict!)
• argmax: take the word with the highest probability
at every step
RNN1
⽜牛
RNN1
吃
RNN1
草
Source sentence
RNN2
cow
RNN2
eats
RNN2
grass
RNN2
Target sentence
cow
ar
gm
ax
eats
ar
gm
ax
grass
ar
gm
ax
ar
gm
ax
COMP90042 L17
30
Exposure Bias
• Describes the discrepancy between training and testing
• Training: always have the ground truth tokens at each step
• Test: uses its own prediction at each step
• Outcome: model is unable to recover from its own error
RNN1
⽜牛
RNN1
吃
RNN1
草
Source sentence
RNN2
cow
RNN2
drinks
RNN2
water
RNN2
Target sentence
cow
ar
gm
ax
drinks
ar
gm
ax
water
ar
gm
ax
ar
gm
ax
COMP90042 L17
31
Greedy Decoding
• argmax decoding is also called greedy decoding
• Issue: does not guarantee optimal probability
P(y |x)
COMP90042 L17
32
Exhaustive Search Decoding
• To find optimal , we need to consider every
word at every step to compute the probability of all
possible sequences
• O(Vn); V = vocab size; n = sentence length
• Far too expensive to be feasible
P(y |x)
COMP90042 L17
33
Beam Search Decoding
• Instead of considering all possible words at every
step, consider k best words
• That is, we keep track of the top-k words that
produce the best partial translations (hypotheses)
thus far
• k = beam width (typically 5 to 10)
• k = 1 = greedy decoding
• k = V = exhaustive search decoding
COMP90042 L17
34
Beam search decoding with k=2
COMP90042 L17
35
cow
tiger
-0.5 = log P(cow |x,
-0.7 = log P(tiger |x,
Beam search decoding with k=2
COMP90042 L17
36
cow
tiger
-0.5
-0.7
eats
chews
-1.3 = – 0.5log P(eats |x,
-1.9 = – 0.5log P(chews |x,
Beam search decoding with k=2
COMP90042 L17
37
cow
tiger
-0.5
-0.7
eats
chews
-1.3 = – 0.5log P(eats |x,
-1.9 = – 0.5log P(chews |x,
bites
kills
-1.5 = – 0.7log P(bites |x,
-1.8 = – 0.7log P(kills |x,
Beam search decoding with k=2
COMP90042 L17
38
cow
tiger
-0.5
-0.7
eats
chews
-1.3
-1.9
bites
kills
-1.5
-1.8
grass
carrot
-2.3 = -1.3log P(grass |x,
-2.0 = -1.3log P(carrot |x,
Beam search decoding with k=2
COMP90042 L17
39
cow
tiger
-0.5
-0.7
eats
chews
-1.3
-1.9
bites
kills
-1.5
-1.8
grass
carrot
-2.3 = -1.3log P(grass |x,
-2.5 = -1.3log P(carrot |x,
hard
man
-2.9 = -1.5log P(hard |x,
-2.8 = -1.5log P(man |x,
Beam search decoding with k=2
COMP90042 L17
40
cow
tiger
-0.5
-0.7
eats
chews
-1.3
-1.9
bites
kills
-1.5
-1.8
grass
carrot
-2.3
-2.5
hard
man
-2.9
-2.8
-2.9
and
but
-3.5
-3.2
-4.5
Best translation probability!
Beam search decoding with k=2
COMP90042 L17
41
cow
tiger
-0.5 = log P(a |x,
-0.7 = log P(the |x,
eats
chews
-1.3 = – 0.5log P(cow |x,
-1.9 = – 0.5log P(bull |x,
bites
kills
-1.5 = – 0.7log P(tiger |x,
-1.8 = – 0.7log P(cat |x,
grass
carrot
-2.3
-2.5 = -1.3log P(chews |x,
hard
man
-2.9 = -1.5log P(bites |x,
-2.8 = -1.5log P(kills |x,
-2.9
and
but
-3.5
-3.2
-4.5
Follow pointers to get the target sentence
Beam search decoding with k=2
COMP90042 L17
42
When to Stop?
• When decoding, we stop when we generate
• But multiple hypotheses may terminate their
sentence at different time steps
• We store hypotheses that have terminated, and
continue explore those that haven’t
• Typically we also set a maximum sentence length
that can be generated (e.g. 50 words)
COMP90042 L17
43
• Information of the whole source sentence is
represented by a single vector
• NMT can generate new details not in source sentence
• NMT tend to generate not very fluent sentences
• Black-box model; difficult to explain when it doesn’t
work
PollEv.com/jeyhanlau569
What are some issues of NMT?
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L17
44
COMP90042 L17
45
Neural MT: Summary
• Single end-to-end model
‣ Statistical MT systems have multiple sub-
components
• Less feature engineering
• Can produce new details that are not in the source
sentence (hallucination)
Lee, Firat, Agarwal, Fannjiang, Sussillo, 2018
COMP90042 L17
46
Attention Mechanism
COMP90042 L17
Decoder RNN
47
RNN1
⽜牛
RNN2
grass
cow eats grass
Encoder RNN
RNN1
吃
RNN1
草
RNN2
RNN2
cow
RNN2
eats
Source sentence Target sentence
This vector encodes the whole source sentence!
• With a long source sentence, the encoded vector is
unlikely to capture all the information in the sentence
• This creates an information bottleneck
COMP90042 L17
48
Attention
• For the decoder, at every time step allow it to
‘attend’ to words in the source sentence
COMP90042 L17
49
RNN1
⽜牛
RNN1
吃
RNN1
草
RNN2
dot product
softmax
attention
distribution
first time step attends
to “⽜牛” to predict cow
cow
weighted sum
COMP90042 L17
50
RNN1
⽜牛
RNN1
吃
RNN1
草
RNN2
dot product
softmax
attention
distribution
second time step attends
to “吃” to predict eats
cow
weighted sum
RNN2
cow
eats
COMP90042 L17
51
RNN1
⽜牛
RNN1
吃
RNN1
草
RNN2
dot product
softmax
attention
distribution
cow
weighted sum
RNN2
cow
eats grass
RNN2
eats
COMP90042 L17
52
RNN1
⽜牛
RNN1
吃
RNN1
草
RNN2
dot product
softmax
attention
distribution
cow
weighted sum
RNN2
cow
eats grass
RNN2
eats
RNN2
grass
COMP90042 L17
53
Encoder-Decoder with Attention
• Encoder hidden states:
• Decoder hidden states:
• For each time step in the decoder, attend to each of the
hidden states to produce the attention weights:
‣
• Apply softmax to the attention weights to get a valid
probability distribution:
‣
• Compute weighted sum of the encoder hidden states:
‣
• Concatenate and to predict the next word
hi = RNN1(hi−1, xi)
st = RNN2(st−1, yt)
et = [s
⊺
t h1, s
⊺
t h2, . . . , s
⊺
t h|x|]
αt = softmax(et)
ct =
|x|
∑
i=1
αithi
ct st
context vector
COMP90042 L17
54
Variants
• Attention:
‣ Dot product:
‣ Bilinear:
‣ Additive:
• can be injected to the current state ( ), or to the
input word ( )
s⊺t hi
s⊺t Whi
v⊺ tanh(Wsst + Whhi)
ct st
yt
COMP90042 L17
55
Attention: Summary
• Solves the information bottleneck issue by
allowing decoder to have access to the source
sentence words directly
• Provides some form of interpretability
‣ Attention weights can be seen as word
alignments
• Most state-of-the-art NMT systems use attention
‣ Google Translate (https://slator.com/technology/google-facebook-amazon-
neural-machine-translation-just-had-its-busiest-month-ever/)
https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/
https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/
https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/
https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/
https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/
https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/
COMP90042 L17
56
Evaluation
COMP90042 L17
57
MT Evaluation
• BLEU: compute n-gram overlap between
“reference” translation and generated translation
• Typically computed for 1 to 4-gram
BLEU = BP × exp (
1
N
N
∑
n
log pn)
pn =
# correct n-grams
# predicted n-grams
BP = min (1, output lengthreference length )
“Brevity Penalty” to
penalise short outputs
COMP90042 L17
58
A Final Word
• Statistical MT
• Neural MT
‣ Nowadays use Transformers rather than RNNs
• Encoder-decoder with attention architecture is a
general architecture that can be used for other tasks
‣ Summarisation (lecture 21)
‣ Other generation tasks such as dialogue
generation
COMP90042 L17
59
Reading
• JM3 Ch. 11.2-11.5
• Optional: GIZA++ word alignment – https://
www.aclweb.org/anthology/W08-0509.pdf
https://www.aclweb.org/anthology/W08-0509.pdf
https://www.aclweb.org/anthology/W08-0509.pdf
https://www.aclweb.org/anthology/W08-0509.pdf
https://www.aclweb.org/anthology/W08-0509.pdf
https://www.aclweb.org/anthology/W08-0509.pdf
https://www.aclweb.org/anthology/W08-0509.pdf