CS计算机代考程序代写 algorithm l17-machine-translation-v3

l17-machine-translation-v3

COMP90042
Natural Language Processing

Lecture 17
Semester 1 2021 Week 9

Jey Han Lau

Machine Translation

COMP90042 L17

Introduction

• Machine translation (MT) is the task of
translating text from one source language to
another target language

虽然北风呼啸，但天空依然⼗分清澈

However, the sky remained clear under the strong north wind

COMP90042 L17

Why?

• Removes language barrier

• Makes information in any languages accessible to
anyone

• But translation is a classic “AI-hard” challenge

‣ Difficult to preserve the meaning and the
fluency of the text after translation

COMP90042 L17

• I’ve never used it before
• Bad, it’s not really usable
• Usable, but it needs a lot of manual correction
• Quite good, I use it quite often
• Perfect, machine translation is a solved task

PollEv.com/jeyhanlau569

How good do you find machine
translation (e.g. Google Translate)?

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L17

MT is Difficult

• Not just simple word for word translation

• Structural changes, e.g., syntax and semantics

• Multiple word translations, idioms

• Inflections for gender, case etc

• Missing information (e.g., determiners)

虽然北风呼啸，但天空依然⼗分清澈
although north wind howls , but sky still extremely limpid

COMP90042 L17

Outline

• Statistical Machine Translation

• Neural Machine Translation

• Attention Mechanism

• Evaluation

COMP90042 L17

Statistical MT

COMP90042 L17

Early Machine Translation

• Started in early 1950s

• Motivated by the Cold War to translate Russian to
English

• Rule-based system

‣ Use bilingual dictionary to map Russian words
to English words

• Goal: translate 1-2 million words an hour within 5
years

COMP90042 L17

Statistical MT

• Given French sentence f, aim is to find the best
English sentence e

‣

• Use Baye’s rule to decompose into two
components

‣

argmaxe P(e | f )

argmaxe P( f |e)P(e)

COMP90042 L17

Language vs Translation Model

•

• : language model

‣ learns how to write fluent English text

• : translation model

‣ learns how to translate words and phrases from
English to French

argmaxe P( f |e)P(e)

P(e)

P( f |e)

COMP90042 L17

How to Learn LM and TM?

• Language model:

‣ Text statistics in large monolingual corpora

• Translation model:

‣ Word co-occurrences in parallel corpora

‣ i.e. English-French sentence pairs

COMP90042 L17

Parallel Corpora

• One text in multiple
languages

• Produced by human
translation

‣ Bible, news articles, legal
transcripts, literature,
subtitles

‣ Open parallel corpus:
http://opus.nlpl.eu/

Ancient Egyptian

Demotic

Ancient Greek

The Rosetta Stone

http://opus.nlpl.eu/
http://opus.nlpl.eu/

COMP90042 L17

Models of Translation

• How to learn from parallel text?

• We only have sentence pairs; words are not
aligned in the parallel text

• I.e. we don’t have word to word translation

P( f |e)

Zhang, Liu, Li, Zhou, Zong (2016)

COMP90042 L17

Alignment

• Idea: introduce word alignment as a latent
variable into the model

‣
• Use algorithms such as expectation maximisation

(EM) to learn (e.g. GIZA++)

P( f, a |e)

https://en.wikipedia.org/wiki/Bitext_word_alignment#/media/File:Word_alignment.svg
https://en.wikipedia.org/wiki/Bitext_word_alignment#/media/File:Word_alignment.svg

COMP90042 L17

Complexity of Alignment

• Some words are dropped and have no alignment

Brown, Della Pietra, Della Pietra, Mercer, 1993

COMP90042 L17

Complexity of Alignment

• One-to-many alignment

Brown, Della Pietra, Della Pietra, Mercer, 1993

COMP90042 L17

Complexity of Alignment

• Many-to-one alignment

Brown, Della Pietra, Della Pietra, Mercer, 1993

COMP90042 L17

Complexity of Alignment

• Many-to-many alignment

Brown, Della Pietra, Della Pietra, Mercer, 1993

COMP90042 L17

Statistical MT: Summary

• A very popular field of research in NLP prior to
2010s

• Lots of feature engineering

• State-of-the-art systems are very complex

‣ Difficult to maintain

‣ Significant effort needed for new language pairs

COMP90042 L17

Neural Machine Translation

COMP90042 L17

Introduction

• Neural machine translation is a new approach to
do machine translation

• Use a single neural model to directly translate
from source to target

• Requires parallel text

• Architecture: encoder-decoder model

‣ 1st RNN to encode the source sentence

‣ 2nd RNN to decode the target sentence

COMP90042 L17

Encoder RNN

RNN1

⽜牛

RNN1

吃

RNN1

草

Source sentence

cow eats grass

Decoder RNN

RNN2

grass

RNN2

cow

RNN2

eats

Target sentence

This vector encodes the whole source sentence;
it is used as the initial state for decoder RNN

hi = RNN1(hi−1, xi)
st = RNN2(st−1, yt)
s1 = h|x|

COMP90042 L17

Neural MT

• The decoder RNN can be interpreted as a
conditional language model

‣ Language model: predicts the next word given
previous words in target sentence y

‣ Conditional: prediction is also conditioned on
the source sentence x

• P(y |x) = P(y1 |x)P(y2 |y1, x) . . . P(yt |y1, . . . , yt−1, x)

COMP90042 L17

Training Neural MT

• Requires parallel corpus just like statistical MT

• Trains with next word prediction, just like a
language model!

COMP90042 L17

Language Model Training Loss

RNN2

cow

negative
logprob of

cow

RNN2

cow

eats

negative
logprob of

eats

RNN2

eats

grass

negative
logprob of

grass

RNN2

grass

negative
logprob of

+ Ltotal=

COMP90042 L17

Neural MT Training Loss

Ltotal=L2

negative
logprob of

eats

RNN2

cow

eats

negative
logprob of

grass

RNN2

eats

grass

negative
logprob of

RNN2

grass

RNN1

⽜牛

RNN1

吃

RNN1

草

negative
logprob of

cow

RNN2

cow

Source sentence Target sentence

COMP90042 L17

Training

• During training, we have the target sentence

• We can therefore feed the right word from target
sentence, one step at a time

RNN2

cow

RNN2

eats

RNN2

grass

RNN1

⽜牛

RNN1

吃

RNN1

草

Source sentence

RNN2

Target sentence

COMP90042 L17

Decoding at Test Time

• But at test time, we don’t have the target sentence
(that’s what we’re trying to predict!)

• argmax: take the word with the highest probability
at every step

RNN1

⽜牛

RNN1

吃

RNN1

草

Source sentence

RNN2

cow

RNN2

eats

RNN2

grass

RNN2

Target sentence

cow

ar
gm

eats

ar
gm

grass

ar
gm

COMP90042 L17

Exposure Bias

• Describes the discrepancy between training and testing

• Training: always have the ground truth tokens at each step

• Test: uses its own prediction at each step

• Outcome: model is unable to recover from its own error

RNN1

⽜牛

RNN1

吃

RNN1

草

Source sentence

RNN2

cow

RNN2

drinks

RNN2

water

RNN2

Target sentence

cow

ar
gm

drinks

ar
gm

water

ar
gm

COMP90042 L17

Greedy Decoding

• argmax decoding is also called greedy decoding

• Issue: does not guarantee optimal probability
P(y |x)

COMP90042 L17

Exhaustive Search Decoding

• To find optimal , we need to consider every
word at every step to compute the probability of all
possible sequences

• O(Vn); V = vocab size; n = sentence length

• Far too expensive to be feasible

P(y |x)

COMP90042 L17

Beam Search Decoding

• Instead of considering all possible words at every
step, consider k best words

• That is, we keep track of the top-k words that
produce the best partial translations (hypotheses)
thus far

• k = beam width (typically 5 to 10)

• k = 1 = greedy decoding

• k = V = exhaustive search decoding

COMP90042 L17

Beam search decoding with k=2

COMP90042 L17

cow

tiger

-0.5 = log P(cow |x, )

-0.7 = log P(tiger |x, )

Beam search decoding with k=2

COMP90042 L17

cow

tiger

-0.5

-0.7

eats

chews

-1.3 = – 0.5log P(eats |x, cow)

-1.9 = – 0.5log P(chews |x, cow)

Beam search decoding with k=2

COMP90042 L17

cow

tiger

-0.5

-0.7

eats

chews

-1.3 = – 0.5log P(eats |x, cow)

-1.9 = – 0.5log P(chews |x, cow)

bites

kills

-1.5 = – 0.7log P(bites |x, tiger)

-1.8 = – 0.7log P(kills |x, tiger)

Beam search decoding with k=2

COMP90042 L17

cow

tiger

-0.5

-0.7

eats

chews

-1.3

-1.9

bites

kills

-1.5

-1.8

grass

carrot

-2.3 = -1.3log P(grass |x, cow eats)

-2.0 = -1.3log P(carrot |x, cow eats)

Beam search decoding with k=2

COMP90042 L17

cow

tiger

-0.5

-0.7

eats

chews

-1.3

-1.9

bites

kills

-1.5

-1.8

grass

carrot

-2.3 = -1.3log P(grass |x, cow eats)

-2.5 = -1.3log P(carrot |x, cow eats)

hard

man

-2.9 = -1.5log P(hard |x, tiger bites)

-2.8 = -1.5log P(man |x, tiger bites)

Beam search decoding with k=2

COMP90042 L17

cow

tiger

-0.5

-0.7

eats

chews

-1.3

-1.9

bites

kills

-1.5

-1.8

grass

carrot

-2.3

-2.5

hard

man

-2.9

-2.8

-2.9

and

but

-3.5

-3.2

-4.5

Best translation probability!

Beam search decoding with k=2

COMP90042 L17

cow

tiger

-0.5 = log P(a |x, )

-0.7 = log P(the |x, )

eats

chews

-1.3 = – 0.5log P(cow |x, a)

-1.9 = – 0.5log P(bull |x, a)

bites

kills

-1.5 = – 0.7log P(tiger |x, the)

-1.8 = – 0.7log P(cat |x, the)

grass

carrot

-2.3

-2.5 = -1.3log P(chews |x, a cow)

hard

man

-2.9 = -1.5log P(bites |x, the tiger)

-2.8 = -1.5log P(kills |x, the tiger)

-2.9

and

but

-3.5

-3.2

-4.5

Follow pointers to get the target sentence

Beam search decoding with k=2

COMP90042 L17

When to Stop?

• When decoding, we stop when we generate
token

• But multiple hypotheses may terminate their
sentence at different time steps

• We store hypotheses that have terminated, and
continue explore those that haven’t

• Typically we also set a maximum sentence length
that can be generated (e.g. 50 words)

COMP90042 L17

• Information of the whole source sentence is
represented by a single vector

• NMT can generate new details not in source sentence
• NMT tend to generate not very fluent sentences
• Black-box model; difficult to explain when it doesn’t

work

PollEv.com/jeyhanlau569

What are some issues of NMT?

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L17

Neural MT: Summary

• Single end-to-end model

‣ Statistical MT systems have multiple sub-
components

• Less feature engineering

• Can produce new details that are not in the source
sentence (hallucination)

Lee, Firat, Agarwal, Fannjiang, Sussillo, 2018

COMP90042 L17

Attention Mechanism

COMP90042 L17

Decoder RNN

RNN1

⽜牛

RNN2

grass

cow eats grass

Encoder RNN

RNN1

吃

RNN1

草

RNN2

cow

RNN2

eats

Source sentence Target sentence

This vector encodes the whole source sentence!

• With a long source sentence, the encoded vector is
unlikely to capture all the information in the sentence

• This creates an information bottleneck

COMP90042 L17

Attention

• For the decoder, at every time step allow it to
‘attend’ to words in the source sentence

COMP90042 L17

RNN1

⽜牛

RNN1

吃

RNN1

草

RNN2

dot product

softmax

attention
distribution

first time step attends
to “⽜牛” to predict cow

cow

weighted sum

COMP90042 L17

RNN1

⽜牛

RNN1

吃

RNN1

草

RNN2

dot product

softmax

attention
distribution

second time step attends
to “吃” to predict eats

cow

weighted sum

RNN2

cow

eats

COMP90042 L17

RNN1

⽜牛

RNN1

吃

RNN1

草

RNN2

dot product

softmax

attention
distribution

cow

weighted sum

RNN2

cow

eats grass

RNN2

eats

COMP90042 L17

RNN1

⽜牛

RNN1

吃

RNN1

草

RNN2

dot product

softmax

attention
distribution

cow

weighted sum

RNN2

cow

eats grass

RNN2

eats

RNN2

grass

COMP90042 L17

Encoder-Decoder with Attention
• Encoder hidden states:

• Decoder hidden states:

• For each time step in the decoder, attend to each of the
hidden states to produce the attention weights:

‣

• Apply softmax to the attention weights to get a valid
probability distribution:

‣
• Compute weighted sum of the encoder hidden states:

‣

• Concatenate and to predict the next word

hi = RNN1(hi−1, xi)
st = RNN2(st−1, yt)

et = [s
⊺
t h1, s

⊺
t h2, . . . , s

⊺
t h|x|]

αt = softmax(et)

ct =
|x|

∑
i=1

αithi

ct st
context vector

COMP90042 L17

Variants

• Attention:

‣ Dot product:

‣ Bilinear:

‣ Additive:

• can be injected to the current state ( ), or to the
input word ( )

s⊺t hi
s⊺t Whi
v⊺ tanh(Wsst + Whhi)

ct st
yt

COMP90042 L17

Attention: Summary

• Solves the information bottleneck issue by
allowing decoder to have access to the source
sentence words directly

• Provides some form of interpretability

‣ Attention weights can be seen as word
alignments

• Most state-of-the-art NMT systems use attention

‣ Google Translate (https://slator.com/technology/google-facebook-amazon-
neural-machine-translation-just-had-its-busiest-month-ever/)

https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/
https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/
https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/
https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/
https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/
https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/

COMP90042 L17

Evaluation

COMP90042 L17

MT Evaluation
• BLEU: compute n-gram overlap between

“reference” translation and generated translation

• Typically computed for 1 to 4-gram

BLEU = BP × exp (
1
N

∑
n

log pn)
pn =

# correct n-grams
# predicted n-grams

BP = min (1, output lengthreference length )
“Brevity Penalty” to

penalise short outputs

COMP90042 L17

A Final Word

• Statistical MT

• Neural MT

‣ Nowadays use Transformers rather than RNNs

• Encoder-decoder with attention architecture is a
general architecture that can be used for other tasks

‣ Summarisation (lecture 21)

‣ Other generation tasks such as dialogue
generation

COMP90042 L17

Reading

• JM3 Ch. 11.2-11.5

• Optional: GIZA++ word alignment – https://
www.aclweb.org/anthology/W08-0509.pdf

https://www.aclweb.org/anthology/W08-0509.pdf
https://www.aclweb.org/anthology/W08-0509.pdf
https://www.aclweb.org/anthology/W08-0509.pdf
https://www.aclweb.org/anthology/W08-0509.pdf
https://www.aclweb.org/anthology/W08-0509.pdf
https://www.aclweb.org/anthology/W08-0509.pdf

Related Posts