CS计算机代考程序代写 algorithm l17-machine-translation-v3

l17-machine-translation-v3

COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE 1

COMP90042
Natural Language Processing

Lecture 17
Semester 1 2021 Week 9

Jey Han Lau

Machine Translation

COMP90042 L17

2

Introduction

• Machine translation (MT) is the task of
translating text from one source language to
another target language

虽然北风呼啸,但天空依然⼗分清澈

However, the sky remained clear under the strong north wind

COMP90042 L17

3

Why?

• Removes language barrier

• Makes information in any languages accessible to
anyone

• But translation is a classic “AI-hard” challenge

‣ Difficult to preserve the meaning and the
fluency of the text after translation

COMP90042 L17

4

• I’ve never used it before
• Bad, it’s not really usable
• Usable, but it needs a lot of manual correction
• Quite good, I use it quite often
• Perfect, machine translation is a solved task

PollEv.com/jeyhanlau569

How good do you find machine
translation (e.g. Google Translate)?

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L17

5

COMP90042 L17

6

MT is Difficult

• Not just simple word for word translation

• Structural changes, e.g., syntax and semantics

• Multiple word translations, idioms

• Inflections for gender, case etc

• Missing information (e.g., determiners)

虽然 北 风 呼啸 , 但 天空 依然 ⼗分 清澈
although north wind howls , but sky still extremely limpid

COMP90042 L17

7

Outline

• Statistical Machine Translation

• Neural Machine Translation

• Attention Mechanism

• Evaluation

COMP90042 L17

8

Statistical MT

COMP90042 L17

9

Early Machine Translation

• Started in early 1950s

• Motivated by the Cold War to translate Russian to
English

• Rule-based system

‣ Use bilingual dictionary to map Russian words
to English words

• Goal: translate 1-2 million words an hour within 5
years

COMP90042 L17

10

Statistical MT

• Given French sentence f, aim is to find the best
English sentence e

• Use Baye’s rule to decompose into two
components

argmaxe P(e | f )

argmaxe P( f |e)P(e)

COMP90042 L17

11

Language vs Translation Model

• : language model

‣ learns how to write fluent English text

• : translation model

‣ learns how to translate words and phrases from
English to French

argmaxe P( f |e)P(e)

P(e)

P( f |e)

COMP90042 L17

12

How to Learn LM and TM?

• Language model:

‣ Text statistics in large monolingual corpora

• Translation model:

‣ Word co-occurrences in parallel corpora

‣ i.e. English-French sentence pairs

COMP90042 L17

13

Parallel Corpora

• One text in multiple
languages

• Produced by human
translation

‣ Bible, news articles, legal
transcripts, literature,
subtitles

‣ Open parallel corpus:
http://opus.nlpl.eu/

Ancient Egyptian

Demotic

Ancient Greek

The Rosetta Stone

http://opus.nlpl.eu/
http://opus.nlpl.eu/

COMP90042 L17

14

Models of Translation

• How to learn from parallel text?

• We only have sentence pairs; words are not
aligned in the parallel text

• I.e. we don’t have word to word translation

P( f |e)

Zhang, Liu, Li, Zhou, Zong (2016)

COMP90042 L17

15

Alignment

• Idea: introduce word alignment as a latent
variable into the model


• Use algorithms such as expectation maximisation

(EM) to learn (e.g. GIZA++)

P( f, a |e)

https://en.wikipedia.org/wiki/Bitext_word_alignment#/media/File:Word_alignment.svg
https://en.wikipedia.org/wiki/Bitext_word_alignment#/media/File:Word_alignment.svg

COMP90042 L17

16

Complexity of Alignment

• Some words are dropped and have no alignment

Brown, Della Pietra, Della Pietra, Mercer, 1993

COMP90042 L17

17

Complexity of Alignment

• One-to-many alignment

Brown, Della Pietra, Della Pietra, Mercer, 1993

COMP90042 L17

18

Complexity of Alignment

• Many-to-one alignment

Brown, Della Pietra, Della Pietra, Mercer, 1993

COMP90042 L17

19

Complexity of Alignment

• Many-to-many alignment

Brown, Della Pietra, Della Pietra, Mercer, 1993

COMP90042 L17

20

Statistical MT: Summary

• A very popular field of research in NLP prior to
2010s

• Lots of feature engineering

• State-of-the-art systems are very complex

‣ Difficult to maintain

‣ Significant effort needed for new language pairs

COMP90042 L17

21

Neural Machine Translation

COMP90042 L17

22

Introduction

• Neural machine translation is a new approach to
do machine translation

• Use a single neural model to directly translate
from source to target

• Requires parallel text

• Architecture: encoder-decoder model

‣ 1st RNN to encode the source sentence

‣ 2nd RNN to decode the target sentence

COMP90042 L17

23

Encoder RNN

RNN1

⽜牛

RNN1

RNN1

Source sentence

cow eats grass

Decoder RNN

RNN2

grass

RNN2

RNN2

cow

RNN2

eats

Target sentence

This vector encodes the whole source sentence;
it is used as the initial state for decoder RNN

hi = RNN1(hi−1, xi)
st = RNN2(st−1, yt)
s1 = h|x|

COMP90042 L17

24

Neural MT

• The decoder RNN can be interpreted as a
conditional language model

‣ Language model: predicts the next word given
previous words in target sentence y

‣ Conditional: prediction is also conditioned on
the source sentence x

• P(y |x) = P(y1 |x)P(y2 |y1, x) . . . P(yt |y1, . . . , yt−1, x)

COMP90042 L17

25

Training Neural MT

• Requires parallel corpus just like statistical MT

• Trains with next word prediction, just like a
language model!

COMP90042 L17

26

Language Model Training Loss

RNN2

cow

L1

negative
logprob of

cow

RNN2

cow

eats

L2

negative
logprob of

eats

+

RNN2

eats

grass

L3

negative
logprob of

grass

+

RNN2

grass

L4

negative
logprob of

+ Ltotal=

COMP90042 L17

27

Neural MT Training Loss

Ltotal=L2

negative
logprob of

eats

+

RNN2

cow

eats

L3

negative
logprob of

grass

+

RNN2

eats

grass

L4

negative
logprob of

+

RNN2

grass

RNN1

⽜牛

RNN1

RNN1

L1

negative
logprob of

cow

RNN2

cow

Source sentence Target sentence

COMP90042 L17

28

Training

• During training, we have the target sentence

• We can therefore feed the right word from target
sentence, one step at a time

RNN2

cow

RNN2

eats

RNN2

grass

RNN1

⽜牛

RNN1

RNN1

Source sentence

RNN2

Target sentence

COMP90042 L17

29

Decoding at Test Time

• But at test time, we don’t have the target sentence
(that’s what we’re trying to predict!)

• argmax: take the word with the highest probability
at every step

RNN1

⽜牛

RNN1

RNN1

Source sentence

RNN2

cow

RNN2

eats

RNN2

grass

RNN2

Target sentence

cow

ar
gm

ax

eats

ar
gm

ax

grass

ar
gm

ax

ar
gm

ax

COMP90042 L17

30

Exposure Bias

• Describes the discrepancy between training and testing

• Training: always have the ground truth tokens at each step

• Test: uses its own prediction at each step

• Outcome: model is unable to recover from its own error

RNN1

⽜牛

RNN1

RNN1

Source sentence

RNN2

cow

RNN2

drinks

RNN2

water

RNN2

Target sentence

cow

ar
gm

ax

drinks

ar
gm

ax

water

ar
gm

ax

ar
gm

ax

COMP90042 L17

31

Greedy Decoding

• argmax decoding is also called greedy decoding

• Issue: does not guarantee optimal probability
P(y |x)

COMP90042 L17

32

Exhaustive Search Decoding

• To find optimal , we need to consider every
word at every step to compute the probability of all
possible sequences

• O(Vn); V = vocab size; n = sentence length

• Far too expensive to be feasible

P(y |x)

COMP90042 L17

33

Beam Search Decoding

• Instead of considering all possible words at every
step, consider k best words

• That is, we keep track of the top-k words that
produce the best partial translations (hypotheses)
thus far

• k = beam width (typically 5 to 10)

• k = 1 = greedy decoding

• k = V = exhaustive search decoding

COMP90042 L17

34

Beam search decoding with k=2

COMP90042 L17

35

cow

tiger

-0.5 = log P(cow |x, )

-0.7 = log P(tiger |x, )

Beam search decoding with k=2

COMP90042 L17

36

cow

tiger

-0.5

-0.7

eats

chews

-1.3 = – 0.5log P(eats |x,  cow)

-1.9 = – 0.5log P(chews |x,  cow)

Beam search decoding with k=2

COMP90042 L17

37

cow

tiger

-0.5

-0.7

eats

chews

-1.3 = – 0.5log P(eats |x,  cow)

-1.9 = – 0.5log P(chews |x,  cow)

bites

kills

-1.5 = – 0.7log P(bites |x,  tiger)

-1.8 = – 0.7log P(kills |x,  tiger)

Beam search decoding with k=2

COMP90042 L17

38

cow

tiger

-0.5

-0.7

eats

chews

-1.3

-1.9

bites

kills

-1.5

-1.8

grass

carrot

-2.3 = -1.3log P(grass |x,  cow eats)

-2.0 = -1.3log P(carrot |x,  cow eats)

Beam search decoding with k=2

COMP90042 L17

39

cow

tiger

-0.5

-0.7

eats

chews

-1.3

-1.9

bites

kills

-1.5

-1.8

grass

carrot

-2.3 = -1.3log P(grass |x,  cow eats)

-2.5 = -1.3log P(carrot |x,  cow eats)

hard

man

-2.9 = -1.5log P(hard |x,  tiger bites)

-2.8 = -1.5log P(man |x,  tiger bites)

Beam search decoding with k=2

COMP90042 L17

40

cow

tiger

-0.5

-0.7

eats

chews

-1.3

-1.9

bites

kills

-1.5

-1.8

grass

carrot

-2.3

-2.5

hard

man

-2.9

-2.8

-2.9

and

but

-3.5

-3.2

-4.5

Best translation probability!

Beam search decoding with k=2

COMP90042 L17

41

cow

tiger

-0.5 = log P(a |x, )

-0.7 = log P(the |x, )

eats

chews

-1.3 = – 0.5log P(cow |x,  a)

-1.9 = – 0.5log P(bull |x,  a)

bites

kills

-1.5 = – 0.7log P(tiger |x,  the)

-1.8 = – 0.7log P(cat |x,  the)

grass

carrot

-2.3

-2.5 = -1.3log P(chews |x,  a cow)

hard

man

-2.9 = -1.5log P(bites |x,  the tiger)

-2.8 = -1.5log P(kills |x,  the tiger)

-2.9

and

but

-3.5

-3.2

-4.5

Follow pointers to get the target sentence

Beam search decoding with k=2

COMP90042 L17

42

When to Stop?

• When decoding, we stop when we generate
token

• But multiple hypotheses may terminate their
sentence at different time steps

• We store hypotheses that have terminated, and
continue explore those that haven’t

• Typically we also set a maximum sentence length
that can be generated (e.g. 50 words)

COMP90042 L17

43

• Information of the whole source sentence is
represented by a single vector

• NMT can generate new details not in source sentence
• NMT tend to generate not very fluent sentences
• Black-box model; difficult to explain when it doesn’t

work

PollEv.com/jeyhanlau569

What are some issues of NMT?

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L17

44

COMP90042 L17

45

Neural MT: Summary

• Single end-to-end model

‣ Statistical MT systems have multiple sub-
components

• Less feature engineering

• Can produce new details that are not in the source
sentence (hallucination)

Lee, Firat, Agarwal, Fannjiang, Sussillo, 2018

COMP90042 L17

46

Attention Mechanism

COMP90042 L17

Decoder RNN

47

RNN1

⽜牛

RNN2

grass

cow eats grass

Encoder RNN

RNN1

RNN1

RNN2

RNN2

cow

RNN2

eats

Source sentence Target sentence

This vector encodes the whole source sentence!

• With a long source sentence, the encoded vector is
unlikely to capture all the information in the sentence

• This creates an information bottleneck

COMP90042 L17

48

Attention

• For the decoder, at every time step allow it to
‘attend’ to words in the source sentence

COMP90042 L17

49

RNN1

⽜牛

RNN1

RNN1

RNN2

dot product

softmax

attention
distribution

first time step attends
to “⽜牛” to predict cow

cow

weighted sum

COMP90042 L17

50

RNN1

⽜牛

RNN1

RNN1

RNN2

dot product

softmax

attention
distribution

second time step attends
to “吃” to predict eats

cow

weighted sum

RNN2

cow

eats

COMP90042 L17

51

RNN1

⽜牛

RNN1

RNN1

RNN2

dot product

softmax

attention
distribution

cow

weighted sum

RNN2

cow

eats grass

RNN2

eats

COMP90042 L17

52

RNN1

⽜牛

RNN1

RNN1

RNN2

dot product

softmax

attention
distribution

cow

weighted sum

RNN2

cow

eats grass

RNN2

eats

RNN2

grass

COMP90042 L17

53

Encoder-Decoder with Attention
• Encoder hidden states:

• Decoder hidden states:

• For each time step in the decoder, attend to each of the
hidden states to produce the attention weights:

• Apply softmax to the attention weights to get a valid
probability distribution:


• Compute weighted sum of the encoder hidden states:

• Concatenate and to predict the next word

hi = RNN1(hi−1, xi)
st = RNN2(st−1, yt)

et = [s

t h1, s


t h2, . . . , s


t h|x|]

αt = softmax(et)

ct =
|x|


i=1

αithi

ct st
context vector

COMP90042 L17

54

Variants

• Attention:

‣ Dot product:

‣ Bilinear:

‣ Additive:

• can be injected to the current state ( ), or to the
input word ( )

s⊺t hi
s⊺t Whi
v⊺ tanh(Wsst + Whhi)

ct st
yt

COMP90042 L17

55

Attention: Summary

• Solves the information bottleneck issue by
allowing decoder to have access to the source
sentence words directly

• Provides some form of interpretability

‣ Attention weights can be seen as word
alignments

• Most state-of-the-art NMT systems use attention

‣ Google Translate (https://slator.com/technology/google-facebook-amazon-
neural-machine-translation-just-had-its-busiest-month-ever/)

https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/
https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/
https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/
https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/
https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/
https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-had-its-busiest-month-ever/

COMP90042 L17

56

Evaluation

COMP90042 L17

57

MT Evaluation
• BLEU: compute n-gram overlap between

“reference” translation and generated translation

• Typically computed for 1 to 4-gram

BLEU = BP × exp (
1
N

N


n

log pn)
pn =

# correct n-grams
# predicted n-grams

BP = min (1, output lengthreference length )
“Brevity Penalty” to

penalise short outputs

COMP90042 L17

58

A Final Word

• Statistical MT

• Neural MT

‣ Nowadays use Transformers rather than RNNs

• Encoder-decoder with attention architecture is a
general architecture that can be used for other tasks

‣ Summarisation (lecture 21)

‣ Other generation tasks such as dialogue
generation

COMP90042 L17

59

Reading

• JM3 Ch. 11.2-11.5

• Optional: GIZA++ word alignment – https://
www.aclweb.org/anthology/W08-0509.pdf

https://www.aclweb.org/anthology/W08-0509.pdf
https://www.aclweb.org/anthology/W08-0509.pdf
https://www.aclweb.org/anthology/W08-0509.pdf
https://www.aclweb.org/anthology/W08-0509.pdf
https://www.aclweb.org/anthology/W08-0509.pdf
https://www.aclweb.org/anthology/W08-0509.pdf