CS计算机代考程序代写 matlab GPU flex ER ()

()

ar
X

iv
:1

50
8.

04
02

5v
5

[
cs

.C
L

]
2

0
S

ep
2

01
5

Effective Approaches to Attention-based Neural Machine Translation

Minh-Thang Luong Hieu Pham Christopher D. Manning
Computer Science Department, Stanford University, Stanford, CA 94305

{lmthang,hyhieu,manning}@stanford.edu

Abstract

An attentional mechanism has lately been
used to improve neural machine transla-
tion (NMT) by selectively focusing on
parts of the source sentence during trans-
lation. However, there has been little
work exploring useful architectures for
attention-based NMT. This paper exam-
ines two simple and effective classes of at-
tentional mechanism: a global approach
which always attends to all source words
and a local one that only looks at a subset
of source words at a time. We demonstrate
the effectiveness of both approaches on the
WMT translation tasks between English
and German in both directions. With local
attention, we achieve a significant gain of
5.0 BLEU points over non-attentional sys-
tems that already incorporate known tech-
niques such as dropout. Our ensemble
model using different attention architec-
tures yields a new state-of-the-art result in
the WMT’15 English to German transla-
tion task with 25.9 BLEU points, an im-
provement of 1.0 BLEU points over the
existing best system backed by NMT and
an n-gram reranker.1

1 Introduction

Neural Machine Translation (NMT) achieved
state-of-the-art performances in large-scale trans-
lation tasks such as from English to French
(Luong et al., 2015) and English to German
(Jean et al., 2015). NMT is appealing since it re-
quires minimal domain knowledge and is concep-
tually simple. The model by Luong et al. (2015)
reads through all the source words until the end-of-
sentence symbol is reached. It then starts

1All our code and models are publicly available at
http://nlp.stanford.edu/projects/nmt.

B C D X Y Z

X Y Z

A

Figure 1: Neural machine translation – a stack-
ing recurrent architecture for translating a source
sequence A B C D into a target sequence X Y
Z. Here, marks the end of a sentence.

emitting one target word at a time, as illustrated in
Figure 1. NMT is often a large neural network that
is trained in an end-to-end fashion and has the abil-
ity to generalize well to very long word sequences.
This means the model does not have to explicitly
store gigantic phrase tables and language models
as in the case of standard MT; hence, NMT has
a small memory footprint. Lastly, implementing
NMT decoders is easy unlike the highly intricate
decoders in standard MT (Koehn et al., 2003).

In parallel, the concept of “attention” has
gained popularity recently in training neural net-
works, allowing models to learn alignments be-
tween different modalities, e.g., between image
objects and agent actions in the dynamic con-
trol problem (Mnih et al., 2014), between speech
frames and text in the speech recognition task
(?), or between visual features of a picture and
its text description in the image caption gener-
ation task (Xu et al., 2015). In the context of
NMT, Bahdanau et al. (2015) has successfully ap-
plied such attentional mechanism to jointly trans-
late and align words. To the best of our knowl-
edge, there has not been any other work exploring
the use of attention-based architectures for NMT.

In this work, we design, with simplicity and ef-

http://arxiv.org/abs/1508.04025v5
http://nlp.stanford.edu/projects/nmt

fectiveness in mind, two novel types of attention-
based models: a global approach in which all
source words are attended and a local one whereby
only a subset of source words are considered at a
time. The former approach resembles the model
of (Bahdanau et al., 2015) but is simpler architec-
turally. The latter can be viewed as an interesting
blend between the hard and soft attention models
proposed in (Xu et al., 2015): it is computation-
ally less expensive than the global model or the
soft attention; at the same time, unlike the hard at-
tention, the local attention is differentiable almost
everywhere, making it easier to implement and
train.2 Besides, we also examine various align-
ment functions for our attention-based models.

Experimentally, we demonstrate that both of
our approaches are effective in the WMT trans-
lation tasks between English and German in both
directions. Our attentional models yield a boost
of up to 5.0 BLEU over non-attentional systems
which already incorporate known techniques such
as dropout. For English to German translation,
we achieve new state-of-the-art (SOTA) results
for both WMT’14 and WMT’15, outperforming
previous SOTA systems, backed by NMT mod-
els and n-gram LM rerankers, by more than 1.0
BLEU. We conduct extensive analysis to evaluate
our models in terms of learning, the ability to han-
dle long sentences, choices of attentional architec-
tures, alignment quality, and translation outputs.

2 Neural Machine Translation

A neural machine translation system is a neural
network that directly models the conditional prob-
ability p(y|x) of translating a source sentence,
x1, . . . , xn, to a target sentence, y1, . . . , ym.

3 A
basic form of NMT consists of two components:
(a) an encoder which computes a representation s
for each source sentence and (b) a decoder which
generates one target word at a time and hence de-
composes the conditional probability as:

log p(y|x) =
∑m

j=1
log p (yj|y.

recurrent neural network (RNN) architec-
ture, which most of the recent NMT work
such as (Kalchbrenner and Blunsom, 2013;
Sutskever et al., 2014; Cho et al., 2014;
Bahdanau et al., 2015; Luong et al., 2015;
Jean et al., 2015) have in common. They, how-
ever, differ in terms of which RNN architectures
are used for the decoder and how the encoder
computes the source sentence representation s.

Kalchbrenner and Blunsom (2013) used an
RNN with the standard hidden unit for the
decoder and a convolutional neural network for
encoding the source sentence representation. On
the other hand, both Sutskever et al. (2014) and
Luong et al. (2015) stacked multiple layers of an
RNN with a Long Short-Term Memory (LSTM)
hidden unit for both the encoder and the decoder.
Cho et al. (2014), Bahdanau et al. (2015), and
Jean et al. (2015) all adopted a different version of
the RNN with an LSTM-inspired hidden unit, the
gated recurrent unit (GRU), for both components.4

In more detail, one can parameterize the proba-
bility of decoding each word yj as:

p (yj|y

X Y Z

A

Figure 4: Input-feeding approach – Attentional
vectors h̃t are fed as inputs to the next time steps to
inform the model about past alignment decisions.

Comparison to (Gregor et al., 2015) – have pro-
posed a selective attention mechanism, very simi-
lar to our local attention, for the image generation
task. Their approach allows the model to select an
image patch of varying location and zoom. We,
instead, use the same “zoom” for all target posi-
tions, which greatly simplifies the formulation and
still achieves good performance.

3.3 Input-feeding Approach

In our proposed global and local approaches,
the attentional decisions are made independently,
which is suboptimal. Whereas, in standard MT,
a coverage set is often maintained during the
translation process to keep track of which source
words have been translated. Likewise, in atten-
tional NMTs, alignment decisions should be made
jointly taking into account past alignment infor-
mation. To address that, we propose an input-
feeding approach in which attentional vectors h̃t
are concatenated with inputs at the next time steps
as illustrated in Figure 4.11 The effects of hav-
ing such connections are two-fold: (a) we hope
to make the model fully aware of previous align-
ment choices and (b) we create a very deep net-
work spanning both horizontally and vertically.

Comparison to other work –
Bahdanau et al. (2015) use context vectors,
similar to our ct, in building subsequent hidden
states, which can also achieve the “coverage”
effect. However, there has not been any analysis
of whether such connections are useful as done

11If n is the number of LSTM cells, the input size of the
first LSTM layer is 2n; those of subsequent layers are n.

in this work. Also, our approach is more general;
as illustrated in Figure 4, it can be applied to
general stacking recurrent architectures, including
non-attentional models.

Xu et al. (2015) propose a doubly attentional
approach with an additional constraint added to
the training objective to make sure the model pays
equal attention to all parts of the image during the
caption generation process. Such a constraint can
also be useful to capture the coverage set effect
in NMT that we mentioned earlier. However, we
chose to use the input-feeding approach since it
provides flexibility for the model to decide on any
attentional constraints it deems suitable.

4 Experiments

We evaluate the effectiveness of our models
on the WMT translation tasks between En-
glish and German in both directions. new-
stest2013 (3000 sentences) is used as a develop-
ment set to select our hyperparameters. Transla-
tion performances are reported in case-sensitive
BLEU (Papineni et al., 2002) on newstest2014
(2737 sentences) and newstest2015 (2169 sen-
tences). Following (Luong et al., 2015), we report
translation quality using two types of BLEU: (a)
tokenized12 BLEU to be comparable with existing
NMT work and (b) NIST13 BLEU to be compara-
ble with WMT results.

4.1 Training Details

All our models are trained on the WMT’14 train-
ing data consisting of 4.5M sentences pairs (116M
English words, 110M German words). Similar
to (Jean et al., 2015), we limit our vocabularies to
be the top 50K most frequent words for both lan-
guages. Words not in these shortlisted vocabular-
ies are converted into a universal token .

When training our NMT systems, following
(Bahdanau et al., 2015; Jean et al., 2015), we fil-
ter out sentence pairs whose lengths exceed
50 words and shuffle mini-batches as we pro-
ceed. Our stacking LSTM models have 4 lay-
ers, each with 1000 cells, and 1000-dimensional
embeddings. We follow (Sutskever et al., 2014;
Luong et al., 2015) in training NMT with similar
settings: (a) our parameters are uniformly initial-
ized in [−0.1, 0.1], (b) we train for 10 epochs us-

12All texts are tokenized with tokenizer.perl and
BLEU scores are computed with multi-bleu.perl.

13With the mteval-v13a script as per WMT guideline.

System Ppl BLEU
Winning WMT’14 system – phrase-based + large LM (Buck et al., 2014) 20.7
Existing NMT systems
RNNsearch (Jean et al., 2015) 16.5
RNNsearch + unk replace (Jean et al., 2015) 19.0
RNNsearch + unk replace + large vocab + ensemble 8 models (Jean et al., 2015) 21.6
Our NMT systems
Base 10.6 11.3
Base + reverse 9.9 12.6 (+1.3)
Base + reverse + dropout 8.1 14.0 (+1.4)
Base + reverse + dropout + global attention (location) 7.3 16.8 (+2.8)
Base + reverse + dropout + global attention (location) + feed input 6.4 18.1 (+1.3)
Base + reverse + dropout + local-p attention (general) + feed input

5.9
19.0 (+0.9)

Base + reverse + dropout + local-p attention (general) + feed input + unk replace 20.9 (+1.9)
Ensemble 8 models + unk replace 23.0 (+2.1)

Table 1: WMT’14 English-German results – shown are the perplexities (ppl) and the tokenized BLEU
scores of various systems on newstest2014. We highlight the best system in bold and give progressive
improvements in italic between consecutive systems. local-p referes to the local attention with predictive
alignments. We indicate for each attention model the alignment score function used in pararentheses.

ing plain SGD, (c) a simple learning rate sched-
ule is employed – we start with a learning rate of
1; after 5 epochs, we begin to halve the learning
rate every epoch, (d) our mini-batch size is 128,
and (e) the normalized gradient is rescaled when-
ever its norm exceeds 5. Additionally, we also
use dropout with probability 0.2 for our LSTMs as
suggested by (Zaremba et al., 2015). For dropout
models, we train for 12 epochs and start halving
the learning rate after 8 epochs. For local atten-
tion models, we empirically set the window size
D = 10.

Our code is implemented in MATLAB. When
running on a single GPU device Tesla K40, we
achieve a speed of 1K target words per second.
It takes 7–10 days to completely train a model.

4.2 English-German Results

We compare our NMT systems in the English-
German task with various other systems. These
include the winning system in WMT’14
(Buck et al., 2014), a phrase-based system
whose language models were trained on a huge
monolingual text, the Common Crawl corpus.
For end-to-end NMT systems, to the best of
our knowledge, (Jean et al., 2015) is the only
work experimenting with this language pair and
currently the SOTA system. We only present
results for some of our attention models and will
later analyze the rest in Section 5.

As shown in Table 1, we achieve pro-

gressive improvements when (a) reversing the
source sentence, +1.3 BLEU, as proposed in
(Sutskever et al., 2014) and (b) using dropout,
+1.4 BLEU. On top of that, (c) the global atten-
tion approach gives a significant boost of +2.8
BLEU, making our model slightly better than the
base attentional system of Bahdanau et al. (2015)
(row RNNSearch). When (d) using the input-
feeding approach, we seize another notable gain
of +1.3 BLEU and outperform their system. The
local attention model with predictive alignments
(row local-p) proves to be even better, giving
us a further improvement of +0.9 BLEU on top
of the global attention model. It is interest-
ing to observe the trend previously reported in
(Luong et al., 2015) that perplexity strongly corre-
lates with translation quality. In total, we achieve
a significant gain of 5.0 BLEU points over the
non-attentional baseline, which already includes
known techniques such as source reversing and
dropout.

The unknown replacement technique proposed
in (Luong et al., 2015; Jean et al., 2015) yields an-
other nice gain of +1.9 BLEU, demonstrating that
our attentional models do learn useful alignments
for unknown works. Finally, by ensembling 8
different models of various settings, e.g., using
different attention approaches, with and without
dropout etc., we were able to achieve a new SOTA
result of 23.0 BLEU, outperforming the existing

best system (Jean et al., 2015) by +1.4 BLEU.

System BLEU
Top – NMT + 5-gram rerank (Montreal) 24.9
Our ensemble 8 models + unk replace 25.9

Table 2: WMT’15 English-German results –
NIST BLEU scores of the winning entry in
WMT’15 and our best one on newstest2015.

Latest results in WMT’15 – despite the fact that
our models were trained on WMT’14 with slightly
less data, we test them on newstest2015 to demon-
strate that they can generalize well to different test
sets. As shown in Table 2, our best system es-
tablishes a new SOTA performance of 25.9 BLEU,
outperforming the existing best system backed by
NMT and a 5-gram LM reranker by +1.0 BLEU.

4.3 German-English Results

We carry out a similar set of experiments for the
WMT’15 translation task from German to En-
glish. While our systems have not yet matched
the performance of the SOTA system, we never-
theless show the effectiveness of our approaches
with large and progressive gains in terms of BLEU
as illustrated in Table 3. The attentional mech-
anism gives us +2.2 BLEU gain and on top of
that, we obtain another boost of up to +1.0 BLEU
from the input-feeding approach. Using a better
alignment function, the content-based dot product
one, together with dropout yields another gain of
+2.7 BLEU. Lastly, when applying the unknown
word replacement technique, we seize an addi-
tional +2.1 BLEU, demonstrating the usefulness
of attention in aligning rare words.

5 Analysis

We conduct extensive analysis to better understand
our models in terms of learning, the ability to han-
dle long sentences, choices of attentional architec-
tures, and alignment quality. All results reported
here are on English-German newstest2014.

5.1 Learning curves

We compare models built on top of one another as
listed in Table 1. It is pleasant to observe in Fig-
ure 5 a clear separation between non-attentional
and attentional models. The input-feeding ap-
proach and the local attention model also demon-
strate their abilities in driving the test costs lower.
The non-attentional model with dropout (the blue

System Ppl. BLEU
WMT’15 systems
SOTA – phrase-based (Edinburgh) 29.2
NMT + 5-gram rerank (MILA) 27.6
Our NMT systems
Base (reverse) 14.3 16.9
+ global (location) 12.7 19.1 (+2.2)
+ global (location) + feed 10.9 20.1 (+1.0)
+ global (dot) + drop + feed

9.7
22.8 (+2.7)

+ global (dot) + drop + feed + unk 24.9 (+2.1)

Table 3: WMT’15 German-English results –
performances of various systems (similar to Ta-
ble 1). The base system already includes source
reversing on which we add global attention,
dropout, input feeding, and unk replacement.

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
x 10

5

2

3

4

5

6

Mini−batches

T
e

st
c

o
st

basic
basic+reverse
basic+reverse+dropout
basic+reverse+dropout+globalAttn
basic+reverse+dropout+globalAttn+feedInput
basic+reverse+dropout+pLocalAttn+feedInput

Figure 5: Learning curves – test cost (ln perplex-
ity) on newstest2014 for English-German NMTs
as training progresses.

+ curve) learns slower than other non-dropout
models, but as time goes by, it becomes more ro-
bust in terms of minimizing test errors.

5.2 Effects of Translating Long Sentences

We follow (Bahdanau et al., 2015) to group sen-
tences of similar lengths together and compute
a BLEU score per group. Figure 6 shows that
our attentional models are more effective than the
non-attentional one in handling long sentences:
the quality does not degrade as sentences become
longer. Our best model (the blue + curve) outper-
forms all other systems in all length buckets.

5.3 Choices of Attentional Architectures

We examine different attention models (global,
local-m, local-p) and different alignment func-
tions (location, dot, general, concat) as described
in Section 3. Due to limited resources, we can-
not run all the possible combinations. However,
results in Table 4 do give us some idea about dif-
ferent choices. The location-based function does

10 20 30 40 50 60 70

10

15

20

25

Sent Lengths

B
L

E
U

ours, no attn (BLEU 13.9)
ours, local−p attn (BLEU 20.9)
ours, best system (BLEU 23.0)
WMT’14 best (BLEU 20.7)
Jeans et al., 2015 (BLEU 21.6)

Figure 6: Length Analysis – translation qualities
of different systems as sentences become longer.

System Ppl
BLEU

Before After unk
global (location) 6.4 18.1 19.3 (+1.2)
global (dot) 6.1 18.6 20.5 (+1.9)
global (general) 6.1 17.3 19.1 (+1.8)
local-m (dot) >7.0 x x
local-m (general) 6.2 18.6 20.4 (+1.8)
local-p (dot) 6.6 18.0 19.6 (+1.9)
local-p (general) 5.9 19 20.9 (+1.9)

Table 4: Attentional Architectures – perfor-
mances of different attentional models. We trained
two local-m (dot) models; both have ppl > 7.0.

not learn good alignments: the global (location)
model can only obtain a small gain when per-
forming unknown word replacement compared to
using other alignment functions.14 For content-
based functions, our implementation concat does
not yield good performances and more analysis
should be done to understand the reason.15 It is
interesting to observe that dot works well for the
global attention and general is better for the local
attention. Among the different models, the local
attention model with predictive alignments (local-
p) is best, both in terms of perplexities and BLEU.

5.4 Alignment Quality

A by-product of attentional models are word align-
ments. While (Bahdanau et al., 2015) visualized

14There is a subtle difference in how we retrieve align-
ments for the different alignment functions. At time step t in
which we receive yt−1 as input and then compute ht,at, ct,
and h̃t before predicting yt, the alignment vector at is used
as alignment weights for (a) the predicted word yt in the
location-based alignment functions and (b) the input word
yt−1 in the content-based functions.

15With concat, the perplexities achieved by different mod-
els are 6.7 (global), 7.1 (local-m), and 7.1 (local-p). Such
high perplexities could be due to the fact that we simplify the
matrix Wa to set the part that corresponds to h̄s to identity.

Method AER
global (location) 0.39
local-m (general) 0.34
local-p (general) 0.36

ensemble 0.34
Berkeley Aligner 0.32

Table 6: AER scores – results of various models
on the RWTH English-German alignment data.

alignments for some sample sentences and ob-
served gains in translation quality as an indica-
tion of a working attention model, no work has as-
sessed the alignments learned as a whole. In con-
trast, we set out to evaluate the alignment quality
using the alignment error rate (AER) metric.

Given the gold alignment data provided by
RWTH for 508 English-German Europarl sen-
tences, we “force” decode our attentional models
to produce translations that match the references.
We extract only one-to-one alignments by select-
ing the source word with the highest alignment
weight per target word. Nevertheless, as shown in
Table 6, we were able to achieve AER scores com-
parable to the one-to-many alignments obtained
by the Berkeley aligner (Liang et al., 2006).16

We also found that the alignments produced by
local attention models achieve lower AERs than
those of the global one. The AER obtained by the
ensemble, while good, is not better than the local-
m AER, suggesting the well-known observation
that AER and translation scores are not well cor-
related (Fraser and Marcu, 2007). We show some
alignment visualizations in Appendix A.

5.5 Sample Translations

We show in Table 5 sample translations in both
directions. It it appealing to observe the ef-
fect of attentional models in correctly translating
names such as “Miranda Kerr” and “Roger Dow”.
Non-attentional models, while producing sensi-
ble names from a language model perspective,
lack the direct connections from the source side
to make correct translations. We also observed
an interesting case in the second example, which
requires translating the doubly-negated phrase,
“not incompatible”. The attentional model cor-
rectly produces “nicht . . . unvereinbar”; whereas
the non-attentional model generates “nicht verein-

16We concatenate the 508 sentence pairs with 1M sentence
pairs from WMT and run the Berkeley aligner.

English-German translations
src Orlando Bloom and Miranda Kerr still love each other
ref Orlando Bloom und Miranda Kerr lieben sich noch immer

best Orlando Bloom und Miranda Kerr lieben einander noch immer .
base Orlando Bloom und Lucas Miranda lieben einander noch immer .

src ′′ We ′ re pleased the FAA recognizes that an enjoyable passenger experience is not incompatible
with safety and security , ′′ said Roger Dow , CEO of the U.S. Travel Association .

ref “ Wir freuen uns , dass die FAA erkennt , dass ein angenehmes Passagiererlebnis nicht im Wider-
spruch zur Sicherheit steht ” , sagte Roger Dow , CEO der U.S. Travel Association .

best ′′ Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit Sicherheit und
Sicherheit unvereinbar ist ′′ , sagte Roger Dow , CEO der US – die .

base ′′ Wir freuen uns über die , dass ein mit Sicherheit nicht vereinbar ist mit
Sicherheit und Sicherheit ′′ , sagte Roger Cameron , CEO der US – .

German-English translations
src In einem Interview sagte Bloom jedoch , dass er und Kerr sich noch immer lieben .
ref However , in an interview , Bloom has said that he and Kerr still love each other .

best In an interview , however , Bloom said that he and Kerr still love .
base However , in an interview , Bloom said that he and Tina were still .

src Wegen der von Berlin und der Europäischen Zentralbank verhängten strengen Sparpolitik in
Verbindung mit der Zwangsjacke , in die die jeweilige nationale Wirtschaft durch das Festhal-
ten an der gemeinsamen Währung genötigt wird , sind viele Menschen der Ansicht , das Projekt
Europa sei zu weit gegangen

ref The austerity imposed by Berlin and the European Central Bank , coupled with the straitjacket
imposed on national economies through adherence to the common currency , has led many people
to think Project Europe has gone too far .

best Because of the strict austerity measures imposed by Berlin and the European Central Bank in
connection with the straitjacket in which the respective national economy is forced to adhere to
the common currency , many people believe that the European project has gone too far .

base Because of the pressure imposed by the European Central Bank and the Federal Central Bank
with the strict austerity imposed on the national economy in the face of the single currency ,
many people believe that the European project has gone too far .

Table 5: Sample translations – for each example, we show the source (src), the human translation (ref),
the translation from our best model (best), and the translation of a non-attentional model (base). We
italicize some correct translation segments and highlight a few wrong ones in bold.

bar”, meaning “not compatible”.17 The attentional
model also demonstrates its superiority in translat-
ing long sentences as in the last example.

6 Conclusion

In this paper, we propose two simple and effective
attentional mechanisms for neural machine trans-
lation: the global approach which always looks
at all source positions and the local one that only
attends to a subset of source positions at a time.
We test the effectiveness of our models in the
WMT translation tasks between English and Ger-
man in both directions. Our local attention yields
large gains of up to 5.0 BLEU over non-attentional

17The reference uses a more fancy translation of “incom-
patible”, which is “im Widerspruch zu etwas stehen”. Both
models, however, failed to translate “passenger experience”.

models which already incorporate known tech-
niques such as dropout. For the English to Ger-
man translation direction, our ensemble model has
established new state-of-the-art results for both
WMT’14 and WMT’15, outperforming existing
best systems, backed by NMT models and n-gram
LM rerankers, by more than 1.0 BLEU.

We have compared various alignment functions
and shed light on which functions are best for
which attentional models. Our analysis shows that
attention-based NMT models are superior to non-
attentional ones in many cases, for example in
translating names and handling long sentences.

Acknowledgment

We gratefully acknowledge support from a gift
from Bloomberg L.P. and the support of NVIDIA

Corporation with the donation of Tesla K40 GPUs.
We thank Andrew Ng and his group as well as
the Stanford Research Computing for letting us
use their computing resources. We thank Rus-
sell Stewart for helpful discussions on the models.
Lastly, we thank Quoc Le, Ilya Sutskever, Oriol
Vinyals, Richard Socher, Michael Kayser, Jiwei
Li, Panupong Pasupat, Kelvin Guu, members of
the Stanford NLP Group and the annonymous re-
viewers for their valuable comments and feedback.

References

[Bahdanau et al.2015] D. Bahdanau, K. Cho, and
Y. Bengio. 2015. Neural machine translation by
jointly learning to align and translate. In ICLR.

[Buck et al.2014] Christian Buck, Kenneth Heafield,
and Bas van Ooyen. 2014. N-gram counts and lan-
guage models from the common crawl. In LREC.

[Cho et al.2014] Kyunghyun Cho, Bart van Merrien-
boer, Caglar Gulcehre, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. 2014. Learning
phrase representations using RNN encoder-decoder
for statistical machine translation. In EMNLP.

[Fraser and Marcu2007] Alexander Fraser and Daniel
Marcu. 2007. Measuring word alignment quality
for statistical machine translation. Computational
Linguistics, 33(3):293–303.

[Gregor et al.2015] Karol Gregor, Ivo Danihelka, Alex
Graves, Danilo Jimenez Rezende, and Daan Wier-
stra. 2015. DRAW: A recurrent neural network for
image generation. In ICML.

[Jean et al.2015] Sébastien Jean, Kyunghyun Cho,
Roland Memisevic, and Yoshua Bengio. 2015. On
using very large target vocabulary for neural ma-
chine translation. In ACL.

[Kalchbrenner and Blunsom2013] N. Kalchbrenner and
P. Blunsom. 2013. Recurrent continuous translation
models. In EMNLP.

[Koehn et al.2003] Philipp Koehn, Franz Josef Och,
and Daniel Marcu. 2003. Statistical phrase-based
translation. In NAACL.

[Liang et al.2006] P. Liang, B. Taskar, and D. Klein.
2006. Alignment by agreement. In NAACL.

[Luong et al.2015] M.-T. Luong, I. Sutskever, Q. V. Le,
O. Vinyals, and W. Zaremba. 2015. Addressing the
rare word problem in neural machine translation. In
ACL.

[Mnih et al.2014] Volodymyr Mnih, Nicolas Heess,
Alex Graves, and Koray Kavukcuoglu. 2014. Re-
current models of visual attention. In NIPS.

[Papineni et al.2002] Kishore Papineni, Salim Roukos,
Todd Ward, and Wei jing Zhu. 2002. Bleu: a
method for automatic evaluation of machine trans-
lation. In ACL.

[Sutskever et al.2014] I. Sutskever, O. Vinyals, and
Q. V. Le. 2014. Sequence to sequence learning with
neural networks. In NIPS.

[Xu et al.2015] Kelvin Xu, Jimmy Ba, Ryan Kiros,
Kyunghyun Cho, Aaron C. Courville, Ruslan
Salakhutdinov, Richard S. Zemel, and Yoshua Ben-
gio. 2015. Show, attend and tell: Neural image cap-
tion generation with visual attention. In ICML.

[Zaremba et al.2015] Wojciech Zaremba, Ilya
Sutskever, and Oriol Vinyals. 2015. Recurrent
neural network regularization. In ICLR.

A Alignment Visualization

We visualize the alignment weights produced by
our different attention models in Figure 7. The vi-
sualization of the local attention model is much
sharper than that of the global one. This contrast
matches our expectation that local attention is de-
signed to only focus on a subset of words each
time. Also, since we translate from English to Ger-
man and reverse the source English sentence, the
white strides at the words “reality” and “.” in the
global attention model reveals an interesting ac-
cess pattern: it tends to refer back to the beginning
of the source sequence.

Compared to the alignment visualizations in
(Bahdanau et al., 2015), our alignment patterns
are not as sharp as theirs. Such difference could
possibly be due to the fact that translating from
English to German is harder than translating into
French as done in (Bahdanau et al., 2015), which
is an interesting point to examine in future work.

Th
ey

do no
t

un
de

rs
ta

nd

wh
y
Eu

ro
pe

ex
ist

s

in th
eo

ry

bu
t

no
t

in re
al
ity

.
Sie

verstehen
nicht

,
warum
Europa

theoretisch
zwar

existiert
,

aber
nicht

in
Wirklichkeit

.

Th
ey

do no
t

un
de

rs
ta

nd

wh
y
Eu

ro
pe

ex
ist

s

in th
eo

ry

bu
t

no
t

in re
al
ity

.
Sie

verstehen
nicht

,
warum
Europa

theoretisch
zwar

existiert
,

aber
nicht

in
Wirklichkeit

.

Th
ey

do no
t

un
de

rs
ta

nd

wh
y
Eu

ro
pe

ex
ist

s

in th
eo

ry

bu
t

no
t

in re
al
ity

.
Sie

verstehen
nicht

,
warum
Europa

theoretisch
zwar

existiert
,

aber
nicht

in
Wirklichkeit

.

Th
ey

do no
t

un
de

rs
ta

nd

wh
y
Eu

ro
pe

ex
ist

s

in th
eo

ry

bu
t

no
t

in re
al
ity

.
Sie

verstehen
nicht

,
warum
Europa

theoretisch
zwar

existiert
,

aber
nicht

in
Wirklichkeit

.

Figure 7: Alignment visualizations – shown are images of the attention weights learned by various
models: (top left) global, (top right) local-m, and (bottom left) local-p. The gold alignments are displayed
at the bottom right corner.

1 Introduction
2 Neural Machine Translation
3 Attention-based Models
3.1 Global Attention
3.2 Local Attention
3.3 Input-feeding Approach

4 Experiments
4.1 Training Details
4.2 English-German Results
4.3 German-English Results

5 Analysis
5.1 Learning curves
5.2 Effects of Translating Long Sentences
5.3 Choices of Attentional Architectures
5.4 Alignment Quality
5.5 Sample Translations

6 Conclusion
A Alignment Visualization