程序代写代做代考 ER GPU matlab flex Effective Approaches to Attention-based Neural Machine Translation

Effective Approaches to Attention-based Neural Machine Translation

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421,
Lisbon, Portugal, 17-21 September 2015. c©2015 Association for Computational Linguistics.

Effective Approaches to Attention-based Neural Machine Translation

Minh-Thang Luong Hieu Pham Christopher D. Manning
Computer Science Department, Stanford University, Stanford, CA 94305

{lmthang,hyhieu,manning}@stanford.edu

Abstract
An attentional mechanism has lately been
used to improve neural machine transla-
tion (NMT) by selectively focusing on
parts of the source sentence during trans-
lation. However, there has been little
work exploring useful architectures for
attention-based NMT. This paper exam-
ines two simple and effective classes of at-
tentional mechanism: a global approach
which always attends to all source words
and a local one that only looks at a subset
of source words at a time. We demonstrate
the effectiveness of both approaches on the
WMT translation tasks between English
and German in both directions. With local
attention, we achieve a significant gain of
5.0 BLEU points over non-attentional sys-
tems that already incorporate known tech-
niques such as dropout. Our ensemble
model using different attention architec-
tures yields a new state-of-the-art result in
the WMT’15 English to German transla-
tion task with 25.9 BLEU points, an im-
provement of 1.0 BLEU points over the
existing best system backed by NMT and
an n-gram reranker.1

1 Introduction

Neural Machine Translation (NMT) achieved
state-of-the-art performances in large-scale trans-
lation tasks such as from English to French (Luong
et al., 2015) and English to German (Jean et al.,
2015). NMT is appealing since it requires minimal
domain knowledge and is conceptually simple.
The model by Luong et al. (2015) reads through all
the source words until the end-of-sentence symbol
is reached. It then starts emitting one tar-
get word at a time, as illustrated in Figure 1. NMT

1All our code and models are publicly available at http:
//nlp.stanford.edu/projects/nmt.

B C D X Y Z

X Y Z

A

Figure 1: Neural machine translation – a stack-
ing recurrent architecture for translating a source
sequence A B C D into a target sequence X Y
Z. Here, marks the end of a sentence.

is often a large neural network that is trained in an
end-to-end fashion and has the ability to general-
ize well to very long word sequences. This means
the model does not have to explicitly store gigantic
phrase tables and language models as in the case
of standard MT; hence, NMT has a small memory
footprint. Lastly, implementing NMT decoders is
easy unlike the highly intricate decoders in stan-
dard MT (Koehn et al., 2003).

In parallel, the concept of “attention” has gained
popularity recently in training neural networks, al-
lowing models to learn alignments between dif-
ferent modalities, e.g., between image objects
and agent actions in the dynamic control problem
(Mnih et al., 2014), between speech frames and
text in the speech recognition task (Chorowski et
al., 2014), or between visual features of a picture
and its text description in the image caption gen-
eration task (Xu et al., 2015). In the context of
NMT, Bahdanau et al. (2015) has successfully ap-
plied such attentional mechanism to jointly trans-
late and align words. To the best of our knowl-
edge, there has not been any other work exploring
the use of attention-based architectures for NMT.

In this work, we design, with simplicity and ef-
fectiveness in mind, two novel types of attention-

1412

based models: a global approach in which all
source words are attended and a local one whereby
only a subset of source words are considered at a
time. The former approach resembles the model
of (Bahdanau et al., 2015) but is simpler architec-
turally. The latter can be viewed as an interesting
blend between the hard and soft attention models
proposed in (Xu et al., 2015): it is computation-
ally less expensive than the global model or the
soft attention; at the same time, unlike the hard at-
tention, the local attention is differentiable, mak-
ing it easier to implement and train.2 Besides, we
also examine various alignment functions for our
attention-based models.

Experimentally, we demonstrate that both of
our approaches are effective in the WMT trans-
lation tasks between English and German in both
directions. Our attentional models yield a boost
of up to 5.0 BLEU over non-attentional systems
which already incorporate known techniques such
as dropout. For English to German translation,
we achieve new state-of-the-art (SOTA) results
for both WMT’14 and WMT’15, outperforming
previous SOTA systems, backed by NMT mod-
els and n-gram LM rerankers, by more than 1.0
BLEU. We conduct extensive analysis to evaluate
our models in terms of learning, the ability to han-
dle long sentences, choices of attentional architec-
tures, alignment quality, and translation outputs.

2 Neural Machine Translation

A neural machine translation system is a neural
network that directly models the conditional prob-
ability p(y|x) of translating a source sentence,
x1, . . . , xn, to a target sentence, y1, . . . , ym.3 A
basic form of NMT consists of two components:
(a) an encoder which computes a representation s
for each source sentence and (b) a decoder which
generates one target word at a time and hence de-
composes the conditional probability as:

log p(y|x) =
∑m

j=1
log p (yj|y.

cent NMT work such as (Kalchbrenner and Blun-
som, 2013; Sutskever et al., 2014; Cho et al., 2014;
Bahdanau et al., 2015; Luong et al., 2015; Jean et
al., 2015) have in common. They, however, dif-
fer in terms of which RNN architectures are used
for the decoder and how the encoder computes the
source sentence representation s.

Kalchbrenner and Blunsom (2013) used an
RNN with the standard hidden unit for the decoder
and a convolutional neural network for encoding
the source sentence representation. On the other
hand, both Sutskever et al. (2014) and Luong et
al. (2015) stacked multiple layers of an RNN with
a Long Short-Term Memory (LSTM) hidden unit
for both the encoder and the decoder. Cho et al.
(2014), Bahdanau et al. (2015), and Jean et al.
(2015) all adopted a different version of the RNN
with an LSTM-inspired hidden unit, the gated re-
current unit (GRU), for both components.4

In more detail, one can parameterize the proba-
bility of decoding each word yj as:

p (yj|y

X Y Z

A

Figure 4: Input-feeding approach – Attentional
vectors h̃t are fed as inputs to the next time steps to
inform the model about past alignment decisions.

3.3 Input-feeding Approach

In our proposed global and local approaches,
the attentional decisions are made independently,
which is suboptimal. Whereas, in standard MT,
a coverage set is often maintained during the
translation process to keep track of which source
words have been translated. Likewise, in atten-
tional NMTs, alignment decisions should be made
jointly taking into account past alignment infor-
mation. To address that, we propose an input-
feeding approach in which attentional vectors h̃t
are concatenated with inputs at the next time steps
as illustrated in Figure 4.11 The effects of hav-
ing such connections are two-fold: (a) we hope
to make the model fully aware of previous align-
ment choices and (b) we create a very deep net-
work spanning both horizontally and vertically.

Comparison to other work – Bahdanau et al.
(2015) use context vectors, similar to our ct, in
building subsequent hidden states, which can also
achieve the “coverage” effect. However, there has
not been any analysis of whether such connections
are useful as done in this work. Also, our approach
is more general; as illustrated in Figure 4, it can be
applied to general stacking recurrent architectures,
including non-attentional models.

Xu et al. (2015) propose a doubly attentional
approach with an additional constraint added to
the training objective to make sure the model pays
equal attention to all parts of the image during the
caption generation process. Such a constraint can

11If n is the number of LSTM cells, the input size of the
first LSTM layer is 2n; those of subsequent layers are n.

also be useful to capture the coverage set effect
in NMT that we mentioned earlier. However, we
chose to use the input-feeding approach since it
provides flexibility for the model to decide on any
attentional constraints it deems suitable.

4 Experiments

We evaluate the effectiveness of our models on the
WMT translation tasks between English and Ger-
man in both directions. newstest2013 (3000 sen-
tences) is used as a development set to select our
hyperparameters. Translation performances are
reported in case-sensitive BLEU (Papineni et al.,
2002) on newstest2014 (2737 sentences) and new-
stest2015 (2169 sentences). Following (Luong et
al., 2015), we report translation quality using two
types of BLEU: (a) tokenized12 BLEU to be com-
parable with existing NMT work and (b) NIST13

BLEU to be comparable with WMT results.

4.1 Training Details

All our models are trained on the WMT’14 train-
ing data consisting of 4.5M sentences pairs (116M
English words, 110M German words). Similar to
(Jean et al., 2015), we limit our vocabularies to
be the top 50K most frequent words for both lan-
guages. Words not in these shortlisted vocabular-
ies are converted into a universal token .

When training our NMT systems, following
(Bahdanau et al., 2015; Jean et al., 2015), we fil-
ter out sentence pairs whose lengths exceed 50
words and shuffle mini-batches as we proceed.
Our stacking LSTM models have 4 layers, each
with 1000 cells, and 1000-dimensional embed-
dings. We follow (Sutskever et al., 2014; Luong
et al., 2015) in training NMT with similar set-
tings: (a) our parameters are uniformly initialized
in [−0.1, 0.1], (b) we train for 10 epochs using
plain SGD, (c) a simple learning rate schedule is
employed – we start with a learning rate of 1; after
5 epochs, we begin to halve the learning rate ev-
ery epoch, (d) our mini-batch size is 128, and (e)
the normalized gradient is rescaled whenever its
norm exceeds 5. Additionally, we also use dropout
for our LSTMs as suggested by (Zaremba et al.,
2015). For dropout models, we train for 12 epochs
and start halving the learning rate after 8 epochs.

Our code is implemented in MATLAB. When

12All texts are tokenized with tokenizer.perl and
BLEU scores are computed with multi-bleu.perl.

13With the mteval-v13a script as per WMT guideline.

1416

System Ppl BLEU
Winning WMT’14 system – phrase-based + large LM (Buck et al., 2014) 20.7
Existing NMT systems
RNNsearch (Jean et al., 2015) 16.5
RNNsearch + unk replace (Jean et al., 2015) 19.0
RNNsearch + unk replace + large vocab + ensemble 8 models (Jean et al., 2015) 21.6
Our NMT systems
Base 10.6 11.3
Base + reverse 9.9 12.6 (+1.3)
Base + reverse + dropout 8.1 14.0 (+1.4)
Base + reverse + dropout + global attention (location) 7.3 16.8 (+2.8)
Base + reverse + dropout + global attention (location) + feed input 6.4 18.1 (+1.3)
Base + reverse + dropout + local-p attention (general) + feed input

5.9
19.0 (+0.9)

Base + reverse + dropout + local-p attention (general) + feed input + unk replace 20.9 (+1.9)
Ensemble 8 models + unk replace 23.0 (+2.1)

Table 1: WMT’14 English-German results – shown are the perplexities (ppl) and the tokenized BLEU
scores of various systems on newstest2014. We highlight the best system in bold and give progressive
improvements in italic between consecutive systems. local-p referes to the local attention with predictive
alignments. We indicate for each attention model the alignment score function used in pararentheses.

running on a single GPU device Tesla K40, we
achieve a speed of 1K target words per second.
It takes 7–10 days to completely train a model.

4.2 English-German Results
We compare our NMT systems in the English-
German task with various other systems. These
include the winning system in WMT’14 (Buck et
al., 2014), a phrase-based system whose language
models were trained on a huge monolingual text,
the Common Crawl corpus. For end-to-end neu-
ral machine translation systems, to the best of our
knowledge, (Jean et al., 2015) is the only work ex-
perimenting with this language pair and currently
the SOTA system. We only present results for
some of our attention models and will later ana-
lyze the rest in Section 5.

As shown in Table 1, we achieve progressive
improvements when (a) reversing the source sen-
tence, +1.3 BLEU, as proposed in (Sutskever et
al., 2014) and (b) using dropout, +1.4 BLEU. On
top of that, (c) the global attention approach gives
a significant boost of +2.8 BLEU, making our
model slightly better than the base attentional sys-
tem of Bahdanau et al. (2015) (row RNNSearch).
When (d) using the input-feeding approach, we
seize another notable gain of +1.3 BLEU and out-
perform their system. The local attention model
with predictive alignments (row local-p) proves
to be even better, giving us a further improve-
ment of +0.9 BLEU on top of the global attention

model. It is interesting to observe the trend pre-
viously reported in (Luong et al., 2015) that per-
plexity strongly correlates with translation quality.
In total, we achieve a significant gain of 5.0 BLEU
points over the non-attentional baseline, which al-
ready includes known techniques such as source
reversing and dropout.

The unknown replacement technique proposed
in (Luong et al., 2015; Jean et al., 2015) yields
another nice gain of +1.9 BLEU, demonstrating
that our attentional models do learn useful align-
ments for unknown works. Finally, by ensembling
8 different models of various settings, e.g., using
different attention approaches, with and without
dropout etc., we were able to achieve a new SOTA
result of 23.0 BLEU, outperforming the existing
best system (Jean et al., 2015) by +1.4 BLEU.

System BLEU
SOTA – NMT + 5-gram rerank (MILA) 24.9
Our ensemble 8 models + unk replace 25.9

Table 2: WMT’15 English-German results –
NIST BLEU scores of the existing WMT’15 SOTA
system and our best one on newstest2015.

Latest results in WMT’15 – despite the fact that
our models were trained on WMT’14 with slightly
less data, we test them on newstest2015 to demon-
strate that they can generalize well to different test
sets. As shown in Table 2, our best system es-

1417

System Ppl. BLEU
WMT’15 systems
SOTA – phrase-based (Edinburgh) 29.2
NMT + 5-gram rerank (MILA) 27.6
Our NMT systems
Base (reverse) 14.3 16.9
+ global (location) 12.7 19.1 (+2.2)
+ global (location) + feed 10.9 20.1 (+1.0)
+ global (dot) + drop + feed

9.7
22.8 (+2.7)

+ global (dot) + drop + feed + unk 24.9 (+2.1)

Table 3: WMT’15 German-English results –
performances of various systems (similar to Ta-
ble 1). The base system already includes source
reversing on which we add global attention,
dropout, input feeding, and unk replacement.

tablishes a new SOTA performance of 25.9 BLEU,
outperforming the existing best system backed by
NMT and a 5-gram LM reranker by +1.0 BLEU.

4.3 German-English Results

We carry out a similar set of experiments for the
WMT’15 translation task from German to En-
glish. While our systems have not yet matched
the performance of the SOTA system, we never-
theless show the effectiveness of our approaches
with large and progressive gains in terms of BLEU
as illustrated in Table 3. The attentional mech-
anism gives us +2.2 BLEU gain and on top of
that, we obtain another boost of up to +1.0 BLEU
from the input-feeding approach. Using a better
alignment function, the content-based dot product
one, together with dropout yields another gain of
+2.7 BLEU. Lastly, when applying the unknown
word replacement technique, we seize an addi-
tional +2.1 BLEU, demonstrating the usefulness
of attention in aligning rare words.

5 Analysis

We conduct extensive analysis to better understand
our models in terms of learning, the ability to han-
dle long sentences, choices of attentional archi-
tectures, and alignment quality. All models con-
sidered here are English-German NMT systems
tested on newstest2014.

5.1 Learning curves

We compare models built on top of one another as
listed in Table 1. It is pleasant to observe in Fig-
ure 5 a clear separation between non-attentional
and attentional models. The input-feeding ap-

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
x 10

5

2

3

4

5

6

Mini−batches

T
e

st
c

o
st

basic
basic+reverse
basic+reverse+dropout
basic+reverse+dropout+globalAttn
basic+reverse+dropout+globalAttn+feedInput
basic+reverse+dropout+pLocalAttn+feedInput

Figure 5: Learning curves – test cost (ln perplex-
ity) on newstest2014 for English-German NMTs
as training progresses.

proach and the local attention model also demon-
strate their abilities in driving the test costs lower.
The non-attentional model with dropout (the blue
+ curve) learns slower than other non-dropout
models, but as time goes by, it becomes more ro-
bust in terms of minimizing test errors.

5.2 Effects of Translating Long Sentences

We follow (Bahdanau et al., 2015) to group sen-
tences of similar lengths together and compute a
BLEU score per group. As demonstrated in Fig-
ure 6, our attentional models are more effective
than the other non-attentional model in handling
long sentences: the translation quality does not de-
grade as sentences become longer. Our best model
(the blue + curve) outperforms all other systems in
all length buckets.

10 20 30 40 50 60 70

10

15

20

25

Sent Lengths

B
L

E
U

ours, no attn (BLEU 13.9)
ours, local−p attn (BLEU 20.9)
ours, best system (BLEU 23.0)
WMT’14 best (BLEU 20.7)
Jeans et al., 2015 (BLEU 21.6)

Figure 6: Length Analysis – translation qualities
of different systems as sentences become longer.

5.3 Choices of Attentional Architectures

We examine different attention models (global,
local-m, local-p) and different alignment func-
tions (location, dot, general, concat) as described
in Section 3. Due to limited resources, we can-
not run all the possible combinations. However,

1418

System Ppl
BLEU

Before After unk
global (location) 6.4 18.1 19.3 (+1.2)
global (dot) 6.1 18.6 20.5 (+1.9)
global (general) 6.1 17.3 19.1 (+1.8)
local-m (dot) >7.0 x x
local-m (general) 6.2 18.6 20.4 (+1.8)
local-p (dot) 6.6 18.0 19.6 (+1.9)
local-p (general) 5.9 19 20.9 (+1.9)

Table 4: Attentional Architectures – perfor-
mances of different attentional models. We trained
two local-m (dot) models; both have ppl > 7.0.

results in Table 4 do give us some idea about dif-
ferent choices. The location-based function does
not learn good alignments: the global (location)
model can only obtain a small gain when perform-
ing unknown word replacement compared to using
other alignment functions.14 For content-based
functions, our implementation of concat does not
yield good performances and more analysis should
be done to understand the reason.15 It is interest-
ing to observe that dot works well for the global
attention and general is better for the local atten-
tion. Among the different models, the local atten-
tion model with predictive alignments (local-p) is
best, both in terms of perplexities and BLEU.

5.4 Alignment Quality

A by-product of attentional models are word align-
ments. While (Bahdanau et al., 2015) visualized
alignments for some sample sentences and ob-
served gains in translation quality as an indica-
tion of a working attention model, no work has as-
sessed the alignments learned as a whole. In con-
trast, we set out to evaluate the alignment quality
using the alignment error rate (AER) metric.

Given the gold alignment data provided by
RWTH for 508 English-German Europarl sen-
tences, we “force” decode our attentional models
to produce translations that match the references.
We extract only one-to-one alignments by select-
ing the source word with the highest alignment

14There is a subtle difference in how we retrieve align-
ments for the different alignment functions. At time step t in
which we receive yt−1 as input and then compute ht,at, ct,
and h̃t before predicting yt, the alignment vector at is used
as alignment weights for (a) the predicted word yt in the
location-based alignment functions and (b) the input word
yt−1 in the content-based functions.

15With concat, the perplexities achieved by different mod-
els are 6.7 (global), 7.1 (local-m), and 7.1 (local-p).

Method AER
global (location) 0.39
local-m (general) 0.34
local-p (general) 0.36

ensemble 0.34
Berkeley Aligner 0.32

Table 6: AER scores – results of various models
on the RWTH English-German alignment data.

weight per target word. Nevertheless, as shown in
Table 6, we were able to achieve AER scores com-
parable to the one-to-many alignments obtained
by the Berkeley aligner (Liang et al., 2006).16

We also found that the alignments produced by
local attention models achieve lower AERs than
those of the global one. The AER obtained by
the ensemble, while good, is not better than the
local-m AER, suggesting the well-known observa-
tion that AER and translation scores are not well
correlated (Fraser and Marcu, 2007). Due to space
constraint, we can only show alignment visualiza-
tions in the arXiv version of our paper.17

5.5 Sample Translations
We show in Table 5 sample translations in both
directions. It it appealing to observe the ef-
fect of attentional models in correctly translat-
ing names such as “Miranda Kerr” and “Roger
Dow”. Non-attentional models, while producing
sensible names from a language model perspec-
tive, lack the direct connections from the source
side to make correct translations.

We also observed an interesting case in the
second English-German example, which requires
translating the doubly-negated phrase, “not in-
compatible”. The attentional model correctly
produces “nicht . . . unvereinbar”; whereas the
non-attentional model generates “nicht vereinbar”,
meaning “not compatible”.18 The attentional
model also demonstrates its superiority in trans-
lating long sentences as in the last example.

6 Conclusion

In this paper, we propose two simple and effec-
tive attentional mechanisms for neural machine

16We concatenate the 508 sentence pairs with 1M sentence
pairs from WMT and run the Berkeley aligner.

17http://arxiv.org/abs/1508.04025
18The reference uses a more fancy translation of “incom-

patible”, which is “im Widerspruch zu etwas stehen”. Both
models, however, failed to translate “passenger experience”.

1419

English-German translations
src Orlando Bloom and Miranda Kerr still love each other
ref Orlando Bloom und Miranda Kerr lieben sich noch immer

best Orlando Bloom und Miranda Kerr lieben einander noch immer .
base Orlando Bloom und Lucas Miranda lieben einander noch immer .
src ′′ We ′ re pleased the FAA recognizes that an enjoyable passenger experience is not incompatible

with safety and security , ′′ said Roger Dow , CEO of the U.S. Travel Association .
ref “ Wir freuen uns , dass die FAA erkennt , dass ein angenehmes Passagiererlebnis nicht im Wider-

spruch zur Sicherheit steht ” , sagte Roger Dow , CEO der U.S. Travel Association .
best ′′ Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit Sicherheit und

Sicherheit unvereinbar ist ′′ , sagte Roger Dow , CEO der US – die .
base ′′ Wir freuen uns über die , dass ein mit Sicherheit nicht vereinbar ist mit

Sicherheit und Sicherheit ′′ , sagte Roger Cameron , CEO der US – .
German-English translations
src In einem Interview sagte Bloom jedoch , dass er und Kerr sich noch immer lieben .
ref However , in an interview , Bloom has said that he and Kerr still love each other .

best In an interview , however , Bloom said that he and Kerr still love .
base However , in an interview , Bloom said that he and Tina were still .
src Wegen der von Berlin und der Europäischen Zentralbank verhängten strengen Sparpolitik in

Verbindung mit der Zwangsjacke , in die die jeweilige nationale Wirtschaft durch das Festhal-
ten an der gemeinsamen Währung genötigt wird , sind viele Menschen der Ansicht , das Projekt
Europa sei zu weit gegangen

ref The austerity imposed by Berlin and the European Central Bank , coupled with the straitjacket
imposed on national economies through adherence to the common currency , has led many people
to think Project Europe has gone too far .

best Because of the strict austerity measures imposed by Berlin and the European Central Bank in
connection with the straitjacket in which the respective national economy is forced to adhere to
the common currency , many people believe that the European project has gone too far .

base Because of the pressure imposed by the European Central Bank and the Federal Central Bank
with the strict austerity imposed on the national economy in the face of the single currency ,
many people believe that the European project has gone too far .

Table 5: Sample translations – for each example, we show the source (src), the human translation (ref),
the translation from our best model (best), and the translation of a non-attentional model (base). We
italicize some correct translation segments and highlight a few wrong ones in bold.

translation: the global approach which always
looks at all source positions and the local one
that only attends to a subset of source positions
at a time. We test the effectiveness of our mod-
els in the WMT translation tasks between En-
glish and German in both directions. Our local
attention yields large gains of up to 5.0 BLEU
over non-attentional models that already incorpo-
rate known techniques such as dropout. For the
English to German translation direction, our en-
semble model has established new state-of-the-art
results for both WMT’14 and WMT’15.

We have compared various alignment functions
and shed light on which functions are best for
which attentional models. Our analysis shows that
attention-based NMT models are superior to non-
attentional ones in many cases, for example in

translating names and handling long sentences.

Acknowledgment

We gratefully acknowledge support from a gift
from Bloomberg L.P. and the support of NVIDIA
Corporation with the donation of Tesla K40 GPUs.
We thank Andrew Ng and his group as well as
the Stanford Research Computing for letting us
use their computing resources. We thank Rus-
sell Stewart for helpful discussions on the models.
Lastly, we thank Quoc Le, Ilya Sutskever, Oriol
Vinyals, Richard Socher, Michael Kayser, Jiwei
Li, Panupong Pasupat, Kelvin Gu, members of the
Stanford NLP Group and the annonymous review-
ers for their valuable comments and feedback.

1420

References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2015. Neural machine translation by jointly
learning to align and translate. In ICLR.

Christian Buck, Kenneth Heafield, and Bas van Ooyen.
2014. N-gram counts and language models from the
common crawl. In LREC.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-
cehre, Fethi Bougares, Holger Schwenk, and Yoshua
Bengio. 2014. Learning phrase representations
using RNN encoder-decoder for statistical machine
translation. In EMNLP.

Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho,
and Yoshua Bengio. 2014. End-to-end continuous
speech recognition using attention-based recurrent
NN: first results. CoRR, abs/1412.1602.

Alexander Fraser and Daniel Marcu. 2007. Measuring
word alignment quality for statistical machine trans-
lation. Computational Linguistics, 33(3):293–303.

Karol Gregor, Ivo Danihelka, Alex Graves,
Danilo Jimenez Rezende, and Daan Wierstra.
2015. DRAW: A recurrent neural network for
image generation. In ICML.

Sébastien Jean, Kyunghyun Cho, Roland Memisevic,
and Yoshua Bengio. 2015. On using very large tar-
get vocabulary for neural machine translation. In
ACL.

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent
continuous translation models. In EMNLP.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical phrase-based translation. In
NAACL.

Percy Liang, Ben Taskar, and Dan Klein. 2006. Align-
ment by agreement. In NAACL.

Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol
Vinyals, and Wojciech Zaremba. 2015. Addressing
the rare word problem in neural machine translation.
In ACL.

Volodymyr Mnih, Nicolas Heess, Alex Graves, and Ko-
ray Kavukcuoglu. 2014. Recurrent models of visual
attention. In NIPS.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei
jing Zhu. 2002. Bleu: a method for automatic eval-
uation of machine translation. In ACL.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural net-
works. In NIPS.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun
Cho, Aaron C. Courville, Ruslan Salakhutdinov,
Richard S. Zemel, and Yoshua Bengio. 2015. Show,
attend and tell: Neural image caption generation
with visual attention. In ICML.

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals.
2015. Recurrent neural network regularization. In
ICLR.

1421