Minimum Risk Training for Neural Machine Translation
Shiqi Shen†, Yong Cheng#, Zhongjun He+, Wei He+, Hua Wu+, Maosong Sun†, Yang Liu†∗
†State Key Laboratory of Intelligent Technology and Systems
Tsinghua National Laboratory for Information Science and Technology
Department of Computer Science and Technology, Tsinghua University, Beijing, China
#Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
+Baidu Inc., Beijing, China
{vicapple22, chengyong3001}@gmail.com, {hezhongjun, hewei06, wu hua}@baidu.com, {sms, liuyang2011}@tsinghua.edu.cn
Abstract
We propose minimum risk training for
end-to-end neural machine translation.
Unlike conventional maximum likelihood
estimation, minimum risk training is ca-
pable of optimizing model parameters di-
rectly with respect to arbitrary evaluation
metrics, which are not necessarily differ-
entiable. Experiments show that our ap-
proach achieves significant improvements
over maximum likelihood estimation on a
state-of-the-art neural machine translation
system across various languages pairs.
Transparent to architectures, our approach
can be applied to more neural networks
and potentially benefit more NLP tasks.
1 Introduction
Recently, end-to-end neural machine transla-
tion (NMT) (Kalchbrenner and Blunsom, 2013;
Sutskever et al., 2014; Bahdanau et al., 2015)
has attracted increasing attention from the com-
munity. Providing a new paradigm for machine
translation, NMT aims at training a single, large
neural network that directly transforms a source-
language sentence to a target-language sentence
without explicitly modeling latent structures (e.g.,
word alignment, phrase segmentation, phrase re-
ordering, and SCFG derivation) that are vital in
conventional statistical machine translation (SMT)
(Brown et al., 1993; Koehn et al., 2003; Chiang,
2005).
Current NMT models are based on the encoder-
decoder framework (Cho et al., 2014; Sutskever
et al., 2014), with an encoder to read and encode
a source-language sentence into a vector, from
which a decoder generates a target-language sen-
tence. While early efforts encode the input into a
∗Corresponding author: Yang Liu.
fixed-length vector, Bahdanau et al. (2015) advo-
cate the attention mechanism to dynamically gen-
erate a context vector for a target word being gen-
erated.
Although NMT models have achieved results on
par with or better than conventional SMT, they still
suffer from a major drawback: the models are op-
timized to maximize the likelihood of training data
instead of evaluation metrics that actually quantify
translation quality. Ranzato et al. (2015) indicate
two drawbacks of maximum likelihood estimation
(MLE) for NMT. First, the models are only ex-
posed to the training distribution instead of model
predictions. Second, the loss function is defined at
the word level instead of the sentence level.
In this work, we introduce minimum risk train-
ing (MRT) for neural machine translation. The
new training objective is to minimize the expected
loss (i.e., risk) on the training data. MRT has the
following advantages over MLE:
1. Direct optimization with respect to evalu-
ation metrics: MRT introduces evaluation
metrics as loss functions and aims to mini-
mize expected loss on the training data.
2. Applicable to arbitrary loss functions: our
approach allows arbitrary sentence-level loss
functions, which are not necessarily differen-
tiable.
3. Transparent to architectures: MRT does not
assume the specific architectures of NMT and
can be applied to any end-to-end NMT sys-
tems.
While MRT has been widely used in conven-
tional SMT (Och, 2003; Smith and Eisner, 2006;
He and Deng, 2012) and deep learning based MT
(Gao et al., 2014), to the best of our knowledge,
this work is the first effort to introduce MRT
ar
X
iv
:1
51
2.
02
43
3v
3
[
cs
.C
L
]
1
5
Ju
n
20
16
into end-to-end NMT. Experiments on a variety of
language pairs (Chinese-English, English-French,
and English-German) show that MRT leads to sig-
nificant improvements over MLE on a state-of-
the-art NMT system (Bahdanau et al., 2015).
2 Background
Given a source sentence x = x1, . . . ,xm, . . . ,xM
and a target sentence y = y1, . . . ,yn, . . . ,yN ,
end-to-end NMT directly models the translation
probability:
P (y|x;θ) =
N∏
n=1
P (yn|x,y
Table 1 shows four models. As model 1 (column
3) ranks the candidates in a reverse order as com-
pared with the gold-standard (i.e., y2 > y3 > y1),
it obtains the highest risk of −0.50. Achieving
a better correlation with the gold-standard than
model 1 by predicting y3 > y1 > y2, model 2
(column 4) reduces the risk to −0.61. As model
3 (column 5) ranks the candidates in the same or-
der with the gold-standard, the risk goes down to
−0.71. The risk can be further reduced by con-
centrating the probability mass on y1 (column 6).
As a result, by minimizing the risk on the training
data, we expect to obtain a model that correlates
well with the gold-standard.
In MRT, the partial derivative with respect to a
model parameter θi is given by
∂R(θ)
∂θi
=
S∑
s=1
Ey|x(s);θ
[
∆(y,y(s))×
N(s)∑
n=1
∂P (yn|x(s),y
evaluator 1 54% 24% 22%
evaluator 2 53% 22% 25%
Table 5: Subjective evaluation of MLE and MRT
on Chinese-English translation.
and RNNSEARCH with MRT on the Chinese-
English test set with respect to input sentence
lengths. While MRT consistently improves over
MLE for all lengths, it achieves worse translation
performance for sentences longer than 60 words.
One reason is that RNNSEARCH tends to pro-
duce short translations for long sentences. As
shown in Figure 5, both MLE and MRE gen-
erate much shorter translations than MOSES.
This results from the length limit imposed by
RNNSEARCH for efficiency reasons: a sentence
in the training set is no longer than 50 words. This
limit deteriorates translation performance because
the sentences in the test set are usually longer than
50 words.
4.6.4 Subjective Evaluation
We also conducted a subjective evaluation to vali-
date the benefit of replacing MLE with MRT. Two
human evaluators were asked to compare MLE
and MRT translations of 100 source sentences ran-
domly sampled from the test sets without know-
ing from which system a candidate translation was
generated.
Table 5 shows the results of subjective evalua-
tion. The two human evaluators made close judge-
ments: around 54% of MLE translations are worse
than MRE, 23% are equal, and 23% are better.
4.6.5 Example Translations
Table 6 shows some example translations. We
find that MOSES translates a Chinese string “yi
wei fuze yu pingrang dangju da jiaodao de qian
guowuyuan guanyuan” that requires long-distance
reordering in a wrong way, which is a notorious
challenge for statistical machine translation. In
contrast, RNNSEARCH-MLE seems to overcome
this problem in this example thanks to the capa-
bility of gated RNNs to capture long-distance de-
pendencies. However, as MLE uses a loss func-
tion defined only at the word level, its translation
lacks sentence-level consistency: “chinese” oc-
curs twice while “two senate” is missing. By opti-
mizing model parameters directly with respect to
sentence-level BLEU, RNNSEARCH-MRT seems
to be able to generate translations more consis-
tently at the sentence level.
4.7 Results on English-French Translation
Table 7 shows the results on English-French trans-
lation. We list existing end-to-end NMT systems
that are comparable to our system. All these sys-
tems use the same subset of the WMT 2014 train-
ing corpus and adopt MLE as the training crite-
rion. They differ in network architectures and vo-
cabulary sizes. Our RNNSEARCH-MLE system
achieves a BLEU score comparable to that of Jean
et al. (2015). RNNSEARCH-MRT achieves the
highest BLEU score in this setting even with a vo-
cabulary size smaller than Luong et al. (2015b)
and Sutskever et al. (2014). Note that our ap-
proach does not assume specific architectures and
can in principle be applied to any NMT systems.
4.8 Results on English-German Translation
Table 8 shows the results on English-German
translation. Our approach still significantly out-
Source meiguo daibiao tuan baokuo laizi shidanfu daxue de yi wei zhongguo
zhuanjia , liang ming canyuan waijiao zhengce zhuli yiji yi wei fuze yu
pingrang dangju da jiaodao de qian guowuyuan guanyuan .
Reference the us delegation consists of a chinese expert from the stanford university
, two senate foreign affairs policy assistants and a former state department
official who was in charge of dealing with pyongyang authority .
MOSES the united states to members of the delegation include representatives from
the stanford university , a chinese expert , two assistant senate foreign policy
and a responsible for dealing with pyongyang before the officials of the state
council .
RNNSEARCH-MLE the us delegation comprises a chinese expert from stanford university , a
chinese foreign office assistant policy assistant and a former official who is
responsible for dealing with the pyongyang authorities .
RNNSEARCH-MRT the us delegation included a chinese expert from the stanford university ,
two senate foreign policy assistants , and a former state department official
who had dealings with the pyongyang authorities .
Table 6: Example Chinese-English translations. “Source” is a romanized Chinese sentence, “Refer-
ence” is a gold-standard translation. “MOSES” and “RNNSEARCH-MLE” are baseline SMT and NMT
systems. “RNNSEARCH-MRT” is our system.
System Architecture Training Vocab BLEU
Existing end-to-end NMT systems
Bahdanau et al. (2015) gated RNN with search
MLE
30K 28.45
Jean et al. (2015) gated RNN with search 30K 29.97
Jean et al. (2015) gated RNN with search + PosUnk 30K 33.08
Luong et al. (2015b) LSTM with 4 layers 40K 29.50
Luong et al. (2015b) LSTM with 4 layers + PosUnk 40K 31.80
Luong et al. (2015b) LSTM with 6 layers 40K 30.40
Luong et al. (2015b) LSTM with 6 layers + PosUnk 40K 32.70
Sutskever et al. (2014) LSTM with 4 layers 80K 30.59
Our end-to-end NMT systems
this work
gated RNN with search MLE 30K 29.88
gated RNN with search MRT 30K 31.30
gated RNN with search + PosUnk MRT 30K 34.23
Table 7: Comparison with previous work on English-French translation. The BLEU scores are case-
sensitive. “PosUnk” denotes Luong et al. (2015b)’s technique of handling rare words.
System Architecture Training BLEU
Existing end-to-end NMT systems
Jean et al. (2015) gated RNN with search
MLE
16.46
Jean et al. (2015) gated RNN with search + PosUnk 18.97
Jean et al. (2015) gated RNN with search + LV + PosUnk 19.40
Luong et al. (2015a) LSTM with 4 layers + dropout + local att. + PosUnk 20.90
Our end-to-end NMT systems
this work
gated RNN with search MLE 16.45
gated RNN with search MRT 18.02
gated RNN with search + PosUnk MRT 20.45
Table 8: Comparison with previous work on English-German translation. The BLEU scores are case-
sensitive.
performs MLE and achieves comparable results
with state-of-the-art systems even though Luong
et al. (2015a) used a much deeper neural network.
We believe that our work can be applied to their
architecture easily.
Despite these significant improvements, the
margins on English-German and English-French
datasets are much smaller than Chinese-English.
We conjecture that there are two possible rea-
sons. First, the Chinese-English datasets contain
four reference translations for each sentence while
both English-French and English-German datasets
only have single references. Second, Chinese and
English are more distantly related than English,
French and German and thus benefit more from
MRT that incorporates evaluation metrics into op-
timization to capture structural divergence.
5 Related Work
Our work originated from the minimum risk train-
ing algorithms in conventional statistical machine
translation (Och, 2003; Smith and Eisner, 2006;
He and Deng, 2012). Och (2003) describes a
smoothed error count to allow calculating gradi-
ents, which directly inspires us to use a param-
eter α to adjust the smoothness of the objective
function. As neural networks are non-linear, our
approach has to minimize the expected loss on
the sentence level rather than the loss of 1-best
translations on the corpus level. Smith and Eis-
ner (2006) introduce minimum risk annealing for
training log-linear models that is capable of grad-
ually annealing to focus on the 1-best hypothe-
sis. He et al. (2012) apply minimum risk training
to learning phrase translation probabilities. Gao
et al. (2014) leverage MRT for learning continu-
ous phrase representations for statistical machine
translation. The difference is that they use MRT
to optimize a sub-model of SMT while we are in-
terested in directly optimizing end-to-end neural
translation models.
The Mixed Incremental Cross-Entropy Rein-
force (MIXER) algorithm (Ranzato et al., 2015)
is in spirit closest to our work. Building on
the REINFORCE algorithm proposed by Williams
(1992), MIXER allows incremental learning and
the use of hybrid loss function that combines both
REINFORCE and cross-entropy. The major dif-
ference is that Ranzato et al. (2015) leverage rein-
forcement learning while our work resorts to mini-
mum risk training. In addition, MIXER only sam-
ples one candidate to calculate reinforcement re-
ward while MRT generates multiple samples to
calculate the expected risk. Figure 2 indicates that
multiple samples potentially increases MRT’s ca-
pability of discriminating between diverse candi-
dates and thus benefit translation quality. Our ex-
periments confirm Ranzato et al. (2015)’s finding
that taking evaluation metrics into account when
optimizing model parameters does help to improve
sentence-level text generation.
More recently, our approach has been suc-
cessfully applied to summarization (Ayana et al.,
2016). They optimize neural networks for head-
line generation with respect to ROUGE (Lin,
2004) and also achieve significant improvements,
confirming the effectiveness and applicability of
our approach.
6 Conclusion
In this paper, we have presented a framework for
minimum risk training in end-to-end neural ma-
chine translation. The basic idea is to minimize
the expected loss in terms of evaluation metrics
on the training data. We sample the full search
space to approximate the posterior distribution to
improve efficiency. Experiments show that MRT
leads to significant improvements over maximum
likelihood estimation for neural machine trans-
lation, especially for distantly-related languages
such as Chinese and English.
In the future, we plan to test our approach on
more language pairs and more end-to-end neural
MT systems. It is also interesting to extend mini-
mum risk training to minimum risk annealing fol-
lowing Smith and Eisner (2006). As our approach
is transparent to loss functions and architectures,
we believe that it will also benefit more end-to-end
neural architectures for other NLP tasks.
Acknowledgments
This work was done while Shiqi Shen and Yong
Cheng were visiting Baidu. Maosong Sun and
Hua Wu are supported by the 973 Program
(2014CB340501 & 2014CB34505). Yang Liu is
supported by the National Natural Science Foun-
dation of China (No.61522204 and No.61432013)
and the 863 Program (2015AA011808). This re-
search is also supported by the Singapore National
Research Foundation under its International Re-
search Centre@Singapore Funding Initiative and
administered by the IDM Programme.
References
[Ayana et al.2016] Ayana, Shiqi Shen, Zhiyuan Liu,
and Maosong Sun. 2016. Neural headline genera-
tion with minimum risk training. arXiv:1604.01904.
[Bahdanau et al.2015] Dzmitry Bahdanau, KyungHyun
Cho, and Yoshua Bengio. 2015. Neural machine
translation by jointly learning to align and translate.
In Proceedings of ICLR.
[Brown et al.1993] Peter F. Brown, Stephen A.
Della Pietra, Vincent J. Della Pietra, and Robert L.
Mercer. 1993. The mathematics of statisti-
cal machine translation: Parameter estimation.
Computational Linguisitics.
[Chiang2005] David Chiang. 2005. A hierarchical
phrase-based model for statistical machine transla-
tion. In Proceedings of ACL.
[Cho et al.2014] Kyunghyun Cho, Bart van Merrien-
boer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio.
2014. Learning phrase representations using rnn
encoder-decoder for statistical machine translation.
In Proceedings of EMNLP.
[Doddington2002] George Doddington. 2002. Auto-
matic evaluation of machine translation quality us-
ing n-gram co-occurrence statistics. In Proceedings
of HLT.
[Gao et al.2014] Jianfeng Gao, Xiaodong He, Wen tao
Yih, and Li Deng. 2014. Learning continuous
phrase representations for translation modeling. In
Proceedings of ACL.
[Gulcehre et al.2015] Caglar Gulcehre, Orhan Firat,
Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-
Chi Lin, Fethi Bougares, Holger Schwenk, and
Yoshua Bengio. 2015. On using mono-
lingual corpora in neural machine translation.
arXiv:1503.03535.
[He and Deng2012] Xiaodong He and Li Deng. 2012.
Maximum expected bleu training of phrase and lex-
icon translation models. In Proceedings of ACL.
[Jean et al.2015] Sebastien Jean, Kyunghyun Cho,
Roland Memisevic, and Yoshua Bengio. 2015. On
using very large target vocabulary for neural ma-
chine translation. In Proceedings of ACL.
[Kalchbrenner and Blunsom2013] Nal Kalchbrenner
and Phil Blunsom. 2013. Recurrent continuous
translation models. In Proceedings of EMNLP.
[Koehn and Hoang2007] Philipp Koehn and Hieu
Hoang. 2007. Factored translation models. In
Proceedings of EMNLP.
[Koehn et al.2003] Philipp Koehn, Franz J. Och, and
Daniel Marcu. 2003. Statistical phrase-based trans-
lation. In Proceedings of HLT-NAACL.
[Lavie and Denkowski2009] Alon Lavie and Michael
Denkowski. 2009. The mereor metric for automatic
evaluation of machine translation. Machine Trans-
lation.
[Lin2004] Chin-Yew Lin. 2004. Rouge: A package for
automatic evaluation of summaries. In Proceedings
of ACL.
[Luong et al.2015a] Minh-Thang Luong, Hieu Pham,
and Christopher D Manning. 2015a. Effective ap-
proaches to attention-based neural machine transla-
tion. In Proceedings of EMNLP.
[Luong et al.2015b] Minh-Thang Luong, Ilya
Sutskever, Quoc V. Le, Oriol Vinyals, and Wo-
jciech Zaremba. 2015b. Addressing the rare
word problem in neural machine translation. In
Proceedings of ACL.
[Och2003] Franz J. Och. 2003. Minimum error rate
training in statistical machine translation. In Pro-
ceedings of ACL.
[Papineni et al.2002] Kishore Papineni, Salim Roukos,
Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a
method for automatic evaluation of machine trans-
lation. In Proceedings of ACL.
[Ranzato et al.2015] Marc’Aurelio Ranzato, Sumit
Chopra, Michael Auli, and Wojciech Zaremba.
2015. Sequence level training with recurrent neural
networks. arXiv:1511.06732v1.
[Sennrich et al.2015] Rico Sennrich, Barry Haddow,
and Alexandra Birch. 2015. Improving neural
machine translation models with monolingual data.
arXiv:1511.06709.
[Smith and Eisner2006] David A. Smith and Jason Eis-
ner. 2006. Minimum risk annealing for training log-
linear models. In Proceedings of ACL.
[Snover et al.2006] Matthew Snover, Bonnie Dorr,
Richard Schwartz, Linnea Micciulla, and John
Makhoul. 2006. A study of translation edit rate
with targeted human annotation. In Proceedings of
AMTA.
[Stolcke2002] Andreas Stolcke. 2002. Srilm – am ex-
tensible language modeling toolkit. In Proceedings
of ICSLP.
[Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals,
and Quoc V. Le. 2014. Sequence to sequence learn-
ing with neural networks. In Proceedings of NIPS.
[Willams1992] Ronald J. Willams. 1992. Simple sta-
tistical gradient-following algorithms for connec-
tionist reinforcement learning. Machine Learning.
1 Introduction
2 Background
3 Minimum Risk Training for Neural Machine Translation
4 Experiments
4.1 Setup
4.2 Effect of
4.3 Effect of Sample Size
4.4 Effect of Loss Function
4.5 Comparison of Training Time
4.6 Results on Chinese-English Translation
4.6.1 Comparison of BLEU Scores
4.6.2 Comparison of TER Scores
4.6.3 BLEU Scores over Sentence Lengths
4.6.4 Subjective Evaluation
4.6.5 Example Translations
4.7 Results on English-French Translation
4.8 Results on English-German Translation
5 Related Work
6 Conclusion