Revisiting Low-Resource Neural Machine Translation: A Case Study
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 211–221
Florence, Italy, July 28 – August 2, 2019. c©2019 Association for Computational Linguistics
211
Revisiting Low-Resource Neural Machine Translation:
A Case Study
Rico Sennrich1,2 Biao Zhang1
1School of Informatics, University of Edinburgh
rico. .uk, B. .uk
2Institute of Computational Linguistics, University of Zurich
Abstract
It has been shown that the performance of neu-
ral machine translation (NMT) drops starkly
in low-resource conditions, underperforming
phrase-based statistical machine translation
(PBSMT) and requiring large amounts of aux-
iliary data to achieve competitive results. In
this paper, we re-assess the validity of these
results, arguing that they are the result of lack
of system adaptation to low-resource settings.
We discuss some pitfalls to be aware of when
training low-resource NMT systems, and re-
cent techniques that have shown to be espe-
cially helpful in low-resource settings, result-
ing in a set of best practices for low-resource
NMT. In our experiments on German–English
with different amounts of IWSLT14 training
data, we show that, without the use of any aux-
iliary monolingual or multilingual data, an op-
timized NMT system can outperform PBSMT
with far less data than previously claimed. We
also apply these techniques to a low-resource
Korean–English dataset, surpassing previously
reported results by 4 BLEU.
1 Introduction
While neural machine translation (NMT) has
achieved impressive performance in high-resource
data conditions, becoming dominant in the field
(Sutskever et al., 2014; Bahdanau et al., 2015;
Vaswani et al., 2017), recent research has ar-
gued that these models are highly data-inefficient,
and underperform phrase-based statistical ma-
chine translation (PBSMT) or unsupervised meth-
ods in low-data conditions (Koehn and Knowles,
2017; Lample et al., 2018b). In this paper, we
re-assess the validity of these results, arguing that
they are the result of lack of system adaptation to
low-resource settings. Our main contributions are
as follows:
• we explore best practices for low-resource
106 107 108
0
10
20
30
21.8
23.4
24.9
26.2 26.9
27.9 28.6
29.2 29.6
30.1 30.4
16.4
18.1
19.6
21.2
22.2
23.5
24.7
26.1
26.9
27.8
28.6
1.6
7.2
11.9
14.7
18.2
22.4
25.7
27.4 29.2
30.3
31.1
Corpus Size (English Words)
BLEU Scores with Varying Amounts of Training Data
Phrase-Based with Big LM
Phrase-Based
Neural
Figure 3: BLEU scores for English-Spanish sys-
tems trained on 0.4 million to 385.7 million
words of parallel data. Quality for NMT starts
much lower, outperforms SMT at about 15 mil-
lion words, and even beats a SMT system with a
big 2 billion word in-domain language model un-
der high-resource conditions.
How do the data needs of SMT and NMT com-
pare? NMT promises both to generalize better (ex-
ploiting word similary in embeddings) and condi-
tion on larger context (entire input and all prior
output words).
We built English-Spanish systems on WMT
data,7 about 385.7 million English words paired
with Spanish. To obtain a learning curve, we used
1
1024
, 1
512
, …, 1
2
, and all of the data. For SMT, the
language model was trained on the Spanish part of
each subset, respectively. In addition to a NMT
and SMT system trained on each subset, we also
used all additionally provided monolingual data
for a big language model in contrastive SMT sys-
tems.
Results are shown in Figure 3. NMT ex-
hibits a much steeper learning curve, starting with
abysmal results (BLEU score of 1.6 vs. 16.4 for
1
1024
of the data), outperforming SMT 25.7 vs.
24.7 with 1
16
of the data (24.1 million words), and
even beating the SMT system with a big language
model with the full data set (31.1 for NMT, 28.4
for SMT, 30.4 for SMT+BigLM).
7Spanish was last represented in 2013, we used data from
http://statmt.org/wmt13/translation-task.html
Src: A Republican strategy to counter the re-election
of Obama
1
1024
Un órgano de coordinación para el anuncio de
libre determinación
1
512
Lista de una estrategia para luchar contra la
elección de hojas de Ohio
1
256
Explosión realiza una estrategia divisiva de
luchar contra las elecciones de autor
1
128
Una estrategia republicana para la eliminación
de la reelección de Obama
1
64
Estrategia siria para contrarrestar la reelección
del Obama .
1
32
+ Una estrategia republicana para contrarrestar la
reelección de Obama
Figure 4: Translations of the first sentence of
the test set using NMT system trained on varying
amounts of training data. Under low resource con-
ditions, NMT produces fluent output unrelated to
the input.
The contrast between the NMT and SMT learn-
ing curves is quite striking. While NMT is able to
exploit increasing amounts of training data more
effectively, it is unable to get off the ground with
training corpus sizes of a few million words or
less.
To illustrate this, see Figure 4. With 1
1024
of the
training data, the output is completely unrelated to
the input, some key words are properly translated
with 1
512
and 1
256
of the data (estrategia for strat-
egy, elección or elecciones for election), and start-
ing with 1
64
the translations become respectable.
3.3 Rare Words
Conventional wisdom states that neural machine
translation models perform particularly poorly on
rare words, (Luong et al., 2015; Sennrich et al.,
2016b; Arthur et al., 2016) due in part to the
smaller vocabularies used by NMT systems. We
examine this claim by comparing performance on
rare word translation between NMT and SMT
systems of similar quality for German–English
and find that NMT systems actually outperform
SMT systems on translation of very infrequent
words. However, both NMT and SMT systems
do continue to have difficulty translating some
infrequent words, particularly those belonging to
highly-inflected categories.
For the neural machine translation model, we
use a publicly available model8 with the training
settings of Edinburgh’s WMT submission (Sen-
nrich et al., 2016a). This was trained using Ne-
8
https://github.com/rsennrich/wmt16-scripts/
31
Figure 1: quality of PBSMT and NMT in low-resource
conditions according to (Koehn and Knowles, 2017).
NMT, evaluating their importance with abla-
tion studies.
• we reproduce a comparison of NMT and PB-
SMT in different data conditions, showing
that when following our best practices, NMT
outperforms PBSMT with as little as 100 000
words of parallel training data.
2 Related Work
2.1 Low-Resource Translation Quality
Compared Across Systems
Figure 1 reproduces a plot by Koehn and Knowles
(2017) which shows that their NMT system only
outperforms their PBSMT system when more than
100 million words (approx. 5 million sentences) of
parallel training data are available. Results shown
by Lample et al. (2018b) are similar, showing that
unsupervised NMT outperforms supervised sys-
tems if few parallel resources are available. In
both papers, NMT systems are trained with hyper-
parameters that are typical for high-resource set-
212
tings, and the authors did not tune hyperparame-
ters, or change network architectures, to optimize
NMT for low-resource conditions.
2.2 Improving Low-Resource Neural
Machine Translation
The bulk of research on low-resource NMT has
focused on exploiting monolingual data, or par-
allel data involving other language pairs. Meth-
ods to improve NMT with monolingual data range
from the integration of a separately trained lan-
guage model (Gülçehre et al., 2015) to the train-
ing of parts of the NMT model with additional ob-
jectives, including a language modelling objective
(Gülçehre et al., 2015; Sennrich et al., 2016b; Ra-
machandran et al., 2017), an autoencoding objec-
tive (Luong et al., 2016; Currey et al., 2017), or
a round-trip objective, where the model is trained
to predict monolingual (target-side) training data
that has been back-translated into the source lan-
guage (Sennrich et al., 2016b; He et al., 2016;
Cheng et al., 2016). As an extreme case, mod-
els that rely exclusively on monolingual data have
been shown to work (Artetxe et al., 2018b; Lample
et al., 2018a; Artetxe et al., 2018a; Lample et al.,
2018b). Similarly, parallel data from other lan-
guage pairs can be used to pre-train the network
or jointly learn representations (Zoph et al., 2016;
Chen et al., 2017; Nguyen and Chiang, 2017; Neu-
big and Hu, 2018; Gu et al., 2018a,b; Kocmi and
Bojar, 2018).
While semi-supervised and unsupervised ap-
proaches have been shown to be very effective for
some language pairs, their effectiveness depends
on the availability of large amounts of suitable
auxiliary data, and other conditions being met. For
example, the effectiveness of unsupervised meth-
ods is impaired when languages are morphologi-
cally different, or when training domains do not
match (Søgaard et al., 2018)
More broadly, this line of research still accepts
the premise that NMT models are data-inefficient
and require large amounts of auxiliary data to
train. In this work, we want to re-visit this point,
and will focus on techniques to make more ef-
ficient use of small amounts of parallel training
data. Low-resource NMT without auxiliary data
has received less attention; work in this direction
includes (Östling and Tiedemann, 2017; Nguyen
and Chiang, 2018).
3 Methods for Low-Resource Neural
Machine Translation
3.1 Mainstream Improvements
We consider the hyperparameters used by Koehn
and Knowles (2017) to be our baseline. This base-
line does not make use of various advances in
NMT architectures and training tricks. In contrast
to the baseline, we use a BiDeep RNN architec-
ture (Miceli Barone et al., 2017), label smoothing
(Szegedy et al., 2016), dropout (Srivastava et al.,
2014), word dropout (Sennrich et al., 2016a), layer
normalization (Ba et al., 2016) and tied embed-
dings (Press and Wolf, 2017).
3.2 Language Representation
Subword representations such as BPE (Sennrich
et al., 2016c) have become a popular choice to
achieve open-vocabulary translation. BPE has one
hyperparameter, the number of merge operations,
which determines the size of the final vocabulary.
For high-resource settings, the effect of vocabu-
lary size on translation quality is relatively small;
Haddow et al. (2018) report mixed results when
comparing vocabularies of 30k and 90k subwords.
In low-resource settings, large vocabularies re-
sult in low-frequency (sub)words being repre-
sented as atomic units at training time, and the
ability to learn good high-dimensional representa-
tions of these is doubtful. Sennrich et al. (2017a)
propose a minimum frequency threshold for sub-
word units, and splitting any less frequent subword
into smaller units or characters. We expect that
such a threshold reduces the need to carefully tune
the vocabulary size to the dataset, leading to more
aggressive segmentation on smaller datasets.1
3.3 Hyperparameter Tuning
Due to long training times, hyperparameters are
hard to optimize by grid search, and are of-
ten re-used across experiments. However, best
practices differ between high-resource and low-
resource settings. While the trend in high-resource
settings is towards using larger and deeper mod-
els, Nguyen and Chiang (2018) use smaller and
fewer layers for smaller datasets. Previous work
has argued for larger batch sizes in NMT (Mor-
ishita et al., 2017; Neishi et al., 2017), but we
1In related work, Cherry et al. (2018) have shown that,
given deep encoders and decoders, character-level models
can outperform other subword segmentations. In preliminary
experiments, a character-level model performed poorly in our
low-resource setting.
213
find that using smaller batches is beneficial in low-
resource settings. More aggressive dropout, in-
cluding dropping whole words at random (Gal and
Ghahramani, 2016), is also likely to be more im-
portant. We report results on a narrow hyperpa-
rameter search guided by previous work and our
own intuition.
3.4 Lexical Model
Finally, we implement and test the lexical model
by Nguyen and Chiang (2018), which has been
shown to be beneficial in low-data conditions. The
core idea is to train a simple feed-forward net-
work, the lexical model, jointly with the original
attentional NMT model. The input of the lexical
model at time step t is the weighted average of
source embeddings f (the attention weights a are
shared with the main model). After a feedforward
layer (with skip connection), the lexical model’s
output hlt is combined with the original model’s
hidden state hot before softmax computation.
f lt = tanh
∑
s
at(s)fs
hlt = tanh(Wf
l
t) + f
l
t
p(yt|y