arXiv:1508.07909v5 [cs.CL] 10 Jun 2016
ar
X
iv
:1
50
8.
07
90
9v
5
[
cs
.C
L
]
1
0
Ju
n
20
16
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich and Barry Haddow and Alexandra Birch
School of Informatics, University of Edinburgh
{rico.sennrich,a.birch}@ed.ac.uk, .ac.uk
Abstract
Neural machine translation (NMT) mod-
els typically operate with a fixed vocabu-
lary, but translation is an open-vocabulary
problem. Previous work addresses the
translation of out-of-vocabulary words by
backing off to a dictionary. In this pa-
per, we introduce a simpler and more ef-
fective approach, making the NMT model
capable of open-vocabulary translation by
encoding rare and unknown words as se-
quences of subword units. This is based on
the intuition that various word classes are
translatable via smaller units than words,
for instance names (via character copying
or transliteration), compounds (via com-
positional translation), and cognates and
loanwords (via phonological and morpho-
logical transformations). We discuss the
suitability of different word segmentation
techniques, including simple character n-
gram models and a segmentation based on
the byte pair encoding compression algo-
rithm, and empirically show that subword
models improve over a back-off dictionary
baseline for the WMT 15 translation tasks
English→German and English→Russian
by up to 1.1 and 1.3 BLEU, respectively.
1 Introduction
Neural machine translation has recently shown
impressive results (Kalchbrenner and Blunsom,
2013; Sutskever et al., 2014; Bahdanau et al.,
2015). However, the translation of rare words
is an open problem. The vocabulary of neu-
ral models is typically limited to 30 000–50 000
words, but translation is an open-vocabulary prob-
The research presented in this publication was conducted
in cooperation with Samsung Electronics Polska sp. z o.o. –
Samsung R&D Institute Poland.
lem, and especially for languages with produc-
tive word formation processes such as aggluti-
nation and compounding, translation models re-
quire mechanisms that go below the word level.
As an example, consider compounds such as the
German Abwasser|behandlungs|anlange ‘sewage
water treatment plant’, for which a segmented,
variable-length representation is intuitively more
appealing than encoding the word as a fixed-length
vector.
For word-level NMT models, the translation
of out-of-vocabulary words has been addressed
through a back-off to a dictionary look-up (Jean et
al., 2015; Luong et al., 2015b). We note that such
techniques make assumptions that often do not
hold true in practice. For instance, there is not al-
ways a 1-to-1 correspondence between source and
target words because of variance in the degree of
morphological synthesis between languages, like
in our introductory compounding example. Also,
word-level models are unable to translate or gen-
erate unseen words. Copying unknown words into
the target text, as done by (Jean et al., 2015; Luong
et al., 2015b), is a reasonable strategy for names,
but morphological changes and transliteration is
often required, especially if alphabets differ.
We investigate NMT models that operate on the
level of subword units. Our main goal is to model
open-vocabulary translation in the NMT network
itself, without requiring a back-off model for rare
words. In addition to making the translation pro-
cess simpler, we also find that the subword models
achieve better accuracy for the translation of rare
words than large-vocabulary models and back-off
dictionaries, and are able to productively generate
new words that were not seen at training time. Our
analysis shows that the neural networks are able to
learn compounding and transliteration from sub-
word representations.
This paper has two main contributions:
• We show that open-vocabulary neural ma-
http://arxiv.org/abs/1508.07909v5
chine translation is possible by encoding
(rare) words via subword units. We find our
architecture simpler and more effective than
using large vocabularies and back-off dictio-
naries (Jean et al., 2015; Luong et al., 2015b).
• We adapt byte pair encoding (BPE) (Gage,
1994), a compression algorithm, to the task
of word segmentation. BPE allows for the
representation of an open vocabulary through
a fixed-size vocabulary of variable-length
character sequences, making it a very suit-
able word segmentation strategy for neural
network models.
2 Neural Machine Translation
We follow the neural machine translation archi-
tecture by Bahdanau et al. (2015), which we will
briefly summarize here. However, we note that our
approach is not specific to this architecture.
The neural machine translation system is imple-
mented as an encoder-decoder network with recur-
rent neural networks.
The encoder is a bidirectional neural network
with gated recurrent units (Cho et al., 2014)
that reads an input sequence x = (x1, …, xm)
and calculates a forward sequence of hidden
states (
−→
h 1, …,
−→
h m), and a backward sequence
(
←−
h 1, …,
←−
hm). The hidden states
−→
h j and
←−
h j are
concatenated to obtain the annotation vector hj .
The decoder is a recurrent neural network that
predicts a target sequence y = (y1, …, yn). Each
word yi is predicted based on a recurrent hidden
state si, the previously predicted word yi−1, and
a context vector ci. ci is computed as a weighted
sum of the annotations hj . The weight of each
annotation hj is computed through an alignment
model αij , which models the probability that yi is
aligned to xj . The alignment model is a single-
layer feedforward neural network that is learned
jointly with the rest of the network through back-
propagation.
A detailed description can be found in (Bah-
danau et al., 2015). Training is performed on a
parallel corpus with stochastic gradient descent.
For translation, a beam search with small beam
size is employed.
3 Subword Translation
The main motivation behind this paper is that
the translation of some words is transparent in
that they are translatable by a competent transla-
tor even if they are novel to him or her, based
on a translation of known subword units such as
morphemes or phonemes. Word categories whose
translation is potentially transparent include:
• named entities. Between languages that share
an alphabet, names can often be copied from
source to target text. Transcription or translit-
eration may be required, especially if the al-
phabets or syllabaries differ. Example:
Barack Obama (English; German)
Барак Обама (Russian)
バラク・オバマ (ba-ra-ku o-ba-ma) (Japanese)
• cognates and loanwords. Cognates and loan-
words with a common origin can differ in
regular ways between languages, so that
character-level translation rules are sufficient
(Tiedemann, 2012). Example:
claustrophobia (English)
Klaustrophobie (German)
Клаустрофобия (Klaustrofobiâ) (Russian)
• morphologically complex words. Words con-
taining multiple morphemes, for instance
formed via compounding, affixation, or in-
flection, may be translatable by translating
the morphemes separately. Example:
solar system (English)
Sonnensystem (Sonne + System) (German)
Naprendszer (Nap + Rendszer) (Hungarian)
In an analysis of 100 rare tokens (not among
the 50 000 most frequent types) in our German
training data1, the majority of tokens are poten-
tially translatable from English through smaller
units. We find 56 compounds, 21 names,
6 loanwords with a common origin (emanci-
pate→emanzipieren), 5 cases of transparent affix-
ation (sweetish ‘sweet’ + ‘-ish’→ süßlich ‘süß’ +
‘-lich’), 1 number and 1 computer language iden-
tifier.
Our hypothesis is that a segmentation of rare
words into appropriate subword units is suffi-
cient to allow for the neural translation network
to learn transparent translations, and to general-
ize this knowledge to translate and produce unseen
words.2 We provide empirical support for this hy-
1Primarily parliamentary proceedings and web crawl data.
2Not every segmentation we produce is transparent.
While we expect no performance benefit from opaque seg-
mentations, i.e. segmentations where the units cannot be
translated independently, our NMT models show robustness
towards oversplitting.
pothesis in Sections 4 and 5. First, we discuss dif-
ferent subword representations.
3.1 Related Work
For Statistical Machine Translation (SMT), the
translation of unknown words has been the subject
of intensive research.
A large proportion of unknown words are
names, which can just be copied into the tar-
get text if both languages share an alphabet. If
alphabets differ, transliteration is required (Dur-
rani et al., 2014). Character-based translation has
also been investigated with phrase-based models,
which proved especially successful for closely re-
lated languages (Vilar et al., 2007; Tiedemann,
2009; Neubig et al., 2012).
The segmentation of morphologically complex
words such as compounds is widely used for SMT,
and various algorithms for morpheme segmen-
tation have been investigated (Nießen and Ney,
2000; Koehn and Knight, 2003; Virpioja et al.,
2007; Stallard et al., 2012). Segmentation al-
gorithms commonly used for phrase-based SMT
tend to be conservative in their splitting decisions,
whereas we aim for an aggressive segmentation
that allows for open-vocabulary translation with a
compact network vocabulary, and without having
to resort to back-off dictionaries.
The best choice of subword units may be task-
specific. For speech recognition, phone-level lan-
guage models have been used (Bazzi and Glass,
2000). Mikolov et al. (2012) investigate subword
language models, and propose to use syllables.
For multilingual segmentation tasks, multilingual
algorithms have been proposed (Snyder and Barzi-
lay, 2008). We find these intriguing, but inapplica-
ble at test time.
Various techniques have been proposed to pro-
duce fixed-length continuous word vectors based
on characters or morphemes (Luong et al., 2013;
Botha and Blunsom, 2014; Ling et al., 2015a; Kim
et al., 2015). An effort to apply such techniques
to NMT, parallel to ours, has found no significant
improvement over word-based approaches (Ling
et al., 2015b). One technical difference from our
work is that the attention mechanism still oper-
ates on the level of words in the model by Ling
et al. (2015b), and that the representation of each
word is fixed-length. We expect that the attention
mechanism benefits from our variable-length rep-
resentation: the network can learn to place atten-
tion on different subword units at each step. Re-
call our introductory example Abwasserbehand-
lungsanlange, for which a subword segmentation
avoids the information bottleneck of a fixed-length
representation.
Neural machine translation differs from phrase-
based methods in that there are strong incentives to
minimize the vocabulary size of neural models to
increase time and space efficiency, and to allow for
translation without back-off models. At the same
time, we also want a compact representation of the
text itself, since an increase in text length reduces
efficiency and increases the distances over which
neural models need to pass information.
A simple method to manipulate the trade-off be-
tween vocabulary size and text size is to use short-
lists of unsegmented words, using subword units
only for rare words. As an alternative, we pro-
pose a segmentation algorithm based on byte pair
encoding (BPE), which lets us learn a vocabulary
that provides a good compression rate of the text.
3.2 Byte Pair Encoding (BPE)
Byte Pair Encoding (BPE) (Gage, 1994) is a sim-
ple data compression technique that iteratively re-
places the most frequent pair of bytes in a se-
quence with a single, unused byte. We adapt this
algorithm for word segmentation. Instead of merg-
ing frequent pairs of bytes, we merge characters or
character sequences.
Firstly, we initialize the symbol vocabulary with
the character vocabulary, and represent each word
as a sequence of characters, plus a special end-of-
word symbol ‘·’, which allows us to restore the
original tokenization after translation. We itera-
tively count all symbol pairs and replace each oc-
currence of the most frequent pair (‘A’, ‘B’) with
a new symbol ‘AB’. Each merge operation pro-
duces a new symbol which represents a charac-
ter n-gram. Frequent character n-grams (or whole
words) are eventually merged into a single sym-
bol, thus BPE requires no shortlist. The final sym-
bol vocabulary size is equal to the size of the initial
vocabulary, plus the number of merge operations
– the latter is the only hyperparameter of the algo-
rithm.
For efficiency, we do not consider pairs that
cross word boundaries. The algorithm can thus be
run on the dictionary extracted from a text, with
each word being weighted by its frequency. A
minimal Python implementation is shown in Al-
Algorithm 1 Learn BPE operations
import re, collections
def get_stats(vocab):
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols)-1):
pairs[symbols[i],symbols[i+1]] += freq
return pairs
def merge_vocab(pair, v_in):
v_out = {}
bigram = re.escape(‘ ‘.join(pair))
p = re.compile(r'(?‘ : 5, ‘l o w e r ‘ : 2,
‘n e w e s t ‘:6, ‘w i d e s t ‘:3}
num_merges = 10
for i in range(num_merges):
pairs = get_stats(vocab)
best = max(pairs, key=pairs.get)
vocab = merge_vocab(best, vocab)
print(best)
r · → r·
l o → lo
lo w → low
e r· → er·
Figure 1: BPE merge operations learned from dic-
tionary {‘low’, ‘lowest’, ‘newer’, ‘wider’}.
gorithm 1. In practice, we increase efficiency by
indexing all pairs, and updating data structures in-
crementally.
The main difference to other compression al-
gorithms, such as Huffman encoding, which have
been proposed to produce a variable-length en-
coding of words for NMT (Chitnis and DeNero,
2015), is that our symbol sequences are still in-
terpretable as subword units, and that the network
can generalize to translate and produce new words
(unseen at training time) on the basis of these sub-
word units.
Figure 1 shows a toy example of learned BPE
operations. At test time, we first split words into
sequences of characters, then apply the learned op-
erations to merge the characters into larger, known
symbols. This is applicable to any word, and
allows for open-vocabulary networks with fixed
symbol vocabularies.3 In our example, the OOV
‘lower’ would be segmented into ‘low er·’.
3The only symbols that will be unknown at test time are
unknown characters, or symbols of which all occurrences
in the training text have been merged into larger symbols,
like ‘safeguar’, which has all occurrences in our training text
merged into ‘safeguard’. We observed no such symbols at
test time, but the issue could be easily solved by recursively
reversing specific merges until all symbols are known.
We evaluate two methods of applying BPE:
learning two independent encodings, one for the
source, one for the target vocabulary, or learning
the encoding on the union of the two vocabular-
ies (which we call joint BPE).4 The former has the
advantage of being more compact in terms of text
and vocabulary size, and having stronger guaran-
tees that each subword unit has been seen in the
training text of the respective language, whereas
the latter improves consistency between the source
and the target segmentation. If we apply BPE in-
dependently, the same name may be segmented
differently in the two languages, which makes it
harder for the neural models to learn a mapping
between the subword units. To increase the con-
sistency between English and Russian segmenta-
tion despite the differing alphabets, we transliter-
ate the Russian vocabulary into Latin characters
with ISO-9 to learn the joint BPE encoding, then
transliterate the BPE merge operations back into
Cyrillic to apply them to the Russian training text.5
4 Evaluation
We aim to answer the following empirical ques-
tions:
• Can we improve the translation of rare and
unseen words in neural machine translation
by representing them via subword units?
• Which segmentation into subword units per-
forms best in terms of vocabulary size, text
size, and translation quality?
We perform experiments on data from the
shared translation task of WMT 2015. For
English→German, our training set consists of 4.2
million sentence pairs, or approximately 100 mil-
lion tokens. For English→Russian, the training set
consists of 2.6 million sentence pairs, or approx-
imately 50 million tokens. We tokenize and true-
case the data with the scripts provided in Moses
(Koehn et al., 2007). We use newstest2013 as de-
velopment set, and report results on newstest2014
and newstest2015.
We report results with BLEU (mteval-v13a.pl),
and CHRF3 (Popović, 2015), a character n-gram
F3 score which was found to correlate well with
4In practice, we simply concatenate the source and target
side of the training set to learn joint BPE.
5Since the Russian training text also contains words that
use the Latin alphabet, we also apply the Latin BPE opera-
tions.
human judgments, especially for translations out
of English (Stanojević et al., 2015). Since our
main claim is concerned with the translation of
rare and unseen words, we report separate statis-
tics for these. We measure these through unigram
F1, which we calculate as the harmonic mean of
clipped unigram precision and recall.6
We perform all experiments with Groundhog7
(Bahdanau et al., 2015). We generally follow set-
tings by previous work (Bahdanau et al., 2015;
Jean et al., 2015). All networks have a hidden
layer size of 1000, and an embedding layer size
of 620. Following Jean et al. (2015), we only keep
a shortlist of τ = 30000 words in memory.
During training, we use Adadelta (Zeiler, 2012),
a minibatch size of 80, and reshuffle the train-
ing set between epochs. We train a network for
approximately 7 days, then take the last 4 saved
models (models being saved every 12 hours), and
continue training each with a fixed embedding
layer (as suggested by (Jean et al., 2015)) for 12
hours. We perform two independent training runs
for each models, once with cut-off for gradient
clipping (Pascanu et al., 2013) of 5.0, once with
a cut-off of 1.0 – the latter produced better single
models for most settings. We report results of the
system that performed best on our development set
(newstest2013), and of an ensemble of all 8 mod-
els.
We use a beam size of 12 for beam search,
with probabilities normalized by sentence length.
We use a bilingual dictionary based on fast-align
(Dyer et al., 2013). For our baseline, this serves
as back-off dictionary for rare words. We also use
the dictionary to speed up translation for all ex-
periments, only performing the softmax over a fil-
tered list of candidate translations (like Jean et al.
(2015), we use K = 30000; K ′ = 10).
4.1 Subword statistics
Apart from translation quality, which we will ver-
ify empirically, our main objective is to represent
an open vocabulary through a compact fixed-size
subword vocabulary, and allow for efficient train-
ing and decoding.8
Statistics for different segmentations of the Ger-
6Clipped unigram precision is essentially 1-gram BLEU
without brevity penalty.
7github.com/sebastien-j/LV_groundhog
8The time complexity of encoder-decoder architectures is
at least linear to sequence length, and oversplitting harms ef-
ficiency.
man side of the parallel data are shown in Table
1. A simple baseline is the segmentation of words
into character n-grams.9 Character n-grams allow
for different trade-offs between sequence length
(# tokens) and vocabulary size (# types), depend-
ing on the choice of n. The increase in sequence
length is substantial; one way to reduce sequence
length is to leave a shortlist of the k most frequent
word types unsegmented. Only the unigram repre-
sentation is truly open-vocabulary. However, the
unigram representation performed poorly in pre-
liminary experiments, and we report translation re-
sults with a bigram representation, which is empir-
ically better, but unable to produce some tokens in
the test set with the training set vocabulary.
We report statistics for several word segmenta-
tion techniques that have proven useful in previous
SMT research, including frequency-based com-
pound splitting (Koehn and Knight, 2003), rule-
based hyphenation (Liang, 1983), and Morfessor
(Creutz and Lagus, 2002). We find that they only
moderately reduce vocabulary size, and do not
solve the unknown word problem, and we thus find
them unsuitable for our goal of open-vocabulary
translation without back-off dictionary.
BPE meets our goal of being open-vocabulary,
and the learned merge operations can be applied
to the test set to obtain a segmentation with no
unknown symbols.10 Its main difference from
the character-level model is that the more com-
pact representation of BPE allows for shorter se-
quences, and that the attention model operates
on variable-length units.11 Table 1 shows BPE
with 59 500 merge operations, and joint BPE with
89 500 operations.
In practice, we did not include infrequent sub-
word units in the NMT network vocabulary, since
there is noise in the subword symbol sets, e.g.
because of characters from foreign alphabets.
Hence, our network vocabularies in Table 2 are
typically slightly smaller than the number of types
in Table 1.
9Our character n-grams do not cross word boundaries. We
mark whether a subword is word-final or not with a special
character, which allows us to restore the original tokenization.
10Joint BPE can produce segments that are unknown be-
cause they only occur in the English training text, but these
are rare (0.05% of test tokens).
11We highlighted the limitations of word-level attention in
section 3.1. At the other end of the spectrum, the character
level is suboptimal for alignment (Tiedemann, 2009).
vocabulary BLEU CHRF3 unigram F1 (%)
name segmentation shortlist source target single ens-8 single ens-8 all rare OOV
syntax-based (Sennrich and Haddow, 2015) 24.4 – 55.3 – 59.1 46.0 37.7
WUnk – – 300 000 500 000 20.6 22.8 47.2 48.9 56.7 20.4 0.0
WDict – – 300 000 500 000 22.0 24.2 50.5 52.4 58.1 36.8 36.8
C2-50k char-bigram 50 000 60 000 60 000 22.8 25.3 51.9 53.5 58.4 40.5 30.9
BPE-60k BPE – 60 000 60 000 21.5 24.5 52.0 53.9 58.4 40.9 29.3
BPE-J90k BPE (joint) – 90 000 90 000 22.8 24.7 51.7 54.1 58.5 41.8 33.6
Table 2: English→German translation performance (BLEU, CHRF3 and unigram F1) on newstest2015.
Ens-8: ensemble of 8 models. Best NMT system in bold. Unigram F1 (with ensembles) is computed for
all words (n = 44085), rare words (not among top 50 000 in training set; n = 2900), and OOVs (not in
training set; n = 1168).
segmentation # tokens # types # UNK
none 100 m 1 750 000 1079
characters 550 m 3000 0
character bigrams 306 m 20 000 34
character trigrams 214 m 120 000 59
compound splitting△ 102 m 1 100 000 643
morfessor* 109 m 544 000 237
hyphenation⋄ 186 m 404 000 230
BPE 112 m 63 000 0
BPE (joint) 111 m 82 000 32
character bigrams
129 m 69 000 34
(shortlist: 50 000)
Table 1: Corpus statistics for German training
corpus with different word segmentation tech-
niques. #UNK: number of unknown tokens in
newstest2013. △: (Koehn and Knight, 2003); *:
(Creutz and Lagus, 2002); ⋄: (Liang, 1983).
4.2 Translation experiments
English→German translation results are shown in
Table 2; English→Russian results in Table 3.
Our baseline WDict is a word-level model with
a back-off dictionary. It differs from WUnk in that
the latter uses no back-off dictionary, and just rep-
resents out-of-vocabulary words as UNK12. The
back-off dictionary improves unigram F1 for rare
and unseen words, although the improvement is
smaller for English→Russian, since the back-off
dictionary is incapable of transliterating names.
All subword systems operate without a back-off
dictionary. We first focus on unigram F1, where
all systems improve over the baseline, especially
for rare words (36.8%→41.8% for EN→DE;
26.5%→29.7% for EN→RU). For OOVs, the
baseline strategy of copying unknown words
works well for English→German. However, when
alphabets differ, like in English→Russian, the
subword models do much better.
12We use UNK for words that are outside the model vo-
cabulary, and OOV for those that do not occur in the training
text.
Unigram F1 scores indicate that learning the
BPE symbols on the vocabulary union (BPE-
J90k) is more effective than learning them sep-
arately (BPE-60k), and more effective than using
character bigrams with a shortlist of 50 000 unseg-
mented words (C2-50k), but all reported subword
segmentations are viable choices and outperform
the back-off dictionary baseline.
Our subword representations cause big im-
provements in the translation of rare and unseen
words, but these only constitute 9-11% of the test
sets. Since rare words tend to carry central in-
formation in a sentence, we suspect that BLEU
and CHRF3 underestimate their effect on transla-
tion quality. Still, we also see improvements over
the baseline in total unigram F1, as well as BLEU
and CHRF3, and the subword ensembles outper-
form the WDict baseline by 0.3–1.3 BLEU and
0.6–2 CHRF3. There is some inconsistency be-
tween BLEU and CHRF3, which we attribute to the
fact that BLEU has a precision bias, and CHRF3 a
recall bias.
For English→German, we observe the best
BLEU score of 25.3 with C2-50k, but the best
CHRF3 score of 54.1 with BPE-J90k. For com-
parison to the (to our knowledge) best non-neural
MT system on this data set, we report syntax-
based SMT results (Sennrich and Haddow, 2015).
We observe that our best systems outperform the
syntax-based system in terms of BLEU, but not
in terms of CHRF3. Regarding other neural sys-
tems, Luong et al. (2015a) report a BLEU score of
25.9 on newstest2015, but we note that they use an
ensemble of 8 independently trained models, and
also report strong improvements from applying
dropout, which we did not use. We are confident
that our improvements to the translation of rare
words are orthogonal to improvements achievable
through other improvements in the network archi-
tecture, training algorithm, or better ensembles.
For English→Russian, the state of the art is
the phrase-based system by Haddow et al. (2015).
It outperforms our WDict baseline by 1.5 BLEU.
The subword models are a step towards closing
this gap, and BPE-J90k yields an improvement of
1.3 BLEU, and 2.0 CHRF3, over WDict.
As a further comment on our translation results,
we want to emphasize that performance variabil-
ity is still an open problem with NMT. On our de-
velopment set, we observe differences of up to 1
BLEU between different models. For single sys-
tems, we report the results of the model that per-
forms best on dev (out of 8), which has a stabi-
lizing effect, but how to control for randomness
deserves further attention in future research.
5 Analysis
5.1 Unigram accuracy
Our main claims are that the translation of rare and
unknown words is poor in word-level NMT mod-
els, and that subword models improve the trans-
lation of these word types. To further illustrate
the effect of different subword segmentations on
the translation of rare and unseen words, we plot
target-side words sorted by their frequency in the
training set.13 To analyze the effect of vocabulary
size, we also include the system C2-3/500k, which
is a system with the same vocabulary size as the
WDict baseline, and character bigrams to repre-
sent unseen words.
Figure 2 shows results for the English–German
ensemble systems on newstest2015. Unigram
F1 of all systems tends to decrease for lower-
frequency words. The baseline system has a spike
in F1 for OOVs, i.e. words that do not occur in
the training text. This is because a high propor-
tion of OOVs are names, for which a copy from
the source to the target text is a good strategy for
English→German.
The systems with a target vocabulary of 500 000
words mostly differ in how well they translate
words with rank > 500 000. A back-off dictionary
is an obvious improvement over producing UNK,
but the subword system C2-3/500k achieves better
performance. Note that all OOVs that the back-
off dictionary produces are words that are copied
from the source, usually names, while the subword
13We perform binning of words with the same training set
frequency, and apply bezier smoothing to the graph.
systems can productively form new words such as
compounds.
For the 50 000 most frequent words, the repre-
sentation is the same for all neural networks, and
all neural networks achieve comparable unigram
F1 for this category. For the interval between fre-
quency rank 50 000 and 500 000, the comparison
between C2-3/500k and C2-50k unveils an inter-
esting difference. The two systems only differ in
the size of the shortlist, with C2-3/500k represent-
ing words in this interval as single units, and C2-
50k via subword units. We find that the perfor-
mance of C2-3/500k degrades heavily up to fre-
quency rank 500 000, at which point the model
switches to a subword representation and perfor-
mance recovers. The performance of C2-50k re-
mains more stable. We attribute this to the fact
that subword units are less sparse than words. In
our training set, the frequency rank 50 000 corre-
sponds to a frequency of 60 in the training data;
the frequency rank 500 000 to a frequency of 2.
Because subword representations are less sparse,
reducing the size of the network vocabulary, and
representing more words via subword units, can
lead to better performance.
The F1 numbers hide some qualitative differ-
ences between systems. For English→German,
WDict produces few OOVs (26.5% recall), but
with high precision (60.6%) , whereas the subword
systems achieve higher recall, but lower precision.
We note that the character bigram model C2-50k
produces the most OOV words, and achieves rel-
atively low precision of 29.1% for this category.
However, it outperforms the back-off dictionary
in recall (33.0%). BPE-60k, which suffers from
transliteration (or copy) errors due to segmenta-
tion inconsistencies, obtains a slightly better pre-
cision (32.4%), but a worse recall (26.6%). In con-
trast to BPE-60k, the joint BPE encoding of BPE-
J90k improves both precision (38.6%) and recall
(29.8%).
For English→Russian, unknown names can
only rarely be copied, and usually require translit-
eration. Consequently, the WDict baseline per-
forms more poorly for OOVs (9.2% precision;
5.2% recall), and the subword models improve
both precision and recall (21.9% precision and
15.6% recall for BPE-J90k). The full unigram F1
plot is shown in Figure 3.
vocabulary BLEU CHRF3 unigram F1 (%)
name segmentation shortlist source target single ens-8 single ens-8 all rare OOV
phrase-based (Haddow et al., 2015) 24.3 – 53.8 – 56.0 31.3 16.5
WUnk – – 300 000 500 000 18.8 22.4 46.5 49.9 54.2 25.2 0.0
WDict – – 300 000 500 000 19.1 22.8 47.5 51.0 54.8 26.5 6.6
C2-50k char-bigram 50 000 60 000 60 000 20.9 24.1 49.0 51.6 55.2 27.8 17.4
BPE-60k BPE – 60 000 60 000 20.5 23.6 49.8 52.7 55.3 29.7 15.6
BPE-J90k BPE (joint) – 90 000 100 000 20.4 24.1 49.7 53.0 55.8 29.7 18.3
Table 3: English→Russian translation performance (BLEU, CHRF3 and unigram F1) on newstest2015.
Ens-8: ensemble of 8 models. Best NMT system in bold. Unigram F1 (with ensembles) is computed for
all words (n = 55654), rare words (not among top 50 000 in training set; n = 5442), and OOVs (not in
training set; n = 851).
100 101 102 103 104 105 106
0
0.2
0.4
0.6
0.8
1
50 000 500 000
training set frequency rank
un
ig
ra
m
F
1
BPE-J90k
C2-50k
C2-300/500k
WDict
WUnk
Figure 2: English→German unigram F1 on new-
stest2015 plotted by training set frequency rank
for different NMT systems.
100 101 102 103 104 105 106
0
0.2
0.4
0.6
0.8
1
50 000 500 000
training set frequency rank
un
ig
ra
m
F
1
BPE-J90k
C2-50k
WDict
WUnk
Figure 3: English→Russian unigram F1 on new-
stest2015 plotted by training set frequency rank
for different NMT systems.
5.2 Manual Analysis
Table 4 shows two translation examples for
the translation direction English→German, Ta-
ble 5 for English→Russian. The baseline sys-
tem fails for all of the examples, either by delet-
ing content (health), or by copying source words
that should be translated or transliterated. The
subword translations of health research insti-
tutes show that the subword systems are capa-
ble of learning translations when oversplitting (re-
search→Fo|rs|ch|un|g), or when the segmentation
does not match morpheme boundaries: the seg-
mentation Forschungs|instituten would be linguis-
tically more plausible, and simpler to align to the
English research institutes, than the segmentation
Forsch|ungsinstitu|ten in the BPE-60k system, but
still, a correct translation is produced. If the sys-
tems have failed to learn a translation due to data
sparseness, like for asinine, which should be trans-
lated as dumm, we see translations that are wrong,
but could be plausible for (partial) loanwords (asi-
nine Situation→Asinin-Situation).
The English→Russian examples show that
the subword systems are capable of translitera-
tion. However, transliteration errors do occur,
either due to ambiguous transliterations, or be-
cause of non-consistent segmentations between
source and target text which make it hard for
the system to learn a transliteration mapping.
Note that the BPE-60k system encodes Mirza-
yeva inconsistently for the two language pairs
(Mirz|ayeva→Мир|за|ева Mir|za|eva). This ex-
ample is still translated correctly, but we observe
spurious insertions and deletions of characters in
the BPE-60k system. An example is the translit-
eration of rakfisk, where a п is inserted and a к
is deleted. We trace this error back to transla-
tion pairs in the training data with inconsistent
segmentations, such as (p|rak|ri|ti→пра|крит|и
system sentence
source health research institutes
reference Gesundheitsforschungsinstitute
WDict Forschungsinstitute
C2-50k Fo|rs|ch|un|gs|in|st|it|ut|io|ne|n
BPE-60k Gesundheits|forsch|ungsinstitu|ten
BPE-J90k Gesundheits|forsch|ungsin|stitute
source asinine situation
reference dumme Situation
WDict asinine situation → UNK → asinine
C2-50k as|in|in|e situation → As|in|en|si|tu|at|io|n
BPE-60k as|in|ine situation → A|in|line-|Situation
BPE-J90K as|in|ine situation → As|in|in-|Situation
Table 4: English→German translation example.
“|” marks subword boundaries.
system sentence
source Mirzayeva
reference Мирзаева (Mirzaeva)
WDict Mirzayeva → UNK → Mirzayeva
C2-50k Mi|rz|ay|ev|a → Ми|рз|ае|ва (Mi|rz|ae|va)
BPE-60k Mirz|ayeva → Мир|за|ева (Mir|za|eva)
BPE-J90k Mir|za|yeva → Мир|за|ева (Mir|za|eva)
source rakfisk
reference ракфиска (rakfiska)
WDict rakfisk → UNK → rakfisk
C2-50k ra|kf|is|k → ра|кф|ис|к (ra|kf|is|k)
BPE-60k rak|f|isk → пра|ф|иск (pra|f|isk)
BPE-J90k rak|f|isk → рак|ф|иска (rak|f|iska)
Table 5: English→Russian translation examples.
“|” marks subword boundaries.
(pra|krit|i)), from which the translation (rak→пра)
is erroneously learned. The segmentation of the
joint BPE system (BPE-J90k) is more consistent
(pra|krit|i→пра|крит|и (pra|krit|i)).
6 Conclusion
The main contribution of this paper is that we
show that neural machine translation systems are
capable of open-vocabulary translation by repre-
senting rare and unseen words as a sequence of
subword units.14 This is both simpler and more
effective than using a back-off translation model.
We introduce a variant of byte pair encoding for
word segmentation, which is capable of encod-
ing open vocabularies with a compact symbol vo-
cabulary of variable-length subword units. We
show performance gains over the baseline with
both BPE segmentation, and a simple character bi-
gram segmentation.
Our analysis shows that not only out-of-
vocabulary words, but also rare in-vocabulary
words are translated poorly by our baseline NMT
14The source code of the segmentation algorithms
is available at https://github.com/rsennrich/
subword-nmt.
system, and that reducing the vocabulary size
of subword models can actually improve perfor-
mance. In this work, our choice of vocabulary size
is somewhat arbitrary, and mainly motivated by
comparison to prior work. One avenue of future
research is to learn the optimal vocabulary size for
a translation task, which we expect to depend on
the language pair and amount of training data, au-
tomatically. We also believe there is further po-
tential in bilingually informed segmentation algo-
rithms to create more alignable subword units, al-
though the segmentation algorithm cannot rely on
the target text at runtime.
While the relative effectiveness will depend on
language-specific factors such as vocabulary size,
we believe that subword segmentations are suit-
able for most language pairs, eliminating the need
for large NMT vocabularies or back-off models.
Acknowledgments
We thank Maja Popović for her implementa-
tion of CHRF, with which we verified our re-
implementation. The research presented in this
publication was conducted in cooperation with
Samsung Electronics Polska sp. z o.o. – Sam-
sung R&D Institute Poland. This project received
funding from the European Union’s Horizon 2020
research and innovation programme under grant
agreement 645452 (QT21).
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2015. Neural Machine Translation by Jointly
Learning to Align and Translate. In Proceedings of
the International Conference on Learning Represen-
tations (ICLR).
Issam Bazzi and James R. Glass. 2000. Modeling out-
of-vocabulary words for robust speech recognition.
In Sixth International Conference on Spoken Lan-
guage Processing, ICSLP 2000 / INTERSPEECH
2000, pages 401–404, Beijing, China.
Jan A. Botha and Phil Blunsom. 2014. Compositional
Morphology for Word Representations and Lan-
guage Modelling. In Proceedings of the 31st Inter-
national Conference on Machine Learning (ICML),
Beijing, China.
Rohan Chitnis and John DeNero. 2015. Variable-
Length Word Encodings for Neural Translation
Models. In Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing
(EMNLP).
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-
cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-
ger Schwenk, and Yoshua Bengio. 2014. Learn-
ing Phrase Representations using RNN Encoder–
Decoder for Statistical Machine Translation. In Pro-
ceedings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 1724–1734, Doha, Qatar. Association for
Computational Linguistics.
Mathias Creutz and Krista Lagus. 2002. Unsupervised
Discovery of Morphemes. In Proceedings of the
ACL-02 Workshop on Morphological and Phonolog-
ical Learning, pages 21–30. Association for Compu-
tational Linguistics.
Nadir Durrani, Hassan Sajjad, Hieu Hoang, and Philipp
Koehn. 2014. Integrating an Unsupervised Translit-
eration Model into Statistical Machine Translation.
In Proceedings of the 14th Conference of the Euro-
pean Chapter of the Association for Computational
Linguistics, EACL 2014, pages 148–153, Gothen-
burg, Sweden.
Chris Dyer, Victor Chahuneau, and Noah A. Smith.
2013. A Simple, Fast, and Effective Reparame-
terization of IBM Model 2. In Proceedings of the
2013 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
man Language Technologies, pages 644–648, At-
lanta, Georgia. Association for Computational Lin-
guistics.
Philip Gage. 1994. A New Algorithm for Data Com-
pression. C Users J., 12(2):23–38, February.
Barry Haddow, Matthias Huck, Alexandra Birch, Niko-
lay Bogoychev, and Philipp Koehn. 2015. The
Edinburgh/JHU Phrase-based Machine Translation
Systems for WMT 2015. In Proceedings of the
Tenth Workshop on Statistical Machine Translation,
pages 126–133, Lisbon, Portugal. Association for
Computational Linguistics.
Sébastien Jean, Kyunghyun Cho, Roland Memisevic,
and Yoshua Bengio. 2015. On Using Very Large
Target Vocabulary for Neural Machine Translation.
In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the
7th International Joint Conference on Natural Lan-
guage Processing (Volume 1: Long Papers), pages
1–10, Beijing, China. Association for Computa-
tional Linguistics.
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent
Continuous Translation Models. In Proceedings of
the 2013 Conference on Empirical Methods in Nat-
ural Language Processing, Seattle. Association for
Computational Linguistics.
Yoon Kim, Yacine Jernite, David Sontag, and Alexan-
der M. Rush. 2015. Character-Aware Neural Lan-
guage Models. CoRR, abs/1508.06615.
Philipp Koehn and Kevin Knight. 2003. Empirical
Methods for Compound Splitting. In EACL ’03:
Proceedings of the Tenth Conference on European
Chapter of the Association for Computational Lin-
guistics, pages 187–193, Budapest, Hungary. Asso-
ciation for Computational Linguistics.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra
Constantin, and Evan Herbst. 2007. Moses: Open
Source Toolkit for Statistical Machine Translation.
In Proceedings of the ACL-2007 Demo and Poster
Sessions, pages 177–180, Prague, Czech Republic.
Association for Computational Linguistics.
Franklin M. Liang. 1983. Word hy-phen-a-tion by
com-put-er. Ph.D. thesis, Stanford University, De-
partment of Linguistics, Stanford, CA.
Wang Ling, Chris Dyer, Alan W. Black, Isabel Tran-
coso, Ramon Fermandez, Silvio Amir, Luis Marujo,
and Tiago Luis. 2015a. Finding Function in Form:
Compositional Character Models for Open Vocab-
ulary Word Representation. In Proceedings of the
2015 Conference on Empirical Methods in Natu-
ral Language Processing (EMNLP), pages 1520–
1530, Lisbon, Portugal. Association for Computa-
tional Linguistics.
Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W.
Black. 2015b. Character-based Neural Machine
Translation. ArXiv e-prints, November.
Thang Luong, Richard Socher, and Christopher D.
Manning. 2013. Better Word Representations
with Recursive Neural Networks for Morphology.
In Proceedings of the Seventeenth Conference on
Computational Natural Language Learning, CoNLL
2013, Sofia, Bulgaria, August 8-9, 2013, pages 104–
113.
Thang Luong, Hieu Pham, and Christopher D. Man-
ning. 2015a. Effective Approaches to Attention-
based Neural Machine Translation. In Proceed-
ings of the 2015 Conference on Empirical Meth-
ods in Natural Language Processing, pages 1412–
1421, Lisbon, Portugal. Association for Computa-
tional Linguistics.
Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals,
and Wojciech Zaremba. 2015b. Addressing the
Rare Word Problem in Neural Machine Translation.
In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the
7th International Joint Conference on Natural Lan-
guage Processing (Volume 1: Long Papers), pages
11–19, Beijing, China. Association for Computa-
tional Linguistics.
Tomas Mikolov, Ilya Sutskever, Anoop Deoras, Hai-
Son Le, Stefan Kombrink, and Jan Cernocký. 2012.
Subword Language Modeling with Neural Net-
works. Unpublished.
Graham Neubig, Taro Watanabe, Shinsuke Mori, and
Tatsuya Kawahara. 2012. Machine Translation
without Words through Substring Alignment. In The
50th Annual Meeting of the Association for Compu-
tational Linguistics, Proceedings of the Conference,
July 8-14, 2012, Jeju Island, Korea – Volume 1: Long
Papers, pages 165–174.
Sonja Nießen and Hermann Ney. 2000. Improving
SMT quality with morpho-syntactic analysis. In
18th Int. Conf. on Computational Linguistics, pages
1081–1085.
Razvan Pascanu, Tomas Mikolov, and Yoshua Ben-
gio. 2013. On the difficulty of training recurrent
neural networks. In Proceedings of the 30th Inter-
national Conference on Machine Learning, ICML
2013, pages 1310–1318, Atlanta, USA.
Maja Popović. 2015. chrF: character n-gram F-score
for automatic MT evaluation. In Proceedings of the
Tenth Workshop on Statistical Machine Translation,
pages 392–395, Lisbon, Portugal. Association for
Computational Linguistics.
Rico Sennrich and Barry Haddow. 2015. A Joint
Dependency Model of Morphological and Syntac-
tic Structure for Statistical Machine Translation. In
Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, pages
2081–2087, Lisbon, Portugal. Association for Com-
putational Linguistics.
Benjamin Snyder and Regina Barzilay. 2008. Unsu-
pervised Multilingual Learning for Morphological
Segmentation. In Proceedings of ACL-08: HLT,
pages 737–745, Columbus, Ohio. Association for
Computational Linguistics.
David Stallard, Jacob Devlin, Michael Kayser,
Yoong Keok Lee, and Regina Barzilay. 2012. Unsu-
pervised Morphology Rivals Supervised Morphol-
ogy for Arabic MT. In The 50th Annual Meeting of
the Association for Computational Linguistics, Pro-
ceedings of the Conference, July 8-14, 2012, Jeju
Island, Korea – Volume 2: Short Papers, pages 322–
327.
Miloš Stanojević, Amir Kamran, Philipp Koehn, and
Ondřej Bojar. 2015. Results of the WMT15 Met-
rics Shared Task. In Proceedings of the Tenth Work-
shop on Statistical Machine Translation, pages 256–
273, Lisbon, Portugal. Association for Computa-
tional Linguistics.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to Sequence Learning with Neural Net-
works. In Advances in Neural Information Process-
ing Systems 27: Annual Conference on Neural Infor-
mation Processing Systems 2014, pages 3104–3112,
Montreal, Quebec, Canada.
Jörg Tiedemann. 2009. Character-based PSMT for
Closely Related Languages. In Proceedings of 13th
Annual Conference of the European Association for
Machine Translation (EAMT’09), pages 12–19.
Jörg Tiedemann. 2012. Character-Based Pivot Trans-
lation for Under-Resourced Languages and Do-
mains. In Proceedings of the 13th Conference of the
European Chapter of the Association for Computa-
tional Linguistics, pages 141–151, Avignon, France.
Association for Computational Linguistics.
David Vilar, Jan-Thorsten Peter, and Hermann Ney.
2007. Can We Translate Letters? In Second Work-
shop on Statistical Machine Translation, pages 33–
39, Prague, Czech Republic. Association for Com-
putational Linguistics.
Sami Virpioja, Jaakko J. Väyrynen, Mathias Creutz,
and Markus Sadeniemi. 2007. Morphology-Aware
Statistical Machine Translation Based on Morphs
Induced in an Unsupervised Manner. In Proceed-
ings of the Machine Translation Summit XI, pages
491–498, Copenhagen, Denmark.
Matthew D. Zeiler. 2012. ADADELTA: An Adaptive
Learning Rate Method. CoRR, abs/1212.5701.