()
ar
X
iv
:1
60
2.
01
92
5v
2
[
cs
.C
L
]
2
1
M
ay
2
01
6
Massively Multilingual Word Embeddings
Waleed Ammar♦ George Mulcaire♥ Yulia Tsvetkov♦
Guillaume Lample♦ Chris Dyer♦ Noah A. Smith♥
♦School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
♥Computer Science & Engineering, University of Washington, Seattle, WA, USA
.edu, , .edu
{glample,cdyer}@cs.cmu.edu, .edu
Abstract
We introduce new methods for estimat-
ing and evaluating embeddings of words
in more than fifty languages in a sin-
gle shared embedding space. Our esti-
mation methods, multiCluster and mul-
tiCCA, use dictionaries and monolingual
data; they do not require parallel data.
Our new evaluation method, multiQVEC-
CCA, is shown to correlate better than
previous ones with two downstream tasks
(text categorization and parsing). We also
describe a web portal for evaluation that
will facilitate further research in this area,
along with open-source releases of all our
methods.
1 Introduction
Vector-space representations of words are widely
used in statistical models of natural language.
In addition to improving the performance on
standard monolingual NLP tasks, shared rep-
resentation of words across languages offers
intriguing possibilities (Klementiev et al., 2012).
For example, in machine translation, translat-
ing a word never seen in parallel data may be
overcome by seeking its vector-space neighbors,
provided the embeddings are learned from both
plentiful monolingual corpora and more limited
parallel data. A second opportunity comes from
transfer learning, in which models trained in one
language can be deployed in other languages.
While previous work has used hand-engineered
features that are cross-linguistically stable as the
basis model transfer (Zeman and Resnik, 2008;
McDonald et al., 2011; Tsvetkov et al., 2014),
automatically learned embeddings of-
fer the promise of better generalization
at lower cost (Klementiev et al., 2012;
Hermann and Blunsom, 2014; Guo et al., 2016).
We therefore conjecture that developing estima-
tion methods for massively multilingual word
embeddings (i.e., embeddings for words in a large
number of languages) will play an important role
in the future of multilingual NLP.
This paper builds on previous work in multilin-
gual embeddings and makes the following contri-
butions:
• We propose two dictionary-based methods—
multiCluster and multiCCA—for estimating
multilingual embeddings which only require
monolingual data and pairwise parallel dictio-
naries, and use them to train embeddings in 59
languages for which these resources are avail-
able (§2). Parallel corpora are not required but
can be used when available. We show that the
proposed methods work well in some settings
and evaluation metrics.
• We adapt QVEC (Tsvetkov et al., 2015)1 to eval-
uating multilingual embeddings (multiQVEC).
We also develop a new evaluation method
multiQVEC-CCA which addresses a theoretical
shortcoming of multiQVEC (§3). Compared to
other intrinsic metrics used in the literature, we
show that both multiQVEC and multiQVEC-CCA
achieve better correlations with extrinsic tasks.
• We develop an easy-to-use web portal2 for
evaluating arbitrary multilingual embeddings
using a suite of intrinsic and extrinsic metrics
(§4). Together with the provided benchmarks,
the evaluation portal will substantially facilitate
future research in this area.
1A method for evaluating monolingual word embeddings.
2http://128.2.220.95/multilingual
http://arxiv.org/abs/1602.01925v2
http://128.2.220.95/multilingual
2 Estimating Multilingual Embeddings
Let L be a set of languages, and let Vm be the
set of surface forms (word types) in m ∈ L. Let
V =
⋃
m∈L V
m. Our goal is to estimate a partial
embedding function E : L × V 7→ Rd (allowing
a surface form that appears in two languages to
have different vectors in each). We would like to
estimate this function such that: (i) semantically
similar words in the same language are nearby, (ii)
translationally equivalent words in different lan-
guages are nearby, and (iii) the domain of the func-
tion covers as many words in V as possible.
We use distributional similarity in a monolin-
gual corpus Mm to model semantic similarity be-
tween words in the same language. For cross-
lingual similarity, either a parallel corpus Pm,n or a
bilingual dictionary Dm,n ⊂ Vm ×Vn can be used.
Our methods focus on the latter, in some cases ex-
tracting Dm,n from a parallel corpus.3
Most previous work on multilingual embed-
dings only considered the bilingual case, | L |=
2. We focus on estimating multilingual em-
beddings for | L |> 2 and describe two
novel dictionary-based methods (multiCluster
and multiCCA). We then describe our base-
lines: a variant of Coulmance et al. (2015) and
Guo et al. (2016) (henceforth referred to as mul-
tiSkip),4 and the translation-invariance matrix fac-
torization method (Gardner et al., 2015).
2.1 MultiCluster
In this approach, we decompose the problem into
two simpler subproblems: E = Eembed ◦ Ecluster,
where Ecluster : L×V 7→ C deterministically maps
words to multilingual clusters C, and Eembed : C →
R
d assigns a vector to each cluster. We use a bilin-
gual dictionary to find clusters of translationally
equivalent words, then use distributional similari-
ties of the clusters in monolingual corpora from all
languages in L to estimate an embedding for each
cluster. By forcing words from different languages
3To do this, we align the corpus using fast align
(Dyer et al., 2013) in both directions. The es-
timated parameters of the word translation dis-
tributions are used to select pairs: Dm,n =
{
(u, v) | u ∈ Vm, v ∈ Vn, pm|n(u | v)× pn|m(v | u) > τ
}
,
where the threshold τ trades off dictionary recall and preci-
sion. We fixed τ = 0.1 early on based on manual inspection
of the resulting dictionaries.
4We developed multiSkip independently of
Coulmance et al. (2015) and Guo et al. (2016). One impor-
tant distinction is that multiSkip is only trained on parallel
corpora, while Coulmance et al. (2015) and Guo et al. (2016)
also use monolingual corpora.
in a cluster to share the same embedding, we cre-
ate anchor points in the vector space to bridge lan-
guages.
More specifically, we define the clusters as the
connected components in a graph where nodes are
(language, surface form) pairs and edges corre-
spond to translation entries in Dm,n. We assign ar-
bitrary IDs to the clusters and replace each word
token in each monolingual corpus with the corre-
sponding cluster ID, and concatenate all modified
corpora. The resulting corpus consists of multilin-
gual cluster ID sequences. We can then apply any
monolingual embedding estimator; here, we use
the skipgram model from Mikolov et al. (2013a).
2.2 MultiCCA
Our proposed method (multiCCA) extends the
bilingual embeddings of Faruqui and Dyer (2014).
First, they use monolingual corpora to train mono-
lingual embeddings for each language indepen-
dently (Em and En), capturing semantic similar-
ity within each language separately. Then, us-
ing a bilingual dictionary Dm,n, they use canonical
correlation analysis (CCA) to estimate linear pro-
jections from the ranges of the monolingual em-
beddings Em and En, yielding a bilingual embed-
ding Em,n. The linear projections are defined by
Tm→m,n and Tn→m,n ∈ R
d×d; they are selected to
maximize the correlation between Tm→m,nE
m(u)
and Tn→m,nE
n(v) where (u, v) ∈ Dm,n. The bilin-
gual embedding is then defined as ECCA(m, u) =
Tm→m,nE
m(u) (and likewise for ECCA(n, v)).
In this work, we use a simple extension (in hind-
sight) to construct multilingual embeddings for
more languages. We let the vector space of the
initial (monolingual) English embeddings serve as
the multilingual vector space (since English typ-
ically offers the largest corpora and wide avail-
ability of bilingual dictionaries). We then estimate
projections from the monolingual embeddings of
the other languages into the English space.
We start by estimating, for each m ∈ L \
{en}, the two projection matrices: Tm→m,en and
Ten→m,en; these are guaranteed to be non-singular.
We then define the multilingual embedding as
ECCA(en, u) = E
en(u) for u ∈ Ven, and
ECCA(m, v) = T
−1
en→m,enTm→m,enE
m(v) for v ∈
Vm,m ∈ L \ {en}.
2.3 MultiSkip
Luong et al. (2015b) proposed a method for esti-
mating bilingual embeddings which only makes
use of parallel data; it extends the skipgram model
of Mikolov et al. (2013a). The skipgram model
defines a distribution over words u that occur in
a context window (of size K) of a word v:
p(u | v) =
expEskipgram(m, v)
⊤Econtext(m, u)∑
u′∈Vm expEskipgram(m, v)
⊤Econtext(m, u′)
In practice, this distribution can be estimated us-
ing a noise contrastive estimation approximation
(Gutmann and Hyvärinen, 2012) while maximiz-
ing the log-likelihood:
∑
i∈pos(Mm)
∑
k∈{−K,…,−1,1,…,K}
log p(ui+k | ui)
where pos(Mm) are the indices of words in the
monolingual corpus Mm.
To establish a bilingual embedding, with a
parallel corpus Pm,n of source language m and
target language n, Luong et al. (2015b) estimate
conditional models of words in both source and
target positions. The source positions are se-
lected as sentential contexts (similar to monolin-
gual skipgram), and the bilingual contexts come
from aligned words. The bilingual objective is to
maximize:
∑
i∈m-pos(Pm,n)
∑
k∈{−K,…,−1,1,…,K}
log p(ui+k | ui)
+ log p(va(i)+k | ui)
+
∑
j∈n-pos(Pm,n)
∑
k∈{−K,…,−1,1,…,K}
log p(vj+k | vj)
+ log p(ua(j)+k | vj)
where m-pos(Pm,n) and n-pos(Pm,n) are the in-
deces of the source and target tokens in the parallel
corpus respectively, a(i) and a(j) are the positions
of words that align to i and j in the other language.
It is easy to see how this method can be extended
for more than two languages by summing up the
bilingual objectives for all available parallel cor-
pora.
2.4 Translation-invariance
Gardner et al. (2015) proposed that multilingual
embeddings should be translation invariant. Con-
sider a matrix X ∈ R|V|×|V| which summarizes
the pointwise mutual information statistics be-
tween pairs of words in monolingual corpora,
and let UV⊤ be a low-rank decomposition of X
where U,V ∈ R|V|×d. Now, consider another
matrix A ∈ R|V|×|V| which summarizes bilin-
gual alignment frequencies in a parallel corpus.
Gardner et al. (2015) solves for a low-rank de-
composition UV⊤ which both approximates X as
well as its transformations A⊤X, XA and A⊤XA by
defining the following objective:
minU,V ‖X − UV
⊤‖2 + ‖XA − UV⊤‖2
+ ‖A⊤X − UV⊤‖2 + ‖A⊤XA − UV⊤‖2
The multilingual embeddings are then taken to be
the rows of the matrix U.
3 Evaluating Multilingual Embeddings
One of our contributions is to streamline the eval-
uation of multilingual embeddings. In addition to
assessing goals (i–iii) stated in §2, a good evalua-
tion metric should also (iv) show good correlation
with performance in downstream applications and
(v) be computationally efficient.
It is easy to evaluate the coverage (iii) by count-
ing the number of words covered by an embedding
function in a closed vocabulary. Intrinsic evalua-
tion metrics are generally designed to be computa-
tionally efficient (v) but may or may not meet the
goals (i, ii, iv). Although intrinsic evaluations will
never be perfect, a standard set of evaluation met-
rics will help drive research. By design, standard
(monolingual) word similarity tasks meet (i) while
cross-lingual word similarity tasks and the word
translation tasks meet (ii). We propose another
evaluation method (multiQVEC-CCA), designed to
simultaneously assess goals (i, ii). MultiQVEC-
CCA extends QVEC (Tsvetkov et al., 2015), a re-
cently proposed monolingual evaluation method,
addressing fundamental flaws and extending it to
multiple languages. To assess the degree to which
these evaluation metrics meet (iv), in §5 we per-
form a correlation analysis looking at which intrin-
sic metrics are best correlated with downstream
task performance—i.e., we evaluate the evaluation
metrics.
3.1 Word similarity
Word similarity datasets such as WordSim-353
(Agirre et al., 2009) and MEN (Bruni et al., 2014)
provide human judgments of semantic similarity.
By ranking words by cosine similarity and by their
empirical similarity judgments, a ranking correla-
tion can be computed that assesses how well the
estimated vectors capture human intuitions about
semantic relatedness.
Some previous work on bilingual and multilin-
gual embeddings focuses on monolingual word
similarity to evaluate embeddings (e.g., Faruqui
and Dyer, 2014). This approach is limited be-
cause it cannot measure the degree to which em-
beddings from different languages are similar (ii).
For this paper, we report results on an English
word similarity task, the Stanford RW dataset
(Luong et al., 2013), as well as a combination
of several cross-lingual word similarity datasets
(Camacho-Collados et al., 2015).
3.2 Word translation
This task directly assesses the degree to which
translationally equivalent words in different lan-
guages are nearby in the embedding space.
The evaluation data consists of word pairs
which are known to be translationally equiva-
lent. The score for one word pair (l1,w1), (l2,w2)
both of which are covered by an embed-
ding E is 1 if cosine(E(l1,w1),E(l2,w2)) ≥
cosine(E(l1,w1),E(l2,w
′
2))∀w
′
2 ∈ G
l2 where Gl2
is the set of words of language l2 in the evalu-
ation dataset, and cosine is the cosine similarity
function. Otherwise, the score for this word pair
is 0. The overall score is the average score for
all word pairs covered by the embedding func-
tion. This is a variant of the method used by
Mikolov et al. (2013b) to evaluate bilingual em-
beddings.
3.3 Correlation-based evaluation
We introduce QVEC-CCA—an intrinsic evaluation
measure of the quality of word embeddings. Our
method is an improvement of QVEC—a monolin-
gual evaluation based on alignment of embeddings
to a matrix of features extracted from a linguis-
tic resource (Tsvetkov et al., 2015). We review
QVEC, and then describe QVEC-CCA.
QVEC. The main idea behind QVEC is to quan-
tify the linguistic content of word embeddings
by maximizing the correlation with a manually-
annotated linguistic resource. Let the number of
common words in the vocabulary of the word em-
beddings and the linguistic resource be N. To
quantify the semantic content of embeddings, a se-
mantic linguistic matrix S ∈ RP×N is constructed
from a semantic database, with a column vector
for each word. Each word vector is a distribu-
tion of the word over P linguistic properties, based
on annotations of the word in the database. Let
X ∈ RD×N be embedding matrix with every row
as a dimension vector x ∈ R1×N . D denotes the
dimensionality of word embeddings. Then, S and
X are aligned to maximize the cumulative corre-
lation between the aligned dimensions of the two
matrices. Specifically, let A ∈ {0, 1}D×P be a ma-
trix of alignments such that aij = 1 iff xi is aligned
to sj, otherwise aij = 0. If r(xi, sj) is the Pearson’s
correlation between vectors xi and sj, then QVEC
is defined as:
QVEC = maxA:
∑
j aij≤1
X∑
i=1
S∑
j=1
r(xi, sj)× aij
The constraint
∑
j aij ≤ 1, warrants that one dis-
tributional dimension is aligned to at most one lin-
guistic dimension.
QVEC has been shown to correlate
strongly with downstream semantic tasks
(Tsvetkov et al., 2015). However, it suffers from
two major weaknesses. First, it is not invariant to
linear transformations of the embeddings’ basis,
whereas the bases in word embeddings are gen-
erally arbitrary (Szegedy et al., 2014). Second,
a sum of correlations produces an unnormalized
score: the more dimensions in the embedding
matrix the higher the score. This precludes
comparison of models of different dimensional-
ity. QVEC-CCA simultaneously addresses both
problems.
QVEC-CCA. To measure correlation between
the embedding matrix X and the linguistic matrix
S, instead of cumulative dimension-wise correla-
tion we employ CCA. CCA finds two sets of basis
vectors, one for X⊤ and the other for S⊤, such that
the correlations between the projections of the ma-
trices onto these basis vectors are maximized. For-
mally, CCA finds a pair of basis vectors v and w
such that
QVEC-CCA = CCA(X⊤,S⊤)
= maxv,w r(X
⊤v,S⊤w)
Thus, QVEC-CCA ensures invariance to the
matrices’ bases’ rotation, and since it is a sin-
gle correlation, it produces a score in [−1, 1].
Both QVEC and QVEC-CCA rely on a matrix of
linguistic properties constructed from a man-
ually crafted linguistic resource. We extend
both methods to multilingual evaluations—
multiQVEC and multiQVEC-CCA—by construct-
ing the linguistic matrix using supersense tag
annotations for English (Miller et al., 1993),
Danish (Martı́nez Alonso et al., 2015;
Martı́nez Alonso et al., 2016) and Italian
(Montemagni et al., 2003).
3.4 Extrinsic tasks
In order to evaluate how useful the word embed-
dings are for a downstream task, we use the em-
bedding vector as a dense feature representation
of each word in the input, and deliberately remove
any other feature available for this word (e.g., pre-
fixes, suffixes, part-of-speech). For each task, we
train one model on the aggregate training data
available for several languages, and evaluate on
the aggregate evaluation data in the same set of
languages. We apply this for multilingual doc-
ument classification and multilingual dependency
parsing.
For document classification, we follow
Klementiev et al. (2012) in using the RCV corpus
of newswire text, and train a classifier which
differentiates between four topics. While most
previous work used this data only in a bilingual
setup, we simultaneously train the classifier on
documents in seven languages,5 and evaluate on
the development/test section of those languages.
For this task, we report the average classification
accuracy on the test set.
For dependency parsing, we train the stack-
LSTM parser of Dyer et al. (2015) on a subset of
the languages in the universal dependencies v1.1,6
and test on the same languages, reporting unla-
beled attachment scores. We remove all part-of-
speech and morphology features from the data,
and prevent the model from optimizing the word
embeddings used to represent each word in the
corpus, thereby forcing the parser to rely com-
pletely on the provided (pretrained) embeddings
as the token representation. Although omitting
other features (e.g., parts of speech) hurts the per-
formance of the parser, it emphasizes the contribu-
tion of the word embeddings being studied.
4 Evaluation Portal
In order to facilitate future research on multilin-
gual word embeddings, we developed a web portal
to enable researchers who develop new estimation
methods to evaluate them using a suite of evalu-
ation tasks. The portal serves the following pur-
5Danish, German, English, Spanish, French, Italian and
Swedish.
6http://hdl.handle.net/11234/LRT-1478
poses:
• Download the monolingual and bilingual data
we used to estimate multilingual embeddings in
this paper,
• Download standard development/test data sets
for each of the evaluation metrics to help re-
searchers working in this area report trustwor-
thy and replicable results,7
• Upload arbitrary multilingual embeddings, scan
which languages are covered by the embed-
dings, allow the user to pick among the com-
patible evaluation tasks, and receive evaluation
scores for the selected tasks, and
• Register a new evaluation data set or a new eval-
uation metric via the github repository which
mirrors the backend of the web portal.
5 Experiments
Our experiments are designed to show two pri-
mary sets of results: (i) how well the proposed
intrinsic evaluation metrics correlate with down-
stream tasks (§5.1) and (ii) which estimation meth-
ods work best according to each metric (§5.2). The
data used for training and evaluation are available
for download on the evaluation portal.
5.1 Correlations between intrinsic vs.
extrinsic evaluation metrics
In this experiment, we consider four intrinsic
evaluation metrics (cross-lingual word similar-
ity, word translation, multiQVEC and multiQVEC-
CCA) and two extrinsic evaluation metrics (mul-
tilingual document classification and multilingual
parsing).
Data: For the cross-lingual word similarity task,
we use disjoint subsets of the en-it MWS353
dataset (Leviant and Reichart, 2015) for develop-
ment (308 word pairs) and testing (307 word
pairs). For the word translation task, we use
Wiktionary to extract a development set (647
translations) and a test set (647 translations) of
translationally-equivalent word pairs in en-it, en-
da and da-it. For both multiQVEC and multiQVEC-
CCA, we used disjoint subsets of the multilingual
(en, da, it) supersense tag annotations described
in §3 for development (12,513 types) and testing
(12,512 types).
7Except for the original RCV documents, which are re-
stricted by the Reuters license and cannot be republished. All
other data is available for download.
(→) extrinsic task document dependency
(↓) intrinsic metric classification parsing
word similarity 0.386 0.007
word translation 0.066 -0.292
multiQVEC 0.635 0.444
multiQVEC-CCA 0.896 0.273
Table 1: Correlations between intrinsic evaluation
metrics (rows) and downstream task performance
(columns).
For the document classification task, we use the
multilingual RCV corpus (en, it, da). For the de-
pendency parsing task, we use the universal de-
pendencies v1.1 (Agić et al., 2015) in three lan-
guages (en, da, it).
Setup: To estimate correlations between the
proposed intrinsic evaluation metrics and down-
stream task performance, we train a total of 17
different multilingual embeddings for three lan-
guages (English, Italian and Danish). To com-
pute the correlations, we evaluate each of the 17
embeddings (12 multiCluster embeddings, 1 mul-
tiCCA embeddings, 1 multiSkip embeddings, 2
translation-invariance embeddings) according to
each of the six evaluation metrics (4 intrinsic, 2
extrinsic).8
Results: Table 1 shows Pearson’s correlation co-
efficients of eight (intrinsic metric, extrinsic met-
ric) pairs. Although each of two proposed meth-
ods multiQVEC and multiQVEC-CCA correlate bet-
ter with a different extrinsic task, we establish (i)
that intrinsic methods previously used in the litera-
ture (cross-lingual word similarity and word trans-
lation) correlate poorly with downstream tasks,
and (ii) that the intrinsic methods proposed in this
paper (multiQVEC and multiQVEC-CCA) correlate
better with both downstream tasks, compared to
cross-lingual word similarity and word transla-
tion.9
8The 102 (17 × 6) values used to compute Pearson’s cor-
relation coefficient are provided in the supplementary mate-
rial.
9Although supersense annotations exist for other lan-
guages, the annotations are inconsistent across languages and
may not be publicly available, which is a disadvantage of the
multiQVEC and multiQVEC-CCA metrics. Therefore, we rec-
ommend that future multilingual supersense annotation ef-
forts use the same set of supersense tags used in other lan-
guages. If the word embeddings are primarily needed for en-
coding syntactic information, one could use tag dictionaries
based on the universal POS tag set (Petrov et al., 2012) in-
stead of supersense tags.
Task multiCluster multiCCA
dependency parsing 48.4 [72.1] 48.8 [69.3]
doc. classification 90.3 [52.3] 91.6 [52.6]
mono. wordsim 14.9 [71.0] 43.0 [71.0]
cross. wordsim 12.8 [78.2] 66.8 [78.2]
word translation 30.0 [38.9] 83.6 [31.8]
mono. QVEC 7.6 [99.6] 10.7 [99.0]
multiQVEC 8.3 [86.4] 8.7 [87.0]
mono. QVEC-CCA 53.8 [99.6] 63.4 [99.0]
multiQVEC-CCA 37.4 [86.4] 42.0 [87.0]
Table 2: Results for multilingual embeddings that
cover 59 languages. Each row corresponds to
one of the embedding evaluation metrics we use
(higher is better). Each column corresponds to
one of the embedding estimation methods we con-
sider; i.e., numbers in the same row are compa-
rable. Numbers in square brackets are coverage
percentages.
5.2 Evaluating multilingual estimation
methods
We now turn to evaluating the four estimation
methods described in §2. We use the proposed
methods (i.e., multiCluster and multiCCA) to
train multilingual embeddings in 59 languages for
which bilingual translation dictionaries are avail-
able.10 In order to compare our methods to base-
lines which use parallel data (i.e., multiSkip and
translation-invariance), we also train multilingual
embeddings in a smaller set of 12 languages for
which high-quality parallel data are available.11
Training data: We use Europarl en-xx parallel
data for the set of 12 languages. We obtain en-xx
bilingual dictionaries from two different sources.
For the set of 12 languages, we extract the bilin-
gual dictionaries from the Europarl parallel cor-
pora. For the remaining 47 languages, dictio-
naries were formed by translating the 20k most
common words in the English monolingual corpus
with Google Translate, ignoring translation pairs
with identical surface forms and multi-word trans-
lations.
Evaluation data: Monolingual word similarity
uses the MEN dataset in Bruni et al. (2014) as
10The 59-language set is { bg, cs, da, de, el, en, es, fi, fr,
hu, it, sv, zh, af, ca, iw, cy, ar, ga, zu, et, gl, id, ru, nl, pt, la, tr,
ne, lv, lt, tg, ro, is, pl, yi, be, hy, hr, jw, ka, ht, fa, mi, bs, ja,
mg, tl, ms, uz, kk, sr, mn, ko, mk, so, uk, sl, sw }.
11The 12-language set is {bg, cs, da, de, el, en, es, fi, fr,
hu, it, sv}.
a development set and Stanford’s Rare Words
dataset in Luong et al. (2013) as a test set. For
the cross-lingual word similarity task, we aggre-
gate the RG-65 datasets in six language pairs (fr-
es, fr-de, en-fr, en-es, en-de, de-es). For the
word translation task, we use Wiktionary to ex-
tract translationally-equivalent word pairs to eval-
uate multilingual embeddings for the set of 12 lan-
guages. Since Wiktionary-based translations do
not cover all 59 languages, we use Google Trans-
late to obtain en-xx bilingual dictionaries to eval-
uate the embeddings of 59 languages. For QVEC
and QVEC-CCA, we split the English supersense
annotations used in Tsvetkov et al. (2015) into a
development set and a test set. For multiQVEC and
multiQVEC-CCA, we use supersense annotations
in English, Italian and Danish. For the document
classification task, we use the multilingual RCV
corpus in seven languages (da, de, en, es, fr, it,
sv). For the dependency parsing task, we use the
universal dependencies v1.1 in twelve languages
(bg, cs, da, de, el, en, es, fi, fr, hu, it, sv).
Setup: All word embeddings in the follow-
ing results are 512-dimensional vectors. Meth-
ods which indirectly use skipgram (i.e., multi-
CCA, multiSkip, and multiCluster) are trained
using 10 epochs of stochastic gradient descent,
and use a context window of size 5. The
translation-invariance method use a context win-
dow of size 3.12 We only estimate embeddings
for words/clusters which occur 5 times or more in
the monolingual corpora. In a postprocessing step,
all vectors are normalized to unit length. Multi-
Cluster uses a maximum cluster size of 1,000 and
10,000 for the set of 12 and 59 languages, respec-
tively. In the English tasks (monolingual word
similarity, QVEC, QVEC-CCA), skipgram embed-
dings (Mikolov et al., 2013a) and multiCCA em-
beddings give identical results (since we project
words in other languages to the English vector
space, estimated using the skipgram model). The
software used to train all embeddings as well as
the trained embeddings are available for download
on the evaluation portal.13
We note that intrinsic evaluation of word em-
beddings (e.g., word similarity) typically ignores
12Training translation-invariance embeddings with larger
context window sizes using the matlab implementation pro-
vided by Gardner et al. (2015) is computationally challeng-
ing.
13URLs to software libraries on Github are redacted to
comply with the double-blind reviewing of CoNLL.
test instances which are not covered by the embed-
dings being studied. When the vocabulary used in
two sets of word embeddings is different, which
is often the case, the intrinsic evaluation score for
each set may be computed based on a different set
of test instances, which may bias the results in un-
expected ways. For instance, if one set of embed-
dings only covers frequent words while the other
set also covers infrequent words, the scores of the
first set may be inflated because frequent words
appear in many different contexts and are there-
fore easier to estimate than infrequent words. To
partially address this problem, we report the cov-
erage of each set of embeddings in square brack-
ets. When the difference in coverage is large, we
repeat the evaluation using only the intersection
of vocabularies covered by all embeddings being
evaluated. Extrinsic evaluations are immune to
this problem because the score is computed based
on all test instances regardless of the coverage.
Results [59 languages]. We train the proposed
dictionary-based estimation methods (multiClus-
ter and multiCCA) for 59 languages, and evalu-
ate the trained embeddings according to nine dif-
ferent metrics in Table 2. The results show that,
when trained on a large number of languages, mul-
tiCCA consistently outperforms multiCluster ac-
cording to all evaluation metrics. Note that most
differences in coverage between multiCluster and
multiCCA are relatively small.
It is worth noting that the mainstream approach
of estimating one vector representation per word
type (rather than word token) ignores the fact that
the same word may have different semantics in dif-
ferent contexts. The multiCluster method exacer-
bates this problem by estimating one vector repre-
sentation per cluster of translationally equivalent
words. The added semantic ambiguity severely
hurts the performance of multiCluster with 59 lan-
guages, but it is still competitive with 12 languages
(see below).
Results [12 languages]. We compare the pro-
posed dictionary-based estimation methods to par-
allel text-based methods in Table 3. The ranking
of the four estimation methods is not consistent
across all evaluation metrics. This is unsurprising
since each metric evaluates different traits of word
embeddings, as detailed in §3. However, some pat-
terns are worth noting in Table 3.
In five of the evaluations (including both ex-
Task multiCluster multiCCA multiSkip invariance
extrinsic
metrics
dependency parsing 61.0 [70.9] 58.7 [69.3] 57.7 [68.9] 59.8 [68.6]
document classification 92.1 [48.1] 92.1 [62.8] 90.4 [45.7] 91.1 [31.3]
intrinsic
metrics
monolingual word similarity 38.0 [57.5] 43.0 [71.0] 33.9 [55.4] 51.0 [23.0]
multilingual word similarity 58.1 [74.1] 66.6 [78.2] 59.5 [67.5] 58.7 [63.0]
word translation 43.7 [45.2] 35.7 [53.2] 46.7 [39.5] 63.9 [30.3]
monolingual QVEC 10.3 [98.6] 10.7 [99.0] 8.4 [98.0] 8.1 [91.7]
multiQVEC 9.3 [82.0] 8.7 [87.0] 8.7 [87.0] 5.3 [74.7]
monolingual QVEC-CCA 62.4 [98.6] 63.4 [99.0] 58.9 [98.0] 65.8 [91.7]
multiQVEC-CCA 43.3 [82.0] 41.5 [87.0] 36.3 [75.6] 46.2 [74.7]
Table 3: Results for multilingual embeddings that cover Bulgarian, Czech, Danish, Greek, English,
Spanish, German, Finnish, French, Hungarian, Italian and Swedish. Each row corresponds to one of
the embedding evaluation metrics we use (higher is better). Each column corresponds to one of the
embedding estimation methods we consider; i.e., numbers in the same row are comparable. Numbers in
square brackets are coverage percentages.
trinsic tasks), the best performing method is a
dictionary-based one proposed in this paper. In
the remaining four intrinsic methods, the best
performing method is the translation-invariance
method. MultiSkip ranks last in five evaluations,
and never ranks first. Since our implementation of
multiSkip does not make use of monolingual data,
it only learns from monolingual contexts observed
in parallel corpora, it misses the opportunity to
learn from contexts in the much larger monolin-
gual corpora. Trained for 12 languages, mul-
tiCluster is competitive in four evaluations (and
ranks first in three).
We note that multiCCA consistently achieves
better coverage than the translation-invariance
method. For intrinsic measures, this confounds the
performance comparison. A partial solution is to
test only on word types for which all four methods
have a vector; this subset is in no sense a represen-
tative sample of the vocabulary. In this compari-
son (provided in the supplementary material), we
find a similar pattern of results, though multiCCA
outperforms the translation-invariance method on
the monolingual word similarity task. Also,
the gap (between multiCCA and the translation-
invariance method) reduces to 0.7 in monolingual
QVEC-CCA and 2.5 in multiQVEC-CCA.
6 Related Work
There is a rich body of literature on bilingual em-
beddings, including work on machine translation
(Zou et al., 2013; Hermann and Blunsom, 2014;
Cho et al., 2014; Luong et al., 2015b;
Luong et al., 2015a, inter alia),14 cross-
lingual dependency parsing (Guo et al., 2015;
Guo et al., 2016), and cross-lingual docu-
ment classification (Klementiev et al., 2012;
Gouws et al., 2014; Kociskỳ et al., 2014).
Al-Rfou’ et al. (2013) trained word embeddings
for more than 100 languages, but the embeddings
of each language are trained independently (i.e.,
embeddings of words in different languages do
not share the same vector space). Word clusters
are a related form of distributional representation;
in clustering, cross-lingual distributional repre-
sentations were proposed as well (Och, 1999;
Täckström et al., 2012). Haghighi et al. (2008)
used CCA to learn bilingual lexicons from
monolingual corpora.
7 Conclusion
We proposed two dictionary-based estimation
methods for multilingual word embeddings, mul-
tiCCA and multiCluster, and used them to train
embeddings for 59 languages. We characterized
important shortcomings of the QVEC previously
used to evaluate monolingual embeddings, and
proposed an improved metric multiQVEC-CCA.
Both multiQVEC and multiQVEC-CCA obtain bet-
ter correlations with downstream tasks compared
to intrinsic methods previously used in the litera-
ture. Finally, in order to help future research in this
area, we created a web portal for users to upload
their multilingual embeddings and easily evaluate
14Hermann and Blunsom (2014) showed that the bicvm
method can be extended to more than two languages, but the
released software library only supports bilingual embeddings.
them on nine evaluation metrics, with two modes
of operation (development and test) to encourage
sound experimentation practices.
Acknowledgments
Waleed Ammar is supported by the Google fellow-
ship in natural language processing. Part of this
material is based upon work supported by a sub-
contract with Raytheon BBN Technologies Corp.
under DARPA Prime Contract No. HR0011-15-
C-0013. This work was supported in part by the
National Science Foundation through award IIS-
1526745. We thank Manaal Faruqui, Wang Ling,
Kazuya Kawakami, Matt Gardner, Benjamin Wil-
son and the anonymous reviewers of the NW-NLP
workshop for helpful comments. We are also
grateful to Héctor Martı́nez Alonso for his help
with Danish resources.
References
[Agić et al.2015] Željko Agić, Maria Jesus Aranzabe,
Aitziber Atutxa, Cristina Bosco, Jinho Choi, Marie-
Catherine de Marneffe, Timothy Dozat, Richárd
Farkas, Jennifer Foster, Filip Ginter, Iakes Goe-
naga, Koldo Gojenola, Yoav Goldberg, Jan Hajič,
Anders Trærup Johannsen, Jenna Kanerva, Juha
Kuokkala, Veronika Laippala, Alessandro Lenci,
Krister Lindén, Nikola Ljubešić, Teresa Lynn,
Christopher Manning, Héctor Alonso Martı́nez,
Ryan McDonald, Anna Missilä, Simonetta Monte-
magni, Joakim Nivre, Hanna Nurmi, Petya Osen-
ova, Slav Petrov, Jussi Piitulainen, Barbara Plank,
Prokopis Prokopidis, Sampo Pyysalo, Wolfgang
Seeker, Mojgan Seraji, Natalia Silveira, Maria Simi,
Kiril Simov, Aaron Smith, Reut Tsarfaty, Veronika
Vincze, and Daniel Zeman. 2015. Universal de-
pendencies 1.1. LINDAT/CLARIN digital library at
Institute of Formal and Applied Linguistics, Charles
University in Prague.
[Agirre et al.2009] Eneko Agirre, Enrique Alfonseca,
Keith Hall, Jana Kravalova, Marius Paşca, and Aitor
Soroa. 2009. A study on similarity and relatedness
using distributional and WordNet-based approaches.
In Proc. of NAACL, pages 19–27.
[Al-Rfou’ et al.2013] Rami Al-Rfou’, Bryan Perozzi,
and Steven Skiena. 2013. Polyglot: Distributed
word representations for multilingual nlp. In
CONLL.
[Bruni et al.2014] Elia Bruni, Nam-Khanh Tran, and
Marco Baroni. 2014. Multimodal distributional se-
mantics. JAIR.
[Camacho-Collados et al.2015] José Camacho-
Collados, Mohammad Taher Pilehvar, and Roberto
Navigli. 2015. A framework for the construction
of monolingual and cross-lingual word similarity
datasets. In Proc. of ACL.
[Cho et al.2014] Kyunghyun Cho, Bart
Van Merriënboer, Caglar Gulcehre, Dzmitry
Bahdanau, Fethi Bougares, Holger Schwenk, and
Yoshua Bengio. 2014. Learning phrase represen-
tations using rnn encoder-decoder for statistical
machine translation. In Proc. of EMNLP.
[Coulmance et al.2015] Jocelyn Coulmance, Jean-Marc
Marty, Guillaume Wenzek, and Amine Benhal-
loum. 2015. Trans-gram, fast cross-lingual word-
embeddings. In Proc. of EMNLP.
[Dyer et al.2013] Chris Dyer, Victor Chahuneau, and
Noah A. Smith. 2013. A simple, fast, and effec-
tive reparameterization of IBM Model 2. In Proc. of
NAACL.
[Dyer et al.2015] Chris Dyer, Miguel Ballesteros,
Wang Ling, Austin Matthews, and Noah A Smith.
2015. Transition-based dependency parsing with
stack long short-term memory. In Proc. of ACL.
[Faruqui and Dyer2014] Manaal Faruqui and Chris
Dyer. 2014. Improving vector space word repre-
sentations using multilingual correlation. Proc. of
EACL.
[Gardner et al.2015] Matt Gardner, Kejun Huang,
Evangelos Papalexakis, Xiao Fu, Partha Talukdar,
Christos Faloutsos, Nicholas Sidiropoulos, and
Tom Mitchell. 2015. Translation invariant word
embeddings. In Proc. of EMNLP.
[Gouws et al.2014] Stephan Gouws, Yoshua Bengio,
and Greg Corrado. 2014. Bilbowa: Fast bilingual
distributed representations without word alignments.
arXiv preprint arXiv:1410.2455.
[Guo et al.2015] Jiang Guo, Wanxiang Che, David
Yarowsky, Haifeng Wang, and Ting Liu. 2015.
Cross-lingual dependency parsing based on dis-
tributed representations. In Proc. of ACL.
[Guo et al.2016] Jiang Guo, Wanxiang Che, David
Yarowsky, Haifeng Wang, and Ting Liu. 2016. A
representation learning framework for multi-source
transfer parsing. In Proc. of AAAI.
[Gutmann and Hyvärinen2012] Michael U Gutmann
and Aapo Hyvärinen. 2012. Noise-contrastive esti-
mation of unnormalized statistical models, with ap-
plications to natural image statistics. JMLR.
[Haghighi et al.2008] Aria Haghighi, Percy Liang, Tay-
lor Berg-Kirkpatrick, and Dan Klein. 2008. Learn-
ing bilingual lexicons from monolingual corpora. In
Proc. of ACL.
[Hermann and Blunsom2014] Karl Moritz Hermann
and Phil Blunsom. 2014. Multilingual Models for
Compositional Distributional Semantics. In Proc. of
ACL.
[Klementiev et al.2012] Alexandre Klementiev, Ivan
Titov, and Binod Bhattarai. 2012. Inducing
crosslingual distributed representations of words. In
Proc. of COLING.
[Kociskỳ et al.2014] Tomáš Kociskỳ, Karl Moritz Her-
mann, and Phil Blunsom. 2014. Learning bilingual
word representations by marginalizing alignments.
In arXiv preprint arXiv:1405.0947.
[Leviant and Reichart2015] Ira Leviant and Roi Re-
ichart. 2015. Judgment language matters: Towards
judgment language informed vector space modeling.
In arXiv preprint arXiv:1508.00106.
[Luong et al.2013] Minh-Thang Luong, Richard
Socher, and Christopher D. Manning. 2013. Better
word representations with recursive neural networks
for morphology. In Proc. of CoNLL.
[Luong et al.2015a] Minh-Thang Luong, Ilya
Sutskever, Quoc V Le, Oriol Vinyals, and Wo-
jciech Zaremba. 2015a. Addressing the rare word
problem in neural machine translation. In Proc. of
ACL.
[Luong et al.2015b] Thang Luong, Hieu Pham, and
Christopher D Manning. 2015b. Bilingual word
representations with monolingual quality in mind.
In Proc. of NAACL.
[Martı́nez Alonso et al.2015] Héctor Martı́nez Alonso,
Anders Johannsen, Sussi Olsen, Sanni Nimb,
Nicolai Hartvig Sørensen, Anna Braasch, Anders
Søgaard, and Bolette Sandford Pedersen. 2015. Su-
persense tagging for Danish. In Proc. of NODAL-
IDA, page 21.
[Martı́nez Alonso et al.2016] Héctor Martı́nez Alonso,
Anders Johannsen, Sussi Olsen, Sanni Nimb, and
Bolette Sandford Pedersen. 2016. An empirically
grounded expansion of the supersense inventory. In
Proc. of the Global Wordnet Conference.
[McDonald et al.2011] Ryan McDonald, Slav Petrov,
and Keith Hall. 2011. Multi-source transfer
of delexicalized dependency parsers. In Proc. of
EMNLP, pages 62–72.
[Mikolov et al.2013a] Tomas Mikolov, Kai Chen, Greg
Corrado, and Jeffrey Dean. 2013a. Efficient esti-
mation of word representations in vector space. In
Proc. of ICLR.
[Mikolov et al.2013b] Tomas Mikolov, Quoc V. Le, and
Ilya Sutskever. 2013b. Exploiting similarities
among languages for machine translation. In arXiv
preprint arXiv:1309.4168v1.
[Miller et al.1993] George A. Miller, Claudia Leacock,
Randee Tengi, and Ross T. Bunker. 1993. A seman-
tic concordance. In Proc. of HLT, pages 303–308.
[Montemagni et al.2003] Simonetta Montemagni,
Francesco Barsotti, Marco Battista, Nicoletta
Calzolari, Ornella Corazzari, Alessandro Lenci,
Antonio Zampolli, Francesca Fanciulli, Maria Mas-
setani, Remo Raffaelli, et al. 2003. Building the
italian syntactic-semantic treebank. In Treebanks,
pages 189–210. Springer.
[Och1999] Franz Joseph Och. 1999. An efficient
method for determining bilingual word classes. In
EACL.
[Petrov et al.2012] Slav Petrov, Dipanjan Das, and
Ryan McDonald. 2012. A universal part-of-speech
tagset. In Proc. of LREC.
[Szegedy et al.2014] Christian Szegedy, Wojciech
Zaremba, Ilya Sutskever, Joan Bruna, Dumitru
Erhan, Ian Goodfellow, and Rob Fergus. 2014.
Intriguing properties of neural networks. In Proc. of
ICLR.
[Täckström et al.2012] Oscar Täckström, Ryan Mc-
Donald, and Jakob Uszkoreit. 2012. Cross-lingual
word clusters for direct transfer of linguistic struc-
ture. In Proc. of NAACL, pages 477–487.
[Tsvetkov et al.2014] Yulia Tsvetkov, Leonid Boytsov,
Anatole Gershman, Eric Nyberg, and Chris Dyer.
2014. Metaphor detection with cross-lingual model
transfer. In Proc. of ACL.
[Tsvetkov et al.2015] Yulia Tsvetkov, Manaal Faruqui,
Wang Ling, Guillaume Lample, and Chris Dyer.
2015. Evaluation of word vector representations by
subspace alignment. In Proc. of EMNLP.
[Zeman and Resnik2008] Daniel Zeman and Philip
Resnik. 2008. Cross-language parser adaptation be-
tween related languages. In Proc. of IJCNLP, pages
35–42.
[Zou et al.2013] Will Y Zou, Richard Socher, Daniel M
Cer, and Christopher D Manning. 2013. Bilingual
word embeddings for phrase-based machine transla-
tion. In Proc. of EMNLP, pages 1393–1398.
1 Introduction
2 Estimating Multilingual Embeddings
2.1 MultiCluster
2.2 MultiCCA
2.3 MultiSkip
2.4 Translation-invariance
3 Evaluating Multilingual Embeddings
3.1 Word similarity
3.2 Word translation
3.3 Correlation-based evaluation
3.4 Extrinsic tasks
4 Evaluation Portal
5 Experiments
5.1 Correlations between intrinsic vs. extrinsic evaluation metrics
5.2 Evaluating multilingual estimation methods
6 Related Work
7 Conclusion