Natural Language Processing with Small Feed-Forward Networks
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2879–2885
Copenhagen, Denmark, September 7–11, 2017. c©2017 Association for Computational Linguistics
Natural Language Processing with Small Feed-Forward Networks
Jan A. Botha Emily Pitler Ji Ma Anton Bakalov
Alex Salcianu David Weiss Ryan McDonald Slav Petrov
Google Inc.
Mountain View, CA
{jabot,epitler,maji,abakalov,salcianu,djweiss,ryanmcd,slav}@google.com
Abstract
We show that small and shallow feed-
forward neural networks can achieve near
state-of-the-art results on a range of un-
structured and structured language pro-
cessing tasks while being considerably
cheaper in memory and computational re-
quirements than deep recurrent models.
Motivated by resource-constrained envi-
ronments like mobile phones, we show-
case simple techniques for obtaining such
small neural network models, and investi-
gate different tradeoffs when deciding how
to allocate a small memory budget.
1 Introduction
Deep and recurrent neural networks with large net-
work capacity have become increasingly accurate
for challenging language processing tasks. For ex-
ample, machine translation models have been able
to attain impressive accuracies, with models that
use hundreds of millions (Bahdanau et al., 2014;
Wu et al., 2016) or billions (Shazeer et al., 2017)
of parameters. These models, however, may not
be feasible in all computational settings. In partic-
ular, models running on mobile devices are often
constrained in terms of memory and computation.
Long Short-Term Memory (LSTM) mod-
els (Hochreiter and Schmidhuber, 1997) have
achieved good results with small memory foot-
prints by using character-based input representa-
tions: e.g., the part-of-speech tagging models of
Gillick et al. (2016) have only roughly 900,000
parameters. Latency, however, can still be an is-
sue with LSTMs, due to the large number of ma-
trix multiplications they require (eight per LSTM
cell): Kim and Rush (2016) report speeds of
only 8.8 words/second when running a two-layer
LSTM translation system on an Android phone.
Feed-forward neural networks have the poten-
tial to be much faster. In this paper, we show that
small feed-forward networks can achieve results at
or near the state-of-the-art on a variety of natural
language processing tasks, with an order of mag-
nitude speedup over an LSTM-based approach.
We begin by introducing the network model
structure and the character-based representations
we use throughout all tasks (§2). The four tasks
that we address are: language identification (Lang-
ID), part-of-speech (POS) tagging, word segmen-
tation, and preordering for translation. In order
to use feed-forward networks for structured pre-
diction tasks, we use transition systems (Titov and
Henderson, 2007, 2010) with feature embeddings
as proposed by Chen and Manning (2014), and in-
troduce two novel transition systems for the last
two tasks. We focus on budgeted models and ab-
late four techniques (one on each task) for improv-
ing accuracy for a given memory budget:
1. Quantization: Using more dimensions and
less precision (Lang-ID: §3.1).
2. Word clusters: Reducing the network size to
allow for word clusters and derived features
(POS tagging: §3.2).
3. Selected features: Adding explicit feature
conjunctions (segmentation: §3.3).
4. Pipelines: Introducing another task in a
pipeline and allocating parameters to the aux-
iliary task instead (preordering: §3.4).
We achieve results at or near state-of-the-art with
small (< 3 MB) models on all four tasks.
2 Small Feed-Forward Network Models
The network architectures are designed to limit the
memory and runtime of the model. Figure 1 illus-
trates the model architecture:
2879
1. Discrete features are organized into groups
(e.g., Ebigrams), with one embedding matrix
Eg ∈ RVg×Dg per group.
2. Embeddings of features extracted for each
group are reshaped into a single vector and
concatenated to define the output of the em-
bedding layer as h0 = [XgEg | ∀g].
3. A single hidden layer, h1, with M rectified
linear units (Nair and Hinton, 2010) is fully
connected to h0.
4. A softmax function models the probability of
an output class y: P (y) ∝ exp(βTy h1 + by),
where βy ∈ RM and by are the weight vector
and bias, respectively.
Memory needs are dominated by the embedding
matrix sizes (
∑
g VgDg, where Vg and Dg are the
vocabulary sizes and dimensions respectively for
each feature group g), while runtime is strongly
influenced by the hidden layer dimensions.
Hashed Character n-grams Previous applica-
tions of this network structure used (pretrained)
word embeddings to represent words (Chen and
Manning, 2014; Weiss et al., 2015). However,
for word embeddings to be effective, they usu-
ally need to cover large vocabularies (100,000+)
and dimensions (50+). Inspired by the success of
character-based representations (Ling et al., 2015),
we use features defined over character n-grams in-
stead of relying on word embeddings, and learn
their embeddings from scratch.
We use a distinct feature group g for each n-
gram length N , and control the size Vg directly
by applying random feature mixing (Ganchev and
Dredze, 2008). That is, we define the feature value
v for an n-gram string x as v = H(x) mod Vg,
where H is a well-behaved hash function. Typical
values for Vg are in the 100-5000 range, which is
far smaller than the exponential number of unique
raw n-grams. A consequence of these small fea-
ture vocabularies is that we can also use small fea-
ture embeddings, typically Dg=16.
Quantization A commonly used strategy for
compressing neural networks is quantization, us-
ing less precision to store parameters (Han et al.,
2015). We compress the embedding weights (the
vast majority of the parameters for these shal-
low models) by storing scale factors for each em-
bedding (details in the supplementary material).
In §3.1, we contrast devoting model size to higher
⨁
There was no queue at the ...
no
eu
at
⨁⨁
Ebigrams
que
ueu
eue
Etrigrams
⨁
h0
h1
P(y)
qu
ue
Figure 1: An example network structure for a model using
bigrams of the previous, current and next word, and trigrams
of the current word. Does not illustrate hashing.
precision and lower dimensionality versus lower
precision and more network dimensions.
Training Our objective function combines the
cross-entropy loss for model predictions relative to
the ground truth with L2 regularization of the bi-
ases and hidden layer weights. For optimization,
we use mini-batched averaged stochastic gradient
descent with momentum (Bottou, 2010; Hinton,
2012) and exponentially decaying learning rates.
The mini-batch size is fixed to 32 and we perform
a grid search for the other hyperparameters, tun-
ing against the task-specific evaluation metric on
held-out data, with early stopping. Full feature
templates and optimal hyperparameter settings are
given in the supplementary material.
3 Experiments
We experiment with small feed-forward networks
for four diverse NLP tasks: language identifica-
tion, part-of-speech tagging, word segmentation,
and preordering for statistical machine translation.
Evaluation Metrics In addition to standard
task-specific quality metrics, our evaluations also
consider model size and computational cost. We
skirt implementation details by calculating size as
the number of kilobytes (1KB=1024 bytes) needed
to represent all model parameters and resources.
We approximate the computational cost as the
number of floating-point operations (FLOPs) per-
formed for one forward pass through the network
given an embedding vector h0. This cost is dom-
inated by the matrix multiplications to compute
(unscaled) activation unit values, hence our metric
excludes the non-linearities and softmax normal-
2880
ization, but still accounts for the final layer log-
its. To ground this metric, we also provide indica-
tive absolute speeds for each task, as measured on
a modern workstation CPU (3.50GHz Intel Xeon
E5-1650 v3).
3.1 Language Identification
Recent shared tasks on code-switching (Molina
et al., 2016) and dialects (Malmasi et al., 2016)
have generated renewed interest in language iden-
tification. We restrict our focus to single language
identification across diverse languages, and com-
pare to the work of Baldwin and Lui (2010) on pre-
dicting the language of Wikipedia text in 66 lan-
guages. For this task, we obtain the input h0 by
separately averaging the embeddings for each n-
gram length (N = [1, 4]), as summation did not
produce good results.
Table 1 shows that we outperform the low-
memory nearest-prototype model of Baldwin and
Lui (2010). Their nearest neighbor model is the
most accurate but its memory scales linearly with
the size of the training data.
Moreover, we can apply quantization to the em-
bedding matrix without hurting prediction accu-
racy: it is better to use less precision for each
dimension, but to use more dimensions. Our
subsequent models all use quantization. There
is no noticeable variation in processing speed
when performing dequantization on-the-fly at in-
ference time. Our 16-dim Lang-ID model runs at
4450 documents/second (5.6 MB of text per sec-
ond) on the preprocessed Wikipedia dataset.
Relationship to Compact Language Detector
These techniques back the open-source Com-
pact Language Detector v3 (CLD3)1 that runs
in Google Chrome browsers.2 Our experimental
Lang-ID model uses the same overall architecture
as CLD3, but uses a simpler feature set, less in-
volved preprocessing, and covers fewer languages.
3.2 POS Tagging
We apply our model as an unstructured classifier
to predict a POS tag for each token independently,
and compare its performance to that of the byte-
to-span (BTS) model (Gillick et al., 2016). BTS
is a 4-layer LSTM network that maps a sequence
of bytes to a sequence of labeled spans, such as
tokens and their POS tags. Both approaches limit
1github.com/google/cld3
2As of the date of this writing in 2017.
Model Micro F1 Size
Baldwin and Lui (2010): NN 90.2 -
Baldwin and Lui (2010): NP 87.0 -
Small FF, 6 dim 87.3 334 KB
Small FF, 16 dim 88.0 800 KB
Small FF, 16 dim, quantized 88.0 302 KB
Table 1: Language Identification. Quantization allows trad-
ing numerical precision for larger embeddings. The two mod-
els from Baldwin and Lui (2010) are the nearest neighbor
(NN) and nearest prototype (NP) approaches.
Model Acc. Wts. MB Ops.
Gillick et al. (2016) 95.06 900k - 6.63m
Small FF 94.76 241k 0.6 0.27m
+Clusters 95.56 261k 1.0 0.31m
1
2
Dim. 95.39 143k 0.7 0.18m
Table 2: POS tagging. Embedded word clusters improves ac-
curacy and allows the use of smaller embedding dimensions.
model size by using small input vocabularies: byte
values in the case of BTS, and hashed character n-
grams and (optionally) cluster ids in our case.
Bloom Mapped Word Clusters It is well
known that word clusters can be powerful features
in linear models for a variety of tasks (Koo et al.,
2008; Turian et al., 2010). Here, we show that
they can also be useful in neural network mod-
els. However, naively introducing word cluster
features drastically increases the amount of mem-
ory required, as a word-to-cluster mapping file
with hundreds of thousands of entries can be sev-
eral megabytes on its own.3 By representing word
clusters with a Bloom map (Talbot and Talbot,
2008), a key-value based generalization of Bloom
filters, we can reduce the space required by a fac-
tor of∼15 and use 300KB to (approximately) rep-
resent the clusters for 250,000 word types.
In order to compare against the monolingual
setting of Gillick et al. (2016), we train models
for the same set of 13 languages from the Univer-
sal Dependency treebanks v1.1 (Nivre et al., 2016)
corpus, using the standard predefined splits.
As shown in Table 2, our best models are 0.3%
more accuate on average across all languages than
the BTS monolingual models, while using 6x
fewer parameters and 36x fewer FLOPs. The clus-
ter features play an important role, providing a
15% relative reduction in error over our vanilla
model, but also increase the overall size. Halv-
3For example, the commonly used English clusters from
the BLLIP corpus is over 7 MB – people.csail.mit.
edu/maestro/papers/bllip-clusters.gz
2881
Transition
SPLIT ([σ], [i|β])→ ([σ|i], [β])
MERGE ([σ], [i|β])→ ([σ], [β])
Table 3: Segmentation Transition system. Initially all
characters are on the buffer β and the stack σ is empty:
([], [c1c2...cn]). In the final state the buffer is empty and the
stack contains the first character for each word.
ing all feature embedding dimensions (except for
the cluster features) still gives a 12% reduction in
error and trims the overall size back to 1.1x the
vanilla model, staying well under 1MB in total.
This halved model configuration has a throughput
of 46k tokens/second, on average.
Two potential advantages of BTS are that it does
not require tokenized input and has a more accu-
rate multilingual version, achieving 95.85% accu-
racy. From a memory perspective, one multilin-
gual BTS model will take less space than separate
FF models. However, from a runtime perspective,
a pipeline of our models doing language identi-
fication, word segmentation, and then POS tag-
ging would still be faster than a single instance of
the deep LSTM BTS model, by about 12x in our
FLOPs estimate.4
3.3 Segmentation
Word segmentation is critical for processing Asian
languages where words are not explicitly sep-
arated by spaces. Recently, neural networks
have significantly improved segmentation accu-
racy (Zhang et al., 2016; Cai and Zhao, 2016; Liu
et al., 2016; Yang et al., 2017; Kong et al., 2015).
We use a structured model based on the transition
system in Table 3, and similar to the one proposed
by Zhang and Clark (2007). We conduct the seg-
mentation experiments on the Chinese Treebank
6.0 with the recommended data splits. No exter-
nal resources or pretrained embeddings are used.
Hashing was detrimental to quality in our prelim-
inary experiments, hence we do not use it for this
task. To learn an embedding for unknown charac-
ters, we cast characters occurring only once in the
training set to a special symbol.
Selected Features Because we are not using
hashing here, we need to be careful about the
size of the input vocabulary. The neural network
with its non-linearity is in theory able to learn
bigrams by conjoining unigrams, but it has been
4Our calculation of BTS FLOPs is very conservative and
favorable to BTS, as detailed in the supplementary material.
Model Accuracy Size
Zhang et al. (2016) 95.01 −
Zhang et al. (2016)-combo 95.95 −
Small FF, 64 dim 94.24 846KB
Small FF, 256 dim 94.16 3.2MB
Small FF, 64 dim, bigrams 95.18 2.0MB
Table 4: Segmentation results. Explicit bigrams are useful.
Transition Precondition
APPEND ([σ|i|j], [β])→ ([σ|[ij]], [β])
SHIFT ([σ], [i|β])→ ([σ|i], [β])
SWAP ([σ|i|j], [β])→ [σ|j], [i|β]); i < j
Table 5: Preordering Transition system. Initially all words are
part of singleton spans on the buffer: ([], [[w1][w2]...[wn]]).
In the final state the buffer is empty and the stack contains a
single span.
shown that explicitly using character bigram fea-
tures leads to better accuracy (Zhang et al., 2016;
Pei et al., 2014). Zhang et al. (2016) suggests
that embedding manually specified feature con-
junctions further improves accuracy (‘Zhang et al.
(2016)-combo’ in Table 4). However, such embed-
dings could easily lead to a model size explosion
and thus are not considered in this work.
The results in Table 4 show that spending our
memory budget on small bigram embeddings is
more effective than on larger character embed-
dings, in terms of both accuracy and model size.
Our model featuring bigrams runs at 110KB of
text per second, or 39k tokens/second.
3.4 Preordering
Preordering source-side words into the target-side
word order is a useful preprocessing task for statis-
tical machine translation (Xia and McCord, 2004;
Collins et al., 2005; Nakagawa, 2015; de Gispert
et al., 2015). We propose a novel transition sys-
tem for this task (Table 5), so that we can repeat-
edly apply a small network to produce these per-
mutations. Inspired by a non-projective parsing
transition system (Nivre, 2009), the system uses
a SWAP action to permute spans. The system is
sound for permutations: any derivation will end
with all of the input words in a permuted order,
and complete: all permutations are reachable (use
SHIFT and SWAP operations to perform a bubble
sort, then APPEND n − 1 times to form a single
span). For training and evaluation, we use the
English-Japanese manual word alignments from
Nakagawa (2015).
2882
Model FRS Size
Nakagawa (2015) 81.6 -
Small FF 75.2 0.5MB
Small FF + POS tags 81.3 1.3MB
Small FF + Tagger input fts. 76.6 3.7MB
Table 6: Preordering results for English→ Japanese. FRS (in
[0, 100]) is the fuzzy reordering score (Talbot et al., 2011).
Pipelines For preordering, we experiment with
either spending all of our memory budget on re-
ordering, or spending some of the memory budget
on features over predicted POS tags, which also
requires an additional neural network to predict
these tags. Full feature templates are in the supple-
mentary material. As the POS tagger network uses
features based on a three word window around the
token, another possibility is to add all of the fea-
tures that would have affected the POS tag of a
token to the reorderer directly.
Table 6 shows results with or without using
the predicted POS tags in the preorderer, as well
as including the features used by the tagger in
the reorderer directly and only training the down-
stream task. The preorderer that includes a sep-
arate network for POS tagging and then extracts
features over the predicted tags is more accurate
and smaller than the model that includes all the
features that contribute to a POS tag in the re-
orderer directly. This pipeline processes 7k to-
kens/second when taking pretokenized text as in-
put, with the POS tagger accounting for 23% of
the computation time.
4 Conclusions
This paper shows that small feed-forward net-
works are sufficient to achieve useful accuracies
on a variety of tasks. In resource-constrained en-
vironments, speed and memory are important met-
rics to optimize as well as accuracies. While
large and deep recurrent models are likely to be
the most accurate whenever they can be afforded,
feed-foward networks can provide better value in
terms of runtime and memory, and should be con-
sidered a strong baseline.
Acknowledgments
We thank Kuzman Ganchev, Fernando Pereira,
and the anonymous reviewers for their useful com-
ments.
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2014. Neural machine translation by jointly
learning to align and translate. arXiv preprint
arXiv:1409.0473.
Timothy Baldwin and Marco Lui. 2010. Language
identification: The long and the short of the mat-
ter. In Human Language Technologies: The 2010
Annual Conference of the North American Chap-
ter of the Association for Computational Linguistics,
pages 229–237. Association for Computational Lin-
guistics.
Léon Bottou. 2010. Large-scale machine learning
with stochastic gradient descent. In Proceedings of
COMPSTAT’2010, pages 177–186. Springer.
Deng Cai and Hai Zhao. 2016. Neural word segmen-
tation learning for Chinese. In Proceedings of the
54th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
409–420, Berlin, Germany. Association for Compu-
tational Linguistics.
Danqi Chen and Christopher D. Manning. 2014. A fast
and accurate dependency parser using neural net-
works. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Process-
ing, pages 740–750.
M. Collins, P. Koehn, and I. Kučerová. 2005. Clause
restructuring for statistical machine translation. In
Proceedings of ACL, pages 531–540.
Kuzman Ganchev and Mark Dredze. 2008. Small sta-
tistical models by random feature mixing. In Pro-
ceedings of the ACL-08- HLT Workshop on Mobile
Language Processing, pages 18–19. Association for
Computational Linguistics.
Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amar-
nag Subramanya. 2016. Multilingual language pro-
cessing from bytes. In Proceedings of NAACL-HLT,
pages 1296–1306, San Diego, USA. Association for
Computational Linguistics.
Adrià de Gispert, Gonzalo Iglesias, and Bill Byrne.
2015. Fast and accurate preordering for SMT using
neural networks. In Proceedings of NAACL, pages
1012–1017.
Song Han, Huizi Mao, and William J Dally. 2015.
Deep compression: Compressing deep neural net-
works with pruning, trained quantization and huff-
man coding. arXiv preprint arXiv:1510.00149.
Geoffrey E. Hinton. 2012. A practical guide to train-
ing restricted Boltzmann machines. In Neural Net-
works: Tricks of the Trade (2nd ed.), Lecture Notes
in Computer Science, pages 599–619. Springer.
Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Long short-term memory. Neural computation,
9(8):1735–1780.
2883
Yoon Kim and Alexander M. Rush. 2016. Sequence-
level knowledge distillation. In Proceedings of the
2016 Conference on Empirical Methods in Natu-
ral Language Processing, pages 1317–1327, Austin,
Texas. Association for Computational Linguistics.
Lingpeng Kong, Chris Dyer, and Noah A. Smith.
2015. Segmental recurrent neural networks. CoRR,
abs/1511.06018.
Terry Koo, Xavier Carreras, and Michael Collins. 2008.
Simple semi-supervised dependency parsing. In
Proceedings of ACL-08: HLT, pages 595–603.
Wang Ling, Tiago Luı́s, Luı́s Marujo, Ramón Fernan-
dez Astudillo, Silvio Amir, Chris Dyer, Alan W
Black, and Isabel Trancoso. 2015. Finding function
in form: Compositional character models for open
vocabulary word representation. In Proceedings of
the 2015 Conference on Empirical Methods in Nat-
ural Language Processing, pages 1520–1530. Asso-
ciation for Computational Linguistics.
Yijia Liu, Wanxiang Che, Jiang Guo, Bing Qin, and
Ting Liu. 2016. Exploring segment representations
for neural segmentation models. In Proceedings of
the Twenty-Fifth International Joint Conference on
Artificial Intelligence, IJCAI 2016, New York, NY,
USA, 9-15 July 2016, pages 2880–2886.
Shervin Malmasi, Marcos Zampieri, Nikola Ljubešić,
Preslav Nakov, Ahmed Ali, and Jörg Tiedemann.
2016. Discriminating between similar languages
and Arabic dialect identification: A report on the
third DSL shared task. In Proceedings of the Third
Workshop on NLP for Similar Languages, Varieties
and Dialects (VarDial3), pages 1–14, Osaka, Japan.
The COLING 2016 Organizing Committee.
Giovanni Molina, Fahad AlGhamdi, Mahmoud
Ghoneim, Abdelati Hawwari, Nicolas Rey-
Villamizar, Mona Diab, and Thamar Solorio. 2016.
Overview for the second shared task on language
identification in code-switched data. In Proceedings
of the Second Workshop on Computational Ap-
proaches to Code Switching, pages 40–49, Austin,
Texas. Association for Computational Linguistics.
Vinod Nair and Geoffrey E. Hinton. 2010. Rectified
linear units improve restricted Boltzmann machines.
In Proceedings of the 27th International Conference
on Machine Learning, pages 807–814.
Tetsuji Nakagawa. 2015. Efficient top-down BTG
parsing for machine translation preordering. In Pro-
ceedings of ACL, pages 208–218.
Joakim Nivre. 2009. Non-projective dependency pars-
ing in expected linear time. In Proceedings of the
Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP, pages
351–359.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
ter, Yoav Goldberg, Jan Hajic, Christopher D. Man-
ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,
Natalia Silveira, Reut Tsarfaty, and Daniel Zeman.
2016. Universal Dependencies v1: A Multilingual
Treebank Collection. In Proceedings of the Tenth In-
ternational Conference on Language Resources and
Evaluation (LREC 2016), Paris, France. European
Language Resources Association (ELRA).
Wenzhe Pei, Tao Ge, and Baobao Chang. 2014. Max-
margin tensor neural network for Chinese word seg-
mentation. In Proceedings of the 52nd Annual Meet-
ing of the Association for Computational Linguis-
tics (Volume 1: Long Papers), pages 293–303, Bal-
timore, Maryland. Association for Computational
Linguistics.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,
Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff
Dean. 2017. Outrageously large neural networks:
The sparsely-gated mixture-of-experts layer. arXiv
preprint arXiv:1701.06538.
David Talbot, Hideto Kazawa, Hiroshi Ichikawa, Ja-
son Katz-Brown, Masakazu Seno, and Franz J. Och.
2011. A lightweight evaluation framework for ma-
chine translation reordering. In Proceedings of the
Sixth Workshop on Statistical Machine Translation,
WMT ’11, pages 12–21, Stroudsburg, PA, USA. As-
sociation for Computational Linguistics.
David Talbot and John Talbot. 2008. Bloom maps. In
Proceedings of the Meeting on Analytic Algorith-
mics and Combinatorics, pages 203–212. Society
for Industrial and Applied Mathematics.
Ivan Titov and James Henderson. 2007. Fast and robust
multilingual dependency parsing with a generative
latent variable model. In Proceedings of the 2007
Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural
Language Learning, pages 947–951.
Ivan Titov and James Henderson. 2010. A latent vari-
able model for generative dependency parsing. In
Harry Bunt, Paola Merlo, and Joakim Nivre, editors,
Trends in Parsing Technology: Dependency Parsing,
Domain Adaptation, and Deep Parsing, pages 35–
55. Springer Netherlands, Dordrecht.
Joseph Turian, Lev Ratinov, and Yoshua Bengio.
2010. Word Representations: A Simple and Gen-
eral Method for Semi-supervised Learning. In Pro-
ceedings of the 48th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 384–
394, Stroudsburg, PA, USA. Association for Com-
putational Linguistics.
David Weiss, Chris Alberti, Michael Collins, and Slav
Petrov. 2015. Structured training for neural net-
work transition-based parsing. In Proceedings of the
53rd Annual Meeting of the Association for Compu-
tational Linguistics, pages 323–333.
2884
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, Jeff Klingner, Apurva Shah, Melvin
Johnson, Xiaobing Liu, ukasz Kaiser, Stephan
Gouws, Yoshikiyo Kato, Taku Kudo, Hideto
Kazawa, Keith Stevens, George Kurian, Nishant
Patil, Wei Wang, Cliff Young, Jason Smith, Jason
Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,
Macduff Hughes, and Jeffrey Dean. 2016. Google’s
neural machine translation system: Bridging the gap
between human and machine translation. CoRR,
abs/1609.08144.
Fei Xia and Michael McCord. 2004. Improving a
statistical MT system with automatically learned
rewrite patterns. In Proceedings of COLING, page
508.
Jie Yang, Yue Zhang, and Fei Dong. 2017. Neural
word segmentation with rich pretraining. CoRR,
abs/1704.08960.
Meishan Zhang, Yue Zhang, and Guohong Fu. 2016.
Transition-based neural word segmentation. In Pro-
ceedings of the 54th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long
Papers), pages 421–431, Berlin, Germany. Associa-
tion for Computational Linguistics.
Yue Zhang and Stephen Clark. 2007. Chinese segmen-
tation with a word-based perceptron algorithm. In
Proceedings of the 45th Annual Meeting of the As-
sociation of Computational Linguistics, pages 840–
847, Prague, Czech Republic. Association for Com-
putational Linguistics.
2885