程序代写代做代考 graph Hidden Markov Mode flex computational biology interpreter html C AI Finite State Automaton Excel compiler go data mining decision tree deep learning kernel distributed system information theory B tree cache chain database Bioinformatics information retrieval Lambda Calculus Hive algorithm data science case study Bayesian game data structure Natural Language Processing

Natural Language Processing
Jacob Eisenstein October 15, 2018

Contents
Contents 1
Preface i
Background ………………………………. i Howtousethisbook………………………….. ii
1 Introduction 1
1.1 Naturallanguageprocessinganditsneighbors . . . . . . . . . . . . . . . . . 1
1.2 Threethemesinnaturallanguageprocessing ……………… 6
1.2.1 1.2.2 1.2.3
I Learning
Learningandknowledge ……………………. 6 Searchandlearning ………………………. 7 Relational, compositional, and distributional perspectives . . . . . . 9
11
2 Linear text classification 13
2.1 Thebagofwords ……………………………. 13
2.2 Na ̈ıveBayes ………………………………. 17 2.2.1 Typesandtokens………………………… 19 2.2.2 Prediction ……………………………. 20 2.2.3 Estimation……………………………. 21 2.2.4 Smoothing……………………………. 22 2.2.5 Settinghyperparameters…………………….. 23
2.3 Discriminativelearning ………………………… 24 2.3.1 Perceptron……………………………. 25 2.3.2 Averagedperceptron………………………. 27
2.4 Lossfunctionsandlarge-marginclassification . . . . . . . . . . . . . . . . . 27 2.4.1 Onlinelargemarginclassification ……………….. 30 2.4.2 *Derivation of the online support vector machine . . . . . . . . . . . 32
2.5 Logisticregression …………………………… 35
1

2
CONTENTS
3
2.5.1 Regularization …………………………. 36
2.5.2 Gradients ……………………………. 37
2.6 Optimization………………………………. 37 2.6.1 Batchoptimization ……………………….. 38 2.6.2 Onlineoptimization ………………………. 39
2.7 *Additionaltopicsinclassification …………………… 41 2.7.1 Featureselectionbyregularization……………….. 41 2.7.2 Otherviewsoflogisticregression………………… 41
2.8 Summaryoflearningalgorithms ……………………. 43
Nonlinear classification 47
3.1 Feedforwardneuralnetworks……………………… 48
3.2 Designingneuralnetworks ………………………. 50 3.2.1 Activationfunctions ………………………. 50 3.2.2 Networkstructure ……………………….. 51 3.2.3 Outputsandlossfunctions …………………… 52 3.2.4 Inputsandlookuplayers ……………………. 53
3.3 Learningneuralnetworks ……………………….. 53 3.3.1 Backpropagation ………………………… 55 3.3.2 Regularizationanddropout…………………… 57 3.3.3 *Learningtheory ………………………… 58 3.3.4 Tricks………………………………. 59
3.4 Convolutionalneuralnetworks…………………….. 62
Linguistic applications of classification 69
4.1 Sentimentandopinionanalysis…………………….. 69 4.1.1 Relatedproblems………………………… 70 4.1.2 Alternativeapproachestosentimentanalysis . . . . . . . . . . . . . . 72
4.2 Wordsensedisambiguation ………………………. 73 4.2.1 Howmanywordsenses? ……………………. 74 4.2.2 Wordsensedisambiguationasclassification . . . . . . . . . . . . . . 75
4.3 Designdecisionsfortextclassification …………………. 76 4.3.1 Whatisaword?…………………………. 76 4.3.2 Howmanywords?……………………….. 79 4.3.3 Countorbinary? ………………………… 80
4.4 Evaluatingclassifiers………………………….. 80
4.4.1 Precision,recall,andF-MEASURE ……………….. 81
4.4.2 Threshold-freemetrics……………………… 83
4.4.3 Classifier comparison and statistical significance . . . . . . . . . . . . 84
4.4.4 *Multiplecomparisons……………………… 87
4.5 Buildingdatasets ……………………………. 88 Jacob Eisenstein. Draft of October 15, 2018.
4

CONTENTS 3 4.5.1 Metadataaslabels ……………………….. 88
4.5.2 Labelingdata ………………………….. 88
5 Learning without supervision 95
5.1 Unsupervisedlearning…………………………. 95 5.1.1 K-meansclustering ………………………. 96 5.1.2 Expectation-Maximization(EM) ………………… 98 5.1.3 EMasanoptimizationalgorithm…………………102 5.1.4 Howmanyclusters? ……………………….103
5.2 Applicationsofexpectation-maximization………………..104 5.2.1 Wordsenseinduction ………………………104 5.2.2 Semi-supervisedlearning …………………….105 5.2.3 Multi-componentmodeling……………………106
5.3 Semi-supervisedlearning ………………………..107 5.3.1 Multi-viewlearning ……………………….108 5.3.2 Graph-basedalgorithms……………………..109
5.4 Domainadaptation……………………………110 5.4.1 Superviseddomainadaptation………………….111 5.4.2 Unsuperviseddomainadaptation ………………..112
5.5 *Otherapproachestolearningwithlatentvariables . . . . . . . . . . . . . . 114 5.5.1 Sampling……………………………..115 5.5.2 Spectrallearning …………………………117
II Sequencesandtrees 123
6 Language models 125
6.1 N-gramlanguagemodels ………………………..126
6.2 Smoothinganddiscounting ……………………….129 6.2.1 Smoothing…………………………….129 6.2.2 Discountingandbackoff……………………..130 6.2.3 *Interpolation…………………………..131 6.2.4 *Kneser-Neysmoothing……………………..133
6.3 Recurrentneuralnetworklanguagemodels……………….133 6.3.1 Backpropagationthroughtime ………………….136 6.3.2 Hyperparameters…………………………137 6.3.3 Gatedrecurrentneuralnetworks…………………137
6.4 Evaluatinglanguagemodels……………………….139 6.4.1 Held-outlikelihood ……………………….139 6.4.2 Perplexity …………………………….140
6.5 Out-of-vocabularywords ………………………..141
Under contract with MIT Press, shared under CC-BY-NC-ND license.

4
CONTENTS
7 Sequence labeling 145
7.1 Sequencelabelingasclassification ……………………145
7.2 Sequencelabelingasstructureprediction ………………..147
7.3 TheViterbialgorithm…………………………..149
7.3.1 Example……………………………..152
7.3.2 Higher-orderfeatures ………………………153
7.4 HiddenMarkovModels …………………………153 7.4.1 Estimation…………………………….155 7.4.2 Inference……………………………..155
7.5 Discriminativesequencelabelingwithfeatures . . . . . . . . . . . . . . . . . 157 7.5.1 Structuredperceptron ………………………160 7.5.2 Structuredsupportvectormachines ……………….160 7.5.3 Conditionalrandomfields…………………….162
7.6 Neuralsequencelabeling…………………………167
7.6.1 Recurrentneuralnetworks ……………………167
7.6.2 Character-levelmodels………………………169
7.6.3 Convolutional Neural Networks for Sequence Labeling . . . . . . . . 170
7.7 *Unsupervisedsequencelabeling…………………….170
7.7.1 Lineardynamicalsystems…………………….172
7.7.2 Alternative unsupervised learning methods . . . . . . . . . . . . . . 172
7.7.3 Semiring notation and the generalized viterbi algorithm . . . . . . . 172
8 Applications of sequence labeling 175
8.1 Part-of-speechtagging ………………………….175 8.1.1 Parts-of-Speech………………………….176 8.1.2 Accuratepart-of-speechtagging …………………180
8.2 MorphosyntacticAttributes ……………………….182
8.3 NamedEntityRecognition………………………..183
8.4 Tokenization……………………………….185
8.5 Codeswitching ……………………………..186
8.6 Dialogueacts ………………………………187
9 Formal language theory 191
9.1 Regularlanguages ……………………………192
9.1.1 Finitestateacceptors……………………….193
9.1.2 Morphologyasaregularlanguage………………..194
9.1.3 Weightedfinitestateacceptors ………………….196
9.1.4 Finitestatetransducers ……………………..201
9.1.5 *Learningweightedfinitestateautomata . . . . . . . . . . . . . . . . 206
9.2 Context-freelanguages………………………….207 9.2.1 Context-freegrammars ……………………..208
Jacob Eisenstein. Draft of October 15, 2018.

CONTENTS 5
9.2.2 Natural language syntax as a context-free language . . . . . . . . . . 211 9.2.3 Aphrase-structuregrammarforEnglish . . . . . . . . . . . . . . . . 213 9.2.4 Grammaticalambiguity ……………………..218
9.3 *Mildlycontext-sensitivelanguages …………………..218 9.3.1 Context-sensitive phenomena in natural language . . . . . . . . . . . 219 9.3.2 Combinatorycategorialgrammar ………………..220
10 Context-free parsing 225
10.1 Deterministicbottom-upparsing . . . . . . . . . . . . . . . . . . . . . . . . . 226 10.1.1 Recoveringtheparsetree …………………….227 10.1.2 Non-binaryproductions……………………..227 10.1.3 Complexity ……………………………229
10.2 Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 10.2.1 Parserevaluation…………………………230 10.2.2 Localsolutions ………………………….231
10.3WeightedContext-FreeGrammars ……………………232 10.3.1 Parsing with weighted context-free grammars . . . . . . . . . . . . . 234 10.3.2 Probabilisticcontext-freegrammars ……………….235 10.3.3 *Semiringweightedcontext-freegrammars . . . . . . . . . . . . . . . 237
10.4 Learning weighted context-free grammars . . . . . . . . . . . . . . . . . . . . 238 10.4.1 Probabilisticcontext-freegrammars ……………….238 10.4.2 Feature-basedparsing ………………………239 10.4.3 *Conditionalrandomfieldparsing………………..240 10.4.4 Neuralcontext-freegrammars ………………….242
10.5 Grammarrefinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 10.5.1 Parent annotations and other tree transformations . . . . . . . . . . . 243 10.5.2 Lexicalizedcontext-freegrammars………………..244 10.5.3 *Refinementgrammars ……………………..248
10.6 Beyond context-free parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 10.6.1 Reranking …………………………….250 10.6.2 Transition-basedparsing……………………..251
11 Dependency parsing 257
11.1Dependencygrammar ………………………….257 11.1.1 Headsanddependents………………………258 11.1.2 Labeleddependencies ………………………259 11.1.3 Dependencysubtreesandconstituents . . . . . . . . . . . . . . . . . 260
11.2Graph-baseddependencyparsing ……………………262 11.2.1 Graph-based parsing algorithms . . . . . . . . . . . . . . . . . . . . . 264 11.2.2 Computingscoresfordependencyarcs . . . . . . . . . . . . . . . . . 265 11.2.3 Learning……………………………..267
Under contract with MIT Press, shared under CC-BY-NC-ND license.

6
CONTENTS
11.3Transition-baseddependencyparsing ………………….268 11.3.1 Transitionsystemsfordependencyparsing . . . . . . . . . . . . . . . 269 11.3.2 Scoring functions for transition-based parsers . . . . . . . . . . . . . 273 11.3.3 Learningtoparse…………………………274
11.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
III Meaning 283
12 Logical semantics 285
12.1 Meaning and denotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
12.2 Logical representations of meaning . . . . . . . . . . . . . . . . . . . . . . . . 287 12.2.1 Propositionallogic ………………………..287 12.2.2 First-orderlogic………………………….288
12.3 Semantic parsing and the lambda calculus . . . . . . . . . . . . . . . . . . . . 291 12.3.1 Thelambdacalculus ……………………….292 12.3.2 Quantification…………………………..293
12.4 Learning semantic parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 12.4.1 Learningfromderivations…………………….297 12.4.2 Learningfromlogicalforms……………………299 12.4.3 Learningfromdenotations ……………………301
13 Predicate-argument semantics 305
13.1 Semantic roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 13.1.1 VerbNet ……………………………..308 13.1.2 Proto-rolesandPropBank…………………….309 13.1.3 FrameNet …………………………….310
13.2Semanticrolelabeling ………………………….312 13.2.1 Semanticrolelabelingasclassification . . . . . . . . . . . . . . . . . . 312 13.2.2 Semantic role labeling as constrained optimization . . . . . . . . . . 315 13.2.3 Neuralsemanticrolelabeling…………………..317
13.3 Abstract Meaning Representation . . . . . . . . . . . . . . . . . . . . . . . . . 318 13.3.1 AMRParsing …………………………..321
14 Distributional and distributed semantics 325
14.1Thedistributionalhypothesis ………………………325
14.2 Design decisions for word representations . . . . . . . . . . . . . . . . . . . . 327 14.2.1 Representation ………………………….327 14.2.2 Context………………………………328 14.2.3 Estimation…………………………….329
14.3 Latent semantic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Jacob Eisenstein. Draft of October 15, 2018.

CONTENTS 7
14.4 Brown clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 14.5Neuralwordembeddings ………………………..334 14.5.1 Continuousbag-of-words(CBOW)………………..334 14.5.2 Skipgrams…………………………….335 14.5.3 Computationalcomplexity ……………………335 14.5.4 Wordembeddingsasmatrixfactorization . . . . . . . . . . . . . . . . 337
14.6 Evaluatingwordembeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 14.6.1 Intrinsicevaluations ……………………….339 14.6.2 Extrinsicevaluations……………………….339 14.6.3 Fairnessandbias …………………………340
14.7 Distributed representations beyond distributional statistics . . . . . . . . . . 341 14.7.1 Word-internalstructure ……………………..341 14.7.2 Lexicalsemanticresources…………………….343
14.8 Distributedrepresentationsofmultiwordunits . . . . . . . . . . . . . . . . . 344 14.8.1 Purelydistributionalmethods ………………….344 14.8.2 Distributional-compositionalhybrids . . . . . . . . . . . . . . . . . . 345 14.8.3 Supervisedcompositionalmethods ……………….346 14.8.4 Hybrid distributed-symbolic representations . . . . . . . . . . . . . . 346
15 Reference Resolution 351
15.1Formsofreferringexpressions ……………………..352 15.1.1 Pronouns …………………………….352 15.1.2 ProperNouns…………………………..357 15.1.3 Nominals …………………………….357
15.2Algorithmsforcoreferenceresolution ………………….358 15.2.1 Mention-pairmodels……………………….359 15.2.2 Mention-rankingmodels …………………….360 15.2.3 Transitiveclosureinmention-basedmodels. . . . . . . . . . . . . . . 361 15.2.4 Entity-basedmodels ……………………….362
15.3 Representationsforcoreferenceresolution. . . . . . . . . . . . . . . . . . . . 367 15.3.1 Features ……………………………..367 15.3.2 Distributed representations of mentions and entities . . . . . . . . . . 370
15.4 Evaluating coreference resolution . . . . . . . . . . . . . . . . . . . . . . . . . 373
16 Discourse 379
16.1 Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 16.1.1 Topicsegmentation………………………..380 16.1.2 Functionalsegmentation……………………..381
16.2 Entities and reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 16.2.1 Centeringtheory …………………………382 16.2.2 Theentitygrid ………………………….383
Under contract with MIT Press, shared under CC-BY-NC-ND license.

8
CONTENTS
16.3
16.2.3 *Formalsemanticsbeyondthesentencelevel . . . . . . . . . . . . . . 384 Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 16.3.1 Shallowdiscourserelations ……………………385 16.3.2 Hierarchicaldiscourserelations………………….389 16.3.3 Argumentation ………………………….392 16.3.4 Applicationsofdiscourserelations………………..393
IV Applications 401
17 Information extraction 403
17.1 Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 17.1.1 Entitylinkingbylearningtorank ………………..406 17.1.2 Collective entity linking . . . . . . . . . . . . . . . . . . . . . . . . . . 408 17.1.3 *Pairwiserankinglossfunctions …………………409
17.2 Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 17.2.1 Pattern-based relation extraction . . . . . . . . . . . . . . . . . . . . . 412 17.2.2 Relationextractionasaclassificationtask . . . . . . . . . . . . . . . . 413 17.2.3 Knowledgebasepopulation……………………416 17.2.4 Openinformationextraction …………………..419
17.3Events ………………………………….420
17.4 Hedges, denials, and hypotheticals . . . . . . . . . . . . . . . . . . . . . . . . 422
17.5 Question answering and machine reading . . . . . . . . . . . . . . . . . . . . 424
17.5.1 Formalsemantics…………………………424 17.5.2 Machinereading …………………………425
18 Machine translation 431
18.1Machinetranslationasatask ………………………431 18.1.1 Evaluatingtranslations ……………………..433 18.1.2Data ……………………………….435
18.2 Statistical machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 436 18.2.1 Statisticaltranslationmodeling………………….437 18.2.2 Estimation…………………………….438 18.2.3 Phrase-basedtranslation……………………..439 18.2.4 *Syntax-basedtranslation …………………….441
18.3Neuralmachinetranslation ……………………….442 18.3.1 Neuralattention …………………………444 18.3.2 *Neural machine translation without recurrence . . . . . . . . . . . . 446 18.3.3 Out-of-vocabularywords …………………….448
18.4 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 18.5Trainingtowardstheevaluationmetric …………………451
Jacob Eisenstein. Draft of October 15, 2018.

CONTENTS 9
19 Text generation 457
19.1 Data-to-text generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457 19.1.1 Latentdata-to-textalignment…………………..459 19.1.2 Neuraldata-to-textgeneration ………………….460
19.2Text-to-textgeneration ………………………….464 19.2.1 Neuralabstractivesummarization . . . . . . . . . . . . . . . . . . . . 464 19.2.2 Sentence fusion for multi-document summarization . . . . . . . . . . 466
19.3 Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 19.3.1 Finite-state and agenda-based dialogue systems . . . . . . . . . . . . 467 19.3.2 Markovdecisionprocesses ……………………468 19.3.3 Neuralchatbots………………………….470
A Probability 475
A.1 Probabilitiesofeventcombinations……………………475 A.1.1 Probabilitiesofdisjointevents ………………….476 A.1.2 Lawoftotalprobability ……………………..477
A.2 ConditionalprobabilityandBayes’rule …………………477
A.3 Independence ………………………………479
A.4 Randomvariables…………………………….480
A.5 Expectations……………………………….481
A.6 Modelingandestimation…………………………482
B Numerical optimization 485
B.1 Gradientdescent …………………………….486 B.2 Constrainedoptimization ………………………..486 B.3 Example:Passive-aggressiveonlinelearning . . . . . . . . . . . . . . . . . . 487
Bibliography 489
Under contract with MIT Press, shared under CC-BY-NC-ND license.

Preface
The goal of this text is focus on a core subset of the natural language processing, unified by the concepts of learning and search. A remarkable number of problems in natural language processing can be solved by a compact set of methods:
Search. Viterbi, CKY, minimum spanning tree, shift-reduce, integer linear programming, beam search.
Learning. Maximum-likelihood estimation, logistic regression, perceptron, expectation- maximization, matrix factorization, backpropagation.
This text explains how these methods work, and how they can be applied to a wide range of tasks: document classification, word sense disambiguation, part-of-speech tagging, named entity recognition, parsing, coreference resolution, relation extraction, discourse analysis, language modeling, and machine translation.
Background
Because natural language processing draws on many different intellectual traditions, al- most everyone who approaches it feels underprepared in one way or another. Here is a summary of what is expected, and where you can learn more:
Mathematics and machine learning. The text assumes a background in multivariate cal- culus and linear algebra: vectors, matrices, derivatives, and partial derivatives. You should also be familiar with probability and statistics. A review of basic proba- bility is found in Appendix A, and a minimal review of numerical optimization is found in Appendix B. For linear algebra, the online course and textbook from Strang (2016) provide an excellent review. Deisenroth et al. (2018) are currently preparing a textbook on Mathematics for Machine Learning, a draft can be found online.1 For an introduction to probabilistic modeling and estimation, see James et al. (2013); for
1 https://mml- book.github.io/
i

ii PREFACE a more advanced and comprehensive discussion of the same material, the classic
reference is Hastie et al. (2009).
Linguistics. This book assumes no formal training in linguistics, aside from elementary concepts likes nouns and verbs, which you have probably encountered in the study of English grammar. Ideas from linguistics are introduced throughout the text as needed, including discussions of morphology and syntax (chapter 9), semantics (chapters 12 and 13), and discourse (chapter 16). Linguistic issues also arise in the application-focused chapters 4, 8, and 18. A short guide to linguistics for students of natural language processing is offered by Bender (2013); you are encouraged to start there, and then pick up a more comprehensive introductory textbook (e.g., Ak- majian et al., 2010; Fromkin et al., 2013).
Computer science. The book is targeted at computer scientists, who are assumed to have taken introductory courses on the analysis of algorithms and complexity theory. In particular, you should be familiar with asymptotic analysis of the time and memory costs of algorithms, and with the basics of dynamic programming. The classic text on algorithms is offered by Cormen et al. (2009); for an introduction to the theory of computation, see Arora and Barak (2009) and Sipser (2012).
How to use this book
After the introduction, the textbook is organized into four main units:
Learning. Thissectionbuildsupasetofmachinelearningtoolsthatwillbeusedthrough- out the other sections. Because the focus is on machine learning, the text represen- tations and linguistic phenomena are mostly simple: “bag-of-words” text classifica- tion is treated as a model example. Chapter 4 describes some of the more linguisti- cally interesting applications of word-based text analysis.
Sequences and trees. This section introduces the treatment of language as a structured phenomena. It describes sequence and tree representations and the algorithms that they facilitate, as well as the limitations that these representations impose. Chap- ter 9 introduces finite state automata and briefly overviews a context-free account of English syntax.
Meaning. This section takes a broad view of efforts to represent and compute meaning from text, ranging from formal logic to neural word embeddings. It also includes two topics that are closely related to semantics: resolution of ambiguous references, and analysis of multi-sentence discourse structure.
Applications. Thefinalsectionofferschapter-lengthtreatmentsonthreeofthemostpromi- nent applications of natural language processing: information extraction, machine
Jacob Eisenstein. Draft of October 15, 2018.

translation, and text generation. Each of these applications merits a textbook length treatment of its own (Koehn, 2009; Grishman, 2012; Reiter and Dale, 2000); the chap- ters here explain some of the most well known systems using the formalisms and methods built up earlier in the book, while introducing methods such as neural at- tention.
Each chapter contains some advanced material, which is marked with an asterisk. This material can be safely omitted without causing misunderstandings later on. But even without these advanced sections, the text is too long for a single semester course, so instructors will have to pick and choose among the chapters.
Chapters 1-3 provide building blocks that will be used throughout the book, and chap- ter 4 describes some critical aspects of the practice of language technology. Language models (chapter 6), sequence labeling (chapter 7), and parsing (chapter 10 and 11) are canonical topics in natural language processing, and distributed word embeddings (chap- ter 14) have become ubiquitous. Of the applications, machine translation (chapter 18) is the best choice: it is more cohesive than information extraction, and more mature than text generation. Many students will benefit from the review of probability in Appendix A.
• A course focusing on machine learning should add the chapter on unsupervised learning (chapter 5). The chapters on predicate-argument semantics (chapter 13), reference resolution (chapter 15), and text generation (chapter 19) are particularly influenced by recent progress in machine learning, including deep neural networks and learning to search.
• A course with a more linguistic orientation should add the chapters on applica- tions of sequence labeling (chapter 8), formal language theory (chapter 9), semantics (chapter 12 and 13), and discourse (chapter 16).
• For a course with a more applied focus, I recommend the chapters on applications of sequence labeling (chapter 8), predicate-argument semantics (chapter 13), infor- mation extraction (chapter 17), and text generation (chapter 19).
Acknowledgments
Several colleagues, students, and friends read early drafts of chapters in their areas of expertise, including Yoav Artzi, Kevin Duh, Heng Ji, Jessy Li, Brendan O’Connor, Yuval Pinter, Shawn Ling Ramirez, Nathan Schneider, Pamela Shapiro, Noah A. Smith, Sandeep Soni, and Luke Zettlemoyer. I also thank the anonymous reviewers, particularly reviewer 4, who provided detailed line-by-line edits and suggestions. The text benefited from high- level discussions with my editor Marie Lufkin Lee, as well as Kevin Murphy, Shawn Ling Ramirez, and Bonnie Webber. In addition, there are many students, colleagues, friends, and family who found mistakes in early drafts, or who recommended key references.
Under contract with MIT Press, shared under CC-BY-NC-ND license.
iii

iv PREFACE
These include: Parminder Bhatia, Kimberly Caras, Jiahao Cai, Justin Chen, Murtaza Dhu- liawala, Yantao Du, Barbara Eisenstein, Luiz C. F. Ribeiro, Chris Gu, Joshua Killingsworth, Jonathan May, Taha Merghani, Gus Monod, Raghavendra Murali, Nidish Nair, Brendan O’Connor, Brandon Peck, Yuval Pinter, Nathan Schneider, Jianhao Shen, Zhewei Sun, Ru- bin Tsui, Ashwin Cunnapakkam Vinjimur, Denny Vrandecˇic ́, William Yang Wang, Clay Washington, Ishan Waykul, Xavier Yao, Yuyu Zhang, and also some anonymous com- menters. Clay Washington tested several of the programming exercises.
Most of the book was written while I was at Georgia Tech’s School of Interactive Com- puting. I thank the School for its support of this project, and I thank my colleagues there for their help and support at the beginning of my faculty career. I also thank (and apol- ogize to) the many students in Georgia Tech’s CS 4650 and 7650 who suffered through early versions of the text. The book is dedicated to my parents.
Jacob Eisenstein. Draft of October 15, 2018.

Notation
As a general rule, words, word counts, and other types of observations are indicated with Roman letters (a, b, c); parameters are indicated with Greek letters (α, β, θ). Vectors are indicated with bold script for both random variables x and parameters θ. Other useful notations are indicated in the table below.
Basics
exp x log x
{xn}Nn=1
i
x(i)
xj:k [x; y] [x, y] en
Diag(x) X−1
the base-2 exponent, 2x
the base-2 logarithm, log2 x the set {x1,x2,…,xN}
xi raised to the power j
xji x(j)
indexing by both i and j Linear algebra
⊤
a column vector of feature counts for instance i, often word counts elements j through k (inclusive) of a vector x
vertical concatenation of two column vectors
horizontal concatenation of two column vectors
a “one-hot” vector with a value of 1 at position n, and zero everywhere else
θ
θ · x(i) X
xi,j
the transpose of􏰀a column vector θ the dot product N θj × x(i)
 x1 0
a matrix with x on the diagonal, e.g.,  0 x2 0 
j=1 j rowi,columnjofmatrixX
a matrix
the inverse of matrix X
v
0 0 x3
0 

vi
PREFACE
Text datasets
wm
N
M
V y(i)
yˆ
Y
K
􏱑
􏱒
y(i)
Y(w) ♦
􏱓
Probabilities
Pr(A) Pr(A | B) pB (b)
pB|A(b | a) A∼p
word token at position m
number of training instances
length of a sequence (of words or tags)
number of words in vocabulary
the true label for instance i
a predicted label
the set of all possible labels
number of possible labels K = |Y|
the start token
the stop token
a structured label for instance i, such as a tag sequence the set of possible labelings for the word sequence w the start tag
the stop tag
probability of event A
probability of event A, conditioned on event B
the marginal probability of random variable B taking value b; written
p(b) when the choice of random variable is clear from context
the probability of random variable B taking value b, conditioned on A
taking value a; written p(b | a) when clear from context
the random variable A is distributed according to distribution p. For
example, X ∼ N (0, 1) states that the random variable X is drawn from
a normal distribution with zero mean and unit variance. 2 conditioned on the random variable B, A is distributed according to p.
A|B∼p Machine learning
Ψ(x(i), y) f(x(i),y)
θ
l(i)
L
L
λ
the score for assigning label y to instance i the feature vector for instance i with label y a (column) vector of weights
loss on an individual instance i
objective function for an entire dataset log-likelihood of a dataset
the amount of regularization
Jacob Eisenstein. Draft of October 15, 2018.

Chapter 1 Introduction
Natural language processing is the set of methods for making human language accessible to computers. In the past decade, natural language processing has become embedded in our daily lives: automatic machine translation is ubiquitous on the web and in social media; text classification keeps emails from collapsing under a deluge of spam; search engines have moved beyond string matching and network analysis to a high degree of linguistic sophistication; dialog systems provide an increasingly common and effective way to get and share information.
These diverse applications are based on a common set of ideas, drawing on algo- rithms, linguistics, logic, statistics, and more. The goal of this text is to provide a survey of these foundations. The technical fun starts in the next chapter; the rest of this current chapter situates natural language processing with respect to other intellectual disciplines, identifies some high-level themes in contemporary natural language processing, and ad- vises the reader on how best to approach the subject.
1.1 Natural language processing and its neighbors
Natural language processing draws on many other intellectual traditions, from formal linguistics to statistical physics. This section briefly situates natural language processing with respect to some of its closest neighbors.
Computational Linguistics Most of the meetings and journals that host natural lan- guage processing research bear the name “computational linguistics”, and the terms may be thought of as essentially synonymous. But while there is substantial overlap, there is an important difference in focus. In linguistics, language is the object of study. Computa- tional methods may be brought to bear, just as in scientific disciplines like computational biology and computational astronomy, but they play only a supporting role. In contrast,
1

2 CHAPTER1. INTRODUCTION
natural language processing is focused on the design and analysis of computational al- gorithms and representations for processing natural human language. The goal of natu- ral language processing is to provide new computational capabilities around human lan- guage: for example, extracting information from texts, translating between languages, an- swering questions, holding a conversation, taking instructions, and so on. Fundamental linguistic insights may be crucial for accomplishing these tasks, but success is ultimately measured by whether and how well the job gets done.
Machine Learning Contemporary approaches to natural language processing rely heav- ily on machine learning, which makes it possible to build complex computer programs from examples. Machine learning provides an array of general techniques for tasks like converting a sequence of discrete tokens in one vocabulary to a sequence of discrete to- kens in another vocabulary — a generalization of what one might informally call “transla- tion.” Much of today’s natural language processing research can be thought of as applied machine learning. However, natural language processing has characteristics that distin- guish it from many of machine learning’s other application domains.
• Unlike images or audio, text data is fundamentally discrete, with meaning created by combinatorial arrangements of symbolic units. This is particularly consequential for applications in which text is the output, such as translation and summarization, because it is not possible to gradually approach an optimal solution.
• Although the set of words is discrete, new words are always being created. Further- more, the distribution over words (and other linguistic elements) resembles that of a power law1 (Zipf, 1949): there will be a few words that are very frequent, and a long tail of words that are rare. A consequence is that natural language processing algo- rithms must be especially robust to observations that do not occur in the training data.
• Language is compositional: units such as words can combine to create phrases, which can combine by the very same principles to create larger phrases. For ex- ample, a noun phrase can be created by combining a smaller noun phrase with a prepositional phrase, as in the whiteness of the whale. The prepositional phrase is created by combining a preposition (in this case, of ) with another noun phrase (the whale). In this way, it is possible to create arbitrarily long phrases, such as,
(1.1) . . . huge globular pieces of the whale of the bigness of a human head.2
The meaning of such a phrase must be analyzed in accord with the underlying hier-
archical structure. In this case, huge globular pieces of the whale acts as a single noun 1Throughout the text, boldface will be used to indicate keywords that appear in the index.
2Throughout the text, this notation will be used to introduce linguistic examples.
Jacob Eisenstein. Draft of October 15, 2018.

1.1. NATURALLANGUAGEPROCESSINGANDITSNEIGHBORS 3
phrase, which is conjoined with the prepositional phrase of the bigness of a human head. The interpretation would be different if instead, huge globular pieces were con- joined with the prepositional phrase of the whale of the bigness of a human head — implying a disappointingly small whale. Even though text appears as a sequence, machine learning methods must account for its implicit recursive structure.
Artificial Intelligence The goal of artificial intelligence is to build software and robots with the same range of abilities as humans (Russell and Norvig, 2009). Natural language processing is relevant to this goal in several ways. On the most basic level, the capacity for language is one of the central features of human intelligence, and is therefore a prerequi- site for artificial intelligence.3 Second, much of artificial intelligence research is dedicated to the development of systems that can reason from premises to a conclusion, but such algorithms are only as good as what they know (Dreyfus, 1992). Natural language pro- cessing is a potential solution to the “knowledge bottleneck”, by acquiring knowledge from texts, and perhaps also from conversations. This idea goes all the way back to Tur- ing’s 1949 paper Computing Machinery and Intelligence, which proposed the Turing test for determining whether artificial intelligence had been achieved (Turing, 2009).
Conversely, reasoning is sometimes essential for basic tasks of language processing, such as resolving a pronoun. Winograd schemas are examples in which a single word changes the likely referent of a pronoun, in a way that seems to require knowledge and reasoning to decode (Levesque et al., 2011). For example,
(1.2) The trophy doesn’t fit into the brown suitcase because it is too [small/large].
When the final word is small, then the pronoun it refers to the suitcase; when the final word is large, then it refers to the trophy. Solving this example requires spatial reasoning; other schemas require reasoning about actions and their effects, emotions and intentions, and social conventions.
Such examples demonstrate that natural language understanding cannot be achieved in isolation from knowledge and reasoning. Yet the history of artificial intelligence has been one of increasing specialization: with the growing volume of research in subdisci- plines such as natural language processing, machine learning, and computer vision, it is
3This view is shared by some, but not all, prominent researchers in artificial intelligence. Michael Jordan, a specialist in machine learning, has said that if he had a billion dollars to spend on any large research project, he would spend it on natural language processing (https://www.reddit.com/r/ MachineLearning/comments/2fxi6v/ama_michael_i_jordan/). On the other hand, in a public dis- cussion about the future of artificial intelligence in February 2018, computer vision researcher Yann Lecun argued that despite its many practical applications, language is perhaps “number 300” in the priority list for artificial intelligence research, and that it would be a great achievement if AI could attain the capa- bilities of an orangutan, which do not include language (http://www.abigailsee.com/2018/02/21/ deep-learning-structure-and-innate-priors.html).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

4 CHAPTER1. INTRODUCTION
difficult for anyone to maintain expertise across the entire field. Still, recent work has demonstrated interesting connections between natural language processing and other ar- eas of AI, including computer vision (e.g., Antol et al., 2015) and game playing (e.g., Branavan et al., 2009). The dominance of machine learning throughout artificial intel- ligence has led to a broad consensus on representations such as graphical models and computation graphs, and on algorithms such as backpropagation and combinatorial opti- mization. Many of the algorithms and representations covered in this text are part of this consensus.
Computer Science The discrete and recursive nature of natural language invites the ap- plication of theoretical ideas from computer science. Linguists such as Chomsky and Montague have shown how formal language theory can help to explain the syntax and semantics of natural language. Theoretical models such as finite-state and pushdown au- tomata are the basis for many practical natural language processing systems. Algorithms for searching the combinatorial space of analyses of natural language utterances can be analyzed in terms of their computational complexity, and theoretically motivated approx- imations can sometimes be applied.
The study of computer systems is also relevant to natural language processing. Large datasets of unlabeled text can be processed more quickly by parallelization techniques like MapReduce (Dean and Ghemawat, 2008; Lin and Dyer, 2010); high-volume data sources such as social media can be summarized efficiently by approximate streaming and sketching techniques (Goyal et al., 2009). When deep neural networks are imple- mented in production systems, it is possible to eke out speed gains using techniques such as reduced-precision arithmetic (Wu et al., 2016). Many classical natural language process- ing algorithms are not naturally suited to graphics processing unit (GPU) parallelization, suggesting directions for further research at the intersection of natural language process- ing and computing hardware (Yi et al., 2011).
Speech Processing Natural language is often communicated in spoken form, and speech recognition is the task of converting an audio signal to text. From one perspective, this is a signal processing problem, which might be viewed as a preprocessing step before nat- ural language processing can be applied. However, context plays a critical role in speech recognition by human listeners: knowledge of the surrounding words influences percep- tion and helps to correct for noise (Miller et al., 1951). For this reason, speech recognition is often integrated with text analysis, particularly with statistical language models, which quantify the probability of a sequence of text (see chapter 6). Beyond speech recognition, the broader field of speech processing includes the study of speech-based dialogue sys- tems, which are briefly discussed in chapter 19. Historically, speech processing has often been pursued in electrical engineering departments, while natural language processing
Jacob Eisenstein. Draft of October 15, 2018.

1.1. NATURALLANGUAGEPROCESSINGANDITSNEIGHBORS 5 has been the purview of computer scientists. For this reason, the extent of interaction
between these two disciplines is less than it might otherwise be.
Ethics As machine learning and artificial intelligence become increasingly ubiquitous, it is crucial to understand how their benefits, costs, and risks are distributed across differ- ent kinds of people. Natural language processing raises some particularly salient issues around ethics, fairness, and accountability:
Access. Who is natural language processing designed to serve? For example, whose lan- guage is translated from, and whose language is translated to?
Bias. Does language technology learn to replicate social biases from text corpora, and does it reinforce these biases as seemingly objective computational conclusions?
Labor. Whose text and speech comprise the datasets that power natural language pro- cessing, and who performs the annotations? Are the benefits of this technology shared with all the people whose work makes it possible?
Privacy and internet freedom. What is the impact of large-scale text processing on the right to free and private communication? What is the potential role of natural lan- guage processing in regimes of censorship or surveillance?
This text lightly touches on issues related to fairness and bias in subsection 14.6.3 and subsection 18.1.1, but these issues are worthy of a book of their own. For more from within the field of computational linguistics, see the papers from the annual workshop on Ethics in Natural Language Processing (Hovy et al., 2017; Alfano et al., 2018). For an outside perspective on ethical issues relating to data science at large, see boyd and Crawford (2012).
Others Natural language processing plays a significant role in emerging interdisciplinary fields like computational social science and the digital humanities. Text classification (chapter 4), clustering (chapter 5), and information extraction (chapter 17) are particularly useful tools; another is probabilistic topic models (Blei, 2012), which are not covered in this text. Information retrieval (Manning et al., 2008) makes use of similar tools, and conversely, techniques such as latent semantic analysis (section 14.3) have roots in infor- mation retrieval. Text mining is sometimes used to refer to the application of data mining techniques, especially classification and clustering, to text. While there is no clear distinc- tion between text mining and natural language processing (nor between data mining and machine learning), text mining is typically less concerned with linguistic structure, and more interested in fast, scalable algorithms.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

6 CHAPTER1. INTRODUCTION 1.2 Three themes in natural language processing
Natural language processing covers a diverse range of tasks, methods, and linguistic phe- nomena. But despite the apparent incommensurability between, say, the summarization of scientific articles (section 16.3.4) and the identification of suffix patterns in Spanish verbs (section 9.1.4), some general themes emerge. The remainder of the introduction fo- cuses on these themes, which will recur in various forms through the text. Each theme can be expressed as an opposition between two extreme viewpoints on how to process natural language. The methods discussed in the text can usually be placed somewhere on the continuum between these two extremes.
1.2.1 Learning and knowledge
A recurring topic of debate is the relative importance of machine learning and linguistic knowledge. On one extreme, advocates of “natural language processing from scratch” (Col- lobert et al., 2011) propose to use machine learning to train end-to-end systems that trans- mute raw text into any desired output structure: e.g., a summary, database, or transla- tion. On the other extreme, the core work of natural language processing is sometimes taken to be transforming text into a stack of general-purpose linguistic structures: from subword units called morphemes, to word-level parts-of-speech, to tree-structured repre- sentations of grammar, and beyond, to logic-based representations of meaning. In theory, these general-purpose structures should then be able to support any desired application.
The end-to-end approach has been buoyed by recent results in computer vision and speech recognition, in which advances in machine learning have swept away expert- engineered representations based on the fundamentals of optics and phonology (Krizhevsky et al., 2012; Graves and Jaitly, 2014). But while machine learning is an element of nearly every contemporary approach to natural language processing, linguistic representations such as syntax trees have not yet gone the way of the visual edge detector or the auditory triphone. Linguists have argued for the existence of a “language faculty” in all human be- ings, which encodes a set of abstractions specially designed to facilitate the understanding and production of language. The argument for the existence of such a language faculty is based on the observation that children learn language faster and from fewer examples than would be possible if language was learned from experience alone.4 From a practi- cal standpoint, linguistic structure seems to be particularly important in scenarios where training data is limited.
There are a number of ways in which knowledge and learning can be combined in natural language processing. Many supervised learning systems make use of carefully engineered features, which transform the data into a representation that can facilitate
4The Language Instinct (Pinker, 2003) articulates these arguments in an engaging and popular style. For arguments against the innateness of language, see Elman et al. (1998).
Jacob Eisenstein. Draft of October 15, 2018.

1.2. THREETHEMESINNATURALLANGUAGEPROCESSING 7
learning. For example, in a task like search, it may be useful to identify each word’s stem, so that a system can more easily generalize across related terms such as whale, whales, whalers, and whaling. (This issue is relatively benign in English, as compared to the many other languages which include much more elaborate systems of prefixed and suffixes.) Such features could be obtained from a hand-crafted resource, like a dictionary that maps each word to a single root form. Alternatively, features can be obtained from the output of a general-purpose language processing system, such as a parser or part-of-speech tagger, which may itself be built on supervised machine learning.
Another synthesis of learning and knowledge is in model structure: building machine learning models whose architectures are inspired by linguistic theories. For example, the organization of sentences is often described as compositional, with meaning of larger units gradually constructed from the meaning of their smaller constituents. This idea can be built into the architecture of a deep neural network, which is then trained using contemporary deep learning techniques (Dyer et al., 2016).
The debate about the relative importance of machine learning and linguistic knowl- edge sometimes becomes heated. No machine learning specialist likes to be told that their engineering methodology is unscientific alchemy;5 nor does a linguist want to hear that the search for general linguistic principles and structures has been made irrelevant by big data. Yet there is clearly room for both types of research: we need to know how far we can go with end-to-end learning alone, while at the same time, we continue the search for linguistic representations that generalize across applications, scenarios, and languages. For more on the history of this debate, see Church (2011); for an optimistic view of the potential symbiosis between computational linguistics and deep learning, see Manning (2015).
1.2.2 Search and learning
Many natural language processing problems can be written mathematically in the form of optimization,6
yˆ = argmax Ψ(x, y; θ), y∈Y(x)
where,
• x is the input, which is an element of a set X;
• y is the output, which is an element of a set Y(x);
[1.1]
5Ali Rahimi argued that much of deep learning research was similar to “alchemy” in a presentation at the 2017 conference on Neural Information Processing Systems. He was advocating for more learning theory, not more linguistics.
6Throughout this text, equations will be numbered by square brackets, and linguistic examples will be numbered by parentheses.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

8
CHAPTER1. INTRODUCTION
• Ψ is a scoring function (also called the model), which maps from the set X × Y to the real numbers;
• θ is a vector of parameters for Ψ;
• yˆ is the predicted output, which is chosen to maximize the scoring function.
This basic structure can be applied to a huge range of problems. For example, the input x might be a social media post, and the output y might be a labeling of the emotional sentiment expressed by the author (chapter 4); or x could be a sentence in French, and the output y could be a sentence in Tamil (chapter 18); or x might be a sentence in English, and y might be a representation of the syntactic structure of the sentence (chapter 10); or x might be a news article and y might be a structured record of the events that the article describes (chapter 17).
This formulation reflects an implicit decision that language processing algorithms will have two distinct modules:
Search. The search module is responsible for computing the argmax of the function Ψ. In other words, it finds the output yˆ that gets the best score with respect to the in- put x. This is easy when the search space Y(x) is small enough to enumerate, or when the scoring function Ψ has a convenient decomposition into parts. In many cases, we will want to work with scoring functions that do not have these proper- ties, motivating the use of more sophisticated search algorithms, such as bottom-up dynamic programming (section 10.1) and beam search (section 11.3.1). Because the outputs are usually discrete in language processing problems, search often relies on the machinery of combinatorial optimization.
Learning. The learning module is responsible for finding the parameters θ. This is typ- ically (but not always) done by processing a large dataset of labeled examples, {(x(i),y(i))}Ni=1. Like search, learning is also approached through the framework of optimization, as we will see in chapter 2. Because the parameters are usually continuous, learning algorithms generally rely on numerical optimization to iden- tify vectors of real-valued parameters that optimize some function of the model and the labeled data. Some basic principles of numerical optimization are reviewed in Appendix B.
The division of natural language processing into separate modules for search and learning makes it possible to reuse generic algorithms across many tasks and models. Much of the work of natural language processing can be focused on the design of the model Ψ — identifying and formalizing the linguistic phenomena that are relevant to the task at hand — while reaping the benefits of decades of progress in search, optimization, and learning. This textbook will describe several classes of scoring functions, and the corresponding algorithms for search and learning.
Jacob Eisenstein. Draft of October 15, 2018.

1.2. THREETHEMESINNATURALLANGUAGEPROCESSING 9
When a model is capable of making subtle linguistic distinctions, it is said to be ex- pressive. Expressiveness is often traded off against efficiency of search and learning. For example, a word-to-word translation model makes search and learning easy, but it is not expressive enough to distinguish good translations from bad ones. Many of the most im- portant problems in natural language processing seem to require expressive models, in which the complexity of search grows exponentially with the size of the input. In these models, exact search is usually impossible. Intractability threatens the neat modular de- composition between search and learning: if search requires a set of heuristic approxima- tions, then it may be advantageous to learn a model that performs well under these spe- cific heuristics. This has motivated some researchers to take a more integrated approach to search and learning, as briefly mentioned in chapters 11 and 15.
1.2.3 Relational, compositional, and distributional perspectives
Any element of language — a word, a phrase, a sentence, or even a sound — can be described from at least three perspectives. Consider the word journalist. A journalist is a subcategory of a profession, and an anchorwoman is a subcategory of journalist; further- more, a journalist performs journalism, which is often, but not always, a subcategory of writing. This relational perspective on meaning is the basis for semantic ontologies such as WORDNET (Fellbaum, 2010), which enumerate the relations that hold between words and other elementary semantic units. The power of the relational perspective is illustrated by the following example:
(1.3) Umashanthi interviewed Ana. She works for the college newspaper.
Who works for the college newspaper? The word journalist, while not stated in the ex- ample, implicitly links the interview to the newspaper, making Umashanthi the most likely referent for the pronoun. (A general discussion of how to resolve pronouns is found in chapter 15.)
Yet despite the inferential power of the relational perspective, it is not easy to formalize computationally. Exactly which elements are to be related? Are journalists and reporters distinct, or should we group them into a single unit? Is the kind of interview performed by a journalist the same as the kind that one undergoes when applying for a job? Ontology designers face many such thorny questions, and the project of ontology design hearkens back to Borges’ (1993) Celestial Emporium of Benevolent Knowledge, which divides animals into:
(a) belonging to the emperor; (b) embalmed; (c) tame; (d) suckling pigs; (e) sirens; (f) fabulous; (g) stray dogs; (h) included in the present classification; (i) frenzied; (j) innumerable; (k) drawn with a very fine camelhair brush; (l) et cetera; (m) having just broken the water pitcher; (n) that from a long way off resemble flies.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

10 CHAPTER1. INTRODUCTION Difficulties in ontology construction have led some linguists to argue that there is no task-
independent way to partition up word meanings (Kilgarriff, 1997).
Some problems are easier. Each member in a group of journalists is a journalist: the -s suffix distinguishes the plural meaning from the singular in most of the nouns in English. Similarly, a journalist can be thought of, perhaps colloquially, as someone who produces or works on a journal. (Taking this approach even further, the word journal derives from the French jour+nal, or day+ly = daily.) In this way, the meaning of a word is constructed from the constituent parts — the principle of compositionality. This principle can be applied to larger units: phrases, sentences, and beyond. Indeed, one of the great strengths of the compositional view of meaning is that it provides a roadmap for understanding entire texts and dialogues through a single analytic lens, grounding out in the smallest parts of individual words.
But alongside journalists and anti-parliamentarians, there are many words that seem to be linguistic atoms: think, for example, of whale, blubber, and Nantucket. Idiomatic phrases like kick the bucket and shoot the breeze have meanings that are quite different from the sum of their parts (Sag et al., 2002). Composition is of little help for such words and expressions, but their meanings can be ascertained — or at least approximated — from the contexts in which they appear. Take, for example, blubber, which appears in such contexts as:
(1.4) a. b. c.
The blubber served them as fuel.
. . . extracting it from the blubber of the large fish . . .
Amongst oily substances, blubber has been employed as a manure.
These contexts form the distributional properties of the word blubber, and they link it to words which can appear in similar constructions: fat, pelts, and barnacles. This distribu- tional perspective makes it possible to learn about meaning from unlabeled data alone; unlike relational and compositional semantics, no manual annotation or expert knowl- edge is required. Distributional semantics is thus capable of covering a huge range of linguistic phenomena. However, it lacks precision: blubber is similar to fat in one sense, to pelts in another sense, and to barnacles in still another. The question of why all these words tend to appear in the same contexts is left unanswered.
The relational, compositional, and distributional perspectives all contribute to our un- derstanding of linguistic meaning, and all three appear to be critical to natural language processing. Yet they are uneasy collaborators, requiring seemingly incompatible represen- tations and algorithmic approaches. This text presents some of the best known and most successful methods for working with each of these representations, but future research may reveal new ways to combine them.
Jacob Eisenstein. Draft of October 15, 2018.

Part I Learning
11

Chapter 2
Linear text classification
We begin with the problem of text classification: given a text document, assign it a dis- crete label y ∈ Y, where Y is the set of possible labels. Text classification has many ap- plications, from spam filtering to the analysis of electronic health records. This chapter describes some of the most well known and effective algorithms for text classification, from a mathematical perspective that should help you understand what they do and why they work. Text classification is also a building block in more elaborate natural language processing tasks. For readers without a background in machine learning or statistics, the material in this chapter will take more time to digest than most of the subsequent chap- ters. But this investment will pay off as the mathematical principles behind these basic classification algorithms reappear in other contexts throughout the book.
2.1 The bag of words
To perform text classification, the first question is how to represent each document, or instance. A common approach is to use a column vector of word counts, e.g., x = [0,1,1,0,0,2,0,1,13,0…]⊤, where xj is the count of word j. The length of x is V 􏱕 |V|, where V is the set of possible words in the vocabulary. In linear classification, the classi- fication decision is based on a weighted sum of individual feature counts, such as word counts.
The object x is a vector, but it is often called a bag of words, because it includes only information about the count of each word, and not the order in which the words appear. With the bag of words representation, we are ignoring grammar, sentence boundaries, paragraphs — everything but the words. Yet the bag of words model is surprisingly effective for text classification. If you see the word whale in a document, is it fiction or non- fiction? What if you see the word molybdenum? For many labeling problems, individual words can be strong predictors.
13

14 CHAPTER2. LINEARTEXTCLASSIFICATION
To predict a label from a bag-of-words, we can assign a score to each word in the vo- cabulary, measuring the compatibility with the label. For example, for the label FICTION, we might assign a positive score to the word whale, and a negative score to the word molybdenum. These scores are called weights, and they are arranged in a column vector θ.
Suppose that you want a multiclass classifier, where K 􏱕 |Y| > 2. For example, you might want to classify news stories about sports, celebrities, music, and business. The goal is to predict a label yˆ, given the bag of words x, using the weights θ. For each label y ∈ Y, we compute a score Ψ(x, y), which is a scalar measure of the compatibility between the bag-of-words x and the label y. In a linear bag-of-words classifier, this score is the vector inner product between the weights θ and the output of a feature function f (x, y),
Ψ(x, y) = θ · f (x, y) = 􏰁 θj fj (x, y). [2.1] j
As the notation suggests, f is a function of two arguments, the word counts x and the label y, and it returns a vector output. For example, given arguments x and y, element j of this feature vector might be,
fj (x, y) = 􏰅xwhale, if y = FICTION [2.2] 0, otherwise
This function returns the count of the word whale if the label is FICTION, and it returns zero otherwise. The index j depends on the position of whale in the vocabulary, and of FICTION in the set of possible labels. The corresponding weight θj then scores the compatibility of the word whale with the label FICTION.1 A positive score means that this word makes the label more likely.
The output of the feature function can be formalized as a vector:
f(x,y = 1) = [x;0;0;…;0] 􏰎 􏰍􏰌 􏰏
[2.3] [2.4] [2.5]
(K −1)×V
where [0; 0; . . . ; 0] is a column vector of (K − 1) × V zeros, and the semicolon indicates
(K −1)×V
vertical concatenation. For each of the K possible labels, the feature function returns a
(K −1)×V
f(x,y = 2) = [0;0;…;0;x;0;0;…;0]
􏰎 􏰍􏰌 􏰏 􏰎 􏰍􏰌 􏰏
􏰎 􏰍􏰌 􏰏
􏰎 􏰍􏰌 􏰏
V (K −2)×V f(x,y = K) = [0;0;…;0;x],
1In practice, both f and θ may be implemented as a dictionary rather than vectors, so that it is not necessary to explicitly identify j. In such an implementation, the tuple (whale, FICTION) acts as a key in both dictionaries; the values in f are feature counts, and the values in θ are weights.
Jacob Eisenstein. Draft of October 15, 2018.

2.1. THEBAGOFWORDS 15
vector that is mostly zeros, with a column vector of word counts x inserted in a location that depends on the specific label y. This arrangement is shown in Figure 2.1. The notation may seem awkward at first, but it generalizes to an impressive range of learning settings, particularly structure prediction, which is the focus of Chapters 7-11.
Given a vector of weights, θ ∈ RV K , we can now compute the score Ψ(x, y) by Equa- tion 2.1. This inner product gives a scalar measure of the compatibility of the observation x with label y.2 For any document x, we predict the label yˆ,
yˆ = argmax Ψ(x, y) [2.6] y∈Y
Ψ(x, y) =θ · f (x, y). [2.7] This inner product notation gives a clean separation between the data (x and y) and the
parameters (θ).
While vector notation is used for presentation and analysis, in code the weights and feature vector can be implemented as dictionaries. The inner product can then be com- puted as a loop. In python:
def compute_score(x,y,weights): total = 0
for feature,count in feature_function(x,y).items(): total += weights[feature] * count
return total
This representation is advantageous because it avoids storing and iterating over the many
features whose counts are zero.
It is common to add an offset feature at the end of the vector of word counts x, which is always 1. We then have to also add an extra zero to each of the zero vectors, to make the vectorlengthsmatch.Thisgivestheentirefeaturevectorf(x,y)alengthof(V +1)×K. The weight associated with this offset feature can be thought of as a bias for or against each label. For example, if we expect most emails to be spam, then the weight for the offset feature for y = SPAM should be larger than the weight for the offset feature for y = NOT-SPAM.
Returning to the weights θ, where do they come from? One possibility is to set them by hand. If we wanted to distinguish, say, English from Spanish, we can use English and Spanish dictionaries, and set the weight to one for each word that appears in the
2 Only V × (K − 1) features and weights are necessary. By stipulating that Ψ(x, y = K ) = 0 regardless of x, it is possible to implement any classification rule that can be achieved with V × K features and weights. This is the approach taken in binary classification rules like y = Sign(β·x+a), where β is a vector of weights, a is an offset, and the label set is Y = {−1, 1}. However, for multiclass classification, it is more concise to write θ · f(x, y) for all y ∈ Y.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

16
CHAPTER2. LINEARTEXTCLASSIFICATION
Original text
Bag of words
Feature vector
y=Fiction
y=News
y=Gossip
y=Sports
0
0
1
0
2
0
2
0
2
0
2
0
2
0
1
0
0
1
aardvark
… best
…
it
…
of
…
the
…
times
…
was
…
worst
…
x
0
x
0
0
It was the best of times, it was the worst of times…
zyxt

f(x ,y=News)
Figure 2.1: The bag-of-words and feature vector representations, for a hypothetical text
classification task.
associated dictionary. For example,3
θ(E,bicycle) =1 θ(E,bicicleta) =0 θ(E,con) =1 θ(E,ordinateur) =0
θ(S,bicycle) =0 θ(S,bicicleta) =1 θ(S,con) =1
θ(S,ordinateur) =0.
Similarly, if we want to distinguish positive and negative sentiment, we could use posi- tive and negative sentiment lexicons (see subsection 4.1.2), which are defined by social psychologists (Tausczik and Pennebaker, 2010).
But it is usually not easy to set classification weights by hand, due to the large number of words and the difficulty of selecting exact numerical weights. Instead, we will learn the weights from data. Email users manually label messages as SPAM; newspapers label their own articles as BUSINESS or STYLE. Using such instance labels, we can automatically acquire weights using supervised machine learning. This chapter will discuss several machine learning approaches for classification. The first is based on probability. For a review of probability, consult Appendix A.
3In this notation, each tuple (language, word) indexes an element in θ, which remains a vector.
Jacob Eisenstein. Draft of October 15, 2018.

2.2. NAI ̈VEBAYES 17 2.2 Na ̈ıve Bayes
The joint probability of a bag of words x and its true label y is written p(x, y). Suppose we have a dataset of N labeled instances, {(x(i), y(i))}Ni=1, which we assume are indepen- dent and identically distributed (IID) (see section􏰧A.3). Then the joint probability of the entire dataset, written p(x(1:N), y(1:N)), is equal to Ni=1 pX,Y (x(i), y(i)).4
What does this have to do with classification? One approach to classification is to set the weights θ so as to maximize the joint probability of a training set of labeled docu- ments. This is known as maximum likelihood estimation:
θˆ = argmax p(x(1:N), y(1:N); θ) θ
􏰙N i=1
􏰁N i=1
[2.8] [2.9]
[2.10]
= argmax
θ
p(x(i), y(i); θ)
= argmax
θ
log p(x(i), y(i); θ).
The notation p(x(i), y(i); θ) indicates that θ is a parameter of the probability function. The product of probabilities can be replaced by a sum of log-probabilities because the log func- tion is monotonically increasing over positive arguments, and so the same θ will maxi- mize both the probability and its logarithm. Working with logarithms is desirable because of numerical stability: on a large dataset, multiplying many probabilities can underflow to zero.5
The probability p(x(i), y(i); θ) is defined through a generative model — an idealized random process that has generated the observed data.6 Algorithm 1 describes the gener- ative model underlying the Na ̈ıve Bayes classifier, with parameters θ = {μ, φ}.
• The first line of this generative model encodes the assumption that the instances are mutually independent: neither the label nor the text of document i affects the label or text of document j.7 Furthermore, the instances are identically distributed: the
4The notation pX,Y (x(i),y(i)) indicates the joint probability that random variables X and Y take the
specific values x(i) and y(i) respectively. The subscript will often be omitted when it is clear from context. For a review of random variables, see Appendix A.
5Throughout this text, you may assume all logarithms and exponents are base 2, unless otherwise indi- cated. Any reasonable base will yield an identical classifier, and base 2 is most convenient for working out examples by hand.
6Generative models will be used throughout this text. They explicitly define the assumptions underlying the form of a probability distribution over observed and latent variables. For a readable introduction to generative models in statistics, see Blei (2014).
7Can you think of any cases in which this assumption is too strong?
Under contract with MIT Press, shared under CC-BY-NC-ND license.

18 CHAPTER2. LINEARTEXTCLASSIFICATION Algorithm 1 Generative process for the Na ̈ıve Bayes classification model
for Instance i ∈ {1,2,…,N} do:
Draw the label y(i) ∼ Categorical(μ);
Draw the word counts x(i) | y(i) ∼ Multinomial(φy(i) ).
distributions over the label y(i) and the text x(i) (conditioned on y(i)) are the same for all instances i. In other words, we make the assumption that every document has the same distribution over labels, and that each document’s distribution over words depends only on the label, and not on anything else about the document. We also assume that the documents don’t affect each other: if the word whale appears in document i = 7, that does not make it any more or less likely that it will appear again in document i = 8.
• The second line of the generative model states that the random variable y(i) is drawn from a categorical distribution with parameter μ. Categorical distributions are like weighted dice: the column vector μ = [μ1;μ2;…;μK] gives the probabilities of each label, so that the probability of drawing label y is equal to μy. For example, if Y􏰀= {POSITIVE, NEGATIVE, NEUTRAL}, we might have μ = [0.1; 0.7; 0.2]. We require
y∈Y μy = 1 and μy ≥ 0, ∀y ∈ Y: each label’s probability is non-negative, and the sum of these probabilities is equal to one. 8
• The third line describes how the bag-of-words counts x(i) are generated. By writing x(i) | y(i), this line indicates that the word counts are conditioned on the label, so that the joint probability is factored using the chain rule,
pX,Y (x(i), y(i)) = pX|Y (x(i) | y(i)) × pY (y(i)). [2.11]
The specific distribution pX|Y is the multinomial, which is a probability distribu- tion over vectors of non-negative counts. The probability mass function for this distribution is:
􏰙V xj pmult(x; φ) =B(x) φj
j=1 􏰛􏰀Vj=1 xj􏰜!
B(x) = 􏰧Vj=1(xj!) .
[2.12]
[2.13]
8Formally, we require μ ∈ ∆K−1, where ∆K−1 is the K − 1 probability simplex, the set of all vectors of K nonnegative numbers that sum to one. Because of the sum-to-one constraint, there are K − 1 degrees of freedom for a vector of size K.
Jacob Eisenstein. Draft of October 15, 2018.

2.2. NAI ̈VEBAYES 19
2.2.1
As in the categorical distribution, the parameter φj can be interpreted as a probabil- ity: specifically, the probability that any given token in the document is the word j. The multinomial distribution involves a product over words, with each term in the product equal to the probability φj , exponentiated by the count xj . Words that have zero count play no role in this product, because φ0j = 1. The term B(x) is called the multinomial coefficient. It doesn’t depend on φ, and can usually be ignored. Can you see why we need this term at all?9
The notation p(x | y;φ) indicates the conditional probability of word counts x given label y, with parameter φ, which is equal to pmult(x;φy). By specifying the multinomial distribution, we describe the multinomial Na ̈ıve Bayes classifier. Why “na ̈ıve”? Because the multinomial distribution treats each word token indepen- dently, conditioned on the class: the probability mass function factorizes across the counts.10
Types and tokens
A slight modification to the generative model of Na ̈ıve Bayes is shown in Algorithm 2. Instead of generating a vector of counts of types, x, this model generates a sequence of tokens, w = (w1, w2, . . . , wM ). The distinction between types and tokens is critical: xj ∈ {0, 1, 2, . . . , M } is the count of word type j in the vocabulary, e.g., the number of times the word cannibal appears; wm ∈ V is the identity of token m in the document, e.g. wm = cannibal.
The probability of the sequence w is a product of categorical probabilities. Algorithm 2 makes a conditional independence assumption: each token w(i) is independent of all other
m
tokens w(i) , conditioned on the label y(i). This is identical to the “na ̈ıve” independence
n̸=m
assumption implied by the multinomial distribution, and as a result, the optimal parame-
ters for this model are identical to those in multinomial Na ̈ıve Bayes. For any instance, the probability assigned by this model is proportional to the probability under multinomial Na ̈ıve Bayes. The constant of proportionality is the multinomial coefficient B(x). Because B(x) ≥ 1, the probability for a vector of counts x is at least as large as the probability for a list of words w that induces the same counts: there can be many word sequences that correspond to a single vector of counts. For example, man bites dog and dog bites man correspond to an identical count vector, {bites : 1, dog : 1, man : 1}, and B(x) is equal to the total number of possible word orderings for count vector x.
9Technically, a multinomial distribution requires a second parameter, the total number of word counts in x. In the bag-of-words representation is equal to the number of words in the document. However, this parameter is irrelevant for classification.
10You can plug in any probability distribution to the generative story and it will still be Na ̈ıve Bayes, as long as you are making the “na ̈ıve” assumption that the features are conditionally independent, given the label. For example, a multivariate Gaussian with diagonal covariance is na ̈ıve in exactly the same sense.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

20 CHAPTER2. LINEARTEXTCLASSIFICATION Algorithm 2 Alternative generative process for the Na ̈ıve Bayes classification model
for Instance i ∈ {1,2,…,N} do:
Draw the label y(i) ∼ Categorical(μ); for Token m ∈ {1,2,…,Mi} do:
Draw the token w(i) | y(i) ∼ Categorical(φ (i) ). my
Sometimes it is useful to think of instances as counts of types, x; other times, it is better to think of them as sequences of tokens, w. If the tokens are generated from a model that assumes conditional independence, then these two views lead to probability models that are identical, except for a scaling factor that does not depend on the label or the parameters.
2.2.2 Prediction
The Na ̈ıve Bayes prediction rule is to choose the label y which maximizes log p(x, y; μ, φ):
yˆ = argmax log p(x, y; μ, φ) y
=argmaxlogp(x | y;φ)+logp(y;μ) y
Now we can plug in the probability distributions from the generative story.
where
[2.14] [2.15]
[2.16]
[2.17] [2.18]
[2.19] [2.20]
logp(x | y;φ)+logp(y;μ) =log B(x) φy,j +logμy j=1
 􏰙V x  􏰁j
V
=logB(x)+ xj logφy,j +logμy
j=1
= log B(x) + θ · f (x, y),
θ = [θ(1);θ(2);…;θ(K)]
θ(y) = [logφy,1;logφy,2;…; logφy,V ;logμy]
The feature function f(x,y) is a vector of V word counts and an offset, padded by zeros for the labels not equal to y (see Equations 2.3-2.5, and Figure 2.1). This construction ensures that the inner product θ · f(x,y) only activates the features whose weights are in θ(y). These features and weights are all we need to compute the joint log-probability log p(x, y) for each y. This is a key point: through this notation, we have converted the problem of computing the log-likelihood for a document-label pair (x, y) into the compu- tation of a vector inner product.
Jacob Eisenstein. Draft of October 15, 2018.

2.2. NAI ̈VEBAYES 21 2.2.3 Estimation
The parameters of the categorical and multinomial distributions have a simple interpre- tation: they are vectors of expected frequencies for each possible event. Based on this interpretation, it is tempting to set the parameters empirically,
count(y, j) 􏰀 (i) x(i)
φ =􏰀 = i:y =y j , [2.21]
􏰁N
logpmult(x(i);φy(i)) = where B(x(i)) is constant with respect to φ.
L(φ) = i=1
􏰁N i=1
logB(x(i))+
􏰁V (i)
xj logφy(i),j, [2.23]
j=1
y,j V ′ 􏰀V 􏰀 (i) j′=1 count(y, j ) j′=1 i:y(i)=y xj′
where count(y, j) refers to the count of word j in documents with label y.
Equation 2.21 defines the relative frequency estimate for φ. It can be justified as a maximum likelihood estimate: the estimate that maximizes the probability p(x(1:N), y(1:N); θ). Based on the generative model in Algorithm 1, the log-likelihood is,
􏰁N i=1
which is now written as a function L of the parameters φ and μ. Let’s continue to focus on the parameters φ. Since p(y) is constant with respect to φ, we can drop it:
L(φ, μ) =
log pmult(x(i); φy(i) ) + log pcat(y(i); μ), [2.22]
Maximum-likelihood estimation chooses φ to maximize the log-likelihood L. How- ever, the solution must obey the following constraints:
􏰁V
φy,j=1 ∀y [2.24] j=1
These constraints can be incorporated by adding a set of Lagrange multipliers to the objec- tive (see Appendix B for more details). To solve for each θy , we maximize the Lagrangian,
􏰁V
log φy,j − λ( φy,j − 1).
i:y(i)=y j=1
Differentiating with respect to the parameter φy,j yields,
∂l(φy) = 􏰁 x(i)/φy,j − λ. ∂φy,j j
􏰁 􏰁V ( i ) l(φy) = xj
j=1
[2.25]
[2.26]
i:y(i)=y
Under contract with MIT Press, shared under CC-BY-NC-ND license.

22 CHAPTER2. LINEARTEXTCLASSIFICATION The solution is obtained by setting each element in this vector of derivatives equal to zero,
λφy,j = 􏰁 x(i) j
[2.27] [2.28]
where δ 􏰗y(i) = y􏰘 is a delta function, also sometimes called an indicator function, which returns one if y(i) = y. The symbol ∝ indicates that φy,j is proportional to the right-hand side of the equation.
Equation 2.28 shows three different notations for the same thing: a sum over the word
(i) 􏰀
counts for all documents i such that the label y = y. This gives a solution for each
φy up to a constant of proportionality. Now recall the constraint Vj=1 φy,j = 1, which arises because φy represents a vector of probabilities for each word in the vocabulary. This constraint leads to an exact solution, which does not depend on λ:
i:y(i)=y
􏰁 ( i ) 􏰁N 􏰛 ( i ) 􏰜 ( i )
φy,j ∝ xj = δ y =y xj =count(y,j), i:y(i)=y i=1
φy,j = 􏰀 count(y,j) . [2.29] V′
j′=1 count(y,j )
This is equal to the relative frequency estimator from Equation 2.21. A similar derivation
gives μy ∝ 􏰀Ni=1 δ 􏰗y(i) = y􏰘. 2.2.4 Smoothing
With text data, there are likely to be pairs of labels and words that never appear in the training set, leaving φy,j = 0. For example, the word molybdenum may have never yet appeared in a work of fiction. But choosing a value of φFICTION,molybdenum = 0 would allow this single feature to completely veto a label, since p(FICTION | x) = 0 if xmolybdenum > 0.
This is undesirable, because it imposes high variance: depending on what data hap- pens to be in the training set, we could get vastly different classification rules. One so- lution is to smooth the probabilities, by adding a “pseudocount” of α to each count, and then normalizing.
α + count(y, j)
φy,j = 􏰀V ′ [2.30]
Vα+ j′=1count(y,j) ThisiscalledLaplacesmoothing.11 Thepseudocountαisahyperparameter,becauseit
controls the form of the log-likelihood function, which in turn drives the estimation of φ.
11Laplace smoothing has a Bayesian justification, in which the generative model is extended to include φ as a random variable. The resulting distribution over φ depends on both the data (x and y) and the prior probability p(φ; α). The corresponding estimate of φ is called maximum a posteriori, or MAP. This is in contrast with maximum likelihood, which depends only on the data.
Jacob Eisenstein. Draft of October 15, 2018.

2.2. NAI ̈VEBAYES 23
Smoothing reduces variance, but moves us away from the maximum likelihood esti- mate: it imposes a bias. In this case, the bias points towards uniform probabilities. Ma- chine learning theory shows that errors on heldout data can be attributed to the sum of bias and variance (Mohri et al., 2012). In general, techniques for reducing variance often increase the bias, leading to a bias-variance tradeoff.
• Unbiased classifiers may overfit the training data, yielding poor performance on unseen data.
• But if the smoothing is too large, the resulting classifier can underfit instead. In the limit of α → ∞, there is zero variance: you get the same classifier, regardless of the data. However, the bias is likely to be large.
Similar issues arise throughout machine learning. Later in this chapter we will encounter regularization, which controls the bias-variance tradeoff for logistic regression and large- margin classifiers (subsection 2.5.1); subsection 3.3.2 describes techniques for controlling variance in deep learning; chapter 6 describes more elaborate methods for smoothing empirical probabilities.
2.2.5 Setting hyperparameters
Returning to Na ̈ıve Bayes, how should we choose the best value of hyperparameters like α? Maximum likelihood will not work: the maximum likelihood estimate of α on the training set will always be α = 0. In many cases, what we really want is accuracy: the number of correct predictions, divided by the total number of predictions. (Other measures of classification performance are discussed in section 4.4.) As we will see, it is hard to optimize for accuracy directly. But for scalar hyperparameters like α, tun- ing can be performed by a simple heuristic called grid search: try a set of values (e.g., α ∈ {0.001, 0.01, 0.1, 1, 10}), compute the accuracy for each value, and choose the setting that maximizes the accuracy.
The goal is to tune α so that the classifier performs well on unseen data. For this reason, the data used for hyperparameter tuning should not overlap the training set, where very small values of α will be preferred. Instead, we hold out a development set (also called a tuning set) for hyperparameter selection. This development set may consist of a small fraction of the labeled data, such as 10%.
We also want to predict the performance of our classifier on unseen data. To do this, we must hold out a separate subset of data, called the test set. It is critical that the test set not overlap with either the training or development sets, or else we will overestimate the performance that the classifier will achieve on unlabeled data in the future. The test set should also not be used when making modeling decisions, such as the form of the feature function, the size of the vocabulary, and so on (these decisions are reviewed in chapter 4.)
Under contract with MIT Press, shared under CC-BY-NC-ND license.

24 CHAPTER2. LINEARTEXTCLASSIFICATION
The ideal practice is to use the test set only once — otherwise, the test set is used to guide the classifier design, and test set accuracy will diverge from accuracy on truly unseen data. Because annotated data is expensive, this ideal can be hard to follow in practice, and many test sets have been used for decades. But in some high-impact applications like machine translation and information extraction, new test sets are released every year.
When only a small amount of labeled data is available, the test set accuracy can be unreliable. K-fold cross-validation is one way to cope with this scenario: the labeled data is divided into K folds, and each fold acts as the test set, while training on the other folds. The test set accuracies are then aggregated. In the extreme, each fold is a single data point; this is called leave-one-out cross-validation. To perform hyperparameter tuning in the context of cross-validation, another fold can be used for grid search. It is important not to repeatedly evaluate the cross-validated accuracy while making design decisions about the classifier, or you will overstate the accuracy on truly unseen data.
2.3 Discriminative learning
Na ̈ıve Bayes is easy to work with: the weights can be estimated in closed form, and the probabilistic interpretation makes it relatively easy to extend. However, the assumption that features are independent can seriously limit its accuracy. Thus far, we have defined the feature function f(x,y) so that it corresponds to bag-of-words features: one feature per word in the vocabulary. In natural language, bag-of-words features violate the as- sumption of conditional independence — for example, the probability that a document will contain the word na ̈ıve is surely higher given that it also contains the word Bayes — but this violation is relatively mild.
However, good performance on text classification often requires features that are richer than the bag-of-words:
• To better handle out-of-vocabulary terms, we want features that apply to multiple words, such as prefixes and suffixes (e.g., anti-, un-, -ing) and capitalization.
• We also want n-gram features that apply to multi-word units: bigrams (e.g., not good, not bad), trigrams (e.g., not so bad, lacking any decency, never before imagined), and beyond.
These features flagrantly violate the Na ̈ıve Bayes independence assumption. Consider what happens if we add a prefix feature. Under the Na ̈ıve Bayes assumption, the joint probability of a word and its prefix are computed with the following approximation:12
Pr(word = unfit, prefix = un- | y) ≈ Pr(prefix = un- | y) × Pr(word = unfit | y).
12The notation Pr(·) refers to the probability of an event, and p(·) refers to the probability density or mass for a random variable (see Appendix A).
Jacob Eisenstein. Draft of October 15, 2018.

2.3. DISCRIMINATIVELEARNING 25 To test the quality of the approximation, we can manipulate the left-hand side by applying
the chain rule,
Pr(word = unfit, prefix = un- | y) = Pr(prefix = un- | word = unfit, y) [2.31]
× Pr(word = unfit | y) [2.32] But Pr(prefix = un- | word = unfit, y) = 1, since un- is guaranteed to be the prefix for the
word unfit. Therefore,
Pr(word = unfit, prefix = un- | y) =1 × Pr(word = unfit | y) [2.33]
≫ Pr(prefix = un- | y) × Pr(word = unfit | y), [2.34]
because the probability of any given word starting with the prefix un- is much less than one. Na ̈ıve Bayes will systematically underestimate the true probabilities of conjunctions of positively correlated features. To use such features, we need learning algorithms that do not rely on an independence assumption.
The origin of the Na ̈ıve Bayes independence assumption is the learning objective, p(x(1:N),y(1:N)), which requires modeling the probability of the observed text. In clas- sification problems, we are always given x, and are only interested in predicting the label y. In this setting, modeling the probability of the text x seems like a difficult and unnec- essary task. Discriminative learning algorithms avoid this task, and focus directly on the problem of predicting y.
2.3.1 Perceptron
In Na ̈ıve Bayes, the weights can be interpreted as parameters of a probabilistic model. But this model requires an independence assumption that usually does not hold, and limits our choice of features. Why not forget about probability and learn the weights in an error- driven way? The perceptron algorithm, shown in Algorithm 3, is one way to do this.
The algorithm is simple: if you make a mistake, increase the weights for features that are active with the correct label y(i), and decrease the weights for features that are active with the guessed label yˆ. Perceptron is an online learning algorithm, since the classifier weights change after every example. This is different from Na ̈ıve Bayes, which is a batch learning algorithm: it computes statistics over the entire dataset, and then sets the weights in a single operation. Algorithm 3 is vague about when this online learning procedure terminates. We will return to this issue shortly.
The perceptron algorithm may seem like an unprincipled heuristic: Na ̈ıve Bayes has a solid foundation in probability, but the perceptron is just adding and subtracting constants from the weights every time there is a mistake. Will this really work? In fact, there is some nice theory for the perceptron, based on the concept of linear separability. Informally, a dataset with binary labels (y ∈ {0, 1}) is linearly separable if it is possible to draw a
Under contract with MIT Press, shared under CC-BY-NC-ND license.

26 CHAPTER2. LINEARTEXTCLASSIFICATION Algorithm 3 Perceptron learning algorithm
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11: 12: 13:
procedurePERCEPTRON(x(1:N),y(1:N)) t←0
θ(0)←0 repeat
t←t+1
Select an instance i
yˆ ← argmaxy θ(t−1) · f(x(i), y) if yˆ ̸= y(i) then
θ(t) ←θ(t−1) +f(x(i),y(i))−f(x(i),yˆ) else
θ(t) ← θ(t−1) until tired
return θ(t)
hyperplane (a line in many dimensions), such that on each side of the hyperplane, all instances have the same label. This definition can be formalized and extended to multiple labels:
Definition 1 (Linear separability). The dataset D = {(x(i),y(i))}Ni=1 is linearly separable iff (if and only if) there exists some weight vector θ and some margin ρ such that for every instance (x(i),y(i)), the inner product of θ and the feature function for the true label, θ · f(x(i),y(i)), is at least ρ greater than inner product of θ and the feature function for every other possible label, θ · f(x(i), y′).
∃θ,ρ>0:∀(x(i),y(i))∈D, θ·f(x(i),y(i))≥ρ+ max θ·f(x(i),y′). [2.35] y′̸=y(i)
Linear separability is important because of the following guarantee: if your data is linearly separable, then the perceptron algorithm will find a separator (Novikoff, 1962).13 So while the perceptron may seem heuristic, it is guaranteed to succeed, if the learning problem is easy enough.
How useful is this proof? Minsky and Papert (1969) famously proved that the simple logical function of exclusive-or is not separable, and that a perceptron is therefore inca- pable of learning this function. But this is not just an issue for the perceptron: any linear classification algorithm, including Na ̈ıve Bayes, will fail on this task. Text classification problems usually involve high dimensional feature spaces, with thousands or millions of
13It is also possible to prove an upper bound on the number of training iterations required to find the separator. Proofs like this are part of the field of machine learning theory (Mohri et al., 2012).
Jacob Eisenstein. Draft of October 15, 2018.

2.4. LOSSFUNCTIONSANDLARGE-MARGINCLASSIFICATION 27
features. For these problems, it is very likely that the training data is indeed separable. And even if the dataset is not separable, it is still possible to place an upper bound on the number of errors that the perceptron algorithm will make (Freund and Schapire, 1999).
2.3.2 Averaged perceptron
The perceptron iterates over the data repeatedly — until “tired”, as described in Algo- rithm 3. If the data is linearly separable, the perceptron will eventually find a separator, and we can stop once all training instances are classified correctly. But if the data is not linearly separable, the perceptron can thrash between two or more weight settings, never converging. In this case, how do we know that we can stop training, and how should we choose the final weights? An effective practical solution is to average the perceptron weights across all iterations.
This procedure is shown in Algorithm 4. The learning algorithm is nearly identical, but we also maintain a vector of the sum of the weights, m. At the end of the learning procedure, we divide this sum by the total number of updates t, to compute the average weights, θ. These average weights are then used for prediction. In the algorithm sketch, the average is computed from a running sum, m ← m + θ. However, this is inefficient, because it requires |θ| operations to update the running sum. When f(x,y) is sparse, |θ| ≫ |f (x, y)| for any individual (x, y). This means that computing the running sum will be much more expensive than computing of the update to θ itself, which requires only 2 × |f(x,y)| operations. One of the exercises is to sketch a more efficient algorithm for computing the averaged weights.
Even if the dataset is not separable, the averaged weights will eventually converge. One possible stopping criterion is to check the difference between the average weight vectors after each pass through the data: if the norm of the difference falls below some predefined threshold, we can stop training. Another stopping criterion is to hold out some data, and to measure the predictive accuracy on this heldout data. When the accuracy on the heldout data starts to decrease, the learning algorithm has begun to overfit the training set. At this point, it is probably best to stop; this stopping criterion is known as early stopping.
Generalization is the ability to make good predictions on instances that are not in the training data. Averaging can be proven to improve generalization, by computing an upper bound on the generalization error (Freund and Schapire, 1999; Collins, 2002).
2.4 Loss functions and large-margin classification
Na ̈ıve Bayes chooses the weights θ by maximizing the joint log-likelihood log p(x(1:N), y(1:N)). By convention, optimization problems are generally formulated as minimization of a loss function. The input to a loss function is the vector of weights θ, and the output is a
Under contract with MIT Press, shared under CC-BY-NC-ND license.

28 CHAPTER2. LINEARTEXTCLASSIFICATION Algorithm 4 Averaged perceptron learning algorithm
1: procedureAVG-PERCEPTRON(x(1:N),y(1:N))
2: t←0
3: θ(0)←0
4: repeat
5: t←t+1
6: Select an instance i
7: yˆ ← argmaxy θ(t−1) · f(x(i), y)
8: if yˆ ̸= y(i) then
9: θ(t) ←θ(t−1) +f(x(i),y(i))−f(x(i),yˆ)
10: else
11: θ(t) ← θ(t−1)
12: m ← m + θ(t)
13: until tired
14: θ←1m t
15: return θ
non-negative number, measuring the performance of the classifier on a training instance. Formally, the loss l(θ; x(i), y(i)) is then a measure of the performance of the weights θ on the instance (x(i), y(i)). The goal of learning is to minimize the sum of the losses across all instances in the training set.
We can trivially reformulate maximum likelihood as a loss function, by defining the loss function to be the negative log-likelihood:
[2.36] [2.37] [2.38]
log p(x(1:N), y(1:N); θ) =
lNB(θ; x(i), y(i)) = − log p(x(i), y(i); θ)
􏰁N i=1
ˆ 􏰁N θ = argmin
log p(x(i), y(i); θ).
The problem of minimizing lNB is thus identical to maximum-likelihood estimation.
Loss functions provide a general framework for comparing learning objectives. For example, an alternative loss function is the zero-one loss,
l0-1(θ;x(i),y(i))=􏰅0, y(i) =argmaxyθ·f(x(i),y) [2.40] 1, otherwise
Jacob Eisenstein. Draft of October 15, 2018.
= argmax
[2.39]
θ
lNB(θ; x(i), y(i))
i=1 􏰁N
log p(x(i), y(i); θ)
θ
i=1

2.4. LOSSFUNCTIONSANDLARGE-MARGINCLASSIFICATION 29
The zero-one loss is zero if the instance is correctly classified, and one otherwise. The sum of zero-one losses is proportional to the error rate of the classifier on the training data. Since a low error rate is often the ultimate goal of classification, this may seem ideal. But the zero-one loss has several problems. One is that it is non-convex,14 which means that there is no guarantee that gradient-based optimization will be effective. A more serious problem is that the derivatives are useless: the partial derivative with respect to any parameter is zero everywhere, except at the points where θ·f(x(i), y) = θ·f(x(i), yˆ) for some yˆ. At those points, the loss is discontinuous, and the derivative is undefined.
The perceptron optimizes a loss function that has better properties for learning: lPERCEPTRON(θ; x(i), y(i)) = max θ · f(x(i), y) − θ · f(x(i), y(i)), [2.41]
y∈Y
When yˆ = y(i), the loss is zero; otherwise, it increases linearly with the gap between the score for the predicted label yˆ and the score for the true label y(i). Plotting this loss against theinputmaxy∈Y θ·f(x(i),y)−θ·f(x(i),y(i))givesahingeshape,motivatingthename hinge loss.
To see why this is the loss function optimized by the perceptron, take the derivative with respect to θ,
∂ lPERCEPTRON(θ;x(i),y(i))=f(x(i),yˆ)−f(x(i),y(i)). [2.42] ∂θ
At each instance, the perceptron algorithm takes a step of magnitude one in the opposite direction of this gradient, ∇θlPERCEPTRON = ∂ lPERCEPTRON(θ;x(i),y(i)). As we will see in
∂θ
section 2.6, this is an example of the optimization algorithm stochastic gradient descent,
applied to the objective in Equation 2.41.
*Breaking ties with subgradient descent 15 Careful readers will notice the tacit assump- tion that there is a unique yˆ that maximizes θ · f(x(i),y). What if there are two or more labels that maximize this function? Consider binary classification: if the maximizer is y(i), then the gradient is zero, and so is the perceptron update; if the maximizer is yˆ ̸= y(i), thentheupdateisthedifferencef(x(i),y(i))−f(x(i),yˆ). Theunderlyingissueisthatthe perceptron loss is not smooth, because the first derivative has a discontinuity at the hinge point, where the score for the true label y(i) is equal to the score for some other label yˆ. At this point, there is no unique gradient; rather, there is a set of subgradients. A vector v is
14Afunctionf isconvexiffαf(xi)+(1−α)f(xj) ≥ f(αxi+(1−α)xj),forallα ∈ [0,1]andforallxi andxj on the domain of the function. In words, any weighted average of the output of f applied to any two points is larger than the output of f when applied to the weighted average of the same two points. Convexity implies that any local minimum is also a global minimum, and there are many effective techniques for optimizing convex functions (Boyd and Vandenberghe, 2004). See Appendix B for a brief review.
15Throughout this text, advanced topics will be marked with an asterisk.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

30 CHAPTER2. LINEARTEXTCLASSIFICATION
a subgradient of the function g at u0 iff g(u) − g(u0) ≥ v · (u − u0) for all u. Graphically, this defines the set of hyperplanes that include g(u0) and do not intersect g at any other point. Asweapproachthehingepointfromtheleft,thegradientisf(x,yˆ)−f(x,y);aswe approach from the right, the gradient is 0. At the hinge point, the subgradients include all vectors that are bounded by these two extremes. In subgradient descent, any subgradient can be used (Bertsekas, 2012). Since both 0 and f (x, yˆ) − f (x, y) are subgradients at the hinge point, either one can be used in the perceptron update. This means that if multiple labels maximize θ · f(x(i), y), any of them can be used in the perceptron update.
PerceptronversusNa ̈ıveBayes Theperceptronlossfunctionhassomeprosandcons with respect to the negative log-likelihood loss implied by Na ̈ıve Bayes.
• Both lNB and lPERCEPTRON are convex, making them relatively easy to optimize. How- ever, lNB can be optimized in closed form, while lPERCEPTRON requires iterating over the dataset multiple times.
• lNB cansufferinfinitelossonasingleexample,sincethelogarithmofzeroprobability is negative infinity. Na ̈ıve Bayes will therefore overemphasize some examples, and underemphasize others.
• The Na ̈ıve Bayes classifier assumes that the observed features are conditionally in- dependent, given the label, and the performance of the classifier depends on the extent to which this assumption holds. The perceptron requires no such assump- tion.
• lPERCEPTRON treats all correct answers equally. Even if θ only gives the correct answer by a tiny margin, the loss is still zero.
2.4.1 Online large margin classification
This last comment suggests a potential problem with the perceptron. Suppose a test ex- ample is very close to a training example, but not identical. If the classifier only gets the correct answer on the training example by a small amount, then it may give a different answer on the nearby test instance. To formalize this intuition, define the margin as,
γ(θ;x(i),y(i))=θ·f(x(i),y(i))− max θ·f(x(i),y). [2.43] y̸=y(i)
The margin represents the difference between the score for the correct label y(i), and the score for the highest-scoring incorrect label. The intuition behind large margin clas- sification is that it is not enough to label the training data correctly — the correct label should be separated from other labels by a comfortable margin. This idea can be encoded
Jacob Eisenstein. Draft of October 15, 2018.

2.4. LOSSFUNCTIONSANDLARGE-MARGINCLASSIFICATION
31
3
2
1
0
−2 −1 0 1 2 θàf x(i),y(i) θàf x(i),y^
à loss margin loss logistic loss
Figure 2.2: Margin, zero-one, and logistic loss functions. into a loss function,
lMARGIN(θ; x(i), y(i)) = 􏰅0, γ(θ; x(i), y(i)) ≥ 1, 1 − γ(θ; x(i), y(i)), otherwise
= 􏰛1 − γ(θ; x(i), y(i))􏰜+ ,
[2.44]
[2.45]
where (x)+ = max(0, x). The loss is zero if there is a margin of at least 1 between the score for the true label and the best-scoring alternative yˆ. This is almost identical to the perceptron loss, but the hinge point is shifted to the right, as shown in Figure 2.2. The margin loss is a convex upper bound on the zero-one loss.
The margin loss can be minimized using an online learning rule that is similar to per- ceptron. We will call this learning rule the online support vector machine, for reasons that will be discussed in the derivation. Let us first generalize the notion of a classifica- tion error with a cost function c(y(i), y). We will focus on the simple cost function,
c(y(i), y) = 􏰅1, y(i) ̸= yˆ [2.46] 0, otherwise,
but it is possible to design specialized cost functions that assign heavier penalties to espe- cially undesirable errors (Tsochantaridis et al., 2004). This idea is revisited in chapter 7.
Using the cost function, we can now define the online support vector machine as the Under contract with MIT Press, shared under CC-BY-NC-ND license.
loss

32 CHAPTER2. LINEARTEXTCLASSIFICATION following classification rule:
yˆ = argmax θ · f(x(i), y) + c(y(i), y) [2.47] y∈Y
θ(t) ←(1−λ)θ(t−1) +f(x(i),y(i))−f(x(i),yˆ) [2.48] This update is similar in form to the perceptron, with two key differences.
•
•
2.4.2
Rather than selecting the label yˆ that maximizes the score of the current classifi- cation model, the argmax searches for labels that are both strong, as measured by θ · f(x(i),y), and wrong, as measured by c(y(i),y). This maximization is known as cost-augmented decoding, because it augments the maximization objective to favor high-cost labels. If the highest-scoring label is y = y(i), then the margin loss for this instance is zero, and no update is needed. If not, then an update is required to reduce the margin loss — even if the current model classifies the instance correctly. Cost augmentation is only done while learning; it is not applied when making pre- dictions on unseen data.
The previous weights θ(t−1) are scaled by (1 − λ), with λ ∈ (0, 1). The effect of this term is to cause the weights to “decay” back towards zero. In the support vector machine, this term arises from the minimization of a specific form of the margin, as described below. However, it can also be viewed as a form of regularization, which can help to prevent overfitting (see subsection 2.5.1). In this sense, it plays a role that is similar to smoothing in Na ̈ıve Bayes (see subsection 2.2.4).
*Derivation of the online support vector machine
The derivation of the online support vector machine is somewhat involved, but gives further intuition about why the method works. Begin by returning the idea of linear sep- arability (Definition 1): if a dataset is linearly separable, then there is some hyperplane θ that correctly classifies all training instances with margin ρ. This margin can be increased to any desired value by multiplying the weights by a constant.
Now, for any datapoint (x(i),y(i)), the geometric distance to the separating hyper- p􏰠la􏰀ne is given by γ(θ;x(i),y(i)), where the denominator is the norm of the weights, ||θ||2 =
||θ||2
j θj2. The geometric distance is sometimes called the geometric margin, in contrast to
the functional margin γ(θ; x(i), y(i)). Both are shown in Figure 2.3. The geometric margin is a good measure of the robustness of the separator: if the functional margin is large, but the norm ||θ||2 is also large, then a small change in x(i) could cause it to be misclassified. We therefore seek to maximize the minimum geometric margin across the dataset, subject
Jacob Eisenstein. Draft of October 15, 2018.

2.4. LOSSFUNCTIONSANDLARGE-MARGINCLASSIFICATION 33
geometric
margin functional
margin
Figure 2.3: Functional and geometric margins for a binary classification problem. All separators that satisfy the margin constraint are shown. The separator with the largest geometric margin is shown in bold.
to the constraint that the margin loss is always zero:
max
θ
s.t.
This is a constrained optimization problem, where the second line describes constraints on the space of possible solutions θ. In this case, the constraint is that the functional margin always be at least one, and the objective is that the minimum geometric margin be as large as possible.
Constrained optimization is reviewed in Appendix B. In this case, further manipula- tion yields an unconstrained optimization problem. First, note that the norm ||θ||2 scales linearly: ||aθ||2 = a||θ||2. Furthermore, the functional margin γ is a linear function of θ, so that γ(aθ, x(i), y(i)) = aγ(θ, x(i), y(i)). As a result, any scaling factor on θ will cancel in the numerator and denominator of the geometric margin. If the data is linearly separable at any ρ > 0, it is always possible to rescale the functional margin to 1 by multiplying θ by a scalar constant. We therefore need only minimize the denominator ||θ||2, subject to the constrai􏰀nt on the functional margin. The minimizer of ||θ||2 is also the minimizer of
1||θ||2 = 1 θ2, which is easier to work with. This yields a simpler optimization prob- 222j
Under contract with MIT Press, shared under CC-BY-NC-ND license.
γ(θ; x(i), y(i)) i=1,2,…N ||θ||2
min
γ(θ; x(i), y(i)) ≥ 1, ∀i. [2.49]

34 lem:
CHAPTER2. LINEARTEXTCLASSIFICATION
min . 1 ||θ||2 θ2
s.t. γ(θ; x(i), y(i)) ≥ 1, ∀i. [2.50]
This problem is a quadratic program: the objective is a quadratic function of the pa- rameters, and the constraints are all linear inequalities. One solution to this problem is to incorporate the constraints through Lagrange multipliers αi ≥ 0, i = 1, 2, . . . , N . The instances for which αi > 0 are called support vectors; other instances are irrelevant to the classification boundary. This motivates the name support vector machine.
Thus far we have assumed linear separability, but many datasets of interest are not linearly separable. In this case, there is no θ that satisfies the margin constraint. To add more flexibility, we can introduce a set of slack variables ξi ≥ 0. Instead of requiring that the functional margin be greater than or equal to one, we require that it be greater than or equal to 1 − ξi. Ideally there would not be any slack, so the slack variables are penalized in the objective function:
min
θ,ξ s.t.
1 2 􏰁N ||θ||2 + C ξi
2 i=1
γ(θ; x(i), y(i)) + ξi ≥ 1, ∀i ξi ≥ 0, ∀i.
[2.51]
The hyperparameter C controls the tradeoff between violations of the margin con- straint and the preference for a low norm of θ. As C → ∞, slack is infinitely expensive, and there is only a solution if the data is separable. As C → 0, slack becomes free, and there is a trivial solution at θ = 0. Thus, C plays a similar role to the smoothing parameter in Na ̈ıve Bayes (subsection 2.2.4), trading off between a close fit to the training data and better generalization. Like the smoothing parameter of Na ̈ıve Bayes, C must be set by the user, typically by maximizing performance on a heldout development set.
To solve the constrained optimization problem defined in Equation 2.51, we can first solve for the slack variables,
ξi ≥ (1 − γ(θ; x(i), y(i)))+. [2.52]
The inequality is tight: the optimal solution is to make the slack variables as small as possible, while still satisfying the constraints (Ratliff et al., 2007; Smith, 2011). By plugging in the minimum slack variables back into Equation 2.51, the problem can be transformed into the unconstrained optimization,
λ 􏰁N
min ||θ||2 + (1 − γ(θ; x(i), y(i)))+, [2.53]
Jacob Eisenstein. Draft of October 15, 2018.
θ 2 i=1

2.5. LOGISTICREGRESSION 35 where each ξi has been substituted by the right-hand side of Equation 2.52, and the factor
of C on the slack variables has been replaced by an equivalent factor of λ = 1 on the
norm of the weights.
Equation 2.53 can be rewritten by expanding the margin,
min λ||θ||2 +􏰁N 􏰑max􏰛θ·f(x(i),y)+c(y(i),y)􏰜−θ·f(x(i),y(i))􏰒 , θ 2 i=1 y∈Y +
C
[2.54]
where c(y,y(i)) is the cost function defined in Equation 2.46. We can now differentiate with respect to the weights,
􏰁N i=1
where LSVM refers to minimization objective in Equation 2.54 and yˆ = argmaxy∈Y θ · f(x(i),y) + c(y(i),y). The online support vector machine update arises from the appli- cation of stochastic gradient descent (described in subsection 2.6.2) to this gradient.
2.5 Logistic regression
Thus far, we have seen two broad classes of learning algorithms. Na ̈ıve Bayes is a prob- abilistic method, where learning is equivalent to estimating a joint probability distribu- tion. The perceptron and support vector machine are discriminative, error-driven algo- rithms: the learning objective is closely related to the number of errors on the training data. Probabilistic and error-driven approaches each have advantages: probability makes it possible to quantify uncertainty about the predicted labels, but the probability model of Na ̈ıve Bayes makes unrealistic independence assumptions that limit the features that can be used.
Logistic regression combines advantages of discriminative and probabilistic classi- fiers. Unlike Na ̈ıve Bayes, which starts from the joint probability pX,Y , logistic regression defines the desired conditional probability pY |X directly. Think of θ · f (x, y) as a scoring function for the compatibility of the base features x and the label y. To convert this score into a probability, we first exponentiate, obtaining exp (θ · f (x, y)), which is guaranteed to be non-negative. Next, we normalize, dividing over all possible labels y′ ∈ Y. The resulting conditional probability is defined as,
p(y|x;θ)=􏰀 exp(θ·f(x,y)) . [2.56] y′∈Y exp(θ·f(x,y′))
Under contract with MIT Press, shared under CC-BY-NC-ND license.
∇θLSVM =λθ+
f(x(i),yˆ)−f(x(i),y(i)), [2.55]

36 CHAPTER2. LINEARTEXTCLASSIFICATION Given a dataset D = {(x(i), y(i))}Ni=1, the weights θ are estimated by maximum condi-
tional likelihood,
􏰁N
log p(y(1:N) | x(1:N); θ) =
= θ·f(x(i),y(i))−log exp θ·f(x(i),y′) .
log p(y(i) | x(i); θ)
􏰁N 􏰁􏰛􏰜
[2.57]
i=1
[2.58] The final line is obtained by plugging in Equation 2.56 and taking the logarithm.16 Inside
i=1 y′∈Y the sum, we have the (additive inverse of the) logistic loss,
lLOGREG(θ; x(i), y(i)) = −θ · f(x(i), y(i)) + log 􏰁 exp(θ · f(x(i), y′)) [2.59] y′∈Y
The logistic loss is shown in Figure 2.2 on page 31. A key difference from the zero-one and hinge losses is that logistic loss is never zero. This means that the objective function can always be improved by assigning higher confidence to the correct label.
2.5.1 Regularization
As with the support vector machine, better generalization can be obtained by penalizing the norm of θ. This is done by adding a multiple of the squared norm λ||θ||2 to the
222 minimization objective. This is called L2 regularization, because ||θ||2 is the squared L2
norm of the vector θ. Regularization forces the estimator to trade off performance on the training data against the norm of the weights, and this can help to prevent overfitting. Consider what would happen to the unregularized weight for a base feature j that is active in only one instance x(i): the conditional log-likelihood could always be improved by increasing the weight for this feature, so that θ(j,y(i)) → ∞ and θ(j,y ̸̃=y(i)) → −∞, where
(j,y) is the index of feature associated with x(i) and label y in f(x(i),y). j
In subsection 2.2.4 (footnote 11), we saw that smoothing the probabilities of a Na ̈ıve Bayes classifier can be justified as a form of maximum a posteriori estimation, in which the parameters of the classifier are themselves random variables, drawn from a prior dis- tribution. The same justification applies to L2 regularization. In this case, the prior is a zero-mean Gaussian on each term of θ. The log-likelihood under a zero-mean Gaussian is,
logN(θj;0,σ2)∝− 1 θj2, [2.60] 2σ2
16The log-sum-exp term is a common pattern in machine learning. It is numerically unstable, because it will underflow if the inner product is small, and overflow if the inner product is large. Scientific computing libraries usually contain special functions for computing logsumexp, but with some thought, you should be able to see how to create an implementation that is numerically stable.
Jacob Eisenstein. Draft of October 15, 2018.

2.6. OPTIMIZATION 37 so that the regularization weight λ is equal to the inverse variance of the prior, λ = 1 .
2.5.2 Gradients
Logistic loss is minimized by optimization along the gradient. Specific algorithms are de- scribed in the next section, but first let’s compute the gradient with respect to the logistic loss of a single example:
lLOGREG =−θ·f(x(i),y(i))+log 􏰁 exp􏰛θ·f(x(i),y′)􏰜􏰁 [2.61]
y′∈Y 􏰛􏰜
∂l =−f(x(i),y(i))+ 􏰀 􏰗1 􏰘 × exp θ·f(x(i),y′) ×f(x(i),y′)
∂θ y′′∈Y exp θ·f(x(i),y′′) y′∈Y (i) (i) 􏰁 exp 􏰗θ · f(x(i), y′)􏰘
y′∈Y y ∈Y =−f(x(i),y(i))+ p(y′ |x(i);θ)×f(x(i),y′)
[2.62] [2.63]
[2.64]
[2.65]
(i) ′ =−f(x ,y )+􏰁􏰀′′ exp􏰗θ·f(x(i),y′′)􏰘×f(x ,y)
σ2
y′∈Y
= − f(x(i),y(i)) + EY |X[f(x(i),y)].
The final step employs the definition of a conditional expectation (section A.5). The gra- dient of the logistic loss is equal to the difference between the expected counts under the current model, EY |X[f(x(i),y)], and the observed feature counts f(x(i),y(i)). When these two vectors are equal for a single instance, there is nothing more to learn from it; when they are equal in sum over the entire dataset, there is nothing more to learn from the dataset as a whole. The gradient of the hinge loss is nearly identical, but it involves the featuresofthepredictedlabelunderthecurrentmodel,f(x(i),yˆ),ratherthantheexpected features EY |X[f(x(i),y)] under the conditional distribution p(y | x;θ).
The regularizer contributes λθ to the overall gradient: LLOGREG=λ||θ||2−􏰁N θ·f(x(i),y(i))−log􏰁expθ·f(x(i),y′)
􏰁N􏰛 􏰜 ∇θLLOGREG =λθ− f(x(i),y(i))−Ey|x[f(x(i),y)] .
i=1
2.6 Optimization
Each of the classification algorithms in this chapter can be viewed as an optimization problem:
Under contract with MIT Press, shared under CC-BY-NC-ND license.
2 i=1 y′∈Y
[2.66] [2.67]

38
CHAPTER2. LINEARTEXTCLASSIFICATION
• •
•
In Na ̈ıve Bayes, the objective is the joint likelihood log p(x(1:N), y(1:N)). Maximum likelihood estimation yields a closed-form solution for θ.
In the support vector machine, the objective is the regularized margin loss,
λ 􏰁N
LSVM = ||θ||2 + (max(θ·f(x(i),y)+c(y(i),y))−θ·f(x(i),y(i)))+, [2.68]
2 i=1 y∈Y
There is no closed-form solution, but the objective is convex. The perceptron algo-
rithm minimizes a similar objective.
In logistic regression, the objective is the regularized negative log-likelihood,
LLOGREG = λ||θ||2 −􏰁N θ·f(x(i),y(i))−log􏰁exp􏰛θ·f(x(i),y)􏰜 [2.69] 2 i=1 y∈Y
Again, there is no closed-form solution, but the objective is convex.
These learning algorithms are distinguished by what is being optimized, rather than how the optimal weights are found. This decomposition is an essential feature of con- temporary machine learning. The domain expert’s job is to design an objective function — or more generally, a model of the problem. If the model has certain characteristics, then generic optimization algorithms can be used to find the solution. In particular, if an objective function is differentiable, then gradient-based optimization can be employed; if it is also convex, then gradient-based optimization is guaranteed to find the globally optimal solution. The support vector machine and logistic regression have both of these properties, and so are amenable to generic convex optimization techniques (Boyd and Vandenberghe, 2004).
2.6.1 Batch optimization
In batch optimization, each update to the weights is based on a computation involving the entire dataset. One such algorithm is gradient descent, which iteratively updates the weights,
θ(t+1) ← θ(t) − η(t)∇θL, [2.70]
where ∇θL is the gradient computed over the entire training set, and η(t) is the learning
rate at iteration t. If the objective L is a convex function of θ, then this procedure is
guaranteed to terminate at the global optimum, for appropriate schedule of learning rates, η(t).17
17Convergence proofs typically require the learning rate to satisfy the following conditions: 􏰀∞t=1 η(t) = ∞ and 􏰀∞t=1(η(t))2 < ∞ (Bottou et al., 2016). These properties are satisfied by any learning rate schedule η(t) = η(0)t−α for α ∈ [1, 2]. Jacob Eisenstein. Draft of October 15, 2018. 2.6. OPTIMIZATION 39 In practice, gradient descent can be slow to converge, as the gradient can become infinitesimally small. Faster convergence can be obtained by second-order Newton opti- mization, which incorporates the inverse of the Hessian matrix, Hi,j = ∂2L [2.71] ∂θi∂θj The size of the Hessian matrix is quadratic in the number of features. In the bag-of-words representation, this is usually too big to store, let alone invert. Quasi-Network optimiza- tion techniques maintain a low-rank approximation to the inverse of the Hessian matrix. Such techniques usually converge more quickly than gradient descent, while remaining computationally tractable even for large feature sets. A popular quasi-Newton algorithm is L-BFGS (Liu and Nocedal, 1989), which is implemented in many scientific computing environments, such as SCIPY and MATLAB. For any gradient-based technique, the user must set the learning rates η(t). While con- vergence proofs usually employ a decreasing learning rate, in practice, it is common to fix η(t) to a small constant, like 10−3. The specific constant can be chosen by experimentation, although there is research on determining the learning rate automatically (Schaul et al., 2013; Wu et al., 2018). 2.6.2 Online optimization Batch optimization computes the objective on the entire training set before making an up- date. This may be inefficient, because at early stages of training, a small number of train- ing examples could point the learner in the correct direction. Online learning algorithms make updates to the weights while iterating through the training data. The theoretical basis for this approach is a stochastic approximation to the true objective function, 􏰁N l(θ; x(i), y(i)) ≈ N × l(θ; x(j), y(j)), (x(j), y(j)) ∼ {(x(i), y(i))}Ni=1, [2.72] i=1 where the instance (x(j), y(j)) is sampled at random from the full dataset. In stochastic gradient descent, the approximate gradient is computed by randomly sampling a single instance, and an update is made immediately. This is similar to the perceptron algorithm, which also updates the weights one instance at a time. In mini- batch stochastic gradient descent, the gradient is computed over a small set of instances. A typical approach is to set the minibatch size so that the entire batch fits in memory on a graphics processing unit (GPU; Neubig et al., 2017). It is then possible to speed up learn- ing by parallelizing the computation of the gradient over each instance in the minibatch. Algorithm 5 offers a generalized view of gradient descent. In standard gradient de- scent, the batcher returns a single batch with all the instances. In stochastic gradient de- Under contract with MIT Press, shared under CC-BY-NC-ND license. 40 CHAPTER2. LINEARTEXTCLASSIFICATION Algorithm 5 Generalized gradient descent. The function BATCHER partitions the train- ing set into B batches such that each instance appears in exactly one batch. In gradient descent, B = 1; in stochastic gradient descent, B = N; in minibatch stochastic gradient descent, 1 < B < N. 1: procedureGRADIENT-DESCENT(x(1:N),y(1:N),L,η(1...∞),BATCHER,Tmax) 2: θ←0 3: t←0 4: repeat 5: (b(1), b(2), . . . , b(B)) ← BATCHER(N) 6: for n ∈ {1,2,...,B} do 7: t←t+1 (n) (n) ,...), y(b1 ,b2 8: 9: 10: 11: 12: (n) (n) θ(t) ← θ(t−1) − η(t)∇θL(θ(t−1); x(b1 ,b2 if Converged(θ(1,2,...,t)) then return θ(t) until t ≥ Tmax return θ(t) ,...)) scent, it returns N batches with one instance each. In mini-batch settings, the batcher returns B minibatches, 1 < B < N . There are many other techniques for online learning, and research in this area is on- going (Bottou et al., 2016). Some algorithms use an adaptive learning rate, which can be different for every feature (Duchi et al., 2011). Features that occur frequently are likely to be updated frequently, so it is best to use a small learning rate; rare features will be updated infrequently, so it is better to take larger steps. The AdaGrad (adaptive gradient) algorithm achieves this behavior by storing the sum of the squares of the gradients for each feature, and rescaling the learning rate by its inverse: [2.73] [2.74] gt =∇θL(θ(t);x(i),y(i)) (t+1) (t) η(t) θj ←θj −􏰠􏰀t′ g2 gt,j, t=1 t,j where j iterates over features in f (x, y). In most cases, the number of active features for any instance is much smaller than the number of weights. If so, the computation cost of online optimization will be dominated by the update from the regularization term, λθ. The solution is to be “lazy”, updating each θj only as it is used. To implement lazy updating, store an additional parameter τj , which is the iteration at which θj was last updated. If θj is needed at time t, the t − τ regularization updates can be performed all at once. This strategy is described in detail by Kummerfeld et al. (2015). Jacob Eisenstein. Draft of October 15, 2018. 2.7. *ADDITIONALTOPICSINCLASSIFICATION 41 2.7 *Additional topics in classification This section presents some additional topics in classification that are particularly relevant for natural language processing, especially for understanding the research literature. 2.7.1 Feature selection by regularization In logistic regression and large-margin classification, generalization can be improved by regularizing the weights towards 0, using the L2 norm. But rather than encouraging weights to be small, it might be better for the model to be sparse: it should assign weights of exactly zero to most features, and only assign non-zero weights to feat􏰀ures that are clearly necessary. This idea can be formalized by the L0 norm, L0 = ||θ||0 = j δ (θj ̸= 0), which applies a constant penalty for each non-zero weight. This norm can be thought of as a form of feature selection: optimizing the L0-regularized conditional likelihood is equivalent to trading off the log-likelihood against the number of active features. Reduc- ing the number of active features is desirable because the resulting model will be fast, low-memory, and should generalize well, since irrelevant features will be pruned away. Unfortunately, the L0 norm is non-convex and non-differentiable. Optimization under L0 regularization is NP-hard, meaning that it can be solved efficiently only if P=NP (Ge et al., 2011). A useful alternative􏰀is the L1 norm, which is equal to the sum of the absolute values of the weights, ||θ||1 = j |θj |. The L1 norm is convex, and can be used as an approxima- tion to L0 (Tibshirani, 1996). Conveniently, the L1 norm also performs feature selection, by driving many of the coefficients to zero; it is therefore known as a sparsity inducing regularizer. The L1 norm does not have a gradient at θj = 0, so we must instead optimize the L1-regularized objective using subgradient methods. The associated stochastic sub- gradient descent algorithms are only somewhat more complex than conventional SGD; Sra et al. (2012) survey approaches for estimation under L1 and other regularizers. Gao et al. (2007) compare L1 and L2 regularization on a suite of NLP problems, finding that L1 regularization generally gives similar accuracy to L2 regularization, but that L1 regularization produces models that are between ten and fifty times smaller, because more than 90% of the feature weights are set to zero. 2.7.2 Other views of logistic regression In binary classification, we can dispense with the feature function, and choose y based on the inner product of θ · x. The conditional probability pY |X is obtained by passing this Under contract with MIT Press, shared under CC-BY-NC-ND license. 42 CHAPTER2. LINEARTEXTCLASSIFICATION inner product through a logistic function, σ(a) 􏱕 exp(a) = (1 + exp(−a))−1 [2.75] 1 + exp(a) p(y | x; θ) =σ(θ · x). [2.76] This is the origin of the name “logistic regression.” Logistic regression can be viewed as part of a larger family of generalized linear models (GLMs), in which various other link functions convert between the inner product θ · x and the parameter of a conditional probability distribution. Logistic regression and related models are sometimes referred to as log-linear, be- cause the log-probability is a linear function of the features. But in the early NLP liter- ature, logistic regression was often called maximum entropy classification (Berger et al., 1996). This name refers to an alternative formulation, in which the goal is to find the max- imum entropy probability function that satisfies moment-matching constraints. These constraints specify that the empirical counts of each feature should match the expected counts under the induced probability distribution pY |X;θ, 􏰁N i=1 fj (x(i), y(i)) = 􏰁N 􏰁 i=1 y∈Y p(y | x(i); θ)fj (x(i), y), ∀j [2.77] The moment-matching constraint is satisfied exactly when the derivative of the condi- tional log-likelihood function (Equation 2.65) is equal to zero. However, the constraint can be met by many values of θ, so which should we choose? The entropy of the conditional probability distribution pY |X is, H (pY |X ) = − 􏰁 pX (x) 􏰁 pY |X (y | x) log pY |X (y | x), [2.78] x∈X y∈Y where X is the set of all possible feature vectors, and pX (x) is the probability of observing the base features x. The distribution pX is unknown, but it can be estimated by summing over all the instances in the training set, H ̃(pY|X)=−1 􏰁N 􏰁pY|X(y|x(i))logpY|X(y|x(i)). [2.79] N i=1 y∈Y If the entropy is large, the likelihood function is smooth across possible values of y; if it is small, the likelihood function is sharply peaked at some preferred value; in the limiting case, the entropy is zero if p(y | x) = 1 for some y. The maximum-entropy cri- terion chooses to make the weakest commitments possible, while satisfying the moment- matching constraints from Equation 2.77. The solution to this constrained optimization problem is identical to the maximum conditional likelihood (logistic-loss) formulation that was presented in section 2.5. Jacob Eisenstein. Draft of October 15, 2018. 2.8. SUMMARYOFLEARNINGALGORITHMS 43 2.8 Summary of learning algorithms It is natural to ask which learning algorithm is best, but the answer depends on what characteristics are important to the problem you are trying to solve. Na ̈ıve Bayes Pros: easy to implement; estimation is fast, requiring only a single pass over the data; assigns probabilities to predicted labels; controls overfitting with smooth- ing parameter. Cons: often has poor accuracy, especially with correlated features. Perceptron Pros: easy to implement; online; error-driven learning means that accuracy is typically high, especially after averaging. Cons: not probabilistic; hard to know when to stop learning; lack of margin can lead to overfitting. Support vector machine Pros: optimizes an error-based metric, usually resulting in high accuracy; overfitting is controlled by a regularization parameter. Cons: not proba- bilistic. Logistic regression Pros: error-driven and probabilistic; overfitting is controlled by a reg- ularization parameter. Cons: batch learning requires black-box optimization; logistic loss can “overtrain” on correctly labeled examples. One of the main distinctions is whether the learning algorithm offers a probability over labels. This is useful in modular architectures, where the output of one classifier is the input for some other system. In cases where probability is not necessary, the sup- port vector machine is usually the right choice, since it is no more difficult to implement than the perceptron, and is often more accurate. When probability is necessary, logistic regression is usually more accurate than Na ̈ıve Bayes. Additional resources A machine learning textbook will offer more classifiers and more details (e.g., Murphy, 2012), although the notation will differ slightly from what is typical in natural language processing. Probabilistic methods are surveyed by Hastie et al. (2009), and Mohri et al. (2012) emphasize theoretical considerations. Bottou et al. (2016) surveys the rapidly mov- ing field of online learning, and Kummerfeld et al. (2015) empirically review several opti- mization algorithms for large-margin learning. The python toolkit SCIKIT-LEARN includes implementations of all of the algorithms described in this chapter (Pedregosa et al., 2011). Appendix B describes an alternative large-margin classifier, called passive-aggressive. Passive-aggressive is an online learner that seeks to make the smallest update that satisfies the margin constraint at the current instance. It is closely related to MIRA, which was used widely in NLP in the 2000s (Crammer and Singer, 2003). Under contract with MIT Press, shared under CC-BY-NC-ND license. 44 CHAPTER2. LINEARTEXTCLASSIFICATION Exercises There will be exercises at the end of each chapter. In this chapter, the exercises are mostly mathematical, matching the subject material. In other chapters, the exercises will empha- size linguistics or programming. 1. Let x be a bag-of-words vector such that 􏰀Vj=1 xj = 1. Verify that the multinomial probability pmult(x; φ), as defined in Equation 2.12, is identical to the probability of the same document under a categorical distribution, pcat(w; φ). 2. Suppose you have a single feature x, with the following conditional distribution:     α , X = 0 , Y = 0 1−α, X=1,Y=0 p(x|y)=1−β, X =0,Y =1 [2.80] β, X=1,Y =1. Further suppose that the prior is uniform, Pr(Y = 0) = Pr(Y = 1) = 1 , and that 112 both α > 2 and β > 2 . Given a Na ̈ıve Bayes classifier with accurate parameters,
what is the probability of making an error?
3. Derive the maximum-likelihood estimate for the parameter μ in Na ̈ıve Bayes.
4. The classification models in the text have a vector of weights for each possible label. While this is notationally convenient, it is overdetermined: for any linear classifier that can be obtained with K × V weights, an equivalent classifier can be constructed using (K − 1) × V weights.
a) Describe how to construct this classifier. Specifically, if given a set of weights θ and a feature function f(x,y), explain how to construct alternative weights and feature function θ′ and f′(x,y), such that,
∀y,y′ ∈Y,θ·f(x,y)−θ·f(x,y′)=θ′ ·f′(x,y)−θ′ ·f′(x,y′). [2.81]
b) Explain how your construction justifies the well-known alternative form for
binary logistic regression, Pr(Y = 1 | x; θ) = 1 ′ = σ(θ′ · x), where σ 1+exp(−θ ·x)
is the sigmoid function.
5. Suppose you have two labeled datasets D1 and D2, with the same features and la- bels.
• Let θ(1) be the unregularized logistic regression (LR) coefficients from training on dataset D1.
Jacob Eisenstein. Draft of October 15, 2018.

2.8. SUMMARYOFLEARNINGALGORITHMS 45 • Let θ(2) be the unregularized LR coefficients (same model) from training on
dataset D2.
• Let θ∗ be the unregularized LR coefficients from training on the combined
dataset D1 ∪ D2.
Under these conditions, prove that for any feature j,
θ∗ ≥min(θ(1),θ(2)) jjj
θ∗ ≤max(θ(1),θ(2)). jjj
6. Let θˆ be the solution to an unregularized logistic regression problem, and let θ∗ be the solution to the same problem, with L2 regularization. Prove that ||θ∗||2 ≤ ||θˆ||2.
7. As noted in the discussion of averaged perceptron in subsection 2.3.2, the computa- tion of the running sum m ← m + θ is unnecessarily expensive, requiring K ×V op- erations. Give an alternative way to compute the averaged weigh􏰀ts θ, with complex- itythatisindependentofV andlinearinthesumoffeaturesizes Ni=1 |f(x(i),y(i))|.
8. Consider a dataset that is comprised of two identical instances x(1) = x(2) with distinct labels y(1) ̸= y(2). Assume all features are binary, xj ∈ {0, 1} for all j.
Now suppose that the averaged perceptron always trains on the instance (xi(t), yi(t)), where i(t) = 2 − (t mod 2), which is 1 when the training iteration t is odd, and 2 when t is even. Further suppose that learning terminates under the following con- dition:
ε ≥ max􏰽􏰽􏰽􏰽1 􏰁θ(t) − 1 􏰁θ(t−1)􏰽􏰽􏰽􏰽. [2.82] j􏰽tjt−1j􏰽
tt
In words, the algorithm stops when the largest change in the averaged weights is less than or equal to ε. Compute the number of iterations before the averaged per- ceptron terminates.
9. Prove that the margin loss is convex in θ. Use this definition of the margin loss: L(θ) = −θ · f(x, y∗) + max θ · f(x, y) + c(y∗, y), [2.83]
y
where y∗ is the gold label. As a reminder, a function f is convex iff,
f(αx1 + (1 − α)x2) ≤ αf(x1) + (1 − α)f(x2), [2.84]
for any x1,x2 and α ∈ [0,1].
Under contract with MIT Press, shared under CC-BY-NC-ND license.

46 10.
CHAPTER2. LINEARTEXTCLASSIFICATION
If a function f is m-strongly convex, then for some m > 0, the following inequality holds for all x and x′ on the domain of the function:
f(x′) ≤ f(x) + (∇xf) · (x′ − x) + m||x′ − x||2. [2.85] 2
Let f(x) = L(θ(t)), representing the loss of the classifier at iteration t of gradient descent; let f(x′) = L(θ(t+1)). Assuming the loss function is m-convex, prove that L(θ(t+1)) ≤ L(θ(t)) for an appropriate constant learning rate η, which will depend on m. Explain why this implies that gradient descent converges when applied to an m-strongly convex loss function with a unique minimum.
Jacob Eisenstein. Draft of October 15, 2018.

Chapter 3
Nonlinear classification
Linear classification may seem like all we need for natural language processing. The bag- of-words representation is inherently high dimensional, and the number of features is often larger than the number of labeled training instances. This means that it is usually possible to find a linear classifier that perfectly fits the training data, or even to fit any ar- bitrary labeling of the training instances! Moving to nonlinear classification may therefore only increase the risk of overfitting. Furthermore, for many tasks, lexical features (words) are meaningful in isolation, and can offer independent evidence about the instance label — unlike computer vision, where individual pixels are rarely informative, and must be evaluated holistically to make sense of an image. For these reasons, natural language processing has historically focused on linear classification.
But in recent years, nonlinear classifiers have swept through natural language pro- cessing, and are now the default approach for many tasks (Manning, 2015). There are at least three reasons for this change.
• There have been rapid advances in deep learning, a family of nonlinear meth- ods that learn complex functions of the input through multiple layers of compu- tation (Goodfellow et al., 2016).
• Deep learning facilitates the incorporation of word embeddings, which are dense vector representations of words. Word embeddings can be learned from large amounts of unlabeled data, and enable generalization to words that do not appear in the an- notated training data (word embeddings are discussed in detail in chapter 14).
• While CPU speeds have plateaued, there have been rapid advances in specialized hardware called graphics processing units (GPUs), which have become faster, cheaper, and easier to program. Many deep learning models can be implemented efficiently on GPUs, offering substantial performance improvements over CPU-based comput- ing.
47

48 CHAPTER3. NONLINEARCLASSIFICATION
This chapter focuses on neural networks, which are the dominant approach for non- linearclassificationinnaturallanguageprocessingtoday.1 Historically,afewothernon- linear learning methods have been applied to language data.
• Kernelmethodsaregeneralizationsofthenearest-neighborclassificationrule,which classifies each instance by the label of the most similar example in the training set. The application of the kernel support vector machine to information extraction is described in chapter 17.
• Decision trees classify instances by checking a set of conditions. Scaling decision trees to bag-of-words inputs is difficult, but decision trees have been successful in problems such as coreference resolution (chapter 15), where more compact feature sets can be constructed (Soon et al., 2001).
• Boostingandrelatedensemblemethodsworkbycombiningthepredictionsofsev- eral “weak” classifiers, each of which may consider only a small subset of features. Boosting has been successfully applied to text classification (Schapire and Singer, 2000) and syntactic analysis (Abney et al., 1999), and remains one of the most suc- cessful methods on machine learning competition sites such as Kaggle (Chen and Guestrin, 2016).
Hastie et al. (2009) provide an excellent overview of these techniques.
3.1 Feedforward neural networks
Consider the problem of building a classifier for movie reviews. The goal is to predict a label y ∈ {GOOD, BAD, OKAY} from a representation of the text of each document, x. But what makes a good movie? The story, acting, cinematography, editing, soundtrack, and so on. Now suppose the training set contains labels for each of these additional features, z = [z1, z2, . . . , zKz ]⊤. With a training set of such information, we could build a two-step classifier:
1. Use the text x to predict the features z. Specifically, train a logistic regression clas- sifier to compute p(zk | x), for each k ∈ {1,2,…,Kz}.
2. Use the features z to predict the label y. Again, train a logistic regression classifier to compute p(y | z). On test data, z is unknown, so we will use the probabilities p(z | x) from the first layer as the features.
This setup is shown in Figure 3.1, which describes the proposed classifier in a computa- tion graph: the text features x are connected to the middle layer z, which is connected to the label y.
1I will use “deep learning” and “neural networks” interchangeably.
Jacob Eisenstein. Draft of October 15, 2018.

3.1. FEEDFORWARDNEURALNETWORKS
49
y
z
x
… …
Figure 3.1: A feedforward neural network. Shaded circles indicate observed features, usually words; squares indicate nodes in the computation graph, which are computed from the information carried over the incoming arrows.
If we assume that each zk is binary, zk ∈ {0, 1}, then the probability p(zk | x) can be modeled using binary logistic regression:
Pr(zk = 1 | x; Θ(x→z)) = σ(θ(x→z) · x) = (1 + exp(−θ(x→z) · x))−1, [3.1] kk
where σ is the sigmoid function (shown in Figure 3.2), and the matrix Θ(x→z) ∈ RKz×V is constructed by stacking the weight vectors for each zk,
Θ(x→z) =[θ(x→z),θ(x→z),…,θ(x→z)]⊤. [3.2] 1 2 Kz
We will assume that x contains a term with a constant value of 1, so that a corresponding offset parameter is included in each θ(x→z).
The output layer is computed by the multi-class logistic regression probability,
k
􏰀 exp(θ(z→y)·z+bj)
Pr(y = j | z;Θ(z→y),b) = j , (z→y)
[3.3]
j′∈Y exp(θj′ ·z+bj′)
where bj is an offset for label j, and the output weight matrix Θ(z→y) ∈ RKy×Kz is again
constructed by concatenation,
Θ(z→y) =[θ(z→y),θ(z→y),…,θ(z→y)]⊤. [3.4]
1 2 Ky
The vector of probabilities over each possible value of y is denoted,
p(y | z; Θ(z→y), b) = SoftMax(Θ(z→y)z + b), [3.5] where element j in the output of the SoftMax function is computed as in Equation 3.3.
This set of equations defines a multilayer classifier, which can be summarized as,
p(z | x; Θ(x→z)) =σ(Θ(x→z)x) [3.6]
p(y | z; Θ(z→y), b) = SoftMax(Θ(z→y)z + b), [3.7] Under contract with MIT Press, shared under CC-BY-NC-ND license.

50 CHAPTER3. NONLINEARCLASSIFICATION
values
3 1.0
derivatives
2 1 0
0.8 0.6 0.4 0.2
sigmoid tanh ReLU
1 0.0
3210123 3210123
Figure 3.2: The sigmoid, tanh, and ReLU activation functions where the function σ is now applied elementwise to the vector of inner products,
σ(Θ(x→z)x) = [σ(θ(x→z) · x), σ(θ(x→z) · x), . . . , σ(θ(x→z) · x)]⊤. [3.8] 1 2 Kz
Now suppose that the hidden features z are never observed, even in the training data. We can still construct the architecture in Figure 3.1. Instead of predicting y from a discrete vector of predicted values z, we use the probabilities σ(θk · x). The resulting classifier is barely changed:
z =σ(Θ(x→z)x) [3.9] p(y | x; Θ(z→y), b) = SoftMax(Θ(z→y)z + b). [3.10]
This defines a classification model that predicts the label y ∈ Y from the base features x, through a“hidden layer” z. This is a feedforward neural network.2
3.2 Designing neural networks
There several ways to generalize the feedforward neural network.
3.2.1 Activation functions
If the hidden layer is viewed as a set of latent features, then the sigmoid function in Equa- tion 3.9 represents the extent to which each of these features is “activated” by a given input. However, the hidden layer can be regarded more generally as a nonlinear trans- formation of the input. This opens the door to many other activation functions, some of which are shown in Figure 3.2. At the moment, the choice of activation functions is more art than science, but a few points can be made about the most popular varieties:
2The architecture is sometimes called a multilayer perceptron, but this is misleading, because each layer is not a perceptron as defined in the previous chapter.
Jacob Eisenstein. Draft of October 15, 2018.

3.2. DESIGNINGNEURALNETWORKS 51
• The range of the sigmoid function is (0, 1). The bounded range ensures that a cas- cade of sigmoid functions will not “blow up” to a huge output, and this is impor- tant for deep networks with several hidden layers. The derivative of the sigmoid is
∂ σ(a) = σ(a)(1 − σ(a)). This derivative becomes small at the extremes, which can ∂a
make learning slow; this is called the vanishing gradient problem.
• The range of the tanh activation function is (−1, 1): like the sigmoid, the range
is bounded, but unlike the sigmoid, it includes negative values. The derivative is
∂ tanh(a) = 1 − tanh(a)2, which is steeper than the logistic function near the ori- ∂a
gin (LeCun et al., 1998). The tanh function can also suffer from vanishing gradients at extreme values.
• The rectified linear unit (ReLU) is zero for negative inputs, and linear for positive inputs (Glorot et al., 2011),
ReLU(a) = 􏰅a, a ≥ 0 [3.11] 0, otherwise.
The derivative is a step function, which is 1 if the input is positive, and zero other- wise. Once the activation is zero, the gradient is also zero. This can lead to the prob- lem of “dead neurons”, where some ReLU nodes are zero for all inputs, throughout learning. A solution is the leaky ReLU, which has a small positive slope for negative inputs (Maas et al., 2013),
Leaky-ReLU(a) = 􏰅a, a ≥ 0 [3.12] .0001a, otherwise.
Sigmoid and tanh are sometimes described as squashing functions, because they squash
an unbounded input into a bounded range. Glorot and Bengio (2010) recommend against
the use of the sigmoid activation in deep networks, because its mean value of 1 can cause 2
the next layer of the network to be saturated, leading to small gradients on its own pa- rameters. Several other activation functions are reviewed in the textbook by Goodfellow et al. (2016), who recommend ReLU as the “default option.”
3.2.2 Network structure
Deep networks stack up several hidden layers, with each z(d) acting as the input to the next layer, z(d+1). As the total number of nodes in the network increases, so does its capacity to learn complex functions of the input. Given a fixed number of nodes, one must decide whether to emphasize width (large Kz at each layer) or depth (many layers). At present, this tradeoff is not well understood.3
3With even a single hidden layer, a neural network can approximate any continuous function on a closed andboundedsubsetofRN toanarbitrarilysmallnon-zeroerror;seesection6.4.1ofGoodfellowetal.(2016)
Under contract with MIT Press, shared under CC-BY-NC-ND license.

52 CHAPTER3. NONLINEARCLASSIFICATION
It is also possible to “short circuit” a hidden layer, by propagating information directly from the input to the next higher level of the network. This is the idea behind residual net- works, which propagate information directly from the input to the subsequent layer (He et al., 2016),
z = f(Θ(x→z)x) + x, [3.13]
where f is any nonlinearity, such as sigmoid or ReLU. A more complex architecture is the highway network (Srivastava et al., 2015; Kim et al., 2016), in which an addition gate controls an interpolation between f(Θ(x→z)x) and x,
t =σ(Θ(t)x + b(t)) [3.14] z =t⊙f(Θ(x→z)x)+(1−t)⊙x, [3.15]
where ⊙ refers to an elementwise vector product, and 1 is a column vector of ones. As before, the sigmoid function is applied elementwise to its input; recall that the output of this function is restricted to the range (0, 1). Gating is also used in the long short-term memory (LSTM), which is discussed in chapter 6. Residual and highway connections address a problem with deep architectures: repeated application of a nonlinear activation function can make it difficult to learn the parameters of the lower levels of the network, which are too distant from the supervision signal.
3.2.3 Outputs and loss functions
In the multi-class classification example, a softmax output produces probabilities over each possible label. This aligns with a negative conditional log-likelihood,
[3.16]
[3.17] [3.18]
log p(y(i) | x(i); Θ). This loss can be written alternatively as follows:
y ̃j 􏱕 Pr(y = j | x(i); Θ)
−L = −
where Θ = {Θ(x→z), Θ(z→y), b} is the entire set of parameters.
−L=−
􏰁N i=1
ey(i) ·logy ̃
􏰁N i=1
for a survey of these theoretical results. However, depending on the function to be approximated, the width of the hidden layer may need to be arbitrarily large. Furthermore, the fact that a network has the capacity to approximate any given function does not imply that it is possible to learn the function using gradient-based optimization.
Jacob Eisenstein. Draft of October 15, 2018.

3.3. LEARNINGNEURALNETWORKS 53
whereey(i) isaone-hotvectorofzeroswithavalueof1atpositiony(i).Theinnerproduct betweeney(i) andlogy ̃isalsocalledthemultinomialcross-entropy,andthisterminology is preferred in many neural networks papers and software packages.
It is also possible to train neural networks from other objectives, such as a margin loss. In this case, it is not necessary to use softmax at the output layer: an affine transformation of the hidden layer is enough:
Ψ(y; x(i), Θ) =θ(z→y) · z + b [3.19] y􏰛y 􏰜
lMARGIN(Θ;x(i),y(i))= max 1+Ψ(y;x(i),Θ)−Ψ(y(i);x(i),Θ) . [3.20] y̸=y(i) +
In regression problems, the output is a scalar or vector (see subsection 4.1.2). For these problems, a typical loss function is the squared error (y − yˆ)2 or squared norm ||y − yˆ||2.
3.2.4 Inputs and lookup layers
In text classification, the input layer x can refer to a bag-o􏰀f-words vector, where xj is the count of word j. The input to the hidden unit zk is then V θ(x→z)xj, and word j is
j=1 j,k
represented by the vector θ(x→z). This vector is sometimes described as the embedding of
j
word j, and can be learned from unlabeled data, using techniques discussed in chapter 14. The columns of Θ(x→z) are each Kz-dimensional word embeddings.
Chapter 2 presented an alternative view of text documents, as a sequence of word tokens, w1, w2, . . . , wM . In a neural network, each word token wm is represented with a one-hot vector, ewm , with dimension V . The matrix-vector product Θ(x→z)ewm returns the embedding of word wm. The complete document can represented by horizontally concatenating these one-hot vectors, W = [ew1 , ew2 , . . . , ewM ], and the bag-of-words rep- resentation can be recovered from the matrix-vector product W[1, 1, . . . , 1]⊤, which sums each row over the tokens m = {1, 2, . . . , M}. The matrix product Θ(x→z)W contains the horizontally concatenated embeddings of each word in the document, which will be useful as the starting point for convolutional neural networks (see section 3.4). This is sometimes called a lookup layer, because the first step is to lookup the embeddings for each word in the input text.
3.3 Learning neural networks
The feedforward network in Figure 3.1 can now be written as,
z ←f(Θ(x→z)x(i))
y ̃ ← SoftMax 􏰛Θ(z→y)z + b􏰜
l(i) ← − ey(i) · log y ̃,
Under contract with MIT Press, shared under CC-BY-NC-ND license.
[3.21] [3.22] [3.23]

54 CHAPTER3. NONLINEARCLASSIFICATION
where f is an elementwise activation function, such as σ or ReLU, and l(i) is the loss at instance i. The parameters Θ(x→z), Θ(z→y), and b can be estimated using online gradient- based optimization. The simplest such algorithm is stochastic gradient descent, which was discussed in section 2.6. Each parameter is updated by the gradient of the loss,
b ←b − η(t)∇bl(i)
θ(z→y) ←θ(z→y) − η(t)∇ (z→y) l(i) k k θk
and θ(x→z) is column n of the matrix Θ(x→z), and θ(z→y) is column k of Θ(z→y). nk
The gradients of the negative log-likelihood on b and θ(z→y) are similar to the gradi-
[3.24] [3.25]
[3.26] where η(t) is the learning rate on iteration t, l(i) is the loss on instance (or minibatch) i,
ents in logistic regression. For θ(z→y), the gradient is,
k
θ(x→z) ←θ(x→z) − η(t)∇ (x→z) l(i), n n θn
 ⊤ (i)  ∂l(i) ∂l(i) ∂l(i) 
∇θ(z→y) l = ∂θ(z→y) , ∂θ(z→y) , . . . , ∂θ(z→y)
[3.27]
k
gradient ∇bl(i) is similar to Equation 3.29.
The gradients on the input layer weights Θ(x→z) are obtained by the chain rule of
k,1 k,2 ∂l(i) ∂ (z→y)
k,Ky 
􏰁 (z→y) 
y∈Y
∂θ(z→y) =−∂θ(z→y) θy(i) k,j k,j
·z−log
= 􏰛Pr(y = j | z; Θ(z→y), b) − δ 􏰛j = y(i)􏰜􏰜 zk,
expθy ·z
where δ 􏰗j = y(i)􏰘 is a function that returns one when j = y(i), and zero otherwise. The
differentiation:
∂l(i) ∂l(i) ∂zk ∂θ(x→z) = ∂zk ∂θ(x→z)
[3.30]
[3.31]
[3.32]
[3.28] [3.29]
n,k
n,k
∂l(i) ∂f(θ(x→z) · x) k
=
= ∂zk
∂θ(x→z) n,k
′ (x→z) ×f (θk
∂zk ∂l(i)
·x)×xn,
where f′(θ(x→z) · x) is the derivative of the activation function f, applied at the input k
Jacob Eisenstein. Draft of October 15, 2018.

3.3. LEARNINGNEURALNETWORKS
55
θ(x→z) · x. For example, if f is the sigmoid function, then the derivative is, k
∂l(i) ∂l(i) (x→z) (x→z) ∂θ(x→z) = ∂zk ×σ(θk ·x)×(1−σ(θk
n,k
∂ l(i)
= ∂zk ×zk ×(1−zk)×xn.
For intuition, consider each of the terms in the product.
·x))×xn
[3.33] [3.34]
≈ 0. In this
• If the negative log-likelihood l(i) does not depend much on zk, then ∂l(i) ∂zk
case it doesn’t matter how z
is computed, and so ∂l(i) ≈ 0. k ∂θ(x→z)
n,k
• If zk is near 1 or 0, then the curve of the sigmoid function is nearly flat (Figure 3.2),
and changing the inputs will make little local difference. The term zk × (1 − zk) is
maximized at zk = 1 , where the slope of the sigmoid function is steepest. 2
n,k
The equations above rely on the chain rule to compute derivatives of the loss with respect to each parameter of the model. Furthermore, local derivatives are frequently reused: for
example, ∂l(i) is reused in computing the derivatives with respect to each θ(x→z). These ∂zk n,k
terms should therefore be computed once, and then cached. Furthermore, we should only compute any derivative once we have already computed all of the necessary “inputs” demanded by the chain rule of differentiation. This combination of sequencing, caching, and differentiation is known as backpropagation. It can be generalized to any directed acyclic computation graph.
A computation graph is a declarative representation of a computational process. At each node t, compute a value vt by applying a function ft to a (possibly empty) list of parent nodes, πt. Figure 3.3 shows the computation graph for a feedforward network with one hidden layer. There are nodes for the input x(i), the hidden layer z, the predicted output yˆ, and the parameters Θ. During training, there is also a node for the ground truth label y(i) and the loss l(i). The predicted output yˆ is one of the parents of the loss (the other is the label y(i)); its parents include Θ and z, and so on.
Computation graphs include three types of nodes:
Variables. In the feedforward network of Figure 3.3, the variables include the inputs x, the hidden nodes z, the outputs y, and the loss function. Inputs are variables that do not have parents. Backpropagation computes the gradients with respect to all
Under contract with MIT Press, shared under CC-BY-NC-ND license.
= 0, then it does not matter how we set the weights θ(x→z), so ∂l(i) = 0. n n,k ∂θ(x→z)
• If x
3.3.1 Backpropagation

56 CHAPTER3. NONLINEARCLASSIFICATION
Algorithm 6 General backpropagation algorithm. In the computation graph G, every node contains a function ft and a set of parent nodes πt; the inputs to the graph are x(i).
1:
2: 3: 4: 5:
6: 7: 8:
9:
procedureBACKPROP(G={ft,πt}Tt=1},x(i))
← x(i) for all n and associated computation nodes t(n). t(n) n
v
for t ∈ TOPOLOGICALSORT(G) do ◃ Forward pass: compute value at each node
if |πt| > 0 then
vt ← ft(vπt,1,vπt,2,…,vπt,Nt )
gobjective = 1 ◃ Backward pass: compute gradients at each node for t ∈ R􏰀EVERSE(TOPOLOGICALSORT(G)) do
gt ← t′ :t∈πt′ gt′ × ∇vt vt′ ◃ Sum over all t′ that are children of t, propagating the gradient gt′ , scaled by the local gradient ∇vt vt′
return {g1, g2, . . . , gT }
variables except the inputs, and propagates these gradients backwards to the pa-
rameters.
Parameters. In a feedforward network, the parameters include the weights and offsets. In Figure 3.3, the parameters are summarized in the node Θ, but we could have separate nodes for Θ(x→z), Θ(z→y), and any offset parameters. Parameter nodes do not have parents; they are not computed from other nodes, but rather, are learned by gradient descent.
Loss. The loss l(i) is the quantity that is to be minimized during training. The node rep- resenting the loss in the computation graph is not the parent of any other node; its parents are typically the predicted label yˆ and the true label y(i). Backpropagation begins by computing the gradient of the loss, and then propagating this gradient backwards to its immediate parents.
If the computation graph is a directed acyclic graph, then it is possible to order the nodes with a topological sort, so that if node t is a parent of node t′, then t < t′. This means that the values {vt}Tt=1 can be computed in a single forward pass. The topolog- ical sort is reversed when computing gradients: each gradient gt is computed from the gradients of the children of t, implementing the chain rule of differentiation. The general backpropagation algorithm for computation graphs is shown in Algorithm 6. While the gradients with respect to each parameter may be complex, they are com- posed of products of simple parts. For many networks, all gradients can be computed through automatic differentiation. This means that you need only specify the feedfor- ward computation, and the gradients necessary for learning can be obtained automati- cally. There are many software libraries that perform automatic differentiation on compu- Jacob Eisenstein. Draft of October 15, 2018. 3.3. LEARNINGNEURALNETWORKS 57 vx vz vyˆ vy z yˆ l(i) x(i) y(i) gz gyˆ gl gl vΘ gz gyˆ vΘ Θ Figure 3.3: A computation graph for the feedforward neural network shown in Figure 3.1. tation graphs, such as TORCH (Collobert et al., 2011), TENSORFLOW (Abadi et al., 2016), and DYNET (Neubig et al., 2017). One important distinction between these libraries is whether they support dynamic computation graphs, in which the structure of the compu- tation graph varies across instances. Static computation graphs are compiled in advance, and can be applied to fixed-dimensional data, such as bag-of-words vectors. In many nat- ural language processing problems, each input has a distinct structure, requiring a unique computation graph. A simple case occurs in recurrent neural network language models (see chapter 6), in which there is one node for each word in a sentence. More complex cases include recursive neural networks (see chapter 14), in which the network is a tree structure matching the syntactic organization of the input. 3.3.2 Regularization and dropout In linear classification, overfitting was addressed by augmenting the objective with a reg- ularization term, λ||θ||2. This same approach can be applied to feedforward neural net- works, penalizing each matrix of weights: 􏰁N i=1 to matrices. The bias parameters b are not regularized, as they do not contribute to the sensitivity of the classifier to the inputs. In gradient-based optimization, the practical effect of Frobenius norm regularization is that the weights “decay” towards zero at each update, motivating the alternative name weight decay. Another approach to controlling model complexity is dropout, which involves ran- domly setting some computation nodes to zero during training (Srivastava et al., 2014). For example, in the feedforward network, on each training instance, with probability ρ we Under contract with MIT Press, shared under CC-BY-NC-ND license. l(i) + λz→y||Θ(z→y)||2F + λx→z||Θ(x→z)||2F , [3.35] L = where ||Θ||2 = 􏰀 θ2 is the squared Frobenius norm, which generalizes the L2 norm F i,j i,j 58 CHAPTER3. NONLINEARCLASSIFICATION set each input xn and each hidden layer node zk to zero. Srivastava et al. (2014) recom- mend ρ = 0.5 for hidden units, and ρ = 0.2 for input units. Dropout is also incorporated in the gradient computation, so if node zk is dropped, then none of the weights θ(x→z) will be updated for this instance. Dropout prevents the network from learning to depend too much on any one feature or hidden node, and prevents feature co-adaptation, in which a hidden unit is only useful in combination with one or more other hidden units. Dropout is a special case of feature noising, which can also involve adding Gaussian noise to inputs or hidden units (Holmstrom and Koistinen, 1992). Wager et al. (2013) show that dropout is approximately equivalent to “adaptive” L2 regularization, with a separate regularization penalty for each feature. 3.3.3 *Learning theory Chapter 2 emphasized the importance of convexity for learning: for convex objectives, the global optimum can be found efficiently. The negative log-likelihood and hinge loss are convex functions of the parameters of the output layer. However, the output of a feed- forward network is generally not a convex function of the parameters of the input layer, Θ(x→z). Feedforward networks can be viewed as function composition, where each layer is a function that is applied to the output of the previous layer. Convexity is generally not preserved in the composition of two convex functions — and furthermore, “squashing” activation functions like tanh and sigmoid are not convex. The non-convexity of hidden layer neural networks can also be seen by permuting the elements of the hidden layer, from z = [z1, z2, . . . , zKz ] to z ̃ = [zπ(1), zπ(2), . . . , zπ(Kz )]. This corresponds to applying π to the rows of Θ(x→z) and the columns of Θ(z→y), resulting in permuted parameter matrices Θ(x→z) and Θ(z→y). As long as this permutation is applied ππ consistently, the loss will be identical, L(Θ) = L(Θπ): it is invariant to this permutation. However, the loss of the linear combination L(αΘ + (1 − α)Θπ) will generally not be identical to the loss under Θ or its permutations. If L(Θ) is better than the loss at any points in the immediate vicinity, and if L(Θ) = L(Θπ), then the loss function does not satisfy the definition of convexity (see section 2.4). One of the exercises asks you to prove this more rigorously. In practice, the existence of multiple optima is not necessary problematic, if all such optima are permutations of the sort described in the previous paragraph. In contrast, “bad” local optima are better than their neighbors, but much worse than the global op- timum. Fortunately, in large feedforward neural networks, most local optima are nearly as good as the global optimum (Choromanska et al., 2015). More generally, a critical point is one at which the gradient is zero. Critical points may be local optima, but they may also be saddle points, which are local minima in some directions, but local maxima in other directions. For example, the equation x21 − x2 has a saddle point at x = (0, 0). In large networks, the overwhelming majority of critical points are saddle points, rather Jacob Eisenstein. Draft of October 15, 2018. k 3.3. LEARNINGNEURALNETWORKS 59 than local minima or maxima (Dauphin et al., 2014). Saddle points can pose problems for gradient-based optimization, since learning will slow to a crawl as the gradient goes to zero. However, the noise introduced by stochastic gradient descent, and by feature noising techniques such as dropout, can help online optimization to escape saddle points and find high-quality optima (Ge et al., 2015). Other techniques address saddle points directly, using local reconstructions of the Hessian matrix (Dauphin et al., 2014) or higher- order derivatives (Anandkumar and Ge, 2016). Another theoretical puzzle about neural networks is how they are able to generalize to unseen data. Given enough parameters, a two-layer feedforward network can “memo- rize” its training data, attaining perfect accuracy on any training set. A particularly salient demonstration was provided by Zhang et al. (2017), who showed that neural networks can learn to perfectly classify a training set of images, even when the labels are replaced with random values! Of course, this network attains only chance accuracy when applied to heldout data. The concern is that when such a powerful learner is applied to real training data, it may learn a pathological classification function, which exploits irrelevant details of the training data and fails to generalize. Yet this extreme overfitting is rarely en- countered in practice, and can usually be prevented by regularization, dropout, and early stopping (see subsection 3.3.4). Recent papers have derived generalization guarantees for specific classes of neural networks (e.g., Kawaguchi et al., 2017; Brutzkus et al., 2018), but theoretical work in this area is ongoing. 3.3.4 Tricks Getting neural networks to work sometimes requires heuristic “tricks” (Bottou, 2012; Goodfellow et al., 2016; Goldberg, 2017b). This section presents some tricks that are espe- cially important. Initialization Initialization is not especially important for linear classifiers, since con- vexity ensures that the global optimum can usually be found quickly. But for multilayer neural networks, it is helpful to have a good starting point. One reason is that if the mag- nitude of the initial weights is too large, a sigmoid or tanh nonlinearity will be saturated, leading to a small gradient, and slow learning. Large gradients can cause training to di- verge, with the parameters taking increasingly extreme values until reaching the limits of the floating point representation. Initialization can help avoid these problems by ensuring that the variance over the initial gradients is constant and bounded throughout the network. For networks with tanh activation functions, this can be achieved by sampling the initial weights from the Under contract with MIT Press, shared under CC-BY-NC-ND license. 60 CHAPTER3. NONLINEARCLASSIFICATION following uniform distribution (Glorot and Bengio, 2010), θi,j ∼U􏱊−􏰟 √6 ,􏰟 √6 􏱋, [3.36] din(n) + dout(n) din(n) + dout(n) [3.37] For the weights leading to a ReLU activation function, He et al. (2015) use similar argu- mentation to justify sampling from a zero-mean Gaussian distribution, θi,j ∼ N(0,􏰟2/din(n)) [3.38] Rather than initializing the weights independently, it can be beneficial to initialize each layer jointly as an orthonormal matrix, ensuring that Θ⊤Θ = I (Saxe et al., 2014). Or- thonormal matrices preserve the norm of the input, so that ||Θx|| = ||x||, which prevents the gradients from exploding or vanishing. Orthogonality ensures that the hidden units are uncorrelated, so that they correspond to different features of the input. Orthonormal initialization can be performed by applying singular value decomposition to a matrix of values sampled from a standard normal distribution: ai,j ∼N(0,1) A ={ai,j}din(j),dout(j) i=1,j =1 U,S,V⊤ =SVD(A) Θ(j) ←U. The matrix U contains the singular vectors of A, and is guaranteed to be orthonormal. For more on singular value decomposition, see chapter 14. Even with careful initialization, there can still be significant variance in the final re- sults. It can be useful to make multiple training runs, and select the one with the best performance on a heldout development set. Clipping and normalization Learning can be sensitive to the magnitude of the gradient: too large, and learning can diverge, with successive updates thrashing between increas- ingly extreme values; too small, and learning can grind to a halt. Several heuristics have been proposed to address this issue. • In gradient clipping (Pascanu et al., 2013), an upper limit is placed on the norm of the gradient, and the gradient is rescaled when this limit is exceeded, CLIP(g ̃) = 􏰅g ||gˆ|| < τ [3.43] τ g otherwise. [3.39] [3.40] [3.41] [3.42] ||g|| Jacob Eisenstein. Draft of October 15, 2018. 3.3. LEARNINGNEURALNETWORKS 61 • In batch normalization (Ioffe and Szegedy, 2015), the inputs to each computation node are recentered by their mean and variance across all of the instances in the minibatch B (see subsection 2.6.2). For example, in a feedforward network with one hidden layer, batch normalization would tranform the inputs to the hidden layer as follows: [3.44] [3.45] [3.46] μ(B) = 1 􏰁x(i) |B| 􏰁 i∈B s(B) = 1 (x(i) − μ(B))2 |B| i∈B 􏰟 x(i) =(x(i) − μ(B))/ s(B). Empirically, this speeds convergence of deep architectures. One explanation is that it helps to correct for changes in the distribution of activations during training. • In layer normalization (Ba et al., 2016), the inputs to each nonlinear activation func- tion are recentered across the layer: a =Θ(x→z)x 1 􏰁K z μ =K ak z k=1 1 􏰁K z s=K (ak −μ)2 z k=1 √ z =(a − μ)/ s. [3.47] [3.48] [3.49] [3.50] Layer normalization has similar motivations to batch normalization, but it can be applied across a wider range of architectures and training conditions. Online optimization There is a cottage industry of online optimization algorithms that attempt to improve on stochastic gradient descent. AdaGrad was reviewed in subsec- tion 2.6.2; its main innovation is to set adaptive learning rates for each parameter by stor- ing the sum of squared gradients. Rather than using the sum over the entire training history, we can keep a running estimate, v(t) =βv(t−1) + (1 − β)g2 , [3.51] jj t,j where gt,j is the gradient with respect to parameter j at time t, and β ∈ [0, 1]. This term places more emphasis on recent gradients, and is employed in the AdaDelta (Zeiler, 2012) and Adam (Kingma and Ba, 2014) optimizers. Online optimization and its theoretical background are reviewed by Bottou et al. (2016). Early stopping, mentioned in subsec- tion 2.3.2, can help to avoid overfitting by terminating training after reaching a plateau in the performance on a heldout validation set. Under contract with MIT Press, shared under CC-BY-NC-ND license. 62 CHAPTER3. NONLINEARCLASSIFICATION Practical advice The bag of tricks for training neural networks continues to grow, and it is likely that there will be several new ones by the time you read this. Today, it is standard practice to use gradient clipping, early stopping, and a sensible initialization of parameters to small random values. More bells and whistles can be added as solutions to specific problems — for example, if it is difficult to find a good learning rate for stochastic gradient descent, then it may help to try a fancier optimizer with an adaptive learning rate. Alternatively, if a method such as layer normalization is used by related models in the research literature, you should probably consider it, especially if you are having trouble matching published results. As with linear classifiers, it is important to evaluate these decisions on a held-out development set, and not on the test set that will be used to provide the final measure of the model’s performance (see subsection 2.2.5). 3.4 Convolutional neural networks A basic weakness of the bag-of-words model is its inability to account for the ways in which words combine to create meaning, including even simple reversals such as not pleasant, hardly a generous offer, and I wouldn’t mind missing the flight. Computer vision faces the related challenge of identifying the semantics of images from pixel features that are uninformative in isolation. An earlier generation of computer vision research focused on designing filters to aggregate local pixel-level features into more meaningful representations, such as edges and corners (e.g., Canny, 1987). Similarly, earlier NLP re- search attempted to capture multiword linguistic phenomena by hand-designed lexical patterns (Hobbs et al., 1997). In both cases, the output of the filters and patterns could then act as base features in a linear classifier. But rather than designing these feature ex- tractors by hand, a better approach is to learn them, using the magic of backpropagation. This is the idea behind convolutional neural networks. Following subsection 3.2.4, define the base layer of a neural network as, X(0) = Θ(x→z)[ew1,ew2,...,ewM ], [3.52] whereewm isacolumnvectorofzeros,witha1atpositionwm.Thebaselayerhasdimen- sion X(0) ∈ RKe ×M , where Ke is the size of the word embeddings. To merge information across adjacent words, we convolve X(0) with a set of filter matrices C(k) ∈ RKe×h. Convo- lution is indicated by the symbol ∗, and is defined, 􏰝Keh 􏰞 X(1) =f(b+C∗X(0)) =⇒ x(1) =f b +􏰁􏰁c(k) ×x(0) , [3.53] where f is an activation function such as tanh or ReLU, and b is a vector of offsets. The convolution operation slides the matrix C(k) across the columns of X(0). At each position m, we compute the elementwise product C(k) ⊙ X(0) , and take the sum. m:m+h−1 Jacob Eisenstein. Draft of October 15, 2018. k,m k k′ ,n k′ ,m+n−1 k′=1n=1 3.4. CONVOLUTIONALNEURALNETWORKS 63 X(1) Kf Kf * pooling prediction C C Ke convolution M X(0) z Figure 3.4: A convolutional neural network for text classification A simple filter might compute a weighted average over nearby words, 0.5 1 0.5 C(k) = 0.5 1 0.5 , [3.54] ... ... ... 0.5 1 0.5 thereby representing trigram units like not so unpleasant. In one-dimensional convolu- tion, each filter matrix C(k) is constrained to have non-zero values only at row k (Kalch- brenner et al., 2014). This means that each dimension of the word embedding is processed by a separate filter, and it implies that Kf = Ke. To deal with the beginning and end of the input, the base matrix X(0) may be padded with h column vectors of zeros at the beginning and end; this is known as wide convolu- tion. If padding is not applied, then the output from each layer will be h − 1 units smaller than the input; this is known as narrow convolution. The filter matrices need not have identical filter widths, so more generally we could write hk to indicate to width of filter C(k). As suggested by the notation X(0), multiple layers of convolution may be applied, so that X(d) is the input to X(d+1). AfterDconvolutionallayers,weobtainamatrixrepresentationofthedocumentX(D) ∈ RKz ×M . If the instances have variable lengths, it is necessary to aggregate over all M word positions to obtain a fixed-length representation. This can be done by a pooling operation, Under contract with MIT Press, shared under CC-BY-NC-ND license. 64 CHAPTER3. NONLINEARCLASSIFICATION Figure 3.5: A dilated convolutional neural network captures progressively larger context through recursive application of the convolutional operator such as max-pooling (Collobert et al., 2011) or average-pooling, 􏰁􏰛 ( D ) ( D ) ( D ) 􏰜 zk = max xk,1 ,xk,2 ,...xk,M m=1 Just as in feedforward networks, the parameters (C(k),b,Θ) can be learned by back- propagating from the classification loss. This requires backpropagating through the max- pooling operation, which is a discontinuous function of the input. But because we need only a local gradient, backpropagation flows only through the argmax m: ( D ) z = MaxPool(X ) =⇒ yˆ and a loss l(i). The setup is shown in Figure 3.4. 1 M (D) [3.55] [3.56] (D) z = AvgPool(X ) =⇒ zk = M The vector z can now act as a layer in a feedforward network, culminating in a prediction ∂z 􏰅1, x(D) = max􏰛x(D),x(D),...x(D) 􏰜 k,m k,1 k,2 k,M [3.57] 0, otherwise. k ∂x(D) k,m = xk,m. The computer vision literature has produced a huge variety of convolutional archi- tectures, and many of these innovations can be applied to text data. One avenue for improvement is more complex pooling operations, such as k-max pooling (Kalchbrenner et al., 2014), which returns a matrix of the k largest values for each filter. Another innova- tion is the use of dilated convolution to build multiscale representations (Yu and Koltun, 2016). At each layer, the convolutional operator applied in strides, skipping ahead by s steps after each feature. As we move up the hierarchy, each layer is s times smaller than the layer below it, effectively summarizing the input (Kalchbrenner et al., 2016; Strubell et al., 2017). This idea is shown in Figure 3.5. Multi-layer convolutional networks can also be augmented with “shortcut” connections, as in the residual network from subsec- tion 3.2.2 (Johnson and Zhang, 2017). Jacob Eisenstein. Draft of October 15, 2018. 3.4. CONVOLUTIONALNEURALNETWORKS 65 Additional resources The deep learning textbook by Goodfellow et al. (2016) covers many of the topics in this chapter in more detail. For a comprehensive review of neural networks in natural lan- guage processing, see Goldberg (2017b). A seminal work on deep learning in natural language processing is the aggressively titled “Natural Language Processing (Almost) from Scratch”, which uses convolutional neural networks to perform a range of language processing tasks (Collobert et al., 2011), although there is earlier work (e.g., Henderson, 2004). This chapter focuses on feedforward and convolutional neural networks, but recur- rent neural networks are one of the most important deep learning architectures for natural language processing. They are covered extensively in chapters 6 and 7. The role of deep learning in natural language processing research has caused angst in some parts of the natural language processing research community (e.g., Goldberg, 2017a), especially as some of the more zealous deep learning advocates have argued that end-to-end learning from “raw” text can eliminate the need for linguistic constructs such as sentences, phrases, and even words (Zhang et al., 2015, originally titled “Text under- standing from scratch”). These developments were surveyed by Manning (2015). While reports of the demise of linguistics in natural language processing remain controversial at best, deep learning and backpropagation have become ubiquitous in both research and applications. Exercises 1. Figure3.3showsthecomputationgraphforafeedforwardneuralnetworkwithone layer. a) Updatethecomputationgraphtoincludearesidualconnectionbetweenxand z. b) Update the computation graph to include a highway connection between x and z. 2. Prove that the softmax and sigmoid functions are equivalent when the number of possible labels is two. Specifically, for any Θ(z→y) (omitting the offset b for simplic- ity), show how to construct a vector of weights θ such that, SoftMax(Θ(z→y)z)[0] = σ(θ · z). [3.58] 3. Convolutionalneuralnetworksoftenaggregateacrosswordsbyusingmax-pooling (Equation 3.55 in section 3.4). A potential concern is that there is zero gradient with respect to the parts of the input that are not included in the maximum. The following Under contract with MIT Press, shared under CC-BY-NC-ND license. 66 CHAPTER3. NONLINEARCLASSIFICATION questions consider the gradient with respect to an element of the input, x(0) , and m,k they assume that all parameters are independently distributed. a) First consider a minimal network, with z = MaxPool(X(0)). What is the prob- ability that the gradient ∂l is non-zero? ∂ x(0) m,k b) Now consider a two-level network, with X(1) = f(b + C ∗ X(0)). Express the probability that the gradient ∂l is non-zero, in terms of the input length M, the filter size n, and the number of filters Kf . c) Using a calculator, work out the probability for the case M = 128, n = 4, Kf = 32. d) Now consider a three-level network, X(2) = f(b + C ∗ X(1)). Give the general equation for the probability that ∂l is non-zero, and compute the numerical ∂ x(0) m,k probability for the scenario in the previous part, assuming Kf = 32 and n = 4 at both levels. ∂ x(0) m,k 4. Design a feedforward network to compute the XOR function: =1,x=1 2 Your network􏰅should have a single output node which uses the Sign activation func- tion, f(x) = 1, x > 0 . Use a single hidden layer, with ReLU activation func-
−1, x≤0.
tions. Describe all weights and offsets.
Consider the same network as above (with ReLU activations for the hidden layer), with an arbitrary differentiable loss function l(y(i), y ̃), where y ̃ is the activation of the output node. Suppose all weights and offsets are initialized to zero. Show that gradient descent will not learn the desired function from this initialization.
The simplest solution to the previous problem relies on the use of the ReLU activa- tion function at the hidden layer. Now consider a network with arbitrary activations on the hidden layer. Show that if the initial weights are any uniform constant, then gradient descent will not learn the desired function from this initialization.
Consider a network in which: the base features are all binary, x ∈ {0, 1}M ; the hidden layer activation function is sigmoid, zk = σ(θk · x); and the initial weights are sampled independently from a standard normal distribution, θj,k ∼ N (0, 1).
Jacob Eisenstein. Draft of October 15, 2018.
5.
6.
7.
−1, x f(x1,x2)= 1, x1
 1 1, x
=1,x2 =0. =0,x=1
[3.59]
 1 −1, x1
2 =0,x2=0

3.4. CONVOLUTIONALNEURALNETWORKS
67
• Show how the probability of a small initial gradient on any weight, ∂zk ∂ θj,k
depends on the size of the input M . Hint: use the lower bound, Pr(σ(θk ·x)×(1−σ(θk ·x))<α) ≥ 2Pr(σ(θk ·x)<α), and relate this probability to the variance V [θk · x]. • Design an alternative initialization that removes this dependence. < α, [3.60] 8. The ReLU activation function can lead to “dead neurons”, which can never be acti- vated on any input. Consider the following two-layer feedforward network with a scalar output y: zi =ReLU(θ(x→z) · x + bi) [3.61] i y =θ(z→y) · z. [3.62] Suppose that the input is a binary vector of observations, x ∈ {0, 1}D . a) Under what condition is node zi “dead”? Your answer should be expressed in terms of the parameters θ(x→z) and bi. b) Suppose that the gradient of the loss on a given instance is ∂l = 1. Derive the ∂y gradients ∂ l and ∂ l for such an instance. ∂bi ∂θ(x→z) j,i c) Using your answers to the previous two parts, explain why a dead neuron can never be brought back to life during gradient-based learning. 9. Suppose that the parameters Θ = {Θ(x→z), Θ(z → y), b} are a local optimum of a feedforward network in the following sense: there exists some ε > 0 such that,
􏰛||Θ ̃ (x→z) − Θ(x→z)||2F + ||Θ ̃ (z→y) − Θ(z→y)||2F + ||b ̃ − b||2 < ε􏰜 ⇒􏰛L(Θ ̃)>L(Θ)􏰜 [3.63]
Define the function π as a permutation on the hidden units, as described in subsec- tion 3.3.3, so that for any Θ, L(Θ) = L(Θπ ). Prove that if a feedforward network has a local optimum in the sense of Equation 3.63, then its loss is not a convex function of the parameters Θ, using the definition of convexity from section 2.4
10. Consider a network with a single hidden layer, and a single output,
y = θ(z→y) · g(Θ(x→z)x). [3.64]
Assume that g is the ReLU function. Show that for any matrix of weights Θ(x→z), it is permissible to rescale each row to have a norm of one, because an identical output can be obtained by finding a corresponding rescaling of θ(z→y).
Under contract with MIT Press, shared under CC-BY-NC-ND license.
i

Chapter 4
Linguistic applications of classification
Having covered several techniques for classification, this chapter shifts the focus from mathematics to linguistic applications. Later in the chapter, we will consider the design decisions involved in text classification, as well as best practices for evaluation.
4.1 Sentiment and opinion analysis
A popular application of text classification is to automatically determine the sentiment or opinion polarity of documents such as product reviews and social media posts. For example, marketers are interested to know how people respond to advertisements, ser- vices, and products (Hu and Liu, 2004); social scientists are interested in how emotions are affected by phenomena such as the weather (Hannak et al., 2012), and how both opin- ions and emotions spread over social networks (Coviello et al., 2014; Miller et al., 2011). In the field of digital humanities, literary scholars track plot structures through the flow of sentiment across a novel (Jockers, 2015).1
Sentiment analysis can be framed as a direct application of document classification, assuming reliable labels can be obtained. In the simplest case, sentiment analysis is a two or three-class problem, with sentiments of POSITIVE, NEGATIVE, and possibly NEU- TRAL. Such annotations could be annotated by hand, or obtained automatically through a variety of means:
• Tweets containing happy emoticons can be marked as positive, sad emoticons as negative (Read, 2005; Pak and Paroubek, 2010).
1Comprehensive surveys on sentiment analysis and related problems are offered by Pang and Lee (2008) and Liu (2015).
69

70
CHAPTER4. LINGUISTICAPPLICATIONSOFCLASSIFICATION
• •
Reviews with four or more stars can be marked as positive, three or fewer stars as negative (Pang et al., 2002).
Statements from politicians who are voting for a given bill are marked as positive (towards that bill); statements from politicians voting against the bill are marked as negative (Thomas et al., 2006).
The bag-of-words model is a good fit for sentiment analysis at the document level: if the document is long enough, we would expect the words associated with its true senti- ment to overwhelm the others. Indeed, lexicon-based sentiment analysis avoids machine learning altogether, and classifies documents by counting words against positive and neg- ative sentiment word lists (Taboada et al., 2011).
Lexicon-based classification is less effective for short documents, such as single-sentence reviews or social media posts. In these documents, linguistic issues like negation and ir- realis (Polanyi and Zaenen, 2006) — events that are hypothetical or otherwise non-factual — can make bag-of-words classification ineffective. Consider the following examples:
(4.1) a. b. c. d.
e.
That’s not bad for the first day.
This is not the worst thing that can happen.
It would be nice if you acted like you understood.
There is no reason at all to believe that the polluters are suddenly going to become reasonable. (Wilson et al., 2005)
This film should be brilliant. The actors are first grade. Stallone plays a happy, wonderful man. His sweet wife is beautiful and adores him. He has a fascinating gift for living life fully. It sounds like a great plot, however, the film is a failure. (Pang et al., 2002)
A minimal solution is to move from a bag-of-words model to a bag-of-bigrams model, where each base feature is a pair of adjacent words, e.g.,
(that’s, not), (not, bad), (bad, for), . . . [4.1]
Bigrams can handle relatively straightforward cases, such as when an adjective is immedi- ately negated; trigrams would be required to extend to larger contexts (e.g., not the worst). But this approach will not scale to more complex examples like (4.1d) and (4.1e). More sophisticated solutions try to account for the syntactic structure of the sentence (Wilson et al., 2005; Socher et al., 2013), or apply more complex classifiers such as convolutional neural networks (Kim, 2014), which are described in chapter 3.
4.1.1 Related problems
Subjectivity Closely related to sentiment analysis is subjectivity detection, which re-
quires identifying the parts of a text that express subjective opinions, as well as other non- Jacob Eisenstein. Draft of October 15, 2018.

4.1. SENTIMENTANDOPINIONANALYSIS 71
factual content such as speculation and hypotheticals (Riloff and Wiebe, 2003). This can be done by treating each sentence as a separate document, and then applying a bag-of-words classifier: indeed, Pang and Lee (2004) do exactly this, using a training set consisting of (mostly) subjective sentences gathered from movie reviews, and (mostly) objective sen- tences gathered from plot descriptions. They augment this bag-of-words model with a graph-based algorithm that encourages nearby sentences to have the same subjectivity label.
Stance classification In debates, each participant takes a side: for example, advocating for or against proposals like adopting a vegetarian lifestyle or mandating free college ed- ucation. The problem of stance classification is to identify the author’s position from the text of the argument. In some cases, there is training data available for each position, so that standard document classification techniques can be employed. In other cases, it suffices to classify each document as whether it is in support or opposition of the argu- ment advanced by a previous document (Anand et al., 2011). In the most challenging case, there is no labeled data for any of the stances, so the only possibility is group docu- ments that advocate the same position (Somasundaran and Wiebe, 2009). This is a form of unsupervised learning, discussed in chapter 5.
Targeted sentiment analysis The expression of sentiment is often more nuanced than a simple binary label. Consider the following examples:
(4.2) a. The vodka was good, but the meat was rotten.
b. Go to Heaven for the climate, Hell for the company. –Mark Twain
These statements display a mixed overall sentiment: positive towards some entities (e.g., the vodka), negative towards others (e.g., the meat). Targeted sentiment analysis seeks to identify the writer’s sentiment towards specific entities (Jiang et al., 2011). This requires identifying the entities in the text and linking them to specific sentiment words — much more than we can do with the classification-based approaches discussed thus far. For example, Kim and Hovy (2006) analyze sentence-internal structure to determine the topic of each sentiment expression.
Aspect-based opinion mining seeks to identify the sentiment of the author of a review towards predefined aspects such as PRICE and SERVICE, or, in the case of (4.2b), CLIMATE and COMPANY (Hu and Liu, 2004). If the aspects are not defined in advance, it may again be necessary to employ unsupervised learning methods to identify them (e.g., Branavan et al., 2009).
Emotion classification While sentiment analysis is framed in terms of positive and neg- ative categories, psychologists generally regard emotion as more multifaceted. For ex- ample, Ekman (1992) argues that there are six basic emotions — happiness, surprise, fear,
Under contract with MIT Press, shared under CC-BY-NC-ND license.

72 CHAPTER4. LINGUISTICAPPLICATIONSOFCLASSIFICATION
sadness, anger, and contempt — and that they are universal across human cultures. Alm et al. (2005) build a linear classifier for recognizing the emotions expressed in children’s stories. The ultimate goal of this work was to improve text-to-speech synthesis, so that stories could be read with intonation that reflected the emotional content. They used bag- of-words features, as well as features capturing the story type (e.g., jokes, folktales), and structural features that reflect the position of each sentence in the story. The task is diffi- cult: even human annotators frequently disagreed with each other, and the best classifiers achieved accuracy between 60-70%.
4.1.2 Alternative approaches to sentiment analysis
Regression A more challenging version of sentiment analysis is to determine not just the class of a document, but its rating on a numerical scale (Pang and Lee, 2005). If the scale is continuous, it is most natural to apply regression, identifying a set of weights θ that minimize the squared error of a predictor yˆ = θ · x + b, where b is an offset. This approach is called linear regression, and sometimes least squares, because the regression coefficients θ are determined by minimizing the squared error, (y − yˆ)2. If the weights are regularized using a penalty λ||θ||2, then it is ridge regression. Unlike logistic regression, both linear regression and ridge regression can be solved in closed form as a system of linear equations.
Ordinal ranking In many problems, the labels are ordered but discrete: for example, product reviews are often integers on a scale of 1 − 5, and grades are on a scale of A − F . Such problems can be solved by discretizing the score θ · x into “ranks”,
yˆ = argmax r, [4.2] r: θ·x≥br
where b = [b1 = −∞,b2,b3,…,bK] is a vector of boundaries. It is possible to learn the weights and boundaries simultaneously, using a perceptron-like algorithm (Crammer and Singer, 2001).
Lexicon-based classification Sentiment analysis is one of the only NLP tasks where hand-crafted feature weights are still widely employed. In lexicon-based classification (Taboada et al., 2011), the user creates a list of words for each label, and then classifies each docu-
ment based on how many of the words from each list are present. In our linear classifica-
tion framework, this is equivalent to choosing the following weights:
θy,j = 􏰅1, j ∈ Ly [4.3] 0, otherwise,
Jacob Eisenstein. Draft of October 15, 2018.

4.2. WORDSENSEDISAMBIGUATION 73
where Ly is the lexicon for label y. Compared to the machine learning classifiers discussed in the previous chapters, lexicon-based classification may seem primitive. However, su- pervised machine learning relies on large annotated datasets, which are time-consuming and expensive to produce. If the goal is to distinguish two or more categories in a new domain, it may be simpler to start by writing down a list of words for each category.
An early lexicon was the General Inquirer (Stone, 1966). Today, popular sentiment lexi- cons include SENTIWORDNET (Esuli and Sebastiani, 2006) and an evolving set of lexicons from Liu (2015). For emotions and more fine-grained analysis, Linguistic Inquiry and Word Count (LIWC) provides a set of lexicons (Tausczik and Pennebaker, 2010). The MPQA lex- icon indicates the polarity (positive or negative) of 8221 terms, as well as whether they are strongly or weakly subjective (Wiebe et al., 2005). A comprehensive comparison of senti- ment lexicons is offered by Ribeiro et al. (2016). Given an initial seed lexicon, it is possible to automatically expand the lexicon by looking for words that frequently co-occur with words in the seed set (Hatzivassiloglou and McKeown, 1997; Qiu et al., 2011).
4.2 Word sense disambiguation
Consider the the following headlines:
(4.3) a.
b. Prostitutes appeal to Pope
c. Drunk gets nine years in violin case2
These headlines are ambiguous because they contain words that have multiple mean- ings, or senses. Word sense disambiguation is the problem of identifying the intended sense of each word token in a document. Word sense disambiguation is part of a larger field of research called lexical semantics, which is concerned with meanings of the words.
At a basic level, the problem of word sense disambiguation is to identify the correct sense for each word token in a document. Part-of-speech ambiguity (e.g., noun versus verb) is usually considered to be a different problem, to be solved at an earlier stage. From a linguistic perspective, senses are not properties of words, but of lemmas, which are canonical forms that stand in for a set of inflected words. For example, arm/N is a lemma that includes the inflected form arms/N — the /N indicates that it we are refer- ring to the noun, and not its homonym arm/V, which is another lemma that includes the inflected verbs (arm/V, arms/V, armed/V, arming/V). Therefore, word sense disam- biguation requires first identifying the correct part-of-speech and lemma for each token,
2These examples, and many more, can be found at http://www.ling.upenn.edu/ ̃beatrice/ humor/headlines.html
Under contract with MIT Press, shared under CC-BY-NC-ND license.
Iraqi head seeks arms

74 CHAPTER4. LINGUISTICAPPLICATIONSOFCLASSIFICATION and then choosing the correct sense from the inventory associated with the corresponding
lemma.3 (Part-of-speech tagging is discussed in section 8.1.)
4.2.1 How many word senses?
Words sometimes have many more than two senses, as exemplified by the word serve:
• [FUNCTION]: The tree stump served as a table
• [CONTRIBUTE TO]: His evasive replies only served to heighten suspicion • [PROVIDE]: We serve only the rawest fish
• [ENLIST]: She served in an elite combat unit
• [JAIL]: He served six years for a crime he didn’t commit
• [LEGAL]: They were served with subpoenas4
These sense distinctions are annotated in WORDNET (http://wordnet.princeton. edu), a lexical semantic database for English. WORDNET consists of roughly 100,000 synsets, which are groups of lemmas (or phrases) that are synonymous. An example synset is {chump1, fool2, sucker1, mark9}, where the superscripts index the sense of each lemma that is included in the synset: for example, there are at least eight other senses of mark that have different meanings, and are not part of this synset. A lemma is polysemous if it participates in multiple synsets.
WORDNET defines the scope of the word sense disambiguation problem, and, more generally, formalizes lexical semantic knowledge of English. (WordNets have been cre- ated for a few dozen other languages, at varying levels of detail.) Some have argued that WordNet’s sense granularity is too fine (Ide and Wilks, 2006); more fundamentally, the premise that word senses can be differentiated in a task-neutral way has been criti- cized as linguistically na ̈ıve (Kilgarriff, 1997). One way of testing this question is to ask whether people tend to agree on the appropriate sense for example sentences: accord- ing to Mihalcea et al. (2004), people agree on roughly 70% of examples using WordNet senses; far better than chance, but less than agreement on other tasks, such as sentiment annotation (Wilson et al., 2005).
*Other lexical semantic relations Besides synonymy, WordNet also describes many other lexical semantic relationships, including:
• antonymy: x means the opposite of y, e.g. FRIEND-ENEMY; 3Navigli (2009) provides a survey of approaches for word-sense disambiguation.
4Several of the examples are adapted from WORDNET (Fellbaum, 2010).
Jacob Eisenstein. Draft of October 15, 2018.

4.2. WORDSENSEDISAMBIGUATION 75 • hyponymy: x is a special case of y, e.g. RED-COLOR; the inverse relationship is
hypernymy;
• meronymy:xisapartofy,e.g.,WHEEL-BICYCLE;theinverserelationshipisholonymy.
Classification of these relations can be performed by searching for characteristic pat- terns between pairs of words, e.g., X, such as Y, which signals hyponymy (Hearst, 1992),
or X but Y, which signals antonymy (Hatzivassiloglou and McKeown, 1997). Another ap- proach is to analyze each term’s distributional statistics (the frequency of its neighboring words). Such approaches are described in detail in chapter 14.
4.2.2 Word sense disambiguation as classification
How can we tell living plants from manufacturing plants? The context is often critical: (4.4) a. Town officials are hoping to attract new manufacturing plants through weak-
ened environmental regulations.
b. The endangered plants play an important role in the local ecosystem.
It is possible to build a feature vector using the bag-of-words representation, by treat- ing each context as a pseudo-document. The feature function is then,
f ((plant, The endangered plants play an . . . ), y) =
{(the,y) : 1,(endangered,y) : 1,(play,y) : 1,(an,y) : 1,…}
As in document classification, many of these features are irrelevant, but a few are very strong predictors. In this example, the context word endangered is a strong signal that the intended sense is biology rather than manufacturing. We would therefore expect a learning algorithm to assign high weight to (endangered,BIOLOGY), and low weight to (endangered, MANUFACTURING).5
It may also be helpful to go beyond the bag-of-words: for example, one might encode the position of each context word with respect to the target, e.g.,
f ((bank, I went to the bank to deposit my paycheck), y) =
{(i − 3, went, y) : 1, (i + 2, deposit, y) : 1, (i + 4, paycheck, y) : 1}
These are called collocation features, and they give more information about the specific role played by each context word. This idea can be taken further by incorporating addi- tional syntactic information about the grammatical role played by each context feature, such as the dependency path (see chapter 11).
5The context bag-of-words can be also used be used to perform word-sense disambiguation without machine learning: the Lesk (1986) algorithm selects the word sense whose dictionary definition best overlaps the local context.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

76 CHAPTER4. LINGUISTICAPPLICATIONSOFCLASSIFICATION
Using such features, a classifier can be trained from labeled data. A semantic concor- dance is a corpus in which each open-class word (nouns, verbs, adjectives, and adverbs) is tagged with its word sense from the target dictionary or thesaurus. SemCor is a seman- tic concordance built from 234K tokens of the Brown corpus (Francis and Kucera, 1982), annotated as part of the WORDNET project (Fellbaum, 2010). SemCor annotations look like this:
(4.5) As of Sunday1N night1N there was4V no word2N . . . ,
with the superscripts indicating the annotated sense of each polysemous word, and the
subscripts indicating the part-of-speech.
As always, supervised classification is only possible if enough labeled examples can be accumulated. This is difficult in word sense disambiguation, because each polysemous lemma requires its own training set: having a good classifier for the senses of serve is no help towards disambiguating plant. For this reason, unsupervised and semi-supervised methods are particularly important for word sense disambiguation (e.g., Yarowsky, 1995). These methods will be discussed in chapter 5. Unsupervised methods typically lean on the heuristic of “one sense per discourse”, which means that a lemma will usually have a single, consistent sense throughout any given document (Gale et al., 1992). Based on this heuristic, we can propagate information from high-confidence instances to lower- confidence instances in the same document (Yarowsky, 1995). Semi-supervised methods combine labeled and unlabeled data, and are discussed in more detail in chapter 5.
4.3 Design decisions for text classification
Text classification involves a number of design decisions. In some cases, the design deci- sion is clear from the mathematics: if you are using regularization, then a regularization weight λ must be chosen. Other decisions are more subtle, arising only in the low level “plumbing” code that ingests and processes the raw data. Such decision can be surpris- ingly consequential for classification accuracy.
4.3.1 What is a word?
The bag-of-words representation presupposes that extracting a vector of word counts from text is unambiguous. But text documents are generally represented as a sequences of characters (in an encoding such as ascii or unicode), and the conversion to bag-of-words presupposes a definition of the “words” that are to be counted.
Jacob Eisenstein. Draft of October 15, 2018.

4.3. DESIGNDECISIONSFORTEXTCLASSIFICATION 77
Whitespace
Treebank
Tweet
TokTok (Dehdari, 2014)
Isn’t Ahab, Ahab? 😉
Is n’t Ahab
Isn’t Ahab ,
Isn ’ t Ahab , Ahab
; )
? ; ) Figure 4.1: The output of four NLTK tokenizers, applied to the string Isn’t Ahab, Ahab? 😉
, Ahab ? Ahab ? 😉
Tokenization
The first subtask for constructing a bag-of-words vector is tokenization: converting the text from a sequence of characters to a sequence of word!tokens. A simple approach is to define a subset of characters as whitespace, and then split the text on these tokens. However, whitespace-based tokenization is not ideal: we may want to split conjunctions like isn’t and hyphenated phrases like prize-winning and half-asleep, and we likely want to separate words from commas and periods that immediately follow them. At the same time, it would be better not to split abbreviations like U.S. and Ph.D. In languages with Roman scripts, tokenization is typically performed using regular expressions, with mod- ules designed to handle each of these cases. For example, the NLTK package includes a number of tokenizers (Loper and Bird, 2002); the outputs of four of the better-known tok- enizers are shown in Figure 4.1. Social media researchers have found that emoticons and other forms of orthographic variation pose new challenges for tokenization, leading to the development of special purpose tokenizers to handle these phenomena (O’Connor et al., 2010).
Tokenization is a language-specific problem, and each language poses unique chal- lenges. For example, Chinese does not include spaces between words, nor any other consistent orthographic markers of word boundaries. A “greedy” approach is to scan the input for character substrings that are in a predefined lexicon. However, Xue et al. (2003) notes that this can be ambiguous, since many character sequences could be seg- mented in multiple ways. Instead, he trains a classifier to determine whether each Chinese character, or hanzi, is a word boundary. More advanced sequence labeling methods for word segmentation are discussed in section 8.4. Similar problems can occur in languages with alphabetic scripts, such as German, which does not include whitespace in compound nouns, yielding examples such as Freundschaftsbezeigungen (demonstration of friendship) and Dilettantenaufdringlichkeiten (the importunities of dilettantes). As Twain (1997) ar- gues, “These things are not words, they are alphabetic processions.” Social media raises similar problems for English and other languages, with hashtags such as #TrueLoveInFourWords requiring decomposition for analysis (Brun and Roux, 2014).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

78
CHAPTER4. LINGUISTICAPPLICATIONSOFCLASSIFICATION
The Williams sisters are leaving this tennis
centre centr cent centre
Original
Porter stemmer the william sister are leav thi tenni Lancaster stemmer the william sist ar leav thi ten WordNetlemmatizer The Williams sister are leaving this tennis
Figure 4.2: Sample outputs of the Porter (1980) and Lancaster (Paice, 1990) stemmers, and the WORDNET lemmatizer
Text normalization
After splitting the text into tokens, the next question is which tokens are really distinct. Is it necessary to distinguish great, Great, and GREAT? Sentence-initial capitalization may be irrelevant to the classification task. Going further, the complete elimination of case distinctions will result in a smaller vocabulary, and thus smaller feature vectors. However, case distinctions might be relevant in some situations: for example, apple is a delicious pie filling, while Apple is a company that specializes in proprietary dongles and power adapters.
For Roman script, case conversion can be performed using unicode string libraries. Many scripts do not have case distinctions (e.g., the Devanagari script used for South Asian languages, the Thai alphabet, and Japanese kana), and case conversion for all scripts may not be available in every programming environment. (Unicode support is an im- portant distinction between Python’s versions 2 and 3, and is a good reason for mi- grating to Python 3 if you have not already done so. Compare the output of the code “\`a l\’hˆotel”.upper()inthetwolanguageversions.)
Case conversion is a type of text normalization, which refers to string transforma- tions that remove distinctions that are irrelevant to downstream applications (Sproat et al., 2001). Other forms of normalization include the standardization of numbers (e.g., 1,000 to 1000) and dates (e.g., August 11, 2015 to 2015/11/08). Depending on the application, it may even be worthwhile to convert all numbers and dates to special tokens, !NUM and !DATE. In social media, there are additional orthographic phenomena that may be normalized, such as expressive lengthening, e.g., cooooool (Aw et al., 2006; Yang and Eisenstein, 2013). Similarly, historical texts feature spelling variations that may need to be normalized to a contemporary standard form (Baron and Rayson, 2008).
A more extreme form of normalization is to eliminate inflectional affixes, such as the -ed and -s suffixes in English. On this view, whale, whales, and whaling all refer to the same underlying concept, so they should be grouped into a single feature. A stemmer is a program for eliminating affixes, usually by applying a series of regular expression sub- stitutions. Character-based stemming algorithms are necessarily approximate, as shown in Figure 4.2: the Lancaster stemmer incorrectly identifies -ers as an inflectional suffix of
Jacob Eisenstein. Draft of October 15, 2018.

4.3. DESIGNDECISIONSFORTEXTCLASSIFICATION 79
Pang and Lee Movie Reviews (English) 1.0
0.5
0 10000 20000 30000 40000 Vocabulary size
MAC-Morpho Corpus (Brazilian Portuguese) 1.0
0.5
0 10000 20000 30000 40000 50000 60000 70000 Vocabulary size
(a) Movie review data in English (b) News articles in Brazilian Portuguese
Figure 4.3: Tradeoff between token coverage (y-axis) and vocabulary size, on the NLTK movie review dataset, after sorting the vocabulary by decreasing frequency. The red dashed lines indicate 80%, 90%, and 95% coverage.
sisters (by analogy to fix/fixers), and both stemmers incorrectly identify -s as a suffix of this and Williams. Fortunately, even inaccurate stemming can improve bag-of-words classifi- cation models, by merging related strings and thereby reducing the vocabulary size.
Accurately handling irregular orthography requires word-specific rules. Lemmatizers are systems that identify the underlying lemma of a given wordform. They must avoid the over-generalization errors of the stemmers in Figure 4.2, and also handle more complex transformations, such as geese→goose. The output of the WordNet lemmatizer is shown in the final line of Figure 4.2. Both stemming and lemmatization are language-specific: an English stemmer or lemmatizer is of little use on a text written in another language. The discipline of morphology relates to the study of word-internal structure, and is described in more detail in subsection 9.1.2.
The value of normalization depends on the data and the task. Normalization re- duces the size of the feature space, which can help in generalization. However, there is always the risk of merging away linguistically meaningful distinctions. In supervised machine learning, regularization and smoothing can play a similar role to normalization — preventing the learner from overfitting to rare features — while avoiding the language- specific engineering required for accurate normalization. In unsupervised scenarios, such as content-based information retrieval (Manning et al., 2008) and topic modeling (Blei et al., 2003), normalization is more critical.
4.3.2 How many words?
Limiting the size of the feature vector reduces the memory footprint of the resulting mod- els, and increases the speed of prediction. Normalization can help to play this role, but a more direct approach is simply to limit the vocabulary to the N most frequent words in the dataset. For example, in the MOVIE-REVIEWS dataset provided with NLTK (origi- nally from Pang et al., 2002), there are 39,768 word types, and 1.58M tokens. As shown
Under contract with MIT Press, shared under CC-BY-NC-ND license.
Token coverage
Token coverage

80 CHAPTER4. LINGUISTICAPPLICATIONSOFCLASSIFICATION
in Figure 4.3a, the most frequent 4000 word types cover 90% of all tokens, offering an order-of-magnitude reduction in the model size. Such ratios are language-specific: in for example, in the Brazilian Portuguese Mac-Morpho corpus (Alu ́ısio et al., 2003), attain- ing 90% coverage requires more than 10000 word types (Figure 4.3b). This reflects the morphological complexity of Portuguese, which includes many more inflectional suffixes than English.
Eliminating rare words is not always advantageous for classification performance: for example, names, which are typically rare, play a large role in distinguishing topics of news articles. Another way to reduce the size of the feature space is to eliminate stopwords such as the, to, and and, which may seem to play little role in expressing the topic, sentiment, or stance. This is typically done by creating a stoplist (e.g., NLTK.CORPUS.STOPWORDS), and then ignoring all terms that match the list. However, corpus linguists and social psy- chologists have shown that seemingly inconsequential words can offer surprising insights about the author or nature of the text (Biber, 1991; Chung and Pennebaker, 2007). Further- more, high-frequency words are unlikely to cause overfitting in discriminative classifiers. As with normalization, stopword filtering is more important for unsupervised problems, such as term-based document retrieval.
Another alternative for controlling model size is feature hashing (Weinberger et al., 2009). Each feature is assigned an index using a hash function. If a hash function that permits collisions is chosen (typically by taking the hash output modulo some integer), then the model can be made arbitrarily small, as multiple features share a single weight. Because most features are rare, accuracy is surprisingly robust to such collisions (Ganchev and Dredze, 2008).
4.3.3 Count or binary?
Finally, we may consider whether we want our feature vector to include the count of each word, or its presence. This gets at a subtle limitation of linear classification: it’s worse to have two failures than one, but is it really twice as bad? Motivated by this intuition, Pang et al. (2002) use binary indicators of presence or absence in the feature vector: fj(x,y) ∈ {0, 1}. They find that classifiers trained on these binary vectors tend to outperform feature vectors based on word counts. One explanation is that words tend to appear in clumps: if a word has appeared once in a document, it is likely to appear again (Church, 2000). These subsequent appearances can be attributed to this tendency towards repetition, and thus provide little additional information about the class label of the document.
4.4 Evaluating classifiers
In any supervised machine learning application, it is critical to reserve a held-out test set. This data should be used for only one purpose: to evaluate the overall accuracy of a single
Jacob Eisenstein. Draft of October 15, 2018.

4.4. EVALUATINGCLASSIFIERS 81
classifier. Using this data more than once would cause the estimated accuracy to be overly optimistic, because the classifier would be customized to this data, and would not perform as well as on unseen data in the future. It is usually necessary to set hyperparameters or perform feature selection, so you may need to construct a tuning or development set for this purpose, as discussed in subsection 2.2.5.
There are a number of ways to evaluate classifier performance. The simplest is accu- racy: the number of correct predictions, divided by the total number of instances,
1 􏰁N
acc(y, yˆ) = N δ(y(i) = yˆ). [4.4]
i
Exams are usually graded by accuracy. Why are other metrics necessary? The main reason is class imbalance. Suppose you are building a classifier to detect whether an electronic health record (EHR) describes symptoms of a rare disease, which appears in only 1% of all documents in the dataset. A classifier that reports yˆ = NEGATIVE for all documents would achieve 99% accuracy, but would be practically useless. We need metrics that are capable of detecting the classifier’s ability to discriminate between classes, even when the distribution is skewed.
One solution is to build a balanced test set, in which each possible label is equally rep- resented. But in the EHR example, this would mean throwing away 98% of the original dataset! Furthermore, the detection threshold itself might be a design consideration: in health-related applications, we might prefer a very sensitive classifier, which returned a positive prediction if there is even a small chance that y(i) = POSITIVE. In other applica- tions, a positive result might trigger a costly action, so we would prefer a classifier that only makes positive predictions when absolutely certain. We need additional metrics to capture these characteristics.
4.4.1 Precision, recall, and F -MEASURE
For any label (e.g., positive for presence of symptoms of a disease), there are two possible
errors:
• False positive: the system incorrectly predicts the label.
• False negative: the system incorrectly fails to predict the label. Similarly, for any label, there are two ways to be correct:
• True positive: the system correctly predicts the label.
• True negative: the system correctly predicts that the label does not apply to this instance.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

82 CHAPTER4. LINGUISTICAPPLICATIONSOFCLASSIFICATION
Classifiers that make a lot of false positives have low precision: they predict the label even when it isn’t there. Classifiers that make a lot of false negatives have low recall: they fail to predict the label, even when it is there. These metrics distinguish these two sources of error, and are defined formally as:
RECALL(y, yˆ, k) = TP [4.5] TP+FN
PRECISION(y, yˆ, k) = TP . [4.6] TP+FP
Recall and precision are both conditional likelihoods of a correct prediction, which is why their numerators are the same. Recall is conditioned on k being the correct label, y(i) = k, so the denominator sums over true positive and false negatives. Precision is conditioned on k being the prediction, so the denominator sums over true positives and false positives. Note that true negatives are not considered in either statistic. The classifier that labels every document as “negative” would achieve zero recall; precision would be 0 .
Recall and precision are complementary. A high-recall classifier is preferred when false positives are cheaper than false negatives: for example, in a preliminary screening for symptoms of a disease, the cost of a false positive might be an additional test, while a false negative would result in the disease going untreated. Conversely, a high-precision classifier is preferred when false positives are more expensive: for example, in spam de- tection, a false negative is a relatively minor inconvenience, while a false positive might mean that an important message goes unread.
The F-MEASURE combines recall and precision into a single metric, using the har- monic mean:
F-MEASURE(y,yˆ,k)= 2rp , [4.7] r+p
where r is recall and p is precision.6
Evaluating multi-class classification Recall, precision, and F -MEASURE are defined with respect to a specific label k. When there are multiple labels of interest (e.g., in word sense disambiguation or emotion classification), it is necessary to combine the F-MEASURE across each class. Macro F -MEASURE is the average F -MEASURE across several classes,
Macro-F (y, yˆ) = 1 􏰁 F -MEASURE(y, yˆ, k) [4.8] |K| k∈K
6F -MEASURE is sometimes called F1, and generalizes to Fβ = (1+β2)rp . The β parameter can be tuned to
0
emphasize recall or precision.
β2p+r
Jacob Eisenstein. Draft of October 15, 2018.

4.4. EVALUATINGCLASSIFIERS 83
AUC=0.89 AUC=0.73 AUC=0.5
1.0 0.8 0.6 0.4 0.2 0.0
0.0 0.2 0.4
False positive rate
0.6 0.8 1.0
Figure 4.4: ROC curves for three classifiers of varying discriminative power, measured by AUC (area under the curve)
In multi-class problems with unbalanced class distributions, the macro F -MEASURE is a balanced measure of how well the classifier recognizes each class. In micro F -MEASURE, we compute true positives, false positives, and false negatives for each class, and then add them up to compute a single recall, precision, and F -MEASURE. This metric is balanced across instances rather than classes, so it weights each class in proportion to its frequency — unlike macro F -MEASURE, which weights each class equally.
4.4.2 Threshold-free metrics
In binary classification problems, it is possible to trade off between recall and precision by
adding a constant “threshold” to the output of the scoring function. This makes it possible
to trace out a curve, where each point indicates the performance at a single threshold. In
the receiver operating characteristic (ROC) curve,7 the x-axis indicates the false positive
rate, FP , and the y-axis indicates the recall, or true positive rate. A perfect classifier FP+TN
attains perfect recall without any false positives, tracing a “curve” from the origin (0,0) to the upper left corner (0,1), and then to (1,1). In expectation, a non-discriminative classifier traces a diagonal line from the origin (0,0) to the upper right corner (1,1). Real classifiers tend to fall between these two extremes. Examples are shown in Figure 4.4.
The ROC curve can be summarized in a single number by taking its integral, the area under the curve (AUC). The AUC can be interpreted as the probability that a randomly- selected positive example will be assigned a higher score by the classifier than a randomly-
7The name “receiver operator characteristic” comes from the metric’s origin in signal processing applica- tions (Peterson et al., 1954). Other threshold-free metrics include precision-recall curves, precision-at-k, and balanced F -MEASURE; see Manning et al. (2008) for more details.
Under contract with MIT Press, shared under CC-BY-NC-ND license.
True positive rate

84 CHAPTER4. LINGUISTICAPPLICATIONSOFCLASSIFICATION
selected negative example. A perfect classifier has AUC = 1 (all positive examples score higher than all negative examples); a non-discriminative classifier has AUC = 0.5 (given a randomly selected positive and negative example, either could score higher with equal probability); a perfectly wrong classifier would have AUC = 0 (all negative examples score higher than all positive examples). One advantage of AUC in comparison to F -MEASURE is that the baseline rate of 0.5 does not depend on the label distribution.
4.4.3 Classifier comparison and statistical significance
Natural language processing research and engineering often involves comparing different classification techniques. In some cases, the comparison is between algorithms, such as logistic regression versus averaged perceptron, or L2 regularization versus L1. In other cases, the comparison is between feature sets, such as the bag-of-words versus positional bag-of-words (see subsection 4.2.2). Ablation testing involves systematically removing (ablating) various aspects of the classifier, such as feature groups, and testing the null hypothesis that the ablated classifier is as good as the full model.
A full treatment of hypothesis testing is beyond the scope of this text, but this section contains a brief summary of the techniques necessary to compare classifiers. The main aim of hypothesis testing is to determine whether the difference between two statistics — for example, the accuracies of two classifiers — is likely to arise by chance. We will be concerned with chance fluctuations that arise due to the finite size of the test set.8 An improvement of 10% on a test set with ten instances may reflect a random fluctuation that makes the test set more favorable to classifier c1 than c2; on another test set with a different ten instances, we might find that c2 does better than c1. But if we observe the same 10% improvement on a test set with 1000 instances, this is highly unlikely to be explained by chance. Such a finding is said to be statistically significant at a level p, which is the probability of observing an effect of equal or greater magnitude when the null hypothesis is true. The notation p < .05 indicates that the likelihood of an equal or greater effect is less than 5%, assuming the null hypothesis is true.9 The binomial test The statistical significance of a difference in accuracy can be evaluated using classical tests, such as the binomial test.10 Suppose that classifiers c1 and c2 disagree on N instances in a 8Other sources of variance include the initialization of non-convex classifiers such as neural networks, and the ordering of instances in online learning such as stochastic gradient descent and perceptron. 9Statistical hypothesis testing is useful only to the extent that the existing test set is representative of the instances that will be encountered in the future. If, for example, the test set is constructed from news documents, no hypothesis test can predict which classifier will perform best on documents from another domain, such as electronic health records. 10A well-known alternative to the binomial test is McNemar’s test, which computes a test statistic based on the number of examples that are correctly classified by one system and incorrectly classified by the other. Jacob Eisenstein. Draft of October 15, 2018. 4.4. EVALUATINGCLASSIFIERS 85 0.15 0.10 0.05 0.00 0 5 10 15 20 25 30 Instances where c1 is right and c2 is wrong Figure 4.5: Probability mass function for the binomial distribution. The pink highlighted areas represent the cumulative probability for a significance test on an observation of k=10andN =30. test set with binary labels, and that c1 is correct on k of those instances. Under the null hy- pothesis that the classifiers are equally accurate, we would expect k/N to be roughly equal to 1/2, and as N increases, k/N should be increasingly close to this expected value. These properties are captured by the binomial distribution, which is a probability over counts of binary random variables. We write k ∼ Binom(θ,N) to indicate that k is drawn from a binomial distribution, with parameter N indicating the number of random “draws”, and θ indicating the probability of “success” on each draw. Each draw is an example on which the two classifiers disagree, and a “success” is a case in which c1 is right and c2 is wrong. (The label space is assumed to be binary, so if the classifiers disagree, exactly one of them is correct. The test can be generalized to multi-class classification by focusing on the examples in which exactly one classifier is correct.) The probability mass function (PMF) of the binomial distribution is, pBinom(k; N, θ) = 􏰑Nk 􏰒θk(1 − θ)N−k, [4.9] with θk representing the probability of the k successes, (1 − θ)N−k representing the prob- ability of the N − k unsuccessful draws. The expression 􏰗N􏰘 = N! is a binomial k k!(N −k)! coefficient, representing the number of possible orderings of events; this ensures that the distribution sums to one over all k ∈ {0,1,2,...,N}. Under the null hypothesis, when the classifiers disagree, each classifier is equally likely to be right, so θ = 1. Now suppose that among N disagreements, c1 is correct N2 k < 2 times. The probability of c1 being correct k or fewer times is the one-tailed p-value, The null hypothesis distribution for this test statistic is known to be drawn from a chi-squared distribution with a single degree of freedom, so a p-value can be computed from the cumulative density function of this distribution (Dietterich, 1998). Both tests give similar results in most circumstances, but the binomial test is easier to understand from first principles. Under contract with MIT Press, shared under CC-BY-NC-ND license. p(k N=30, =0.5) 86 CHAPTER4. LINGUISTICAPPLICATIONSOFCLASSIFICATION because it is computed from the area under the binomial probability mass function from 0 to k, as shown in the left tail of Figure 4.5. This cumulative probability is computed as a sum over all values i ≤ k, 􏰑(i)(i)(i) 1􏰒􏰁k􏰑1􏰒 Pr count(yˆ2 = y ̸= yˆ1 ) ≤ k;N,θ = = pBinom i;N,θ = . [4.10] Binom 2 i=0 2 The one-tailed p-value applies only to the asymmetric null hypothesis that c1 is at least as accurate as c2. To test the two-tailed null hypothesis that c1 and c2 are equally accu- rate, we would take the sum of one-tailed p-values, where the second term is computed from the right tail of Figure 4.5. The binomial distribution is symmetric, so this can be computed by simply doubling the one-tailed p-value. Two-tailed tests are more stringent, but they are necessary in cases in which there is no prior intuition about whether c1 or c2 is better. For example, in comparing logistic regression versus averaged perceptron, a two-tailed test is appropriate. In an ablation test, c2 may contain a superset of the features available to c1. If the additional features are thought to be likely to improve performance, then a one-tailed test would be appropriate, if chosen in advance. However, such a test can only prove that c2 is more accurate than c1, and not the reverse. *Randomized testing The binomial test is appropriate for accuracy, but not for more complex metrics such as F-MEASURE. To compute statistical significance for arbitrary metrics, we can apply ran- domization. Specifically, draw a set of M bootstrap samples (Efron and Tibshirani, 1993), by resampling instances from the original test set with replacement. Each bootstrap sam- ple is itself a test set of size N. Some instances from the original test set will not appear in any given bootstrap sample, while others will appear multiple times; but overall, the sample will be drawn from the same distribution as the original test set. We can then com- pute any desired evaluation on each bootstrap sample, which gives a distribution over the value of the metric. Algorithm 7 shows how to perform this computation. To compare the F-MEASURE of two classifiers c1 and c2, we set the function δ(·) to compute the difference in F-MEASURE on the bootstrap sample. If the difference is less than or equal to zero in at least 5% of the samples, then we cannot reject the one-tailed null hypothesis that c2 is at least as good as c1 (Berg-Kirkpatrick et al., 2012). We may also be interested in the 95% confidence interval around a metric of interest, such as the F-MEASURE of a single classifier. This can be computed by sorting the output of Algorithm 7, and then setting the top and bottom of the 95% confidence interval to the values at the 2.5% and 97.5% percentiles of the sorted outputs. Alternatively, you can fit a normal distribution to the set of differences across bootstrap samples, and compute a Gaussian confidence interval from the mean and variance. Jacob Eisenstein. Draft of October 15, 2018. 4.4. EVALUATINGCLASSIFIERS 87 Algorithm 7 Bootstrap sampling for classifier evaluation. The original test set is {x(1:N),y(1:N)}, the metric is δ(·), and the number of samples is M. procedure BOOTSTRAP-SAMPLE(x(1:N), y(1:N), δ(·), M) for t ∈ {1,2,...,M} do for i ∈ {1,2,...,N} do j ∼ UniformInteger(1, N ) x ̃ ( i ) ← x ( j ) y ̃ ( i ) ← y ( j ) d(t) ← δ(x ̃(1:N), y ̃(1:N)) return {d(t)}Mt=1 As the number of bootstrap samples goes to infinity, M → ∞, the bootstrap estimate is increasingly accurate. A typical choice for M is 104 or 105; larger numbers of samples are necessary for smaller p-values. One way to validate your choice of M is to run the test multiple times, and ensure that the p-values are similar; if not, increase M by an order of magnitude. This is a heuristic measure of the variance of the test, which can decreases with the square root √M (Robert and Casella, 2013). 4.4.4 *Multiple comparisons Sometimes it is necessary to perform multiple hypothesis tests, such as when compar- ing the performance of several classifiers on multiple datasets. Suppose you have five datasets, and you compare four versions of your classifier against a baseline system, for a total of 20 comparisons. Even if none of your classifiers is better than the baseline, there will be some chance variation in the results, and in expectation you will get one statis- tically significant improvement at p = 0.05 = 1 . It is therefore necessary to adjust the 20 p-values when reporting the results of multiple comparisons. One approach is to require a threshold of α to report a p value of p < α when per- forming m tests. This is known as the Bonferroni correction, and it limits the overall probability of incorrectly rejecting the null hypothesis at α. Another approach is to bound the false discovery rate (FDR), which is the fraction of null hypothesis rejections that are incorrect. Benjamini and Hochberg (1995) propose a p-value correction that bounds the fraction of false discoveries at α: sort the p-values of each individual test in ascending order, and set the significance threshold equal to largest k such that pk ≤ k α. If k > 1, the
FDR adjustment is more permissive than the Bonferroni correction.
Under contract with MIT Press, shared under CC-BY-NC-ND license.
m
m

88 CHAPTER4. LINGUISTICAPPLICATIONSOFCLASSIFICATION 4.5 Building datasets
Sometimes, if you want to build a classifier, you must first build a dataset of your own. This includes selecting a set of documents or instances to annotate, and then performing the annotations. The scope of the dataset may be determined by the application: if you want to build a system to classify electronic health records, then you must work with a corpus of records of the type that your classifier will encounter when deployed. In other cases, the goal is to build a system that will work across a broad range of documents. In this case, it is best to have a balanced corpus, with contributions from many styles and genres. For example, the Brown corpus draws from texts ranging from government doc- uments to romance novels (Francis, 1964), and the Google Web Treebank includes an- notations for five “domains” of web documents: question answers, emails, newsgroups, reviews, and blogs (Petrov and McDonald, 2012).
4.5.1 Metadata as labels
Annotation is difficult and time-consuming, and most people would rather avoid it. It is sometimes possible to exploit existing metadata to obtain labels for training a classi- fier. For example, reviews are often accompanied by a numerical rating, which can be converted into a classification label (see section 4.1). Similarly, the nationalities of social media users can be estimated from their profiles (Dredze et al., 2013) or even the time zones of their posts (Gouws et al., 2011). More ambitiously, we may try to classify the political affiliations of social media profiles based on their social network connections to politicians and major political parties (Rao et al., 2010).
The convenience of quickly constructing large labeled datasets without manual an- notation is appealing. However this approach relies on the assumption that unlabeled instances — for which metadata is unavailable — will be similar to labeled instances. Consider the example of labeling the political affiliation of social media users based on their network ties to politicians. If a classifier attains high accuracy on such a test set, is it safe to assume that it accurately predicts the political affiliation of all social media users? Probably not. Social media users who establish social network ties to politicians may be more likely to mention politics in the text of their messages, as compared to the average user, for whom no political metadata is available. If so, the accuracy on a test set constructed from social network metadata would give an overly optimistic picture of the method’s true performance on unlabeled data.
4.5.2 Labeling data
In many cases, there is no way to get ground truth labels other than manual annotation. An annotation protocol should satisfy several criteria: the annotations should be expressive enough to capture the phenomenon of interest; they should be replicable, meaning that
Jacob Eisenstein. Draft of October 15, 2018.

4.5. BUILDINGDATASETS 89
another annotator or team of annotators would produce very similar annotations if given the same data; and they should be scalable, so that they can be produced relatively quickly. Hovy and Lavid (2010) propose a structured procedure for obtaining annotations that meet these criteria, which is summarized below.
1. Determine what to annotate. This is usually based on some theory of the under- lying phenomenon: for example, if the goal is to produce annotations about the emotional state of a document’s author, one should start with a theoretical account of the types or dimensions of emotion (e.g., Mohammad and Turney, 2013). At this stage, the tradeoff between expressiveness and scalability should be considered: a full instantiation of the underlying theory might be too costly to annotate at scale, so reasonable approximations should be considered.
2. Optionally, one may design or select a software tool to support the annotation effort. Existing general-purpose annotation tools include BRAT (Stenetorp et al., 2012)andMMAX2(Mu ̈llerandStrube,2006).
3. Formalize the instructions for the annotation task. To the extent that the instruc- tions are not explicit, the resulting annotations will depend on the intuitions of the annotators. These intuitions may not be shared by other annotators, or by the users of the annotated data. Therefore explicit instructions are critical to ensuring the an- notations are replicable and usable by other researchers.
4. Perform a pilot annotation of a small subset of data, with multiple annotators for each instance. This will give a preliminary assessment of both the replicability and scalability of the current annotation instructions. Metrics for computing the rate of agreement are described below. Manual analysis of specific disagreements should help to clarify the instructions, and may lead to modifications of the annotation task itself. For example, if two labels are commonly conflated by annotators, it may be best to merge them.
5. Annotate the data. After finalizing the annotation protocol and instructions, the main annotation effort can begin. Some, if not all, of the instances should receive multiple annotations, so that inter-annotator agreement can be computed. In some annotation projects, instances receive many annotations, which are then aggregated into a “consensus” label (e.g., Danescu-Niculescu-Mizil et al., 2013). However, if the annotations are time-consuming or require significant expertise, it may be preferable to maximize scalability by obtaining multiple annotations for only a small subset of examples.
6. Compute and report inter-annotator agreement, and release the data. In some cases, the raw text data cannot be released, due to concerns related to copyright or
Under contract with MIT Press, shared under CC-BY-NC-ND license.

90
CHAPTER4. LINGUISTICAPPLICATIONSOFCLASSIFICATION
privacy. In these cases, one solution is to publicly release stand-off annotations, which contain links to document identifiers. The documents themselves can be re- leased under the terms of a licensing agreement, which can impose conditions on how the data is used. It is important to think through the potential consequences of releasing data: people may make personal data publicly available without realizing that it could be redistributed in a dataset and publicized far beyond their expecta- tions (boyd and Crawford, 2012).
Measuring inter-annotator agreement
To measure the replicability of annotations, a standard practice is to compute the extent to which annotators agree with each other. If the annotators frequently disagree, this casts doubt on either their reliability or on the annotation system itself. For classification, one can compute the frequency with which the annotators agree; for rating scales, one can compute the average distance between ratings. These raw agreement statistics must then be compared with the rate of agreement by chance — the expected level of agreement that would be obtained between two annotators who ignored the data.
Cohen’s Kappa is widely used for quantifying the agreement on discrete labeling tasks (Cohen, 1960; Carletta, 1996),11
κ = agreement − E[agreement]. [4.11] 1 − E[agreement]
The numerator is the difference between the observed agreement and the chance agree- ment, and the denominator is the difference between perfect agreement and chance agree- ment. Thus, κ = 1 when the annotators agree in every case, and κ = 0 when the annota- tors agree only as often as would happen by chance. Various heuristic scales have been proposed for determining when κ indicates “moderate”, “good”, or “substantial” agree- ment; for reference, Lee and Narayanan (2005) report κ ≈ 0.45 − 0.47 for annotations of emotions in spoken dialogues, which they describe as “moderate agreement”; Stolcke et al. (2000) report κ = 0.8 for annotations of dialogue acts, which are labels for the pur- pose of each turn in a conversation.
When there are two annotators, the expected chance agreement is computed as,
􏰁ˆ2
E[agreement] = Pr(Y = k) , [4.12]
k
ˆ
where k is a sum over labels, and Pr(Y = k) is the empirical probability of label k across
all annotations. The formula is derived from the expected number of agreements if the annotations were randomly shuffled. Thus, in a binary labeling task, if one label is applied to 90% of instances, chance agreement is .92 + .12 = .82.
11 For other types of annotations, Krippendorf’s alpha is a popular choice (Hayes and Krippendorff, 2007; Artstein and Poesio, 2008).
Jacob Eisenstein. Draft of October 15, 2018.

4.5. BUILDINGDATASETS 91 Crowdsourcing
Crowdsourcing is often used to rapidly obtain annotations for classification problems. For example, Amazon Mechanical Turk makes it possible to define “human intelligence tasks (hits)”, such as labeling data. The researcher sets a price for each set of annotations and a list of minimal qualifications for annotators, such as their native language and their satisfaction rate on previous tasks. The use of relatively untrained “crowdworkers” con- trasts with earlier annotation efforts, which relied on professional linguists (Marcus et al., 1993). However, crowdsourcing has been found to produce reliable annotations for many language-related tasks (Snow et al., 2008). Crowdsourcing is part of the broader field of human computation (Law and Ahn, 2011).For a critical examination of ethical issues related to crowdsourcing, see Fort et al. (2011).
Additional resources
Many of the preprocessing issues discussed in this chapter also arise in information re- trieval. See Manning et al. (2008) for discussion of tokenization and related algorithms. For more on hypothesis testing in particular and replicability in general, see (Dror et al., 2017, 2018).
Exercises
1. As noted in subsection 4.3.3, words tend to appear in clumps, with subsequent oc- currences of a word being more probable. More concretely, if word j has probability φy,j of appearing in a document with label y, then the probability of two appear-
ances (x(i) = 2) is greater than φ2 . j y,j
Suppose you are applying Na ̈ıve Bayes to a binary classification. Focus on a word j which is more probable under label y = 1, so that,
Pr(w = j | y = 1) > Pr(w = j | y = 0). [4.13] Now suppose that x(i) > 1. All else equal, will the classifier overestimate or under-
estimate the posterior Pr(y = 1 | x)?
2. Prove that F-measure is never greater than the arithmetic mean of recall and preci-
sion, r+p . Your solution should also show that F-measure is equal to r+p iff r = p. 22
3. Givenabinaryclassificationprobleminwhichtheprobabilityofthe“positive”label
is equal to α, what is the expected F -MEASURE of a random classifier which ignores
the data, and selects yˆ = +1 with probability 1 ? (Assume that p(yˆ)⊥p(y).) What is 2
the expected F -MEASURE of a classifier that selects yˆ = +1 with probability α (also independent of y(i))? Depending on α, which random classifier will score better?
Under contract with MIT Press, shared under CC-BY-NC-ND license.
j

92 4.
CHAPTER4. LINGUISTICAPPLICATIONSOFCLASSIFICATION
Suppose that binary classifiers c1 and c2 disagree on N = 30 cases, and that c1 is correct in k = 10 of those cases.
• Write a program that uses primitive functions such as exp and factorial to com- pute the two-tailed p-value — you may use an implementation of the “choose” function if one is avaiable. Verify your code against the output of a library for computing the binomial test or the binomial CDF, such as SCIPY.STATS.BINOM in Python.
• Then use a randomized test to try to obtain the same p-value. In each sample, draw from a binomial distribution with N = 30 and θ = 1 . Count the fraction
of samples in which k ≤ 10. This is the one-tailed p-value; double this to
compute the two-tailed p-value.
• Try this with varying numbers of bootstrap samples: M ∈ {100, 1000, 5000, 10000}.
For M = 100 and M = 1000, run the test 10 times, and plot the resulting p-
values.
• Finally, perform the same tests for N = 70 and k = 25.
SemCor 3.0 is a labeled dataset for word sense disambiguation. You can download it,12 or access it in NLTK.CORPORA.SEMCOR.
Choose a word that appears at least ten times in SemCor (find), and annotate its WordNet senses across ten randomly-selected examples, without looking at the ground truth. Use online WordNet to understand the definition of each of the senses.13 Have
a partner do the same annotations, and compute the raw rate of agreement, expected chance rate of agreement, and Cohen’s kappa.
Download the Pang and Lee movie review data, currently available from http: //www.cs.cornell.edu/people/pabo/movie-review-data/. Hold out a randomly-selected 400 reviews as a test set.
Download a sentiment lexicon, such as the one currently available from Bing Liu, https://www.cs.uic.edu/ ̃liub/FBS/sentiment-analysis.html. Tokenize the data, and classify each document as positive iff it has more positive sentiment words than negative sentiment words. Compute the accuracy and F -MEASURE on detecting positive reviews on the test set, using this lexicon-based classifier.
Then train a discriminative classifier (averaged perceptron or logistic regression) on the training set, and compute its accuracy and F -MEASURE on the test set.
Determine whether the differences are statistically significant, using two-tailed hy- pothesis tests: Binomial for the difference in accuracy, and bootstrap for the differ- ence in macro-F -MEASURE.
5.
6.
2
12e.g., https://github.com/google-research-datasets/word_sense_disambigation_ corpora or http://globalwordnet.org/wordnet-annotated-corpora/
13 http://wordnetweb.princeton.edu/perl/webwn
Jacob Eisenstein. Draft of October 15, 2018.

4.5. BUILDINGDATASETS 93
The remaining problems will require you to build a classifier and test its properties. Pick a multi-class text classification dataset that is not already tokenized. One example is a dataset of New York Times headlines and topics (Boydstun, 2013).14 Divide your data into training (60%), development (20%), and test sets (20%), if no such division already exists. If your dataset is very large, you may want to focus on a few thousand instances at first.
7. Compare various vocabulary sizes of 102 , 103 , 104 , 105 , using the most frequent words in each case (you may use any reasonable tokenizer). Train logistic regression clas- sifiers for each vocabulary size, and apply them to the development set. Plot the accuracy and Macro-F -MEASURE with the increasing vocabulary size. For each vo- cabulary size, tune the regularizer to maximize accuracy on a subset of data that is held out from the training set.
8. Compare the following tokenization algorithms:
• Whitespace, using a regular expression;
• The Penn Treebank tokenizer from NLTK;
• Splittingtheinputintonon-overlappingfive-characterunits,regardlessofwhites- pace or punctuation.
Compute the token/type ratio for each tokenizer on the training data, and explain what you find. Train your classifier on each tokenized dataset, tuning the regularizer on a subset of data that is held out from the training data. Tokenize the development set, and report accuracy and Macro-F -MEASURE.
9. Apply the Porter and Lancaster stemmers to the training set, using any reasonable tokenizer, and compute the token/type ratios. Train your classifier on the stemmed data, and compute the accuracy and Macro-F -MEASURE on stemmed development data, again using a held-out portion of the training data to tune the regularizer.
10. Identify the best combination of vocabulary filtering, tokenization, and stemming from the previous three problems. Apply this preprocessing to the test set, and compute the test set accuracy and Macro-F -MEASURE. Compare against a baseline system that applies no vocabulary filtering, whitespace tokenization, and no stem- ming.
Use the binomial test to determine whether your best-performing system is signifi- cantly more accurate than the baseline.
14 Available as a CSV file at http://www.amber- boydstun.com/ supplementary-information-for-making-the-news.html. Use the field TOPIC 2DIGIT for this problem.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

94 CHAPTER4. LINGUISTICAPPLICATIONSOFCLASSIFICATION Use the bootstrap test with M = 104 to determine whether your best-performing
system achieves significantly higher macro-F -MEASURE.
Jacob Eisenstein. Draft of October 15, 2018.

Chapter 5
Learning without supervision
So far, we have assumed the following setup:
• a training set where you get observations x and labels y;
• a test set where you only get observations x.
Without labeled data, is it possible to learn anything? This scenario is known as unsu- pervised learning, and we will see that indeed it is possible to learn about the underlying structure of unlabeled observations. This chapter will also explore some related scenarios: semi-supervised learning, in which only some instances are labeled, and domain adap- tation, in which the training data differs from the data on which the trained system will be deployed.
5.1 Unsupervised learning
To motivate unsupervised learning, consider the problem of word sense disambiguation (section 4.2). The goal is to classify each instance of a word, such as bank into a sense,
• bank#1: a financial institution
• bank#2: the land bordering a river
It is difficult to obtain sufficient training data for word sense disambiguation, because even a large corpus will contain only a few instances of all but the most common words. Is it possible to learn anything about these different senses without labeled data?
Word sense disambiguation is usually performed using feature vectors constructed from the local context of the word to be disambiguated. For example, for the word
95

96 CHAPTER5. LEARNINGWITHOUTSUPERVISION
40
20
0
0 10 20 30 40
density of word group 1
Figure 5.1: Counts of words from two different context groups
bank, the immediate context might typically include words from one of the following two groups:
1. financial, deposits, credit, lending, capital, markets, regulated, reserve, liquid, assets 2. land, water, geography, stream, river, flow, deposits, discharge, channel, ecology
Now consider a scatterplot, in which each point is a document containing the word bank. The location of the document on the x-axis is the count of words in group 1, and the location on the y-axis is the count for group 2. In such a plot, shown in Figure 5.1, two “blobs” might emerge, and these blobs correspond to the different senses of bank.
Here’s a related scenario, from a different problem. Suppose you download thousands of news articles, and make a scatterplot, where each point corresponds to a document: the x-axis is the frequency of the group of words (hurricane, winds, storm); the y-axis is the frequency of the group (election, voters, vote). This time, three blobs might emerge: one for documents that are largely about a hurricane, another for documents largely about a election, and a third for documents about neither topic.
These clumps represent the underlying structure of the data. But the two-dimensional scatter plots are based on groupings of context words, and in real scenarios these word lists are unknown. Unsupervised learning applies the same basic idea, but in a high- dimensional space with one dimension for every context word. This space can’t be di- rectly visualized, but the goal is the same: try to identify the underlying structure of the observed data, such that there are a few clusters of points, each of which is internally coherent. Clustering algorithms are capable of finding such structure automatically.
5.1.1 K-means clustering
Clustering algorithms assign each data point to a discrete cluster, zi ∈ 1, 2, . . . K . One of
the best known clustering algorithms is K-means, an iterative algorithm that maintains Jacob Eisenstein. Draft of October 15, 2018.
density of word group 2

5.1. UNSUPERVISEDLEARNING
97
Algorithm 8 K-means clustering algorithm
1: 2: 3:
4: 5: 6:
7: 8:
9: 10:
procedureK-MEANS(x1:N,K) for i ∈ 1 . . . N do
z(i) ← RANDOMINT(1, K) repeat
◃ initialize cluster memberships ◃ recompute cluster centers
◃ reassign instances to nearest clusters
for k ∈ 1 . . . K do􏰀
ν← 1 Nδ(z(i)=k)x(i)
k δ(z(i)=k) i=1 for i ∈ 1 . . . N do
z(i) ← argmink ||x(i) − νk||2 until converged
return {z(i)}
K-means iterates between updates to the assignments and the centers:
1. each instance is placed in the cluster with the closest center;
2. each center is recomputed as the average over points in the cluster.
a cluster assignment for each instance, and a central (“mean”) location for each cluster.
This procedure is􏰀formalized in Algorithm 8. The term ||x(i) − ν||2 refers to the squared Euclidean norm, V (x(i) − νj)2. An important property of K-means is that the con-
j=1 j
verged solution depends on the initialization, and a better clustering can sometimes be
found simply by re-running the algorithm from a different random starting point.
Soft K-means is a particularly relevant variant. Instead of directly assigning each point to a sp􏰀ecific cluster, soft K-means assigns to each point a distribution over clusters q(i), so that Kk=1 q(i)(k) = 1, and ∀k, q(i)(k) ≥ 0. The soft weight q(i)(k) is computed from the distance of x(i) to the cluster center νk. In turn, the center of each cluster is computed from a weighted average of the points in the cluster,
1 􏰁N
νk = 􏰀N q(i)(k) q(i)(k)x(i). [5.1]
i=1 i=1
We will now explore a probablistic version of soft K-means clustering, based on expectation- maximization (EM). Because EM clustering can be derived as an approximation to maximum- likelihood estimation, it can be extended in a number of useful ways.
Under contract with MIT Press, shared under CC-BY-NC-ND license.
◃ return cluster assignments

98 CHAPTER5. LEARNINGWITHOUTSUPERVISION 5.1.2 Expectation-Maximization (EM)
Expectation-maximization combines the idea of soft K-means with Na ̈ıve Bayes classifi- cation. To review, Na ̈ıve Bayes defines a probability distribution over the data,
􏰛􏰜
log p(x(i) | y(i); φ) × p(y(i); μ) [5.2]
Now suppose that you never observe the labels. To indicate this, we’ll refer to the label of each instance as z(i), rather than y(i), which is usually reserved for observed variables. By marginalizing over the latent variables z, we obtain the marginal probability of the observed instances x:
log p(x, y; φ, μ) =
􏰁N i=1
log p(x; φ, μ) =
[5.3] [5.4] [5.5]
􏰁N
i=1 􏰁N
=
i=1 􏰁N
=
i=1
log p(x(i); φ, μ)
log log
􏰁K z=1
􏰁K z=1
p(x(i), z; φ, μ)
p(x(i) | z; φ) × p(z; μ).
The parameters φ and μ can be obtained by maximizing the marginal likelihood in Equation 5.5. Why is this the right thing to maximize? Without labels, discriminative learning is impossible — there’s nothing to discriminate. So maximum likelihood is all we have.
When the labels are observed, we can estimate the parameters of the Na ̈ıve Bayes probability model separately for each label. But marginalizing over the labels couples these parameters, making direct optimization of log p(x) intractable. We will approxi- mate the log-likelihood by introducing an auxiliary variable q(i), which is a distribution over the label set Z = {1, 2, . . . , K }. The optimization procedure will alternate between updates to q and updates to the parameters (φ,μ). Thus, q(i) plays here as in soft K- means.
To derive the updates for this optimization, multiply the right side of Equation 5.5 by Jacob Eisenstein. Draft of October 15, 2018.

5.1. UNSUPERVISEDLEARNING
99
the ratio q(i)(z) = 1, q(i)(z)
log p(x; φ, μ) =
=
􏰁N 􏰁K log
i=1 z=1 􏰁N 􏰁K
log
q ( i ) ( z ) p(x(i) | z; φ) × p(z; μ) × q(i)(z)
1 q(i)(z) × p(x(i) | z; φ) × p(z; μ) × q(i)(z)
[5.6] [5.7] [5.8]
[5.9] [5.10] [5.11]
i=1 z=1􏱊 􏱋 􏰁N p(x(i) |z;φ)p(z;μ)
log Eq(i) q(i)(z) ,
Jensen’s inequality says that because log is a concave function, we can push it inside
=
where Eq(i) [f(z)] = 􏰀Kz=1 q(i)(z) × f(z) refers to the expectation of the function f under
i=1
the expectation, and obtain a lower bound.
the distribution z ∼ q(i).
log p(x; φ, μ) ≥
J 􏱕
=
􏰁􏰁N i=1
N i=1
􏰁N i=1
􏱊 p(x(i) |z;φ)p(z;μ)􏱋 Eq(i) log q(i)(z)
Eq(i) Eq(i)
􏰪􏰫
log p(x(i) | z; φ) + log p(z; μ) − log q(i)(z) 􏰪􏰫
log p(x(i), z; φ, μ) + H(q(i))
We will focus on Equation 5.10, which is the lower bound on the marginal log-likelihood of the observed data, log p(x). Equation 􏰀5.11 shows the connection to the information theoretic concept of entropy, H(q(i)) = − Kz=1 q(i)(z) log q(i)(z), which measures the av- erage amount of information produced by a draw from the distribution q(i). The lower bound J is a function of two groups of arguments:
• the distributions q(i) for each instance; • the parameters μ and φ.
The expectation-maximization (EM) algorithm maximizes the bound with respect to each of these arguments in turn, while holding the other fixed.
The E-step
The step in which we update q(i) is known as the E-step, because it updates the distribu-
tion under which the expectation is computed. To derive this update, first write out the Under contract with MIT Press, shared under CC-BY-NC-ND license.

100 CHAPTER5. LEARNINGWITHOUTSUPERVISION expectation in the lower bound as a sum,
􏰁N􏰁K􏰪 􏰫
J = q(i)(z) log p(x(i) | z; φ) + log p(z; μ) − log q(i)(z) . [5.12]
i=1 z=1
􏰀 When optimizing this bound, we must also respect a set of “sum-to-one” constraints, Kz=1 q(i)(z) = 1 for all i. Just as in Na ̈ıve Bayes, this constraint can be incorporated into a
Lagrangian:
i=1 z=1
logq(i)(z)=logp(x(i) |z;φ)+logp(z;μ)−1−λ(i) q(i)(z) ∝p(x(i) | z; φ) × p(z; μ).
Applying the sum-to-one constraint gives an exact solution, q(i)(z)=􏰀 p(x(i) |z;φ)×p(z;μ)
K(i)′ ′ z′=1p(x |z;φ)×p(z;μ)
=p(z | x(i); φ, μ).
􏰁N􏰁K 􏰛 Jq = q(i)(z)
􏰜 􏰁K
log p(x(i) | z; φ) + log p(z; μ) − log q(i)(z) where λ(i) is the Lagrange multiplier for instance i.
∂Jq =logp(x(i) |z;φ)+logp(z;θ)−logq(i)(z)−1−λ(i) ∂q(i)(z)
+ λ(i)(1 − The Lagrangian is maximized by taking the derivative and solving for q(i):
q(i)(z)), [5.13]
[5.14]
[5.15] [5.16]
[5.17]
[5.18]
z=1
After normalizing, each q(i) — which is the soft distribution over clusters for data x(i) — is set to the posterior probability p(z | x(i); φ, μ) under the current parameters. Although the Lagrange multipliers λ(i) were introduced as additional parameters, they drop out during normalization.
The M-step
Next, we hold fixed the soft assignments q(i), and maximize with respect to the pa- rameters, φ and μ. Let’s focus on the parameter φ, which parametrizes the likelihood p(x | z; φ), and leave μ for an exercise. The parameter 􏰀φ is a distribution over words for each cluster, so it is optimized under the constraint that Vj=1 φz,j = 1. To incorporate this
Jacob Eisenstein. Draft of October 15, 2018.

5.1. UNSUPERVISEDLEARNING 101 constraint, we introduce a set of Lagrange multiplers {λz}Kz=1, and from the Lagrangian,
􏰁N􏰁K 􏰛 􏰜􏰁K 􏰁V
Jφ = q(i)(z) logp(x(i) |z;φ)+logp(z;μ)−logq(i)(z) + λz(1− φz,j). i=1 z=1 z=1 j=1
[5.19]
The term log p(x(i) | z; φ) is the conditional log-likelihood for the multinomial, which expands to,
(i) 􏰁V
logp(x | z,φ) = C + xj logφz,j, [5.20]
j=1
where C is a constant with respect to φ — see Equation 2.12 in section 2.2 for more dis-
cussion of this probability function.
Setting the derivative of Jφ equal to zero,
∂J􏰁N x(i)
φ = q(i)(z)× j −λz
[5.21] [5.22]
Because φz is constrained to be a probability distribution, the exact solution is computed as,
􏰀N q(i)(z) × x(i) E
This update sets φz equal to the relative frequency estimate of the expected counts under the distribution q. As in supervised Na ̈ıve Baye􏰀s, we can smooth these counts by adding a constant α. The update for μ is similar: μz ∝ Ni=1 q(i)(z) = Eq [count(z)], which is the expected frequency of cluster z. These probabilities can also be smoothed. In sum, the M-step is just like Na ̈ıve Bayes, but with expected counts rather than observed counts.
The multinomial likelihood p(x | z) can be replaced with other probability distribu- tions: for example, for continuous observations, a Gaussian distribution can be used. In some cases, there is no closed-form update to the parameters of the likelihood. One ap- proach is to run gradient-based optimization at each M-step; another is to simply take a single step along the gradient step and then return to the E-step (Berg-Kirkpatrick et al., 2010).
Under contract with MIT Press, shared under CC-BY-NC-ND license.
∂φz,j i=1 φz,j 􏰁N (i)
φz,j ∝ q(i)(z) × xj . i=1
φ=i=1j=􏰀q ,[5.23] z,j􏰀V􏰀N (i)V ′
[count(z, j)]
j′=1 i=1 q(i)(z) × xj′ j′=1 Eq [count(z, j )] where the counter j ∈ {1, 2, . . . , V } indexes over base features, such as words.

102 CHAPTER5. LEARNINGWITHOUTSUPERVISION
450000 440000 430000
02468
iteration
Figure 5.2: Sensitivity of expectation-maximization to initialization. Each line shows the progress of optimization from a different random initialization.
5.1.3 EM as an optimization algorithm
Algorithms that update a global objective by alternating between updates to subsets of the parameters are called coordinate ascent algorithms. The objective J (the lower bound on the marginal likelihood of the data) is separately convex in q and (μ, φ), but it is not jointly convex in all terms; this condition is known as biconvexity. Each step of the expectation- maximization algorithm is guaranteed not to decrease the lower bound J, which means that EM will converge towards a solution at which no nearby points yield further im- provements. This solution is a local optimum — it is as good or better than any of its immediate neighbors, but is not guaranteed to be optimal among all possible configura- tions of (q, μ, φ).
The fact that there is no guarantee of global optimality means that initialization is important: where you start can determine where you finish. To illustrate this point, Figure 5.2 shows the objective function for EM with ten different random initializations: while the objective function improves monotonically in each run, it converges to several differentvalues.1 Fortheconvexobjectivesthatweencounteredinchapter2,itwasnot necessary to worry about initialization, because gradient-based optimization guaranteed to reach the global minimum. But in expectation-maximization — as in the deep neural networks from chapter 3 — initialization matters.
1The figure shows the upper bound on the negative log-likelihood, because optimization is typically framed as minimization rather than maximization.
Jacob Eisenstein. Draft of October 15, 2018.
(i) (i) distribution assigns probability of 1 to a single label zˆ , and zero
In hard EM, each q
probability to all others (Neal and Hinton, 1998). This is similar in spirit to K-means clus- tering, and can outperform standard EM in some cases (Spitkovsky et al., 2010). Another variant of expectation-maximization incorporates stochastic gradient descent (SGD): after performing a local E-step at each instance x(i), we immediately make a gradient update to the parameters (μ, φ). This algorithm has been called incremental expectation maxi- mization (Neal and Hinton, 1998) and online expectation maximization (Sato and Ishii,
negative log-likelihood bound

5.1. UNSUPERVISEDLEARNING 103
2000; Cappe ́ and Moulines, 2009), and is especially useful when there is no closed-form optimum for the likelihood p(x | z), and in online settings where new data is constantly streamed in (see Liang and Klein, 2009, for a comparison for online EM variants).
5.1.4 How many clusters?
So far, we have assumed that the number of clusters K is given. In some cases, this as- sumption is valid. For example, a lexical semantic resource like WORDNET might define the number of senses for a word. In other cases, the number of clusters could be a parame- ter for the user to tune: some readers want a coarse-grained clustering of news stories into three or four clusters, while others want a fine-grained clustering into twenty or more. But many times there is little extrinsic guidance for how to choose K.
One solution is to choose the number of clusters to maximize a metric of clustering quality. The other parameters μ and φ are chosen to maximize the log-likelihood bound J, so this might seem a potential candidate for tuning K. However, J will never decrease with K: if it is possible to obtain a bound of JK with K clusters, then it is always possible to do at least as well with K + 1 clusters, by simply ignoring the additional cluster and setting its probability to zero in q and μ. It is therefore necessary to introduce a penalty for model complexity, so that fewer clusters are preferred. For example, the Akaike Infor- mation Crition (AIC; Akaike, 1974) is the linear combination of the number of parameters and the log-likelihood,
AIC = 2M − 2J, [5.24]
where M is the number of parameters. In an expectation-maximization clustering algo- rithm, M = K × V + K. Since the number of parameters increases with the number of clusters K, the AIC may prefer more parsimonious models, even if they do not fit the data quite as well.
Another choice is to maximize the predictive likelihood on heldout data. This data is not used to estimate the model parameters φ and μ, and so it is not the case that the likelihood on this data is guaranteed to increase with K. Figure 5.3 shows the negative log-likelihood on training and heldout data, as well as the AIC.
*Bayesian nonparametrics An alternative approach is to treat the number of clusters as another latent variable. This requires statistical inference over a set of models with a variable number of clusters. This is not possible within the framework of expectation- maximization, but there are several alternative inference procedures which can be ap- plied, including Markov Chain Monte Carlo (MCMC), which is briefly discussed in sec- tion 5.5 (for more details, see Chapter 25 of Murphy, 2012). Bayesian nonparametrics have been applied to the problem of unsupervised word sense induction, learning not only the word senses but also the number of senses per word (Reisinger and Mooney, 2010).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

104 CHAPTER5. LEARNINGWITHOUTSUPERVISION
260000 240000 220000
85000 80000 75000
Negative log-likelihood bound AIC
Out-of-sample negative log likelihood
10 20 30 40 50
Number of clusters
10 20 30 40 50
Number of clusters
Figure 5.3: The negative log-likelihood and AIC for several runs of expectation- maximization, on synthetic data. Although the data was generated from a model with K = 10, the optimal number of clusters is Kˆ = 15, according to AIC and the heldout log-likelihood. The training set log-likelihood continues to improve as K increases.
5.2 Applications of expectation-maximization
EM is not really an “algorithm” like, say, quicksort. Rather, it is a framework for learning with missing data. The recipe for using EM on a problem of interest is:
• Introduce latent variables z, such that it is easy to write the probability P (x, z). It should also be easy to estimate the associated parameters, given knowledge of z.
• DerivetheE-stepupdatesforq(z),whichistypicallyfactoredasq(z)=􏰧Ni=1qz(i)(z(i)), where i is an index over instances.
• The M-step updates typically correspond to the soft version of a probabilistic super- vised learning algorithm, like Na ̈ıve Bayes.
This section discusses a few of the many applications of this general framework.
5.2.1 Word sense induction
The chapter began by considering the problem of word sense disambiguation when the senses are not known in advance. Expectation-maximization can be applied to this prob- lem by treating each cluster as a word sense. Each instance represents the use of an ambiguous word, and x(i) is a vector of counts for the other words that appear nearby: Schu ̈tze(1998)usesallwordswithina50-wordwindow.Theprobabilityp(x(i) |z)canbe set to the multinomial distribution, as in Na ̈ıve Bayes. The EM algorithm can be applied directly to this data, yielding clusters that (hopefully) correspond to the word senses.
Better performance can be obtained by first applying singular value decomposition (SVD) to the matrix of context-counts Cij = count(i, j), where count(i, j) is the count of word j in the context of instance i. Truncated singular value decomposition approximates
Jacob Eisenstein. Draft of October 15, 2018.

5.2. APPLICATIONSOFEXPECTATION-MAXIMIZATION 105 the matrix C as a product of three matrices, U, S, V, under the constraint that U and V
are orthonormal, and S is diagonal:
min ||C − USV⊤||F
U,S,V
s.t.U ∈ RV ×K,UU⊤ = I
S = Diag(s1,s2,…,sK) V⊤ ∈RNp×K,VV⊤ =I,
[5.25]
where || · ||F is the Frobenius norm, ||X||F = 􏰠􏰀 X2 . The matrix U contains the i,j i,j
left singular vectors of C, and the rows of this matrix can be used as low-dimensional representations of the count vectors ci. EM clustering can be made more robust by setting theinstancedescriptionsx(i)equaltotheserows,ratherthanusingrawcounts(Schu ̈tze, 1998). However, because the instances are now dense vectors of continuous numbers, the probability p(x(i) | z) must be defined as a multivariate Gaussian distribution.
In truncated singular value decomposition, the hyperparameter K is the truncation limit: when K is equal to the rank of C, the norm of the difference between the original matrix C and its reconstruction USV⊤ will be zero. Lower values of K increase the recon- struction error, but yield vector representations that are smaller and easier to learn from. Singular value decomposition is discussed in more detail in chapter 14.
5.2.2 Semi-supervised learning
Expectation-maximization can also be applied to the problem of semi-supervised learn- ing: learning from both labeled and unlabeled data in a single model. Semi-supervised learning makes use of annotated examples, ensuring that each label y corresponds to the desired concept. By adding unlabeled examples, it is possible cover a greater fraction of the features than would appear in labeled data alone. Other methods for semi-supervised learning are discussed in section 5.3, but for now, let’s approach the problem within the framework of expectation-maximization (Nigam et al., 2000).
Suppose we have labeled data {(x(i), y(i))}Nl , and unlabeled data {x(i)}Nl+Nu , where i=1 i=Nl +1
Nl is the number of labeled instances and Nu is the number of unlabeled instances. We can learn from the combined data by maximizing a lower bound on the joint log-likelihood,
N
Nl 􏰛 = 􏰁
i=1
N +Nu ll
􏰁􏰁
L = log p(x(i), y(i); μ, φ) + log p(x(j); μ, φ) i=1 j =Nl +1
[5.26] [5.27]
􏰜 Nl+Nu + 􏰁
K
log 􏰁 p(x(j), y; μ, φ).
log p(x(i) | y(i); φ) + log p(y(i); μ)
Under contract with MIT Press, shared under CC-BY-NC-ND license.
j=Nl+1
y=1

106 CHAPTER5. LEARNINGWITHOUTSUPERVISION Algorithm 9 Generative process for the Na ̈ıve Bayes classifier with hidden components
for Instance i ∈ {1,2,…,N} do:
Draw the label y(i) ∼ Categorical(μ);
Draw the component z(i) ∼ Categorical(βy(i) );
Draw the word counts x(i) | y(i), z(i) ∼ Multinomial(φz(i) ).
The left sum is identical to the objective in Na ̈ıve Bayes; the right sum is the marginal log-
likelihood for expectation-maximization clustering, from Equation 5.5. We can construct a
lower bound on this log-likelihood by introducing distributions q(j) for all j ∈ {Nl + 1, . . . , Nl + Nu}. The E-step updates these distributions; the M-step updates the parameters φ and μ, us-
ing the expected counts from the unlabeled data and the observed counts from the labeled
data.
A critical issue in semi-supervised learning is how to balance the impact of the labeled and unlabeled data on the classifier weights, especially when the unlabeled data is much larger than the labeled dataset. The risk is that the unlabeled data will dominate, caus- ing the parameters to drift towards a “natural clustering” of the instances — which may not correspond to a good classifier for the labeled data. One solution is to heuristically reweight the two components of Equation 5.26, tuning the weight of the two components on a heldout development set (Nigam et al., 2000).
5.2.3 Multi-component modeling
As a final application, let’s return to fully supervised classification. A classic dataset for text classification is 20 newsgroups, which contains posts to a set of online forums, called newsgroups. One of the newsgroups is comp.sys.mac.hardware, which discusses Ap- ple computing hardware. Suppose that within this newsgroup there are two kinds of posts: reviews of new hardware, and question-answer posts about hardware problems. Thelanguageinthesecomponentsofthemac.hardware classmighthavelittleincom- mon; if so, it would be better to model these components separately, rather than treating their union as a single class. However, the component responsible for each instance is not directly observed.
Recall that Na ̈ıve Bayes is based on a generative process, which provides a stochastic explanation for the observed data. In Na ̈ıve Bayes, each label is drawn from a categorical distribution with parameter μ, and each vector of word counts is drawn from a multi- nomial distribution with parameter φy. For multi-component modeling, we envision a slightly different generative process, incorporating both the observed label y(i) and the latent component z(i). This generative process is shown in Algorithm 9. A new parameter βy(i) definesthedistributionofcomponents,conditionedonthelabely(i).Thecomponent, and not the class label, then parametrizes the distribution over words.
Jacob Eisenstein. Draft of October 15, 2018.

5.3. SEMI-SUPERVISEDLEARNING 107
(5.1) 􏰈 Villeneuve a bel et bien re ́ussi son pari de changer de perspectives tout en assurant
une cohe ́rence a` la franchise.2
(5.2) 􏱇 Il est e ́galement trop long et bancal dans sa narration, tie`de dans ses intentions, et tiraille ́ entre deux personnages et directions qui ne parviennent pas a` coexister en har- monie.3
(5.3) Denis Villeneuve a re ́ussi une suite parfaitement maitrise ́e4
(5.4) Long, bavard, hyper design, a` peine agite ́ (le comble de l’action : une bagarre dans la
flotte), me ́taphysique et, surtout, ennuyeux jusqu’a` la catalepsie.5
(5.5) Une suite d’une e ́crasante puissance, meˆlant parfaitement le contemplatif au narratif.6
(5.6) Le film impitoyablement bavard finit quand meˆme par se taire quand se le`ve l’espe`ce de bouquet final ou` semble se de ́chaˆıner, comme en libre parcours de poulets de ́capite ́s, l’arme ́e des graphistes nume ́riques griffant nerveusement la palette graphique entre ag- onie et orgasme.7
Table 5.1: Labeled and unlabeled reviews of the films Blade Runner 2049 and Transformers: The Last Knight.
The labeled data includes (x(i),y(i)), but not z(i), so this is another case of missing data. Again, we sum over the missing data, applying Jensen’s inequality to as to obtain a lower bound on the log-likelihood,
Z|Y
We are now ready to apply expectation-maximization. As usual, the E-step updates
􏰁Kz z=1
log p(x(i), y(i)) = log
≥ log p(y(i); μ) + Eq(i) [log p(x(i) | z; φ) + log p(z | y(i); β) − log q(i)(z)].
p(x(i), y(i), z; μ, φ, β) [5.28]
[5.29] the distribution over the missing data, q(i) . The M-step updates the parameters,
Z|Y
βy,z =􏰀 Eq [count(y,z)]
[5.30] [5.31]
In semi-supervised learning, the learner makes use of both labeled and unlabeled data. To see how this could help, suppose you want to do sentiment analysis in French. In Ta-
Under contract with MIT Press, shared under CC-BY-NC-ND license.
Kz ′ z′=1 Eq [count(y, z )]
φz,j =􏰀 Eq [count(z,j)] . V′
j′=1 Eq [count(z, j )] 5.3 Semi-supervised learning

108 CHAPTER5. LEARNINGWITHOUTSUPERVISION
ble 5.1, there are two labeled examples, one positive and one negative. From this data, a learner could conclude that re ́ussi is positive and long is negative. This isn’t much! How- ever, we can propagate this information to the unlabeled data, and potentially learn more.
• If we are confident that re ́ussi is positive, then we might guess that (5.3) is also posi- tive.
• That suggests that parfaitement is also positive.
• We can then propagate this information to (5.5), and learn from this words in this
example.
• Similarly, we can propagate from the labeled data to (5.4), which we guess to be negative because it shares the word long. This suggests that bavard is also negative, which we propagate to (5.6).
Instances (5.3) and (5.4) were “similar” to the labeled examples for positivity and negativ- ity, respectively. By using these instances to expand the models for each class, it became possible to correctly label instances (5.5) and (5.6), which didn’t share any important fea- tures with the original labeled data. This requires a key assumption: that similar instances will have similar labels.
In subsection 5.2.2, we discussed how expectation-maximization can be applied to semi-supervised learning. Using the labeled data, the initial parameters φ would assign a high weight for re ́ussi in the positive class, and a high weight for long in the negative class. These weights helped to shape the distributions q for instances (5.3) and (5.4) in the E-step. In the next iteration of the M-step, the parameters φ are updated with counts from these instances, making it possible to correctly label the instances (5.5) and (5.6).
However, expectation-maximization has an important disadvantage: it requires using a generative classification model, which restricts the features that can be used for clas- sification. In this section, we explore non-probabilistic approaches, which impose fewer restrictions on the classification model.
5.3.1 Multi-view learning
EM semi-supervised learning can be viewed as self-training: the labeled data guides the initial estimates of the classification parameters; these parameters are used to compute a label distribution over the unlabeled instances, q(i); the label distributions are used to update the parameters. The risk is that self-training drifts away from the original labeled data. This problem can be ameliorated by multi-view learning. Here we take the as- sumption that the features can be decomposed into multiple “views”, each of which is conditionally independent, given the label. For example, consider the problem of classi- fying a name as a person or location: one view is the name itself; another is the context in which it appears. This situation is illustrated in Table 5.2.
Jacob Eisenstein. Draft of October 15, 2018.

5.3. SEMI-SUPERVISEDLEARNING
109
x(1)
1. Peachtree Street
2. Dr. Walker
3. Zanzibar
4. Zanzibar
5. Dr. Robert
6. Oprah
x(2) located on
said
located in flew to recommended recommended
y
LOC PER
? → LOC ? → LOC ? → PER ? → PER
Table 5.2: Example of multiview learning for named entity classification
Co-training is an iterative multi-view learning algorithm, in which there are separate classifiers for each view (Blum and Mitchell, 1998). At each iteration of the algorithm, each classifier predicts labels for a subset of the unlabeled instances, using only the features available in its view. These predictions are then used as ground truth to train the classifiers associated with the other views. In the example shown in Table 5.2, the classifier on x(1) might correctly label instance #5 as a person, because of the feature Dr; this instance would then serve as training data for the classifier on x(2), which would then be able to correctly label instance #6, thanks to the feature recommended. If the views are truly independent, this procedure is robust to drift. Furthermore, it imposes no restrictions on the classifiers that can be used for each view.
Word-sense disambiguation is particularly suited to multi-view learning, thanks to the heuristic of “one sense per discourse”: if a polysemous word is used more than once in a given text or conversation, all usages refer to the same sense (Gale et al., 1992). This motivates a multi-view learning approach, in which one view corresponds to the local context (the surrounding words), and another view corresponds to the global context at the document level (Yarowsky, 1995). The local context view is first trained on a small seed dataset. We then identify its most confident predictions on unlabeled instances. The global context view is then used to extend these confident predictions to other instances within the same documents. These new instances are added to the training data to the local context classifier, which is retrained and then applied to the remaining unlabeled data.
5.3.2 Graph-based algorithms
Another family of approaches to semi-supervised learning begins by constructing a graph, in which pairs of instances are linked with symmetric weights ωi,j , e.g.,
ωi,j = exp(−α × ||x(i) − x(j)||2). [5.32] The goal is to use this weighted graph to propagate labels from a small set of labeled
instances to larger set of unlabeled instances.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

110 CHAPTER5. LEARNINGWITHOUTSUPERVISION
In label propagation, this is done through a series of matrix operations (Zhu et al., 2003). Let Q be a matrix of size N × K, in which each row q(i) describes the labeling of instance i. When ground truth labels are available, then q(i) is an indicator vector,
with q(i) = 1 and q(i) = 0. Let us refer to the submatrix of rows containing labeled y(i) y′̸=y(i)
instances as QL, and the remaining rows as QU . The rows of QU are initialized to assign equal probabilities to all labels, qi,k = 1 .
K
Now, let Ti,j represent the “transition” probability of moving from node j to node i,
Ti,j 􏱕Pr(j →i)= 􏰀 ωi,j . [5.33] Nk=1 ωk,j
We compute values of Ti,j for all instances j and all unlabeled instances i, forming a matrix of size NU × N . If the dataset is large, this matrix may be expensive to store and manip- ulate; a solution is to sparsify it, by keeping only the κ largest values in each row, and setting all other values to zero. We can then “propagate” the label distributions to the unlabeled instances,
Q ̃U ←TQ s ←Q ̃ U 1
QU ←Diag(s)−1Q ̃ U .
[5.34] [5.35] [5.36]
The expression Q ̃ U 1 indicates multiplication of Q ̃ U by a column vector of ones, which is equivalent to computing the sum of each row of Q ̃ U . The matrix Diag(s) is a diagonal matrix with the elements of s on the diagonals. The product Diag(s)−1Q ̃ U has the effect of normalizing the rows of Q ̃ U , so that each row of QU is a probability distribution over labels.
5.4 Domain adaptation
In many practical scenarios, the labeled data differs in some key respect from the data to which the trained model is to be applied. A classic example is in consumer reviews: we may have labeled reviews of movies (the source domain), but we want to predict the reviews of appliances (the target domain). A similar issues arise with genre differences: most linguistically-annotated data is news text, but application domains range from social media to electronic health records. In general, there may be several source and target domains, each with their own properties; however, for simplicity, this discussion will focus mainly on the case of a single source and target domain.
The simplest approach is “direct transfer”: train a classifier on the source domain, and apply it directly to the target domain. The accuracy of this approach depends on the extent to which features are shared across domains. In review text, words like outstanding and
Jacob Eisenstein. Draft of October 15, 2018.

5.4. DOMAINADAPTATION 111
disappointing will apply across both movies and appliances; but others, like terrifying, may have meanings that are domain-specific. As a result, direct transfer performs poorly: for example, an out-of-domain classifier (trained on book reviews) suffers twice the error rate of an in-domain classifier on reviews of kitchen appliances (Blitzer et al., 2007). Domain adaptation algorithms attempt to do better than direct transfer by learning from data in both domains. There are two main families of domain adaptation algorithms, depending on whether any labeled data is available in the target domain.
5.4.1 Supervised domain adaptation
In supervised domain adaptation, there is a small amount of labeled data in the target domain, and a large amount of data in the source domain. The simplest approach would be to ignore domain differences, and simply merge the training data from the source and target domains. There are several other baseline approaches to dealing with this sce- nario (Daume ́ III, 2007):
Interpolation. Train a classifier for each domain, and combine their predictions, e.g.,
yˆ = argmax λsΨs(x, y) + (1 − λs)Ψt(x, y), [5.37]
y
where Ψs and Ψt are the scoring functions from the source and target domain clas-
sifiers respectively, and λs is the interpolation weight.
Prediction. Trainaclassifieronthesourcedomaindata,useitspredictionasanadditional
feature in a classifier trained on the target domain data,
yˆs =argmaxΨs(x,y) [5.38]
y
yˆt =argmaxΨt([x;yˆS],y). [5.39] y
Priors. Train a classifier on the source domain data, and use its weights as a prior distri- bution on the weights of the classifier for the target domain data. This is equivalent to regularizing the target domain weights towards the weights of the source domain classifier (Chelba and Acero, 2006),
􏰁N i=1
where l(i) is the prediction loss on instance i, and λ is the regularization weight. Under contract with MIT Press, shared under CC-BY-NC-ND license.
l(θt) =
l(i)(x(i), y(i); θt) + λ||θt − θs||2, [5.40]

112 CHAPTER5. LEARNINGWITHOUTSUPERVISION
An effective and “frustratingly simple” alternative is EASYADAPT (Daume ́ III, 2007), which creates copies of each feature: one for each domain and one for the cross-domain setting. For example, a negative review of the film Wonder Woman begins, As boring and flavorless as a three-day-old grilled cheese sandwich….8 The resulting bag-of-words feature vector would be,
f(x,y,d) = {(boring,􏱇,MOVIE) : 1,(boring,􏱇,∗) : 1, (flavorless, 􏱇, MOVIE) : 1, (flavorless, 􏱇, ∗) : 1,
(three-day-old, 􏱇, MOVIE) : 1, (three-day-old, 􏱇, ∗) : 1, …},
with (boring, 􏱇, MOVIE) indicating the word boring appearing in a negative labeled doc- ument in the MOVIE domain, and (boring,􏱇,∗) indicating the same word in a negative labeled document in any domain. It is up to the learner to allocate weight between the domain-specific and cross-domain features: for words that facilitate prediction in both domains, the learner will use the cross-domain features; for words that are relevant only to a single domain, the domain-specific features will be used. Any discriminative classi- fier can be used with these augmented features.9
5.4.2 Unsupervised domain adaptation
In unsupervised domain adaptation, there is no labeled data in the target domain. Un- supervised domain adaptation algorithms cope with this problem by trying to make the data from the source and target domains as similar as possible. This is typically done by learning a projection function, which puts the source and target data in a shared space, in which a learner can generalize across domains. This projection is learned from data in both domains, and is applied to the base features — for example, the bag-of-words in text classification. The projected features can then be used both for training and for prediction.
Linear projection
In linear projection, the cross-domain representation is constructed by a matrix-vector product,
g(x(i)) = Ux(i). [5.41] The projected vectors g(x(i)) can then be used as base features during both training (from
the source domain) and prediction (on the target domain).
8http://www.colesmithey.com/capsules/2017/06/wonder-woman.HTML, accessed October 9. 2017.
9EASYADAPT can be explained as a hierarchical Bayesian model, in which the weights for each domain are drawn from a shared prior (Finkel and Manning, 2009).
Jacob Eisenstein. Draft of October 15, 2018.

5.4. DOMAINADAPTATION 113
The projection matrix U can be learned in a number of different ways, but many ap- proaches focus on compressing and reconstructing the base features (Ando and Zhang, 2005). For example, we can define a set of pivot features, which are typically chosen be- cause they appear in both domains: in the case of review documents, pivot features might include evaluative adjectives like outstanding and disappointing (Blitzer et al., 2007). For each pivot feature j, we define an auxiliary problem of predicting whether the feature is present in each example, using the remaining base features. Let φj denote the weights of this classifier, and us horizontally concatenate the weights for each of the Np pivot features into a matrix Φ = [φ1,φ2,…,φNP ].
We then perform truncated singular value decomposition on Φ, as described in sub- section 5.2.1, obtaining Φ ≈ USV⊤. The rows of the matrix U summarize information about each base feature: indeed, the truncated singular value decomposition identifies a low-dimension basis for the weight matrix Φ, which in turn links base features to pivot features. Suppose that a base feature reliable occurs only in the target domain of appliance reviews. Nonetheless, it will have a positive weight towards some pivot features (e.g., out- standing, recommended), and a negative weight towards others (e.g., worthless, unpleasant). A base feature such as watchable might have the same associations with the pivot features, and therefore, ureliable ≈ uwatchable. The matrix U can thus project the base features into a space in which this information is shared.
Non-linear projection
Non-linear transformations of the base features can be accomplished by implementing the transformation function as a deep neural network, which is trained from an auxiliary objective.
Denoising objectives One possibility is to train a projection function to reconstruct a corrupted version of the original input. The original input can be corrupted in various ways: by the addition of random noise (Glorot et al., 2011; Chen et al., 2012), or by the deletion of features (Chen et al., 2012; Yang and Eisenstein, 2015). Denoising objectives share many properties of the linear projection method described above: they enable the projection function to be trained on large amounts of unlabeled data from the target do- main, and allow information to be shared across the feature space, thereby reducing sen- sitivity to rare and domain-specific features.
Adversarial objectives The ultimate goal is for the transformed representations g(x(i)) to be domain-general. This can be made an explicit optimization criterion by comput- ing the similarity of transformed instances both within and between domains (Tzeng et al., 2015), or by formulating an auxiliary classification task, in which the domain it- self is treated as a label (Ganin et al., 2016). This setting is adversarial, because we want
Under contract with MIT Press, shared under CC-BY-NC-ND license.

114 CHAPTER5. LEARNINGWITHOUTSUPERVISION
ly y(i)
x g(x) ld d(i)
Figure 5.4: A schematic view of adversarial domain adaptation. The loss ly is computed
only for instances from the source domain, where labels y(i) are available.
to learn a representation that makes this classifier perform poorly. At the same time, we
want g(x(i)) to enable accurate predictions of the labels y(i).
To formalize this idea, let d(i) represent the domain of instance i, and let ld(g(x(i)), d(i); θd) represent the loss of a classifier (typically a deep neural network) trained to predict d(i)
from the transformed representation g(x(i)), using parameters θd. Analogously, let ly(g(x(i)), y(i); θy) represent the loss of a classifier trained to predict the label y(i) from g(x(i)), using param-
eters θy. The transformation g can then be trained from two criteria: it should yield accu-
rate predictions of the labels y(i), while making inaccurate predictions of the domains d(i).
This can be formulated as a joint optimization problem,
[5.42]
where Nl is the number of labeled instances and Nu is the number of unlabeled instances, with the labeled instances appearing first in the dataset. This setup is shown in Figure 5.4. The loss can be optimized by stochastic gradient descent, jointly training the parameters of the non-linear transformation θg, and the parameters of the prediction models θd and θy.
5.5 *Other approaches to learning with latent variables
Expectation-maximization provides a general approach to learning with latent variables, but it has limitations. One is the sensitivity to initialization; in practical applications, considerable attention may need to be devoted to finding a good initialization. A second issue is that EM tends to be easiest to apply in cases where the latent variables have a clear decomposition (in the cases we have considered, they decompose across the instances). For these reasons, it is worth briefly considering some alternatives to EM.
Jacob Eisenstein. Draft of October 15, 2018.
N +Nu N ll 􏰁􏰁
min ld(g(x(i); θg), d(i); θd) − ly(g(x(i); θg), y(i); θy), θg θy ,θd i=1 i=1

5.5. *OTHERAPPROACHESTOLEARNINGWITHLATENTVARIABLES 115 5.5.1 Sampling
In EM clustering, there is a distribution q(i) for the missing data related to each instance. The M-step consists of updating the parameters of this distribution. An alternative is to draw samples of the latent variables. If the sampling distribution is designed correctly, this procedure will eventually converge to drawing samples from the true posterior over the missing data, p(z(1:Nz) | x(1:Nx)). For example, in the case of clustering, the missing data z(1:Nz) is the set of cluster memberships, y(1:N), so we draw samples from the pos- terior distribution over clusterings of the data. If a single clustering is required, we can select the one with the highest conditional likelihood, zˆ = argmaxz p(z(1:Nz) | x(1:Nx)).
This general family of algorithms is called Markov Chain Monte Carlo (MCMC): “Monte Carlo” because it is based on a series of random draws; “Markov Chain” because the sampling procedure must be designed such that each sample depends only on the previous sample, and not on the entire sampling history. Gibbs sampling is an MCMC algorithm in which each latent variable is sampled from its posterior distribution,
z(n) | x, z(−n) ∼ p(z(n) | x, z(−n)), [5.43]
where z(−n) indicates {z\z(n)}, the set of all latent variables except for z(n). Repeatedly drawing samples over all latent variables constructs a Markov chain that is guaranteed to converge to a sequence of samples from p(z(1:Nz) | x(1:Nx)). In probabilistic clustering, the sampling distribution has the following form,
p(z
(i)
|x,z
(−i) p(x(i) | z(i); φ) × p(z(i); μ)
)=􏰀Kz=1p(x(i) |z;φ)×p(z;μ) [5.44]
∝Multinomial(x(i); φz(i) ) × μz(i) . [5.45]
In this case, the sampling distribution does not depend on the other instances: the poste- rior distribution over each z(i) can be computed from x(i) and the parameters given the parameters φ and μ.
In sampling algorithms, there are several choices for how to deal with the parameters. One possibility is to sample them too. To do this, we must add them to the generative story, by introducing a prior distribution. For the multinomial and categorical parameters in the EM clustering model, the Dirichlet distribution is a typical choice, since it defines a probability on exactly the set of vectors that can be parameters: vectors that sum to one and include only non-negative numbers.10
To incorporate this prior, the generative model must be augmented to indicate that each φz ∼ Dirichlet(αφ), and μ ∼ Dirichlet(αμ). The hyperparameters α are typically set to a constant vector α = [α, α, . . . , α]. When α is large, the Dirichlet distribution tends to
10If􏰀Ki θi =1andθi ≥0foralli,thenθissaidtobeontheK−1simplex.ADirichletdistributionwith Under contract with MIT Press, shared under CC-BY-NC-ND license.

116 CHAPTER5. LEARNINGWITHOUTSUPERVISION
generate vectors that are nearly uniform; when α is small, it tends to generate vectors that assign most of their probability mass to a few entries. Given prior distributions over φ and μ, we can now include them in Gibbs sampling, drawing values for these parameters from posterior distributions that are conditioned on the other variables in the model.
Unfortunately, sampling φ and μ usually leads to slow “mixing”, meaning that adja- cent samples tend to be similar, so that a large number of samples is required to explore the space of random variables. The reason is that the sampling distributions for the pa- rameters are tightly constrained by the cluster memberships y(i), which in turn are tightly constrained by the parameters. There are two solutions that are frequently employed:
• Empirical Bayesian methods maintain φ and μ as parameters rather than latent variables. They still employ sampling in the E-step of the EM algorithm, but they update the parameters using expected counts that are computed from the samples rather than from parametric distributions. This EM-MCMC hybrid is also known as Monte Carlo Expectation Maximization (MCEM; Wei and Tanner, 1990), and is well-suited for cases in which it is difficult to compute q(i) directly.
• In collapsed Gibbs sampling, we analytically integrate φ and μ out of the model. The cluster memberships y(i) are the only remaining latent variable; we sample them from the compound distribution,
p(y(i) |x(1:N),y(−i);αφ,αμ)=􏰥 p(φ,μ|y(−i),x(1:N);αφ,αμ)p(y(i) |x(1:N),y(−i),φ,μ)dφdμ. φ,μ
[5.48] For multinomial and Dirichlet distributions, this integral can be computed in closed
form.
MCMC algorithms are guaranteed to converge to the true posterior distribution over the latent variables, but there is no way to know how long this will take. In practice, the rate of convergence depends on initialization, just as expectation-maximization depends on initialization to avoid local optima. Thus, while Gibbs Sampling and other MCMC algorithms provide a powerful and flexible array of techniques for statistical inference in latent variable models, they are not a panacea for the problems experienced by EM.
parameterα∈RK+ hassupportovertheK−1simplex,
p
Dirichlet
1K
(θ | α) = 􏰙θαi−1
[5.46] [5.47]
i 􏰧Ki=1 Γ(αi)
B(α)
B(α) =Γ(􏰀Ki=1 αi),
i=1
with Γ(·) indicating the gamma function, a generalization of the factorial function to non-negative reals. Jacob Eisenstein. Draft of October 15, 2018.

5.5. *OTHERAPPROACHESTOLEARNINGWITHLATENTVARIABLES 117 5.5.2 Spectral learning
Another approach to learning with latent variables is based on the method of moments, which makes it possible to avoid the problem of non-convex log-likelihoo􏰀d. Write x(i) for
the normalized vector of word counts in document i, so that x(i) = x(i)/ V x(i). Then j=1 j
we can form a matrix of word-word co-occurrence probabilities,
􏰁N
x(i)(x(i))⊤. The expected value of this matrix under p(x | φ, μ), as
NK
E[C] = 􏰁􏰁 􏰁 Pr(Z(i) = k; μ)φkφ⊤k
i=1 k=1 K
= N μk φk φ⊤k k
=ΦDiag(N μ)Φ⊤ ,
C =
[5.49]
[5.50]
[5.51] [5.52]
i=1
where Φ is formed by horizontally concatenating φ1 . . . φK , and Diag(N μ) indicates a diagonal matrix with values Nμk at position (k,k). Setting C equal to its expectation gives,
C =ΦDiag(Nμ)Φ⊤, [5.53]
which is similar to the eigendecomposition C = QΛQ⊤. This suggests that simply by finding the eigenvectors and eigenvalues of C, we could obtain the parameters φ and μ, and this is what motivates the name spectral learning.
While moment-matching and eigendecomposition are similar in form, they impose different constraints on the solutions: eigendecomposition requires orthonormality, so that QQ⊤ = I; in estimating the parameters of a text clustering model, we require that μ and the columns of Φ are probability vectors. Spectral learning algorithms must therefore include a procedure for converting the solution into vectors that are non-negative and sum to one. One approach is to replace eigendecomposition (or the related singular value decomposition) with non-negative matrix factorization (Xu et al., 2003), which guarantees that the solutions are non-negative (Arora et al., 2013).
After obtaining the parameters φ and μ, the distribution over clusters can be com- puted from Bayes’ rule:
p(z(i) | x(i); φ, μ) ∝ p(x(i) | z(i); φ) × p(z(i); μ). [5.54] Under contract with MIT Press, shared under CC-BY-NC-ND license.

118 CHAPTER5. LEARNINGWITHOUTSUPERVISION
Spectral learning yields provably good solutions without regard to initialization, and can be quite fast in practice. However, it is more difficult to apply to a broad family of genera- tive models than EM and Gibbs Sampling. For more on applying spectral learning across a range of latent variable models, see Anandkumar et al. (2014).
Additional resources
There are a number of other learning paradigms that deviate from supervised learning.
• Activelearning:thelearnerselectsunlabeledinstancesandrequestsannotations(Set- tles, 2012).
• Multiple instance learning: labels are applied to bags of instances, with a positive label applied if at least one instance in the bag meets the criterion (Dietterich et al., 1997; Maron and Lozano-Pe ́rez, 1998).
• Constraint-driven learning: supervision is provided in the form of explicit con- straints on the learner (Chang et al., 2007; Ganchev et al., 2010).
• Distant supervision: noisy labels are generated from an external resource (Mintz et al., 2009, also see subsection 17.2.3).
• Multitask learning: the learner induces a representation that can be used to solve multiple classification tasks (Collobert et al., 2011).
• Transfer learning: the learner must solve a classification task that differs from the labeled data (Pan and Yang, 2010).
Expectation-maximization was introduced by Dempster et al. (1977), and is discussed in more detail by Murphy (2012). Like most machine learning treatments, Murphy focuses on continuous observations and Gaussian likelihoods, rather than the discrete observa- tions typically encountered in natural language processing. Murphy (2012) also includes an excellent chapter on MCMC; for a textbook-length treatment, see Robert and Casella (2013). For still more on Bayesian latent variable models, see Barber (2012), and for ap- plications of Bayesian models to natural language processing, see Cohen (2016). Surveys are available for semi-supervised learning (Zhu and Goldberg, 2009) and domain adapta- tion (Søgaard, 2013), although both pre-date the current wave of interest in deep learning.
Exercises
1. Derive the expectation maximization update for the parameter μ in the EM cluster- ing model.
Jacob Eisenstein. Draft of October 15, 2018.

5.5. *OTHERAPPROACHESTOLEARNINGWITHLATENTVARIABLES 119
2. Derive the E-step and M-step updates for the following generative model. You may
assume that the labels y(i) are observed, but z(i) is not. m
• For each instance i,
– Draw label y(i) ∼ Categorical(μ)
– For each token m ∈ {1,2,…,M(i)}
∗ Draw z(i) ∼ Categorical(π)
∗ If z(i) = 0, draw the current token from a label-specific distribution, m
w(i) ∼ φ (i) my
∗ If z(i) = 1, draw the current token from a document-specific distribu- m
tion, w(i) ∼ ν(i) m
3. UsingtheiterativeupdatesinEquations5.34-5.36,computetheoutcomeofthelabel propagation algorithm for the following examples.
m
?01?
?0 01
1? ??
The value inside the node indicates the label, y(i) ∈ {0, 1}, with y(i) =? for unlabeled nodes. The presence of an edge between two nodes indicates wi,j = 1, and the absence of an edge indicates wi,j = 0. For the third example, you need only compute the first three iterations, and then you can guess at the solution in the limit.
4. Use expectation-maximization clustering to train a word-sense induction system, applied to the word say.
• Import NLTK, run NLTK.DOWNLOAD() and select SEMCOR. Import SEMCOR from NLTK.CORPUS.
• The command SEMCOR.TAGGED SENTENCES(TAG=’SENSE’) returns an itera- tor over sense-tagged sentences in the corpus. Each sentence can be viewed
as an iterator over T R E E objects. For T R E E objects that are sense-annotated words, you can access the annotation as TREE.LABEL(), and the word itself with TREE.LEAVES(). So SEMCOR.TAGGED SENTENCES(TAG=’SENSE’)[0][2].LABEL() would return the sense annotation of the third word in the first sentence.
• Extract all sentences containing the senses SAY.V.01 and SAY.V.02.
• Build bag-of-words vectors x(i), containing the counts of other words in those
sentences, including all words that occur in at least two sentences.
• Implement and run expectation-maximization clustering on the merged data.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

120 CHAPTER5. LEARNINGWITHOUTSUPERVISION • Compute the frequency with which each cluster includes instances of SAY.V.01
and SAY.V.02.
In the remaining exercises, you will try out some approaches for semisupervised learn- ing and domain adaptation. You will need datasets in multiple domains. You can obtain product reviews in multiple domains here: https://www.cs.jhu.edu/ ̃mdredze/ datasets/sentiment/processed_acl.tar.gz. Choose a source and target domain, e.g. dvds and books, and divide the data for the target domain into training and test sets of equal size.
5. First, quantify the cost of cross-domain transfer.
• Trainalogisticregressionclassifieronthesourcedomaintrainingset,andeval- uate it on the target domain test set.
• Train a logistic regression classifier on the target domain training set, and eval- uate it on the target domain test set. This it the “direct transfer” baseline.
Compute the difference in accuracy, which is a measure of the transfer loss across domains.
6. Next, apply the label propagation algorithm from subsection 5.3.2.
As a baseline, using only 5% of the target domain training set, train a classifier, and
compute its accuracy on the target domain test set. Next, apply label propagation:
• Compute the label matrix QL for the labeled data (5% of the target domain training set), with each row equal to an indicator vector for the label (positive or negative).
• Iterate through the target domain instances, including both test and training data. At each instance i, compute all wij, using Equation 5.32, with α = 0.01. Use these values to fill in column i of the transition matrix T, setting all but the ten largest values to zero for each column i. Be sure to normalize the column so that the remaining values sum to one. You may need to use a sparse matrix for this to fit into memory.
• Apply the iterative updates from Equations 5.34-5.36 to compute the outcome of the label propagation algorithm for the unlabeled examples.
Select the test set instances from QU , and compute the accuracy of this method. Compare with the supervised classifier trained only on the 5% sample of the target domain training set.
Jacob Eisenstein. Draft of October 15, 2018.

5.5. *OTHERAPPROACHESTOLEARNINGWITHLATENTVARIABLES 121
7. Usingonly5%ofthetargetdomaintrainingdata(andallofthesourcedomaintrain- ing data), implement one of the supervised domain adaptation baselines in subsec- tion 5.4.1. See if this improves on the “direct transfer” baseline from the previous problem
8. Implement EASYADAPT (subsection 5.4.1), again using 5% of the target domain training data and all of the source domain data.
9. Now try unsupervised domain adaptation, using the “linear projection” method described in subsection 5.4.2. Specifically:
• Identify500pivotfeaturesasthewordswiththehighestfrequencyinthe(com-
plete)trainingdataforthesourceandtargetdomains.Specifically,letxdi bethe
count of the word i in domain d: choose the 500 words with the largest values of min(xsource, xtarget).
ii
• Train a classifier to predict each pivot feature from the remaining words in the document.
• ArrangethefeaturesoftheseclassifiersintoamatrixΦ,andperformtruncated singular value decomposition, with k = 20
• Train a classifier from the source domain data, using the combined features x(i) ⊕ U⊤x(i) — these include the original bag-of-words features, plus the pro- jected features.
• Apply this classifier to the target domain test set, and compute the accuracy.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

Part II Sequences and trees
123

Chapter 6 Language models
In probabilistic classification, the problem is to compute the probability of a label, condi- tioned on the text. Let’s now consider the inverse problem: computing the probability of text itself. Specifically, we will consider models that assign probability to a sequence of word tokens, p(w1, w2, . . . , wM ), with wm ∈ V. The set V is a discrete vocabulary,
V = {aardvark, abacus, . . . , zither}. [6.1] Why would you want to compute the probability of a word sequence? In many appli-
cations, the goal is to produce word sequences as output:
• In machine translation (chapter 18), we convert from text in a source language to text in a target language.
• In speech recognition, we convert from audio signal to text.
• Insummarization(section16.3.4;section19.2),weconvertfromlongtextsintoshort
texts.
• In dialogue systems (section 19.3), we convert from the user’s input (and perhaps an external knowledge base) into a text response.
In many of the systems for performing these tasks, there is a subcomponent that com- putes the probability of the output text. The purpose of this component is to generate texts that are more fluent. For example, suppose we want to translate a sentence from Spanish to English.
(6.1) El cafe negro me gusta mucho.
Here is a literal word-for-word translation (a gloss):
125

126 CHAPTER6. LANGUAGEMODELS (6.2) The coffee black me pleases much.
A good language model of English will tell us that the probability of this translation is low, in comparison with more grammatical alternatives,
p(The coffee black me pleases much) < p(I love dark coffee). [6.2] How can we use this fact? Warren Weaver, one of the early leaders in machine trans- lation, viewed it as a problem of breaking a secret code (Weaver, 1955): When I look at an article in Russian, I say: ’This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’ This observation motivates a generative model (like Na ̈ıve Bayes): • The English sentence w(e) is generated from a language model, pe(w(e)). • The Spanish sentence w(s) is then generated from a translation model, ps|e(w(s) | w(e)). Given these two distributions, translation can be performed by Bayes’ rule: pe|s(w(e) | w(s)) ∝pe,s(w(e), w(s)) [6.3] =ps|e(w(s) | w(e)) × pe(w(e)). [6.4] This is sometimes called the noisy channel model, because it envisions English text turning into Spanish by passing through a noisy channel, ps|e. What is the advantage of modeling translation this way, as opposed to modeling pe|s directly? The crucial point is that the two distributions ps|e (the translation model) and pe (the language model) can be estimated from separate data. The translation model requires examples of correct trans- lations, but the language model requires only text in English. Such monolingual data is much more widely available. Furthermore, once estimated, the language model pe can be reused in any application that involves generating English text, including translation from other languages. 6.1 N-gram language models A simple approach to computing the probability of a sequence of tokens is to use a relative frequency estimate. Consider the quote, attributed to Picasso, “computers are useless, they can only give you answers.” One way to estimate the probability of this sentence is, p(Computers are useless, they can only give you answers) = count(Computers are useless, they can only give you answers) [6.5] count(all sentences ever spoken) Jacob Eisenstein. Draft of October 15, 2018. 6.1. N-GRAM LANGUAGE MODELS 127 This estimator is unbiased: in the theoretical limit of infinite data, the estimate will be correct. But in practice, we are asking for accurate counts over an infinite number of events, since sequences of words can be arbitrarily long. Even with an aggressive upper bound of, say, M = 20 tokens in the sequence, the number of possible sequences is V 20, where V = |V|. A small vocabularly for English would have V = 105, so there are 10100 possible sequences. Clearly, this estimator is very data-hungry, and suffers from high vari- ance: even grammatical sentences will have probability zero if they have not occurred in the training data.1 We therefore need to introduce bias to have a chance of making reli- able estimates from finite training data. The language models that follow in this chapter introduce bias in various ways. We begin with n-gram language models, which compute the probability of a sequence as the product of probabilities of subsequences. The probability of a sequence p(w) = p(w1, w2, . . . , wM ) can be refactored using the chain rule (see section A.2): p(w) =p(w1,w2,...,wM) [6.6] =p(w1)×p(w2 |w1)×p(w3 |w2,w1)×...×p(wM |wM−1,...,w1) [6.7] Each element in the product is the probability of a word given all its predecessors. We can think of this as a word prediction task: given the context Computers are, we want to com- pute a probability over the next token. The relative frequency estimate of the probability of the word useless in this context is, count(computers are useless) p(useless | computers are) = 􏰀x∈V count(computers are x) = count(computers are useless). count(computers are) We haven’t made any approximations yet, and we could have just as well applied the chain rule in reverse order, p(w)=p(wM)×p(wM−1 |wM)×...×p(w1 |w2,...,wM), [6.8] or in any other order. But this means that we also haven’t really made any progress: to compute the conditional probability p(wM | wM−1,wM−2,...,w1), we would need to model V M −1 contexts. Such a distribution cannot be estimated from any realistic sample of text. To solve this problem, n-gram models make a crucial simplifying approximation: they condition on only the past n − 1 words. p(wm | wm−1 ...w1) ≈p(wm | wm−1,...,wm−n+1) [6.9] 1Chomsky famously argued that this is evidence against the very concept of probabilistic language mod- els: no such model could distinguish the grammatical sentence colorless green ideas sleep furiously from the ungrammatical permutation furiously sleep ideas green colorless. Under contract with MIT Press, shared under CC-BY-NC-ND license. 128 CHAPTER6. LANGUAGEMODELS This means that the probability of a sentence w can be approximated as 􏰙M m=1 To compute the probability of an entire sentence, it is convenient to pad the beginning and end with special symbols 􏱑 and 􏱒. Then the bigram (n = 2) approximation to the probability of I like black coffee is: p(I like black coffee) = p(I | 􏱑) × p(like | I) × p(black | like) × p(coffee | black) × p(􏱒 | coffee). [6.11] This model requires estimating and storing the probability of only V n events, which is exponential in the order of the n-gram, and not V M , which is exponential in the length of the sentence. The n-gram probabilities can be computed by relative frequency estimation, count(wm−2, wm−1, wm) p(wm | wm−1, wm−2) = 􏰀w′ count(wm−2, wm−1, w′) [6.12] The hyperparameter n controls the size of the context used in each conditional proba- bility. If this is misspecified, the language model will perform poorly. Let’s consider the potential problems concretely. When n is too small. Consider the following sentences: (6.3) Gorillas always like to groom their friends. (6.4) The computer that’s on the 3rd floor of our office building crashed. In each example, the words written in bold depend on each other: the likelihood of their depends on knowing that gorillas is plural, and the likelihood of crashed de- pends on knowing that the subject is a computer. If the n-grams are not big enough to capture this context, then the resulting language model would offer probabili- ties that are too low for these sentences, and too high for sentences that fail basic linguistic tests like number agreement. When n is too big. In this case, it is hard good estimates of the n-gram parameters from our dataset, because of data sparsity. To handle the gorilla example, it is necessary to model 6-grams, which means accounting for V 6 events. Under a very small vocab- ulary of V = 104, this means estimating the probability of 1024 distinct events. Jacob Eisenstein. Draft of October 15, 2018. p(w1,...,wM) ≈ p(wm | wm−1,...,wm−n+1) [6.10] 6.2. SMOOTHINGANDDISCOUNTING 129 These two problems point to another bias-variance tradeoff (see subsection 2.2.4). A small n-gram size introduces high bias, and a large n-gram size introduces high variance. We can even have both problems at the same time! Language is full of long-range depen- dencies that we cannot capture because n is too small; at the same time, language datasets are full of rare phenomena, whose probabilities we fail to estimate accurately because n is too large. One solution is to try to keep n large, while still making low-variance esti- mates of the underlying parameters. To do this, we will introduce a different sort of bias: smoothing. 6.2 Smoothing and discounting Limited data is a persistent problem in estimating language models. In section 6.1, we presented n-grams as a partial solution. Bit sparse data can be a problem even for low- order n-grams; at the same time, many linguistic phenomena, like subject-verb agreement, cannot be incorporated into language models without high-order n-grams. It is therefore necessary to add additional inductive biases to n-gram language models. This section covers some of the most intuitive and common approaches, but there are many more (see Chen and Goodman, 1999). 6.2.1 Smoothing A major concern in language modeling is to avoid the situation p(w) = 0, which could arise as a result of a single unseen n-gram. A similar problem arose in Na ̈ıve Bayes, and the solution was smoothing: adding imaginary “pseudo” counts. The same idea can be applied to n-gram language models, as shown here in the bigram case, (wm | wm−1) = 􏰀 count(wm−1, wm) + α . [6.13] smooth w′∈V count(wm−1, w′) + V α p This basic framework is called Lidstone smoothing, but special cases have other names: • Laplace smoothing corresponds to the case α = 1. • Jeffreys-Perks law corresponds to the case α = 0.5, which works well in practice andbenefitsfromsometheoreticaljustification(ManningandSchu ̈tze,1999). To ensure that the probabilities are properly normalized, anything that we add to the numerator (α) must also appear in the denominator (V α). This idea is reflected in the concept of effective counts: c∗i =(ci+α) M , [6.14] M+Vα Under contract with MIT Press, shared under CC-BY-NC-ND license. 130 CHAPTER6. LANGUAGEMODELS Lidstone smoothing, α = 0.1 Discounting, d = 0.1 counts impropriety 8 0.4 offense 5 0.25 damage 4 0.2 deficiencies 2 0.1 outbreak 1 0.05 infirmity 0 0 cephalopods 0 0 7.826 0.391 7.9 4.928 0.246 4.9 3.961 0.198 3.9 2.029 0.101 1.9 1.063 0.053 0.9 0.097 0.005 0.25 0.097 0.005 0.25 unsmoothed effective smoothed effective probability counts probability counts smoothed probability 0.395 0.245 0.195 0.095 0.045 0.013 0.013 Table 6.1: Example of Lidstone smoothing and absolute discounting in a bigram language model, for the context (alleged, ), for a toy corpus with a total of twenty counts over the seven words shown. Note that discounting decreases the probability for all but the un- seen words, while Lidstone smoothing increases the effective counts and probabilities for deficiencies and outbreak. where ci is the count of event i, c∗i is the effective count, and M = 􏰀Vi=1 ci is the total num- ber of tokens in the dataset (w1, w2, . . . , wM ). This term ensures that 􏰀Vi=1 c∗i = 􏰀Vi=1 ci = M. The discount for each n-gram is then computed as, d i = c ∗i = ( c i + α ) M . ci ci (M + V α) 6.2.2 Discounting and backoff Discounting “borrows” probability mass from observed n-grams and redistributes it. In Lidstone smoothing, the borrowing is done by increasing the denominator of the relative frequency estimates. The borrowed probability mass is then redistributed by increasing the numerator for all n-grams. Another approach would be to borrow the same amount of probability mass from all observed n-grams, and redistribute it among only the unob- served n-grams. This is called absolute discounting. For example, suppose we set an absolute discount d = 0.1 in a bigram model, and then redistribute this probability mass equally over the unseen words. The resulting probabilities are shown in Table 6.1. Discounting reserves some probability mass from the observed data, and we need not redistribute this probability mass equally. Instead, we can backoff to a lower-order lan- guage model: if you have trigrams, use trigrams; if you don’t have trigrams, use bigrams; if you don’t even have bigrams, use unigrams. This is called Katz backoff. In the simple Jacob Eisenstein. Draft of October 15, 2018. 6.2. SMOOTHINGANDDISCOUNTING 131 case of backing off from bigrams to unigrams, the bigram probabilities are, c∗(i, j) =c(i, j) − d [6.15] [6.16] c∗ (i,j ) c(j ) pKatz(i | j) = punigram(i) if c(i, j) > 0 if c(i, j) = 0.
α(j) × 􏰀i′:c(i′,j)=0 punigram(i′)
The term α(j) indicates the amount of probability mass that has been discounted for context j. This probability mass is then divided across all the unseen events, {i′ : c(i′, j) = 0}, proportional to the unigram probability of each word i′. The discount parameter d can be optimized to maximize performance (typically held-out log-likelihood) on a develop- ment set.
6.2.3 *Interpolation
Backoff is one way to combine different order n-gram models. An alternative approach is interpolation: setting the probability of a word in context to a weighted sum of its probabilities across progressively shorter contexts.
Instead of choosing a single n for the size of the n-gram, we can take the weighted average across several n-gram probabilities. For example, for an interpolated trigram model,
pInterpolation(wm | wm−1, wm−2) = λ3p∗3(wm | wm−1, wm−2) + λ2p∗2(wm | wm−1)
+ λ1p∗1(wm).
In this equation, p∗n is the unsmoothed empirical probability given by an n-gram lan- guage model, and λn is the weight assigned to this model. To ensure that the interpolated p􏰀(w) is still a valid probability distribution, the values of λ must obey the constraint,
nmax λn = 1. But how to find the specific values? n=1
An elegant solution is expectation-maximization. Recall from chapter 5 that we can think about EM as learning with missing data: we just need to choose missing data such that learning would be easy if it weren’t missing. What’s missing in this case? Think of each word wm as drawn from an n-gram of unknown size, zm ∈ {1 . . . nmax}. This zm is the missing data that we are looking for. Therefore, the application of EM to this problem involves the following generative model:
for Each token wm,m = 1,2,…,M do:
draw the n-gram size zm ∼ Categorical(λ); draw wm ∼ p∗zm(wm | wm−1,…,wm−zm).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

132 CHAPTER6. LANGUAGEMODELS If the missing data {Zm} were known, then λ could be estimated as the relative fre-
quency,
λz =count(Zm = z) M
􏰁M m=1
[6.17] [6.18]
∝
δ(Zm = z).
But since we do not know the values of the latent variables Zm, we impute a distribution qm in the E-step, which represents the degree of belief that word token wm was generated from a n-gram of order zm,
qm(z) 􏱕 Pr(Zm = z | w1:m; λ)
=􏰀p(wm |w1:m−1,Zm =z)×p(z)
z′ p(wm | w1:m−1, Zm = z′) × p(z′) ∝p∗z(wm | w1:m−1) × λz.
In the M-step, λ is computed by summing the expected counts under q,
is shown in Algorithm 10.
Algorithm 10 Expectation-maximization for interpolated language modeling
[6.19] [6.20] [6.21]
􏰁M m=1
λz ∝
A solution is obtained by iterating between updates to q and λ. The complete algorithm
qm(z).
[6.22]
◃ Initialization
◃ E-step
◃ M-step
Jacob Eisenstein. Draft of October 15, 2018.
1: 2: 3:
4: 5: 6: 7:
8:
9: 10:
11: 12:
procedureESTIMATEINTERPOLATEDn-GRAM(w1:M,{p∗n}n∈1:nmax) for z ∈ {1, 2, . . . , nmax} do
λz← 1 nmax
repeat
for m ∈ {1,2,…,M} do
for z ∈ {1,2,…,nmax} do
qm(z) ← p∗z(wm | w1:m−) × λz
qm ← Normalize(qm) for z ∈ {1,2,􏰀…,nmax} do
λz←1 M qm(z) M m=1
until tired return λ

6.3. RECURRENTNEURALNETWORKLANGUAGEMODELS 133 6.2.4 *Kneser-Ney smoothing
Kneser-Ney smoothing is based on absolute discounting, but it redistributes the result- ing probability mass in a different way from Katz backoff. Empirical evidence points to Kneser-Ney smoothing as the state-of-art for n-gram language modeling (Goodman, 2001). To motivate Kneser-Ney smoothing, consider the example: I recently visited . Which of the following is more likely: Francisco or Duluth?
Now suppose that both bigrams visited Duluth and visited Francisco are unobserved in
the training data, and furthermore, that the unigram probability p∗1(Francisco) is greater than p∗(Duluth). Nonetheless we would still guess that p(visited Duluth) > p(visited Francisco), because Duluth is a more “versatile” word: it can occur in many contexts, while Francisco usually occurs in a single context, following the word San. This notion of versatility is the
key to Kneser-Ney smoothing.
Writing u for a context of undefined length, and count(w, u) as the count of word w in context u, we define the Kneser-Ney bigram probability as
􏰅 max(count(w,u)−d,0) , pKN (w | u) = count(u)
count(w, u) > 0
[6.23]
[6.24]
α(u) × pcontinuation(w), continuation w′∈V |u′ : count(w′, u′) > 0|
otherwise p (w) =􏰀 |u : count(w,u) > 0| .
Probability mass using absolute discounting d, which is taken from all unobserved n-grams. The total amount of discounting in context u is d × |w : count(w, u) > 0|, and we divide this probability mass among the unseen n-grams. To account for versatility, we define the continuation probability pcontinuation(w) as proportional to the number of ob- served contexts in which w appears. The numerator of the continuation probability is the number of contexts u in which w appears; the denominator normalizes the probability by summing the same quantity over all words w′. The coefficient α(u) is set to ensure that the probability distribution pKN (w | u) sums to one over the vocabulary w.
The idea of modeling versatility by counting contexts may seem heuristic, but there is an elegant theoretical justification from Bayesian nonparametrics (Teh, 2006). Kneser-Ney smoothing on n-grams was the dominant language modeling technique before the arrival of neural language models.
6.3 Recurrent neural network language models
N -gram language models have been largely supplanted by neural networks. These mod- els do not make the n-gram assumption of restricted context; indeed, they can incorporate arbitrarily distant contextual information, while remaining computationally and statisti- cally tractable.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

134 CHAPTER6. LANGUAGEMODELS
h0 h1 h2 h3 ··· x1 x2 x3 ··· w1 w2 w3 ···
Figure 6.1: The recurrent neural network language model, viewed as an “unrolled” com- putation graph. Solid lines indicate direct computation, dotted blue lines indicate proba- bilistic dependencies, circles indicate random variables, and squares indicate computation nodes.
The first insight behind neural language models is to treat word prediction as a dis- criminative learning task.2 The goal is to compute the probability p(w | u), where w ∈ V is a word, and u is the context, which depends on the previous words. Rather than directly estimating the word probabilities from (smoothed) relative frequencies, we can treat treat language modeling as a machine learning problem, and estimate parameters that maxi- mize the log conditional probability of a corpus.
The second insight is to reparametrize the probability distribution p(w | u) as a func- tion of two dense K-dimensional numerical vectors, βw ∈ RK, and vu ∈ RK,
p(w|u)= 􏰀 exp(βw ·vu) , [6.25] w′∈V exp(βw′ · vu)
where βw · vu represents a dot product. As usual, the denominator ensures that the prob- ability distribution is properly normalized. This vector of probabilities is equivalent to applying the softmax transformation (see section 3.1) to the vector of dot-products,
p(·|u)=SoftMax([β1 ·vu,β2 ·vu,…,βV ·vu]). [6.26]
The word vectors βw are parameters of the model, and are estimated directly. The context vectors vu can be computed in various ways, depending on the model. A simple but effective neural language model can be built from a recurrent neural network (RNN; Mikolov et al., 2010). The basic idea is to recurrently update the context vectors while moving through the sequence. Let hm represent the contextual information at position m
2This idea predates neural language models (e.g., Rosenfeld, 1996; Roark et al., 2007).
Jacob Eisenstein. Draft of October 15, 2018.

6.3. RECURRENTNEURALNETWORKLANGUAGEMODELS
135
in the sequence. RNN language models are defined,
xm 􏱕φwm
hm =RNN(xm, hm−1)
[6.27] [6.28]
[6.29]
exp(βwm+1 · hm) p(wm+1 | w1,w2,…,wm) =􏰀 ,
where φ is a matrix of word embeddings, and xm denotes the embedding for word wm. The conversion of wm to xm is sometimes known as a lookup layer, because we simply lookup the embeddings for each word in a table; see subsection 3.2.4.
The Elman unit defines a simple recurrent operation (Elman, 1990),
RNN(xm, hm−1) 􏱕 g(Θhm−1 + xm), [6.30]
where Θ ∈ RK×K is the recurrence matrix and g is a non-linear transformation function, often defined as the elementwise hyperbolic tangent tanh (see section 3.1).3 The tanh acts as a squashing function, ensuring that each element of hm is constrained to the range [−1, 1].
Although each wm depends on only the context vector hm−1, this vector is in turn influenced by all previous tokens, w1, w2, . . . wm−1, through the recurrence operation: w1 affects h1, which affects h2, and so on, until the information is propagated all the way to hm−1, and then on to wm (see Figure 6.1). This is an important distinction from n-gram language models, where any information outside the n-word window is ignored. In prin- ciple, the RNN language model can handle long-range dependencies, such as number agreement over long spans of text — although it would be difficult to know where exactly in the vector hm this information is represented. The main limitation is that informa- tion is attenuated by repeated application of the squashing function g. Long short-term memories (LSTMs), described below, are a variant of RNNs that address this issue, us- ing memory cells to propagate information through the sequence without applying non- linearities (Hochreiter and Schmidhuber, 1997).
The denominator in Equation 6.29 is a computational bottleneck, because it involves a sum over the entire vocabulary. One solution is to use a hierarchical softmax function, which computes the sum more efficiently by organizing the vocabulary into a tree (Mikolov et al., 2011). Another strategy is to optimize an alternative metric, such as noise-contrastive estimation (Gutmann and Hyva ̈rinen, 2012), which learns by distinguishing observed in- stances from artificial instances generated from a noise distribution (Mnih and Teh, 2012). Both of these strategies are described in subsection 14.5.3.
3In the original Elman network, the sigmoid function was used in place of tanh. For an illuminating mathematical discussion of the advantages and disadvantages of various nonlinearities in recurrent neural networks, see the lecture notes from Cho (2015).
Under contract with MIT Press, shared under CC-BY-NC-ND license.
w′∈V exp(βw′ · hm)

136 CHAPTER6. LANGUAGEMODELS 6.3.1 Backpropagation through time
The recurrent neural network language model has the following parameters:
• φi ∈ RK , the “input” word vectors (these are sometimes called word embeddings,
since each word is embedded in a K-dimensional space; see chapter 14); • βi ∈ RK , the “output” word vectors;
• Θ ∈ RK×K, the recurrence operator;
• h0, the initial state.
Each of these parameters can be estimated by formulating an objective function over the training corpus, L(w), and then applying backpropagation to obtain gradients on the parameters from a minibatch of training examples (see subsection 3.3.1). Gradient-based updates can be computed from an online learning algorithm such as stochastic gradient descent (see subsection 2.6.2).
The application of backpropagation to recurrent neural networks is known as back- propagation through time, because the gradients on units at time m depend in turn on the gradients of units at earlier times n < m. Let lm+1 represent the negative log-likelihood of word m + 1, lm+1 = −logp(wm+1 | w1,w2,...,wm). [6.31] We require the gradient of this loss with respect to each parameter, such as θk,k′ , an indi- vidual element in the recurrence matrix Θ. Since the loss depends on the parameters only through hm, we can apply the chain rule of differentiation, ∂lm+1 =∂lm+1 ∂hm . [6.32] ∂θk,k′ ∂hm ∂θk,k′ The vector hm depends on Θ in several ways. First, hm is computed by multiplying Θ by the previous state hm−1. But the previous state hm−1 also depends on Θ: hm =g(xm,hm−1) [6.33] ∂hm,k =g′(xm,k + θk · hm−1)(hm−1,k′ + θk · ∂hm−1 ), [6.34] ∂θk,k′ ∂θk,k′ where g′ is the local derivative of the nonlinear function g. The key point in this equation is that the derivative ∂hm depends on ∂hm−1 , which will depend in turn on ∂hm−2 , and ∂θk,k′ so on, until reaching the initial state h0. ∂θk,k′ ∂θk,k′ Each derivative ∂hm will be reused many times: it appears in backpropagation from ∂θk,k′ the loss lm, but also in all subsequent losses ln>m. Neural network toolkits such as
Torch (Collobert et al., 2011) and DyNet (Neubig et al., 2017) compute the necessary Jacob Eisenstein. Draft of October 15, 2018.

6.3. RECURRENTNEURALNETWORKLANGUAGEMODELS 137
derivatives automatically, and cache them for future use. An important distinction from the feedforward neural networks considered in chapter 3 is that the size of the computa- tion graph is not fixed, but varies with the length of the input. This poses difficulties for toolkits that are designed around static computation graphs, such as TensorFlow (Abadi et al., 2016).4
6.3.2 Hyperparameters
The RNN language model has several hyperparameters that must be tuned to ensure good performance. The model capacity is controlled by the size of the word and context vectors K, which play a role that is somewhat analogous to the size of the n-gram context. For datasets that are large with respect to the vocabulary (i.e., there is a large token-to-type ratio), we can afford to estimate a model with a large K, which enables more subtle dis- tinctions between words and contexts. When the dataset is relatively small, then K must be smaller too, or else the model may “memorize” the training data, and fail to generalize. Unfortunately, this general advice has not yet been formalized into any concrete formula for choosing K, and trial-and-error is still necessary. Overfitting can also be prevented by dropout, which involves randomly setting some elements of the computation to zero (Sri- vastava et al., 2014), forcing the learner not to rely too much on any particular dimension of the word or context vectors. The dropout rate must also be tuned on development data.
6.3.3 Gated recurrent neural networks
In principle, recurrent neural networks can propagate information across infinitely long sequences. But in practice, repeated applications of the nonlinear recurrence function causes this information to be quickly attenuated. The same problem affects learning: back- propagation can lead to vanishing gradients that decay to zero, or exploding gradients that increase towards infinity (Bengio et al., 1994). The exploding gradient problem can be addressed by clipping gradients at some maximum value (Pascanu et al., 2013). The other issues must be addressed by altering the model itself.
The long short-term memory (LSTM; Hochreiter and Schmidhuber, 1997) is a popular variant of RNNs that is more robust to these problems. This model augments the hidden state hm with a memory cell cm. The value of the memory cell at each time m is a gated sum of two quantities: its previous value cm−1, and an “update” c ̃m, which is computed from the current input xm and the previous hidden state hm−1. The next state hm is then computed from the memory cell. Because the memory cell is not passed through a non- linear squashing function during the update, it is possible for information to propagate through the network over long distances.
4See https://www.tensorflow.org/tutorials/recurrent (retrieved Feb 8, 2018).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

138
CHAPTER6. LANGUAGEMODELS
hm om
cm fm+1 im
c ̃m
xm
hm+1 om+1 cm+1
im+1
c ̃m+1
xm+1
Figure 6.2: The long short-term memory (LSTM) architecture. Gates are shown in boxes with dotted edges. In an LSTM language model, each hm would be used to predict the next word wm+1.
The gates are functions of the input and previous hidden state. They are computed from elementwise sigmoid activations, σ(x) = (1 + exp(−x))−1, ensuring that their values will be in the range [0, 1]. They can therefore be viewed as soft, differentiable logic gates. The LSTM architecture is shown in Figure 6.2, and the complete update equations are:
fm+1 =σ(Θ(h→f)hm + Θ(x→f)xm+1 + bf ) im+1 =σ(Θ(h→i)hm + Θ(x→i)xm+1 + bi)
c ̃m+1 = tanh(Θ(h→c)hm + Θ(w→c)xm+1) cm+1 =fm+1 ⊙ cm + im+1 ⊙ c ̃m+1
om+1 =σ(Θ(h→o)hm + Θ(x→o)xm+1 + bo) hm+1 =om+1 ⊙ tanh(cm+1)
forget gate input gate
update candidate memory cell update output gate
output.
[6.35] [6.36]
[6.37] [6.38] [6.39] [6.40]
The operator ⊙ is an elementwise (Hadamard) product. Each gate is controlled by a vec- tor of weights, which parametrize the previous hidden state (e.g., Θ(h→f)) and the current input (e.g., Θ(x→f)), plus a vector offset (e.g., bf). The overall operation can be infor- mally summarized as (hm, cm) = LSTM(xm, (hm−1, cm−1)), with (hm, cm) representing the LSTM state after reading token m.
The LSTM outperforms standard recurrent neural networks across a wide range of problems. It was first used for language modeling by Sundermeyer et al. (2012), but can be applied more generally: the vector hm can be treated as a complete representation of
Jacob Eisenstein. Draft of October 15, 2018.

6.4. EVALUATINGLANGUAGEMODELS 139 the input sequence up to position m, and can be used for any labeling task on a sequence
of tokens, as we will see in the next chapter.
There are several LSTM variants, of which the Gated Recurrent Unit (Cho et al., 2014) is one of the more well known. Many software packages implement a variety of RNN architectures, so choosing between them is simple from a user’s perspective. Jozefowicz et al. (2015) provide an empirical comparison of various modeling choices circa 2015.
6.4 Evaluating language models
Language modeling is not usually an application in itself: language models are typically components of larger systems, and they would ideally be evaluated extrinisically. This means evaluating whether the language model improves performance on the application task, such as machine translation or speech recognition. But this is often hard to do, and depends on details of the overall system which may be irrelevant to language modeling. In contrast, intrinsic evaluation is task-neutral. Better performance on intrinsic metrics may be expected to improve extrinsic metrics across a variety of tasks, but there is always the risk of over-optimizing the intrinsic metric. This section discusses some intrinsic met- rics, but keep in mind the importance of performing extrinsic evaluations to ensure that intrinsic performance gains carry over to real applications.
6.4.1 Held-out likelihood
The goal of probabilistic language models is to accurately measure the probability of se- quences of word tokens. Therefore, an intrinsic evaluation metric is the likelihood that the language model assigns to held-out data, which is not used during training. Specifically, we compute,
Typically, unknown words are mapped to the ⟨UNK⟩ token. This means that we have to estimate some probability for ⟨UNK⟩ on the training data. One way to do this is to fix the vocabulary V to the V − 1 words with the highest counts in the training data, and then convert all other tokens to ⟨UNK⟩. Other strategies for dealing with out-of-vocabulary terms are discussed in section 6.5.
Under contract with MIT Press, shared under CC-BY-NC-ND license.
􏰁M m=1
logp(wm | wm−1,…,w1), [6.41] treating the entire held-out corpus as a single stream of tokens.
l(w) =

140 CHAPTER6. LANGUAGEMODELS 6.4.2 Perplexity
Held-out likelihood is usually presented as perplexity, which is a deterministic transfor- mation of the log-likelihood into an information-theoretic quantity,
M
Perplex(w) = 2− l(w) , [6.42]
where M is the total number of tokens in the held-out corpus.
Lower perplexities correspond to higher likelihoods, so lower scores are better on this
metric — it is better to be less perplexed. Here are some special cases:
• In the limit of a perfect language model, probability 1 is assigned to the held-out
− 1 log 1 0 corpus, with Perplex(w) = 2 M 2 = 2 = 1.
• In the opposite limit, probability zero is assigned to the held-out corpus, which cor- − 1 log 0 ∞
responds to an infinite perplexity, Perplex(w) = 2 M
• Assume a uniform, unigram model in which p(wi) = 1 for all words in the vocab-
ulary. Then,
=V.
This is the “worst reasonable case” scenario, since you could build such a language
model without even looking at the data.
In practice, language models tend to give perplexities in the range between 1 and V . A small benchmark dataset is the Penn Treebank, which contains roughly a million to- kens; its vocabulary is limited to 10,000 words, with all other tokens mapped a special ⟨UNK⟩ symbol. On this dataset, a well-smoothed 5-gram model achieves a perplexity of 141 (Mikolov and Zweig, Mikolov and Zweig), and an LSTM language model achieves perplexity of roughly 80 (Zaremba, Sutskever, and Vinyals, Zaremba et al.). Various en- hancements to the LSTM architecture can bring the perplexity below 60 (Merity et al., 2018). A larger-scale language modeling dataset is the 1B Word Benchmark (Chelba et al., 2013), which contains text from Wikipedia. On this dataset, a perplexities of around 25 can be obtained by averaging together multiple LSTM language models (Jozefowicz et al., 2016).
Jacob Eisenstein. Draft of October 15, 2018.
􏰁M 1 􏰁M
log2(w)= log2 V =− log2V =−Mlog2V
V
2 = 2 = ∞.
Perplex(w) =2 1 M log V =2log2 V
m=1 m=1 M2

6.5. OUT-OF-VOCABULARYWORDS 141 6.5 Out-of-vocabulary words
So far, we have assumed a closed-vocabulary setting — the vocabulary V is assumed to be a finite set. In realistic application scenarios, this assumption may not hold. Consider, for example, the problem of translating newspaper articles. The following sentence appeared in a Reuters article on January 6, 2017:5
The report said U.S. intelligence agencies believe Russian military intelligence, the GRU, used intermediaries such as WikiLeaks, DCLeaks.com and the Guc- cifer 2.0 ”persona” to release emails…
Suppose that you trained a language model on the Gigaword corpus,6 which was released in 2003. The bolded terms either did not exist at this date, or were not widely known; they are unlikely to be in the vocabulary. The same problem can occur for a variety of other terms: new technologies, previously unknown individuals, new words (e.g., hashtag), and numbers.
One solution is to simply mark all such terms with a special token, ⟨UNK⟩. While training the language model, we decide in advance on the vocabulary (often the K most common terms), and mark all other terms in the training data as ⟨UNK⟩. If we do not want to determine the vocabulary size in advance, an alternative approach is to simply mark the first occurrence of each word type as ⟨UNK⟩.
But is often better to make distinctions about the likelihood of various unknown words. This is particularly important in languages that have rich morphological systems, with many inflections for each word. For example, Portuguese is only moderately complex from a morphological perspective, yet each verb has dozens of inflected forms (see Fig- ure 4.3b). In such languages, there will be many word types that we do not encounter in a corpus, which are nonetheless predictable from the morphological rules of the language. To use a somewhat contrived English example, if transfenestrate is in the vocabulary, our language model should assign a non-zero probability to the past tense transfenestrated, even if it does not appear in the training data.
One way to accomplish this is to supplement word-level language models with character- level language models. Such models can use n-gramsor RNNs, but with a fixed vocab- ulary equal to the set of ASCII or Unicode characters. For example, Ling et al. (2015) propose an LSTM model over characters, and Kim (2014) employ a convolutional neural network. A more linguistically motivated approach is to segment words into meaningful subword units, known as morphemes (see chapter 9). For example, Botha and Blunsom
5 Bayoumy, Y. and Strobel, W. (2017, January 6). U.S. intel report: Putin directed cy- ber campaign to help Trump. Reuters. Retrieved from http://www.reuters.com/article/ us-usa-russia-cyber-idUSKBN14Q1T8 on January 7, 2017.
6 https://catalog.ldc.upenn.edu/LDC2003T05
Under contract with MIT Press, shared under CC-BY-NC-ND license.

142 CHAPTER6. LANGUAGEMODELS (2014) induce vector representations for morphemes, which they build into a log-bilinear
language model; Bhatia et al. (2016) incorporate morpheme vectors into an LSTM.
Additional resources
A variety of neural network architectures have been applied to language modeling. No- table earlier non-recurrent architectures include the neural probabilistic language model (Ben- gio et al., 2003) and the log-bilinear language model (Mnih and Hinton, 2007). Much more detail on these models can be found in the text by Goodfellow et al. (2016).
Exercises
1. Prove that n-gram language models give valid probabilities if the n-gram probabil- ities are valid. Specifically, assume that,
􏰁V
p(wm | wm−1, wm−2, . . . , wm−n+1) = 1 [6.43] wm
for all contexts (wm−1, wm−2, . . . , wm−n+1). Prove that 􏰀w pn(w) = 1 for all w ∈ V∗,
where pn is the probability under an n-gram language model. Your proof should
proceed by induction. You should handle the start-of-string case p(w1 | 􏱑, . . . , 􏱑),
but you need not handle the end-of-string token.
2. First, show that RNN language models are valid using a similar proof technique to the one in the previous problem.
Next, let pr(w) indicate the probability of w under RNN r. An ensemble of RNN language models computes the probability,
1 􏰁R
p(w) = R pr(w). [6.44]
r=1
Does an ensemble of RNN language models compute a valid probability?
3. Consider a unigram language model over a vocabulary of size V . Suppose that a word appears m times in a corpus with M tokens in total. With Lidstone smoothing of α, for what values of m is the smoothed probability greater than the unsmoothed probability?
4. Consider a simple language in which each token is drawn from the vocabulary V
with probability 1 , independent of all other tokens. V
Jacob Eisenstein. Draft of October 15, 2018.
􏰎 􏰍􏰌 􏰏
n−1

6.5. OUT-OF-VOCABULARYWORDS 143 Given a corpus of size M, what is the expectation of the fraction of all possible
bigrams that have zero count? You may assume V is large enough that 1 ≈ 1 . V V−1
5. Continuing the previous problem, determine the value of M such that the fraction of bigrams with zero count is at most ε ∈ (0, 1). As a hint, you may use the approxi- mation ln(1 + α) ≈ α for α ≈ 0.
6. Inreallanguages,wordsprobabilitiesareneitheruniformnorindependent.Assume
that word probabilities are independent but not uniform, so that in general p(w) ̸=
1 . Prove that the expected fraction of unseen bigrams will be higher than in the IID V
case.
7. Consider a recurrent neural network with a single hidden unit and a sigmoid acti-
vation, hm = σ(θhm−1 + xm). Prove that if |θ| < 1, then the gradient ∂hm goes to zero as k → ∞.7 8. Zipf’s law states that if the word types in a corpus are sorted by frequency, then the frequency of the word at rank r is proportional to r−s, where s is a free parameter, usually around 1. (Another way to view Zipf’s law is that a plot of log frequency against log rank will be linear.) Solve for s using the counts of the first and second most frequent words, c1 and c2. 9. Download the wikitext-2 dataset.8 Read in the training data and compute word counts. Estimate the Zipf’s law coefficient by, sˆ = exp 􏰑(log r) · (log c)􏰒 , [6.45] || log r||2 where r = [1,2,3,...] is the vector of ranks of all words in the corpus, and c = [c1, c2, c3, . . .] is the vector of counts of all words in the corpus, sorted in descending order. Make a log-log plot of􏰀the observed counts, and the expected counts according to Zipf’s law. The sum ∞r=1 rs = ζ(s) is the Riemann zeta function, available in python’s scipy library as scipy.special.zeta. 10. Using the Pytorch library, train an LSTM language model from the Wikitext train- ing corpus. After each epoch of training, compute its perplexity on the Wikitext validation corpus. Stop training when the perplexity stops improving. 7 This proof generalizes to vector hidden units by considering the largest eigenvector of the matrix Θ (Pas- canu et al., 2013). 8 Available at https://github.com/pytorch/examples/tree/master/word_language_ model/data/wikitext-2 in September 2018. The dataset is already tokenized, and already replaces rare words with ⟨UNK⟩, so no preprocessing is necessary. Under contract with MIT Press, shared under CC-BY-NC-ND license. ∂ hm−k Chapter 7 Sequence labeling The goal of sequence labeling is to assign tags to words, or more generally, to assign discrete labels to discrete elements in a sequence. There are many applications of se- quence labeling in natural language processing, and chapter 8 presents an overview. For now, we’ll focus on the classic problem of part-of-speech tagging, which requires tagging each word by its grammatical category. Coarse-grained grammatical categories include NOUNs, which describe things, properties, or ideas, and VERBs, which describe actions and events. Consider a simple input: (7.1) They can fish. A dictionary of coarse-grained part-of-speech tags might include NOUN as the only valid tag for they, but both NOUN and VERB as potential tags for can and fish. A accurate se- quence labeling algorithm should select the verb tag for both can and fish in (7.1), but it should select noun for the same two words in the phrase can of fish. 7.1 Sequence labeling as classification One way to solve a tagging problem is to turn it into a classification problem. Let f ((w, m), y) indicate the feature function for tag y at position m in the sequence w = (w1, w2, . . . , wM ). A simple tagging model would have a single base feature, the word itself: f ((w = they can fish, m = 1), N) =(they, N) f ((w = they can fish, m = 2), V) =(can, V) f ((w = they can fish, m = 3), V) =(fish, V). [7.1] [7.2] [7.3] Here the feature function takes three arguments as input: the sentence to be tagged (e.g., they can fish), the proposed tag (e.g., N or V), and the index of the token to which this tag 145 146 CHAPTER7. SEQUENCELABELING is applied. This simple feature function then returns a single feature: a tuple including the word to be tagged and the tag that has been proposed. If the vocabulary size is V and the number of tags is K, then there are V × K features. Each of these features must be assigned a weight. These weights can be learned from a labeled dataset using a clas- sification algorithm such as perceptron, but this isn’t necessary in this case: it would be equivalent to define the classification weights directly, with θw,y = 1 for the tag y most frequently associated with word w, and θw,y = 0 for all other tags. However, it is easy to see that this simple classification approach cannot correctly tag both they can fish and can of fish, because can and fish are grammatically ambiguous. To han- dle both of these cases, the tagger must rely on context, such as the surrounding words. We can build context into the feature set by incorporating the surrounding words as ad- ditional features: f((w = they can fish,1),N) = {(wm = they,ym = N), (wm−1 =􏱑,ym =N), (wm+1 = can, ym = N)} f((w = they can fish,2),V) = {(wm = can,ym = V), (wm−1 = they, ym = V), (wm+1 = fish, ym = V)} f((w = they can fish,3),V) = {(wm = fish,ym = V), (wm−1 = can, ym = V), (wm+1 = 􏱒, ym = V)}. [7.4] [7.5] [7.6] These features contain enough information that a tagger should be able to choose the right tag for the word fish: words that come after can are likely to be verbs, so the feature (wm−1 = can, ym = V) should have a large positive weight. However, even with this enhanced feature set, it may be difficult to tag some se- quences correctly. One reason is that there are often relationships between the tags them- selves. For example, in English it is relatively rare for a verb to follow another verb — particularly if we differentiate MODAL verbs like can and should from more typical verbs, like give, transcend, and befuddle. We would like to incorporate preferences against tag se- quences like VERB-VERB, and in favor of tag sequences like NOUN-VERB. The need for such preferences is best illustrated by a garden path sentence: (7.2) The old man the boat. Grammatically, the word the is a DETERMINER. When you read the sentence, what part of speech did you first assign to old? Typically, this word is an ADJECTIVE — abbrevi- ated as J — which is a class of words that modify nouns. Similarly, man is usually a noun. The resulting sequence of tags is D J N D N. But this is a mistaken “garden path” inter- pretation, which ends up leading nowhere. It is unlikely that a determiner would directly Jacob Eisenstein. Draft of October 15, 2018. 7.2. SEQUENCELABELINGASSTRUCTUREPREDICTION 147 follow a noun,1 and it is particularly unlikely that the entire sentence would lack a verb. The only possible verb in (7.2) is the word man, which can refer to the act of maintaining and piloting something — often boats. But if man is tagged as a verb, then old is seated between a determiner and a verb, and must be a noun. And indeed, adjectives often have a second interpretation as nouns when used in this way (e.g., the young, the restless). This reasoning, in which the labeling decisions are intertwined, cannot be applied in a setting where each tag is produced by an independent classification decision. 7.2 Sequence labeling as structure prediction As an alternative, think of the entire sequence of tags as a label itself. For a given sequence of words w = (w1,w2,...,wM), there is a set of possible taggings Y(w) = YM, where Y = {N, V, D, . . .} refers to the set of individual tags, and YM refers to the set of tag sequences of length M . We can then treat the sequence labeling problem as a classification problem in the label space Y(w), yˆ = argmax Ψ(w, y), [7.7] y∈Y(w) where y = (y1,y2,...,yM) is a sequence of M tags, and Ψ is a scoring function on pairs of sequences, V M × YM → R. Such a function can include features that capture the rela- tionships between tagging decisions, such as the preference that determiners not follow nouns, or that all sentences have verbs. Given that the label space is exponentially large in the length of the sequence M, can it ever be practical to perform tagging in this way? The problem of making a series of in- terconnected labeling decisions is known as inference. Because natural language is full of interrelated grammatical structures, inference is a crucial aspect of natural language pro- cessing. In English, it is not unusual to have sentences of length M = 20; part-of-speech tag sets vary in size from 10 to several hundred. Taking the low end of this range, we have |Y(w1:M)| ≈ 1020, one hundred billion billion possible tag sequences. Enumerating and scoring each of these sequences would require an amount of work that is exponential in the sequence length, so inference is intractable. However, the situation changes when we restrict the scoring function. Suppose we choose a function that decomposes into a sum of local parts, M􏰁+1 m=1 where each ψ(·) scores a local part of the tag sequence. Note that the sum goes up to M +1, so that we can include a score for a special end-of-sequence tag, ψ(w1:M , 􏱓, yM , M + 1). We also define a special the tag to begin the sequence, y0 􏱕 ♦. 1The main exception occurs with ditransitive verbs, such as They gave the winner a trophy. Under contract with MIT Press, shared under CC-BY-NC-ND license. Ψ(w, y) = ψ(w, ym, ym−1, m), [7.8] 148 CHAPTER7. SEQUENCELABELING In a linear model, local scoring function can be defined as a dot product of weights and features, ψ(w1:M , ym, ym−1, m) = θ · f(w, ym, ym−1, m). [7.9] The feature vector f can consider the entire input w, and can look at pairs of adjacent tags. This is a step up from per-token classification: the weights can assign low scores to infelicitous tag pairs, such as noun-determiner, and high scores for frequent tag pairs, such as determiner-noun and noun-verb. In the example they can fish, a minimal feature function would include features for word-tag pairs (sometimes called emission features) and tag-tag pairs (sometimes called transition features): f(w = they can fish,y = N V V ) = f(w,ym,ym−1,m) [7.10] [7.11] M􏰁+1 m=1 =f(w,N,♦,1) +f(w,V,N,2) +f(w,V,V,3) + f (w, 􏱓, V, 4) =(wm =they,ym =N)+(ym =N,ym−1 =♦) +(wm =can,ym =V)+(ym =V,ym−1 =N) +(wm =fish,ym =V)+(ym =V,ym−1 =V) + (ym = 􏱓, ym−1 = V). [7.12] There are seven active features for this example: one for each word-tag pair, and one for each tag-tag pair, including a final tag yM +1 = 􏱓. These features capture the two main sources of information for part-of-speech tagging in English: which tags are appropriate for each word, and which tags tend to follow each other in sequence. Given appropriate weights for these features, taggers can achieve high accuracy, even for difficult cases like the old man the boat. We will now discuss how this restricted scoring function enables efficient inference, through the Viterbi algorithm (Viterbi, 1967). Jacob Eisenstein. Draft of October 15, 2018. 7.3. THEVITERBIALGORITHM 149 7.3 The Viterbi algorithm By decomposing the scoring function into a sum of local parts, it is possible to rewrite the tagging problem as follows: yˆ = argmax Ψ(w, y) y∈Y(w) [7.13] [7.14] [7.15] [7.16] = argmax = argmax M􏰁+1 m=1 ψ(w, ym, ym−1, m) sm(ym, ym−1), y1:M where the final line simplifies the notation with the shorthand, sm(ym, ym−1) 􏱕 ψ(w1:M , ym, ym−1, m). y1:M M􏰁+1 m=1 This inference problem can be solved efficiently using dynamic programming, an al- gorithmic technique for reusing work in recurrent computations. We begin by solving an auxiliary problem: rather than finding the best tag sequence, we compute the score of the best tag sequence, M􏰁+1 max Ψ(w, y1:M ) = max sm(ym, ym−1). [7.17] y1:M y1:M m=1 This score involves a maximization over all tag sequences of length M, written maxy1:M . This maximization can be broken into two pieces, M􏰁+1 maxΨ(w,y1:M)=max max sm(ym,ym−1). [7.18] y1:M yM y1:M−1 m=1 Within the sum, only the final term sM+1(􏱓,yM) depends on yM, so we can pull this term out of the second maximization, 􏰑􏰒 max Ψ(w, y1:M ) = max sM+1(􏱓, yM ) + y1:M yM The second term in Equation 7.19 has the same form as our original problem, with M replaced by M −1. This indicates that the problem can be reformulated as a recurrence. We do this by defining an auxiliary variable called the Viterbi variable vm(k), representing Under contract with MIT Press, shared under CC-BY-NC-ND license. 􏰝􏰁M 􏰞 max sm(ym, ym−1) . [7.19] y1:M−1 m=1 150 CHAPTER7. SEQUENCELABELING Algorithm 11 The Viterbi algorithm. Each sm(k,k′) is a local score for tag ym = k and ym−1 = k′. for k ∈ {0,...K} do v1(k) = s1(k, ♦) for m ∈ {2,...,M} do for k ∈ {0,...,K} do vm(k) = maxk′ sm(k, k′) + vm−1(k′) bm(k) = argmaxk′ sm(k, k′) + vm−1(k′) yM =argmaxksM+1(􏱓,k)+vM(k) for m ∈ {M − 1, . . . 1} do ym = bm(ym+1) return y1:M the score of the best sequence terminating in the tag k: 􏰁m vm(ym) 􏱕 max sn(yn, yn−1) y1:m−1 n=1 m􏰁−1 ym−1 y1:m−2 n=1 [7.20] [7.21] [7.22] =maxsm(ym,ym−1)+ max = max sm(ym, ym−1) + vm−1(ym−1). ym−1 Each set of Viterbi variables is computed from the local score sm(ym, ym−1), and from the previous set of Viterbi variables. The initial condition of the recurrence is simply the score for the first tag, v1(y1) 􏱕s1(y1, ♦). [7.23] The maximum overall score for the sequence is then the final Viterbi variable, max Ψ(w1:M , y1:M ) =vM +1 (􏱓). [7.24] y1:M Thus, the score of the best labeling for the sequence can be computed in a single forward sweep: first compute all variables v1(·) from Equation 7.23, and then compute all variables v2(·) from the recurrence in Equation 7.22, continuing until the final variable vM+1(􏱓). The Viterbi variables can be arranged in a structure known as a trellis, shown in Fig- ure 7.1. Each column indexes a token m in the sequence, and each row indexes a tag in Y; every vm−1(k) is connected to every vm(k′), indicating that vm(k′) is computed from vm−1(k). Special nodes are set aside for the start and end states. Jacob Eisenstein. Draft of October 15, 2018. sn(yn,yn−1) 7.3. THEVITERBIALGORITHM 151 0 they can fish -10 N -3 -9 -9 V -12 -5 -11 Figure 7.1: The trellis representation of the Viterbi variables, for the example they can fish, using the weights shown in Table 7.1. The original goal was to find the best scoring sequence, not simply to compute its score. But by solving the auxiliary problem, we are almost there. Recall that each vm(k) represents the score of the best tag sequence ending in that tag k in position m. To compute this, we maximize over possible values of ym−1. By keeping track of the “argmax” tag that maximizes this choice at each step, we can walk backwards from the final tag, and recover the optimal tag sequence. This is indicated in Figure 7.1 by the thick lines, which we trace back from the final position. These backward pointers are written bm(k), indicating the optimal tag ym−1 on the path to Ym = k. The complete Viterbi algorithm is shown in Algorithm 11. When computing the initial Viterbi variables v1(·), the special tag ♦ indicates the start of the sequence. When comput- ing the final tag YM , another special tag,􏱓 indicates the end of the sequence. These special tags enable the use of transition features for the tags that begin and end the sequence: for example, conjunctions are unlikely to end sentences in English, so we would like a low score for sM+1(􏱓, CC); nouns are relatively likely to appear at the beginning of sentences, so we would like a high score for s1(N, ♦), assuming the noun tag is compatible with the first word token w1. Complexity If there are K tags and M positions in the sequence, then there are M × K Viterbi variables to compute. Computing each variable requires finding a maximum over K possible predecessor tags. The total time complexity of populating the trellis is there- fore O(MK2), with an additional factor for the number of active features at each position. After completing the trellis, we simply trace the backwards pointers to the beginning of the sequence, which takes O(M) operations. Under contract with MIT Press, shared under CC-BY-NC-ND license. 152 CHAPTER7. SEQUENCELABELING they can fish N −2 −3 −3 V −10 −1 −3 (a) Weights for emission features. NV􏱓 ♦ −1 −2 −∞ N−3−1 −1 V−1−3 −1 (b) Weights for transition features. The “from” tags are on the columns, and the “to” tags are on the rows. Table 7.1: Feature weights for the example trellis shown in Figure 7.1. Emission weights from ♦ and 􏱓 are implicitly set to −∞. 7.3.1 Example Consider the minimal tagset {N, V}, corresponding to nouns and verbs. Even in this tagset, there is considerable ambiguity: for example, the words can and fish can each take both tags. Of the 2 × 2 × 2 = 8 possible taggings for the sentence they can fish, four are possible given these possible tags, and two are grammatical.2 The values in the trellis in Figure 7.1 are computed from the feature weights defined in Table 7.1. We begin with v1(N), which has only one possible predecessor, the start tag ♦. This score is therefore equal to s1(N, ♦) = −2 − 1 = −3, which is the sum of the scores for the emission and transition features respectively; the backpointer is b1(N) = ♦. The score for v1(V) is computed in the same way: s1(V, ♦) = −10 − 2 = −12, and again b1(V) = ♦. The backpointers are represented in the figure by thick lines. Things get more interesting at m = 2. The score v2(N) is computed by maximizing over the two possible predecessors, v2(N) = max(v1(N) + s2(N, N), v1(V) + s2(N, V)) = max(−3 − 3 − 3, −12 − 3 − 1) = −9 b2(N) =N. This continues until reaching v4(􏱓), which is computed as, v4(􏱓) = max(v3(N) + s4(􏱓, N), v3(V) + s4(􏱓, V)) =max(−9+0−1, −11+0−1) =−10, so b4(􏱓) = N. As there is no emission w4, the emission features have scores of zero. [7.25] [7.26] [7.27] [7.28] [7.29] [7.30] 2The tagging they/N can/V fish/N corresponds to the scenario of putting fish into cans, or perhaps of firing them. Jacob Eisenstein. Draft of October 15, 2018. 7.4. HIDDENMARKOVMODELS 153 To compute the optimal tag sequence, we walk backwards from here, next checking b3(N) = V, and then b2(V) = N, and finally b1(N) = ♦. This yields y = (N, V, N), which corresponds to the linguistic interpretation of the fishes being put into cans. 7.3.2 Higher-order features The Viterbi algorithm was made possible by a restriction of the scoring function to local parts that consider only pairs of adjacent tags. We can think of this as a bigram language model over tags. A natural question is how to generalize Viterbi to tag trigrams, which would involve the following decomposition: M􏰁+2 m=1 One solution is to create a new tagset Y(2) from the Cartesian product of the original tagset with itself, Y(2) = Y × Y. The tags in this product space are ordered pairs, rep- resenting adjacent tags at the token level: for example, the tag (N, V) would represent a noun followed by a verb. Transitions between such tags must be consistent: we can have a transition from (N, V) to (V, N) (corresponding to the tag sequence N V N), but not from (N, V) to (N, N), which would not correspond to any coherent tag sequence. This con- straint can be enforced in feature weights, with θ((a,b),(c,d)) = −∞ if b ̸= c. The remaining feature weights can encode preferences for and against various tag trigrams. In the Cartesian product tag space, there are K2 tags, suggesting that the time com- plexity will increase to O(MK4). However, it is unnecessary to max over predecessor tag bigrams that are incompatible with the current tag bigram. By exploiting this constraint, it is possible to limit the time complexity to O(MK3). The space complexity grows to O(MK2), since the trellis must store all possible predecessors of each tag. In general, the time and space complexity of higher-order Viterbi grows exponentially with the order of the tag n-grams that are considered in the feature decomposition. 7.4 Hidden Markov Models The Viterbi sequence labeling algorithm is built on the scores sm(y, y′). We will now dis- cuss how these scores can be estimated probabilistically. Recall from section 2.2 that the probabilistic Na ̈ıve Bayes classifier selects the label y to maximize p(y | x) ∝ p(y, x). In probabilistic sequence labeling, our goal is similar: select the tag sequence that maximizes p(y | w) ∝ p(y, w). The locality restriction in Equation 7.8 can be viewed as a conditional independence assumption on the random variables y. Under contract with MIT Press, shared under CC-BY-NC-ND license. Ψ(w,y) = wherey−1 =♦andyM+2 =􏱓. f(w,ym,ym−1,ym−2,m), [7.31] 154 CHAPTER7. SEQUENCELABELING Algorithm 12 Generative process for the hidden Markov model y0←♦, m←1 repeat ym ∼ Categorical(λym−1 ) wm ∼ Categorical(φym ) until ym = 􏱓 ◃ sample the current tag ◃ sample the current word ◃ terminate when the stop symbol is generated Na ̈ıve Bayes was introduced as a generative model — a probabilistic story that ex- plains the observed data as well as the hidden label. A similar story can be constructed for probabilistic sequence labeling: first, the tags are drawn from a prior distribution; next, the tokens are drawn from a conditional likelihood. However, for inference to be tractable, additional independence assumptions are required. First, the probability of each token depends only on its tag, and not on any other element in the sequence: [7.32] p(w | y) = Second, each tag ym depends only on its predecessor, 􏰙M m=1 p(wm | ym). p(ym | ym−1), 􏰙M m=1 p(y) = where y0 = ♦ in all cases. Due to this Markov assumption, probabilistic sequence labeling [7.33] The generative process for the hidden Markov model is shown in Algorithm 12. Given the parameters λ and φ, we can compute p(w, y) for any token sequence w and tag se- quence y. The HMM is often represented as a graphical model (Wainwright and Jordan, 2008), as shown in Figure 7.2. This representation makes the independence assumptions explicit: if a variable v1 is probabilistically conditioned on another variable v2, then there is an arrow v2 → v1 in the diagram. If there are no arrows between v1 and v2, they are conditionally independent, given each variable’s Markov blanket. In the hidden Markov model, the Markov blanket for each tag ym includes the “parent” ym−1, and the “children” ym+1 and wm.3 It is important to reflect on the implications of the HMM independence assumptions. A non-adjacent pair of tags ym and yn are conditionally independent; if m < n and we are given yn−1, then ym offers no additional information about yn. However, if we are not given any information about the tags in a sequence, then all tags are probabilistically coupled. 3 In general graphical models, a variable’s Markov blanket includes its parents, children, and its children’s other parents (Murphy, 2012). Jacob Eisenstein. Draft of October 15, 2018. models are known as hidden Markov models (HMMs). 7.4. HIDDENMARKOVMODELS 155 y1 y2 ··· yM w1 w2 ··· wM Figure 7.2: Graphical representation of the hidden Markov model. Arrows indicate prob- abilistic dependencies. 7.4.1 Estimation The hidden Markov model has two groups of parameters: Emission probabilities. The probability pe(wm | ym; φ) is the emission probability, since the words are treated as probabilistically “emitted”, conditioned on the tags. Transition probabilities. The probability pt(ym | ym−1; λ) is the transition probability, since it assigns probability to each possible tag-to-tag transition. Both of these groups of parameters are typically computed from smoothed relative frequency estimation on a labeled corpus (see section 6.2 for a review of smoothing). The unsmoothed probabilities are, φk,i 􏱕Pr(Wm =i|Ym =k)= count(Wm =i,Ym =k) count(Ym = k) λk,k′ 􏱕Pr(Ym = k′ | Ym−1 = k) = count(Ym = k′,Ym−1 = k). count(Ym−1 = k) Smoothing is more important for the emission probability than the transition probability, because the vocabulary is much larger than the number of tags. 7.4.2 Inference The goal of inference in the hidden Markov model is to find the highest probability tag sequence, yˆ = argmax p(y | w). [7.34] y As in Na ̈ıve Bayes, it is equivalent to find the tag sequence with the highest log-probability, since the logarithm is a monotonically increasing function. It is furthermore equivalent to maximize the joint probability p(y, w) = p(y | w) × p(w) ∝ p(y | w), which is pro- portional to the conditional probability. Putting these observations together, the inference Under contract with MIT Press, shared under CC-BY-NC-ND license. 156 CHAPTER7. SEQUENCELABELING problem can be reformulated as, where, and, sm(ym,ym−1)􏱕logλym,ym−1 +logφym,wm, 􏰅1, w=􏱒 φ􏱓,w = 0, otherwise, yˆ = argmax log p(y, w). y We can now apply the HMM independence assumptions: logp(y,w) =logp(y)+logp(w | y) [7.35] [7.36] [7.37] [7.38] [7.39] [7.40] [7.41] = = = M􏰁+1 m=1 M􏰁+1 m=1 M􏰁+1 m=1 logpY (ym | ym−1)+logpW|Y (wm | ym) logλym,ym−1 +logφym,wm sm(ym, ym−1), which ensures that the stop tag 􏱓 can only be applied to the final token 􏱒. This derivation shows that HMM inference can be viewed as an application of the Viterbi decoding algorithm, given an appropriately defined scoring function. The local score sm(ym, ym−1) can be interpreted probabilistically, sm(ym, ym−1) = log py(ym | ym−1) + log pw|y(wm | ym) =logp(ym,wm |ym−1). Now recall the definition of the Viterbi variables, vm(ym) = max sm(ym, ym−1) + vm−1(ym−1) ym−1 = max log p(ym, wm | ym−1) + vm−1(ym−1). ym−1 By setting vm−1(ym−1) = maxy1:m−2 log p(y1:m−1, w1:m−1), we obtain the recurrence, vm(ym) = max log p(ym, wm | ym−1) + max log p(y1:m−1, w1:m−1) ym−1 y1:m−2 = max log p(ym, wm | ym−1) + log p(y1:m−1, w1:m−1) [7.42] [7.43] [7.44] [7.45] [7.46] [7.47] [7.48] y1:m−1 = max log p(y1:m, w1:m). y1:m−1 Jacob Eisenstein. Draft of October 15, 2018. 7.5. DISCRIMINATIVESEQUENCELABELINGWITHFEATURES 157 In words, the Viterbi variable vm(ym) is the log probability of the best tag sequence ending in ym, joint with the word sequence w1:m. The log probability of the best complete tag sequence is therefore, maxlogp(y1:M+1,w1:M+1) = vM+1(􏱓) [7.49] y1:M *Viterbi as an example of the max-product algorithm The Viterbi algorithm can also be implemented using probabilities, rather than log-probabilities. In this case, each vm(ym) is equal to, vm(ym) = max p(y1:m−1, ym, w1:m) y1:m−1 =maxp(ym,wm |ym−1)× max p(y1:m−2,ym−1,w1:m−1) ym−1 y1:m−2 = max p(ym, wm | ym−1) × vm−1(ym−1) ym−1 =pw|y(wm | ym) × max py(ym | ym−1) × vm−1(ym−1). ym−1 [7.50] [7.51] [7.52] [7.53] Each Viterbi variable is computed by maximizing over a set of products. Thus, the Viterbi algorithm is a special case of the max-product algorithm for inference in graphical mod- els (Wainwright and Jordan, 2008). However, the product of probabilities tends towards zero over long sequences, so the log-probability version of Viterbi is recommended in practical implementations. 7.5 Discriminative sequence labeling with features Today, hidden Markov models are rarely used for supervised sequence labeling. This is because HMMs are limited to only two phenomena: • word-tag compatibility, via the emission probability pW |Y (wm | ym ); • local context, via the transition probability pY (ym | ym−1). The Viterbi algorithm permits the inclusion of richer information in the local scoring func- tion ψ(w1:M , ym, ym−1, m), which can be defined as a weighted sum of arbitrary local fea- tures, ψ(w, ym, ym−1, m) = θ · f(w, ym, ym−1, m), [7.54] where f is a locally-defined feature function, and θ is a vector of weights. Under contract with MIT Press, shared under CC-BY-NC-ND license. 158 CHAPTER7. SEQUENCELABELING The local decomposition of the scoring function Ψ is reflected in a corresponding de- composition of the feature function: M􏰁+1 m=1 M􏰁+1 m=1 Ψ(w, y) = = =θ · =θ · f(global)(w, y1:M ), ψ(w, ym, ym−1, m) θ · f(w, ym, ym−1, m) [7.55] [7.56] [7.57] [7.58] M􏰁+1 m=1 f(w,ym,ym−1,m) where f(global)(w,y) is a global feature vector, which is a sum of local feature vectors, M􏰁+1 f(global)(w,y) = with yM +1 = 􏱓 and y0 = ♦ by construction. Let’s now consider what additional information these features might encode. Word affix features. Consider the problem of part-of-speech tagging on the first four lines of the poem Jabberwocky (Carroll, 1917): (7.3) ’Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe. Many of these words were made up by the author of the poem, so a corpus would offer no information about their probabilities of being associated with any particular part of speech. Yet it is not so hard to see what their grammatical roles might be in this passage. Context helps: for example, the word slithy follows the determiner the, so it is probably a noun or adjective. Which do you think is more likely? The suffix -thy is found in a number of adjectives, like frothy, healthy, pithy, worthy. It is also found in a handful of nouns — e.g., apathy, sympathy — but nearly all of these have the longer coda -pathy, unlike slithy. So the suffix gives some evidence that slithy is an adjective, and indeed it is: later in the text we find that it is a combination of the adjectives lithe and slimy.4 4Morphology is the study of how words are formed from smaller linguistic units. chapter 9 touches on computational approaches to morphological analysis. See Bender (2013) for an overview of the underlying linguistic principles, and Haspelmath and Sims (2013) or Lieber (2015) for a full treatment. Jacob Eisenstein. Draft of October 15, 2018. m=1 f(w1:M,ym,ym−1,m), [7.59] 7.5. DISCRIMINATIVESEQUENCELABELINGWITHFEATURES 159 Fine-grained context. The hidden Markov model captures contextual information in the form of part-of-speech tag bigrams. But sometimes, the necessary contextual information is more specific. Consider the noun phrases this fish and these fish. Many part-of-speech tagsets distinguish between singular and plural nouns, but do not distinguish between singular and plural determiners; for example, the well known Penn Treebank tagset fol- lows these conventions. A hidden Markov model would be unable to correctly label fish as singular or plural in both of these cases, because it only has access to two features: the pre- ceding tag (determiner in both cases) and the word (fish in both cases). The classification- based tagger discussed in section 7.1 had the ability to use preceding and succeeding words as features, and it can also be incorporated into a Viterbi-based sequence labeler as a local feature. Example. Consider the tagging D J N (determiner, adjective, noun) for the sequence the slithy toves, so that w =the slithy toves y =D J N. Let’s create the feature vector for this example, assuming that we have word-tag features (indicated by W ), tag-tag features (indicated by T ), and suffix features (indicated by M ). You can assume that you have access to a method for extracting the suffix -thy from slithy, -es from toves, and ∅ from the, indicating that this word has no suffix.5 feature vector is, f (the slithy toves, D J N) =f (the slithy toves, D, ♦, 1) + f (the slithy toves, J, D, 2) + f (the slithy toves, N, J, 3) The resulting + f (the slithy toves, 􏱓, N, 4) ={(T : ♦,D),(W : the,D),(M : ∅,D), (T : D,J),(W : slithy,J),(M : -thy,J), (T : J,N),(W : toves,N),(M : -es,N) (T : N, 􏱓)}. These examples show that local features can incorporate information that lies beyond the scope of a hidden Markov model. Because the features are local, it is possible to apply the Viterbi algorithm to identify the optimal sequence of tags. The remaining question 5Such a system is called a morphological segmenter. The task of morphological segmentation is briefly described in section 9.1.4; a well known segmenter is MORFESSOR (Creutz and Lagus, 2007). In real applica- tions, a typical approach is to include features for all orthographic suffixes up to some maximum number of characters: for slithy, we would have suffix features for -y, -hy, and -thy. Under contract with MIT Press, shared under CC-BY-NC-ND license. 160 CHAPTER7. SEQUENCELABELING is how to estimate the weights on these features. section 2.3 presented three main types of discriminative classifiers: perceptron, support vector machine, and logistic regression. Each of these classifiers has a structured equivalent, enabling it to be trained from labeled sequences rather than individual tokens. 7.5.1 Structured perceptron The perceptron classifier is trained by increasing the weights for features that are asso- ciated with the correct label, and decreasing the weights for features that are associated with incorrectly predicted labels: yˆ = argmax θ · f (x, y) y∈Y θ(t+1) ←θ(t) +f(x,y)−f(x,yˆ). We can apply exactly the same update in the case of structure prediction, yˆ = argmax θ · f (w, y) y∈Y(w) θ(t+1) ← θ(t) + f (w, y) − f (w, yˆ). [7.60] [7.61] [7.62] [7.63] This learning algorithm is called structured perceptron, because it learns to predict the structured output y. The only difference is that instead of computing yˆ by enumerating the entire set Y, the Viterbi algorithm is used to efficiently search the set of possible tag- gings, YM. Structured perceptron can be applied to other structured outputs as long as efficient inference is possible. As in perceptron classification, weight averaging is crucial to get good performance (see subsection 2.3.2). Example For the example they can fish, suppose that the reference tag sequence is y(i) = N V V, but the tagger incorrectly returns the tag sequence yˆ = N V N. Assuming a model with features for emissions (wm, ym) and transitions (ym−1, ym), the corresponding struc- tured perceptron update is: θ(fish,V) ← θ(fish,V) + 1, θ(V,V) ← θ(V,V) + 1, θ(V,􏱓) ← θ(V,􏱓) + 1, θ(fish,N) ← θ(fish,N) − 1 θ(V,N) ← θ(V,N) − 1 θ(N,􏱓) ← θ(N,􏱓) − 1. [7.64] [7.65] [7.66] 7.5.2 Structured support vector machines Large-margin classifiers such as the support vector machine improve on the perceptron by pushing the classification boundary away from the training instances. The same idea can Jacob Eisenstein. Draft of October 15, 2018. 7.5. DISCRIMINATIVESEQUENCELABELINGWITHFEATURES 161 be applied to sequence labeling. A support vector machine in which the output is a struc- tured object, such as a sequence, is called a structured support vector machine (Tsochan- taridis et al., 2004).6 In classification, we formalized the large-margin constraint as, ∀y ̸= y(i), θ · f(x, y(i)) − θ · f(x, y) ≥ 1, [7.67] requiring a margin of at least 1 between the scores for all labels y that are not equal to the correct label y(i). The weights θ are then learned by constrained optimization (see subsection 2.4.2). This idea can be applied to sequence labeling by formulating an equivalent set of con- straints for all possible labelings Y(w) for an input w. However, there are two problems. First, in sequence labeling, some predictions are more wrong than others: we may miss only one tag out of fifty, or we may get all fifty wrong. We would like our learning algo- rithm to be sensitive to this difference. Second, the number of constraints is equal to the number of possible labelings, which is exponentially large in the length of the sequence. The first problem can be addressed by adjusting the constraint to require larger mar- gins for more serious errors. Let c(y(i), yˆ) ≥ 0 represent the cost of predicting label yˆ when the true label is y(i). We can then generalize the margin constraint, ∀y, θ · f(w(i), y(i)) − θ · f(w(i), y) ≥ c(y(i), y). [7.68] This cost-augmented margin constraint specializes to the constraint in Equation 7.67 if we choose the delta function c(y(i), y) = δ (() y(i) ̸= y). A more expressive cost function is the Hamming cost, c(y(i), y) = 􏰁M mm m=1 which computes the number of errors in y. By incorporating the cost function as the margin constraint, we require that the true labeling be seperated from the alternatives by a margin that is proportional to the number of incorrect tags in each alternative labeling. The second problem is that the number of constraints is exponential in the length of the sequence. This can be addressed by focusing on the prediction yˆ that maximally violates the margin constraint. This prediction can be identified by solving the following cost-augmented decoding problem: yˆ = argmax θ · f(w(i), y) − θ · f(w(i), y(i)) + c(y(i), y) [7.70] y̸=y(i) = argmax θ · f(w(i), y) + c(y(i), y), [7.71] y̸=y(i) 6This model is also known as a max-margin Markov network (Taskar et al., 2003), emphasizing that the scoring function is constructed from a sum of components, which are Markov independent. Under contract with MIT Press, shared under CC-BY-NC-ND license. δ(y(i) ̸= y ), [7.69] 162 CHAPTER7. SEQUENCELABELING where in the second line we drop the term θ · f(w(i), y(i)), which is constant in y. We can now reformulate the margin constraint for sequence labeling, θ·f(w(i),y(i))− max 􏰛θ·f(w(i),y)+c(y(i),y)􏰜≥0. [7.72] y∈Y(w) Ifthescoreforθ·f(w(i),y(i))isgreaterthanthecost-augmentedscoreforallalternatives, then the constraint will be met. The name “cost-augmented decoding” is due to the fact that the objective includes the standard decoding problem, maxyˆ∈Y(w) θ · f(w,yˆ), plus an additional term for the cost. Essentially, we want to train against predictions that are strong and wrong: they should score highly according to the model, yet incur a large loss with respect to the ground truth. Training adjusts the weights to reduce the score of these predictions. For cost-augmented decoding to be tractable, the cost function must decompose into local parts, just as the feature function f(·) does. The Hamming cost, defined above, obeys this property. To perform cost-augmented decoding using the Hamming cost, we need only to add features f (y ) = δ(y ̸= y(i)), and assign a constant weight of 1 to mmmm these features. Decoding can then be performed using the Viterbi algorithm.7 As with large-margin classifiers, it is possible to formulate the learning problem in an unconstrained form, by combining a regularization term on the weights and a Lagrangian for the constraints: min 1||θ||2 −C􏰝􏰁θ·f(w(i),y(i))− max 􏰪θ·f(w(i),y)+c(y(i),y)􏰫􏰞, [7.73] θ 2 i y∈Y(w(i)) In this formulation, C is a parameter that controls the tradeoff between the regulariza- tion term and the margin constraints. A number of optimization algorithms have been proposed for structured support vector machines, some of which are discussed in sub- section 2.4.2. An empirical comparison by Kummerfeld et al. (2015) shows that stochastic subgradient descent — which is essentially a cost-augmented version of the structured perceptron — is highly competitive. 7.5.3 Conditional random fields The conditional random field (CRF; Lafferty et al., 2001) is a conditional probabilistic model for sequence labeling; just as structured perceptron is built on the perceptron clas- sifier, conditional random fields are built on the logistic regression classifier.8 The basic 7Are there cost functions that do not decompose into local parts? Suppose we want to assign a constant loss c to any prediction yˆ in which k or more predicted tags are incorrect, and zero loss otherwise. This loss function is combinatorial over the predictions, and thus we cannot decompose it into parts. 8The name “conditional random field” is derived from Markov random fields, a general class of models in which the probability of a configuration of variables is proportional to a product of scores across pairs (or Jacob Eisenstein. Draft of October 15, 2018. 7.5. DISCRIMINATIVESEQUENCELABELINGWITHFEATURES 163 probability model is, p(y | w) = 􏰀 exp(Ψ(w, y)) . [7.74] y′∈Y(w) exp(Ψ(w, y′)) This is almost identical to logistic regression (section 2.5), but because the label space is now sequences of tags, we require efficient algorithms for both decoding (searching for the best tag sequence given a sequence of words w and a model θ) and for normalization (summing over all tag sequences). These algor􏰀ithms will be based on the usual locality assumption on the scoring function, Ψ(w, y) = M+1 ψ(w, ym, ym−1, m). m=1 Decoding in CRFs Decoding — finding the tag sequence yˆ that maximizes p(y | w) — is a direct applica- tion of the Viterbi algorithm. The key observation is that the decoding problem does not depend on the denominator of p(y | w), yˆ =argmaxlogp(y | w) y = argmax Ψ(y, w) − log y y′∈Y(w) recurrence as defined in Equation 7.22 can be used. Learning in CRFs As with logistic regression, the weights θ are learned by minimizing the regularized neg- ative log-probability, [7.75] [7.76] sm(ym, ym−1). This is identical to the decoding problem for structured perceptron, so the same Viterbi λ 􏰁N = argmax Ψ(y, w) = argmax yy l = 2 ||θ||2 − λ􏰁N 􏰁􏰛􏰜 i=1 y′∈Y(w(i)) log p(y(i) | w(i); θ) =2||θ||2 − θ·f(w(i),y(i))+log exp θ·f(w(i),y′) , 􏰁 exp Ψ(y′, w) M􏰁+1 m=1 i=1 more generally, cliques) of variables in a factor graph. In sequence labeling, the pairs of variables include all adjacent tags (ym,ym−1). The probability is conditioned on the words w, which are always observed, motivating the term “conditional” in the name. Under contract with MIT Press, shared under CC-BY-NC-ND license. 164 CHAPTER7. SEQUENCELABELING where λ controls the amount of regularization. The final term in Equation 7.76 is a sum over all possible labelings. This term is the log of the denominator in Equation 7.74, some- times known as the partition function.9 There are |Y|M possible labelings of an input of size M, so we must again exploit the decomposition of the scoring function to compute this sum efficiently. The sum 􏰀y∈Yw(i) exp Ψ(y, w) can be computed efficiently using the forward recur- rence, which is closely related to the Viterbi recurrence. We first define a set of forward variables, αm(ym), which is equal to the sum of the scores of all paths leading to tag ym at position m: can also be computed through a recurrence: αm(ym) 􏱕 = 􏰁 􏰁m exp sn(yn, yn−1) y1:m−1 n=1 􏰁 􏰙m y1:m−1 n=1 [7.77] [7.78] 􏰁 􏰙m y1:m−1 n=1 αm(ym) = = 􏰁􏰁 (exp sm(ym, ym−1)) 􏰁 􏰙 exp sn(yn, yn−1) exp sn(yn, yn−1) [7.79] [7.80] [7.81] m−1 ym−1 y1:m−2 n=1 = (exp sm(ym, ym−1)) × αm−1(ym−1). ym−1 Using the forward recurrence, it is possible to compute the denominator of the condi- tional probability, 􏰁 􏰁 􏰙M Ψ(w,y) = sM+1(􏱓,yM) sm(ym,ym−1) y∈Y(w) y1:M m=1 =αM +1 (􏱓). [7.82] [7.83] exp sn(yn, yn−1). NotethesimilaritytothedefinitionoftheViterbivariable,vm(ym)=maxy1:m−1 􏰀mn=1sn(yn,yn−1). In the hidden Markov model, the Viterbi recurrence had an alternative interpretation as the max-product algorithm (see Equation 7.53); analogously, the forward recurrence is known as the sum-product algorithm, because of the form of [7.78]. The forward variable 9The terminology of “potentials” and “partition functions” comes from statistical mechanics (Bishop, 2006). Jacob Eisenstein. Draft of October 15, 2018. 7.5. DISCRIMINATIVESEQUENCELABELINGWITHFEATURES 165 The conditional log-likelihood can be rewritten, λ 􏰁N l=2||θ||2 − θ·f(w(i),y(i))+logαM+1(􏱓). [7.84] i=1 Probabilistic programming environments, such as TORCH (Collobert et al., 2011) and DYNET (Neubig et al., 2017), can compute the gradient of this objective using automatic differentiation. The programmer need only implement the forward algorithm as a com- putation graph. As in logistic regression, the gradient of the likelihood with respect to the parameters is a difference between observed and expected feature counts: dl 􏰁N dθ =λθj + E[fj(w(i),y)]−fj(w(i),y(i)), [7.85] j i=1 where fj(w(i),y(i)) refers to the count of feature j for token sequence w(i) and tag se- quence y(i). The expected feature counts are computed “under the hood” when automatic differentiation is applied to Equation 7.84 (Eisner, 2016). Before the widespread use of automatic differentiation, it was common to compute the feature expectations from marginal tag probabilities p(ym | w). These marginal prob- abilities are sometimes useful on their own, and can be computed using the forward- backward algorithm. This algorithm combines the forward recurrence with an equivalent backward recurrence, which traverses the input from wM back to w1. *Forward-backward algorithm Marginal probabilities over tag bigrams can be written as,10 ′ 􏰀y:Ym=k,Ym−1=k′ 􏰧Mn=1expsn(yn,yn−1) Pr(Ym−1 = k ,Ym = k | w) = 􏰀y′ 􏰧Mn=1 expsn(yn′ ,yn′ −1) . [7.86] The numerator sums over all tag sequences that include the transition (Ym−1 = k′) → (Ym = k). Because we are only interested in sequences that include the tag bigram, this sum can be decomposed into three parts: the prefixes y1:m−1, terminating in Ym−1 = k′; the 10Recall the notational convention of upper-case letters for random variables, e.g. Ym, and lower case letters for specific values, e.g., ym, so that Ym = k is interpreted as the event of random variable Ym taking the value k. Under contract with MIT Press, shared under CC-BY-NC-ND license. 166 CHAPTER7. SEQUENCELABELING Ym−1 = k′ Ym = k αm−1(k′) exp sm(k, k′) βm(k) Figure 7.3: A schematic illustration of the computation of the marginal probability Pr(Ym−1 = k′, Ym = k), using the forward score αm−1(k′) and the backward score βm(k). transition (Ym−1 = k′) → (Ym = k); and the suffixes ym:M , beginning with the tag Ym = k: M 􏰁 􏰙 exp sn(yn, yn−1) = y:Ym=k,Ym−1=k′ n=1 m−1 􏰁 􏰙 exp sn(yn, yn−1) M+1 􏰁 􏰙 exp sn(yn, yn−1) βm(k) 􏱕 = 􏰁􏰁 exp sm+1(k′, k) 􏰁 􏰙 [7.88] [7.89] [7.90] ym:M :Ym=k n=m M+1 k′∈Y ym+1:M :Ym=k′ n=m+1 exp sn(yn, yn−1) = exp sm+1(k′, k) × βm+1(k′). k′∈Y y1:m−1:Ym−1=k′ n=1 ×expsm(k,k′) M+1 × 􏰁 􏰙 ym:M :Ym=k n=m+1 exp sn(yn, yn−1). [7.87] The result is product of three terms: a score that sums over all the ways to get to the position (Ym−1 = k′), a score for the transition from k′ to k, and a score that sums over all the ways of finishing the sequence from (Ym = k). The first term of Equation 7.87 is equal to the forward variable, αm−1(k′). The third term — the sum over ways to finish the sequence — can also be defined recursively, this time moving over the trellis from right to left, which is known as the backward recurrence: To understand this computation, compare with the forward recurrence in Equation 7.81. Jacob Eisenstein. Draft of October 15, 2018. 7.6. NEURALSEQUENCELABELING 167 In practice, numerical stability demands that we work in the log domain, log αm(k) = log 􏰁 exp 􏰗log sm(k, k′) + log αm−1(k′)􏰘 k ∈Y [7.91] [7.92] log βm−1(k) = log 􏰁′ k′∈Y exp 􏰗log sm(k′, k) + log βm(k′)􏰘 . The application of the forward and backward probabilities is shown in Figure 7.3. Both the forward and backward recurrences operate on the trellis, which implies a space complexity O(MK). Because both recurrences require computing a sum over K terms at each node in the trellis, their time complexity is O(MK2). 7.6 Neural sequence labeling In neural network approaches to sequence labeling, we construct a vector representa- tion for each tagging decision, based on the word and its context. Neural networks can perform tagging as a per-token classification decision, or they can be combined with the Viterbi algorithm to tag the entire sequence globally. 7.6.1 Recurrent neural networks Recurrent neural networks (RNNs) were introduced in chapter 6 as a language model- ing technique, in which the context at token m is summarized by a recurrently-updated vector, hm =g(xm,hm−1), m = 1,2,...M, where xm is the vector embedding of the token wm and the function g defines the recur- rence. The starting condition h0 is an additional parameter of the model. The long short- term memory (LSTM) is a more complex recurrence, in which a memory cell is through a series of gates, avoiding repeated application of the non-linearity. Despite these bells and whistles, both models share the basic architecture of recurrent updates across a sequence, and both will be referred to as RNNs here. A straightforward application of RNNs to sequence labeling is to score each tag ym as a linear function of hm: ψm(y) =βy · hm [7.93] yˆm =argmaxψm(y). [7.94] y The score ψm(y) can also be converted into a probability distribution using the usual soft- max operation, p(y | w1:m) = 􏰀 exp ψm(y) . [7.95] y′∈Y expψm(y′) Under contract with MIT Press, shared under CC-BY-NC-ND license. 168 CHAPTER7. SEQUENCELABELING Using this transformation, it is possible to train the tagger from the negative log-likelihood of the tags, as in a conditional random field. Alternatively, a hinge loss or margin loss objective can be constructed from the raw scores ψm(y). The hidden state hm accounts for information in the input leading up to position m, but it ignores the subsequent tokens, which may also be relevant to the tag ym. This can be addressed by adding a second RNN, in which the input is reversed, running the recur- rence from wM to w1. This is known as a bidirectional recurrent neural network (Graves and Schmidhuber, 2005), and is specified as: ←− ←− hm =g(xm, hm+1), m = 1,2,...,M. [7.96] The hidden states of the left-to-right RNN are denoted h m. The left-to-right and right-to- ←− →− left vectors are concatenated, hm = [ h m; h m]. The scoring function in Equation 7.93 is applied to this concatenated vector. Bidirectional RNN tagging has several attractive properties. Ideally, the representa- tion hm summarizes the useful information from the surrounding context, so that it is not necessary to design explicit features to capture this information. If the vector hm is an ad- equate summary of this context, then it may not even be necessary to perform the tagging jointly: in general, the gains offered by joint tagging of the entire sequence are diminished as the individual tagging model becomes more powerful. Using backpropagation, the word vectors x can be trained “end-to-end”, so that they capture word properties that are useful for the tagging task. Alternatively, if limited labeled data is available, we can use word embeddings that are “pre-trained” from unlabeled data, using a language modeling objective (as in section 6.3) or a related word embedding technique (see chapter 14). It is even possible to combine both fine-tuned and pre-trained embeddings in a single model. Neural structure prediction The bidirectional recurrent neural network incorporates in- formation from throughout the input, but each tagging decision is made independently. In some sequence labeling applications, there are very strong dependencies between tags: it may even be impossible for one tag to follow another. In such scenarios, the tagging decision must be made jointly across the entire sequence. Neural sequence labeling can be combined with the Viterbi algorithm by defining the local scores as: sm(ym,ym−1)=βym ·hm+ηym−1,ym, [7.97] where hm is the RNN hidden state, βym is a vector associated with tag ym, and ηym−1,ym is a scalar parameter for the tag transition (ym−1 , ym ). These local scores can then be incorporated into the Viterbi algorithm for inference, and into the forward algorithm for training. This model is shown in Figure 7.4. It can be trained from the conditional log- likelihood objective defined in Equation 7.76, backpropagating to the tagging parameters Jacob Eisenstein. Draft of October 15, 2018. →− 7.6. NEURALSEQUENCELABELING 169 ym−1 ym ym+1 ←− ←− ←− hm−1 hm hm+1 →− →− →− hm−1 hm hm+1 xm−1 xm xm+1 Figure 7.4: Bidirectional LSTM for sequence labeling. The solid lines indicate computa- tion, the dashed lines indicate probabilistic dependency, and the dotted lines indicate the optional additional probabilistic dependencies between labels in the biLSTM-CRF. β and η, as well as the parameters of the RNN. This model is called the LSTM-CRF, due to its combination of aspects of the long short-term memory and conditional random field models (Huang et al., 2015). The LSTM-CRF is especially effective on the task of named entity recognition (Lam- ple et al., 2016), a sequence labeling task that is described in detail in section 8.3. This task has strong dependencies between adjacent tags, so structure prediction is especially important. 7.6.2 Character-level models As in language modeling, rare and unseen words are a challenge: if we encounter a word that was not in the training data, then there is no obvious choice for the word embed- ding xm. One solution is to use a generic unseen word embedding for all such words. However, in many cases, properties of unseen words can be guessed from their spellings. For example, whimsical does not appear in the Universal Dependencies (UD) English Tree- bank, yet the suffix -al makes it likely to be adjective; by the same logic, unflinchingly is likely to be an adverb, and barnacle is likely to be a noun. In feature-based models, these morphological properties were handled by suffix fea- tures; in a neural network, they can be incorporated by constructing the embeddings of unseen words from their spellings or morphology. One way to do this is to incorporate an additional layer of bidirectional RNNs, one for each word in the vocabulary (Ling et al., 2015). For each such character-RNN, the inputs are the characters, and the output is the concatenation of the final states of the left-facing and right-facing passes, φw = Under contract with MIT Press, shared under CC-BY-NC-ND license. 170 CHAPTER7. SEQUENCELABELING →−(w) ←−(w) [hNw; h0 →−(w) ], where hNw is the final state of the right-facing pass for word w, and Nw is the number of characters in the word. The character RNN model is trained by back- propagation from the tagging objective. On the test data, the trained RNN is applied to out-of-vocabulary words (or all words), yielding inputs to the word-level tagging RNN. Other approaches to compositional word embeddings are described in subsection 14.7.1. 7.6.3 Convolutional Neural Networks for Sequence Labeling One disadvantage of recurrent neural networks is that the architecture requires iterating through the sequence of inputs and predictions: each hidden vector hm must be com- puted from the previous hidden vector hm−1, before predicting the tag ym. These iterative computations are difficult to parallelize, and fail to exploit the speedups offered by graph- ics processing units (GPUs) on operations such as matrix multiplication. Convolutional neural networks achieve better computational performance by predicting each label ym from a set of matrix operations on the neighboring word embeddings, xm−k:m+k (Col- lobert et al., 2011). Because there is no hidden state to update, the predictions for each ym can be computed in parallel. For more on convolutional neural networks, see section 3.4. Character-based word embeddings can also be computed using convolutional neural net- works (Santos and Zadrozny, 2014). 7.7 *Unsupervised sequence labeling In unsupervised sequence labeling, the goal is to induce a hidden Markov model from a corpus of unannotated text (w(1), w(2), . . . , w(N)), where each w(i) is a sequence of length M(i). This is an example of the general problem of structure induction, which is the unsupervised version of structure prediction. The tags that result from unsupervised se- quence labeling might be useful for some downstream task, or they might help us to better understand the language’s inherent structure. For part-of-speech tagging, it is common to use a tag dictionary that lists the allowed tags for each word, simplifying the prob- lem (Christodoulopoulos et al., 2010). Unsupervised learning in hidden Markov models can be performed using the Baum- Welch algorithm, which combines the forward-backward algorithm (section 7.5.3) with expectation-maximization (EM; subsection 5.1.2). In the M-step, the HMM parameters from expected counts: Pr(W =i|Y =k)=φk,i =E[count(W =i,Y =k)] E[count(Y = k)] Pr(Ym = k | Ym−1 = k′) = λk′,k =E[count(Ym = k,Ym−1 = k′)] E[count(Ym−1 = k′)] Jacob Eisenstein. Draft of October 15, 2018. 7.7. *UNSUPERVISEDSEQUENCELABELING 171 The expected counts are computed in the E-step, using the forward and backward recurrences. The local scores follow the usual definition for hidden Markov models, sm(k,k′)=logpE(wm |Ym =k;φ)+logpT(Ym =k|Ym−1 =k′;λ). The expected transition counts for a single instance are, 􏰁M E[count(Ym = k,Ym−1 = k′) | w] = Pr(Ym−1 = k′,Ym = k | w) m=1 􏰀y:Ym=k,Ym−1=k′ 􏰧Mn=1expsn(yn,yn−1) = 􏰀y′ 􏰧Mn=1 exp sn(yn′ , yn′ −1) . [7.98] [7.99] [7.100] As described in section 7.5.3, these marginal probabilities can be computed from the forward-backward recurrence, Pr(Ym−1 =k′,Ym =k|w)=αm−1(k′)×expsm(k,k′)×βm(k). [7.101] αM+1(􏱓) In a hidden Markov model, each element of the forward-backward computation has a special interpretation: αm−1(k′) =p(Ym−1 = k′, w1:m−1) exp sm(k, k′) =p(Ym = k, wm | Ym−1 = k′) βm(k) =p(wm+1:M | Ym = k). [7.102] [7.103] [7.104] Applying the conditional independence assumptions of the hidden Markov model (de- fined in Algorithm 12), the product is equal to the joint probability of the tag bigram and the entire input, αm−1(k′) × exp sm(k, k′) × βm(k) =p(Ym−1 = k′, w1:m−1) ×p(Ym =k,wm |Ym−1 =k′) × p(wm+1:M | Ym = k) =p(Ym−1 = k′, Ym = k, w1:M ). Dividing by αM +1 (􏱓) = p(w1:M ) gives the desired probability, αm−1(k′)×sm(k,k′)×βm(k) =p(Ym−1 =k′,Ym =k,w1:M) αM +1 (􏱓) p(w1:M ) = Pr(Ym−1 = k′, Ym = k | w1:M ). [7.105] [7.106] [7.107] The expected emission counts can be computed in a similar manner, using the product αm(k) × βm(k). Under contract with MIT Press, shared under CC-BY-NC-ND license. 172 CHAPTER7. SEQUENCELABELING 7.7.1 Linear dynamical systems The forward-backward algorithm can be viewed as Bayesian state estimation in a discrete state space. In a continuous state space, ym ∈ RK , the equivalent algorithm is the Kalman smoother. It also computes marginals p(ym | x1:M ), using a similar two-step algorithm of forward and backward passes. Instead of computing a trellis of values at each step, the Kalman smoother computes a probability density function qym (ym; μm, Σm), character- ized by a mean μm and a covariance Σm around the latent state. Connections between the Kalman smoother and the forward-backward algorithm are elucidated by Minka (1999) and Murphy (2012). 7.7.2 Alternative unsupervised learning methods As noted in section 5.5, expectation-maximization is just one of many techniques for struc- ture induction. One alternative is to use Markov Chain Monte Carlo (MCMC) sampling algorithms, which are briefly described in subsection 5.5.1. For the specific case of se- quence labeling, Gibbs sampling can be applied by iteratively sampling each tag ym con- ditioned on all the others (Finkel et al., 2005): p(ym | y−m, w1:M ) ∝ p(wm | ym)p(ym | y−m). [7.108] Gibbs Sampling has been applied to unsupervised part-of-speech tagging by Goldwater and Griffiths (2007). Beam sampling is a more sophisticated sampling algorithm, which randomly draws entire sequences y1:M , rather than individual tags ym; this algorithm was applied to unsupervised part-of-speech tagging by Van Gael et al. (2009). Spectral learn- ing (see subsection 5.5.2) can also be applied to sequence labeling. By factoring matrices of co-occurrence counts of word bigrams and trigrams (Song et al., 2010; Hsu et al., 2012), it is possible to obtain globally optimal estimates of the transition and emission parameters, under mild assumptions. 7.7.3 Semiring notation and the generalized viterbi algorithm The Viterbi and Forward recurrences can each be performed over probabilities or log probabilities, yielding a total of four closely related recurrences. These four recurrence scan in fact be expressed as a single recurrence in a more general notation, known as semiring algebra. Let the symbols ⊕ and ⊗ represent generalized addition and multipli- cation respectively.11 Given these operators, a generalized Viterbi recurrence is denoted, vm(k) = 􏱍 sm(k, k′) ⊗ vm−1(k′). [7.109] k′∈Y 11In a semiring, the addition and multiplication operators must both obey associativity, and multiplication must distribute across addition; the addition operator must be commutative; there must be additive and multiplicative identities 0 and 1, such that a ⊕ 0 = a and a ⊗ 1 = a; and there must be a multiplicative annihilator 0, such that a ⊗ 0 = 0. Jacob Eisenstein. Draft of October 15, 2018. 7.7. *UNSUPERVISEDSEQUENCELABELING 173 Each recurrence that we have seen so far is a special case of this generalized Viterbi recurrence: • In the max-product Viterbi recurrence over probabilities, the ⊕ operation corre- sponds to maximization, and the ⊗ operation corresponds to multiplication. • In the forward recurrence over probabilities, the ⊕ operation corresponds to addi- tion, and the ⊗ operation corresponds to multiplication. • In the max-product Viterbi recurrence over log-probabilities, the ⊕ operation corre- sponds to maximization, and the ⊗ operation corresponds to addition.12 • Intheforwardrecurrenceoverlog-probabilities,the⊕operationcorrespondstolog- addition, a ⊕ b = log(ea + eb). The ⊗ operation corresponds to addition. The mathematical abstraction offered by semiring notation can be applied to the soft- ware implementations of these algorithms, yielding concise and modular implementa- tions. For example, in the OPENFST library, generic operations are parametrized by the choice of semiring (Allauzen et al., 2007). Exercises 1. Extendtheexampleinsubsection7.3.1tothesentencetheycancanfish,meaningthat “they can put fish into cans.” Build the trellis for this example using the weights in Table 7.1, and identify the best-scoring tag sequence. If the scores for noun and verb are tied, then you may assume that the backpointer always goes to noun. 2. UsingthetagsetY ={N,V},andthefeaturesetf(w,ym,ym−1,m)={(wm,ym),(ym,ym−1)}, show that there is no set of weights that give the correct tagging for both they can fish (N V V) and they can can fish (N V V N). 3. Work out what happens if you train a structured perceptron on the two exam- ples mentioned in the previous problem, using the transition and emission features (ym, ym−1) and (ym, wm). Initialize all weights at 0, and assume that the Viterbi algo- rithm always chooses N when the scores for the two tags are tied, so that the initial prediction for they can fish is N N N. 4. Consider the garden path sentence, The old man the boat. Given word-tag and tag-tag features, what inequality in the weights must hold for the correct tag sequence to outscore the garden path tag sequence for this example? 12This is sometimes called the tropical semiring, in honor of the Brazilian mathematician Imre Simon. Under contract with MIT Press, shared under CC-BY-NC-ND license. 174 CHAPTER7. SEQUENCELABELING 5. Using the weights in Table 7.1, explicitly compute the log-probabilities for all pos- sible taggings of the input fish can. Verify that the forward algorithm recovers the aggregate log probability. 6. Sketch out an algorithm for a variant of Viterbi that returns the top-n label se- quences. What is the time and space complexity of this algorithm? 7. Show how to compute the marginal probability Pr(ym−2 = k, ym = k′ | w1:M ), in terms of the forward and backward variables, and the potentials sn(yn, yn−1). 8. Suppose you receive a stream of text, where some of tokens have been replaced at random with NOISE. For example: • Source: I try all things, I achieve what I can • Message received: I try NOISE NOISE, I NOISE what I NOISE Assume you have access to a pre-trained bigram language model, which gives prob- abilities p(wm | wm−1). These probabilities can be assumed to be non-zero for all bigrams. Show how to use the Viterbi algorithm to recover the source by maximizing the bigram language model log-probability. Specifically, set the scores sm(ym, ym−1) so that the Viterbi algorithm selects a sequence of words that maximizes the bigram language model log-probability, while leaving the non-noise tokens intact. Your solution should not modify the logic of the Viterbi algorithm, it should only set the scores sm(ym, ym−1). 9. Let α(·) and β(·) indicate the forward an􏰀d backward variables as defined in sec- tion 7.5.3. Prove that αM+1(􏱓) = β0(♦) = y αm(y)βm(y), ∀m ∈ {1, 2, . . . , M}. 10. Consider an RNN tagging model with a tanh activation function on the hidden layer, and a hinge loss on the output. (The problem also works for the margin loss and negative log-likelihood.) Suppose you initialize all parameters to zero: this in- cludes the word embeddings that make up x, the transition matrix Θ, the output weights β, and the initial hidden state h0. a) Prove that for any data and for any gradient-based learning algorithm, all pa- rameters will be stuck at zero. b) Would a sigmoid activation function avoid this problem? Jacob Eisenstein. Draft of October 15, 2018. Chapter 8 Applications of sequence labeling Sequence labeling has applications throughout natural language processing. This chap- ter focuses on part-of-speech tagging, morpho-syntactic attribute tagging, named entity recognition, and tokenization. It also touches briefly on two applications to interactive settings: dialogue act recognition and the detection of code-switching points between languages. 8.1 Part-of-speech tagging The syntax of a language is the set of principles under which sequences of words are judged to be grammatically acceptable by fluent speakers. One of the most basic syntactic concepts is the part-of-speech (POS), which refers to the syntactic role of each word in a sentence. This concept was used informally in the previous chapter, and you may have some intuitions from your own study of English. For example, in the sentence We like vegetarian sandwiches, you may already know that we and sandwiches are nouns, like is a verb, and vegetarian is an adjective. These labels depend on the context in which the word appears: in she eats like a vegetarian, the word like is a preposition, and the word vegetarian is a noun. Parts-of-speech can help to disentangle or explain various linguistic problems. Recall Chomsky’s proposed distinction in chapter 6: (8.1) a. Colorless green ideas sleep furiously. b. * Ideas colorless furiously green sleep. One difference between these two examples is that the first contains part-of-speech tran- sitions that are typical in English: adjective to adjective, adjective to noun, noun to verb, and verb to adverb. The second example contains transitions that are unusual: noun to adjective and adjective to verb. The ambiguity in a headline like, 175 176 CHAPTER8. APPLICATIONSOFSEQUENCELABELING (8.2) Teacher Strikes Idle Children can also be explained in terms of parts of speech: in the interpretation that was likely intended, strikes is a noun and idle is a verb; in the alternative explanation, strikes is a verb and idle is an adjective. Part-of-speech tagging is often taken as a early step in a natural language processing pipeline. Indeed, parts-of-speech provide features that can be useful for many of the tasks that we will encounter later, such as parsing (chapter 10), coreference resolution (chapter 15), and relation extraction (chapter 17). 8.1.1 Parts-of-Speech The Universal Dependencies project (UD) is an effort to create syntactically-annotated corpora across many languages, using a single annotation standard (Nivre et al., 2016). As part of this effort, they have designed a part-of-speech tagset, which is meant to capture word classes across as many languages as possible.1 This section describes that inventory, giving rough definitions for each of tags, along with supporting examples. Part-of-speech tags are morphosyntactic, rather than semantic, categories. This means that they describe words in terms of how they pattern together and how they are inter- nally constructed (e.g., what suffixes and prefixes they include). For example, you may think of a noun as referring to objects or concepts, and verbs as referring to actions or events. But events can also be nouns: (8.3) . . . the howling of the shrieking storm. Here howling and shrieking are events, but grammatically they act as a noun and adjective respectively. The Universal Dependency part-of-speech tagset The UD tagset is broken up into three groups: open class tags, closed class tags, and “others.” Open class tags Nearly all languages contain nouns, verbs, adjectives, and adverbs.2 These are all open word classes, because new words can easily be added to them. The UD tagset includes two other tags that are open classes: proper nouns and interjections. • Nouns (UD tag: NOUN) tend to describe entities and concepts, e.g., 1The UD tagset builds on earlier work from Petrov et al. (2012), in which a set of twelve universal tags was identified by creating mappings from tagsets for individual languages. 2One prominent exception is Korean, which some linguists argue does not have adjectives Kim (2002). Jacob Eisenstein. Draft of October 15, 2018. 8.1. PART-OF-SPEECHTAGGING 177 (8.4) Toes are scarce among veteran blubber men. In English, nouns tend to follow determiners and adjectives, and can play the subject role in the sentence. They can be marked for the plural number by an -s suffix. • Propernouns(PROPN)aretokensinnames,whichuniquelyspecifyagivenentity, (8.5) “Moby Dick?” shouted Ahab. • Verbs (VERB), according to the UD guidelines, “typically signal events and ac- tions.” But they are also defined grammatically: they “can constitute a minimal predicate in a clause, and govern the number and types of other constituents which may occur in a clause.”3 (8.6) “Moby Dick?” shouted Ahab. (8.7) Shall we keep chasing this murderous fish? English verbs tend to come in between the subject and some number of direct ob- jects, depending on the verb. They can be marked for tense and aspect using suffixes such as -ed and -ing. (These suffixes are an example of inflectional morphology, which is discussed in more detail in subsection 9.1.4.) • Adjectives (ADJ) describe properties of entities, (8.8) a. Shall we keep chasing this murderous fish? b. Toes are scarce among veteran blubber men. In the second example, scarce is a predicative adjective, linked to the subject by the copula verb are. In contrast, murderous and veteran are attributive adjectives, modi- fying the noun phrase in which they are embedded. • Adverbs (ADV) describe properties of events, and may also modify adjectives or other adverbs: (8.9) a. b. c. It is not down on any map; true places never are. . . . treacherously hidden beneath the loveliest tints of azure Not drowned entirely, though. • Interjections (INTJ) are used in exclamations, e.g., (8.10) Aye aye! it was that accursed white whale that razed me. 3 http://universaldependencies.org/u/pos/VERB.html Under contract with MIT Press, shared under CC-BY-NC-ND license. 178 CHAPTER8. APPLICATIONSOFSEQUENCELABELING Closed class tags Closed word classes rarely receive new members. They are sometimes referred to as function words — as opposed to content words — as they have little lexical meaning of their own, but rather, help to organize the components of the sentence. • Adpositions(ADP)describetherelationshipbetweenacomplement(usuallyanoun phrase) and another unit in the sentence, typically a noun or verb phrase. (8.11) a. b. c. Toes are scarce among veteran blubber men. It is not down on any map. Give not thyself up then. As the examples show, English generally uses prepositions, which are adpositions that appear before their complement. (An exception is ago, as in, we met three days ago). Postpositions are used in other languages, such as Japanese and Turkish. • Auxiliary verbs (AUX) are a closed class of verbs that add information such as tense, aspect, person, and number. (8.12) a. b. c. d. e. Shall we keep chasing this murderous fish? What the white whale was to Ahab, has been hinted. Ahab must use tools. Meditation and water are wedded forever. Toes are scarce among veteran blubber men. The final example is a copula verb, which is also tagged as an auxiliary in the UD corpus. • Coordinating conjunctions (CCONJ) express relationships between two words or phrases, which play a parallel role: (8.13) Meditation and water are wedded forever. • Subordinating conjunctions (SCONJ) link two clauses, making one syntactically subordinate to the other: (8.14) It is the easiest thing in the world for a man to look as if he had a great secret in him. Note that • Pronouns (PRON) are words that substitute for nouns or noun phrases. (8.15) a. Be it what it will, I’ll go to it laughing. Jacob Eisenstein. Draft of October 15, 2018. 8.1. PART-OF-SPEECHTAGGING 179 b. I try all things, I achieve what I can. The example includes the personal pronouns I and it, as well as the relative pronoun what. Other pronouns include myself, somebody, and nothing. • Determiners(DET)provideadditionalinformationaboutthenounsornounphrases that they modify: (8.16) a. b. c. d. What the white whale was to Ahab, has been hinted. It is not down on any map. I try all things ... Shall we keep chasing this murderous fish? Determiners include articles (the), possessive determiners (their), demonstratives (this murderous fish), and quantifiers (any map). • Numerals(NUM)areaninfinitebutclosedclass,whichincludesintegers,fractions, and decimals, regardless of whether spelled out or written in numerical form. (8.17) a. How then can this one small heart beat. b. I am going to put him down for the three hundredth. • Particles(PART)areacatch-alloffunctionwordsthatcombinewithotherwordsor phrases, but do not meet the conditions of the other tags. In English, this includes the infinitival to, the possessive marker, and negation. (8.18) a. b. c. Better to sleep with a sober cannibal than a drunk Christian. So man’s insanity is heaven’s sense It is not down on any map As the second example shows, the possessive marker is not considered part of the same token as the word that it modifies, so that man’s is split into two tokens. (To- kenization is described in more detail in section 8.4.) A non-English example of a particle is the Japanese question marker ka:4 (8.19) Sensei desu ka Teacher is ? Is she a teacher? 4In this notation, the first line is the transliterated Japanese text, the second line is a token-to-token gloss, and the third line is the translation. Under contract with MIT Press, shared under CC-BY-NC-ND license. 180 CHAPTER8. APPLICATIONSOFSEQUENCELABELING Other The remaining UD tags include punctuation (PUN) and symbols (SYM). Punc- tuation is purely structural — e.g., commas, periods, colons — while symbols can carry content of their own. Examples of symbols include dollar and percentage symbols, math- ematical operators, emoticons, emojis, and internet addresses. A final catch-all tag is X, which is used for words that cannot be assigned another part-of-speech category. The X tag is also used in cases of code switching (between languages), described in section 8.5. Other tagsets Prior to the Universal Dependency treebank, part-of-speech tagging was performed us- ing language-specific tagsets. The dominant tagset for English was designed as part of the Penn Treebank (PTB), and it includes 45 tags — more than three times as many as the UD tagset. This granularity is reflected in distinctions between singular and plural nouns, verb tenses and aspects, possessive and non-possessive pronouns, comparative and superlative adjectives and adverbs (e.g., faster, fastest), and so on. The Brown corpus includes a tagset that is even more detailed, with 87 tags (Francis, 1964), including special tags for individual auxiliary verbs such as be, do, and have. Different languages make different distinctions, and so the PTB and Brown tagsets are not appropriate for a language such as Chinese, which does not mark the verb tense (Xia, 2000); nor for Spanish, which marks every combination of person and number in the verb ending; nor for German, which marks the case of each noun phrase. Each of these languages requires more detail than English in some areas of the tagset, and less in other areas. The strategy of the Universal Dependencies corpus is to design a coarse-grained tagset to be used across all languages, and then to additionally annotate language-specific morphosyntactic attributes, such as number, tense, and case. The attribute tagging task is described in more detail in section 8.2. Social media such as Twitter have been shown to require tagsets of their own (Gimpel et al., 2011). Such corpora contain some tokens that are not equivalent to anything en- countered in a typical written corpus: e.g., emoticons, URLs, and hashtags. Social media also includes dialectal words like gonna (‘going to’, e.g. We gonna be fine) and Ima (‘I’m going to’, e.g., Ima tell you one more time), which can be analyzed either as non-standard orthography (making tokenization impossible), or as lexical items in their own right. In either case, it is clear that existing tags like NOUN and VERB cannot handle cases like Ima, which combine aspects of the noun and verb. Gimpel et al. (2011) therefore propose a new set of tags to deal with these cases. 8.1.2 Accurate part-of-speech tagging Part-of-speech tagging is the problem of selecting the correct tag for each word in a sen- tence. Success is typically measured by accuracy on an annotated test set, which is simply the fraction of tokens that were tagged correctly. Jacob Eisenstein. Draft of October 15, 2018. 8.1. PART-OF-SPEECHTAGGING 181 Baselines A simple baseline for part-of-speech tagging is to choose the most common tag for each word. For example, in the Universal Dependencies treebank, the word talk appears 96 times, and 85 of those times it is labeled as a VERB: therefore, this baseline will always predict VERB for this word. For words that do not appear in the training corpus, the base- line simply guesses the most common tag overall, which is NOUN. In the Penn Treebank, this simple baseline obtains accuracy above 92%. A more rigorous evaluation is the accu- racy on out-of-vocabulary words, which are not seen in the training data. Tagging these words correctly requires attention to the context and the word’s internal structure. Contemporary approaches Conditional random fields and structured perceptron perform at or near the state-of-the- art for part-of-speech tagging in English. For example, (Collins, 2002) achieved 97.1% accuracy on the Penn Treebank, using a structured perceptron with the following base features (originally introduced by Ratnaparkhi (1996)): • current word, wm • previous words, wm−1 , wm−2 • next words, wm+1 , wm+2 • previous tag, ym−1 • previous two tags, (ym−1, ym−2) • for rare words: – first k characters, up to k = 4 – last k characters, up to k = 4 – whether wm contains a number, uppercase character, or hyphen. Similar results for the PTB data have been achieved using conditional random fields (CRFs; Toutanova et al., 2003). More recent work has demonstrated the power of neural sequence models, such as the long short-term memory (LSTM) (section 7.6). Plank et al. (2016) apply a CRF and a bidi- rectional LSTM to twenty-two languages in the UD corpus, achieving an average accuracy of 94.3% for the CRF, and 96.5% with the bi-LSTM. Their neural model employs three types of embeddings: fine-tuned word embeddings, which are updated during training; pre-trained word embeddings, which are never updated, but which help to tag out-of- vocabulary words; and character-based embeddings. The character-based embeddings are computed by running an LSTM on the individual characters in each word, thereby capturing common orthographic patterns such as prefixes, suffixes, and capitalization. Extensive evaluations show that these additional embeddings are crucial to their model’s success. Under contract with MIT Press, shared under CC-BY-NC-ND license. 182 word PTB tag The DT German JJ Expressionist NN movement NN was VBD destroyed VBN as IN a DT result NN . . CHAPTER 8. UD tag DET ADJ NOUN NOUN AUX VERB ADP DET NOUN PUNCT APPLICATIONS OF SEQUENCE LABELING UD attributes DEFINITE=DEF PRONTYPE=ART DEGREE=POS NUMBER=SING NUMBER=SING MOOD=IND NUMBER=SING PERSON=3 TENSE=PAST VERBFORM=FIN TENSE=PAST VERBFORM=PART VOICE=PASS DEFINITE=IND PRONTYPE=ART NUMBER=SING Figure 8.1: UD and PTB part-of-speech tags, and UD morphosyntactic attributes. Example selected from the UD 1.4 English corpus. 8.2 Morphosyntactic Attributes There is considerably more to say about a word than whether it is a noun or a verb: in En- glish, verbs are distinguish by features such tense and aspect, nouns by number, adjectives by degree, and so on. These features are language-specific: other languages distinguish other features, such as case (the role of the noun with respect to the action of the sen- tence, which is marked in languages such as Latin and German5) and evidentiality (the source of information for the speaker’s statement, which is marked in languages such as Turkish). In the UD corpora, these attributes are annotated as feature-value pairs for each token.6 An example is shown in Figure 8.1. The determiner the is marked with two attributes: PRONTYPE=ART, which indicates that it is an article (as opposed to another type of deter- 5Case is marked in English for some personal pronouns, e.g., She saw her, They saw them. 6The annotation and tagging of morphosyntactic attributes can be traced back to earlier work on Turk- ish (Oflazer and Kuruo ̈z, 1994) and Czech (Hajicˇ and Hladka ́, 1998). MULTEXT-East was an early multilin- gual corpus to include morphosyntactic attributes (Dimitrova et al., 1998). Jacob Eisenstein. Draft of October 15, 2018. 8.3. NAMEDENTITYRECOGNITION 183 miner or pronominal modifier), and DEFINITE=DEF, which indicates that it is a definite article (referring to a specific, known entity). The verbs are each marked with several attributes. The auxiliary verb was is third-person, singular, past tense, finite (conjugated), and indicative (describing an event that has happened or is currently happenings); the main verb destroyed is in participle form (so there is no additional person and number information), past tense, and passive voice. Some, but not all, of these distinctions are reflected in the PTB tags VBD (past-tense verb) and VBN (past participle). While there are thousands of papers on part-of-speech tagging, there is comparatively little work on automatically labeling morphosyntactic attributes. Faruqui et al. (2016) train a support vector machine classification model, using a minimal feature set that in- cludes the word itself, its prefixes and suffixes, and type-level information listing all pos- sible morphosyntactic attributes for each word and its neighbors. Mueller et al. (2013) use a conditional random field (CRF), in which the tag space consists of all observed com- binations of morphosyntactic attributes (e.g., the tag would be DEF+ART for the word the in Figure 8.1). This massive tag space is managed by decomposing the feature space over individual attributes, and pruning paths through the trellis. More recent work has employed bidirectional LSTM sequence models. For example, Pinter et al. (2017) train a bidirectional LSTM sequence model. The input layer and hidden vectors in the LSTM are shared across attributes, but each attribute has its own output layer, culminating in a softmax over all attribute values, e.g. yNUMBER ∈ {SING, PLURAL, . . .}. They find that t character-level information is crucial, especially when the amount of labeled data is lim- ited. Evaluation is performed by first computing recall and precision for each attribute. These scores can then be averaged at either the type or token level to obtain micro- or macro-F-MEASURE. Pinter et al. (2017) evaluate on 23 languages in the UD treebank, reporting a median micro-F -MEASURE of 0.95. Performance is strongly correlated with the size of the labeled dataset for each language, with a few outliers: for example, Chinese is particularly difficult, because although the dataset is relatively large (105 tokens in the UD 1.4 corpus), only 6% of tokens have any attributes, offering few useful labeled instances. 8.3 Named Entity Recognition A classical problem in information extraction is to recognize and extract mentions of named entities in text. In news documents, the core entity types are people, locations, and organizations; more recently, the task has been extended to include amounts of money, percentages, dates, and times. In item 8.20a (Figure 8.2), the named entities include: The U.S. Army, an organization; Atlanta, a location; and May 14, 1864, a date. Named en- tity recognition is also a key task in biomedical natural language processing, with entity types including proteins, DNA, RNA, and cell lines (e.g., Collier et al., 2000; Ohta et al., 2002). Figure 8.2 shows an example from the GENIA corpus of biomedical research ab- Under contract with MIT Press, shared under CC-BY-NC-ND license. 184 CHAPTER8. APPLICATIONSOFSEQUENCELABELING (8.20) a. The U.S. Army captured Atlanta on May 14 , 1864 B-ORG I-ORG I-ORG O B-LOC O B-DATE I-DATE I-DATE I-DATE b. Number of glucocorticoid receptors in lymphocytes and . . . O O B-PROTEIN I-PROTEIN O B-CELLTYPE O . . . Figure 8.2: BIO notation for named entity recognition. Example (8.20b) is drawn from the GENIA corpus of biomedical documents (Ohta et al., 2002). stracts. A standard approach to tagging named entity spans is to use discriminative sequence labeling methods such as conditional random fields. However, the named entity recogni- tion (NER) task would seem to be fundamentally different from sequence labeling tasks like part-of-speech tagging: rather than tagging each token, the goal in is to recover spans of tokens, such as The United States Army. This is accomplished by the BIO notation, shown in Figure 8.2. Each token at the beginning of a name span is labeled with a B- prefix; each token within a name span is la- beled with an I- prefix. These prefixes are followed by a tag for the entity type, e.g. B-LOC for the beginning of a location, and I-PROTEIN for the inside of a protein name. Tokens that are not parts of name spans are labeled as O. From this representation, the entity name spans can be recovered unambiguously. This tagging scheme is also advantageous for learning: tokens at the beginning of name spans may have different properties than tokens within the name, and the learner can exploit this. This insight can be taken even further, with special labels for the last tokens of a name span, and for unique tokens in name spans, such as Atlanta in the example in Figure 8.2. This is called BILOU notation, and it can yield improvements in supervised named entity recognition (Ratinov and Roth, 2009). Feature-based sequence labeling Named entity recognition was one of the first applica- tions of conditional random fields (McCallum and Li, 2003). The u􏰀se of Viterbi decoding restricts the feature function f(w,y) to be a sum of local features, m f(w,ym,ym−1,m), so that each feature can consider only local adjacent tags. Typical features include tag tran- sitions, word features for wm and its neighbors, character-level features for prefixes and suffixes, and “word shape” features for capitalization and other orthographic properties. As an example, base features for the word Army in the example in (8.20a) include: (CURR-WORD:Army, PREV-WORD:U.S., NEXT-WORD:captured, PREFIX-1:A-, PREFIX-2:Ar-, SUFFIX-1:-y, SUFFIX-2:-my, SHAPE:Xxxx) Features can also be obtained from a gazetteer, which is a list of known entity names. For example, the U.S. Social Security Administration provides a list of tens of thousands of Jacob Eisenstein. Draft of October 15, 2018. 8.4. TOKENIZATION 185 (1)日文章魚怎麼說? Japanese octopus how say How to say octopus in Japanese? (2)日文章魚怎麼說? Japan essay fish how say Figure 8.3: An example of tokenization ambiguity in Chinese (Sproat et al., 1996) given names — more than could be observed in any annotated corpus. Tokens or spans that match an entry in a gazetteer can receive special features; this provides a way to incorporate hand-crafted resources such as name lists in a learning-driven framework. Neural sequence labeling for NER Current research has emphasized neural sequence labeling, using similar LSTM models to those employed in part-of-speech tagging (Ham- merton, 2003; Huang et al., 2015; Lample et al., 2016). The bidirectional LSTM-CRF (Fig- ure 7.4 in section 7.6) does particularly well on this task, due to its ability to model tag- to-tag dependencies. However, Strubell et al. (2017) show that convolutional neural net- works can be equally accurate, with significant improvement in speed due to the effi- ciency of implementing ConvNets on graphics processing units (GPUs). The key inno- vation in this work was the use of dilated convolution, which is described in more detail in section 3.4. 8.4 Tokenization A basic problem for text analysis, first discussed in subsection 4.3.1, is to break the text into a sequence of discrete tokens. For alphabetic languages such as English, determin- istic scripts usually suffice to achieve accurate tokenization. However, in logographic writing systems such as Chinese script, words are typically composed of a small num- ber of characters, without intervening whitespace. The tokenization must be determined by the reader, with the potential for occasional ambiguity, as shown in Figure 8.3. One approach is to match character sequences against a known dictionary (e.g., Sproat et al., 1996), using additional statistical information about word frequency. However, no dic- tionary is completely comprehensive, and dictionary-based approaches can struggle with such out-of-vocabulary words. Chinese word segmentation has therefore been approached as a supervised sequence labeling problem. Xue et al. (2003) train a logistic regression classifier to make indepen- dent segmentation decisions while moving a sliding window across the document. A set of rules is then used to convert these individual classification decisions into an overall tokenization of the input. However, these individual decisions may be globally subop- Under contract with MIT Press, shared under CC-BY-NC-ND license. 186 CHAPTER8. APPLICATIONSOFSEQUENCELABELING timal, motivating a structure prediction approach. Peng et al. (2004) train a conditional random field to predict labels of START or NONSTART on each character. More recent work has employed neural network architectures. For example, Chen et al. (2015) use an LSTM-CRF architecture, as described in section 7.6: they construct a trellis, in which each tag is scored according to the hidden state of an LSTM, and tag-tag transitions are scored according to learned transition weights. The best-scoring segmentation is then computed by the Viterbi algorithm. 8.5 Code switching Multilingual speakers and writers do not restrict themselves to a single language. Code switching is the phenomenon of switching between languages in speech and text (Auer, 2013; Poplack, 1980). Written code switching has become more common in online social media, as in the following extract from the website of Canadian President Justin Trudeau:7 (8.21) Although everything written on this site est disponible en anglais is available in English and in French, my personal videos seront bilingues will be bilingual Accurately analyzing such texts requires first determining which languages are being used. Furthermore, quantitative analysis of code switching can provide insights on the languages themselves and their relative social positions. Code switching can be viewed as a sequence labeling problem, where the goal is to la- bel each token as a candidate switch point. In the example above, the words est, and, and seront would be labeled as switch points. Solorio and Liu (2008) detect English-Spanish switch points using a supervised classifier, with features that include the word, its part-of- speech in each language (according to a supervised part-of-speech tagger), and the prob- abilities of the word and part-of-speech in each language. Nguyen and Dogruo ̈z (2013) apply a conditional random field to the problem of detecting code switching between Turkish and Dutch. Code switching is a special case of the more general problem of word level language identification, which Barman et al. (2014) address in the context of trilingual code switch- ing between Bengali, English, and Hindi. They further observe an even more challenging phenomenon: intra-word code switching, such as the use of English suffixes with Bengali roots. They therefore mark each token as either (1) belonging to one of the three languages; (2) a mix of multiple languages; (3) “universal” (e.g., symbols, numbers, emoticons); or (4) undefined. 7 As quoted in http://blogues.lapresse.ca/lagace/2008/09/08/ justin-trudeau-really-parfait-bilingue/, accessed August 21, 2017. Jacob Eisenstein. Draft of October 15, 2018. 8.6. DIALOGUEACTS 187 Speaker Dialogue Act YES-NO-QUESTION ABANDONED YES-ANSWER STATEMENT DECLARATIVE-QUESTION YES-ANSWER STATEMENT APPRECIATION BACKCHANNEL Utterance So do you go college right now? Are yo- Yeah, It’s my last year [laughter]. You’re a, so you’re a senior now. Yeah, I’m working on my projects trying to graduate [laughter] Oh, good for you. Yeah. A A B B A B B A B 8.6 Figure 8.4: An example of dialogue act labeling (Stolcke et al., 2000) Dialogue acts The sequence labeling problems that we have discussed so far have been over sequences of word tokens or characters (in the case of tokenization). However, sequence labeling can also be performed over higher-level units, such as utterances. Dialogue acts are la- bels over utterances in a dialogue, corresponding roughly to the speaker’s intention — the utterance’s illocutionary force (Austin, 1962). For example, an utterance may state a proposition (it is not down on any map), pose a question (shall we keep chasing this murderous fish?), or provide a response (aye aye!). Stolcke et al. (2000) describe how a set of 42 dia- logue acts were annotated for the 1,155 conversations in the Switchboard corpus (Godfrey et al., 1992).8 An example is shown in Figure 8.4. The annotation is performed over UTTERANCES, with the possibility of multiple utterances per conversational turn (in cases such as inter- ruptions, an utterance may split over multiple turns). Some utterances are clauses (e.g., So do you go to college right now?), while others are single words (e.g., yeah). Stolcke et al. (2000) report that hidden Markov models (HMMs) achieve 96% accuracy on supervised utter- ance segmentation. The labels themselves reflect the conversational goals of the speaker: the utterance yeah functions as an answer in response to the question you’re a senior now, but in the final line of the excerpt, it is a backchannel (demonstrating comprehension). For task of dialogue act labeling, Stolcke et al. (2000) apply a hidden Markov model. The probability p(wm | ym) must generate the entire sequence of words in the utterance, and it is modeled as a trigram language model (section 6.1). Stolcke et al. (2000) also account for acoustic features, which capture the prosody of each utterance — for example, tonal and rhythmic properties of speech, which can be used to distinguish dialogue acts 8Dialogue act modeling is not restricted to speech; it is relevant in any interactive conversation. For example, Jeong et al. (2009) annotate a more limited set of speech acts in a corpus of emails and online forums. Under contract with MIT Press, shared under CC-BY-NC-ND license. 188 CHAPTER8. APPLICATIONSOFSEQUENCELABELING such as questions and answers. These features are handled with an additional emission distribution, p(am | ym), which is modeled with a probabilistic decision tree (Murphy, 2012). While acoustic features yield small improvements overall, they play an important role in distinguish questions from statements, and agreements from backchannels. Recurrent neural architectures for dialogue act labeling have been proposed by Kalch- brenner and Blunsom (2013) and Ji et al. (2016), with strong empirical results. Both models are recurrent at the utterance level, so that each complete utterance updates a hidden state. The recurrent-convolutional network of Kalchbrenner and Blunsom (2013) uses convolu- tion to obtain a representation of each individual utterance, while Ji et al. (2016) use a second level of recurrence, over individual words. This enables their method to also func- tion as a language model, giving probabilities over sequences of words in a document. Exercises 1. Using the Universal Dependencies part-of-speech tags, annotate the following sen- tences. You may examine the UD tagging guidelines. Tokenization is shown with whitespace. Don’t forget about punctuation. (8.22) a. b. It was that accursed white whale that razed me . c. Better to sleep with a sober cannibal , than a drunk Christian . d. Beitwhatitwill,I’llgotoitlaughing. 2. Select three short sentences from a recent news article, and annotate them for UD part-of-speech tags. Ask a friend to annotate the same three sentences without look- ing at your annotations. Compute the rate of agreement, using the Kappa metric defined in section 4.5.2. Then work together to resolve any disagreements. 3. Choose one of the following morphosyntactic attributes: MOOD, TENSE, VOICE. Re- search the definition of this attribute on the universal dependencies website, http: //universaldependencies.org/u/feat/index.html. Returning to the ex- amples in the first exercise, annotate all verbs for your chosen attribute. It may be helpful to consult examples from an English-language universal dependencies cor- pus, available at https://github.com/UniversalDependencies/UD_English-EWT/ tree/master. 4. Downloadadatasetannotatedforuniversaldependencies,suchastheEnglishTree- bank at https://github.com/UniversalDependencies/UD_English-EWT/ tree/master. This corpus is already segmented into training, development, and test data. Jacob Eisenstein. Draft of October 15, 2018. I try all things , I achieve what I can . 8.6. DIALOGUEACTS 189 a) First, train a logistic regression or SVM classifier using character suffixes: char- acter n-grams up to length 4. Compute the recall, precision, and F -MEASURE on the development data. b) Next, augment your classifier using the same character suffixes of the preced- ing and succeeding tokens. Again, evaluate your classifier on heldout data. c) Optionally, train a Viterbi-based sequence labeling model, using a toolkit such as CRFSuite (http://www.chokkan.org/software/crfsuite/) or your own Viterbi implementation. This is more likely to be helpful for attributes in which agreement is required between adjacent words. For example, many Romance languages require gender and number agreement for determiners, nouns, and adjectives. 5. Provide BIO-style annotation of the named entities (person, place, organization, date, or product) in the following expressions: (8.23) a. The third mate was Flask, a native of Tisbury, in Martha’s Vineyard. b. Its official Nintendo announced today that they Will release the Nin- tendo 3DS in north America march 27 (Ritter et al., 2011). c. Jessica Reif, a media analyst at Merrill Lynch & Co., said, “If they can get up and running with exclusive programming within six months, it doesn’t set the venture back that far.”9 6. Run the examples above through the online version of a named entity recogni- tion tagger, such as the Allen NLP system here: http://demo.allennlp.org/named- entity-recognition. Do the predicted tags match your annotations? 7. Build a whitespace tokenizer for English: a) Using the NLTK library, download the complete text to the novel Alice in Won- derland (Carroll, 1865). Hold out the final 1000 words as a test set. b) Label each alphanumeric character as a segmentation point, ym = 1 if m is the final character of a token. Label every other character as ym = 0. Then concatenate all the tokens in the training and test sets.Make sure that the num- ber of labels {ym}Mm=1 is identical to the number of characters {cm}Mm=1 in your concatenated datasets. c) Train a logistic regression classifier to predict ym, using the surrounding char- acters cm−5:m+5 as features. After training the classifier, run it on the test set, using the predicted segmentation points to re-tokenize the text. 9From the Message Understanding Conference (MUC-7) dataset (Chinchor and Robinson, 1997). Under contract with MIT Press, shared under CC-BY-NC-ND license. 190 CHAPTER8. APPLICATIONSOFSEQUENCELABELING d) e) Compute the per-character segmentation accuracy on the test set. You should be able to get at least 88% accuracy. Print out a sample of segmented text from the test set, e.g. Thereareno mice in the air , I ’ m afraid , but y oumight cat chabat , and that ’ svery like a mouse , youknow . But docatseat bats , I wonder ?’ 8. Perform the following extensions to your tokenizer in the previous problem. a) Train a conditional random field sequence labeler, by incorporating the tag bigrams (ym−1,ym) as additional features. You may use a structured predic- tion library such as CRFSuite, or you may want to implement Viterbi yourself. Compare the accuracy with your classification-based approach. b) Compute the token-level performance: treating the original tokenization as ground truth, compute the number of true positives (tokens that are in both the ground truth and predicted tokenization), false positives (tokens that are in the predicted tokenization but not the ground truth), and false negatives (to- kens that are in the ground truth but not the predicted tokenization). Compute the F-measure. Hint: to match predicted and ground truth tokens, add “anchors” for the start character of each token. The number of true positives is then the size of the intersection of the sets of predicted and ground truth tokens. c) Apply the same methodology in a more practical setting: tokenization of Chi- nese, which is written without whitespace. You can find annotated datasets at http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me. html. Jacob Eisenstein. Draft of October 15, 2018. Chapter 9 Formal language theory We have now seen methods for learning to label individual words, vectors of word counts, and sequences of words; we will soon proceed to more complex structural transforma- tions. Most of these techniques could apply to counts or sequences from any discrete vo- cabulary; there is nothing fundamentally linguistic about, say, a hidden Markov model. This raises a basic question that this text has not yet considered: what is a language? This chapter will take the perspective of formal language theory, in which a language is defined as a set of strings, each of which is a sequence of elements from a finite alphabet. For interesting languages, there are an infinite number of strings that are in the language, and an infinite number of strings that are not. For example: • thesetofalleven-lengthsequencesfromthealphabet{a,b},e.g.,{∅,aa,ab,ba,bb,aaaa,aaab,...}; • the set of all sequences from the alphabet {a, b} that contain aaa as a substring, e.g., {aaa, aaaa, baaa, aaab, . . .}; • the set of all sequences of English words (drawn from a finite dictionary) that con- tain at least one verb (a finite subset of the dictionary); • the PYTHON programming language. Formal language theory defines classes of languages and their computational prop- erties. Of particular interest is the computational complexity of solving the membership problem — determining whether a string is in a language. The chapter will focus on three classes of formal languages: regular, context-free, and “mildly” context-sensitive languages. A key insight of 20th century linguistics is that formal language theory can be usefully applied to natural languages such as English, by designing formal languages that cap- ture as many properties of the natural language as possible. For many such formalisms, a useful linguistic analysis comes as a byproduct of solving the membership problem. The 191 192 CHAPTER9. FORMALLANGUAGETHEORY membership problem can be generalized to the problems of scoring strings for their ac- ceptability (as in language modeling), and of transducing one string into another (as in translation). 9.1 Regular languages If you have written a regular expression, then you have defined a regular language: a regular language is any language that can be defined by a regular expression. Formally, a regular expression can include the following elements: • A literal character drawn from some finite alphabet Σ. • The empty string ε. • The concatenation of two regular expressions RS, where R and S are both regular expressions. The resulting expression accepts any string that can be decomposed x = yz, where y is accepted by R and z is accepted by S. • The alternation R | S, where R and S are both regular expressions. The resulting expression accepts a string x if it is accepted by R or it is accepted by S. • The Kleene star R∗, which accepts any string x that can be decomposed into a se- quence of strings which are all accepted by R. • Parenthesization (R), which is used to limit the scope of the concatenation, alterna- tion, and Kleene star operators. Here are some example regular expressions: • The set of all even length strings on the alphabet {a, b}: ((aa)|(ab)|(ba)|(bb))∗ • Thesetofallsequencesofthealphabet{a,b}thatcontainaaaasasubstring:(a|b)∗aaa(a|b)∗ • The set of all sequences of English words that contain at least one verb: W ∗ V W ∗ , where W is an alternation between all words in the dictionary, and V is an alterna- tion between all verbs (V ⊆ W ). This list does not include a regular expression for the Python programming language, because this language is not regular — there is no regular expression that can capture its syntax. We will discuss why towards the end of this section. Regular languages are closed under union, intersection, and concatenation. This means that if two languages L1 and L2 are regular, then so are the languages L1 ∪ L2, L1 ∩ L2, and the language of strings that can be decomposed as s = tu, with s ∈ L1 and t ∈ L2. Regular languages are also closed under negation: if L is regular, then so is the language L = { s ∈/ L } . Jacob Eisenstein. Draft of October 15, 2018. 9.1. REGULARLANGUAGES 193 ab start q0 b q1 Figure 9.1: State diagram for the finite state acceptor M1. 9.1.1 Finite state acceptors A regular expression defines a regular language, but does not give an algorithm for de- termining whether a string is in the language that it defines. Finite state automata are theoretical models of computation on regular languages, which involve transitions be- tween a finite number of states. The most basic type of finite state automaton is the finite state acceptor (FSA), which describes the computation involved in testing if a string is a member of a language. Formally, a finite state acceptor is a tuple M = (Q, Σ, q0, F, δ), consisting of: • a finite alphabet Σ of input symbols; • a finite set of states Q = {q0,q1,...,qn}; • a start state q0 ∈ Q; • asetoffinalstatesF ⊆Q; • a transition function δ : Q × (Σ ∪ {ε}) → 2Q. The transition function maps from a state and an input symbol (or empty string ε) to a set of possible resulting states. A path in M is a sequence of transitions, π = t1,t2,...,tN, where each ti traverses an arc in the transition function δ. The finite state acceptor M accepts a string ω if there is an accepting path, in which the initial transition t1 begins at the start state q0, the final transition tN terminates in a final state in Q, and the entire input ω is consumed. Example Consider the following FSA, M1. Σ ={a, b} Q ={q0, q1} F ={q1} [9.1] [9.2] [9.3] [9.4] function δ is written as a set of arcs: (q0,a) → q0 says that if the machine is in state Under contract with MIT Press, shared under CC-BY-NC-ND license. δ ={(q0, a) → q0, (q0, b) → q1, (q1, b) → q1}. This FSA defines a language over an alphabet of two symbols, a and b. The transition 194 CHAPTER9. FORMALLANGUAGETHEORY q0 and reads symbol a, it stays in q0. Figure 9.1 provides a graphical representation of M1. Because each pair of initial state and symbol has at most one resulting state, M1 is deterministic: each string ω induces at most one accepting path. Note that there are no transitions for the symbol a in state q1; if a is encountered in q1, then the acceptor is stuck, and the input string is rejected. What strings does M1 accept? The start state is q0, and we have to get to q1, since this is the only final state. Any number of a symbols can be consumed in q0, but a b symbol is required to transition to q1. Once there, any number of b symbols can be consumed, but an a symbol cannot. So the regular expression corresponding to the language defined by M1 is a∗bb∗. Computational properties of finite state acceptors The key computational question for finite state acceptors is: how fast can we determine whether a string is accepted? For determistic FSAs, this computation can be performed by Dijkstra’s algorithm, with time complexity O(V log V + E), where V is the number of vertices in the FSA, and E is the number of edges (Cormen et al., 2009). Non-deterministic FSAs (NFSAs) can include multiple transitions from a given symbol and state. Any NSFA can be converted into a deterministic FSA, but the resulting automaton may have a num- ber of states that is exponential in the number of size of the original NFSA (Mohri et al., 2002). 9.1.2 Morphology as a regular language Many words have internal structure, such as prefixes and suffixes that shape their mean- ing. The study of word-internal structure is the domain of morphology, of which there are two main types: • Derivational morphology describes the use of affixes to convert a word from one grammatical category to another (e.g., from the noun grace to the adjective graceful), or to change the meaning of the word (e.g., from grace to disgrace). • Inflectional morphology describes the addition of details such as gender, number, person, and tense (e.g., the -ed suffix for past tense in English). Morphology is a rich topic in linguistics, deserving of a course in its own right.1 The focus here will be on the use of finite state automata for morphological analysis. The 1A good starting point would be a chapter from a linguistics textbook (e.g., Akmajian et al., 2010; Bender, 2013). A key simplification in this chapter is the focus on affixes at the sole method of derivation and inflec- tion. English makes use of affixes, but also incorporates apophony, such as the inflection of foot to feet. Semitic languages like Arabic and Hebrew feature a template-based system of morphology, in which roots are triples of consonants (e.g., ktb), and words are created by adding vowels: kataba (Arabic: he wrote), kutub (books), maktab (desk). For more detail on morphology, see texts from Haspelmath and Sims (2013) and Lieber (2015). Jacob Eisenstein. Draft of October 15, 2018. 9.1. REGULARLANGUAGES 195 current section deals with derivational morphology; inflectional morphology is discussed in section 9.1.4. Suppose that we want to write a program that accepts only those words that are con- structed in accordance with the rules of English derivational morphology: (9.1) a. b. c. d. grace, graceful, gracefully, *gracelyful disgrace, *ungrace, disgraceful, disgracefully allure, *allureful, alluring, alluringly fairness, unfair, *disfair, fairly (Recall that the asterisk indicates that a linguistic example is judged unacceptable by flu- ent speakers of a language.) These examples cover only a tiny corner of English deriva- tional morphology, but a number of things stand out. The suffix -ful converts the nouns grace and disgrace into adjectives, and the suffix -ly converts adjectives into adverbs. These suffixes must be applied in the correct order, as shown by the unacceptability of *grace- lyful. The -ful suffix works for only some words, as shown by the use of alluring as the adjectival form of allure. Other changes are made with prefixes, such as the derivation of disgrace from grace, which roughly corresponds to a negation; however, fair is negated with the un- prefix instead. Finally, while the first three examples suggest that the direc- tion of derivation is noun → adjective → adverb, the example of fair suggests that the adjective can also be the base form, with the -ness suffix performing the conversion to a noun. Can we build a computer program that accepts only well-formed English words, and rejects all others? This might at first seem trivial to solve with a brute-force attack: simply make a dictionary of all valid English words. But such an approach fails to account for morphological productivity — the applicability of existing morphological rules to new words and names, such as Trump to Trumpy and Trumpkin, and Clinton to Clintonian and Clintonite. We need an approach that represents morphological rules explicitly, and for this we will try a finite state acceptor. The dictionary approach can be implemented as a finite state acceptor, with the vo- cabulary Σ equal to the vocabulary of English, and a transition from the start state to the accepting state for each word. But this would of course fail to generalize beyond the origi- nal vocabulary, and would not capture anything about the morphotactic rules that govern derivations from new words. The first step towards a more general approach is shown in Figure 9.2, which is the state diagram for a finite state acceptor in which the vocabulary consists of morphemes, which include stems (e.g., grace, allure) and affixes (e.g., dis-, -ing, -ly). This finite state acceptor consists of a set of paths leading away from the start state, with derivational affixes added along the path. Except for qneg, the states on these paths are all final, so the FSA will accept disgrace, disgraceful, and disgracefully, but not dis-. Under contract with MIT Press, shared under CC-BY-NC-ND license. 196 CHAPTER9. FORMALLANGUAGETHEORY grace dis- allure fair q -ful q -ly q N1 J1 A1 q grace q -ful q -ly q neg N2 J2 A2 start q0 -ing q -ly q N3 J3 A3 q q -ness q q J4 N4 A4 -ly Figure 9.2: A finite state acceptor for a fragment of English derivational morphology. Each path represents possible derivations from a single root form. This FSA can be minimized to the form shown in Figure 9.3, which makes the gen- erality of the finite state approach more apparent. For example, the transition from q0 to qJ2 canbemadetoacceptnotonlyfairbutanysingle-morpheme(monomorphemic)ad- jective that takes -ness and -ly as suffixes. In this way, the finite state acceptor can easily be extended: as new word stems are added to the vocabulary, their derived forms will be accepted automatically. Of course, this FSA would still need to be extended considerably to cover even this small fragment of English morphology. As shown by cases like music → musical, athlete → athletic, English includes several classes of nouns, each with its own rules for derivation. The FSAs shown in Figure 9.2 and 9.3 accept allureing, not alluring. This reflects a dis- tinction between morphology — the question of which morphemes to use, and in what order — and orthography — the question of how the morphemes are rendered in written language. Just as orthography requires dropping the e preceding the -ing suffix, phonol- ogy imposes a related set of constraints on how words are rendered in speech. As we will see soon, these issues can be handled by finite state!transducers, which are finite state automata that take inputs and produce outputs. 9.1.3 Weighted finite state acceptors According to the FSA treatment of morphology, every word is either in or out of the lan- guage, with no wiggle room. Perhaps you agree that musicky and fishful are not valid English words; but if forced to choose, you probably find a fishful stew or a musicky trib- ute preferable to behaving disgracelyful. Rather than asking whether a word is acceptable, we might like to ask how acceptable it is. Aronoff (1976, page 36) puts it another way: Jacob Eisenstein. Draft of October 15, 2018. 9.1. REGULARLANGUAGES 197 q grace q -ful q -ly q start q0 allure qN2 fair q -ly grace -ing q neg N1 J1 A1 dis- J2 -ness N3 Figure 9.3: Minimization of the finite state acceptor shown in Figure 9.2. “Though many things are possible in morphology, some are more possible than others.” But finite state acceptors give no way to express preferences among technically valid choices. Weighted finite state acceptors (WFSAs) are generalizations of FSAs, in which each accepting path is assigned a score, computed from the transitions, the initial state, and the final state. Formally, a weighted finite state acceptor M = (Q, Σ, λ, ρ, δ) consists of: • a finite set of states Q = {q0,q1,...,qn}; • a finite alphabet Σ of input symbols; • an initial weight function, λ : Q → R; • afinalweightfunctionρ:Q→R; • atransitionfunctionδ:Q×Σ×Q→R. WFSAs depart from the FSA formalism in three ways: every state can be an initial state, with score λ(q); every state can be an accepting state, with score ρ(q); transitions are possible between any pair of states on any input, with a score δ(qi, ω, qj ). Nonetheless, FSAs can be viewed as a special case: for any FSA M we can build an equivalent WFSA bysettingλ(q) = ∞forallq ̸= q0,ρ(q) = ∞forallq ∈/ F,andδ(qi,ω,qj) = ∞forall transitions {(q1, ω) → q2} that are not permitted by the transition function of M. The total score for any path π = t1,t2,...,tN is equal to the sum of these scores, 􏰁N n A shortest-path algorithm is used to find the minimum-cost path through a WFSA for string ω, with time complexity O(E + V log V ), where E is the number of edges and V is the number of vertices (Cormen et al., 2009).2 2Shortest-path algorithms find the path with the minimum cost. In many cases, the path weights are log Under contract with MIT Press, shared under CC-BY-NC-ND license. d(π) = λ(from-state(t1)) + δ(tn) + ρ(to-state(tN )). [9.5] 198 CHAPTER9. FORMALLANGUAGETHEORY N-gram language models as WFSAs In n-gram language models (see section 6.1), the probability of a sequence of tokens w1,w2,...,wM is modeled as, 􏰙M m=1 The log probability under an n-gram language model can be modeled in a WFSA. First consider a unigram language model. We need only a single state q0, with transition scores δ(q0, ω, q0) = log p1(ω). The initial and final scores can be set to zero. Then the path score for w1,w2,...,wM is equal to, 􏰁M 􏰁M 0 + δ(q0, wm, q0) + 0 = log p1(wm). [9.7] mm For an n-gram language model with n > 1, we need probabilities that condition on the past history. For example, in a bigram language model, the transition weights must represent log p2(wm | wm−1). The transition scoring function must somehow “remember” the previous word or words. This can be done by adding more states: to model the bigram probability p2(wm | wm−1), we need a state for every possible wm−1 — a total of V states. The construction indexes each state qi by a context event wm−1 = i. The weights are then assigned as follows:
δ(qi,ω,qj) =􏰅logPr(wm = j | wm−1 = i), ω = j −∞, ω ̸= j
λ(qi) =logPr(w1 = i | w0 = 􏱑) ρ(qi) =logPr(wM+1 = 􏱒 | wM = i).
The transition function is designed to ensure that the context is recorded accurately: we can move to state j on input ω only if ω = j; otherwise, transitioning to state j is forbidden by the weight of −∞. The initial weight function λ(qi) is the log probability of receiving i as the first token, and the final weight function ρ(qi) is the log probability of receiving an “end-of-string” token after observing wM = i.
*Semiring weighted finite state acceptors
The n-gram language model WFSA is deterministic: each input has exactly one accepting path, for which the WFSA computes a score. In non-deterministic WFSAs, a given input
probabilities, so we want the path with the maximum score, which can be accomplished by making each local score into a negative log-probability.
Jacob Eisenstein. Draft of October 15, 2018.
p(w1,…,wM) ≈
pn(wm | wm−1,…,wm−n+1). [9.6]

9.1. REGULARLANGUAGES 199
may have multiple accepting paths. In some applications, the score for the input is ag- gregated across all such paths. Such aggregate scores can be computed by generalizing WFSAs with semiring notation, first introduced in subsection 7.7.3.
Let d(π) represent the total score for path π = t1, t2, . . . , tN , which is computed as,
d(π) = λ(from-state(t1)) ⊗ δ(t1) ⊗ δ(t2) ⊗ . . . ⊗ δ(tN ) ⊗ ρ(to-state(tN )). [9.8]
This is a generalization of Equation 9.5 to semiring notation, using the semiring multipli- cation operator ⊗ in place of addition.
Now let s(ω) represent the total score for all paths Π(ω) that consume input ω,
s(ω) = 􏱍 d(π). [9.9]
π∈Π(ω)
Here, semiring addition (⊕) is used to combine the scores of multiple paths.
The generalization to semirings covers a number of useful special cases. In the log- probability semiring, multiplication is defined as log p(x) ⊗ log p(y) = log p(x) + log p(y), and addition is defined as log p(x) ⊕ log p(y) = log(p(x) + p(y)). Thus, s(ω) represents the log-probability of accepting input ω, marginalizing over all paths π ∈ Π(ω). In the boolean semiring, the ⊗ operator is logical conjunction, and the ⊕ operator is logical disjunction. This reduces to the special case of unweighted finite state acceptors, where the score s(ω) is a boolean indicating whether there exists any accepting path for ω. In the tropical semiring, the ⊕ operator is a maximum, so the resulting score is the score of the best-scoring path through the WFSA. The OPENFST toolkit uses semirings and poly- morphism to implement general algorithms for weighted finite state automata (Allauzen et al., 2007).
*Interpolated n-gram language models
Recall from subsection 6.2.3 that an interpolated n-gram language model combines the probabilities from multiple n-gram models. For example, an interpolated bigram lan- guage model computes the probability,
pˆ(wm | wm−1) = λ1p1(wm) + λ2p2(wm | wm−1), [9.10]
with pˆ indicating the interpolated probability, p2 indicating the bigram probability, and p1 indicating the unigram probability. Setting λ2 = (1 − λ1) ensures that the probabilities sum to one.
Interpolated bigram language models can be implemented using a non-deterministic WFSA (Knight and May, 2009). The basic idea is shown in Figure 9.4. In an interpolated bigram language model, there is one state for each element in the vocabulary — in this
Under contract with MIT Press, shared under CC-BY-NC-ND license.

200
CHAPTER9. FORMALLANGUAGETHEORY
a : λ2p2(a | a) qA
b : λ2p2(b | a) ε : λ1
a : p1(a)
start qU
a : λ2p2(a | b)
ε2 : λ1
b : p1(b)
qB
b : λ2p2(b | b)
Figure 9.4: WFSA implementing an interpolated bigram/unigram language model, on the alphabet Σ = {a, b}. For simplicity, the WFSA is contrained to force the first token to be generated from the unigram model, and does not model the emission of the end-of- sequence token.
case, the states qA and qB — which are capture the contextual conditioning in the bigram probabilities. To model unigram probabilities, there is an additional state qU , which “for- gets” the context. Transitions out of qU involve unigram probabilities, p1(a) and p2(b); transitions into qU emit the empty symbol ε, and have probability λ1, reflecting the inter- polation weight for the unigram model. The interpolation weight for the bigram model is included in the weight of the transition qA → qB .
The epsilon transitions into qU make this WFSA non-deterministic. Consider the score for the sequence (a, b, b). The initial state is qU , so the symbol a is generated with score p1(a)3 Next, we can generate b from the unigram model by taking the transition qA → qB, with score λ2p2(b | a). Alternatively, we can take a transition back to qU with score λ1, and then emit b from the unigram model with score p1(b). To generate the final b token, we face the same choice: emit it directly from the self-transition to qB, or transition to qU first.
The total score for the sequence (a, b, b) is the semiring sum over all accepting paths,
s(a, b, b) = 􏰗p1(a) ⊗ λ2p2(b | a) ⊗ λ2p(b | b)􏰘 ⊕􏰗p1(a)⊗λ1 ⊗p1(b)⊗λ2p(b|b)􏰘
⊕ 􏰗p1(a) ⊗ λ2p2(b | a) ⊗ p1(b) ⊗ p1(b)􏰘
⊕ 􏰗p1(a) ⊗ λ1 ⊗ p1(b) ⊗ p1(b) ⊗ p1(b)􏰘 . [9.11]
Each line in Equation 9.11 represents the probability of a specific path through the WFSA. In the probability semiring, ⊗ is multiplication, so that each path is the product of each
3We could model the sequence-initial bigram probability p2(a | 􏱑), but for simplicity the WFSA does not admit this possibility, which would require another state.
Jacob Eisenstein. Draft of October 15, 2018.

9.1. REGULARLANGUAGES 201
transition weight, which are themselves probabilities. The ⊕ operator is addition, so that the total score is the sum of the scores (probabilities) for each path. This corresponds to the probability under the interpolated bigram language model.
9.1.4 Finite state transducers
Finite state acceptors can determine whether a string is in a regular language, and weighted finite state acceptors can compute a score for every string over a given alphabet. Finite state transducers (FSTs) extend the formalism further, by adding an output symbol to each transition. Formally, a finite state transducer is a tuple T = (Q, Σ, Ω, λ, ρ, δ), with Ω repre- senting an output vocabulary and the transition function δ : Q × (Σ ∪ ε) × (Ω ∪ ε) × Q → R mapping from states, input symbols, and output symbols to states. The remaining ele- ments (Q, Σ, λ, ρ) are identical to their definition in weighted finite state acceptors (sub- section 9.1.3). Thus, each path through the FST T transduces the input string into an output.
String edit distance
The edit distance between two strings s and t is a measure of how many operations are required to transform one string into another. There are several ways to compute edit distance, but one of the most popular is the Levenshtein edit distance, which counts the minimum number of insertions, deletions, and substitutions. This can be computed by a one-state weighted finite state transducer, in which the input and output alphabets are identical. For simplicity, consider the alphabet Σ = Ω = {a, b}. The edit distance can be computed by a one-state transducer with the following transitions,
δ(q,a,a,q) = δ(q,b,b,q) = 0 δ(q,a,b,q) = δ(q,b,a,q) = 1 δ(q,a,ε,q) = δ(q,b,ε,q) = 1
δ(q,ε,a,q) = δ(q,ε,b,q) = 1. The state diagram is shown in Figure 9.5.
[9.12] [9.13] [9.14] [9.15]
For a given string pair, there are multiple paths through the transducer: the best- scoring path from dessert to desert involves a single deletion, for a total score of 1; the worst-scoring path involves seven deletions and six additions, for a score of 13.
The Porter stemmer
The Porter (1980) stemming algorithm is a “lexicon-free” algorithm for stripping suffixes from English words, using a sequence of character-level rules. Each rule can be described
Under contract with MIT Press, shared under CC-BY-NC-ND license.

202
CHAPTER9. FORMALLANGUAGETHEORY
start
a/a, b/b : 0
a/ε, b/ε : 1
q
ε/a, ε/b : 1
a/b, b/a : 1
Figure 9.5: State diagram for the Levenshtein edit distance finite state transducer. The
label x/y : c indicates a cost of c for a transition with input x and output y. by an unweighted finite state transducer. The first rule is:
-sses → -ss -ies → -i -ss → -ss -s→ε
e.g., dresses → dress e.g., parties → parti e.g., dress → dress e.g., cats → cat
[9.16] [9.17] [9.18] [9.19]
The final two lines appear to conflict; they are meant to be interpreted as an instruction to remove a terminal -s unless it is part of an -ss ending. A state diagram to handle just these final two lines is shown in Figure 9.6. Make sure you understand how this finite state transducer handles cats, steps, bass, and basses.
Inflectional morphology
In inflectional morphology, word lemmas are modified to add grammatical information such as tense, number, and case. For example, many English nouns are pluralized by the suffix -s, and many verbs are converted to past tense by the suffix -ed. English’s inflectional morphology is considerably simpler than many of the world’s languages. For example, Romance languages (derived from Latin) feature complex systems of verb suffixes which must agree with the person and number of the verb, as shown in Table 9.1.
The task of morphological analysis is to read a form like canto, and output an analysis
like CANTAR+VERB+PRESIND+1P+SING, where +PRESIND describes the tense as present indicative, +1P indicates the first-person, and +SING indicates the singular number. The
task of morphological generation is the reverse, going from CANTAR+VERB+PRESIND+1P+SING to canto. Finite state transducers are an attractive solution, because they can solve both problems with a single model (Beesley and Karttunen, 2003). As an example, Figure 9.7 shows a fragment of a finite state transducer for Spanish inflectional morphology. The
Jacob Eisenstein. Draft of October 15, 2018.

9.1. REGULARLANGUAGES
203
start
¬s/¬s
q s/ε q
1
ε/a q3
2
a/s b/s
ε/b
q
…
Figure 9.6: State diagram for final two lines of step 1a of the Porter stemming diagram. States q3 and q4 “remember ” the observations a and b respectively; the ellipsis . . . repre- sents additional states for each symbol in the input alphabet. The notation ¬s/¬s is not part of the FST formalism; it is a shorthand to indicate a set of self-transition arcs for every input/output symbol except s.
4
infinitive
yo (1st singular)
tu (2nd singular)
e ́l, ella, usted (3rd singular) nosotros (1st plural)
vosotros (2nd plural, informal) ellos, ellas (3rd plural); ustedes (2nd plural)
cantar (to sing)
canto cantas canta cantamos canta ́is
cantan
comer (to eat)
como comes come comemos come ́is
comen
vivir (to live)
vivo vives vive vivimos viv ́ıs
viven
Table 9.1: Spanish verb inflections for the present indicative tense. Each row represents a person and number, and each column is a regular example from a class of verbs, as indicated by the ending of the infinitive form.
input vocabulary Σ corresponds to the set of letters used in Spanish spelling, and the out- put vocabulary Ω corresponds to these same letters, plus the vocabulary of morphological features (e.g., +SING, +VERB). In Figure 9.7, there are two paths that take canto as input, corresponding to the verb and noun meanings; the choice between these paths could be guided by a part-of-speech tagger. By inversion, the inputs and outputs for each tran- sition are switched, resulting in a finite state generator, capable of producing the correct surface form for any morphological analysis.
Finite state morphological analyzers and other unweighted transducers can be de- signed by hand. The designer’s goal is to avoid overgeneration — accepting strings or making transductions that are not valid in the language — as well as undergeneration
Under contract with MIT Press, shared under CC-BY-NC-ND license.

204
CHAPTER9. FORMALLANGUAGETHEORY
o/o
ε/+Noun ε/+Masc
ε/r ε/+Verb
ε/+Sing
o/+PresInd ε/+1p
a/+PresInd ε/+3p
start
c/c a/a n/n t/t
ε/a
ε/+Sing
ε/+Sing
Figure 9.7: Fragment of a finite state transducer for Spanish morphology. There are two accepting paths for the input canto: canto+NOUN+MASC+SING (masculine singular noun, meaning a song), and cantar+VERB+PRESIND+1P+SING (I sing). There is also an accept- ing path for canta, with output cantar+VERB+PRESIND+3P+SING (he/she sings).
— failing to accept strings or transductions that are valid. For example, a pluralization transducer that does not accept foot/feet would undergenerate. Suppose we “fix” the trans- ducer to accept this example, but as a side effect, it now accepts boot/beet; the transducer would then be said to overgenerate. If a transducer accepts foot/foots but not foot/feet, then it simultaneously overgenerates and undergenerates.
Finite state composition
Designing finite state transducers to capture the full range of morphological phenomena in any real language is a huge task. Modularization is a classic computer science approach for this situation: decompose a large and unwieldly problem into a set of subproblems, each of which will hopefully have a concise solution. Finite state automata can be mod- ularized through composition: feeding the output of one transducer T1 as the input to another transducer T2, written T2 ◦T1. Formally, if there exists some y such that (x, y) ∈ T1 (meaning that T1 produces output y on input x), and (y, z) ∈ T2, then (x, z) ∈ (T2 ◦ T1). Because finite state transducers are closed under composition, there is guaranteed to be a single finite state transducer that T3 = T2 ◦ T1, which can be constructed as a machine with one state for each pair of states in T1 and T2 (Mohri et al., 2002).
Example: Morphology and orthography In English morphology, the suffix -ed is added to signal the past tense for many verbs: cook→cooked, want→wanted, etc. However, English orthography dictates that this process cannot produce a spelling with consecutive e’s, so that bake→baked, not bakeed. A modular solution is to build separate transducers for mor- phology and orthography. The morphological transducer TM transduces from bake+PAST to bake+ed, with the + symbol indicating a segment boundary. The input alphabet of TM includes the lexicon of words and the set of morphological features; the output alphabet includes the characters a-z and the + boundary marker. Next, an orthographic transducer TO is responsible for the transductions cook+ed → cooked, and bake+ed → baked. The input alphabet of TO must be the same as the output alphabet for TM , and the output alphabet
Jacob Eisenstein. Draft of October 15, 2018.

9.1. REGULARLANGUAGES 205 is simply the characters a-z. The composed transducer (TO ◦ TM ) then transduces from
bake+PAST to the spelling baked. The design of TO is left as an exercise.
Example: Hidden Markov models Hidden Markov models (chapter 7) can be viewed as weighted finite state transducers, and they can be constructed by transduction. Recall that a hidden Markov model defines a joint probability over words and tags, p(w, y), which can be computed as a path through a trellis structure. This trellis is itself a weighted finite state acceptor, with edges between all adjacent nodes qm−1,i → qm,j on input Ym = j. The edge weights are log-probabilities,
δ(qm−1,i,Ym = j,qm,j) =logp(wm,Ym = j | Ym−i = j) [9.20] =logp(wm | Ym = j)+logPr(Ym = j | Ym−1 = i). [9.21]
Because there is only one possible transition for each tag Ym, this WFSA is deterministic. The score for any tag sequence {ym}Mm=1 is the sum of these log-probabilities, correspond- ing to the total log probability log p(w, y). Furthermore, the trellis can be constructed by the composition of simpler FSTs.
• First, construct a “transition” transducer to represent a bigram probability model over tag sequences, TT . This transducer is almost identical to the n-gram language model acceptor in section 9.1.3: there is one state for each tag, and the edge weights equal to the transition log-probabilities, δ(qi, j, j, qj ) = log Pr(Ym = j | Ym−1 = i). NotethatTT isatransducer,withidenticalinputandoutputateacharc;thismakes itpossibletocomposeTT withothertransducers.
• Next,constructan“emission”transducertorepresenttheprobabilityofwordsgiven tags, TE. This transducer has only a single state, with arcs for each word/tag pair, δ(q0, i, j, q0) = log Pr(Wm = j | Ym = i). The input vocabulary is the set of all tags, and the output vocabulary is the set of all words.
• The composition TE ◦ TT is a finite state transducer with one state per tag, as shown in Figure 9.8. Each state has V × K outgoing edges, representing transitions to each of the K other states, with outputs for each of the V words in the vocabulary. The weights for these edges are equal to,
δ(qi,Ym = j,wm,qj) = logp(wm,Ym = j | Ym−1 = i). [9.22]
• ThetrellisisastructurewithM×Knodes,foreachoftheMwordstobetaggedand each of the K tags in the tagset. It can be built by composition of (TE ◦ TT ) against an unweighted chain FSA MA(w) that is specially constructed to accept only a given input w1, w2, . . . , wM , shown in Figure 9.9. The trellis for input w is built from the composition MA(w) ◦ (TE ◦ TT ). Composing with the unweighted MA(w) does not affect the edge weights from (TE ◦ TT ), but it selects the subset of paths that generate the word sequence w.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

206 CHAPTER9. FORMALLANGUAGETHEORY
N/aardvark
N/abacus
N/…
start start
N
V
end
V/…
V/abacus
V/aardvark
Figure 9.8: Finite state transducer for hidden Markov models, with a small tagset of nouns and verbs. For each pair of tags (including self-loops), there is an edge for every word in the vocabulary. For simplicity, input and output are only shown for the edges from the start state. Weights are also omitted from the diagram; for each edge from qi to qj, the weight is equal to log p(wm, Ym = j | Ym−1 = i), except for edges to the end state, which are equal to logPr(Ym = 􏱓 | Ym−1 = i).
start
They
can fish
Figure 9.9: Chain finite state acceptor for the input They can fish. 9.1.5 *Learning weighted finite state automata
In generative models such as n-gram language models and hidden Markov models, the edge weights correspond to log probabilities, which can be obtained from relative fre- quency estimation. However, in other cases, we wish to learn the edge weights from in- put/output pairs. This is difficult in non-deterministic finite state automata, because we do not observe the specific arcs that are traversed in accepting the input, or in transducing from input to output. The path through the automaton is a latent variable.
Chapter 5 presented one method for learning with latent variables: expectation max- imization (EM). This involves computing a distribution q(·) over the latent variable, and iterating between updates to this distribution and updates to the parameters — in this case, the arc weights. The forward-backward algorithm (section 7.5.3) describes a dy- namic program for computing a distribution over arcs in the trellis structure of a hid-
Jacob Eisenstein. Draft of October 15, 2018.

9.2. CONTEXT-FREELANGUAGES 207
den Markov model, but this is a special case of the more general problem for finite state automata. Eisner (2002) describes an expectation semiring, which enables the expected number of transitions across each arc to be computed through a semiring shortest-path algorithm. Alternative approaches for generative models include Markov Chain Monte Carlo (Chiang et al., 2010) and spectral learning (Balle et al., 2011).
Further afield, we can take a perceptron-style approach, with each arc corresponding to a feature. The classic perceptron update would update the weights by subtracting the difference between the feature vector corresponding to the predicted path and the feature vector corresponding to the correct path. Since the path is not observed, we resort to a latent variable perceptron. The model is described formally in section 12.4, but the basic idea is to compute an update from the difference between the features from the predicted path and the features for the best-scoring path that generates the correct output.
9.2 Context-free languages
Beyond the class of regular languages lie the context-free languages. An example of a language that is context-free but not finite state is the set of arithmetic expressions with balanced parentheses. Intuitively, to accept only strings in this language, an FSA would have to “count” the number of left parentheses, and make sure that they are balanced against the number of right parentheses. An arithmetic expression can be arbitrarily long, yet by definition an FSA has a finite number of states. Thus, for any FSA, there will be a string that with too many parentheses to count. More formally, the pumping lemma is a proof technique for showing that languages are not regular. It is typically demonstrated for the simpler case anbn, the language of strings containing a sequence of a’s, and then an equal-length sequence of b’s.4
There are at least two arguments for the relevance of non-regular formal languages to linguistics. First, there are natural language phenomena that are argued to be iso- morphic to anbn. For English, the classic example is center embedding, shown in Fig- ure 9.10. The initial expression the dog specifies a single dog. Embedding this expression into the cat chased specifies a particular cat — the one chased by the dog. This cat can then be embedded again to specify a goat, in the less felicitous but arguably grammatical expression, the goat the cat the dog chased kissed, which refers to the goat who was kissed by the cat which was chased by the dog. Chomsky (1957) argues that to be grammatical, a center-embedded construction must be balanced: if it contains n noun phrases (e.g., the cat), they must be followed by exactly n − 1 verbs. An FSA that could recognize such ex- pressions would also be capable of recognizing the language anbn. Because we can prove that no FSA exists for anbn, no FSA can exist for center embedded constructions either. En-
4 Details of the proof can be found in an introductory computer science theory textbook (e.g., Sipser, 2012). Under contract with MIT Press, shared under CC-BY-NC-ND license.

208
CHAPTER9. FORMALLANGUAGETHEORY
the cat the goat the cat
the dog
the dog chased
the dog chased kissed …
Figure 9.10: Three levels of center embedding
glish includes center embedding, and so the argument goes, English grammar as a whole cannot be regular.5
A more practical argument for moving beyond regular languages is modularity. Many linguistic phenomena — especially in syntax — involve constraints that apply at long distance. Consider the problem of determiner-noun number agreement in English: we can say the coffee and these coffees, but not *these coffee. By itself, this is easy enough to model in an FSA. However, fairly complex modifying expressions can be inserted between the determiner and the noun:
(9.2) a. the burnt coffee
b. the badly-ground coffee
c. the burnt and badly-ground Italian coffee
d. these burnt and badly-ground Italian coffees
e. * these burnt and badly-ground Italian coffee
Again, an FSA can be designed to accept modifying expressions such as burnt and badly- ground Italian. Let’s call this FSA FM . To reject the final example, a finite state acceptor must somehow “remember” that the determiner was plural when it reaches the noun cof- fee at the end of the expression. The only way to do this is to make two identical copies of FM : one for singular determiners, and one for plurals. While this is possible in the finite state framework, it is inconvenient — especially in languages where more than one attribute of the noun is marked by the determiner. Context-free languages facilitate mod- ularity across such long-range dependencies.
9.2.1 Context-free grammars
Context-free languages are specified by context-free grammars (CFGs), which are tuples
(N, Σ, R, S) consisting of:
5The claim that arbitrarily deep center-embedded expressions are grammatical has drawn skepticism. Corpus evidence shows that embeddings of depth greater than two are exceedingly rare (Karlsson, 2007), and that embeddings of depth greater than three are completely unattested. If center-embedding is capped at some finite depth, then it is regular.
Jacob Eisenstein. Draft of October 15, 2018.

9.2. CONTEXT-FREELANGUAGES 209
S→SOP S|NUM OP →+ | − | × | ÷
NUM →NUM DIGIT | DIGIT DIGIT →0|1|2|…|9
Figure 9.11: A context-free grammar for arithmetic expressions
• a finite set of non-terminals N ;
• a finite alphabet Σ of terminal symbols;
• asetofproductionrulesR,eachoftheformA→β,whereA∈Nandβ∈(Σ∪N)∗; • a designated start symbol S.
In the production rule A → β, the left-hand side (LHS) A must be a non-terminal; the right-hand side (RHS) can be a sequence of terminals or non-terminals, {n,σ}∗,n ∈ N,σ ∈ Σ. A non-terminal can appear on the left-hand side of many production rules. A non-terminal can appear on both the left-hand side and the right-hand side; this is a recursive production, and is analogous to self-loops in finite state automata. The name “context-free” is based on the property that the production rule depends only on the LHS, and not on its ancestors or neighbors; this is analogous to Markov property of finite state automata, in which the behavior at each step depends only on the current state, on not on the path by which that state was reached.
A derivation τ is a sequence of steps from the start symbol S to a surface string w ∈ Σ∗, which is the yield of the derivation. A string w is in a context-free language if there is some derivation from S yielding w. Parsing is the problem of finding a derivation for a string in a grammar. Algorithms for parsing are described in chapter 10.
Like regular expressions, context-free grammars define the language but not the com- putation necessary to recognize it. The context-free analogues to finite state acceptors are pushdown automata, a theoretical model of computation in which input symbols can be pushed onto a stack with potentially infinite depth. For more details, see Sipser (2012).
Example
Figure 9.11 shows a context-free grammar for arithmetic expressions such as 1 + 2 ÷ 3 − 4. In this grammar, the terminal symbols include the digits {1, 2, …, 9} and the op- erators {+, −, ×, ÷}. The rules include the | symbol, a notational convenience that makes it possible to specify multiple right-hand sides on a single line: the statement A → x | y
Under contract with MIT Press, shared under CC-BY-NC-ND license.

210
CHAPTER9. FORMALLANGUAGETHEORY
SSS
Num S OpS SOp S
Digit S Op S − Num 4 Num + Num Digit
Num + S Op S Digit Num − Num 1 Digit Digit
Digit Digit 3
12 23
Figure 9.12: Some example derivations from the arithmetic grammar in Figure 9.11
defines two productions, A → x and A → y. This grammar is recursive: the non-termals S and NUM can produce themselves.
Derivations are typically shown as trees, with production rules applied from the top to the bottom. The tree on the left in Figure 9.12 describes the derivation of a single digit, through the sequence of productions S → NUM → DIGIT → 4 (these are all unary pro- ductions, because the right-hand side contains a single element). The other two trees in Figure 9.12 show alternative derivations of the string 1 + 2 − 3. The existence of multiple derivations for a string indicates that the grammar is ambiguous.
Context-free derivations can also be written out according to the pre-order tree traver- sal.6 For the two derivations of 1 + 2 – 3 in Figure 9.12, the notation is:
(S (S (S (Num (Digit 1))) (Op +) (S (Num (Digit 2)))) (Op – ) (S (Num (Digit 3)))) [9.23] (S (S (Num (Digit 1))) (Op +) (S (Num (Digit 2)) (Op – ) (S (Num (Digit 3))))). [9.24]
Grammar equivalence and Chomsky Normal Form
A single context-free language can be expressed by more than one context-free grammar. For example, the following two grammars both define the language anbn for n > 0.
S →aSb | ab
S →aSb|aabb|ab
Two grammars are weakly equivalent if they generate the same strings. Two grammars are strongly equivalent if they generate the same strings via the same derivations. The grammars above are only weakly equivalent.
6This is a depth-first left-to-right search that prints each node the first time it is encountered (Cormen et al., 2009, chapter 12).
Jacob Eisenstein. Draft of October 15, 2018.

9.2. CONTEXT-FREELANGUAGES 211 In Chomsky Normal Form (CNF), the right-hand side of every production includes
either two non-terminals, or a single terminal symbol:
A →BC A →a
All CFGs can be converted into a CNF grammar that is weakly equivalent. To convert a grammar into CNF, we first address productions that have more than two non-terminals on the RHS by creating new “dummy” non-terminals. For example, if we have the pro- duction,
it is replaced with two productions,
W → X Y Z, W →X W\X
W\X →Y Z.
[9.25]
[9.26] [9.27]
In these productions, W\X is a new dummy non-terminal. This transformation binarizes the grammar, which is critical for efficient bottom-up parsing, as we will see in chapter 10. Productions whose right-hand side contains a mix of terminal and non-terminal symbols can be replaced in a similar fashion.
Unary non-terminal productions A → B are replaced as follows: for each production B → α in the grammar, add a new production A → α. For example, in the grammar described in Figure 9.11, we would replace NUM → DIGIT with NUM → 1 | 2 | … | 9. However, we keep the production NUM → NUM DIGIT, which is a valid binary produc- tion.
9.2.2 Natural language syntax as a context-free language
Context-free grammars can be used to represent syntax, which is the set of rules that determine whether an utterance is judged to be grammatical. If this representation were perfectly faithful, then a natural language such as English could be transformed into a formal language, consisting of exactly the (infinite) set of strings that would be judged to be grammatical by a fluent English speaker. We could then build parsing software that would automatically determine if a given utterance were grammatical.7
Contemporary theories generally do not consider natural languages to be context-free (see section 9.3), yet context-free grammars are widely used in natural language parsing. The reason is that context-free representations strike a good balance: they cover a broad range of syntactic phenomena, and they can be parsed efficiently. This section therefore
7To move beyond this cursory treatment of syntax, consult the short introductory manuscript by Bender (2013), or the longer text by Akmajian et al. (2010).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

212 CHAPTER9. FORMALLANGUAGETHEORY
describes how to handle a core fragment of English syntax in context-free form, following the conventions of the Penn Treebank (PTB; Marcus et al., 1993), a large-scale annotation of English language syntax. The generalization to “mildly” context-sensitive languages is discussed in section 9.3.
The Penn Treebank annotation is a phrase-structure grammar of English. This means that sentences are broken down into constituents, which are contiguous sequences of words that function as coherent units for the purpose of linguistic analysis. Constituents generally have a few key properties:
Movement. Constituents can often be moved around sentences as units. (9.3) a. Abigail gave (her brother) (a fish).
b. Abigail gave (a fish) to (her brother).
In contrast, gave her and brother a cannot easily be moved while preserving gram- maticality.
Substitution. Constituents can be substituted by other phrases of the same type. (9.4) a. Max thanked (his older sister).
b. Max thanked (her).
In contrast, substitution is not possible for other contiguous units like Max thanked
and thanked his.
Coordination. Coordinators like and and or can conjoin constituents.
(9.5) a.
b. Abigail (bought a fish) and (gave it to Max).
(Abigail) and (her younger brother) bought a fish. c. Abigail (bought) and (greedily ate) a fish.
Units like brother bought and bought a cannot easily be coordinated.
These examples argue for units such as her brother and bought a fish to be treated as con- stituents. Other sequences of words in these examples, such as Abigail gave and brother a fish, cannot be moved, substituted, and coordinated in these ways. In phrase-structure grammar, constituents are nested, so that the senator from New Jersey contains the con- stituent from New Jersey, which in turn contains New Jersey. The sentence itself is the max- imal constituent; each word is a minimal constituent, derived from a unary production from a part-of-speech tag. Between part-of-speech tags and sentences are phrases. In phrase-structure grammar, phrases have a type that is usually determined by their head word: for example, a noun phrase corresponds to a noun and the group of words that
Jacob Eisenstein. Draft of October 15, 2018.

9.2. CONTEXT-FREELANGUAGES 213 modify it, such as her younger brother; a verb phrase includes the verb and its modifiers,
such as bought a fish and greedily ate it.
In context-free grammars, each phrase type is a non-terminal, and each constituent is the substring that the non-terminal yields. Grammar design involves choosing the right set of non-terminals. Fine-grained non-terminals make it possible to represent more fine- grained linguistic phenomena. For example, by distinguishing singular and plural noun phrases, it is possible to have a grammar of English that generates only sentences that obey subject-verb agreement. However, enforcing subject-verb agreement is considerably more complicated in languages like Spanish, where the verb must agree in both person and number with subject. In general, grammar designers must trade off between over- generation — a grammar that permits ungrammatical sentences — and undergeneration — a grammar that fails to generate grammatical sentences. Furthermore, if the grammar is to support manual annotation of syntactic structure, it must be simple enough to annotate efficiently.
9.2.3 A phrase-structure grammar for English
To better understand how phrase-structure grammar works, let’s consider the specific case of the Penn Treebank grammar of English. The main phrase categories in the Penn Treebank (PTB) are based on the main part-of-speech classes: noun phrase (NP), verb phrase (VP), prepositional phrase (PP), adjectival phrase (ADJP), and adverbial phrase (ADVP). The top-level category is S, which conveniently stands in for both “sentence” and the “start” symbol. Complement clauses (e.g., I take the good old fashioned ground that the whale is a fish) are represented by the non-terminal SBAR. The terminal symbols in the grammar are individual words, which are generated from unary productions from part-of-speech tags (the PTB tagset is described in section 8.1).
This section describes some of the most common productions from the major phrase- level categories, explaining how to generate individual tag sequences. The production rules are approached in a “theory-driven” manner: first the syntactic properties of each phrase type are described, and then some of the necessary production rules are listed. But it is important to keep in mind that the Penn Treebank was produced in a “data-driven” manner. After the set of non-terminals was specified, annotators were free to analyze each sentence in whatever way seemed most linguistically accurate, subject to some high-level guidelines. The grammar of the Penn Treebank is simply the set of productions that were required to analyze the several million words of the corpus. By design, the grammar overgenerates — it does not exclude ungrammatical sentences. Furthermore, while the productions shown here cover some of the most common cases, they are only a small fraction of the several thousand different types of productions in the Penn Treebank.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

214 CHAPTER9. FORMALLANGUAGETHEORY Sentences
The most common production rule for sentences is,
S →NP VP [9.28]
which accounts for simple sentences like Abigail ate the kimchi — as we will see, the direct object the kimchi is part of the verb phrase. But there are more complex forms of sentences as well:
S →ADVP NP VP S →S CC S
S →VP
Unfortunately Abigail ate the kimchi. Abigail ate the kimchi and Max had a burger. Eat the kimchi.
[9.29] [9.30] [9.31]
where ADVP is an adverbial phrase (e.g., unfortunately, very unfortunately) and CC is a coordinating conjunction (e.g., and, but).8
Noun phrases
Noun phrases refer to entities, real or imaginary, physical or abstract: Asha, the steamed dumpling, parts and labor, nobody, the whiteness of the whale, and the rise of revolutionary syn- dicalism in the early twentieth century. Noun phrase productions include “bare” nouns, which may optionally follow determiners, as well as pronouns:
NP→NN |NNS |NNP |PRP [9.32] NP →DET NN | DET NNS | DET NNP [9.33]
The tags NN, NNS, and NNP refer to singular, plural, and proper nouns; PRP refers to personal pronouns, and DET refers to determiners. The grammar also contains terminal productions from each of these tags, e.g., PRP → I | you | we | ….
Noun phrases may be modified by adjectival phrases (ADJP; e.g., the small Russian dog) and numbers (CD; e.g., the five pastries), each of which may optionally follow a determiner:
NP →ADJP NN | ADJP NNS | DET ADJP NN | DET ADJP NNS [9.34] NP →CD NNS | DET CD NNS | … [9.35]
Some noun phrases include multiple nouns, such as the liberation movement and an antelope horn, necessitating additional productions:
NP→NN NN |NN NNS |DET NN NN |… [9.36] 8Notice that the grammar does not include the recursive production S → ADVP S. It may be helpful to
think about why this production would cause the grammar to overgenerate.
Jacob Eisenstein. Draft of October 15, 2018.

9.2. CONTEXT-FREELANGUAGES 215 These multiple noun constructions can be combined with adjectival phrases and cardinal
numbers, leading to a large number of additional productions.
Recursive noun phrase productions include coordination, prepositional phrase attach- ment, subordinate clauses, and verb phrase adjuncts:
NP →NP CC NP NP →NP PP
NP →NP SBAR NP →NP VP
e.g., the red and the black e.g., the President of the Georgia Institute of Technology e.g., a whale which he had wounded e.g., a whale taken near Shetland
[9.37] [9.38] [9.39] [9.40]
These recursive productions are a major source of ambiguity, because the VP and PP non- terminals can also generate NP children. Thus, the the President of the Georgia Institute of Technology can be derived in two ways, as can a whale taken near Shetland in October.
But aside from these few recursive productions, the noun phrase fragment of the Penn Treebank grammar is relatively flat, containing a large of number of productions that go from NP directly to a sequence of parts-of-speech. If noun phrases had more internal structure, the grammar would need fewer rules, which, as we will see, would make pars- ing faster and machine learning easier. Vadas and Curran (2011) propose to add additional structure in the form of a new non-terminal called a nominal modifier (NML), e.g.,
(9.6) a. (NP (NN crude) (NN oil) (NNS prices)) (PTB analysis)
b. (NP (NML (NN crude) (NN oil)) (NNS prices)) (NML-style analysis).
Another proposal is to treat the determiner as the head of a determiner phrase (DP; Abney, 1987). There are linguistic arguments for and against determiner phrases (e.g., Van Eynde, 2006). From the perspective of context-free grammar, DPs enable more struc- tured analyses of some constituents, e.g.,
(9.7) a. (NP (DT the) (JJ white) (NN whale)) (PTB analysis)
b. (DP (DT the) (NP (JJ white) (NN whale))) (DP-style analysis).
Verb phrases
Verb phrases describe actions, events, and states of being. The PTB tagset distinguishes several classes of verb inflections: base form (VB; she likes to snack), present-tense third- person singular (VBZ; she snacks), present tense but not third-person singular (VBP; they snack), past tense (VBD; they snacked), present participle (VBG; they are snacking), and past participle (VBN; they had snacked).9 Each of these forms can constitute a verb phrase on its
9This tagset is specific to English: for example, VBP is a meaningful category only because English mor- phology distinguishes third-person singular from all person-number combinations.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

216 CHAPTER9. FORMALLANGUAGETHEORY own:
VP→VB |VBZ |VBD |VBN |VBG |VBP [9.41]
More complex verb phrases can be formed by a number of recursive productions, including the use of coordination, modal verbs (MD; she should snack), and the infinitival to (TO):
VP → MD VP VP → VBD VP VP → VBZ VP VP → VBN VP VP → TO VP
VP → VP CC VP
She will snack She had snacked She has been snacking She has been snacking She wants to snack She buys and eats many snacks
[9.42] [9.43] [9.44] [9.45] [9.46] [9.47]
Each of these productions uses recursion, with the VP non-terminal appearing in both the LHS and RHS. This enables the creation of complex verb phrases, such as She will have wanted to have been snacking.
Transitive verbs take noun phrases as direct objects, and ditransitive verbs take two direct objects:
VP → VBZ NP
VP → VBG NP
VP → VBD NP NP
She teaches algebra She has been teaching algebra She taught her brother algebra
[9.48] [9.49] [9.50]
These productions are not recursive, so a unique production is required for each verb part-of-speech. They also do not distinguish transitive from intransitive verbs, so the resulting grammar overgenerates examples like *She sleeps sushi and *She learns Boyang algebra. Sentences can also be direct objects:
VP → VBZ S Hunter wants to eat the kimchi [9.51] VP → VBZ SBAR Hunter knows that Tristan ate the kimchi [9.52]
The first production overgenerates, licensing sentences like *Hunter sees Tristan eats the kimchi. This problem could be addressed by designing a more specific set of sentence non-terminals, indicating whether the main verb can be conjugated.
Verbs can also be modified by prepositional phrases and adverbial phrases:
VP → VBZ PP
VP → VBZ ADVP VP → ADVP VBG
She studies at night She studies intensively She is not studying
[9.53] [9.54] [9.55]
Jacob Eisenstein. Draft of October 15, 2018.

9.2. CONTEXT-FREELANGUAGES 217 Again, because these productions are not recursive, the grammar must include produc-
tions for every verb part-of-speech.
A special set of verbs, known as copula, can take predicative adjectives as direct ob- jects:
VP → VBZ ADJP She is hungry [9.56] VP → VBP ADJP Success seems increasingly unlikely [9.57]
The PTB does not have a special non-terminal for copular verbs, so this production gen- erates non-grammatical examples such as *She eats tall.
Particles (PRT as a phrase; RP as a part-of-speech) work to create phrasal verbs:
VP → VB PRT She told them to fuck off [9.58]
VP → VBD PRT NP They gave up their ill-gotten gains [9.59] As the second production shows, particle productions are required for all configurations
of verb parts-of-speech and direct objects.
Other contituents
The remaining constituents require far fewer productions. Prepositional phrases almost
always consist of a preposition and a noun phrase,
PP → IN NP the whiteness of the whale [9.60]
PP → TO NP What the white whale was to Ahab, has been hinted [9.61] Similarly, complement clauses consist of a complementizer (usually a preposition, pos-
sibly null) and a sentence,
SBAR → IN S She said that it was spicy [9.62]
SBAR → S She said it was spicy [9.63] Adverbial phrases are usually bare adverbs (ADVP → RB), with a few exceptions:
ADVP → RB RBR They went considerably further [9.64] ADVP → ADVP PP They went considerably further than before [9.65]
The tag RBR is a comparative adverb.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

218
CHAPTER9. FORMALLANGUAGETHEORY
Adjectival phrases extend beyond bare adjectives (ADJP → JJ) in a number of ways:
ADJP → RB JJ ADJP → RBR JJ ADJP → JJS JJ ADJP → RB JJR ADJP → JJ CC JJ ADJP → JJ JJ ADJP → RB VBN
very hungry more hungry best possible even bigger high and mighty West German previously reported
[9.66] [9.67] [9.68] [9.69] [9.70] [9.71] [9.72]
[9.73] [9.74] [9.75] [9.76]
The tags JJR and JJS refer to comparative and superlative adjectives respectively. All of these phrase types can be coordinated:
PP →PP CC PP
ADVP →ADVP CC ADVP
ADJP →ADJP CC ADJP SBAR →SBAR CC SBAR
9.2.4 Grammatical ambiguity
on time and under budget now and two years ago quaint and rather deceptive whether they want control or whether they want exports
Context-free parsing is useful not only because it determines whether a sentence is gram- matical, but mainly because the constituents and their relations can be applied to tasks such as information extraction (chapter 17) and sentence compression (Jing, 2000; Clarke and Lapata, 2008). However, the ambiguity of wide-coverage natural language grammars poses a serious problem for such potential applications. As an example, Figure 9.13 shows two possible analyses for the simple sentence We eat sushi with chopsticks, depending on whether the chopsticks modify eat or sushi. Realistic grammars can license thousands or even millions of parses for individual sentences. Weighted context-free grammars solve this problem by attaching weights to each production, and selecting the derivation with the highest score. This is the focus of chapter 10.
9.3 *Mildly context-sensitive languages
Beyond context-free languages lie context-sensitive languages, in which the expansion of a non-terminal depends on its neighbors. In the general class of context-sensitive languages, computation becomes much more challenging: the membership problem for context-sensitive languages is PSPACE-complete. Since PSPACE contains the complexity class NP (problems that can be solved in polynomial time on a non-deterministic Turing
Jacob Eisenstein. Draft of October 15, 2018.

9.3. *MILDLYCONTEXT-SENSITIVELANGUAGES
219
S
NP
VP
VP
S
WeV NP
eat NP
NP
PP
WeVP PP
NP eat sushi with chopsticks
sushi IN
with chopsticks
V NP IN
NP
Figure 9.13: Two derivations of the same sentence
machine), PSPACE-complete problems cannot be solved efficiently if P ̸= NP. Thus, de- signing an efficient parsing algorithm for the full class of context-sensitive languages is probably hopeless.10
However, Joshi (1985) identifies a set of properties that define mildly context-sensitive languages, which are a strict subset of context-sensitive languages. Like context-free lan- guages, mildly context-sensitive languages are parseable in polynomial time. However, the mildly context-sensitive languages include non-context-free languages, such as the “copy language” {ww | w ∈ Σ∗} and the language ambncmdn. Both are characterized by cross-serial dependencies, linking symbols at long distance across the string.11 For exam- ple, in the language anbmcndm, each a symbol is linked to exactly one c symbol, regardless of the number of intervening b symbols.
9.3.1 Context-sensitive phenomena in natural language
Such phenomena are occasionally relevant to natural language. A classic example is found in Swiss-German (Shieber, 1985), in which sentences such as we let the children help Hans paint the house are realized by listing all nouns before all verbs, i.e., we the children Hans the house let help paint. Furthermore, each noun’s determiner is dictated by the noun’s case marking (the role it plays with respect to the verb). Using an argument that is analogous to the earlier discussion of center-embedding (section 9.2), Shieber describes these case marking constraints as a set of cross-serial dependencies, homomorphic to ambncmdn, and therefore not context-free.
10If PSPACE ̸= NP, then it contains problems that cannot be solved in polynomial time on a non- deterministic Turing machine; equivalently, solutions to these problems cannot even be checked in poly- nomial time (Arora and Barak, 2009).
11A further condition of the set of mildly-context-sensitive languages is constant growth: if the strings in the language are arranged by length, the gap in length between any pair of adjacent strings is bounded by some language specific constant. This condition excludes languages such as {a2n | n ≥ 0}.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

220 CHAPTER9. FORMALLANGUAGETHEORY
Abigail eats the kimchi
NP (S\NP)/NP (NP/N) N NP
S \NP < > >
S
Figure 9.14: A syntactic analysis in CCG involving forward and backward function appli- cation
As with the move from regular to context-free languages, mildly context-sensitive languages can also be motivated by expedience. While finite sequences of cross-serial dependencies can in principle be handled in a context-free grammar, it is often more con- venient to use a mildly context-sensitive formalism like tree-adjoining grammar (TAG) and combinatory categorial grammar (CCG). TAG-inspired parsers have been shown to be particularly effective in parsing the Penn Treebank (Collins, 1997; Carreras et al., 2008), and CCG plays a leading role in current research on semantic parsing (Zettlemoyer and Collins, 2005). These two formalisms are weakly equivalent: any language that can be specified in TAG can also be specified in CCG, and vice versa (Joshi et al., 1991). The re- mainder of the chapter gives a brief overview of CCG, but you are encouraged to consult Joshi and Schabes (1997) and Steedman and Baldridge (2011) for more detail on TAG and CCG respectively.
9.3.2 Combinatory categorial grammar
In combinatory categorial grammar, structural analyses are built up through a small set of generic combinatorial operations, which apply to immediately adjacent sub-structures. These operations act on the categories of the sub-structures, producing a new structure with a new category. The basic categories include S (sentence), NP (noun phrase), VP (verb phrase) and N (noun). The goal is to label the entire span of text as a sentence, S.
Complex categories, or types, are constructed from the basic categories, parentheses, and forward and backward slashes: for example, S/NP is a complex type, indicating a sentence that is lacking a noun phrase to its right; S\NP is a sentence lacking a noun phrase to its left. Complex types act as functions, and the most basic combinatory oper- ations are function application to either the right or left neighbor. For example, the type of a verb phrase, such as eats, would be S\NP. Applying this function to a subject noun phrase to its left results in an analysis of Abigail eats as category S, indicating a successful parse.
Transitive verbs must first be applied to the direct object, which in English appears to the right of the verb, before the subject, which appears on the left. They therefore have the more complex type (S\NP)/NP. Similarly, the application of a determiner to the noun at
Jacob Eisenstein. Draft of October 15, 2018.

9.3. *MILDLYCONTEXT-SENSITIVELANGUAGES
221
Abigail
NP
might learn
(S \NP )/VP VP /NP
>B (S \NP )/NP
Swahili
NP
S \NP
< S >
Figure 9.15: A syntactic analysis in CCG involving function composition (example modi- fied from Steedman and Baldridge, 2011)
its right results in a noun phrase, so determiners have the type NP/N. Figure 9.14 pro- vides an example involving a transitive verb and a determiner. A key point from this example is that it can be trivially transformed into phrase-structure tree, by treating each function application as a constituent phrase. Indeed, when CCG’s only combinatory op- erators are forward and backward function application, it is equivalent to context-free grammar. However, the location of the “effort” has changed. Rather than designing good productions, the grammar designer must focus on the lexicon — choosing the right cate- gories for each word. This makes it possible to parse a wide range of sentences using only a few generic combinatory operators.
Things become more interesting with the introduction of two additional operators: composition and type-raising. Function composition enables the combination of com- plex types: X/Y ◦ Y/Z ⇒B X/Z (forward composition) and Y \Z ◦ X\Y ⇒B X\Z (back- ward composition).12 Composition makes it possible to “look inside” complex types, and combine two adjacent units if the “input” for one is the “output” for the other. Figure 9.15 shows how function composition can be used to handle modal verbs. While this sentence can be parsed using only function application, the composition-based analysis is prefer- able because the unit might learn functions just like a transitive verb, as in the example Abigail studies Swahili. This in turn makes it possible to analyze conjunctions such as Abi- gail studies and might learn Swahili, attaching the direct object Swahili to the entire conjoined verb phrase studies and might learn. The Penn Treebank grammar fragment from subsec- tion 9.2.3 would be unable to handle this case correctly: the direct object Swahili could attach only to the second verb learn.
TyperaisingconvertsanelementoftypeXtoamorecomplextype:X⇒T T/(T\X) (forward type-raising to type T), and X ⇒T T\(T/X) (backward type-raising to type T ). Type-raising makes it possible to reverse the relationship between a function and its argument — by transforming the argument into a function over functions over arguments! An example may help. Figure 9.15 shows how to analyze an object relative clause, a story that Abigail tells. The problem is that tells is a transitive verb, expecting a direct object to its right. As a result, Abigail tells is not a valid constituent. The issue is resolved by raising
12The subscript B follows notation from Curry and Feys (1958).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

222
CHAPTER9. FORMALLANGUAGETHEORY
a story
NP
that Abigail tells
(NP \NP )/(S /NP ) NP S/(S\NP)
(S \NP )/NP
>B > < NP \NP NP >T
S /NP
Figure 9.16: A syntactic analysis in CCG involving an object relative clause
Abigail from NP to the complex type (S/NP)\NP). This function can then be combined with the transitive verb tells by forward composition, resulting in the type (S/NP), which is a sentence lacking a direct object to its right.13 From here, we need only design the lexical entry for the complementizer that to expect a right neighbor of type (S/NP), and the remainder of the derivation can proceed by function application.
Composition and type-raising give CCG considerable power and flexibility, but at a price. The simple sentence Abigail tells Max can be parsed in two different ways: by func- tion application (first forming the verb phrase tells Max), and by type-raising and compo- sition (first forming the non-constituent Abigail tells). This derivational ambiguity does not affect the resulting linguistic analysis, so it is sometimes known as spurious ambi- guity. Hockenmaier and Steedman (2007) present a translation algorithm for converting the Penn Treebank into CCG derivations, using composition and type-raising only when necessary.
Exercises
1. Sketch out the state diagram for finite-state acceptors for the following languages on the alphabet {a, b}.
a) Even-length strings. (Be sure to include 0 as an even number.)
b) Strings that contain aaa as a substring.
c) Strings containing an even number of a and an odd number of b symbols.
d) Strings in which the substring bbb must be terminal if it appears — the string
need not contain bbb, but if it does, nothing can come after it.
2. Levenshtein edit distance is the number of insertions, substitutions, or deletions
required to convert one string to another.
13The missing direct object would be analyzed as a trace in CFG-like approaches to syntax, including the Penn Treebank.
Jacob Eisenstein. Draft of October 15, 2018.

9.3. *MILDLYCONTEXT-SENSITIVELANGUAGES 223
a) Define a finite-state acceptor that accepts all strings with edit distance 1 from
the target string, target.
b) Now think about how to generalize your design to accept all strings with edit distance from the target string equal to d. If the target string has length l, what is the minimal number of states required?
3. Construct an FSA in the style of Figure 9.3, which handles the following examples: • nation/N, national/ADJ, nationalize/V, nationalizer/N
• America/N, American/ADJ, Americanize/V, Americanizer/N
Be sure that your FSA does not accept any further derivations, such as *nationalizeral
and *Americanizern.
4. Showhowtoconstructatrigramlanguagemodelinaweightedfinite-stateacceptor.
Make sure that you handle the edge cases at the beginning and end of the input.
5. Extend the FST in Figure 9.6 to handle the other two parts of rule 1a of the Porter
stemmer: -sses → ss, and -ies → -i.
6. section 9.1.4 describes TO, a transducer that captures English orthography by trans- ducing cook + ed → cooked and bake + ed → baked. Design an unweighted finite-state transducer that captures this property of English orthography.
Next, augment the transducer to appropriately model the suffix -s when applied to words ending in s, e.g. kiss+s → kisses.
7. Add parenthesization to the grammar in Figure 9.11 so that it is no longer ambigu- ous.
8. Construct three examples — a noun phrase, a verb phrase, and a sentence — which can be derived from the Penn Treebank grammar fragment in subsection 9.2.3, yet are not grammatical. Avoid reusing examples from the text. Optionally, propose corrections to the grammar to avoid generating these cases.
9. Produce parses for the following sentences, using the Penn Treebank grammar frag- ment from subsection 9.2.3.
(9.8) This aggression will not stand.
(9.9) I can get you a toe.
(9.10) Sometimes you eat the bar and sometimes the bar eats you.
Then produce parses for three short sentences from a news article from this week. Under contract with MIT Press, shared under CC-BY-NC-ND license.

224 10.
CHAPTER9. FORMALLANGUAGETHEORY
* One advantage of CCG is its flexibility in handling coordination: (9.11) a. Hunter and Tristan speak Hawaiian
b. Hunter speaks and Tristan understands Hawaiian Define the lexical entry for and as
and := (X/X)\X, [9.77]
where X can refer to any type. Using this lexical entry, show how to parse the two examples above. In the second example, Swahili should be combined with the coor- dination Abigail speaks and Max understands, and not just with the verb understands.
Jacob Eisenstein. Draft of October 15, 2018.

Chapter 10 Context-free parsing
Parsing is the task of determining whether a string can be derived from a given context- free grammar, and if so, how. A parser’s output is a tree, like the ones shown in Fig- ure 9.13. Such trees can answer basic questions of who-did-what-to-whom, and have ap- plications in downstream tasks like semantic analysis (chapter 12 and 13) and information extraction (chapter 17).
For a given input and grammar, how many parse trees are there? Consider a minimal context-free grammar with only one non-terminal, X, and the following productions:
X →X X
X →aardvark | abacus | . . . | zyther
The second line indicates unary productions to every nonterminal in Σ. In this gram- mar, the number of possible derivations for a string w is equal to the number of binary bracketings, e.g.,
((((w1 w2) w3) w4) w5), (((w1 (w2 w3)) w4) w5), ((w1 (w2(w3 w4))) w5), . . . .
The number of such bracketings is a Catalan number, which grows super-exponentially
in the length of the sentence, Cn = (2n)! . As with sequence labeling, it is only possible to (n+1)!n!
exhaustively search the space of parses by resorting to locality assumptions, which make it possible to search efficiently by reusing shared substructures with dynamic programming. This chapter focuses on a bottom-up dynamic programming algorithm, which enables exhaustive search of the space of possible parses, but imposes strict limitations on the form of scoring function. These limitations can be relaxed by abandoning exhaustive search. Non-exact search methods will be briefly discussed at the end of this chapter, and one of them — transition-based parsing — will be the focus of chapter 11.
225

226
CHAPTER10. CONTEXT-FREEPARSING
10.1
S →NPVP
NP → NP PP | we | sushi | chopsticks PP →INNP
IN → with
VP →VNP|VPPP
V → eat
Table 10.1: A toy example context-free grammar
Deterministic bottom-up parsing
The CKY algorithm1 is a bottom-up approach to parsing in a context-free grammar. It efficiently tests whether a string is in a language, without enumerating all possible parses. The algorithm first forms small constituents, and then tries to merge them into larger constituents.
To understand the algorithm, consider the input, We eat sushi with chopsticks. Accord- ing to the toy grammar in Table 10.1, each terminal symbol can be generated by exactly one unary production, resulting in the sequence NP V NP IN NP. In real examples, there may be many unary productions for each individual token. In any case, the next step is to try to apply binary productions to merge adjacent symbols into larger constituents: for example, V NP can be merged into a verb phrase (VP), and IN NP can be merged into a prepositional phrase (PP). Bottom-up parsing searches for a series of mergers that ultimately results in the start symbol S covering the entire input.
The CKY algorithm systematizes this search by incrementally constructing a table t in which each cell t[i, j ] contains the set of nonterminals that can derive the span wi+1:j . The algorithm fills in the upper right triangle of the table; it begins with the diagonal, which corresponds to substrings of length 1, and then computes derivations for progressively larger substrings, until reaching the upper right corner t[0, M ], which corresponds to the entire input, w1:M . If the start symbol S is in t[0, M ], then the string w is in the language defined by the grammar. This process is detailed in Algorithm 13, and the resulting data structure is shown in Figure 10.1. Informally, here’s how it works:
• Beginbyfillinginthediagonal:thecellst[m−1,m]forallm∈{1,2,…,M}.These cells are filled with terminal productions that yield the individual tokens; for the word w2 = sushi, we fill in t[1, 2] = {NP}, and so on.
• Then fill in the next diagonal, in which each cell corresponds to a subsequence of length two: t[0, 2], t[1, 3], . . . , t[M − 2, M ]. These cells are filled in by looking for
1The name is for Cocke-Kasami-Younger, the inventors of the algorithm. It is a special case of chart parsing, because its stores reusable computations in a chart-like data structure.
Jacob Eisenstein. Draft of October 15, 2018.

10.1. DETERMINISTICBOTTOM-UPPARSING 227
binary productions capable of producing at least one entry in each of the cells corre- sponding to left and right children. For example, VP can be placed in the cell t[1, 3] because the grammar includes the production VP → V NP, and because the chart contains V ∈ t[1, 2] and NP ∈ t[2, 3].
• At the next diagonal, the entries correspond to spans of length three. At this level, there is an additional decision at each cell: where to split the left and right children. The cell t[i,j] corresponds to the subsequence wi+1:j, and we must choose some split point i < k < j, so that the span wi+1:k is the left child, and the span wk+1:j is the right child. We consider all possible k, looking for productions that generate elements in t[i, k] and t[k, j]; the left-hand side of all such productions can be added to t[i, j]. When it is time to compute t[i, j], the cells t[i, k] and t[k, j] are guaranteed to be complete, since these cells correspond to shorter sub-strings of the input. • Theprocesscontinuesuntilwereacht[0,M]. Figure 10.1 shows the chart that arises from parsing the sentence We eat sushi with chop- sticks using the grammar defined above. 10.1.1 Recovering the parse tree As with the Viterbi algorithm, it is possible to identify a successful parse by storing and traversing an additional table of back-pointers. If we add an entry X to cell t[i, j] by using the production X → Y Z and the split point k, then we store the back-pointer b[i, j, X] = (Y,Z,k). Once the table is complete, we can recover a parse by tracing this pointers, starting at b[0, M, S], and stopping when they ground out at terminal productions. For ambiguous sentences, there will be multiple paths to reach S ∈ t[0,M]. For ex- ample, in Figure 10.1, the goal state S ∈ t[0,M] is reached through the state VP ∈ t[1,5], and there are two different ways to generate this constituent: one with (eat sushi) and (with chopsticks) as children, and another with (eat) and (sushi with chopsticks) as children. The presence of multiple paths indicates that the input can be generated by the grammar in more than one way. In Algorithm 13, one of these derivations is selected arbitrarily. As discussed in section 10.3, weighted context-free grammars compute a score for all permissible derivations, and a minor modification of CKY allows it to identify the single derivation with the maximum score. 10.1.2 Non-binary productions As presented above, the CKY algorithm assumes that all productions with non-terminals on the right-hand side (RHS) are binary. In real grammars, such as the one considered in chapter 9, there are other types of productions: some have more than two elements on the right-hand side, and others produce a single non-terminal. Under contract with MIT Press, shared under CC-BY-NC-ND license. 228 CHAPTER10. CONTEXT-FREEPARSING Algorithm 13 The CKY algorithm for parsing a sequence w ∈ Σ∗ in a context-free gram- mar G = (N,Σ,R,S), with non-terminals N, production rules R, and start symbol S. The grammar is assumed to be in Chomsky normal form (section 9.2.1). The function PICKFROM(b[i, j, X]) selects an element of the set b[i, j, X] arbitrarily. All values of t and b are initialized to ∅. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: procedureCKY(w,G=(N,Σ,R,S)) for m ∈ {1...M} do t[m − 1, m] ← {X : (X → wm) ∈ R} for l ∈ {2, 3, . . . , M} do for m ∈ {0, 1, . . . M − l} do for k ∈ {m + 1, m + 2, . . . , m + l − 1} do for(X →Y Z)∈Rdo ifY ∈t[m,k]∧Z ∈t[k,m+l]then ◃ Iterate over constituent lengths ◃ Iterate over left endpoints ◃ Iterate over split points ◃Iterateoverrules t[m, m + l] ← t[m, m + l] ∪ X b[m, m + l, X] ← b[m, m + l, X] ∪ (Y, Z, k) if S ∈ t[0, M ] then return TRACEBACK(S, 0, M, b) else return ∅ procedureTRACEBACK(X,i,j,b) if j = i + 1 then ◃ Add non-terminal to table ◃ Add back-pointers • • (Y, Z, k) ← PICKFROM(b[i, j, X]) return X → (TRACEBACK(Y, i, k, b), TRACEBACK(Z, k, j, b)) Productions with more than two elements on the right-hand side can be binarized by creating additional non-terminals, as described in section 9.2.1. For example, the production VP → V NP NP (for ditransitive verbs) can be converted to VP → VPditrans/NP NP, by adding the non-terminal VPditrans/NP and the production VPditrans/NP → V NP. What about unary productions like VP → V? While such productions are not a part of Chomsky Normal Form — and can therefore be eliminated in preprocessing the grammar — in practice, a more typical solution is to modify the CKY algorithm. The algorithm makes a second pass on each diagonal in the table, augmenting each cell t[i,j] with all possible unary productions capable of generating each item al- ready in the cell: formally, t[i, j] is extended to its unary closure. Suppose the ex- ample grammar in Table 10.1 were extended to include the production VP → V, enabling sentences with intransitive verb phrases, like we eat. Then the cell t[1, 2] — corresponding to the word eat — would first include the set {V}, and would be augmented to the set {V, VP} during this second pass. Jacob Eisenstein. Draft of October 15, 2018. return X else 10.2. AMBIGUITY 229 We eat sushi with chopsticks We NP ∅ S ∅ S eat VVP∅VP sushi with chopsticks NP ∅ NP P PP NP Figure 10.1: An example completed CKY chart. The solid and dashed lines show the back pointers resulting from the two different derivations of VP in position t[1, 5]. 10.1.3 Complexity For an input of length M and a grammar with R productions and N non-terminals, the space complexity of the CKY algorithm is O(M2N): the number of cells in the chart is O(M2), and each cell must hold O(N) elements. The time complexity is O(M3R): each cell is computed by searching over O(M) split points, with R possible productions for each split point. Both the time and space complexity are considerably worse than the Viterbi algorithm, which is linear in the length of the input. 10.2 Ambiguity In natural language, there is rarely a single parse for a given sentence. The main culprit is ambiguity, which is endemic to natural language syntax. Here are a few broad categories: • Attachment ambiguity: e.g., We eat sushi with chopsticks, I shot an elephant in my pajamas. In these examples, the prepositions (with, in) can attach to either the verb or the direct object. • Modifier scope: e.g., southern food store, plastic cup holder. In these examples, the first word could be modifying the subsequent adjective, or the final noun. • Particle versus preposition: e.g., The puppy tore up the staircase. Phrasal verbs like tore up often include particles which could also act as prepositions. This has struc- tural implications: if up is a preposition, then up the staircase is a prepositional phrase; if up is a particle, then the staircase is the direct object to the verb. • Complement structure: e.g., The students complained to the professor that they didn’t understand. This is another form of attachment ambiguity, where the complement Under contract with MIT Press, shared under CC-BY-NC-ND license. 230 CHAPTER10. CONTEXT-FREEPARSING that they didn’t understand could attach to the main verb (complained), or to the indi- rect object (the professor). • Coordination scope: e.g., “I see,” said the blind man, as he picked up the hammer and saw. In this example, the lexical ambiguity for saw enables it to be coordinated either with the noun hammer or the verb picked up. These forms of ambiguity can combine, so that seemingly simple headlines like Fed raises interest rates have dozens of possible analyses even in a minimal grammar. In a broad coverage grammar, typical sentences can have millions of parses. While careful grammar design can chip away at this ambiguity, a better strategy is combine broad cov- erage parsers with data-driven strategies for identifying the correct analysis. 10.2.1 Parser evaluation Before continuing to parsing algorithms that are able to handle ambiguity, let us stop to consider how to measure parsing performance. Suppose we have a set of reference parses — the ground truth — and a set of system parses that we would like to score. A simple solution would be per-sentence accuracy: the parser is scored by the proportion of sentences on which the system and reference parses exactly match.2 But as any student knows, it always nice to get partial credit, which we can assign to analyses that correctly match parts of the reference parse. The PARSEval metrics (Grishman et al., 1992) score each system parse via: Precision: the fraction of constituents in the system parse that match a constituent in the reference parse. Recall: the fraction of constituents in the reference parse that match a constituent in the system parse. In labeled precision and recall, the system must also match the phrase type for each constituent; in unlabeled precision and recall, it is only required to match the constituent structure. As described in chapter 4, the precision and recall can be combined into an F -MEASURE by their harmonic mean. Suppose that the left tree of Figure 10.2 is the system parse, and that the right tree is the reference parse. Then: • S → w1:5 is true positive, because it appears in both trees. 2Most parsing papers do not report results on this metric, but Suzuki et al. (2018) find that a strong parser recovers the exact parse in roughly 50% of all sentences. Performance on short sentences is generally much higher. Jacob Eisenstein. Draft of October 15, 2018. 10.2. AMBIGUITY 231 S NP VP S WeVNP NPVP eatNPPP WeVPPP sushi IN NP with chopsticks (a) system output Figure 10.2: Two possible analyses from the grammar in Table 10.1 • VP → w2:5 is true positive as well. • NP → w3:5 is false positive, because it appears only in the system output. • PP → w4:5 is true positive, because it appears in both trees. • VP → w2:3 is false negative, because it appears only in the reference. The labeled and unlabeled precision of this parse is 3 = 0.75, and the recall is 3 = 0.75, for 44 an F-measure of 0.75. For an example in which precision and recall are not equal, suppose the reference parse instead included the production VP → V NP PP. In this parse, the reference does not contain the constituent w2:3, so the recall would be 1.3 10.2.2 Local solutions Some ambiguity can be resolved locally. Consider the following examples, (10.1) a. We met the President on Monday. b. We met the President of Mexico. Each case ends with a prepositional phrase, which can be attached to the verb met or the noun phrase the president. If given a labeled corpus, we can compare the likelihood of the observing the preposition alongside each candidate attachment point, p(on | met) ≷ p(on | President) [10.1] p(of | met) ≷ p(of | President). [10.2] 3While the grammar must be binarized before applying the CKY algorithm, evaluation is performed on the original parses. It is therefore necessary to “unbinarize” the output of a CKY-based parser, converting it back to the original grammar. Under contract with MIT Press, shared under CC-BY-NC-ND license. NP eat sushi with chopsticks V NP IN (b) reference 232 CHAPTER10. CONTEXT-FREEPARSING A comparison of these probabilities would successfully resolve this case (Hindle and Rooth, 1993). Other cases, such as the example we eat sushi with chopsticks, require con- sidering the object of the preposition: consider the alternative we eat sushi with soy sauce. With sufficient labeled data, some instances of attachment ambiguity can be solved by supervised classification (Ratnaparkhi et al., 1994). However, there are inherent limitations to local solutions. While toy examples may have just a few ambiguities to resolve, realistic sentences have thousands or millions of possible parses. Furthermore, attachment decisions are interdependent, as shown in the garden path example: (10.2) Cats scratch people with claws with knives. We may want to attach with claws to scratch, as would be correct in the shorter sentence in cats scratch people with claws. But this leaves nowhere to attach with knives. The cor- rect interpretation can be identified only be considering the attachment decisions jointly. The huge number of potential parses may seem to make exhaustive search impossible. But as with sequence labeling, locality assumptions make it possible to search this space efficiently. 10.3 Weighted Context-Free Grammars Let us define a derivation τ as a set of anchored productions, τ = {X → α,(i,j,k)}, [10.3] with X corresponding to the left-hand side non-terminal and α corresponding to the right- hand side. For grammars in Chomsky normal formal, α is either a pair of non-terminals or a terminal symbol. The indices i, j, k anchor the production in the input, with X deriving the span wi+1:j. For binary productions, wi+1:k indicates the span of the left child, and wk+1:j indicates the span of the right child; for unary productions, k is ignored. For an input w, the optimal parse is, τˆ = argmax Ψ(τ ), τ∈T (w) where T (w) is the set of derivations that yield the input w. Define a scoring function Ψ that decomposes across anchored productions, Ψ(τ) = 􏰁 ψ(X → α,(i,j,k)). (X→α,(i,j,k))∈τ [10.4] [10.5] This is a locality assumption, akin to the assumption in Viterbi sequence labeling. In this case, the assumption states that the overall score is a sum over scores of productions, Jacob Eisenstein. Draft of October 15, 2018. 10.3. WEIGHTEDCONTEXT-FREEGRAMMARS 233 ψ(·) S →NPVP 0 NP →NPPP −1 → we −2 → sushi −3 → chopsticks −3 PP→INNP 0 1 IN →with 0 1 1 2 1 4 1 4 V→eat 01 Table 10.2: An example weighted context-free grammar (WCFG). The weights are chosen so that exp ψ(·) sums to one over right-hand sides for each non-terminal; this is required by probabilistic context-free grammars, but not by WCFGs in general. which are computed independently. In a weighted context-free grammar (WCFG), the score of each anchored production X → (α, (i, j, k)) is simply ψ(X → α), ignoring the anchor (i, j, k). In other parsing models, the anchors can be used to access features of the input, while still permitting efficient bottom-up parsing. Example Consider the weighted grammar shown in Table 10.2, and the analysis in Fig- ure 10.2b. Ψ(τ) =ψ(S → NP VP)+ψ(VP → VP PP)+ψ(VP → V NP)+ψ(PP → IN NP) + ψ(NP → We) + ψ(V → eat) + ψ(NP → sushi) + ψ(IN → with) + ψ(NP → chopsticks) [10.6] =0 − 2 − 1 + 0 − 2 + 0 − 3 + 0 − 3 = −11. [10.7] In the alternative parse in Figure 10.2a, the production VP → VP PP (with score −2) is replaced with the production NP → NP PP (with score −1); all other productions are the same. As a result, the score for this parse is −10. This example hints at a problem with WCFG parsing on non-terminals such as NP, VP, and PP: a WCFG will always prefer either VP or NP attachment, regardless of what is being attached! Solutions to this issue are discussed in section 10.5. Under contract with MIT Press, shared under CC-BY-NC-ND license. exp ψ(·) 1 1 2 1 4 1 8 1 8 VP → V NP −1 → VP PP −2 → MD V −2 234 CHAPTER10. CONTEXT-FREEPARSING Algorithm 14 CKY algorithm for parsing a string w ∈ Σ∗ in a weighted context-free grammar (N, Σ, R, S), where N is the set of non-terminals and R is the set of weighted productions. The grammar is assumed to be in Chomsky normal form (section 9.2.1). The function TRACEBACK is defined in Algorithm 13. procedure WCKY(w, G = (N, Σ, R, S)) for all i, j, X do t[i,j,X] ← 0 b[i,j,X] ← ∅ for m ∈ {1,2,...,M} do forallX ∈N do t[m,m+1,X]←ψ(X →wm,(m,m+1,m)) for l ∈ {2,3,...M} do for m ∈ {0, 1, . . . , M − l} do ◃ Initialization for k ∈ {m + 1, m + 2, . . . , m + l − 1} do t[m,m+l,X]← maxψ(X →Y Z,(m,m+l,k))+t[m,k,Y]+t[k,m+l,Z] k,Y,Z b[m,m+l,X]←argmaxψ(X →Y Z,(m+l,k))+t[m,k,Y]+t[k,m+l,Z] k,Y,Z return TRACEBACK(S, 0, M, b) 10.3.1 Parsing with weighted context-free grammars The optimization problem in Equation 10.4 can be solved by modifying the CKY algo- rithm. In the deterministic CKY algorithm, each cell t[i, j] stored a set of non-terminals capable of deriving the span wi+1:j. We now augment the table so that the cell t[i,j,X] is the score of the best derivation of wi+1:j from non-terminal X. This score is computed recursively: for the anchored binary production (X → Y Z, (i, j, k)), we compute: • the score of the anchored production, ψ(X → Y Z, (i, j, k)); • the score of the best derivation of the left child, t[i, k, Y ]; • the score of the best derivation of the right child, t[k, j, Z]. These scores are combined by addition. As in the unweighted CKY algorithm, the table is constructed by considering spans of increasing length, so the scores for spans t[i, k, Y ] and t[k, j, Z] are guaranteed to be available at the time we compute the score t[i, j, X]. The value t[0, M, S] is the score of the best derivation of w from the grammar. Algorithm 14 formalizes this procedure. As in unweighted CKY, the parse is recovered from the table of back pointers b, where each b[i, j, X ] stores the argmax split point k and production X → Y Z in the derivation of wi+1:j from X. The top scoring parse can be obtained by tracing these pointers backwards from b[0, M, S], all the way to the terminal symbols. This is analogous to the computation Jacob Eisenstein. Draft of October 15, 2018. 10.3. WEIGHTEDCONTEXT-FREEGRAMMARS 235 of the best sequence of labels in the Viterbi algorithm by tracing pointers backwards from the end of the trellis. Note that we need only store back-pointers for the best path to t[i,j,X]; this follows from the locality assumption that the global score for a parse is a combination of the local scores of each production in the parse. Example Let’s revisit the parsing table in Figure 10.1. In a weighted CFG, each cell would include a score for each non-terminal; non-terminals that cannot be generated are assumed to have a score of −∞. The first diagonal contains the scores of unary produc- tions: t[0, 1, NP] = −2, t[1, 2, V] = 0, and so on. The next diagonal contains the scores for spans of length 2: t[1, 3, VP] = −1 + 0 − 3 = −4, t[3, 5, PP] = 0 + 0 − 3 = −3, and so on. Things get interesting when we reach the cell t[1, 5, VP], which contains the score for the derivation of the span w2:5 from the non-terminal VP. This score is computed as a max over two alternatives, t[1,5,VP] = max(ψ(VP → VP PP,(1,3,5))+t[1,3,VP]+t[3,5,PP], ψ(VP → V NP,(1,2,5))+t[1,2,V]+t[2,5,NP]) [10.8] = max( − 2 − 4 − 3, −1 + 0 − 7) = −8. [10.9] Since the second case is the argmax, we set the back-pointer b[1, 5, VP] = (V, NP, 2), en- abling the optimal derivation to be recovered. 10.3.2 Probabilistic context-free grammars Probabilistic context-free grammars (PCFGs) are a special case of weighted context- free grammars that arises when the weights correspond to probabilities. Specifically, the weight ψ(X → α, (i, j, k)) = log p(α | X), where the probability of the right-hand side α is conditioned on the non-terminal X, and the anchor (i, j, k) is ig􏰀nored. These probabilities must be normalized over all possible right-hand sides, so that α p(α | X) = 1, for all X. For a given parse τ, the product of the probabilities of the productions is equal to p(τ), under the generative model τ ∼ DRAWSUBTREE(S), where the function DRAWSUBTREE is defined in Algorithm 15. The conditional probability of a parse given a string is, 􏰀 p(τ |w)= 􏰀 p(τ) = 􏰀 expΨ(τ) , [10.10] where Ψ(τ) = X→α,(i,j,k)∈τ ψ(X → α,(i,j,k)). Because the probability is monotonic in the score Ψ(τ ), the maximum likelihood parse can be identified by the CKY algorithm without modification. If a normalized probability p(τ | w) is required, the denominator of Equation 10.10 can be computed by the inside recurrence, described below. Under contract with MIT Press, shared under CC-BY-NC-ND license. τ′∈T (w) p(τ′) τ′∈T (w) expΨ(τ′) 236 CHAPTER10. CONTEXT-FREEPARSING Algorithm 15 Generative model for derivations from probabilistic context-free grammars in Chomsky Normal Form (CNF). procedure DRAWSUBTREE(X) sample(X →α)∼p(α|X) ifα=(Y Z)then return DRAWSUBTREE(Y ) ∪ DRAWSUBTREE(Z) else return (X → α) ◃ In CNF, all unary productions yield terminal symbols Example The WCFG in􏰀Table 10.2 is designed so that the weights are log-probabilities, satisfying the constraint α exp ψ(X → α) = 1. As noted earlier, there are two parses in T (we eat sushi with chopsticks), with scores Ψ(τ1) = log p(τ1) = −10 and Ψ(τ2) = log p(τ2) = −11. Therefore, the conditional probability p(τ1 | w) is equal to, p(τ1) exp Ψ(τ1) 2−10 2 p(τ1 |w)=p(τ1)+p(τ2) = expΨ(τ1)+expΨ(τ2) = 2−10 +2−11 = 3. [10.11] The inside recurrence The denominator of Equation 10.10 can be viewed as a language model, summing over all valid derivations of the string w, p(w) = 􏰁 p(τ′). [10.12] τ ′ :yield(τ ′ )=w Just as the CKY algorithm makes it possible to maximize over all such analyses, with a few modifications it can also compute their sum. Each cell t[i,j,X] must store the log probability of deriving wi+1:j from non-terminal X. To compute this, we replace the max- imization over split points k and productions X → Y Z with a “log-sum-exp” operation, which exponentiates the log probabilities of the production and the children, sums them in probability space, and then converts back to the log domain: t[i,j,X]=log 􏰁 exp(ψ(X →Y Z)+t[i,k,Y]+t[k,j,Z]) [10.13] k􏰁,Y ,Z =log exp(logp(Y Z |X)+logp(Y →wi+1:k)+logp(Z →wk+1:j)) k,Y,Z =log 􏰁 p(Y Z |X)×p(Y →wi+1:k)×p(Z →wk+1:j) k􏰁,Y ,Z = log p(Y Z,wi+1:k,wk+1:j |X) k,Y,Z [10.14] [10.15] [10.16] = log p(X 􏱔 wi+1:j ), [10.17] Jacob Eisenstein. Draft of October 15, 2018. 10.3. WEIGHTEDCONTEXT-FREEGRAMMARS 237 with X 􏱔 wi+1:j indicating the event that non-terminal X yields the span wi+1 , wi+2 , . . . , wj . The recursive computation of t[i, j, X ] is called the inside recurrence, because it computes the probability of each subtree as a combination of the probabilities of the smaller subtrees that are inside of it. The name implies a corresponding outside recurrence, which com- putes the probability of a non-terminal X spanning wi+1:j , joint with the outside context (w1:i , wj +1:M ). This recurrence is described in subsection 10.4.3. The inside and out- side recurrences are analogous to the forward and backward recurrences in probabilistic sequence labeling (see section 7.5.3). They can be used to compute the marginal prob- abilities of individual anchored productions, p(X → α, (i, j, k) | w), summing over all possible derivations of w. 10.3.3 *Semiring weighted context-free grammars The weighted and unweighted CKY algorithms can be unified with the inside recurrence using the same semiring notation described in subsection 7.7.3. The generalized recur- rence is: t[i,j,X]= 􏱍 ψ(X →Y Z,(i,j,k))⊗t[i,k,Y]⊗t[k,j,Z]. [10.18] k,Y,Z This recurrence subsumes all of the algorithms that have been discussed in this chapter to this point. Unweighted CKY. Wh􏱌en ψ(X → α, (i, j, k)) is a Boolean truth value {⊤, ⊥}, ⊗ is logical conjunction, and is logical disjunction, then we derive CKY recurrence for un- weighted context-free grammars, discussed in section 10.1 and Algorithm 13. Weighted CKY. When ψ(X → α, (i, j, k)) is a scalar score, ⊗ is addition, and 􏱌 is maxi- mization, then we derive the CKY recurrence for weighted context-free grammars, discussed in section 10.3 and Algorithm 14. When ψ(X → α, (i, j, k)) = log p(α | X), this same setting derives the CKY recurrence for finding the maximum likelihood derivation in a probabilistic context-free grammar. Inside re􏰀currence. When ψ(X → α, (i, j, k)) is a log probability, ⊗ is addition, and 􏱌 = log exp, then we derive the inside recurrence for probabilistic context-free gram- mars, discussed in subsection 10.3.2. It is also possible to set ψ(X → α,(i,j􏱌,k)) directly equal to the probability p(α | X). In this case, ⊗ is multiplication, and is addition. While this may seem more intuitive than working with log probabilities, there is the risk of underflow on long inputs. Regardless of how the scores are combined, the key point is the locality assumption: the score for a derivation is the combination of the independent scores for each anchored Under contract with MIT Press, shared under CC-BY-NC-ND license. 238 CHAPTER10. CONTEXT-FREEPARSING production, and these scores do not depend on any other part of the derivation. For exam- ple, if two non-terminals are siblings, the scores of productions from these non-terminals are computed independently. This locality assumption is analogous to the first-order Markov assumption in sequence labeling, where the score for transitions between tags depends only on the previous tag and current tag, and not on the history. As with se- quence labeling, this assumption makes it possible to find the optimal parse efficiently; its linguistic limitations are discussed in section 10.5. 10.4 Learning weighted context-free grammars Like sequence labeling, context-free parsing is a form of structure prediction. As a result, WCFGs can be learned using the same set of algorithms: generative probabilistic models, structured perceptron, maximum conditional likelihood, and maximum margin learning. In all cases, learning requires a treebank, which is a dataset of sentences labeled with context-free parses. Parsing research was catalyzed by the Penn Treebank (Marcus et al., 1993), the first large-scale dataset of this type (see subsection 9.2.2). Phrase structure tree- banks exist for roughly two dozen other languages, with coverage mainly restricted to European and East Asian languages, plus Arabic and Urdu. 10.4.1 Probabilistic context-free grammars Probabilistic context-free grammars are similar to hidden Markov models, in that they are generative models of text. In this case, the parameters of interest correspond to probabil- ities of productions, conditional on the left-hand side. As with hidden Markov models, these parameters can be estimated by relative frequency: ψ(X → α) =logp(X → α) [10.19] pˆ(X → α) =count(X → α). [10.20] For example, the probability of the production NP → DET NN is the corpus count of this production, divided by the count of the non-terminal NP. This estimator applies to terminal productions as well: the probability of NN → whale is the count of how often whale appears in the corpus as generated from an NN tag, divided by the total count of the NN tag. Even with the largest treebanks — currently on the order of one million tokens — it is difficult to accurately compute probabilities of even moderately rare events, such as NN → whale. Therefore, smoothing is critical for making PCFGs effective. Jacob Eisenstein. Draft of October 15, 2018. count(X ) 10.4. LEARNINGWEIGHTEDCONTEXT-FREEGRAMMARS 239 10.4.2 Feature-based parsing The scores for each production can be computed as an inner product of weights and fea- tures, ψ(X →α,(i,j,k))=θ·f(X,α,(i,j,k),w), [10.21] where the feature vector f is a function of the left-hand side X, the right-hand side α, the anchor indices (i, j, k), and the input w. The basic feature f (X, α, (i, j, k)) = {(X, α)} encodes only the identity of the produc- tion itself. This gives rise to a discriminatively-trained model with the same expressive- ness as a PCFG. Features on anchored productions can include the words that border the span wi, wj+1, the word at the split point wk+1, the presence of a verb or noun in the left child span wi+1:k, and so on (Durrett and Klein, 2015). Scores on anchored productions can be incorporated into CKY parsing without any modification to the algorithm, because it is still possible to compute each element of the table t[i, j, X ] recursively from its imme- diate children. Other features can be obtained by grouping elements on either the left-hand or right- hand side: for example it can be particularly beneficial to compute additional features by clustering terminal symbols, with features corresponding to groups of words with similar syntactic properties. The clustering can be obtained from unlabeled datasets that are much larger than any treebank, improving coverage. Such methods are described in chapter 14. Feature-based parsing models can be estimated using the usual array of discrimina- tive learning techniques. For example, a structure perceptron update can be computed as (Carreras et al., 2008), f(τ,w(i)) = 􏰁 f(X,α,(i,j,k),w(i)) (X→α,(i,j,k))∈τ τˆ = argmax θ · f(τ, w(i)) τ∈T (w) θ ←f(τ(i),w(i))−f(τˆ,w(i)). [10.22] [10.23] [10.24] A margin-based objective can be optimized by selecting τˆ through cost-augmented de- coding (subsection 2.4.2), enforcing a margin of ∆(τˆ, τ ) between the hypothesis and the reference parse, where ∆ is a non-negative cost function, such as the Hamming loss (Stern et al., 2017). It is also possible to train feature-based parsing models by conditional log- likelihood, as described in the next section. Under contract with MIT Press, shared under CC-BY-NC-ND license. 240 CHAPTER10. CONTEXT-FREEPARSING YY XZZX wi+1 ...wj wj+1 ...wk wk+1 ...wi wi+1 ...wj Figure 10.3: The two cases faced by the outside recurrence in the computation of β(i, j, X) 10.4.3 *Conditional random field parsing The score of a derivation Ψ(τ ) can be converted into a probability by normalizing over all possible derivations, p(τ | w) = 􏰀 expΨ(τ) . [10.25] τ′∈T (w) expΨ(τ′) Using this probability, a WCFG can be trained by maximizing the conditional log-likelihood of a labeled corpus. Just as in logistic regression and the conditional random field over sequences, the gradient of the conditional log-likelihood is the difference between the observed and ex- pected counts of each feature. The expectation Eτ|w[f(τ,w(i));θ] requires summing over all possible parses, and computing the marginal probabilities of anchored productions, p(X → α,(i,j,k) | w). In CRF sequence labeling, marginal probabilities over tag bi- grams are computed by the two-pass forward-backward algorithm (section 7.5.3). The analogue for context-free grammars is the inside-outside algorithm, in which marginal probabilities are computed from terms generated by an upward and downward pass over the parsing chart: • The upward pass is performed by the inside recurrence, which is described in sub- section 10.3.2. Each inside variable α(i, j, X ) is the score of deriving wi+1:j from the non-terminalX.InaPCFG,thiscorrespondstothelog-probabilitylogp(wi+1:j |X). This is computed by the recurrence, 􏰁 􏰁j (X→Y Z)k=i+1 The initial c􏰀ondition of this recurrence is α(m − 1, m, X) = ψ(X → wm). The de- nominator τ∈T(w)expΨ(τ)isequaltoexpα(0,M,S). • Thedownwardpassisperformedbytheoutsiderecurrence,whichrecursivelypop- ulates the same table structure, starting at the root of the tree. Each outside variable Jacob Eisenstein. Draft of October 15, 2018. α(i,j,X)􏱕log exp(ψ(X →Y Z,(i,j,k))+α(i,k,Y)+α(k,j,Z)). [10.26] 10.4. LEARNINGWEIGHTEDCONTEXT-FREEGRAMMARS 241 β(i, j, X) is the score of having a phrase of type X covering the span (i + 1 : j), joint with the exterior context w1:i and wj+1:M . In a PCFG, this corresponds to the log probability log p((X, i + 1, j), w1:i, wj+1:M ). Each outside variable is computed by the recurrence, 􏰁 􏰁M (Y →X Z) k=j+1 expβ(i,j,X)􏱕 exp[ψ(Y →XZ,(i,k,j))+α(j,k,Z)+β(i,k,Y)] [10.27] i−1 􏰁 􏰁exp[ψ(Y →ZX,(k,i,j))+α(k,i,Z)+β(k,j,Y)]. (Y→Z X)k=0 [10.28] The first line of Equation 10.28 is the score under the condition that X is a left child of its parent, which spans wi+1:k, with k > j; the second line is the score under the condition that X is a right child of its parent Y , which spans wk+1:j , with k < i. The two cases are shown in Figure 10.3. In each case, we sum over all possible productions with X on the right-hand side. The parent Y is bounded on one side by either i or j, depending on whether X is a left or right child of Y ; we must sum over all possible values for the other boundary. The initial conditions for the outside recurrence are β(0, M, S) = 0 and β(0, M, X ̸= S) = −∞. The marginal probability of a non-terminal X over span wi+1:j is written p(X 􏱔 wi+1:j | w). This probability can be computed from the inside and outside scores, p(X 􏱔 wi+1:j | w) = p(X 􏱔 wi+1:j , w) [10.29] p(w) =p(wi+1:j |X)×p(X,w1:i,xj+1:M) [10.30] p(w) =exp(α(i,j,X)+β(i,j,X)). [10.31] exp α(0, M, S) Marginal probabilities of individual productions can be computed similarly (see exercise 2). These marginal probabilities can be used for training a conditional random field parser, and also for the task of unsupervised grammar induction, in which a PCFG is estimated from a dataset of unlabeled text (Lari and Young, 1990; Pereira and Schabes, 1992). Under contract with MIT Press, shared under CC-BY-NC-ND license. + 242 CHAPTER10. CONTEXT-FREEPARSING 10.4.4 Neural context-free grammars Neural networks and can be applied to parsing by representing each span with a dense numerical vector (Socher et al., 2013; Durrett and Klein, 2015; Cross and Huang, 2016).4 For example, the anchor (i,j,k) and sentence w can be associated with a fixed-length column vector, v(i,j,k) =[uwi−1;uwi;uwj−1;uwj;uwk−1;uwk], [10.32] where uwi is a word embedding associated with the word wi. The vector vi,j,k can then be passed through a feedforward neural network, and used to compute the score of the an- chored production. For example, this score can be computed as a bilinear product (Durrett and Klein, 2015), v ̃(i,j,k) =FeedForward(v(i,j,k)) [10.33] ψ(X → α, (i, j, k)) =v ̃⊤ Θf (X → α), [10.34] where f(X → α) is a vector of features of the production, and Θ is a parameter ma- trix. The matrix Θ and the parameters of the feedforward network can be learned by backpropagating from an objective such as the margin loss or the negative conditional log-likelihood. 10.5 Grammar refinement The locality assumptions underlying CFG parsing depend on the granularity of the non- terminals. For the Penn Treebank non-terminals, there are several reasons to believe that these assumptions are too strong (Johnson, 1998): • The context-free assumption is too strict: for example, the probability of the produc- tion NP → NP PP is much higher (in the PTB) if the parent of the noun phrase is a verb phrase (indicating that the NP is a direct object) than if the parent is a sentence (indicating that the NP is the subject of the sentence). • The Penn Treebank non-terminals are too coarse: there are many kinds of noun phrases and verb phrases, and accurate parsing sometimes requires knowing the difference. As we have already seen, when faced with prepositional phrase at- tachment ambiguity, a weighted CFG will either always choose NP attachment (if ψ(NP → NP PP) > ψ(VP → VP PP)), or it will always choose VP attachment. To get more nuanced behavior, more fine-grained non-terminals are needed.
• Moregenerally,accurateparsingrequiressomeamountofsemantics—understand- ing the meaning of the text to be parsed. Consider the example cats scratch people
4Earlier work on neural constituent parsing used transition-based parsing algorithms (subsection 10.6.2) rather than CKY-style chart parsing (Henderson, 2004; Titov and Henderson, 2007).
Jacob Eisenstein. Draft of October 15, 2018.
(i,j,k)

10.5. GRAMMARREFINEMENT
243
SS
NP VP
PRP V
she likes NN
NP
PP
NP VP
PRP V NP
she likes NP
NN PP
CC NP and NNP Italy
wine P NP
from NP CC
NP NNP and NNP France Italy
wine P
from NNP
NP France
Figure 10.4: The left parse is preferable because of the conjunction of phrases headed by France and Italy, but these parses cannot be distinguished by a WCFG.
with claws: knowledge of about cats, claws, and scratching is necessary to correctly resolve the attachment ambiguity.
An extreme example is shown in Figure 10.4. The analysis on the left is preferred because of the conjunction of similar entities France and Italy. But given the non-terminals shown in the analyses, there is no way to differentiate these two parses, since they include exactly the same productions. What is needed seems to be more precise non-terminals. One possibility would be to rethink the linguistics behind the Penn Treebank, and ask the annotators to try again. But the original annotation effort took five years, and there is a little appetite for another annotation effort of this scope. Researchers have therefore turned to automated techniques.
10.5.1 Parent annotations and other tree transformations
The key assumption underlying context-free parsing is that productions depend only on the identity of the non-terminal on the left-hand side, and not on its ancestors or neigh- bors. The validity of this assumption is an empirical question, and it depends on the non-terminals themselves: ideally, every noun phrase (and verb phrase, etc) would be distributionally identical, so the assumption would hold. But in the Penn Treebank, the observed probability of productions often depends on the parent of the left-hand side. For example, noun phrases are more likely to be modified by prepositional phrases when they are in the object position (e.g., they amused the students from Georgia) than in the subject position (e.g., the students from Georgia amused them). This means that the NP → NP PP production is more likely if the entire constituent is the child of a VP than if it is the child
Under contract with MIT Press, shared under CC-BY-NC-ND license.

244
CHAPTER10. CONTEXT-FREEPARSING
SS
NP VP
she V
NP
NP-S VP-S
she VP-VP
NP-VP
heard DT NN the bear
heard DT-NP NN-NP the bear
Figure 10.5: Parent annotation in a CFG derivation of S. The observed statistics are (Johnson, 1998):
Pr(NP → NP PP) =11% Pr(NP under S → NP PP) =9%
Pr(NP under VP → NP PP) =23%.
[10.35] [10.36] [10.37]
This phenomenon can be captured by parent annotation (Johnson, 1998), in which each non-terminal is augmented with the identity of its parent, as shown in Figure 10.5). This is sometimes called vertical Markovization, since a Markov dependency is introduced be- tween each node and its parent (Klein and Manning, 2003). It is analogous to moving from a bigram to a trigram context in a hidden Markov model. In principle, parent annotation squares the size of the set of non-terminals, which could make parsing considerably less efficient. But in practice, the increase in the number of non-terminals that actually appear in the data is relatively modest (Johnson, 1998).
Parent annotation weakens the WCFG locality assumptions. This improves accuracy by enabling the parser to make more fine-grained distinctions, which better capture real linguistic phenomena. However, each production is more rare, and so careful smoothing or regularization is required to control the variance over production scores.
10.5.2 Lexicalized context-free grammars
The examples in subsection 10.2.2 demonstrate the importance of individual words in resolving parsing ambiguity: the preposition on is more likely to attach to met, while the preposition of is more likely to attachment to President. But of all word pairs, which are relevant to attachment decisions? Consider the following variants on the original examples:
(10.3) a.
b. We met the first female President of Mexico.
c. They had supposedly met the President on Monday.
Jacob Eisenstein. Draft of October 15, 2018.
We met the President of Mexico.

10.5. GRAMMARREFINEMENT
245
VB
VP(meet)
NP(President) PP(on)
VB meet
VP(meet)
NP(President)
NP(President) PP(of)
DT NN P NP the President of NN
meet DT NN P NP the President on NN
Monday
(a) Lexicalization and attachment ambiguity
Mexico
NP(Italy)
NP(Italy)
NP(wine)
NP(wine)
CC NP(Italy)
NP(wine) NN wine
PP(from)
NP(wine) NN wine
PP(from)
IN NP(France) from NNP
France
and
NNS Italy
IN from
NP(France) CC NNP and
France
NP(Italy) NNS Italy
(b) Lexicalization and coordination scope ambiguity Figure 10.6: Examples of lexicalization
The underlined words are the head words of their respective phrases: met heads the verb phrase, and President heads the direct object noun phrase. These heads provide useful semantic information. But they break the context-free assumption, which states that the score for a production depends only on the parent and its immediate children, and not the substructure under each child.
The incorporation of head words into context-free parsing is known as lexicalization, and is implemented in rules of the form,
NP(President) →NP(President) PP(of) [10.38] NP(President) →NP(President) PP(on). [10.39]
Lexicalization was a major step towards accurate PCFG parsing in the 1990s and early 2000s. It requires solving three problems: identifying the heads of all constituents in a treebank; parsing efficiently while keeping track of the heads; and estimating the scores for lexicalized productions.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

246
CHAPTER10. CONTEXT-FREEPARSING
Non-terminal
S VP NP PP
Direction
right left right left
Priority
VP SBAR ADJP UCP NP
VBD VBN MD VBZ TO VB VP VBG VBP ADJP NP N* EX $ CD QP PRP . . .
IN TO FW
Table 10.3: A fragment of head percolation rules for English (Magerman, 1995; Collins, 1997)
Identifying head words
The head of a constituent is the word that is the most useful for determining how that constituentisintegratedintotherestofthesentence.5 Theheadwordofaconstituentis determined recursively: for any non-terminal production, the head of the left-hand side must be the head of one of the children. The head is typically selected according to a set of deterministic rules, sometimes called head percolation rules. In many cases, these rules are straightforward: the head of a noun phrase in a NP → DET NN production is the head of the noun; the head of a sentence in a S → NP VP production is the head of the verb phrase.
Table 10.3 shows a fragment of the head percolation rules used in many English pars- ing systems. The meaning of the first rule is that to find the head of an S constituent, first look for the rightmost VP child; if you don’t find one, then look for the rightmost SBAR child, and so on down the list. Verb phrases are headed by left verbs (the head of can plan on walking is planned, since the modal verb can is tagged MD); noun phrases are headed by the rightmost noun-like non-terminal (so the head of the red cat is cat),6 and prepositional phrases are headed by the preposition (the head of at Georgia Tech is at). Some of these rules are somewhat arbitrary — there’s no particular reason why the head of cats and dogs should be dogs — but the point here is just to get some lexical information that can support parsing, not to make deep claims about syntax. Figure 10.6 shows the application of these rules to two of the running examples.
Parsing lexicalized context-free grammars
A na ̈ıve application of lexicalization would simply increase the set of non-terminals by taking the cross-product with the set of terminal symbols, so that the non-terminals now
5This is a pragmatic definition, befitting our goal of using head words to improve parsing; for a more formal definition, see (Bender, 2013, chapter 7).
6The noun phrase non-terminal is sometimes treated as a special case. Collins (1997) uses a heuristic that looks for the rightmost child which is a noun-like part-of-speech (e.g., NN, NNP), a possessive marker, or a superlative adjective (e.g., the greatest). If no such child is found, the heuristic then looks for the leftmost NP. If there is no child with tag NP, the heuristic then applies another priority list, this time from right to left.
Jacob Eisenstein. Draft of October 15, 2018.

10.5. GRAMMARREFINEMENT 247
include symbols like NP(President) and VP(meet). Under this approach, the CKY parsing algorithm could be applied directly to the lexicalized production rules. However, the complexity would be cubic in the size of the vocabulary of terminal symbols, which would clearly be intractable.
Another approach is to augment the CKY table with an additional index, keeping track of the head of each constituent. The cell t[i, j, h, X ] stores the score of the best derivation in which non-terminal X spans wi+1:j with head word h, where i < h ≤ j. To compute such a table recursively, we must consider the possibility that each phrase gets its head from either its left or right child. The scores of the best derivations in which the head comes from the left and right child are denoted tl and tr respectively, leading to the following recurrence: tl[i,j,h,X]= max max max t[i,k,h,Y]+t[k,j,h′,Z]+ψ(X(h)→Y(h)Z(h′)) (X→Y Z) k>h k h, since the head word must be in the left child. We then maximize again over possible head words h′ for the right child. An analogous computation is performed for tr. The size of the table is now O(M3N), where M is the length of the input and N is the number of non-terminals. Furthermore, each cell is computed by performing O(M2) operations, since we maximize over both the split point k and the head h′. The time complexity of the algorithm is therefore O(RM5N), where R is the number of rules in the grammar. Fortunately, more efficient solutions are possible. In general, the complexity of parsing can be reduced to O(M4) in the length of the input; for a broad class of lexicalized CFGs, the complexity can be made cubic in the length of the input, just as in unlexicalized CFGs (Eisner, 2000).
Estimating lexicalized context-free grammars
The final problem for lexicalized parsing is how to estimate weights for lexicalized pro- ductions X(i) → Y (j) Z(k). These productions are said to be bilexical, because they involve scores over pairs of words: in the example meet the President of Mexico, we hope to choose the correct attachment point by modeling the bilexical affinities of (meet, of) and (President, of). The number of such word pairs is quadratic in the size of the vocabulary, making it difficult to estimate the weights of lexicalized production rules directly from data. This is especially true for probabilistic context-free grammars, in which the weights are obtained from smoothed relative frequency. In a treebank with a million tokens, a
Under contract with MIT Press, shared under CC-BY-NC-ND license.

248 CHAPTER10. CONTEXT-FREEPARSING
vanishingly small fraction of the possible lexicalized productions will be observed more than once.7 The Charniak (1997) and Collins (1997) parsers therefore focus on approxi- mating the probabilities of lexicalized productions, using various smoothing techniques and independence assumptions.
In discriminatively-trained weighted context-free grammars, the scores for each pro- duction can be computed from a set of features, which can be made progressively more fine-grained (Finkel et al., 2008). For example, the score of the lexicalized production NP(President) → NP(President) PP(of) can be computed from the following features:
f(NP(President) → NP(President) PP(of)) = {NP(*) → NP(*) PP(*),
NP(President) → NP(President) PP(*),
NP(*) → NP(*) PP(of),
NP(President) → NP(President) PP(of)}
The first feature scores the unlexicalized production NP → NP PP; the next two features lexicalize only one element of the production, thereby scoring the appropriateness of NP attachment for the individual words President and of ; the final feature scores the specific bilexical affinity of President and of. For bilexical pairs that are encountered frequently in the treebank, this bilexical feature can play an important role in parsing; for pairs that are absent or rare, regularization will drive its weight to zero, forcing the parser to rely on the more coarse-grained features.
In chapter 14, we will encounter techniques for clustering words based on their distri- butional properties — the contexts in which they appear. Such a clustering would group rare and common words, such as whale, shark, beluga, Leviathan. Word clusters can be used as features in discriminative lexicalized parsing, striking a middle ground between full lexicalization and non-terminals (Finkel et al., 2008). In this way, labeled examples con- taining relatively common words like whale can help to improve parsing for rare words like beluga, as long as those two words are clustered together.
10.5.3 *Refinement grammars
Lexicalization improves on context-free parsing by adding detailed information in the form of lexical heads. However, estimating the scores of lexicalized productions is dif- ficult. Klein and Manning (2003) argue that the right level of linguistic detail is some- where between treebank categories and individual words. Some parts-of-speech and non- terminals are truly substitutable: for example, cat/N and dog/N. But others are not: for example, the preposition of exclusively attaches to nouns, while the preposition as is more
7 The real situation is even more difficult, because non-binary context-free grammars can involve trilexical or higher-order dependencies, between the head of the constituent and multiple of its children (Carreras et al., 2008).
Jacob Eisenstein. Draft of October 15, 2018.

10.5. GRAMMARREFINEMENT 249
likely to modify verb phrases. Klein and Manning (2003) obtained a 2% improvement in F -MEASURE on a parent-annotated PCFG parser by making a single change: splitting the preposition category into six subtypes. They propose a series of linguistically-motivated refinements to the Penn Treebank annotations, which in total yielded a 40% error reduc- tion.
Non-terminal refinement process can be automated by treating the refined categories as latent variables. For example, we might split the noun phrase non-terminal into NP1, NP2, NP3, . . . , without defining in advance what each refined non-terminal cor- responds to. This can be treated as partially supervised learning, similar to the multi- component document classification model described in subsection 5.2.3. A latent variable PCFG can be estimated by expectation maximization (Matsuzaki et al., 2005):8
• In the E-step, estimate a marginal distribution q over the refinement type of each non-terminal in each derivation. These marginals are constrained by the original annotation: an NP can be reannotated as NP4, but not as VP3. Marginal probabil- ities over refined productions can be computed from the inside-outside algorithm, as described in subsection 10.4.3, where the E-step enforces the constraints imposed by the original annotations.
• In the M-step, recompute the parameters of the grammar, by summing over the probabilities of anchored productions that were computed in the E-step:
􏰁M 􏰁M 􏰁j i=0 j=i k=i
As usual, this process can be iterated to convergence. To determine the number of re- finement types for each tag, Petrov et al. (2006) apply a split-merge heuristic; Liang et al. (2007) and Finkel et al. (2007) apply Bayesian nonparametrics (Cohen, 2016).
Some examples of refined non-terminals are shown in Table 10.4. The proper nouns differentiate months, first names, middle initials, last names, first names of places, and second names of places; each of these will tend to appear in different parts of grammatical productions. The personal pronouns differentiate grammatical role, with PRP-0 appear- ing in subject position at the beginning of the sentence (note the capitalization), PRP-1 appearing in subject position but not at the beginning of the sentence, and PRP-2 appear- ing in object position.
8Spectral learning, described in subsection 5.5.2, has also been applied to refinement grammars (Cohen et al., 2014).
Under contract with MIT Press, shared under CC-BY-NC-ND license.
E[count(X → Y Z)] =
p(X → Y Z,(i,j,k) | w). [10.43]

250
CHAPTER10. CONTEXT-FREEPARSING
Proper nouns NNP-14 NNP-12 NNP-2 NNP-1 NNP-15 NNP-3
Personal Pronouns PRP-0
PRP-1
PRP-2
Oct. Nov. Sept. John Robert James J. E. L. Bush Noriega Peters New San Wall York Francisco Street
It He I
it he they it them him
Table 10.4: Examples of automatically refined non-terminals and some of the words that they generate (Petrov et al., 2006).
10.6 Beyond context-free parsing
In the context-free setting, the score for a parse is a combination of the scores of individual productions. As we have seen, these models can be improved by using finer-grained non- terminals, via parent-annotation, lexicalization, and automated refinement. However, the inherent limitations to the expressiveness of context-free parsing motivate the consider- ation of other search strategies. These strategies abandon the optimality guaranteed by bottom-up parsing, in exchange for the freedom to consider arbitrary properties of the proposed parses.
10.6.1 Reranking
A simple way to relax the restrictions of context-free parsing is to perform a two-stage pro- cess, in which a context-free parser generates a k-best list of candidates, and a reranker then selects the best parse from this list (Charniak and Johnson, 2005; Collins and Koo, 2005). The reranker can be trained from an objective that is similar to multi-class classi- fication: the goal is to learn weights that assign a high score to the reference parse, or to the parse on the k-best list that has the lowest error. In either case, the reranker need only evaluate the K best parses, and so no context-free assumptions are necessary. This opens the door to more expressive scoring functions:
• It is possible to incorporate arbitrary non-local features, such as the structural par- allelism and right-branching orientation of the parse (Charniak and Johnson, 2005).
• Reranking enables the use of recursive neural networks, in which each constituent span wi+1:j receives a vector ui,j which is computed from the vector representa-
Jacob Eisenstein. Draft of October 15, 2018.

10.6. BEYONDCONTEXT-FREEPARSING 251 tions of its children, using a composition function that is linked to the production
rule (Socher et al., 2013), e.g.,
ui,j = f 􏰑ΘX→Y Z 􏰔 ui,k 􏰕􏰒 [10.44]
uk,j
The overall score of the parse can then be computed from the final vector, Ψ(τ) =
θu0,M .
Reranking can yield substantial improvements in accuracy. The main limitation is that it can only find the best parse among the K-best offered by the generator, so it is inherently limited by the ability of the bottom-up parser to find high-quality candidates.
10.6.2 Transition-based parsing
Structure prediction can be viewed as a form of search. An alternative to bottom-up pars- ing is to read the input from left-to-right, gradually building up a parse structure through a series of transitions. Transition-based parsing is described in more detail in the next chapter, in the context of dependency parsing. However, it can also be applied to CFG parsing, as briefly described here.
For any context-free grammar, there is an equivalent pushdown automaton, a model of computation that accepts exactly those strings that can be derived from the grammar. This computational model consumes the input from left to right, while pushing and pop- ping elements on a stack. This architecture provides a natural transition-based parsing framework for context-free grammars, known as shift-reduce parsing.
Shift-reduce parsing is a type of transition-based parsing, in which the parser can take the following actions:
• shift the next terminal symbol onto the stack;
• unary-reduce the top item on the stack, using a unary production rule in the gram-
mar;
• binary-reduce the top two items onto the stack, using a binary production rule in the grammar.
The set of available actions is constrained by the situation: the parser can only shift if there are remaining terminal symbols in the input, and it can only reduce if an applicable production rule exists in the grammar. If the parser arrives at a state where the input has been completely consumed, and the stack contains only the element S, then the input is accepted. If the parser arrives at a non-accepting state where there are no possible actions, the input is rejected. A parse error occurs if there is some action sequence that would accept an input, but the parser does not find it.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

252 CHAPTER10. CONTEXT-FREEPARSING Example Consider the input we eat sushi and the grammar in Table 10.1. The input can
be parsed through the following sequence of actions:
1. Shift the first token we onto the stack.
2. Reduce the top item on the stack to NP, using the production NP → we.
3. Shift the next token eat onto the stack, and reduce it to V with the production V → eat.
4. Shift the final token sushi onto the stack, and reduce it to NP. The input has been completely consumed, and the stack contains [NP, V, NP].
5. Reduce the top two items using the production VP → V NP. The stack now con- tains [VP, NP].
6. ReducethetoptwoitemsusingtheproductionS→NPVP.Thestacknowcontains [S]. Since the input is empty, this is an accepting state.
One thing to notice from this example is that the number of shift actions is equal to the length of the input. The number of reduce actions is equal to the number of non-terminals in the analysis, which grows linearly in the length of the input. Thus, the overall time complexity of shift-reduce parsing is linear in the length of the input (assuming the com- plexity of each individual classification decision is constant in the length of the input). This is far better than the cubic time complexity required by CKY parsing.
Transition-based parsing as inference In general, it is not possible to guarantee that a transition-based parser will find the optimal parse, argmaxτ Ψ(τ;w), even under the usual CFG independence assumptions. We could assign a score to each anchored parsing action in each context, with ψ(a, c) indicating the score of performing action a in context c. One might imagine that transition-based parsing could efficiently find the derivation that maximizes the sum of such scores. But this too would require backtracking and searching over an exponentially large number of possible action sequences: if a bad decision is made at the beginning of the derivation, then it may be impossible to recover the optimal action sequence without backtracking to that early mistake. This is known as a search error. Transition-based parsers can incorporate arbitrary features, without the restrictive independence assumptions required by chart parsing; search errors are the price that must be paid for this flexibility.
Learning transition-based parsing Transition-based parsing can be combined with ma- chine learning by training a classifier to select the correct action in each situation. This classifier is free to choose any feature of the input, the state of the parser, and the parse history. However, there is no optimality guarantee: the parser may choose a suboptimal parse, due to a mistake at the beginning of the analysis. Nonetheless, some of the strongest
Jacob Eisenstein. Draft of October 15, 2018.

10.6. BEYONDCONTEXT-FREEPARSING 253
CFG parsers are based on the shift-reduce architecture, rather than CKY. A recent gener- ation of models links shift-reduce parsing with recurrent neural networks, updating a hidden state vector while consuming the input (e.g., Cross and Huang, 2016; Dyer et al., 2016). Learning algorithms for transition-based parsing are discussed in more detail in section 11.3.
Exercises
1. Design a grammar that handles English subject-verb agreement. Specifically, your grammar should handle the examples below correctly:
(10.4) a. b.
(10.5) a. b.
She sings. We sing.
*She sing. *We sings.
2. Extend your grammar from the previous problem to include the auxiliary verb can, so that the following cases are handled:
(10.6) a. b.
(10.7) a. b.
She can sing. We can sing.
*She can sings. *We can sings.
3. French requires subjects and verbs to agree in person and number, and it requires determiners and nouns to agree in gender and number. Verbs and their objects need not agree. Assuming that French has two genders (feminine and masculine), three persons (first [me], second [you], third [her]), and two numbers (singular and plural), how many productions are required to extend the following simple grammar to handle agreement?
S →NPVP
VP → V|VNP|VNPNP NP → DET NN
4. Consider the grammar:
Under contract with MIT Press, shared under CC-BY-NC-ND license.

254
CHAPTER10. CONTEXT-FREEPARSING
S →NPVP VP → VNP
NP → NP → V → JJ →
JJNP
fish (the animal)
fish (the action of fishing)
fish (a modifier, as in fish sauce or fish stew)
Apply the CKY algorithm and identify all possible parses for the sentence fish fish fish fish.
5. Choose one of the possible parses for the previous problem, and show how it can be derived by a series of shift-reduce actions.
6. To handle VP coordination, a grammar includes the production VP → VP CC VP. To handle adverbs, it also includes the production VP → VP ADV. Assume all verbs are generated from a sequence of unary productions, e.g., VP → V → eat.
a) Show how to binarize the production VP → VP CC VP.
b) Use your binarized grammar to parse the sentence They eat and drink together,
treating together as an adverb.
c) Prove that a weighted CFG cannot distinguish the two possible derivations of this sentence. Your explanation should focus on the productions in the original, non-binary grammar.
d) Explain what condition must hold for a parent-annotated WCFG to prefer the derivation in which together modifies the coordination eat and drink.
7. Consider the following PCFG:
p(X → X X) = 1
2
p(X →Y)= 1 2
p(Y →σ)= 1 ,∀σ∈Σ |Σ|
[10.45] [10.46] [10.47]
a) Compute the probability p(τˆ) of the maximum probability parse for a string w ∈ ΣM .
b) Compute the conditional probability p(τˆ | w).
8. Context-free grammars can be used to parse the internal structure of words. Us- ing the weighted CKY algorithm and the following weighted context-free grammar, identify the best parse for the sequence of morphological segments in+flame+able.
Jacob Eisenstein. Draft of October 15, 2018.

10.6. BEYONDCONTEXT-FREEPARSING 255
S→V0 S→N0 S→J0 V → VPrefN -1 J →NJSuff1 J →VJSuff0 J → NegPrefJ 1 VPref→in+ 2 NegPref → in+ 1 N →flame 0 JSuff → +able 0
9. Use the inside and outside scores to compute the marginal probability p(Xi+1:j → Yi+1:k Zk+1:j | w), indicating that Y spans wi+1:k, Z spans wk+1:j, and X is the parent of Y and Z, span-
ning wi+1:j .
10. Suppose that the potentials Ψ(X → α) are log-probabilities, so that 􏰀α exp Ψ(X → α) = 1 for all X. Verify that the semiring􏰀inside recurrence from Equation 10.26 generates
the log-probability log p(w) = log τ :yield(τ )=w p(τ ).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

Chapter 11 Dependency parsing
The previous chapter discussed algorithms for analyzing sentences in terms of nested con- stituents, such as noun phrases and verb phrases. However, many of the key sources of ambiguity in phrase-structure analysis relate to questions of attachment: where to attach a prepositional phrase or complement clause, how to scope a coordinating conjunction, and so on. These attachment decisions can be represented with a more lightweight structure: a directed graph over the words in the sentence, known as a dependency parse. Syn- tactic annotation has shifted its focus to such dependency structures: at the time of this writing, the Universal Dependencies project offers more than 100 dependency treebanks for more than 60 languages.1 This chapter will describe the linguistic ideas underlying dependency grammar, and then discuss exact and transition-based parsing algorithms. The chapter will also discuss recent research on learning to search in transition-based structure prediction.
11.1 Dependency grammar
While dependency grammar has a rich history of its own (Tesnie`re, 1966; Ku ̈bler et al., 2009), it can be motivated by extension from the lexicalized context-free grammars that we encountered in previous chapter (subsection 10.5.2). Recall that lexicalization augments each non-terminal with a head word. The head of a constituent is identified recursively, using a set of head rules, as shown in Table 10.3. An example of a lexicalized context-free parse is shown in Figure 11.1a. In this sentence, the head of the S constituent is the main verb, scratch; this non-terminal then produces the noun phrase the cats, whose head word is cats, and from which we finally derive the word the. Thus, the word scratch occupies the central position for the sentence, with the word cats playing a supporting role. In turn, cats
1 universaldependencies.org
257

258
CHAPTER11. DEPENDENCYPARSING
NP(cats)
VP(scratch)
NP(people)
S(scratch)
DT The
NNS cats
VB scratch
PP(with)
NNS IN NP(claws)
people with
NNS claws
(a) lexicalized constituency parse
The cats scratch people with claws
(b) unlabeled dependency tree
Figure 11.1: Dependency grammar is closely linked to lexicalized context free grammars: each lexical head has a dependency path to every other word in the constituent. (This ex- ample is based on the lexicalization rules from subsection 10.5.2, which make the preposi- tion the head of a prepositional phrase. In the more contemporary Universal Dependen- cies annotations, the head of with claws would be claws, so there would be an edge scratch → claws.)
occupies the central position for the noun phrase, with the word the playing a supporting role.
The relationships between words in a sentence can be formalized in a directed graph, based on the lexicalized phrase-structure parse: create an edge (i, j) iff word i is the head of a phrase whose child is a phrase headed by word j. Thus, in our example, we would have scratch → cats and cats → the. We would not have the edge scratch → the, because although S(scratch) dominates DET(the) in the phrase-structure parse tree, it is not its im- mediate parent. These edges describe syntactic dependencies, a bilexical relationship between a head and a dependent, which is at the heart of dependency grammar.
Continuing to build out this dependency graph, we will eventually reach every word in the sentence, as shown in Figure 11.1b. In this graph — and in all graphs constructed in this way — every word has exactly one incoming edge, except for the root word, which is indicated by a special incoming arrow from above. Furthermore, the graph is weakly connected: if the directed edges were replaced with undirected edges, there would be a path between all pairs of nodes. From these properties, it can be shown that there are no cycles in the graph (or else at least one node would have to have more than one incoming edge), and therefore, the graph is a tree. Because the graph includes all vertices, it is a spanning tree.
11.1.1 Heads and dependents
A dependency edge implies an asymmetric syntactic relationship between the head and dependent words, sometimes called modifiers. For a pair like the cats or cats scratch, how
Jacob Eisenstein. Draft of October 15, 2018.

11.1. DEPENDENCYGRAMMAR 259 do we decide which is the head? Here are some possible criteria:
• The head sets the syntactic category of the construction: for example, nouns are the heads of noun phrases, and verbs are the heads of verb phrases.
• The modifier may be optional while the head is mandatory: for example, in the sentence cats scratch people with claws, the subtrees cats scratch and cats scratch people are grammatical sentences, but with claws is not.
• The head determines the morphological form of the modifier: for example, in lan- guages that require gender agreement, the gender of the noun determines the gen- der of the adjectives and determiners.
• Edges should first connect content words, and then connect function words.
These guidelines are not universally accepted, and they sometimes conflict. The Uni- versal Dependencies (UD) project has attempted to identify a set of principles that can be applied to dozens of different languages (Nivre et al., 2016).2 These guidelines are based on the universal part-of-speech tags from chapter 8. They differ somewhat from the head rules described in subsection 10.5.2: for example, on the principle that depen- dencies should relate content words, the prepositional phrase with claws would be headed by claws, resulting in an edge scratch → claws, and another edge claws → with.
One objection to dependency grammar is that not all syntactic relations are asymmet- ric. One such relation is coordination (Popel et al., 2013): in the sentence, Abigail and Max like kimchi (Figure 11.2), which word is the head of the coordinated noun phrase Abigail and Max? Choosing either Abigail or Max seems arbitrary; fairness argues for making and the head, but this seems like the least important word in the noun phrase, and selecting it would violate the principle of linking content words first. The Universal Dependencies annotation system arbitrarily chooses the left-most item as the head — in this case, Abigail — and includes edges from this head to both Max and the coordinating conjunction and. These edges are distinguished by the labels CONJ (for the thing begin conjoined) and CC (for the coordinating conjunction). The labeling system is discussed next.
11.1.2 Labeled dependencies
Edges may be labeled to indicate the nature of the syntactic relation that holds between the two elements. For example, in Figure 11.2, the label NSUBJ on the edge from like to Abigail indicates that the subtree headed by Abigail is the noun subject of the verb like; similarly, the label OBJ on the edge from like to kimchi indicates that the subtree headed by
2The latest and most specific guidelines are available at universaldependencies.org/ guidelines.html
Under contract with MIT Press, shared under CC-BY-NC-ND license.

260
CHAPTER11. DEPENDENCYPARSING
nsubj conj
root
conj cc
cc
Abigail and Max like kimchi but not jook
Figure 11.2: In the Universal Dependencies annotation system, the left-most item of a coordination is the head.
obj advmod
root
nsubj
I know
obj
compound compound New York pizza
punct conj
and
cc
this is
nsubj
cop advmod
not it !!
Figure 11.3: A labeled dependency parse from the English UD Treebank (reviews-361348- 0006)
kimchi is the object.3 The negation not is treated as an adverbial modifier (ADVMOD) on the noun jook.
A slightly more complex example is shown in Figure 11.3. The multiword expression New York pizza is treated as a “flat” unit of text, with the elements linked by the COM- POUND relation. The sentence includes two clauses that are conjoined in the same way that noun phrases are conjoined in Figure 11.2. The second clause contains a copula verb (see subsection 8.1.1). For such clauses, we treat the “object” of the verb as the root — in this case, it — and label the verb as a dependent, with the COP relation. This example also shows how punctuations are treated, with label PUNCT.
11.1.3 Dependency subtrees and constituents
Dependency trees hide information that would be present in a CFG parse. Often what is hidden is in fact irrelevant: for example, Figure 11.4 shows three different ways of
3Earlier work distinguished direct and indirect objects (De Marneffe and Manning, 2008), but this has been dropped in version 2.0 of the Universal Dependencies annotation system.
Jacob Eisenstein. Draft of October 15, 2018.

11.1. DEPENDENCYGRAMMAR
261
VP
V ate
VP
NP PP PP dinner on the table with a fork
(a) Flat
VP
VP PP PP
VP
VP PP
PP with a fork
V ate
NP on the table dinner
(b) Chomsky adjunction
V NP on the table with a fork ate dinner
(c) Two-level (PTB-style)
ate dinner on the table with a fork
(d) Dependency representation
Figure 11.4: The three different CFG analyses of this verb phrase all correspond to a single dependency structure.
representing prepositional phrase adjuncts to the verb ate. Because there is apparently no meaningful difference between these analyses, the Penn Treebank decides by convention to use the two-level representation (see Johnson, 1998, for a discussion). As shown in Figure 11.4d, these three cases all look the same in a dependency parse.
But dependency grammar imposes its own set of annotation decisions, such as the identification of the head of a coordination (subsection 11.1.1); without lexicalization, context-free grammar does not require either element in a coordination to be privileged in this way. Dependency parses can be disappointingly flat: for example, in the sentence Yesterday, Abigail was reluctantly giving Max kimchi, the root giving is the head of every dependency! The constituent parse arguably offers a more useful structural analysis for such cases.
Projectivity Thus far, we have defined dependency trees as spanning trees over a graph in which each word is a vertex. As we have seen, one way to construct such trees is by connecting the heads in a lexicalized constituent parse. However, there are spanning trees that cannot be constructed in this way. Syntactic constituents are contiguous spans. In a spanning tree constructed from a lexicalized constituent parse, the head h of any con- stituent that spans the nodes from i to j must have a path to every node in this span. This is property is known as projectivity, and projective dependency parses are a restricted class of spanning trees. Informally, projectivity means that “crossing edges” are prohib-
Under contract with MIT Press, shared under CC-BY-NC-ND license.

262
CHAPTER11. DEPENDENCYPARSING
% non-projective edges
Czech 1.86% English 0.39% German 2.33%
% non-projective sentences
22.42% 7.63% 28.19%
Table 11.1: Frequency of non-projective dependencies in three languages (Kuhlmann and Nivre, 2010)
root
obl:tmod
ob j nsubj det
acl:relcl
nsub j cop
Lucia ate a pizza yesterday which was vegetarian
Figure 11.5: An example of a non-projective dependency parse. The “crossing edge” arises from the relative clause which was vegetarian and the oblique temporal modifier yesterday.
ited. The formal definition follows:
Definition 2 (Projectivity). An edge from i to j is projective iff all k between i and j are descen-
dants of i. A dependency parse is projective iff all its edges are projective.
Figure 11.5 gives an example of a non-projective dependency graph in English. This dependency graph does not correspond to any constituent parse. As shown in Table 11.1, non-projectivity is more common in languages such as Czech and German. Even though relatively few dependencies are non-projective in these languages, many sentences have at least one such dependency. As we will soon see, projectivity has important algorithmic consequences.
11.2 Graph-based dependency parsing
Let y = {(i →−r j)} represent a dependency graph, in which each edge is a relation r from head word i ∈ {1,2,…,M,ROOT} to modifier j ∈ {1,2,…,M}. The special node ROOT indicates the root of the graph, and M is the length of the input |w|. Given a scoring function Ψ(y, w; θ), the optimal parse is,
yˆ = argmax Ψ(y, w; θ), [11.1] y∈Y(w)
Jacob Eisenstein. Draft of October 15, 2018.

11.2. GRAPH-BASEDDEPENDENCYPARSING 263
First order h m
Secondorder h s m g h m
Thirdorder ghsmhtsm
Figure 11.6: Feature templates for higher-order dependency parsing
where Y(w) is the set of valid dependency parses on the input w. As usual, the number of possible labels |Y(w)| is exponential in the length of the input (Wu and Chao, 2004). Algorithms that search over this space of possible graphs are known as graph-based de- pendency parsers.
In sequence labeling and constituent parsing, it was possible to search efficiently over an exponential space by choosing a feature function that decomposes into a sum of local feature vectors. A similar approach is possible for dependency parsing, by requiring the scoring function to decompose across dependency arcs:
Ψ(y,w;θ)= 􏰁 ψ(i→−r j,w;θ). [11.2] i→−r j∈y
Dependency parsers that operate under this assumption are known as arc-factored, since the score of a graph is the product of the scores of all arcs.
Higher-order dependency parsing The arc-factored decomposition can be relaxed to al- low higher-order dependencies. In second-order dependency parsing, the scoring func- tion may include grandparents and siblings, as shown by the templates in Figure 11.6. The scoring function is,
Ψ(y, w; θ) =
􏰁 ψparent(i →−r j, w; θ) i→−rj∈y􏰁 r′
+ ψgrandparent(i→− j,k,r,w;θ) r′
k−→􏰁i∈y
ψsibling(i →−r j, s, r′, w; θ). [11.3] The top line scores computes a scoring function that includes the grandparent k; the
bottom line computes a scoring function for each sibling s. For projective dependency Under contract with MIT Press, shared under CC-BY-NC-ND license.
+
r′ i−→s∈y
s̸=j

264 CHAPTER11. DEPENDENCYPARSING
graphs, there are efficient algorithms for second-order and third-order dependency pars- ing (Eisner, 1996; McDonald and Pereira, 2006; Koo and Collins, 2010); for non-projective dependency graphs, second-order dependency parsing is NP-hard (McDonald and Pereira, 2006). The specific algorithms are discussed in the next section.
11.2.1 Graph-based parsing algorithms
The distinction between projective and non-projective dependency trees (subsection 11.1.3) plays a key role in the choice of algorithms. Because projective dependency trees are closely related to (and can be derived from) lexicalized constituent trees, lexicalized pars- ing algorithms can be applied directly. For the more general problem of parsing to arbi- trary spanning trees, a different class of algorithms is required. In both cases, arc-factored dependency parsing relies on precomputing the scores ψ(i →−r j, w; θ) for each potential edge. There are O(M2R) such scores, where M is the length of the input and R is the number of dependency relation types, and this is a lower bound on the time and space complexity of any exact algorithm for arc-factored dependency parsing.
Projective dependency parsing
Any lexicalized constituency tree can be converted into a projective dependency tree by creating arcs between the heads of constituents and their parents, so any algorithm for lex- icalized constituent parsing can be converted into an algorithm for projective dependency parsing, by converting arc scores into scores for lexicalized productions. As noted in sub- section 10.5.2, there are cubic time algorithms for lexicalized constituent parsing, which are extensions of the CKY algorithm. Therefore, arc-factored projective dependency pars- ing can be performed in cubic time in the length of the input.
Second-order projective dependency parsing can also be performed in cubic time, with minimal modifications to the lexicalized parsing algorithm (Eisner, 1996). It is possible to go even further, to third-order dependency parsing, in which the scoring function may consider great-grandparents, grand-siblings, and “tri-siblings”, as shown in Figure 11.6. Third-order dependency parsing can be performed in O(M4) time, which can be made practical through the use of pruning to eliminate unlikely edges (Koo and Collins, 2010).
Non-projective dependency parsing
In non-projective dependency parsing, the goal is to identify the highest-scoring span- ning tree over the words in the sentence. The arc-factored assumption ensures that the score for each spanning tree will be computed as a sum over scores for the edges, which are precomputed. Based on these scores, we build a weighted connected graph. Arc- factored non-projective dependency parsing is then equ􏰀ivalent to finding the spanning tree that achieves the maximum total score, Ψ(y, w) = i→−r j∈y ψ(i →−r j, w). The Chu-
Jacob Eisenstein. Draft of October 15, 2018.

11.2. GRAPH-BASEDDEPENDENCYPARSING 265
Liu-Edmonds algorithm (Chu and Liu, 1965; Edmonds, 1967) computes this maximum directed spanning tree efficiently. It does this by first identifying the best incoming edge i →−r j for each vertex j. If the resulting graph does not contain cycles, it is the maxi- mum spanning tree. If there is a cycle, it is collapsed into a super-vertex, whose incoming and outgoing edges are based on the edges to the vertices in the cycle. The algorithm is then applied recursively to the resulting graph, and process repeats until a graph without cycles is obtained.
The time complexity of identifying the best incoming edge for each vertex is O(M2R), where M is the length of the input and R is the number of relations; in the worst case, the number of cycles is O(M). Therefore, the complexity of the Chu-Liu-Edmonds algorithm is O(M3R). This complexity can be reduced to O(M2N) by storing the edge scores in a Fibonnaci heap (Gabow et al., 1986). For more detail on graph-based parsing algorithms, seeEisner(1997)andKu ̈bleretal.(2009).
Higher-order non-projective dependency parsing Given the tractability of higher-order projective dependency parsing, you may be surprised to learn that non-projective second- order dependency parsing is NP-Hard. This can be proved by reduction from the vertex cover problem (Neuhaus and Bro ̈ker, 1997). A heuristic solution is to do projective pars- ing first, and then post-process the projective dependency parse to add non-projective edges (Nivre and Nilsson, 2005). More recent work has applied techniques for approxi- mate inference in graphical models, including belief propagation (Smith and Eisner, 2008), integer linear programming (Martins et al., 2009), variational inference (Martins et al., 2010), and Markov Chain Monte Carlo (Zhang et al., 2014).
11.2.2 Computing scores for dependency arcs
The arc-factored scoring function ψ(i →−r j, w; θ) can be defined in several ways:
Linear Neural Generative
ψ(i →−r j, w; θ) = θ · f (i →−r j, w)
ψ(i →−r j, w; θ) = Feedforward([uwi ; uwj ]; θ) ψ(i →−r j, w; θ) = log p(wj , r | wi).
[11.4] [11.5] [11.6]
Linear feature-based arc scores
Linear models for dependency parsing incorporate many of the same features used in sequence labeling and discriminative constituent parsing. These include:
• the length and direction of the arc;
• the words wi and wj linked by the dependency relation; • the prefixes, suffixes, and parts-of-speech of these words;
Under contract with MIT Press, shared under CC-BY-NC-ND license.

266 CHAPTER11. DEPENDENCYPARSING • the neighbors of the dependency arc, wi−1, wi+1, wj−1, wj+1;
• the prefixes, suffixes, and part-of-speech of these neighbor words.
Each of these features can be conjoined with the dependency edge label r. Note that features in an arc-factored parser can refer to words other than wi and wj . The restriction is that the features consider only a single arc.
Bilexical features (e.g., sushi → chopsticks) are powerful but rare, so it is useful to aug- ment them with coarse-grained alternatives, by “backing off” to the part-of-speech or affix. For example, the following features are created by backing off to part-of-speech tags in an unlabeled dependency parser:
f (3 →− 5, we eat sushi with chopsticks) = ⟨sushi → chopsticks, sushi → NNS,
NN → chopsticks, NNS → NN⟩.
Regularized discriminative learning algorithms can then trade off between features at varying levels of detail. McDonald et al. (2005) take this approach as far as tetralexical features (e.g., (wi,wi+1,wj−1,wj)). Such features help to avoid choosing arcs that are un- likely due to the intervening words: for example, there is unlikely to be an edge between two nouns if the intervening span contains a verb. A large list of first and second-order features is provided by Bohnet (2010), who uses a hashing function to store these features efficiently.
Neural arc scores
Given vector representations xi for each word wi in the input, a set of arc scores can be computed from a feedforward neural network:
ψ(i →−r j,w;θ) =FeedForward([xi;xj];θr), [11.7]
where unique weights θr are available for each arc type (Pei et al., 2015; Kiperwasser and Goldberg, 2016). Kiperwasser and Goldberg (2016) use a feedforward network with a single hidden layer,
z =g(Θ [x ;x ]+b(z)) [11.8] rijr
ψ(i→−r j)=βz+b(y), [11.9] rr
where Θr is a matrix, βr is a vector, each br is a scalar, and the function g is an elementwise tanh activation function.
Jacob Eisenstein. Draft of October 15, 2018.

11.2. GRAPH-BASEDDEPENDENCYPARSING 267
The vector xi can be set equal to the word embedding, which may be pre-trained or learned by backpropagation (Pei et al., 2015). Alternatively, contextual information can be incorporated by applying a bidirectional recurrent neural network across the input, as described in section 7.6. The RNN hidden states at each word can be used as inputs to the arc scoring function (Kiperwasser and Goldberg, 2016).
Feature-based arc scores are computationally expensive, due to the costs of storing and searching a huge table of weights. Neural arc scores can be viewed as a compact solution to this problem. Rather than working in the space of tuples of lexical features, the hidden layers of a feedforward network can be viewed as implicitly computing fea- ture combinations, with each layer of the network evaluating progressively more words. An early paper on neural dependency parsing showed substantial speed improvements at test time, while also providing higher accuracy than feature-based models (Chen and Manning, 2014).
Probabilistic arc scores
If each arc score is equal to the log probability logp(wj,r | wi), then the sum of scores gives the log probability of the sentence and arc labels, by the chain rule. For example, consider the unlabeled parse of we eat sushi with rice,
y ={(R􏰁OOT, 2), (2, 1), (2, 3), (3, 5), (5, 4)} log p(w | y) = log p(wj | wi)
(i→j )∈y
= log p(eat | ROOT) + log p(we | eat) + log p(sushi | eat)
+ log p(rice | sushi) + log p(with | rice).
Probabilistic generative models are used in combination with expectation-maximization
(chapter 5) for unsupervised dependency parsing (Klein and Manning, 2004).
11.2.3 Learning
Having formulated graph-based dependency parsing as a structure prediction problem, we can apply similar learning algorithms to those used in sequence labeling. Given a loss function l(θ; w(i), y(i)), we can compute gradient-based updates to the parameters. For a model with feature-based arc scores and a perceptron loss, we obtain the usual structured perceptron update,
yˆ = argmax θ · f(w, y′) [11.13] y′∈Y(w)
θ =θ + f (w, y) − f (w, yˆ) [11.14] Under contract with MIT Press, shared under CC-BY-NC-ND license.
[11.10] [11.11]
[11.12]

268 CHAPTER11. DEPENDENCYPARSING
In this case, the argmax requires a maximization over all dependency trees for the sen- tence, which can be computed using the algorithms described in subsection 11.2.1. We can apply all the usual tricks from section 2.3: weight averaging, a large margin objective, and regularization. McDonald et al. (2005) were the first to treat dependency parsing as a structure prediction problem, using MIRA, an online margin-based learning algorithm. Neural arc scores can be learned in the same way, backpropagating from a margin loss to updates on the feedforward network that computes the score for each edge.
A conditional random field for arc-factored dependency parsing is built on the proba- bility model,
exp􏰀i→−r j∈y ψ(i →−r j,w;θ)
p(y | w) = 􏰀y′∈Y(w) exp 􏰀i→−r j∈y′ ψ(i →−r j, w; θ) [11.15]
Such a model is trained to minimize the negative log conditional-likelihood. Just as in CRF sequence models (subsection 7.5.3) and the logistic regression classifier (section 2.5), the gradients involve marginal probabilities p(i →−r j | w; θ), which in this case are prob- abilities over individual dependencies. In arc-factored models, these probabilities can be computed in polynomial time. For projective dependency trees, the marginal probabilities can be computed in cubic time, using a variant of the inside-outside algorithm (Lari and Young, 1990). For non-projective dependency parsing, marginals can also be computed in cubic time, using the matrix-tree theorem (Koo et al., 2007; McDonald et al., 2007; Smith andSmith,2007).DetailsofthesemethodsaredescribedbyKu ̈bleretal.(2009).
11.3 Transition-based dependency parsing
Graph-based dependency parsing offers exact inference, meaning that it is possible to re- cover the best-scoring parse for any given model. But this comes at a price: the scoring function is required to decompose into local parts — in the case of non-projective parsing, these parts are restricted to individual arcs. These limitations are felt more keenly in de- pendency parsing than in sequence labeling, because second-order dependency features are critical to correctly identify some types of attachments. For example, prepositional phrase attachment depends on the attachment point, the object of the preposition, and the preposition itself; arc-factored scores cannot account for all three of these features si- multaneously. Graph-based dependency parsing may also be criticized on the basis of intuitions about human language processing: people read and listen to sentences sequen- tially, incrementally building mental models of the sentence structure and meaning before getting to the end (Jurafsky, 1996). This seems hard to reconcile with graph-based algo- rithms, which perform bottom-up operations on the entire sentence, requiring the parser to keep every word in memory. Finally, from a practical perspective, graph-based depen- dency parsing is relatively slow, running in cubic time in the length of the input.
Jacob Eisenstein. Draft of October 15, 2018.

11.3. TRANSITION-BASEDDEPENDENCYPARSING 269
Transition-based algorithms address all three of these objections. They work by mov- ing through the sentence sequentially, while performing actions that incrementally up- date a stored representation of what has been read thus far. As with the shift-reduce parser from subsection 10.6.2, this representation consists of a stack, onto which parsing substructures can be pushed and popped. In shift-reduce, these substructures were con- stituents; in the transition systems that follow, they will be projective dependency trees over partial spans of the input.4 Parsing is complete when the input is consumed and there is only a single structure on the stack. The sequence of actions that led to the parse is known as the derivation. One problem with transition-based systems is that there may be multiple derivations for a single parse structure — a phenomenon known as spurious ambiguity.
11.3.1 Transition systems for dependency parsing
A transition system consists of a representation for describing configurations of the parser, and a set of transition actions, which manipulate the configuration. There are two main transition systems for dependency parsing: arc-standard, which is closely related to shift- reduce, and arc-eager, which adds an additional action that can simplify derivations (Ab- ney and Johnson, 1991). In both cases, transitions are between configurations that are represented as triples, C = (σ, β, A), where σ is the stack, β is the input buffer, and A is the list of arcs that have been created (Nivre, 2008). In the initial configuration,
Cinitial = ([ROOT], w, ∅), [11.16] indicating that the stack contains only the special node ROOT, the entire input is on the
buffer, and the set of arcs is empty. An accepting configuration is,
Caccept = ([ROOT], ∅, A), [11.17]
where the stack contains only ROOT, the buffer is empty, and the arcs A define a spanning tree over the input. The arc-standard and arc-eager systems define a set of transitions between configurations, which are capable of transforming an initial configuration into an accepting configuration. In both of these systems, the number of actions required to parse an input grows linearly in the length of the input, making transition-based parsing considerably more efficient than graph-based methods.
Arc-standard
The arc-standard transition system is closely related to shift-reduce, and to the LR algo- rithm that is used to parse programming languages (Aho et al., 2006). It includes the following classes of actions:
4Transition systems also exist for non-projective dependency parsing (e.g., Nivre, 2008).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

270
CHAPTER11. DEPENDENCYPARSING
•
•
•
SHIFT: move the first item from the input buffer on to the top of the stack,
(σ, i|β, A) ⇒ (σ|i, β, A), [11.18]
where we write i|β to indicate that i is the leftmost item in the input buffer, and σ|i to indicate the result of pushing i on to stack σ.
ARC-LEFT: create a new left-facing arc of type r between the item on the top of the stack and the first item in the input buffer. The head of this arc is j, which remains at the front of the input buffer. The arc j →−r i is added to A. Formally,
(σ|i, j|β, A) ⇒ (σ, j|β, A ⊕ j →−r i), [11.19] where r is the label of the dependency arc, and ⊕ concatenates the new arc j →−r i to
the list A.
ARC-RIGHT: creates a new right-facing arc of type r between the item on the top of the stack and the first item in the input buffer. The head of this arc is i, which is “popped” from the stack and pushed to the front of the input buffer. The arc i →−r j is added to A. Formally,
(σ|i, j|β, A) ⇒ (σ, i|β, A ⊕ i →−r j), [11.20] where again r is the label of the dependency arc.
Each action has preconditions. The SHIFT action can be performed only when the buffer has at least one element. The ARC-LEFT action cannot be performed when the root node ROOT is on top of the stack, since this node must be the root of the entire tree. The ARC- LEFT and ARC-RIGHT remove the modifier words from the stack (in the case of ARC-LEFT) and from the buffer (in the case of ARC-RIGHT), so it is impossible for any word to have more than one parent. Furthermore, the end state can only be reached when every word is removed from the buffer and stack, so the set of arcs is guaranteed to constitute a spanning tree. An example arc-standard derivation is shown in Table 11.2.
Arc-eager dependency parsing
In the arc-standard transition system, a word is completely removed from the parse once it has been made the modifier in a dependency arc. At this time, any dependents of this word must have already been identified. Right-branching structures are common in English (and many other languages), with words often modified by units such as prepo- sitional phrases to their right. In the arc-standard system, this means that we must first shift all the units of the input onto the stack, and then work backwards, creating a series of arcs, as occurs in Table 11.2. Note that the decision to shift bagels onto the stack guarantees that the prepositional phrase with lox will attach to the noun phrase, and that this decision
Jacob Eisenstein. Draft of October 15, 2018.

11.3. TRANSITION-BASEDDEPENDENCYPARSING
271 arc added to A
(they ← like) (with ← lox)
(bagels → lox) (like → bagels) (ROOT → like)
σ
β
they like bagels with lox like bagels with lox
like bagels with lox bagels with lox
with lox lox
lox bagels like
∅
action
SHIFT ARC-LEFT SHIFT SHIFT SHIFT ARC-LEFT ARC-RIGHT ARC-RIGHT ARC-RIGHT DONE
1. [ROOT]
2. [ROOT, they]
3. [ROOT]
4. [ROOT,
5. [ROOT,
6. [ROOT,
7. [ROOT,
8. [ROOT,
9. [ROOT]
10. [ROOT]
like]
like, bagels]
like, bagels, with] like, bagels]
like]
Table 11.2: Arc-standard derivation of the unlabeled dependency parse for the input they like bagels with lox.
must be made before the prepositional phrase is itself parsed. This has been argued to be cognitively implausible (Abney and Johnson, 1991); from a computational perspective, it means that a parser may need to look several steps ahead to make the correct decision.
Arc-eager dependency parsing changes the ARC-RIGHT action so that right depen- dents can be attached before all of their dependents have been found. Rather than re- moving the modifier from both the buffer and stack, the ARC-RIGHT action pushes the modifier on to the stack, on top of the head. Because the stack can now contain elements that already have parents in the partial dependency graph, two additional changes are necessary:
• A precondition is required to ensure that the ARC-LEFT action cannot be applied when the top element on the stack already has a parent in A.
• A new REDUCE action is introduced, which can remove elements from the stack if they already have a parent in A:
(σ|i, β, A) ⇒ (σ, β, A). [11.21]
As a result of these changes, it is now possible to create the arc like → bagels before parsing the prepositional phrase with lox. Furthermore, this action does not imply a decision about whether the prepositional phrase will attach to the noun or verb. Noun attachment is chosen in the parse in Table 11.3, but verb attachment could be achieved by applying the REDUCE action at step 5 or 7.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

272
CHAPTER11. DEPENDENCYPARSING
σ
β action
arc added to A (they ← like)
(ROOT → like) (like → bagels)
(with ← lox) (bagels → lox)
1. [ROOT]
2. [ROOT, they]
3. [ROOT]
4. [ROOT, like]
5. [ROOT, like,
6. [ROOT, like,
7. [ROOT, like,
8. [ROOT, like,
9. [ROOT, like,
10. [ROOT, like]
11. [ROOT]
they like bagels with lox like bagels with lox
like bagels with lox bagels with lox
with lox lox
lox
∅
∅ ∅ ∅
SHIFT ARC-LEFT ARC-RIGHT ARC-RIGHT SHIFT ARC-LEFT ARC-RIGHT REDUCE REDUCE REDUCE DONE
bagels] bagels, with] bagels] bagels, lox] bagels]
Table 11.3: Arc-eager derivation of the unlabeled dependency parse for the input they like bagels with lox.
Projectivity
The arc-standard and arc-eager transition systems are guaranteed to produce projective dependency trees, because all arcs are between the word at the top of the stack and the left-most edge of the buffer (Nivre, 2008). Non-projective transition systems can be con- structed by adding actions that create arcs with words that are second or third in the stack (Attardi, 2006), or by adopting an alternative configuration structure, which main- tains a list of all words that do not yet have heads (Covington, 2001). In pseudo-projective dependency parsing, a projective dependency parse is generated first, and then a set of graph transformation techniques are applied, producing non-projective edges (Nivre and Nilsson, 2005).
Beam search
In “greedy” transition-based parsing, the parser tries to make the best decision at each configuration. This can lead to search errors, when an early decision locks the parser into a poor derivation. For example, in Table 11.2, if ARC-RIGHT were chosen at step 4, then the parser would later be forced to attach the prepositional phrase with lox to the verb likes. Note that the likes → bagels arc is indeed part of the correct dependency parse, but the arc-standard transition system requires it to be created later in the derivation.
Beam search is a general technique for ameliorating search errors in incremental de- coding.5 While searching, the algorithm maintains a set of partially-complete hypotheses,
5Beam search is used throughout natural language processing, and beyond. In this text, it appears again in coreference resolution (section 15.2.4) and machine translation (section 18.4).
Jacob Eisenstein. Draft of October 15, 2018.

11.3. TRANSITION-BASEDDEPENDENCYPARSING 273 t=1 t=2 t=3 t=4 t=5
􏰔 [Root] 􏰕 Shift 􏰔 [Root, they] 􏰕 Arc-Right 􏰔 [Root, they] 􏰕Arc-Right 􏰔 [Root, can] 􏰕 Arc-Right 􏰔 [Root] 􏰕 they can fish can fish fish ∅ ∅
Arc-Left 􏰔 [Root,can] 􏰕 􏰔 [Root,fish] 􏰕 􏰔 [Root] 􏰕 fish Arc-Left ∅ Arc-Right ∅
Figure 11.7: Beam search for unlabeled dependency parsing, with beam size K = 2. The arc lists for each configuration are not shown, but can be computed from the transitions.
called a beam. At step t of the derivation, there is a set of k hypotheses, each of which includes a score s(k) and a set of dependency arcs A(k):
tt
h(k) = (s(k), A(k)) [11.22] ttt
Each hypothesis is then “expanded” by considering the set of all valid actions from the current configuration c(k), written A(c(k)). This yields a large set of new hypotheses. For
each action a ∈ A(c(k)), we score the new hypothesis A(k) ⊕ a. The top k hypotheses tt
by this scoring metric are kept, and parsing proceeds to the next step (Zhang and Clark, 2008). Note that beam search requires a scoring function for action sequences, rather than individual actions. This issue will be revisited in the next section.
Figure 11.7 shows the application of beam search to dependency parsing, with a beam size of K = 2. For the first transition, the only valid action is SHIFT, so there is only one possible configuration at t = 2. From this configuration, there are three possible actions. The two best scoring actions are ARC-RIGHT and ARC-LEFT, and so the resulting hypotheses from these actions are on the beam at t = 3. From these configurations, there are three possible actions each, but the best two are expansions of the bottom hypothesis at t = 3. Parsing continues until t = 5, at which point both hypotheses reach an accepting state. The best-scoring hypothesis is then selected as the parse.
11.3.2 Scoring functions for transition-based parsers
Transition-based parsing requires selecting a series of actions. In greedy transition-based parsing, this can be done by training a classifier,
aˆ = argmax Ψ(a, c, w; θ), [11.23] a∈A(c)
where A(c) is the set of admissible actions in the current configuration c, w is the input, and Ψ is a scoring function with parameters θ (Yamada and Matsumoto, 2003).
A feature-based score can be computed, Ψ(a, c, w) = θ · f (a, c, w), using features that may consider any aspect of the current configuration and input sequence. Typical features
Under contract with MIT Press, shared under CC-BY-NC-ND license.
tt

274 CHAPTER11. DEPENDENCYPARSING
for transition-based dependency parsing include: the word and part-of-speech of the top element on the stack; the word and part-of-speech of the first, second, and third elements on the input buffer; pairs and triples of words and parts-of-speech from the top of the stack and the front of the buffer; the distance (in tokens) between the element on the top of the stack and the element in the front of the input buffer; the number of modifiers of each of these elements; and higher-order dependency features as described above in the section on graph-based dependency parsing (see, e.g., Zhang and Nivre, 2011).
Parse actions can also be scored by neural networks. For example, Chen and Manning (2014) build a feedforward network in which the input layer consists of the concatenation of embeddings of several words and tags:
• the top three words on the stack, and the first three words on the buffer;
• the first and second leftmost and rightmost children (dependents) of the top two
words on the stack;
• the leftmost and right most grandchildren of the top two words on the stack;
• embeddings of the part-of-speech tags of these words.
Let us call this base layer x(c, w), defined as, c =(σ, β, A)
x(c,w) =[vwσ1 ,vtσ1 vwσ2 ,vtσ2 ,vwσ3 ,vtσ3 ,vwβ1 ,vtβ1 ,vwβ2 ,vtβ2 ,…],
where vwσ1 is the embedding of the first word on the stack, vtβ2 is the embedding of the part-of-speech tag of the second word on the buffer, and so on. Given this base encoding of the parser state, the score for the set of possible actions is computed through a feedfor- ward network,
z =g(Θ(x→z)x(c, w)) [11.24] ψ(a, c, w; θ) =Θ(z→y)z, [11.25]
where the vector z plays the same role as the features f (a, c, w), but is a learned represen- tation. Chen and Manning (2014) use a cubic elementwise activation function, g(x) = x3, so that the hidden layer models products across all triples of input features. The learning algorithm updates the embeddings as well as the parameters of the feedforward network.
11.3.3 Learning to parse
Transition-based dependency parsing suffers from a mismatch between the supervision, which comes in the form of dependency trees, and the classifier’s prediction space, which is a set of parsing actions. One solution is to create new training data by converting parse trees into action sequences; another is to derive supervision directly from the parser’s performance.
Jacob Eisenstein. Draft of October 15, 2018.
a

11.3. TRANSITION-BASEDDEPENDENCYPARSING 275 Oracle-based training
A transition system can be viewed as a function from action sequences (derivations) to parse trees. The inverse of this function is a mapping from parse trees to derivations, which is called an oracle. For the arc-standard and arc-eager parsing system, an oracle can becomputedinlineartimeinthelengthofthederivation(Ku ̈bleretal.,2009,page32). Both the arc-standard and arc-eager transition systems suffer from spurious ambiguity: there exist dependency parses for which multiple derivations are possible, such as 1 ← 2 → 3.The oracle must choose between these different derivations. For example, the algorithmdescribedbyKu ̈bleretal.(2009)wouldfirstcreatetheleftarc(1←2),andthen create the right arc, (1 ← 2) → 3; another oracle might begin by shifting twice, resulting in the derivation 1 ← (2 → 3).
Given such an oracle, a dependency treebank can be converted into a set of oracle ac- tion sequences {A(i)}Ni=1. The parser can be trained by stepping through the oracle action sequences, and optimizing on an classification-based objective that rewards selecting the oracle action. For transition-based dependency parsing, maximum conditional likelihood is a typical choice (Chen and Manning, 2014; Dyer et al., 2015):
p(a | c, w) =􏰀 exp Ψ(a, c, w; θ) a′∈A(c) exp Ψ(a′, c, w; θ)
[11.26] [11.27]
N |A(i) |
θˆ = argmax 􏰁 􏰁 log p(a(i) | c(i), w),
θ i=1 t=1 where |A(i)| is the length of the action sequence A(i).
Recall that beam search requires a scoring function for action sequences. Such a score can be obtained by adding the log-likelihoods (or hinge losses) across all actions in the sequence (Chen and Manning, 2014).
Global objectives
The objective in Equation 11.27 is locally-normalized: it is the product of normalized probabilities over individual actions. A similar characterization could be made of non- probabilistic algorithms in which hinge-loss objectives are summed over individual ac- tions. In either case, training on individual actions can be sub-optimal with respect to global performance, due to the label bias problem (Lafferty et al., 2001; Andor et al., 2016).
As a stylized example, suppose that a given configuration appears 100 times in the training data, with action a1 as the oracle action in 51 cases, and a2 as the oracle action in the other 49 cases. However, in cases where a2 is correct, choosing a1 results in a cascade of subsequent errors, while in cases where a1 is correct, choosing a2 results in only a single
Under contract with MIT Press, shared under CC-BY-NC-ND license.
tt

276 CHAPTER11. DEPENDENCYPARSING error. A classifier that is trained on a local objective function will learn to always choose
a1, but choosing a2 would minimize the overall number of errors.
This observation motivates a global objective, such as the globally-normalized condi-
tional likelihood,
wherethedenominatorsumsoverthesetofallpossibleactionsequences,A(w).6 Inthe conditional random field model for sequence labeling (subsection 7.5.3), it was possible to compute this sum explicitly, using dynamic programming. In transition-based parsing, this is not possible. However, the sum can be approximated using beam search,
exp 􏰀|A(i)| Ψ(a(i), c(i), w)
p(A(i) |w;θ)= t=1 t t , [11.28]
􏰀 ′ exp􏰀|A′| Ψ(a′,c′,w) A∈A(w) t=1tt
|A′| K
tt tt
|A(k)|
􏰁 exp􏰁Ψ(a′,c′,w) ≈ 􏰁exp 􏰁 Ψ(a(k),c(k),w), [11.29]
A′ ∈A(w) t=1 k=1 t=1
where A(k) is an action sequence on a beam of size K. This gives rise to the following loss
function,
t=1 k=1 t=1
The derivatives of this loss involve expectations with respect to a probability distribution
over action sequences on the beam.
*Early update and the incremental perceptron
When learning in the context of beam search, the goal is to learn a decision function so that the gold dependency parse is always reachable from at least one of the partial derivations on the beam. (The combination of a transition system (such as beam search) and a scoring function for actions is known as a policy.) To achieve this, we can make an early update as soon as the oracle action sequence “falls off” the beam, even before a complete analysis is available (Collins and Roark, 2004; Daume ́ III and Marcu, 2005). The loss can be based on the best-scoring hypothesis on the beam, or the sum of all hypotheses (Huang et al., 2012).
For example, consider the beam search in Figure 11.7. In the correct parse, fish is the head of dependency arcs to both of the other two words. In the arc-standard system,
6Andor et al. (2016) prove that the set of globally-normalized conditional distributions is a strict superset of the set of locally-normalized conditional distributions, and that globally-normalized conditional models are therefore strictly more expressive.
Jacob Eisenstein. Draft of October 15, 2018.
|A(i) | K
L(θ) = − 􏰁 Ψ(a(i), c(i), w) + log 􏰁 exp 􏰁 Ψ(a(k), c(k), w). [11.30]
|A(k) |
tt tt

11.4. APPLICATIONS 277 this can be achieved only by using SHIFT for the first two actions. At t = 3, the oracle
1:3
action sequence has fallen off the beam. The parser should therefore stop, and update the parametersbythegradient ∂ L(A(i),{A(k)};θ),whereA(i) isthefirstthreeactionsofthe
∂θ 1:3 1:3 1:3 oracle sequence, and {A(k)} is the beam.
This integration of incremental search and learning was first developed in the incre- mental perceptron (Collins and Roark, 2004). This method updates the parameters with respect to a hinge loss, which compares the top-scoring hypothesis and the gold action sequence, up to the current point t. Several improvements to this basic protocol are pos- sible:
• As noted earlier, the gold dependency parse can be derived by multiple action se- quences. Rather than checking for the presence of a single oracle action sequence on the beam, we can check if the gold dependency parse is reachable from the current beam, using a dynamic oracle (Goldberg and Nivre, 2012).
• By maximizing the score of the gold action sequence, we are training a decision function to find the correct action given the gold context. But in reality, the parser will make errors, and the parser is not trained to find the best action given a context that may not itself be optimal. This issue is addressed by various generalizations of incremental perceptron, known as learning to search (Daume ́ III et al., 2009). Some of these methods are discussed in chapter 15.
11.4 Applications
Dependency parsing is used in many real-world applications: any time you want to know about pairs of words which might not be adjacent, you can use dependency arcs instead of regular expression search patterns. For example, you may want to match strings like delicious pastries, delicious French pastries, and the pastries are delicious.
It is possible to search the Google n-grams corpus by dependency edges, finding the trend in how often a dependency edge appears over time. For example, we might be inter- ested in knowing when people started talking about writing code, but we also want write some code, write good code, write all the code, etc. The result of a search on the dependency edge write → code is shown in Figure 11.8. This capability has been applied to research in digital humanities, such as the analysis of gender in Shakespeare Muralidharan and Hearst (2013).
A classic application of dependency parsing is relation extraction, which is described Under contract with MIT Press, shared under CC-BY-NC-ND license.

278 CHAPTER11. DEPENDENCYPARSING
Figure 11.8: Google n-grams results for the bigram write code and the dependency arc write => code (and their morphological variants)
in chapter 17. The goal of relation extraction is to identify entity pairs, such as
(MELVILLE, MOBY-DICK)
(TOLSTOY, WAR AND PEACE)
(MARQUE ́ Z, 100 YEARS OF SOLITUDE) (SHAKESPEARE, A MIDSUMMER NIGHT’S DREAM),
which stand in some relation to each other (in this case, the relation is authorship). Such entity pairs are often referenced via consistent chains of dependency relations. Therefore, dependency paths are often a useful feature in supervised systems which learn to detect new instances of a relation, based on labeled examples of other instances of the same relation type (Culotta and Sorensen, 2004; Fundel et al., 2007; Mintz et al., 2009).
Cui et al. (2005) show how dependency parsing can improve automated question an- swering. Suppose you receive the following query:
(11.1) What percentage of the nation’s cheese does Wisconsin produce? The corpus contains this sentence:
(11.2) In Wisconsin, where farmers produce 28% of the nation’s cheese, . . .
The location of Wisconsin in the surface form of this string makes it a poor match for the query. However, in the dependency graph, there is an edge from produce to Wisconsin in both the question and the potential answer, raising the likelihood that this span of text is relevant to the question.
A final example comes from sentiment analysis. As discussed in chapter 4, the polarity of a sentence can be reversed by negation, e.g.
Jacob Eisenstein. Draft of October 15, 2018.

11.4. APPLICATIONS 279 (11.3) There is no reason at all to believe the polluters will suddenly become reasonable.
By tracking the sentiment polarity through the dependency parse, we can better iden- tify the overall polarity of the sentence, determining when key sentiment words are re- versed (Wilson et al., 2005; Nakagawa et al., 2010).
Additional resources
More details on dependency grammar and parsing algorithms can be found in the manuscript byKu ̈bleretal.(2009).Foracomprehensivebutwhimsicaloverviewofgraph-basedde- pendency parsing algorithms, see Eisner (1997). Jurafsky and Martin (2019) describe an agenda-based version of beam search, in which the beam contains hypotheses of varying lengths. New hypotheses are added to the beam only if their score is better than the worst item currently on the beam. Another search algorithm for transition-based parsing is easy-first, which abandons the left-to-right traversal order, and adds the highest-scoring edges first, regardless of where they appear (Goldberg and Elhadad, 2010). Goldberg et al. (2013) note that although transition-based methods can be implemented in linear time in the length of the input, na ̈ıve implementations of beam search will require quadratic time, due to the cost of copying each hypothesis when it is expanded on the beam. This issue can be addressed by using a more efficient data structure for the stack.
Exercises
1. The dependency structure 1 ← 2 → 3, with 2 as the root, can be obtained from more than one set of actions in arc-standard parsing. List both sets of actions that can obtain this parse. Don’t forget about the edge ROOT → 2.
2. This problem develops the relationship between dependency parsing and lexical- ized context-free parsing. Suppose you have a set of unlabeled arc scores {ψ(i → j)}Mi,j=1 ∪ {ψ(ROOT → j)}Mj=1.
a) Assuming each word type occurs no more than once in the input ((i ̸= j) ⇒ (wi ̸= wj )), how would you construct a weighted lexicalized context-free gram- mar so that the score of any projective dependency tree is equal to the score of some equivalent derivation in the lexicalized context-free grammar?
b) Verify that your method works for the example They fish.
c) Does your method require the restriction that each word type occur no more
than once in the input? If so, why?
d) *If your method required that each word type occur only once in the input, show how to generalize it.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

280 3.
CHAPTER11. DEPENDENCYPARSING
In arc-factored dependency parsing of an input of length M, the score of a parse is the sum of M scores, one for each arc. In second order dependency parsing, the total score is the sum over many more terms. How many terms are the score of the parse for Figure 11.2, using a second-order dependency parser with grandparent and sibling features? Assume that a child of ROOT has no grandparent score, and that a node with no siblings has no sibling scores.
4. a) In the worst case, how many terms can be involved in the score of an input of length M , assuming second-order dependency parsing? Describe the structure of the worst-case parse. As in the previous problem, assume that there is only one child of ROOT, and that it does not have any grandparent scores.
5.
6.
7.
b) What about third-order dependency parsing?
ProvidetheUD-styleunlabeleddependencyparseforthesentenceXi-Laneatsshoots and leaves, assuming shoots is a noun and leaves is a verb. Provide arc-standard and arc-eager derivations for this dependency parse.
Compute an upper bound on the number of successful derivations in arc-standard
2017), where b = (a−b)!b! .
The label bias problem arises when a decision is locally correct, yet leads to a cas- cade of errors in some situations (section 11.3.3). Design a scenario in which this occurs. Specifically:
• Assume an arc-standard dependency parser, whose action classifier considers only the words at the top of the stack and at the front of the input buffer.
• Design two examples, which both involve a decision with identical features.
– In one example, shift is the correct decision; in the other example, arc-left or arc-right is the correct decision.
– In one of the two examples, a mistake should lead to at least two attach- ment errors.
– In the other example, a mistake should lead only to a single attachment error.
shift-reduce parsing for unlabeled dependencies, as a function of the length of the
input, M . Hint: a lower bound is the number of projective decision trees, 1 􏰗3M −2􏰘 (Zhang, 􏰗a􏰘 a! M+1 M−1
For the following exercises, run a dependency parser, such as Stanford’s CoreNLP parser, on a large corpus of text (at least 105 tokens), such as nltk.corpus.webtext.
8. The dependency relation NMOD:POSS indicates possession. Compute the top ten words most frequently possessed by each of the following pronouns: his, her, our, my, your, and their (inspired by Muralidharan and Hearst, 2013).
Jacob Eisenstein. Draft of October 15, 2018.

11.4. APPLICATIONS 281
9. Count all pairs of words grouped by the C O N J relation. Select all pairs of words (i, j ) for which i and j each participate in CONJ relations at least five times. Compute and sort by the pointwise mutual information, which is defined in section 14.3 as,
PMI(i, j) = log p(i, j) . [11.31] p(i)p(j )
Here, p(i) is the fraction of CONJ relations containing word i (in either position), and p(i, j) is the fraction of such relations linking i and j (in any order).
10. In section 4.2, we encountered lexical semantic relationships such as synonymy (same meaning), antonymy (opposite meaning), and hypernymy (i is a special case of j). Another relevant relation is co-hypernymy, which means that i and j share a hypernym. Of the top 20 pairs identified by PMI in the previous problem, how many participate in synsets that are linked by one of these four relations? Use WORDNET to check for these relations, and count a pair of words if any of their synsets are linked.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

Part III Meaning
283

Chapter 12
Logical semantics
The previous few chapters have focused on building systems that reconstruct the syntax of natural language — its structural organization — through tagging and parsing. But some of the most exciting and promising potential applications of language technology involve going beyond syntax to semantics — the underlying meaning of the text:
• Answering questions, such as where is the nearest coffeeshop? or what is the middle name of the mother of the 44th President of the United States?.
• Building a robot that can follow natural language instructions to execute tasks.
• Translating a sentence from one language into another, while preserving the under-
lying meaning.
• Fact-checking an article by searching the web for contradictory evidence.
• Logic-checking an argument by identifying contradictions, ambiguity, and unsup- ported assertions.
Semantic analysis involves converting natural language into a meaning representa- tion. To be useful, a meaning representation must meet several criteria:
• c1: it should be unambiguous: unlike natural language, there should be exactly one meaning per statement;
• c2: it should provide a way to link language to external knowledge, observations, and actions;
• c3: it should support computational inference, so that meanings can be combined to derive additional knowledge;
• c4: it should be expressive enough to cover the full range of things that people talk about in natural language.
285

286 CHAPTER12. LOGICALSEMANTICS Much more than this can be said about the question of how best to represent knowledge
for computation (e.g., Sowa, 2000), but this chapter will focus on these four criteria.
12.1 Meaning and denotation
The first criterion for a meaning representation is that statements in the representation should be unambiguous — they should have only one possible interpretation. Natural language does not have this property: as we saw in chapter 10, sentences like cats scratch people with claws have multiple interpretations.
But what does it mean for a statement to be unambiguous? Programming languages provide a useful example: the output of a program is completely specified by the rules of the language and the properties of the environment in which the program is run. For ex- ample,thepythoncode5 + 3willhavetheoutput8,aswillthecodes(4*4)-(3*3)+1 and ((8)). This output is known as the denotation of the program, and can be written as,
􏰆5+3􏰇 = 􏰆(4*4)-(3*3)+1􏰇 = 􏰆((8))􏰇 = 8. [12.1]
The denotations of these arithmetic expressions are determined by the meaning of the constants (e.g., 5, 3) and the relations (e.g., +, *, (, )). Now let’s consider another snippet of python code, double(4). The denotation of this code could be, 􏰆double(4)􏰇 = 8, or it could be 􏰆double(4)􏰇 = 44 — it depends on the meaning of double. This meaning is defined in a world model M as an infinite set of pairs. We write the denotation with respect to model M as 􏰆·􏰇M, e.g., 􏰆double􏰇M = {(0,0), (1,2), (2,4), . . .}. The world model would also define the (infinite) list of constants, e.g., {0,1,2,…}. As long as the denotation of string φ in model M can be computed unambiguously, the language can be said to be unambiguous.
This approach to meaning is known as model-theoretic semantics, and it addresses not only criterion c1 (no ambiguity), but also c2 (connecting language to external knowl- edge, observations, and actions). For example, we can connect a representation of the meaning of a statement like the capital of Georgia with a world model that includes knowl- edge base of geographical facts, obtaining the denotation Atlanta. We might populate a world model by detecting and analyzing the objects in an image, and then use this world model to evaluate propositions like a man is riding a moose. Another desirable property of model-theoretic semantics is that when the facts change, the denotations change too: the meaning representation of President of the USA would have a different denotation in the model M2014 as it would in M2022.
Jacob Eisenstein. Draft of October 15, 2018.

12.2. LOGICALREPRESENTATIONSOFMEANING 287 12.2 Logical representations of meaning
Criterion c3 requires that the meaning representation support inference — for example, automatically deducing new facts from known premises. While many representations have been proposed that meet these criteria, the most mature is the language of first-order logic.1
12.2.1 Propositional logic
The bare bones of logical meaning representation are Boolean operations on propositions:
Propositional symbols. Greek symbols like φ and ψ will be used to represent proposi- tions, which are statements that are either true or false. For example, φ may corre- spond to the proposition, bagels are delicious.
Boolean operators. We can build up more complex propositional formulas from Boolean operators. These include:
• Negation ¬φ, which is true if φ is false.
• Conjunction, φ ∧ ψ, which is true if both φ and ψ are true.
• Disjunction, φ ∨ ψ, which is true if at least one of φ and ψ is true
• Implication, φ ⇒ ψ, which is true unless φ is true and ψ is false. Implication has identical truth conditions to ¬φ ∨ ψ.
• Equivalence, φ ⇔ ψ, which is true if φ and ψ are both true or both false. Equiv- alence has identical truth conditions to (φ ⇒ ψ) ∧ (ψ ⇒ φ).
It is not strictly necessary to have all five Boolean operators: readers familiar with Boolean logic will know that it is possible to construct all other operators from either the NAND (not-and) or NOR (not-or) operators. Nonetheless, it is clearest to use all five operators. From the truth conditions for these operators, it is possible to define a number of “laws” for these Boolean operators, such as,
• Commutativity:φ∧ψ=ψ∧φ, φ∨ψ=ψ∨φ
• Associativity:φ∧(ψ∧χ)=(φ∧ψ)∧χ, φ∨(ψ∨χ)=(φ∨ψ)∨χ
• Complementation: φ ∧ ¬φ = ⊥, φ ∨ ¬φ = ⊤, where ⊤ indicates a true proposition and ⊥ indicates a false proposition.
1Alternatives include the “variable-free” representation used in semantic parsing of geographical queries (Zelle and Mooney, 1996) and robotic control (Ge and Mooney, 2005), and dependency-based com- positional semantics (Liang et al., 2013).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

288 CHAPTER12. LOGICALSEMANTICS
These laws can be combined to derive further equivalences, which can support logical inferences. For example, suppose φ = The music is loud and ψ = Max can’t sleep. Then if we are given,
φ ⇒ ψ If the music is loud, Max can’t sleep. φ The music is loud.
we can derive ψ (Max can’t sleep) by application of modus ponens, which is one of a set of inference rules that can be derived from more basic laws and used to manipulate propositional formulas. Automated theorem provers are capable of applying inference rules to a set of premises to derive desired propositions (Loveland, 2016).
12.2.2 First-order logic
Propositional logic is so named because it treats propositions as its base units. However, the criterion c4 states that our meaning representation should be sufficiently expressive. Now consider the sentence pair,
(12.1) If anyone is making noise, then Max can’t sleep. Abigail is making noise.
People are capable of making inferences from this sentence pair, but such inferences re- quire formal tools that are beyond propositional logic. To understand the relationship between the statement anyone is making noise and the statement Abigail is making noise, our meaning representation requires the additional machinery of first-order logic (FOL).
In FOL, logical propositions can be constructed from relationships between entities. Specifically, FOL extends propositional logic with the following classes of terms:
Constants. These are elements that name individual entities in the model, such as MAX and ABIGAIL. The denotation of each constant in a model M is an element in the model, e.g., 􏰆MAX􏰇 = m and 􏰆ABIGAIL􏰇 = a.
Relations. Relations can be thought of as sets of entities, or sets of tuples. For example, the relation CAN-SLEEP is defined as the set of entities who can sleep, and has the denotation 􏰆CAN-SLEEP􏰇 = {a, m, . . .}. To test the truth value of the proposition CAN-SLEEP(MAX), we ask whether 􏰆MAX􏰇 ∈ 􏰆CAN-SLEEP􏰇. Logical relations that are defined over sets of entities are sometimes called properties.
Relations may also be ordered tuples of entities. For example BROTHER(MAX,ABIGAIL) expresses the proposition that MAX is the brother of ABIGAIL. The denotation of such relations is a set of tuples, 􏰆BROTHER􏰇 = {(m,a), (x,y), . . .}. To test the truth value of the proposition BROTHER(MAX,ABIGAIL), we ask whether the tuple (􏰆MAX􏰇, 􏰆ABIGAIL􏰇) is in the denotation 􏰆BROTHER􏰇.
Jacob Eisenstein. Draft of October 15, 2018.

12.2. LOGICALREPRESENTATIONSOFMEANING 289 Using constants and relations, it is possible to express statements like Max can’t sleep
and Max is Abigail’s brother:
¬CAN-SLEEP(MAX) BROTHER(MAX,ABIGAIL).
These statements can also be combined using Boolean operators, such as, (BROTHER(MAX,ABIGAIL) ∨ BROTHER(MAX,STEVE)) ⇒ ¬CAN-SLEEP(MAX).
This fragment of first-order logic permits only statements about specific entities. To support inferences about statements like If anyone is making noise, then Max can’t sleep, two more elements must be added to the meaning representation:
Variables. Variablesaremechanismsforreferringtoentitiesthatarenotlocallyspecified. We can then write CAN-SLEEP(x) or BROTHER(x, ABIGAIL). In these cases, x is a free variable, meaning that we have not committed to any particular assignment.
Quantifiers. Variables are bound by quantifiers. There are two quantifiers in first-order logic.2
• The existential quantifier ∃, which indicates that there must be at least one en- tity to which the variable can bind. For example, the statement ∃xMAKES-NOISE(X) indicates that there is at least one entity for which MAKES-NOISE is true.
• The universal quantifier ∀, which indicates that the variable must be able to bind to any entity in the model. For example, the statement,
MAKES-NOISE(ABIGAIL) ⇒ (∀x¬CAN-SLEEP(x)) [12.3] asserts that if Abigail makes noise, no one can sleep.
The expressions ∃x and ∀x make x into a bound variable. A formula that contains no free variables is a sentence.
Functions. Functionsmapfromentitiestoentities,e.g.,􏰆CAPITAL-OF(GEORGIA)􏰇=􏰆ATLANTA􏰇. With functions, it is convenient to add an equality operator, supporting statements
like,
∀x∃yMOTHER-OF(x) = DAUGHTER-OF(y). [12.4]
2In first-order logic, it is possible to quantify only over entities. In second-order logic, it is possible to quantify over properties. This makes it possible to represent statements like Butch has every property that a good boxer has (example from Blackburn and Bos, 2005),
∀P ∀x((GOOD-BOXER(x) ⇒ P (x)) ⇒ P (BUTCH)). [12.2]
Under contract with MIT Press, shared under CC-BY-NC-ND license.

290
CHAPTER12. LOGICALSEMANTICS
Note that MOTHER-OF is a functional analogue of the relation MOTHER, so that MOTHER-OF(x) = y if MOTHER(x, y). Any logical formula that uses functions can be rewritten using only relations and quantification. For example,
MAKES-NOISE(MOTHER-OF(ABIGAIL)) [12.5] can be rewritten as ∃xMAKES-NOISE(x) ∧ MOTHER(x, ABIGAIL).
An important property of quantifiers is that the order can matter. Unfortunately, natu- ral language is rarely clear about this! The issue is demonstrated by examples like everyone speaks a language, which has the following interpretations:
∀x∃y SPEAKS(x, y) [12.6] ∃y∀x SPEAKS(x, y). [12.7]
In the first case, y may refer to several different languages, while in the second case, there is a single y that is spoken by everyone.
Truth-conditional semantics
One way to look at the meaning of an FOL sentence φ is as a set of truth conditions, or models under which φ is satisfied. But how to determine whether a sentence is true or false in a given model? We will approach this inductively, starting with a predicate applied to a tuple of constants. The truth of such a sentence depends on whether the tuple of denotations of the constants is in the denotation of the predicate. For example, CAPITAL(GEORGIA,ATLANTA) is true in model M iff,
(􏰆GEORGIA􏰇M, 􏰆ATLANTA􏰇M) ∈ 􏰆CAPITAL􏰇M. [12.8]
The Boolean operators ∧, ∨, . . . provide ways to construct more complicated sentences, and the truth of such statements can be assessed based on the truth tables associated with these operators. The statement ∃xφ is true if there is some assignment of the variable x to an entity in the model such that φ is true; the statement ∀xφ is true if φ is true under all possible assignments of x. More formally, we would say that φ is satisfied under M, written as M |= φ.
Truth conditional semantics allows us to define several other properties of sentences and pairs of sentences. Suppose that in every M under which φ is satisfied, another formula ψ is also satisfied; then φ entails ψ, which is also written as φ |= ψ. For example,
CAPITAL(GEORGIA,ATLANTA) |= ∃xCAPITAL(GEORGIA, x). [12.9] A statement that is satisfied under any model, such as φ ∨ ¬φ, is valid, written |= (φ ∨
¬φ). A statement that is not satisfied under any model, such as φ ∧ ¬φ, is unsatisfiable, Jacob Eisenstein. Draft of October 15, 2018.

12.3. SEMANTICPARSINGANDTHELAMBDACALCULUS 291
or inconsistent. A model checker is a program that determines whether a sentence φ is satisfied in M. A model builder is a program that constructs a model in which φ is satisfied. The problems of checking for consistency and validity in first-order logic are undecidable, meaning that there is no algorithm that can automatically determine whether an FOL formula is valid or inconsistent.
Inference in first-order logic
Our original goal was to support inferences that combine general statements If anyone is making noise, then Max can’t sleep with specific statements like Abigail is making noise. We can now represent such statements in first-order logic, but how are we to perform the inference that Max can’t sleep? One approach is to use “generalized” versions of propo- sitional inference rules like modus ponens, which can be applied to FOL formulas. By repeatedly applying such inference rules to a knowledge base of facts, it is possible to produce proofs of desired propositions. To find the right sequence of inferences to derive a desired theorem, classical artificial intelligence search algorithms like backward chain- ing can be applied. Such algorithms are implemented in interpreters for the prolog logic programming language (Pereira and Shieber, 2002).
12.3 Semantic parsing and the lambda calculus
The previous section laid out a lot of formal machinery; the remainder of this chapter links these formalisms back to natural language. Given an English sentence like Alex likes Brit, how can we obtain the desired first-order logical representation, LIKES(ALEX,BRIT)? This is the task of semantic parsing. Just as a syntactic parser is a function from a natu- ral language sentence to a syntactic structure such as a phrase structure tree, a semantic parser is a function from natural language to logical formulas.
As in syntactic analysis, semantic parsing is difficult because the space of inputs and outputs is very large, and their interaction is complex. Our best hope is that, like syntactic parsing, semantic parsing can somehow be decomposed into simpler sub-problems. This idea, usually attributed to the German philosopher Gottlob Frege, is called the principle of compositionality: the meaning of a complex expression is a function of the meanings of that expression’s constituent parts. We will define these “constituent parts” as syntactic constituents: noun phrases and verb phrases. These constituents are combined using function application: if the syntactic parse contains the production x → y z, then the semantics of x, written x.sem, will be computed as a function of the semantics of the
Under contract with MIT Press, shared under CC-BY-NC-ND license.

292
CHAPTER12. LOGICALSEMANTICS
NP : alex Alex
S : likes(alex, brit)
V : ? likes
VP : ?
NP : brit Brit
Figure 12.1: The principle of compositionality requires that we identify meanings for the constituents likes and likes Brit that will make it possible to compute the meaning for the entire sentence.
constituents, y.sem and z.sem.3 4 12.3.1 The lambda calculus
Let’s see how this works for a simple sentence like Alex likes Brit, whose syntactic structure is shown in Figure 12.1. Our goal is the formula, LIKES(ALEX,BRIT), and it is clear that the meaning of the constituents Alex and Brit should be ALEX and BRIT. That leaves two more constituents: the verb likes, and the verb phrase likes Brit. The meanings of these units must be defined in a way that makes it possible to recover the desired meaning for the entire sentence by function application. If the meanings of Alex and Brit are constants, then the meanings of likes and likes Brit must be functional expressions, which can be applied to their siblings to produce the desired analyses.
Modeling these partial analyses requires extending the first-order logic meaning rep- resentation. We do this by adding lambda expressions, which are descriptions of anony- mous functions,5 e.g.,
λx.LIKES(x, BRIT). [12.10] This functional expression is the meaning of the verb phrase likes Brit; it takes a single
argument, and returns the result of substituting that argument for x in the expression
3subsection 9.3.2 briefly discusses Combinatory Categorial Grammar (CCG) as an alternative to a phrase- structure analysis of syntax. CCG is argued to be particularly well-suited to semantic parsing (Hockenmaier and Steedman, 2007), and is used in much of the contemporary work on machine learning for semantic parsing, summarized in section 12.4.
4The approach of algorithmically building up meaning representations from a series of operations on the syntactic structure of a sentence is generally attributed to the philosopher Richard Montague, who published a series of influential papers on the topic in the early 1970s (e.g., Montague, 1973).
5Formally, all first-order logic formulas are lambda expressions; in addition, if φ is a lambda expression, then λx.φ is also a lambda expression. Readers who are familiar with functional programming will recognize lambda expressions from their use in programming languages such as Lisp and Python.
Jacob Eisenstein. Draft of October 15, 2018.

12.3. SEMANTICPARSINGANDTHELAMBDACALCULUS 293
LIKES(x, BRIT). We write this substitution as,
(λx.LIKES(x, BRIT))@ALEX = LIKES(ALEX,BRIT), [12.11]
with the symbol “@” indicating function application. Function application in the lambda calculus is sometimes called β-reduction or β-conversion. The expression φ@ψ indicates a function application to be performed by β-reduction, and φ(ψ) indicates a function or predicate in the final logical form.
Equation 12.11 shows how to obtain the desired semantics for the sentence Alex likes Brit: by applying the lambda expression λx.LIKES(x, BRIT) to the logical constant ALEX. This rule of composition can be specified in a syntactic-semantic grammar, in which syntactic productions are paired with semantic operations. For the syntactic production S → NP VP, we have the semantic rule VP.sem@NP.sem.
The meaning of the transitive verb phrase likes Brit can also be obtained by function application on its syntactic constituents. For the syntactic production VP → V NP, we apply the semantic rule,
VP.sem =(V.sem)@NP.sem =(λy.λx.LIKES(x, y))@(BRIT)
=λx.LIKES(x, BRIT).
[12.12] [12.13] [12.14]
Thus, the meaning of the transitive verb likes is a lambda expression whose output is another lambda expression: it takes y as an argument to fill in one of the slots in the LIKES relation, and returns a lambda expression that is ready to take an argument to fill in the other slot.6
Table 12.1 shows a minimal syntactic-semantic grammar fragment, G1. The complete derivation of Alex likes Brit in G1 is shown in Figure 12.2. In addition to the transitive verb likes, the grammar also includes the intransitive verb sleeps; it should be clear how to derive the meaning of sentences like Alex sleeps. For verbs that can be either transitive or intransitive, such as eats, we would have two terminal productions, one for each sense (terminal productions are also called the lexical entries). Indeed, most of the grammar is in the lexicon (the terminal productions), since these productions select the basic units of the semantic interpretation.
12.3.2 Quantification
Things get more complicated when we move from sentences about named entities to sen- tences that involve more general noun phrases. Let’s consider the example, A dog sleeps,
6This can be written in a few different ways. The notation λy, x.LIKES(x, y) is a somewhat informal way to indicate a lambda expression that takes two arguments; this would be acceptable in functional programming. Logicians (e.g., Carpenter, 1997) often prefer the more formal notation λy.λx.LIKES(x)(y), indicating that each lambda expression takes exactly one argument.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

294
CHAPTER12. LOGICALSEMANTICS
NP : alex Alex
S : likes(alex, brit)
VP : λx.likes(x, brit)
Vt : λy.λx.likes(x, y) NP : brit likes Brit
Figure 12.2: Derivation of the semantic representation for Alex likes Brit in the grammar G1.
S →NPVP VP →VtNP VP →Vi
Vt → likes Vi → sleeps NP → Alex NP → Brit
VP.sem@NP.sem Vt.sem@NP.sem Vi.sem
λy.λx.LIKES(x, y) λx.SLEEPS(x) ALEX
BRIT
Table 12.1: G1, a minimal syntactic-semantic context-free grammar
which has the meaning ∃xDOG(x) ∧ SLEEPS(x). Clearly, the DOG relation will be intro- duced by the word dog, and the SLEEP relation will be introduced by the word sleeps. The existential quantifier ∃ must be introduced by the lexical entry for the determiner a.7 However, this seems problematic for the compositional approach taken in the grammar G1: if the semantics of the noun phrase a dog is an existentially quantified expression, how can it be the argument to the semantics of the verb sleeps, which expects an entity? And where does the logical conjunction come from?
There are a few different approaches to handling these issues.8 We will begin by re- versing the semantic relationship between subject NPs and VPs, so that the production S → NP VP has the semantics NP.sem@VP.sem: the meaning of the sentence is now the semantics of the noun phrase applied to the verb phrase. The implications of this change are best illustrated by exploring the derivation of the example, shown in Figure 12.3. Let’s
7Conversely, the sentence Every dog sleeps would involve a universal quantifier, ∀xDOG(x) ⇒ SLEEPS(x). The definite article the requires more consideration, since the dog must refer to some dog which is uniquely identifiable, perhaps from contextual information external to the sentence. Carpenter (1997, pp. 96-100) summarizes recent approaches to handling definite descriptions.
8Carpenter (1997) offers an alternative treatment based on combinatory categorial grammar.
Jacob Eisenstein. Draft of October 15, 2018.

12.3. SEMANTICPARSINGANDTHELAMBDACALCULUS 295 S : ∃xdog(x) ∧ sleeps(x)
NP : λP.∃xP (x) ∧ dog(x) DT : λQ.λP.∃x.P (x) ∧ Q(x) NN : dog
A dog
VP : λx.sleeps(x)
Vi : λx.sleeps(x)
sleeps
Figure 12.3: Derivation of the semantic representation for A dog sleeps, in grammar G2
start with the indefinite article a, to which we assign the rather intimidating semantics,
λP.λQ.∃xP (x) ∧ Q(x). [12.15]
This is a lambda expression that takes two relations as arguments, P and Q. The relation P is scoped to the outer lambda expression, so it will be provided by the immediately adjacent noun, which in this case is DOG. Thus, the noun phrase a dog has the semantics,
NP.sem =DET.sem@NN.sem =(λP.λQ.∃xP (x) ∧ Q(x))@(DOG)
=λQ.∃xDOG(x) ∧ Q(x).
This is a lambda expression that is expecting another relation, Q, which will be provided
by the verb phrase, SLEEPS. This gives the desired analysis, ∃xDOG(x) ∧ SLEEPS(x).9
If noun phrases like a dog are interpreted as lambda expressions, then proper nouns like Alex must be treated in the same way. This is achieved by type-raising from con- stants to lambda expressions, x ⇒ λP.P(x). After type-raising, the semantics of Alex is λP.P(ALEX) — a lambda expression that expects a relation to tell us something about ALEX.10 Again,makesureyouseehowtheanalysisinFigure12.3canbeappliedtothe sentence Alex sleeps.
9When applying β-reduction to arguments that are themselves lambda expressions, be sure to use unique variable names to avoid confusion. For example, it is important to distinguish the x in the semantics for a from the x in the semantics for likes. Variable names are abstractions, and can always be changed — this is known as α-conversion. For example, λx.P (x) can be converted to λy.P (y), etc.
10Compositional semantic analysis is often supported by type systems, which make it possible to check whether a given function application is valid. The base types are entities e and truth values t. A property, such as DOG, is a function from entities to truth values, so its type is written ⟨e, t⟩. A transitive verb has type ⟨e, ⟨e, t⟩⟩: after receiving the first entity (the direct object), it returns a function from entities to truth values, which will be applied to the subject of the sentence. The type-raising operation x ⇒ λP.P(x) corresponds to a change in type from e to ⟨⟨e, t⟩, t⟩: it expects a function from entities to truth values, and returns a truth value.
Under contract with MIT Press, shared under CC-BY-NC-ND license.
[12.16] [12.17] [12.18]

296
CHAPTER12. LOGICALSEMANTICS
NP : λQ.∃xdog(x) ∧ Q(x) DT : λP.λQ.∃xP (x) ∧ Q(x) NN : dog
A dog
VP : λx.likes(x, alex)
Vt : λP.λx.P (λy.likes(x, y)) NP : λP.P (alex) likes NNP : alex
Alex
S : ∃xdog(x) ∧ likes(x, alex)
Figure 12.4: Derivation of the semantic representation for A dog likes Alex.
Direct objects are handled by applying the same type-raising operation to transitive
verbs: the meaning of verbs such as likes is raised to,
λP.λx.P (λy.LIKES(x, y)) [12.19]
As a result, we can keep the verb phrase production VP.sem = V.sem@NP.sem, knowing that the direct object will provide the function P in Equation 12.19. To see how this works, let’s analyze the verb phrase likes a dog. After uniquely relabeling each lambda variable,
VP.sem =V.sem@NP.sem
=(λP.λx.P (λy.LIKES(x, y)))@(λQ.∃zDOG(z) ∧ Q(z)) =λx.(λQ.∃zDOG(z) ∧ Q(z))@(λy.LIKES(x, y)) =λx.∃zDOG(z) ∧ (λy.LIKES(x, y))@z
=λx.∃zDOG(z) ∧ LIKES(x, z).
These changes are summarized in the revised grammar G2, shown in Table 12.2. Fig- ure 12.4 shows a derivation that involves a transitive verb, an indefinite noun phrase, and a proper noun.
12.4 Learning semantic parsers
As with syntactic parsing, any syntactic-semantic grammar with sufficient coverage risks producing many possible analyses for any given sentence. Machine learning is the dom- inant approach to selecting a single analysis. We will focus on algorithms that learn to score logical forms by attaching weights to features of their derivations (Zettlemoyer and Collins, 2005). Alternative approaches include transition-based parsing (Zelle and Mooney, 1996; Misra and Artzi, 2016) and methods inspired by machine translation (Wong and Mooney, 2006). Methods also differ in the form of supervision used for learning, which can range from complete derivations to much more limited training signals. We will begin with the case of complete supervision, and then consider how learning is still possible even when seemingly key information is missing.
Jacob Eisenstein. Draft of October 15, 2018.

12.4. LEARNINGSEMANTICPARSERS
297
S →NPVP VP →VtNP VP →Vi
NP →DETNN NP → NNP
DET →a DET → every Vt → likes Vi → sleeps NN → dog NNP → Alex NNP → Brit
NP.sem@VP.sem Vt.sem@NP.sem Vi.sem DET.sem@NN.sem λP.P (NNP.sem)
λP.λQ.∃xP (x) ∧ Q(x) λP.λQ.∀x(P (x) ⇒ Q(x)) λP.λx.P (λy.LIKES(x, y)) λx.SLEEPS(x)
DOG ALEX BRIT
Table 12.2: G2, a syntactic-semantic context-free grammar fragment, which supports quantified noun phrases
Datasets Early work on semantic parsing focused on natural language expressions of geographical database queries, such as What states border Texas. The GeoQuery dataset of Zelle and Mooney (1996) was originally coded in prolog, but has subsequently been expanded and converted into the SQL database query language by Popescu et al. (2003) and into first-order logic with lambda calculus by Zettlemoyer and Collins (2005), pro- viding logical forms like λx.STATE(x) ∧ BORDERS(x, TEXAS). Another early dataset con- sists of instructions for RoboCup robot soccer teams (Kate et al., 2005). More recent work has focused on broader domains, such as the Freebase database (Bollacker et al., 2008), for which queries have been annotated by Krishnamurthy and Mitchell (2012) and Cai and Yates (2013). Other recent datasets include child-directed speech (Kwiatkowski et al., 2012) and elementary school science exams (Krishnamurthy, 2016).
12.4.1 Learning from derivations
Let w(i) indicate a sequence of text, and let y(i) indicate the desired logical form. For example:
w(i) =Alex eats shoots and leaves
y(i) =EATS(ALEX,SHOOTS) ∧ EATS(ALEX,LEAVES)
In the standard supervised learning paradigm that was introduced in section 2.3, we first define a feature function, f(w,y), and then learn weights on these features, so that y(i) = argmaxy θ · f (w, y). The weight vector θ is learned by comparing the features of the true label f(w(i),y(i)) against either the features of the predicted label f(w(i),yˆ) (per-
Under contract with MIT Press, shared under CC-BY-NC-ND license.

298
CHAPTER12. LOGICALSEMANTICS
S : eats(alex, shoots) ∧ eats(alex, leavesn)
NP : λP.P(alex) VP : λx.eats(x,shoots)∧eats(x,leavesn)
Alex Vt :λP.λx.P(λy.eats(x,y)) NP:λP.P(shoots)∧P(leavesn)
eats NP : λP.P (shoots) CC : λP.λQ.λx.P (x) ∧ Q(x) NP : λP.P (leavesn)
shoots and leaves
Figure 12.5: Derivation for gold semantic analysis of Alex eats shoots and leaves ceptron, support vector machine) or the expected feature vector Ey|w[f(w(i),y)] (logistic
regression).
While this basic framework seems similar to discriminative syntactic parsing, there is a crucial difference. In (context-free) syntactic parsing, the annotation y(i) contains all of the syntactic productions; indeed, the task of identifying the correct set of productions is identical to the task of identifying the syntactic structure. In semantic parsing, this is not the case: the logical form EATS(ALEX,SHOOTS) ∧ EATS(ALEX,LEAVES) does not reveal the syntactic-semantic productions that were used to obtain it. Indeed, there may be spu- rious ambiguity, so that a single logical form can be reached by multiple derivations. (We previously encountered spurious ambiguity in transition-based dependency parsing, subsection 11.3.2.)
These ideas can be formalized by introducing an additional variable z, representing the derivation of the logical form y from the text w. Assume that 􏰀the feature function de- composes across the productions in the derivation, f(w,z,y) = Tt=1 f(w,zt,y), where zt indicates a single syntactic-semantic production. For example, we might have a feature for the production S → NP VP : NP.sem@VP.sem, as well as for terminal productions like NNP → Alex : ALEX. Under this decomposition, it is possible to compute scores for each semantically-annotated subtree in the analysis of w, so that bottom-up parsing algo- rithms like CKY (section 10.1) can be applied to find the best-scoring semantic analysis.
Figure 12.5 shows a derivation of the correct semantic analysis of the sentence Alex eats shoots and leaves, in a simplified grammar in which the plural noun phrases shoots and leaves are interpreted as logical constants SHOOTS and LEAVESn. Figure 12.6 shows a derivation of an incorrect analysis. Assuming one feature per production, the perceptron update is shown in Table 12.3. From this update, the parser would learn to prefer the noun interpretation of leaves over the verb interpretation. It would also learn to prefer noun phrase coordination over verb phrase coordination.
While the update is explained in terms of the perceptron, it would be easy to replace the perceptron with a conditional random field. In this case, the online updates would be
Jacob Eisenstein. Draft of October 15, 2018.

12.4. LEARNINGSEMANTICPARSERS
299
S : eats(alex, shoots) ∧ leavesv (alex)
NP : λP.P (alex) VP : λx.eats(x, shoots) ∧ leavesv (x)
Alex VP : λx.eats(x,shoots)
Vt : λP.λx.P(λy.eats(x,y)) NP : λP.P(shoots) eats shoots
CC : λP.λQ.λx.P(x)∧Q(x) and
VP : λx.leavesv(x) Vi : λx.leavesv(x) leaves
Figure 12.6: Derivation for incorrect semantic analysis of Alex eats shoots and leaves
NP1 → NP2 CC NP3 VP1 → VP2 CC VP3 NP → leaves
VP → Vi
(CC.sem@(NP2.sem))@(NP3.sem) +1 (CC.sem@(VP2.sem))@(VP3.sem) -1 LEAVESn +1 Vi.sem -1 λx.LEAVESv -1
Vi → leaves
Table 12.3: Perceptron update for analysis in Figure 12.5 (gold) and Figure 12.6 (predicted)
based on feature expectations, which can be computed using the inside-outside algorithm (section 10.6).
12.4.2 Learning from logical forms
Complete derivations are expensive to annotate, and are rarely available.11 One solution is to focus on learning from logical forms directly, while treating the derivations as la- tent variables (Zettlemoyer and Collins, 2005). In a conditional probabilistic model over logical forms y and derivations z, we have,
p(y,z|w)=􏰀 exp(θ·f(w,z,y)) , [12.20] y′,z′ exp(θ·f(w,z′,y′))
which is the standard log-linear model, applied to the logical form y and the derivation z.
Since the derivation z unambiguously determines the logical form y, it may seem silly to model the joint probability over y and z. However, since z is unknown, it can be marginalized out,
p(y | w) =􏰁p(y,z | w). [12.21] z
11An exception is the work of Ge and Mooney (2005), who annotate the meaning of each syntactic con- stituents for several hundred sentences.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

300 CHAPTER12. LOGICALSEMANTICS The semantic parser can then select the logical form with the maximum log marginal
probability,
log􏰁p(y,z|w)=log􏰁􏰁􏰀 exp(θ·f(w,z,y))
z z y′, z′ exp(θ · f(w, z′, y′))
∝ log exp(θ · f(w, z′, y′)) z
≥ max θ · f (w, z, y). z
[12.22]
[12.23]
[12.24]
It is impossible to push the log term inside the sum over z, so our usual linear scoring function does not apply. We can recover this scoring function only in approximation, by taking the max (rather than the sum) over derivations z, which provides a lower bound.
Learning can be performed by maximizing the log marginal likelihood,
􏰁N i=1
log p(y(i) | w(i); θ)
l(θ) =
= log p(y ,z |w ;θ).
[12.25] [12.26]
􏰁N 􏰁 (i) (i) (i) i=1 z
This log-likelihood is not convex in θ, unlike the log-likelihood of a fully-observed condi- tional random field. This means that learning can give different results depending on the initialization.
The derivative of Equation 12.26 is,
∂li =􏰁p(z|y,w;θ)f(w,z,y)−􏰁p(y′,z′ |w;θ)f(w,z′,y′) [12.27]
∂θ z y′,z′
=Ez|y,wf(w,z,y)−Ey,z|wf(w,z,y) [12.28]
Both expectations can be computed via bottom-up algorithms like inside-outside. Al- ternatively, we can again maximize rather than marginalize over derivations for an ap- proximate solution. In either case, the first term of the gradient requires us to identify derivations z that are compatible with the logical form y. This can be done in a bottom- up dynamic programming algorithm, by having each cell in the table t[i, j, X ] include the set of all possible logical forms for X 􏱔 wi+1:j . The resulting table may therefore be much larger than in syntactic parsing. This can be controlled by using pruning to eliminate in- termediate analyses that are incompatible with the final logical form y (Zettlemoyer and Collins, 2005), or by using beam search and restricting the size of each cell to some fixed constant (Liang et al., 2013).
If we replace each expectation in Equation 12.28 with argmax and then apply stochastic gradient descent to learn the weights, we obtain the latent variable perceptron, a simple
Jacob Eisenstein. Draft of October 15, 2018.

12.4. LEARNINGSEMANTICPARSERS 301 Algorithm 16 Latent variable perceptron
1: 2: 3: 4: 5: 6: 7: 8: 9:
procedureLATENTVARIABLEPERCEPTRON(w(1:N),y(1:N)) θ←0
repeat
Select an instance i
z(i) ←argmaxzθ·f(w(i),z,y(i)) yˆ,zˆ←argmaxy′,z′ θ·f(w(i),z′,y′)
θ ← θ + f ( w ( i ) , z ( i ) , y ( i ) ) − f ( w ( i ) , zˆ , yˆ )
until tired return θ
and general algorithm for learning with missing data. The algorithm is shown in its most basic form in Algorithm 16, but the usual tricks such as averaging and margin loss can be applied (Yu and Joachims, 2009). Aside from semantic parsing, the latent variable perceptron has been used in tasks such as machine translation (Liang et al., 2006) and named entity recognition (Sun et al., 2009). In latent conditional random fields, we use the full expectations rather than maximizing over the hidden variable. This model has also been employed in a range of problems beyond semantic parsing, including parse reranking (Koo and Collins, 2005) and gesture recognition (Quattoni et al., 2007).
12.4.3 Learning from denotations
Logical forms are easier to obtain than complete derivations, but the annotation of logical forms still requires considerable expertise. However, it is relatively easy to obtain deno- tations for many natural language sentences. For example, in the geography domain, the denotation of a question would be its answer (Clarke et al., 2010; Liang et al., 2013):
Text :What states border Georgia?
Logical form :λx.STATE(x) ∧ BORDER(x, GEORGIA)
Denotation :{Alabama, Florida, North Carolina, South Carolina, Tennessee}
Similarly, in a robotic control setting, the denotation of a command would be an action or sequence of actions (Artzi and Zettlemoyer, 2013). In both cases, the idea is to reward the semantic parser for choosing an analysis whose denotation is correct: the right answer to the question, or the right action.
Learning from logical forms was made possible by summing or maxing over deriva- tions. This idea can be carried one step further, summing or maxing over all logical forms with the correct denotation. Let vi(y) ∈ {0, 1} be a validation function, which assigns a
Under contract with MIT Press, shared under CC-BY-NC-ND license.

302 CHAPTER12. LOGICALSEMANTICS binary score indicating whether the denotation 􏰆y􏰇 for the text w(i) is correct. We can then
learn by maximizing a conditional-likelihood objective,
l(i)(θ) = log 􏰁 vi(y) × p(y | w; θ) [12.29]
y
= log 􏰁 vi(y) × 􏰁 p(y, z | w; θ), [12.30] yz
which sums over all derivations z of all valid logical forms, {y : vi(y) = 1}. This cor- responds to the log-probability that the semantic parser produces a logical form with a valid denotation.
Differentiating with respect to θ, we obtain,
∂l(i)􏰁 􏰁′′′′
∂θ = p(y,z|w)f(w,z,y)− p(y,z |w)f(w,z,y), [12.31] y,z:vi(y)=1 y′,z′
which is the usual difference in feature expectations. The positive term computes the expected feature expectations conditioned on the denotation being valid, while the second term computes the expected feature expectations according to the current model, without regard to the ground truth. Large-margin learning formulations are also possible for this problem. For example, Artzi and Zettlemoyer (2013) generate a set of valid and invalid derivations, and then impose a constraint that all valid derivations should score higher than all invalid derivations. This constraint drives a perceptron-like learning rule.
Additional resources
A key issue not considered here is how to handle semantic underspecification: cases in which there are multiple semantic interpretations for a single syntactic structure. Quanti- fier scope ambiguity is a classic example. Blackburn and Bos (2005) enumerate a number of approaches to this issue, and also provide links between natural language semantics and computational inference techniques. Much of the contemporary research on semantic parsing uses the framework of combinatory categorial grammar (CCG). Carpenter (1997) provides a comprehensive treatment of how CCG can support compositional semantic analysis. Another recent area of research is the semantics of multi-sentence texts. This can be handled with models of dynamic semantics, such as dynamic predicate logic (Groe- nendijk and Stokhof, 1991).
Alternative readings on formal semantics include an “informal” reading from Levy and Manning (2009), and a more involved introduction from Briscoe (2011). To learn more about ongoing research on data-driven semantic parsing, readers may consult the survey
Jacob Eisenstein. Draft of October 15, 2018.

12.4. LEARNINGSEMANTICPARSERS 303 article by Liang and Potts (2015), tutorial slides and videos by Artzi and Zettlemoyer
(2013),12 and the source code by Yoav Artzi13 and Percy Liang.14
Exercises
1. The modus ponens inference rule states that if we know φ ⇒ ψ and φ, then ψ must be true. Justify this rule, using the definition of the ⇒ operator and some of the laws provided in subsection 12.2.1, plus one additional identity: ⊥ ∨ φ = φ.
2. Convertthefollowingexamplesintofirst-orderlogic,usingtherelationsCAN-SLEEP, MAKES-NOISE, and BROTHER.
• If Abigail makes noise, no one can sleep.
• If Abigail makes noise, someone cannot sleep.
• None of Abigail’s brothers can sleep.
• If one of Abigail’s brothers makes noise, Abigail cannot sleep.
3. Extend the grammar fragment G1 to include the ditransitive verb teaches and the proper noun Swahili. Show how to derive the interpretation for the sentence Alex teaches Brit Swahili, which should be TEACHES(ALEX,BRIT,SWAHILI). The grammar need not be in Chomsky Normal Form. For the ditransitive verb, use NP1 and NP2 to indicate the two direct objects.
4. Derive the semantic interpretation for the sentence Alex likes every dog, using gram- mar fragment G2.
5. Extend the grammar fragment G2 to handle adjectives, so that the meaning of an angry dog is λP.∃xDOG(x) ∧ ANGRY(x) ∧ P(x). Specifically, you should supply the lexical entry for the adjective angry, and you should specify the syntactic-semantic productions NP → DET NOM, NOM → JJ NOM, and NOM → NN.
6. Extend your answer to the previous question to cover copula constructions with predicative adjectives, such as Alex is angry. The interpretation should be ANGRY(ALEX). You should add a verb phrase production VP → Vcop JJ, and a terminal production Vcop → is. Show why your grammar extensions result in the correct interpretation.
7. InFigure12.5andFigure12.6,wetreatthepluralsshootsandleavesasentities.Revise G2 so that the interpretation of Alex eats leaves is ∀x.(LEAF(x) ⇒ EATS(ALEX, x)), and show the resulting perceptron update.
12Videos are currently available at http://yoavartzi.com/tutorial/ 13 http://yoavartzi.com/spf
14 https://github.com/percyliang/sempre
Under contract with MIT Press, shared under CC-BY-NC-ND license.

304 CHAPTER12. LOGICALSEMANTICS 8. Statements like every student eats a pizza have two possible interpretations, depend-
ing on quantifier scope:
∀x∃yPIZZA(y) ∧ (STUDENT(x) ⇒ EATS(x, y)) [12.32]
∃y∀xPIZZA(y) ∧ (STUDENT(x) ⇒ EATS(x, y)) [12.33] a) Explain why these interpretations really are different.
b) Which is generated by grammar G2? Note that you may have to manipulate the logical form to exactly align with the grammar.
9. *Modify G2 so that produces the second interpretation in the previous problem. Hint: one possible solution involves changing the semantics of the sentence pro- duction and one other production.
10. In the GeoQuery domain, give a natural language query that has multiple plausible semantic interpretations with the same denotation. List both interpretaions and the denotation.
Hint: There are many ways to do this, but one approach involves using toponyms (place names) that could plausibly map to several different entities in the model.
Jacob Eisenstein. Draft of October 15, 2018.

Chapter 13
Predicate-argument semantics
This chapter considers more “lightweight” semantic representations, which discard some aspects of first-order logic, but focus on predicate-argument structures. Let’s begin by thinking about the semantics of events, with a simple example:
(13.1) Asha gives Boyang a book.
A first-order logical representation of this sentence is,
∃x.BOOK(x) ∧ GIVE(ASHA, BOYANG, x) [13.1]
In this representation, we define variable x for the book, and we link the strings Asha and Boyang to entities ASHA and BOYANG. Because the action of giving involves a giver, a recipient, and a gift, the predicate GIVE must take three arguments.
Now suppose we have additional information about the event: (13.2) Yesterday, Asha reluctantly gave Boyang a book.
One possible to solution is to extend the predicate GIVE to take additional arguments, ∃x.BOOK(x) ∧ GIVE(ASHA, BOYANG, x, YESTERDAY, RELUCTANTLY) [13.2]
But this is clearly unsatisfactory: yesterday and relunctantly are optional arguments, and we would need a different version of the GIVE predicate for every possible combi- nation of arguments. Event semantics solves this problem by reifying the event as an existentially quantified variable e,
∃e, x.GIVE-EVENT(e) ∧ GIVER(e, ASHA) ∧ GIFT(e, x) ∧ BOOK(e, x) ∧ RECIPIENT(e, BOYANG) ∧ TIME(e, YESTERDAY) ∧ MANNER(e, RELUCTANTLY)
305

306 CHAPTER13. PREDICATE-ARGUMENTSEMANTICS
In this way, each argument of the event — the giver, the recipient, the gift — can be rep- resented with a relation of its own, linking the argument to the event e. The expression GIVER(e, ASHA) says that ASHA plays the role of GIVER in the event. This reformulation handles the problem of optional information such as the time or manner of the event, which are called adjuncts. Unlike arguments, adjuncts are not a mandatory part of the relation, but under this representation, they can be expressed with additional logical rela- tions that are conjoined to the semantic interpretation of the sentence. 1
The event semantic representation can be applied to nested clauses, e.g., (13.3) Chris sees Asha pay Boyang.
This is done by using the event variable as an argument:
∃e1∃e2 SEE-EVENT(e1) ∧ SEER(e1, CHRIS) ∧ SIGHT(e1, e2)
As with generalizes
(13.4) a. b. c. d.
∧ PAY-EVENT(e2) ∧ PAYER(e2, ASHA) ∧ PAYEE(e2, BOYANG)
first-order logic, the goal of event semantics is to provide a representation that
over many surface forms. Consider the following paraphrases of (13.1):
Asha gives a book to Boyang.
A book is given to Boyang by Asha.
A book is given by Asha to Boyang. ThegiftofabookfromAshatoBoyang…
All have the same event semantic meaning as Equation 13.1, but the ways in which the meaning can be expressed are diverse. The final example does not even include a verb: events are often introduced by verbs, but as shown by (13.4d), the noun gift can introduce the same predicate, with the same accompanying arguments.
Semantic role labeling (SRL) is a relaxed form of semantic parsing, in which each semantic role is filled by a set of tokens from the text itself. This is sometimes called “shallow semantics” because, unlike model-theoretic semantic parsing, role fillers need not be symbolic expressions with denotations in some world model. A semantic role labeling system is required to identify all predicates, and then specify the spans of text that fill each role. To give a sense of the task, here is a more complicated example:
(13.5) Boyang wants Asha to give him a linguistics book.
1This representation is often called Neo-Davidsonian event semantics. The use of existentially- quantified event variables was proposed by Davidson (1967) to handle the issue of optional adjuncts. In Neo-Davidsonian semantics, this treatment of adjuncts is extended to mandatory arguments as well (e.g., Parsons, 1990).
Jacob Eisenstein. Draft of October 15, 2018.
[13.3]

13.1. SEMANTICROLES 307 In this example, there are two predicates, expressed by the verbs want and give. Thus, a
semantic role labeler might return the following output:
• (PREDICATE : wants, WANTER : Boyang, DESIRE : Asha to give him a linguistics book)
• (PREDICATE : give, GIVER : Asha, RECIPIENT : him, GIFT : a linguistics book)
Boyang and him may refer to the same person, but the semantic role labeling is not re- quired to resolve this reference. Other predicate-argument representations, such as Ab- stract Meaning Representation (AMR), do require reference resolution. We will return to AMR in section 13.3, but first, let us further consider the definition of semantic roles.
13.1 Semantic roles
In event semantics, it is necessary to specify a number of additional logical relations to link arguments to events: GIVER, RECIPIENT, SEER, SIGHT, etc. Indeed, every predicate re- quires a set of logical relations to express its own arguments. In contrast, adjuncts such as TIME and MANNER are shared across many types of events. A natural question is whether it is possible to treat mandatory arguments more like adjuncts, by identifying a set of generic argument types that are shared across many event predicates. This can be further motivated by examples involving related verbs:
(13.6) a. b. c. d.
Asha gave Boyang a book. Asha loaned Boyang a book. Asha taught Boyang a lesson. Asha gave Boyang a lesson.
The respective roles of Asha, Boyang, and the book are nearly identical across the first two examples. The third example is slightly different, but the fourth example shows that the roles of GIVER and TEACHER can be viewed as related.
One way to think about the relationship between roles such as GIVER and TEACHER is by enumerating the set of properties that an entity typically possesses when it fulfills these roles: givers and teachers are usually animate (they are alive and sentient) and volitional (they choose to enter into the action).2 In contrast, the thing that gets loaned or taught is usually not animate or volitional; furthermore, it is unchanged by the event.
Building on these ideas, thematic roles generalize across predicates by leveraging the shared semantic properties of typical role fillers (Fillmore, 1968). For example, in exam- ples (13.6a-13.6d), Asha plays a similar role in all four sentences, which we will call the
2There are always exceptions. For example, in the sentence The C programming language has taught me a lot about perseverance, the “teacher” is the The C programming language, which is presumably not animate or volitional.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

308
CHAPTER13. PREDICATE-ARGUMENTSEMANTICS
VerbNet PropBank FrameNet
VerbNet PropBank FrameNet
Asha gave Boyang
AGENT RECIPIENT
ARG0: giver ARG2: entity given to DONOR RECIPIENT
Asha taught Boyang
AGENT RECIPIENT ARG0: teacher ARG2: student TEACHER STUDENT
a book
THEME
ARG1: thing given THEME
algebra
TOPIC
ARG1: subject SUBJECT
Figure 13.1: Example semantic annotations according to VerbNet, PropBank, and FrameNet
agent. This reflects several shared semantic properties: she is the one who is actively and intentionally performing the action, while Boyang is a more passive participant; the book and the lesson would play a different role, as non-animate participants in the event.
Example annotations from three well known systems are shown in Figure 13.1. We will now discuss these systems in more detail.
13.1.1 VerbNet
VerbNet (Kipper-Schuler, 2005) is a lexicon of verbs, and it includes thirty “core” thematic roles played by arguments to these verbs. Here are some example roles, accompanied by their definitions from the VerbNet Guidelines.3
• AGENT: “ACTOR in an event who initiates and carries out the event intentionally or consciously, and who exists independently of the event.”
• PATIENT: “UNDERGOER in an event that experiences a change of state, location or condition, that is causally involved or directly affected by other participants, and exists independently of the event.”
• RECIPIENT: “DESTINATION that is animate”
• THEME: “UNDERGOER that is central to an event or state that does not have control over the way the event occurs, is not structurally changed by the event, and/or is characterized as being in a certain position or condition throughout the state.”
• TOPIC: “THEME characterized by information content transferred to another partic- ipant.”
3http://verbs.colorado.edu/verb-index/VerbNet_Guidelines.pdf
Jacob Eisenstein. Draft of October 15, 2018.

13.1. SEMANTICROLES 309 VerbNet roles are organized in a hierarchy, so that a TOPIC is a type of THEME, which in
turn is a type of UNDERGOER, which is a type of PARTICIPANT, the top-level category.
In addition, VerbNet organizes verb senses into a class hierarchy, in which verb senses that have similar meanings are grouped together. Recall from section 4.2 that multiple meanings of the same word are called senses, and that WordNet identifies senses for many English words. VerbNet builds on WordNet, so that verb classes are identified by the WordNet senses of the verbs that they contain. For example, the verb class give-13.1 includes the first WordNet sense of loan and the second WordNet sense of lend.
Each VerbNet class or subclass takes a set of thematic roles. For example, give-13.1 takes arguments with the thematic roles of AGENT, THEME, and RECIPIENT;4 the pred- icate TEACH takes arguments with the thematic roles AGENT, TOPIC, RECIPIENT, and SOURCE.5 So according to VerbNet, Asha and Boyang play the roles of AGENT and RECIP- IENT in the sentences,
(13.7) a. Asha gave Boyang a book.
b. Asha taught Boyang algebra.
The book and algebra are both THEMES, but algebra is a subcategory of THEME — a TOPIC — because it consists of information content that is given to the receiver.
13.1.2 Proto-roles and PropBank
Detailed thematic role inventories of the sort used in VerbNet are not universally accepted. For example, Dowty (1991, pp. 547) notes that “Linguists have often found it hard to agree on, and to motivate, the location of the boundary between role types.” He argues that a solid distinction can be identified between just two proto-roles:
Proto-Agent. Characterized by volitional involvement in the event or state; sentience and/or perception; causing an event or change of state in another participant; move- ment; exists independently of the event.
Proto-Patient. Undergoes change of state; causally affected by another participant; sta- tionary relative to the movement of another participant; does not exist indepen- dently of the event.6
4 https://verbs.colorado.edu/verb- index/vn/give- 13.1.php
5 https://verbs.colorado.edu/verb- index/vn/transfer_mesg- 37.1.1.php
6Reisinger et al. (2015) ask crowd workers to annotate these properties directly, finding that annotators
tend to agree on the properties of each argument. They also find that in English, arguments having more proto-agent properties tend to appear in subject position, while arguments with more proto-patient proper- ties appear in object position.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

310 CHAPTER13. PREDICATE-ARGUMENTSEMANTICS
In the examples in Figure 13.1, Asha has most of the proto-agent properties: in giving the book to Boyang, she is acting volitionally (as opposed to Boyang got a book from Asha, in which it is not clear whether Asha gave up the book willingly); she is sentient; she causes a change of state in Boyang; she exists independently of the event. Boyang has some proto- agent properties: he is sentient and exists independently of the event. But he also has some proto-patient properties: he is the one who is causally affected and who undergoes change of state. The book that Asha gives Boyang has even fewer of the proto-agent properties: it is not volitional or sentient, and it has no causal role. But it also lacks many of the proto-patient properties: it does not undergo change of state, exists independently of the event, and is not stationary.
The Proposition Bank, or PropBank (Palmer et al., 2005), builds on this basic agent- patient distinction, as a middle ground between generic thematic roles and roles that are specific to each predicate. Each verb is linked to a list of numbered arguments, with ARG0 as the proto-agent and ARG1 as the proto-patient. Additional numbered arguments are verb-specific. For example, for the predicate TEACH,7 the arguments are:
• ARG0: the teacher
• ARG1: the subject
• ARG2: the student(s)
Verbs may have any number of arguments: for example, WANT and GET have five, while EAT has only ARG0 and ARG1. In addition to the semantic arguments found in the frame files, roughly a dozen general-purpose adjuncts may be used in combination with any verb. These are shown in Table 13.1.
PropBank-style semantic role labeling is annotated over the entire Penn Treebank. This annotation includes the sense of each verbal predicate, as well as the argument spans.
13.1.3 FrameNet
Semantic frames are descriptions of situations or events. Frames may be evoked by one of their lexical units (often a verb, but not always), and they include some number of frame elements, which are like roles (Fillmore, 1976). For example, the act of teaching is a frame, and can be evoked by the verb taught; the associated frame elements include the teacher, the student(s), and the subject being taught. Frame semantics has played a significant role in the history of artificial intelligence, in the work of Minsky (1974) and Schank and Abelson (1977). In natural language processing, the theory of frame semantics has been implemented in FrameNet (Fillmore and Baker, 2009), which consists of a lexicon
7 http://verbs.colorado.edu/propbank/framesets-english-aliases/teach.html
Jacob Eisenstein. Draft of October 15, 2018.

13.1. SEMANTICROLES
311
TMP time
LOC location
MOD modal verb
ADV general purpose MNR manner
DIS discourse connective PRP purpose
DIR direction
NEG negation
EXT extent
CAU cause
Boyang ate a bagel [AM-TMP yesterday].
Asha studies in [AM-LOC Stuttgart]
Asha [AM-MOD will] study in Stuttgart
[AM-ADV Luckily], Asha knew algebra.
Asha ate [AM-MNR aggressively].
[AM-DIS However], Asha prefers algebra.
Barry studied [AM-PRP to pass the bar].
Workers dumped burlap sacks [AM-DIR into a bin]. Asha does [AM-NEG not] speak Albanian.
Prices increased [AM-EXT 4%].
Boyang returned the book [AM-CAU because it was overdue].
Table 13.1: PropBank adjuncts (Palmer et al., 2005), sorted by frequency in the corpus
of roughly 1000 frames, and a corpus of more than 200,000 “exemplar sentences,” in which the frames and their elements are annotated.8
Rather than seeking to link semantic roles such as TEACHER and GIVER into the- matic roles such as AGENT, FrameNet aggressively groups verbs into frames, and links semantically-related roles across frames. For example, the following two sentences would be annotated identically in FrameNet:
(13.8) a. Asha taught Boyang algebra.
b. Boyang learned algebra from Asha.
This is because teach and learn are both lexical units in the EDUCATION TEACHING frame. Furthermore, roles can be shared even when the frames are distinct, as in the following two examples:
(13.9) a. Asha gave Boyang a book.
b. Boyang got a book from Asha.
The GIVING and GETTING frames both have RECIPIENT and THEME elements, so Boyang and the book would play the same role. Asha’s role is different: she is the DONOR in the GIVING frame, and the SOURCE in the GETTING frame. FrameNet makes extensive use of multiple inheritance to share information across frames and frame elements: for example, the COMMERCE SELL and LENDING frames inherit from GIVING frame.
8Current details and data can be found at https://framenet.icsi.berkeley.edu/
Under contract with MIT Press, shared under CC-BY-NC-ND license.

312 CHAPTER13. PREDICATE-ARGUMENTSEMANTICS 13.2 Semantic role labeling
The task of semantic role labeling is to identify the parts of the sentence comprising the semantic roles. In English, this task is typically performed on the PropBank corpus, with the goal of producing outputs in the following form:
(13.10) [ARG0 Asha] [GIVE.01 gave] [ARG2 Boyang’s mom] [ARG1 a book] [AM-TMP yesterday]. Note that a single sentence may have multiple verbs, and therefore a given word may be
part of multiple role-fillers:
(13.11) [ARG0 Asha] [WANT.01 wanted] Asha wanted
[ARG1 Boyang to give her the book].
[ARG0 Boyang] [GIVE.01 to give] [ARG2 her] [ARG1 the book].
13.2.1 Semantic role labeling as classification
PropBank is annotated on the Penn Treebank, and annotators used phrasal constituents
(subsection 9.2.2) to fill the roles. PropBank semantic role labeling can be viewed as the
task of assigning to each phrase a label from the set R = {∅, PRED, ARG0, ARG1, ARG2, . . . , AM-LOC, AM with respect to each predicate. If we treat semantic role labeling as a classification prob-
lem, we obtain the following functional form:
where,
yˆ(i,j) = argmaxψ(w,y,i,j,ρ,τ), y
[13.4]
• (i,j)indicatesthespanofaphrasalconstituent(wi+1,wi+2,…,wj);9 • w represents the sentence as a sequence of tokens;
• ρ is the index of the predicate verb in w;
• τ is the structure of the phrasal constituent parse of w.
Early work on semantic role labeling focused on discriminative feature-based models, where ψ(w, y, i, j, ρ, τ ) = θ · f (w, y, i, j, ρ, τ ). Table 13.2 shows the features used in a sem- inal paper on FrameNet semantic role labeling (Gildea and Jurafsky, 2002). By 2005 there
9PropBank roles can also be filled by split constituents, which are discontinuous spans of text. This situation most frequently in reported speech, e.g. [ARG1 By addressing these problems], Mr. Maxwell said, [ARG1 the new funds have become extremely attractive.] (example adapted from Palmer et al., 2005). This issue is typically addressed by defining “continuation arguments”, e.g. C-ARG1, which refers to the continuation of ARG1 after the split.
Jacob Eisenstein. Draft of October 15, 2018.

13.2. SEMANTICROLELABELING 313
Predicate lemma and POS tag
Voice
Phrase type
Headword and POS tag
Position
Syntactic path Subcategorization
The lemma of the predicate verb and its part-of-speech tag
Whether the predicate is in active or passive voice, as deter- mined by a set of syntactic patterns for identifying passive voice constructions
The constituent phrase type for the proposed argument in the parse tree, e.g. NP, PP
The head word of the proposed argument and its POS tag, identified using the Collins (1997) rules
Whether the proposed argument comes before or after the predicate in the sentence
The set of steps on the parse tree from the proposed argu- ment to the predicate (described in detail in the text)
The syntactic production from the first branching node above the predicate. For example, in Figure 13.2, the subcategorization feature around taught would be VP → VBD NP PP.
Table 13.2: Features used in semantic role labeling by Gildea and Jurafsky (2002).
were several systems for PropBank semantic role labeling, and their approaches and fea- ture sets are summarized by Carreras and Ma`rquez (2005). Typical features include: the phrase type, head word, part-of-speech, boundaries, and neighbors of the proposed argu- ment wi+1:j ; the word, lemma, part-of-speech, and voice of the verb wρ (active or passive), as well as features relating to its frameset; the distance and path between the verb and the proposed argument. In this way, semantic role labeling systems are high-level “con- sumers” in the NLP stack, using features produced from lower-level components such as part-of-speech taggers and parsers. More comprehensive feature sets are enumerated by Das et al. (2014) and Ta ̈ckstro ̈m et al. (2015).
A particularly powerful class of features relate to the syntactic path between the ar- gument and the predicate. These features capture the sequence of moves required to get from the argument to the verb by traversing the phrasal constituent parse of the sentence. The idea of these features is to capture syntactic regularities in how various arguments are realized. Syntactic path features are best illustrated by example, using the parse tree in Figure 13.2:
• The path from Asha to the verb taught is NNP↑NP↑S↓VP↓VBD. The first part of the path, NNP↑NP↑S, means that we must travel up the parse tree from the NNP tag (proper noun) to the S (sentence) constituent. The second part of the path, S↓VP↓VBD, means that we reach the verb by producing a VP (verb phrase) from
Under contract with MIT Press, shared under CC-BY-NC-ND license.

314
CHAPTER13. PREDICATE-ARGUMENTSEMANTICS
NP
Nnp(Arg0)
Asha
S
Vbd
taught
VP
NP(Arg2) PP(Arg1)
Det Nn In Nn
the class about algebra
Figure 13.2: Semantic role labeling on the phrase-structure parse tree for a sentence. The dashed line indicates the syntactic path from Asha to the predicate verb taught.
the S constituent, and then by producing a VBD (past tense verb). This feature is consistent with Asha being in subject position, since the path includes the sentence root S.
• The path from the class to taught is NP↑VP↓VBD. This is consistent with the class being in object position, since the path passes through the VP node that dominates the verb taught.
Because there are many possible path features, it can also be helpful to look at smaller parts: for example, the upward and downward parts can be treated as separate features; another feature might consider whether S appears anywhere in the path.
Rather than using the constituent parse, it is also possible to build features from the
dependency path (see section 11.4) between the head word of each argument and the
verb (Pradhan et al., 2005). Using the Universal Dependency part-of-speech tagset and
dependency relations (Nivre et al., 2016), the dependency path from Asha to taught is
PROPN ← VERB, because taught is the head of a relation of type ← with Asha. Simi- NSUBJ NSUBJ
larly, the dependency path from class to taught is NOUN ← VERB, because class heads the DOBJ
noun phrase that is a direct object of taught. A more interesting example is Asha wanted to teach the class, where the path from Asha to teach is PROPN ← VERB → VERB. The
NSUBJ XCOMP right-facing arrow in second relation indicates that wanted is the head of its XCOMP rela-
tion with teach.
Jacob Eisenstein. Draft of October 15, 2018.

13.2. SEMANTICROLELABELING 315 13.2.2 Semantic role labeling as constrained optimization
A potential problem with treating SRL as a classification problem is that there are a num- ber of sentence-level constraints, which a classifier might violate.
• For a given verb, there can be only one argument of each type (ARG0, ARG1, etc.)
• Arguments cannot overlap. This problem arises when we are labeling the phrases in a constituent parse tree, as shown in Figure 13.2: if we label the PP about algebra as an argument or adjunct, then its children about and algebra must be labeled as ∅. The same constraint also applies to the syntactic ancestors of this phrase.
These constraints introduce dependencies across labeling decisions. In structure pre- diction problems such as sequence labeling and parsing, such dependencies are usually handled by defining a scoring over the entire structure, y. Efficient inference requires that the global score decomposes into local parts: for example, in sequence labeling, the scoring function decomposes into scores of pairs of adjacent tags, permitting the applica- tion of the Viterbi algorithm for inference. But the constraints that arise in semantic role labeling are less amenable to local decomposition.10 We therefore consider constrained optimization as an alternative solution.
Let the set C(τ) refer to all labelings that obey the constraints introduced by the parse τ . The semantic role labeling problem can be reformulated as a constrained optimization over y ∈ C(τ), 􏰁
max ψ(w,yi,j,i,j,ρ,τ)
y
(i,j )∈τ
s.t. y ∈ C(τ). [13.5]
In this formulation, the objective (shown on the first line) is a separable function of each individual labeling decision􏰀, but the constraints (shown on the second line) apply to the overall labeling. The sum (i,j)∈τ indicates that we are summing over all constituent spans in the parse τ. The expression s.t. in the second line means that we maximize the objective subject to the constraint y ∈ C(τ).
A number of practical algorithms exist for restricted forms of constrained optimiza- tion. One such restricted form is integer linear programming, in which the objective and constraints are linear functions of integer variables. To formulate SRL as an integer linear program, we begin by rewriting the lab􏰅els as a set of binary variables z = {zi,j,r} (Pun- yakanok et al., 2008),
zi,j,r = 1, yi,j = r [13.6] 0, otherwise,
10DynamicprogrammingsolutionshavebeenproposedbyTrombleandEisner(2006)andTa ̈ckstro ̈metal. (2015), but they involves creating a trellis structure whose size is exponential in the number of labels.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

316 CHAPTER13. PREDICATE-ARGUMENTSEMANTICS where r ∈ R is a label in the set {ARG0,ARG1,…,AM-LOC,…,∅}. Thus, the variables
z are a binarized version of the semantic role labeling y.
The objective can then be formulated as a linear function of z.
􏰁 ψ(w,yi,j,i,j,ρ,τ)=􏰁ψ(w,r,i,j,ρ,τ)×zi,j,r, [13.7] (i,j )∈τ i,j,r
which is the sum of the scores of all relations, as indicated by zi,j,r.
Constraints Integer linear programming permits linear inequality constraints, of the general form Az ≤ b, where the parameters A and b define the constraints. To make this more concrete, let’s start with the constraint that each non-null role type can occur only once in a sentence. This constraint can be written,
∀r ̸= ∅, 􏰁 zi,j,r ≤ 1. [13.8] (i,j )∈τ
Recall that zi,j,r = 1 iff the span (i, j) has label r; this constraint says that for each possible label r ̸= ∅, there can be at most one (i, j) such that zi,j,r = 1. Rewriting this constraint can be written in the form Az ≤ b, as you will find if you complete the exercises at the end of the chapter.
Now consider the constraint that labels cannot overlap. Let’s define the convenience function o((i,j),(i′,j′)) = 1 iff (i,j) overlaps (i′,j′), and zero otherwise. Thus, o will indicate if a constituent (i′, j′) is either an ancestor or descendant of (i, j). The constraint is that if two constituents overlap, only one can have a non-null label:
∀(i, j) ∈ τ, 􏰁 􏰁 o((i, j), (i′, j′)) × zi′,j′,r ≤ 1, [13.9] (i′,j′)∈τ r̸=∅
where o((i, j), (i, j)) = 1.
In summary, the semantic role labeling problem can thus be rewritten as the following
integer linear program,
􏰁􏰁􏰁
max|τ | z∈{0,1}
s.t.
zi,j,r ψi,j,r
∀r ̸= ∅, zi,j,r ≤ 1.
(i,j )∈τ
∀(i, j) ∈ τ, 􏰁 􏰁 o((i, j), (i′, j′)) × zi′,j′,r ≤ 1.
[13.10] [13.11] [13.12]
(i,j)∈τ r∈R
(i′,j′)∈τ r̸=∅
Jacob Eisenstein. Draft of October 15, 2018.

13.2. SEMANTICROLELABELING 317
Learning with constraints Learning can be performed in the context of constrained op- timization using the usual perceptron or large-margin classification updates. Because constrained inference is generally more time-consuming, a key question is whether it is necessary to apply the constraints during learning. Chang et al. (2008) find that better per- formance can be obtained by learning without constraints, and then applying constraints only when using the trained model to predict semantic roles for unseen data.
How important are the constraints? Das et al. (2014) find that an unconstrained, classification- based method performs nearly as well as constrained optimization for FrameNet parsing: while it commits many violations of the “no-overlap” constraint, the overall F1 score is
less than one point worse than the score at the constrained optimum. Similar results
were obtained for PropBank semantic role labeling by Punyakanok et al. (2008). He et al. (2017) find that constrained inference makes a bigger impact if the constraints are based on manually-labeled “gold” syntactic parses. This implies that errors from the syntac- tic parser may limit the effectiveness of the constraints. Punyakanok et al. (2008) hedge against parser error by including constituents from several different parsers; any con- stituent can be selected from any parse, and additional constraints ensure that overlap- ping constituents are not selected.
Implementation Integer linear programming solvers such as glpk,11 cplex,12 and Gurobi13 allow inequality constraints to be expressed directly in the problem definition, rather than
in the matrix form Az ≤ b. The time complexity of integer linear programming is theoret- ically exponential in the number of variables |z|, but in practice these off-the-shelf solvers obtain good solutions efficiently. Using a standard desktop computer, Das et al. (2014) report that the cplex solver requires 43 seconds to perform inference on the FrameNet
test set, which contains 4,458 predicates.
Recent work has shown that many constrained optimization problems in natural lan- guage processing can be solved in a highly parallelized fashion, using optimization tech- niques such as dual decomposition, which are capable of exploiting the underlying prob- lem structure (Rush et al., 2010). Das et al. (2014) apply this technique to FrameNet se- mantic role labeling, obtaining an order-of-magnitude speedup over cplex.
13.2.3 Neural semantic role labeling
Neural network approaches to SRL have tended to treat it as a sequence labeling task, using a labeling scheme such as the BIO notation, which we previously saw in named entity recognition (section 8.3). In this notation, the first token in a span of type ARG1
11 https://www.gnu.org/software/glpk/
12 https://www-01.ibm.com/software/commerce/optimization/cplex-optimizer/ 13 http://www.gurobi.com/
Under contract with MIT Press, shared under CC-BY-NC-ND license.

318 CHAPTER13. PREDICATE-ARGUMENTSEMANTICS is labeled B-ARG1; all remaining tokens in the span are inside, and are therefore labeled
I-ARG1. Tokens outside any argument are labeled O. For example:
(13.12) Asha taught Boyang ’s mom about algebra B-ARG0 PRED B-ARG2 I-ARG2 I-ARG2 B-ARG1 I-ARG1
Recurrent neural networks (section 7.6) are a natural approach to this tagging task. For example, Zhou and Xu (2015) apply a deep bidirectional multilayer LSTM (see sec- tion 7.6) to PropBank semantic role labeling. In this model, each bidirectional LSTM serves as input for another, higher-level bidirectional LSTM, allowing complex non-linear trans- formations of the original input embeddings, X = [x1, x2, . . . , xM ]. The hidden state of
the final LSTM is Z(K) = [z(K), z(K), . . . , z(K)]. The “emission” score for each tag Ym = y 12M
is equal to the inner product θ · z(K), and there is also a transition score for each pair of ym
adjacent tags. The complete model can be written, Z(1) =BiLSTM(X)
Z(i) =BiLSTM(Z(i−1))
[13.13] [13.14]
[13.15]
yˆ = argmax y
􏰁M m−1
Θ(y)z(K) + ψ . m ym−1,ym
Note that the final step maximizes over the entire labeling y, and includes a score for each tag transition ψym−1,ym. This combination of LSTM and pairwise potentials on tags is an example of an LSTM-CRF. The maximization over y is performed by the Viterbi algorithm.
This model strongly outperformed alternative approaches at the time, including con- strained decoding and convolutional neural networks.14 More recent work has combined recurrent neural network models with constrained decoding, using the A∗ search algo- rithm to search over labelings that are feasible with respect to the constraints (He et al., 2017). This yields small improvements over the method of Zhou and Xu (2015). He et al. (2017) obtain larger improvements by creating an ensemble of SRL systems, each trained on an 80% subsample of the corpus. The average prediction across this ensemble is more robust than any individual model.
13.3 Abstract Meaning Representation
Semantic role labeling transforms the task of semantic parsing to a labeling task. Consider the sentence,
14The successful application of convolutional neural networks to semantic role labeling by Collobert and Weston (2008) was an influential early result in the current wave of neural networks in natural language processing.
Jacob Eisenstein. Draft of October 15, 2018.

13.3. ABSTRACT MEANING REPRESENTATION
319
(w / want-01
:ARG0 (h / whale)
:ARG1 (p / pursue-02
:ARG0 (c / captain)
:ARG1 h))
w / wants-01
Arg0
h / whale Arg1
Arg1
p / pursue-02
Arg0
c / captain
Figure 13.3: Two views of the AMR representation for the sentence The whale wants the captain to pursue him.
(13.13) The whale wants the captain to pursue him. The PropBank semantic role labeling analysis is:
• (PREDICATE : wants, ARG0 : the whale, ARG1 : the captain to pursue him) • (PREDICATE : pursue, ARG0 : the captain, ARG1 : him)
The Abstract Meaning Representation (AMR) unifies this analysis into a graph struc- ture, in which each node is a variable, and each edge indicates a concept (Banarescu et al., 2013). This can be written in two ways, as shown in Figure 13.3. On the left is the PENMAN notation (Matthiessen and Bateman, 1991), in which each set of parentheses in- troduces a variable. Each variable is an instance of a concept, which is indicated with the slash notation: for example, w / want-01 indicates that the variable w is an instance of the concept want-01, which in turn refers to the PropBank frame for the first sense of the verb want; pursue-02 refers to the second sense of pursue. Relations are introduced with colons: for example, :ARG0 (c / captain) indicates a relation of type ARG0 with the newly-introduced variable c. Variables can be reused, so that when the variable h ap- pears again as an argument to p, it is understood to refer to the same whale in both cases. This arrangement is indicated compactly in the graph structure on the right, with edges indicating concepts.
One way in which AMR differs from PropBank-style semantic role labeling is that it reifies each entity as a variable: for example, the whale in (13.13) is reified in the variable h, which is reused as ARG0 in its relationship with w / want-01, and as ARG1 in its relationship with p / pursue-02. Reifying entities as variables also makes it possible to represent the substructure of noun phrases more explicitly. For example, Asha borrowed the algebra book would be represented as:
(b / borrow-01
:ARG0 (p / person
:name (n / name
:op1 “Asha”))
Under contract with MIT Press, shared under CC-BY-NC-ND license.

320
CHAPTER13. PREDICATE-ARGUMENTSEMANTICS
:ARG1 (b2 / book
:topic (a / algebra)))
This indicates that the variable p is a person, whose name is the variable n; that name has one token, the string Asha. Similarly, the variable b2 is a book, and the topic of b2 is a variable a whose type is algebra. The relations name and topic are examples of “non-core roles”, which are similar to adjunct modifiers in PropBank. However, AMR’s inventory is more extensive, including more than 70 non-core roles, such as negation, time, manner, frequency, and location. Lists and sequences — such as the list of tokens in a name — are described using the roles op1, op2, etc.
Another feature of AMR is that a semantic predicate can be introduced by any syntac- tic element, as in the following examples from Banarescu et al. (2013):
(13.14) a.
b. the destruction of the room by the boy . . .
c. the boy’s destruction of the room . . .
All these examples have the same semantics in AMR,
(d / destroy-01
:ARG0 (b / boy)
:ARG1 (r / room))
The noun destruction is linked to the verb destroy, which is captured by the PropBank frame destroy-01. This can happen with adjectives as well: in the phrase the attractive spy, the adjective attractive is linked to the PropBank frame attract-01:
(s / spy
:ARG0-of (a / attract-01))
In this example, ARG0-of is an inverse relation, indicating that s is the ARG0 of the predicate a. Inverse relations make it possible for all AMR parses to have a single root concept.
While AMR goes farther than semantic role labeling, it does not link semantically- related frames such as buy/sell (as FrameNet does). AMR also does not handle quan- tification (as first-order predicate calculus does), and it makes no attempt to handle noun number and verb tense (as PropBank does).
Jacob Eisenstein. Draft of October 15, 2018.
The boy destroyed the room.

13.3. ABSTRACTMEANINGREPRESENTATION 321 13.3.1 AMR Parsing
Abstract Meaning Representation is not a labeling of the original text — unlike PropBank semantic role labeling, and most of the other tagging and parsing tasks that we have encountered thus far. The AMR for a given sentence may include multiple concepts for single words in the sentence: as we have seen, the sentence Asha likes algebra contains both person and name concepts for the word Asha. Conversely, words in the sentence may not appear in the AMR: in Boyang made a tour of campus, the light verb make would not appear in the AMR, which would instead be rooted on the predicate tour. As a result, AMR is difficult to parse, and even evaluating AMR parsing involves considerable algorithmic complexity (Cai and Yates, 2013).
A further complexity is that AMR labeled datasets do not explicitly show the align- ment between the AMR annotation and the words in the sentence. For example, the link between the word wants and the concept want-01 is not annotated. To acquire train- ing data for learning-based parsers, it is therefore necessary to first perform an alignment between the training sentences and their AMR parses. Flanigan et al. (2014) introduce a rule-based parser, which links text to concepts through a series of increasingly high-recall steps.
As with dependency parsing, AMR can be parsed by graph-based methods that ex- plore the space of graph structures, or by incremental transition-based algorithms. One approach to graph-based AMR parsing is to first group adjacent tokens into local sub- structures, and then to search the space of graphs over these substructures (Flanigan et al., 2014). The identification of concept subgraphs can be formulated as a sequence labeling problem, and the subsequent graph search can be solved using integer linear programming (subsection 13.2.2). Various transition-based parsing algorithms have been proposed. Wang et al. (2015) construct an AMR graph by incrementally modifying the syntactic dependency graph. At each step, the parser performs an action: for example, adding an AMR relation label to the current dependency edge, swapping the direction of a syntactic dependency edge, or cutting an edge and reattaching the orphaned subtree to a new parent.
Additional resources
Practical semantic role labeling was first made possible by the PropBank annotations on the Penn Treebank (Palmer et al., 2005). Abend and Rappoport (2017) survey several semantic representation schemes, including semantic role labeling and AMR. Other lin- guistic features of AMR are summarized in the original paper (Banarescu et al., 2013) and the tutorial slides by Schneider et al. (2015). Recent shared tasks have undertaken seman- tic dependency parsing, in which the goal is to identify semantic relationships between pairs of words (Oepen et al., 2014); see Ivanova et al. (2012) for an overview of connections
Under contract with MIT Press, shared under CC-BY-NC-ND license.

322 CHAPTER13. PREDICATE-ARGUMENTSEMANTICS between syntactic and semantic dependencies.
Exercises
1. Write out an event semantic representation for the following sentences. You may make up your own predicates.
(13.15) Abigail shares with Max.
(13.16) Abigail reluctantly shares a toy with Max.
(13.17) Abigail hates to share with Max.
2. FindthePropBankframesetsforshareandhateathttp://verbs.colorado.edu/ propbank/framesets-english-aliases/, and rewrite your answers from the previous question, using the thematic roles ARG0, ARG1, and ARG2.
3. ComputethesyntacticpathfeaturesforAbigailandMaxineachoftheexamplesen- tences (13.15) and (13.17) in Question 1, with respect to the verb share. If you’re not sure about the parse, you can try an online parser such as http://nlp.stanford. edu:8080/parser/.
4. Compute the dependency path features for Abigail and Max in each of the example sentences (13.15) and (13.17) in Question 1, with respect to the verb share. Again, if you’re not sure about the parse, you can try an online parser such as http://nlp. stanford.edu:8080/parser/. As a hint, the dependency relation between share and Max is OBL according to the Universal Dependency treebank.
5. PropBank semantic role labeling includes reference arguments, such as,
(13.18) [AM-LOC The bed] on [R-AM-LOC which] I slept broke.15
The label R-AM-LOC indicates that the word which is a reference to The bed, which expresses the location of the event. Reference arguments must have referents: the tag R-AM-LOC can appear only when AM-LOC also appears in the sentence. Show how to express this as a linear constraint, specifically for the tag R-AM-LOC. Be sure to correctly handle the case in which neither AM-LOC nor R-AM-LOC appear in the sentence.
6. Explain how to express the constraints on semantic role labeling in Equation 13.8 and Equation 13.9 in the general form Az ≥ b.
7. Produce the AMR annotations for the following examples:
15Example from 2013 NAACL tutorial slides by Shumin Wu
Jacob Eisenstein. Draft of October 15, 2018.

13.3. ABSTRACTMEANINGREPRESENTATION 323
(13.19) The girl likes the boy.
(13.20) The girl was liked by the boy.
(13.21) Abigail likes Maxwell Aristotle.
(13.22) The spy likes the attractive boy.
(13.23) The girl doesn’t like the boy.
(13.24) The girl likes her dog.
For (13.21), recall that multi-token names are created using op1, op2, etc. You will need to consult Banarescu et al. (2013) for (13.23), and Schneider et al. (2015) for (13.24). You may assume that her refers to the girl in this example.
8. In this problem, you will build a FrameNet sense classifier for the verb can, which can evoke two frames: POSSIBILITY (can you order a salad with french fries?) and CAPABILITY(can you eat a salad with chopsticks?).
To build the dataset, access the FrameNet corpus in NLTK:
import nltk nltk.download(’framenet_v17’)
from nltk.corpus import framenet as fn
Next, find instances in which the lexical unit can.v (the verb form of can) evokes a frame. Do this by iterating over fn.docs(), and then over sentences, and then
for doc in fn.docs():
if ’sentence’ in doc:
for sent in doc[’sentence’]:
for anno_set in sent[’annotationSet’]:
if ’luName’ in anno_set and anno_set[’luName’] == ’can.v’: pass # your code here
Use the field frameName as a label, and build a set of features from the field text. Train a classifier to try to accurately predict the frameName, disregarding cases other than CAPABILITY and POSSIBILITY. Treat the first hundred instances as a train- ing set, and the remaining instances as the test set. Can you do better than a classifier that simply selects the most common class?
9. *Download the PropBank sample data, using NLTK (http://www.nltk.org/ howto/propbank.html).
a) Use a deep learning toolkit such as PyTorch to train a BiLSTM sequence label- ing model (section 7.6) to identify words or phrases that are predicates, e.g., we/O took/B-PRED a/I-PRED walk/I-PRED together/O. Your model should com- pute the tag score from the BiLSTM hidden state ψ(ym) = βy · hm.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

324
CHAPTER13. PREDICATE-ARGUMENTSEMANTICS
10.
Using an off-the-shelf PropBank SRL system,16 build a simplified question answer- ing system in the style of Shen and Lapata (2007). Specifically, your system should do the following:
• For each document in a collection, it should apply the semantic role labeler, and should store the output as a tuple.
• For a question, your system should again apply the semantic role labeler. If any of the roles are filled by a wh-pronoun, you should mark that role as the expected answer phrase (EAP).
• Toanswerthequestion,searchforastoredtuplewhichmatchesthequestionas well as possible (same predicate, no incompatible semantic roles, and as many matching roles as possible). Align the EAP against its role filler in the stored tuple, and return this as the answer.
To evaluate your system, download a set of three news articles on the same topic, and write down five factoid questions that should be answerable from the arti- cles. See if your system can answer these questions correctly. (If this problem is assigned to an entire class, you can build a large-scale test set and compare various approaches.)
b) c)
Optionally, implement Viterbi to improve the predictions of the model in the previous section.
Try to identify ARG0 and ARG1 for each predicate. You should again use the BiLSTM and BIO notation, but you may want to include the BiLSTM hidden state at the location of the predicate in your prediction model, e.g., ψ(ym) = βy ·[hm; hrˆ], where rˆ is the predicted location of the (first word of the) predicate.
16At the time of writing, the following systems are availabe: SENNA (http://ronan.collobert. com/senna/), Illinois Semantic Role Labeler (https://cogcomp.cs.illinois.edu/page/ software_view/SRL), and mate-tools (https://code.google.com/archive/p/mate-tools/).
Jacob Eisenstein. Draft of October 15, 2018.

Chapter 14
Distributional and distributed semantics
A recurring theme in natural language processing is the complexity of the mapping from words to meaning. In chapter 4, we saw that a single word form, like bank, can have mul- tiple meanings; conversely, a single meaning may be created by multiple surface forms, a lexical semantic relationship known as synonymy. Despite this complex mapping be- tween words and meaning, natural language processing systems usually rely on words as the basic unit of analysis. This is especially true in semantics: the logical and frame semantic methods from the previous two chapters rely on hand-crafted lexicons that map from words to semantic predicates. But how can we analyze texts that contain words that we haven’t seen before? This chapter describes methods that learn representations of word meaning by analyzing unlabeled data, vastly improving the generalizability of natural language processing systems. The theory that makes it possible to acquire mean- ingful representations from unlabeled data is the distributional hypothesis.
14.1 The distributional hypothesis
Here’sawordyoumaynotknow:tezgu ̈ino(theexampleisfromLin,1998).Ifyoudonot know the meaning of tezgu ̈ino, then you are in the same situation as a natural language processing system when it encounters a word that did not appear in its training data. Nowsupposeyouseethattezgu ̈inoisusedinthefollowingcontexts:
(14.1) A bottle of is on the table.
(14.2) Everybody likes .
(14.3) Don’t have before you drive.
(14.4) We make out of corn.
325

326
CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS
tezgu ̈ino loud motoroil tortillas choices wine
(14.1) (14.2) (14.3) (14.4) …
1 1 1 1 0000 1 0 0 1 0101 0 1 0 0 1110
Table14.1:Distributionalstatisticsfortezgu ̈inoandfiverelatedterms 4
3 2 1 0
Figure 14.1: Lexical semantic relationships have regular linear structures in two dimen- sional projections of distributional statistics (Pennington et al., 2014).
What other words fit into these contexts? How about: loud, motor oil, tortillas, choices, wine? Each row of Table 14.1 is a vector that summarizes the contextual properties for each word, with a value of one for contexts in which the word can appear, and a value of zero for contexts in which it cannot. Based on these vectors, we can conclude: wine is very similar to tezgu ̈ino; motor oil and tortillas are fairly similar to tezgu ̈ino; loud is completely different.
These vectors, which we will call word representations, describe the distributional properties of each word. Does vector similarity imply semantic similarity? This is the dis- tributional hypothesis, stated by Firth (1957) as: “You shall know a word by the company it keeps.” The distributional hypothesis has stood the test of time: distributional statistics are a core part of language technology today, because they make it possible to leverage large amounts of unlabeled data to learn about rare words that do not appear in labeled training data.
Distributional statistics have a striking ability to capture lexical semantic relationships Jacob Eisenstein. Draft of October 15, 2018.
4 3 2 1 0
dark darker
soft
strong stronger
man
brother uncle
sir
king emperor
queen duke earl
darkest
softest clearer clearest
softer clear
nephew woman
sister aunt
niece
heir madam
heiress
empress
duchess countess
strongest louder
shortest slowest
1
short loud
loudest
3 4
1
2
2
shorter slow
slower
3 4
432101234
420246

14.2. DESIGNDECISIONSFORWORDREPRESENTATIONS 327
such as analogies. Figure 14.1 shows two examples, based on two-dimensional projections of distributional word embeddings, discussed later in this chapter. In each case, word- pair relationships correspond to regular linear patterns in this two dimensional space. No labeled data about the nature of these relationships was required to identify this underly- ing structure.
Distributional semantics are computed from context statistics. Distributed seman- tics are a related but distinct idea: that meaning can be represented by numerical vectors rather than symbolic structures. Distributed representations are often estimated from dis- tributional statistics, as in latent semantic analysis and WORD2VEC, described later in this chapter. However, distributed representations can also be learned in a supervised fashion from labeled data, as in the neural classification models encountered in chapter 3.
14.2 Design decisions for word representations
There are many approaches for computing word representations, but most can be distin- guished on three main dimensions: the nature of the representation, the source of contex- tual information, and the estimation procedure.
14.2.1 Representation
Today, the dominant word representations are k-dimensional vectors of real numbers, known as word embeddings. (The name is due to the fact that each discrete word is em- bedded in a continuous vector space.) This representation dates back at least to the late 1980s (Deerwester et al., 1990), and is used in popular techniques such as WORD2VEC (Mikolov et al., 2013).
Word embeddings are well suited for neural networks, where they can be plugged in as inputs. They can also be applied in linear classifiers and structure prediction mod- els (Turian et al., 2010), although it can be difficult to learn linear models that employ real-valued features (Kummerfeld et al., 2015). A popular alternative is bit-string repre- sentations, such as Brown clusters (section 14.4), in which each word is represented by a variable-length sequence of zeros and ones (Brown et al., 1992).
Another representational question is whether to estimate one embedding per surface form (e.g., bank), or to estimate distinct embeddings for each word sense or synset. In- tuitively, if word representations are to capture the meaning of individual words, then words with multiple meanings should have multiple embeddings. This can be achieved by integrating unsupervised clustering with word embedding estimation (Huang and Yates, 2012; Li and Jurafsky, 2015). However, Arora et al. (2018) argue that it is unnec- essary to model distinct word senses explicitly, because the embeddings for each surface form are a linear combination of the embeddings of the underlying senses.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

328 CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS The moment one learns English, complications set in (Alfau, 1999)
Brown Clusters
WORD2VEC, h = 2 Structured WORD2VEC, h = 2 Dependency contexts,
{one}
{moment, one, English, complications}
{(moment, −2), (one, −1), (English, +1), (complications, +2)} {(one, NSUBJ), (English, DOBJ), (moment, ACL−1)}
Table 14.2: Contexts for the word learns, according to various word representations. For dependency context, (one, NSUBJ) means that there is a relation of type NSUBJ (nominal subject) to the word one, and (moment, ACL−1) means that there is a relation of type ACL (adjectival clause) from the word moment.
14.2.2 Context
The distributional hypothesis says that word meaning is related to the “contexts” in which thewordappears,butcontextcanbedefinedinmanyways.Inthetezgu ̈inoexample,con- texts are entire sentences, but in practice there are far too many sentences. At the oppo- site extreme, the context could be defined as the immediately preceding word; this is the context considered in Brown clusters. WORD2VEC takes an intermediate approach, using local neighborhoods of words (e.g., h = 5) as contexts (Mikolov et al., 2013). Contexts can also be much larger: for example, in latent semantic analysis, each word’s context vector includes an entry per document, with a value of one if the word appears in the document (Deerwester et al., 1990); in explicit semantic analysis, these documents are Wikipedia pages (Gabrilovich and Markovitch, 2007).
In structured WORD2VEC, context words are labeled by their position with respect to the target word wm (e.g., two words before, one word after), which makes the result- ing word representations more sensitive to syntactic differences (Ling et al., 2015). An- other way to incorporate syntax is to perform parsing as a preprocessing step, and then form context vectors from the dependency edges (Levy and Goldberg, 2014) or predicate- argument relations (Lin, 1998). The resulting context vectors for several of these methods are shown in Table 14.2.
The choice of context has a profound effect on the resulting representations, which can be viewed in terms of word similarity. Applying latent semantic analysis (section 14.3) to contexts of size h = 2 and h = 30 yields the following nearest-neighbors for the word dog:1
• (h = 2): cat, horse, fox, pet, rabbit, pig, animal, mongrel, sheep, pigeon
1The example is from lecture slides by Marco Baroni, Alessandro Lenci, and Stefan Evert, who applied latent semantic analysis to the British National Corpus. You can find an online demo here: http://clic. cimec.unitn.it/infomap- query/
Jacob Eisenstein. Draft of October 15, 2018.

14.3. LATENTSEMANTICANALYSIS 329 • (h = 30): kennel, puppy, pet, bitch, terrier, rottweiler, canine, cat, to bark, Alsatian
Which word list is better? Each word in the h = 2 list is an animal, reflecting the fact that locally, the word dog tends to appear in the same contexts as other animal types (e.g., pet the dog, feed the dog). In the h = 30 list, nearly everything is dog-related, including specific breeds such as rottweiler and Alsatian. The list also includes words that are not animals (kennel), and in one case (to bark), is not a noun at all. The 2-word context window is more sensitive to syntax, while the 30-word window is more sensitive to topic.
14.2.3 Estimation
Word embeddings are estimated by optimizing some objective: the likelihood of a set of unlabeled data (or a closely related quantity), or the reconstruction of a matrix of context counts, similar to Table 14.1.
Maximum likelihood estimation Likelihood-based optimization is derived from the objective log p(w; U), where U ∈ RK × V is matrix of word embeddings, and w = {wm}Mm=1 is a corpus, represented as a list of M tokens. Recurrent neural network lan- guage models (section 6.3) optimize this objective directly, backpropagating to the input word embeddings through the recurrent structure. However, state-of-the-art word em- beddings employ huge corpora with hundreds of billions of tokens, and recurrent archi- tectures are difficult to scale to such data. As a result, likelihood-based word embeddings are usually based on simplified likelihoods or heuristic approximations.
Matrix factorization The matrix C = {count(i,j)} stores the co-occurrence counts of word i and context j. Word representations can be obtained by approximately factoring this matrix, so that count(i, j) is approximated by a function of a word embedding ui and a context embedding vj . These embeddings can be obtained by minimizing the norm of the reconstruction error,
min||C − C ̃ (u, v)||F , [14.1] u,v
where C(u, v) is the approximate reconstruction resulting from the embeddings u and ̃􏰀
v, and ||X||F indicates the Frobenius norm, i,j x2i,j. Rather than factoring the matrix of word-context counts directly, it is often helpful to transform these counts using information- theoretic metrics such as pointwise mutual information (PMI), described in the next sec- tion.
14.3 Latent semantic analysis
Latent semantic analysis (LSA) is one of the oldest approaches to distributed seman- tics (Deerwester et al., 1990). It induces continuous vector representations of words by
Under contract with MIT Press, shared under CC-BY-NC-ND license.

330 CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS factoring a matrix of word and context counts, using truncated singular value decompo-
sition (SVD),
min ||C − USV⊤||F U∈RV ×K ,S∈RK×K ,V∈R|C|×K
s.t. U⊤U = I V⊤V = I
∀i ̸= j, Si,j = 0,
[14.2]
[14.3] [14.4] [14.5]
where V is the size of the vocabulary, |C| is the number of contexts, and K is size of the resulting embeddings, which are set equal to the rows of the matrix U. The matrix S is constrained to be diagonal (these diagonal elements are called the singular values), and the columns of the product SV⊤ provide descriptions of the contexts. Each element ci,j is then reconstructed as a bilinear product,
􏰁K k=1
The objective is to minimize the sum of squared approximation errors. The orthonormal- ity constraints U⊤U = V⊤V = I ensure that all pairs of dimensions in U and V are uncorrelated, so that each dimension conveys unique information. Efficient implemen- tations of truncated singular value decomposition are available in numerical computing packages such as SCIPY and MATLAB.2
Latent semantic analysis is most effective when the count matrix is transformed before the application of SVD. One such transformation is pointwise mutual information (PMI; Church and Hanks, 1990), which captures the degree of association between word i and context j,
ci,j ≈
ui,kskvj,k. [14.6]
PMI(i,j) =log p(i,j) = log p(i | j)p(j) = log p(i | j)
p(i)p(j ) p(i)
􏰁V
[14.7] [14.8]
ditional probability of word i in context j to the marginal probability of word i in all 2An important implementation detail is to represent C as a sparse matrix, so that the storage cost is equal
to the number of non-zero entries, rather than the size V × |C|.
Jacob Eisenstein. Draft of October 15, 2018.
p(i)p(j )
= log count(i, j) − log
count(i′, j)
− log 􏰁 count(i, j′) + log 􏰁V 􏰁 count(i′, j′).
i′ =1
j′∈C i′=1 j′∈C
[14.9] The pointwise mutual information can be viewed as the logarithm of the ratio of the con-

14.4. BROWNCLUSTERS
331
evaluation assessment analysis understanding opinion conversation discussion
day year
week month quarter
half
reps representatives
representative rep
accounts people customers individuals employees students
Figure 14.2: Subtrees produced by bottom-up Brown clustering on news text (Miller et al., 2004).
contexts. When word i is statistically associated with context j, the ratio will be greater than one, so PMI(i, j) > 0. The PMI transformation focuses latent semantic analysis on re- constructing strong word-context associations, rather than on reconstructing large counts.
The PMI is negative when a word and context occur together less often than if they were independent, but such negative correlations are unreliable because counts of rare events have high variance. Furthermore, the PMI is undefined when count(i, j) = 0. One solution to these problems is to use the Positive PMI (PPMI),
PPMI(i, j) = 􏰅PMI(i, j), p(i | j) > p(i) [14.10] 0, otherwise.
Bullinaria and Levy (2007) compare a range of matrix transformations for latent se- mantic analysis, using a battery of tasks related to word meaning and word similarity (for more on evaluation, see section 14.6). They find that PPMI-based latent semantic analysis yields strong performance on a battery of tasks related to word meaning: for example, PPMI-based LSA vectors can be used to solve multiple-choice word similarity questions from the Test of English as a Foreign Language (TOEFL), obtaining 85% accuracy.
14.4 Brown clusters
Learning algorithms like perceptron and conditional random fields often perform better with discrete feature vectors. A simple way to obtain discrete representations from dis-
Under contract with MIT Press, shared under CC-BY-NC-ND license.

332
CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS
bitstring
011110100111 01111010100
011110101010 011110101011
011110101100 011110101101 011110101110
ten most frequent words
excited thankful grateful stoked pumped anxious hyped psyched exited geeked
talking talkin complaining talkn bitching tlkn tlkin bragging rav- ing +k
thinking thinkin dreaming worrying thinkn speakin reminiscing dreamin daydreaming fantasizing
saying sayin suggesting stating sayn jokin talmbout implying insisting 5’2
wonder dunno wondered duno donno dno dono wonda wounder dunnoe
wondering wonders debating deciding pondering unsure won- derin debatin woundering wondern
sure suree suuure suure sure- surre sures shuree
Table 14.3: Fragment of a Brown clustering of Twitter data (Owoputi et al., 2013). Each row is a leaf in the tree, showing the ten most frequent words. This part of the tree emphasizes verbs of communicating and knowing, especially in the present partici- ple. Each leaf node includes orthographic variants (thinking, thinkin, thinkn), semanti- cally related terms (excited, thankful, grateful), and some outliers (5’2, +k). See http: //www.cs.cmu.edu/ ̃ark/TweetNLP/cluster_viewer.html for more.
tributional statistics is by clustering (subsection 5.1.1), so that words in the same cluster have similar distributional statistics. This can help in downstream tasks, by sharing fea- tures between all words in the same cluster. However, there is an obvious tradeoff: if the number of clusters is too small, the words in each cluster will not have much in common; if the number of clusters is too large, then the learner will not see enough examples from each cluster to generalize.
A solution to this problem is hierarchical clustering: using the distributional statistics to induce a tree-structured representation. Fragments of Brown cluster trees are shown in Figure 14.2 and Table 14.3. Each word’s representation consists of a binary string describ- ing a path through the tree: 0 for taking the left branch, and 1 for taking the right branch. In the subtree in the upper right of the figure, the representation of the word conversation is 10; the representation of the word assessment is 0001. Bitstring prefixes capture similar- ity at varying levels of specificity, and it is common to use the first eight, twelve, sixteen, and twenty bits as features in tasks such as named entity recognition (Miller et al., 2004) and dependency parsing (Koo et al., 2008).
Hierarchical trees can be induced from a likelihood-based objective, using a discrete Jacob Eisenstein. Draft of October 15, 2018.

14.4. BROWNCLUSTERS
333
latent variable ki ∈ {1, 2, . . . , K } to represent the cluster of word i:
􏰁M m=1
log p(w; k) ≈ 􏱕
log p(wm | wm−1; k) logp(wm|kwm)+logp(kwm |kwm−1).
[14.11]
􏰁M m=1
[14.12] This is similar to a hidden Markov model, with the crucial difference that each word can
be emitted from only a single cluster: ∀k ̸= kwm , p(wm | k) = 0.
Using the objective in Equation 14.12, the Brown clustering tree can be constructed from the bottom up: begin with each word in its own cluster, and incrementally merge clusters until only a single cluster remains. At each step, we merge the pair of clusters such that the objective in Equation 14.12 is maximized. Although the objective seems to involve a sum over the entire corpus, the score for each merger can be computed from the cluster-to-cluster co-occurrence counts. These counts can be updated incrementally as the clustering proceeds. The optimal merge at each step can be shown to maximize the average mutual information,
I(k) =
p(k1, k2) × PMI(k1, k2) count(k1, k2)
􏰁K 􏰁K k1=1 k2=1
[14.13]
where p(k1, k2) is the joint probability of a bigram involving a word in cluster k1 followed by a word in k2. This probability and the PMI are both computed from the co-occurrence counts between clusters. After each merger, the co-occurrence vectors for the merged clusters are simply added up, so that the next optimal merger can be found efficiently.
This bottom-up procedure requires iterating over the entire vocabulary, and evaluat- ing Kt2 possible mergers at each step, where Kt is the current number of clusters at step t of the algorithm. Furthermore, computing the score for each merger involves a sum over Kt2 clusters. The maximum number of clusters is K0 = V , which occurs when every word is in its own cluster at the beginning of the algorithm. The time complexity is thus O(V 5).
To avoid this complexity, practical implementations use a heuristic approximation called exchange clustering. The K most common words are placed in clusters of their own at the beginning of the process. We then consider the next most common word, and merge it with one of the existing clusters. This continues until the entire vocabulary has been incorporated, at which point the K clusters are merged down to a single cluster, forming a tree. The algorithm never considers more than K + 1 clusters at any step, and the complexity is O(V K + V log V ), with the second term representing the cost of sorting
Under contract with MIT Press, shared under CC-BY-NC-ND license.
p(k1, k2) =􏰀Kk1′ =1 􏰀Kk2′ =1 count(k1′ , k2′ ),

334
CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS
vwm
matrix of word embeddings, and each vm is the context embedding for word wm.
the words at the beginning of the algorithm. For more details on the algorithm, see Liang
(2005).
14.5 Neural word embeddings
Neural word embeddings combine aspects of the previous two methods: like latent se- mantic analysis, they are a continuous vector representation; like Brown clusters, they are trained from a likelihood-based objective. Let the vector ui represent the K-dimensional embedding for word i, and let vj represent the K-dimensional embedding for context j. The inner product ui · vj represents the compatibility between word i and context j. By incorporating this inner product into an approximation to the log-likelihood of a cor- pus, it is possible to estimate both parameters by backpropagation. WORD2VEC (Mikolov et al., 2013) includes two such approximations: continuous bag-of-words (CBOW) and skipgrams.
14.5.1 Continuous bag-of-words (CBOW)
In recurrent neural network language models, each word wm is conditioned on a recurrently- updated state vector, which is based on word representations going all the way back to the beginning of the text. The continuous bag-of-words (CBOW) model is a simplification: the local context is computed as an average of embeddings for words in the immediate neighborhood m − h, m − h + 1, . . . , m + h − 1, m + h,
1 􏰁h
vm = 2h vwm+n + vwm−n . [14.14]
n=1
Thus, CBOW is a bag-of-words model, because the order of the context words does not matter; it is continuous, because rather than conditioning on the words themselves, we condition on a continuous vector constructed from the word embeddings. The parameter h determines the neighborhood size, which Mikolov et al. (2013) set to h = 4.
Jacob Eisenstein. Draft of October 15, 2018.
vwm−2 wm−2
vwm−1 vwm vwm+1 vwm+2 wm−1 wm wm+1 wm+2
U
wm−2 wm−1
(b) Skipgram
wm wm+1 wm+2
(a) Continuous bag-of-words (CBOW)
Figure 14.3: The CBOW and skipgram variants of WORD2VEC. The parameter U is the
U

14.5. NEURALWORDEMBEDDINGS
335
The CBOW model optimizes an approximation to the corpus log-likelihood,
􏰁M
m=1 􏰁M
=
m=1 􏰁M
logp(w)≈
logp(wm |wm−h,wm−h+1,…,wm+h−1,wm+h) exp(uwm ·vm)
[14.15] [14.16] [14.17]
log 􏰀V exp (uj · vm) j=1
􏰁V j=1
M hm
log p(w) ≈ 􏰁 􏰁 log p(wm−n | wm) + log p(wm+n | wm)
exp(uj ·vm). context is predicted from the word, yielding the objective:
uwm ·vm −log
In the CBOW model, words are predicted from their context. In the skipgram model, the
=
m=1
14.5.2 Skipgrams
[14.18] [14.19] [14.20]
In the skipgram approximation, each word is generated multiple times; each time it is conditioned only on a single word. This makes it possible to avoid averaging the word vectors, as in the CBOW model. The local neighborhood size hm is randomly sampled from a uniform categorical distribution over the range {1,2,…,hmax}; Mikolov et al. (2013) set hmax = 10. Because the neighborhood grows outward with h, this approach has the effect of weighting near neighbors more than distant ones. Skipgram performs better on most evaluations than CBOW (see section 14.6 for details of how to evaluate word representations), but CBOW is faster to train (Mikolov et al., 2013).
14.5.3 Computational complexity
The WORD2VEC models can be viewed as an efficient alternative to recurrent neural net- work language models, which involve a recurrent state update whose time complexity is quadratic in the size of the recurrent state vector. CBOW and skipgram avoid this computation, and incur only a linear time complexity in the size of the word and con- text representations. However, all three models compute a normalized probability over word tokens; a na ̈ıve implementation of this probability requires summing over the entire
Under contract with MIT Press, shared under CC-BY-NC-ND license.
=
exp(uwm−n ·vwm) exp(uwm+n ·vwm) log 􏰀V exp(uj ·vwm) +log 􏰀V exp(uj ·vwm)
V m=1 n=1 j=1
m=1 n=1 M hm
􏰁􏰁
j=1 j=1
= 􏰁 􏰁uwm−n ·vwm +uwm+n ·vwm −2log􏰁exp(uj ·vwm).
m=1 n=1 M hm

336
CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS
0 12
Ahab
34
σ(u0 · vc)
whale blubber
σ(−u0 ·vc)×σ(u2 ·vc) σ(−u0 ·vc)×σ(−u2 ·vc)
Figure 14.4: A fragment of a hierarchical softmax tree. The probability of each word is
computed as a product of probabilities of local branching decisions in the tree.
vocabulary. The time complexity of this sum is O(V ×K), which dominates all other com- putational costs. There are two solutions: hierarchical softmax, a tree-based computation that reduces the cost to a logarithm of the size of the vocabulary; and negative sampling, an approximation that eliminates the dependence on vocabulary size. Both methods are also applicable to RNN language models.
Hierarchical softmax
In Brown clustering, the vocabulary is organized into a binary tree. Mnih and Hin- ton (2008) show that the normalized probability over words in the vocabulary can be reparametrized as a probability over paths through such a tree. This hierarchical softmax probability is computed as a product of binary decisions over whether to move left or right through the tree, with each binary decision represented as a sigmoid function of the inner product between the context embedding vc and an output embedding associated with the node un,
Pr(left at n | c) =σ(un · vc) [14.21] Pr(right at n | c) =1 − σ(un · vc) = σ(−un · vc), [14.22]
where σ refers to the sigmoid function, σ(x) = 1 . The range of the sigmoid is the 1+exp(−x)
interval (0, 1), and 1 − σ(x) = σ(−x).
As shown in Figure 14.4, the probability of generating each word is redefined as the product of the probabilities across its path. The sum of all such path probabilities is guar- anteed to be one, for any context vector vc ∈ RK . In a balanced binary tree, the depth is logarithmic in the number of leaf nodes, and thus the number of multiplications is equal to O(log V ). The number of non-leaf nodes is equal to O(2V − 1), so the number of pa- rameters to be estimated increases by only a small multiple. The tree can be constructed using an incremental clustering procedure similar to hierarchical Brown clusters (Mnih
Jacob Eisenstein. Draft of October 15, 2018.

14.5. NEURALWORDEMBEDDINGS 337 and Hinton, 2008), or by using the Huffman (1952) encoding algorithm for lossless com-
pression.
Negative sampling
Likelihood-based methods are computationally intensive because each probability must be normalized over the vocabulary. These probabilities are based on scores for each word in each context, and it is possible to design an alternative objective that is based on these scores more directly: we seek word embeddings that maximize the score for the word that was really observed in each context, while minimizing the scores for a set of randomly selected negative samples:
ψ(i,j)=logσ(ui ·vj)+ 􏰁 log(1−σ(ui′ ·vj)), [14.23] i′ ∈Wneg
where ψ(i, j) is the score for word i in context j, and W 􏰀is the set of negative samples. neg
The objective is to maximize the sum over the corpus, Mm=1 ψ(wm, cm), where wm is token m and cm is the associated context.
The set of negative samples Wneg is obtained by sampling from a unigram language
model. Mikolov et al. (2013) construct this unigram language model by exponentiating
3
the empirical word probabilities, setting pˆ(i) ∝ (count(i))4 . This has the effect of redis-
tributing probability mass from common to rare words. The number of negative samples increases the time complexity of training by a constant factor. Mikolov et al. (2013) report that 5-20 negative samples works for small training sets, and that two to five samples suffice for larger corpora.
14.5.4 Word embeddings as matrix factorization
The negative sampling objective in Equation 14.23 can be justified as an efficient approx- imation to the log-likelihood, but it is also closely linked to the matrix factorization ob- jective employed in latent semantic analysis. For a matrix of word-context pairs in which all counts are non-zero, negative sampling is equivalent to factorization of the matrix M, where Mij = PMI(i, j) − log k: each cell in the matrix is equal to the pointwise mutual information of the word and context, shifted by log k, with k equal to the number of neg- ative samples (Levy and Goldberg, 2014). For word-context pairs that are not observed in the data, the pointwise mutual information is −∞, but this can be addressed by consid- ering only PMI values that are greater than log k, resulting in a matrix of shifted positive pointwise mutual information,
Mij =max(0,PMI(i,j)−logk). [14.24]
Word embeddings are obtained by factoring this matrix with truncated singular value decomposition.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

338 CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS
word 1
love
stock
money development lad
word 2 similarity
sex 6.77 jaguar 0.92 cash 9.15 issue 3.97 brother 4.46
Table 14.4: Subset of the WS-353 (Finkelstein et al., 2002) dataset of word similarity ratings (examples from Faruqui et al. (2016)).
GloVe (“global vectors”) are a closely related approach (Pennington et al., 2014), in which the matrix to be factored is constructed from log co-occurrence counts, Mij = log count(i, j). The word embeddings are estimated by minimizing the sum of squares,
􏱎􏰁V􏰁􏰛􏱎 􏰜2
f(Mij) log Mij − log Mij j=1 j∈C
logMij =ui ·vj +bi + ̃bj, [14.25]
where bi and ̃bj are offsets for word i and context j, which are estimated jointly with the embeddings u and v. The weighting function f(Mij) is set to be zero at Mij = 0, thus avoiding the problem of taking the logarithm of zero counts; it saturates at Mij = mmax, thus avoiding the problem of overcounting common word-context pairs. This heuristic turns out to be critical to the method’s performance.
The time complexity of sparse matrix reconstruction is determined by the number of non-zero word-context counts. Pennington et al. (2014) show that this number grows sublinearly with the size of the dataset: roughly O(N0.8) for typical English corpora. In contrast, the time complexity of WORD2VEC is linear in the corpus size. Computing the co- occurrence counts also requires linear time in the size of the corpus, but this operation can easily be parallelized using MapReduce-style algorithms (Dean and Ghemawat, 2008).
14.6 Evaluating word embeddings
Distributed word representations can be evaluated in two main ways. Intrinsic evalu- ations test whether the representations cohere with our intuitions about word meaning. Extrinsic evaluations test whether they are useful for downstream tasks, such as sequence labeling.
Jacob Eisenstein. Draft of October 15, 2018.
min ̃ u,v,b,b
s.t.

14.6. EVALUATINGWORDEMBEDDINGS 339 14.6.1 Intrinsic evaluations
A basic question for word embeddings is whether the similarity of words i and j is re- flected in the similarity of the vectors ui and uj. Cosine similarity is typically used to compare two word embeddings,
cos(ui,uj)= ui ·uj . [14.26] ||ui||2 × ||uj||2
For any embedding method, we can evaluate whether the cosine similarity of word em- beddings is correlated with human judgments of word similarity. The WS-353 dataset (Finkel- stein et al., 2002) includes similarity scores for 353 word pairs (Table 14.4). To test the accuracy of embeddings for rare and morphologically complex words, Luong et al. (2013) introduce a dataset of “rare words.” Outside of English, word similarity resources are lim- ited, mainly consisting of translations of WS-353 and the related SimLex-999 dataset (Hill
et al., 2015).
Word analogies (e.g., king:queen :: man:woman) have also been used to evaluate word embeddings (Mikolov et al., 2013). In this evaluation, the system is provided with the first three parts of the analogy (i1 : j1 :: i2 :?), and the final element is predicted by finding the wordembeddingmostsimilartoui1 −uj1 +ui2.Anotherevaluationtestswhetherword embeddings are related to broad lexical semantic categories called supersenses (Ciaramita and Johnson, 2003): verbs of motion, nouns that describe animals, nouns that describe body parts, and so on. These supersenses are annotated for English synsets in Word- Net (Fellbaum, 2010). This evaluation is implemented in the QVEC metric, which tests whether the matrix of supersenses can be reconstructed from the matrix of word embed- dings (Tsvetkov et al., 2015).
Levy et al. (2015) compared several dense word representations for English — includ- ing latent semantic analysis, WORD2VEC, and GloVe — using six word similarity metrics and two analogy tasks. None of the embeddings outperformed the others on every task, but skipgrams were the most broadly competitive. Hyperparameter tuning played a key role: any method will perform badly if the wrong hyperparameters are used. Relevant hyperparameters include the embedding size, as well as algorithm-specific details such as the neighborhood size and the number of negative samples.
14.6.2 Extrinsic evaluations
Word representations contribute to downstream tasks like sequence labeling and docu- ment classification by enabling generalization across words. The use of distributed repre- sentations as features is a form of semi-supervised learning, in which performance on a supervised learning problem is augmented by learning distributed representations from unlabeled data (Miller et al., 2004; Koo et al., 2008; Turian et al., 2010). These pre-trained word representations can be used as features in a linear prediction model, or as the input
Under contract with MIT Press, shared under CC-BY-NC-ND license.

340 CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS
layer in a neural network, such as a Bi-LSTM tagging model (section 7.6). Word repre- sentations can be evaluated by the performance of the downstream systems that consume them: for example, GloVe embeddings are convincingly better than Latent Semantic Anal- ysis as features in the downstream task of named entity recognition (Pennington et al., 2014). Unfortunately, extrinsic and intrinsic evaluations do not always point in the same direction, and the best word representations for one downstream task may perform poorly on another task (Schnabel et al., 2015).
When word representations are updated from labeled data in the downstream task, they are said to be fine-tuned. When labeled data is plentiful, pre-training may be un- necessary; when labeled data is scare, fine-tuning may lead to overfitting. Various com- binations of pre-training and fine-tuning can be employed. Pre-trained embeddings can be used as initialization before fine-tuning, and this can substantially improve perfor- mance (Lample et al., 2016). Alternatively, both fine-tuned and pre-trained embeddings can be used as inputs in a single model (Kim, 2014).
In semi-supervised scenarios, pretrained word embeddings can be replaced by “con- textualized” word representations (Peters et al., 2018). These contextualized represen- tations are set to the hidden states of a deep bi-directional LSTM, which is trained as a bi-directional language model, motivating the name ELMo (embeddings from language models). By running the language model, we obtain contextualized word representa- tions, which can then be used as the base layer in a supervised neural network for any task. This approach yields significant gains over pretrained word embeddings on several tasks, presumably because the contextualized embeddings use unlabeled data to learn how to integrate linguistic context into the base layer of the supervised neural network.
14.6.3 Fairness and bias
Figure 14.1 shows how word embeddings can capture analogies such as man:woman :: king:queen. While king and queen are gender-specific by definition, other professions or titles are associated with genders and other groups merely by statistical tendency. This statistical tendency may be a fact about the world (e.g., professional baseball players are usually men), or a fact about the text corpus (e.g., there are professional basketball leagues for both women and men, but the men’s basketball is written about far more often).
There is now considerable evidence that word embeddings do indeed encode such bi- ases. Bolukbasi et al. (2016) show that the words most aligned with the vector difference she − he are stereotypically female professions homemaker, nurse, receptionist; in the other direction are maestro, skipper, protege. Caliskan et al. (2017) systematize this observation by showing that biases in word embeddings align with well-validated gender stereotypes. Garg et al. (2018) extend these results to ethnic stereotypes of Asian Americans, and pro- vide a historical perspective on how stereotypes evolve over 100 years of text data.
Because word embeddings are the input layer for many other natural language pro- Jacob Eisenstein. Draft of October 15, 2018.

14.7. DISTRIBUTEDREPRESENTATIONSBEYONDDISTRIBUTIONALSTATISTICS341
cessing systems, these findings highlight the risk that natural language processing will replicate and amplify biases in the world, as well as in text. If, for example, word em- beddings encode the belief that women are as unlikely to be programmers as they are to be nephews, then software is unlikely to successfully parse, translate, index, and generate texts in which women do indeed program computers. For example, contemporary NLP systems often fail to properly resolve pronoun references in texts that cut against gender stereotypes (Rudinger et al., 2018; Zhao et al., 2018). (The task of pronoun resolution is described in depth in chapter 15.) Such biases can have profound consequences: for exam- ple, search engines are more likely to yield personalized advertisements for public arrest records when queried with names that are statistically associated with African Ameri- cans (Sweeney, 2013). There is now an active research literature on “debiasing” machine learning and natural language processing, as evidenced by the growth of annual meet- ings such as Fairness, Accountability, and Transparency in Machine Learning (FAT/ML). However, given that the ultimate source of these biases is the text itself, it may be too much to hope for a purely algorithmic solution. There is no substitute for critical thought about the inputs to natural language processing systems – and the uses of their outputs.
14.7 Distributed representations beyond distributional statistics
Distributional word representations can be estimated from huge unlabeled datasets, thereby covering many words that do not appear in labeled data: for example, GloVe embeddings are estimated from 800 billion tokens of web data,3 while the largest labeled datasets for NLP tasks are on the order of millions of tokens. Nonetheless, even a dataset of hundreds of billions of tokens will not cover every word that may be encountered in the future. Furthermore, many words will appear only a few times, making their embeddings un- reliable. Many languages exceed English in morphological complexity, and thus have lower token-to-type ratios. When this problem is coupled with small training corpora, it becomes especially important to leverage other sources of information beyond distribu- tional statistics.
14.7.1 Word-internal structure
One solution is to incorporate word-internal structure into word embeddings. Purely distributional approaches consider words as atomic units, but in fact, many words have internal structure, so that their meaning can be composed from the representations of sub-word units. Consider the following terms, all of which are missing from Google’s pre-trained WORD2VEC embeddings:4
3 http://commoncrawl.org/ 4https://code.google.com/archive/p/word2vec/, accessed September 20, 2017
Under contract with MIT Press, shared under CC-BY-NC-ND license.

342
CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS
u(M) milli+
umillicuries u(M)
curie
u ̃ millicuries u(M)
+s
umillicuries umillicurie
(M) (M) (M) umilli+ ucurie u+s
Figure 14.5: Two architectures for building word embeddings from subword units. On the left, morpheme embeddings u(m) are combined by addition with the non-compositional word embedding u ̃ (Botha and Blunsom, 2014). On the right, morpheme embeddings are combined in a recursive neural network (Luong et al., 2013).
millicuries Thiswordhasmorphologicalstructure(seesubsection9.1.2formoreonmor- phology): the prefix milli- indicates an amount, and the suffix -s indicates a plural. (A millicurie is an unit of radioactivity.)
caesium This word is a single morpheme, but the characters -ium are often associated with chemical elements. (Caesium is the British spelling of a chemical element, spelled cesium in American English.)
IAEA This term is an acronym, as suggested by the use of capitalization. The prefix I- fre- quently refers to international organizations, and the suffix -A often refers to agen- cies or associations. (IAEA is the International Atomic Energy Agency.)
Zhezhgan This term is in title case, suggesting the name of a person or place, and the character bigram zh indicates that it is likely a transliteration. (Zhezhgan is a mining facility in Kazakhstan.)
How can word-internal structure be incorporated into word representations? One approach is to construct word representations from embeddings of the characters or mor- phemes. For example, if word i has morphological segments Mi, then its embedding can be constructed by addition (Botha and Blunsom, 2014),
ui = u ̃i + 􏰁 u(M), [14.27] j
j ∈Mi
where u(M) is a morpheme embedding and u ̃ is a non-compositional embedding of the
mi
whole word, which is an additional free parameter of the model (Figure 14.5, left side). All embeddings are estimated from a log-bilinear language model (Mnih and Hinton, 2007), which is similar to the CBOW model (section 14.5), but includes only contextual information from preceding words. The morphological segments are obtained using an unsupervised segmenter (Creutz and Lagus, 2007). For words that do not appear in the
Jacob Eisenstein. Draft of October 15, 2018.

14.7. DISTRIBUTEDREPRESENTATIONSBEYONDDISTRIBUTIONALSTATISTICS343
training data, the embedding can be constructed directly from the morphemes, assuming that each morpheme appears in some other word in the training data. The free param- eter u ̃ adds flexibility: words with similar morphemes are encouraged to have similar embeddings, but this parameter makes it possible for them to be different.
Word-internal structure can be incorporated into word representations in various other ways. Here are some of the main parameters.
Subword units. Examples like IAEA and Zhezhgan are not based on morphological com- position, and a morphological segmenter is unlikely to identify meaningful sub- word units for these terms. Rather than using morphemes for subword embeddings, one can use characters (Santos and Zadrozny, 2014; Ling et al., 2015; Kim et al., 2016), character n-grams (Wieting et al., 2016a; Bojanowski et al., 2017), and byte-pair en- codings, a compression technique which captures frequent substrings (Gage, 1994; Sennrich et al., 2016).
Composition. Combining the subword embeddings by addition does not differentiate between orderings, nor does it identify any particular morpheme as the root. A range of more flexible compositional models have been considered, including re- currence (Ling et al., 2015), convolution (Santos and Zadrozny, 2014; Kim et al., 2016), and recursive neural networks (Luong et al., 2013), in which representa- tions of progressively larger units are constructed over a morphological parse, e.g. ((milli+curie)+s), ((in+flam)+able), (in+(vis+ible)). A recursive embedding model is shown in the right panel of Figure 14.5.
Estimation. Estimating subword embeddings from a full dataset is computationally ex- pensive. An alternative approach is to train a subword model to match pre-trained word embeddings (Cotterell et al., 2016; Pinter et al., 2017). To train such a model, it is only necessary to iterate over the vocabulary, and the not the corpus.
14.7.2 Lexical semantic resources
Resources such as WordNet provide another source of information about word meaning: if we know that caesium is a synonym of cesium, or that a millicurie is a type of measurement unit, then this should help to provide embeddings for the unknown words, and to smooth embeddings of rare words. One way to do this is to retrofit pre-trained word embeddings across a network of lexical semantic relationships (Faruqui et al., 2015) by minimizing the following objective,
􏰁V 􏰁
min ||ui − uˆi||2 + βij ||ui − uj ||2, [14.28]
U j=1 (i,j)∈L
Under contract with MIT Press, shared under CC-BY-NC-ND license.

344 CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS
where uˆi is the pretrained embedding of word i, and L = {(i,j)} is a lexicon of word relations. The hyperparameter βij controls the importance of adjacent words having similar embeddings; Faruqui et al. (2015) set it to the inverse of the degree of word i, βij = |{j : (i, j) ∈ L}|−1. Retrofitting improves performance on a range of intrinsic evalu- ations, and gives small improvements on an extrinsic document classification task.
14.8 Distributed representations of multiword units
Can distributed representations extend to phrases, sentences, paragraphs, and beyond? Before exploring this possibility, recall the distinction between distributed and distri- butional representations. Neural embeddings such as WORD2VEC are both distributed (vector-based) and distributional (derived from counts of words in context). As we con- sider larger units of text, the counts decrease: in the limit, a multi-paragraph span of text would never appear twice, except by plagiarism. Thus, the meaning of a large span of text cannot be determined from distributional statistics alone; it must be computed com- positionally from smaller spans. But these considerations are orthogonal to the question of whether distributed representations — dense numerical vectors — are sufficiently ex- pressive to capture the meaning of phrases, sentences, and paragraphs.
14.8.1 Purely distributional methods
Some multiword phrases are non-compositional: the meaning of such phrases is not de- rived from the meaning of the individual words using typical compositional semantics. This includes proper nouns like San Francisco as well as idiomatic expressions like kick the bucket (Baldwin and Kim, 2010). For these cases, purely distributional approaches can work. A simple approach is to identify multiword units that appear together fre- quently, and then treat these units as words, learning embeddings using a technique such as WORD2VEC.
The problem of identifying multiword units is sometimes called collocation extrac- tion. A good collocation has high pointwise mutual information (PMI; see section 14.3). For example, Na ̈ıve Bayes is a good collocation because p(wt = Bayes | wt−1 = na ̈ıve) is much larger than p(wt = Bayes). Collocations of more than two words can be identified by a greedy incremental search: for example, mutual information might first be extracted as a collocation and grouped into a single word type mutual information; then pointwise mutual information can be extracted later. After identifying such units, they can be treated as words when estimating skipgram embeddings. Mikolov et al. (2013) show that the re- sulting embeddings perform reasonably well on a task of solving phrasal analogies, e.g. New York : New York Times :: Baltimore : Baltimore Sun.
Jacob Eisenstein. Draft of October 15, 2018.

14.8. DISTRIBUTEDREPRESENTATIONSOFMULTIWORDUNITS 345
this was the only way
it was the only way
it was her turn to blink
it was hard to tell
it was time to move on
he had to do it again
they all looked at each other
they all turned to look back
they both turned to face him
they both turned and walked away
Figure 14.6: By interpolating between the distributed representations of two sentences (in bold), it is possible to generate grammatical sentences that combine aspects of both (Bow- man et al., 2016)
14.8.2 Distributional-compositional hybrids
To move beyond short multiword phrases, composition is necessary. A simple but sur- prisingly powerful approach is to represent a sentence with the average of its word em- beddings (Mitchell and Lapata, 2010). This can be considered a hybrid of the distribu- tional and compositional approaches to semantics: the word embeddings are computed distributionally, and then the sentence representation is computed by composition.
The WORD2VEC approach can be stretched considerably further, embedding entire sentences using a model similar to skipgrams, in the “skip-thought” model of Kiros et al. (2015). Each sentence is encoded into a vector using a recurrent neural network: the encod-
ing of sentence t is set to the RNN hidden state at its final token, h(t) . This vector is then Mt
a parameter in a decoder model that is used to generate the previous and subsequent sen- tences: the decoder is another recurrent neural network, which takes the encoding of the neighboring sentence as an additional parameter in its recurrent update. (This encoder- decoder model is discussed at length in chapter 18.) The encoder and decoder are trained simultaneously from a likelihood-based objective, and the trained encoder can be used to compute a distributed representation of any sentence. Skip-thought can also be viewed as a hybrid of distributional and compositional approaches: the vector representation of each sentence is computed compositionally from the representations of the individual words, but the training objective is distributional, based on sentence co-occurrence across a corpus.
Autoencoders are a variant of encoder-decoder models in which the decoder is trained to produce the same text that was originally encoded, using only the distributed encod- ing vector (Li et al., 2015). The encoding acts as a bottleneck, so that generalization is necessary if the model is to successfully fit the training data. In denoising autoencoders,
Under contract with MIT Press, shared under CC-BY-NC-ND license.

346 CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS
the input is a corrupted version of the original sentence, and the auto-encoder must re- construct the uncorrupted original (Vincent et al., 2010; Hill et al., 2016). By interpolating betweendistributedrepresentationsoftwosentences,αui+(1−α)uj,itispossibletogen- erate sentences that combine aspects of the two inputs, as shown in Figure 14.6 (Bowman et al., 2016).
Autoencoders can also be applied to longer texts, such as paragraphs and documents. This enables applications such as question answering, which can be performed by match- ing the encoding of the question with encodings of candidate answers (Miao et al., 2016).
14.8.3 Supervised compositional methods
Given a supervision signal, such as a label describing the sentiment or meaning of a sen- tence, a wide range of compositional methods can be applied to compute a distributed representation that then predicts the label. The simplest is to average the embeddings of each word in the sentence, and pass this average through a feedforward neural net- work (Iyyer et al., 2015). Convolutional and recurrent neural networks go further, with the ability to effectively capturing multiword phenomena such as negation (Kalchbrenner et al., 2014; Kim, 2014; Li et al., 2015; Tang et al., 2015). Another approach is to incorpo- rate the syntactic structure of the sentence into a recursive neural network, in which the representation for each syntactic constituent is computed from the representations of its children (Socher et al., 2012). However, in many cases, recurrent neural networks perform as well or better than recursive networks (Li et al., 2015).
Whether convolutional, recurrent, or recursive, a key question is whether supervised sentence representations are task-specific, or whether a single supervised sentence repre- sentation model can yield useful performance on other tasks. Wieting et al. (2016b) train a variety of sentence embedding models for the task of labeling pairs of sentences as para- phrases. They show that the resulting sentence embeddings give good performance for sentiment analysis. The Stanford Natural Language Inference corpus classifies sentence pairs as entailments (the truth of sentence i implies the truth of sentence j), contradictions (the truth of sentence i implies the falsity of sentence j), and neutral (i neither entails nor contradicts j). Sentence embeddings trained on this dataset transfer to a wide range of classification tasks (Conneau et al., 2017).
14.8.4 Hybrid distributed-symbolic representations
The power of distributed representations is in their generality: the distributed represen- tation of a unit of text can serve as a summary of its meaning, and therefore as the input for downstream tasks such as classification, matching, and retrieval. For example, dis- tributed sentence representations can be used to recognize the paraphrase relationship between closely related sentences like the following:
Jacob Eisenstein. Draft of October 15, 2018.

14.8. DISTRIBUTEDREPRESENTATIONSOFMULTIWORDUNITS 347
(14.5) a.
b. Donald conveyed to Vlad his profound appreciation.
c. Vlad was showered with gratitude by Donald.
Symbolic representations are relatively brittle to this sort of variation, but are better suited to describe individual entities, the things that they do, and the things that are done to them. In examples (14.5a)-(14.5c), we not only know that somebody thanked someone else, but we can make a range of inferences about what has happened between the entities named Donald and Vlad. Because distributed representations do not treat entities symbol- ically, they lack the ability to reason about the roles played by entities across a sentence or larger discourse.5 A hybrid between distributed and symbolic representations might give the best of both worlds: robustness to the many different ways of describing the same event, plus the expressiveness to support inferences about entities and the roles that they play.
A “top-down” hybrid approach is to begin with logical semantics (of the sort de- scribed in the previous two chapters), and but replace the predefined lexicon with a set of distributional word clusters (Poon and Domingos, 2009; Lewis and Steedman, 2013). A “bottom-up” approach is to add minimal symbolic structure to existing distributed repre- sentations, such as vector representations for each entity (Ji and Eisenstein, 2015; Wiseman et al., 2016). This has been shown to improve performance on two problems that we will encounter in the following chapters: classification of discourse relations between adja- cent sentences (chapter 16; Ji and Eisenstein, 2015), and coreference resolution of entity mentions (chapter 15; Wiseman et al., 2016; Ji et al., 2017). Research on hybrid seman- tic representations is still in an early stage, and future representations may deviate more boldly from existing symbolic and distributional approaches.
Additional resources
Turney and Pantel (2010) survey a number of facets of vector word representations, fo- cusing on matrix factorization methods. Schnabel et al. (2015) highlight problems with similarity-based evaluations of word embeddings, and present a novel evaluation that controls for word frequency. Baroni et al. (2014) address linguistic issues that arise in attempts to combine distributed and compositional representations.
In bilingual and multilingual distributed representations, embeddings are estimated for translation pairs or tuples, such as (dog, perro, chien). These embeddings can improve machine translation (Zou et al., 2013; Klementiev et al., 2012), transfer natural language
5At a 2014 workshop on semantic parsing, this critique of distributed representations was expressed by Ray Mooney — a leading researcher in computational semantics — in a now well-known quote, “you can’t cram the meaning of a whole sentence into a single vector!”
Under contract with MIT Press, shared under CC-BY-NC-ND license.
Donald thanked Vlad profusely.

348 CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS
processingmodelsacrosslanguages(Ta ̈ckstro ̈metal.,2012),andmakemonolingualword embeddings more accurate (Faruqui and Dyer, 2014). A typical approach is to learn a pro- jection that maximizes the correlation of the distributed representations of each element in a translation pair, which can be obtained from a bilingual dictionary. Distributed rep- resentations can also be linked to perceptual information, such as image features. Bruni et al. (2014) use textual descriptions of images to obtain visual contextual information for various words, which supplements traditional distributional context. Image features can also be inserted as contextual information in log bilinear language models (Kiros et al., 2014), making it possible to automatically generate text descriptions of images.
Exercises
1. Prove that the sum of probabilities of paths through a hierarchical softmax tree is equal to one.
2. In skipgram word embeddings, the negative sampling objective can be written as, L = 􏰁 􏰁 count(i, j)ψ(i, j), [14.29]
i∈V j∈C with ψ(i, j) is defined in Equation 14.23.
Suppose we draw the negative samples from the empirical unigram distribution pˆ(i) = punigram(i). First, compute the expectation of L with respect the negative samples, using this probability.
Next, take the derivative of this expectation with respect to the score of a single word context pair σ(ui ·vj ), and solve for the pointwise mutual information PMI(i, j). You should be able to show that at the optimum, the PMI is a simple function of σ(ui ·vj) and the number of negative samples.
(This exercise is part of a proof that shows that skipgram with negative sampling is closely related to PMI-weighted matrix factorization.)
3. * In Brown clustering, prove that the cluster merge that maximizes the average mu- tual information (Equation 14.13) also maximizes the log-likelihood objective (Equa- tion 14.12).
4. A simple way to compute a distributed phrase representation is to add up the dis- tributed representations of the words in the phrase. Consi􏰀der a sentiment analysis model in which the predicted sentiment is, ψ(w) = θ · ( Mm=1 xm), where xm is the vector representation of word m. Prove that in such a model, the following two
Jacob Eisenstein. Draft of October 15, 2018.

14.8. DISTRIBUTEDREPRESENTATIONSOFMULTIWORDUNITS
349
inequalities cannot both hold:
ψ(good) >ψ(not good) ψ(bad) <ψ(not bad). [14.30] [14.31] Then construct a similar example pair for the case in which phrase representations are the average of the word representations. 5. Now let’s consider a slight modification to the prediction model in the previous problem: For the next two problems, download a set of pre-trained word embeddings, such as the WORD2VEC or polyglot embeddings. 6. Use cosine similarity to find the most similar words to: dog, whale, before, however, fabricate. 7. Use vector addition and subtraction to compute target vectors for the analogies be- low. After computing each target vector, find the top three candidates by cosine similarity. • dog:puppy :: cat: ? • speak:speaker :: sing:? • France:French :: England:? • France:wine :: England:? The remaining problems will require you to build a classifier and test its properties. Pick a text classification dataset, such as the Cornell Movie Review data.6 Divide your data into training (60%), development (20%), and test sets (20%), if no such division already exists. 8. Train a convolutional neural network, with inputs set to pre-trained word embed- dings from the previous two problems. Use an additional, fine-tuned embedding for out-of-vocabulary words. Train until performance on the development set does not improve. You can also use the development set to tune the model architecture, such as the convolution width and depth. Report F -MEASURE and accuracy, as well as training time. 6 http://www.cs.cornell.edu/people/pabo/movie-review-data/ Under contract with MIT Press, shared under CC-BY-NC-ND license. ψ(w) = θ · ReLU( Show that in this case, it is possible to achieve the inequalities above. Your solution xm) [14.32] should provide the weights θ and the embeddings xgood, xbad, and xnot. 􏰁M m=1 350 9. 10. CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS Now modify your model from the previous problem to fine-tune the word embed- dings. Report F -MEASURE, accuracy, and training time. Try a simpler approach, in which word embeddings in the document are averaged, and then this average is passed through a feed-forward neural network. Again, use the development data to tune the model architecture. How close is the accuracy to the convolutional networks from the previous problems? Jacob Eisenstein. Draft of October 15, 2018. Chapter 15 Reference Resolution References are one of the most noticeable forms of linguistic ambiguity, afflicting not just automated natural language processing systems, but also fluent human readers. Warn- ings to avoid “ambiguous pronouns” are ubiquitous in manuals and tutorials on writing style. But referential ambiguity is not limited to pronouns, as shown in the text in Fig- ure 15.1. Each of the bracketed substrings refers to an entity that is introduced earlier in the passage. These references include the pronouns he and his, but also the shortened name Cook, and nominals such as the firm and the firm’s biggest growth market. Reference resolution subsumes several subtasks. This chapter will focus on corefer- ence resolution, which is the task of grouping spans of text that refer to a single underly- ing entity, or, in some cases, a single event: for example, the spans Tim Cook, he, and Cook are all coreferent. These individual spans are called mentions, because they mention an entity; the entity is sometimes called the referent. Each mention has a set of antecedents, which are preceding mentions that are coreferent; for the first mention of an entity, the an- tecedent set is empty. The task of pronominal anaphora resolution requires identifying only the antecedents of pronouns. In entity linking, references are resolved not to other spans of text, but to entities in a knowledge base. This task is discussed in chapter 17. Coreference resolution is a challenging problem for several reasons. Resolving differ- ent types of referring expressions requires different types of reasoning: the features and methods that are useful for resolving pronouns are different from those that are useful to resolve names and nominals. Coreference resolution involves not only linguistic rea- soning, but also world knowledge and pragmatics: you may not have known that China was Apple’s biggest growth market, but it is likely that you effortlessly resolved this ref- erencewhilereadingthepassageinFigure15.1.1 Afurtherchallengeisthatcoreference 1This interpretation is based in part on the assumption that a cooperative author would not use the expression the firm’s biggest growth market to refer to an entity not yet mentioned in the article (Grice, 1975). Pragmatics is the discipline of linguistics concerned with the formalization of such assumptions (Huang, 351 352 CHAPTER15. REFERENCERESOLUTION (15.1) [[Apple Inc] Chief Executive Tim Cook] has jetted into [China] for talks with government officials as [he] seeks to clear up a pile of problems in [[the firm] ’s biggest growth market] ... [Cook] is on [his] first trip to [the country] since taking over... Figure 15.1: Running example (Yee and Jones, 2012). Coreferring entity mentions are in brackets. resolution decisions are often entangled: each mention adds information about the entity, which affects other coreference decisions. This means that coreference resolution must be addressed as a structure prediction problem. But as we will see, there is no dynamic program that allows the space of coreference decisions to be searched efficiently. 15.1 Forms of referring expressions There are three main forms of referring expressions — pronouns, names, and nominals. 15.1.1 Pronouns Pronouns are a closed class of words that are used for references. A natural way to think about pronoun resolution is SMASH (Kehler, 2007): • Search for candidate antecedents; • Match against hard agreement constraints; • And Select using Heuristics, which are “soft” constraints such as recency, syntactic prominence, and parallelism. Search In the search step, candidate antecedents are identified from the preceding text or speech.2 Any noun phrase can be a candidate antecedent, and pronoun resolution usually requires 2015). 2Pronouns whose referents come later are known as cataphora, as in the opening line from a novel by Ma ́rquez (1970): (15.1) Manyyearslater,as[he]facedthefiringsquad,[ColonelAurelianoBuend ́ıa]wastorememberthat distant afternoon when [his] father took him to discover ice. Jacob Eisenstein. Draft of October 15, 2018. 15.1. FORMSOFREFERRINGEXPRESSIONS 353 parsing the text to identify all such noun phrases.3 Filtering heuristics can help to prune the search space to noun phrases that are likely to be coreferent (Lee et al., 2013; Durrett and Klein, 2013). In nested noun phrases, mentions are generally considered to be the largest unit with a given head word (see subsection 10.5.2): thus, Apple Inc. Chief Executive Tim Cook would be included as a mention, but Tim Cook would not, since they share the same head word, Cook. Matching constraints for pronouns References and their antecedents must agree on semantic features such as number, person, gender, and animacy. Consider the pronoun he in this passage from the running example: (15.2) Tim Cook has jetted in for talks with officials as [he] seeks to clear up a pile of problems... The pronoun and possible antecedents have the following features: • he: singular, masculine, animate, third person • officials: plural, animate, third person • talks: plural, inanimate, third person • Tim Cook: singular, masculine, animate, third person The SMASH method searches backwards from he, discarding officials and talks because they do not satisfy the agreements constraints. Another source of constraints comes from syntax — specifically, from the phrase struc- ture trees discussed in chapter 10. Consider a parse tree in which both x and y are phrasal constituents. The constituent x c-commands the constituent y iff the first branching node above x also dominates y. For example, in Figure 15.2a, Abigail c-commands her, because the first branching node above Abigail, S, also dominates her. Now, if x c-commands y, government and binding theory (Chomsky, 1982) states that y can refer to x only if it is a reflexive pronoun (e.g., herself). Furthermore, if y is a reflexive pronoun, then its an- tecedent must c-command it. Thus, in Figure 15.2a, her cannot refer to Abigail; conversely, if we replace her with herself, then the reflexive pronoun must refer to Abigail, since this is the only candidate antecedent that c-commands it. Now consider the example shown in Figure 15.2b. Here, Abigail does not c-command her, but Abigail’s mom does. Thus, her can refer to Abigail — and we cannot use reflexive 3In the OntoNotes coreference annotations, verbs can also be antecedents, if they are later referenced by nominals (Pradhan et al., 2011): (15.1) Sales of passenger cars [grew] 22%. [The strong growth] followed year-to-year increases. Under contract with MIT Press, shared under CC-BY-NC-ND license. 354 CHAPTER15. REFERENCERESOLUTION S Abigail speaks (a) S NP Abigail S V VP NP VP NP VP PP S VP PP Abigail ’s mom speaks (b) hopes NP with her with her she speaks with her (c) Figure 15.2: In (a), Abigail c-commands her; in (b), Abigail does not c-command her, but Abigail’s mom does; in (c), the scope of Abigail is limited by the S non-terminal, so that she or her can bind to Abigail, but not both. herself in this context, unless we are talking about Abigail’s mom. However, her does not have to refer to Abigail. Finally, Figure 15.2c shows the how these constraints are limited. In this case, the pronoun she can refer to Abigail, because the S non-terminal puts Abigail outside the domain of she. Similarly, her can also refer to Abigail. But she and her cannot be coreferent, because she c-commands her. Heuristics After applying constraints, heuristics are applied to select among the remaining candi- dates. Recency is a particularly strong heuristic. All things equal, readers will prefer the more recent referent for a given pronoun, particularly when comparing referents that occur in different sentences. Jurafsky and Martin (2009) offer the following example: (15.3) The doctor found an old map in the captain’s chest. Jim found an even older map hidden on the shelf. [It] described an island. Readers are expected to prefer the older map as the referent for the pronoun it. However, subjects are often preferred over objects, and this can contradict the prefer- ence for recency when two candidate referents are in the same sentence. For example, (15.4) Abigail loaned Lucia a book on Spanish. [She] is always trying to help people. Here, we may prefer to link she to Abigail rather than Lucia, because of Abigail’s position in the subject role of the preceding sentence. (Arguably, this preference would not be strong enough to select Abigail if the second sentence were She is visiting Valencia next month.) A third heuristic is parallelism: (15.5)AbigailloanedLuciaabookonSpanish.O ̈zlemloaned[her]abookonPor- tuguese. Jacob Eisenstein. Draft of October 15, 2018. 15.1. FORMSOFREFERRINGEXPRESSIONS 355 S NP VP NP DETNNPP VBD NP PP CD The castle IN NP remained DET IN until NP in NNP Camelot the NN PP SBAR S residence IN NP 536 WHP of DET NN the king when NP VP PRP VBD NP1 PP he moved PRP TO NP it to NNP London Figure 15.3: Left-to-right breadth-first tree traversal (Hobbs, 1978), indicating that the search for an antecedent for it (NP1) would proceed in the following order: 536; the castle in Camelot; the residence of the king; Camelot; the king. Hobbs (1978) proposes semantic con- straints to eliminate 536 and the castle in Camelot as candidates, since they are unlikely to be the direct object of the verb move. Here Lucia is preferred as the referent for her, contradicting the preference for the subject Abigail in the preceding example. The recency and subject role heuristics can be unified by traversing the document in a syntax-driven fashion (Hobbs, 1978): each preceding sentence is traversed breadth-first, left-to-right (Figure 15.3). This heuristic successfully handles (15.4): Abigail is preferred as the referent for she because the subject NP is visited first. It also handles (15.3): the older map is preferred as the referent for it because the more recent sentence is visited first. (An alternative unification of recency and syntax is proposed by centering theory (Grosz et al., 1995), which is discussed in detail in chapter 16.) In early work on reference resolution, the number of heuristics was small enough that a set of numerical weights could be set by hand (Lappin and Leass, 1994). More recent work uses machine learning to quantify the importance of each of these factors. However, pronoun resolution cannot be completely solved by constraints and heuristics alone. This is shown by the classic example pair (Winograd, 1972): (15.6) The [city council] denied [the protesters] a permit because [they] advocated/feared violence. Without reasoning about the motivations of the city council and protesters, it is unlikely that any system could correctly resolve both versions of this example. Under contract with MIT Press, shared under CC-BY-NC-ND license. 356 CHAPTER15. REFERENCERESOLUTION Non-referential pronouns While pronouns are generally used for reference, they need not refer to entities. The fol- lowing examples show how pronouns can refer to propositions, events, and speech acts. (15.7) a. b. c. They told me that I was too ugly for show business, but I didn’t believe [it]. Elifsu saw Berthold get angry, and I saw [it] too. Emmanuel said he worked in security. I suppose [that]’s one way to put it. These forms of reference are generally not annotated in large-scale coreference resolution datasets such as OntoNotes (Pradhan et al., 2011). Pronouns may also have generic referents: (15.8) a. b. c. A poor carpenter blames [her] tools. On the moon, [you] have to carry [your] own oxygen. Every farmer who owns a donkey beats [it]. (Geach, 1962) In the OntoNotes dataset, coreference is not annotated for generic referents, even in cases like these examples, in which the same generic entity is mentioned multiple times. Some pronouns do not refer to anything at all: (15.9) a. b. c. [It]’s raining. [Il] pleut. (Fr) [It] ’s money that she’s really after. [It] is too bad that we have to work so hard. How can we automatically distinguish these usages of it from referential pronouns? Consider the the difference between the following two examples (Bergsma et al., 2008): (15.10) a. You can make [it] in advance. b. You can make [it] in showbiz. In the second example, the pronoun it is non-referential. One way to see this is by substi- tuting another pronoun, like them, into these examples: (15.11) a. You can make [them] in advance. b. ? You can make [them] in showbiz. The questionable grammaticality of the second example suggests that it is not referential. Bergsma et al. (2008) operationalize this idea by comparing distributional statistics for the n-grams around the word it, testing how often other pronouns or nouns appear in the same context. In cases where nouns and other pronouns are infrequent, the it is unlikely to be referential. Jacob Eisenstein. Draft of October 15, 2018. 15.1. FORMSOFREFERRINGEXPRESSIONS 357 15.1.2 Proper Nouns If a proper noun is used as a referring expression, it often corefers with another proper noun, so that the coreference problem is simply to determine whether the two names match. Subsequent proper noun references often use a shortened form, as in the running example (Figure 15.1): (15.12) Apple Inc Chief Executive [Tim Cook] has jetted into China . . . [Cook] is on his first business trip to the country . . . A typical solution for proper noun coreference is to match the syntactic head words of the reference with the referent. In subsection 10.5.2, we saw that the head word of a phrase can be identified by applying head percolation rules to the phrasal parse tree; alternatively, the head can be identified as the root of the dependency subtree covering the name. For sequences of proper nouns, the head word will be the final token. There are a number of caveats to the practice of matching head words of proper nouns. • In the European tradition, family names tend to be more specific than given names, and family names usually come last. However, other traditions have other practices: for example, in Chinese names, the family name typically comes first; in Japanese, honorifics come after the name, as in Nobu-San (Mr. Nobu). • Inorganizationnames,theheadwordisoftennotthemostinformative,asinGeorgia Tech and Virginia Tech. Similarly, Lebanon does not refer to the same entity as South- ern Lebanon, necessitating special rules for the specific case of geographical modi- fiers (Lee et al., 2011). • Proper nouns can be nested, as in [the CEO of [Microsoft]], resulting in head word match without coreference. Despite these difficulties, proper nouns are the easiest category of references to re- solve (Stoyanov et al., 2009). In machine learning systems, one solution is to include a range of matching features, including exact match, head match, and string inclusion. In addition to matching features, competitive systems (e.g., Bengtson and Roth, 2008) in- clude large lists, or gazetteers, of acronyms (e.g, the National Basketball Association/NBA), demonyms (e.g., the Israelis/Israel), and other aliases (e.g., the Georgia Institute of Technol- ogy/Georgia Tech). 15.1.3 Nominals In coreference resolution, noun phrases that are neither pronouns nor proper nouns are referred to as nominals. In the running example (Figure 15.1), nominal references include: the firm (Apple Inc); the firm’s biggest growth market (China); and the country (China). Under contract with MIT Press, shared under CC-BY-NC-ND license. 358 CHAPTER15. REFERENCERESOLUTION Nominals are especially difficult to resolve (Denis and Baldridge, 2007; Durrett and Klein, 2013), and the examples above suggest why this may be the case: world knowledge is required to identify Apple Inc as a firm, and China as a growth market. Other difficult examples include the use of colloquial expressions, such as coreference between Clinton campaign officials and the Clinton camp (Soon et al., 2001). 15.2 Algorithms for coreference resolution The ground truth training data for coreference resolution is a set of mention sets, where all mentionswithineachsetrefertoasingleentity.4 IntherunningexamplefromFigure15.1, the ground truth coreference annotation is: c1 ={Apple Inc1:2, the firm27:28} c2 ={Apple Inc Chief Executive Tim Cook1:6, he17, Cook33, his36} c3 ={China10, the firm ’s biggest growth market27:32, the country40:41} [15.1] [15.2] [15.3] Each row specifies the token spans that mention an entity. (“Singleton” entities, which are mentioned only once (e.g., talks, government officials), are excluded from the annotations.) Equivalently, if given a set of M mentions, {mi}Mi=1, each mention i can be assigned to a cluster zi, where zi = zj if i and j are coreferent. The cluster assignments z are invariant under permutation. The unique clustering associated with the assignment z is written c(z). Coreference resolution can thus be viewed as a structure prediction problem, involv- ing two subtasks: identifying which spans of text mention entities, and then clustering those spans. Mention identification The task of identifying mention spans for coreference resolution is often performed by applying a set of heuristics to the phrase structure parse of each sentence. A typical approach is to start with all noun phrases and named entities, and then apply filtering rules to remove nested noun phrases with the same head (e.g., [Apple CEO [Tim Cook]]), numeric entities (e.g., [100 miles], [97%]), non-referential it, etc (Lee et al., 2013; Durrett and Klein, 2013). In general, these deterministic approaches err in favor of recall, since the mention clustering component can choose to ignore false positive mentions, but cannot recover from false negatives. An alternative is to consider all spans (up to some finite length) as candidate mentions, performing mention identification and clustering jointly (Daume ́ III and Marcu, 2005; Lee et al., 2017). 4In many annotations, the term markable is used to refer to spans of text that can potentially mention an entity. The set of markables includes non-referential pronouns, which does not mention any entity. Part of the job of the coreference system is to avoid incorrectly linking these non-referential markables to any mention chains. Jacob Eisenstein. Draft of October 15, 2018. 15.2. ALGORITHMSFORCOREFERENCERESOLUTION 359 Mention clustering The subtask of mention clustering will be the focus of the remainder of this chapter. There are two main classes of models. In mention-based models, the scoring function for a coreference clustering decomposes over pairs of mentions. These pairwise decisions are then aggregated, using a clustering heuristic. Mention-based coreference clustering can be treated as a fairly direct application of supervised classification or rank- ing. However, the mention-pair locality assumption can result in incoherent clusters, like {Hillary Clinton ← Clinton ← Mr Clinton}, in which the pairwise links score well, but the overall result is unsatisfactory. Entity-based models address this issue by scoring entities holistically. This can make inference more difficult, since the number of possible entity groupings is exponential in the number of mentions. 15.2.1 Mention-pair models Inthemention-pairmodel,abinarylabelyi,j ∈{0,1}isassignedtoeachpairofmentions (i,j), where i < j. If i and j corefer (zi = zj), then yi,j = 1; otherwise, yi,j = 0. The mention he in Figure 15.1 is preceded by five other mentions: (1) Apple Inc; (2) Apple Inc Chief Executive Tim Cook; (3) China; (4) talks; (5) government officials. The correct mention pair labeling is y2,6 = 1 and yi̸=2,6 = 0 for all other i. If a mention j introduces a new entity, such as mention 3 in the example, then yi,j = 0 for all i. The same is true for “mentions” that do not refer to any entity, such as non-referential pronouns. If mention j refers to an entity that has been mentioned more than once, then yi,j = 1 for all i < j that mention the referent. By transforming coreference into a set of binary labeling problems, the mention-pair model makes it possible to apply an off-the-shelf binary classifier (Soon et al., 2001). This classifier is applied to each mention j independently, searching backwards from j until finding an antecedent i which corefers with j with high confidence. After identifying a single antecedent, the remaining mention pair labels can be computed by transitivity: if yi,j =1andyj,k =1,thenyi,k =1. Since the ground truth annotations give entity chains c but not individual mention- pair labels y, an additional heuristic must be employed to convert the labeled data into training examples for classification. A typical approach is to generate at most one pos- itive labeled instance yaj,j = 1 for mention j, where aj is the index of the most recent antecedent, aj = max{i : i < j ∧ zi = zj }. Negative labeled instances are generated for all for all i ∈ {aj + 1, . . . , j}. In the running example, the most recent antecedent of the pronounheisa6 =2,sothetrainingdatawouldbey2,6 =1andy3,6 =y4,6 =y5,6 =0. The variable y1,6 is not part of the training data, because the first mention appears before the true antecedent a6 = 2. Under contract with MIT Press, shared under CC-BY-NC-ND license. 360 CHAPTER15. REFERENCERESOLUTION 15.2.2 Mention-ranking models In mention ranking (Denis and Baldridge, 2007), the classifier learns to identify a single antecedent ai ∈ {ε, 1, 2, . . . , i − 1} for each referring expression i, aˆi = argmax ψM(a,i), [15.4] a∈{ε,1,2,...,i−1} where ψM (a, i) is a score for the mention pair (a, i). If a = ε, then mention i does not refer to any previously-introduced entity — it is not anaphoric. Mention-ranking is similar to the mention-pair model, but all candidates are considered simultaneously, and at most a single antecedent is selected. The mention-ranking model explicitly accounts for the possibility that mention i is not anaphoric, through the score ψM (ε, i). The determination of anaphoricity can be made by a special classifier in a preprocessing step, so that non-ε antecedents are identified only for spans that are determined to be anaphoric (Denis and Baldridge, 2008). As a learning problem, ranking can be trained using the same objectives as in dis- criminative classification. For each mention i, we can define a gold antecedent a∗i , and an associated loss, such as the hinge loss, li = (1 − ψM (a∗i , i) + ψM (aˆ, i))+ or the negative log-likelihood, li = − log p(a∗i | i; θ). (For more on learning to rank, see subsection 17.1.1.) But as with the mention-pair model, there is a mismatch between the labeled data, which comes in the form of mention sets, and the desired supervision, which would indicate the specific antecedent of each mention. The antecedent variables {ai}Mi=1 relate to the men- tion sets in a many-to-one mapping: each set of antecedents induces a single clustering, but a clustering can correspond to many different settings of antecedent variables. A heuristic solution is to set a∗i = max{j : j < i ∧ zj = zi}, the most recent mention in the same cluster as i. But the most recent mention may not be the most informative: in the running example, the most recent antecedent of the mention Cook is the pronoun he, but a more useful antecedent is the earlier mention Apple Inc Chief Executive Tim Cook. Rather than selecting a specific antecedent to train on, the antecedent can be treated as a latent variable, in the manner of the latent variable perceptron from subsection 12.4.2 (Fernan- des et al., 2014): aˆ = argmax a 􏰁M i=1 ψM (ai, i) [15.5] [15.6] [15.7] ∗ 􏰁M a = argmax ψM (ai, i) θ←θ+􏰁M ∂LψM(a∗i,i)−􏰁M ∂LψM(aˆi,i) a∈A(c) i=1 i=1 ∂θ i=1 ∂θ Jacob Eisenstein. Draft of October 15, 2018. 15.2. ALGORITHMSFORCOREFERENCERESOLUTION 361 where A(c) is the set of antecedent structures that is compatible with the ground truth coreference clustering c. Another alternative is to sum over all the conditional probabili- ties of antecedent structures that are compatible with the ground truth clustering (Durrett and Klein, 2013; Lee et al., 2017). For the set of mention m, we compute the following probabilities: 􏰁 􏰁􏰙M p(c|m)= p(a|m)= p(ai |i,m) a∈A(c) a∈A(c) i=1 p(ai | i, m) =􏰀 exp (ψM (ai, i)) . a′∈{ε,1,2,...,i−1} exp (ψM (a′, i)) [15.8] [15.9] This objective rewards models that assign high scores for all valid antecedent structures. In the running example, this would correspond to summing the probabilities of the two valid antecedents for Cook, he and Apple Inc Chief Executive Tim Cook. In one of the exer- cises, you will compute the number of valid antecedent structures for a given clustering. 15.2.3 Transitive closure in mention-based models A problem for mention-based models is that individual mention-level decisions may be incoherent. Consider the following mentions: m1 =Hillary Clinton m2 =Clinton m3 =Bill Clinton [15.10] [15.11] [15.12] A mention-pair system might predict yˆ1,2 = 1, yˆ2,3 = 1, yˆ1,3 = 0. Similarly, a mention- ranking system might choose aˆ2 = 1 and aˆ3 = 2. Logically, if mentions 1 and 3 are both coreferent with mention 2, then all three mentions must refer to the same entity. This constraint is known as transitive closure. Transitive closure can be applied post hoc, revising the independent mention-pair or mention-ranking decisions. However, there are many possible ways to enforce transitive closure: in the example above, we could set yˆ1,3 = 1, or yˆ1,2 = 0, or yˆ2,3 = 0. For docu- ments with many mentions, there may be many violations of transitive closure, and many possible fixes. Transitive closure can be enforced by always adding edges, so that yˆ1,3 = 1 is preferred (e.g., Soon et al., 2001), but this can result in overclustering, with too many mentions grouped into too few entities. Mention-pair coreference resolution can be viewed as a constrained optimization prob- Under contract with MIT Press, shared under CC-BY-NC-ND license. 362 lem, CHAPTER15. REFERENCERESOLUTION 􏰁M 􏰁j maxM ψM(i,j)×yi,j y∈{0,1} j=1 i=1 s.t. yi,j+yj,k−1≤yi,k, ∀i

400
CHAPTER16. DISCOURSE
the follwoing constraint: if segment i undercuts an argumentative relationship be- tween j and k, then i cannot be included in the summary unless both j and k are included. Your solution should take the form of a set of linear constraints on an integer linear program — that is, each constraint can only involve addition and sub- traction of variables.
In the next two exercises, you will explore the use of discourse connectives in a real corpus. Using NLTK, acquire the Brown corpus, and identify sentences that begin with any of the following connectives: however, nevertheless, moreover, furthermore, thus.
9. Both lexical consistency and discourse connectives contribute to the cohesion of a text. We might therefore expect adjacent sentences that are joined by explicit dis- course connectives to also have higher word overlap. Using the Brown corpus, test this theory by computing the average cosine similarity between adjacent sentences that are connected by one of the connectives mentioned above. Compare this to the average cosine similarity of all other adjacent sentences. If you know how, perform a two-sample t-test to determine whether the observed difference is statistically sig- nificant.
10. Group the above connectives into the following three discourse relations:
• Expansion: moreover, furthermore • Comparison: however, nevertheless • Contingency: thus
Focusing on pairs of sentences which are joined by one of these five connectives, build a classifier to predict the discourse relation from the text of the two adjacent sentences — taking care to ignore the connective itself. Use the first 30000 sentences of the Brown corpus as the training set, and the remaining sentences as the test set. Compare the performance of your classifier against simply choosing the most common class. Using a bag-of-words classifier, it is hard to do much better than this baseline, so consider more sophisticated alternatives!
Jacob Eisenstein. Draft of October 15, 2018.

Part IV Applications
401

Chapter 17
Information extraction
Computers offer powerful capabilities for searching and reasoning about structured records and relational data. Some have argued that the most important limitation of artificial in- telligence is not inference or learning, but simply having too little knowledge (Lenat et al., 1990). Natural language processing provides an appealing solution: automatically con- struct a structured knowledge base by reading natural language text.
For example, many Wikipedia pages have an “infobox” that provides structured in- formation about an entity or event. An example is shown in Figure 17.1a: each row rep- resents one or more properties of the entity IN THE AEROPLANE OVER THE SEA, a record album. The set of properties is determined by a predefined schema, which applies to all record albums in Wikipedia. As shown in Figure 17.1b, the values for many of these fields are indicated directly in the first few sentences of text on the same Wikipedia page.
The task of automatically constructing (or “populating”) an infobox from text is an example of information extraction. Much of information extraction can be described in terms of entities, relations, and events.
• Entitiesareuniquelyspecifiedobjectsintheworld,suchaspeople(JEFFMANGUM), places (ATHENS, GEORGIA), organizations (MERGE RECORDS), and times (FEBRUARY 10, 1998). Chapter 8 described the task of named entity recognition, which labels tokens as parts of entity spans. Now we will see how to go further, linking each entity mention to an element in a knowledge base.
• Relationsincludeapredicateandtwoarguments:forexample,CAPITAL(GEORGIA,ATLANTA).
• Events involve multiple typed arguments. For example, the production and release 403

404
CHAPTER17. INFORMATIONEXTRACTION
(17.1) In the Aeroplane Over the Sea is the
second and final studio album by the
American indie rock band Neutral Milk
Hotel.
(17.2) It was released in the United States on February 10, 1998 on Merge Records
the United Kingdom.
(17.3) Jeff Mangum moved from Athens,
Georgia to Denver, Colorado to prepare
the bulk of the album’s material with producer Robert Schneider, this time at
Schneider’s newly created Pet Sounds
Studio at the home of Jim McIntyre.
(b) The first few sentences of text. Strings that match fields or field names in the infobox are underlined; strings that mention other entities are wavy underlined.
Figure 17.1: From the Wikipedia page for the album “In the Aeroplane Over the Sea”, retrieved October 26, 2017.
of the album described in Figure 17.1 is described by the event,
⟨TITLE : IN THE AEROPLANE OVER THE SEA, ARTIST : NEUTRAL MILK HOTEL, RELEASE-DATE : 1998-FEB-10, . . .⟩
The set of arguments for an event type is defined by a schema. Events often refer to time-delimited occurrences: weddings, protests, purchases, terrorist attacks.
Information extraction is similar to semantic role labeling (chapter 13): we may think of predicates as corresponding to events, and the arguments as defining slots in the event representation. However, the goals of information extraction are different. Rather than accurately parsing every sentence, information extraction systems often focus on recog- nizing a few key relation or event types, or on the task of identifying all properties of a given entity. Information extraction is often evaluated by the correctness of the resulting knowledge base, and not by how many sentences were accurately parsed. The goal is sometimes described as macro-reading, as opposed to micro-reading, in which each sen- tence must be analyzed correctly. Macro-reading systems are not penalized for ignoring difficult sentences, as long as they can recover the same information from other, easier- to-read sources. However, macro-reading systems must resolve apparent inconsistencies
Jacob Eisenstein. Draft of October 15, 2018.
and May 1998 on Blue Rose Records in
XXXXXXXXX XXXXXXXXXXXXXXXXX
XXXXXXXXXXXXX XXXXXXX
XXXXXXX
(a) A Wikipedia infobox
XXXXXXXXXXXX
XXXXXXXXXXXXXXX

17.1. ENTITIES 405 (was the album released on MERGE RECORDS or BLUE ROSE RECORDS?), requiring rea-
soning across the entire dataset.
In addition to the basic tasks of recognizing entities, relations, and events, information extraction systems must handle negation, and must be able to distinguish statements of fact from hopes, fears, hunches, and hypotheticals. Finally, information extraction is of- ten paired with the problem of question answering, which requires accurately parsing a query, and then selecting or generating a textual answer. Question answering systems can be built on knowledge bases that are extracted from large text corpora, or may attempt to identify answers directly from the source texts.
17.1 Entities
The starting point for information extraction is to identify mentions of entities in text. Consider the following example:
(17.4) The United States Army captured a hill overlooking Atlanta on May 14, 1864. For this sentence, there are two goals:
1. Identify the spans United States Army, Atlanta, and May 14, 1864 as entity mentions. (The hill is not uniquely identified, so it is not a named entity.) We may also want to recognize the named entity types: organization, location, and date. This is named entity recognition, and is described in chapter 8.
2. Link these spans to entities in a knowledge base: U.S. ARMY, ATLANTA, and 1864- MAY-14. This task is known as entity linking.
The strings to be linked to entities are mentions — similar to the use of this term in coreference resolution. In some formulations of the entity linking task, only named enti- ties are candidates for linking. This is sometimes called named entity linking (Ling et al., 2015). In other formulations, such as Wikification (Milne and Witten, 2008), any string can be a mention. The set of target entities often corresponds to Wikipedia pages, and Wikipedia is the basis for more comprehensive knowledge bases such as YAGO (Suchanek et al., 2007), DBPedia (Auer et al., 2007), and Freebase (Bollacker et al., 2008). Entity link- ing may also be performed in more “closed” settings, where a much smaller list of targets is provided in advance. The system must also determine if a mention does not refer to any entity in the knowledge base, sometimes called a NIL entity (McNamee and Dang, 2009).
Returning to (17.4), the three entity mentions may seem unambiguous. But the Wikipedia disambiguation page for the string Atlanta says otherwise:1 there are more than twenty
1https://en.wikipedia.org/wiki/Atlanta_(disambiguation), retrieved November 1, 2017. Under contract with MIT Press, shared under CC-BY-NC-ND license.

406 CHAPTER17. INFORMATIONEXTRACTION
different towns and cities, five United States Navy vessels, a magazine, a television show, a band, and a singer — each prominent enough to have its own Wikipedia page. We now consider how to choose among these dozens of possibilities. In this chapter we will focus on supervised approaches. Unsupervised entity linking is closely related to the problem of cross-document coreference resolution, where the task is to identify pairs of mentions that corefer, across document boundaries (Bagga and Baldwin, 1998b; Singh et al., 2011).
17.1.1 Entity linking by learning to rank
Entity linking is often formulated as a ranking problem,
yˆ = argmax Ψ(y, x, c), [17.1] y∈Y(x)
where y is a target entity, x is a description of the mention, Y(x) is a set of candidate entities, and c is a description of the context — such as the other text in the document, or its metadata. The function Ψ is a scoring function, which could be a linear model, Ψ(y, x, c) = θ · f (y, x, c), or a more complex function such as a neural network. In either case, the scoring function can be learned by minimizing a margin-based ranking loss,
l(yˆ, y(i), x(i), c(i)) = 􏰛Ψ(yˆ, x(i), c(i)) − Ψ(y(i), x(i), c(i)) + 1􏰜+ , [17.2] where y(i) is the ground truth and yˆ ̸= y(i) is the predicted target for mention x(i) in
context c(i) (Joachims, 2002; Dredze et al., 2010).
Candidate identification For computational tractability, it is helpful to restrict the set of candidates, Y(x). One approach is to use a name dictionary, which maps from strings to the entities that they might mention. This mapping is many-to-many: a string such as Atlanta can refer to multiple entities, and conversely, an entity such as ATLANTA can be referenced by multiple strings. A name dictionary can be extracted from Wikipedia, with links between each Wikipedia entity page and the anchor text of all hyperlinks that point to the page (Bunescu and Pasca, 2006; Ratinov et al., 2011). To improve recall, the name dictionary can be augmented by partial and approximate matching (Dredze et al., 2010), but as the set of candidates grows, the risk of false positives increases. For example, the string Atlanta is a partial match to the Atlanta Fed (a name for the FEDERAL RESERVE BANK OF ATLANTA), and a noisy match (edit distance of one) from Atalanta (a heroine in Greek mythology and an Italian soccer team).
Features Feature-based approaches to entity ranking rely on three main types of local information (Dredze et al., 2010):
Jacob Eisenstein. Draft of October 15, 2018.

17.1. ENTITIES 407
• The similarity of the mention string to the canonical entity name, as quantified by string similarity. This feature would elevate the city ATLANTA over the basketball team ATLANTA HAWKS for the string Atlanta.
• The popularity of the entity, which can be measured by Wikipedia page views or PageRank in the Wikipedia link graph. This feature would elevate ATLANTA, GEOR- GIA over the unincorporated community of ATLANTA, OHIO.
• The entity type, as output by the named entity recognition system. This feature would elevate the city of ATLANTA over the magazine ATLANTA in contexts where the mention is tagged as a location.
In addition to these local features, the document context can also help. If Jamaica is men- tioned in a document about the Caribbean, it is likely to refer to the island nation; in the context of New York, it is likely to refer to the neighborhood in Queens; in the con- text of a menu, it might refer to a hibiscus tea beverage. Such hints can be formalized by computing the similarity between the Wikipedia page describing each candidate en- tity and the mention context c(i), which may include the bag-of-words representing the document (Dredze et al., 2010; Hoffart et al., 2011) or a smaller window of text around the mention (Ratinov et al., 2011). For example, we can compute the cosine similarity between bag-of-words vectors for the context and entity description, typically weighted using inverse document frequency to emphasize rare words.2
Neural entity linking An alternative approach is to compute the score for each entity candidate using distributed vector representations of the entities, mentions, and context. For example, for the task of entity linking in Twitter, Yang et al. (2016) employ the bilinear scoring function,
Ψ(y, x, c) = vy⊤Θ(y,x)x + vy⊤Θ(y,c)c, [17.3]
with vy ∈ RKy as the vector embedding of entity y, x ∈ RKx as the embedding of the mention, c ∈ RKc as the embedding of the context, and the matrices Θ(y,x) and Θ(y,c) as parameters that score the compatibility of each entity with respect to the mention and context. Each of the vector embeddings can be learned from an end-to-end objective, or pre-trained on unlabeled data.
• Pretrainedentityembeddingscanbeobtainedfromanexistingknowledgebase(Bor- des et al., 2011, 2013), or by running a word embedding algorithm such as WORD2VEC
2The document frequency of word j is DF(j) = 1 􏰀N δ 􏰛x(i) > 0􏰜, equal to the number of docu- N i=1 j
ments in which the word appears. The contribution of each word to the cosine similarity of two bag-of- words vectors can be weighted by the inverse document frequency 1 or log 1 , to emphasize rare
words (Spa ̈rck Jones, 1972).
Under contract with MIT Press, shared under CC-BY-NC-ND license.
DF(j ) DF(j )

408
CHAPTER17. INFORMATIONEXTRACTION
• •
on the text of Wikipedia, with hyperlinks substituted for the anchor text.3
The embedding of the mention x can be computed by averaging the embeddings of the words in the mention (Yang et al., 2016), or by the compositional techniques described in section 14.8.
The embedding of the context c can also be computed from the embeddings of the words in the context. A denoising autoencoder learns a function from raw text to dense K-dimensional vector encodings by minimizing a reconstruction loss (Vin- cent et al., 2010),
􏰁N
min ||x(i) − g(h(x ̃(i); θh); θg)||2, [17.4]
θg,θh i=1
where x ̃(i) is a noisy version of the bag-of-words counts x(i), which is produced by randomly setting some counts to zero; h : RV → RK is an encoder with parameters θh; and g : RK → RV , with parameters θg. The encoder and decoder functions are typically implemented as feedforward neural networks. To apply this model to entity linking, each entity and context are initially represented by the encoding of their bag-of-words vectors, h(e) and g(c), and these encodings are then fine-tuned from labeled data (He et al., 2013). The context vector c can also be obtained by convolution (section 3.4) on the embeddings of words in the document (Sun et al., 2015), or by examining metadata such as the author’s social network (Yang et al., 2016).
The remaining parameters Θ(y,x) and Θ(y,c) can be trained by backpropagation from the margin loss in Equation 17.2.
17.1.2 Collective entity linking
Entity linking can be more accurate when it is performed jointly across a document. To see why, consider the following lists:
(17.5) a.
b. Baltimore, Washington, Philadelphia
California, Oregon, Washington c. Washington, Adams, Jefferson
In each case, the term Washington refers to a different entity, and this reference is strongly suggested by the other entries on the list. In the last list, all three names are highly am- biguous — there are dozens of other Adams and Jefferson entities in Wikipedia. But a
3Pre-trained entity embeddings can be downloaded from https://code.google.com/archive/p/ word2vec/.
Jacob Eisenstein. Draft of October 15, 2018.

17.1. ENTITIES 409 preference for coherence motivates collectively linking these references to the first three
U.S. presidents.
A general approach to collective entity linking is to introduce a compatibility score
ψc(y). Collective entity linking is then performed by optimizing the global objective,
􏰁N i=1
where Y(x) is the set of all possible collective entity assignments for the mentions in x, and ψl is the local scoring function for each entity i. The compatibility function is typically decomposed into a sum of pairwise scores, Ψc(y) = 􏰀Ni=1 􏰀Nj̸=i Ψc(y(i), y(j)). These scores can be computed in a number of different ways:
• Wikipedia defines high-level categories for entities (e.g., living people, Presidents of the United States, States of the United States), and Ψc can reward entity pairs for the number of categories that they have in common (Cucerzan, 2007).
• Compatibility can be measured by the number of incoming hyperlinks shared by the Wikipedia pages for the two entities (Milne and Witten, 2008).
• Inaneuralarchitecture,thecompatibilityoftwoentitiescanbesetequaltotheinner productoftheirembeddings,Ψc(y(i),y(j))=vy(i) ·vy(j).
• A non-pairwise compatibility score can be defined using a type of latent variable model known as a probabilistic topic model (Blei et al., 2003; Blei, 2012). In this framework, each latent topic is a probability distribution over entities, and each document has a probability distribution over topics. Each entity helps to determine the document’s distribution over topics, and in turn these topics help to resolve am- biguous entity mentions (Newman et al., 2006). Inference can be performed using the sampling techniques described in chapter 5.
Unfortunately, collective entity linking is NP-hard even for pairwise compatibility func- tions, so exact optimization is almost certainly intractable. Various approximate inference techniques have been proposed, including integer linear programming (Cheng and Roth, 2013), Gibbs sampling (Han and Sun, 2012), and graph-based algorithms (Hoffart et al., 2011; Han et al., 2011).
17.1.3 *Pairwise ranking loss functions
The loss function defined in Equation 17.2 considers only the highest-scoring prediction yˆ, but in fact, the true entity y(i) should outscore all other entities. A loss function based on this idea would give a gradient against the features or representations of several entities,
Under contract with MIT Press, shared under CC-BY-NC-ND license.
yˆ = argmax Ψc(y) + y∈Y(x)
Ψl(y(i), x(i), c(i)), [17.5]

410 CHAPTER17. INFORMATIONEXTRACTION Algorithm 18 WARP approximate ranking loss
procedureWARP(y(i),x(i)) N←0
repeat
return Lrank(r) × (ψ(y, x(i)) + 1 − ψ(y(i), x(i)))
until N ≥ |Y(x(i))| − 1 ◃ no violation found
1: 2: 3: 4: 5: 6: 7: 8:
9: 10:
function,
Randomly sample y ∼ Y(x(i)) N←N+1
if ψ(y, x(i)) + 1 > ψ(y(i), x(i)) then
◃ check for margin violation ◃ compute approximate rank
r ← 􏱈|Y(x(i))|/N􏱉
return 0 ◃ return zero loss not just the top-scoring prediction. Usunier et al. (2009) define a general ranking error
􏰁k j=1
where k is equal to the number of labels ranked higher than the correct label y(i). This function defines a class of ranking errors: if αj = 1 for all j, then the ranking error is equal to the rank of the correct entity; if α1 = 1 and αj>1 = 0, then the ranking error is one whenever the correct entity is not ranked first; if αj decreases smoothly with j, as in αj = 1 , then the error is between these two extremes.
This ranking error can be integrated into a margin objective. Remember that large margin classification requires not only the correct label, but also that the correct label outscores other labels by a substantial margin. A similar principle applies to ranking: we want a high rank for the correct entity, and we want it to be separated from other entities by a substantial margin. We therefore define the margin-augmented rank,
r(y(i), x(i)) 􏱕 􏰁 δ 􏰛1 + ψ(y, x(i)) ≥ ψ(y(i), x(i))􏰜 , [17.7] y∈Y(x(i))\y(i)
where δ (·) is a delta function, and Y(x(i)) \ y(i) is the set of all entity candidates minus the true entity y(i). The margin-augmented rank is the rank of the true entity, after aug- menting every other candidate with a margin of one, under the current scoring function ψ. (The context c is omitted for clarity, and can be considered part of x.)
For each instance, a hinge loss is computed from the ranking error associated with this Jacob Eisenstein. Draft of October 15, 2018.
Lrank(k) =
αj, with α1 ≥ α2 ≥ ··· ≥ 0, [17.6]
j

17.2. RELATIONS
411
margin-augmented rank, and the violation of the margin constraint,
l(y(i), x(i)) =Lrank(r(y(i), x(i))) 􏰁 􏰛ψ(y, x(i)) − ψ(y(i), x(i)) + 1􏰜
r(y(i), x(i)) y∈Y(x)\y(i)
, [17.8] +
The sum in Equation 17.8 includes non-zero values for every label that is ranked at least as high as the true entity, after applying the margin augmentation. Dividing by the margin- augmented rank of the true entity thus gives the average violation.
The objective in Equation 17.8 is expensive to optimize when the label space is large, as is usually the case for entity linking against large knowledge bases. This motivates a randomized approximation called WARP (Weston et al., 2011), shown in Algorithm 18. In this procedure, we sample random entities until one violates the pairwise margin con- straint, ψ(y, x(i)) + 1 ≥ ψ(y(i), x(i)). The number of samples N required to find such a violation yields an approximation of the margin-augmented rank of the true entity,
r(y(i),x(i)) ≈ 􏱏|Y(x)|􏱐. If a violation is found immediately, N = 1, the correct entity N
probably ranks below many others, r ≈ |Y(x)|. If many samples are required before a violation is found, N → |Y(x)|, then the correct entity is probably highly ranked, r → 1. A computational advantage of WARP is that it is not necessary to find the highest-scoring label, which can impose a non-trivial computational cost when Y(x(i)) is large. The objec- tive is conceptually similar to the negative sampling objective in WORD2VEC (chapter 14), which compares the observed word against randomly sampled alternatives.
17.2 Relations
After identifying the entities that are mentioned in a text, the next step is to determine how they are related. Consider the following example:
(17.6) George Bush traveled to France on Thursday for a summit.
This sentence introduces a relation between the entities referenced by George Bush and France. In the Automatic Content Extraction (ACE) ontology (Linguistic Data Consortium, 2005), the type of this relation is PHYSICAL, and the subtype is LOCATED. This relation would be written,
PHYSICAL.LOCATED(GEORGE BUSH, FRANCE). [17.9] Relations take exactly two arguments, and the order of the arguments matters.
In the ACE datasets, relations are annotated between entity mentions, as in the exam- ple above. Relations can also hold between nominals, as in the following example from the SemEval-2010 shared task (Hendrickx et al., 2009):
Under contract with MIT Press, shared under CC-BY-NC-ND license.

412
CHAPTER17. INFORMATIONEXTRACTION
CAUSE-EFFECT INSTRUMENT-AGENCY PRODUCT-PRODUCER CONTENT-CONTAINER ENTITY-ORIGIN ENTITY-DESTINATION COMPONENT-WHOLE MEMBER-COLLECTION COMMUNICATION-TOPIC
those cancers were caused by radiation exposures phone operator
a factory manufactures suits
a bottle of honey was weighed
letters from foreign countries
the boy went to bed
my apartment has a large kitchen there are many trees in the forest the lecture was about semantics
Table 17.1: Relations and example sentences from the SemEval-2010 dataset (Hendrickx et al., 2009)
(17.7) The cup contained tea from dried ginseng.
This sentence describes a relation of type ENTITY-ORIGIN between tea and ginseng. Nomi- nal relation extraction is closely related to semantic role labeling (chapter 13). The main difference is that relation extraction is restricted to a relatively small number of relation types; for example, Table 17.1 shows the ten relation types from SemEval-2010.
17.2.1 Pattern-based relation extraction
Early work on relation extraction focused on hand-crafted patterns (Hearst, 1992). For example, the appositive Starbuck, a native of Nantucket signals the relation ENTITY-ORIGIN between Starbuck and Nantucket. This pattern can be written as,
PERSON , a native of LOCATION ⇒ ENTITY-ORIGIN(PERSON, LOCATION). [17.10]
This pattern will be “triggered” whenever the literal string , a native of occurs between an entity of type PERSON and an entity of type LOCATION. Such patterns can be generalized beyond literal matches using techniques such as lemmatization, which would enable the words (buy, buys, buying) to trigger the same patterns (see section 4.3.1). A more aggres- sive strategy would be to group all words in a WordNet synset (section 4.2), so that, e.g., buy and purchase trigger the same patterns.
Relation extraction patterns can be implemented in finite-state automata (section 9.1). If the named entity recognizer is also a finite-state machine, then the systems can be com- bined by finite-state transduction (Hobbs et al., 1997). This makes it possible to propagate uncertainty through the finite-state cascade, and disambiguate from higher-level context. For example, suppose the entity recognizer cannot decide whether Starbuck refers to ei- ther a PERSON or a LOCATION; in the composed transducer, the relation extractor would be free to select the PERSON annotation when it appears in the context of an appropriate pattern.
Jacob Eisenstein. Draft of October 15, 2018.

17.2. RELATIONS 413 17.2.2 Relation extraction as a classification task
Relation extraction can be formulated as a classification problem,
rˆ = argmax Ψ(r, (i, j), (m, n), w), [17.11]
r∈R
(i,j ),(m,n)
where r ∈ R is a relation type (possibly NIL), wi+1:j is the span of the first argument, and wm+1:n is the span of the second argument. The argument wm+1:n may appear before or after wi+1:j in the text, or they may overlap; we stipulate only that wi+1:j is the first argument of the relation. We now consider three alternatives for computing the scoring function.
Feature-based classification
In a feature-based classifier, the scoring function is defined as,
Ψ(r, (i, j), (m, n), w) = θ · f(r, (i, j), (m, n), w), [17.12]
with θ representing a vector of weights, and f(·) a vector of features. The pattern-based methods described in subsection 17.2.1 suggest several features:
• Local features of wi+1:j and wm+1:n, including: the strings themselves; whether they are recognized as entities, and if so, which type; whether the strings are present in a gazetteer of entity names; each string’s syntactic head (subsection 9.2.2).
• Features of the span between the two arguments, wj+1:m or wn+1:i (depending on which argument appears first): the length of the span; the specific words that ap- pear in the span, either as a literal sequence or a bag-of-words; the wordnet synsets (section 4.2) that appear in the span between the arguments.
• Features of the syntactic relationship between the two arguments, typically the de- pendency path between the arguments (subsection 13.2.1). Example dependency paths are shown in Table 17.2.
Kernels
Suppose that the first line of Table 17.2 is a labeled example, and the remaining lines are instances to be classified. A feature-based approach would have to decompose the depen- dency paths into features that capture individual edges, with or without their labels, and then learn weights for each of these features: for example, the second line contains identi- cal dependencies, but different arguments; the third line contains a different inflection of the word travel; the fourth and fifth lines each contain an additional edge on the depen- dency path; and the sixth example uses an entirely different path. Rather than attempting to create local features that capture all of the ways in which these dependencies paths
Under contract with MIT Press, shared under CC-BY-NC-ND license.

414 CHAPTER17. INFORMATIONEXTRACTION
George Bush ← traveled → France NSUBJ OBL
Ahab ← traveled → Nantucket NSUBJ OBL
George Bush ← travel → France NSUBJ OBL
George Bush ← wants → travel → France NSUBJ XCOMP OBL
Ahab ← traveled → city → France NSUBJ OBL NMOD
are similar and different, we can instead define a similarity function κ, which computes a score for any pair of instances, κ : X × X → R+. The score for any pair of instances (i, j) is κ(x(i), x(j)) ≥ 0, with κ(i, j) being large when instances x(i) and x(j) are similar. If the function κ obeys a few key properties it is a valid kernel function.4
Given a valid kernel function, we can build a non-linear classifier without explicitly defining a feature vector or neural network architecture. For a binary classification prob- lem y ∈ {−1, 1}, we have the decision function,
􏰁N i=1
where b and {α(i)}Ni=1 are parameters that must be learned from the training set, under the constraint ∀i, α(i) ≥ 0. Intuitively, each αi specifies the importance of the instance x(i) towards the classification rule. Kernel-based classification can be viewed as a weighted form of the nearest-neighbor classifier (Hastie et al., 2009), in which test instances are assigned the most common label among their near neighbors in the training set. This results in a non-linear classification boundary. The parameters are typically learned from a margin-based objective (see section 2.4), leading to the kernel support vector machine. To generalize to multi-class classification, we can train separate binary classifiers for each label (sometimes called one-versus-all), or train binary classifiers for each pair of possible labels (one-versus-one).
Dependency kernels are particularly effective for relation extraction, due to their abil- ity to capture syntactic properties of the path between the two candidate arguments. One class of dependency tree kernels is defined recursively, with the score for a pair of trees
4The Gram matrix K arises from computing the kernel function between all pairs in a set of instances. For a valid kernel, the Gram matrix must be symmetric (K = K⊤) and positive semi-definite (∀a,a⊤Ka ≥ 0). For more on kernel-based classification, see chapter 14 of Murphy (2012).
Jacob Eisenstein. Draft of October 15, 2018.
1. George Bush traveled to France
2. Ahab traveled to Nantucket
3. George Bush will travel to France
4. George Bush wants to travel to France
5. Ahab traveled to a city in France
6. We await Ahab ’s visit to France
Table 17.2: Candidates instances for the PHYSICAL.LOCATED relation, and their depen-
dency paths
yˆ =Sign(b +
y(i)α(i)κ(x(i), x)) [17.13]
Ahab ← visit → France NMOD:POSS NMOD

17.2. RELATIONS 415
equal to the similarity of the root nodes and the sum of similarities of matched pairs of child subtrees (Zelenko et al., 2003; Culotta and Sorensen, 2004). Alternatively, Bunescu and Mooney (2005) define a kernel function over sequences of unlabeled dependency edges, in which the score is computed as a product of scores for each pair of words in the sequence: identical words receive a high score, words that share a synset or part-of-speech receive a small non-zero score (e.g., travel / visit), and unrelated words receive a score of zero.
Neural relation extraction
Convolutional neural networks (section 3.4) were an early neural architecture for relation extraction (Zeng et al., 2014; dos Santos et al., 2015). For the sentence (w1, w2, . . . , wM ), obtain a matrix of word embeddings X, where xm ∈ RK is the embedding of wm. Now, suppose the candidate arguments appear at positions a1 and a2; then for each word in the sentence, its position with respect to each argument is m − a1 and m − a2. (Following Zeng et al. (2014), this is a restricted version of the relation extraction task in which the arguments are single tokens.) To capture any information conveyed by these positions, the
word embeddings are concatenated with vector encodings of the positional offsets, x(p) m−a1
and x(p) . (For more on positional encodings, see subsection 18.3.2.) The complete base m−a2
representation of the sentence is,
x1 x2 ··· xM 
X(a ,a ) =  x(p) x(p) ··· x(p) 12 1−a1 2−a1 M−a1
x(p) x(p) · · · x(p) 1−a2 2−a2 M −a2
,
[17.14]
where each column is a vertical concatenation of a word embedding, represented by the column vector xm, and two positional encodings, specifying the position with respect to a1 and a2. The matrix X(a1,a2) is then taken as input to a convolutional layer (see section 3.4), and max-pooling is applied to obtain a vector. The final scoring function is then,
Ψ(r, i, j, X) = θr · MaxPool(ConvNet(X(i, j); φ)), [17.15] where φ defines the parameters of the convolutional operator, and the θr defines a set of
weights for relation r. The model can be trained using a margin objective,
rˆ = argmax Ψ(r, i, j, X) [17.16]
r
l =(1 + ψ(rˆ, i, j, X) − ψ(r, i, j, X))+. [17.17]
Recurrent neural networks (section 6.3) have also been applied to relation extraction, using a network such as a bidirectional LSTM to encode the words or dependency path between the two arguments. Xu et al. (2015) segment each dependency path into left and
Under contract with MIT Press, shared under CC-BY-NC-ND license.

416 CHAPTER17. INFORMATIONEXTRACTION right subpaths: the path George Bush ← wants → travel → France is segmented into the
NSUBJ XCOMP OBL
subpaths, George Bush ← wants and wants → travel → France. In each path, a recurrent NSUBJ XCOMP OBL
neural network is run from the argument to the root word (in this case, wants). The final representation by max pooling (section 3.4) across all the recurrent states along each path. This process can be applied across separate “channels”, in which the inputs consist of em- beddings for the words, parts-of-speech, dependency relations, and WordNet hypernyms (e.g., France-nation; see section 4.2). To define the model formally, let s(m) define the suc- cessor of word m in either the left or right subpath (in a dependency path, each word can
have a successor in at most one subpath). Let x(c) indicate the embedding of word (or ←− →− m
relation) m in channel c, and let h (c) and h (c) indicate the associated recurrent states in mm
the left and right subtrees respectively. Then the complete model is specified as follows,
h(c) s(m)
=RNN(x(c) , h(c))
[17.18] [17.19] [17.20]
m
􏰛←−(c) ←−(c) ←−(c) →− (c)
s(m)
=MaxPool hi , hs(i),…, hroot, hj
→− (c) →− (c) , hs(j),…, hroot
􏰜
(c)
Ψ(r, i, j) =θ · 􏰪z(word); z(POS); z(dependency); z(hypernym)􏰫 .
z
Note that z is computed by applying max-pooling to the matrix of horizontally concate- nated vectors h, while Ψ is computed from the vector of vertically concatenated vectors z. Xu et al. (2015) pass the score Ψ through a softmax layer to obtain a probability p(r | i, j, w), and train the model by regularized cross-entropy. Miwa and Bansal (2016) show that a related model can solve the more challenging “end-to-end” relation extrac- tion task, in which the model must simultaneously detect entities and then extract their relations.
17.2.3 Knowledge base population
In many applications, what matters is not what fraction of sentences are analyzed cor- rectly, but how much accurate knowledge can be extracted. Knowledge base population (KBP) refers to the task of filling in Wikipedia-style infoboxes, as shown in Figure 17.1a. Knowledge base population can be decomposed into two subtasks: entity linking (de- scribed in section 17.1), and slot filling (Ji and Grishman, 2011). Slot filling has two key differences from the formulation of relation extraction presented above: the relations hold between entities rather than spans of text, and the performance is evaluated at the type level (on entity pairs), rather than on the token level (on individual sentences).
From a practical standpoint, there are three other important differences between slot filling and per-sentence relation extraction.
• KBP tasks are often formulated from the perspective of identifying attributes of a few “query” entities. As a result, these systems often start with an information
Jacob Eisenstein. Draft of October 15, 2018.

17.2. RELATIONS 417 retrieval phase, in which relevant passages of text are obtained by search.
• For many entity pairs, there will be multiple passages of text that provide evidence. Slot filling systems must aggregate this evidence to predict a single relation type (or set of relations).
• Labeled data is usually available in the form of pairs of related entities, rather than annotated passages of text. Training from such type-level annotations is a challenge: two entities may be linked by several relations, or they may appear together in a passage of text that nonetheless does not describe their relation to each other.
Information retrieval is beyond the scope of this text (see Manning et al., 2008). The re- mainder of this section describes approaches to information fusion and learning from type-level annotations.
Information fusion
In knowledge base population, there will often be multiple pieces of evidence for (and sometimes against) a single relation. For example, a search for the entity MAYNARD JACK- SON, JR. may return several passages that reference the entity ATLANTA:5
(17.8) a.
Elected mayor of Atlanta in 1973, Maynard Jackson was the first African American to serve as mayor of a major southern city.
b. Atlanta’s airport will be renamed to honor Maynard Jackson, the city’s first Black mayor.
c. Born in Dallas, Texas in 1938, Maynard Holbrook Jackson, Jr. moved to Atlanta when he was 8.
d. Maynard Jackson has gone from one of the worst high schools in Atlanta to one of the best.
The first and second examples provide evidence for the relation MAYOR holding between the entities ATLANTA and MAYNARD JACKSON, JR.. The third example provides evidence for a different relation between these same entities, LIVED-IN. The fourth example poses an entity linking problem, referring to MAYNARD JACKSON HIGH SCHOOL. Knowledge base population requires aggregating this sort of textual evidence, and predicting the re- lations that are most likely to hold.
One approach is to run a single-document relation extraction system (using the tech- niques described in subsection 17.2.2), and then aggregate the results (Li et al., 2011).
5 First three examples from: http://www.georgiaencyclopedia.org/articles/ government-politics/maynard-jackson-1938-2003; JET magazine, November 10, 2003; www.todayingeorgiahistory.org/content/maynard-jackson-elected
Under contract with MIT Press, shared under CC-BY-NC-ND license.

418 CHAPTER17. INFORMATIONEXTRACTION Relations that are detected with high confidence in multiple documents are more likely to
be valid, motivating the heuristic,
􏰁N i=1
where p(r(e1, e2) | w(i)) is the probability of relation r between entities e1 and e2 condi- tioned on the text w(i), and α ≫ 1 is a tunable hyperparameter. Using this heuristic, it is possible to rank all candidate relations, and trace out a precision-recall curve as more re- lations are extracted.6 Alternatively, features can be aggregated across multiple passages of text, feeding a single type-level relation extraction system (Wolfe et al., 2017).
Precision can be improved by introducing constraints across multiple relations. For example, if we are certain of the relation PARENT(e1,e2), then it cannot also be the case that PARENT(e2,e1). Integer linear programming makes it possible to incorporate such constraints into a global optimization (Li et al., 2011). Other pairs of relations have posi- tive correlations, such MAYOR(e1, e2) and LIVED-IN(e1, e2). Compatibility across relation types can be incorporated into probabilistic graphical models (e.g., Riedel et al., 2010).
Distant supervision
Relation extraction is “annotation hungry,” because each relation requires its own la- beled data. Rather than relying on annotations of individual documents, it would be preferable to use existing knowledge resources — such as the many facts that are al- ready captured in knowledge bases like DBPedia. However such annotations raise the inverse of the information fusion problem considered above: the existence of the relation MAYOR(MAYNARD JACKSON JR., ATLANTA) provides only distant supervision for the example texts in which this entity pair is mentioned.
One approach is to treat the entity pair as the instance, rather than the text itself (Mintz et al., 2009). Features are then aggregated across all sentences in which both entities are mentioned, and labels correspond to the relation (if any) between the entities in a knowl- edge base, such as FreeBase. Negative instances are constructed from entity pairs that are not related in the knowledge base. In some cases, two entities are related, but the knowl- edge base is missing the relation; however, because the number of possible entity pairs is huge, these missing relations are presumed to be relatively rare. This approach is shown in Figure 17.2.
In multiple instance learning, labels are assigned to sets of instances, of which only an unknown subset are actually relevant (Dietterich et al., 1997; Maron and Lozano-Pe ́rez, 1998). This formalizes the framework of distant supervision: the relation REL(A, B) acts
6The precision-recall curve is similar to the ROC curve shown in Figure 4.4, but it includes the precision
TP rather than the false positive rate FP . TP+FP FP+TN
ψ(r, e1, e2) =
(p(r(e1, e2) | w(i)))α, [17.21]
Jacob Eisenstein. Draft of October 15, 2018.

17.2. RELATIONS 419 • Label : MAYOR(ATLANTA, MAYNARD JACKSON)
– ElectedmayorofAtlantain1973,MaynardJackson…
– Atlanta’sairportwillberenamedtohonorMaynardJackson,thecity’sfirstBlack
mayor
– Born in Dallas, Texas in 1938, Maynard Holbrook Jackson, Jr. moved to Atlanta
when he was 8.
• Label : MAYOR(NEW YORK, FIORELLO LA GUARDIA)
– FiorelloLaGuardiawasMayorofNewYorkforthreeterms…
– FiorelloLaGuardia,thenservingontheNewYorkCityBoardofAldermen…
• Label : BORN-IN(DALLAS, MAYNARD JACKSON)
– Born in Dallas, Texas in 1938, Maynard Holbrook Jackson, Jr. moved to Atlanta when he was 8.
– MaynardJacksonwasraisedinDallas… • Label : NIL(NEW YORK, MAYNARD JACKSON)
– JacksonmarriedValerieRichardson,whomhehadmetinNewYork… – JacksonwasamemberoftheGeorgiaandNewYorkbars…
Figure 17.2: Four training instances for relation classification using distant supervi- sion Mintz et al. (2009). The first two instances are positive for the MAYOR relation, and the third instance is positive for the BORN-IN relation. The fourth instance is a negative ex- ample, constructed from a pair of entities (NEW YORK, MAYNARD JACKSON) that do not appear in any Freebase relation. Each instance’s features are computed by aggregating across all sentences in which the two entities are mentioned.
as a label for the entire set of sentences mentioning entities A and B, even when only a subset of these sentences actually describes the relation. One approach to multi-instance learning is to introduce a binary latent variable for each sentence, indicating whether the sentence expresses the labeled relation (Riedel et al., 2010). A variety of inference tech- niques have been employed for this probabilistic model of relation extraction: Surdeanu et al. (2012) use expectation maximization, Riedel et al. (2010) use sampling, and Hoff- mann et al. (2011) use a custom graph-based algorithm. Expectation maximization and sampling are surveyed in chapter 5, and are covered in more detail by Murphy (2012); graph-based methods are surveyed by Mihalcea and Radev (2011).
17.2.4 Open information extraction
In classical relation extraction, the set of relations is defined in advance, using a schema. The relation for any pair of entities can then be predicted using multi-class classification. In open information extraction (OpenIE), a relation can be any triple of text. The example
Under contract with MIT Press, shared under CC-BY-NC-ND license.

420
CHAPTER17. INFORMATIONEXTRACTION
Task
PropBank semantic role labeling FrameNet semantic role labeling Relation extraction
Slot filling
Open Information Extraction
Relation ontology
VerbNet
FrameNet
ACE, TAC, SemEval, etc ACE, TAC, SemEval, etc open
Supervision
sentence
sentence
sentence
relation
seed relations or patterns
Table 17.3: Various relation extraction tasks and their properties. VerbNet and FrameNet are described in chapter 13. ACE (Linguistic Data Consortium, 2005), TAC (McNamee and Dang, 2009), and SemEval (Hendrickx et al., 2009) refer to shared tasks, each of which involves an ontology of relation types.
sentence (17.8a) instantiates several “relations” of this sort, e.g.,
• (mayorof,MaynardJackson,Atlanta),
• (elected,MaynardJackson,mayorofAtlanta), • (electedin,MaynardJackson,1973).
Extracting such tuples can be viewed as a lightweight version of semantic role labeling (chapter 13), with only two argument types: first slot and second slot. The task is gen- erally evaluated on the relation level, rather than on the level of sentences: precision is measured by the number of extracted relations that are accurate, and recall is measured by the number of true relations that were successfully extracted. OpenIE systems are trained from distant supervision or bootstrapping, rather than from labeled sentences.
An early example is the TEXTRUNNER system (Banko et al., 2007), which identifies relations with a set of handcrafted syntactic rules. The examples that are acquired from the handcrafted rules are then used to train a classification model that uses part-of-speech patterns as features. Finally, the relations that are extracted by the classifier are aggre- gated, removing redundant relations and computing the number of times that each rela- tion is mentioned in the corpus. TEXTRUNNER was the first in a series of systems that performed increasingly accurate open relation extraction by incorporating more precise linguistic features (Etzioni et al., 2011), distant supervision from Wikipedia infoboxes (Wu and Weld, 2010), and better learning algorithms (Zhu et al., 2009).
17.3 Events
Relations link pairs of entities, but many real-world situations involve more than two enti- ties. Consider again the example sentence (17.8a), which describes the event of an election,
Jacob Eisenstein. Draft of October 15, 2018.

17.3. EVENTS 421
with four properties: the office (MAYOR), the district (ATLANTA), the date (1973), and the person elected (MAYNARD JACKSON, JR.). In event detection, a schema is provided for each event type (e.g., an election, a terrorist attack, or a chemical reaction), indicating all the possible properties of the event. The system is then required to fill in as many of these properties as possible (Doddington et al., 2004).
Event detection systems generally involve a retrieval component (finding relevant documents and passages of text) and an extraction component (determining the prop- erties of the event based on the retrieved texts). Early approaches focused on finite-state patterns for identify event properties (Hobbs et al., 1997); such patterns can be automati- cally induced by searching for patterns that are especially likely to appear in documents that match the event query (Riloff, 1996). Contemporary approaches employ techniques that are similar to FrameNet semantic role labeling (section 13.2), such as structured pre- diction over local and global features (Li et al., 2013) and bidirectional recurrent neural networks (Feng et al., 2016). These methods detect whether an event is described in a sentence, and if so, what are its properties.
Event coreference Because multiple sentences may describe unique properties of a sin- gle event, event coreference is required to link event mentions across a single passage of text, or between passages (Humphreys et al., 1997). Bejan and Harabagiu (2014) de- fine event coreference as the task of identifying event mentions that share the same event participants (i.e., the slot-filling entities) and the same event properties (e.g., the time and location), within or across documents. Event coreference resolution can be performed us- ing supervised learning techniques in a similar way to entity coreference, as described in chapter 15: move left-to-right through the document, and use a classifier to decide whether to link each event reference to an existing cluster of coreferent events, or to cre- ate a new cluster (Ahn, 2006). Each clustering decision is based on the compatibility of features describing the participants and properties of the event. Due to the difficulty of annotating large amounts of data for entity coreference, unsupervised approaches are es- pecially desirable (Chen and Ji, 2009; Bejan and Harabagiu, 2014).
Relations between events Just as entities are related to other entities, events may be related to other events: for example, the event of winning an election both precedes and causes the event of serving as mayor; moving to Atlanta precedes and enables the event of becoming mayor of Atlanta; moving from Dallas to Atlanta prevents the event of later be- coming mayor of Dallas. As these examples show, events may be related both temporally and causally. The TimeML annotation scheme specifies a set of six temporal relations between events (Pustejovsky et al., 2005), derived in part from interval algebra (Allen, 1984). The TimeBank corpus provides TimeML annotations for 186 documents (Puste- jovsky et al., 2003). Methods for detecting these temporal relations combine supervised
Under contract with MIT Press, shared under CC-BY-NC-ND license.

422
CHAPTER17. INFORMATIONEXTRACTION
Certain (CT) Probable (PR) Possible (PS) Underspecified (U)
Positive (+)
Fact: CT+ Probable: PR+ Possible: PS+ (NA)
Negative (-)
Counterfact: CT- Not probable: P R – Not possible: PS- (NA)
Underspecified (u)
Certain, but unknown: C T U (NA)
(NA)
Unknown or uncommitted: U U
Table 17.4: Table of factuality values from the FactBank corpus (Saur ́ı and Pustejovsky, 2009). The entry (NA) indicates that this combination is not annotated.
machine learning with temporal constraints, such as transitivity (e.g. Mani et al., 2006; Chambers and Jurafsky, 2008).
More recent annotation schemes and datasets combine temporal and causal relations (Mirza et al., 2014; Dunietz et al., 2017): for example, the CaTeRS dataset includes annotations of
320 five-sentence short stories (Mostafazadeh et al., 2016). Abstracting still further, pro- cesses are networks of causal relations between multiple events. A small dataset of bi- ological processes is annotated in the ProcessBank dataset (Berant et al., 2014), with the
goal of supporting automatic question answering on scientific textbooks.
17.4 Hedges, denials, and hypotheticals
The methods described thus far apply to propositions about the way things are in the real world. But natural language can also describe events and relations that are likely or unlikely, possible or impossible, desired or feared. The following examples hint at the scope of the problem (Prabhakaran et al., 2010):
(17.9) a. b. c. d. e. f. g.
GM will lay off workers.
A spokesman for GM said GM will lay off workers. GM may lay off workers.
The politician claimed that GM will lay off workers. Some wish GM would lay off workers.
Will GM lay off workers?
Many wonder whether GM will lay off workers.
Accurate information extraction requires handling these extra-propositional aspects of meaning, which are sometimes summarized under the terms modality and negation.7
7The classification of negation as extra-propositional is controversial: Packard et al. (2014) argue that negation is a “core part of compositionally constructed logical-form representations.” Negation is an element of the semantic parsing tasks discussed in chapter 12 and chapter 13 — for example, negation markers are
Jacob Eisenstein. Draft of October 15, 2018.

17.4. HEDGES,DENIALS,ANDHYPOTHETICALS 423
Modality refers to expressions of the speaker’s attitude towards her own statements, in- cluding “degree of certainty, reliability, subjectivity, sources of information, and perspec- tive” (Morante and Sporleder, 2012). Various systematizations of modality have been proposed (e.g., Palmer, 2001), including categories such as future, interrogative, imper- ative, conditional, and subjective. Information extraction is particularly concerned with negation and certainty. For example, Saur ́ı and Pustejovsky (2009) link negation with a modal calculus of certainty, likelihood, and possibility, creating the two-dimensional schema shown in Table 17.4. This is the basis for the FactBank corpus, with annotations of the factuality of all sentences in 208 documents of news text.
A related concept is hedging, in which speakers limit their commitment to a proposi- tion (Lakoff, 1973):
(17.10) a. These results suggest that expression of c-jun, jun B and jun D genes might be involved in terminal granulocyte differentiation. . . (Morante and Daelemans, 2009)
b. A whale is technically a mammal (Lakoff, 1973)
In the first example, the hedges suggest and might communicate uncertainty; in the second example, there is no uncertainty, but the hedge technically indicates that the evidence for the proposition will not fully meet the reader’s expectations. Hedging has been studied extensively in scientific texts (Medlock and Briscoe, 2007; Morante and Daelemans, 2009), where the goal of large-scale extraction of scientific facts is obstructed by hedges and spec- ulation. Still another related aspect of modality is evidentiality, in which speakers mark the source of their information. In many languages, it is obligatory to mark evidentiality through affixes or particles (Aikhenvald, 2004); while evidentiality is not grammaticalized in English, authors are expected to express this information in contexts such as journal- ism (Kovach and Rosenstiel, 2014) and Wikipedia.8
Methods for handling negation and modality generally include two phases: 1. detecting negated or uncertain events;
2. identifying scope of the negation or modal operator.
A considerable body of work on negation has employed rule-based techniques such
as regular expressions (Chapman et al., 2001) to detect negated events. Such techniques
treated as adjuncts in PropBank semantic role labeling. However, many of the relation extraction methods mentioned in this chapter do not handle negation directly. A further consideration is that negation inter- acts closely with aspects of modality that are generally not considered in propositional semantics, such as certainty and subjectivity.
8 https://en.wikipedia.org/wiki/Wikipedia:Verifiability
Under contract with MIT Press, shared under CC-BY-NC-ND license.

424 CHAPTER17. INFORMATIONEXTRACTION
match lexical cues (e.g., Norwood was not elected Mayor), while avoiding “double nega- tives” (e.g., surely all this is not without meaning). Supervised techniques involve classi- fiers over lexical and syntactic features (Uzuner et al., 2009) and sequence labeling (Prab- hakaran et al., 2010).
The scope refers to the elements of the text whose propositional meaning is negated or modulated (Huddleston and Pullum, 2005), as elucidated in the following example from Morante and Sporleder (2012):
(17.11) [ After his habit he said ] nothing, and after mine I asked no questions. After his habit he said nothing, and [ after mine I asked ] no [ questions ].
In this sentence, there are two negation cues (nothing and no). Each negates an event, in- dicated by the underlined verbs said and asked, and each occurs within a scope: after his habit he said and after mine I asked questions. Scope identification is typically formal- ized as sequence labeling problems, with each word token labeled as beginning, inside, or outside of a cue, focus, or scope span (see section 8.3). Conventional sequence labeling ap- proaches can then be applied, using surface features as well as syntax (Velldal et al., 2012) and semantic analysis (Packard et al., 2014). Labeled datasets include the BioScope corpus of biomedical texts (Vincze et al., 2008) and a shared task dataset of detective stories by Arthur Conan Doyle (Morante and Blanco, 2012).
17.5 Question answering and machine reading
The victory of the Watson question-answering system against three top human players on the game show Jeopardy! was a landmark moment for natural language processing (Fer- rucci et al., 2010). Game show questions are usually answered by factoids: entity names and short phrases.9 The task of factoid question answering is therefore closely related to information extraction, with the additional problem of accurately parsing the question.
17.5.1 Formal semantics
Semantic parsing is an effective method for question-answering in restricted domains such as questions about geography and airline reservations (Zettlemoyer and Collins, 2005), and has also been applied in “open-domain” settings such as question answering on Freebase (Berant et al., 2013) and biomedical research abstracts (Poon and Domingos, 2009). One approach is to convert the question into a lambda calculus expression that returns a boolean value: for example, the question who is the mayor of the capital of Georgia?
9The broader landscape of question answering includes “why” questions (Why did Ahab continue to pursue the white whale?), “how questions” (How did Queequeg die?), and requests for summaries (What was Ishmael’s attitude towards organized religion?). For more, see Hirschman and Gaizauskas (2001).
Jacob Eisenstein. Draft of October 15, 2018.

17.5. QUESTIONANSWERINGANDMACHINEREADING 425
would be converted to,
λx.∃y CAPITAL(GEORGIA, y) ∧ MAYOR(y, x). [17.22]
This lambda expression can then be used to query an existing knowledge base, returning “true” for all entities that satisfy it.
17.5.2 Machine reading
Recent work has focused on answering questions about specific textual passages, similar to the reading comprehension examinations for young students (Hirschman et al., 1999). This task has come to be known as machine reading.
Datasets
The machine reading problem can be formulated in a number of different ways. The most important distinction is what form the answer should take.
• Multiple-choice question answering, as in the MCTest dataset of stories (Richard- son et al., 2013) and the New York Regents Science Exams (Clark, 2015). In MCTest, the answer is deducible from the text alone, while in the science exams, the system must make inferences using an existing model of the underlying scientific phenom- ena. Here is an example from MCTest:
(17.12) James the turtle was always getting into trouble. Sometimes he’d reach into the freezer and empty out all the food . . .
Q: What is the name of the trouble making turtle? (a) Fries
(b) Pudding
(c) James
(d) Jane
• Cloze-style “fill in the blank” questions, as in the CNN/Daily Mail comprehension task (Hermann et al., 2015), the Children’s Book Test (Hill et al., 2016), and the Who- did-What dataset (Onishi et al., 2016). In these tasks, the system must guess which word or entity completes a sentence, based on reading a passage of text. Here is an example from Who-did-What:
(17.13) Q: Tottenham manager Juande Ramos has hinted he will allow to leave iftheBulgariastrikermakesitclearheisunhappy. (Onishietal.,2016)
The query sentence may be selected either from the story itself, or from an external summary. In either case, datasets can be created automatically by processing large
Under contract with MIT Press, shared under CC-BY-NC-ND license.

426
CHAPTER17. INFORMATIONEXTRACTION
quantities existing documents. An additional constraint is that that missing element from the cloze must appear in the main passage of text: for example, in Who-did- What, the candidates include all entities mentioned in the main passage. In the CNN/Daily Mail dataset, each entity name is replaced by a unique identifier, e.g., ENTITY37. This ensures that correct answers can only be obtained by accurately reading the text, and not from external knowledge about the entities.
• Extractive question answering, in which the answer is drawn from the original text. In WikiQA, answers are sentences (Yang et al., 2015). In the Stanford Question An- swering Dataset (SQuAD), answers are words or short phrases (Rajpurkar et al., 2016):
(17.14) In metereology, precipitation is any product of the condensation of atmo- spheric water vapor that falls under gravity.
Q: What causes precipitation to fall? A: gravity
In both WikiQA and SQuAD, the original texts are Wikipedia articles, and the ques- tions are generated by crowdworkers.
Methods
A baseline method is to search the text for sentences or short passages that overlap with both the query and the candidate answer (Richardson et al., 2013). In example (17.12), this baseline would select the correct answer, since James appears in a sentence that includes the query terms trouble and turtle.
This baseline can be implemented as a neural architecture, using an attention mech- anism (see subsection 18.3.1), which scores the similarity of the query to each part of the source text (Chen et al., 2016). The first step is to encode the passage w(p) and the query w(q), using two bidirectional LSTMs (section 7.6).
h(q) =BiLSTM(w(q);Θ(q)) [17.23] h(p) =BiLSTM(w(p);Θ(p)). [17.24]
The query is represented by vertically concatenating the final states of the left-to-right and right-to-left passes:
−→ ←−
u =[h(q)Mq ; h(q)0]. [17.25] Jacob Eisenstein. Draft of October 15, 2018.

17.5. QUESTIONANSWERINGANDMACHINEREADING 427 The attention vector is computed as a softmax over a vector of bilinear products, and
the expected representation is computed by summing over attention values,
α ̃ =(u(q))⊤W h(p) mam
α =SoftMax(α ̃)
[17.26] [17.27]
[17.28]
o=
α h(p).
􏰁M
mm
m=1
Each candidate answer c is represented by a vector xc. Assuming the candidate answers are spans from the original text, these vectors can be set equal to the corresponding ele- ment in h(p). The score for each candidate answer a is computed by the inner product,
cˆ = argmax o · xc. [17.29] c
This architecture can be trained end-to-end from a loss based on the log-likelihood of the correct answer. A number of related architectures have been proposed (e.g., Hermann et al., 2015; Kadlec et al., 2016; Dhingra et al., 2017; Cui et al., 2017), and these methods are surveyed by Wang et al. (2017).
Additional resources
The field of information extraction is surveyed in course notes by Grishman (2012), and more recently in a short survey paper (Grishman, 2015). Shen et al. (2015) survey the task of entity linking, and Ji and Grishman (2011) survey work on knowledge base popula- tion. This chapter’s discussion of non-propositional meaning was strongly influenced by Morante and Sporleder (2012), who introduced a special issue of the journal Computational Linguistics dedicated to recent work on modality and negation.
Exercises
1. Go to the Wikipedia page for your favorite movie. For each record in the info box (e.g., Screenplay by: Stanley Kubrick), report whether there is a sentence in the ar- ticle containing both the field and value (e.g., The screenplay was written by Stanley Kubrick). If not, is there is a sentence in the article containing just the value? (For records with more than one value, just use the first value.)
2. Building on your answer in the previous question, report the dependency path be- tween the head words of the field and value for at least three records.
3. Consider the following heuristic for entity linking:
Under contract with MIT Press, shared under CC-BY-NC-ND license.

428
CHAPTER17. INFORMATIONEXTRACTION
4.
• Among all entities that have the same type as the mention (e.g., LOC, PER), choose the one whose name has the lowest edit distance from the mention.
• If more than one entity has the right type and the lowest edit distance from the mention, choose the most popular one.
• If no candidate entity has the right type, choose NIL. Now suppose you have the following feature function:
f(y,x) = [edit-dist(name(y),x),same-type(y,x),popularity(y),δ(y = NIL)]
Design a set of ranking weights θ that match the heuristic. You may assume that edit distance and popularity are always in the range [0, 100], and that the NIL entity has values of zero for all features except δ (y = NIL).
Now consider another heuristic:
• Among all candidate entities that have edit distance zero from the mention, and are the right type, choose the most popular one.
• If no entity has edit distance zero from the mention, choose the one with the right type that is most popular, regardless of edit distance.
• If no entity has the right type, choose NIL.
Using the same features and assumptions from the previous problem, prove that there is no set of weights that could implement this heuristic. Then show that the heuristic can be implemented by adding a single feature. Your new feature should consider only the edit distance.
Download the Reuters corpus in NLTK, and iterate over the tokens in the corpus:
import nltk nltk.corpus.download(’reuters’) from nltk.corpus import reuters for word in reuters.words():
#your code here
a) Apply the pattern , such as to obtain candidates for the IS-A relation, e.g. IS-A(ROMANIA, COUNTRY). What are three pairs that this method identi- fies correctly? What are three different pairs that it gets wrong?
b) Design a pattern for the PRESIDENT relation, e.g. PRESIDENT(PHILIPPINES, CORAZON AQUINO In this case, you may want to augment your pattern matcher with the ability
to match multiple token wildcards, perhaps using case information to detect
proper names. Again, list three correct
Jacob Eisenstein. Draft of October 15, 2018.
5.

17.5. QUESTIONANSWERINGANDMACHINEREADING 429
c) Preprocess the Reuters data by running a named entity recognizer, replacing tokens with named entity spans when applicable — e.g., your pattern can now match on the United States if the NER system tags it. Apply your PRESIDENT matcher to this preprocessed data. Does the accuracy improve? Compare 20 randomly-selected pairs from this pattern and the one you designed in the pre- vious part.
6. Using the same NLTK Reuters corpus, apply distant supervision to build a training set for detecting the relation between nations and their capitals. Start with the fol- lowing known relations: (JAPAN, TOKYO), (FRANCE, PARIS), (ITALY, ROME). How many positive and negative examples are you able to extract?
7. Represent the dependency path x(i) as a sequence of words and dependency arcs of length Mi, ignoring the endpoints of the path. In example 1 of Table 17.2, the dependency path is,
x(1) = ( ← , traveled, → ) [17.30] NSUBJ OBL
If x(i) is a word, then let pos(x(i)) be its part-of-speech, using the tagset defined in mm
chapter 8.
We can define the following kernel function over pairs of dependency paths (Bunescu and Mooney, 2005):
(i) (j) 􏰅0, Mi ̸= Mj κ(x ,x ) =􏰧Mi c(x(i),x(j)), M = M
m=1 m m i j 2, x(i) = x(j)
mm
c(x(i), x(j)) = 1, x(i) ̸= x(j) and pos(x(i)) = pos(x(j)) mmmmmm
0, otherwise.
Using this kernel function, compute the kernel similarities of example 1 from Ta-
ble 17.2 with the other five examples.
8. Continuing from the previous problem, suppose that the instances have the follow- ing labels:
y2 = 1,y3 = −1,y4 = −1,y5 = 1,y6 = 1 [17.31]
Equation 17.13 defines a kernel-based classification in terms of parameters α and b. Using the above labels for y2, . . . , y6, identify the values of α and b under which yˆ1 = 1. Remember the constraint that αi ≥ 0 for all i.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

430 9.
CHAPTER17. INFORMATIONEXTRACTION
Consider the neural QA system described in section 17.5.2, but restrict the set of candidate answers to words in the passage, and set each candidate answer em-
bedding x equal to the vector h(p), representing token m in the passage, so that m
mˆ = argmax o · h(p). Suppose the system selects answer mˆ , but the correct answer mm
is m∗. Consider the gradient of the margin loss with respect to the attention: a)Provethat ∂l ≥ ∂l .
∂ α mˆ ∂ α m ∗
b) Assuming that ||hmˆ || = ||hm∗ ||, prove that ∂l ≥ 0 and ∂l ≤ 0. Explain in
∂ α mˆ ∂ α m ∗
words what this means about how the attention is expected to change after a
gradient-based update.
Jacob Eisenstein. Draft of October 15, 2018.

Chapter 18
Machine translation
Machine translation (MT) is one of the “holy grail” problems in artificial intelligence, with the potential to transform society by facilitating communication between people anywhere in the world. As a result, MT has received significant attention and funding since the early 1950s. However, it has proved remarkably challenging, and while there has been substantial progress towards usable MT systems — especially for high-resource language pairs like English-French — we are still far from translation systems that match the nuance and depth of human translations.
18.1 Machine translation as a task
Machine translation can be formulated as an optimization problem:
wˆ(t) = argmax Ψ(w(s), w(t)), [18.1]
w(t)
where w(s) is a sentence in a source language, w(t) is a sentence in the target language, and Ψ is a scoring function. As usual, this formalism requires two components: a decod- ing algorithm for computing wˆ(t), and a learning algorithm for estimating the parameters of the scoring function Ψ.
Decoding is difficult for machine translation because of the huge space of possible translations. We have faced large label spaces before: for example, in sequence labeling, the set of possible label sequences is exponential in the length of the input. In these cases, it was possible to search the space quickly by introducing locality assumptions: for ex- ample, that each tag depends only on its predecessor, or that each production depends only on its parent. In machine translation, no such locality assumptions seem possible: human translators reword, reorder, and rearrange words; they replace single words with multi-word phrases, and vice versa. This flexibility means that in even relatively simple
431

432
CHAPTER18. MACHINETRANSLATION
interlingua semantics
syntax text
source
target
Figure 18.1: The Vauquois Pyramid
translation models, decoding is NP-hard (Knight, 1999). Approaches for dealing with this complexity are described in section 18.4.
Estimating translation models is difficult as well. Labeled translation data usually comes in the form parallel sentences, e.g.,
w(s) =A Vinay le gusta las manzanas. w(t) =Vinay likes apples.
A useful feature function would note the translation pairs (gusta, likes), (manzanas, apples), and even (Vinay, Vinay). But this word-to-word alignment is not given in the data. One solution is to treat this alignment as a latent variable; this is the approach taken by clas- sical statistical machine translation (SMT) systems, described in section 18.2. Another solution is to model the relationship between w(t) and w(s) through a more complex and expressive function; this is the approach taken by neural machine translation (NMT) sys- tems, described in section 18.3.
The Vauquois Pyramid is a theory of how translation should be done. At the lowest level, the translation system operates on individual words, but the horizontal distance at this level is large, because languages express ideas differently. If we can move up the triangle to syntactic structure, the distance for translation is reduced; we then need only produce target-language text from the syntactic representation, which can be as simple as reading off a tree. Further up the triangle lies semantics; translating between semantic representations should be easier still, but mapping between semantics and surface text is a difficult, unsolved problem. At the top of the triangle is interlingua, a semantic represen- tation that is so generic that it is identical across all human languages. Philosophers de- bate whether such a thing as interlingua is really possible (e.g., Derrida, 1985). While the first-order logic representations discussed in chapter 12 might be thought to be language independent, they are built on an inventory of predicates that are suspiciously similar to English words (Nirenburg and Wilks, 2001). Nonetheless, the idea of linking translation
Jacob Eisenstein. Draft of October 15, 2018.

18.1. MACHINETRANSLATIONASATASK 433 Adequate? Fluent?
To Vinay it like Python Vinay debugs memory leaks Vinay likes Python
yes no no yes yes yes
Table 18.1: Adequacy and fluency for translations of the Spanish sentence A Vinay le gusta Python.
and semantic understanding may still be a promising path, if the resulting translations better preserve the meaning of the original text.
18.1.1 Evaluating translations
There are two main criteria for a translation, summarized in Table 18.1.
• Adequacy: The translation w(t) should adequately reflect the linguistic content of w(s). For example, if w(s) = A Vinay le gusta Python, the reference translation is w(t) = Vinay likes Python. However, the gloss, or word-for-word translation w(t) = To Vinay it like Python is also considered adequate because it contains all the relevant content. The output w(t) = Vinay debugs memory leaks is not adequate.
• Fluency: The translation w(t) should read like fluent text in the target language. By this criterion, the gloss w(t) = To Vinay it like Python will score poorly, and w(t) = Vinay debugs memory leaks will be preferred.
Automated evaluations of machine translations typically merge both of these criteria, by comparing the system translation with one or more reference translations, produced by professional human translators. The most popular quantitative metric is BLEU (bilin- gual evaluation understudy; Papineni et al., 2002), which is based on n-gram precision: what fraction of n-grams in the system translation appear in the reference? Specifically, for each n-gram length, the precision is defined as,
pn = number of n-grams appearing in both reference and hypothesis translations . number of n-grams appearing in the hypothesis translation
[18.2] The n-gram precisions for three hypothesis translations are shown in Figure 18.2.
The BLEU score is then based on the average, exp 1 􏰀N log pn. Two modifications N n=1
of Equation 18.2 are necessary: (1) to avoid computing log 0, all precisions are smoothed to ensure that they are positive; (2) each n-gram in the reference can be used at most once, so that to to to to to to does not achieve p1 = 1 against the reference to be or not to be. Furthermore, precision-based metrics are biased in favor of short translations, which
Under contract with MIT Press, shared under CC-BY-NC-ND license.

434 CHAPTER18. MACHINETRANSLATION
Translation
Reference Vinay likes programming in Python
Sys1 To Vinay it like to program Python
Sys2 Vinay likes Python
Sys3 Vinay likes programming in his pajamas
p1 p2 p3 p4 BP BLEU 20001.21
3100.51.33 32
43211.76 6543
7
Figure 18.2: A reference translation and three system outputs. For each output, pn indi- cates the precision at each n-gram, and BP indicates the brevity penalty.
can achieve high scores by minimizing the denominator in [18.2]. To avoid this issue, a brevity penalty is applied to translations that are shorter than the reference. This penalty is indicated as “BP” in Figure 18.2.
Automated metrics like BLEU have been validated by correlation with human judg- ments of translation quality. Nonetheless, it is not difficult to construct examples in which the BLEU score is high, yet the translation is disfluent or carries a completely different meaning from the original. To give just one example, consider the problem of translating pronouns. Because pronouns refer to specific entities, a single incorrect pronoun can oblit- erate the semantics of the original sentence. Existing state-of-the-art systems generally do not attempt the reasoning necessary to correctly resolve pronominal anaphora (Hard- meier, 2012). Despite the importance of pronouns for semantics, they have a marginal impact on BLEU, which may help to explain why existing systems do not make a greater effort to translate them correctly.
Fairness and bias The problem of pronoun translation intersects with issues of fairness and bias. In many languages, such as Turkish, the third person singular pronoun is gender neutral. Today’s state-of-the-art systems produce the following Turkish-English transla- tions (Caliskan et al., 2017):
(18.1) O bir doktor. Heis adoctor.
(18.2) O bir hems ̧ire. Sheis anurse.
The same problem arises for other professions that have stereotypical genders, such as engineers, soldiers, and teachers, and for other languages that have gender-neutral pro- nouns. This bias was not directly programmed into the translation model; it arises from statistical tendencies in existing datasets. This highlights a general problem with data- driven approaches, which can perpetuate biases that negatively impact disadvantaged
Jacob Eisenstein. Draft of October 15, 2018.

18.1. MACHINETRANSLATIONASATASK 435
groups. Worse, machine learning can amplify biases in data (Bolukbasi et al., 2016): if a dataset has even a slight tendency towards men as doctors, the resulting translation model may produce translations in which doctors are always he, and nurses are always she.
Other metrics A range of other automated metrics have been proposed for machine translation. One potential weakness of B L E U is that it only measures precision; M E – TEOR is a weighted F-MEASURE, which is a combination of recall and precision (see subsection 4.4.1). Translation Error Rate (TER) computes the string edit distance (see section 9.1.4) between the reference and the hypothesis (Snover et al., 2006). For language pairs like English and Japanese, there are substantial differences in word order, and word order errors are not sufficiently captured by n-gram based metrics. The RIBES metric ap- plies rank correlation to measure the similarity in word order between the system and reference translations (Isozaki et al., 2010).
18.1.2 Data
Data-driven approaches to machine translation rely primarily on parallel corpora, which are translations at the sentence level. Early work focused on government records, in which fine-grained official translations are often required. For example, the IBM translation sys- tems were based on the proceedings of the Canadian Parliament, called Hansards, which are recorded in English and French (Brown et al., 1990). The growth of the European Union led to the development of the EuroParl corpus, which spans 21 European lan- guages (Koehn, 2005). While these datasets helped to launch the field of machine transla- tion, they are restricted to narrow domains and a formal speaking style, limiting their ap- plicability to other types of text. As more resources are committed to machine translation, new translation datasets have been commissioned. This has broadened the scope of avail- able data to news,1 movie subtitles,2 social media (Ling et al., 2013), dialogues (Fordyce, 2007), TED talks (Paul et al., 2010), and scientific research articles (Nakazawa et al., 2016).
Despite this growing set of resources, the main bottleneck in machine translation data is the need for parallel corpora that are aligned at the sentence level. Many languages have sizable parallel corpora with some high-resource language, but not with each other. The high-resource language can then be used as a “pivot” or “bridge” (Boitet, 1988; Utiyama and Isahara, 2007): for example, De Gispert and Marino (2006) use Spanish as a bridge for translation between Catalan and English. For most of the 6000 languages spoken today, the only source of translation data remains the Judeo-Christian Bible (Resnik et al., 1999). While relatively small, at less than a million tokens, the Bible has been translated into more than 2000 languages, far outpacing any other corpus. Some research has explored
1 https://catalog.ldc.upenn.edu/LDC2010T10, http://www.statmt.org/wmt15/ translation- task.html
2 http://opus.nlpl.eu/
Under contract with MIT Press, shared under CC-BY-NC-ND license.

436 CHAPTER18. MACHINETRANSLATION
the possibility of automatically identifying parallel sentence pairs from unaligned parallel texts, such as web pages and Wikipedia articles (Kilgarriff and Grefenstette, 2003; Resnik and Smith, 2003; Adafre and De Rijke, 2006). Another approach is to create large parallel corpora through crowdsourcing (Zaidan and Callison-Burch, 2011).
18.2 Statistical machine translation
The previous section introduced adequacy and fluency as the two main criteria for ma- chine translation. A natural modeling approach is to represent them with separate scores,
Ψ(w(s), w(t)) = ΨA(w(s), w(t)) + ΨF (w(t)). [18.3]
The fluency score ΨF need not even consider the source sentence; it only judges w(t) on whether it is fluent in the target language. This decomposition is advantageous because it makes it possible to estimate the two scoring functions on separate data. While the adequacy model must be estimated from aligned sentences — which are relatively expen- sive and rare — the fluency model can be estimated from monolingual text in the target language. Large monolingual corpora are now available in many languages, thanks to resources such as Wikipedia.
An elegant justification of the decomposition in Equation 18.3 is provided by the noisy channel model, in which each scoring function is a log probability:
ΨA(w(s), w(t)) 􏱕 log pS|T (w(s) | w(t)) ΨF (w(t)) 􏱕 log pT (w(t))
Ψ(w(s), w(t)) = log pS|T (w(s) | w(t)) + log pT (w(t)) = log pS,T (w(s), w(t)).
By setting the scoring functions equal to the logarithms of the prior and likelihood, their
sum is equal to log pS,T , which is the logarithm of the joint probability of the source and
target. The sentence wˆ(t) that maximizes this joint probability is also the maximizer of the conditional probability pT|S, making it the most likely target language sentence, condi- tioned on the source.
The noisy channel model can be justified by a generative story. The target text is orig- inally generated from a probability model pT . It is then encoded in a “noisy channel” pS|T , which converts it to a string in the source language. In decoding, we apply Bayes’
rule to recover the string w(t) that is maximally likely under the conditional probability pT|S. Under this interpretation, the target probability pT is just a language model, and can be estimated using any of the techniques from chapter 6. The only remaining learning problem is to estimate the translation model pS|T .
Jacob Eisenstein. Draft of October 15, 2018.
[18.4] [18.5] [18.6]

18.2. STATISTICALMACHINETRANSLATION 437
Vinay likes python
Figure 18.3: An example word-to-word alignment
18.2.1 Statistical translation modeling
The simplest decomposition of the translation model is word-to-word: each word in the source should be aligned to a word in the translation. This approach presupposes an alignment A(w(s),w(t)), which contains a list of pairs of source and target tokens. For example, given w(s) = A Vinay le gusta Python and w(t) = Vinay likes Python, one possible word-to-word alignment is,
A(w(s), w(t)) = {(A, ∅), (Vinay, Vinay), (le, likes), (gusta, likes), (Python,Python)}. [18.7] This alignment is shown in Figure 18.3. Another, less promising, alignment is:
A(w(s), w(t)) = {(A, Vinay), (Vinay, likes), (le, Python), (gusta, ∅), (Python, ∅)}. [18.8]
Each alignment contains exactly one tuple for each word in the source, which serves to explain how the source word could be translated from the target, as required by the trans- lation probability pS|T . If no appropriate word in the target can be identified for a source word, it is aligned to ∅ — as is the case for the Spanish function word a in the example, which glosses to the English word to. Words in the target can align with multiple words in the source, so that the target word likes can align to both le and gusta in the source.
The joint probability of the alignment and the translation can be defined conveniently as,
􏰙(s) M
p(w(s), A | w(t)) = =
This probability model makes two key assumptions:
Under contract with MIT Press, shared under CC-BY-NC-ND license.
m=1 􏰙(s)
p(w(s), a | w(t) , m, M(s), M(t)) m m am
[18.9] [18.10]
M m=1
p(a
|m,M(s),M(t))×p(w(s) |w(t)). m mam
A Vinay
le gusta
python

438
CHAPTER18. MACHINETRANSLATION
•
•
The alignment probability factors across tokens,
􏰙(s) M
p(A | w(s), w(t)) =
This means that each alignment decision is independent of the others, and depends
m=1
only on the index m, and the sentence lengths M(s) and M(t). The translation probability also factors across tokens,
p(w(s) | w(t), A) =
􏰙(s) M
m=1
p(w(s) | w(t) ), m am
[18.12]
so that each word in w(s) depends only on its aligned word in w(t). This means that translation is word-to-word, ignoring context. The hope is that the target language model p(w(t)) will correct any disfluencies that arise from word-to-word translation.
To translate with such a model, we could sum or max over all possible alignments,
p(w(s), w(t)) = 􏰁 p(w(s), w(t), A)
[18.13] [18.14] [18.15]
The term p(A) defines the prior probability over alignments. A series of alignment models with increasingly relaxed independence assumptions was developed by researchers at IBM in the 1980s and 1990s, known as IBM Models 1-6 (Och and Ney, 2003). IBM Model 1 makes the strongest independence assumption:
p(am | m, M(s), M(t)) = 1 . [18.16] M (t)
In this model, every alignment is equally likely. This is almost surely wrong, but it re- sults in a convex learning objective, yielding a good initialization for the more complex alignment models (Brown et al., 1993; Koehn, 2009).
18.2.2 Estimation
Let us define the parameter θu→v as the probability of translating target word u to source word v. If word-to-word alignments were annotated, these probabilities could be com- puted from relative frequencies,
θˆu→v = count(u,v), [18.17] count(u)
Jacob Eisenstein. Draft of October 15, 2018.
A􏰁 =p(w(t))
p(am | m, M(s), M(t)). [18.11]
p(A) × p(w(s) | w(t), A) ≥p(w(t)) max p(A) × p(w(s) | w(t), A).
A A

18.2. STATISTICALMACHINETRANSLATION 439
where count(u,v) is the count of instances in which word v was aligned to word u in the training set, and count(u) is the total count of the target word u. The smoothing techniques mentioned in chapter 6 can help to reduce the variance of these probability estimates.
Conversely, if we had an accurate translation model, we could estimate the likelihood of each alignment decision,
q (a |w(s),w(t))∝p(a |m,M(s),M(t))×p(w(s) |w(t)), [18.18] mm m mam
where q (a | w(s),w(t)) is a measure of our confidence in aligning source word w(s) mmm
to target word w(t) . The relative frequencies could then be computed from the expected am
[18.19] [18.20]
The expectation-maximization (EM) algorithm proceeds by iteratively updating qm and Θˆ . The algorithm is described in general form in chapter 5. For statistical machine translation, the steps of the algorithm are:
1. E-step: Update beliefs about word alignment using Equation 18.18.
2. M-step: Update the translation model using Equations 18.19 and 18.20.
As discussed in chapter 5, the expectation maximization algorithm is guaranteed to con- verge, but not to a global optimum. However, for IBM Model 1, it can be shown that EM optimizes a convex objective, and global optimality is guaranteed. For this reason, IBM Model 1 is often used as an initialization for more complex alignment models. For more detail, see Koehn (2009).
18.2.3 Phrase-based translation
Real translations are not word-to-word substitutions. One reason is that many multiword expressions are not translated literally, as shown in this example from French:
(18.3) Nous allons prendre un verre We will take a glass
We’ll have a drink
Under contract with MIT Press, shared under CC-BY-NC-ND license.
counts,
􏰁count(u)
θˆu→v =Eq [count(u,v)]
E [count(u, v)] = q (a | w(s), w(t)) × δ(w(s) = v) × δ(w(t) = u).
q mm m am m

440
CHAPTER18. MACHINETRANSLATION
We’ll have a
drink
Figure 18.4: A phrase-based alignment between French and English, corresponding to example (18.3)
The line we will take a glass is the word-for-word gloss of the French sentence; the transla- tion we’ll have a drink is shown on the third line. Such examples are difficult for word-to- word translation models, since they require translating prendre to have and verre to drink. These translations are only correct in the context of these specific phrases.
Phrase-based translation generalizes on word-based models by building translation tables and alignments between multiword spans. (These “phrases” are not necessarily syntactic constituents like the noun phrases and verb phrases described in chapters 9 and 10.) The generalization from word-based translation is surprisingly straightforward: the translation tables can now condition on multi-word units, and can assign probabilities to multi-word units; alignments are mappings from spans to spans, ((i, j), (k, l)), so that
p(w(s) | w(t),A) = 􏰙 p (s) (t)({w(s) ,w(s) ,…,w(s)} | {w(t) ,w(t) ,…,w(t)}). w |w i+1 i+2 j k+1 k+2 l
((i,j),(k,l))∈A
[18.21]
The phrase alignment ((i, j), (k, l)) indicates that the span w(s) is the translation of the i+1:j
span w(t) . An example phrasal alignment is shown in Figure 18.4. Note that the align- k+1:l
ment set A is required to cover all of the tokens in the source, just as in word-based trans- lation.Theprobabilitymodelpw(s)|w(t) mustnowincludetranslationsforallphrasepairs, which can be learned from expectation-maximization just as in word-based statistical ma- chine translation.
Jacob Eisenstein. Draft of October 15, 2018.
Nous allons
prendre une
verre

18.2. STATISTICALMACHINETRANSLATION 441 18.2.4 *Syntax-based translation
The Vauquois Pyramid (Figure 18.1) suggests that translation might be easier if we take a higher-level view. One possibility is to incorporate the syntactic structure of the source, the target, or both. This is particularly promising for language pairs that consistent syn- tactic differences. For example, English adjectives almost always precede the nouns that they modify, while in Romance languages such as French and Spanish, the adjective often follows the noun: thus, angry fish would translate to pez (fish) enojado (angry) in Spanish. In word-to-word translation, these reorderings cause the alignment model to be overly permissive. It is not that the order of any pair of English words can be reversed when translating into Spanish, but only adjectives and nouns within a noun phrase. Similar issues arise when translating between verb-final languages such as Japanese (in which verbs usually follow the subject and object), verb-initial languages like Tagalog and clas- sical Arabic, and verb-medial languages such as English.
An elegant solution is to link parsing and translation in a synchronous context-free grammar (SCFG; Chiang, 2007).3 An SCFG is a set of productions of the form X → (α, β, ∼), where X is a non-terminal, α and β are sequences of terminals or non-terminals, and ∼ is a one-to-one alignment of items in α with items in β. English-Spanish adjective-noun ordering can be handled by a set of synchronous productions, e.g.,
NP → (DET1 NN2 JJ3, DET1 JJ3 NN2), [18.22] with subscripts indicating the alignment between the Spanish (left) and English (right)
parts of the right-hand side. Terminal productions yield translation pairs,
JJ → (enojado1, angry1). [18.23]
A synchronous derivation begins with the start symbol S, and derives a pair of sequences of terminal symbols.
Given an SCFG in which each production yields at most two symbols in each language (Chomsky Normal Form; see section 9.2.1), a sentence can be parsed using only the CKY algorithm (chapter 10). The resulting derivation also includes productions in the other language, all the way down to the surface form. Therefore, SCFGs make translation very similartoparsing.InaweightedSCFG,thelogprobabilitylogpS|T canbecomputedfrom the sum of the log-probabilities of the productions. However, combining SCFGs with a target language model is computationally expensive, necessitating approximate search algorithms (Huang and Chiang, 2007).
Synchronous context-free grammars are an example of tree-to-tree translation, be- cause they model the syntactic structure of both the target and source language. In string- to-tree translation, string elements are translated into constituent tree fragments, which
3Earlier approaches to syntactic machine translation includes syntax-driven transduction (Lewis II and Stearns, 1968) and stochastic inversion transduction grammars (Wu, 1997).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

442 CHAPTER18. MACHINETRANSLATION
are then assembled into a translation (Yamada and Knight, 2001; Galley et al., 2004); in tree-to-string translation, the source side is parsed, and then transformed into a string on the target side (Liu et al., 2006). A key question for syntax-based translation is the extent to which we phrasal constituents align across translations (Fox, 2002), because this gov- erns the extent to which we can rely on monolingual parsers and treebanks. For more on syntax-based machine translation, see the monograph by Williams et al. (2016).
18.3 Neural machine translation
Neural network models for machine translation are based on the encoder-decoder archi- tecture (Cho et al., 2014). The encoder network converts the source language sentence into a vector or matrix representation; the decoder network then converts the encoding into a sentence in the target language.
z =ENCODE(w(s)) [18.24] w(t) | w(s) ∼DECODE(z), [18.25]
where the second line means that the function DECODE(z) defines the conditional proba- bility p(w(t) | w(s)).
The decoder is typically a recurrent neural network, which generates the target lan- guage sentence one word at a time, while recurrently updating a hidden state. The en- coder and decoder networks are trained end-to-end from parallel sentences. If the output layer of the decoder is a logistic function, then the entire architecture can be trained to maximize the conditional log-likelihood,
log p(w(t) | w(s)) =
p(w(t) | w(t) , w(s)) ∝ exp β (t) · h(t)
[18.26]
m=1􏰛 􏰜 m 1:m−1 wm m−1
[18.27] where the hidden state h(t) is a recurrent function of the previously generated text
m−1
w(t) and the encoding z. The second line is equivalent to writing,
w(t) | w(t) , w(s) ∼ SoftMax 􏰛β · h(t) 􏰜 , [18.28] m 1:m−1 m−1
1:m−1
􏰁(t) M
p(w(t) | w(t) , z) m 1:m−1
where β ∈ R(V (t)×K) is the matrix of output word vectors for the V (t) words in the target language vocabulary.
The simplest encoder-decoder architecture is the sequence-to-sequence model (Sutskever et al., 2014). In this model, the encoder is set to the final hidden state of a long short-term
Jacob Eisenstein. Draft of October 15, 2018.

18.3. NEURALMACHINETRANSLATION
443
h(s,D) h(s,D) h(s,D) m−1 m m+1
… … …
h(s,2) h(s,2) h(s,2) m−1 m m+1
h(s,1) h(s,1) h(s,1) m−1 m m+1
x(s) x(s) x(s) m−1 m m+1
Figure 18.5: A deep bidirectional LSTM encoder memory (LSTM) (see subsection 6.3.3) on the source sentence:
h(s) =LSTM(x(s),h(s) ) m mm−1
z 􏱕h(s) , M (s)
[18.29] [18.30]
where x(s) is the embedding of source language word w(s). The encoding then provides mm
the initial hidden state for the decoder LSTM:
h(t) =z 0
h(t) =LSTM(x(t),h(t) ), m mm−1
where x(t) is the embedding of the target language word w(t). mm
[18.31] [18.32]
Sequence-to-sequence translation is nothing more than wiring together two LSTMs: one to read the source, and another to generate the target. To make the model work well, some additional tweaks are needed:
• Most notably, the model works much better if the source sentence is reversed, read- ing from the end of the sentence back to the beginning. In this way, the words at the beginning of the source have the greatest impact on the encoding z, and there- fore impact the words at the beginning of the target sentence. Later work on more advanced encoding models, such as neural attention (see subsection 18.3.1), has eliminated the need for reversing the source sentence.
• TheencoderanddecodercanbeimplementedasdeepLSTMs,withmultiplelayers of hidden states. As shown in Figure 18.5, each hidden state h(s,i) at layer i is treated
Under contract with MIT Press, shared under CC-BY-NC-ND license.
m

444
CHAPTER18. MACHINETRANSLATION
•
•
Google’s commercial machine translation system used eight layers (Wu et al., 2016).4
Significant improvements can be obtained by creating an ensemble of translation models, each trained from a different random initialization. For an ensemble of size N , the per-token decoding probability is set equal to,
( t ) 1 􏰁N ( t )
p(w(t) | z, w1:m−1) = N pi(w(t) | z, w1:m−1), [18.35]
i=1
where pi is the decoding probability for model i. Each translation model in the
ensemble includes its own encoder and decoder networks.
Theoriginalsequence-to-sequencemodelusedafairlystandardtrainingsetup:stochas- tic gradient descent with an exponentially decreasing learning rate after the first five epochs; mini-batches of 128 sentences, chosen to have similar length so that each sentence on the batch will take roughly the same amount of time to process; gra- dient clipping (see subsection 3.3.4) to ensure that the norm of the gradient never exceeds some predefined value.
as the input to an LSTM at layer i + 1:
h(s,1) =LSTM(x(s),h(s) ) [18.33]
m mm−1
h(s,i+1) =LSTM(h(s,i), h(s,i+1)), ∀i ≥ 1. [18.34]
m mm−1
The original work on sequence-to-sequence translation used four layers; in 2016,
18.3.1 Neural attention
The sequence-to-sequence model discussed in the previous section was a radical depar- ture from statistical machine translation, in which each word or phrase in the target lan- guage is conditioned on a single word or phrase in the source language. Both approaches have advantages. Statistical translation leverages the idea of compositionality — transla- tions of large units should be based on the translations of their component parts — and this seems crucial if we are to scale translation to longer units of text. But the translation of each word or phrase often depends on the larger context, and encoder-decoder models capture this context at the sentence level.
Is it possible for translation to be both contextualized and compositional? One ap- proach is to augment neural translation with an attention mechanism. The idea of neural attention was described in section 17.5, but its application to translation bears further dis- cussion. In general, attention can be thought of as using a query to select from a memory of key-value pairs. However, the query, keys, and values are all vectors, and the entire
4Google reports that this system took six days to train for English-French translation, using 96 NVIDIA K80 GPUs, which would have cost roughly half a million dollars at the time.
Jacob Eisenstein. Draft of October 15, 2018.

18.3. NEURALMACHINETRANSLATION 445 Output
α
Query ψα
Key Value
Figure 18.6: A general view of neural attention. The dotted box indicates that each αm→n can be viewed as a gate on value n.
operation is differentiable. For each key n in the memory, we compute a score ψα(m, n) with respect to the query m. That score is a function of the compatibility of the key and the query, and can be computed using a small feedforward neural network. The vector of scores is passed through an activation function, such as softmax. The output of this activation function is a vector of non-negative numbers [αm→1, αm→2, . . . , αm→N ]⊤, with length N equal to the size of the memory. Each value in the memory vn is multiplied by the attention αm→n; the sum of these scaled values is the output. This process is shown in Figure 18.6. In the extreme case that αm→n = 1 and αm→n′ = 0 for all other n′, then the attention mechanism simply selects the value vn from the memory.
Neural attention makes it possible to integrate alignment into the encoder-decoder ar- chitecture. Rather than encoding the entire source sentence into a fixed length vector z, it can be encoded into a matrix Z ∈ RK×M(S), where K is the dimension of the hidden state, and M(S) is the number of tokens in the source input. Each column of Z represents the state of a recurrent neural network over the source sentence. These vectors are con- structed from a bidirectional LSTM (see section 7.6), which can be a deep network as shown in Figure 18.5. These columns are both the keys and the values in the attention mechanism.
At each step m in decoding, the attentional state is computed by executing a query, which is equal to the state of the decoder, h(t). The resulting compatibility scores are,
ψ (m, n) =v · tanh(Θ [h(t); h(s)]). [18.36] αααmn
The function ψ is thus a two layer feedforward neural network, with weights vα on the output layer, and weights Θα on the input layer. To convert these scores into atten- tion weights, we apply an activation function, which can be vector-wise softmax or an element-wise sigmoid:
Softmax attention
activa
tion
m
αm→n = 􏰀 expψα(m,n) [18.37] M(s) ′
n′=1 exp ψα(m, n )
Under contract with MIT Press, shared under CC-BY-NC-ND license.

446 CHAPTER18. MACHINETRANSLATION Sigmoid attention
αm→n = σ (ψα(m, n)) [18.38] The attention α is then used to compute a context vector cm by taking a weighted
average over the columns of Z,
􏰁(s)
cm =
where αm→n ∈ [0, 1] is the amount of attention from word m of the target to word n of the source. The context vector can be incorporated into the decoder’s word output probability model, by adding another layer to the decoder (Luong et al., 2015):
h ̃(t) =tanh􏰛Θ [h(t);c ]􏰜 [18.40] m cmm
p(w(t) |w(t) ,w(s))∝exp􏰑β (t) ·h ̃(t)􏰒. [18.41] m+1 1:m wm+1 m
Here the decoder state h(t) is concatenated with the context vector, forming the input m
to compute a final output vector h ̃(t). The context vector can be incorporated into the m
decoder recurrence in a similar manner (Bahdanau et al., 2014).
18.3.2 *Neural machine translation without recurrence
In the encoder-decoder model, attention’s “keys and values” are the hidden state repre- sentations in the encoder network, z, and the “queries” are state representations in the decoder network h(t). It is also possible to completely eliminate recurrence from neural translation, by applying self-attention (Lin et al., 2017; Kim et al., 2017) within the en- coder and decoder, as in the transformer architecture (Vaswani et al., 2017). For level i, the basic equations of the encoder side of the transformer are:
􏰁(s) M
z(i) =
m m→nvn
M n=1
αm→nzn, [18.39]
α(i) (Θ h(i−1)) n=1 􏰛 􏰜
[18.42]
[18.43] For each token m at level i, we compute self-attention over the entire source sentence.
h(i) =Θ ReLU Θ z(i) +b +b . m21m12
The keys, values, and queries are all projections of the vector h(i−1): for example, in Equa- tion 18.42, the value v is the projection Θ h(i−1). The attention scores α(i) are com-
n vn m→n puted using a scaled form of softmax attention,
αm→n ∝ exp(ψα(m,n)/M), [18.44] Jacob Eisenstein. Draft of October 15, 2018.

18.3. NEURALMACHINETRANSLATION
447
z(i) (i)
αm→ (i)
v
k
m−1 m m+1
Figure 18.7: The transformer encoder’s computation of z(i) from h(i−1). The key, value,
and query are shown for token m − 1. For example, ψ(i)(m, m − 1) is computed from α
the key Θ h(i−1) and the query Θ h(i−1), and the gate α(i) operates on the value k m−1 q m m→m−1
Θvh(i−1). The figure shows a minimal version of the architecture, with a single atten- m−1
tion head. With multiple heads, it is possible to attend to different properties of multiple words.
where M is the length of the input. This encourages the attention to be more evenly dispersed across the input. Self-attention is applied across multiple “heads”, each using different projections of h(i−1) to form the keys, values, and queries. This architecture is
shown in Figure 18.7. The output of the self-attentional layer is the representation z(i), m
which is then passed through a two-layer feed-forward network, yielding the input to the next layer, h(i). This self-attentional architecture can be applied in the decoder as well, but this requires that there is zero attention to future words: αm→n = 0 for all n > m.
To ensure that information about word order in the source is integrated into the model, the encoder augments the base layer of the network with positional encodings of the indices of each word in the source. These encodings are vectors for each position m ∈ {1, 2, . . . , M }. The transformer sets these encodings equal to a set of sinusoidal functions of m,
2i
e2i−1(m) =sin(m/(10000Ke )) [18.45]
2i
e2i(m) =cos(m/(10000Ke )), ∀i ∈ {1,2,…,Ke/2} [18.46]
where e2i(m) is the value at element 2i of the encoding for index m. As we progress through the encoding, the sinusoidal functions have progressively narrower bandwidths. This enables the model to learn to attend by relative positions of words. The positional encodings are concatenated with the word embeddings xm at the base layer of the model.5
5The transformer architecture relies on several additional tricks, including layer normalization (see Under contract with MIT Press, shared under CC-BY-NC-ND license.
ψα (m,·) h(i−1)
q
m

448 CHAPTER18. MACHINETRANSLATION Source: The ecotax portico in Pont-de-buis was taken down on Thursday morning
Reference: Le portique ́ecotaxe de Pont-de-buis a ́et ́e d ́emont ́e jeudi matin
System: Le unk de unk `a unk a ́et ́e pris le jeudi matin
Figure 18.8: Translation with unknown words. The system outputs unk to indicate words that are outside its vocabulary. Figure adapted from Luong et al. (2015).
Convolutional neural networks (see section 3.4) have also been applied as encoders in neural machine translation (Gehring et al., 2017). For each word w(s), a convolutional
network computes a representation h(s) from the embeddings of the word and its neigh- m
bors. This procedure is applied several times, creating a deep convolutional network. The recurrent decoder then computes a set of attention weights over these convolutional rep- resentations, using the decoder’s hidden state h(t) as the queries. This attention vector is used to compute a weighted average over the outputs of another convolutional neural network of the source, yielding an averaged representation cm, which is then fed into the decoder. As with the transformer, speed is the main advantage over recurrent encoding models; another similarity is that word order information is approximated through the use of positional encodings.6
18.3.3 Out-of-vocabulary words
Thus far, we have treated translation as a problem at the level of words or phrases. For words that do not appear in the training data, all such models will struggle. There are two main reasons for the presence of out-of-vocabulary (OOV) words:
• New proper nouns, such as family names or organizations, are constantly arising — particularly in the news domain. The same is true, to a lesser extent, for technical terminology. This issue is shown in Figure 18.8.
• In many languages, words have complex internal structure, known as morphology. An example is German, which uses compounding to form nouns like Abwasserbe- handlungsanlage (sewage water treatment plant; example from Sennrich et al. (2016)).
subsection 3.3.4), residual connections around the nonlinear activations (see subsection 3.2.2), and a non- monotonic learning rate schedule.
6A recent evaluation found that best performance was obtained by using a recurrent network for the decoder, and a transformer for the encoder (Chen et al., 2018). The transformer was also found to significantly outperform a convolutional neural network.
Jacob Eisenstein. Draft of October 15, 2018.
m

18.4. DECODING 449
While compounds could in principle be addressed by better tokenization (see sec- tion 8.4), other morphological processes involve more complex transformations of subword units.
Names and technical terms can be handled in a postprocessing step: after first identi- fying alignments between unknown words in the source and target, we can look up each aligned source word in a dictionary, and choose a replacement (Luong et al., 2015). If the word does not appear in the dictionary, it is likely to be a proper noun, and can be copied directly from the source to the target. This approach can also be integrated directly into the translation model, rather than applying it as a postprocessing step (Jean et al., 2015).
Words with complex internal structure can be handled by translating subword units rather than entire words. A popular technique for identifying subword units is byte-pair encoding (BPE; Gage, 1994; Sennrich et al., 2016). The initial vocabulary is defined as the set of characters used in the text. The most common character bigram is then merged into a new symbol, the vocabulary is updated, and the merging operation is applied again. For example, given the dictionary {fish, fished, want, wanted, bike, biked}, we would first form the subword unit ed, since this character bigram appears in three of the six words. Next, there are several bigrams that each appear in a pair of words: fi, is, sh, wa, an, etc. These can be merged in any order. By iterating this process, we eventually reach the segmentation, {fish, fish+ed, want, want+ed, bik+e, bik+ed}. At this point, there are no bigrams that appear more than once. In real data, merging is performed until the number of subword units reaches some predefined threshold, such as 104.
Each subword unit is treated as a token for translation, in both the encoder (source side) and decoder (target side). BPE can be applied jointly to the union of the source and target vocabularies, identifying subword units that appear in both languages. For lan- guages that have different scripts, such as English and Russian, transliteration between the scripts should be applied first.7
18.4 Decoding
Given a trained translation model, the decoding task is:
wˆ(t) =argmaxΨ(w,w(s)), [18.47]
w∈V∗
where w(t) is a sequence of tokens from the target vocabulary V. It is not possible to
efficiently obtain exact solutions to the decoding problem, for even minimally effective
7Transliteration is crucial for converting names and other foreign words between languages that do not share a single script, such as English and Japanese. It is typically approached using the finite-state methods discussed in chapter 9 (Knight and Graehl, 1998).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

450 CHAPTER18. MACHINETRANSLATION
models in either statistical or neural machine translation. Today’s state-of-the-art trans- lation systems use beam search (see section 11.3.1), which is an incremental decoding al- gorithm that maintains a small constant number of competitive hypotheses. Such greedy approximations are reasonably effective in practice, and this may be in part because the decoding objective is only loosely correlated with measures of translation quality, so that exact optimization of [18.47] may not greatly improve the resulting translations.
Decoding in neural machine translation is simpler than in phrase-based statistical ma- chine translation.8 The scoring function Ψ is defined,
Ψ(w(t), w(s)) =
ψ(w(t);w(t) ,z)=β (t) ·h(t) −log
􏰁(t) M
ψ(w(t); w(t)
m 1:m−1
, z)
m=1 􏰁􏰛􏰜
z and the decoding history w(t) . This formulation subsumes the attentional translation 1:m−1
model, where z is a matrix encoding of the source. Now consider the incremental decoding algorithm,
wˆ(t) =argmaxψ(w;wˆ(t) ,z), m = 1,2,… [18.50]
w∈V
1:m−1 wm m
where z is the encoding of the source sentence w(s), and h(t) is a function of the encoding
m
1:m−1
This algorithm selects the best target language word at position m, assuming that it has
already generated the sequence wˆ(t) . (Termination can be handled by augmenting 1:m−1
the vocabulary V with a special end-of-sequence token, 􏱒.) The incremental algorithm is likely to produce a suboptimal solution to the optimization problem defined in Equa- tion 18.47, because selecting the highest-scoring word at position m can set the decoder on a “garden path,” in which there are no good choices at some later position n > m. We might hope for some dynamic programming solution, as in sequence labeling (sec- tion 7.3). But the Viterbi algorithm and its relatives rely on a Markov decomposition of the objective function into a sum of local scores: for example, scores can consider locally adjacent tags (ym,ym−1), but not the entire tagging history y1:m. This decomposition is
not applicable to recurrent neural networks, because the hidden state h(t) is impacted m
by the entire history w(t) ; this sensitivity to long-range context is precisely what makes 1:m
recurrentneuralnetworkssoeffective.9 Infact,itcanbeshownthatdecodingfromany recurrent neural network is NP-complete (Siegelmann and Sontag, 1995; Chen et al., 2018).
8For more on decoding in phrase-based statistical models, see Koehn (2009).
9Note that this problem does not impact RNN-based sequence labeling models (see section 7.6). This is because the tags produced by these models do not affect the recurrent state.
Jacob Eisenstein. Draft of October 15, 2018.
w∈V
m
exp β ·h(t) , w m
[18.48] [18.49]

18.5. TRAININGTOWARDSTHEEVALUATIONMETRIC 451
Beam search Beam search is a general technique for avoiding search errors when ex- haustive search is impossible; it was first discussed in section 11.3.1. Beam search can be seen as a variant of the incremental decoding algorithm sketched in Equation 18.50, but at each step m, a set of K different hypotheses are kept on the beam. For each hypothesis
k ∈ {1,2,…,K},wecomputeboththecurrentscore􏰀M(t) ψ(w(t) ;w(t) ,z)aswellas m=1 k,m k,1:m−1
the current hidden state h(t). At each step in the beam search, the K top-scoring children k
of each hypothesis currently on the beam are “expanded”, and the beam is updated. For a detailed description of beam search for RNN decoding, see Graves (2012).
Learning and search Conventionally, the learning algorithm is trained to predict the right token in the translation, conditioned on the translation history being correct. But if decoding must be approximate, then we might do better by modifying the learning algorithm to be robust to errors in the translation history. Scheduled sampling does this by training on histories that sometimes come from the ground truth, and sometimes come from the model’s own output (Bengio et al., 2015).10 As training proceeds, the training wheels come off: we increase the fraction of tokens that come from the model rather than the ground truth. Another approach is to train on an objective that relates directly to beam search performance (Wiseman et al., 2016). Reinforcement learning has also been applied to decoding of RNN-based translation models, making it possible to directly optimize translation metrics such as BLEU (Ranzato et al., 2016).
18.5 Training towards the evaluation metric
In likelihood-based training, the objective is the maximize the probability of a parallel corpus. However, translations are not evaluated in terms of likelihood: metrics like BLEU consider only the correctness of a single output translation, and not the range of prob- abilities that the model assigns. It might therefore be better to train translation models to achieve the highest BLEU score possible — to the extent that we believe BLEU mea- sures translation quality. Unfortunately, B L E U and related metrics are not friendly for optimization: they are discontinuous, non-differentiable functions of the parameters of the translation model.
Consider an error function ∆(wˆ(t), w(t)), which measures the discrepancy between the system translation wˆ(t) and the reference translation w(t); this function could be based on BLEU or any other metric on translation quality. One possible criterion would be to select
10Scheduled sampling builds on earlier work on learning to search (Daume ́ III et al., 2009; Ross et al., 2011), which are also described in subsection 15.2.4.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

452 CHAPTER18. MACHINETRANSLATION the parameters θ that minimize the error of the system’s preferred translation,
wˆ(t) =argmaxΨ(w(t),w(s);θ) [18.51] w(t)
θˆ = argmin ∆(wˆ(t), w(s)) [18.52] θ
However, identifying the top-scoring translation wˆ(t) is usually intractable, as described in the previous section. In minimum error-rate training (MERT), wˆ(t) is selected from a set of candidate translations Y(w(s)); this is typically a strict subset of all possible transla- tions, so that it is only possible to optimize an approximation to the true error rate (Och and Ney, 2003).
A further issue is that the objective function in Equation 18.52 is discontinuous and non-differentiable, due to the argmax over translations: an infinitesimal change in the parameters θ could cause another translation to be selected, with a completely different error. To address this issue, we can instead minimize the risk, which is defined as the expected error rate,
R(θ) =Ewˆ(t􏰁)|w(s);θ[∆(wˆ(t), w(t))] [18.53] = p(wˆ(t) | w(s)) × ∆(wˆ(t), w(t)). [18.54]
wˆ (t) ∈Y (w(s) )
Minimum risk training minimizes the sum of R(θ) across all instances in the training set.
The risk can be generalized by exponentiating the translation probabilities,
􏰛􏰁 􏰜α
p ̃(w(t); θ, α) ∝ p(w(t) | w(s); θ) [18.55]
R ̃(θ)= p ̃(wˆ(t) |w(s);α,θ)×∆(wˆ(t),w(t)) [18.56] wˆ (t) ∈Y (w(s) )
where Y(w(s)) is now the set of all possible translations for w(s). Exponentiating the prob- abilities in this way is known as annealing (Smith and Eisner, 2006). When α = 1, then R ̃(θ) = R(θ); when α = ∞, then R ̃(θ) is equivalent to the sum of the errors of the maxi- mum probability translations for each sentence in the dataset.
Clearly the set of candidate translations Y(w(s)) is too large to explicitly sum over. Because the error function ∆ generally does not decompose into smaller parts, there is no efficie􏰀nt dynamic programming solution to sum over this set. We can approximate
the sum (t) (s) with a sum over a finite number of samples, {w(t), w(t), . . . , w(t)}. wˆ∈Y(w) 12 K
If these samples were drawn uniformly at random, then the (annealed) risk would be Jacob Eisenstein. Draft of October 15, 2018.

18.5. TRAININGTOWARDSTHEEVALUATIONMETRIC
453
approximated as (Shen et al., 2016),
̃ 1􏰁K (t) (s) (t) (t) R(θ)≈Z p ̃(wk |w ;θ,α)×∆(wk ,w )
[18.57] [18.58]
Shen et al. (2016) report that performance plateaus at K = 100 for minimum risk training of neural machine translation.
Uniform sampling over the set of all possible translations is undesirable, because most translations have very low probability. A solution from Monte Carlo estimation is impor- tance sampling, in which we draw samples from a proposal distribution q(w(s)). This distribution can be set equal to the current translation model p(w(t) | w(s); θ). Each sam-
k pleisthenweightedbyanimportancescore,ω =p ̃(w(t)|w(s)).Theeffectofthisweighting
k=1
􏰁K (t) (s)
Z= p ̃(wk |w ;θ,α). k=1
k q(w(t);w(s)) k
is to correct for any mismatch between the proposal distribution q and the true distribu- tion p ̃. The risk can then be approximated as,
w(t) ∼q(w(s)) k
[18.59] [18.60]
[18.61]
p ̃(w(t) | w(s)) ωk = k
q(w(t); w(s)) k
̃1􏰁K (t)(t) R(θ)≈􏰀K ω ωk×∆(wk ,w ).
k=1 k k=1
Importance sampling will generally give a more accurate approximation than uniform sampling. The only formal requirement is that the proposal assigns non-zero probability to every w(t) ∈ Y(w(s)). For more on importance sampling and related methods, see Robert and Casella (2013).
Additional resources
A complete textbook on machine translation is available from Koehn (2009). While this book precedes recent work on neural translation, a more recent draft chapter on neural translation models is also available (Koehn, 2017). Neubig (2017) provides a compre- hensive tutorial on neural machine translation, starting from first principles. The course notes from Cho (2015) are also useful. Several neural machine translation libraries are available: LAMTRAM is an implementation of neural machine translation in DYNET (Neu- big et al., 2017); OPENNMT (Klein et al., 2017) and FAIRSEQ are available in PYTORCH;
Under contract with MIT Press, shared under CC-BY-NC-ND license.

454 CHAPTER18. MACHINETRANSLATION TENSOR2TENSOR is an implementation of several of the Google translation models in TEN-
SORFLOW (Abadi et al., 2016).
Literary translation is especially challenging, even for expert human translators. Mes- sud (2014) describes some of these issues in her review of an English translation of L’e ́tranger, the 1942 French novel by Albert Camus.11 She compares the new translation by Sandra Smith against earlier translations by Stuart Gilbert and Matthew Ward, focusing on the difficulties presented by a single word in the first sentence:
Then, too, Smith has reconsidered the book’s famous opening. Camus’s original is deceptively simple: “Aujourd’hui, maman est morte.” Gilbert influ- enced generations by offering us “Mother died today”—inscribing in Meur- sault [the narrator] from the outset a formality that could be construed as heartlessness. But maman, after all, is intimate and affectionate, a child’s name for his mother. Matthew Ward concluded that it was essentially untranslatable (“mom” or “mummy” being not quite apt), and left it in the original French: “Maman died today.” There is a clear logic in this choice; but as Smith has explained, in an interview in The Guardian, maman “didn’t really tell the reader anything about the connotation.” She, instead, has translated the sentence as “My mother died today.”
I chose “My mother” because I thought about how someone would tell another person that his mother had died. Meursault is speaking to the reader directly. “My mother died today” seemed to me the way it would work, and also implied the closeness of “maman” you get in the French.
Elsewhere in the book, she has translated maman as “mama” — again, striving to come as close as possible to an actual, colloquial word that will carry the same connotations as maman does in French.
The passage is a reminder that while the quality of machine translation has improved dramatically in recent years, expert human translations draw on considerations that are beyond the ken of any contemporary computational approach.
Exercises
1. Using Google translate or another online service, translate the following example into two different languages of your choice:
11The book review is currently available online at http://www.nybooks.com/articles/2014/06/ 05/camus-new-letranger/.
Jacob Eisenstein. Draft of October 15, 2018.

18.5. TRAININGTOWARDSTHEEVALUATIONMETRIC 455 (18.4) It is not down on any map; true places never are.
Then translate each result back into English. Which is closer to the original? Can you explain the differences?
2. Compute the unsmoothed n-gram precisions p1 . . . p4 for the two back-translations in the previous problem, using the original source as the reference. Your n-grams should include punctuation, and you should segment conjunctions like it’s into two tokens.
3. You are given the following dataset of translations from “simple” to “difficult” En- glish:
(18.5) a. Kids like cats. Children adore felines.
b. Cats hats. Felines fedoras.
Estimate a word-to-word statistical translation model from simple English (source) to difficult English (target), using the expectation-maximization as described in sub- section 18.2.2. Compute two iterations of the algorithm by hand, starting from a uni- formtranslationmodel,andusingthesimplealignmentmodelp(am |m,M(s),M(t))= 1 . Hint: in the final M-step, you will want to switch from fractions to decimals.
4. Building on the previous problem, what will be the converged translation proba- bility table? Can you state a general condition about the data, under which this translation model will fail in the way that it fails here?
5. Proposeasimplealignmentmodelthatwouldmakeitpossibletorecoverthecorrect translation probabilities from the toy dataset in the previous two problems.
M (t)
6. Let l(t) represent the loss at word m + 1 of the target, and let h(s) represent the hid-
m+1 n
den state at word n of the source. Write the expression for the derivative m+1 in the
∂ l(t) ∂ h(s)
n
sequence-to-sequence translation model expressed in Equations [18.29-18.32]. You may assume that both the encoder and decoder are one-layer LSTMs. In general,
how many terms are on the shortest backpropagation path from l(t) to h(s)? m+1 n
7. Now consider the neural attentional model from subsection 18.3.1, with sigmoid
∂ l(t)
attention. The derivative m+1 is the sum of many paths through the computation
∂zn
graph; identify the shortest such path. You may assume that the initial state of the
decoder recurrence h(t) is not tied to the final state of the encoder recurrence h(s) . 0 M(s)
Under contract with MIT Press, shared under CC-BY-NC-ND license.

456 CHAPTER18. MACHINETRANSLATION
8. Apply byte-pair encoding for the vocabulary it, unit, unite, until no bigram appears
more than once.
9. This problem relates to the complexity of machine translation. Suppose you have an oracle that returns the list of words to include in the translation, so that your only task is to order the words. Furt􏰀hermore, suppose that the scoring function
over orderings is a sum over bigrams, M ψ(w(t), w(t) ). Show that the problem m=1 m m−1
of finding the optimal translation is NP-complete, by reduction from a well-known problem.
10. Hand-designanattentionalrecurrenttranslationmodelthatsimplycopiestheinput from the source to the target. You may assume an arbitrarily large hidden state, and you may assume that there is a finite maximum input length M. Specify all the weights such that the maximum probability translation of any source is the source itself. Hint: it is simplest to use the Elman recurrence hm = f(Θhm−1 + xm) rather than an LSTM.
11. Give a synchronized derivation (subsection 18.2.4) for the Spanish-English transla- tion,
(18.6) El pez enojado atacado. The fish angry attacked.
The angry fish attacked.
As above, the second line shows a word-for-word gloss, and the third line shows the desired translation. Use the synchronized production rule in [18.22], and design the other production rules necessary to derive this sentence pair. You may derive (atacado, attacked) directly from VP.
Jacob Eisenstein. Draft of October 15, 2018.

Chapter 19
Text generation
In many of the most interesting problems in natural language processing, language is the output. The previous chapter described the specific case of machine translation, but there are many other applications, from summarization of research articles, to automated journalism, to dialogue systems. This chapter emphasizes three main scenarios: data-to- text, in which text is generated to explain or describe a structured record or unstructured perceptual input; text-to-text, which typically involves fusing information from multiple linguistic sources into a single coherent summary; and dialogue, in which text is generated as part of an interactive conversation with one or more human participants.
19.1 Data-to-text generation
In data-to-text generation, the input ranges from structured records, such as the descrip- tion of an weather forecast (as shown in Figure 19.1), to unstructured perceptual data, such as a raw image or video; the output may be a single sentence, such as an image cap- tion, or a multi-paragraph argument. Despite this diversity of conditions, all data-to-text systems share some of the same challenges (Reiter and Dale, 2000):
• determining what parts of the data to describe;
• planning a presentation of this information;
• lexicalizing the data into words and phrases;
• organizing words and phrases into well-formed sentences and paragraphs.
The earlier stages of this process are sometimes called content selection and text plan- ning; the later stages are often called surface realization.
Early systems for data-to-text generation were modular, with separate software com- ponents for each task. Artificial intelligence planning algorithms can be applied to both
457

458 CHAPTER19. TEXTGENERATION
Temperature
time min mean max
06:00-21:00 9 15 21
Wind speed
time min mean max
06:00-21:00 15 20 30
Cloudy, with temperatures between 10 and 20 degrees. South wind around 20 mph.
Figure 19.1: An example input-output pair for the task of generating text descriptions of weather forecasts (adapted from Konstas and Lapata, 2013).
the high-level information structure and the organization of individual sentences, ensur- ing that communicative goals are met (McKeown, 1992; Moore and Paris, 1993). Surface realization can be performed by grammars or templates, which link specific types of data to candidate words and phrases. A simple example template is offered by Wiseman et al. (2017), for generating descriptions of basketball games:
(19.1) The (-losses1) defeated the (–), –.
The New York Knicks (45-5) defeated the Boston Celtics (11-38), 115-79.
For more complex cases, it may be necessary to apply morphological inflections such as pluralization and tense marking — even in the simple example above, languages such as Russian would require case marking suffixes for the team names. Such inflections can be applied as a postprocessing step. Another difficult challenge for surface realization is the generation of varied referring expressions (e.g., The Knicks, New York, they), which is critical to avoid repetition. As discussed in subsection 16.2.1, the form of referring expressions is constrained by the discourse and information structure.
An example at the intersection of rule-based and statistical techniques is the NITRO- GEN system (Langkilde and Knight, 1998). The input to NITROGEN is an abstract meaning representation (AMR; see section 13.3) of semantic content to be expressed in a single sen- tence. In data-to-text scenarios, the abstract meaning representation is the output of a higher-level text planning stage. A set of rules then converts the abstract meaning repre- sentation into various sentence plans, which may differ in both the high-level structure (e.g., active versus passive voice) as well as the low-level details (e.g., word and phrase choice). Some examples are shown in Figure 19.2. To control the combinatorial explosion in the number of possible realizations for any given meaning, the sentence plans are uni- fied into a single finite-state acceptor, in which word tokens are represented by arcs (see
Jacob Eisenstein. Draft of October 15, 2018.
Cloud sky cover
time
06:00-09:00
09:00-12:00
percent (%)
25-50 50-75
Wind direction
time mode
06:00-21:00 S

19.1. DATA-TO-TEXTGENERATION 459 • Visitors who came to Japan admire Mount
(a / admire-01
:ARG0 (v / visitor
:ARG1-of (c / arrive-01
:ARG4 (j / Japan)))
:ARG1 (m / “Mount Fuji”))
Fuji.
• Visitors who came in Japan admire Mount Fuji.
• Mount Fuji is admired by the visitor who came in Japan.
Figure 19.2: Abstract meaning representation and candidate surface realizations from the NITROGEN system. Example adapted from Langkilde and Knight (1998).
subsection 9.1.1). A bigram language model is then used to compute weights on the arcs, so that the shortest path is also the surface realization with the highest bigram language model probability.
More recent systems are unified models that are trained end-to-end using backpropa- gation. Data-to-text generation shares many properties with machine translation, includ- ing a problem of alignment: labeled examples provide the data and the text, but they do not specify which parts of the text correspond to which parts of the data. For example, to learn from Figure 19.1, the system must align the word cloudy to records in CLOUD SKY COVER, the phrases 10 and 20 degrees to the MIN and MAX fields in TEMPERATURE, and so on. As in machine translation, both latent variables and neural attention have been proposed as solutions.
19.1.1 Latent data-to-text alignment
Given a dataset of texts and associated records {(w(i),y(i))}Ni=1, our goal is to learn a model Ψ, so that
wˆ = argmax Ψ(w, y; θ), [19.1] w∈V∗
where V∗ is the set of strings over a discrete vocabulary, and θ is a vector of parameters. The relationship between w and y is complex: the data y may contain dozens of records, and w may extend to several sentences. To facilitate learning and inference, it would be helpful to decompose the scoring function Ψ into subcomponents. This would be possi- ble if given an alignment, specifying which element of y is expressed in each part of w. Specifically, let zm indicates the record aligned to word m. For example, in Figure 19.1, z1 might specify that the word cloudy is aligned to the record cloud-sky-cover:percent. The score for this alignment would then be given by the weight on features such as
(cloudy, cloud-sky-cover:percent). [19.2]
In general, given an observed set of alignments, the score for a generation can be Under contract with MIT Press, shared under CC-BY-NC-ND license.

460 CHAPTER19. TEXTGENERATION
written as sum of local scores (Angeli et al., 2010):
􏰁M m=1
where ψw can represent a bigram language model, and ψz can be tuned to reward coher- ence, such as the use of related records in nearby words. 1 The parameters of this model could be learned from labeled data {(w(i), y(i), z(i))}Ni=1. However, while several datasets include structured records and natural language text (Barzilay and McKeown, 2005; Chen and Mooney, 2008; Liang and Klein, 2009), the alignments between text and records are usually not available.2 One solution is to model the problem probabilistically, treating the alignment as a latent variable (Liang et al., 2009; Konstas and Lapata, 2013). The model can then be estimated using expectation maximization or sampling (see chapter 5).
19.1.2 Neural data-to-text generation
The encoder-decoder model and neural attention were introduced in section 18.3 as methods for neural machine translation. They can also be applied to data-to-text gen- eration, with the data acting as the source language (Mei et al., 2016). In neural machine translation, the attention mechanism linked words in the source to words in the target; in data-to-text generation, the attention mechanism can link each part of the generated text back to a record in the data. The biggest departure from translation is in the encoder, which depends on the form of the data.
Data encoders
In some types of structured records, all values are drawn from discrete sets. For example, the birthplace of an individual is drawn from a discrete set of possible locations; the diag- nosis and treatment of a patient are drawn from an exhaustive list of clinical codes (John- son et al., 2016). In such cases, vector embeddings can be estimated for each field and possible value: for example, a vector embedding for the field BIRTHPLACE, and another embedding for the value BERKELEY CALIFORNIA (Bordes et al., 2011). The table of such embeddings serves as the encoding of a structured record (He et al., 2017). It is also possi- ble to compress the entire table into a single vector representation, by pooling across the embeddings of each field and value (Lebret et al., 2016).
1More expressive decompositions of Ψ are possible. For example, Wong and Mooney (2007) use a syn- chronous context-free grammar (see subsection 18.2.4) to “translate” between a meaning representation and natural language text.
2An exception is a dataset of records and summaries from American football games, containing annota- tions of alignments between sentences and records (Snyder and Barzilay, 2007).
Jacob Eisenstein. Draft of October 15, 2018.
Ψ(w, y; θ) =
ψw,y(wm, yzm ) + ψw(wm, wm−1) + ψz(zm, zm−1), [19.3]

19.1. DATA-TO-TEXTGENERATION 461
Figure 19.3: Examples of the image captioning task, with attention masks shown for each of the underlined words (Xu et al., 2015).
Sequences Some types of structured records have a natural ordering, such as events in a game (Chen and Mooney, 2008) and steps in a recipe (Tutin and Kittredge, 1992). For example, the following records describe a sequence of events in a robot soccer match (Mei et al., 2016):
PASS(arg1 = PURPLE6, arg2 = PURPLE3) KICK(arg1 = PURPLE3)
BADPASS(arg1 = PURPLE3, arg2 = PINK9).
Each event is a single record, and can be encoded by a concatenation of vector represen- tations for the event type (e.g., PASS), the field (e.g., arg1), and the values (e.g., PURPLE3),
e.g.,
X = 􏰨uPASS, uarg1, uPURPLE6, uarg2, uPURPLE3􏰩 . [19.4]
This encoding can then act as the input layer for a recurrent neural network, yielding a sequence of vector representations {zr}Rr=1, where r indexes over records. Interestingly, this sequence-based approach can work even in cases where there is no natural ordering over the records, such as the weather data in Figure 19.1 (Mei et al., 2016).
Images Another flavor of data-to-text generation is the generation of text captions for images. Examples from this task are shown in Figure 19.3. Images are naturally repre- sented as tensors: a color image of 320 × 240 pixels would be stored as a tensor with 320 × 240 × 3 intensity values. The dominant approach to image classification is to en- code images as vectors using a combination of convolution and pooling (Krizhevsky et al.,
Under contract with MIT Press, shared under CC-BY-NC-ND license.

462 CHAPTER19. TEXTGENERATION
id-0: temperature(min=52,max=71,mean=63) id-2: windSpeed(min=8,mean=17,max=23) id-5: skyCover(mode=50-75) id-10: precipChance(min=19,mean=32,max=73) id-15: thunderChance(mode=SChc)
Figure 19.4: Neural attention in text generation. Figure adapted from Mei et al. (2016).
2012). Chapter 3 explains how to use convolutional networks for text; for images, convo- lution is applied across the vertical, horizontal, and color dimensions. By pooling the re- sults of successive convolutions, the image is converted to a vector representation, which can then be fed directly into the decoder as the initial state (Vinyals et al., 2015), just as in the sequence-to-sequence translation model (see section 18.3). Alternatively, one can apply a set of convolutional networks, yielding vector representations for different parts of the image, which can then be combined using neural attention (Xu et al., 2015).
Attention
Given a set of embeddings of the data {zr}Rr=1 and a decoder state hm, an attention vector over the data can be computed using the same techniques as in machine translation (see subsection 18.3.1). When generating word m of the output, attention is computed over the records,
ψα(m,r)=βα ·f(Θα[hm;zr])
αm =g([ψα(m,1),ψα(m,2),…,ψα(m,R)])
[19.5] [19.6]
[19.7]
cm =
􏰁R r=1
αm→rzr,
where f is an elementwise nonlinearity such as tanh or ReLU, and g is a either softmax or elementwise sigmoid. The weighted sum cm can then be included in the recurrent update to the decoder state, or in the emission probabilities, as described in subsection 18.3.1. Figure 19.4 shows the attention to components of a weather record, while generating the text shown on the x-axis.
Adapting this architecture to image captioning is straightforward. A convolutional neural networks is applied to a set of image locations, and the output at each location l is
Jacob Eisenstein. Draft of October 15, 2018.
a 20 % chance of showers and thunderstorms after noon . mostly cloudy with a high near 71 .

19.1. DATA-TO-TEXTGENERATION 463 represented with a vector zl. Attention can then be computed over the image locations,
as shown in the right panels of each pair of images in Figure 19.3.
Various modifications to this basic mechanism have been proposed. In coarse-to-fine attention (Mei et al., 2016), each record receives a global attention ar ∈ [0, 1], which is in- dependent of the decoder state. This global attention, which represents the overall impor- tance of the record, is multiplied with the decoder-based attention scores, before comput- ing the final normalized attentions. In structured attention, the attention vector αm→· can include structural biases, which can favor assigning higher attention values to contiguous segments or to dependency subtrees (Kim et al., 2017). Structured attention vectors can be computed by running the forward-backward algorithm to obtain marginal attention probabilities (see section 7.5.3). Because each step in the forward-backward algorithm is differentiable, it can be encoded in a computation graph, and end-to-end learning can be performed by backpropagation.
Decoder
Given the encoding, the decoder can function just as in neural machine translation (see subsection 18.3.1), using the attention-weighted encoder representation in the decoder recurrence and/or output computation. As in machine translation, beam search can help to avoid search errors (Lebret et al., 2016).
Many applications require generating words that do not appear in the training vocab- ulary. For example, a weather record may contain a previously unseen city name; a sports record may contain a previously unseen player name. Such tokens can be generated in the text by copying them over from the input (e.g., Gulcehre et al., 2016).3 First introduce an
additional variable s ∈ {gen, copy}, indicating whether token w(t) should be generated mm
or copied. The decoder probability is then,
􏰅SoftMax(β (t) · h(t) ), (t) 􏰛􏰜 p(w |w(t) ,Z,s )= 􏰀 w m−1
sm = gen
, s =copy,
[19.8] where δ(w(s) = w(t)) is an indicator function, taking the value 1 iff the text of the record
1:m−1 m R δ w(s)=w(t) ×α
r=1 r m→r
m
r
w(s) is identical to the target word w(t). The probability of copying record r from the source r
is δ (sm = copy) × αm→r , the product of the copy probability by the local attention. Note that in this model, the attention weights αm are computed from the previous decoder state hm−1. The computation graph therefore remains a feedforward network, with recurrent
paths such as h(t) → α → w(t) → h(t). m−1 m m m
To facilitate end-to-end training, the switching variable sm can be represented by a gate πm, which is computed from a two-layer feedforward network, whose input consists
3A number of variants of this strategy have been proposed (e.g., Gu et al., 2016; Merity et al., 2017). See Wiseman et al. (2017) for an overview.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

464 CHAPTER19. TEXTGENERATION of the concatenatio􏰀n of the decoder state h(t) and the attention-weighted representation
m−1
of the data, cm = Rr=1 αm→rzr,
πm = σ(Θ(2)f(Θ(1)[h(t) ; cm])).
[19.9]
δ(w(s) =w(t))×α . r m→r
m−1 The full generative probability at token m is then,
expβ (t) ·h(t)
w m−1
􏰁R r=1
p(w(t) |w(t) ,Z)=π × 1:m m
19.2 Text-to-text generation
Text-to-text generation includes problems of summarization and simplification:
• reading a novel and outputting a paragraph-long summary of the plot;4
• reading a set of blog posts about politics, and outputting a bullet list of the various issues and perspectives;
• readingatechnicalresearcharticleaboutthelong-termhealthconsequencesofdrink- ing kombucha, and outputting a summary of the article in language that non-experts can understand.
These problems can be approached in two ways: through the encoder-decoder architec- ture discussed in the previous section, or by operating directly on the input text.
19.2.1 Neural abstractive summarization
Sentence summarization is the task of shortening a sentence while preserving its mean-
ing, as in the following examples (Knight and Marcu, 2000; Rush et al., 2015):
(19.2) a. The documentation is typical of Epson quality: excellent. Documentation is excellent.
4In section 16.3.4, we encountered a special case of single-document summarization, which involved extracting the most important sentences or discourse units. We now consider the more challenging problem of abstractive summarization, in which the summary can include words that do not appear in the original text.
Jacob Eisenstein. Draft of October 15, 2018.
􏰀V expβ ·h(t) j=1 j m−1
+(1−π )× m
􏰎 􏰍􏰌 􏰏 􏰎 􏰍􏰌 􏰏
generate copy
[19.10]

19.2. TEXT-TO-TEXTGENERATION 465
b. Russian defense minister Ivanov called sunday for the creation of a joint front for combating global terrorism.
Russia calls for joint front against terrorism.
Sentence summarization is closely related to sentence compression, in which the sum- mary is produced by deleting words or phrases from the original (Clarke and Lapata, 2008). But as shown in (19.2b), a sentence summary can also introduce new words, such as against, which replaces the phrase for combatting.
Sentence summarization can be treated as a machine translation problem, using the attentional encoder-decoder translation model discussed in subsection 18.3.1 (Rush et al., 2015). The longer sentence is encoded into a sequence of vectors, one for each token. The decoder then computes attention over these vectors when updating its own recurrent state. As with data-to-text generation, it can be useful to augment the encoder-decoder model with the ability to copy words directly from the source. Rush et al. (2015) train this model by building four million sentence pairs from news articles. In each pair, the longer sentence is the first sentence of the article, and the summary is the article headline. Sentence summarization can also be trained in a semi-supervised fashion, using a proba- bilistic formulation of the encoder-decoder model called a variational autoencoder (Miao and Blunsom, 2016, also see subsection 14.8.2).
When summarizing longer documents, an additional concern is that the summary not be repetitive: each part of the summary should cover new ground. This can be a􏰀ddressed by maintaining a vector of the sum total of all attention values thus far, tm = mn=1 αn. This total can be used as an additional input to the computation of the attention weights,
α ∝ exp 􏰛v · tanh(Θ [h(t); h(s); t ])􏰜 , [19.11] m→n α αmnm
which enables the model to learn to prefer parts of the source which have not been at- tended to yet (Tu et al., 2016). To further encourage diversity in the generated summary, See et al. (2017) introduce a coverage loss to the objective function,
􏰁(s) M
n=1
This loss will be low if αmassigns little attention to words that already have large values in tm.Coverage loss is similar to the concept of marginal relevance, in which the reward for adding new content is proportional to the extent to which it increases the overall amount of information conveyed by the summary (Carbonell and Goldstein, 1998).
Under contract with MIT Press, shared under CC-BY-NC-ND license.
lm =
min(αm→n, tm→n). [19.12]

466 CHAPTER19. TEXTGENERATION 19.2.2 Sentence fusion for multi-document summarization
In multi-document summarization, the goal is to produce a summary that covers the content of several documents (McKeown et al., 2002). One approach to this challenging problem is to identify sentences across multiple documents that relate to a single theme, and then to fuse them into a single sentence (Barzilay and McKeown, 2005). As an exam- ple, consider the following two sentences (McKeown et al., 2010):
(19.3) a. Palin actually turned against the bridge project only after it became a national symbol of wasteful spending.
b. Ms. Palin supported the bridge project while running for governor, and abandoned it after it became a national scandal.
An intersection preserves only the content that is present in both sentences:
(19.4) Palin turned against the bridge project after it became a national scandal.
A union includes information from both sentences:
(19.5) Ms. Palin supported the bridge project while running for governor, but turned
against it when it became a national scandal and a symbol of wasteful spending.
Dependency parsing is often used as a technique for sentence fusion. After parsing each sentence, the resulting dependency trees can be aggregated into a lattice (Barzilay and McKeown, 2005) or a graph structure (Filippova and Strube, 2008), in which identical or closely related words (e.g., Palin, bridge, national) are fused into a single node. The resulting graph can then be pruned back to a tree by solving an integer linear program (see subsection 13.2.2),
max 􏰁 ψ(i →−r j, w; θ) × yi,j,r
[19.13] [19.14]
y
s.t. y ∈ C,
i,j,r
where the variable yi,j,r ∈ {0, 1} indicates whether there is an edge from i to j of type r, the score of this edge is ψ(i →−r j, w; θ), and C is a set of constraints, which ensures that y forms a valid dependency graph. As usual, w is the list of words in the graph, and θ is a vector of parameters. The score ψ(i →−r j, w; θ) reflects the “importance” of the modifier j to the overall meaning: in intersective fusion, this score indicates the extent to which the content in this edge is expressed in all sentences; in union fusion, the score indicates whether the content in the edge is expressed in any sentence. The constraint set C can impose additional linguistic constraints: for example, ensuring that coordinated nouns are sufficiently similar. The resulting tree must then be linearized into a sentence. Lin- earization is like the inverse of dependency parsing: instead of parsing from a sequence
Jacob Eisenstein. Draft of October 15, 2018.

19.3. DIALOGUE 467
of tokens into a tree, we must convert the tree back into a sequence of tokens. This is typically done by generating a set of candidate linearizations, and choosing the one with the highest score under a language model (Langkilde and Knight, 1998; Song et al., 2016).
19.3 Dialogue
Dialogue systems are capable of conversing with a human interlocutor, often to per- form some task (Grosz, 1979), but sometimes just to chat (Weizenbaum, 1966). While re- search on dialogue systems goes back several decades (Carbonell, 1970; Winograd, 1972), commercial systems such as Alexa and Siri have recently brought this technology into widespread use. Nonetheless, there is a significant gap between research and practice: many practical dialogue systems remain scripted and inflexible, while research systems emphasize abstractive text generation, “on-the-fly” decision making, and probabilistic reasoning about the user’s intentions.
19.3.1 Finite-state and agenda-based dialogue systems
Finite-state automata were introduced in chapter 9 as a formal model of computation, in which string inputs and outputs are linked to transitions between a finite number of discrete states. This model naturally fits simple task-oriented dialogues, such as the one shown in the left panel of Figure 19.5. This (somewhat frustrating) dialogue can be repre- sented with a finite-state transducer, as shown in the right panel of the figure. The accept- ing state is reached only when the two needed pieces of information are provided, and the human user confirms that the order is correct. In this simple scenario, the TOPPING and ADDRESS are the two slots associated with the activity of ordering a pizza, which is called a frame. Frame representations can be hierarchical: for example, an ADDRESS could have slots of its own, such as STREET and CITY.
In the example dialogue in Figure 19.5, the user provides the precise inputs that are needed in each turn (e.g., anchovies; the College of Computing building). Some users may prefer to communicate more naturally, with phrases like I’d, uh, like some anchovies please. One approach to handling such utterances is to design a custom grammar, with non- terminals for slots such as TOPPING and LOCATION. However, context-free parsing of unconstrained speech input is challenging. A more lightweight alternative is BIO-style sequence labeling (see section 8.3), e.g.:
(19.7) I’d like anchovies , and please bring it to the College of Computing O O B-TOPPING O O O O O O B-ADDR I-ADDR I-ADDR I-ADDR Building .
I-ADDR O
Under contract with MIT Press, shared under CC-BY-NC-ND license.

468
CHAPTER19. TEXTGENERATION
start
q0
What toppings?
(19.6)
A: I want to order a pizza. B: What toppings?
A: Anchovies.
B: Ok, what address?
A: The College of Computing building.
B: Please confirm: one pizza with artichokes, to be delivered to the College of Computing building.
A: No.
B: What toppings? …
q1 Topping
q2
No What address?
q3 Address
q4 Confirm?
q5 Yes
q6
Figure 19.5: An example dialogue and the associated finite-state model. In the finite-state model, SMALL CAPS indicates that the user must provide information of this type in their answer.
The tagger can be driven by a bi-directional recurrent neural network, similar to recurrent approaches to semantic role labeling described in subsection 13.2.3.
The input in (19.7) could not be handled by the finite-state system from Figure 19.5, which forces the user to provide the topping first, and then the location. In this sense, the “initiative” is driven completely by the system. Agenda-based dialogue systems extend finite-state architectures by attempting to recognize all slots that are filled by the user’s re- ply, thereby handling these more complex examples. Agenda-based systems dynamically pose additional questions until the frame is complete (Bobrow et al., 1977; Allen et al., 1995; Rudnicky and Xu, 1999). Such systems are said to be mixed-initiative, because both the user and the system can drive the direction of the dialogue.
19.3.2 Markov decision processes
The task of dynamically selecting the next move in a conversation is known as dialogue management. This problem can be framed as a Markov decision process, which is a theoretical model that includes a discrete set of states, a discrete set of actions, a function that computes the probability of transitions between states, and a function that computes the cost or reward of action-state pairs. Let’s see how each of these elements pertains to the pizza ordering dialogue system.
• Each state is a tuple of information about whether the topping and address are Jacob Eisenstein. Draft of October 15, 2018.

19.3. DIALOGUE 469
known, and whether the order has been confirmed. For example,
(KNOWN TOPPING, UNKNOWN ADDRESS, NOT CONFIRMED) [19.15]
is a possible state. Any state in which the pizza order is confirmed is a terminal state, and the Markov decision process stops after entering such a state.
• The set of actions includes querying for the topping, querying for the address, and requesting confirmation. Each action induces a probability distribution over states, p(st | at,st−1). For example, requesting confirmation of the order is not likely to result in a transition to the terminal state if the topping is not yet known. This probability distribution over state transitions may be learned from data, or it may be specified in advance.
• Each state-action-state tuple earns a reward, ra(st, st+1). In the context of the pizza ordering system, a simple reward function would be,
0, a = CONFIRM, st = (*, *, CONFIRMED)
ra(st, st−1) = −10, a = CONFIRM, st = (*, *, NOT CONFIRMED) [19.16]
−1, a ̸= CONFIRM
This function assigns zero reward for successful transitions to the terminal state, a large negative reward to a rejected request for confirmation, and a small negative re- ward for every other type of action. The system is therefore rewarded for reaching the terminal state in few steps, and penalized for prematurely requesting confirma- tion.
In a Markov decision process, a policy is a function π : S → A that maps from states to actions (se􏰀e section 15.2.4). The value of a policy is the expected sum of discounted rewards, Eπ[ Tt=1 γtrat (st, st+1)], where γ is the discount factor, γ ∈ [0, 1). Discounting has the effect of emphasizing rewards that can be obtained immediately over less certain rewards in the distant future.
An optimal policy can be obtained by dynamic programming, by iteratively updating the value function V (s), which is the expectation of the cumulative reward from s under the optimal action a,
V (s) ← max 􏰁 p(s′ | s, a)[ra(s, s′) + γV (s′)]. [19.17] a∈A s′∈S
The value function V (s) is computed in terms of V (s′) for all states s′ ∈ S. A series of iterative updates to the value function will eventually converge to a stationary point. This algorithm is known as value iteration. Given the converged value function V (s), the
Under contract with MIT Press, shared under CC-BY-NC-ND license.

470 CHAPTER19. TEXTGENERATION optimal action at each state is the argmax,
π(s) = argmax 􏰁 p(s′ | s, a)[ra(s, s′) + γV (s′)]. [19.18] a∈A s′ ∈S
Value iteration and related algorithms are described in detail by Sutton and Barto (1998). For applications to dialogue systems, see Levin et al. (1998) and Walker (2000).
The Markov decision process framework assumes that the current state of the dialogue is known. In reality, the system may misinterpret the user’s statements — for example, believing that a specification of the delivery location (PEACHTREE) is in fact a specification of the topping (PEACHES). In a partially observable Markov decision process (POMDP), the system receives an observation o, which is probabilistically conditioned on the state, p(o | s). It must therefore maintain a distribution of beliefs about which state it is in, with qt(s) indicating the degree of belief that the dialogue is in state s at time t. The POMDP formulation can help to make dialogue systems more robust to errors, particularly in the context of spoken language dialogues, where the speech itself may be misrecognized (Roy et al., 2000; Williams and Young, 2007). However, finding the optimal policy in a POMDP is computationally intractable, requiring additional approximations.
19.3.3 Neural chatbots
It’s easier to talk when you don’t need to get anything done. Chatbots are systems that parry the user’s input with a response that keeps the conversation going. They can be built from the encoder-decoder architecture discussed in section 18.3 and subsec- tion 19.1.2: the encoder converts the user’s input into a vector, and the decoder produces a sequence of words as a response. For example, Shang et al. (2015) apply the attentional encoder-decoder translation model, training on a dataset of posts and responses from the Chinese microblogging platform Sina Weibo.5 This approach is capable of generating replies that relate thematically to the input, as shown in the following examples (trans- lated from Chinese by Shang et al., 2015).
(19.8) a. A: High fever attacks me every New Year’s day. B: Get well soon and stay healthy!
b. A: I gain one more year. Grateful to my group, so happy. B: Getting old now. Time has no mercy.
While encoder-decoder models can generate responses that make sense in the con- text of the immediately preceding turn, they struggle to maintain coherence over longer
5Twitter is also frequently used for construction of dialogue datasets (Ritter et al., 2011; Sordoni et al., 2015). Another source is technical support chat logs from the Ubuntu linux distribution (Uthus and Aha, 2013; Lowe et al., 2015).
Jacob Eisenstein. Draft of October 15, 2018.

19.3. DIALOGUE 471
conversations. One solution is to model the dialogue context recurrently. This creates a hierarchical recurrent network, including both word-level and turn-level recurrences. The turn-level hidden state is then used as additional context in the decoder (Serban et al., 2016).
An open question is how to integrate the encoder-decoder architecture into task-oriented dialogue systems. Neural chatbots can be trained end-to-end: the user’s turn is analyzed by the encoder, and the system output is generated by the decoder. This architecture can be trained by log-likelihood using backpropagation (e.g., Sordoni et al., 2015; Serban et al., 2016), or by more elaborate objectives, using reinforcement learning (Li et al., 2016). In contrast, the task-oriented dialogue systems described in subsection 19.3.1 typically involve a set of specialized modules: one for recognizing the user input, another for de- ciding what action to take, and a third for arranging the text of the system output.
Recurrent neural network decoders can be integrated into Markov Decision Process dialogue systems, by conditioning the decoder on a representation of the information that is to be expressed in each turn (Wen et al., 2015). Specifically, the long short-term memory (LSTM; section 6.3) architecture is augmented so that the memory cell at turn m takes an additional input dm, which is a representation of the slots and values to be expressed in the next turn. However, this approach still relies on additional modules to recognize the user’s utterance and to plan the overall arc of the dialogue.
Another promising direction is to create embeddings for the elements in the domain: for example, the slots in a record and the entities that can fill them. The encoder then encodes not only the words of the user’s input, but the embeddings of the elements that the user mentions. Similarly, the decoder is endowed with the ability to refer to specific elements in the knowledge base. He et al. (2017) show that such a method can learn to play a collaborative dialogue game, in which both players are given a list of entities and their properties, and the goal is to find an entity that is on both players’ lists.
Additional resources
Gatt and Krahmer (2018) provide a comprehensive recent survey on text generation. For a book-length treatment of earlier work, see Reiter and Dale (2000). For a survey on image captioning, see Bernardi et al. (2016); for a survey of pre-neural approaches to dialogue systems, see Rieser and Lemon (2011). Dialogue acts were introduced in section 8.6 as a labeling scheme for human-human dialogues; they also play a critical in task-based dia- logue systems (e.g., Allen et al., 1996). The incorporation of theoretical models of dialogue into computational systems is reviewed by Jurafsky and Martin (2009, chapter 24).
While this chapter has focused on the informative dimension of text generation, an- other line of research aims to generate text with configurable stylistic properties (Walker et al., 1997; Mairesse and Walker, 2011; Ficler and Goldberg, 2017; Hu et al., 2017). This
Under contract with MIT Press, shared under CC-BY-NC-ND license.

472 CHAPTER19. TEXTGENERATION
chapter also does not address the generation of creative text such as narratives (Riedl and Young, 2010), jokes (Ritchie, 2001), poems (Colton et al., 2012), and song lyrics (Gonc ̧alo Oliveira et al., 2007).
Exercises
1. Find an article about a professional basketball game, with an associated “box score” of statistics. Which are the first three elements in the box score that are expressed in the article? Can you identify template-based patterns that express these elements of the record? Now find a second article about a different basketball game. Does it mention the same first three elements of the box score? Do your templates capture how these elements are expressed in the text?
2. Thisexerciseistobedonebyapairofstudents.Onestudentshouldchooseanarticle from the news or from Wikipedia, and manually perform semantic role labeling (SRL) on three short sentences or clauses. (See chapter 13 for a review of SRL.) Identify the main the semantic relation and its arguments and adjuncts. Pass this structured record — but not the original sentence — to the other student, whose job is to generate a sentence expressing the semantics. Then reverse roles, and try to regenerate three sentences from another article, based on the predicate-argument semantics.
3. Compute the BLEU scores (see subsection 18.1.1) for the generated sentences in the previous problem, using the original article text as the reference.
4. Align each token in the text of Figure 19.1 to a specific single record in the database, or to the null record ∅. For example, the tokens south wind would align to the record
wind direction: 06:00-21:00: mode=S. How often is each token aligned to the same record as the previous token? How many transitions are there? How might a system learn to output 10 degrees for the record min=9?
5. Insentencecompressionandfusion,wemaywishtopreservecontiguoussequences of tokens (n-grams) and/or dependency edges. Find five short news articles with headlines. For each headline, compute the fraction of bigrams that appear in the main text of the article. Then do a manual depenency parse of the headline. For each dependency edge, count how often it appears as a dependency edge in the main text. You may use an automatic dependency parser to assist with this exercise, but check the output, and focus on UD 2.0 dependency grammar, as described in chapter 11.
6. subsection 19.2.2 presents the idea of generating text from dependency trees, which requires linearization. Sometimes there are multiple ways that a dependency tree can be linearized. For example:
Jacob Eisenstein. Draft of October 15, 2018.

19.3. DIALOGUE
473
(19.9) a. b.
The sick kids stayed at home in bed. The sick kids stayed in bed at home.
Both sentences have an identical dependency parse: both home and bed are (oblique) dependents of stayed.
Identify two more English dependency trees that can each be linearized in more than one way, and try to use a different pattern of variation in each tree. As usual, specify your trees in the Universal Dependencies 2 style, which is described in chapter 11.
7. In subsection 19.3.2, we considered a pizza delivery service. Let’s simplify the prob- lem to take-out, where it is only necessary to determine the topping and confirm the order. The state is a tuple in which the first element is T if the topping is specified and ? otherwise, and the second element is either YES or NO, depending on whether the order has been confirmed. The actions are TOPPING? (request information about the topping) and CONFIRM? (request confirmation). The state transition function is:
􏰅0.9, p(st | st−1 = (?, NO), a = TOPPING?) = 􏰮 0.1,
st = (T, NO) st = (?, NO).
[19.19]
[19.20] [19.21]
[19.22]
p(st | st−1 = (?, NO), a = CONFIRM?) = 􏰮1,
p(st | st−1 = (T, NO), a = TOPPING?) = 􏰅1, st = (T, NO).
p(st | st−1 = (T, NO), a = CONFIRM?) = 0.9, 0.1,
st = (?, NO). st = (T, YES)
Using the reward function defined in Equation 19.16, the discount γ = 0.9, and the initialization V (s) = 0, execute three iterations of Equation 19.17. After these three iterations, compute the optimal action in each state. You can assume that for the terminal states, V (*, YES) = 0, so you only need to compute the values for non- terminal states, V (?, NO) and V (T, NO).
8. Thereareseveraltoolkitsthatallowyoutotrainencoder-decodertranslationmodels “out of the box”, such as FAIRSEQ (Gehring et al., 2017), XNMT (Neubig et al., 2018), TENSOR2TENSOR(Vaswanietal.,2018),andOPENNMT(Kleinetal.,2017).6 Useone of these toolkits to train a chatbot dialogue system, using either the NPS dialogue corpus that comes with NLTK (Forsyth and Martell, 2007), or, if you are feeling more ambitious, the Ubuntu dialogue corpus (Lowe et al., 2015).
6 https://github.com/facebookresearch/fairseq; https://github.com/neulab/xnmt; https://github.com/tensorflow/tensor2tensor; http://opennmt.net/
Under contract with MIT Press, shared under CC-BY-NC-ND license.
st = (T, NO).

Appendix A Probability
Probability theory provides a way to reason about random events. The sorts of random events that are typically used to explain probability theory include coin flips, card draws, and the weather. It may seem odd to think about the choice of a word as akin to the flip of a coin, particularly if you are the type of person to choose words carefully. But random or not, language has proven to be extremely difficult to model deterministically. Probability offers a powerful tool for modeling and manipulating linguistic data.
Probability can be thought of in terms of random outcomes: for example, a single coin flip has two possible outcomes, heads or tails. The set of possible outcomes is the sample space, and a subset of the sample space is an event. For a sequence of two coin flips, therearefourpossibleoutcomes,{HH,HT,TH,TT},representingtheorderedsequences heads-head, heads-tails, tails-heads, and tails-tails. The event of getting exactly one head includes two outcomes: {HT,TH}.
Formally, a probability is a function from events to the interval between zero and one: Pr : F → [0, 1], where F is the set of possible events. An event that is certain has proba- bility one; an event that is impossible has probability zero. For example, the probability of getting fewer than three heads on two coin flips is one. Each outcome is also an event (a set with exactly one element), and for two flips of a fair coin, the probability of each outcome is,
Pr({HH}) = Pr({HT}) = Pr({TH}) = Pr({TT}) = 1. [A.1] 4
A.1 Probabilities of event combinations
Because events are sets of outcomes, we can use set-theoretic operations such as comple- ment, intersection, and union to reason about the probabilities of events and their combi- nations.
475

476
APPENDIX A. PROBABILITY
For any event A, there is a complement ¬A, such that:
• TheprobabilityoftheunionA∪¬AisPr(A∪¬A)=1;
• TheintersectionA∩¬A=∅istheemptyset,andPr(A∩¬A)=0.
In the coin flip example, the event of obtaining a single head on two flips corresponds to the set of outcomes {HT,TH}; the complement event includes the other two outcomes, {TT,HH}.
A.1.1 Probabilities of disjoint events
When two events have an empty intersection, A ∩ B = ∅, they are disjoint. The probabil-
ity of the union of two disjoint events is equal to the sum of their probabilities,
A∩B=∅ ⇒ Pr(A∪B)=Pr(A)+Pr(B). [A.2]
This is the third axiom of probability, and it can be generalized to any countable sequence of disjoint events.
In the coin flip example, this axiom can derive the probability of the event of getting a single head on two flips. This event is the set of outcomes {HT,TH}, which is the union of two simpler events, {HT,TH} = {HT} ∪ {TH}. The events {HT} and {TH} are disjoint. Therefore,
Pr({H T , T H }) = Pr({H T } ∪ {T H }) = Pr({H T }) + Pr({T H }) =1 + 1 = 1.
442
In the general, the probability of the union of two events is,
Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B).
[A.3] [A.4]
[A.5]
This can be seen visually in Figure A.1, and it can be derived from the third axiom of probability. Consider an event that includes all outcomes in B that are not in A, denoted as B − (A ∩ B). By construction, this event is disjoint from A. We can therefore apply the additive rule,
Pr(A ∪ B) = Pr(A) + Pr(B − (A ∩ B)). [A.6] Furthermore, the event B is the union of two disjoint events: A ∩ B and B − (A ∩ B).
Pr(B) = Pr(B − (A ∩ B)) + Pr(A ∩ B). Reorganizing and subtituting into Equation A.6 gives the desired result:
Pr(B − (A ∩ B)) = Pr(B) − Pr(A ∩ B)
Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B).
[A.7]
[A.8] [A.9]
Jacob Eisenstein. Draft of October 15, 2018.

A.2. CONDITIONALPROBABILITYANDBAYES’RULE 477
AA∩BB
Figure A.1: A visualization of the probability of non-disjoint events A and B. A.1.2 Law of total probability
A set of events B = {B1,B2,…,BN} is a partition of the sample space iff each pair of events is disjoint (Bi ∩ Bj = ∅), and the union of the events is the entire sample space. The law of total probability states that we can marginalize over these events as follows,
Pr(A) = 􏰁 Pr(A ∩ Bn). [A.10] Bn ∈B
For any event B, the union B ∪ ¬B is a partition of the sample space. Therefore, a special case of the law of total probability is,
Pr(A) = Pr(A ∩ B) + Pr(A ∩ ¬B). [A.11] A.2 Conditional probability and Bayes’ rule
A conditional probability is an expression like Pr(A | B), which is the probability of the event A, assuming that event B happens too. For example, we may be interested in the probability of a randomly selected person answering the phone by saying hello, conditioned on that person being a speaker of English. Conditional probability is defined as the ratio,
Pr(A | B) = Pr(A ∩ B). [A.12] Pr(B)
The chain rule of probability states that Pr(A ∩ B) = Pr(A | B) × Pr(B), which is just Under contract with MIT Press, shared under CC-BY-NC-ND license.

478 APPENDIX A. PROBABILITY a rearrangement of terms from Equation A.12. The chain rule can be applied repeatedly:
Pr(A ∩ B ∩ C) = Pr(A | B ∩ C) × Pr(B ∩ C)
= Pr(A | B ∩ C) × Pr(B | C) × Pr(C).
Bayes’ rule (sometimes called Bayes’ law or Bayes’ theorem) gives us a way to convert between Pr(A | B) and Pr(B | A). It follows from the definition of conditional probability and the chain rule:
Pr(A | B) = Pr(A ∩ B) = Pr(B | A) × Pr(A) [A.13] Pr(B) Pr(B)
Each term in Bayes rule has a name, which we will occasionally use:
• Pr(A) is the prior, since it is the probability of event A without knowledge about
whether B happens or not.
• Pr(B | A) is the likelihood, the probability of event B given that event A has oc-
curred.
• Pr(A | B) is the posterior, the probability of event A with knowledge that B has occurred.
Example The classic examples for Bayes’ rule involve tests for rare diseases, but Man- ningandSchu ̈tze(1999)reframethisexampleinalinguisticsetting.Supposethatyouare is interested in a rare syntactic construction, such as parasitic gaps, which occur on average once in 100,000 sentences. Here is an example of a parasitic gap:
(A.1) Which class did you attend without registering for ?
Lana Linguist has developed a complicated pattern matcher that attempts to identify
sentences with parasitic gaps. It’s pretty good, but it’s not perfect:
• If a sentence has a parasitic gap, the pattern matcher will find it with probability
0.95. (This is the recall, which is one minus the false negative rate.)
• If the sentence doesn’t have a parasitic gap, the pattern matcher will wrongly say it does with probability 0.005. (This is the false positive rate, which is one minus the precision.)
Suppose that Lana’s pattern matcher says that a sentence contains a parasitic gap. What is the probability that this is true?
Jacob Eisenstein. Draft of October 15, 2018.

A.3. INDEPENDENCE 479
Let G be the event of a sentence having a parasitic gap, and T be the event of the test being positive. We are interested in the probability of a sentence having a parasitic gap given that the test is positive. This is the conditional probability Pr(G | T ), and it can be computed by Bayes’ rule:
Pr(G | T) =Pr(T | G) × Pr(G). [A.14] Pr(T )
We already know both terms in the numerator: Pr(T | G) is the recall, which is 0.95; Pr(G) is the prior, which is 10−5.
We are not given the denominator, but it can be computed using tools developed ear- lier in this section. First apply the law of total probability, using the partition {G, ¬G}:
Pr(T) = Pr(T ∩ G) + Pr(T ∩ ¬G). [A.15]
This says that the probability of the test being positive is the sum of the probability of a true positive (T ∩ G) and the probability of a false positive (T ∩ ¬G). The probability of each of these events can be computed using the chain rule:
Pr(T ∩G)=Pr(T |G)×Pr(G)=0.95×10−5
Pr(T ∩¬G)=Pr(T |¬G)×Pr(¬G)=0.005×(1−10−5)≈0.005
Pr(T) =Pr(T ∩ G) + Pr(T ∩ ¬G) =0.95 × 10−5 + 0.005.
Plugging these terms into Bayes’ rule gives the desired posterior probability,
Pr(G | T) =Pr(T | G)Pr(G) Pr(T )
0.95 × 10−5
=0.95 × 10−5 + 0.005 × (1 − 10−5)
≈0.002.
[A.16] [A.17] [A.18] [A.19]
[A.20]
[A.21] [A.22]
Lana’s pattern matcher seems accurate, with false positive and false negative rates below 5%. Yet the extreme rarity of the phenomenon means that a positive result from the detector is most likely to be wrong.
A.3 Independence
Two events are independent if the probability of their intersection is equal to the product of their probabilities: Pr(A ∩ B) = Pr(A) × Pr(B). For example, for two flips of a fair
Under contract with MIT Press, shared under CC-BY-NC-ND license.

480 APPENDIX A. PROBABILITY coin, the probability of getting heads on the first flip is independent of the probability of
getting heads on the second flip:
Pr({H T , H H }) = Pr(H T ) + Pr(H H ) = 1 + 1 = 1 442
Pr({HH, T H}) = Pr(HH) + Pr(T H) = 1 + 1 = 1 442
Pr({HT,HH})×Pr({HH,TH}) =1 × 1 = 1 224
Pr({HT,HH}∩{HH,TH})=Pr(HH)= 1 4
=Pr({HT,HH})×Pr({HH,TH}).
[A.23]
[A.24]
[A.25]
[A.26]
[A.27]
If Pr(A ∩ B | C) = Pr(A | C) × Pr(B | C), then the events A and B are conditionally independent, written A ⊥ B | C. Conditional independence plays a important role in probabilistic models such as Na ̈ıve Bayes chapter 2.
A.4 Random variables
Random variables are functions from events to Rn, where R is the set of real numbers.
This subsumes several useful special cases:
• An indicator random variable is a function from events to the set {0, 1}. In the coin flip example, we can define Y as an indicator random variable, taking the value 1 when the coin has come up heads on at least one flip. This would include the outcomes {HH,HT,TH}. The probability Pr(Y = 1) is the sum of the probabilities of these outcomes, Pr(Y = 1) = 1 + 1 + 1 = 3 .
• A discrete random variable is a function from events to a discrete subset of R. Con- sider the coin flip example: the number of heads on two flips, X, can be viewed as a discrete random variable, X ∈ 0, 1, 2. The event probability Pr(X = 1) can again be computed as the sum of the probabilities of the events in which there is one head, {HT,TH}, giving Pr(X = 1) = 1 + 1 = 1.
Each possible value of a random variable is associated with a subset of the sample space. In the coin flip example, X = 0 is associated with the event {TT}, X = 1 is associated with the event {HT,TH}, and X = 2 is associated with the event {HH}. Assuming a fair coin, the probabilities of these events are, respectively, 1/4, 1/2, and 1/4. This list of numbers represents the probability distribution over X, written pX, which maps from the possible values of X to the non-negative reals. For a specific value x, we write pX (x), which is equal to the event probability Pr(X = x).1 The function pX is called
1In general, capital letters (e.g., X) refer to random variables, and lower-case letters (e.g., x) refer to specific values. When the distribution is clear from context, I will simply write p(x).
Jacob Eisenstein. Draft of October 15, 2018.
4444
442

A.5. EXPECTATIONS 481
a probability mass function (pmf) if X is discrete; it is called a probability density function (pdf) if X is continuous. In either case, the function must sum to one, and all values must be non-negative:
􏰥x
∀x, pX (x) ≥0. [A.29]
Probabilities over multiple random variables can written as joint probabilities, e.g., pA,B (a, b) = Pr(A = a ∩ B = b). Several properties of event probabilities carry over to probability distributions over random variables:
• The marginal probability distribution is pA(a) = 􏰀b pA,B(a,b).
Sometimes we want the expectation of a function, such as E[g(x)] = 􏰀x∈X g(x)p(x). Expectations are easiest to think about in terms of probability distributions over discrete events:
• If it is sunny, Lucia will eat three ice creams.
• If it is rainy, she will eat only one ice cream.
• There’s a 80% chance it will be sunny.
• Theexpectednumberoficecreamsshewilleatis0.8×3+0.2×1=2.6.
If the random variable X is continuous, the expectation is an integral:
E[g(x)] = 􏰥 g(x)p(x)dx [A.30]
X
For example, a fast food restaurant in Quebec has a special offer for cold days: they give a 1% discount on poutine for every degree below zero. Assuming a thermometer with infinite precision, the expected price would be an integral over all possible temperatures,
E[price(x)] = 􏰥 min(1, 1 + x) × original-price × p(x)dx. [A.31] X
Under contract with MIT Press, shared under CC-BY-NC-ND license.
pX (x)dx =1 [A.28]
• The conditional probability distribution is p
• RandomvariablesAandBareindependentiffpA,B(a,b)=pA(a)×pB(b).
A.5 Expectations
(a | b) = pA,B (a,b) . A|B pB (b)

482 APPENDIX A. PROBABILITY A.6 Modeling and estimation
Probabilistic models provide a principled way to reason about random events and ran- dom variables. Let’s consider the coin toss example. Each toss can be modeled as a ran- dom event, with probability θ of the event H , and probability 1 − θ of the complementary event T. If we write a random variable X as the total number of heads on three coin flips, then the distribution of X depends on θ. In this case, X is distributed as a binomial random variable, meaning that it is drawn from a binomial distribution, with parameters (θ, N = 3). This is written,
X ∼ Binomial(θ, N = 3). [A.32] The properties of the binomial distribution enable us to make statements about the X,
such as its expected value and the likelihood that its value will fall within some interval.
Now suppose that θ is unknown, but we have run an experiment, in which we exe- cuted N trials, and obtained x heads. We can estimate θ by the principle of maximum likelihood:
θˆ = argmax pX (x; θ, N ). [A.33] θ
This says that the estimate θˆ should be the value that maximizes the likelihood of the data. The semicolon indicates that θ and N are parameters of the probability function. The likelihood pX (x; θ, N ) can be computed from the binomial distribution,
pX(x;θ,N) = N! θx(1−θ)N−x. [A.34] x!(N − x)!
This likelihood is proportional to the product of the probability of individual out-
comes: for example, the sequence T, H, H, T, H would have probability θ3(1 − θ)2. The
term N! arises from the many possible orderings by which we could obtain x heads x!(N −x)!
on N trials. This term does not depend on θ, so it can be ignored during estimation.
In practice, we maximize the log-likelihood, which is a monotonic function of the like-
lihood. Under the binomial distribution, the log-likelihood is a convex function of θ (see Jacob Eisenstein. Draft of October 15, 2018.

A.6. MODELINGANDESTIMATION 483 section 2.4), so it can be maximized by taking the derivative and setting it equal to zero.
l(θ) =x logθ+(N −x)log(1−θ) ∂l(θ) =x −N−x
∂θ θ 1−θ N − x =x
1−θ θ
N − x =1 − θ
xθ
N − 1 =1 − 1 xθ
θˆ = x . N
[A.35] [A.36]
[A.37] [A.38]
[A.39] [A.40]
In this case, the maximum likelihood estimate is equal to x , the fraction of trials that N
came up heads. This intuitive solution is also known as the relative frequency estimate, since it is equal to the relative frequency of the outcome.
Is maximum likelihood estimation always the right choice? Suppose you conduct one trial, and get heads. Would you conclude that θ = 1, meaning that the coin is guaran- teed to come up heads? If not, then you must have some prior expectation about θ. To incorporate this prior information, we can treat θ as a random variable, and use Bayes’ rule:
p(θ | x; N ) = p(x | θ) × p(θ) p(x)
ˆ ∝p(x | θ) × p(θ)
θ = argmax p(x | θ) × p(θ).
θ
[A.41]
[A.42] [A.43]
This it the maximum a posteriori (MAP) estimate. Given a form for p(θ), you can de- rive the MAP estimate using the same approach that was used to derive the maximum likelihood estimate.
Additional resources
A good introduction to probability theory is offered by Manning and Schu ̈tze (1999),
which helped to motivate this section. For more detail, Sharon Goldwater provides an-
other useful reference, http://homepages.inf.ed.ac.uk/sgwater/teaching/general/ probability.pdf. A historical and philosophical perspective on probability is offered
by Diaconis and Skyrms (2017).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

Appendix B
Numerical optimization
Unconstrained numerical optimization involves solving problems of the form,
min f(x), x∈RD
[B.1]
where x ∈ RD is a vector of D real numbers.
Differentiation is fundamental to numerical optimiza􏰽tion. Suppose that at some x∗,
every partial derivative of f is equal to 0: formally, ∂f 􏰽􏰽 = 0. Then x∗ is said to be a ∂xi x∗
critical point of f. If f is a convex function (defined in section 2.4), then the value of f(x∗) is equal to the global minimum of f iff x∗ is a critical point of f.
As an example, consider the convex function f (x) = (x − 2)2 + 3, shown in Figure B.1a. Thederivativeis∂f =2x−4.Auniqueminimumcanbeobtainedbysettingthederivative
∂x ∗
equal to zero and solving for x, obtaining x = 2. Now consider the multivariate convex
function f (x) = 1 ||x − [2, 1]⊤ ||2 , where ||x||2 is the squared Euclidean norm. The partial 2
(a)Thefunctionf(x)=(x−2)2 +3 (b)Thefunctionf(x)=|x|−2cos(x) Figure B.1: Two functions with unique global minima
485
40 30 20 10
20 10 0
42024
x
20 10 0 10 20
x
(x 2)2+3
|x| 2cos(x)

486
derivatives are,
The unique minimum is x∗ = [2, 1]⊤.
APPENDIX B. NUMERICAL OPTIMIZATION
∂d =x1−2 ∂x1
∂d =x2−1 ∂x2
[B.2] [B.3]
For non-convex functions, critical points are not necessarily global minima. A local minimum x∗ is a point at which the function takes a smaller value than at all nearby neighbors: formally, x∗ is a local minimum if there is some positive ε such that f(x∗) ≤ f (x) for all x within distance ε of x∗ . Figure B.1b shows the function f (x) = |x| − 2 cos(x), which has many local minima, as well as a unique global minimum at x = 0. A critical point may also be the local or global maximum of the function; it may be a saddle point, which is a minimum with respect to at least one coordinate, and a maximum with respect at least one other coordinate; it may be an inflection point, which is neither or a minimum nor maximum. When available, the second derivative of f can help to distinguish these cases.
B.1 Gradient descent
For many convex functions, it is not possible to solve for x∗ in closed form. In gradient descent, we compute a series of solutions, x(0), x(1), . . . by taking steps along the local gradient ∇x(t)f, which is the vector of partial derivatives of the function f, evaluated at the point x(t). Each solution x(t+1) is computed,
x(t+1) ←x(t) − η(t)∇x(t) f. [B.4]
where η(t) > 0 is a step size. If the step size is chosen appropriately, this procedure will find the global minimum of a differentiable convex function. For non-convex functions, gradient descent will find a local minimum. The extension to non-differentiable convex functions is discussed in section 2.4.
B.2 Constrained optimization
Optimization must often be performed under constraints: for example, when optimizing the parameters of a probability distribution, the probabilities of all events must sum to one. Constrained optimization problems can be written,
min f (x) [B.5] x
s.t.gc(x)≤0, ∀c=1,2,…,C [B.6] Jacob Eisenstein. Draft of October 15, 2018.

B.3. EXAMPLE:PASSIVE-AGGRESSIVEONLINELEARNING 487
where each gc(x) is a scalar function of x. For example, suppose that x must be non- negative, and that its sum cannot exceed a budget b. Then there are D + 1 inequality constraints,
gi(x)=−xi, ∀i=1,2,…,D [B.7]
􏰁D
form, which is a function of λ:
D(λ) = min L(x, λ). [B.10] x
The Lagrangian L can be referred to as the primal form.
B.3 Example: Passive-aggressive online learning
Sometimes it is possible to solve a constrained optimization problem by manipulating the Lagrangian. One example is maximum-likelihood estimation of a Na ̈ıve Bayes probability model, as described in subsection 2.2.3. In that case, it is unnecessary to explicitly com- pute the Lagrange multiplier. Another example is illustrated by the passive-aggressive algorithm for online learning (Crammer et al., 2006). This algorithm is similar to the per- ceptron, but the goal at each step is to make the most conservative update that gives zero margin loss on the current example.1 Each update can be formulated as a constrained optimization over the weights θ:
min 1||θ − θ(i−1)||2 [B.11] θ2
s.t. l(i)(θ) = 0 [B.12] where θ(i−1) is the previous set of weights, and l(i)(θ) is the margin loss on instance i. As
in subsection 2.4.1, this loss is defined as,
l(i)(θ)=1−θ·f(x(i),y(i))+ max θ·f(x(i),y). [B.13]
y̸=y(i)
1This is the basis for the name of the algorithm: it is passive when the loss is zero, but it aggressively
moves to make the loss zero when necessary.
Under contract with MIT Press, shared under CC-BY-NC-ND license.
gD+1(x) = − b + L(x, λ) = f (x) +
xi. [B.8]
Inequality constraints can be combined with the original objective function f by form-
ing a Lagrangian,
λc gc (x), [B.9] where λc is a Lagrange multiplier. For any Lagrangian, there is a corresponding dual
i=1
􏰁C c=1

488 APPENDIX B. NUMERICAL OPTIMIZATION When the margin loss is zero for θ(i−1), the optimal solution is θ∗ = θ(i−1), so we will
focus on the case where l(i)(θ(i−1)) > 0. The Lagrangian for this problem is, L(θ,λ)= 1||θ−θ(i−1)||2 +λl(i)(θ),
2
[B.14]
[B.15] [B.16]
The Lagrange multiplier λ acts as the learning rate in a perceptron-style update to θ. We can solve for λ by plugging θ∗ back into the Lagrangian, obtaining the dual function,
Holding λ constant, we can solve for θ by differentiating, ∇θL =θ − θ(i−1) + λ ∂ l(i)(θ)
∂θ whereδ=f(x(i),y(i))−f(x(i),yˆ)andyˆ=argmaxy̸=y(i) θ·f(x(i),y).
θ∗ =θ(i−1) + λδ,
D(λ) =1||θ(i−1) + λδ − θ(i−1)||2 + λ(1 − (θ(i−1) + λδ) · δ) 2
[B.17] [B.18] [B.19]
[B.20] [B.21]
[B.22]
λ2 2 2 2
= 2 ||δ|| −λ ||δ|| +λ(1−θ
(i−1)
·δ)
λ2 2 (i) (i−1) =−2||δ||+λl (θ
).
Differentiating and solving for λ,
∂D = − λ||δ||2 + l(i)(θ(i−1))
||δ||2 The complete update equation is therefore:
θ∗ =θ(i−1) + l(i)(θ(i−1)) (f(x(i),y(i))−f(x(i),yˆ)). ||f(x(i),y(i))−f(x(i),yˆ)||2
∂λ
λ∗ =l(i)(θ(i−1)).
This learning rate makes intuitive sense. The numerator grows with the loss; the denom- inator grows with the norm of the difference between the feature vectors associated with the correct and predicted label. If this norm is large, then the step with respect to each feature should be small, and vice versa.
Jacob Eisenstein. Draft of October 15, 2018.

Bibliography
Abadi, M., A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jo ́zefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane ́, R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Vie ́gas, O. Vinyals, P. Warden, M. Watten- berg, M. Wicke, Y. Yu, and X. Zheng (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR abs/1603.04467.
Abend, O. and A. Rappoport (2017). The state of the art in semantic representation. In Proceedings of the Association for Computational Linguistics (ACL).
Abney, S., R. E. Schapire, and Y. Singer (1999). Boosting applied to tagging and PP attach- ment. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 132–134.
Abney, S. P. (1987). The English noun phrase in its sentential aspect. Ph. D. thesis, Mas- sachusetts Institute of Technology.
Abney, S. P. and M. Johnson (1991). Memory requirements and local ambiguities of pars- ing strategies. Journal of Psycholinguistic Research 20(3), 233–250.
Adafre, S. F. and M. De Rijke (2006). Finding similar sentences across multiple languages in wikipedia. In Proceedings of the Workshop on NEW TEXT Wikis and blogs and other dynamic text sources.
Ahn, D. (2006). The stages of event extraction. In Proceedings of the Workshop on Annotating and Reasoning about Time and Events, pp. 1–8. Association for Computational Linguistics.
Aho, A. V., M. S. Lam, R. Sethi, and J. D. Ullman (2006). Compilers: Principles, Techniques, & Tools (2nd ed.). Addison-Wesley Publishing Company.
Aikhenvald, A. Y. (2004). Evidentiality. Oxford University Press. 489

490 BIBLIOGRAPHY Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on
Automatic Control 19(6), 716–723.
Akmajian, A., R. A. Demers, A. K. Farmer, and R. M. Harnish (2010). Linguistics: An
introduction to language and communication (Sixth ed.). Cambridge, MA: MIT press.
Alfano, M., D. Hovy, M. Mitchell, and M. Strube (2018). Proceedings of the second acl workshop on ethics in natural language processing. In Proceedings of the Second ACL Workshop on Ethics in Natural Language Processing. Association for Computational Lin- guistics.
Alfau, F. (1999). Chromos. Dalkey Archive Press.
Allauzen, C., M. Riley, J. Schalkwyk, W. Skut, and M. Mohri (2007). OpenFst: A gen- eral and efficient weighted finite-state transducer library. In International Conference on Implementation and Application of Automata, pp. 11–23. Springer.
Allen, J. F. (1984). Towards a general theory of action and time. Artificial intelligence 23(2), 123–154.
Allen, J. F., B. W. Miller, E. K. Ringger, and T. Sikorski (1996). A robust system for natural spoken dialogue. In Proceedings of the Association for Computational Linguistics (ACL), pp. 62–70.
Allen, J. F., L. K. Schubert, G. Ferguson, P. Heeman, C. H. Hwang, T. Kato, M. Light, N. Martin, B. Miller, M. Poesio, and D. Traum (1995). The TRAINS project: A case study in building a conversational planning agent. Journal of Experimental & Theoretical Artificial Intelligence 7(1), 7–48.
Alm, C. O., D. Roth, and R. Sproat (2005). Emotions from text: machine learning for text-based emotion prediction. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 579–586.
Alu ́ısio, S., J. Pelizzoni, A. Marchi, L. de Oliveira, R. Manenti, and V. Marquiafa ́vel (2003). An account of the challenge of tagging a reference corpus for Brazilian Portuguese. Computational Processing of the Portuguese Language, 194–194.
Anand, P., M. Walker, R. Abbott, J. E. Fox Tree, R. Bowmani, and M. Minor (2011). Cats rule and dogs drool!: Classifying stance in online debate. In Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pp. 1–9. Association for Computational Linguistics.
Anandkumar, A. and R. Ge (2016). Efficient approaches for escaping higher order saddle points in non-convex optimization. In Proceedings of the Conference On Learning Theory (COLT), pp. 81–102.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 491
Anandkumar, A., R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky (2014). Tensor decompo- sitions for learning latent variable models. The Journal of Machine Learning Research 15(1), 2773–2832.
Ando, R. K. and T. Zhang (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. The Journal of Machine Learning Research 6, 1817– 1853.
Andor, D., C. Alberti, D. Weiss, A. Severyn, A. Presta, K. Ganchev, S. Petrov, and M. Collins (2016). Globally normalized transition-based neural networks. In Proceedings of the Association for Computational Linguistics (ACL), pp. 2442–2452.
Angeli, G., P. Liang, and D. Klein (2010). A simple domain-independent probabilistic ap- proach to generation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 502–512.
Antol, S., A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015). Vqa: Visual question answering. In Proceedings of the International Conference on Computer Vision (ICCV), pp. 2425–2433.
Aronoff, M. (1976). Word formation in generative grammar. MIT Press.
Arora, S. and B. Barak (2009). Computational complexity: a modern approach. Cambridge
University Press.
Arora, S., R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu (2013). A practical algorithm for topic modeling with provable guarantees. In Proceedings of the International Conference on Machine Learning (ICML), pp. 280–288.
Arora, S., Y. Li, Y. Liang, T. Ma, and A. Risteski (2018). Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association of Computational Linguistics 6, 483–495.
Artstein, R. and M. Poesio (2008). Inter-coder agreement for computational linguistics. Computational Linguistics 34(4), 555–596.
Artzi, Y. and L. Zettlemoyer (2013). Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association for Computational Linguis- tics 1, 49–62.
Attardi, G. (2006). Experiments with a multilanguage non-projective dependency parser. In Proceedings of the Conference on Natural Language Learning (CoNLL), pp. 166–170.
Auer, P. (2013). Code-switching in conversation: Language, interaction and identity. Routledge. Under contract with MIT Press, shared under CC-BY-NC-ND license.

492 BIBLIOGRAPHY Auer, S., C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives (2007). Dbpedia: A
nucleus for a web of open data. The semantic web, 722–735.
Austin, J. L. (1962). How to do things with words. Oxford University Press.
Aw, A., M. Zhang, J. Xiao, and J. Su (2006). A phrase-based statistical model for SMS text normalization. In Proceedings of the Association for Computational Linguistics (ACL), pp. 33–40.
Ba, J. L., J. R. Kiros, and G. E. Hinton (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
Bagga, A. and B. Baldwin (1998a). Algorithms for scoring coreference chains. In Proceed- ings of the Language Resources and Evaluation Conference, pp. 563–566.
Bagga, A. and B. Baldwin (1998b). Entity-based cross-document coreferencing using the vector space model. In Proceedings of the International Conference on Computational Lin- guistics (COLING), pp. 79–85.
Bahdanau, D., K. Cho, and Y. Bengio (2014). Neural machine translation by jointly learn- ing to align and translate. In Neural Information Processing Systems (NIPS).
Baldwin, T. and S. N. Kim (2010). Multiword expressions. In Handbook of natural language processing, Volume 2, pp. 267–292. Boca Raton, USA: CRC Press.
Balle, B., A. Quattoni, and X. Carreras (2011). A spectral learning algorithm for finite state transducers. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML), pp. 156–171.
Banarescu, L., C. Bonial, S. Cai, M. Georgescu, K. Griffitt, U. Hermjakob, K. Knight, P. Koehn, M. Palmer, and N. Schneider (2013, August). Abstract meaning represen- tation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and In- teroperability with Discourse, Sofia, Bulgaria, pp. 178–186. Association for Computational Linguistics.
Banko, M., M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni (2007). Open information extraction from the web. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 2670–2676.
Bansal, N., A. Blum, and S. Chawla (2004). Correlation clustering. Machine Learning 56(1- 3), 89–113.
Barber, D. (2012). Bayesian reasoning and machine learning. Cambridge University Press. Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 493
Barman, U., A. Das, J. Wagner, and J. Foster (2014). Code mixing: A challenge for language identification in the language of social media. In Proceedings of the First Workshop on Computational Approaches to Code Switching, pp. 13–23. Association for Computational Linguistics.
Baron, A. and P. Rayson (2008). Vard2: A tool for dealing with spelling variation in his- torical corpora. In Postgraduate conference in corpus linguistics.
Baroni, M., R. Bernardi, and R. Zamparelli (2014). Frege in space: A program for compo- sitional distributional semantics. Linguistic Issues in Language Technologies.
Barzilay, R. and M. Lapata (2008). Modeling local coherence: An Entity-Based approach. Computational Linguistics 34(1), 1–34.
Barzilay, R. and K. R. McKeown (2005). Sentence fusion for multidocument news summa- rization. Computational Linguistics 31(3), 297–328.
Beesley, K. R. and L. Karttunen (2003). Finite-state morphology. Stanford, CA: Center for the Study of Language and Information.
Bejan, C. A. and S. Harabagiu (2014). Unsupervised event coreference resolution. Compu- tational Linguistics 40(2), 311–347.
Bell, E. T. (1934). Exponential numbers. The American Mathematical Monthly 41(7), 411–419.
Bender, E. M. (2013). Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax, Volume 6 of Synthesis Lectures on Human Language Technolo- gies. Morgan & Claypool Publishers.
Bengio, S., O. Vinyals, N. Jaitly, and N. Shazeer (2015). Scheduled sampling for sequence prediction with recurrent neural networks. In Neural Information Processing Systems (NIPS), pp. 1171–1179.
Bengio, Y., R. Ducharme, P. Vincent, and C. Janvin (2003). A neural probabilistic language model. The Journal of Machine Learning Research 3, 1137–1155.
Bengio, Y., P. Simard, and P. Frasconi (1994). Learning long-term dependencies with gra- dient descent is difficult. IEEE Transactions on Neural Networks 5(2), 157–166.
Bengtson, E. and D. Roth (2008). Understanding the value of features for coreference resolution. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 294–303.
Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 289–300.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

494 BIBLIOGRAPHY
Berant, J., A. Chou, R. Frostig, and P. Liang (2013). Semantic parsing on freebase from question-answer pairs. In Proceedings of Empirical Methods for Natural Language Process- ing (EMNLP), pp. 1533–1544.
Berant, J., V. Srikumar, P.-C. Chen, A. Vander Linden, B. Harding, B. Huang, P. Clark, and C. D. Manning (2014). Modeling biological processes for reading comprehension. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Berg-Kirkpatrick, T., A. Bouchard-Coˆte ́, J. DeNero, and D. Klein (2010). Painless unsuper- vised learning with features. In Proceedings of the North American Chapter of the Associa- tion for Computational Linguistics (NAACL), pp. 582–590.
Berg-Kirkpatrick, T., D. Burkett, and D. Klein (2012). An empirical investigation of sta- tistical significance in NLP. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 995–1005.
Berger, A. L., V. J. D. Pietra, and S. A. D. Pietra (1996). A maximum entropy approach to natural language processing. Computational linguistics 22(1), 39–71.
Bergsma, S., D. Lin, and R. Goebel (2008). Distributional identification of non-referential pronouns. In Proceedings of the Association for Computational Linguistics (ACL), pp. 10–18.
Bernardi, R., R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. Ikizler-Cinbis, F. Keller, A. Mus- cat, and B. Plank (2016). Automatic description generation from images: A survey of models, datasets, and evaluation measures. Journal of Artificial Intelligence Research 55, 409–442.
Bertsekas, D. P. (2012). Incremental gradient, subgradient, and proximal methods for con- vex optimization: A survey. In S. Sra, S. Nowozin, and S. J. Wright (Eds.), Optimization for machine learning. MIT Press.
Bhatia, P., R. Guthrie, and J. Eisenstein (2016). Morphological priors for probabilistic neu- ral word embeddings. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Bhatia, P., Y. Ji, and J. Eisenstein (2015). Better document-level sentiment analysis from rst discourse parsing. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Biber, D. (1991). Variation across speech and writing. Cambridge University Press.
Bird, S., E. Klein, and E. Loper (2009). Natural language processing with Python. California:
O’Reilly Media.
Bishop, C. M. (2006). Pattern recognition and machine learning. springer.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 495 Bjo ̈rkelund, A. and P. Nugues (2011). Exploring lexicalized features for coreference reso-
lution. In Proceedings of the Conference on Natural Language Learning (CoNLL), pp. 45–50. Blackburn, P. and J. Bos (2005). Representation and inference for natural language: A first
course in computational semantics. CSLI.
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM 55(4), 77–84.
Blei, D. M. (2014). Build, compute, critique, repeat: Data analysis with latent variable models. Annual Review of Statistics and Its Application 1, 203–232.
Blei, D. M., A. Y. Ng, and M. I. Jordan (2003). Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022.
Blitzer, J., M. Dredze, and F. Pereira (2007). Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the Associa- tion for Computational Linguistics (ACL), pp. 440–447.
Blum, A. and T. Mitchell (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the Conference On Learning Theory (COLT), pp. 92–100.
Bobrow, D. G., R. M. Kaplan, M. Kay, D. A. Norman, H. Thompson, and T. Winograd (1977). Gus, a frame-driven dialog system. Artificial intelligence 8(2), 155–173.
Bohnet, B. (2010). Very high accuracy and fast dependency parsing is not a contradiction. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 89–97.
Boitet, C. (1988). Pros and cons of the pivot and transfer approaches in multilingual ma- chine translation. Readings in machine translation, 273–279.
Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135– 146.
Bollacker, K., C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008). Freebase: a collabora- tively created graph database for structuring human knowledge. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 1247–1250. AcM.
Bolukbasi, T., K.-W. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Neural Information Processing Systems (NIPS), pp. 4349–4357.
Bordes, A., N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013). Translating embeddings for modeling multi-relational data. In Neural Information Processing Systems (NIPS), pp. 2787–2795.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

496 BIBLIOGRAPHY
Bordes, A., J. Weston, R. Collobert, Y. Bengio, et al. (2011). Learning structured embed- dings of knowledge bases. In Proceedings of the National Conference on Artificial Intelligence (AAAI), pp. 301–306.
Borges, J. L. (1993). Other Inquisitions 1937–1952. University of Texas Press. Translated by Ruth L. C. Simms.
Botha, J. A. and P. Blunsom (2014). Compositional morphology for word representations and language modelling. In Proceedings of the International Conference on Machine Learn- ing (ICML).
Bottou, L. (2012). Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pp. 421–436. Springer.
Bottou, L., F. E. Curtis, and J. Nocedal (2016). Optimization methods for large-scale ma- chine learning. arXiv preprint arXiv:1606.04838.
Bowman, S. R., L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz, and S. Bengio (2016). Gen- erating sentences from a continuous space. In Proceedings of the Conference on Natural Language Learning (CoNLL), pp. 10–21.
boyd, d. and K. Crawford (2012). Critical questions for big data. Information, Communica- tion & Society 15(5), 662–679.
Boyd, S. and L. Vandenberghe (2004). Convex Optimization. New York: Cambridge Uni- versity Press.
Boydstun, A. E. (2013). Making the news: Politics, the media, and agenda setting. University of Chicago Press.
Branavan, S., H. Chen, J. Eisenstein, and R. Barzilay (2009). Learning document-level semantic properties from free-text annotations. Journal of Artificial Intelligence Re- search 34(2), 569–603.
Branavan, S. R., H. Chen, L. S. Zettlemoyer, and R. Barzilay (2009). Reinforcement learning for mapping instructions to actions. In Proceedings of the Association for Computational Linguistics (ACL), pp. 82–90.
Brants, T. and A. Franz (2006). The Google 1T 5-gram corpus. LDC2006T13.
Braud, C., O. Lacroix, and A. Søgaard (2017). Does syntax help discourse segmenta- tion? not so much. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 2432–2442.
Brennan, S. E., M. W. Friedman, and C. J. Pollard (1987). A centering approach to pro- nouns. In Proceedings of the Association for Computational Linguistics (ACL), pp. 155–162.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 497 Briscoe, T. (2011). Introduction to formal semantics for natural language.
https://www.cl.cam.ac.uk/teaching/1011/L107/semantics.pdf.
Brown, P. F., J. Cocke, S. A. D. Pietra, V. J. D. Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin (1990). A statistical approach to machine translation. Computational linguistics 16(2), 79–85.
Brown, P. F., P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai (1992). Class-based n-gram models of natural language. Computational linguistics 18(4), 467–479.
Brown, P. F., V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer (1993). The mathematics of statistical machine translation: Parameter estimation. Computational linguistics 19(2), 263–311.
Brun, C. and C. Roux (2014). De ́composition des “hash tags” pour l’ame ́lioration de la classification en polarite ́ des “tweets”. Proceedings of Traitement Automatique des Langues Naturelles, 473–478.
Bruni, E., N.-K. Tran, and M. Baroni (2014). Multimodal distributional semantics. Journal of Artificial Intelligence Research 49(2014), 1–47.
Brutzkus, A., A. Globerson, E. Malach, and S. Shalev-Shwartz (2018). Sgd learns over- parameterized networks that provably generalize on linearly separable data. In Pro- ceedings of the International Conference on Learning Representations (ICLR).
Bullinaria, J. A. and J. P. Levy (2007). Extracting semantic representations from word co- occurrence statistics: A computational study. Behavior research methods 39(3), 510–526.
Bunescu, R. C. and R. J. Mooney (2005). A shortest path dependency kernel for relation extraction. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 724–731.
Bunescu, R. C. and M. Pasca (2006). Using encyclopedic knowledge for named entity disambiguation. In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL), pp. 9–16.
Burstein, J., D. Marcu, and K. Knight (2003). Finding the WRITE stuff: Automatic identi- fication of discourse structure in student essays. IEEE Intelligent Systems 18(1), 32–39.
Burstein, J., J. Tetreault, and S. Andreyev (2010). Using entity-based features to model coherence in student essays. In Human language technologies: The 2010 annual conference of the North American chapter of the Association for Computational Linguistics, pp. 681–684. Association for Computational Linguistics.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

498 BIBLIOGRAPHY Burstein, J., J. Tetreault, and M. Chodorow (2013). Holistic discourse coherence annotation
for noisy essay writing. Dialogue & Discourse 4(2), 34–52.
Cai, Q. and A. Yates (2013). Large-scale semantic parsing via schema matching and lexicon extension. In Proceedings of the Association for Computational Linguistics (ACL), pp. 423– 433.
Caliskan, A., J. J. Bryson, and A. Narayanan (2017). Semantics derived automatically from language corpora contain human-like biases. Science 356(6334), 183–186.
Canny, J. (1987). A computational approach to edge detection. In Readings in Computer Vision, pp. 184–203. Elsevier.
Cappe ́, O. and E. Moulines (2009). On-line expectation–maximization algorithm for latent data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71(3), 593–613.
Carbonell, J. and J. Goldstein (1998). The use of MMR, diversity-based reranking for re- ordering documents and producing summaries. In Proceedings of ACM SIGIR conference on Research and development in information retrieval.
Carbonell, J. R. (1970). Mixed-initiative man-computer instructional dialogues. Technical report, Bolt Beranek and Newman.
Cardie, C. and K. Wagstaff (1999). Noun phrase coreference as clustering. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 82–89.
Carletta, J. (1996). Assessing agreement on classification tasks: the kappa statistic. Com- putational linguistics 22(2), 249–254.
Carletta, J. (2007). Unleashing the killer corpus: experiences in creating the multi- everything ami meeting corpus. Language Resources and Evaluation 41(2), 181–190.
Carlson, L. and D. Marcu (2001). Discourse tagging reference manual. Technical Report ISI-TR-545, Information Sciences Institute.
Carlson, L., M. E. Okurowski, and D. Marcu (2002). RST discourse treebank. Linguistic Data Consortium, University of Pennsylvania.
Carpenter, B. (1997). Type-logical semantics. Cambridge, MA: MIT Press.
Carreras, X., M. Collins, and T. Koo (2008). Tag, dynamic programming, and the percep- tron for efficient, feature-rich parsing. In Proceedings of the Conference on Natural Language Learning (CoNLL), pp. 9–16.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 499
Carreras, X. and L. Ma`rquez (2005). Introduction to the conll-2005 shared task: Semantic role labeling. In Proceedings of the Ninth Conference on Computational Natural Language Learning, pp. 152–164. Association for Computational Linguistics.
Carroll, L. (1865). Alice’s Adventures in Wonderland. London: Macmillan.
Carroll, L. (1917). Through the looking glass: And what Alice found there. Chicago: Rand,
McNally.
Chambers, N. and D. Jurafsky (2008). Jointly combining implicit constraints improves temporal ordering. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 698–706.
Chang, K.-W., A. Krishnamurthy, A. Agarwal, H. Daume III, and J. Langford (2015). Learning to search better than your teacher. In Proceedings of the International Confer- ence on Machine Learning (ICML).
Chang, M.-W., L. Ratinov, and D. Roth (2007). Guiding semi-supervision with constraint- driven learning. In Proceedings of the Association for Computational Linguistics (ACL), pp. 280–287.
Chang, M.-W., L.-A. Ratinov, N. Rizzolo, and D. Roth (2008). Learning and inference with constraints. In Proceedings of the National Conference on Artificial Intelligence (AAAI), pp. 1513–1518.
Chapman, W. W., W. Bridewell, P. Hanbury, G. F. Cooper, and B. G. Buchanan (2001). A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of biomedical informatics 34(5), 301–310.
Charniak, E. (1997). Statistical techniques for natural language parsing. AI magazine 18(4), 33–43.
Charniak, E. and M. Johnson (2005). Coarse-to-fine n-best parsing and maxent discrimi- native reranking. In Proceedings of the Association for Computational Linguistics (ACL), pp. 173–180.
Chelba, C. and A. Acero (2006). Adaptation of maximum entropy capitalizer: Little data can help a lot. Computer Speech & Language 20(4), 382–399.
Chelba, C., T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson (2013). One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.
Chen, D., J. Bolton, and C. D. Manning (2016). A thorough examination of the CNN/Daily Mail reading comprehension task. In Proceedings of the Association for Computational Linguistics (ACL).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

500 BIBLIOGRAPHY
Chen, D. and C. D. Manning (2014). A fast and accurate dependency parser using neural networks. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 740–750.
Chen, D. L. and R. J. Mooney (2008). Learning to sportscast: a test of grounded language acquisition. In Proceedings of the International Conference on Machine Learning (ICML), pp. 128–135.
Chen, H., S. Branavan, R. Barzilay, and D. R. Karger (2009). Content modeling using latent permutations. Journal of Artificial Intelligence Research 36(1), 129–163.
Chen, M., Z. Xu, K. Weinberger, and F. Sha (2012). Marginalized denoising autoencoders for domain adaptation. In Proceedings of the International Conference on Machine Learning (ICML).
Chen, M. X., O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, N. Parmar, M. Schuster, Z. Chen, Y. Wu, and M. Hughes (2018). The best of both worlds: Combin- ing recent advances in neural machine translation. In Proceedings of the Association for Computational Linguistics (ACL).
Chen, S. F. and J. Goodman (1999). An empirical study of smoothing techniques for lan- guage modeling. Computer Speech & Language 13(4), 359–393.
Chen, T. and C. Guestrin (2016). Xgboost: A scalable tree boosting system. In Proceedings of Knowledge Discovery and Data Mining (KDD), pp. 785–794.
Chen, X., X. Qiu, C. Zhu, P. Liu, and X. Huang (2015). Long short-term memory neural networks for chinese word segmentation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1197–1206.
Chen, Y., S. Gilroy, A. Malletti, K. Knight, and J. May (2018). Recurrent neural networks as weighted language recognizers. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
Chen, Z. and H. Ji (2009). Graph-based event coreference resolution. In Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 54–57. Association for Computational Linguistics.
Cheng, X. and D. Roth (2013). Relational inference for wikification. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1787–1796.
Chiang, D. (2007). Hierarchical phrase-based translation. Computational Linguistics 33(2), 201–228.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 501
Chiang, D., J. Graehl, K. Knight, A. Pauls, and S. Ravi (2010). Bayesian inference for finite-state transducers. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 447–455.
Chinchor, N. and P. Robinson (1997). MUC-7 named entity task definition. In Proceedings of the 7th Conference on Message Understanding, Volume 29.
Cho, K. (2015). Natural language understanding with distributed representation. CoRR abs/1511.07916.
Cho, K., B. Van Merrie ̈nboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014). Learning phrase representations using rnn encoder-decoder for sta- tistical machine translation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Chomsky, N. (1957). Syntactic structures. The Hague: Mouton & Co.
Chomsky, N. (1982). Some concepts and consequences of the theory of government and binding,
Volume 6. MIT press.
Choromanska, A., M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun (2015). The loss surfaces of multilayer networks. In Proceedings of Artificial Intelligence and Statistics (AIS- TATS), pp. 192–204.
Christodoulopoulos, C., S. Goldwater, and M. Steedman (2010). Two decades of unsuper- vised pos induction: How far have we come? In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 575–584.
Chu, Y.-J. and T.-H. Liu (1965). On shortest arborescence of a directed graph. Scientia Sinica 14(10), 1396–1400.
Chung, C. and J. W. Pennebaker (2007). The psychological functions of function words. In K. Fiedler (Ed.), Social communication, pp. 343–359. New York and Hove: Psychology Press.
Church, K. (2011). A pendulum swung too far. Linguistic Issues in Language Technology 6(5), 1–27.
Church, K. W. (2000). Empirical estimates of adaptation: the chance of two Noriegas is closer to p/2 than p2. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 180–186.
Church, K. W. and P. Hanks (1990). Word association norms, mutual information, and lexicography. Computational linguistics 16(1), 22–29.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

502 BIBLIOGRAPHY
Ciaramita, M. and M. Johnson (2003). Supersense tagging of unknown nouns in wordnet. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 168– 175.
Clark, K. and C. D. Manning (2015). Entity-centric coreference resolution with model stacking. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1405– 1415.
Clark, K. and C. D. Manning (2016). Improving coreference resolution by learning entity- level distributed representations. In Proceedings of the Association for Computational Lin- guistics (ACL).
Clark, P. (2015). Elementary school science and math tests as a driver for AI: take the aristo challenge! In Proceedings of the National Conference on Artificial Intelligence (AAAI), pp. 4019–4021.
Clarke, J., D. Goldwasser, M.-W. Chang, and D. Roth (2010). Driving semantic parsing from the world’s response. In Proceedings of the Conference on Natural Language Learning (CoNLL), pp. 18–27.
Clarke, J. and M. Lapata (2008). Global inference for sentence compression: An integer linear programming approach. Journal of Artificial Intelligence Research 31, 399–429.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychologi- cal measurement 20(1), 37–46.
Cohen, S. (2016). Bayesian analysis in natural language processing. Synthesis Lectures on Human Language Technologies. San Rafael, CA: Morgan & Claypool Publishers.
Cohen, S. B., K. Stratos, M. Collins, D. P. Foster, and L. Ungar (2014). Spectral learning of latent-variable PCFGs: Algorithms and sample complexity. Journal of Machine Learning Research 15, 2399–2449.
Collier, N., C. Nobata, and J.-i. Tsujii (2000). Extracting the names of genes and gene products with a hidden markov model. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 201–207.
Collins, M. (1997). Three generative, lexicalised models for statistical parsing. In Proceed- ings of the Association for Computational Linguistics (ACL), pp. 16–23.
Collins, M. (2002). Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1–8.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 503 Collins, M. and T. Koo (2005). Discriminative reranking for natural language parsing.
Computational Linguistics 31(1), 25–70.
Collins, M. and B. Roark (2004). Incremental parsing with the perceptron algorithm. In
Proceedings of the Association for Computational Linguistics (ACL).
Collobert, R., K. Kavukcuoglu, and C. Farabet (2011). Torch7: A matlab-like environment
for machine learning. Technical Report EPFL-CONF-192376, EPFL.
Collobert, R. and J. Weston (2008). A unified architecture for natural language process- ing: Deep neural networks with multitask learning. In Proceedings of the International Conference on Machine Learning (ICML), pp. 160–167.
Collobert, R., J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa (2011). Nat- ural language processing (almost) from scratch. Journal of Machine Learning Research 12, 2493–2537.
Colton, S., J. Goodwin, and T. Veale (2012). Full-face poetry generation. In Proceedings of the International Conference on Computational Creativity, pp. 95–102.
Conneau, A., D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017). Supervised learning of universal sentence representations from natural language inference data. In Proceed- ings of Empirical Methods for Natural Language Processing (EMNLP), pp. 681–691.
Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein (2009). Introduction to algorithms (third ed.). MIT press.
Cotterell,R.,H.Schu ̈tze,andJ.Eisner(2016).Morphologicalsmoothingandextrapolation of word embeddings. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1651–1660.
Coviello, L., Y. Sohn, A. D. Kramer, C. Marlow, M. Franceschetti, N. A. Christakis, and J. H. Fowler (2014). Detecting emotional contagion in massive social networks. PloS one 9(3), e90315.
Covington, M. A. (2001). A fundamental algorithm for dependency parsing. In Proceedings of the 39th annual ACM southeast conference, pp. 95–102.
Crammer, K., O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer (2006, December). Online passive-aggressive algorithms. The Journal of Machine Learning Research 7, 551– 585.
Crammer, K. and Y. Singer (2001). Pranking with ranking. In Neural Information Processing Systems (NIPS), pp. 641–647.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

504 BIBLIOGRAPHY Crammer, K. and Y. Singer (2003). Ultraconservative online algorithms for multiclass
problems. The Journal of Machine Learning Research 3, 951–991.
Creutz, M. and K. Lagus (2007). Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing (TSLP) 4(1), 3.
Cross, J. and L. Huang (2016). Span-based constituency parsing with a structure-label system and provably optimal dynamic oracles. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1–11.
Cucerzan, S. (2007). Large-scale named entity disambiguation based on wikipedia data. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Cui, H., R. Sun, K. Li, M.-Y. Kan, and T.-S. Chua (2005). Question answering passage retrieval using dependency relations. In Proceedings of ACM SIGIR conference on Research and development in information retrieval, pp. 400–407.
Cui, Y., Z. Chen, S. Wei, S. Wang, T. Liu, and G. Hu (2017). Attention-over-attention neural networks for reading comprehension. In Proceedings of the Association for Computational Linguistics (ACL).
Culotta, A. and J. Sorensen (2004). Dependency tree kernels for relation extraction. In Proceedings of the Association for Computational Linguistics (ACL).
Culotta, A., M. Wick, and A. McCallum (2007). First-order probabilistic models for coref- erence resolution. In Proceedings of the North American Chapter of the Association for Com- putational Linguistics (NAACL), pp. 81–88.
Curry, H. B. and R. Feys (1958). Combinatory Logic, Volume I. Amsterdam: North Holland.
Danescu-Niculescu-Mizil, C., M. Sudhof, D. Jurafsky, J. Leskovec, and C. Potts (2013). A computational approach to politeness with application to social factors. In Proceedings of the Association for Computational Linguistics (ACL), pp. 250–259.
Das, D., D. Chen, A. F. Martins, N. Schneider, and N. A. Smith (2014). Frame-semantic parsing. Computational Linguistics 40(1), 9–56.
Daume ́ III, H. (2007). Frustratingly easy domain adaptation. In Proceedings of the Associa- tion for Computational Linguistics (ACL).
Daume ́ III, H., J. Langford, and D. Marcu (2009). Search-based structured prediction. Machine learning 75(3), 297–325.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 505
Daume ́ III, H. and D. Marcu (2005). A large-scale exploration of effective global features for a joint entity detection and tracking model. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 97–104.
Dauphin, Y. N., R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio (2014). Iden- tifying and attacking the saddle point problem in high-dimensional non-convex opti- mization. In Neural Information Processing Systems (NIPS), pp. 2933–2941.
Davidson, D. (1967). The logical form of action sentences. In N. Rescher (Ed.), The Logic of Decision and Action. Pittsburgh: University of Pittsburgh Press.
De Gispert, A. and J. B. Marino (2006). Catalan-english statistical machine translation without parallel corpus: bridging through spanish. In Proc. of 5th International Conference on Language Resources and Evaluation (LREC), pp. 65–68. Citeseer.
De Marneffe, M.-C. and C. D. Manning (2008). The stanford typed dependencies represen- tation. In Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation, pp. 1–8. Association for Computational Linguistics.
Dean, J. and S. Ghemawat (2008). Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113.
Deerwester, S. C., S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman (1990). Indexing by latent semantic analysis. Journal of the American society for information sci- ence 41(6), 391–407.
Dehdari, J. (2014). A Neurophysiologically-Inspired Statistical Language Model. Ph. D. thesis, The Ohio State University.
Deisenroth, M. P., A. A. Faisal, and C. S. Ong (2018). Mathematics For Machine Learning. Cambridge University Press.
Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incom- plete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Method- ological), 1–38.
Denis, P. and J. Baldridge (2007). A ranking approach to pronoun resolution. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).
Denis, P. and J. Baldridge (2008). Specialized models and ranking for coreference resolu- tion. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 660–669.
Denis, P. and J. Baldridge (2009). Global joint models for coreference resolution and named entity classification. Procesamiento del Lenguaje Natural 42.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

506 BIBLIOGRAPHY Derrida, J. (1985). Des tours de babel. In J. Graham (Ed.), Difference in translation. Ithaca,
NY: Cornell University Press.
Dhingra, B., H. Liu, Z. Yang, W. W. Cohen, and R. Salakhutdinov (2017). Gated-attention readers for text comprehension. In Proceedings of the Association for Computational Lin- guistics (ACL).
Diaconis, P. and B. Skyrms (2017). Ten Great Ideas About Chance. Princeton University Press.
Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classifica- tion learning algorithms. Neural computation 10(7), 1895–1923.
Dietterich, T. G., R. H. Lathrop, and T. Lozano-Pe ́rez (1997). Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence 89(1), 31–71.
Dimitrova, L., N. Ide, V. Petkevic, T. Erjavec, H. J. Kaalep, and D. Tufis (1998). Multext- east: Parallel and comparable corpora and lexicons for six central and eastern euro- pean languages. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 315–319.
Doddington, G. R., A. Mitchell, M. A. Przybocki, L. A. Ramshaw, S. Strassel, and R. M. Weischedel (2004). The automatic content extraction (ACE) program-tasks, data, and evaluation. In Proceedings of the Language Resources and Evaluation Conference, pp. 837– 840.
dos Santos, C., B. Xiang, and B. Zhou (2015). Classifying relations by ranking with con- volutional neural networks. In Proceedings of the Association for Computational Linguistics (ACL), pp. 626–634.
Dowty, D. (1991). Thematic proto-roles and argument selection. Language, 547–619.
Dredze, M., P. McNamee, D. Rao, A. Gerber, and T. Finin (2010). Entity disambiguation for knowledge base population. In Proceedings of the International Conference on Compu- tational Linguistics (COLING), pp. 277–285.
Dredze, M., M. J. Paul, S. Bergsma, and H. Tran (2013). Carmen: A Twitter geolocation system with applications to public health. In AAAI workshop on expanding the boundaries of health informatics using AI (HIAI), pp. 20–24.
Dreyfus, H. L. (1992). What computers still can’t do: a critique of artificial reason. MIT press.
Dror, R., G. Baumer, M. Bogomolov, and R. Reichart (2017). Replicability analysis for natural language processing: Testing significance with multiple datasets. Transactions of the Association for Computational Linguistics 5, 471–486.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 507
Dror, R., G. Baumer, S. Shlomov, and R. Reichart (2018). The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1383–1392.
Du, L., W. Buntine, and M. Johnson (2013). Topic segmentation with a structured topic model. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 190–200.
Duchi, J., E. Hazan, and Y. Singer (2011). Adaptive subgradient methods for online learn- ing and stochastic optimization. The Journal of Machine Learning Research 12, 2121–2159.
Dunietz, J., L. Levin, and J. Carbonell (2017). The because corpus 2.0: Annotating causality and overlapping relations. In Proceedings of the Linguistic Annotation Workshop.
Durrett, G., T. Berg-Kirkpatrick, and D. Klein (2016). Learning-based single-document summarization with compression and anaphoricity constraints. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1998–2008.
Durrett, G. and D. Klein (2013). Easy victories and uphill battles in coreference resolution. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Durrett, G. and D. Klein (2015). Neural CRF parsing. In Proceedings of the Association for Computational Linguistics (ACL).
Dyer, C., M. Ballesteros, W. Ling, A. Matthews, and N. A. Smith (2015). Transition-based dependency parsing with stack long short-term memory. In Proceedings of the Association for Computational Linguistics (ACL), pp. 334–343.
Dyer, C., A. Kuncoro, M. Ballesteros, and N. A. Smith (2016). Recurrent neural network grammars. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 199–209.
Edmonds, J. (1967). Optimum branchings. Journal of Research of the National Bureau of Standards B 71(4), 233–240.
Efron, B. and R. J. Tibshirani (1993). An introduction to the bootstrap: Monographs on statistics and applied probability, vol. 57. New York and London: Chapman and Hall/CRC.
Eisenstein, J. (2009). Hierarchical text segmentation from multi-scale lexical cohesion. In
Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
Eisenstein, J. and R. Barzilay (2008). Bayesian unsupervised topic segmentation. In Pro- ceedings of Empirical Methods for Natural Language Processing (EMNLP).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

508 BIBLIOGRAPHY
Eisner, J. (1997). State-of-the-art algorithms for minimum spanning trees: A tutorial dis- cussion. https://www.cs.jhu.edu/ ̃jason/papers/eisner.mst-tutorial. pdf.
Eisner, J. (2000). Bilexical grammars and their cubic-time parsing algorithms. In Advances in probabilistic and other parsing technologies, pp. 29–61. Springer.
Eisner, J. (2002). Parameter estimation for probabilistic finite-state transducers. In Proceed- ings of the Association for Computational Linguistics (ACL), pp. 1–8.
Eisner, J. (2016). Inside-outside and forward-backward algorithms are just backprop. In Proceedings of the Workshop on Structured Prediction for NLP, pp. 1–17.
Eisner, J. M. (1996). Three new probabilistic models for dependency parsing: An explo- ration. In Proceedings of the International Conference on Computational Linguistics (COL- ING), pp. 340–345.
Ekman, P. (1992). Are there basic emotions? Psychological Review 99(3), 550–553.
Elman, J. L. (1990). Finding structure in time. Cognitive science 14(2), 179–211.
Elman, J. L., E. A. Bates, M. H. Johnson, A. Karmiloff-Smith, D. Parisi, and K. Plunkett (1998). Rethinking innateness: A connectionist perspective on development, Volume 10. MIT press.
Elsner, M. and E. Charniak (2010). Disentangling chat. Computational Linguistics 36(3), 389–409.
Esuli, A. and F. Sebastiani (2006). SentiWordNet: A publicly available lexical resource for opinion mining. In Proceedings of the Language Resources and Evaluation Conference, pp. 417–422.
Etzioni, O., A. Fader, J. Christensen, S. Soderland, and M. Mausam (2011). Open informa- tion extraction: The second generation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 3–10.
Faruqui, M., J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith (2015). Retrofitting word vectors to semantic lexicons. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
Faruqui, M. and C. Dyer (2014). Improving vector space word representations using mul- tilingual correlation. In Proceedings of the European Chapter of the Association for Computa- tional Linguistics (EACL), pp. 462–471.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 509
Faruqui, M., R. McDonald, and R. Soricut (2016). Morpho-syntactic lexicon generation using graph-based semi-supervised learning. Transactions of the Association for Computa- tional Linguistics 4, 1–16.
Faruqui, M., Y. Tsvetkov, P. Rastogi, and C. Dyer (2016). Problems with evaluation of word embeddings using word similarity tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pp. 30–35. Association for Computational Linguis- tics.
Fellbaum, C. (2010). WordNet. Springer.
Feng, V. W., Z. Lin, and G. Hirst (2014). The impact of deep hierarchical discourse struc- tures in the evaluation of text coherence. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 940–949.
Feng, X., L. Huang, D. Tang, H. Ji, B. Qin, and T. Liu (2016). A language-independent neural network for event detection. In Proceedings of the Association for Computational Linguistics (ACL), pp. 66–71.
Fernandes, E. R., C. N. dos Santos, and R. L. Milidiu ́ (2014). Latent trees for coreference resolution. Computational Linguistics.
Ferrucci, D., E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A. Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager, et al. (2010). Building Watson: An overview of the DeepQA project. AI magazine 31(3), 59–79.
Ficler, J. and Y. Goldberg (2017). Controlling linguistic style aspects in neural language generation. In Proceedings of the Workshop on Stylistic Variation, pp. 94–104. Association for Computational Linguistics.
Filippova, K. and M. Strube (2008). Sentence fusion via dependency graph compression. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 177– 185.
Fillmore, C. J. (1968). The case for case. In E. Bach and R. Harms (Eds.), Universals in linguistic theory. Holt, Rinehart, and Winston.
Fillmore, C. J. (1976). Frame semantics and the nature of language. Annals of the New York Academy of Sciences 280(1), 20–32.
Fillmore, C. J. and C. Baker (2009). A frames approach to semantic analysis. In The Oxford Handbook of Linguistic Analysis. Oxford University Press.
Finkel, J. R., T. Grenager, and C. Manning (2005). Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the Association for Computational Linguistics (ACL), pp. 363–370.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

510 BIBLIOGRAPHY Finkel, J. R., T. Grenager, and C. D. Manning (2007). The infinite tree. In Proceedings of the
Association for Computational Linguistics (ACL), pp. 272–279.
Finkel, J. R., A. Kleeman, and C. D. Manning (2008). Efficient, feature-based, conditional random field parsing. In Proceedings of the Association for Computational Linguistics (ACL), pp. 959–967.
Finkel, J. R. and C. Manning (2009). Hierarchical bayesian domain adaptation. In Proceed- ings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 602–610.
Finkel, J. R. and C. D. Manning (2008). Enforcing transitivity in coreference resolution. In Proceedings of the Association for Computational Linguistics (ACL), pp. 45–48.
Finkelstein, L., E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin (2002). Placing search in context: The concept revisited. ACM Transactions on Information Systems 20(1), 116–131.
Firth, J. R. (1957). Papers in Linguistics 1934-1951. Oxford University Press.
Flanigan, J., S. Thomson, J. Carbonell, C. Dyer, and N. A. Smith (2014). A discrimina- tive graph-based parser for the abstract meaning representation. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1426–1436.
Foltz, P. W., W. Kintsch, and T. K. Landauer (1998). The measurement of textual coherence with latent semantic analysis. Discourse processes 25(2-3), 285–307.
Fordyce, C. (2007). Overview of the IWSLT 2007 evaluation campaign. In International Workshop on Spoken Language Translation (IWSLT) 2007.
Forsyth, E. N. and C. H. Martell (2007). Lexical and discourse analysis of online chat dialog. In International Conference on Semantic Computing, pp. 19–26. IEEE.
Fort, K., G. Adda, and K. B. Cohen (2011). Amazon mechanical turk: Gold mine or coal mine? Computational Linguistics 37(2), 413–420.
Fox, H. (2002). Phrasal cohesion and statistical machine translation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 304–3111.
Francis, W. and H. Kucera (1982). Frequency analysis of English usage. Houghton Mifflin Company.
Francis, W. N. (1964). A standard sample of present-day English for use with digital computers. Report to the U.S Office of Education on Cooperative Research Project No. E-007.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 511 Freund, Y. and R. E. Schapire (1999). Large margin classification using the perceptron
algorithm. Machine learning 37(3), 277–296.
Fromkin, V., R. Rodman, and N. Hyams (2013). An introduction to language. Cengage
Learning.
Fundel, K., R. Ku ̈ffner, and R. Zimmer (2007). Relex – relation extraction using depen-
dency parse trees. Bioinformatics 23(3), 365–371.
Gabow, H. N., Z. Galil, T. Spencer, and R. E. Tarjan (1986). Efficient algorithms for finding minimum spanning trees in undirected and directed graphs. Combinatorica 6(2), 109– 122.
Gabrilovich, E. and S. Markovitch (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the International Joint Con- ference on Artificial Intelligence (IJCAI), Volume 7, pp. 1606–1611.
Gage, P. (1994). A new algorithm for data compression. The C Users Journal 12(2), 23–38.
Gale, W. A., K. W. Church, and D. Yarowsky (1992). One sense per discourse. In Pro- ceedings of the workshop on Speech and Natural Language, pp. 233–237. Association for Computational Linguistics.
Galley, M., M. Hopkins, K. Knight, and D. Marcu (2004). What’s in a translation rule? In
Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 273–280.
Galley, M., K. R. McKeown, E. Fosler-Lussier, and H. Jing (2003). Discourse segmentation of multi-party conversation. In Proceedings of the Association for Computational Linguistics (ACL).
Ganchev, K. and M. Dredze (2008). Small statistical models by random feature mixing. In Proceedings of Workshop on Mobile Language Processing, pp. 19–20. Association of Compu- tational Linguistics.
Ganchev, K., J. Grac ̧a, J. Gillenwater, and B. Taskar (2010). Posterior regularization for structured latent variable models. The Journal of Machine Learning Research 11, 2001– 2049.
Ganin, Y., E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016). Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17(59), 1–35.
Gao, J., G. Andrew, M. Johnson, and K. Toutanova (2007). A comparative study of param- eter estimation methods for statistical natural language processing. In Proceedings of the Association for Computational Linguistics (ACL), pp. 824–831.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

512 BIBLIOGRAPHY
Garg, N., L. Schiebinger, D. Jurafsky, and J. Zou (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sci- ences 115(16), E3635–E3644.
Gatt, A. and E. Krahmer (2018). Survey of the state of the art in natural language genera- tion: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research 61, 65–170.
Ge, D., X. Jiang, and Y. Ye (2011). A note on the complexity of lp minimization. Mathemat- ical programming 129(2), 285–299.
Ge, N., J. Hale, and E. Charniak (1998). A statistical approach to anaphora resolution. In Proceedings of the sixth workshop on very large corpora, Volume 71, pp. 76.
Ge, R., F. Huang, C. Jin, and Y. Yuan (2015). Escaping from saddle points — online stochas- tic gradient for tensor decomposition. In P. Gru ̈nwald, E. Hazan, and S. Kale (Eds.), Proceedings of the Conference On Learning Theory (COLT).
Ge, R. and R. J. Mooney (2005). A statistical semantic parser that integrates syntax and semantics. In Proceedings of the Conference on Natural Language Learning (CoNLL), pp. 9–16.
Geach, P. T. (1962). Reference and generality: An examination of some medieval and modern theories. Cornell University Press.
Gehring, J., M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017). Convolutional sequence to sequence learning. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1243–1252.
Gildea, D. and D. Jurafsky (2002). Automatic labeling of semantic roles. Computational linguistics 28(3), 245–288.
Gimpel, K., N. Schneider, B. O’Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yo- gatama, J. Flanigan, and N. A. Smith (2011). Part-of-speech tagging for Twitter: an- notation, features, and experiments. In Proceedings of the Association for Computational Linguistics (ACL), pp. 42–47.
Glass, J., T. J. Hazen, S. Cyphers, I. Malioutov, D. Huynh, and R. Barzilay (2007). Recent progress in the mit spoken lecture processing project. In Eighth Annual Conference of the International Speech Communication Association.
Glorot, X. and Y. Bengio (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of Artificial Intelligence and Statistics (AISTATS), pp. 249– 256.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 513 Glorot, X., A. Bordes, and Y. Bengio (2011). Deep sparse rectifier networks. In Proceedings
of Artificial Intelligence and Statistics (AISTATS), pp. 315–323.
Godfrey, J. J., E. C. Holliman, and J. McDaniel (1992). Switchboard: Telephone speech corpus for research and development. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 517–520. IEEE.
Goldberg, Y. (2017a, June). An adversarial review of “adversarial generation of natural language”. https://medium.com/@yoav.goldberg/an-adversarial-review-of- adversarial-generation-of-natural-language-409ac3378bd7.
Goldberg, Y. (2017b). Neural Network Methods for Natural Language Processing. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers.
Goldberg, Y. and M. Elhadad (2010). An efficient algorithm for easy-first non-directional dependency parsing. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 742–750.
Goldberg, Y. and J. Nivre (2012). A dynamic oracle for arc-eager dependency parsing. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 959–976.
Goldberg, Y., K. Zhao, and L. Huang (2013). Efficient implementation of beam-search incremental parsers. In Proceedings of the Association for Computational Linguistics (ACL), pp. 628–633.
Goldwater, S. and T. Griffiths (2007). A fully bayesian approach to unsupervised part-of- speech tagging. In Proceedings of the Association for Computational Linguistics (ACL).
Gonc ̧alo Oliveira, H. R., F. A. Cardoso, and F. C. Pereira (2007). Tra-la-lyrics: An approach to generate text based on rhythm. In Proceedings of the 4th. International Joint Workshop on Computational Creativity. A. Cardoso and G. Wiggins.
Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep learning. MIT Press.
Goodman, J. T. (2001). A bit of progress in language modeling. Computer Speech & Lan-
guage 15(4), 403–434.
Gouws, S., D. Metzler, C. Cai, and E. Hovy (2011). Contextual bearing on linguistic varia- tion in social media. In Proceedings of the Workshop on Language and Social Media. Associ- ation for Computational Linguistics.
Goyal, A., H. Daume ́ III, and S. Venkatasubramanian (2009). Streaming for large scale NLP: Language modeling. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 512–520.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

514 BIBLIOGRAPHY Graves, A. (2012). Sequence transduction with recurrent neural networks. In Proceedings
of the International Conference on Machine Learning (ICML).
Graves, A. and N. Jaitly (2014). Towards end-to-end speech recognition with recur- rent neural networks. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1764–1772.
Graves, A. and J. Schmidhuber (2005). Framewise phoneme classification with bidirec- tional lstm and other neural network architectures. Neural Networks 18(5), 602–610.
Grice, H. P. (1975). Logic and conversation. In P. Cole and J. L. Morgan (Eds.), Syntax and Semantics Volume 3: Speech Acts, pp. 41–58. Academic Press.
Grishman, R. (2012). Information extraction: Capabilities and challenges. Notes prepared for the 2012 International Winter School in Language and Speech Technologies, Rovira i Virgili University, Tarragona, Spain.
Grishman, R. (2015). Information extraction. IEEE Intelligent Systems 30(5), 8–15.
Grishman, R., C. Macleod, and J. Sterling (1992). Evaluating parsing strategies using standardized parse files. In Proceedings of the third conference on Applied natural language processing, pp. 156–161. Association for Computational Linguistics.
Grishman, R. and B. Sundheim (1996). Message understanding conference-6: A brief his- tory. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 466–471.
Groenendijk, J. and M. Stokhof (1991). Dynamic predicate logic. Linguistics and philoso- phy 14(1), 39–100.
Grosz, B. J. (1979). Focusing and description in natural language dialogues. Technical report, SRI International.
Grosz, B. J., S. Weinstein, and A. K. Joshi (1995). Centering: A framework for modeling the local coherence of discourse. Computational linguistics 21(2), 203–225.
Gu, J., Z. Lu, H. Li, and V. O. Li (2016). Incorporating copying mechanism in sequence-to- sequence learning. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1631–1640.
Gulcehre, C., S. Ahn, R. Nallapati, B. Zhou, and Y. Bengio (2016). Pointing the unknown words. In Proceedings of the Association for Computational Linguistics (ACL), pp. 140–149.
Gutmann, M. U. and A. Hyva ̈rinen (2012). Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. The Journal of Machine Learning Research 13(1), 307–361.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 515 Haghighi, A. and D. Klein (2007). Unsupervised coreference resolution in a nonparametric
bayesian model. In Proceedings of the Association for Computational Linguistics (ACL).
Haghighi, A. and D. Klein (2009). Simple coreference resolution with rich syntactic and semantic features. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1152–1161.
Haghighi, A. and D. Klein (2010). Coreference resolution in a modular, entity-centered model. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 385–393.
Hajicˇ, J. and B. Hladka ́ (1998). Tagging inflective languages: Prediction of morphological categories for a rich, structured tagset. In Proceedings of the Association for Computational Linguistics (ACL), pp. 483–490.
Halliday, M. and R. Hasan (1976). Cohesion in English. London: Longman.
Hammerton, J. (2003). Named entity recognition with long short-term memory. In Pro-
ceedings of the Conference on Natural Language Learning (CoNLL), pp. 172–175.
Han, X. and L. Sun (2012). An entity-topic model for entity linking. In Proceedings of
Empirical Methods for Natural Language Processing (EMNLP), pp. 105–115.
Han, X., L. Sun, and J. Zhao (2011). Collective entity linking in web text: a graph-based method. In Proceedings of ACM SIGIR conference on Research and development in informa- tion retrieval, pp. 765–774.
Hannak, A., E. Anderson, L. F. Barrett, S. Lehmann, A. Mislove, and M. Riedewald (2012). Tweetin’in the rain: Exploring societal-scale effects of weather on mood. In Proceedings of the International Conference on Web and Social Media (ICWSM).
Hardmeier, C. (2012). Discourse in statistical machine translation. a survey and a case study. Discours (11).
Haspelmath, M. and A. Sims (2013). Understanding morphology. Routledge.
Hastie, T., R. Tibshirani, and J. Friedman (2009). The elements of statistical learning (Second
ed.). New York: Springer.
Hatzivassiloglou, V. and K. R. McKeown (1997). Predicting the semantic orientation of adjectives. In Proceedings of the Association for Computational Linguistics (ACL), pp. 174– 181.
Hayes, A. F. and K. Krippendorff (2007). Answering the call for a standard reliability measure for coding data. Communication methods and measures 1(1), 77–89.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

516 BIBLIOGRAPHY
He, H., A. Balakrishnan, M. Eric, and P. Liang (2017). Learning symmetric collaborative dialogue agents with dynamic knowledge graph embeddings. In Proceedings of the As- sociation for Computational Linguistics (ACL), pp. 1766–1776.
He, K., X. Zhang, S. Ren, and J. Sun (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the International Conference on Computer Vision (ICCV), pp. 1026–1034.
He, K., X. Zhang, S. Ren, and J. Sun (2016). Deep residual learning for image recognition. In Proceedings of the International Conference on Computer Vision (ICCV), pp. 770–778.
He, L., K. Lee, M. Lewis, and L. Zettlemoyer (2017). Deep semantic role labeling: What works and what’s next. In Proceedings of the Association for Computational Linguistics (ACL).
He, Z., S. Liu, M. Li, M. Zhou, L. Zhang, and H. Wang (2013). Learning entity repre- sentation for entity disambiguation. In Proceedings of the Association for Computational Linguistics (ACL), pp. 30–34.
Hearst, M. A. (1992). Automatic acquisition of hyponyms from large text corpora. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 539– 545.
Hearst, M. A. (1997). Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational linguistics 23(1), 33–64.
Heerschop, B., F. Goossen, A. Hogenboom, F. Frasincar, U. Kaymak, and F. de Jong (2011). Polarity analysis of texts using discourse structure. In Proceedings of the 20th ACM inter- national conference on Information and knowledge management, pp. 1061–1070. ACM.
Henderson, J. (2004). Discriminative training of a neural network statistical parser. In Proceedings of the Association for Computational Linguistics (ACL), pp. 95–102.
Hendrickx,I.,S.N.Kim,Z.Kozareva,P.Nakov,D.O ́ Se ́aghdha,S.Pado ́,M.Pennacchiotti, L. Romano, and S. Szpakowicz (2009). Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, pp. 94–99. Association for Com- putational Linguistics.
Hermann, K. M., T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom (2015). Teaching machines to read and comprehend. In Advances in Neu- ral Information Processing Systems, pp. 1693–1701.
Hernault, H., H. Prendinger, D. A. duVerle, and M. Ishizuka (2010). HILDA: A discourse parser using support vector machine classification. Dialogue and Discourse 1(3), 1–33.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 517
Hill, F., A. Bordes, S. Chopra, and J. Weston (2016). The goldilocks principle: Reading children’s books with explicit memory representations. In Proceedings of the International Conference on Learning Representations (ICLR).
Hill, F., K. Cho, and A. Korhonen (2016). Learning distributed representations of sentences from unlabelled data. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
Hill, F., R. Reichart, and A. Korhonen (2015). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41(4), 665–695.
Hindle, D. and M. Rooth (1993). Structural ambiguity and lexical relations. Computational linguistics 19(1), 103–120.
Hirao, T., Y. Yoshida, M. Nishino, N. Yasuda, and M. Nagata (2013). Single-document summarization as a tree knapsack problem. In Proceedings of Empirical Methods for Nat- ural Language Processing (EMNLP), pp. 1515–1520.
Hirschman, L. and R. Gaizauskas (2001). Natural language question answering: the view from here. natural language engineering 7(4), 275–300.
Hirschman, L., M. Light, E. Breck, and J. D. Burger (1999). Deep read: A reading compre- hension system. In Proceedings of the Association for Computational Linguistics (ACL), pp. 325–332.
Hobbs, J. R. (1978). Resolving pronoun references. Lingua 44(4), 311–338.
Hobbs, J. R., D. Appelt, J. Bear, D. Israel, M. Kameyama, M. Stickel, and M. Tyson (1997). Fastus: A cascaded finite-state transducer for extracting information from natural- language text. Finite-state language processing, 383–406.
Hochreiter, S. and J. Schmidhuber (1997). Long short-term memory. Neural computa- tion 9(8), 1735–1780.
Hockenmaier, J. and M. Steedman (2007). CCGbank: a corpus of CCG derivations and de- pendency structures extracted from the Penn Treebank. Computational Linguistics 33(3), 355–396.
Hoffart, J., M. A. Yosef, I. Bordino, H. Fu ̈rstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum (2011). Robust disambiguation of named entities in text. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 782–792.
Hoffmann, R., C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld (2011). Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the Association for Computational Linguistics (ACL), pp. 541–550.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

518 BIBLIOGRAPHY Holmstrom, L. and P. Koistinen (1992). Using additive noise in back-propagation training.
IEEE Transactions on Neural Networks 3(1), 24–38.
Hovy, D., S. Spruit, M. Mitchell, E. M. Bender, M. Strube, and H. Wallach (2017). Proceed- ings of the first acl workshop on ethics in natural language processing. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing. Association for Compu- tational Linguistics.
Hovy, E. and J. Lavid (2010). Towards a ’science’ of corpus annotation: a new method- ological challenge for corpus linguistics. International journal of translation 22(1), 13–36.
Hsu, D., S. M. Kakade, and T. Zhang (2012). A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences 78(5), 1460–1480.
Hu, M. and B. Liu (2004). Mining and summarizing customer reviews. In Proceedings of Knowledge Discovery and Data Mining (KDD), pp. 168–177.
Hu, Z., Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing (2017). Toward controlled gen- eration of text. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1587–1596.
Huang, F. and A. Yates (2012). Biased representation learning for domain adaptation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1313–1323.
Huang, L. and D. Chiang (2007). Forest rescoring: Faster decoding with integrated lan- guage models. In Proceedings of the Association for Computational Linguistics (ACL), pp. 144–151.
Huang, L., S. Fayong, and Y. Guo (2012). Structured perceptron with inexact search. In
Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 142–151.
Huang, Y. (2015). Pragmatics (Second ed.). Oxford Textbooks in Linguistics. Oxford Uni- versity Press.
Huang, Z., W. Xu, and K. Yu (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.
Huddleston, R. and G. K. Pullum (2005). A student’s introduction to English grammar. Cam- bridge University Press.
Huffman, D. A. (1952). A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40(9), 1098–1101.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 519
Humphreys, K., R. Gaizauskas, and S. Azzam (1997). Event coreference for information extraction. In Proceedings of a Workshop on Operational Factors in Practical, Robust Anaphora Resolution for Unrestricted Texts, pp. 75–81. Association for Computational Linguistics.
Ide, N. and Y. Wilks (2006). Making sense about sense. In Word sense disambiguation, pp. 47–73. Springer.
Ioffe, S. and C. Szegedy (2015). Batch normalization: Accelerating deep network train- ing by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (ICML), pp. 448–456.
Isozaki, H., T. Hirao, K. Duh, K. Sudoh, and H. Tsukada (2010). Automatic evaluation of translation quality for distant language pairs. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 944–952.
Ivanova, A., S. Oepen, L. Øvrelid, and D. Flickinger (2012). Who did what to whom? a contrastive study of syntacto-semantic dependencies. In Proceedings of the Sixth Linguis- tic Annotation Workshop, pp. 2–11. Association for Computational Linguistics.
Iyyer, M., V. Manjunatha, J. Boyd-Graber, and H. Daume ́ III (2015). Deep unordered com- position rivals syntactic methods for text classification. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1681–1691.
James, G., D. Witten, T. Hastie, and R. Tibshirani (2013). An introduction to statistical learn- ing, Volume 112. Springer.
Janin, A., D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, et al. (2003). The ICSI meeting corpus. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP).
Jean, S., K. Cho, R. Memisevic, and Y. Bengio (2015). On using very large target vocab- ulary for neural machine translation. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1–10.
Jeong, M., C.-Y. Lin, and G. G. Lee (2009). Semi-supervised speech act recognition in emails and forums. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1250–1259.
Ji, H. and R. Grishman (2011). Knowledge base population: Successful approaches and challenges. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1148– 1158.
Ji, Y., T. Cohn, L. Kong, C. Dyer, and J. Eisenstein (2015). Document context language models. In International Conference on Learning Representations, Workshop Track, Volume abs/1511.03962.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

520 BIBLIOGRAPHY Ji, Y. and J. Eisenstein (2014). Representation learning for text-level discourse parsing. In
Proceedings of the Association for Computational Linguistics (ACL).
Ji, Y. and J. Eisenstein (2015). One vector is not enough: Entity-augmented distributional semantics for discourse relations. Transactions of the Association for Computational Lin- guistics (TACL).
Ji, Y., G. Haffari, and J. Eisenstein (2016). A latent variable recurrent neural network for discourse relation language models. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
Ji, Y. and N. A. Smith (2017). Neural discourse structure for text categorization. In Pro- ceedings of the Association for Computational Linguistics (ACL), pp. 996–1005.
Ji, Y., C. Tan, S. Martschat, Y. Choi, and N. A. Smith (2017). Dynamic entity representations in neural language models. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1831–1840.
Jiang, L., M. Yu, M. Zhou, X. Liu, and T. Zhao (2011). Target-dependent twitter sentiment classification. In Proceedings of the Association for Computational Linguistics (ACL), pp. 151–160.
Jing, H. (2000). Sentence reduction for automatic text summarization. In Proceedings of the sixth conference on Applied natural language processing, pp. 310–315. Association for Computational Linguistics.
Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proceedings of Knowledge Discovery and Data Mining (KDD), pp. 133–142.
Jockers, M. L. (2015). Revealing sentiment and plot arcs with the syuzhet package. http: //www.matthewjockers.net/2015/02/02/syuzhet/.
Johnson, A. E., T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016). MIMIC-III, a freely accessible critical care database. Scientific data 3, 160035.
Johnson, M. (1998). PCFG models of linguistic tree representations. Computational Lin- guistics 24(4), 613–632.
Johnson, R. and T. Zhang (2017). Deep pyramid convolutional neural networks for text categorization. In Proceedings of the Association for Computational Linguistics (ACL), pp. 562–570.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 521
Joshi, A. K. (1985). Tree adjoining grammars: How much context-sensitivity is required to provide reasonable structural descriptions? In Natural Language Processing – Theoret- ical, Computational and Psychological Perspectives. New York, NY: Cambridge University Press.
Joshi, A. K. and Y. Schabes (1997). Tree-adjoining grammars. In Handbook of formal lan- guages, pp. 69–123. Springer.
Joshi, A. K., K. V. Shanker, and D. Weir (1991). The convergence of mildly context-sensitive grammar formalisms. In Foundational Issues in Natural Language Processing. Cambridge MA: MIT Press.
Jozefowicz, R., O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu (2016). Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410.
Jozefowicz, R., W. Zaremba, and I. Sutskever (2015). An empirical exploration of recurrent network architectures. In Proceedings of the International Conference on Machine Learning (ICML), pp. 2342–2350.
Jurafsky, D. (1996). A probabilistic model of lexical and syntactic access and disambigua- tion. Cognitive Science 20(2), 137–194.
Jurafsky, D. and J. H. Martin (2009). Speech and Language Processing (Second ed.). Prentice Hall.
Jurafsky, D. and J. H. Martin (2019). Speech and Language Processing (Third ed.). Prentice Hall.
Kadlec, R., M. Schmid, O. Bajgar, and J. Kleindienst (2016). Text understanding with the attention sum reader network. In Proceedings of the Association for Computational Linguistics (ACL), pp. 908–918.
Kalchbrenner, N. and P. Blunsom (2013). Recurrent convolutional neural networks for dis- course compositionality. In Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality, pp. 119–126. Association for Computational Linguistics.
Kalchbrenner, N., L. Espeholt, K. Simonyan, A. v. d. Oord, A. Graves, and K. Kavukcuoglu (2016). Neural machine translation in linear time. arXiv preprint arXiv:1610.10099.
Kalchbrenner, N., E. Grefenstette, and P. Blunsom (2014). A convolutional neural network for modelling sentences. In Proceedings of the Association for Computational Linguistics (ACL), pp. 655–665.
Karlsson, F. (2007). Constraints on multiple center-embedding of clauses. Journal of Lin- guistics 43(2), 365–392.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

522 BIBLIOGRAPHY Kate, R. J., Y. W. Wong, and R. J. Mooney (2005). Learning to transform natural to formal
languages. In Proceedings of the National Conference on Artificial Intelligence (AAAI). Kawaguchi, K., L. P. Kaelbling, and Y. Bengio (2017). Generalization in deep learning.
arXiv preprint arXiv:1710.05468.
Kehler, A. (2007). Rethinking the SMASH approach to pronoun interpretation. In Interdis- ciplinary perspectives on reference processing, New Directions in Cognitive Science Series, pp. 95–122. Oxford University Press.
Kibble, R. and R. Power (2004). Optimizing referential coherence in text generation. Com- putational Linguistics 30(4), 401–416.
Kilgarriff, A. (1997). I don’t believe in word senses. Computers and the Humanities 31(2), 91–113.
Kilgarriff, A. and G. Grefenstette (2003). Introduction to the special issue on the web as corpus. Computational linguistics 29(3), 333–347.
Kim, M.-J. (2002). Does Korean have adjectives? MIT Working Papers in Linguistics 43, 71–89.
Kim, S.-M. and E. Hovy (2006, July). Extracting opinions, opinion holders, and topics expressed in online news media text. In Proceedings of the Workshop on Sentiment and Subjectivity in Text, pp. 1–8. Association for Computational Linguistics.
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1746–1751.
Kim, Y., C. Denton, L. Hoang, and A. M. Rush (2017). Structured attention networks. In Proceedings of the International Conference on Learning Representations (ICLR).
Kim, Y., Y. Jernite, D. Sontag, and A. M. Rush (2016). Character-aware neural language models. In Proceedings of the National Conference on Artificial Intelligence (AAAI).
Kingma, D. and J. Ba (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kiperwasser, E. and Y. Goldberg (2016). Simple and accurate dependency parsing using bidirectional lstm feature representations. Transactions of the Association for Computa- tional Linguistics 4, 313–327.
Kipper-Schuler, K. (2005). VerbNet: A broad-coverage, comprehensive verb lexicon. Ph. D. thesis, Computer and Information Science, University of Pennsylvania.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 523 Kiros, R., R. Salakhutdinov, and R. Zemel (2014). Multimodal neural language models. In
Proceedings of the International Conference on Machine Learning (ICML), pp. 595–603. Kiros, R., Y. Zhu, R. Salakhudinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fidler
(2015). Skip-thought vectors. In Neural Information Processing Systems (NIPS).
Klein, D. and C. D. Manning (2003). Accurate unlexicalized parsing. In Proceedings of the
Association for Computational Linguistics (ACL), pp. 423–430.
Klein, D. and C. D. Manning (2004). Corpus-based induction of syntactic structure: Mod- els of dependency and constituency. In Proceedings of the Association for Computational Linguistics (ACL).
Klein, G., Y. Kim, Y. Deng, J. Senellart, and A. M. Rush (2017). OpenNMT: Open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810.
Klementiev, A., I. Titov, and B. Bhattarai (2012). Inducing crosslingual distributed repre- sentations of words. In Proceedings of the International Conference on Computational Lin- guistics (COLING), pp. 1459–1474.
Klenner, M. (2007). Enforcing consistency on coreference sets. In Recent Advances in Natu- ral Language Processing (RANLP), pp. 323–328.
Knight, K. (1999). Decoding complexity in word-replacement translation models. Compu- tational Linguistics 25(4), 607–615.
Knight, K. and J. Graehl (1998). Machine transliteration. Computational Linguistics 24(4), 599–612.
Knight, K. and D. Marcu (2000). Statistics-based summarization-step one: Sentence com- pression. In Proceedings of the National Conference on Artificial Intelligence (AAAI), pp. 703–710.
Knight, K. and J. May (2009). Applications of weighted automata in natural language processing. In Handbook of Weighted Automata, pp. 571–596. Springer.
Knott, A. (1996). A data-driven methodology for motivating a set of coherence relations. Ph. D. thesis, The University of Edinburgh.
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In MT summit, Volume 5, pp. 79–86.
Koehn, P. (2009). Statistical machine translation. Cambridge University Press. Koehn, P. (2017). Neural machine translation. arXiv preprint arXiv:1709.07809.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

524 BIBLIOGRAPHY Konstas, I. and M. Lapata (2013). A global model for concept-to-text generation. Journal
of Artificial Intelligence Research 48, 305–346.
Koo, T., X. Carreras, and M. Collins (2008). Simple semi-supervised dependency parsing.
In Proceedings of the Association for Computational Linguistics (ACL), pp. 595–603.
Koo, T. and M. Collins (2005). Hidden-variable models for discriminative reranking. In
Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 507–514. Koo, T. and M. Collins (2010). Efficient third-order dependency parsers. In Proceedings of
the Association for Computational Linguistics (ACL).
Koo, T., A. Globerson, X. Carreras, and M. Collins (2007). Structured prediction models via the matrix-tree theorem. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 141–150.
Kovach, B. and T. Rosenstiel (2014). The elements of journalism: What newspeople should know and the public should expect. Three Rivers Press.
Krishnamurthy, J. (2016). Probabilistic models for learning a semantic parser lexicon. In
Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 606–616.
Krishnamurthy, J. and T. M. Mitchell (2012). Weakly supervised training of semantic parsers. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 754–765.
Krizhevsky, A., I. Sutskever, and G. E. Hinton (2012). Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems (NIPS), pp. 1097–1105.
Ku ̈bler,S.,R.McDonald,andJ.Nivre(2009).Dependencyparsing.SynthesisLectureson Human Language Technologies 1(1), 1–127.
Kuhlmann, M. and J. Nivre (2010). Transition-based techniques for non-projective depen- dency parsing. Northern European Journal of Language Technology (NEJLT) 2(1), 1–19.
Kummerfeld, J. K., T. Berg-Kirkpatrick, and D. Klein (2015). An empirical analysis of op- timization for max-margin NLP. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Kwiatkowski, T., S. Goldwater, L. Zettlemoyer, and M. Steedman (2012). A probabilistic model of syntactic and semantic acquisition from child-directed utterances and their meanings. In Proceedings of the European Chapter of the Association for Computational Lin- guistics (EACL), pp. 234–244.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 525
Lafferty, J., A. McCallum, and F. Pereira (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning (ICML).
Lakoff, G. (1973). Hedges: A study in meaning criteria and the logic of fuzzy concepts. Journal of philosophical logic 2(4), 458–508.
Lample, G., M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016). Neural architectures for named entity recognition. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 260–270.
Langkilde, I. and K. Knight (1998). Generation that exploits corpus-based statistical knowledge. In Proceedings of the Association for Computational Linguistics (ACL), pp. 704– 710.
Lapata, M. (2003). Probabilistic text structuring: Experiments with sentence ordering. In Proceedings of the Association for Computational Linguistics (ACL), pp. 545–552.
Lappin, S. and H. J. Leass (1994). An algorithm for pronominal anaphora resolution. Computational linguistics 20(4), 535–561.
Lari, K. and S. J. Young (1990). The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer speech & language 4(1), 35–56.
Lascarides, A. and N. Asher (2007). Segmented discourse representation theory: Dynamic semantics with discourse structure. In Computing meaning, pp. 87–124. Springer.
Law, E. and L. v. Ahn (2011). Human computation. Synthesis Lectures on Artificial Intelli- gence and Machine Learning 5(3), 1–121.
Lebret, R., D. Grangier, and M. Auli (2016). Neural text generation from structured data with application to the biography domain. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1203–1213.
LeCun, Y., L. Bottou, G. B. Orr, and K.-R. Mu ̈ller (1998). Efficient backprop. In Neural networks: Tricks of the trade, pp. 9–50. Springer.
Lee, C. M. and S. S. Narayanan (2005). Toward detecting emotions in spoken dialogs. IEEE Transactions on speech and audio processing 13(2), 293–303.
Lee, H., A. Chang, Y. Peirsman, N. Chambers, M. Surdeanu, and D. Jurafsky (2013). De- terministic coreference resolution based on entity-centric, precision-ranked rules. Com- putational Linguistics 39(4), 885–916.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

526 BIBLIOGRAPHY
Lee, H., Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky (2011). Stan- ford’s multi-pass sieve coreference resolution system at the conll-2011 shared task. In Proceedings of the Conference on Natural Language Learning (CoNLL), pp. 28–34.
Lee, K., L. He, M. Lewis, and L. Zettlemoyer (2017). End-to-end neural coreference reso- lution. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Lee, K., L. He, and L. Zettlemoyer (2018). Higher-order coreference resolution with coarse- to-fine inference. In Proceedings of the North American Chapter of the Association for Com- putational Linguistics (NAACL), pp. 687–692.
Lenat, D. B., R. V. Guha, K. Pittman, D. Pratt, and M. Shepherd (1990). Cyc: toward programs with common sense. Communications of the ACM 33(8), 30–49.
Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual interna- tional conference on Systems documentation, pp. 24–26. ACM.
Levesque, H. J., E. Davis, and L. Morgenstern (2011). The Winograd schema challenge. In AAAI spring symposium: Logical formalizations of commonsense reasoning.
Levin, E., R. Pieraccini, and W. Eckert (1998). Using markov decision process for learning dialogue strategies. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Volume 1, pp. 201–204. IEEE.
Levy, O. and Y. Goldberg (2014). Dependency-based word embeddings. In Proceedings of the Association for Computational Linguistics (ACL), pp. 302–308.
Levy, O., Y. Goldberg, and I. Dagan (2015). Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3, 211–225.
Levy, R. and C. Manning (2009). An informal introduction to computational seman- tics. http://idiom.ucsd.edu/ ̃rlevy/teaching/winter2009/ligncse256/ lectures/lecture_14_compositional_semantics.pdf.
Lewis, M. and M. Steedman (2013). Combined distributional and logical semantics. Trans- actions of the Association for Computational Linguistics 1, 179–192.
Lewis II, P. M. and R. E. Stearns (1968). Syntax-directed transduction. Journal of the ACM 15(3), 465–488.
Li, J. and D. Jurafsky (2015). Do multi-sense embeddings improve natural language understanding? In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1722–1732.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 527 Li, J. and D. Jurafsky (2017). Neural net models of open-domain discourse coherence. In
Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 198–209. Li, J., R. Li, and E. Hovy (2014). Recursive deep models for discourse parsing. In Proceed-
ings of Empirical Methods for Natural Language Processing (EMNLP).
Li, J., M.-T. Luong, and D. Jurafsky (2015). A hierarchical neural autoencoder for para- graphs and documents. In Proceedings of Empirical Methods for Natural Language Process- ing (EMNLP).
Li, J., T. Luong, D. Jurafsky, and E. Hovy (2015). When are tree structures necessary for deep learning of representations? In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 2304–2314.
Li, J., W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao (2016). Deep reinforcement learning for dialogue generation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1192–1202.
Li, Q., S. Anzaroot, W.-P. Lin, X. Li, and H. Ji (2011). Joint inference for cross-document information extraction. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pp. 2225–2228.
Li, Q., H. Ji, and L. Huang (2013). Joint event extraction via structured prediction with global features. In Proceedings of the Association for Computational Linguistics (ACL), pp. 73–82.
Liang, P. (2005). Semi-supervised learning for natural language. Master’s thesis, Mas- sachusetts Institute of Technology.
Liang, P., A. Bouchard-Coˆte ́, D. Klein, and B. Taskar (2006). An end-to-end discrimina- tive approach to machine translation. In Proceedings of the Association for Computational Linguistics (ACL), pp. 761–768.
Liang, P., M. Jordan, and D. Klein (2009). Learning semantic correspondences with less supervision. In Proceedings of the Association for Computational Linguistics (ACL), pp. 91– 99.
Liang, P., M. I. Jordan, and D. Klein (2013). Learning dependency-based compositional semantics. Computational Linguistics 39(2), 389–446.
Liang, P. and D. Klein (2009). Online EM for unsupervised models. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 611–619.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

528 BIBLIOGRAPHY
Liang, P., S. Petrov, M. I. Jordan, and D. Klein (2007). The infinite PCFG using hierarchical Dirichlet processes. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 688–697.
Liang, P. and C. Potts (2015). Bringing machine learning and compositional semantics together. Annual Review of Linguistics 1(1), 355–376.
Lieber, R. (2015). Introducing morphology. Cambridge University Press.
Lin, D. (1998). Automatic retrieval and clustering of similar words. In Proceedings of the
International Conference on Computational Linguistics (COLING), pp. 768–774.
Lin, J. and C. Dyer (2010). Data-intensive text processing with mapreduce. Synthesis
Lectures on Human Language Technologies 3(1), 1–177.
Lin, Z., M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017). A struc- tured self-attentive sentence embedding. In Proceedings of the International Conference on Learning Representations (ICLR).
Lin, Z., M.-Y. Kan, and H. T. Ng (2009). Recognizing implicit discourse relations in the penn discourse treebank. In Proceedings of Empirical Methods for Natural Language Pro- cessing (EMNLP), pp. 343–351.
Lin, Z., H. T. Ng, and M.-Y. Kan (2011). Automatically evaluating text coherence using discourse relations. In Proceedings of the Association for Computational Linguistics (ACL), pp. 997–1006.
Lin, Z., H. T. Ng, and M.-Y. Kan (2014). A pdtb-styled end-to-end discourse parser. Natural Language Engineering 20(2), 151–184.
Ling, W., C. Dyer, A. Black, and I. Trancoso (2015). Two/too simple adaptations of word2vec for syntax problems. In Proceedings of the North American Chapter of the As- sociation for Computational Linguistics (NAACL).
Ling, W., T. Lu ́ıs, L. Marujo, R. F. Astudillo, S. Amir, C. Dyer, A. W. Black, and I. Trancoso (2015). Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Ling, W., G. Xiang, C. Dyer, A. Black, and I. Trancoso (2013). Microblogs as parallel cor- pora. In Proceedings of the Association for Computational Linguistics (ACL).
Ling, X., S. Singh, and D. S. Weld (2015). Design challenges for entity linking. Transactions of the Association for Computational Linguistics 3, 315–328.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 529
Linguistic Data Consortium (2005). ACE (automatic content extraction) English annota- tion guidelines for relations. Technical Report Version 5.8.3, Linguistic Data Consor- tium.
Liu, B. (2015). Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge University Press.
Liu, D. C. and J. Nocedal (1989). On the limited memory BFGS method for large scale optimization. Mathematical programming 45(1-3), 503–528.
Liu, Y., Q. Liu, and S. Lin (2006). Tree-to-string alignment template for statistical machine translation. In Proceedings of the Association for Computational Linguistics (ACL), pp. 609– 616.
Loper, E. and S. Bird (2002). NLTK: The natural language toolkit. In Proceedings of the Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics, pp. 63–70. Association for Computational Linguistics.
Louis, A., A. Joshi, and A. Nenkova (2010). Discourse indicators for content selection in summarization. In Proceedings of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pp. 147–156.
Louis, A. and A. Nenkova (2013). What makes writing great? first experiments on article quality prediction in the science journalism domain. Transactions of the Association for Computational Linguistics 1, 341–352.
Loveland, D. W. (2016). Automated Theorem Proving: a logical basis. Elsevier.
Lowe, R., N. Pow, I. V. Serban, and J. Pineau (2015). The Ubuntu Dialogue Corpus: A large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the Special Interest Group on Discourse and Dialogue (SIGDIAL).
Luo, X. (2005). On coreference resolution performance metrics. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 25–32.
Luo, X., A. Ittycheriah, H. Jing, N. Kambhatla, and S. Roukos (2004). A mention- synchronous coreference resolution algorithm based on the bell tree. In Proceedings of the Association for Computational Linguistics (ACL).
Luong, M.-T., R. Socher, and C. D. Manning (2013). Better word representations with recursive neural networks for morphology.
Luong, T., H. Pham, and C. D. Manning (2015). Effective approaches to attention-based neural machine translation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1412–1421.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

530 BIBLIOGRAPHY
Luong, T., I. Sutskever, Q. Le, O. Vinyals, and W. Zaremba (2015). Addressing the rare word problem in neural machine translation. In Proceedings of the Association for Compu- tational Linguistics (ACL), pp. 11–19.
Maas, A. L., A. Y. Hannun, and A. Y. Ng (2013). Rectifier nonlinearities improve neu- ral network acoustic models. In Proceedings of the International Conference on Machine Learning (ICML).
Magerman, D. M. (1995). Statistical decision-tree models for parsing. In Proceedings of the Association for Computational Linguistics (ACL), pp. 276–283.
Mairesse, F. and M. A. Walker (2011). Controlling user perceptions of linguistic style: Trainable generation of personality traits. Computational Linguistics 37(3), 455–488.
Mani, I., M. Verhagen, B. Wellner, C. M. Lee, and J. Pustejovsky (2006). Machine learning of temporal relations. In Proceedings of the Association for Computational Linguistics (ACL), pp. 753–760.
Mann, W. C. and S. A. Thompson (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text 8(3), 243–281.
Manning, C. D. (2015). Last words: Computational linguistics and deep learning. Compu- tational Linguistics 41(4), 701–707.
Manning,C.D.,P.Raghavan,H.Schu ̈tze,etal.(2008). Introductiontoinformationretrieval, Volume 1. Cambridge university press.
Manning,C.D.andH.Schu ̈tze(1999).FoundationsofStatisticalNaturalLanguageProcess- ing. Cambridge, Massachusetts: MIT press.
Marcu, D. (1996). Building up rhetorical structure trees. In Proceedings of the National Conference on Artificial Intelligence, pp. 1069–1074.
Marcu, D. (1997a). From discourse structures to text summaries. In Proceedings of the workshop on Intelligent Scalable Text Summarization.
Marcu, D. (1997b). From local to global coherence: A bottom-up approach to text plan- ning. In Proceedings of the National Conference on Artificial Intelligence (AAAI), pp. 629–635.
Marcus, M. P., M. A. Marcinkiewicz, and B. Santorini (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330.
Maron, O. and T. Lozano-Pe ́rez (1998). A framework for multiple-instance learning. In Neural Information Processing Systems (NIPS), pp. 570–576.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 531 Ma ́rquez, G. G. (1970). One Hundred Years of Solitude. Harper & Row. English translation
by Gregory Rabassa.
Martins, A. F. T., N. A. Smith, and E. P. Xing (2009). Concise integer linear programming formulations for dependency parsing. In Proceedings of the Association for Computational Linguistics (ACL), pp. 342–350.
Martins, A. F. T., N. A. Smith, E. P. Xing, P. M. Q. Aguiar, and M. A. T. Figueiredo (2010). Turbo parsers: Dependency parsing by approximate variational inference. In Proceed- ings of Empirical Methods for Natural Language Processing (EMNLP), pp. 34–44.
Matsuzaki, T., Y. Miyao, and J. Tsujii (2005). Probabilistic cfg with latent annotations. In Proceedings of the Association for Computational Linguistics (ACL), pp. 75–82.
Matthiessen, C. and J. A. Bateman (1991). Text generation and systemic-functional linguistics: experiences from English and Japanese. Pinter Publishers.
McCallum, A. and W. Li (2003). Early results for named entity recognition with condi- tional random fields, feature induction and web-enhanced lexicons. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 188–191.
McCallum, A. and B. Wellner (2004). Conditional models of identity uncertainty with application to noun coreference. In Neural Information Processing Systems (NIPS), pp. 905–912.
McDonald, R., K. Crammer, and F. Pereira (2005). Online large-margin training of depen- dency parsers. In Proceedings of the Association for Computational Linguistics (ACL), pp. 91–98.
McDonald, R., K. Hannan, T. Neylon, M. Wells, and J. Reynar (2007). Structured models for fine-to-coarse sentiment analysis. In Proceedings of the Association for Computational Linguistics (ACL).
McDonald, R. and F. Pereira (2006). Online learning of approximate dependency parsing algorithms. In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL).
McKeown, K. (1992). Text generation. Cambridge University Press.
McKeown, K., S. Rosenthal, K. Thadani, and C. Moore (2010). Time-efficient creation of an accurate sentence fusion corpus. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 317–320.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

532 BIBLIOGRAPHY
McKeown, K. R., R. Barzilay, D. Evans, V. Hatzivassiloglou, J. L. Klavans, A. Nenkova, C. Sable, B. Schiffman, and S. Sigelman (2002). Tracking and summarizing news on a daily basis with Columbia’s Newsblaster. In Proceedings of the second international conference on Human Language Technology Research, pp. 280–285.
McNamee, P. and H. T. Dang (2009). Overview of the tac 2009 knowledge base population track. In Text Analysis Conference (TAC), Volume 17, pp. 111–113.
Medlock, B. and T. Briscoe (2007). Weakly supervised learning for hedge classification in scientific literature. In Proceedings of the Association for Computational Linguistics (ACL), pp. 992–999.
Mei, H., M. Bansal, and M. R. Walter (2016). What to talk about and how? selective gen- eration using lstms with coarse-to-fine alignment. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 720–730.
Merity, S., N. S. Keskar, and R. Socher (2018). Regularizing and optimizing LSTM lan- guage models. In Proceedings of the International Conference on Learning Representations (ICLR).
Merity, S., C. Xiong, J. Bradbury, and R. Socher (2017). Pointer sentinel mixture models. In Proceedings of the International Conference on Learning Representations (ICLR).
Messud, C. (2014, June). A new ‘l’e ́tranger’. New York Review of Books.
Miao, Y. and P. Blunsom (2016). Language as a latent variable: Discrete generative mod- els for sentence compression. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 319–328.
Miao, Y., L. Yu, and P. Blunsom (2016). Neural variational inference for text processing. In Proceedings of the International Conference on Machine Learning (ICML).
Mihalcea, R., T. A. Chklovski, and A. Kilgarriff (2004). The SENSEVAL-3 English lexical sample task. In Proceedings of SENSEVAL-3, Barcelona, Spain, pp. 25–28. Association for Computational Linguistics.
Mihalcea, R. and D. Radev (2011). Graph-based natural language processing and information retrieval. Cambridge University Press.
Mikolov, T., K. Chen, G. Corrado, and J. Dean (2013). Efficient estimation of word repre- sentations in vector space. In Proceedings of International Conference on Learning Represen- tations.
Mikolov, T., A. Deoras, D. Povey, L. Burget, and J. Cernocky (2011). Strategies for training large scale neural network language models. In Proceedings of the Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 196–201.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 533 Mikolov, T., M. Karafia ́t, L. Burget, J. Cernocky`, and S. Khudanpur (2010). Recurrent
neural network based language model. In INTERSPEECH, pp. 1045–1048.
Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013). Distributed rep- resentations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pp. 3111–3119.
Mikolov, T., W.-t. Yih, and G. Zweig (2013). Linguistic regularities in continuous space word representations. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 746–751.
Mikolov, T. and G. Zweig. Context dependent recurrent neural network language model. In Proceedings of Spoken Language Technology (SLT), pp. 234–239.
Miller, G. A., G. A. Heise, and W. Lichten (1951). The intelligibility of speech as a function of the context of the test materials. Journal of experimental psychology 41(5), 329.
Miller, M., C. Sathi, D. Wiesenthal, J. Leskovec, and C. Potts (2011). Sentiment flow through hyperlink networks. In Proceedings of the International Conference on Web and Social Media (ICWSM).
Miller, S., J. Guinness, and A. Zamanian (2004). Name tagging with word clusters and discriminative training. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 337–342.
Milne, D. and I. H. Witten (2008). Learning to link with wikipedia. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pp. 509–518.
Miltsakaki, E. and K. Kukich (2004). Evaluation of text coherence for electronic essay scoring systems. Natural Language Engineering 10(1), 25–55.
Minka, T. P. (1999). From hidden markov models to linear dynamical systems. Tech. Rep. 531, Vision and Modeling Group of Media Lab, MIT.
Minsky, M. (1974). A framework for representing knowledge. Technical Report 306, MIT AI Laboratory.
Minsky, M. and S. Papert (1969). Perceptrons. MIT press.
Mintz, M., S. Bills, R. Snow, and D. Jurafsky (2009). Distant supervision for relation extrac- tion without labeled data. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1003–1011.
Mirza, P., R. Sprugnoli, S. Tonelli, and M. Speranza (2014). Annotating causality in the TempEval-3 corpus. In Proceedings of the EACL 2014 Workshop on Computational Ap- proaches to Causality in Language (CAtoCL), pp. 10–19.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

534 BIBLIOGRAPHY Misra, D. K. and Y. Artzi (2016). Neural shift-reduce ccg semantic parsing. In Proceedings
of Empirical Methods for Natural Language Processing (EMNLP).
Mitchell, J. and M. Lapata (2010). Composition in distributional models of semantics.
Cognitive Science 34(8), 1388–1429.
Miwa, M. and M. Bansal (2016). End-to-end relation extraction using lstms on sequences and tree structures. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1105–1116.
Mnih, A. and G. Hinton (2007). Three new graphical models for statistical language mod- elling. In Proceedings of the International Conference on Machine Learning (ICML), pp. 641– 648.
Mnih, A. and G. E. Hinton (2008). A scalable hierarchical distributed language model. In Neural Information Processing Systems (NIPS), pp. 1081–1088.
Mnih, A. and Y. W. Teh (2012). A fast and simple algorithm for training neural probabilis- tic language models. In Proceedings of the International Conference on Machine Learning (ICML).
Mohammad, S. M. and P. D. Turney (2013). Crowdsourcing a word–emotion association lexicon. Computational Intelligence 29(3), 436–465.
Mohri, M., F. Pereira, and M. Riley (2002). Weighted finite-state transducers in speech recognition. Computer Speech & Language 16(1), 69–88.
Mohri, M., A. Rostamizadeh, and A. Talwalkar (2012). Foundations of machine learning. MIT press.
Montague, R. (1973). The proper treatment of quantification in ordinary english. In Ap- proaches to natural language, pp. 221–242. Springer.
Moore, J. D. and C. L. Paris (1993). Planning text for advisory dialogues: Capturing inten- tional and rhetorical information. Computational Linguistics 19(4), 651–694.
Morante, R. and E. Blanco (2012). *SEM 2012 shared task: Resolving the scope and focus of negation. In Proceedings of the First Joint Conference on Lexical and Computational Semantics, pp. 265–274. Association for Computational Linguistics.
Morante, R. and W. Daelemans (2009). Learning the scope of hedge cues in biomedical texts. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, pp. 28–36. Association for Computational Linguistics.
Morante, R. and C. Sporleder (2012). Modality and negation: An introduction to the special issue. Computational linguistics 38(2), 223–260.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 535
Mostafazadeh, N., A. Grealish, N. Chambers, J. Allen, and L. Vanderwende (2016). CaTeRS: Causal and temporal relation scheme for semantic annotation of event struc- tures. In Proceedings of the Fourth Workshop on Events, pp. 51–61. Association for Com- putational Linguistics.
Mueller,T.,H.Schmid,andH.Schu ̈tze(2013).Efficienthigher-orderCRFsformorpholog- ical tagging. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 322–332.
Mu ̈ller, C. and M. Strube (2006). Multi-level annotation of linguistic data with mmax2. Corpus technology and language pedagogy: New resources, new tools, new methods 3, 197– 214.
Muralidharan, A. and M. A. Hearst (2013). Supporting exploratory text analysis in litera- ture study. Literary and linguistic computing 28(2), 283–295.
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. The MIT Press.
Nakagawa, T., K. Inui, and S. Kurohashi (2010). Dependency tree-based sentiment classi- fication using CRFs with hidden variables. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 786–794.
Nakazawa, T., M. Yaguchi, K. Uchimoto, M. Utiyama, E. Sumita, S. Kurohashi, and H. Isa- hara (2016). ASPEC: Asian scientific paper excerpt corpus. In Proceedings of the Language Resources and Evaluation Conference, pp. 2204–2208.
Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys 41(2), 10.
Neal, R. M. and G. E. Hinton (1998). A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pp. 355–368. Springer.
Nenkova, A. and K. McKeown (2012). A survey of text summarization techniques. In Mining text data, pp. 43–76. Springer.
Neubig, G. (2017). Neural machine translation and sequence-to-sequence models: A tu- torial. arXiv preprint arXiv:1703.01619.
Neubig, G., C. Dyer, Y. Goldberg, A. Matthews, W. Ammar, A. Anastasopoulos, M. Balles- teros, D. Chiang, D. Clothiaux, T. Cohn, K. Duh, M. Faruqui, C. Gan, D. Garrette, Y. Ji, L. Kong, A. Kuncoro, G. Kumar, C. Malaviya, P. Michel, Y. Oda, M. Richardson, N. Saphra, S. Swayamdipta, and P. Yin (2017). Dynet: The dynamic neural network toolkit.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

536 BIBLIOGRAPHY Neubig, G., Y. Goldberg, and C. Dyer (2017). On-the-fly operation batching in dynamic
computation graphs. In Neural Information Processing Systems (NIPS).
Neubig, G., M. Sperber, X. Wang, M. Felix, A. Matthews, S. Padmanabhan, Y. Qi, D. S. Sachan, P. Arthur, P. Godard, J. Hewitt, R. Riad, and L. Wang (2018). XNMT: The ex- tensible neural machine translation toolkit. In Conference of the Association for Machine Translation in the Americas (AMTA).
Neuhaus, P. and N. Bro ̈ker (1997). The complexity of recognition of linguistically ade- quate dependency grammars. In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL), pp. 337–343.
Newman, D., C. Chemudugunta, and P. Smyth (2006). Statistical entity-topic models. In Proceedings of Knowledge Discovery and Data Mining (KDD), pp. 680–686.
Ng, V. (2010). Supervised noun phrase coreference research: The first fifteen years. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1396–1411.
Nguyen,D.andA.S.Dogruo ̈z(2013).Wordlevellanguageidentificationinonlinemulti- lingual communication. In Proceedings of Empirical Methods for Natural Language Process- ing (EMNLP).
Nguyen, D. T. and S. Joty (2017). A neural local coherence model. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1320–1330.
Nigam, K., A. K. McCallum, S. Thrun, and T. Mitchell (2000). Text classification from labeled and unlabeled documents using em. Machine learning 39(2-3), 103–134.
Nirenburg, S. and Y. Wilks (2001). What’s in a symbol: ontology, representation and lan- guage. Journal of Experimental & Theoretical Artificial Intelligence 13(1), 9–23.
Nivre, J. (2008). Algorithms for deterministic incremental dependency parsing. Computa- tional Linguistics 34(4), 513–553.
Nivre, J., M.-C. de Marneffe, F. Ginter, Y. Goldberg, J. Hajicˇ, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman (2016). Universal depen- dencies v1: A multilingual treebank collection. In Proceedings of the Language Resources and Evaluation Conference.
Nivre, J. and J. Nilsson (2005). Pseudo-projective dependency parsing. In Proceedings of the Association for Computational Linguistics (ACL), pp. 99–106.
Novikoff, A. B. (1962). On convergence proofs on perceptrons. In Proceedings of the Sym- posium on the Mathematical Theory of Automata, Volume 12, pp. 615––622.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 537 Och, F. J. and H. Ney (2003). A systematic comparison of various statistical alignment
models. Computational linguistics 29(1), 19–51.
O’Connor, B., M. Krieger, and D. Ahn (2010). Tweetmotif: Exploratory search and topic summarization for twitter. In Proceedings of the International Conference on Web and Social Media (ICWSM), pp. 384–385.
Oepen, S., M. Kuhlmann, Y. Miyao, D. Zeman, D. Flickinger, J. Hajic, A. Ivanova, and Y. Zhang (2014). Semeval 2014 task 8: Broad-coverage semantic dependency parsing. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp. 63–72.
Oflazer, K. and I`. Kuruo ̈z (1994). Tagging and morphological disambiguation of turkish text. In Proceedings of the fourth conference on Applied natural language processing, pp. 144– 149.
Ohta, T., Y. Tateisi, and J.-D. Kim (2002). The genia corpus: An annotated research abstract corpus in molecular biology domain. In Proceedings of the second international conference on Human Language Technology Research, pp. 82–86. Morgan Kaufmann Publishers Inc.
Onishi, T., H. Wang, M. Bansal, K. Gimpel, and D. McAllester (2016). Who did what: A large-scale person-centered cloze dataset. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 2230–2235.
Owoputi, O., B. O’Connor, C. Dyer, K. Gimpel, N. Schneider, and N. A. Smith (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 380–390.
Packard, W., E. M. Bender, J. Read, S. Oepen, and R. Dridan (2014). Simple negation scope resolution through deep parsing: A semantic solution to a semantic problem. In Proceedings of the Association for Computational Linguistics (ACL), pp. 69–78.
Paice, C. D. (1990). Another stemmer. In ACM SIGIR Forum, Volume 24, pp. 56–61.
Pak, A. and P. Paroubek (2010). Twitter as a corpus for sentiment analysis and opinion
mining. In Proceedings of the Language Resources and Evaluation Conference, pp. 1320–1326.
Palmer, F. R. (2001). Mood and modality. Cambridge University Press.
Palmer, M., D. Gildea, and P. Kingsbury (2005). The proposition bank: An annotated corpus of semantic roles. Computational linguistics 31(1), 71–106.
Pan, S. J. and Q. Yang (2010). A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22(10), 1345–1359.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

538 BIBLIOGRAPHY
Pang, B. and L. Lee (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the Association for Computa- tional Linguistics (ACL), pp. 271–278.
Pang, B. and L. Lee (2005). Seeing stars: Exploiting class relationships for sentiment cate- gorization with respect to rating scales. In Proceedings of the Association for Computational Linguistics (ACL), pp. 115–124.
Pang, B. and L. Lee (2008). Opinion mining and sentiment analysis. Foundations and trends in information retrieval 2(1-2), 1–135.
Pang, B., L. Lee, and S. Vaithyanathan (2002). Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 79–86.
Papineni, K., S. Roukos, T. Ward, and W.-J. Zhu (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Association for Computational Linguistics (ACL), pp. 311–318.
Park, J. and C. Cardie (2012). Improving implicit discourse relation recognition through feature set optimization. In Proceedings of the Special Interest Group on Discourse and Dia- logue (SIGDIAL), pp. 108–112.
Parsons, T. (1990). Events in the Semantics of English, Volume 5. MIT Press.
Pascanu, R., T. Mikolov, and Y. Bengio (2013). On the difficulty of training recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1310–1318.
Paul,M.,M.Federico,andS.Stu ̈ker(2010).Overviewoftheiwslt2010evaluationcam- paign. In International Workshop on Spoken Language Translation (IWSLT) 2010.
Pedersen, T., S. Patwardhan, and J. Michelizzi (2004). Wordnet::similarity – measuring the relatedness of concepts. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 38–41.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon- del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830.
Pei, W., T. Ge, and B. Chang (2015). An effective neural network model for graph-based dependency parsing. In Proceedings of the Association for Computational Linguistics (ACL), pp. 313–322.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 539
Peldszus, A. and M. Stede (2013). From argument diagrams to argumentation mining in texts: A survey. International Journal of Cognitive Informatics and Natural Intelligence (IJCINI) 7(1), 1–31.
Peldszus, A. and M. Stede (2015). An annotated corpus of argumentative microtexts. In Proceedings of the First Conference on Argumentation.
Peng, F., F. Feng, and A. McCallum (2004). Chinese segmentation and new word detec- tion using conditional random fields. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 562–568.
Pennington, J., R. Socher, and C. Manning (2014). Glove: Global vectors for word repre- sentation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1532–1543.
Pereira, F. and Y. Schabes (1992). Inside-outside reestimation from partially bracketed corpora. In Proceedings of the Association for Computational Linguistics (ACL), pp. 128– 135.
Pereira, F. C. N. and S. M. Shieber (2002). Prolog and natural-language analysis. Microtome Publishing.
Peters, M. E., M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018). Deep contextualized word representations. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
Peterson, W. W., T. G. Birdsall, and W. C. Fox (1954). The theory of signal detectability. Transactions of the IRE professional group on information theory 4(4), 171–212.
Petrov, S., L. Barrett, R. Thibaux, and D. Klein (2006). Learning accurate, compact, and in- terpretable tree annotation. In Proceedings of the Association for Computational Linguistics (ACL).
Petrov, S., D. Das, and R. McDonald (2012). A universal part-of-speech tagset. In Proceed- ings of the Language Resources and Evaluation Conference.
Petrov, S. and R. McDonald (2012). Overview of the 2012 shared task on parsing the web. In Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL).
Pinker, S. (2003). The language instinct: How the mind creates language. Penguin UK.
Pinter, Y., R. Guthrie, and J. Eisenstein (2017). Mimicking word embeddings using subword RNNs. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

540 BIBLIOGRAPHY
Pitler, E., A. Louis, and A. Nenkova (2009). Automatic sense prediction for implicit dis- course relations in text. In Proceedings of the Association for Computational Linguistics (ACL).
Pitler, E. and A. Nenkova (2009). Using syntax to disambiguate explicit discourse con- nectives in text. In Proceedings of the Association for Computational Linguistics (ACL), pp. 13–16.
Pitler, E., M. Raghupathy, H. Mehta, A. Nenkova, A. Lee, and A. Joshi (2008). Easily iden- tifiable discourse relations. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 87–90.
Plank, B., A. Søgaard, and Y. Goldberg (2016). Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the Association for Computational Linguistics (ACL).
Poesio, M., R. Stevenson, B. Di Eugenio, and J. Hitzeman (2004). Centering: A parametric theory and its instantiations. Computational linguistics 30(3), 309–363.
Polanyi, L. and A. Zaenen (2006). Contextual valence shifters. In Computing attitude and affect in text: Theory and applications. Springer.
Ponzetto, S. P. and M. Strube (2006). Exploiting semantic role labeling, wordnet and wikipedia for coreference resolution. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 192–199.
Ponzetto, S. P. and M. Strube (2007). Knowledge derived from wikipedia for computing semantic relatedness. Journal of Artificial Intelligence Research 30, 181–212.
Poon, H. and P. Domingos (2008). Joint unsupervised coreference resolution with markov logic. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 650–659.
Poon, H. and P. Domingos (2009). Unsupervised semantic parsing. In Proceedings of Em- pirical Methods for Natural Language Processing (EMNLP), pp. 1–10.
Popel, M., D. Marecek, J. Stepa ́nek, D. Zeman, and Z. Zabokrtsky` (2013). Coordination structures in dependency treebanks. In Proceedings of the Association for Computational Linguistics (ACL), pp. 517–527.
Popescu, A.-M., O. Etzioni, and H. Kautz (2003). Towards a theory of natural language interfaces to databases. In Proceedings of Intelligent User Interfaces (IUI), pp. 149–157.
Poplack,S.(1980).SometimesI’llstartasentenceinSpanishyterminoenEspan ̃ol:toward a typology of code-switching. Linguistics 18(7-8), 581–618.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 541 Porter, M. F. (1980). An algorithm for suffix stripping. Program 14(3), 130–137.
Prabhakaran, V., O. Rambow, and M. Diab (2010). Automatic committed belief tagging. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 1014–1022.
Pradhan, S., X. Luo, M. Recasens, E. Hovy, V. Ng, and M. Strube (2014). Scoring corefer- ence partitions of predicted mentions: A reference implementation. In Proceedings of the Association for Computational Linguistics (ACL), pp. 30–35.
Pradhan, S., L. Ramshaw, M. Marcus, M. Palmer, R. Weischedel, and N. Xue (2011). CoNLL-2011 shared task: Modeling unrestricted coreference in OntoNotes. In Proceed- ings of the Conference on Natural Language Learning (CoNLL), pp. 1–27.
Pradhan, S., W. Ward, K. Hacioglu, J. H. Martin, and D. Jurafsky (2005). Semantic role labeling using different syntactic views. In Proceedings of the Association for Computational Linguistics (ACL), pp. 581–588.
Prasad, R., N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. Joshi, and B. Webber (2008). The Penn Discourse Treebank 2.0. In Proceedings of the Language Resources and Evaluation Conference.
Punyakanok, V., D. Roth, and W.-t. Yih (2008). The importance of syntactic parsing and inference in semantic role labeling. Computational Linguistics 34(2), 257–287.
Pustejovsky,J.,P.Hanks,R.Saur ́ı,A.See,R.Gaizauskas,A.Setzer,D.Radev,B.Sundheim, D. Day, L. Ferro, et al. (2003). The timebank corpus. In Proceedings of Corpus linguistics, pp. 647–656.
Pustejovsky, J., B. Ingria, R. Sauri, J. Castano, J. Littman, R. Gaizauskas, A. Setzer, G. Katz, and I. Mani (2005). The specification language TimeML. In The language of time: A reader, pp. 545–557. Oxford University Press.
Qin, L., Z. Zhang, H. Zhao, Z. Hu, and E. Xing (2017). Adversarial connective-exploiting networks for implicit discourse relation classification. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1006–1017.
Qiu, G., B. Liu, J. Bu, and C. Chen (2011). Opinion word expansion and target extraction through double propagation. Computational linguistics 37(1), 9–27.
Quattoni, A., S. Wang, L.-P. Morency, M. Collins, and T. Darrell (2007). Hidden conditional random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(10).
Rahman, A. and V. Ng (2011). Narrowing the modeling gap: a cluster-ranking approach to coreference resolution. Journal of Artificial Intelligence Research 40, 469–521.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

542 BIBLIOGRAPHY
Rajpurkar, P., J. Zhang, K. Lopyrev, and P. Liang (2016). Squad: 100,000+ questions for machine comprehension of text. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 2383–2392.
Ranzato, M., S. Chopra, M. Auli, and W. Zaremba (2016). Sequence level training with recurrent neural networks. In Proceedings of the International Conference on Learning Rep- resentations (ICLR).
Rao, D., D. Yarowsky, A. Shreevats, and M. Gupta (2010). Classifying latent user attributes in twitter. In Proceedings of Workshop on Search and mining user-generated contents.
Ratinov, L. and D. Roth (2009). Design challenges and misconceptions in named entity recognition. In Proceedings of the Conference on Natural Language Learning (CoNLL), pp. 147–155.
Ratinov, L., D. Roth, D. Downey, and M. Anderson (2011). Local and global algorithms for disambiguation to wikipedia. In Proceedings of the Association for Computational Lin- guistics (ACL), pp. 1375–1384.
Ratliff, N. D., J. A. Bagnell, and M. Zinkevich (2007). (approximate) subgradient methods for structured prediction. In Proceedings of Artificial Intelligence and Statistics (AISTATS), pp. 380–387.
Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Pro- ceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 133–142.
Ratnaparkhi, A., J. Reynar, and S. Roukos (1994). A maximum entropy model for preposi- tional phrase attachment. In Proceedings of the workshop on Human Language Technology, pp. 250–255.
Read, J. (2005). Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In Proceedings of the ACL student research workshop, pp. 43–48.
Reisinger, D., R. Rudinger, F. Ferraro, C. Harman, K. Rawlins, and B. V. Durme (2015). Semantic proto-roles. Transactions of the Association for Computational Linguistics 3, 475– 488.
Reisinger, J. and R. J. Mooney (2010). Multi-prototype vector-space models of word mean- ing. In Proceedings of the North American Chapter of the Association for Computational Lin- guistics (NAACL), pp. 109–117.
Reiter, E. and R. Dale (2000). Building natural language generation systems. Cambridge University Press.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 543 Resnik, P., M. B. Olsen, and M. Diab (1999). The bible as a parallel corpus: Annotating the
‘Book of 2000 Tongues’. Computers and the Humanities 33(1-2), 129–153.
Resnik, P. and N. A. Smith (2003). The web as a parallel corpus. Computational Linguis-
tics 29(3), 349–380.
Ribeiro, F. N., M. Arau ́jo, P. Gonc ̧alves, M. A. Gonc ̧alves, and F. Benevenuto (2016). Sentibench-a benchmark comparison of state-of-the-practice sentiment analysis meth- ods. EPJ Data Science 5(1), 1–29.
Richardson, M., C. J. Burges, and E. Renshaw (2013). MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 193–203.
Riedel, S., L. Yao, and A. McCallum (2010). Modeling relations and their mentions without labeled text. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML), pp. 148–163.
Riedl, M. O. and R. M. Young (2010). Narrative planning: Balancing plot and character. Journal of Artificial Intelligence Research 39, 217–268.
Rieser, V. and O. Lemon (2011). Reinforcement learning for adaptive dialogue systems: a data- driven methodology for dialogue management and natural language generation. Springer Sci- ence & Business Media.
Riloff, E. (1996). Automatically generating extraction patterns from untagged text. In Proceedings of the National Conference on Artificial Intelligence (AAAI), pp. 1044–1049.
Riloff, E. and J. Wiebe (2003). Learning extraction patterns for subjective expressions. In Proceedings of the 2003 conference on Empirical methods in natural language processing, pp. 105–112. Association for Computational Linguistics.
Ritchie, G. (2001). Current directions in computational humour. Artificial Intelligence Re- view 16(2), 119–135.
Ritter, A., C. Cherry, and W. B. Dolan (2011). Data-driven response generation in social media. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 583–593.
Ritter, A., S. Clark, Mausam, and O. Etzioni (2011). Named entity recognition in tweets: an experimental study. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Roark, B., M. Saraclar, and M. Collins (2007). Discriminative n-gram language modeling. Computer Speech & Language 21(2), 373–392.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

544 BIBLIOGRAPHY Robert, C. and G. Casella (2013). Monte Carlo statistical methods. Springer Science & Busi-
ness Media.
Rosenfeld, R. (1996). A maximum entropy approach to adaptive statistical language mod-
elling. Computer Speech & Language 10(3), 187–228.
Ross, S., G. Gordon, and D. Bagnell (2011). A reduction of imitation learning and struc- tured prediction to no-regret online learning. In Proceedings of Artificial Intelligence and Statistics (AISTATS), pp. 627–635.
Roy, N., J. Pineau, and S. Thrun (2000). Spoken dialogue management using probabilistic reasoning. In Proceedings of the Association for Computational Linguistics (ACL), pp. 93– 100.
Rudinger, R., J. Naradowsky, B. Leonard, and B. Van Durme (2018). Gender bias in coref- erence resolution. In Proceedings of the North American Chapter of the Association for Com- putational Linguistics (NAACL).
Rudnicky, A. and W. Xu (1999). An agenda-based dialog management architecture for spoken language systems. In IEEE Automatic Speech Recognition and Understanding Work- shop, Volume 13.
Rush, A. M., S. Chopra, and J. Weston (2015). A neural attention model for abstractive sen- tence summarization. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 379–389.
Rush, A. M., D. Sontag, M. Collins, and T. Jaakkola (2010). On dual decomposition and linear programming relaxations for natural language processing. In Proceedings of Em- pirical Methods for Natural Language Processing (EMNLP), pp. 1–11.
Russell, S. J. and P. Norvig (2009). Artificial intelligence: a modern approach (3rd ed.). Prentice Hall.
Rutherford, A., V. Demberg, and N. Xue (2017). A systematic study of neural discourse models for implicit discourse relation. In Proceedings of the European Chapter of the Asso- ciation for Computational Linguistics (EACL), pp. 281–291.
Rutherford, A. T. and N. Xue (2014). Discovering implicit discourse relations through brown cluster pair representation and coreference patterns. In Proceedings of the Euro- pean Chapter of the Association for Computational Linguistics (EACL).
Sag, I. A., T. Baldwin, F. Bond, A. Copestake, and D. Flickinger (2002). Multiword expres- sions: A pain in the neck for nlp. In International Conference on Intelligent Text Processing and Computational Linguistics, pp. 1–15. Springer.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 545
Sagae, K. (2009). Analysis of discourse structure with syntactic dependencies and data- driven shift-reduce parsing. In Proceedings of the 11th International Conference on Parsing Technologies, pp. 81–84.
Santos, C. D. and B. Zadrozny (2014). Learning character-level representations for part-of- speech tagging. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1818–1826.
Sato, M.-A. and S. Ishii (2000). On-line em algorithm for the normalized gaussian network. Neural computation 12(2), 407–432.
Saur ́ı, R. and J. Pustejovsky (2009). Factbank: a corpus annotated with event factuality. Language resources and evaluation 43(3), 227.
Saxe, A. M., J. L. McClelland, and S. Ganguli (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Proceedings of the International Conference on Learning Representations (ICLR).
Schank, R. C. and R. Abelson (1977). Scripts, goals, plans, and understanding. Hillsdale, NJ: Erlbaum.
Schapire, R. E. and Y. Singer (2000). BoosTexter: A boosting-based system for text catego- rization. Machine learning 39(2-3), 135–168.
Schaul, T., S. Zhang, and Y. LeCun (2013). No more pesky learning rates. In Proceedings of the International Conference on Machine Learning (ICML), pp. 343–351.
Schnabel, T., I. Labutov, D. Mimno, and T. Joachims (2015). Evaluation methods for un- supervised word embeddings. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 298–307.
Schneider, N., J. Flanigan, and T. O’Gorman (2015). The logic of AMR: Practical, unified, graph-based sentence semantics for NLP. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 4–5.
Schu ̈tze,H.(1998).Automaticwordsensediscrimination.Computationallinguistics24(1), 97–123.
Schwarm, S. E. and M. Ostendorf (2005). Reading level assessment using support vector machines and statistical language models. In Proceedings of the Association for Computa- tional Linguistics (ACL), pp. 523–530.
See, A., P. J. Liu, and C. D. Manning (2017). Get to the point: Summarization with pointer- generator networks. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1073–1083.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

546 BIBLIOGRAPHY
Sennrich, R., B. Haddow, and A. Birch (2016). Neural machine translation of rare words with subword units. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1715–1725.
Serban, I. V., A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau (2016). Building end-to- end dialogue systems using generative hierarchical neural network models. In Proceed- ings of the National Conference on Artificial Intelligence (AAAI), pp. 3776–3784.
Settles, B. (2012). Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6(1), 1–114.
Shang, L., Z. Lu, and H. Li (2015). Neural responding machine for short-text conversation. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1577–1586.
Shen, D. and M. Lapata (2007). Using semantic roles to improve question answering. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 12–21.
Shen, S., Y. Cheng, Z. He, W. He, H. Wu, M. Sun, and Y. Liu (2016). Minimum risk train- ing for neural machine translation. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1683–1692.
Shen, W., J. Wang, and J. Han (2015). Entity linking with a knowledge base: Issues, tech- niques, and solutions. IEEE Transactions on Knowledge and Data Engineering 27(2), 443– 460.
Shieber, S. M. (1985). Evidence against the context-freeness of natural language. Linguistics and Philosophy 8(3), 333–343.
Siegelmann, H. T. and E. D. Sontag (1995). On the computational power of neural nets. Journal of computer and system sciences 50(1), 132–150.
Singh, S., A. Subramanya, F. Pereira, and A. McCallum (2011). Large-scale cross- document coreference using distributed inference and hierarchical models. In Proceed- ings of the Association for Computational Linguistics (ACL), pp. 793–803.
Sipser, M. (2012). Introduction to the Theory of Computation. Cengage Learning.
Smith, D. A. and J. Eisner (2006). Minimum risk annealing for training log-linear models.
In Proceedings of the Association for Computational Linguistics (ACL), pp. 787–794.
Smith, D. A. and J. Eisner (2008). Dependency parsing by belief propagation. In Proceed-
ings of Empirical Methods for Natural Language Processing (EMNLP), pp. 145–156.
Smith, D. A. and N. A. Smith (2007). Probabilistic models of nonprojective dependency trees. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 132–140.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 547 Smith, N. A. (2011). Linguistic structure prediction. Synthesis Lectures on Human Language
Technologies 4(2), 1–274.
Snover, M., B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul (2006). A study of transla- tion edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas (AMTA).
Snow, R., B. O’Connor, D. Jurafsky, and A. Y. Ng (2008). Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 254–263.
Snyder, B. and R. Barzilay (2007). Database-text alignment via structured multilabel classi- fication. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1713–1718.
Socher, R., J. Bauer, C. D. Manning, and A. Y. Ng (2013). Parsing with compositional vector grammars. In Proceedings of the Association for Computational Linguistics (ACL).
Socher, R., B. Huval, C. D. Manning, and A. Y. Ng (2012). Semantic compositionality through recursive matrix-vector spaces. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1201–1211.
Socher, R., A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Søgaard, A. (2013). Semi-supervised learning and domain adaptation in natural language processing. Synthesis Lectures on Human Language Technologies 6(2), 1–103.
Solorio, T. and Y. Liu (2008). Learning to predict code-switching points. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 973–981.
Somasundaran, S., G. Namata, J. Wiebe, and L. Getoor (2009). Supervised and unsuper- vised methods in employing discourse relations for improving opinion polarity classi- fication. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Somasundaran, S. and J. Wiebe (2009). Recognizing stances in online debates. In Proceed- ings of the Association for Computational Linguistics (ACL), pp. 226–234.
Song, L., B. Boots, S. M. Siddiqi, G. J. Gordon, and A. J. Smola (2010). Hilbert space embeddings of hidden markov models. In Proceedings of the International Conference on Machine Learning (ICML), pp. 991–998.
Song, L., Y. Zhang, X. Peng, Z. Wang, and D. Gildea (2016). AMR-to-text generation as a traveling salesman problem. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 2084–2089.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

548 BIBLIOGRAPHY Soon, W. M., H. T. Ng, and D. C. Y. Lim (2001). A machine learning approach to corefer-
ence resolution of noun phrases. Computational linguistics 27(4), 521–544.
Sordoni, A., M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J.-Y. Nie, J. Gao, and B. Dolan (2015). A neural network approach to context-sensitive generation of conversational responses. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
Soricut, R. and D. Marcu (2003). Sentence level discourse parsing using syntactic and lexical information. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 149–156.
Sowa, J. F. (2000). Knowledge representation: logical, philosophical, and computational founda- tions. Pacific Grove, CA: Brooks/Cole.
Spa ̈rck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation 28(1), 11–21.
Spitkovsky, V. I., H. Alshawi, D. Jurafsky, and C. D. Manning (2010). Viterbi training improves unsupervised dependency parsing. In Proceedings of the Conference on Natural Language Learning (CoNLL), pp. 9–17.
Sporleder, C. and M. Lapata (2005). Discourse chunking and its application to sen- tence compression. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 257–264.
Sproat, R., A. Black, S. Chen, S. Kumar, M. Ostendorf, and C. Richards (2001). Normaliza- tion of non-standard words. Computer Speech & Language 15(3), 287–333.
Sproat, R., W. Gale, C. Shih, and N. Chang (1996). A stochastic finite-state word- segmentation algorithm for chinese. Computational linguistics 22(3), 377–404.
Sra, S., S. Nowozin, and S. J. Wright (2012). Optimization for machine learning. MIT Press.
Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1), 1929–1958.
Srivastava, R. K., K. Greff, and J. Schmidhuber (2015). Training very deep networks. In Neural Information Processing Systems (NIPS), pp. 2377–2385.
Stab, C. and I. Gurevych (2014a). Annotating argument components and relations in per- suasive essays. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 1501–1510.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 549
Stab, C. and I. Gurevych (2014b). Identifying argumentative discourse structures in persuasive essays. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 46–56.
Stede, M. (2011, nov). Discourse Processing, Volume 4 of Synthesis Lectures on Human Lan- guage Technologies. Morgan & Claypool Publishers.
Steedman, M. and J. Baldridge (2011). Combinatory categorial grammar. In Non- Transformational Syntax: Formal and Explicit Models of Grammar. Wiley-Blackwell.
Stenetorp, P., S. Pyysalo, G. Topic ́, T. Ohta, S. Ananiadou, and J. Tsujii (2012). Brat: a web- based tool for nlp-assisted text annotation. In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL), pp. 102–107.
Stern, M., J. Andreas, and D. Klein (2017). A minimal span-based neural constituency parser. In Proceedings of the Association for Computational Linguistics (ACL).
Stolcke, A., K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P. Taylor, R. Martin, C. Van Ess-Dykema, and M. Meteer (2000). Dialogue act modeling for automatic tag- ging and recognition of conversational speech. Computational linguistics 26(3), 339–373.
Stone, P. J. (1966). The General Inquirer: A Computer Approach to Content Analysis. The MIT Press.
Stoyanov, V., N. Gilbert, C. Cardie, and E. Riloff (2009). Conundrums in noun phrase coreference resolution: Making sense of the state-of-the-art. In Proceedings of the Associ- ation for Computational Linguistics (ACL), pp. 656–664.
Strang, G. (2016). Introduction to linear algebra (Fifth ed.). Wellesley, MA: Wellesley- Cambridge Press.
Strubell, E., P. Verga, D. Belanger, and A. McCallum (2017). Fast and accurate entity recog- nition with iterated dilated convolutions. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Suchanek, F. M., G. Kasneci, and G. Weikum (2007). Yago: a core of semantic knowledge. In Proceedings of the Conference on World-Wide Web (WWW), pp. 697–706.
Sun, X., T. Matsuzaki, D. Okanohara, and J. Tsujii (2009). Latent variable perceptron algo- rithm for structured classification. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1236–1242.
Sun, Y., L. Lin, D. Tang, N. Yang, Z. Ji, and X. Wang (2015). Modeling mention, context and entity with neural networks for entity disambiguation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1333–1339.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

550 BIBLIOGRAPHY
Sundermeyer, M., R. Schlu ̈ter, and H. Ney (2012). LSTM neural networks for language modeling. In Proceedings of the International Speech Communication Association (INTER- SPEECH).
Surdeanu, M., J. Tibshirani, R. Nallapati, and C. D. Manning (2012). Multi-instance multi- label learning for relation extraction. In Proceedings of Empirical Methods for Natural Lan- guage Processing (EMNLP), pp. 455–465.
Sutskever, I., O. Vinyals, and Q. V. Le (2014). Sequence to sequence learning with neural networks. In Neural Information Processing Systems (NIPS), pp. 3104–3112.
Sutton, R. S. and A. G. Barto (1998). Reinforcement learning: An introduction, Volume 1. MIT press Cambridge.
Sutton, R. S., D. A. McAllester, S. P. Singh, and Y. Mansour (2000). Policy gradient methods for reinforcement learning with function approximation. In Neural Information Process- ing Systems (NIPS), pp. 1057–1063.
Suzuki, J., S. Takase, H. Kamigaito, M. Morishita, and M. Nagata (2018). An empirical study of building a strong baseline for constituency parsing. In Proceedings of the Asso- ciation for Computational Linguistics (ACL), pp. 612–618.
Sweeney, L. (2013). Discrimination in online ad delivery. Queue 11(3), 10.
Taboada, M., J. Brooke, M. Tofiloski, K. Voll, and M. Stede (2011). Lexicon-based methods
for sentiment analysis. Computational linguistics 37(2), 267–307.
Taboada, M. and W. C. Mann (2006). Rhetorical structure theory: Looking back and mov-
ing ahead. Discourse studies 8(3), 423–459.
Ta ̈ckstro ̈m, O., K. Ganchev, and D. Das (2015). Efficient inference and structured learning for semantic role labeling. Transactions of the Association for Computational Linguistics 3, 29–41.
Ta ̈ckstro ̈m, O., R. McDonald, and J. Uszkoreit (2012). Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 477–487.
Tang, D., B. Qin, and T. Liu (2015). Document modeling with gated recurrent neural net- work for sentiment classification. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1422–1432.
Taskar, B., C. Guestrin, and D. Koller (2003). Max-margin markov networks. In Neural Information Processing Systems (NIPS).
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 551
Tausczik, Y. R. and J. W. Pennebaker (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology 29(1), 24–54.
Teh, Y. W. (2006). A hierarchical bayesian language model based on pitman-yor processes. In Proceedings of the Association for Computational Linguistics (ACL), pp. 985–992.
Tesnie`re, L. (1966). E ́le ́ments de syntaxe structurale (second ed.). Paris: Klincksieck.
Teufel, S., J. Carletta, and M. Moens (1999). An annotation scheme for discourse-level argumentation in research articles. In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL), pp. 110–117.
Teufel, S. and M. Moens (2002). Summarizing scientific articles: experiments with rele- vance and rhetorical status. Computational linguistics 28(4), 409–445.
Thomas, M., B. Pang, and L. Lee (2006). Get out the vote: Determining support or opposi- tion from Congressional floor-debate transcripts. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 327–335.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267–288.
Titov, I. and J. Henderson (2007). Constituent parsing with incremental sigmoid belief networks. In Proceedings of the Association for Computational Linguistics (ACL), pp. 632– 639.
Toutanova, K., D. Klein, C. D. Manning, and Y. Singer (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
Trivedi, R. and J. Eisenstein (2013). Discourse connectors for latent subjectivity in senti- ment analysis. In Proceedings of the North American Chapter of the Association for Compu- tational Linguistics (NAACL), pp. 808–813.
Tromble, R. W. and J. Eisner (2006). A fast finite-state relaxation method for enforcing global constraints on sequence decoding. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
Tsochantaridis, I., T. Hofmann, T. Joachims, and Y. Altun (2004). Support vector machine learning for interdependent and structured output spaces. In Proceedings of the Interna- tional Conference on Machine Learning (ICML).
Tsvetkov, Y., M. Faruqui, W. Ling, G. Lample, and C. Dyer (2015). Evaluation of word vector representations by subspace alignment. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 2049–2054.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

552 BIBLIOGRAPHY
Tu, Z., Z. Lu, Y. Liu, X. Liu, and H. Li (2016). Modeling coverage for neural machine translation. In Proceedings of the Association for Computational Linguistics (ACL), pp. 76– 85.
Turian, J., L. Ratinov, and Y. Bengio (2010). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the Association for Computational Linguistics (ACL), pp. 384–394.
Turing, A. M. (2009). Computing machinery and intelligence. In R. Epstein, G. Roberts, and G. Beber (Eds.), Parsing the Turing Test, pp. 23–65. Springer.
Turney, P. D. and P. Pantel (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research 37, 141–188.
Tutin, A. and R. Kittredge (1992). Lexical choice in context: generating procedural texts. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 763–769.
Twain, M. (1997). A Tramp Abroad. New York: Penguin.
Tzeng, E., J. Hoffman, T. Darrell, and K. Saenko (2015). Simultaneous deep transfer across domains and tasks. In Proceedings of the International Conference on Computer Vision (ICCV), pp. 4068–4076.
Usunier, N., D. Buffoni, and P. Gallinari (2009). Ranking with ordered weighted pairwise classification. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1057–1064.
Uthus, D. C. and D. W. Aha (2013). The ubuntu chat corpus for multiparticipant chat analysis. In AAAI Spring Symposium: Analyzing Microtext, Volume 13, pp. 01.
Utiyama, M. and H. Isahara (2001). A statistical model for domain-independent text seg- mentation. In Proceedings of the Association for Computational Linguistics (ACL), pp. 499– 506.
Utiyama, M. and H. Isahara (2007). A comparison of pivot methods for phrase-based sta- tistical machine translation. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 484–491.
Uzuner,O ̈.,X.Zhang,andT.Sibanda(2009).Machinelearningandrule-basedapproaches to assertion classification. Journal of the American Medical Informatics Association 16(1), 109–115.
Vadas, D. and J. R. Curran (2011). Parsing noun phrases in the penn treebank. Computa- tional Linguistics 37(4), 753–809.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 553 Van Eynde, F. (2006). NP-internal agreement and the structure of the noun phrase. Journal
of Linguistics 42(1), 139–186.
Van Gael, J., A. Vlachos, and Z. Ghahramani (2009). The infinite hmm for unsuper- vised pos tagging. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 678–687.
Vaswani, A., S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, L. Kaiser, N. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, and J. Uszkoreit (2018). Ten- sor2tensor for neural machine translation. CoRR abs/1803.07416.
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017). Attention is all you need. In Neural Information Processing Systems (NIPS), pp. 6000–6010.
Velldal, E., L. Øvrelid, J. Read, and S. Oepen (2012). Speculation and negation: Rules, rankers, and the role of syntax. Computational linguistics 38(2), 369–410.
Versley, Y. (2011). Towards finer-grained tagging of discourse connectives. In Proceedings of the Workshop Beyound Semantics: Corpus-based Investigations of Pragmatic and Discourse Phenomena, pp. 2–63.
Vilain, M., J. Burger, J. Aberdeen, D. Connolly, and L. Hirschman (1995). A model- theoretic coreference scoring scheme. In Proceedings of the 6th conference on Message understanding, pp. 45–52. Association for Computational Linguistics.
Vincent, P., H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol (2010). Stacked de- noising autoencoders: Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research 11, 3371–3408.
Vincze, V., G. Szarvas, R. Farkas, G. Mo ́ra, and J. Csirik (2008). The bioscope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC bioinformat- ics 9(11), S9.
Vinyals, O., A. Toshev, S. Bengio, and D. Erhan (2015). Show and tell: A neural image cap- tion generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pp. 3156–3164. IEEE.
Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory 13(2), 260–269.
Voll, K. and M. Taboada (2007). Not all words are created equal: Extracting semantic orientation as a function of adjective relevance. In Proceedings of Australian Conference on Artificial Intelligence.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

554 BIBLIOGRAPHY Wager, S., S. Wang, and P. S. Liang (2013). Dropout training as adaptive regularization. In
Neural Information Processing Systems (NIPS), pp. 351–359.
Wainwright, M. J. and M. I. Jordan (2008). Graphical models, exponential families, and
variationalinference.FoundationsandTrends⃝R inMachineLearning1(1-2),1–305.
Walker, M. A. (2000). An application of reinforcement learning to dialogue strategy selec- tion in a spoken dialogue system for email. Journal of Artificial Intelligence Research 12, 387–416.
Walker, M. A., J. E. Cahn, and S. J. Whittaker (1997). Improvising linguistic style: Social and affective bases for agent personality. In Proceedings of the first international conference on Autonomous agents, pp. 96–105. ACM.
Wang, C., N. Xue, and S. Pradhan (2015). A Transition-based Algorithm for AMR Parsing. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 366–375.
Wang, H., T. Onishi, K. Gimpel, and D. McAllester (2017). Emergent predication structure in hidden state vectors of neural readers. In Proceedings of the 2nd Workshop on Represen- tation Learning for NLP, pp. 26–36.
Weaver, W. (1955). Translation. Machine translation of languages 14, 15–23.
Webber, B. (2004). D-LTAG: extending lexicalized TAG to discourse. Cognitive Sci-
ence 28(5), 751–779.
Webber, B., M. Egg, and V. Kordoni (2012). Discourse structure and language technology.
Journal of Natural Language Engineering 18(4), 437–490.
Webber, B. and A. Joshi (2012). Discourse structure and computation: past, present and future. In Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Dis- coveries, pp. 42–54. Association for Computational Linguistics.
Wei, G. C. and M. A. Tanner (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association 85(411), 699–704.
Weinberger, K., A. Dasgupta, J. Langford, A. Smola, and J. Attenberg (2009). Feature hashing for large scale multitask learning. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1113–1120.
Weizenbaum, J. (1966). Eliza—a computer program for the study of natural language communication between man and machine. Communications of the ACM 9(1), 36–45.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 555
Wellner, B. and J. Pustejovsky (2007). Automatically identifying the arguments of dis- course connectives. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 92–101.
Wen, T.-H., M. Gasic, N. Mrksˇic ́, P.-H. Su, D. Vandyke, and S. Young (2015). Semantically conditioned lstm-based natural language generation for spoken dialogue systems. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1711–1721.
Weston, J., S. Bengio, and N. Usunier (2011). Wsabie: Scaling up to large vocabulary im- age annotation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 2764–2770.
Wiebe, J., T. Wilson, and C. Cardie (2005). Annotating expressions of opinions and emo- tions in language. Language resources and evaluation 39(2), 165–210.
Wieting, J., M. Bansal, K. Gimpel, and K. Livescu (2016a). CHARAGRAM: Embedding words and sentences via character n-grams. In Proceedings of Empirical Methods for Nat- ural Language Processing (EMNLP), pp. 1504–1515.
Wieting, J., M. Bansal, K. Gimpel, and K. Livescu (2016b). Towards universal paraphrastic sentence embeddings. In Proceedings of the International Conference on Learning Represen- tations (ICLR).
Williams, J. D. and S. Young (2007). Partially observable markov decision processes for spoken dialog systems. Computer Speech & Language 21(2), 393–422.
Williams, P., R. Sennrich, M. Post, and P. Koehn (2016). Syntax-based statistical machine translation. Synthesis Lectures on Human Language Technologies 9(4), 1–208.
Wilson, T., J. Wiebe, and P. Hoffmann (2005). Recognizing contextual polarity in phrase- level sentiment analysis. In Proceedings of Empirical Methods for Natural Language Pro- cessing (EMNLP), pp. 347–354.
Winograd, T. (1972). Understanding natural language. Cognitive psychology 3(1), 1–191.
Wiseman, S., A. M. Rush, and S. M. Shieber (2016). Learning global features for corefer- ence resolution. In Proceedings of the North American Chapter of the Association for Compu- tational Linguistics (NAACL), pp. 994–1004.
Wiseman, S., S. Shieber, and A. Rush (2017). Challenges in data-to-document generation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 2253– 2263.
Wiseman, S. J., A. M. Rush, S. M. Shieber, and J. Weston (2015). Learning anaphoricity and antecedent ranking features for coreference resolution. In Proceedings of the Association for Computational Linguistics (ACL).
Under contract with MIT Press, shared under CC-BY-NC-ND license.

556 BIBLIOGRAPHY Wolf, F. and E. Gibson (2005). Representing discourse coherence: A corpus-based study.
Computational Linguistics 31(2), 249–287.
Wolfe, T., M. Dredze, and B. Van Durme (2017). Pocket knowledge base population. In
Proceedings of the Association for Computational Linguistics (ACL), pp. 305–310.
Wong, Y. W. and R. Mooney (2007). Generation by inverting a semantic parser that uses statistical machine translation. In Proceedings of the North American Chapter of the Associ- ation for Computational Linguistics (NAACL), pp. 172–179.
Wong, Y. W. and R. J. Mooney (2006). Learning for semantic parsing with statistical ma- chine translation. In Proceedings of the North American Chapter of the Association for Com- putational Linguistics (NAACL), pp. 439–446.
Wu, B. Y. and K.-M. Chao (2004). Spanning trees and optimization problems. CRC Press. Wu, D. (1997). Stochastic inversion transduction grammars and bilingual parsing of par-
allel corpora. Computational linguistics 23(3), 377–403.
Wu, F. and D. S. Weld (2010). Open information extraction using wikipedia. In Proceedings
of the Association for Computational Linguistics (ACL), pp. 118–127.
Wu, X., R. Ward, and L. Bottou (2018). WNGrad: Learn the learning rate in gradient
descent. arXiv preprint arXiv:1803.02865.
Wu, Y., M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Łukasz Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean (2016). Google’s neural machine translation system: Bridging the gap between human and ma- chine translation. CoRR abs/1609.08144.
Xia, F. (2000). The part-of-speech tagging guidelines for the Penn Chinese Treebank (3.0). Technical report, University of Pennsylvania Institute for Research in Cognitive Science.
Xu, K., J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015). Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning (ICML), pp. 2048–2057.
Xu, W., X. Liu, and Y. Gong (2003). Document clustering based on non-negative matrix factorization. In Proceedings of ACM SIGIR conference on Research and development in in- formation retrieval, pp. 267–273.
Xu, Y., L. Mou, G. Li, Y. Chen, H. Peng, and Z. Jin (2015). Classifying relations via long short term memory networks along shortest dependency paths. In Proceedings of Empir- ical Methods for Natural Language Processing (EMNLP), pp. 1785–1794.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 557
Xuan Bach, N., N. L. Minh, and A. Shimazu (2012). A reranking model for discourse seg- mentation using subtree features. In Proceedings of the Special Interest Group on Discourse and Dialogue (SIGDIAL).
Xue, N. et al. (2003). Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing 8(1), 29–48.
Xue, N., H. T. Ng, S. Pradhan, R. Prasad, C. Bryant, and A. T. Rutherford (2015). The CoNLL-2015 shared task on shallow discourse parsing. In Proceedings of the Conference on Natural Language Learning (CoNLL).
Yamada, H. and Y. Matsumoto (2003). Statistical dependency analysis with support vector machines. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT), pp. 195–206.
Yamada, K. and K. Knight (2001). A syntax-based statistical translation model. In Proceed- ings of the Association for Computational Linguistics (ACL), pp. 523–530.
Yang, B. and C. Cardie (2014). Context-aware learning for sentence-level sentiment anal- ysis with posterior regularization. In Proceedings of the Association for Computational Lin- guistics (ACL).
Yang, Y., M.-W. Chang, and J. Eisenstein (2016). Toward socially-infused information ex- traction: Embedding authors, mentions, and entities. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Yang, Y. and J. Eisenstein (2013). A log-linear model for unsupervised text normalization. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
Yang, Y. and J. Eisenstein (2015). Unsupervised multi-domain adaptation with feature em- beddings. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
Yang, Y., W.-t. Yih, and C. Meek (2015). WikiQA: A challenge dataset for open-domain question answering. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 2013–2018.
Yannakoudakis, H., T. Briscoe, and B. Medlock (2011). A new dataset and method for automatically grading esol texts. In Proceedings of the Association for Computational Lin- guistics (ACL), pp. 180–189.
Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised meth- ods. In Proceedings of the Association for Computational Linguistics (ACL), pp. 189–196. Association for Computational Linguistics.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

558 BIBLIOGRAPHY Yee, L. C. and T. Y. Jones (2012, March). Apple CEO in China mission to clear up problems.
Reuters. retrieved on March 26, 2017.
Yi, Y., C.-Y. Lai, S. Petrov, and K. Keutzer (2011). Efficient parallel cky parsing on gpus. In
Proceedings of the 12th International Conference on Parsing Technologies, pp. 175–185.
Yu, C.-N. J. and T. Joachims (2009). Learning structural svms with latent variables. In
Proceedings of the International Conference on Machine Learning (ICML), pp. 1169–1176. Yu, F. and V. Koltun (2016). Multi-scale context aggregation by dilated convolutions. In
Proceedings of the International Conference on Learning Representations (ICLR).
Zaidan, O. F. and C. Callison-Burch (2011). Crowdsourcing translation: Professional qual- ity from non-professionals. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1220–1229.
Zaremba, W., I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
Zelenko, D., C. Aone, and A. Richardella (2003). Kernel methods for relation extraction. The Journal of Machine Learning Research 3, 1083–1106.
Zelle, J. M. and R. J. Mooney (1996). Learning to parse database queries using induc- tive logic programming. In Proceedings of the National Conference on Artificial Intelligence (AAAI), pp. 1050–1055.
Zeng, D., K. Liu, S. Lai, G. Zhou, and J. Zhao (2014). Relation classification via convolu- tional deep neural network. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 2335–2344.
Zettlemoyer, L. S. and M. Collins (2005). Learning to map sentences to logical form: Struc- tured classification with probabilistic categorial grammars. In Proceedings of Uncertainty in Artificial Intelligence (UAI).
Zhang, C., S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017). Understanding deep learning requires rethinking generalization. In Proceedings of the International Conference on Learning Representations (ICLR).
Zhang, X., J. Zhao, and Y. LeCun (2015). Character-level convolutional networks for text classification. In Neural Information Processing Systems (NIPS), pp. 649–657.
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 559
Zhang, Y. and S. Clark (2008). A tale of two parsers: investigating and combining graph- based and transition-based dependency parsing using beam-search. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 562–571.
Zhang, Y., T. Lei, R. Barzilay, T. Jaakkola, and A. Globerson (2014). Steps to excellence: Simple inference with refined scoring of dependency trees. In Proceedings of the Associa- tion for Computational Linguistics (ACL), pp. 197–207.
Zhang, Y. and J. Nivre (2011). Transition-based dependency parsing with rich non-local features. In Proceedings of the Association for Computational Linguistics (ACL), pp. 188–193.
Zhang, Z. (2017). A note on counting dependency trees. arXiv preprint arXiv:1708.08789.
Zhao, J., T. Wang, M. Yatskar, e. V. Ordonez, and K.-W. Chang (2018). Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
Zhou, J. and W. Xu (2015). End-to-end learning of semantic role labeling using recurrent neural networks. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1127–1137.
Zhu, J., Z. Nie, X. Liu, B. Zhang, and J.-R. Wen (2009). Statsnowball: a statistical approach to extracting entity relationships. In Proceedings of the Conference on World-Wide Web (WWW), pp. 101–110.
Zhu, X., Z. Ghahramani, and J. D. Lafferty (2003). Semi-supervised learning using gaus- sian fields and harmonic functions. In Proceedings of the International Conference on Ma- chine Learning (ICML), pp. 912–919.
Zhu, X. and A. B. Goldberg (2009). Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning 3(1), 1–130.
Zipf, G. K. (1949). Human behavior and the principle of least effort. Addison-Wesley.
Zirn, C., M. Niepert, H. Stuckenschmidt, and M. Strube (2011). Fine-grained sentiment analysis with structural features. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 336–344.
Zou, W. Y., R. Socher, D. Cer, and C. D. Manning (2013). Bilingual word embeddings for phrase-based machine translation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1393–1398.
Under contract with MIT Press, shared under CC-BY-NC-ND license.

Index
α-conversion, 295 β-reduction, β-conversion, 293 n-gram, 24
language model, 198 p-value, 84
one-tailed, 85 F -MEASURE, 82
balanced, 83 macro, 82 micro, 83
WORDNET, 9, 74 BLEU, 433 METEOR, 435 RIBES, 435
semantics extra-propositional, 422
ablation test, 84
Abstract Meaning Representation
(AMR), 307, 319 accuracy, 23, 81
action, in reinforcement learning, 365 AdaGrad, 40, 61
adequacy, in translation, 433 adjectives, 177
adjuncts, 306 adpositions, 178 adverbs, 177 affix, 195
inflectional, 78
agent (thematic role), 308
alignment
in machine translation, 432, 437 in semantic parsing, 321
in text generation, 459
Amazon Mechanical Turk, 91 ambiguity, 210, 218
attachment, 257 derivational, 222 spurious, 222, 269, 298 syntactic, 229
anaphoricity, 360
anchored productions, 232 animacy, 307
annealing, 452
antecedent mention, 351, 359 antonymy, 74, 281
apophony, 194
area under the curve (AUC), 83 argumentation, 392
arguments, 403
article, 182
aspect, 177
aspect-based opinion mining, 71 autoencoder, 345
denoising, 345, 408
variational, 465
automated theorem provers, 288 automatic differentiation, 56 auxiliary verbs, 178
average mutual information, 333 averaged perceptron, 27
561

562
BIBLIOGRAPHY
backchannel, 187 backoff, 130
Katz, 130 backpropagation, 55
through time, 136
backward recurrence, 165, 166 backward-looking center, 382
bag of words, 13
balanced test set, 81
batch normalization, 61 Baum-Welch algorithm, 170 Bayes’ rule, 478
Bayesian nonparametrics, 103, 249 beam sampling, 172
beam search, 272
in coreference resolution, 363
in machine translation, 450, 451 Bell number, 362
bias, 23
bias-variance tradeoff, 23, 127, 129 bigrams, 24, 70
bilinear product, 330 binarization, 211, 228 binomial
distribution, 85 random variable, 482 test, 84
BIO notation, 184, 317
biomedical natural language processing,
183
Bonferroni correction, 87
boolean semiring, 199 boosting, 48
bootstrap samples, 86 brevity penalty, 434
Brown clusters, 327, 332 byte-pair encoding, 343, 449
c-command, 353
case marking, 182, 219 Catalan number, 225
cataphora, 352
center embedding, 207
centering theory, 355, 382 character-level language models, 141 chatbots, 470
Chomsky Normal Form (CNF), 211 Chu-Liu-Edmonds algorithm, 264 CKY algorithm, 226
class imbalance, 81
classification, 13
large margin, 30 lexicon-based, 72 weights, 14
closed-vocabulary, 141
closure, of classes of formal languages,
192
cluster ranking, 363
clustering, 96 K-means, 96
exchange, 333 hierarchical, 332 soft, 97
co-hypernymy, 281
co-training, 108
code switching, 180, 186
Cohen’s Kappa, 90
coherence, 396
cohesion, 379, 400
collective entity linking, 408 collocation extraction, 344 combinatory categorial grammar, 220 complement clause, 213
complement event (probability), 476 composition (CCG), 221 compositional vector grammars, 391 compositionality, 2, 7, 10, 291, 341 computation graph, 48, 55
dynamic, 57
computational linguistics (versus
natural language processing), 1 computational social science, 5
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY
563
concept (in Abstract Meaning Representation), 319
conditional independence, 154, 480 conditional random field, 162 confidence interval, 86
configuration (transition-based parsing),
269
consistency, in logic, 291
constants, in logic, 286 constituents, 212
split, 312
constrained optimization, 315 content selection, 457
content words, 178 context-free grammars, 208
probabilistic (PCFGs), 235 synchronous, 441 weighted, 218, 227, 233
context-free languages, 207, 208 context-free production, 209
recursive, 209
unary, 210
context-sensitive languages, 218 continuous bag-of-words (CBOW), 334 contradiction, 346
conversational turns, 187
convexity, 29, 58, 300, 482, 485
biconvexity, 102 convolution
dilated, 64, 185 narrow, 63 one-dimensional, 63 wide, 63
convolutional neural networks, 62 in machine translation, 447
in relation extraction, 415
in semantic role labeling, 318 in sequence labeling, 170
cooperative principle, 351 coordinate ascent, 102 coordinating conjunctions, 178
copula, 177, 217, 260 coreference resolution, 347, 351
cross-document, 406 coreferent, 351
cosine similarity, 339, 380 cost function, 31 cost-augmented decoding, 32 coverage loss, 465
critical point, 58, 485
cross-entropy, 53, 416
cross-serial dependencies, 219 cross-validation, 24
crowdsourcing, 91
cumulative probability distribution, 86
decidability, 291 decision trees, 48 decoding
cost-augmented, 161
in conditional random fields, 163 definiteness, 183
delta function, 22 denotation, 286 dependency, 258
grammar, 257 graph, 258 labels, 259
path, 75, 314, 413 syntactic, 258
dependency parsing, 257 arc-eager, 269, 271 arc-factored, 263 arc-standard, 269 pseudo-projective, 272 second-order, 263 third-order, 264
derivation
in context-free languages, 209 in dependency parsing, 269 in semantic parsing, 293
determiner, 179
Under contract with MIT Press, shared under CC-BY-NC-ND license.

564
BIBLIOGRAPHY
phrase, 215 development set, 23, 81 dialogue acts, 90, 187, 471 dialogue management, 468 dialogue systems, 125, 467
agenda-based, 468
mixed-initiative, 468
digital humanities, 5, 69
Dirichlet Compound Multinomial, 398 Dirichlet distribution, 115 discounting, 130
absolute, 130 discourse, 379
connectives, 385 depth, 394 parsing, 385 segment, 379 unit, 389
discourse relations, 347, 384 coordinating, 389 implicit, 385
sense classification, 386 subordinating, 389
distant supervision, 118, 418, 419 distributed semantics, 327 distributional
hypothesis, 325, 326 semantics, 10, 327 statistics, 75, 248, 326
document frequency, 407 domain adaptation, 95, 111
by projection, 112 dropout, 57, 137
dual decomposition, 317 dynamic programming, 149
E-step, 99
early stopping, 27, 61 early update, 276
edit distance, 201, 435 effective counts, 129
elementwise nonlinearity, 50 Elman unit, 135
ELMo (embeddings from language
models), 340 embedding, 167
emotion, 71
empirical Bayes, 116
empty string, 192
encoder-decoder model, 345, 442, 460 ensemble learning, 48, 318, 444 entailment, 290, 346
entity, 403
embeddings, 407
grid, 383
linking, 351, 403, 405, 416 linking, collective, 409 nil, 405
entropy, 42, 99 estimation, 482 EuroParl corpus, 435 evaluation
extrinsic, 139, 338
intrinsic, 139, 338 event, 403
coreference, 421
detection, 421
event semantics, 305 events, in probability, 475
disjoint, 476 evidentiality, 182, 423 expectation, 481 expectation-maximization, 97
hard, 102
in language modeling, 131 in machine translation, 439 incremental, 102
online, 102
explicit semantic analysis, 328
factoid question, 424 factor graph, 163
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY
565
factuality, 423 fairness and bias, 5
in machine translation, 434
in word embeddings, 340 false discovery rate, 87
false negative, 81
rate, 478
false positive, 81, 479
rate, 83, 478 feature
co-adaptation, 58 function, 14 hashing, 80 noising, 58 selection, 41
features, 6 bilexical, 266
collocation, 75 emission, 148 lexical, 47 offset, 15 pivot, 113 transition, 148
finite state acceptor, 193
acceptor, chain, 205 acceptors, weighted, 197 automata, 193
automaton, deterministic, 194 composition, 204 transducers, 196, 201
finite-state transduction, 192
fluency, 125, 433
formal language theory, 191 forward
recurrence, 164
variable, 164, 166 forward-backward algorithm, 165, 206,
240 forward-looking centers, 382
frame, 310 element, 310
in dialogue systems, 467 FrameNet, 310
Frobenius norm, 57, 105
function words, 178
function, in logic, 289
functional segmentation, 379, 381
garden path sentence, 146 gazetteer, 184, 357, 413 generalization, 27
of neural networks, 59 generalized linear models, 42 generative model
for classification, 17
for coreference, 363
for interpolated language modeling,
131
for parsing, 235
for sequence labeling, 154 Gibbs sampling, 115, 409
collapsed, 116
gloss, 125, 179, 433
government and binding theory, 353 gradient, 29
clipping, 60 descent, 38 exploding, 137 vanishing, 51, 137
Gram matrix, 414
grammar equivalence, 210
grammar induction, 241 grammaticality, 396
graphical model, 154
graphics processing units (GPUs), 170,
185 grid search, 23
Hamming cost, 161 Hansards corpus, 435
Under contract with MIT Press, shared under CC-BY-NC-ND license.

566
BIBLIOGRAPHY
hanzi, 77
head rules, for lexicalized CFGs, 246, 257 head word, 212, 245, 257, 353
of a dependency edge, 258 hedging, 423
held-out data, 139
Hessian matrix, 39
hidden Markov models, 154 hierarchical recurrent network, 471 highway network, 52
holonymy, 75
homonymy, 73
human computation, 91 hypergraph, 393
hypernymy, 75, 281 hyperparameter, 22
hyponymy, 75
illocutionary force, 187
importance sampling, 453
importance score, 453
independent and identically distributed
(IID), 17 inference
in structured prediction, 147 logical, 285
rules for propositional logic, 288
inflection point, 486 information extraction, 403
open, 419
information retrieval, 5, 416 inside recurrence, 235–237 inside-outside algorithm, 240, 249 instance labels, 16
instance, in Abstract Meaning
Representation, 319 integer linear programming, 315
in coreference resolution, 362
in entity linking, 409
in extractive summarization, 395 in sentence compression, 466
inter-annotator agreement, 90 interjections, 177
interlingua, 432 interpolation, 131, 199 interval algebra, 421
inverse document frequency, 407 inversion (of finite state automata), 203 irrealis, 70
Jensen’s inequality, 99
Kalman smoother, 172 kernel, 48, 414
Kleene star, 192 knapsack problem, 395 knowledge base, 403
population, 416
label bias problem, 275, 280
label propagation, 110, 120 Lagrangian, 487
lambda calculus, 292
lambda expressions, 292
language model, 4, 126
latent Dirichlet allocation, 380 latent semantic analysis, 328, 329 latent variable, 98, 206, 299, 419, 432
conditional random field, 301 in parsing, 249
perceptron, 207, 300, 360
layer normalization, 61, 447 learning
active, 118
batch, 25
constraint-driven, 118 deep, 47
discriminative, 25
multiple instance, 118, 418 multitask, 118
online, 25, 39 reinforcement, 365, 451 semi-supervised, 76, 95, 339
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY
567
to search, 257, 277, 366 transfer, 118 unsupervised, 71, 95
learning rate, 38
least squares, 72
leave-one-out cross-validation, 24 lemma, 73, 202
lemmatization, 79
lexical entry, 293
lexical unit, 310
lexicalization
in parsing, 245
in text generation, 457
lexicalized tree-adjoining grammar for
discourse (D-LTAG), 385 lexicon, 293
in combinatory categorial grammar, 221
lexicon-based classification, 72
seed, 73 light verb, 321
linear separability, 25 linearization, 466, 472
link function, 42
literal character, 192
local optimum, 102, 486 locally-normalized objective, 275 log-bilinear language model, 342 log-likelihood
conditional, 52 log-linear models, 42 logic, 287
first-order, 288 higher-order, 289 propositional, 287
logistic function, 42
long short-term memory (LSTM), 52,
135, 137, 181, 442 bidirectional, 169, 445 deep, 443
LSTM-CRF, 169, 318
memory cell, 137 lookup layer, 53, 135 loss
function, 27 hinge, 29 logistic, 36 WARP, 411 zero-one, 28
machine learning, 2 supervised, 16
theory, 26 machine reading, 425
macro, 404
micro, 404
machine translation, 125
neural, 432
statistical, 432 margin, 26, 30
functional, 32
geometric, 32
marginal relevance, 465 marginalization, 477
markable, 358
Markov assumption, 154
Markov blanket, 154
Markov Chain Monte Carlo (MCMC),
103, 115, 172
Markov decision process, 468
partially-observable (POMDP), 470 Markov random fields, 162 Markovization
vertical, 244
matrix-tree theorem, 268
max-margin Markov network, 161 max-product algorithm, 157 maximum a posteriori, 22, 483 maximum conditional likelihood, 36 maximum directed spanning tree, 265 maximum entropy, 42
maximum likelihood, 17, 21, 482
Under contract with MIT Press, shared under CC-BY-NC-ND license.

568
BIBLIOGRAPHY
McNemar’s test, 84 meaning representation, 285 membership problem, 191 mention
in coreference resolution, 351 in entity linking, 403
in information extraction, 405
mention ranking, 360
mention-pair model, 359
meronymy, 75
method of moments, 117
mildly context-sensitive languages, 219 minibatch, 39
minimization of finite-state automata, 196
minimum error-rate training (MERT), 452
minimum risk training, 452 modality, 422
model, 8, 38
modifier, 258
modus ponens, 288, 303 moment-matching, 42 monomorphemic, 196 morpheme, 6, 141, 195 morphological segmentation, 159 morphology, 79, 158, 194, 342, 448
derivational, 194
inflectional, 177, 194, 202 morphosyntactic, 176
attributes, 180 morphotactics, 195 multi-view learning, 108 multiclass classification
one-versus-all, 414
one-versus-one, 414 multinomial
distribution, 18 Na ̈ıve Bayes, 19
Na ̈ıve Bayes, 17
name dictionary, 406 named entity, 183
linking, 405
recognition, 169, 403, 405 types, 405
nearest-neighbor, 48, 414 negation, 70, 422
scope, 423
negative sampling, 336, 337, 411 Neo-Davidsonian event semantics, 306 neural attention, 371, 426, 443, 444, 460
coarse-to-fine, 463
structured, 463 neural gate, 52, 445 neural network, 48
adversarial, 113
bidirectional recurrent, 168 convolutional, 53, 62, 185, 415 feedforward, 50
recurrent, 134, 415
recursive, 250, 343, 346, 396
noise-contrastive estimation, 135 noisy channel model, 126, 436 nominal modifier, 215
nominals, 351, 357 non-terminals, 209 normalization, 78
noun phrase, 2, 212 nouns, 176 NP-hard, 41, 409 nucleus, in RST, 389 null hypothesis, 84 null subjects, 375 numerals, 179
one-hot vector, 53
online support vector machine, 31 ontology, 9
open word classes, 176
opinion polarity, 69
optimization
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY
569
batch, 38 combinatorial, 8 constrained, 33 convex, 38 numerical, 8 quasi-Newton, 39
oracle, 275 dynamic, 277
in learning to search, 366 orthography, 196, 204 orthonormal matrix, 60 out-of-vocabulary words, 181 outside recurrence, 237, 240 overfitting, 23, 27
in neural networks, 59 overgeneration, 203, 213
parallel corpus, 435 parameters, 482 paraphase, 346
parent annotation, 244 parsing, 209
agenda-based, 279 chart, 226
easy-first, 279 graph-based, 263 transition-based, 225
part-of-speech, 6, 175 tagging, 145
particle, 179, 217
partition, 477
partition function, 164 passive-aggressive, 43, 487 path, in an FSA, 193
Penn Discourse Treebank (PDTB), 385 Penn Treebank, 140, 159, 180, 212, 238 perceptron, 25
incremental, 277, 364 multilayer, 50 structured, 160
phonology, 196
phrase, 212
phrase-structure grammar, 212 pointwise mutual information, 281, 329,
330
in collocation identification, 344 positive, 331
shifted positive, 337
policy, 276, 365, 469 policy gradient, 366 polysemy, 74
pooling, 63, 65, 372, 460 positional encodings
in machine translation, 447
in relation extraction, 415 power law, 2
pragmatics, 351 precision, 82, 478
at-k, 83, 398 labeled, 230 unlabeled, 230
precision-recall curve, 83, 418 predicate, 403
predicative adjectives, 217 predictive likelihood, 103 prepositional phrase, 2, 217 primal form, 487
prior expectation, 483 probabilistic models, 482 probabilistic topic model, 5, 409 probability
chain rule, 477 conditional, 35, 477, 481 density function, 481 distribution, 480
joint, 17, 35, 481 likelihood, 478 marginal, 481
mass function, 85, 481 posterior, 478
prior, 22, 36, 478
perplexity, 140
Under contract with MIT Press, shared under CC-BY-NC-ND license.

570
BIBLIOGRAPHY
simplex, 18 processes, 422
productivity, 195
projectivity, 261
pronominal anaphora resolution, 351 pronoun, 178
reflexive, 353 PropBank, 310
proper nouns, 177
proposal distribution, 453 propositions, 286, 287, 422 prosody, 187
proto-roles, 309
pumping lemma, 207 pushdown automata, 209, 251
quadratic program, 34 quantifier, 289
existential, 289
universal, 289
question answering, 346, 405
cloze, 425 extractive, 426 multiple-choice, 425
random outcomes, 475 random variable, 480
discrete, 480
indicator, 480 ranking, 406
loss, 406 recall, 82, 478
at-k, 398 labeled, 230 unlabeled, 230
receiver operating characteristic (ROC), 83
rectified linear unit (ReLU), 51 leaky, 51
reference arguments, 322 reference resolution, 351
reference translations, 433 referent, 351
generic, 356
referring expressions, 351, 381, 458 regression, 72
linear, 72 logistic, 35 ridge, 72
regular expression, 192 regular language, 192 regularization, 23, 32 reification (events), 305 relation
extraction, 277
logical, 286, 295 relation extraction, 411 relations
in information extraction, 403 relative frequency estimate, 21, 126, 483 reranking, 250
residual networks, 52
retrofitting, 343
Rhetorical Structure Theory (RST), 389 rhetorical zones, 381
risk, 452
roll-in, roll-out, 366
saddle point, 58, 486
sample space, 475
satellite, in RST, 389
satisfaction, 290
scheduled sampling, 451
schema, 403, 404, 419
search error, 252, 363
segmented discourse representation
theory (SDRT), 384 self-attention, 446
self-training, 108 semantic concordance, 76 semantic parsing, 291 semantic role, 306
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY
571
semantic role labeling, 306, 412, 420 semantics, 285
dynamic, 302, 384
in parsing, 242
lexical, 73 model-theoretic, 286 underspecification, 302
semi-supervised learning, 105 semiring, 199
algebra, 172 expectation, 207 tropical, 173
sentence compression, 465 sentence fusion, 466 sentence, in logic, 289 sentiment analysis, 69
lexicon-based, 70
targeted, 71
sentiment lexicon, 16 sequence-to-sequence model, 442 shift-reduce parsing, 251 shortest-path algorithm, 197
sigmoid, 49
simplex, 115
singular value decomposition, 60, 104
truncated, 104, 330
singular vectors, 60
skipgram word embeddings, 335 slack variables, 34
slot filling, 416
slots, in dialogue systems, 467 smooth functions, 29
smoothing, 22, 129
Jeffreys-Perks, 129 Kneser-Ney, 133 Laplace, 22, 129 Lidstone, 129
softmax, 49, 134, 416 hierarchical, 135, 336
source language, 431 spanning tree, 258
sparse matrix, 330
sparsity, 41
spectral learning, 117
speech acts, 187
speech recognition, 125
squashing function, 51, 135
stand-off annotations, 90
Stanford Natural Language Inference
corpus, 346 statistical significance, 84
stemming, 7, 78, 195
step size, 486
stochastic gradient descent, 29, 35, 39 stopwords, 80
string, in formal language theory, 191 string-to-tree translation, 441
strong compositionality criterion, 390 structure induction, 170
structure prediction, 15
subgradient, 29, 41
subjectivity detection, 70 subordinating conjunctions, 178 sum-product algorithm, 164 summarization, 125, 393
abstractive, 394, 464 extractive, 393 multi-document, 466 of sentences, 464
supersenses, 339
support vector machine, 34
kernel, 48, 414
structured, 161
support vectors, 34
surface form, 203
surface realization, 457 synonymy, 74, 281, 325
synset, 74, 369
syntactic path, 313 syntactic-semantic grammar, 293 syntax, 175, 211, 285
Under contract with MIT Press, shared under CC-BY-NC-ND license.

572
BIBLIOGRAPHY
tagset, 176
tanh activation function, 51 target language, 431
tense, 177
terminal symbols, 209
test set, 23, 95
test statistic, 84
text mining, 5
text planning, 457
thematic roles, 307
third axiom of probability, 476 TimeML, 421
tokenization, 77, 185
tokens and types, 19
topic segmentation, 379
hierarchical, 381 trace, 222
training set, 17, 95 transformer architecture, 446 transition system, 269
for context-free grammars, 251 transitive closure, 361
translation error rate (TER), 435 translation model, 126 transliteration, 449
tree-adjoining grammar, 220 tree-to-string translation, 442 tree-to-tree translation, 441 treebank, 238
trellis, 150, 205
trigrams, 24
trilexical dependencies, 248 tropical semiring, 199
true negative, 81
true positive, 81, 479
rate, 83
truth conditions, 290
tuning set, see development set, 81 Turing test, 3
two-tailed test, 86
type systems, 295
type-raising, 221, 295
unary closure, 228
undercutting, in argumentation, 393, 399 underfitting, 23
underflow, 17
undergeneration, 203, 213
Universal Dependencies, 176, 257 unseen word, 169
utterances, 187
validation function, 301 validity, in logic, 290 value iteration, 469 variable, 319
bound, 289
free, 289 variance, 22, 87
Vauquois Pyramid, 432 verb phrase, 213 VerbNet, 308
verbs, 177
Viterbi
algorithm, 148
variable, 149 volition, 307
weight decay, 57 Wikification, 405 Winograd schemas, 3 word
embeddings, 47, 53, 135, 136, 327, 334
embeddings, fine-tuned, 340 embeddings, pre-trained, 339 representations, 326
sense disambiguation, 73 senses, 73, 309
tokens, 77 world model, 286
builder, 291 checker, 291
Jacob Eisenstein. Draft of October 15, 2018.

BIBLIOGRAPHY 573 yield, 209 Zipf’s law, 143
Under contract with MIT Press, shared under CC-BY-NC-ND license.

Related Posts