CS447: Natural Language Processing
http://courses.engr.illinois.edu/cs447
Julia Hockenmaier
juliahmr@illinois.edu
3324 Siebel Center
Lecture 11:
Penn Treebank Parsing;
Dependency Grammars
CS447 Natural Language Processing
Class Admin
�2
CS447 Natural Language Processing
Midterm Exam: Friday, Oct 12
The midterm will be during class.
Closed book exam:
You are not allowed to use any cheat sheets, computers,
calculators, phones etc.(you shouldn’t have to anyway)
The exam will cover the material from the lectures
Format: Short answer questions
Review session: Wednesday, Oct 10 in class.
Review the material before that class,
so that we can clear up any confusions
Conflict Exam or DRES accommodations:
Email me (juliahmr@illinois.edu) asap
�3 CS447: Natural Language Processing (J. Hockenmaier)
Exam Question types
Define X:
Provide a mathematical/formal definition of X
Explain X; Explain what X is/does:
Use plain English to define X and say what X is/does
Compute X:
Return X; Show the steps required to calculate it
Draw X:
Draw a figure of X
Show that X is true/is the case/…:
This may require a (typically very simple) proof.
Discuss/Argue whether …
Use your knowledge (of X,Y,Z) to argue your point
�4
CS447: Natural Language Processing (J. Hockenmaier)
4th Credit Hour
Either a research project (alone or with one other student)
or a literature survey (alone)
Upcoming deadlines:
Fri, Oct 19: Proposal due
Fri, Nov 9: Progress report due (Is your paper on track?)
Thu, Dec 13: Final report due (Summary of papers)
Good places to find NLP papers:
-ACL anthology http://aclweb.org/anthology
covers almost everything published in NLP
-JNLE http://journals.cambridge.org/action/displayJournal?jid=NLE
is another big NLP journal that is not part of the ACL
-Standard machine learning/AI conferences (NIPS, ICML, IJCAI, AAAI)
and journals (JMLR, JAIR etc.) are okay as well.
-Other venues: check with me that this is actually NLP
�5 CS447: Natural Language Processing (J. Hockenmaier)
4th Credit hour: Proposal
Upload a one-page PDF to Compass by Oct 19
-written in LaTeX (not MS Word)
-with full bibliography of the papers you want to read
or base your project on
(ideally with links to online versions; add url-field to your bibtex file)
– include a motivation of why you have chosen those papers
– for a research project: tell me whether you have the data you
need, what existing software you will be using, what you will
have to implement yourself.
-mention any questions/concerns that you may have.
�6
CS447 Natural Language Processing
Today’s lecture
Penn Treebank Parsing
Dependency Grammars
Dependency Treebanks
Dependency Parsing
�7 CS447: Natural Language Processing (J. Hockenmaier)
Penn Treebank
Parsing
�8
CS447 Natural Language Processing
The Penn Treebank
The first publicly available syntactically annotated
corpus
Wall Street Journal (50,000 sentences, 1 million words)
also Switchboard, Brown corpus, ATIS
The annotation:
– POS-tagged (Ratnaparkhi’s MXPOST)
– Manually annotated with phrase-structure trees
– Richer than standard CFG: Traces and other null
elements used to represent non-local dependencies
(designed to allow extraction of predicate-argument
structure) [more on this later in the semester]
Standard data set for English parsers
�9 CS447 Natural Language Processing
The Treebank label set
48 preterminals (tags):
– 36 POS tags, 12 other symbols (punctuation etc.)
– Simplified version of Brown tagset (87 tags)
(cf. Lancaster-Oslo/Bergen (LOB) tag set: 126 tags)
14 nonterminals:
standard inventory (S, NP, VP,…)
�10
CS447 Natural Language Processing
A simple example
�11
Relatively flat structures:
– There is no noun level
– VP arguments and adjuncts appear at the same level
Function tags, e.g. -SBJ (subject), -MNR (manner)
CS447 Natural Language Processing
A more realistic (partial) example
Until Congress acts, the government hasn’t any authority to issue new debt
obligations of any kind, the Treasury said …. .
�12
CS447 Natural Language Processing
The Penn Treebank CFG
The Penn Treebank uses very flat rules, e.g.:
– Many of these rules appear only once.
– Many of these rules are very similar.
– Can we pool these counts?
�13 CS447 Natural Language Processing
PCFGs in practice:
Charniak (1996) Tree-bank grammars
How well do PCFGs work on the Penn Treebank?
– Split Treebank into test set (30K words)
and training set (300K words).
– Estimate a PCFG from training set.
– Parse test set (with correct POS tags).
– Evaluate unlabeled precision and recall
�14
CS447 Natural Language Processing
Two ways to improve performance
… change the (internal) grammar:
– Parent annotation/state splits:
Not all NPs/VPs/DTs/… are the same.
It matters where they are in the tree
… change the probability model:
– Lexicalization:
Words matter!
– Markovization:
Generalizing the rules
�15 CS447 Natural Language Processing
PCFGs assume the expansion of any nonterminal
is independent of its parent.
But this is not true: NP subjects more likely to be modified
than objects.
We can change the grammar by adding the name
of the parent node to each nonterminal
The parent transformation
�16
CS447 Natural Language Processing
Markov PCFGs (Collins parser)
The RHS of each CFG rule consists of:
one head HX, n left sisters Li and m right sisters Ri:
Replace rule probabilities with a generative process:
For each nonterminal X
-generate its head HX (nonterminal or terminal)
– then generate its left sisters L1..n and a STOP symbol
conditioned on HX
– then generate its right sisters R1…n and a STOP symbol
conditioned on HX
X → Ln…L1
! “# $
left sisters
HX R1…Rm
! “# $
right sisters
�17 CS447 Natural Language Processing
Lexicalization
PCFGs can’t distinguish between
“eat sushi with chopsticks” and “eat sushi with tuna”.
We need to take words into account!
P(VPeat → VP PPwith chopsticks | VPeat )
vs. P(VPeat → VP PPwith tuna | VPeat )
Problem: sparse data (PPwith fatty|white|… tuna….)
Solution: only take head words into account!
Assumption: each constituent has one head word.
�18
CS447 Natural Language Processing
At the root (start symbol S), generate the head word of the
sentence, wS , with P(wS)
Lexicalized rule probabilities:
Every nonterminal is lexicalized: Xwx
Condition rules Xwx → αYβ on the lexicalized LHS Xwx
P( Xwx → αYβ | Xwx)
Word-word dependencies:
For each nonterminal Y in RHS of a rule Xwx → αYβ,
condition wY (the head word of Y ) on X and wX:
P( wY | Y, X, wX )
Lexicalized PCFGs
�19 CS447 Natural Language Processing
Dealing with unknown words
A lexicalized PCFG assigns zero probability
to any word that does not appear in the training data.
Solution:
Training: Replace rare words in training data
with a token ‘UNKNOWN’.
Testing: Replace unseen words with ‘UNKNOWN’
�20
CS447 Natural Language Processing
Refining the set of categories
Unlexicalized Parsing (Klein & Manning ’03)
Unlexicalized PCFGs with various transformations
of the training data and the model, e.g.:
– Parent annotation (of terminals and nonterminals):
distinguish preposition IN from subordinating conjunction IN etc.
– Add head tag to nonterminals
(e.g. distinguish finite from infinite VPs)
– Add distance features
Accuracy: 86.3 Precision and 85.1 Recall
The Berkeley parser (Petrov et al. ’06, ’07)
Automatically learns refinements of the nonterminals
Accuracy: 90.2 Precision, 89.9 Recall
�21 CS447: Natural Language Processing (J. Hockenmaier)
Summary
The Penn Treebank has a large number of very flat
rules.
Accurate parsing requires modifications to the basic
PCFG model: refining the nonterminals, relaxing the
independence assumptions by including grandparent
information, modeling word-word dependencies, etc.
How much of this transfers to other treebanks or
languages?
�22
CS447: Natural Language Processing (J. Hockenmaier)
Dependency
Grammar
�23 CS447 Natural Language Processing
A dependency parse
�24
2 CHAPTER 1. INTRODUCTION
Figure 1.1: Dependency structure for an English sentence.
The basic assumption underlying all varieties of dependency grammar is the idea that syntactic
structure essentially consists of words linked by binary, asymmetrical relations called dependency
relations (or dependencies for short). A dependency relation holds between a syntactically subordinate
word, called the dependent, and another word on which it depends, called the head.1 This is illustrated
in figure 1.1, which shows a dependency structure for a simple English sentence, where dependency
relations are represented by arrows pointing from the head to the dependent.2 Moreover, each arrow
has a label, indicating the dependency type. For example, the noun news is a dependent of the verb
had with the dependency type subject (SBJ). By contrast, the noun effect is a dependent of type object
(OBJ) with the same head verb had. Note also that the noun news is itself a syntactic head in relation
to the word Economic, which stands in the attribute (ATT) relation to its head noun.
One peculiarity of the dependency structure in figure 1.1 is that we have inserted an artificial
word root before the first word of the sentence. This is a mere technicality, which simplifies both
formal definitions and computational implementations. In particular, we can normally assume that
every real word of the sentence should have a syntactic head. Thus, instead of saying that the verb
had lacks a syntactic head, we can say that it is a dependent of the artificial word root. In chapter 2,
we will define dependency structures formally as labeled directed graphs, where nodes correspond to
words (including root) and labeled arcs correspond to typed dependency relations.
The information encoded in a dependency structure representation is different from the infor-
mation captured in a phrase structure representation, which is the most widely used type of syntactic
representation in both theoretical and computational linguistics. This can be seen by comparing the
dependency structure in figure 1.1 to a typical phrase structure representation for the same sentence,
shown in figure 1.2. While the dependency structure represents head-dependent relations between
words, classified by functional categories such as subject (SBJ) and object (OBJ), the phrase structure
represents the grouping of words into phrases, classified by structural categories such as noun phrase
(NP) and verb phrase (VP).
1Other terms that are found in the literature are modifier or child, instead of dependent, and governor, regent or parent, instead of
head. Note that, although we will not use the noun modifier, we will use the verb modify when convenient and say that a dependent
modifies its head.
2This is the notational convention that we will adopt throughout the book, but the reader should be warned that there is a competing
tradition in the literature on dependency grammar according to which arrows point from the dependent to the head.
Dependencies are (labeled) asymmetrical binary relations
between two lexical items (words).
CS447 Natural Language Processing
Dependency grammar
Word-word dependencies are a component of
many (most/all?) grammar formalisms.
Dependency grammar assumes that syntactic
structure consists only of dependencies.
Many variants. Modern DG began with Tesniere (1959).
DG is often used for free word order languages.
DG is purely descriptive (not a generative system
like CFGs etc.), but some formal equivalences are
known.
�25 CS447 Natural Language Processing
Head-argument: eat sushi
Arguments may be obligatory, but can only occur once.
The head alone cannot necessarily replace the construction.
Head-modifier: fresh sushi
Modifiers are optional, and can occur more than once.
The head alone can replace the entire construction.
Head-specifier: the sushi
Between function words (e.g. prepositions, determiners)
and their arguments. Syntactic head ≠ semantic head
Coordination: sushi and sashimi
Unclear where the head is.
Different kinds of dependencies
�26
CS447 Natural Language Processing
Dependency structures
Dependencies form a graph over the words in a
sentence.
This graph is connected (every word is a node)
and (typically) acyclic (no loops).
Single-head constraint:
Every node has at most one incoming edge.
This implies that the graph is a rooted tree.
�27 CS447 Natural Language Processing
From CFGs to dependencies
Assume each CFG rule has one head child (bolded)
The other children are dependents of the head.
S → NP VP VP is head, NP is a dependent
VP → V NP NP
NP → DT NOUN
NOUN → ADJ N
The headword of a constituent is the terminal that is
reached by recursively following the head child.
(here, V is the head word of S, and N is the head word of NP).
If in rule XP → X Y, X is head child and Y dependent,
the headword of Y depends on the headword of X.
The maximal projection of a terminal w is the highest nonterminal in
the tree that w is headword of.
Here, Y is a maximal projection.
�28
CS447 Natural Language Processing
Context-free grammars
CFGs capture only nested dependencies
The dependency graph is a tree
The dependencies do not cross
�29 CS447 Natural Language Processing
Beyond CFGs:
Nonprojective dependencies
Dependencies: tree with crossing branches
Arise in the following constructions
– (Non-local) scrambling (free word order languages)
Die Pizza hat Klaus versprochen zu bringen
– Extraposition (The guy is coming who is wearing a hat)
– Topicalization (Cheeseburgers, I thought he likes)
�30
CS447 Natural Language Processing
Dependency Treebanks
Dependency treebanks exist for many languages:
Czech
Arabic
Turkish
Danish
Portuguese
Estonian
….
Phrase-structure treebanks (e.g. the Penn Treebank) can
also be translated into dependency trees
(although there might be noise in the translation)
�31 CS447 Natural Language Processing
The Prague Dependency Treebank
Three levels of annotation:
morphological: [<2M tokens]
Lemma (dictionary form) + detailed analysis
(15 categories with many possible values = 4,257 tags)
surface-syntactic (“analytical”): [1.5M tokens]
Labeled dependency tree encoding grammatical functions
(subject, object, conjunct, etc.)
semantic (“tectogrammatical”): [0.8M tokens]
Labeled dependency tree for predicate-argument structure,
information structure, coreference (not all words included)
(39 labels: agent, patient, origin, effect, manner, etc....)
�32
CS447 Natural Language Processing
Examples: analytical level
�33 CS447 Natural Language Processing
Turkish is an agglutinative language
with free word order.
Rich morphological annotations
Dependencies (next slide) are at the morpheme level
Very small -- about 5000 sentences
METU-Sabanci Turkish
Treebank
�34
CS447 Natural Language Processing
[this and prev. example from Kemal Oflazer’s talk at Rochester, April 2007]
�35
METU-Sabanci Turkish
Treebank
CS447 Natural Language Processing
Universal Dependencies
37 syntactic relations, intended to be applicable to all
languages (“universal”), with slight modifications for
each specific language, if necessary.
http://universaldependencies.org
�36
CS447 Natural Language Processing
Universal Dependency Relations
Nominal core arguments: nsubj (nominal subject), obj (direct
object), iobj (indirect object)
Clausal core arguments: csubj (clausal subject), ccomp
(clausal object [“complement”])
Non-core dependents: advcl (adverbial clause modifier), aux
(auxiliary verb),
Nominal dependents: nmod (nominal modifier), amod
(adjectival modifier),
Coordination: cc (coordinating conjunction), conj (conjunct)
and many more…
�37 CS447 Natural Language Processing
Parsing algorithms for DG
‘Transition-based’ parsers:
learn a sequence of actions to parse sentences
Models:
State = stack of partially processed items
+ queue/buffer of remaining tokens
+ set of dependency arcs that have been found already
Transitions (actions) = add dependency arcs; stack/queue operations
‘Graph-based’ parsers:
learn a model over dependency graphs
Models:
a function (typically sum) of local attachment scores
For dependency trees, you can use a minimum spanning tree algorithm
�38
CS447 Natural Language Processing
Transition-based parsing
(Nivre et al.)
�39 CS447 Natural Language Processing
Transition-based parsing: assumptions
This algorithm works for projective dependency trees.
Dependency tree:
Each word has a single parent
(Each word is a dependent of [is attached to] one other word)
Projective dependencies:
There are no crossing dependencies.
For any i, j, k with i < k < j: if there is a dependency between wi and wj,
the parent of wk is a word wl between (possibly including) i and j: i ≤ l ≤ j,
while any child wm of wk has to occur between (excluding) i and j: i