Syntax with Dependency Grammar
ANLP: Week 7, Unit 1
Shay Cohen
Based on slides from ANLP 2019
Lexicalization, again
We saw that adding lexical head of the phrase can help choose the right parse:
S-saw
NP-kids VP-saw kids
VP-saw
PP-fish
V-saw NP-birds P-with NP-fish saw birds with fish
Dependency syntax focuses on the head-dependent relationships.
Dependency trees
A valid dependency tree for a sentence requires:
A single distinguished root word.
All other words have exactly one incoming edge. A unique path from the root to each other word.
kids saw birds with fish
kids saw birds with binoculars
Dependency syntax
An alternative approach to sentence structure.
A fully lexicalized formalism: no phrasal categories.
Assumes binary, asymmetric grammatical relations between words: head-dependent relations, shown as directed edges:
kids saw birds with fish
Here, edges point from heads to their dependents.
It really is a tree!
The usual way to show dependency trees is with edges over ordered sentences.
But the edge structure (without word order) can also be shown as a more obvious tree:
kids saw birds with fish
saw
kids birds fish with
Labelled dependencies
It is often useful to distinguish different kinds of head → modifier relations, by labelling edges:
ROOT
NMOD NSUBJ DOBJ CASE
kids saw birds with fish
Historically, different treebanks/languages used different
labels.
Now, the Universal Dependencies project aims to
standardize labels and annotation conventions, bringing
together annotated corpora from over 50 languages. Labels in this example (and in textbook) are from UD.
Why dependencies??
Consider these sentences. Two ways to say the same thing: SS
NP VP
NP VP Sasha
V NP PP
Sasha
V NP NP
gave the girl a book
We only need a few phrase structure rules:
S → NP VP VP→V NP NP
VP→V NP PP
plus rules for NP and PP.
to the girl
gave a book
Why dependencies??
Consider these sentences. Two ways to say the same thing: SS
NP VP
NP VP Sasha
Sasha
V NP NP
V NP PP
gave a book to the girl
gave the girl a book
Equivalent sentences in Russian
Russian uses morphology to mark relations between words: knigu means book (kniga) as a direct object.
devochke means girl (devochka) as indirect object (to the girl). So we can have the same word orders as English:
Sasha dal devochke knigu Sasha dal knigu devochke
Equivalent sentences in Russian
Russian uses morphology to mark relations between words: knigu means book (kniga) as a direct object.
devochke means girl (devochka) as indirect object (to the girl).
So we can have the same word orders as English:
Sasha dal devochke knigu Sasha dal knigu devochke
But also many others!
Sasha devochke dal knigu Devochke dal Sasha knigu Knigu dal Sasha devochke
Phrase structure vs dependencies
In languages with free word order, phrase structure (constituency) grammars don’t make as much sense.
E.g.,wewouldneedbothS→NP VPandS→VP NP,etc. Not very informative about what’s really going on.
In contrast, the dependency relations stay constant:
ROOT
NSUBJ
Sasha dal
ROOT
DOBJ
devochke
IOBJ
IOBJ
NSUBJ
Sasha
DOBJ
knigu
dal
knigu devochke
Phrase structure vs dependencies
In languages with free word order, phrase structure (constituency) grammars don’t make as much sense.
E.g.,wewouldneedbothS→NP VPandS→VP NP,etc. Not very informative about what’s really going on.
Phrase structure vs dependencies
Even more obvious if we just look at the trees without word order:
ROOT
NSUBJ
Sasha dal
ROOT
DOBJ
devochke
dal
IOBJ
IOBJ
NSUBJ
Sasha dal
DOBJ
knigu
knigu
dal
devochke
Sasha devochke knigu
Sasha devochke knigu
Pros and cons
Sensible framework for free word order languages.
Identifies syntactic relations directly. (using CFG, how would
you identify the subject of a sentence?)
Dependency pairs/chains can make good features in classifiers, for information extraction, etc.
Parsers can be very fast (coming up…) But
The assumption of asymmetric binary relations isn’t always right… e.g., how to parse dogs and cats?
Lexicalized Constituency Parse
S-saw
V-saw saw
birds
NP-kids kids
VP-saw
NP-birds
NP-birds PP-fish
P-with NP-fish with fish
How do we annotate dependencies?
Two options:
1. Annotate dependencies directly.
2. Convert phrase structure annotations to dependencies. (Convenient if we already have a phrase structure treebank.)
Next slides show how to convert, assuming we have head-finding rules for our phrase structure trees.
. . . remove the phrasal categories. . .
kids kids
saw
saw saw
birds
saw
birds
birds fish
with fish with fish
. . . remove the (duplicated) terminals. . .
kids
saw
saw
saw
birds
birds fish
with fish
. . . and collapse chains of duplicates. . .
saw
kids saw
saw birds
birds fish with
. . . and collapse chains of duplicates. . .
kids
saw
saw
saw
birds
birds fish
with fish
. . . and collapse chains of duplicates. . .
saw
kids saw
saw birds
birds fish with
. . . and collapse chains of duplicates. . .
saw
kids saw
saw birds fish
with
…done!
saw
kids birds fish
with
. . . and collapse chains of duplicates. . .
saw
kids saw
saw birds fish
with
Constituency Tree → Dependency Tree
We saw how the lexical head of the phrase can be used to
collapse down to a dependency tree: S-saw
NP-kids kids
VP-saw
VP-saw
V-saw NP-birds saw birds
PP-binoculars
P-with NP-binoculars with binoculars
But how can we find each phrase’s head in the first place?
Head Rules
The standard solution is to use head rules: for every non-unary (P)CFG production, designate one RHS nonterminal as containing thehead. S→NP VP,VP→VP PP,PP→P NP(contenthead), etc.
NP kids
S
VP
V NP saw birds
VP
PP
P NP with binoculars
Heuristics to scale this to large grammars: e.g., within an NP, last immediate N child is the head.
Head Rules
Then, propagate heads up the tree: S
NP-kids kids
VP
VP-saw
V-saw NP-birds saw birds
PP
P-with NP-binoculars with binoculars
Head Rules
Then, propagate heads up the tree: S
NP-kids kids
VP
VP
V-saw NP-birds saw birds
PP
P-with NP-binoculars with binoculars
Head Rules
Then, propagate heads up the tree: S
NP-kids kids
VP
VP-saw
V-saw NP-birds saw birds
PP-binoculars
P-with NP-binoculars with binoculars
Head Rules
Then, propagate heads up the tree: S
NP-kids kids
VP-saw
VP-saw
V-saw NP-birds saw birds
PP-binoculars
P-with NP-binoculars with binoculars
Projectivity
If we convert constituency parses to dependencies, all the resulting trees will be projective.
Every subtree (node and all its descendants) occupies a contiguous span of the sentence.
= the parse can be drawn over the sentence w/ no crossing edges.
SBJ
PC ATT
the issue
ROOT
ATT ATT
A hearing on
VC TMP
is scheduled
today
Head Rules
Then, propagate heads up the tree: S-saw
NP-kids kids
VP-saw
VP-saw
V-saw NP-birds saw birds
PP-binoculars
P-with NP-binoculars with binoculars
Nonprojectivity
But some sentences are nonprojective.
ROOT
ATT
TMP
PC ATT SBJ VC ATT
A
We’ll only get these annotations right if we directly annotate
the sentences (or correct the converted parses).
hearing is scheduled
on the
issue
today
Nonprojectivity is rare in English, but common in many
languages.
Nonprojectivity presents problems for parsing algorithms.
Dependency Parsing
Some of the algorithms you have seen for PCFGs can be adapted to dependency parsing.
CKY can be adapted, though efficiency is a concern: obvious approach is O(Gn5); Eisner algorithm brings it down to O(Gn3)
N. Smith’s slides explaining the Eisner algorithm: http://courses.cs.washington.edu/courses/cse517/ 16wi/slides/an-dep-slides.pdf
Shift-reduce: more efficient, doesn’t even require a grammar!
Transition-based Dependency Parsing
The arc-standard approach parses input sentence w1 . . . wN using two types of reduce actions (three actions altogether):
Shift: Read next word wi from input and push onto the stack. LeftArc: Assign head-dependent relation s2 ← s1; pop s2
RightArc: Assign head-dependent relation s2 → s1; pop s1
where s1 and s2 are the top and second item on the stack, respectively. (So, s2 preceded s1 in the input sentence.)
Recall: shift-reduce parser with CFG
Same example grammar and sentence.
Operations:
Reduce (R) Shift (S)
Backtrack to
step n (Bn)
Note that at 9 and 11 we skipped over backtracking to 7 and 5 respectively as there were actually no choices to be made at those points.
Step Op. Stack 0
Input
the dog bit dog bit dog bit
bit
bit bit
bit
bit bit bit
1 S
2 R
3 S
4 R
5 R
6 S
7 R
8 R
9 B6
10 R 11B4DTV
12 S
13 R
14 R
15 B3
16 R
17 R
…
DTVbit DTVV DTVVP DTdog DT NN
NP
the
DT
DT dog DTV DTVP
DT VP bit DTVPV DTVPVP DT VP bit DT VP NN
Example
Parsing Kim saw Sandy:
Step
←bot. Stacktop→
Word List
Action
Relations
0 1 2 3 4 5 6
[root]
[root,Kim] [root,Kim,saw] [root,saw] [root,saw,Sandy] [root,saw]
[root]
[Kim,saw,Sandy] [saw,Sandy] [Sandy]
[Sandy]
[] [] []
Shift Shift LeftArc Shift RightArc RightArc (done)
Kim←saw
saw→Sandy root→saw
Here, top two words on stack are also always adjacent in sentence. Not true in general! (See longer example in JM3.)
Labelled dependency parsing
These parsing actions produce unlabelled dependencies (left). For labelled dependencies (right), just use more actions:
LeftArc(NSUBJ), RightArc(NSUBJ), LeftArc(DOBJ), . . .
ROOT
NSUBJ DOBJ
Kim saw Sandy Kim saw Sandy
Notions of validity
In constituency parsing, valid parse = grammatical parse. That is, we first define a grammar, then use it for parsing.
In dependency parsing, we don’t normally define a grammar. Valid parses are those with the properties on slide 4.
Differences to constituency parsing
Shift-reduce parser for CFG: not all sequences of actions lead to valid parses. Choose incorrect action → may need to backtrack.
Here, all valid action sequences lead to valid parses.
Invalid actions: can’t apply LeftArc with root as dependent; can’t apply RightArc with root as head unless input is empty.
Other actions may lead to incorrect parses, but still valid.
So, parser doesn’t backtrack. Instead, tries to greedily predict
the correct action at each step.
Therefore, dependency parsers can be very fast (linear time). But need a good way to predict correct actions (next lecture).
Summary: Transition-based Parsing
arc-standard approach is based on simple shift-reduce idea.
Can do labelled or unlabelled parsing, but need to train a
classifier to predict next action, as we’ll see next time.
Greedy algorithm means time complexity is linear in sentence
length.
Only finds projective trees (without special extensions)
Pioneering system: Nivre’s MaltParser.
Alternative: Graph-based Parsing
Global algorithm: From the fully connected directed graph of all possible edges, choose the best ones that form a tree.
Edge-factored models: Classifier assigns a nonnegative score to each possible edge; maximum spanning tree algorithm finds the spanning tree with highest total score in O(n2) time.
Pioneering work: McDonald’s MSTParser
Can be formulated as constraint-satisfaction with integer
linear programming (Martins’s TurboParser)
Details in JM3, Ch 16.5 (optional).
Choosing a Parser: Criteria
Target representation: constituency or dependency?
Efficiency? In practice, both runtime and memory use.
Incrementality: parse the whole sentence at once, or obtain partial left-to-right analyses/expectations?
Retrainable system?
Accuracy?
Graph-based vs. Transition-based vs. Conversion-based
TB: Features in scoring function can look at any part of the stack; no optimality guarantees for search; linear-time; (classically) projective only
GB: Features in scoring function limited by factorization; optimal search within that model; quadratic-time; no projectivity constraint
CB: In terms of accuracy, sometimes best to first constituency-parse, then convert to dependencies (e.g., Stanford Parser). Slower than direct methods.
Summary
Constituency syntax: hierarchically nested phrases with categories like NP.
Dependency syntax: trees whose edges connect words in the sentence. Edges often labeled with relations like nsubj.
Can convert constituency to dependency parse using head rules.
For projective trees, transition-based parsing is very fast and can be very accurate.
Google “online dependency parser”.
Try out the Stanford parser and SEMAFOR!
Probabilistic transition-based dep’y parsing
At each step in parsing we have:
Current configuration: consisting of the stack state, input buffer, and dependency relations found so far.
Possible actions: e.g., Shift, LeftArc, RightArc. Probabilistic parser assumes we also have a model that tells us
P(action|configuration). Then,
Choosing the most probable action at each step (greedy parsing) produces a parse in linear time.
But it might not be the best one: choices made early could lead to a worse overall parse.
Beam search: basic idea
Instead of choosing only the best action at each step, choose a few of the best.
Extend previous partial parses using these options.
At each time step, keep a fixed number of best options,
discard anything else.
Advantages:
May find a better overall parse than greedy search,
While using less time/memory than exhaustive search.
Recap: parsing as search
Parser is searching through a very large space of possible parses.
Greedy parsing is a depth-first strategy.
Beam search is a limited breadth-first strategy. S
SSS
aux NPVP NP
NP VP
NP
S . . . . . S S . . . . . S
S ….. S
The agenda
An ordered list of configurations (parser state + parse so far).
Items are ordered by score: how good a configuration is it?
Implemented using a priority queue data structure, which efficiently inserts items into the ordered list.
In beam search, we use an agenda with a fixed size (beam width). If new high-scoring items are inserted, discard items at the bottom below beam width.
Won’t discuss scoring function here; but beam search idea is used across NLP (e.g., in best-first constituency parsing, NNet models.)
Evaluating dependency parsers
How do we know if beam search is helping?
As usual, we can evaluate against a gold standard data set. But what evaluation measure to use?
Evaluating dependency parsers
By construction, the number of dependencies is the same as the number of words in the sentence.
So we do not need to worry about precision and recall, just plain old accuracy.
Labelled Attachment Score (LAS): Proportion of words where we predicted the correct head and label.
Unlabelled Attachment Score (UAS): Proportion of words where we predicted the correct head, regardless of label.