Computational
Linguistics
Copyright © 2017
Gerald Penn. All
rights reserved.
10A
10A. Log-Likelihood
Dependency Parsing
Gerald Penn
Department of Computer Science, University of Toronto
CSC 2501 / 485
Fall 2018
Based on slides by Yuji Matsumoto, Dragomir Radev,
David Smith and Jason Eisner
2
MOD
Word Dependency Parsing
He reckons the current account deficit will narrow to only 1.8 billion in September.
Raw sentence
Part-of-speech tagging
He reckons the current account deficit will narrow to only 1.8 billion in September.
PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .
POS-tagged sentence
Word dependency parsing
slide adapted from Yuji Matsumoto
Word dependency parsed sentence
He reckons the current account deficit will narrow to only 1.8 billion in September .
SUBJ
ROOT
S-COMP
SUBJ
SPEC
MOD
MOD
COMP
COMP
Parsing Methods
Shift-Reduce Type Algorithms
◮ Data structures:
◮ Stack [. . . , wi ]S of partially processed tokens
◮ Queue [wj , . . .]Q of remaining input tokens
◮ Parsing actions built from atomic actions:
◮ Adding arcs (wi → wj , wi ← wj)
◮ Stack and queue operations
◮ Left-to-right parsing in O(n) time
◮ Restricted to projective dependency graphs
Dependency Parsing 54(103)
Parsing Methods
Yamada’s Algorithm
◮ Three parsing actions:
Shift
[. . .]S [wi , . . .]Q
[. . . , wi ]S [. . .]Q
Left
[. . . , wi , wj ]S [. . .]Q
[. . . , wi ]S [. . .]Q wi → wj
Right
[. . . , wi , wj ]S [. . .]Q
[. . . , wj ]S [. . .]Q wi ← wj
◮ Algorithm variants:
◮ Originally developed for Japanese (strictly head-final) with only
the Shift and Right actions [Kudo and Matsumoto 2002]
◮ Adapted for English (with mixed headedness) by adding the
Left action [Yamada and Matsumoto 2003]
◮ Multiple passes over the input give time complexity O(n2)
Dependency Parsing 55(103)
Parsing Methods
Nivre’s Algorithm
◮ Four parsing actions:
Shift
[. . .]S [wi , . . .]Q
[. . . , wi ]S [. . .]Q
Reduce
[. . . , wi ]S [. . .]Q ∃wk : wk → wi
[. . .]S [. . .]Q
Left-Arcr
[. . . , wi ]S [wj , . . .]Q ¬∃wk : wk → wi
[. . .]S [wj , . . .]Q wi
r
← wj
Right-Arcr
[. . . , wi ]S [wj , . . .]Q ¬∃wk : wk → wj
[. . . , wi , wj ]S [. . .]Q wi
r
→ wj
◮ Characteristics:
◮ Integrated labeled dependency parsing
◮ Arc-eager processing of right-dependents
◮ Single pass over the input gives time complexity O(n)
Dependency Parsing 56(103)
Parsing Methods
Example
[root]S [Economic news had little effect on financial markets .]Q
Dependency Parsing 57(103)
Parsing Methods
Example
[root Economic]S [news had little effect on financial markets .]Q
Shift
Dependency Parsing 57(103)
Parsing Methods
Example
[root]S Economic [news had little effect on financial markets .]Q
nmod
Left-Arcnmod
Dependency Parsing 57(103)
Parsing Methods
Example
[root Economic news]S [had little effect on financial markets .]Q
nmod
Shift
Dependency Parsing 57(103)
Parsing Methods
Example
[root]S Economic news [had little effect on financial markets .]Q
sbjnmod
Left-Arcsbj
Dependency Parsing 57(103)
Parsing Methods
Example
[root Economic news had]S [little effect on financial markets .]Q
pred
sbjnmod
Right-Arcpred
Dependency Parsing 57(103)
Parsing Methods
Example
[root Economic news had little]S [effect on financial markets .]Q
pred
sbjnmod
Shift
Dependency Parsing 57(103)
Parsing Methods
Example
[root Economic news had]S little [effect on financial markets .]Q
pred
sbjnmod nmod
Left-Arcnmod
Dependency Parsing 57(103)
Parsing Methods
Example
[root Economic news had little effect]S [on financial markets .]Q
objpred
sbjnmod nmod
Right-Arcobj
Dependency Parsing 57(103)
Parsing Methods
Example
[root Economic news had little effect on]S [financial markets .]Q
objpred
sbjnmod nmod nmod
Right-Arcnmod
Dependency Parsing 57(103)
Parsing Methods
Example
[root Economic news had little effect on financial]S [markets .]Q
objpred
sbjnmod nmod nmod
Shift
Dependency Parsing 57(103)
Parsing Methods
Example
[root Economic news had little effect on]S financial [markets .]Q
objpred
sbjnmod nmod nmod nmod
Left-Arcnmod
Dependency Parsing 57(103)
Parsing Methods
Example
[root Economic news had little effect on financial markets]S [.]Q
objpred
sbjnmod nmod nmod
pc
nmod
Right-Arcpc
Dependency Parsing 57(103)
Parsing Methods
Example
[root Economic news had little effect on]S financial markets [.]Q
objpred
sbjnmod nmod nmod
pc
nmod
Reduce
Dependency Parsing 57(103)
Parsing Methods
Example
[root Economic news had little effect]S on financial markets [.]Q
objpred
sbjnmod nmod nmod
pc
nmod
Reduce
Dependency Parsing 57(103)
Parsing Methods
Example
[root Economic news had]S little effect on financial markets [.]Q
objpred
sbjnmod nmod nmod
pc
nmod
Reduce
Dependency Parsing 57(103)
Parsing Methods
Example
[root]S Economic news had little effect on financial markets [.]Q
objpred
sbjnmod nmod nmod
pc
nmod
Reduce
Dependency Parsing 57(103)
Parsing Methods
Example
[root Economic news had little effect on financial markets .]S []Q
obj
p
pred
sbjnmod nmod nmod
pc
nmod
Right-Arcp
Dependency Parsing 57(103)
Parsing Methods
Classifier-Based Parsing
◮ Data-driven deterministic parsing:
◮ Deterministic parsing requires an oracle.
◮ An oracle can be approximated by a classifier.
◮ A classifier can be trained using treebank data.
◮ Learning methods:
◮ Support vector machines (SVM)
[Kudo and Matsumoto 2002, Yamada and Matsumoto 2003,
Isozaki et al. 2004, Cheng et al. 2004, Nivre et al. 2006]
◮ Memory-based learning (MBL)
[Nivre et al. 2004, Nivre and Scholz 2004]
◮ Maximum entropy modeling (MaxEnt)
[Cheng et al. 2005]
Dependency Parsing 58(103)
Parsing Methods
Feature Models
◮ Learning problem:
◮ Approximate a function from parser states, represented by
feature vectors to parser actions, given a training set of gold
standard derivations.
◮ Typical features:
◮ Tokens:
◮ Target tokens
◮ Linear context (neighbors in S and Q)
◮ Structural context (parents, children, siblings in G)
◮ Attributes:
◮ Word form (and lemma)
◮ Part-of-speech (and morpho-syntactic features)
◮ Dependency type (if labeled)
◮ Distance (between target tokens)
Dependency Parsing 59(103)
3 3
Great ideas in NLP: Log-linear models
(Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)
In the beginning, we used generative models.
p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …
each choice depends on a limited part of the history
but which dependencies to allow?
what if they’re all worthwhile?
p(D | A,B,C)?
p(D | A,B,C)?
… p(D | A,B) * p(C | A,B,D)?
4 4
Great ideas in NLP: Log-linear models
(Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)
In the beginning, we used generative models.
Solution: Log-linear (max-entropy) modeling
Features may interact in arbitrary ways
Iterative scaling keeps adjusting the feature weights
until the model agrees with the training data.
p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …
which dependencies to allow? (given limited training data)
(1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B)
* Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) *
… throw them all in!
5
Log-linear models great for n-way classification
Also good for predicting sequences
Also good for dependency parsing
5
How about structured outputs?
but to allow fast dynamic
programming,
only use n-gram features
but to allow fast dynamic
programming or MST parsing,
only use single-edge features …find preferred links…
find preferred tags
v a n
6
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen” was a in the striking
Is this a good edge?
yes, lots of green …
7
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen” was a in the striking
Is this a good edge?
jasný den
(“bright day”)
8
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen” was a in the striking
Is this a good edge?
jasný den
(“bright day”)
jasný N
(“bright NOUN”)
V A A A N J N V C
9
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen” was a in the striking
Is this a good edge?
jasný den
(“bright day”)
jasný N
(“bright NOUN”)
V A A A N J N V C
A N
10
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen” was a in the striking
Is this a good edge?
jasný den
(“bright day”)
jasný N
(“bright NOUN”)
V A A A N J N V C
A N
preceding
conjunction A N
11
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen” was a in the striking
How about this competing edge?
V A A A N J N V C
not as good, lots of red …
12
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen” was a in the striking
How about this competing edge?
V A A A N J N V C
jasný hodiny
(“bright clocks”)
… undertrained …
13
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen” was a in the striking
How about this competing edge?
V A A A N J N V C
byl jasn stud dubn den a hodi odbí třin
jasný hodiny
(“bright clocks”)
… undertrained …
jasn hodi
(“bright clock,”
stems only)
14
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen” was a in the striking
How about this competing edge?
V A A A N J N V C
jasn hodi
(“bright clock,”
stems only)
byl jasn stud dubn den a hodi odbí třin
Aplural Nsingular
jasný hodiny
(“bright clocks”)
… undertrained …
15
jasný hodiny
(“bright clocks”)
… undertrained …
Edge-Factored Parsers (McDonald et al. 2005)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen” was a in the striking
How about this competing edge?
V A A A N J N V C
jasn hodi
(“bright clock,”
stems only)
byl jasn stud dubn den a hodi odbí třin
Aplural Nsingular
A N
where N follows
a conjunction
16
jasný
Edge-Factored Parsers (McDonald et al. 2005)
Byl studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen” was a in the striking
V A A A N J N V C
byl jasn stud dubn den a hodi odbí třin
Which edge is better?
“bright day” or “bright clocks”?
17
jasný
Edge-Factored Parsers (McDonald et al. 2005)
Byl studený dubnový den a hodiny odbíjely třináctou
“It bright cold day April and clocks were thirteen” was a in the striking
V A A A N J N V C
byl jasn stud dubn den a hodi odbí třin
Which edge is better?
Score of an edge e = features(e)
Standard algos valid parse with max total score
our current weight vector
18
Edge-Factored Parsers (McDonald et al. 2005)
Which edge is better?
Score of an edge e = features(e)
Standard algos valid parse with max total score
our current weight vector
can’t have both
(one parent per word)
can‘t have both
(no crossing links)
Can’t have all three
(no cycles)
Thus, an edge may lose (or win)
because of a consensus of other
edges.
19
Finding Highest-Scoring Parse
The cat in the hat wore a stovepipe. ROOT
Convert to context-free grammar (CFG)
Then use dynamic programming
each subtree is a linguistic constituent
(here a noun phrase)
The
cat
in
the
hat
wore
a
stovepipe
ROOT
let’s vertically stretch
this graph drawing
20
Finding Highest-Scoring Parse
each subtree is a linguistic constituent
(here a noun phrase)
The
cat
in
the
hat
wore
a
stovepipe
ROOT
so CKY’s “grammar constant” is no longer constant
Convert to context-free grammar (CFG)
Then use dynamic programming
CKY algorithm for CFG parsing is O(n3)
Unfortunately, O(n5) in this case
to score “cat wore” link, not enough to know this is NP
must know it’s rooted at “cat”
so expand nonterminal set by O(n): {NPthe, NPcat, NPhat, …}
21
Finding Highest-Scoring Parse
each subtree is a linguistic constituent
(here a noun phrase)
The
cat
in
the
hat
wore
a
stovepipe
ROOT
Convert to context-free grammar (CFG)
Then use dynamic programming
CKY algorithm for CFG parsing is O(n3)
Unfortunately, O(n5) in this case
Solution: Use a different decomposition (Eisner 1996)
Back to O(n3)
22
Spans vs. constituents
Two kinds of substring.
» Constituent of the tree: links to the rest
only through its headword (root).
» Span of the tree: links to the rest
only through its endwords.
The cat in the hat wore a stovepipe. ROOT
The cat in the hat wore a stovepipe. ROOT
Decomposing a tree into spans
The cat in the hat wore a stovepipe. ROOT
The cat
wore a stovepipe. ROOT cat in the hat wore
+
+
in the hat wore cat in +
hat wore in the hat +
cat in the hat wore a stovepipe. ROOT
24
Hard Constraints on Valid Trees
Score of an edge e = features(e)
Standard algos valid parse with max total score
our current weight vector
can’t have both
(one parent per word)
can‘t have both
(no crossing links)
Can’t have all three
(no cycles)
Thus, an edge may lose (or win)
because of a consensus of other
edges.
25
talk
Non-Projective Parses
can‘t have both
(no crossing links)
The “projectivity” restriction.
Do we really want it?
I give a on bootstrapping tomorrow ROOT ‘ll
subtree rooted at “talk”
is a discontiguous noun phrase
26
Non-Projective Parses
ista meam norit gloria canitiem ROOT
I give a on bootstrapping talk tomorrow ROOT ‘ll
thatNOM myACC may-know gloryNOM going-grayACC
That glory may-know my going-gray
(i.e., it shall last till I go gray)
occasional non-projectivity in English
frequent non-projectivity in Latin, etc.
Parsing Methods
Non-Projective Parsing Algorithms
◮ Complexity considerations:
◮ Projective (Proj)
◮ Non-projective (NonP)
Problem/Algorithm Proj NonP
Complete grammar parsing P NP hard
[Gaifman 1965, Neuhaus and Bröker 1997]
Deterministic parsing O(n) O(n2)
[Nivre 2003, Covington 2001]
First order spanning tree O(n3) O(n2)
[McDonald et al. 2005b]
Nth order spanning tree (N > 1) P NP hard
[McDonald and Pereira 2006]
Dependency Parsing 65(103)
27
McDonald’s Approach (non-projective)
Consider the sentence “John saw Mary” (left).
The Chu-Liu-Edmonds algorithm finds the maximum-
weight spanning tree (right) – may be non-projective.
Can be found in time O(n2).
9
10
30
20
30
0
11
3
9
root
John
saw
Mary
10
30
30
root
John
saw
Mary
slide thanks to Dragomir Radev
Every node selects best parent
If cycles, contract them and repeat
28
Summing over all non-projective trees
Finding highest-scoring non-projective tree
Consider the sentence “John saw Mary” (left).
The Chu-Liu-Edmonds algorithm finds the maximum-
weight spanning tree (right) – may be non-projective.
Can be found in time O(n2).
How about total weight Z of all trees?
Can be found in time O(n3) by matrix determinants
and inverses (Smith & Smith, 2007).
slide thanks to Dragomir Radev
29
Graph Theory to the Rescue!
Tutte’s Matrix-Tree Theorem (1948)
The determinant of the Kirchoff (aka Laplacian)
adjacency matrix of directed graph G without row and
column r is equal to the sum of scores of all directed
spanning trees of G rooted at node r.
Exactly the Z we need!
O(n3) time!
30
Building the Kirchoff
(Laplacian) Matrix
0 s(1,0) s(2,0) s(n,0)
0 0 s(2,1) s(n,1)
0 s(1,2) 0 s(n,2)
0 s(1,n) s(2,n) 0
0 s(1,0) s(2,0) s(n,0)
0 s(1, j)
j1
s(2,1) s(n,1)
0 s(1,2) s(2, j)
j2
s(n,2)
0 s(1,n) s(2,n) s(n, j)
jn
nj
j
j
jnsnsns
nsjss
nssjs
),(),2(),1(
)2,(),2()2,1(
)1,()1,2(),1(
2
1
• Negate edge scores
• Sum columns
(children)
• Strike root row/col.
• Take determinant
N.B.: This allows multiple children of root, but see Koo et al. 2007.
31
Why Should This Work?
Chu-Liu-Edmonds analogy:
Every node selects best parent
If cycles, contract and recur
K K with contracted edge 1,2
K K({1,2} |{1,2})
K s(1,2) K K
s(1, j)
j1
s(2,1) s(n,1)
s(1,2) s(2, j)
j2
s(n,2)
s(1,n) s(2,n) s(n, j)
jn
Clear for 1×1 matrix; use induction
Undirected case; special root cases for directed
1
Graph Deletion & Contraction
Important fact: κ(G) = κ(G-{e}) + κ(G\{e})