Computational
Linguistics
CSC 485 Summer 2020
10A
10A. Log-Likelihood Dependency Parsing
Gerald Penn
Department of Computer Science, University of Toronto
Based on slides by Yuji Matsumoto, Dragomir Radev, David Smith, Sam Thomson and Jason Eisner
Copyright © 2020 Gerald Penn. All rights reserved.
How about structured outputs?
Log-linear models great for n-way classification Also good for predicting sequences
van
find preferred tags
but to allow fast dynamic programming,
only use n-gram features
Also good for dependency parsing
…find preferred links…
but to allow fast dynamic programming or MST parsing, only use single-edge features
5
Parsing Methods
Classifier-Based Parsing
◮ Data-driven deterministic parsing:
◮ Deterministic parsing requires an oracle.
◮ An oracle can be approximated by a classifier.
◮ A classifier can be trained using treebank data.
◮ Learning methods:
◮ Support vector machines (SVM)
[Kudo and Matsumoto 2002, Yamada and Matsumoto 2003,
Isozaki et al. 2004, Cheng et al. 2004, Nivre et al. 2006]
◮ Memory-based learning (MBL)
[Nivre et al. 2004, Nivre and Scholz 2004]
◮ Maximum entropy modeling (MaxEnt) [Cheng et al. 2005]
Dependency Parsing 58(103)
Parsing Methods
Feature Models
◮ Learning problem:
◮ Approximate a function from parser states, represented by feature vectors to parser actions, given a training set of gold standard derivations.
◮ Typical features: ◮ Tokens:
◮ Target tokens
◮ Linear context (neighbors in S and Q)
◮ Structural context (parents, children, siblings in G)
◮ Attributes:
◮ Word form (and lemma)
◮ Part-of-speech (and morpho-syntactic features)
◮ Dependency type (if labeled)
◮ Distance (between target tokens)
Dependency Parsing 59(103)
Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)
In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …
each choice depends on a limited part of the history but which dependencies to allow? p(D | A,B,C)?
what if they’re all worthwhile?
p(D | A,B,C)? … p(D | A,B) * p(C | A,B,D)?
3
Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)
In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …
which dependencies to allow? (given limited training data)
Solution: Log-linear (max-entropy) modeling (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B)
… * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) * throw them all in!
Features may interact in arbitrary ways
Iterative scaling keeps adjusting the feature weights until the model agrees with the training data.
4
Edge-Factored Parsers (McDonald et al. 2005) Is this a good edge?
yes, lots of green …
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It was a bright cold day in April and the clocks were striking thirteen”
6
Edge-Factored Parsers (McDonald et al. 2005) Is this a good edge?
jasný den (“bright day”)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It was a bright cold day in April and the clocks were striking thirteen”
7
Edge-Factored Parsers (McDonald et al. 2005) Is this a good edge?
jasný den (“bright day”)
jasný N (“bright NOUN”)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC
“It was a bright cold day in April and the clocks were striking thirteen”
8
Edge-Factored Parsers (McDonald et al. 2005) Is this a good edge?
jasný den (“bright day”)
jasný N (“bright NOUN”)
AN
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC
“It was a bright cold day in April and the clocks were striking thirteen”
9
Edge-Factored Parsers (McDonald et al. 2005) Is this a good edge?
jasný N (“bright NOUN”)
AN
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC
“It was a bright cold day in April and the clocks were striking thirteen”
jasný den (“bright day”)
AN preceding conjunction
10
Edge-Factored Parsers (McDonald et al. 2005) How about this competing edge?
not as good, lots of red …
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC
“It was a bright cold day in April and the clocks were striking thirteen”
11
Edge-Factored Parsers (McDonald et al. 2005) How about this competing edge?
jasný hodiny (“bright clocks”)
… undertrained …
Byl jasný studený dubnový den a hodiny odbíjely třináctou VAAANJNVC
“It was a bright cold day in April and the clocks were striking thirteen”
12
Edge-Factored Parsers (McDonald et al. 2005) How about this competing edge?
jasný hodiny jasn hodi (“bright clocks”) (“bright clock,”
… undertrained …
stems only)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC
byl jasn stud dubn den a hodi odbí třin
“It was a bright cold day in April and the clocks were striking thirteen”
13
Edge-Factored Parsers (McDonald et al. 2005) How about this competing edge?
jasný hodiny jasn hodi (“bright clocks”) (“bright clock,”
… undertrained …
stems only)
Aplural Nsingular
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC byl jasn stud dubn den a hodi odbí třin
“It was a bright cold day in April and the clocks were striking thirteen”
14
Edge-Factored Parsers (McDonald et al. 2005) How about this competing edge?
jasný hodiny (“bright clocks”)
jasn hodi (“bright clock,” stems only)
AN where N follows
A N
plural singular
… undertrained …
a conjunction
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC byl jasn stud dubn den a hodi odbí třin
“It was a bright cold day in April and the clocks were striking thirteen”
15
Edge-Factored Parsers (McDonald et al. 2005) Which edge is better?
“bright day” or “bright clocks”?
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC
byl jasn stud dubn den a hodi odbí třin
“It was a bright cold day in April and the clocks were striking thirteen”
16
Edge-Factored Parsers (McDonald et al. 2005)
Which edge is better?
Score of an edge e = features(e)
Standard algosvalid parse with max total score
our current weight vector
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC
byl jasn stud dubn den a hodi odbí třin
“It was a bright cold day in April and the clocks were striking thirteen”
17
Edge-Factored Parsers (McDonald et al. 2005)
Which edge is better?
Score of an edge e = features(e)
Standard algosvalid parse with max total score
our current weight vector
can’t have both can‘t have both (one parent per word) (no crossing links)
Can’t have all three (no cycles)
Thus, an edge may lose (or win) because of a consensus of other edges.
18
Finding Highest-Scoring Parse
Convert to context-free grammar (CFG) Then use dynamic programming
The cat in the hat wore a stovepipe. ROOT
let’s vertically stretch this graph drawing
wore
ROOT
cat
stovepipe
Thein a hat
the
each subtree is a linguistic constituent (here a noun phrase)
19
Finding Highest-Scoring Parse
Convert to context-free grammar (CFG)
Then use dynamic programming
CKY algorithm for CFG parsing is O(n3)
Unfortunately, O(n5) in this case
to score “cat wore” link, not enough to know this is NP
must know it’s rooted at “cat”
so expand nonterminal set by O(n): {NPthe, NPcat, NPhat, …}
ROOT
so CKY’s “grammar constant” is no longer constant
wore
cat stovepipe
Thein a hat
each subtree is a linguistic constituent (here a noun phrase)
the
20
Finding Highest-Scoring Parse
Convert to context-free grammar (CFG)
Then use dynamic programming
CKY algorithm for CFG parsing is O(n3)
Unfortunately, O(n5) in this case
Solution: Use a different decomposition (Eisner 1996) Back to O(n3)
ROOT
wore
cat stovepipe
Thein a hat
the
each subtree is a linguistic constituent (here a noun phrase)
21
Spans vs. constituents
Two kinds of substring.
» Constituent of the tree: links to the rest only through its headword (root).
wore a stovepipe. ROOT
» Span of the tree: links to the rest only through its endwords.
The a stovepipe. ROOT
The cat in the hat
cat in the hat wore
22
Decomposing a tree into spans
The cat in the hat wore a stovepipe. ROOT
The cat +
cat in the hat wore + wore a stovepipe. ROOT
cat in + in the hat wore
in the hat + hat wore
cat in the hat wore a stovepipe. ROOT
Hard Constraints on Valid Trees
our current weight vector
Score of an edge e = features(e)
Standard algosvalid parse with max total score
can’t have both can‘t have both (one parent per word) (no crossing links)
Can’t have all three (no cycles)
Thus, an edge may lose (or win) because of a consensus of other edges.
24
Non-Projective Parses
ROOT I ‘ll give a talk tomorrow on bootstrapping subtree rooted at “talk”
is a discontiguous noun phrase
can‘t have both (no crossing links)
The “projectivity” restriction. Do we really want it?
25
Non-Projective Parses
ROOT I ‘ll give a talk tomorrow on bootstrapping
ROOT
occasional non-projectivity in English
ista meam norit thatNOM myACC may-know
may-know
(i.e., it shall last till I go gray)
frequent non-projectivity in Latin, etc.
gloria gloryNOM
canitiem going-grayACC
That glory
my going-gray
26
Parsing Methods
Non-Projective Parsing Algorithms
◮ Complexity considerations:
◮ Projective (Proj)
◮ Non-projective (NonP) Problem/Algorithm
Complete grammar parsing
[Gaifman 1965, Neuhaus and Bro ̈ker 1997]
Deterministic parsing
[Nivre 2003, Covington 2001]
First order spanning tree
[McDonald et al. 2005b]
Nth order spanning tree (N > 1) [McDonald and Pereira 2006]
Proj NonP P NP hard
O(n) O(n2) O(n3) O(n2)
P NP hard
Dependency Parsing 65(103)
McDonald’s Approach (non-projective)
Consider the sentence “John saw Mary” (left).
The Chu-Liu-Edmonds algorithm finds the maximum-
weight spanning tree (right) – may be non-projective. Can be found in time O(n2).
9
root
9 20saw
John
11 3
root
10
10
saw
Mary John
Every node selects best parent
30
300 30
30
Mary
If cycles, contract them and repeat
slide thanks to Dragomir Radev
27
Chu-Liu-Edmonds – Contracting Stage
For each non-ROOT node v, set bestInEdge[v] to be its highest scoring incoming edge.
If a cycle C is formed:
contract the nodes in C into a new node vC
edges outgoing from any node in C now get source vC
edges incoming to any node in C now get destination vC
For each node u in C, and for each edge e incoming to u from
outside of C:
set e.kicksOut to bestInEdge[u], and
set e.score to be e.score − e.kicksOut.score.
Repeat until every non-ROOT node has an incoming edge and no cycles are formed
An Example – Contracting Stage
bestInEdge
V1 V2 V3
ROOT
a:5 b:1 c:1
V1 d : 11 V2 f : 5 V3 g : 10 i : 8
e:4
h:9
kicksOut
a b c d e f g h i
An Example – Contracting Stage
bestInEdge
V1 V2 V3
g
ROOT
a:5 b:1 c:1
V1 d : 11 V2 f : 5 V3 g : 10 i : 8
e:4
h:9
kicksOut
a b c d e f g h i
An Example – Contracting Stage
bestInEdge
V1 V2 V3
g
d
ROOT
a:5 b:1 c:1
V1 d : 11 V2 f : 5 V3 g : 10 i : 8
e:4
h:9
kicksOut
a b c d e f g h i
An Example – Contracting Stage
bestInEdge
V1 V2 V3
g d
kicksOut
a b c d e f g h i
g d
g d
V1
a : 5 − 10 V4
d : 11 g : 10
ROOT
b : 1 − 11
V2
e:4
h : 9 − 10
c : 1
f : 5 V3 i : 8 − 11
An Example – Contracting Stage
bestInEdge
ROOT
V1 V2 V3 V4
g d
a : −5
b : −10
c : 1
kicksOut
a b c d e f g h i
g d
V4 f:5 V3
i : −3
e:4 h : −1
g d
An Example – Contracting Stage
bestInEdge
ROOT
V1 V2 V3 V4
g d f
a : −5
b : −10
c : 1
kicksOut
a b c d e f g h i
g d
V4 f:5 V3
i : −3
e:4 h : −1
g d
An Example – Contracting Stage
bestInEdge
ROOT
V1 V2 V3 V4
g d f h
a : −5
b : −10
c : 1
kicksOut
a b c d e f g h i
g d
V4 f:5 V3
i : −3
e:4 h : −1
g d
An Example – Contracting Stage
bestInEdge
ROOT
V1 V2 V3 V4 V5
g d f h
a : −5−−1
b : −10 − −1
c : 1 − 5
V5
V4 f:5 V3
i : −3
e:4 h : −1
a b c d e f g h i
kicksOut
g, h d, h f
g d
An Example – Contracting Stage
V1 V2 V3 V4 V5
bestInEdge
g d f h
a : −4
ROOT
b : −9
V5
c : −4
a b c d e f g h i
kicksOut
g, h d, h f
f
g d
An Example – Contracting Stage
V1 V2 V3 V4 V5
bestInEdge
g d f h a
a : −4
ROOT
b : −9
V5
c : −4
a b c d e f g h i
kicksOut
g, h d, h f
f
g d
Chu-Liu-Edmonds – Expanding Stage
After the contracting stage, every contracted node will have exactly one bestInEdge. This edge will kick out one edge inside the contracted node, breaking the cycle.
Go through each bestInEdge e in the reverse order that we added them
lock down e, and remove every edge in kicksOut(e) from bestInEdge.
An Example – Expanding Stage
V1 V2 V3 V4 V5
bestInEdge
g d f h a
a : −4
ROOT
b : −9
V5
c : −4
a b c d e f g h i
kicksOut
g, h d, h f
f
g d
An Example – Expanding Stage
V1 V2 V3 V4 V5
bestInEdge
ag d
f
ah
a
a : −4
ROOT
b : −9
V5
c : −4
a
b c d e f g h i
kicksOut
g, h
d, h f
f
g d
An Example – Expanding Stage
bestInEdge
V1 V2 V3 V4 V5
ag d
f
ah
a
ROOT
a : −5
b : −10
c : 1
a
b c d e f g h i
kicksOut
g, h
d, h f
f
g d
V4 f:5 V3
i : −3
e:4 h : −1
An Example – Expanding Stage
bestInEdge
V1 V2 V3 V4 V5
ag d
f ah
a
ROOT
a : −5
b : −10
c : 1
a
b c d e f g h i
kicksOut
g, h
d, h f
f
g d
V4 f:5 V3
i : −3
e:4 h : −1
An Example – Expanding Stage
bestInEdge
V1 V2 V3 V4 V5
ag d
f ah
a
ROOT
a:5 b:1 c:1
V1 d : 11 V2 f : 5 V3 g : 10 i : 8
e:4
h:9
kicksOut
a
b c d e f g h i
g, h
d, h f
f
g d
An Example – Expanding Stage
bestInEdge
V1 V2 V3 V4 V5
ag d
f ah
a
ROOT
a:5 b:1 c:1
V1 d : 11 V2 f : 5 V3 g : 10 i : 8
e:4
h:9
kicksOut
a
b c d e f g h i
g, h
d, h f
f
g d
Summing over all non-projective trees Finding highest-scoring non-projective tree
Consider the sentence “John saw Mary” (left).
The Chu-Liu-Edmonds algorithm finds the maximum-
weight spanning tree (right) – may be non-projective.
Can be found in time O(n2).
How about total weight Z of all trees?
Can be found in time O(n3) by matrix determinants
and inverses (Smith & Smith, 2007).
slide thanks to Dragomir Radev
28
Graph Theory to the Rescue!
O(n3) time!
Tutte’s Matrix-Tree Theorem (1948)
The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r.
Exactly the Z we need!
29
s(2,0) s(n,0) s(1,j) s(2,1) s(n,1)
0
•
Negate edge scores
Building the Kirchoff (Laplacian) Matrix
s(1,0)
0 s(1,0) s(2,0) s(n,0)
s(1, j) s(2,1) 0 s(2,1)
s(n,1)
0
s(n,1)
• Sum columns (children)
• Strike root row/col.
• Take determinant
j1
0
j1 s(1,2) s(2, j) s(n,2)
0 s(1,2)j2 s(2,j) s(n,2)
0 s(1,2) 0 s(n,2)
j2
0 s ( 1 , n ) s ( 2 , n )
s(1,n)
s(2,n) s(2,n)
0 s(n, j)
s(1,n)
s(n, j)
0
jn
jn
N.B.: This allows multiple children of root, but see Koo et al. 2007.
30
s(1, j) j1
s(1,2) s(1,n)
s(2,1)
s(n,1) s(n,2)
s(n, j) jn
Why Should This Work?
Clear for 1×1 matrix; use induction
Chu-Liu-Edmonds analogy:
Every node selects best parent If cycles, contract and recurse
j2
s(2, j) s(2,n)
K K with contracted edge 1,2 K K({1,2} |{1,2})
K s(1,2)K K
Undirected case; special root cases for directed
31
Graph Deletion & Contraction
Important fact: κ(G) = κ(G-{e}) + κ(G\{e})
1