程序代写代做代考 C go algorithm graph clock Computational

Computational
Linguistics
CSC 485 Summer 2020
10A
10A. Log-Likelihood Dependency Parsing
Gerald Penn
Department of Computer Science, University of Toronto
Based on slides by Yuji Matsumoto, Dragomir Radev, David Smith, Sam Thomson and Jason Eisner
Copyright © 2020 Gerald Penn. All rights reserved.

How about structured outputs?
 Log-linear models great for n-way classification  Also good for predicting sequences
van
find preferred tags
but to allow fast dynamic programming,
only use n-gram features
 Also good for dependency parsing
…find preferred links…
but to allow fast dynamic programming or MST parsing, only use single-edge features
5

Parsing Methods
Classifier-Based Parsing
◮ Data-driven deterministic parsing:
◮ Deterministic parsing requires an oracle.
◮ An oracle can be approximated by a classifier.
◮ A classifier can be trained using treebank data.
◮ Learning methods:
◮ Support vector machines (SVM)
[Kudo and Matsumoto 2002, Yamada and Matsumoto 2003,
Isozaki et al. 2004, Cheng et al. 2004, Nivre et al. 2006]
◮ Memory-based learning (MBL)
[Nivre et al. 2004, Nivre and Scholz 2004]
◮ Maximum entropy modeling (MaxEnt) [Cheng et al. 2005]
Dependency Parsing 58(103)

Parsing Methods
Feature Models
◮ Learning problem:
◮ Approximate a function from parser states, represented by feature vectors to parser actions, given a training set of gold standard derivations.
◮ Typical features: ◮ Tokens:
◮ Target tokens
◮ Linear context (neighbors in S and Q)
◮ Structural context (parents, children, siblings in G)
◮ Attributes:
◮ Word form (and lemma)
◮ Part-of-speech (and morpho-syntactic features)
◮ Dependency type (if labeled)
◮ Distance (between target tokens)
Dependency Parsing 59(103)

Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)
 In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …
each choice depends on a limited part of the history but which dependencies to allow? p(D | A,B,C)?
what if they’re all worthwhile?
p(D | A,B,C)? … p(D | A,B) * p(C | A,B,D)?
3

Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)
 In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …
which dependencies to allow? (given limited training data)
 Solution: Log-linear (max-entropy) modeling (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B)
… * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) * throw them all in!
 Features may interact in arbitrary ways
 Iterative scaling keeps adjusting the feature weights until the model agrees with the training data.
4

Edge-Factored Parsers (McDonald et al. 2005)  Is this a good edge?
yes, lots of green …
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It was a bright cold day in April and the clocks were striking thirteen”
6

Edge-Factored Parsers (McDonald et al. 2005)  Is this a good edge?
jasný  den (“bright day”)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It was a bright cold day in April and the clocks were striking thirteen”
7

Edge-Factored Parsers (McDonald et al. 2005)  Is this a good edge?
jasný  den (“bright day”)
jasný  N (“bright NOUN”)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC
“It was a bright cold day in April and the clocks were striking thirteen”
8

Edge-Factored Parsers (McDonald et al. 2005)  Is this a good edge?
jasný  den (“bright day”)
jasný  N (“bright NOUN”)
AN
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC
“It was a bright cold day in April and the clocks were striking thirteen”
9

Edge-Factored Parsers (McDonald et al. 2005)  Is this a good edge?
jasný  N (“bright NOUN”)
AN
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC
“It was a bright cold day in April and the clocks were striking thirteen”
jasný  den (“bright day”)
AN preceding conjunction
10

Edge-Factored Parsers (McDonald et al. 2005)  How about this competing edge?
not as good, lots of red …
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC
“It was a bright cold day in April and the clocks were striking thirteen”
11

Edge-Factored Parsers (McDonald et al. 2005)  How about this competing edge?
jasný  hodiny (“bright clocks”)
… undertrained …
Byl jasný studený dubnový den a hodiny odbíjely třináctou VAAANJNVC
“It was a bright cold day in April and the clocks were striking thirteen”
12

Edge-Factored Parsers (McDonald et al. 2005)  How about this competing edge?
jasný  hodiny jasn  hodi (“bright clocks”) (“bright clock,”
… undertrained …
stems only)
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC
byl jasn stud dubn den a hodi odbí třin
“It was a bright cold day in April and the clocks were striking thirteen”
13

Edge-Factored Parsers (McDonald et al. 2005)  How about this competing edge?
jasný  hodiny jasn  hodi (“bright clocks”) (“bright clock,”
… undertrained …
stems only)
Aplural  Nsingular
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC byl jasn stud dubn den a hodi odbí třin
“It was a bright cold day in April and the clocks were striking thirteen”
14

Edge-Factored Parsers (McDonald et al. 2005)  How about this competing edge?
jasný  hodiny (“bright clocks”)
jasn  hodi (“bright clock,” stems only)
AN where N follows
A N
plural singular
… undertrained …
a conjunction
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC byl jasn stud dubn den a hodi odbí třin
“It was a bright cold day in April and the clocks were striking thirteen”
15

Edge-Factored Parsers (McDonald et al. 2005)  Which edge is better?
 “bright day” or “bright clocks”?
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC
byl jasn stud dubn den a hodi odbí třin
“It was a bright cold day in April and the clocks were striking thirteen”
16

Edge-Factored Parsers (McDonald et al. 2005)
 Which edge is better?
 Score of an edge e =   features(e)
 Standard algosvalid parse with max total score
our current weight vector
Byl jasný studený dubnový den a hodiny odbíjely třináctou
VAAANJNVC
byl jasn stud dubn den a hodi odbí třin
“It was a bright cold day in April and the clocks were striking thirteen”
17

Edge-Factored Parsers (McDonald et al. 2005)
 Which edge is better?
 Score of an edge e =   features(e)
 Standard algosvalid parse with max total score
our current weight vector
can’t have both can‘t have both (one parent per word) (no crossing links)
Can’t have all three (no cycles)
Thus, an edge may lose (or win) because of a consensus of other edges.
18

Finding Highest-Scoring Parse
 Convert to context-free grammar (CFG)  Then use dynamic programming
The cat in the hat wore a stovepipe. ROOT
let’s vertically stretch this graph drawing
wore
ROOT
cat
stovepipe
Thein a hat
the
each subtree is a linguistic constituent (here a noun phrase)
19

Finding Highest-Scoring Parse
 Convert to context-free grammar (CFG)
 Then use dynamic programming
 CKY algorithm for CFG parsing is O(n3)
 Unfortunately, O(n5) in this case
 to score “cat  wore” link, not enough to know this is NP
 must know it’s rooted at “cat”
 so expand nonterminal set by O(n): {NPthe, NPcat, NPhat, …}
ROOT
 so CKY’s “grammar constant” is no longer constant 
wore
cat stovepipe
Thein a hat
each subtree is a linguistic constituent (here a noun phrase)
the
20

Finding Highest-Scoring Parse
 Convert to context-free grammar (CFG)
 Then use dynamic programming
 CKY algorithm for CFG parsing is O(n3)
 Unfortunately, O(n5) in this case
 Solution: Use a different decomposition (Eisner 1996)  Back to O(n3)
ROOT
wore
cat stovepipe
Thein a hat
the
each subtree is a linguistic constituent (here a noun phrase)
21

Spans vs. constituents
Two kinds of substring.
» Constituent of the tree: links to the rest only through its headword (root).
wore a stovepipe. ROOT
» Span of the tree: links to the rest only through its endwords.
The a stovepipe. ROOT
The cat in the hat
cat in the hat wore
22

Decomposing a tree into spans
The cat in the hat wore a stovepipe. ROOT
The cat +
cat in the hat wore + wore a stovepipe. ROOT
cat in + in the hat wore
in the hat + hat wore
cat in the hat wore a stovepipe. ROOT

Hard Constraints on Valid Trees
our current weight vector
 Score of an edge e =   features(e)
 Standard algosvalid parse with max total score
can’t have both can‘t have both (one parent per word) (no crossing links)
Can’t have all three (no cycles)
Thus, an edge may lose (or win) because of a consensus of other edges.
24

Non-Projective Parses
ROOT I ‘ll give a talk tomorrow on bootstrapping subtree rooted at “talk”
is a discontiguous noun phrase
can‘t have both (no crossing links)
The “projectivity” restriction. Do we really want it?
25

Non-Projective Parses
ROOT I ‘ll give a talk tomorrow on bootstrapping
ROOT
occasional non-projectivity in English
ista meam norit thatNOM myACC may-know
may-know
(i.e., it shall last till I go gray)
frequent non-projectivity in Latin, etc.
gloria gloryNOM
canitiem going-grayACC
That glory
my going-gray
26

Parsing Methods
Non-Projective Parsing Algorithms
◮ Complexity considerations:
◮ Projective (Proj)
◮ Non-projective (NonP) Problem/Algorithm
Complete grammar parsing
[Gaifman 1965, Neuhaus and Bro ̈ker 1997]
Deterministic parsing
[Nivre 2003, Covington 2001]
First order spanning tree
[McDonald et al. 2005b]
Nth order spanning tree (N > 1) [McDonald and Pereira 2006]
Proj NonP P NP hard
O(n) O(n2) O(n3) O(n2)
P NP hard
Dependency Parsing 65(103)

McDonald’s Approach (non-projective)
 Consider the sentence “John saw Mary” (left).
 The Chu-Liu-Edmonds algorithm finds the maximum-
weight spanning tree (right) – may be non-projective.  Can be found in time O(n2).
9
root
9 20saw
John
11 3
root
10
10
saw
Mary John
Every node selects best parent
30
300 30
30
Mary
If cycles, contract them and repeat
slide thanks to Dragomir Radev
27

Chu-Liu-Edmonds – Contracting Stage
􏰉 For each non-ROOT node v, set bestInEdge[v] to be its highest scoring incoming edge.
􏰉 If a cycle C is formed:
􏰉 contract the nodes in C into a new node vC
􏰉 edges outgoing from any node in C now get source vC
􏰉 edges incoming to any node in C now get destination vC
􏰉 For each node u in C, and for each edge e incoming to u from
outside of C:
􏰉 set e.kicksOut to bestInEdge[u], and
􏰉 set e.score to be e.score − e.kicksOut.score.
􏰉 Repeat until every non-ROOT node has an incoming edge and no cycles are formed

An Example – Contracting Stage
bestInEdge
V1 V2 V3
ROOT
a:5 b:1 c:1
V1 d : 11 V2 f : 5 V3 g : 10 i : 8
e:4
h:9
kicksOut
a b c d e f g h i

An Example – Contracting Stage
bestInEdge
V1 V2 V3
g
ROOT
a:5 b:1 c:1
V1 d : 11 V2 f : 5 V3 g : 10 i : 8
e:4
h:9
kicksOut
a b c d e f g h i

An Example – Contracting Stage
bestInEdge
V1 V2 V3
g
d
ROOT
a:5 b:1 c:1
V1 d : 11 V2 f : 5 V3 g : 10 i : 8
e:4
h:9
kicksOut
a b c d e f g h i

An Example – Contracting Stage
bestInEdge
V1 V2 V3
g d
kicksOut
a b c d e f g h i
g d
g d
V1
a : 5 − 10 V4
d : 11 g : 10
ROOT
b : 1 − 11
V2
e:4
h : 9 − 10
c : 1
f : 5 V3 i : 8 − 11

An Example – Contracting Stage
bestInEdge
ROOT
V1 V2 V3 V4
g d
a : −5
b : −10
c : 1
kicksOut
a b c d e f g h i
g d
V4 f:5 V3
i : −3
e:4 h : −1
g d

An Example – Contracting Stage
bestInEdge
ROOT
V1 V2 V3 V4
g d f
a : −5
b : −10
c : 1
kicksOut
a b c d e f g h i
g d
V4 f:5 V3
i : −3
e:4 h : −1
g d

An Example – Contracting Stage
bestInEdge
ROOT
V1 V2 V3 V4
g d f h
a : −5
b : −10
c : 1
kicksOut
a b c d e f g h i
g d
V4 f:5 V3
i : −3
e:4 h : −1
g d

An Example – Contracting Stage
bestInEdge
ROOT
V1 V2 V3 V4 V5
g d f h
a : −5−−1
b : −10 − −1
c : 1 − 5
V5
V4 f:5 V3
i : −3
e:4 h : −1
a b c d e f g h i
kicksOut
g, h d, h f
g d

An Example – Contracting Stage
V1 V2 V3 V4 V5
bestInEdge
g d f h
a : −4
ROOT
b : −9
V5
c : −4
a b c d e f g h i
kicksOut
g, h d, h f
f
g d

An Example – Contracting Stage
V1 V2 V3 V4 V5
bestInEdge
g d f h a
a : −4
ROOT
b : −9
V5
c : −4
a b c d e f g h i
kicksOut
g, h d, h f
f
g d

Chu-Liu-Edmonds – Expanding Stage
After the contracting stage, every contracted node will have exactly one bestInEdge. This edge will kick out one edge inside the contracted node, breaking the cycle.
􏰉 Go through each bestInEdge e in the reverse order that we added them
􏰉 lock down e, and remove every edge in kicksOut(e) from bestInEdge.

An Example – Expanding Stage
V1 V2 V3 V4 V5
bestInEdge
g d f h a
a : −4
ROOT
b : −9
V5
c : −4
a b c d e f g h i
kicksOut
g, h d, h f
f
g d

An Example – Expanding Stage
V1 V2 V3 V4 V5
bestInEdge
ag 􏱆 d
f
ah 􏱆
a
a : −4
ROOT
b : −9
V5
c : −4
a
b c d e f g h i
kicksOut
g, h
d, h f
f
g d

An Example – Expanding Stage
bestInEdge
V1 V2 V3 V4 V5
ag 􏱆 d
f
ah 􏱆
a
ROOT
a : −5
b : −10
c : 1
a
b c d e f g h i
kicksOut
g, h
d, h f
f
g d
V4 f:5 V3
i : −3
e:4 h : −1

An Example – Expanding Stage
bestInEdge
V1 V2 V3 V4 V5
ag 􏱆 d
f ah
􏱆
a
ROOT
a : −5
b : −10
c : 1
a
b c d e f g h i
kicksOut
g, h
d, h f
f
g d
V4 f:5 V3
i : −3
e:4 h : −1

An Example – Expanding Stage
bestInEdge
V1 V2 V3 V4 V5
ag 􏱆 d
f ah
􏱆
a
ROOT
a:5 b:1 c:1
V1 d : 11 V2 f : 5 V3 g : 10 i : 8
e:4
h:9
kicksOut
a
b c d e f g h i
g, h
d, h f
f
g d

Summing over all non-projective trees Finding highest-scoring non-projective tree
 Consider the sentence “John saw Mary” (left).
 The Chu-Liu-Edmonds algorithm finds the maximum-
weight spanning tree (right) – may be non-projective.
 Can be found in time O(n2).
 How about total weight Z of all trees?
 Can be found in time O(n3) by matrix determinants
and inverses (Smith & Smith, 2007).
slide thanks to Dragomir Radev
28

Graph Theory to the Rescue!
O(n3) time!
Tutte’s Matrix-Tree Theorem (1948)
The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r.
Exactly the Z we need!
29




s(2,0) s(n,0) s(1,j) s(2,1)  s(n,1)
0


•
Negate edge scores
Building the Kirchoff (Laplacian) Matrix
s(1,0)
0 s(1,0) s(2,0) s(n,0)




s(1, j) s(2,1) 0 s(2,1)
s(n,1)

0
s(n,1)
• Sum columns (children)
• Strike root row/col.
• Take determinant
j1 

0
  
 j1   s(1,2) s(2, j)   s(n,2)
0 s(1,2)j2 s(2,j) s(n,2)   
  0 s(1,2) 0 s(n,2)
 
 j2   

 

 0  s ( 1 , n )  s ( 2 , n )

s(1,n)
s(2,n) s(2,n)
0   s(n, j)

s(1,n)

  s(n, j)
0
 
 jn 
jn
N.B.: This allows multiple children of root, but see Koo et al. 2007.
30

s(1, j) j1
s(1,2) s(1,n)
s(2,1)
s(n,1) s(n,2)
s(n, j) jn
Why Should This Work?
Clear for 1×1 matrix; use induction
Chu-Liu-Edmonds analogy:
Every node selects best parent If cycles, contract and recurse
 j2
s(2, j) s(2,n)
K K with contracted edge 1,2 K K({1,2} |{1,2})
K  s(1,2)K K
Undirected case; special root cases for directed
31

Graph Deletion & Contraction
Important fact: κ(G) = κ(G-{e}) + κ(G\{e})
1

Related Posts