程序代写代做代考 data structure chain algorithm graph Syntax with Dependency Grammar

Syntax with Dependency Grammar
ANLP: Week 7, Unit 1
Shay Cohen
Based on slides from ANLP 2019
Lexicalization, again
We saw that adding lexical head of the phrase can help choose the right parse:
S-saw
NP-kids VP-saw kids
VP-saw
PP-fish
V-saw NP-birds P-with NP-fish saw birds with fish
Dependency syntax focuses on the head-dependent relationships.
Dependency trees
A valid dependency tree for a sentence requires:
􏰀 A single distinguished root word.
􏰀 All other words have exactly one incoming edge. 􏰀 A unique path from the root to each other word.
kids saw birds with fish
kids saw birds with binoculars
Dependency syntax
An alternative approach to sentence structure.
􏰀 A fully lexicalized formalism: no phrasal categories.
􏰀 Assumes binary, asymmetric grammatical relations between words: head-dependent relations, shown as directed edges:
kids saw birds with fish
􏰀 Here, edges point from heads to their dependents.

It really is a tree!
􏰀 The usual way to show dependency trees is with edges over ordered sentences.
􏰀 But the edge structure (without word order) can also be shown as a more obvious tree:
kids saw birds with fish
saw
kids birds fish with
Labelled dependencies
It is often useful to distinguish different kinds of head → modifier relations, by labelling edges:
ROOT
NMOD NSUBJ DOBJ CASE
kids saw birds with fish
􏰀 Historically, different treebanks/languages used different
􏰀 labels.
Now, the Universal Dependencies project aims to
standardize labels and annotation conventions, bringing
􏰀 together annotated corpora from over 50 languages. Labels in this example (and in textbook) are from UD.
Why dependencies??
Consider these sentences. Two ways to say the same thing: SS
NP VP
NP VP Sasha
V NP PP
Sasha
V NP NP
gave the girl a book
􏰀 We only need a few phrase structure rules:
S → NP VP VP→V NP NP
VP→V NP PP
plus rules for NP and PP.
to the girl
gave a book
Why dependencies??
Consider these sentences. Two ways to say the same thing: SS
NP VP
NP VP Sasha
Sasha
V NP NP
V NP PP
gave a book to the girl
gave the girl a book

Equivalent sentences in Russian
􏰀 Russian uses morphology to mark relations between words: 􏰀 knigu means book (kniga) as a direct object.
􏰀 devochke means girl (devochka) as indirect object (to the girl). 􏰀 So we can have the same word orders as English:
􏰀 Sasha dal devochke knigu 􏰀 Sasha dal knigu devochke
Equivalent sentences in Russian
􏰀 Russian uses morphology to mark relations between words: 􏰀 knigu means book (kniga) as a direct object.
􏰀 devochke means girl (devochka) as indirect object (to the girl).
􏰀 So we can have the same word orders as English:
􏰀 Sasha dal devochke knigu 􏰀 Sasha dal knigu devochke
􏰀 But also many others!
􏰀 Sasha devochke dal knigu 􏰀 Devochke dal Sasha knigu 􏰀 Knigu dal Sasha devochke
Phrase structure vs dependencies
􏰀 In languages with free word order, phrase structure (constituency) grammars don’t make as much sense.
􏰀 E.g.,wewouldneedbothS→NP VPandS→VP NP,etc. Not very informative about what’s really going on.
􏰀 In contrast, the dependency relations stay constant:
ROOT
NSUBJ
Sasha dal
ROOT
DOBJ
devochke
IOBJ
IOBJ
NSUBJ
Sasha
DOBJ
knigu
dal
knigu devochke
Phrase structure vs dependencies
􏰀 In languages with free word order, phrase structure (constituency) grammars don’t make as much sense.
􏰀 E.g.,wewouldneedbothS→NP VPandS→VP NP,etc. Not very informative about what’s really going on.

Phrase structure vs dependencies
􏰀 Even more obvious if we just look at the trees without word order:
ROOT
NSUBJ
Sasha dal
ROOT
DOBJ
devochke
dal
IOBJ
IOBJ
NSUBJ
Sasha dal
DOBJ
knigu
knigu
dal
devochke
Sasha devochke knigu
Sasha devochke knigu
Pros and cons
􏰀 Sensible framework for free word order languages.
􏰀 Identifies syntactic relations directly. (using CFG, how would
you identify the subject of a sentence?)
􏰀 Dependency pairs/chains can make good features in classifiers, for information extraction, etc.
􏰀 Parsers can be very fast (coming up…) But
􏰀 The assumption of asymmetric binary relations isn’t always right… e.g., how to parse dogs and cats?
Lexicalized Constituency Parse
S-saw
V-saw saw
birds
NP-kids kids
VP-saw
NP-birds
NP-birds PP-fish
P-with NP-fish with fish
How do we annotate dependencies?
Two options:
1. Annotate dependencies directly.
2. Convert phrase structure annotations to dependencies. (Convenient if we already have a phrase structure treebank.)
Next slides show how to convert, assuming we have head-finding rules for our phrase structure trees.

. . . remove the phrasal categories. . .
kids kids
saw
saw saw
birds
saw
birds
birds fish
with fish with fish
. . . remove the (duplicated) terminals. . .
kids
saw
saw
saw
birds
birds fish
with fish
. . . and collapse chains of duplicates. . .
saw
kids saw
saw birds
birds fish with
. . . and collapse chains of duplicates. . .
kids
saw
saw
saw
birds
birds fish
with fish

. . . and collapse chains of duplicates. . .
saw
kids saw
saw birds
birds fish with
. . . and collapse chains of duplicates. . .
saw
kids saw
saw birds fish
with
…done!
saw
kids birds fish
with
. . . and collapse chains of duplicates. . .
saw
kids saw
saw birds fish
with

Constituency Tree → Dependency Tree
We saw how the lexical head of the phrase can be used to
collapse down to a dependency tree: S-saw
NP-kids kids
VP-saw
VP-saw
V-saw NP-birds saw birds
PP-binoculars
P-with NP-binoculars with binoculars
􏰀 But how can we find each phrase’s head in the first place?
Head Rules
The standard solution is to use head rules: for every non-unary (P)CFG production, designate one RHS nonterminal as containing thehead. S→NP VP,VP→VP PP,PP→P NP(contenthead), etc.
NP kids
S
VP
V NP saw birds
VP
PP
P NP with binoculars
􏰀 Heuristics to scale this to large grammars: e.g., within an NP, last immediate N child is the head.
Head Rules
Then, propagate heads up the tree: S
NP-kids kids
VP
VP-saw
V-saw NP-birds saw birds
PP
P-with NP-binoculars with binoculars
Head Rules
Then, propagate heads up the tree: S
NP-kids kids
VP
VP
V-saw NP-birds saw birds
PP
P-with NP-binoculars with binoculars

Head Rules
Then, propagate heads up the tree: S
NP-kids kids
VP
VP-saw
V-saw NP-birds saw birds
PP-binoculars
P-with NP-binoculars with binoculars
Head Rules
Then, propagate heads up the tree: S
NP-kids kids
VP-saw
VP-saw
V-saw NP-birds saw birds
PP-binoculars
P-with NP-binoculars with binoculars
Projectivity
If we convert constituency parses to dependencies, all the resulting trees will be projective.
􏰀 Every subtree (node and all its descendants) occupies a 􏰀 contiguous span of the sentence.
= the parse can be drawn over the sentence w/ no crossing edges.
SBJ
PC ATT
the issue
ROOT
ATT ATT
A hearing on
VC TMP
is scheduled
today
Head Rules
Then, propagate heads up the tree: S-saw
NP-kids kids
VP-saw
VP-saw
V-saw NP-birds saw birds
PP-binoculars
P-with NP-binoculars with binoculars

Nonprojectivity
But some sentences are nonprojective.
ROOT
ATT
TMP
PC ATT SBJ VC ATT
A
􏰀 We’ll only get these annotations right if we directly annotate
􏰀 the sentences (or correct the converted parses).
hearing is scheduled
on the
issue
today
Nonprojectivity is rare in English, but common in many
􏰀 languages.
Nonprojectivity presents problems for parsing algorithms.
Dependency Parsing
Some of the algorithms you have seen for PCFGs can be adapted to dependency parsing.
􏰀 CKY can be adapted, though efficiency is a concern: obvious approach is O(Gn5); Eisner algorithm brings it down to O(Gn3)
􏰀 N. Smith’s slides explaining the Eisner algorithm: http://courses.cs.washington.edu/courses/cse517/ 16wi/slides/an-dep-slides.pdf
􏰀 Shift-reduce: more efficient, doesn’t even require a grammar!
Transition-based Dependency Parsing
The arc-standard approach parses input sentence w1 . . . wN using two types of reduce actions (three actions altogether):
􏰀 Shift: Read next word wi from input and push onto the stack. 􏰀 LeftArc: Assign head-dependent relation s2 ← s1; pop s2
􏰀 RightArc: Assign head-dependent relation s2 → s1; pop s1
where s1 and s2 are the top and second item on the stack, respectively. (So, s2 preceded s1 in the input sentence.)
Recall: shift-reduce parser with CFG
􏰀 Same example grammar and sentence.
􏰀 Operations:
􏰀 Reduce (R) 􏰀 Shift (S)
􏰀 Backtrack to
step n (Bn)
􏰀 Note that at 9 and 11 we skipped over backtracking to 7 and 5 respectively as there were actually no choices to be made at those points.
Step Op. Stack 0
Input
the dog bit dog bit dog bit
bit
bit bit
bit
bit bit bit
1 S
2 R
3 S
4 R
5 R
6 S
7 R
8 R
9 B6
10 R 11B4DTV
12 S
13 R
14 R
15 B3
16 R
17 R

DTVbit DTVV DTVVP DTdog DT NN
NP
the
DT
DT dog DTV DTVP
DT VP bit DTVPV DTVPVP DT VP bit DT VP NN

Example
Parsing Kim saw Sandy:
Step
←bot. Stacktop→
Word List
Action
Relations
0 1 2 3 4 5 6
[root]
[root,Kim] [root,Kim,saw] [root,saw] [root,saw,Sandy] [root,saw]
[root]
[Kim,saw,Sandy] [saw,Sandy] [Sandy]
[Sandy]
[] [] []
Shift Shift LeftArc Shift RightArc RightArc (done)
Kim←saw
saw→Sandy root→saw
􏰀 Here, top two words on stack are also always adjacent in sentence. Not true in general! (See longer example in JM3.)
Labelled dependency parsing
􏰀 These parsing actions produce unlabelled dependencies (left). 􏰀 For labelled dependencies (right), just use more actions:
LeftArc(NSUBJ), RightArc(NSUBJ), LeftArc(DOBJ), . . .
ROOT
NSUBJ DOBJ
Kim saw Sandy Kim saw Sandy
Notions of validity
􏰀 In constituency parsing, valid parse = grammatical parse. 􏰀 That is, we first define a grammar, then use it for parsing.
􏰀 In dependency parsing, we don’t normally define a grammar. 􏰀 Valid parses are those with the properties on slide 4.
Differences to constituency parsing
􏰀 Shift-reduce parser for CFG: not all sequences of actions lead to valid parses. Choose incorrect action → may need to backtrack.
􏰀 Here, all valid action sequences lead to valid parses.
􏰀 Invalid actions: can’t apply LeftArc with root as dependent; 􏰀 can’t apply RightArc with root as head unless input is empty.
Other actions may lead to incorrect parses, but still valid.
􏰀 So, parser doesn’t backtrack. Instead, tries to greedily predict
the correct action at each step.
􏰀 Therefore, dependency parsers can be very fast (linear time). 􏰀 But need a good way to predict correct actions (next lecture).

Summary: Transition-based Parsing
􏰀 arc-standard approach is based on simple shift-reduce idea.
􏰀 Can do labelled or unlabelled parsing, but need to train a
classifier to predict next action, as we’ll see next time.
􏰀 Greedy algorithm means time complexity is linear in sentence
length.
􏰀 Only finds projective trees (without special extensions)
􏰀 Pioneering system: Nivre’s MaltParser.
Alternative: Graph-based Parsing
􏰀 Global algorithm: From the fully connected directed graph of all possible edges, choose the best ones that form a tree.
􏰀 Edge-factored models: Classifier assigns a nonnegative score to each possible edge; maximum spanning tree algorithm finds the spanning tree with highest total score in O(n2) time.
􏰀 Pioneering work: McDonald’s MSTParser
􏰀 Can be formulated as constraint-satisfaction with integer
linear programming (Martins’s TurboParser)
􏰀 Details in JM3, Ch 16.5 (optional).
Choosing a Parser: Criteria
􏰀 Target representation: constituency or dependency?
􏰀 Efficiency? In practice, both runtime and memory use.
􏰀 Incrementality: parse the whole sentence at once, or obtain partial left-to-right analyses/expectations?
􏰀 Retrainable system?
􏰀 Accuracy?
Graph-based vs. Transition-based vs. Conversion-based
􏰀 TB: Features in scoring function can look at any part of the stack; no optimality guarantees for search; linear-time; (classically) projective only
􏰀 GB: Features in scoring function limited by factorization; optimal search within that model; quadratic-time; no projectivity constraint
􏰀 CB: In terms of accuracy, sometimes best to first constituency-parse, then convert to dependencies (e.g., Stanford Parser). Slower than direct methods.

Summary
􏰀 Constituency syntax: hierarchically nested phrases with categories like NP.
􏰀 Dependency syntax: trees whose edges connect words in the sentence. Edges often labeled with relations like nsubj.
􏰀 Can convert constituency to dependency parse using head rules.
􏰀 For projective trees, transition-based parsing is very fast and can be very accurate.
􏰀 Google “online dependency parser”.
Try out the Stanford parser and SEMAFOR!
Probabilistic transition-based dep’y parsing
At each step in parsing we have:
􏰀 Current configuration: consisting of the stack state, input buffer, and dependency relations found so far.
􏰀 Possible actions: e.g., Shift, LeftArc, RightArc. Probabilistic parser assumes we also have a model that tells us
P(action|configuration). Then,
􏰀 Choosing the most probable action at each step (greedy parsing) produces a parse in linear time.
􏰀 But it might not be the best one: choices made early could lead to a worse overall parse.
Beam search: basic idea
􏰀 Instead of choosing only the best action at each step, choose a few of the best.
􏰀 Extend previous partial parses using these options.
􏰀 At each time step, keep a fixed number of best options,
discard anything else.
Advantages:
􏰀 May find a better overall parse than greedy search,
􏰀 While using less time/memory than exhaustive search.
Recap: parsing as search
Parser is searching through a very large space of possible parses.
􏰀 Greedy parsing is a depth-first strategy.
􏰀 Beam search is a limited breadth-first strategy. S
SSS
aux NPVP NP
NP VP
NP
S . . . . . S S . . . . . S
S ….. S

The agenda
An ordered list of configurations (parser state + parse so far).
􏰀 Items are ordered by score: how good a configuration is it?
􏰀 Implemented using a priority queue data structure, which efficiently inserts items into the ordered list.
􏰀 In beam search, we use an agenda with a fixed size (beam width). If new high-scoring items are inserted, discard items at the bottom below beam width.
Won’t discuss scoring function here; but beam search idea is used across NLP (e.g., in best-first constituency parsing, NNet models.)
Evaluating dependency parsers
􏰀 How do we know if beam search is helping?
􏰀 As usual, we can evaluate against a gold standard data set. But what evaluation measure to use?
Evaluating dependency parsers
􏰀 By construction, the number of dependencies is the same as the number of words in the sentence.
􏰀 So we do not need to worry about precision and recall, just plain old accuracy.
􏰀 Labelled Attachment Score (LAS): Proportion of words where we predicted the correct head and label.
􏰀 Unlabelled Attachment Score (UAS): Proportion of words where we predicted the correct head, regardless of label.