Dependency grammars
I Based on syntactic relations between a head word and a dependent word
I “Shallower” in hierarchical structure
Head and dependents
I The head sets the syntactic category of the construction: for example, nouns are the heads of noun phrases, and verbs are the heads of verb phrases.
I The modifier may be optional while the head is mandatory: for example, in the sentence Cats scratch people with claws, the subtrees cats scratch and cats scratch people are grammatical sentences, but with claws is not.
I The head determines the morphological form of the modifier: for example, in languages that require gender agreement, the gender of the noun determines the gender of the adjectives and determiners.
I Edges should first connect content words, and then connect function words.
Relationship between phrase structures and dependency structures
258
NP(cats)
CHAPTER11. DEPENDENCYPARSING
S(scratch)
DT The
NNS cats
VB scratch
VP(scratch)
NP(people)
PP(with)
NNS IN NP(claws)
people with
NNS claws
(a) lexicalized constituency parse
The cats scratch people with claws
(b) unlabeled dependency tree
Figure 11.1: Dependency grammar is closely linked to lexicalized context free grammars: each lexical head has a dependency path to every other word in the constituent. (This example is based on the lexicalization rules from § 10.5.2, which make the preposition the head of a prepositional phrase. In the more contemporary Universal Dependencies annotations, the head of with claws would be claws, so there would be an edge scratch ! claws.)
occupies the central position for the noun phrase, with the word the playing a supporting role.
The relationships between words in a sentence can be formalized in a directed graph,
based on the lexicalized phrase-structure parse: create an edge (i, j) iff word i is the head of a phrase whose child is a phrase headed by word j. Thus, in our example, we would have scratch ! cats and cats ! the. We would not have the edge scratch ! the, because although S(scratch) dominates DET(the) in the phrase-structure parse tree, it is not its im- mediate parent. These edges describe syntactic dependencies, a bilexical relationship between a head and a dependent, which is at the heart of dependency grammar.
Continuing to build out this dependency graph, we will eventually reach every word in the sentence, as shown in Figure 11.1b. In this graph — and in all graphs constructed in this way — every word has exactly one incoming edge, except for the root word, which is indicated by a special incoming arrow from above. Furthermore, the graph is weakly connected: if the directed edges were replaced with undirected edges, there would be a
260
CHAPTER11. DEPENDENCYPARSING
conj
nsubj
root
conj cc Labeled dependenciescc obj advmod
Abigail and Max like kimchi but not jook
Figure 11.2: In the Universal Dependencies annotation system, the left-most item of a coordination is the head.
root
I know
obj
compound compound
New York pizza
punct conj
and
cc
this is
nsubj
advmod
not it !!
cop
nsubj
Figure 11.3: A labeled dependency parse from the English UD Treebank (reviews-361348- 0006)
kimchi is the object.3 The negation not is treated as an adverbial modifier (ADVMOD) on the noun jook.
A slightly more complex example is shown in Figure 11.3. The multiword expression New York pizza is treated as a “flat” unit of text, with the elements linked by the COM- POUND relation. The sentence includes two clauses that are conjoined in the same way that noun phrases are conjoined in Figure 11.2. The second clause contains a copula verb
(see § 8.1.1). For such clauses, we treat the “object” of the verb as the root — in this case, it — and label the verb as a dependent, with the COP relation. This example also shows how punctuations are treated, with label PUNCT.
11.1.3 Dependency subtrees and constituents
Dependency trees hide information that would be present in a CFG parse. Often what is hidden is in fact irrelevant: for example, Figure 11.4 shows three different ways of
3Earlier work distinguished direct and indirect objects (De Marneffe and Manning, 2008), but this has been dropped in version 2.0 of the Universal Dependencies annotation system.
Jacob Eisenstein. Draft of October 21, 2018.
Projectivity
262 CHAPTER11. DEPENDENCYPARSING I Projectivity: An edge from i to j is projective i↵ all k
between i and j are descendants of i. A dependency parse is % non-projective edges % non-projective sentences
I
projective i↵ all its edges are projective.
Czech 1.86% 22.42% English 0.39% 7.63%
Informally, a dependency parse is projective if there are no
German 2.33% 28.19%
crossing edges if all dependencies are drawn on one side of the
Table 11.1: Frequency of non-projective dependencies in three languages (Kuhlmann and
sentence.
Nivre, 2010)
root
obl:tmod
ob j nsubj det
acl:relcl
nsub j cop
Lucia ate a pizza yesterday which was vegetarian
Figure 11.5: An example of a non-projective dependency parse. The “crossing edge” arises from the relative clause which was vegetarian and the oblique temporal modifier yesterday.
Definition 2 (Projectivity). An edge from i to j is projective iff all k between i and j are descen- dants of i. A dependency parse is projective iff all its edges are projective.
Figure 11.5 gives an example of a non-projective dependency graph in English. This dependency graph does not correspond to any constituent parse. As shown in Table 11.1, non-projectivity is more common in languages such as Czech and German. Even though relatively few dependencies are non-projective in these languages, many sentences have at least one such dependency. As we will soon see, projectivity has important algorithmic consequences.
11.2 Graph-based dependency parsing
Let y = (i r j) represent a dependency graph, in which each edge is a relation r from
{ ! }
head word R to modifier . The special node R
Main dependency parsing approaches
I Graph-based approaches
I Transition-based approaches
Graph-based approach
I Let y = {(i ! r j} represent a dependency graph in which r is a relation from headword i 2 {1,2,··· ,M,ROOT} to modifier j 2 {1, 2, · · · , M}. The special node ROOT indicates the root of the graph, and M is the length of the input |w|.
I Given a scoring function (y,w;✓), yˆ=argmax (y,w;✓)
y2Y(w)
where Y(w) is the set of valid dependency parses on the input
w.
I The set of possible labels |Y(w)| is exponential in the length
of the input.
I Algorithm that search over this space of possible graphs are known as graph-based dependency parsers.
Factorization
Dependency parsers that operate under this assumption are called arc-factored.
I First-order factorization:
(y,w;✓)= X (i! r j,w;✓)
i! r j2y I Second-order factoriztion:
(y,w;✓)= X (i! r j,w;✓) r
+ +
i ! Xj 2 y r0
grandparent(i ! r j,k,r0,w;✓) sibling(i ! r j,s,r0,w;✓)
k ! Xi 2 y
r0
i ! s 2 y , s 6 = j
Computing scores for dependency arcs
ILinear (i! r j,w;✓)=✓·f(i! r j,w)
I Neural (i ! r j,w;✓) = Feedforward([uwi ;uwj ];✓)
I Generative: (i ! r j,w;✓) = logP(wj,r|wi)
Linear feature-based arc scores
I The length and direction of the arc;
I The words wi and wj linked by the dependency relation; I The prefixes, su xes, and parts-of-speech of these words;
I The neighbors of the dependency arc, wi 1 , wi+1 , wj 1 , wj+1;
I The prefixes, su xes, and part-of-speech of these neighbor words.
Learning a linear model with Perceptron
I For a model with feature-based arc scores and perceptron loss, we obtain the usual structured perceptron update
I Finding the tree with the highest score:
yˆ = argmax ✓ · f (w , y 0 )
y02Y(w) I Update the weights
✓ = ✓ + f ( w , y ) f ( w , yˆ )
Learning a linear model with CRF
I A CRF for arc-factored dependency parsing is built on the probability model:
expPi! r j2y (i ! r j,w;✓) P(y|w) = Py02Y(w) expPi! r j2y0 (i ! r j,w;✓)
Where the posterior represents the fraction of the exponent of the score of one possible dependency graph out of the sum of the exponent of the scores of all possible graphs
I Questions: How do we compute the score of one dependency graph? How do we compute the sum of scores for all possible graphs?
I Such a model is trained to minimize the negative log conditional-likelihood.
Neural arc scores
I Given vector representations of xi for each word wi in the input, a set of arc scores can be computed from a feedforward neural network:
(i ! r j,w,✓) = FeedForward([xi;xj];✓r) where unique weights ✓r are available for each arc type.
I Specifically, the score of each arc for each relation can be computed as:
z = g(⇥ [x ; x ] + b(z)) rijr
(i! r j)= z+b(y) rr
where ⇥r is a matrix, r is a vector, each br is a scalar, the function g is an element-wise tanh activation function.
SOA in neural dependency parsing: Bia ne models
Tutorial: http://www.cse.chalmers.se/⇠richajo/nlp2019/l7/ Bia ne%20dependency%20parsing.html
Bi-a ne models for dependency parsing
I The input is a sequence of word embeddings concatenated with their POS embeddings
xi =v(word) v(pos) ii
I The input is fed into a BiLSTM to get a sequence of hidden states
hi,ci = BiLSTM(r0,(x1,··· ,xn))i
I Each hidden state hi is projected into four distinct vectors
h(arc dep) = MLP(arc dep(hi ) i
h(arc head) =MLP(arc head(hi) i
h(rel dep) = MLP(rel dep(hi ) i
h(rel head) =MLP(rel head(hi) i
Predicting the arcs with hidden vectors
I Given a dependent, pair it up with each potential head (all other tokens) in the sentence, and compute a score:
s(arc) =H(arc head)W(arc)h(arc dep) +H(arc head)b>(arc) ii
I The head is the pair with the highest score
y0(arc) = argmax s(arc) i jij
Predicting relations with hidden states
I Given a head yi0 for word i, perform another bia ne transformation to predict the relation labels
I First use the relation vectors to compute a score for each possible label:
s(rel) = h>(rel head)U(rel)h(rel dep)
i
(rel head)◆ hy0(arc)
i
What’s the shape of U?
I Find the relation label with the highest score:
y0(rel) = argmax s(rel) i jij
i y0(arc)
i ✓ (rel dep)
+W(rel) hi +b(rel)
Breaking potential cycles in factored graph-based dependency parsing
I Since in factored graph-based dependency parsing the parent for each dependent is predicted independently by finding the edge with the highest score, it is possible that the resulting dependency structure may have cycles. For instance, the model might be predict i is the head of j, and j is the head of i.
I When this happens, we need to break the cycle to ensure that the resulting dependency tree is well-formed.
I One algorithm that can do this is the Chu-Liu/Edmonds algorithm.
The Chu-Liu-Edmonds algorithm
I Assuming a model that assigns a score to each possible edge in a dependency tree
I x = root John saw Mary
The Chu-Liu-Edmonds algorithm
I start by removing all incoming arcs to root, and finding the highest scoring incoming arc for each node
The Chu-Liu-Edmonds algorithm
I If not a tree, identify cycles and contract
I Recalculate arc weights into and out-of cycle
I New incoming arc weights equal to the weight of the best spanning tree that includes head of the incoming arc, and all nodes in cycle
I root ! saw ! John: 40 I root ! John ! saw: 40
The Chu-Liu-Edmonds algorithm
I This is the final dependency tree