Computational
Linguistics
CSC 485/2501 Fall 2021
2A
2A. Dependency Grammar
Department of Computer Science, University of Toronto
Based on slides by , , , , and
Copyright © 2019 . All rights reserved.
Word Dependency Parsing
Raw sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
Part-of-speech tagging
He reckons the current account deficit will narrow to only 1.8 billion in September.
PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .
POS-tagged sentence
Word dependency parsing
He reckons the current account deficit will narrow to only 1.8 billion in September .
Word dependency parsed sentence
SUBJ
MOD SPEC
MOD SUBJ
S-COMP
MOD
COMP ROOT
COMP
slide adapted from
2
All these conditions will be violated for semantic dependency graphs we will consider later
You can think of it as (related) planarity
Underspecifications of simple typed dependencies
Flatbracketings
Non-projectivedependency
A woman arrived who was wearing a hat
Complexword-worddependencyconstructions: Predicative adjectives
Coordination
I ate the fish naked/raw
Pat and Terry sat and laughed
Moregenerally,semanticroles:
The door opened
Erin opened the door The door opened a crack
Quantifierscoping,temporalinterpretationandsoforth
Parsing Methods
Shift-Reduce Type Algorithms
◮ Data structures:
◮ Stack [. . . , wi ]S of partially processed tokens
◮ Queue [wj , . . .]Q of remaining input tokens ◮ Parsing actions built from atomic actions:
◮ Adding arcs (wi → wj, wi ← wj)
◮ Stack and queue operations
◮ Left-to-right parsing in O(n) time
◮ Restricted to projective dependency graphs
Dependency Parsing 54(103)
Parsing Methods
Yamada’s Algorithm
◮ Three parsing actions:
[…]S [wi,…]Q […,wi]S […]Q
[…,wi,wj]S […]Q […,wi]S […]Q
Shift
Left
wi →wj wi ←wj
Right
◮ Algorithm variants:
[…,wi,wj]S […]Q […,wj]S […]Q
◮ Originally developed for Japanese (strictly head-final) with only the Shift and Right actions [Kudo and Matsumoto 2002]
◮ Adapted for English (with mixed headedness) by adding the Left action [Yamada and Matsumoto 2003]
◮ Multiple passes over the input give time complexity O(n2)
Dependency Parsing 55(103)
Parsing Methods
Nivre’s Algorithm
◮ Four parsing actions:
Shift Reduce Left-Arcr
[…]S [wi,…]Q […,wi]S […]Q
[…,wi]S […]Q ∃wk :wk →wi […]S […]Q
[…,wi]S [wj,…]Q ¬∃wk : wk → wi […]S [wj,…]Q wi ←r wj
[…,wi]S [wj,…]Q ¬∃wk : wk → wj Right-Arcr […,wi,wj]S […]Q wi →r wj
◮ Characteristics:
◮ Integrated labeled dependency parsing
◮ Arc-eager processing of right-dependents
◮ Two passes over the input gives time complexity O(n)
Dependency Parsing 56(103)
Parsing Methods
Example
[root]S [Economic news had little effect on financial markets .]Q
Dependency Parsing 57(103)
Parsing Methods
Example
[root Economic]S [news had little effect on financial markets .]Q Shift
Dependency Parsing 57(103)
Parsing Methods
Example
nmod
[root]S Economic [news had little effect on financial markets .]Q Left-Arcnmod
Dependency Parsing 57(103)
Parsing Methods
Example
nmod
[root Economic news]S [had little effect on financial markets .]Q Shift
Dependency Parsing 57(103)
Parsing Methods
Example
sbj
nmod
[root]S Economic news [had little effect on financial markets .]Q Left-Arcsbj
Dependency Parsing 57(103)
Parsing Methods
Example
pred nmod
[root Economic news had]S [little effect on financial markets .]Q Right-Arcpred
sbj
Dependency Parsing 57(103)
Parsing Methods
Example
pred nmod
[root Economic news had little]S [effect on financial markets .]Q Shift
sbj
Dependency Parsing 57(103)
Parsing Methods
Example
pred
nmod sbj nmod
[root Economic news had]S little [effect on financial markets .]Q Left-Arcnmod
Dependency Parsing 57(103)
Parsing Methods
Example
pred obj nmod sbj nmod
[root Economic news had little effect]S [on financial markets .]Q Right-Arcobj
Dependency Parsing 57(103)
Parsing Methods
Example
pred obj
nmod sbj nmod nmod
[root Economic news had little effect on]S [financial markets .]Q Right-Arcnmod
Dependency Parsing 57(103)
Parsing Methods
Example
pred obj
nmod sbj nmod nmod
[root Economic news had little effect on financial]S [markets .]Q Shift
Dependency Parsing 57(103)
Parsing Methods
Example
pred obj
nmod sbj nmod nmod nmod
[root Economic news had little effect on]S financial [markets .]Q Left-Arcnmod
Dependency Parsing 57(103)
Parsing Methods
Example
pred obj pc nmod sbj nmod nmod nmod
[root Economic news had little effect on financial markets]S [.]Q Right-Arcpc
Dependency Parsing 57(103)
Parsing Methods
Example
pred obj pc nmod sbj nmod nmod nmod
[root Economic news had little effect on]S financial markets [.]Q Reduce
Dependency Parsing 57(103)
Parsing Methods
Example
pred obj pc nmod sbj nmod nmod nmod
[root Economic news had little effect]S on financial markets [.]Q Reduce
Dependency Parsing 57(103)
Parsing Methods
Example
pred obj pc nmod sbj nmod nmod nmod
[root Economic news had]S little effect on financial markets [.]Q Reduce
Dependency Parsing 57(103)
Parsing Methods
Example
pred obj pc nmod sbj nmod nmod nmod
[root]S Economic news had little effect on financial markets [.]Q Reduce
Dependency Parsing 57(103)
Parsing Methods
Example
p
pred obj pc
nmod sbj nmod nmod nmod
[root Economic news had little effect on financial markets .]S []Q Right-Arcp
Dependency Parsing 57(103)
Parsing Methods
Classifier-Based Parsing
◮ Data-driven deterministic parsing:
◮ Deterministic parsing requires an oracle.
◮ An oracle can be approximated by a classifier.
◮ A classifier can be trained using treebank data.
◮ Learning methods:
◮ Support vector machines (SVM)
[Kudo and Matsumoto 2002, Yamada and Matsumoto 2003,
Isozaki et al. 2004, Cheng et al. 2004, Nivre et al. 2006]
◮ Memory-based learning (MBL)
[Nivre et al. 2004, Nivre and Scholz 2004]
◮ Maximum entropy modeling (MaxEnt) [Cheng et al. 2005]
◮ Neural networks [you!]
Dependency Parsing 58(103)
Parsing Methods
Feature Models
◮ Learning problem:
◮ Approximate a function from parser states, represented by feature vectors to parser actions, given a training set of gold standard derivations.
◮ Typical features: ◮ Tokens:
◮ Target tokens
◮ Linear context (neighbors in S and Q)
◮ Structural context (parents, children, siblings in G)
◮ Attributes:
◮ Word form (and lemma)
◮ Part-of-speech (and morpho-syntactic features)
◮ Dependency type (if labeled)
◮ Distance (between target tokens)
Dependency Parsing 59(103)
Neural Networks
Neural Networks can be built for different input, output types.
– Outputs can be:
– Linear, single output (Linear)
– Linear, multiple outputs (Linear)
– Single output binary (Logistic)
– Multi output binary (Logistic)
– 1 of k Multinomial output (Softmax)
categorical
– Inputs can be:
– A scalar number
– Vector of Real numbers
– Vector of Binary
(Fig: courtesy R Socher)
Goal of training: Given the training data (inputs, targets) and the architecture, determine the model parameters.
Model Parameters for a 3 layer network:
– Weight matrix from input layer to the hidden (Wjk)
– Weight matrix from hidden layer to the output (Wkj) – Bias terms for hidden layer
– Bias terms for output layer
Our strategy will be:
– Compute the error at the output
– Determine the contribution of each parameter to the error by
taking the differential of error wrt the parameter
– Update the parameter commensurate with the error it contributed.
Design Choices
• When building a neural network, the designer would choose the following hyper parameters and non linearities based on the application characteristics:
• Number of hidden layers
• Number of hidden units in each layer
• Learning rate
• Regularization coefft
• Number of outputs
• Type of output (linear, logistic, softmax)
• Choice of Non linearity at the output layer and hidden layer (See next slide) • Input representation and dimensionality
Commonly used non linearities (fig: courtesy Socher)
Objective Functions and gradients
• Linear – Mean squared error • 𝐸 𝑤 = 1 𝑁(𝑡 − 𝑦 )2
2𝑁1𝑛 𝑛
• Logistic with binary classifications: Cross Entropy Error
• Logistic with k outputs: k > 2: Cross Entropy Error
• Softmax: 1 of K multinomial classification: Cross Entropy Error, minimize NLL
• In all the above cases we can show that the gradient is: (yk – tk) where yk is the predicted output for the output unit k and tk is the corresponding target
“”
High Level Backpropagation Algorithm
• Apply the input vector to the network and forward propagate. This will yield the activations for hidden layer(s) and the output layer
•𝑛𝑒𝑡= 𝑤𝑧, 𝑗 𝑖𝑗𝑖𝑖
• 𝑧 = h(𝑛𝑒𝑡 ) where h is your choice of non linearity. Usually it is sigmoid or 𝑗𝑗
tanh. Rectified Linear Unit (RelU) is also used.
• Evaluate the error 𝛿𝑘 for all the output units
𝛿𝑘 = 𝑜𝑘 − 𝑡𝑘 where 𝑜𝑘 is the output produced by the model and 𝑡𝑘 is the target provided in the training dataset
• Backpropagate the 𝛿’s to obtain 𝛿 for each hidden unit j 𝑗
𝛿=h′(𝑧) 𝑤𝛿 𝑗 𝑗𝑘𝑘𝑗𝑘
• Evaluate the required derivatives
𝜕𝐸
=𝛿𝑧 𝑗𝑖
𝜕𝑊 𝑗𝑖