CS计算机代考程序代写 data structure algorithm Computational

Computational
Linguistics
CSC 485 Fall 2020
2A
2A. Dependency Grammar
Gerald Penn
Department of Computer Science, University of Toronto
Based on slides by Roger Levy, Yuji Matsumoto, Dragomir Radev, Dan Roth, David Smith and Jason Eisner
Copyright © 2019 Gerald Penn. All rights reserved.

Word Dependency Parsing
Raw sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
Part-of-speech tagging
He reckons the current account deficit will narrow to only 1.8 billion in September.
PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .
POS-tagged sentence
Word dependency parsing
He reckons the current account deficit will narrow to only 1.8 billion in September .
Word dependency parsed sentence
SUBJ
MOD SPEC
MOD SUBJ
S-COMP
MOD
COMP ROOT
COMP
slide adapted from Yuji Matsumoto
2

All these conditions will be violated for semantic dependency graphs we will consider later

You can think of it as (related) planarity

Underspecifications of simple typed dependencies
􏰁 Flatbracketings
􏰁 Non-projectivedependency
A woman arrived who was wearing a hat
􏰁 Complexword-worddependencyconstructions: 􏰁 Predicative adjectives
􏰁 Coordination
I ate the fish naked/raw
Pat and Terry sat and laughed
􏰁 Moregenerally,semanticroles:
The door opened
Erin opened the door The door opened a crack
􏰁 Quantifierscoping,temporalinterpretationandsoforth

Parsing Methods
Shift-Reduce Type Algorithms
◮ Data structures:
◮ Stack [. . . , wi ]S of partially processed tokens
◮ Queue [wj , . . .]Q of remaining input tokens ◮ Parsing actions built from atomic actions:
◮ Adding arcs (wi → wj, wi ← wj)
◮ Stack and queue operations
◮ Left-to-right parsing in O(n) time
◮ Restricted to projective dependency graphs
Dependency Parsing 54(103)

Parsing Methods
Yamada’s Algorithm
◮ Three parsing actions:
[…]S [wi,…]Q […,wi]S […]Q
[…,wi,wj]S […]Q […,wi]S […]Q
Shift
Left
wi →wj wi ←wj
Right
◮ Algorithm variants:
[…,wi,wj]S […]Q […,wj]S […]Q
◮ Originally developed for Japanese (strictly head-final) with only the Shift and Right actions [Kudo and Matsumoto 2002]
◮ Adapted for English (with mixed headedness) by adding the Left action [Yamada and Matsumoto 2003]
◮ Multiple passes over the input give time complexity O(n2)
Dependency Parsing 55(103)

Parsing Methods
Nivre’s Algorithm
◮ Four parsing actions:
Shift Reduce Left-Arcr
[…]S [wi,…]Q […,wi]S […]Q
[…,wi]S […]Q ∃wk :wk →wi […]S […]Q
[…,wi]S [wj,…]Q ¬∃wk : wk → wi […]S [wj,…]Q wi ←r wj
[…,wi]S [wj,…]Q ¬∃wk : wk → wj Right-Arcr […,wi,wj]S […]Q wi →r wj
◮ Characteristics:
◮ Integrated labeled dependency parsing
◮ Arc-eager processing of right-dependents
◮ Two passes over the input gives time complexity O(n)
Dependency Parsing 56(103)

Parsing Methods
Example
[root]S [Economic news had little effect on financial markets .]Q
Dependency Parsing 57(103)

Parsing Methods
Example
[root Economic]S [news had little effect on financial markets .]Q Shift
Dependency Parsing 57(103)

Parsing Methods
Example
nmod
[root]S Economic [news had little effect on financial markets .]Q Left-Arcnmod
Dependency Parsing 57(103)

Parsing Methods
Example
nmod
[root Economic news]S [had little effect on financial markets .]Q Shift
Dependency Parsing 57(103)

Parsing Methods
Example
sbj
nmod
[root]S Economic news [had little effect on financial markets .]Q Left-Arcsbj
Dependency Parsing 57(103)

Parsing Methods
Example
pred nmod
[root Economic news had]S [little effect on financial markets .]Q Right-Arcpred
sbj
Dependency Parsing 57(103)

Parsing Methods
Example
pred nmod
[root Economic news had little]S [effect on financial markets .]Q Shift
sbj
Dependency Parsing 57(103)

Parsing Methods
Example
pred
nmod sbj nmod
[root Economic news had]S little [effect on financial markets .]Q Left-Arcnmod
Dependency Parsing 57(103)

Parsing Methods
Example
pred obj nmod sbj nmod
[root Economic news had little effect]S [on financial markets .]Q Right-Arcobj
Dependency Parsing 57(103)

Parsing Methods
Example
pred obj
nmod sbj nmod nmod
[root Economic news had little effect on]S [financial markets .]Q Right-Arcnmod
Dependency Parsing 57(103)

Parsing Methods
Example
pred obj
nmod sbj nmod nmod
[root Economic news had little effect on financial]S [markets .]Q Shift
Dependency Parsing 57(103)

Parsing Methods
Example
pred obj
nmod sbj nmod nmod nmod
[root Economic news had little effect on]S financial [markets .]Q Left-Arcnmod
Dependency Parsing 57(103)

Parsing Methods
Example
pred obj pc nmod sbj nmod nmod nmod
[root Economic news had little effect on financial markets]S [.]Q Right-Arcpc
Dependency Parsing 57(103)

Parsing Methods
Example
pred obj pc nmod sbj nmod nmod nmod
[root Economic news had little effect on]S financial markets [.]Q Reduce
Dependency Parsing 57(103)

Parsing Methods
Example
pred obj pc nmod sbj nmod nmod nmod
[root Economic news had little effect]S on financial markets [.]Q Reduce
Dependency Parsing 57(103)

Parsing Methods
Example
pred obj pc nmod sbj nmod nmod nmod
[root Economic news had]S little effect on financial markets [.]Q Reduce
Dependency Parsing 57(103)

Parsing Methods
Example
pred obj pc nmod sbj nmod nmod nmod
[root]S Economic news had little effect on financial markets [.]Q Reduce
Dependency Parsing 57(103)

Parsing Methods
Example
p
pred obj pc
nmod sbj nmod nmod nmod
[root Economic news had little effect on financial markets .]S []Q Right-Arcp
Dependency Parsing 57(103)

Parsing Methods
Classifier-Based Parsing
◮ Data-driven deterministic parsing:
◮ Deterministic parsing requires an oracle.
◮ An oracle can be approximated by a classifier.
◮ A classifier can be trained using treebank data.
◮ Learning methods:
◮ Support vector machines (SVM)
[Kudo and Matsumoto 2002, Yamada and Matsumoto 2003,
Isozaki et al. 2004, Cheng et al. 2004, Nivre et al. 2006]
◮ Memory-based learning (MBL)
[Nivre et al. 2004, Nivre and Scholz 2004]
◮ Maximum entropy modeling (MaxEnt) [Cheng et al. 2005]
◮ Neural networks [you!]
Dependency Parsing 58(103)

Parsing Methods
Feature Models
◮ Learning problem:
◮ Approximate a function from parser states, represented by feature vectors to parser actions, given a training set of gold standard derivations.
◮ Typical features: ◮ Tokens:
◮ Target tokens
◮ Linear context (neighbors in S and Q)
◮ Structural context (parents, children, siblings in G)
◮ Attributes:
◮ Word form (and lemma)
◮ Part-of-speech (and morpho-syntactic features)
◮ Dependency type (if labeled)
◮ Distance (between target tokens)
Dependency Parsing 59(103)

Neural Networks
Neural Networks can be built for different input, output types.
– Outputs can be:
– Linear, single output (Linear)
– Linear, multiple outputs (Linear)
– Single output binary (Logistic)
– Multi output binary (Logistic)
– 1 of k Multinomial output (Softmax)
– Inputs can be:
– A scalar number
– Vector of Real numbers
– Vector of Binary
(Fig: courtesy R Socher)
Goal of training: Given the training data (inputs, targets) and the architecture, determine the model parameters.
Model Parameters for a 3 layer network:
– Weight matrix from input layer to the hidden (Wjk)
– Weight matrix from hidden layer to the output (Wkj) – Bias terms for hidden layer
– Bias terms for output layer
Our strategy will be:
– Compute the error at the output
– Determine the contribution of each parameter to the error by
taking the differential of error wrt the parameter
– Update the parameter commensurate with the error it contributed.

Design Choices
• When building a neural network, the designer would choose the following hyper parameters and non linearities based on the application characteristics:
• Number of hidden layers
• Number of hidden units in each layer
• Learning rate
• Regularization coefft
• Number of outputs
• Type of output (linear, logistic, softmax)
• Choice of Non linearity at the output layer and hidden layer (See next slide) • Input representation and dimensionality

Commonly used non linearities (fig: courtesy Socher)

Objective Functions and gradients
• Linear – Mean squared error • 𝐸 𝑤 = 1 𝑁(𝑡 − 𝑦 )2
2𝑁1𝑛 𝑛
• Logistic with binary classifications: Cross Entropy Error
• Logistic with k outputs: k > 2: Cross Entropy Error
• Softmax: 1 of K multinomial classification: Cross Entropy Error, minimize NLL
• In all the above cases we can show that the gradient is: (yk – tk) where yk is the predicted output for the output unit k and tk is the corresponding target
“”

High Level Backpropagation Algorithm
• Apply the input vector to the network and forward propagate. This will yield the activations for hidden layer(s) and the output layer
•𝑛𝑒𝑡= 𝑤𝑧, 𝑗 𝑖𝑗𝑖𝑖
• 𝑧 = h(𝑛𝑒𝑡 ) where h is your choice of non linearity. Usually it is sigmoid or 𝑗𝑗
tanh. Rectified Linear Unit (RelU) is also used.
• Evaluate the error 𝛿𝑘 for all the output units
𝛿𝑘 = 𝑜𝑘 − 𝑡𝑘 where 𝑜𝑘 is the output produced by the model and 𝑡𝑘 is the target provided in the training dataset
• Backpropagate the 𝛿’s to obtain 𝛿 for each hidden unit j 𝑗
𝛿=h′(𝑧) 𝑤𝛿 𝑗 𝑗𝑘𝑘𝑗𝑘
• Evaluate the required derivatives
𝜕𝐸
=𝛿𝑧 𝑗𝑖
𝜕𝑊 𝑗𝑖