Syntactic parsing approaches
I Grammar-based approach with CKY decoding
I PCFG, a generative approach that extends the Naive Bayes
Model
I Lexicalization, parent annotation
I Discrminative approaches: linear and neural models
I Perceptron and CRF training with discrete features I Neural models
I Transition-based approach: the shift-reduce algorithm with greedy or beam search
I Linear models with discrete features – Perceptron, Conditional I Random fields
Non-linear (neural) models
I Thinking out of the box: a sequence-to-sequence approach to syntactic parsing
Learning PCFGs
Parameters in probabilistic context-free grammars can be estimated by relative frequency, as with HMMs:
(X !↵)=logP(X !↵) Pˆ(X ! ↵) = count(X ! ↵
count (X )
E.g., the probability of the production NP ! DET NN is the corpus count of this production, divided by the count of the non-terminal NP. This applies to terminals as well.
Grammar Refinement
I Grammars extracted from treebanks (e.g., the Penn TreeBank) are often sensitive to ambiguities in the parses, even with the weighted productions
I There are various attempt to augment with the vanilla PCFG with more expressive productions
I Parent annotation I Lexicalization
Parent annotation
Lexicalized CFGs
Discriminative approaches with discrete features
I The scores for each production can be computed as an inner product of features and their weights,
(X !↵,(i,j,k))=✓·f(X,↵,(i,j,k),w)
where the feature vector f is a function of the left-hand side X, the right-hand side ↵, the anchor indices (i,j,k), and the input w.
I Thebasicfeaturef(X,↵,(i,j,k))={(X,↵)}encodesonly the identify of the production itself and is therefore as expressive as PCFG trained discriminatively.
I Other features include the words in the beginning and at the end of the span wi,wj+1, the word at the split point wk+1, etc.
Perceptron training
I Perceptron training for parsing is very similar to that of sequence labeling
I The feature vector for a sentence-tree pair decomposes to the sum of local features
f (⌧,w(i)) = X f (X,↵,(i,j,k),w(i)) (X !↵,(i ,j ,k ))2⌧
I Find the tree with the highest score based on the current model
⌧ˆ = argmax ✓ · f (⌧ , w (i ) ) ⌧2T (w)
I Update the feature weights
✓ f(⌧(i),w(i)) f(⌧ˆ,w(i))
CRF parsing
I The score of a derivation (⌧) can be converted into a probability by normalizing over all possible derivations:
P(⌧|w)=P exp (⌧)
⌧02T (w) exp (⌧0)
I Using this probability, a WCFG can be trained by maximizing the conditional log-likelihood of a labeled corpus.
CRF training
I Just as in logistic regression and the conditional random field over sequences, the gradient of the conditional log-likelihood is the di↵erence between the observed and expected counts of each feature.
I The expectation E⌧|w[f(⌧,w(i));✓] requires summing over all possible parses, and computing the marginal probabilities of anchored productions, P(X ! ↵,(i,j,k)|w).
I In CRF sequence labeling, marginal probabilities of over tag bigrams are computed by the two-pass forward-backward algorithm. The analogue for context-free grammars is the inside-outside algorithm, in which marginal probabilities are computed from terms generated by an upward and downward pass over the parsing chart.
Neural context-free grammars
I Neural networks can be applied to parsing by representing each span with a dense numerical vector. For example, the anchor (i,j,k) and sentence w can be associated with a column vector:
(i,j,k) =[uwi 1;uwi;uwj 1;uwj;uwk 1;uwk] I The vector can be fed into a feed-forward neural net:
̃(i,j,k) = FeedForward( (i,j,k))
I The score of a constituent can be computed with a weight matrix
(X !↵,(i,j,k))= ̃> ⇥f(X !↵) (i,j,k)
Parsing with the Transformer-based encoder-decoder framework
I Using the contextualized embeddings trained with Transformer to compute the vectorial representation for constituents, and then e ciently search for the syntactic tree with the highest score with the CKY algorithm.
Score the candidate trees and search for the optimal one by the model
I Assign a real-valued score s(T) to each tree T, which decomposes as
s(T)= X s(i,j,l) (i,j,l)2T
where s(i, j, l) is a real-valued score for a constituent between position i and j with the label l
I Given the scores of constituent, the model-optimal tree can be found with the CKY algorithm.
Train the model with a max-margin objective
I Given the correct tree T?, the model is trained to satisfy the margin constraints
s (T ?) s (T ) + (T , T ?) for all trees T by minimizing the hinge loss
max✓0, max[s(T)+ (T,T?)] s(T?)◆ T6=T?
I is the Hamming loss on labeled spans, and the tree that violates the most constraints is selected for purposes of updating parameters.
Encoder
I The encoder portion of our model is split into two parts:
I A word-based portion that assigns a context-aware vector representation yt to each sentence with Transformer (self-attention followed by position-wise feedforward neural network)
I The input is the sum of a word embedding, position embedding, and POS embedding
I A chart portion that combines the vectors yt to generate the scores for each span s(i, j, l).
I Span score:
s(i, j, ·) = ⇥2ReLU(LayerNorm(⇥1v + b1) + b2
I The input vector v combines the word-based vectors: v = [ y j y i ; !y j + 1 !y i + 1 ]
where y t and !y t are the first and second half of the yt respectively
Parser evaluation
I Precision: the fraction of constituents in the system parse I that match a constituent in the reference parse.
Recall: the fraction of constituents in the reference parse that I match a constituent in the system parse.
labeled vs unlabeded precision and recall: In labeled precision and recall, the system must also match the phrase type for each constituent; in unlabeled precision and recall, it is only required to match the constituent structure.
Transition-based syntactic parsing
Transition-based syntactic parsing
I Transition-based constituent parsing I Transition-based dependency parsing I Transition-based AMR parsing
Transition-based Constituent Parsing
I A transition-based constituent parsing model is a quaduple C = (S,T,s0,St) where:
I S is a set of parser states or configurations,
I T is a set of actions, e.g., shift, reduce,
I s0 is an initialization function that maps an input sentence into I a unique initial state,
St 2 S is a set of terminal states
I An action t 2 T is a transition function that transforms the
current state into a new state
I A state s 2 S is defined as a tuple s = (↵, ) where ↵ is a stack that holds already constructed subtrees, and is a queue which is used to store words that is yet to be processed.
Shift-Reduce: a Transition-based algorithm
stack
queue
He1 eats2 noodles3 with4 chopsticks5
shift
eats2 noodles3 with4 chopsticks5
current state
new state
He1
Shift-Reduce: a Transition-based algorithm
He1 eats2 noodles3 with4 chopsticks5
reduce
NP eats2 noodles3 with4 chopsticks5 He1
Shift-Reduce: a Transition-based algorithm
NP eats2 noodles3 with4 chopsticks5 He1
shift
NP eats2 noodles3 with4 chopsticks5 He1
Shift-Reduce: a Transition-based algorithm
NP eats2 noodles3 with4 chopsticks5 He1
NP eats2 noodles3 with4 chopsticks5 He1
shift
Shift-Reduce: a Transition-based algorithm
NP eats2 noodles3 He1
NP eats2 NP
with4 chopsticks5
with4 chopsticks5
reduce
He1
noodles3
Shift-Reduce: a Transition-based algorithm
NP eats2 NP He1 noodles3
NP VP
He1 eats2 NP noodles3
with4 chopsticks5
reduce
with4 chopsticks5
Shift-Reduce: a Transition-based algorithm
NP VP
He1 eats2 NP noodles3
NP VP
He1 eats2 NP noodles3
with4 chopsticks5
chopsticks5
shift
with4
Shift-Reduce: a Transition-based algorithm
NP VP with4 chopsticks5
He1 eats2 NP noodles3
NP VP with4 chopsticks5
He1 eats2 NP noodles3
shift
Shift-Reduce: a Transition-based algorithm
NP VP
He1 eats2 NP noodles3
NP VP
He1 eats2 NP noodles3
with4
chopsticks5
reduce
with4 NP
chopsticks5
Shift-Reduce: a Transition-based algorithm
NP VP with4 NP
He1 eats2 NP noodles3
chopsticks5
reduce
NPVP PP
He1 eats2 NP with4 NP noodles3 chopsticks5
Shift-Reduce: a Transition-based algorithm
NPVP PP
He1 eats2 NP with4 NP
NP
He1
PP
noodles3 VP
VP
eats2 NP with4 noodles3
chopsticks5
NP
chopsticks5
reduce
Shift-Reduce: a Transition-based algorithm
NP VP
VP PP He1
eats2 NP with4 NP
NP
He1
noodles3 S
chopsticks5 VP
reduce
VP PP
eats2 NP with4 NP noodles3 chopsticks5
Success!
“Oracle”
I The oracle is a sequence of actions that lead to the correct parse of a sentence.
I When training a transition-based parsing model, we first map a gold parse tree onto an oracle sequence of actions
I We can learn a model by comparing the oracle to predicted action sequences and update the parameters of the model.
The Perceptron learning algorithm
1: 2: 3: 4: 5: 6: 7:
8:
Input : Training examples (xi , yi ) Initialization: Set ✓ = 0
for t 1,T do
for i 1,N do
zi argmaxz2GEN(xi ) f (xi , z) · ✓ if zi 6= yi then
✓ ✓ + f (xi , yi ) f (xi , zi ) Output: Parameters ✓
Lexcialized transition-based parsing actions
I Each action t 2 T is a transition action that transforms a state into a new state.
I SHIFT (sh): remove the first word-POS pair from , and I pushes it onto the top of ;
REDUCE-UNARY-X(ru-x): pop the top subtree from , construct a new unary node labeled with X for the subtree, and then push the new subtree back onto . The head of the
I new subtree is inherited from the child; REDUCE-BINARY-L/R-X (rl/rl-x): pop the top two subtrees from , combine them into a new tree with a node labeled with X, then push the new subtree back onto . The left (L) and right (R) versions of the action indicate whether the head of the new subtree is inherited from its left or right child.
I Aparsingstates2Sisdefinedasatuples=( , ),where is a stack that is maintained to hold the partial parsing structures that are already constructed and is a queue used to store unprocessed input (typically word-POS tag pairs).
Updating feature weights
02 p0tc=N NP⇠shift 31 203 B6 p0tc = N NP ⇠ reduce 7C 617 B6 p0wc = noodles NP ⇠ shift 7C 607
213 607
0 1
2 13 617
6 0 7
6 0 7
6 0 7
64 0 75 0
0
r✓ B6p0wc = noodles NP ⇠ reduce7C = 617 B6 p1tc=V V⇠shift 7C 607 B@ 64 p 1 t c = V V ⇠ r e d u c e 75 CA 64175
p1wc = eats V ⇠ shift 0
607 617 =
p1wc = eats V ⇠ reduce
1
607 64175
Notes: The feature p0tc = N NP predicts a “shift” action when the oracle action should be “reduce”.
Transition-based parsing features
Type Feature Templates
p0tc, p0wc, p1tc, p1wc, p2tc
p2wc, p3tc, p3wc, q0wt, q1wt q2wt, q3wt, p0l wc, p0r wc
p0uwc, p1lwc, p1rwc, p1uwc p0wp1w, p0wp1c, p0cp1w, p0cp1c p0wq0w, p0wq0t, p0cq0w, p0cq0t q0wq1w, q0wq1t, q0tq1w, q0tq1t p1wq0w, p1wq0t, p1cq0w, p1cq0t p0cs1cs2c, p0wp1cp2c, p0cp1wq0t
unigrams
bigrams
trigrams p0cp1cp2w, p0cp1cq0t, p0wp1cq0t p0cp1wq0t, p0cp1cq0w
Baseline features, where pi represents the ith subtree in the stack and qi denotes the ith item in the queue . w refers to the head word, t refers to the head POS, and c refers to the constituent label. pil and pir refer to the left and right child for a binary subtree pi , and piu refers to the child of a unary subtree pi .
Feature vector
feature
count feature count p0 tc =N-NP ̃shift 0 p0 tc =N-NP ̃reduce 1 p0wc=noodles-NP ̃shift 0 p0wc=noodles-NP ̃reduce 1
p1 tc =V-V ̃shift 0 p1 tc =V-V ̃reduce 1
p1 wc =eats-V ̃shift 0 p1 wc =eats-V ̃reduce 1
p0u wc =noodles-N ̃shift 0 p0u wc =noodles-N ̃reduce 1
q0 wt =with-P ̃shift 0 q0 wt =with-P ̃reduce 1
q1 wt =chopsticks-N ̃shift 0 q1 wt =chopsticks-N ̃reduce 1
···
··· ··· ···
Notes: Feature count for one configuration. The total count for a sentence will be a sum over all configurations in the derivation of the syntactic structure of the sentence
Beam Search
Input: A POS-tagged sentence, beam size k.
Output:
1: 2: 3: 4: 5: 6: 7: 8: 9:
10:
11: 12: 13: 14:
15:
beam0 i 0 loop
P
while beami is not empty do
. initialization . step index
. a priority queue
A constituent parse tree.
{s0}
s Pop(beami )
for all possible t 2 T do
snew apply t to s score snew
insert snew into P
beami +1 k best states of P sbest best state in beami +1 if sbest 2 St then
return sbest i i+1
{}
CFG based parsing vs transition-based parsing
I A transition-based parser scores the actions while a PCFG based parsing model scores the rules
I It’s customary to use the beam search algorithm in transition-based parsing
I The transition-based approach can be easily applied to dependency parsing as well as graph-based semantic parsing
I Learning for transition-based parsing can be with done with basically any type of classifier, including neural network models