Transition-based syntactic parsing
Transition-based syntactic parsing
I Transition-based constituent parsing I Transition-based dependency parsing I Transition-based AMR parsing
Transition-based Constituent Parsing
I A transition-based constituent parsing model is a quaduple C = (S,T,s0,St) where:
I S is a set of parser states or configurations,
I T is a set of actions, e.g., shift, reduce,
I s0 is an initialization function that maps an input sentence into I a unique initial state,
St 2 S is a set of terminal states
I An action t 2 T is a transition function that transforms the
current state into a new state
I A state s 2 S is defined as a tuple s = (↵, ) where ↵ is a stack that holds already constructed subtrees, and is a queue which is used to store words that is yet to be processed.
Shift-Reduce: a Transition-based algorithm
stack
queue
He1 eats2 noodles3 with4 chopsticks5
shift
eats2 noodles3 with4 chopsticks5
current state
new state
He1
Shift-Reduce: a Transition-based algorithm
He1 eats2 noodles3 with4 chopsticks5
reduce
NP eats2 noodles3 with4 chopsticks5 He1
Shift-Reduce: a Transition-based algorithm
NP eats2 noodles3 with4 chopsticks5 He1
shift
NP eats2 noodles3 with4 chopsticks5 He1
Shift-Reduce: a Transition-based algorithm
NP eats2 noodles3 with4 chopsticks5 He1
NP eats2 noodles3 with4 chopsticks5 He1
shift
Shift-Reduce: a Transition-based algorithm
NP eats2 noodles3 He1
NP eats2 NP
with4 chopsticks5
with4 chopsticks5
reduce
He1
noodles3
Shift-Reduce: a Transition-based algorithm
NP eats2 NP He1 noodles3
NP VP
He1 eats2 NP noodles3
with4 chopsticks5
reduce
with4 chopsticks5
Shift-Reduce: a Transition-based algorithm
NP VP
He1 eats2 NP noodles3
NP VP
He1 eats2 NP noodles3
with4 chopsticks5
chopsticks5
shift
with4
Shift-Reduce: a Transition-based algorithm
NP VP with4 chopsticks5
He1 eats2 NP noodles3
NP VP with4 chopsticks5
He1 eats2 NP noodles3
shift
Shift-Reduce: a Transition-based algorithm
NP VP
He1 eats2 NP noodles3
NP VP
He1 eats2 NP noodles3
with4
chopsticks5
reduce
with4 NP
chopsticks5
Shift-Reduce: a Transition-based algorithm
NP VP with4 NP
He1 eats2 NP noodles3
chopsticks5
reduce
NPVP PP
He1 eats2 NP with4 NP noodles3 chopsticks5
Shift-Reduce: a Transition-based algorithm
NPVP PP
He1 eats2 NP with4 NP
NP
He1
PP
noodles3 VP
VP
eats2 NP with4 noodles3
chopsticks5
NP
chopsticks5
reduce
Shift-Reduce: a Transition-based algorithm
NP VP
VP PP He1
eats2 NP with4 NP
NP
He1
noodles3 S
chopsticks5 VP
reduce
VP PP
eats2 NP with4 NP noodles3 chopsticks5
Success!
“Oracle”
I The oracle is a sequence of actions that lead to the correct parse of a sentence.
I When training a transition-based parsing model, we first map a gold parse tree onto an oracle sequence of actions
I We can learn a model by comparing the oracle to predicted action sequences and update the parameters of the model.
The Perceptron learning algorithm
1: 2: 3: 4: 5: 6: 7:
8:
Input : Training examples (xi , yi ) Initialization: Set ✓ = 0
for t 1,T do
for i 1,N do
zi argmaxz2GEN(xi ) f (xi , z) · ✓ if zi 6= yi then
✓ ✓ + f (xi , yi ) f (xi , zi ) Output: Parameters ✓
Lexcialized transition-based parsing actions
I Each action t 2 T is a transition action that transforms a state into a new state.
I SHIFT (sh): remove the first word-POS pair from , and I pushes it onto the top of ;
REDUCE-UNARY-X(ru-x): pop the top subtree from , construct a new unary node labeled with X for the subtree, and then push the new subtree back onto . The head of the
I new subtree is inherited from the child; REDUCE-BINARY-L/R-X (rl/rl-x): pop the top two subtrees from , combine them into a new tree with a node labeled with X, then push the new subtree back onto . The left (L) and right (R) versions of the action indicate whether the head of the new subtree is inherited from its left or right child.
I Aparsingstates2Sisdefinedasatuples=( , ),where is a stack that is maintained to hold the partial parsing structures that are already constructed and is a queue used to store unprocessed input (typically word-POS tag pairs).
Updating feature weights
02 p0tc=N NP⇠shift 31 203 B6 p0tc = N NP ⇠ reduce 7C 617 B6 p0wc = noodles NP ⇠ shift 7C 607
213 607
0 1
2 13 617
6 0 7
6 0 7
6 0 7
64 0 75 0
0
r✓ B6p0wc = noodles NP ⇠ reduce7C = 617 B6 p1tc=V V⇠shift 7C 607 B@ 64 p 1 t c = V V ⇠ r e d u c e 75 CA 64175
p1wc = eats V ⇠ shift 0
607 617 =
p1wc = eats V ⇠ reduce
1
607 64175
Notes: The feature p0tc = N NP predicts a “shift” action when the oracle action should be “reduce”.
Transition-based parsing features
Type Feature Templates
p0tc, p0wc, p1tc, p1wc, p2tc
p2wc, p3tc, p3wc, q0wt, q1wt q2wt, q3wt, p0l wc, p0r wc
p0uwc, p1lwc, p1rwc, p1uwc p0wp1w, p0wp1c, p0cp1w, p0cp1c p0wq0w, p0wq0t, p0cq0w, p0cq0t q0wq1w, q0wq1t, q0tq1w, q0tq1t p1wq0w, p1wq0t, p1cq0w, p1cq0t p0cs1cs2c, p0wp1cp2c, p0cp1wq0t
unigrams
bigrams
trigrams p0cp1cp2w, p0cp1cq0t, p0wp1cq0t p0cp1wq0t, p0cp1cq0w
Baseline features, where pi represents the ith subtree in the stack and qi denotes the ith item in the queue . w refers to the head word, t refers to the head POS, and c refers to the constituent label. pil and pir refer to the left and right child for a binary subtree pi , and piu refers to the child of a unary subtree pi .
Feature vector
feature
count feature count p0 tc =N-NP ̃shift 0 p0 tc =N-NP ̃reduce 1 p0wc=noodles-NP ̃shift 0 p0wc=noodles-NP ̃reduce 1
p1 tc =V-V ̃shift 0 p1 tc =V-V ̃reduce 1
p1 wc =eats-V ̃shift 0 p1 wc =eats-V ̃reduce 1
p0u wc =noodles-N ̃shift 0 p0u wc =noodles-N ̃reduce 1
q0 wt =with-P ̃shift 0 q0 wt =with-P ̃reduce 1
q1 wt =chopsticks-N ̃shift 0 q1 wt =chopsticks-N ̃reduce 1
···
··· ··· ···
Notes: Feature count for one configuration. The total count for a sentence will be a sum over all configurations in the derivation of the syntactic structure of the sentence
Beam Search
Input: A POS-tagged sentence, beam size k.
Output:
1: 2: 3: 4: 5: 6: 7: 8: 9:
10:
11: 12: 13: 14:
15:
beam0 i 0 loop
P
while beami is not empty do
. initialization . step index
. a priority queue
A constituent parse tree.
{s0}
s Pop(beami )
for all possible t 2 T do
snew apply t to s score snew
insert snew into P
beami +1 k best states of P sbest best state in beami +1 if sbest 2 St then
return sbest i i+1
{}
CFG based parsing vs transition-based parsing
I A transition-based parser scores the actions while a PCFG based parsing model scores the rules
I It’s customary to use the beam search algorithm in transition-based parsing
I The transition-based approach can be easily applied to dependency parsing as well as graph-based semantic parsing
I Learning for transition-based parsing can be with done with basically any type of classifier, including neural network models
Constituent parsing as sequence-to-sequence learning
I When you have a hammer, everything looks like a nail. Can we apply sequence-to-sequence modeling to syntactic parsing?
I Our input is the raw sentence, which is a sequence, but our output is a constituent tree. How do we make the tree a sequence?
I We need to linearize the tree in such a way that it is fully reversible and can be mapped back to a tree.
Tree linearization
(S (NP NNP )NP (VP VBD (NP (NP NNS )NP (PP IN (NP NNS )NP )PP )NP )VP )S
Alternative linearizations possible, e.g., a sequence of shift reduction actions.
Tricks and tips
I Normalizing POS tags: replacing all POS tags with XX to further reduce the size of the output vocabulary
I Input inversing: Inversing the input sentence but not the output trees helps. Similar e↵ects have been shown in MT: inversing the source sentence but not the target sentence
I There is no guarantee that the output will be a well-formed tree structure. If the model is well trained, most of the trees will be well-formed, but you need to do some post-processing or checking to make sure the trees are well-formed, with matching parentheses, etc.
I In practical systems, LSTM-RNNs are used rather than vanilla RNNs, which are easier to explain but don’t work as well in practice
BLEU: a metric for measuring similarity of sequences
I BLEU stands for Bilingual Evaluation Understudy, and it is the most popular MT evaluation metric.
I N-gram overlap between machine translation output and reference translation
I Compute precision for n-grams of size 1 to 4 I Add brevity penalty (for too short translations)
✓
✓
reference-length◆◆ Y4
1 precisioni 4
1 output-length
bleu = min 1, exp
I In logarithms: exp 1 PN log Pn
i=1
I Nn=1
Typically computed over the entire corpus, not single sentences
Example
Metric
precision (1gram) precision (2gram) precision (3gram) precision (4gram) brevity penalty bleu
System A System B
6/6 4/5 2/4 1/3 6/7 52%
3/6
1/5
0/4
0/3
6/7
0%
Fine prints of BLEU
I To avoid computing log 0, all precisions are smoothed to ensure that they are positive
I Each n-gram in the reference can be used at most once, so that to to to to to to to does not achieve P1 = 1 against the reference to be or not to be
I Brevity penality is applied to translations that are shorter than the reference.
I Normalization often performed on reference and hypothesis translations to get better match.