Computational
Linguistics
CSC 485 Summer 2020
7
7. Statistical parsing
Gerald Penn
Department of Computer Science, University of Toronto
Reading: Jurafsky & Martin: 5.2–5.5.2, 5.6, 12.4, 14.0–1, 14.3–4, 14.6–7. Bird et al: 8.6.
Copyright © 2017 Suzanne Stevenson, Graeme Hirst and Gerald Penn. All rights reserved.
Statistical parsing 1
• General idea:
• Assign probabilities to rules in a context-free
grammar.
• Use a likelihood model.
• Combine probabilities of rules in a tree.
• Yields likelihood of a parse.
• The best parse is the most likely one.
2
Statistical parsing 2
• Motivations:
• Uniform process for attachment decisions.
• Use lexical preferences in all decisions.
3
Three general approaches
1. Assign a probability to each rule of grammar, including lexical productions.
–Parse string of input words with probabilistic rules.
The can will rust.
2. Assign probabilities only to non-lexical productions. –Probabilistically tag input words with syntactic
3.
Det N Modal Verb.
“Supertagging” – parsing as tagging with tree fragments.
categories using a part-of-speech tagger.
–Consider the pre-terminal syntactic categories to be terminals, parse that string with probabilistic rules.
4
Part-of-speech tagging 1
• Part-of-speech (PoS) tagging:
Given a sequence of words w1 … wn (from well-formed text), determine the syntactic category (PoS) Ci of each word.
• I.e, the best category sequence C1 … Cn to assign to the word sequence w1 … wn.
Most likely
5
Part-of-speech tagging 2
• Example: The can
det modal verb noun
verb
will rust
modal verrb noun noun verb verb
Example from Charniak 1997
6
Part-of-speech tagging 3
• We cannot get this probability directly.
• Have to estimate it (through counts).
• Perhaps after first approximating it (by modifying the formula).
• Counts: Need representative corpus.
7
PoS tagging: Unigram MLE 1
• Look at individual words (unigrams):
• Maximum likelihood estimator (MLE):
Count in corpus
8
PoS tagging: Unigram MLE 2
• Problems of MLE:
• Sparse data.
• Extreme cases:
a. Undefined if w is not in the corpus.
b. 0 if w does not appear in a particular category.
9
PoS tagging: Unigram MLE 3
• Smoothing of formula, e.g.,:
𝑐𝑤𝑖𝑠𝐶 +𝜖 𝑐 𝑤 +𝜖𝑁
• Give small (non-zero) probability value to unseen events, taken from seen events by discounting them.
• Various methods to ensure we still have valid probability distribution.
𝑃𝐶𝑤≈
10
PoS tagging: Unigram MLE 4
• Just choosing the most frequent PoS for each word yields 90% accuracy in PoS tagging.
• But:
• Not uniform across words.
• Accuracy is low (0.9n) when multiplied over n words.
• No context: The fly vs. I will fly.
• Need better approximations for
11
PoS tagging: Bayesian method
• Use Bayes’s rule to rewrite: ❶❷
• For a given word string, we want to maximize this, find most likely C1 … Cn:
• So just need to maximize the numerator.
12
Approximating probabilities 1
• Approximate ❶P(C1 … Cn) by predicting each category from previous N – 1 categories:
an N-gram model.
• Bigram (2-gram) model:
Warning: Not the same n!!
• Posit pseudo-categories START at C0, and END as Cn. Example:
13
Approximating probabilities 2
• Approximate ❷P(w1 … wn|C1 … Cn) by assuming that the probability of a word appearing in a category is independent of the words surrounding it.
Lexical generation probabilities
15
Approximating probabilities 3
• Why is P(w|C) better than P(C|w)?
• P(C|w) is clearly not independent of surrounding
categories.
• Lexical generation probability is somewhat more independent.
• Complete formula for PoS includes bigrams, and so it does capture some context.
16
Putting it all together
❸
Really should use smoothed MLE; MLE for categories not the same as for words;
cf slide 10 cf slide 8
17
Finding max 1
• Want to find the argmax (most probable) C1 … Cn.
• Brute force method: Find all possible sequences of categories and compute P.
• Unnecessary: Our approximation assumes independence:
• Category bigrams: Ci depends only on Ci – 1. Lexical generation: wi depends only on Ci.
• Hence we do not need to enumerate all sequences independently.
18
Finding max 2
• Bigrams:
Markov model.
• States are categories and transitions are bigrams. • Lexical generation probabilities:
Hidden Markov model.
• Words are outputs (with given probability) of states.
• A word could be the output of more than one state.
• Current state is unknown (“hidden”).
19
Example
Based on an example in section 7.3 of: Allen, James. Natural Language Understanding (2nd ed), 1995, Benjamin Cummings.
• Artificial corpus of PoS-tagged 300 sentences using only Det, N, V, P.
• The flower flowers like a bird.
Some birds like a flower with fruit beetles. Like flies like flies.
…
• Some lexical generation probabilities:
P(the|Det) = .54 P(a|Det) = .36 P(a|N) = .001
⋮
P(like|N) = .012 P(like|V) = .1 P(like|P) = .068
⋮
P(flower|N) = .063 P(flower|V) = .050 P(flowers|N) = .050 P(flowers|V) = .053
⋮
P(birds|N) = .076 P(flies|V) = .076 P(flies|N) = .025
⋮
20
Markov model: Bigram table
Bigram Ci–1, Ci
START, Det
START, N Det, N N, V
N, N N, P
N, END V, N
V, Det V, END P, Det P, N
Count Ci-1 300
300 558 883 883 883 883 300 300 300 307 307
Count Ci–1,Ci 213
87 558 300 51 307 225 106 119 75 226 81
P(Ci|Ci–1) P(Det| START)
P(N| START) P(N|Det) P(V|N) P(N|N) P(P|N) P(END|N) P(N|V) P(Det|N) P(END|V) P(Det|P) P(N|P)
Estimate
0.710
0.290 1.000 0.340 0.058 0.348 0.255 0.353 0.397 0.250 0.740 0.260
21
Markov model: Transition probabilities
.397
.353
.74
.26
N .34 P 8
.255
Det
.71
1.0
START
.29 .058
V
.34
.25
END
22
HMM: Lexical generation probabilities
the .54 a .36
.397
.353
.74
.05
flower
.076 flies
Det
V
.1 like END
.71
1.0
START
.25
.34
.29
.058 N .34 P
a
P(the|Det) = .54 P(a|Det) = .36 P(a|N) = .001
⋮
.26 8
.001 .076
.068
like
like .012 .063 .025 birds flower flies
.225
P(flower|N) = .063 P(flower|V) = .050 P(flowers|N) = .050 P(flowers|V) = .053
P(like|N) = .012 P(like|V) = .1 P(like|P) = .068
⋮
P(birds|N) = .076 P(flies|V) = .076 P(flies|N) = .025
⋮
23
Hidden Markov models 1
• Given the observed output, we want to find the most likely path through the model.
The can
det modal verb
noun noun verb verb verb
will rust
modal verb noun
24
Hidden Markov models 2
• At any state in an HMM, how you got there is irrelevant to computing the next transition.
• So, just need to remember the best path and probability up to that point.
• Define PCi–1 as the probability of the best sequence up to state Ci–1.
• Then find Ci that maximizes
P(Ci–1) × ❸ from slide 17
P(Ci|Ci–1) × P(w|Ci)
25
Viterbi algorithm
• Given an HMM and an observation O of its output, finds the most probable sequence S of states that produced O.
• O = words of sentence, S = PoS tags of sentence
• Parameters of HMM based on large training corpus.
26
Statistical chart parsing 1
• Consider tags as terminals (i.e., use a PoS tagger to pre-process input texts).
Det N Modal Verb.
• For probability of each grammar rule, use MLE.
• Probabilities derived from hand-parsed corpora (treebanks).
• Count frequency of use c of each rule
each non-terminal C and each different RHS
What are some problems with this approach?
, for .
27
Statistical chart parsing 2
• MLE probability of rules: • For each rule :
• Takes no account of the context of use of a rule: independence assumption.
• Source-normalized: assumes a top-down generative process.
• NLTK’s pchart demo doesn’t POS-tag first (words are generated top-down), and it shows P(t) rather than P(t|s)’. Why?
❹
28
29
: :
30
: :
⬅ ⬅
31
32
33
Statistical chart parsing 3
• In this view of chart parsing, probability of chart entries is relatively simple to calculate. For completed constituents:
❺
e0 is the entry for current constituent, of category C0; e1 … en are chart entries for C1 … Cn in the RHS of
the rule.
NB: Unlike for PoS tagging above, the Ci are not necessarily lexical categories.
39
Statistical chart parsing 4
• Consider a complete parse tree, t, with root label S.
• Recasting ❺, t has the probability:
𝑃𝑡 =𝑃𝑆 ∗Π𝑛𝑃𝑟𝑢𝑙𝑒𝑛 𝑐𝑎𝑡𝑛
where n ranges over all nodes in the tree t; rule(n) is the rule used for n;
cat(n) is the category of n.
• P(S) = 1!
• “Bottoms out” at lexical categories.
❻
• Note that we’re parsing bottom-up, but the generative model “thinks” top-down regardless.
40
Inside-Outside Algorithm
• Maximum likelihood estimates on an annotated corpus can be improved to increase the likelihood of a different, unannotated corpus
• Step 1: parse the unannotated corpus using the MLE parameters.
• Step 2: adjust the parameters according to the expected relative frequencies of different rules in the parse trees obtained in Step 1:
• ṗ(A→B C) = μ(A→B C) / Z
• ṗ(A→w) = μ(A→w) / Z
41
Inside-Outside Algorithm2
• 𝜇 𝐴→𝐵𝐶 =σ𝑖,𝑘,𝑗 𝜇(𝐴→𝐵𝐶,𝑖,𝑘,𝑗) • 𝜇𝐴→𝑤 =σ𝜇𝐴,𝑖𝛿 𝑤
where we now count having seen an A from i to j, a B from i to k, and a C from k to j,
…or an A at location i, where there appears the word w.
𝑖
𝑖
42
Inside-Outside Algorithm3
• We can define these position-specific μ’s in terms of:
• outside probability
𝜷(𝑵, 𝒑, 𝒒)
N
• inside probability w1 wp-1
𝜶(𝑵, 𝒑, 𝒒)
wp wq Wq+1
wm 43
Inside-Outside Algorithm 4
• 𝜇𝐴→𝐵𝐶,𝑖,𝑘,𝑗 =
𝑝 𝐴→𝐵𝐶 𝛽 𝐴,𝑖,𝑗 𝛼 𝐵,𝑖,𝑘 𝛼(𝐶,𝑘+1,𝑗)
• 𝜇𝐴,𝑖 =𝜇(𝐴,𝑖,𝑖)
• 𝜇 𝐴,𝑖,𝑗 = 𝛼 𝐴,𝑖,𝑗 𝛽(𝐴,𝑖,𝑗) • 𝑍=𝛼𝑆,1,𝑛
There are also very terse, recursive formulations of α and β that are amenable to dynamic programming.
44
Statistical chart parsing 5
• But just like non-statistical chart parsers, this one only answers ‘yes’ or ‘no’ (with a probability) in polynomial time:
• It’s not supposed to matter how we got each constituent. Just the non-terminal label and the span are all that should matter.
• There might be exponentially many trees in this formulation.
• And we’re not calculating the probability that the input is a sentence – this is only the probability of one interpretation (tree).
45
Evaluation 1
• Evaluation method:
• Train on part of a parsed corpus.
(I.e., gather rules and statistics.)
• Test on a different part of the corpus.
• In one sense, the best evaluation of a method like this would be data likelihood, but since we’re scoring trees instead of strings, it’s difficult to defend any sort of intuition about the numbers assigned to them.
47
Evaluation 2
• Evaluation: PARSEVAL measures compare parser output to known correct parse:
• Labelled precision, labelled recall.
Fraction of constituents in output that are correct.
Fraction of correct constituents in output.
• F-measure = harmonic mean of precision and recall = 2PR / (P + R)
48
Evaluation 3
• Evaluation: PARSEVAL measures compare parser output to known correct parse:
• Penalize for cross-brackets per sentence: Constituents in output that overlap two (or more) correct ones; e.g., [[A B] C] for [A [B C]].
• [[Nadia] [[smelled] [the eggplant]]] [[[Nadia] [smelled]] [the eggplant]]
• The labels on the subtrees aren’t necessary for this one.
49
Evaluation 4
• PARSEVAL is a classifier accuracy score – much more extensional. All that matters is the right answer at the end.
• But that still means that we can look at parts of the right answer.
• Can get ~75% labelled precision, recall, and F with above methods.
50
Improving statistical parsing
• Problem: Probabilities are based only on structures and categories:
❹
• But actual words strongly condition which rule is used (cf Ratnaparkhi).
• Improve results by conditioning on more factors, including words. Think semantics – the words themselves give us a little bit of access to this.
51
Lexicalized grammars 1
• Head of a phrase: its central or key word.
• The noun of an NP, the preposition of a PP, etc.
• Lexicalized grammar: Refine the grammar so that rules take heads of phrases into account — the actual words.
• BEFORE: Rule for NP.
AFTER: Rules for NP-whose-head-is-aardvark, NP-whose-head-is-abacus, …, NP-whose-head-is- zymurgy.
• And similarly for VP, PP, etc.
52
Lexicalized grammars 2
• Notation: cat(head,tag) for constituent category cat headed by head with part-of- speech tag.
• e.g., NP(aardvark,NN), PP(without,IN)
NP-whose-head-is-the-NN-aardvark
PP-whose-head-is-the-IN-without
53
A lexicalized grammar
TOP → S(bought,VBD)
S(bought,VBD) → NP(week,NN) NP(IBM,NNP) VP(bought,VBD)
NP(week,NN) → JJ(Last,JJ) NN(week,NN) NP(IBM,NNP) → NNP(IBM,NNP) VP(bought,VBD) → VBD(bought,VBD) NP(Lotus,NNP)
NP(Lotus,NNP) → NNP(Lotus,NNP) Lexical Rules:
JJ(Last,JJ) → Last
NN(week,NN) → week NNP(IBM,NNP) → IBM VBD(bought,VBD) → bought NNP(Lotus,NNP) → Lotus
Michael Collins. Head-driven statistical models for natural language parsing. Computational Linguistics, 29(4), 2003, 589– 637.
54
Lexicalized grammars 3
• Number of rules and categories explodes, but no theoretical change in parsing process
(whether statistical or not).
• But far too specific for practical use; each is too rarely used to determine its probability.
• Need something more than regular (unlexicalized) rules and less than complete lexicalization …
• … perhaps we should change the process after all.
56
Lexicalized parsing 1
Starting from unlexicalized rules:
• 1. Lexicalization: Consider the head word of
each node, not just its category: Replaces ❻ • 𝑃𝑡 =𝑃𝑆 ∗Π𝑛𝑃(𝑟𝑢𝑙𝑒(𝑛)|h𝑒𝑎𝑑𝑛)fromslide40
where head(n) is the PoS-tagged head word of node n. • Needs finer-grained probabilities:
• e.g., probability that rule r is used, given we have an NP whose head is the noun deficit.
57
Lexicalized parsing 2
• 2. Head and parent: Condition on the head and the head of the parent node in the tree:
e.g., probability of rule r given that head is the noun deficit.
e.g., probability that head is the noun deficit, given that parent phrase’s head is the verb report.
58
Effects on parsing
• Lexical information introduces context into CFG.
• Grammar is larger.
• Potential problems of sparse data.
• Possible solutions: Smoothing; back-off estimates.
If you don’t have data for a fine-grained situation, use data from a coarser-grained situation that it’s contained in.
60
Bikel’s 2004 intepretation
• Can condition on any information available in generating the tree.
• Basic idea: Avoid sparseness of lexicalization by decomposing rules.
• Make plausible independence assumptions.
• Break rules down into small steps (small number
of parameters).
• Each rule still parameterized with word/PoS pair:
S(bought,VBD) → NP(week,NN) NP(IBM,NNP) VP(bought,VBD)
Michael Collins. Head-driven statistical models for natural language parsing. Computational Linguistics, 29(4), 2003, 589– 637.
62
Collins’s “model 1” 1
• Lexical Rules, with probability 1: tag(word, tag) → word
• Internal Rules, with treebank-based probabilities. Separate terminals to the left and right of the head; generate one at a time:
X, Li, H, and Ri all have the form cat(head,tag). Notation: Italic lowercase symbol for (head,tag):
63
Collins’s “model 1” 2
• Assume there are additional Ln+1 and Rm+1 representing phrase boundaries (“STOP”).
• Example:
S(bought,VBD) → NP(week,NN) NP(IBM,NNP) VP(bought,VBD) n = 2, m = 0 (two constituents on the left of the head, zero on the right).
X = S, H = VP, L1 = NP, L2 = NP, L3 = STOP, R1 = STOP.
h = (bought,VBD), l1 = (IBM,NNP), l2 = (week,NN).
• Distinguish probabilities of heads Ph, of left constituents Pl, and of right constituents Pr.
64
Probabilities of internal rules 1
Generate head constituent
Generate left modifiers (stop at STOP)
Generate right modifiers (stop at STOP)
By independence assumption
65
Probabilities of internal rules 2
Example:
P( S(bought,VBD)
→ NP(week,NN) NP(IBM,NNP) VP(bought,VBD) )
≈ Ph(VP | S, bought,VBD)
× Pl(NP(IBM,NNP) | S, bought,VBD, VP)
× Pl(NP(week,NN) | S, bought,VBD, VP)
× Pl(STOP | S, bought,VBD, VP) × Pr(STOP | S, bought,VBD, VP)
Generate head constituent
Generate left modifiers
Generate right modifiers
66
Adding other dependencies
• (Badly-named) “distance measure” to capture properties of attachment relevant to current modifier.
• •
becomes and analogously on the right.
The value of distancex is actually a pair of Boolean random variables:
• Is string 1..(i – 1) of length 0?
i.e., is attachment of modifier i to the head?
• Does string 1..(i – 1) contain a verb?
i.e., is attachment of modifier i crossing a verb?
67
Collins’s “model 1” 4
• Backs off …
• to tag probability when no data for specific word;
• to complete non-lexicalization when necessary.
68
Collins’s Models 2 and 3
• Model 2: Add verb subcategorization and argument/adjunct distinction.
• Model 3: Integrate gaps and trace identification into model.
• Especially important with addition of subcategorization.
69
Results and conclusions 1
• Model 2 outperforms Model 1.
• Model 3: Similar performance, but identifies
traces too.
• Model 2 performs best overall:
• LP = 89.6, LR = 89.9 [sentences ≤ 100 words].
• LP = 90.1, LR = 90.4 [sentences ≤ 40 words].
• Rich information improves parsing performance.
72
Results and conclusions 2
• Strengths:
• Incorporation of lexical and other linguistic
information.
• Competitive results.
• Weaknesses:
• Supervised training.
• Performance tightly linked to particular type of corpus used.
73
Results and conclusions 3
• Importance to CL:
• High-performance parser showing benefits of
lexicalization and linguistic information.
• Publicly available, widely used in research.
• There was some initial hope that it would make language models better, but that didn’t pan out.
• But it was fairly successful at giving us some access to semantics, i.e. language modelling makes parsing better.
74
Graph Deletion & Contraction
Important fact: κ(G) = κ(G-{e}) + κ(G\{e})
75