Computational Linguistics
Computational
Linguistics
Copyright © 2017 Suzanne
Stevenson, Graeme Hirst
and Gerald Penn. All rights
reserved.
9
9. Statistical parsing
Gerald Penn
Department of Computer Science, University of Toronto
CSC 2501 / 485
Fall 2018
Reading: Jurafsky & Martin: 5.2–5.5.2, 5.6, 12.4, 14.0–1,
14.3–4, 14.6–7. Bird et al: 8.6.
2
• General idea:
• Assign probabilities to rules in a context-free
grammar.
• Use a likelihood model.
• Combine probabilities of rules in a tree.
• Yields likelihood of a parse.
• The best parse is the most likely one.
Statistical parsing 1
3
• Motivations:
• Uniform process for attachment decisions.
• Use lexical preferences in all decisions.
Statistical parsing 2
4
1. Assign a probability to each rule of grammar, including
lexical productions.
–Parse string of input words with probabilistic rules.
The can will rust.
2. Assign probabilities only to non-lexical productions.
–Probabilistically tag input words with syntactic
categories using a part-of-speech tagger.
–Consider the pre-terminal syntactic categories to be
terminals, parse that string with probabilistic rules.
Det N Modal Verb.
3. “Supertagging” – parsing as tagging with tree
fragments.
Two general approaches
5
• Part-of-speech (PoS) tagging:
Given a sequence of words w1 … wn (from
well-formed text), determine the syntactic
category (PoS) Ci of each word.
• I.e, the best category sequence C1 … Cn to
assign to the word sequence w1 … wn.
Part-of-speech tagging 1
Most likely
The can will rust
det modal verb modal verb noun
noun noun verb
verb verb
Part-of-speech tagging 2
• Example:
6
Example from Charniak 1997
The can will rust
det modal verb modal verb noun
noun noun verb
verb verb
7
• We cannot get this probability directly.
• Have to estimate it (through counts).
• Perhaps after first approximating it (by
modifying the formula).
• Counts: Need representative corpus.
Part-of-speech tagging 3
8
• Look at individual words (unigrams):
• Maximum likelihood estimator (MLE):
PoS tagging: Unigram MLE 1
Count in corpus
• Problems of MLE:
• Sparse data.
• Extreme cases:
a. Undefined if w is not in the corpus.
b. 0 if w does not appear in a particular category.
9
PoS tagging: Unigram MLE 2
• Smoothing of formula, e.g.,:
𝑃 𝐶 𝑤 ≈
𝑐 𝑤 𝑖𝑠 𝐶 + 𝜖
𝑐 𝑤 + 𝜖𝑁
• Give small (non-zero) probability value to
unseen events, taken from seen events by
discounting them.
• Various methods to ensure we still have valid
probability distribution.
10
PoS tagging: Unigram MLE 3
11
• Just choosing the most frequent PoS for each
word yields 90% accuracy in PoS tagging.
• But:
• Not uniform across words.
• Accuracy is low (0.9n) when multiplied over n
words.
• No context: The fly vs. I will fly.
• Need better approximations for
PoS tagging: Unigram MLE 4
12
• Use Bayes’s rule to rewrite:
• For a given word string, we want to maximize
this, find most likely C1 … Cn:
• So just need to maximize the numerator.
PoS tagging: Bayesian method
❶ ❷
13
• Approximate ❶P(C1 … Cn) by predicting
each category from previous n – 1 categories:
an n-gram model.
• Bigram (2-gram) model:
• Posit pseudo-categories START at C0 , and
END as Cn. Example:
Approximating probabilities 1
Warning: Not
the same n!!
• Approximate ❷P(w1 … wn|C1 … Cn) by
assuming that the probability of a word
appearing in a category is independent of the
words surrounding it.
15
Lexical generation
probabilities
Approximating probabilities 2
• Why is P(w|C) better than P(C|w)?
• P(C|w) is clearly not independent of surrounding
categories.
• Lexical generation probability is somewhat more
independent.
• Complete formula for PoS includes bigrams, and
so it does capture some context.
16
Approximating probabilities 3
Putting it all together
Really should use smoothed MLE;
cf slide 10
MLE for categories not the same as for words;
cf slide 8 17
❸
18
• Want to find the argmax (most probable)
C1 … Cn.
• Brute force method: Find all possible
sequences of categories and compute P.
• Unnecessary: Our approximation assumes
independence:
• Category bigrams: Ci depends only on Ci – 1.
Lexical generation: wi depends only on Ci.
• Hence we do not need to enumerate all
sequences independently.
Finding max 1
19
• Bigrams:
Markov model.
• States are categories and transitions are bigrams.
• Lexical generation probabilities:
Hidden Markov model.
• Words are outputs (with given probability) of states.
• A word could be the output of more than one state.
• Current state is unknown (“hidden”).
Finding max 2
Example
• Artificial corpus of PoS-tagged 300 sentences
using only Det, N, V, P.
• The flower flowers like a bird.
Some birds like a flower with fruit beetles.
Like flies like flies.
…
• Some lexical generation probabilities:
20
P(the|Det) = .54 P(like|N) = .012 P(flower|N) = .063 P(birds|N) = .076
P(a|Det) = .36 P(like|V) = .1 P(flower|V) = .050 P(flies|V) = .076
P(a|N) = .001 P(like|P) = .068 P(flowers|N) = .050 P(flies|N) = .025
⋮ ⋮ P(flowers|V) = .053 ⋮
⋮
Based on an example in section 7.3 of: Allen, James. Natural
Language Understanding (2nd ed), 1995, Benjamin Cummings.
21
Bigram
Ci–1, Ci
Count Ci-1 Count Ci–1,Ci P(Ci|Ci–1) Estimate
START,
Det
300 213 P(Det| START) 0.710
START, N 300 87 P(N| START) 0.290
Det, N 558 558 P(N|Det) 1.000
N, V 883 300 P(V|N) 0.340
N, N 883 51 P(N|N) 0.058
N, P 883 307 P(P|N) 0.348
N, END 883 225 P(END|N) 0.255
V, N 300 106 P(N|V) 0.353
V, Det 300 119 P(Det|N) 0.397
V, END 300 75 P(END|V) 0.250
P, Det 307 226 P(Det|P) 0.740
P, N 307 81 P(N|P) 0.260
Markov model: Bigram table
22
Markov model: Transition probabilities
START
Det
N
V
P
END
.71
.397
.25
.255
.34
8
.26
.353
.34
.74
1.0
.29
.058
23
P(the|Det) = .54 P(like|N) = .012 P(flower|N) = .063 P(birds|N) = .076
P(a|Det) = .36 P(like|V) = .1 P(flower|V) = .050 P(flies|V) = .076
P(a|N) = .001 P(like|P) = .068 P(flowers|N) = .050 P(flies|N) = .025
⋮ ⋮ P(flowers|V) = .053 ⋮
HMM: Lexical generation probabilities
like
.068
like
flower
flies
birds
a
.076
.025
.063.012
.001
the
a
.54
.36
like
flower
flies
.05
.076
.1
Det
N
V
P
END
.71
.397
.25
.225
.34
8
.26
.353
.34
.74
1.0
.29
START
.058
24
• Given the observed output, we want to find
the most likely path through the model.
Hidden Markov models 1
The can will rust
det modal verb modal verb noun
noun noun verb
verb verb
25
• At any state in an HMM, how you got there is
irrelevant to computing the next transition.
• So, just need to remember the best path and
probability up to that point.
• Define PCi–1 as the probability of the best
sequence up to state Ci–1.
• Then find Ci that maximizes
P(Ci–1) × P(Ci|Ci–1) × P(w|Ci) ❸ from slide 17
Hidden Markov models 2
Viterbi algorithm
• Given an HMM and an observation O of its
output, finds the most probable sequence S
of states that produced O.
• O = words of sentence, S = PoS tags of sentence
• Parameters of HMM based on large training
corpus.
26
27
• Consider tags as terminals (i.e., use a PoS
tagger to pre-process input texts).
Det N Modal Verb.
• For probability of each grammar rule, use
MLE.
• Probabilities derived from hand-parsed
corpora (treebanks).
• Count frequency of use c of each rule , for
each non-terminal C and each different RHS .
What are some problems with this approach?
Statistical chart parsing 1
28
• MLE probability of rules:
• For each rule :
• Takes no account of the context of use of a
rule: independence assumption.
• Source-normalized: assumes a top-down
generative process.
• NLTK’s pchart demo doesn’t POS-tag first
(words are generated top-down), and it
shows P(t) rather than P(t|s)’. Why?
Statistical chart parsing 2
❹
29
30
:
:
31
⬅
⬅
:
:
32
33
• In this view of chart parsing, probability of chart
entries is relatively simple to calculate. For
completed constituents:
e0 is the entry for current constituent, of category C0;
e1 … en are chart entries for C1 … Cn in the RHS of
the rule.
NB: Unlike for PoS tagging above, the Ci are not necessarily lexical categories.
39
Statistical chart parsing 3
❺
• Consider a complete parse tree, t, with root
label S.
• Recasting ❺, t has the probability:
𝑃 𝑡 = 𝑃 𝑆 ∗ Π𝑛𝑃 𝑟𝑢𝑙𝑒 𝑛 𝑐𝑎𝑡 𝑛
where n ranges over all nodes in the tree t;
rule(n) is the rule used for n;
cat(n) is the category of n.
• P(S) = 1!
• “Bottoms out” at lexical categories.
• Note that we’re parsing bottom-up, but the
generative model “thinks” top-down
regardless.
40
Statistical chart parsing 4
❻
• Maximum likelihood estimates on an
annotated corpus can be improved to
increase the likelihood of a different,
unannotated corpus
• Step 1: parse the unannotated corpus using
the MLE parameters.
• Step 2: adjust the parameters according to the
expected relative frequencies of different rules
in the parse trees obtained in Step 1:
• ṗ(A→B C) = μ(A→B C) / Z
• ṗ(A→w) = μ(A→w) / Z
41
Inside-Outside Algorithm
• 𝜇 𝐴 → 𝐵𝐶 = σ 𝑖,𝑘,𝑗 𝜇(𝐴 → 𝐵𝐶, 𝑖, 𝑘, 𝑗)
• 𝜇 𝐴 → 𝑤 = σ𝑖 𝜇 𝐴, 𝑖 𝛿𝑖 𝑤
where we now count having seen an A from i to j, a B
from i to k, and a C from k to j,
…or an A at location i, where there appears the word w).
42
Inside-Outside Algorithm 2
• We can define these position-specific μ’s in
terms of:
• outside probability
• inside probability
43
Inside-Outside Algorithm 3
wp wmwp-1w1 wq Wq+1
N
𝜷(𝑵, 𝒑, 𝒒)
𝜶(𝑵, 𝒑, 𝒒)
• 𝜇 𝐴 → 𝐵𝐶, 𝑖, 𝑘, 𝑗 =
𝑝 𝐴 → 𝐵𝐶 𝛽 𝐴, 𝑖, 𝑗 𝛼 𝐵, 𝑖, 𝑘 𝛼(𝐶, 𝑘 + 1, 𝑗)
• 𝜇 𝐴, 𝑖 = 𝜇(𝐴, 𝑖, 𝑖)
• 𝜇 𝐴, 𝑖, 𝑗 = 𝛼 𝐴, 𝑖, 𝑗 𝛽(𝐴, 𝑖, 𝑗)
• 𝑍 = 𝛼 𝑆, 1, 𝑛
There are also very terse, recursive
formulations of α and β that are amenable to
dynamic programming.
44
Inside-Outside Algorithm 4
• But just like non-statistical chart parsers, this
one only answers ‘yes’ or ‘no’ (with a
probability) in polynomial time:
• It’s not supposed to matter how we got each
constituent. Just the non-terminal label and the
span are all that should matter.
• There might be exponentially many trees in
this formulation.
• And we’re not calculating the probability that
the input is a sentence – this is only the
probability of one interpretation (tree).
45
Statistical chart parsing 5
47
• Evaluation method:
• Train on part of a parsed corpus.
(I.e., gather rules and statistics.)
• Test on a different part of the corpus.
• In one sense, the best evaluation of a method
like this would be data likelihood, but since
we’re scoring trees instead of strings, it’s
difficult to defend any sort of intuition about
the numbers assigned to them.
Evaluation 1
Fraction of correct constituents in output.
48
• Evaluation: PARSEVAL measures compare
parser output to known correct parse:
• Labelled precision, labelled recall.
• F-measure = harmonic mean of precision and
recall = 2PR / (P + R)
Evaluation 2
Fraction of constituents in output that are correct.
49
• Evaluation: PARSEVAL measures compare
parser output to known correct parse:
• Penalize for cross-brackets per sentence:
Constituents in output that overlap two (or more)
correct ones; e.g., [[A B] C] for [A [B C]].
• [[Nadia] [[smelled] [the eggplant]]]
[[[Nadia] [smelled]] [the eggplant]]
• The labels on the subtrees aren’t necessary
for this one.
Evaluation 3
50
• PARSEVAL is a classifier accuracy score –
much more extensional. All that matters is
the right answer at the end.
• But that still means that we can look at parts
of the right answer.
• Can get ~75% labelled precision, recall, and
F with above methods.
Evaluation 4
Improving statistical parsing
• Problem: Probabilities are based only on
structures and categories:
• But actual words strongly condition which rule
is used (cf Ratnaparkhi).
• Improve results by conditioning on more
factors, including words. Think semantics –
the words themselves give us a little bit of
access to this.
51
❹
• Head of a phrase: its central or key word.
• The noun of an NP, the preposition of a PP, etc.
• Lexicalized grammar: Refine the grammar
so that rules take heads of phrases into
account — the actual words.
• BEFORE: Rule for NP.
AFTER: Rules for NP-whose-head-is-aardvark,
NP-whose-head-is-abacus, …, NP-whose-head-is-
zymurgy.
• And similarly for VP, PP, etc.
52
Lexicalized grammars 1
• Notation: cat(head,tag) for constituent
category cat headed by head with part-of-
speech tag.
• e.g., NP(aardvark,NN), PP(without,IN)
53
Lexicalized grammars 2
NP-whose-head-is-the-NN-aardvark
PP-whose-head-is-the-IN-without
A lexicalized grammar
54
TOP → S(bought,VBD)
S(bought,VBD) → NP(week,NN) NP(IBM,NNP)
VP(bought,VBD)
NP(week,NN) → JJ(Last,JJ) NN(week,NN)
NP(IBM,NNP) → NNP(IBM,NNP)
VP(bought,VBD) → VBD(bought,VBD)
NP(Lotus,NNP)
NP(Lotus,NNP) → NNP(Lotus,NNP)
Lexical Rules:
JJ(Last,JJ) → Last
NN(week,NN) → week
NNP(IBM,NNP) → IBM
VBD(bought,VBD) → bought
NNP(Lotus,NNP) → Lotus
Michael Collins. Head-driven statistical models for natural language parsing. Computational Linguistics, 29(4), 2003, 589–
637.
• Number of rules and categories explodes, but
no theoretical change in parsing process
(whether statistical or not).
• But far too specific for practical use; each is
too rarely used to determine its probability.
• Need something more than regular (unlexicalized)
rules and less than complete lexicalization …
• … perhaps we should change the process
after all.
56
Lexicalized grammars 3
Starting from unlexicalized rules:
• 1. Lexicalization: Consider the head word of
each node, not just its category:
• 𝑃 𝑡 = 𝑃 𝑆 ∗ Π𝑛𝑃(𝑟𝑢𝑙𝑒(𝑛)|ℎ𝑒𝑎𝑑 𝑛 )
where head(n) is the PoS-tagged head word of node n.
• Needs finer-grained probabilities:
• e.g., probability that rule r is used, given we have an
NP whose head is the noun deficit.
57
Lexicalized parsing 1
Replaces ❻
from slide 40
58
• 2. Head and parent: Condition on the head
and the head of the parent node in the tree:
Lexicalized parsing 2
e.g., probability of rule r given that head is the noun deficit.
e.g., probability that head is the noun deficit,
given that parent phrase’s head is the verb report.
60
• Lexical information introduces context into
CFG.
• Grammar is larger.
• Potential problems of sparse data.
• Possible solutions: Smoothing; back-off
estimates.
Effects on parsing
If you don’t have data for a fine-grained
situation, use data from a coarser-grained
situation that it’s contained in.
62
Bikel’s 2004 intepretation
• Can condition on any information available in
generating the tree.
• Basic idea: Avoid sparseness of
lexicalization by decomposing rules.
• Make plausible independence assumptions.
• Break rules down into small steps (small number
of parameters).
• Each rule still parameterized with word/PoS pair:
S(bought,VBD) → NP(week,NN) NP(IBM,NNP) VP(bought,VBD)
Michael Collins. Head-driven statistical models for natural language parsing. Computational Linguistics, 29(4), 2003, 589–
637.
63
Collins’s “model 1” 1
• Lexical Rules, with probability 1:
tag(word, tag) → word
• Internal Rules, with treebank-based
probabilities. Separate terminals to the left
and right of the head; generate one at a time:
X, Li, H, and Ri all have the form cat(head,tag).
Notation: Italic lowercase symbol for (head,tag):
64
Collins’s “model 1” 2
• Assume there are additional Ln+1 and Rm+1
representing phrase boundaries (“STOP”).
• Example:
S(bought,VBD) → NP(week,NN) NP(IBM,NNP) VP(bought,VBD)
n = 2, m = 0 (two constituents on the left of the head, zero on the right).
X = S, H = VP, L1 = NP, L2 = NP, L3 = STOP, R1 = STOP.
h = (bought,VBD), l1 = (IBM,NNP), l2 = (week,NN).
• Distinguish probabilities of heads Ph, of left
constituents Pl, and of right constituents Pr.
65
Probabilities of internal rules 1
Generate head constituent
Generate left modifiers (stop at STOP) Generate right modifiers
(stop at STOP)
By independence assumption
Probabilities of internal rules 2
P( S(bought,VBD)
→ NP(week,NN) NP(IBM,NNP) VP(bought,VBD) )
≈ Ph(VP | S, bought,VBD)
× Pl(NP(IBM,NNP) | S, bought,VBD, VP)
× Pl(NP(week,NN) | S, bought,VBD, VP)
× Pl(STOP | S, bought,VBD, VP)
× Pr(STOP | S, bought,VBD, VP)
Example:
66
Generate head constituent
Generate left
modifiers
Generate right modifiers
67
• (Badly-named) “distance measure” to capture
properties of attachment relevant to current
modifier.
• becomes
and analogously on the right.
• The value of distancex is actually a pair of Boolean
random variables:
• Is string 1..(i – 1) of length 0?
i.e., is attachment of modifier i to the head?
• Does string 1..(i – 1) contain a verb?
i.e., is attachment of modifier i crossing a verb?
Adding other dependencies
Collins’s “model 1” 4
• Backs off …
• to tag probability when no data for specific word;
• to complete non-lexicalization when necessary.
68
69
Collins’s Models 2 and 3
• Model 2: Add verb subcategorization and
argument/adjunct distinction.
• Model 3: Integrate gaps and trace
identification into model.
• Especially important with addition of
subcategorization.
72
Results and conclusions 1
• Model 2 outperforms Model 1.
• Model 3: Similar performance, but identifies
traces too.
• Model 2 performs best overall:
• LP = 89.6, LR = 89.9 [sentences ≤ 100 words].
• LP = 90.1, LR = 90.4 [sentences ≤ 40 words].
• Rich information improves parsing
performance.
73
Results and conclusions 2
• Strengths:
• Incorporation of lexical and other linguistic
information.
• Competitive results.
• Weaknesses:
• Supervised training.
• Performance tightly linked to particular type of
corpus used.
74
Results and conclusions 3
• Importance to CL:
• High-performance parser showing benefits of
lexicalization and linguistic information.
• Publicly available, widely used in research.
• There was some initial hope that it would make
language models better, but that didn’t pan out.
• But it was fairly successful at giving us some
access to semantics, i.e. language modelling
makes parsing better.