CS447: Natural Language Processing
http://courses.engr.illinois.edu/cs447
Julia Hockenmaier
juliahmr@illinois.edu
3324 Siebel Center
Lecture 10:
Statistical Parsing with
PCFGs
CS447 Natural Language Processing
Where we’re at
Previous lecture:
Standard CKY (for non-probabilistic CFGs)
The standard CKY algorithm finds all possible parse
trees τ for a sentence S = w(1)…w(n) under a CFG G
in Chomsky Normal Form.
Today’s lecture:
Probabilistic Context-Free Grammars (PCFGs)
– CFGs in which each rule is associated with a probability
CKY for PCFGs (Viterbi):
– CKY for PCFGs finds the most likely parse tree
τ* = argmax P(τ | S) for the sentence S under a PCFG.
�2
CS447: Natural Language Processing (J. Hockenmaier)
Previous Lecture:
CKY for CFGs
�3 CS447 Natural Language Processing
CKY: filling the chart
�4
w
1
… … wi … w
n w
1…
..
.wi
…
w
n
w
1
… … wi … w
n w
1…
..
.wi
…
w
n
w
1
… … wi … w
n w
1…
..
.wi
…
w
n
w
1
… … wi … w
n w
1…
..
.wi
…
w
n
w
1
… … wi … w
n w
1…
..
.wi
…
w
n
w
1
… … wi … w
n w
1…
..
.wi
…
w
n
w
1
… … wi … w
n w
1…
..
.wi
…
w
n
CS447 Natural Language Processing
CKY: filling one cell
�5
w
1
… … wi … w
n w
1…
..
.wi
…
w
n
chart[2][6]:
w1 w2 w3 w4 w5 w6 w7
w
1
… … wi … w
n w
1…
..
.wi
…
w
n
chart[2][6]:
w1 w2w3w4w5w6 w7
w
1
… … wi … w
n w
1…
..
.wi
…
w
n
chart[2][6]:
w1 w2w3w4w5w6 w7
w
1
… … wi … w
n w
1…
..
.wi
…
w
n
chart[2][6]:
w1 w2w3w4w5w6 w7
w
1
… … wi … w
n w
1…
..
.wi
…
w
n
chart[2][6]:
w1 w2w3w4w5w6 w7
CS447 Natural Language Processing
CKY for standard CFGs
CKY is a bottom-up chart parsing algorithm that finds
all possible parse trees τ for a sentence S = w(1)…w(n)
under a CFG G in Chomsky Normal Form (CNF).
– CNF: G has two types of rules: X ⟶ Y Z and X ⟶ w
(X, Y, Z are nonterminals, w is a terminal)
– CKY is a dynamic programming algorithm
– The parse chart is an n×n upper triangular matrix:
Each cell chart[i][j] (i ≤ j) stores all subtrees for w(i)…w(j)
– Each cell chart[i][j] has at most one entry for each
nonterminal X (and pairs of backpointers to each pair of
(Y, Z) entry in cells chart[i][k] chart[k+1][j] from which an X
can be formed
– Time Complexity: O(n3 |G |)
�6
CS447: Natural Language Processing (J. Hockenmaier)
Dealing with ambiguity:
Probabilistic Context-
Free Grammars (PCFGs)
�7 CS447 Natural Language Processing
Probabilistic Context-Free Grammars
For every nonterminal X, define a probability distribution
P(X → α | X) over all rules with the same LHS symbol X:
�8
S � NP VP 0.8
S � S conj S 0.2
NP � Noun 0.2
NP � Det Noun 0.4
NP � NP PP 0.2
NP � NP conj NP 0.2
VP � Verb 0.4
VP � Verb NP 0.3
VP � Verb NP NP 0.1
VP � VP PP 0.2
PP � P NP 1.0
CS447 Natural Language Processing
Computing P(τ) with a PCFG
The probability of a tree τ is the product of the probabilities
of all its rules:
�9
P(τ) = 0.8 ×0.3 ×0.2 ×1.0
= 0.00384
×0.23
S
NP
Noun
John
VP
VP
Verb
eats
NP
Noun
pie
PP
P
with
NP
Noun
cream
S � NP VP 0.8
S � S conj S 0.2
NP � Noun 0.2
NP � Det Noun 0.4
NP � NP PP 0.2
NP � NP conj NP 0.2
VP � Verb 0.4
VP � Verb NP 0.3
VP � Verb NP NP 0.1
VP � VP PP 0.2
PP � P NP 1.0
CS447 Natural Language Processing
Learning the parameters of a PCFG
If we have a treebank (a corpus in which each
sentence is associated with a parse tree), we can just
count the number of times each rule appears, e.g.:
S ! NP VP . (count = 1000)
S ! S conj S . (count = 220)
and then we divide the observed frequency of each
rule X → Y Z by the sum of the frequencies of all rules
with the same LHS X to turn these counts into
probabilities:
S ! NP VP . (p = 1000/1220)
S ! S conj S . (p = 220/1220)
�10
CS447 Natural Language Processing
More on probabilities:
Computing P(s):
If P(τ) is the probability of a tree τ,
the probability of a sentence s is the sum of the
probabilities of all its parse trees:
P(s) = ∑τ:yield(τ) = s P(τ)
How do we know that P(L) = ∑τ P(τ) = 1?
If we have learned the PCFG from a corpus via MLE,
this is guaranteed to be the case.
If we just set the probabilities by hand, we could run into
trouble, as in the following example:
S ! S S (0.9)
S ! w (0.1)
�11 CS447: Natural Language Processing (J. Hockenmaier)
PCFG parsing
(decoding):
Probabilistic CKY
�12
CS447 Natural Language Processing
Probabilistic CKY: Viterbi
Like standard CKY, but with probabilities.
Finding the most likely tree is similar to Viterbi for HMMs:
Initialization:
– [optional] Every chart entry that corresponds to a terminal
(entry w in cell[i][i]) has a Viterbi probability PVIT(w[i][i] ) = 1 (*)
– Every entry for a non-terminal X in cell[i][i] has Viterbi
probability PVIT(X[i][i] ) = P(X → w | X) [and a single backpointer to w[i][i] (*) ]
Recurrence: For every entry that corresponds to a non-terminal X
in cell[i][j], keep only the highest-scoring pair of backpointers
to any pair of children (Y in cell[i][k] and Z in cell[k+1][j]):
PVIT(X[i][j]) = argmaxY,Z,k PVIT(Y[i][k]) × PVIT(Z[k+1][j] ) × P(X → Y Z | X )
Final step: Return the Viterbi parse for the start symbol S
in the top cell[1][n].
*this is unnecessary for simple PCFGs, but can be helpful for more complex probability models
�13 CS447 Natural Language Processing
Probabilistic CKY
�14
NP
0.2
John eats pie with cream
Noun
1.0 John
Verb
1.0 eats
Noun
1.0 pie
Prep
1.0
with
Noun
1.0 cream
Input: POS-tagged sentence
John_N eats_V pie_N with_P cream_N
NP
0.2
VP
0.3
NP
0.2
S
0.8·0.2·0.3
VP
1·0.3·0.2
= 0.06
PP
1·1·0.2
S
0.8·0.2·0.06
NP
0.2·0.2·0.2
= 0.008
VP
max( 1.0 ·0.008·0.3,
0.06·0.2·0.3 )
S
0.2·0.0036·0.8
S � NP VP 0.8
S � S conj S 0.2
NP � Noun 0.2
NP � Det Noun 0.4
NP � NP PP 0.2
NP � NP conj NP 0.2
VP � Verb 0.4
VP � Verb NP 0.3
VP � Verb NP NP 0.1
VP � VP PP 0.2
PP � P NP 1.0
0.3
0.3
Prep NP
Prep ⟶ P 1.0
Noun ⟶ N 1.0
Verb ⟶ V 1.0
CS447 Natural Language Processing
How do we handle flat rules?
�15
S � NP VP 0.8
S � S conj S 0.2
NP � Noun 0.2
NP � Det Noun 0.4
NP � NP PP 0.2
NP � NP conj NP 0.2
VP � Verb 0.4
VP � Verb NP 0.3
VP � Verb NP NP 0.1
VP � VP PP 0.2
PP � P NP 1.0
0.3
0.3
Prep NP
S ⟶ S ConjS 0.2
ConjS ⟶ conj S 1.0
Binarize each flat rule by
adding dummy nonterminals
(ConjS),
and setting the probability of
the rule with the dummy
nonterminal on the LHS to 1
CS447: Natural Language Processing (J. Hockenmaier)
Parser evaluation
�16
CS447: Natural Language Processing (J. Hockenmaier)
Precision and recall
Precision and recall were originally developed
as evaluation metrics for information retrieval:
-Precision: What percentage of retrieved documents are
relevant to the query?
-Recall: What percentage of relevant documents were
retrieved?
In NLP, they are often used in addition to accuracy:
-Precision: What percentage of items that were assigned
label X do actually have label X in the test data?
-Recall: What percentage of items that have label X in the test
data were assigned label X by the system?
Particularly useful when there are more than two labels.
�17 CS447: Natural Language Processing (J. Hockenmaier)
True vs. false positives, false negatives
-True positives: Items that were labeled X by the system,
and should be labeled X.
-False positives: Items that were labeled X by the system,
but should not be labeled X.
-False negatives: Items that were not labeled X by the system,
but should be labeled X
�18
False
Negatives
(FN)
Items labeled X
in the gold standard
(‘truth’)
Items labeled X
by the system
= TP + FN
= TP + FP
False
Positives
(FP)
True
Positives
(TP)
CS447: Natural Language Processing (J. Hockenmaier)
Precision, recall, f-measure
�19
False
Positives
(FP)
False
Negatives
(FN)
True
Positives
(TP)
Items labeled X
in the gold standard
(‘truth’)
= TP + FN
Items labeled X
by the system
= TP + FP
Precision: P = TP ∕( TP + FP )
Recall: R = TP ∕( TP + FN )
F-measure: harmonic mean of precision and recall
F = (2·P·R)∕(P + R)
CS447 Natural Language Processing
Evalb (“Parseval”)
Measures recovery of phrase-structure trees.
Labeled: span and label (NP, PP,…) has to be right.
[Earlier variant— unlabeled: span of nodes has to be right]
Two aspects of evaluation
Precision: How many of the predicted nodes are correct?
Recall: How many of the correct nodes were predicted?
Usually combined into one metric (F-measure):
P =
#correctly predicted nodes
#predicted nodes
R =
#correctly predicted nodes
#correct nodes
F =
2PR
P + R
�20
CS447 Natural Language Processing
Parseval in practice
eat sushi with tuna: Precision: 4/5 Recall: 4/5
eat sushi with chopsticks: Precision: 4/5 Recall: 4/5
�21
eat with tunasushi
NPNP
VP
PP
NP
V P
sushi eat with chopsticks
NPNP
VP
PPVP
V P
eat sushi with chopsticks
NPNP
NP
VP
PP
V P
eat with tunasushi
NPNP
VP
PPVP
V P
Gold standard Parser output
N N N N
NN N N
CS498JH: Introduction to NLP
Shortcomings of PCFGs
�22
CS447 Natural Language Processing
PCFGs make independence assumptions:
Only the label of a node determines what children it has.
Factors that influence these assumptions:
Shape of the trees:
A corpus with flat trees (i.e. few nodes/sentence)
results in a model with few independence assumptions.
Labeling of the trees:
A corpus with many node labels (nonterminals)
results in a model with few independence assumptions.
�23
How well can a PCFG model the
distribution of trees?
CS447 Natural Language Processing
Example 1: flat trees
S
I eat sushi with tuna
What sentences would a PCFG
estimated from this corpus generate?
S
I eat sushi with chopsticks
�24
CS447 Natural Language Processing
Example 2: deep trees, few labels
S
I S
eat S
sushi S
with chopsticks
What sentences would a PCFG
estimated from this corpus generate?
S
I S
eat S
sushi S
with tuna
�25 CS447 Natural Language Processing
Example 3: deep trees, many labels
What sentences would a PCFG
estimated from this corpus generate?
S
I S1
eat S2
sushi S3
with tuna
S
I S1
eat S2
sushi S3
with chopsticks
�26
CS447 Natural Language Processing
Aside: Bias/Variance tradeoff
A probability model has low bias if it makes
few independence assumptions.
⇒ It can capture the structures in the training data.
This typically leads to a more fine-grained partitioning
of the training data.
Hence, fewer data points are available to estimate
the model parameters.
This increases the variance of the model.
⇒ This yields a poor estimate of the distribution.
�27 CS447: Natural Language Processing (J. Hockenmaier)
Penn Treebank
parsing
�28
CS447 Natural Language Processing
The Penn Treebank
The first publicly available syntactically annotated
corpus
Wall Street Journal (50,000 sentences, 1 million words)
also Switchboard, Brown corpus, ATIS
The annotation:
– POS-tagged (Ratnaparkhi’s MXPOST)
– Manually annotated with phrase-structure trees
– Richer than standard CFG: Traces and other null
elements used to represent non-local dependencies
(designed to allow extraction of predicate-argument
structure) [more on this later in the semester]
Standard data set for English parsers
�29 CS447 Natural Language Processing
The Treebank label set
48 preterminals (tags):
– 36 POS tags, 12 other symbols (punctuation etc.)
– Simplified version of Brown tagset (87 tags)
(cf. Lancaster-Oslo/Bergen (LOB) tag set: 126 tags)
14 nonterminals:
standard inventory (S, NP, VP,…)
�30
CS447 Natural Language Processing
A simple example
�31
Relatively flat structures:
– There is no noun level
– VP arguments and adjuncts appear at the same level
Function tags, e.g. -SBJ (subject), -MNR (manner)
CS447 Natural Language Processing
A more realistic (partial) example
Until Congress acts, the government hasn’t any authority to issue new debt
obligations of any kind, the Treasury said …. .
�32
CS447 Natural Language Processing
The Penn Treebank CFG
The Penn Treebank uses very flat rules, e.g.:
– Many of these rules appear only once.
– Many of these rules are very similar.
– Can we pool these counts?
�33 CS447 Natural Language Processing
PCFGs in practice:
Charniak (1996) Tree-bank grammars
How well do PCFGs work on the Penn Treebank?
– Split Treebank into test set (30K words)
and training set (300K words).
– Estimate a PCFG from training set.
– Parse test set (with correct POS tags).
– Evaluate unlabeled precision and recall
�34
CS447 Natural Language Processing
Two ways to improve performance
… change the (internal) grammar:
– Parent annotation/state splits:
Not all NPs/VPs/DTs/… are the same.
It matters where they are in the tree
… change the probability model:
– Lexicalization:
Words matter!
– Markovization:
Generalizing the rules
�35 CS447 Natural Language Processing
PCFGs assume the expansion of any nonterminal
is independent of its parent.
But this is not true: NP subjects more likely to be modified
than objects.
We can change the grammar by adding the name
of the parent node to each nonterminal
The parent transformation
�36
CS447 Natural Language Processing
Markov PCFGs (Collins parser)
The RHS of each CFG rule consists of:
one head HX, n left sisters Li and m right sisters Ri:
Replace rule probabilities with a generative process:
For each nonterminal X
-generate its head HX (nonterminal or terminal)
– then generate its left sisters L1..n and a STOP symbol
conditioned on HX
– then generate its right sisters R1…n and a STOP symbol
conditioned on HX
X → Ln…L1
! “# $
left sisters
HX R1…Rm
! “# $
right sisters
�37 CS447 Natural Language Processing
Lexicalization
PCFGs can’t distinguish between
“eat sushi with chopsticks” and “eat sushi with tuna”.
We need to take words into account!
P(VPeat → VP PPwith chopsticks | VPeat )
vs. P(VPeat → VP PPwith tuna | VPeat )
Problem: sparse data (PPwith fatty|white|… tuna….)
Solution: only take head words into account!
Assumption: each constituent has one head word.
�38
CS447 Natural Language Processing
At the root (start symbol S), generate the head word of the
sentence, wS , with P(wS)
Lexicalized rule probabilities:
Every nonterminal is lexicalized: Xwx
Condition rules Xwx → αYβ on the lexicalized LHS Xwx
P( Xwx → αYβ | Xwx)
Word-word dependencies:
For each nonterminal Y in RHS of a rule Xwx → αYβ,
condition wY (the head word of Y ) on X and wX:
P( wY | Y, X, wX )
Lexicalized PCFGs
�39 CS447 Natural Language Processing
Dealing with unknown words
A lexicalized PCFG assigns zero probability
to any word that does not appear in the training data.
Solution:
Training: Replace rare words in training data
with a token ‘UNKNOWN’.
Testing: Replace unseen words with ‘UNKNOWN’
�40
CS447 Natural Language Processing
Refining the set of categories
Unlexicalized Parsing (Klein & Manning ’03)
Unlexicalized PCFGs with various transformations
of the training data and the model, e.g.:
– Parent annotation (of terminals and nonterminals):
distinguish preposition IN from subordinating conjunction IN etc.
– Add head tag to nonterminals
(e.g. distinguish finite from infinite VPs)
– Add distance features
Accuracy: 86.3 Precision and 85.1 Recall
The Berkeley parser (Petrov et al. ’06, ’07)
Automatically learns refinements of the nonterminals
Accuracy: 90.2 Precision, 89.9 Recall
�41 CS447: Natural Language Processing (J. Hockenmaier)
Summary
The Penn Treebank has a large number of very flat
rules.
Accurate parsing requires modifications to the basic
PCFG model: refining the nonterminals, relaxing the
independence assumptions by including grandparent
information, modeling word-word dependencies, etc.
How much of this transfers to other treebanks or
languages?
�42