程序代写代做代考 information retrieval algorithm CS447: Natural Language Processing

CS447: Natural Language Processing
http://courses.engr.illinois.edu/cs447

Julia Hockenmaier
juliahmr@illinois.edu
3324 Siebel Center

Lecture 10:
Statistical Parsing with
PCFGs

CS447 Natural Language Processing

Where we’re at
Previous lecture:  
Standard CKY (for non-probabilistic CFGs)
The standard CKY algorithm finds all possible parse
trees τ for a sentence S = w(1)…w(n) under a CFG G  
in Chomsky Normal Form.

Today’s lecture:
Probabilistic Context-Free Grammars (PCFGs)
– CFGs in which each rule is associated with a probability
CKY for PCFGs (Viterbi):
– CKY for PCFGs finds the most likely parse tree  
τ* = argmax P(τ | S) for the sentence S under a PCFG.

�2

CS447: Natural Language Processing (J. Hockenmaier)

Previous Lecture:
CKY for CFGs

�3 CS447 Natural Language Processing

CKY: filling the chart

�4

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

CS447 Natural Language Processing

CKY: filling one cell

�5

w
1

… … wi … w
n w

1…

..
.wi
…

w
n

chart[2][6]:
w1 w2 w3 w4 w5 w6 w7

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

chart[2][6]:
w1 w2w3w4w5w6 w7

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

chart[2][6]:
w1 w2w3w4w5w6 w7

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

chart[2][6]:
w1 w2w3w4w5w6 w7

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

chart[2][6]:
w1 w2w3w4w5w6 w7

CS447 Natural Language Processing

CKY for standard CFGs
CKY is a bottom-up chart parsing algorithm that finds
all possible parse trees τ for a sentence S = w(1)…w(n)
under a CFG G in Chomsky Normal Form (CNF). 

– CNF: G has two types of rules: X ⟶ Y Z and X ⟶ w  
(X, Y, Z are nonterminals, w is a terminal)

– CKY is a dynamic programming algorithm
– The parse chart is an n×n upper triangular matrix:  

Each cell chart[i][j] (i ≤ j) stores all subtrees for w(i)…w(j)
– Each cell chart[i][j] has at most one entry for each

nonterminal X (and pairs of backpointers to each pair of
(Y, Z) entry in cells chart[i][k] chart[k+1][j] from which an X
can be formed

– Time Complexity: O(n3 |G |)
�6

CS447: Natural Language Processing (J. Hockenmaier)

Dealing with ambiguity:
Probabilistic Context-
Free Grammars (PCFGs)

�7 CS447 Natural Language Processing

Probabilistic Context-Free Grammars
For every nonterminal X, define a probability distribution
P(X → α | X) over all rules with the same LHS symbol X: 

�8

S � NP VP 0.8
S � S conj S 0.2
NP � Noun 0.2
NP � Det Noun 0.4
NP � NP PP 0.2
NP � NP conj NP 0.2
VP � Verb 0.4
VP � Verb NP 0.3
VP � Verb NP NP 0.1
VP � VP PP 0.2
PP � P NP 1.0

CS447 Natural Language Processing

Computing P(τ) with a PCFG
The probability of a tree τ is the product of the probabilities  
of all its rules:

�9

P(τ) = 0.8 ×0.3 ×0.2 ×1.0

= 0.00384

×0.23

Noun

John

Verb

eats

Noun

pie

with

Noun

cream

S � NP VP 0.8
S � S conj S 0.2
NP � Noun 0.2
NP � Det Noun 0.4
NP � NP PP 0.2
NP � NP conj NP 0.2
VP � Verb 0.4
VP � Verb NP 0.3
VP � Verb NP NP 0.1
VP � VP PP 0.2
PP � P NP 1.0

CS447 Natural Language Processing

Learning the parameters of a PCFG
If we have a treebank (a corpus in which each
sentence is associated with a parse tree), we can just
count the number of times each rule appears, e.g.:
S ! NP VP . (count = 1000)
S ! S conj S . (count = 220)

and then we divide the observed frequency of each
rule X → Y Z by the sum of the frequencies of all rules
with the same LHS X to turn these counts into
probabilities:
S ! NP VP . (p = 1000/1220)  
S ! S conj S . (p = 220/1220)

�10

CS447 Natural Language Processing

More on probabilities:
Computing P(s):  
If P(τ) is the probability of a tree τ,  
the probability of a sentence s is the sum of the
probabilities of all its parse trees:
P(s) = ∑τ:yield(τ) = s P(τ)

How do we know that P(L) = ∑τ P(τ) = 1?
If we have learned the PCFG from a corpus via MLE,
this is guaranteed to be the case.

If we just set the probabilities by hand, we could run into
trouble, as in the following example:
S ! S S (0.9)  
S ! w (0.1)

�11 CS447: Natural Language Processing (J. Hockenmaier)

PCFG parsing
(decoding):
Probabilistic CKY

�12

CS447 Natural Language Processing

Probabilistic CKY: Viterbi

Like standard CKY, but with probabilities.
Finding the most likely tree is similar to Viterbi for HMMs:

Initialization:
– [optional] Every chart entry that corresponds to a terminal  

(entry w in cell[i][i]) has a Viterbi probability PVIT(w[i][i] ) = 1 (*)
– Every entry for a non-terminal X in cell[i][i] has Viterbi

probability PVIT(X[i][i] ) = P(X → w | X) [and a single backpointer to w[i][i] (*) ]
Recurrence: For every entry that corresponds to a non-terminal X
in cell[i][j], keep only the highest-scoring pair of backpointers
to any pair of children (Y in cell[i][k] and Z in cell[k+1][j]): 
PVIT(X[i][j]) = argmaxY,Z,k PVIT(Y[i][k]) × PVIT(Z[k+1][j] ) × P(X → Y Z | X )
Final step: Return the Viterbi parse for the start symbol S  
in the top cell[1][n].
*this is unnecessary for simple PCFGs, but can be helpful for more complex probability models

�13 CS447 Natural Language Processing

Probabilistic CKY

�14

NP
0.2

John eats pie with cream

Noun
1.0 John

Verb
1.0 eats

Noun
1.0 pie

Prep
1.0

with

Noun
1.0 cream

Input: POS-tagged sentence  
John_N eats_V pie_N with_P cream_N

NP
0.2

VP
0.3

NP
0.2

S
0.8·0.2·0.3

VP
1·0.3·0.2

= 0.06

PP
1·1·0.2

S
0.8·0.2·0.06

NP
0.2·0.2·0.2

= 0.008

VP
max( 1.0 ·0.008·0.3,

0.06·0.2·0.3 )

S
0.2·0.0036·0.8

S � NP VP 0.8
S � S conj S 0.2
NP � Noun 0.2
NP � Det Noun 0.4
NP � NP PP 0.2
NP � NP conj NP 0.2
VP � Verb 0.4
VP � Verb NP 0.3
VP � Verb NP NP 0.1
VP � VP PP 0.2
PP � P NP 1.0

0.3

0.3
Prep NP

Prep ⟶ P 1.0
Noun ⟶ N 1.0
Verb ⟶ V 1.0

CS447 Natural Language Processing

How do we handle flat rules?

�15

S � NP VP 0.8
S � S conj S 0.2
NP � Noun 0.2
NP � Det Noun 0.4
NP � NP PP 0.2
NP � NP conj NP 0.2
VP � Verb 0.4
VP � Verb NP 0.3
VP � Verb NP NP 0.1
VP � VP PP 0.2
PP � P NP 1.0

0.3

0.3
Prep NP

S ⟶ S ConjS 0.2
ConjS ⟶ conj S 1.0

Binarize each flat rule by
adding dummy nonterminals
(ConjS),
and setting the probability of
the rule with the dummy
nonterminal on the LHS to 1

CS447: Natural Language Processing (J. Hockenmaier)

Parser evaluation

�16

CS447: Natural Language Processing (J. Hockenmaier)

Precision and recall
Precision and recall were originally developed  
as evaluation metrics for information retrieval:
-Precision: What percentage of retrieved documents are
relevant to the query?
-Recall: What percentage of relevant documents were
retrieved?

In NLP, they are often used in addition to accuracy:
-Precision: What percentage of items that were assigned
label X do actually have label X in the test data?
-Recall: What percentage of items that have label X in the test
data were assigned label X by the system?

Particularly useful when there are more than two labels.

�17 CS447: Natural Language Processing (J. Hockenmaier)

True vs. false positives, false negatives

-True positives: Items that were labeled X by the system, 
and should be labeled X.
-False positives: Items that were labeled X by the system,  

but should not be labeled X.
-False negatives: Items that were not labeled X by the system,  

but should be labeled X
�18

False
Negatives

(FN)

Items labeled X  
in the gold standard  

(‘truth’)

Items labeled X  
by the system

= TP + FN
= TP + FP

False  
Positives 

(FP)

True  
Positives

(TP)

CS447: Natural Language Processing (J. Hockenmaier)

Precision, recall, f-measure

�19

False
Positives 

(FP)

False
Negatives

(FN)

True
Positives 

(TP)

Items labeled X  
in the gold standard  

(‘truth’)
= TP + FN

Items labeled X  
by the system

= TP + FP

Precision: P = TP ∕( TP + FP )
Recall: R = TP ∕( TP + FN )
F-measure: harmonic mean of precision and recall  
F = (2·P·R)∕(P + R)

CS447 Natural Language Processing

Evalb (“Parseval”)

Measures recovery of phrase-structure trees.
Labeled: span and label (NP, PP,…) has to be right.
[Earlier variant— unlabeled: span of nodes has to be right]

Two aspects of evaluation
Precision: How many of the predicted nodes are correct?
Recall: How many of the correct nodes were predicted?
Usually combined into one metric (F-measure):

P =
#correctly predicted nodes

#predicted nodes

R =
#correctly predicted nodes

#correct nodes

F =
2PR

P + R

�20

CS447 Natural Language Processing

Parseval in practice

eat sushi with tuna: Precision: 4/5 Recall: 4/5
eat sushi with chopsticks: Precision: 4/5 Recall: 4/5

�21

eat with tunasushi
NPNP

PP
NP

V P

sushi eat with chopsticks
NPNP

PPVP

V P
eat sushi with chopsticks

NPNP
NP

V P

eat with tunasushi
NPNP

PPVP

V P

Gold standard Parser output

N N N N

NN N N

CS498JH: Introduction to NLP

Shortcomings of PCFGs

�22

CS447 Natural Language Processing

 
 
PCFGs make independence assumptions:

Only the label of a node determines what children it has. 

Factors that influence these assumptions:
Shape of the trees: 
A corpus with flat trees (i.e. few nodes/sentence)  
results in a model with few independence assumptions. 

Labeling of the trees: 
A corpus with many node labels (nonterminals)  
results in a model with few independence assumptions.

�23

How well can a PCFG model the
distribution of trees?

CS447 Natural Language Processing

Example 1: flat trees

I eat sushi with tuna

What sentences would a PCFG 
estimated from this corpus generate?

I eat sushi with chopsticks

�24

CS447 Natural Language Processing

Example 2: deep trees, few labels

I S

eat S

sushi S

with chopsticks

What sentences would a PCFG 
estimated from this corpus generate?

I S

eat S

sushi S

with tuna

�25 CS447 Natural Language Processing

Example 3: deep trees, many labels

What sentences would a PCFG 
estimated from this corpus generate?

I S1

eat S2

sushi S3

with tuna

I S1

eat S2

sushi S3

with chopsticks

�26

CS447 Natural Language Processing

Aside: Bias/Variance tradeoff

A probability model has low bias if it makes  
few independence assumptions. 
⇒ It can capture the structures in the training data.
This typically leads to a more fine-grained partitioning
of the training data.
 
Hence, fewer data points are available to estimate
the model parameters. 

This increases the variance of the model. 
⇒ This yields a poor estimate of the distribution.

�27 CS447: Natural Language Processing (J. Hockenmaier)

Penn Treebank
parsing

�28

CS447 Natural Language Processing

The Penn Treebank
The first publicly available syntactically annotated
corpus

Wall Street Journal (50,000 sentences, 1 million words)
also Switchboard, Brown corpus, ATIS 

The annotation:
– POS-tagged (Ratnaparkhi’s MXPOST)
– Manually annotated with phrase-structure trees
– Richer than standard CFG: Traces and other null

elements used to represent non-local dependencies
(designed to allow extraction of predicate-argument
structure) [more on this later in the semester]  

Standard data set for English parsers
�29 CS447 Natural Language Processing

The Treebank label set
48 preterminals (tags):

– 36 POS tags, 12 other symbols (punctuation etc.)
– Simplified version of Brown tagset (87 tags)  

(cf. Lancaster-Oslo/Bergen (LOB) tag set: 126 tags) 

14 nonterminals:
standard inventory (S, NP, VP,…)

�30

CS447 Natural Language Processing

A simple example

�31

Relatively flat structures:
– There is no noun level
– VP arguments and adjuncts appear at the same level  

Function tags, e.g. -SBJ (subject), -MNR (manner)

CS447 Natural Language Processing

A more realistic (partial) example
Until Congress acts, the government hasn’t any authority to issue new debt
obligations of any kind, the Treasury said …. .

�32

CS447 Natural Language Processing

The Penn Treebank CFG

The Penn Treebank uses very flat rules, e.g.: 
 
 
 
 
 
 
 

– Many of these rules appear only once.
– Many of these rules are very similar.
– Can we pool these counts?

�33 CS447 Natural Language Processing

PCFGs in practice:
Charniak (1996) Tree-bank grammars

How well do PCFGs work on the Penn Treebank?  

– Split Treebank into test set (30K words)  
and training set (300K words).

– Estimate a PCFG from training set.
– Parse test set (with correct POS tags).
– Evaluate unlabeled precision and recall

�34

CS447 Natural Language Processing

Two ways to improve performance
… change the (internal) grammar:
– Parent annotation/state splits:  

Not all NPs/VPs/DTs/… are the same. 
It matters where they are in the tree  

… change the probability model:
– Lexicalization:  

Words matter!
– Markovization:  

Generalizing the rules

�35 CS447 Natural Language Processing

PCFGs assume the expansion of any nonterminal
is independent of its parent.

But this is not true: NP subjects more likely to be modified
than objects.

We can change the grammar by adding the name
of the parent node to each nonterminal

The parent transformation

�36

CS447 Natural Language Processing

Markov PCFGs (Collins parser)
The RHS of each CFG rule consists of:  
one head HX, n left sisters Li and m right sisters Ri:  
 
 

Replace rule probabilities with a generative process: 
For each nonterminal X

-generate its head HX (nonterminal or terminal)
– then generate its left sisters L1..n and a STOP symbol  

conditioned on HX
– then generate its right sisters R1…n and a STOP symbol

conditioned on HX

X → Ln…L1
! “# $

left sisters

HX R1…Rm
! “# $

right sisters

�37 CS447 Natural Language Processing

Lexicalization
PCFGs can’t distinguish between  
“eat sushi with chopsticks” and “eat sushi with tuna”. 

We need to take words into account!
P(VPeat → VP PPwith chopsticks | VPeat )  
vs. P(VPeat → VP PPwith tuna | VPeat )

Problem: sparse data (PPwith fatty|white|… tuna….) 
Solution: only take head words into account!

Assumption: each constituent has one head word.

�38

CS447 Natural Language Processing

At the root (start symbol S), generate the head word of the
sentence, wS , with P(wS)  

Lexicalized rule probabilities:  
Every nonterminal is lexicalized: Xwx  
Condition rules Xwx → αYβ on the lexicalized LHS Xwx 
P( Xwx → αYβ | Xwx)  

Word-word dependencies:  
For each nonterminal Y in RHS of a rule Xwx → αYβ,  
condition wY (the head word of Y ) on X and wX: 
P( wY | Y, X, wX )  

Lexicalized PCFGs

�39 CS447 Natural Language Processing

Dealing with unknown words
A lexicalized PCFG assigns zero probability  
to any word that does not appear in the training data.

Solution: 

Training: Replace rare words in training data  
with a token ‘UNKNOWN’.  

Testing: Replace unseen words with ‘UNKNOWN’

�40

CS447 Natural Language Processing

Refining the set of categories

Unlexicalized Parsing (Klein & Manning ’03)
Unlexicalized PCFGs with various transformations  
of the training data and the model, e.g.:
– Parent annotation (of terminals and nonterminals):
distinguish preposition IN from subordinating conjunction IN etc.
– Add head tag to nonterminals
(e.g. distinguish finite from infinite VPs)
– Add distance features
Accuracy: 86.3 Precision and 85.1 Recall

The Berkeley parser (Petrov et al. ’06, ’07)
Automatically learns refinements of the nonterminals
Accuracy: 90.2 Precision, 89.9 Recall

�41 CS447: Natural Language Processing (J. Hockenmaier)

Summary
The Penn Treebank has a large number of very flat
rules.
Accurate parsing requires modifications to the basic
PCFG model: refining the nonterminals, relaxing the
independence assumptions by including grandparent
information, modeling word-word dependencies, etc.

How much of this transfers to other treebanks or
languages?  

�42

Related Posts