程序代写代做代考 information retrieval algorithm CS447: Natural Language Processing

CS447: Natural Language Processing
http://courses.engr.illinois.edu/cs447

Julia Hockenmaier
juliahmr@illinois.edu
3324 Siebel Center

Lecture 10:
Statistical Parsing with
PCFGs

CS447 Natural Language Processing

Where we’re at
Previous lecture: 

Standard CKY (for non-probabilistic CFGs)
The standard CKY algorithm finds all possible parse
trees τ for a sentence S = w(1)…w(n) under a CFG G 

in Chomsky Normal Form.

Today’s lecture:
Probabilistic Context-Free Grammars (PCFGs)
– CFGs in which each rule is associated with a probability
CKY for PCFGs (Viterbi):
– CKY for PCFGs finds the most likely parse tree 

τ* = argmax P(τ | S) for the sentence S under a PCFG.

�2

CS447: Natural Language Processing (J. Hockenmaier)

Previous Lecture:
CKY for CFGs

�3 CS447 Natural Language Processing

CKY: filling the chart

�4

w
1

… … wi … w
n w

1…

..
.wi

w
n

w
1

… … wi … w
n w

1…

..
.wi

w
n

w
1

… … wi … w
n w

1…

..
.wi

w
n

w
1

… … wi … w
n w

1…

..
.wi

w
n

w
1

… … wi … w
n w

1…

..
.wi

w
n

w
1

… … wi … w
n w

1…

..
.wi

w
n

w
1

… … wi … w
n w

1…

..
.wi

w
n

CS447 Natural Language Processing

CKY: filling one cell

�5

w
1

… … wi … w
n w

1…

..
.wi

w
n

chart[2][6]:
w1 w2 w3 w4 w5 w6 w7

w
1

… … wi … w
n w

1…

..
.wi

w
n

chart[2][6]:
w1 w2w3w4w5w6 w7

w
1

… … wi … w
n w

1…

..
.wi

w
n

chart[2][6]:
w1 w2w3w4w5w6 w7

w
1

… … wi … w
n w

1…

..
.wi

w
n

chart[2][6]:
w1 w2w3w4w5w6 w7

w
1

… … wi … w
n w

1…

..
.wi

w
n

chart[2][6]:
w1 w2w3w4w5w6 w7

CS447 Natural Language Processing

CKY for standard CFGs
CKY is a bottom-up chart parsing algorithm that finds
all possible parse trees τ for a sentence S = w(1)…w(n)
under a CFG G in Chomsky Normal Form (CNF).


– CNF: G has two types of rules: X ⟶ Y Z and X ⟶ w 

(X, Y, Z are nonterminals, w is a terminal)

– CKY is a dynamic programming algorithm
– The parse chart is an n×n upper triangular matrix: 


Each cell chart[i][j] (i ≤ j) stores all subtrees for w(i)…w(j)
– Each cell chart[i][j] has at most one entry for each

nonterminal X (and pairs of backpointers to each pair of
(Y, Z) entry in cells chart[i][k] chart[k+1][j] from which an X
can be formed

– Time Complexity: O(n3 |G |)
�6

CS447: Natural Language Processing (J. Hockenmaier)

Dealing with ambiguity:
Probabilistic Context-
Free Grammars (PCFGs)

�7 CS447 Natural Language Processing

Probabilistic Context-Free Grammars
For every nonterminal X, define a probability distribution
P(X → α | X) over all rules with the same LHS symbol X:


�8

S � NP VP 0.8
S � S conj S 0.2
NP � Noun 0.2
NP � Det Noun 0.4
NP � NP PP 0.2
NP � NP conj NP 0.2
VP � Verb 0.4
VP � Verb NP 0.3
VP � Verb NP NP 0.1
VP � VP PP 0.2
PP � P NP 1.0

CS447 Natural Language Processing

Computing P(τ) with a PCFG
The probability of a tree τ is the product of the probabilities 

of all its rules:

�9

P(τ) = 0.8 ×0.3 ×0.2 ×1.0

= 0.00384

×0.23

S

NP

Noun

John

VP

VP

Verb

eats

NP

Noun

pie

PP

P

with

NP

Noun

cream

S � NP VP 0.8
S � S conj S 0.2
NP � Noun 0.2
NP � Det Noun 0.4
NP � NP PP 0.2
NP � NP conj NP 0.2
VP � Verb 0.4
VP � Verb NP 0.3
VP � Verb NP NP 0.1
VP � VP PP 0.2
PP � P NP 1.0

CS447 Natural Language Processing

Learning the parameters of a PCFG
If we have a treebank (a corpus in which each
sentence is associated with a parse tree), we can just
count the number of times each rule appears, e.g.:
S ! NP VP . (count = 1000)
S ! S conj S . (count = 220)

and then we divide the observed frequency of each
rule X → Y Z by the sum of the frequencies of all rules
with the same LHS X to turn these counts into
probabilities:
S ! NP VP . (p = 1000/1220) 

S ! S conj S . (p = 220/1220)

�10

CS447 Natural Language Processing

More on probabilities:
Computing P(s): 

If P(τ) is the probability of a tree τ, 

the probability of a sentence s is the sum of the
probabilities of all its parse trees:
P(s) = ∑τ:yield(τ) = s P(τ)

How do we know that P(L) = ∑τ P(τ) = 1?
If we have learned the PCFG from a corpus via MLE,
this is guaranteed to be the case.

If we just set the probabilities by hand, we could run into
trouble, as in the following example:
S ! S S (0.9) 

S ! w (0.1)

�11 CS447: Natural Language Processing (J. Hockenmaier)

PCFG parsing
(decoding):
Probabilistic CKY

�12

CS447 Natural Language Processing

Probabilistic CKY: Viterbi

Like standard CKY, but with probabilities.
Finding the most likely tree is similar to Viterbi for HMMs:

Initialization:
– [optional] Every chart entry that corresponds to a terminal 


(entry w in cell[i][i]) has a Viterbi probability PVIT(w[i][i] ) = 1 (*)
– Every entry for a non-terminal X in cell[i][i] has Viterbi

probability PVIT(X[i][i] ) = P(X → w | X) [and a single backpointer to w[i][i] (*) ]
Recurrence: For every entry that corresponds to a non-terminal X
in cell[i][j], keep only the highest-scoring pair of backpointers
to any pair of children (Y in cell[i][k] and Z in cell[k+1][j]):

PVIT(X[i][j]) = argmaxY,Z,k PVIT(Y[i][k]) × PVIT(Z[k+1][j] ) × P(X → Y Z | X )
Final step: Return the Viterbi parse for the start symbol S 

in the top cell[1][n].
*this is unnecessary for simple PCFGs, but can be helpful for more complex probability models

�13 CS447 Natural Language Processing

Probabilistic CKY

�14

NP
0.2

John eats pie with cream

Noun
1.0 John

Verb
1.0 eats

Noun
1.0 pie

Prep
1.0

with

Noun
1.0 cream

Input: POS-tagged sentence 

John_N eats_V pie_N with_P cream_N

NP
0.2

VP
0.3

NP
0.2

S
0.8·0.2·0.3

VP
1·0.3·0.2

= 0.06

PP
1·1·0.2

S
0.8·0.2·0.06

NP
0.2·0.2·0.2

= 0.008

VP
max( 1.0 ·0.008·0.3,

0.06·0.2·0.3 )

S
0.2·0.0036·0.8

S � NP VP 0.8
S � S conj S 0.2
NP � Noun 0.2
NP � Det Noun 0.4
NP � NP PP 0.2
NP � NP conj NP 0.2
VP � Verb 0.4
VP � Verb NP 0.3
VP � Verb NP NP 0.1
VP � VP PP 0.2
PP � P NP 1.0

0.3

0.3
Prep NP

Prep ⟶ P 1.0
Noun ⟶ N 1.0
Verb ⟶ V 1.0

CS447 Natural Language Processing

How do we handle flat rules?

�15

S � NP VP 0.8
S � S conj S 0.2
NP � Noun 0.2
NP � Det Noun 0.4
NP � NP PP 0.2
NP � NP conj NP 0.2
VP � Verb 0.4
VP � Verb NP 0.3
VP � Verb NP NP 0.1
VP � VP PP 0.2
PP � P NP 1.0

0.3

0.3
Prep NP

S ⟶ S ConjS 0.2
ConjS ⟶ conj S 1.0

Binarize each flat rule by
adding dummy nonterminals
(ConjS),
and setting the probability of
the rule with the dummy
nonterminal on the LHS to 1

CS447: Natural Language Processing (J. Hockenmaier)

Parser evaluation

�16

CS447: Natural Language Processing (J. Hockenmaier)

Precision and recall
Precision and recall were originally developed 

as evaluation metrics for information retrieval:
-Precision: What percentage of retrieved documents are
relevant to the query?
-Recall: What percentage of relevant documents were
retrieved?

In NLP, they are often used in addition to accuracy:
-Precision: What percentage of items that were assigned
label X do actually have label X in the test data?
-Recall: What percentage of items that have label X in the test
data were assigned label X by the system?

Particularly useful when there are more than two labels.

�17 CS447: Natural Language Processing (J. Hockenmaier)

True vs. false positives, false negatives

-True positives: Items that were labeled X by the system,

and should be labeled X.
-False positives: Items that were labeled X by the system, 


but should not be labeled X.
-False negatives: Items that were not labeled X by the system, 


but should be labeled X
�18

False
Negatives

(FN)

Items labeled X 

in the gold standard 


(‘truth’)

Items labeled X 

by the system

= TP + FN
= TP + FP

False 

Positives


(FP)

True 

Positives

(TP)

CS447: Natural Language Processing (J. Hockenmaier)

Precision, recall, f-measure

�19

False
Positives


(FP)

False
Negatives

(FN)

True
Positives


(TP)

Items labeled X 

in the gold standard 


(‘truth’)
= TP + FN

Items labeled X 

by the system

= TP + FP

Precision: P = TP ∕( TP + FP )
Recall: R = TP ∕( TP + FN )
F-measure: harmonic mean of precision and recall 

F = (2·P·R)∕(P + R)

CS447 Natural Language Processing

Evalb (“Parseval”)

Measures recovery of phrase-structure trees.
Labeled: span and label (NP, PP,…) has to be right.
[Earlier variant— unlabeled: span of nodes has to be right]

Two aspects of evaluation
Precision: How many of the predicted nodes are correct?
Recall: How many of the correct nodes were predicted?
Usually combined into one metric (F-measure):

P =
#correctly predicted nodes

#predicted nodes

R =
#correctly predicted nodes

#correct nodes

F =
2PR

P + R

�20

CS447 Natural Language Processing

Parseval in practice

eat sushi with tuna: Precision: 4/5 Recall: 4/5
eat sushi with chopsticks: Precision: 4/5 Recall: 4/5

�21

eat with tunasushi
NPNP

VP

PP
NP

V P

sushi eat with chopsticks
NPNP

VP

PPVP

V P
eat sushi with chopsticks

NPNP
NP

VP

PP

V P

eat with tunasushi
NPNP

VP

PPVP

V P

Gold standard Parser output

N N N N

NN N N

CS498JH: Introduction to NLP

Shortcomings of PCFGs

�22

CS447 Natural Language Processing



PCFGs make independence assumptions:

Only the label of a node determines what children it has.


Factors that influence these assumptions:
Shape of the trees:

A corpus with flat trees (i.e. few nodes/sentence) 

results in a model with few independence assumptions.


Labeling of the trees:

A corpus with many node labels (nonterminals) 

results in a model with few independence assumptions.

�23

How well can a PCFG model the
distribution of trees?

CS447 Natural Language Processing

Example 1: flat trees

S

I eat sushi with tuna

What sentences would a PCFG

estimated from this corpus generate?

S

I eat sushi with chopsticks

�24

CS447 Natural Language Processing

Example 2: deep trees, few labels

S

I S

eat S

sushi S

with chopsticks

What sentences would a PCFG

estimated from this corpus generate?

S

I S

eat S

sushi S

with tuna

�25 CS447 Natural Language Processing

Example 3: deep trees, many labels

What sentences would a PCFG

estimated from this corpus generate?

S

I S1

eat S2

sushi S3

with tuna

S

I S1

eat S2

sushi S3

with chopsticks

�26

CS447 Natural Language Processing

Aside: Bias/Variance tradeoff

A probability model has low bias if it makes 

few independence assumptions.

⇒ It can capture the structures in the training data.
This typically leads to a more fine-grained partitioning
of the training data.

Hence, fewer data points are available to estimate
the model parameters.


This increases the variance of the model.

⇒ This yields a poor estimate of the distribution.

�27 CS447: Natural Language Processing (J. Hockenmaier)

Penn Treebank
parsing

�28

CS447 Natural Language Processing

The Penn Treebank
The first publicly available syntactically annotated
corpus

Wall Street Journal (50,000 sentences, 1 million words)
also Switchboard, Brown corpus, ATIS


The annotation:
– POS-tagged (Ratnaparkhi’s MXPOST)
– Manually annotated with phrase-structure trees
– Richer than standard CFG: Traces and other null

elements used to represent non-local dependencies
(designed to allow extraction of predicate-argument
structure) [more on this later in the semester] 


Standard data set for English parsers
�29 CS447 Natural Language Processing

The Treebank label set
48 preterminals (tags):

– 36 POS tags, 12 other symbols (punctuation etc.)
– Simplified version of Brown tagset (87 tags) 


(cf. Lancaster-Oslo/Bergen (LOB) tag set: 126 tags)


14 nonterminals:
standard inventory (S, NP, VP,…)

�30

CS447 Natural Language Processing

A simple example

�31

Relatively flat structures:
– There is no noun level
– VP arguments and adjuncts appear at the same level 


Function tags, e.g. -SBJ (subject), -MNR (manner)

CS447 Natural Language Processing

A more realistic (partial) example
Until Congress acts, the government hasn’t any authority to issue new debt
obligations of any kind, the Treasury said …. .

�32

CS447 Natural Language Processing

The Penn Treebank CFG

The Penn Treebank uses very flat rules, e.g.:







– Many of these rules appear only once.
– Many of these rules are very similar.
– Can we pool these counts?

�33 CS447 Natural Language Processing

PCFGs in practice:
Charniak (1996) Tree-bank grammars

How well do PCFGs work on the Penn Treebank? 


– Split Treebank into test set (30K words) 

and training set (300K words).

– Estimate a PCFG from training set.
– Parse test set (with correct POS tags).
– Evaluate unlabeled precision and recall

�34

CS447 Natural Language Processing

Two ways to improve performance
… change the (internal) grammar:
– Parent annotation/state splits: 


Not all NPs/VPs/DTs/… are the same.

It matters where they are in the tree 


… change the probability model:
– Lexicalization: 


Words matter!
– Markovization: 


Generalizing the rules

�35 CS447 Natural Language Processing

PCFGs assume the expansion of any nonterminal
is independent of its parent.

But this is not true: NP subjects more likely to be modified
than objects.

We can change the grammar by adding the name
of the parent node to each nonterminal

The parent transformation

�36

CS447 Natural Language Processing

Markov PCFGs (Collins parser)
The RHS of each CFG rule consists of: 

one head HX, n left sisters Li and m right sisters Ri: 


Replace rule probabilities with a generative process:

For each nonterminal X

-generate its head HX (nonterminal or terminal)
– then generate its left sisters L1..n and a STOP symbol 


conditioned on HX
– then generate its right sisters R1…n and a STOP symbol

conditioned on HX

X → Ln…L1
! “# $

left sisters

HX R1…Rm
! “# $

right sisters

�37 CS447 Natural Language Processing

Lexicalization
PCFGs can’t distinguish between 

“eat sushi with chopsticks” and “eat sushi with tuna”.


We need to take words into account!
P(VPeat → VP PPwith chopsticks | VPeat ) 

vs. P(VPeat → VP PPwith tuna | VPeat )

Problem: sparse data (PPwith fatty|white|… tuna….)

Solution: only take head words into account!

Assumption: each constituent has one head word.

�38

CS447 Natural Language Processing

At the root (start symbol S), generate the head word of the
sentence, wS , with P(wS) 


Lexicalized rule probabilities: 

Every nonterminal is lexicalized: Xwx 

Condition rules Xwx → αYβ on the lexicalized LHS Xwx

P( Xwx → αYβ | Xwx) 


Word-word dependencies: 

For each nonterminal Y in RHS of a rule Xwx → αYβ, 

condition wY (the head word of Y ) on X and wX:

P( wY | Y, X, wX ) 


Lexicalized PCFGs

�39 CS447 Natural Language Processing

Dealing with unknown words
A lexicalized PCFG assigns zero probability 

to any word that does not appear in the training data.

Solution:


Training: Replace rare words in training data 

with a token ‘UNKNOWN’. 


Testing: Replace unseen words with ‘UNKNOWN’

�40

CS447 Natural Language Processing

Refining the set of categories

Unlexicalized Parsing (Klein & Manning ’03)
Unlexicalized PCFGs with various transformations 

of the training data and the model, e.g.:
– Parent annotation (of terminals and nonterminals):
distinguish preposition IN from subordinating conjunction IN etc.
– Add head tag to nonterminals
(e.g. distinguish finite from infinite VPs)
– Add distance features
Accuracy: 86.3 Precision and 85.1 Recall

The Berkeley parser (Petrov et al. ’06, ’07)
Automatically learns refinements of the nonterminals
Accuracy: 90.2 Precision, 89.9 Recall

�41 CS447: Natural Language Processing (J. Hockenmaier)

Summary
The Penn Treebank has a large number of very flat
rules.
Accurate parsing requires modifications to the basic
PCFG model: refining the nonterminals, relaxing the
independence assumptions by including grandparent
information, modeling word-word dependencies, etc.

How much of this transfers to other treebanks or
languages? 


�42