程序代写代做代考 algorithm chain CS447: Natural Language Processing

CS447: Natural Language Processing
http://courses.engr.illinois.edu/cs447

Julia Hockenmaier
juliahmr@illinois.edu
3324 Siebel Center

Lecture 9:
The CKY parsing
algorithm

CS447 Natural Language Processing

Last lecture’s key concepts
Natural language syntax

Constituents
Dependencies
Context-free grammar
Arguments and modifiers
Recursion in natural language

�2

CS447: Natural Language Processing (J. Hockenmaier)

Defining grammars
for natural language

�3 CS447: Natural Language Processing (J. Hockenmaier)

An example CFG
DT → {the, a}
N → {ball, garden, house, sushi }
P → {in, behind, with}
NP → DT N
NP → NP PP
PP → P NP

N: noun
P: preposition
NP: “noun phrase”
PP: “prepositional phrase”

�4

CS447: Natural Language Processing (J. Hockenmaier)

Reminder: Context-free grammars
A CFG is a 4-tuple 〈N, Σ, R, S〉 consisting of:

A set of nonterminals N  
(e.g. N = {S, NP, VP, PP, Noun, Verb, ….})  

A set of terminals Σ 
(e.g. Σ = {I, you, he, eat, drink, sushi, ball, })  

A set of rules R  
R ⊆ {A → β with left-hand-side (LHS) A ∈ N  
and right-hand-side (RHS) β ∈ (N ∪ Σ)* }
 
A start symbol S ∈ N

�5 CS447: Natural Language Processing (J. Hockenmaier)

Constituents:
Heads and dependents
There are different kinds of constituents:

Noun phrases: the man, a girl with glasses, Illinois
Prepositional phrases: with glasses, in the garden
Verb phrases: eat sushi, sleep, sleep soundly

Every phrase has a head:
Noun phrases: the man, a girl with glasses, Illinois
Prepositional phrases: with glasses, in the garden
Verb phrases: eat sushi, sleep, sleep soundly
The other parts are its dependents.
Dependents are either arguments or adjuncts

�6

CS447: Natural Language Processing (J. Hockenmaier)

Is string α a constituent?

Substitution test:
Can α be replaced by a single word?  
He talks [there].

Movement test:
Can α be moved around in the sentence?  
[In class], he talks.

Answer test:
Can α be the answer to a question?  
Where does he talk? – [In class].

He talks [in class].

�7 CS447: Natural Language Processing (J. Hockenmaier)

Arguments are obligatory
Words subcategorize for specific sets of arguments:

Transitive verbs (sbj + obj): [John] likes [Mary] 

All arguments have to be present:
*[John] likes. *likes [Mary].

No argument can be occupied multiple times:
*[John] [Peter] likes [Ann] [Mary]. 

Words can have multiple subcat frames:
Transitive eat (sbj + obj): [John] eats [sushi].
Intransitive eat (sbj): [John] eats. 

�8

CS447: Natural Language Processing (J. Hockenmaier)

Adjuncts are optional
Adverbs, PPs and adjectives can be adjuncts:

Adverbs: John runs [fast].  
a [very] heavy book.  
PPs: John runs [in the gym].
the book [on the table]
Adjectives: a [heavy] book 

There can be an arbitrary number of adjuncts:
John saw Mary.
John saw Mary [yesterday].
John saw Mary [yesterday] [in town]
John saw Mary [yesterday] [in town] [during lunch]
[Perhaps] John saw Mary [yesterday] [in town] [during lunch]

�9 CS447 Natural Language Processing

Heads, Arguments and Adjuncts in CFGs

Heads:  
We assume that each RHS has one head, e.g.

VP → Verb NP (Verbs are heads of VPs)
NP → Det Noun (Nouns are heads of NPs)
S → NP VP (VPs are heads of sentences)
Exception: Coordination, lists: VP → VP conj VP

Arguments:
The head has a different category from the parent:

VP → Verb NP (the NP is an argument of the verb)
Adjuncts:
The head has the same category as the parent:

VP → VP PP (the PP is an adjunct)
�10

CS447 Natural Language Processing

The right-hand side of a standard CFG can have an arbitrary
number of symbols (terminals and nonterminals): 

VP → ADV eat NP 

A CFG in Chomsky Normal Form (CNF) allows only two
kinds of right-hand sides:

– Two nonterminals: VP → ADV VP
– One terminal: VP → eat  

Any CFG can be transformed into an equivalent CNF:
VP → ADVP VP1
VP1 → VP2 NP
VP2 → eat

Chomsky Normal Form

�11

VP
ADV NPeat

VP2

VP
ADV

NP
eat

VP1VP
ADV NPeat

CS447 Natural Language Processing

A note about ε-productions
Formally, context-free grammars are allowed to have  
empty productions (ε = the empty string): 
VP → V NP NP → DT Noun NP → ε 

These can always be eliminated without changing the
language generated by the grammar:
VP → V NP NP → DT Noun NP → ε
becomes 
VP → V NP VP → V ε NP → DT Noun
which in turn becomes 
VP → V NP VP → V NP → DT Noun 

We will assume that our grammars don’t have ε-productions

�12

CS447 Natural Language Processing

CKY chart parsing algorithm
Bottom-up parsing:

start with the words
Dynamic programming:

save the results in a table/chart
re-use these results in finding larger constituents 

Complexity: O( n3|G| )
n: length of string, |G|: size of grammar)

Presumes a CFG in Chomsky Normal Form:
Rules are all either A → B C or A → a  
(with A,B,C nonterminals and a a terminal)

�13 CS447 Natural Language Processing

we eat sushiwe eat

eat sushi

sushi

eat

S → NP VP
VP → V NP
V → eat
NP → we
NP → sushi

We eat sushi

The CKY parsing algorithm

SNP

�14

To recover the
parse tree, each

entry needs  
pairs of

backpointers.

CS447 Natural Language Processing

CKY algorithm
1. Create the chart

(an n×n upper triangular matrix for an sentence with n words)
– Each cell chart[i][j] corresponds to the substring w(i)…w(j)

2. Initialize the chart (fill the diagonal cells chart[i][i]):
For all rules X → w(i), add an entry X to chart[i][i]

3. Fill in the chart:
Fill in all cells chart[i][i+1], then chart[i][i+2], …, 
until you reach chart[1][n] (the top right corner of the chart)
– To fill chart[i][j], consider all binary splits w(i)…w(k)|w(k+1)…w(j)
– If the grammar has a rule X → YZ, chart[i][k] contains a Y

and chart[k+1][j] contains a Z, add an X to chart[i][j] with two
backpointers to the Y in chart[i][k] and the Z in chart[k+1][j]

4. Extract the parse trees from the S in chart[1][n].

�15 CS447 Natural Language Processing

CKY: filling the chart

�16

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

CS447 Natural Language Processing

CKY: filling one cell

�17

w
1

… … wi … w
n w

1…

..
.wi
…

w
n

chart[2][6]:
w1 w2 w3 w4 w5 w6 w7

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

chart[2][6]:
w1 w2w3w4w5w6 w7

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

chart[2][6]:
w1 w2w3w4w5w6 w7

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

chart[2][6]:
w1 w2w3w4w5w6 w7

w
1

… … wi … w
n w

1…

..
.wi
…
w
n

chart[2][6]:
w1 w2w3w4w5w6 w7

CS447 Natural Language Processing

V
buy

VP
buy drinks

buy drinks
with

VP
buy drinks with

milk

V, NP  
drinks

drinks with VP, NP
drinks with milk

P
with

PP
with milk

NP
milk

The CKY parsing algorithm

�18

We buy drinks with milk

S → NP VP
VP → V NP
VP → VP PP
V → drinks
NP → NP PP
NP → we
NP → drinks
NP → milk
PP → P NP
P → with

Each cell may have one entry
for each nonterminal

CS447 Natural Language Processing

we we eat we eat sushi we eat sushi with
we eat sushi

with tuna

eat eat sushi eat sushi with eat sushi with tuna

sushi sushi with sushi with tuna

with with tuna

tuna

we we eat we eat sushi we eat sushi with
we eat sushi

with tuna

V
eat

VP
eat sushi

eat sushi with VP
eat sushi with tuna

sushi sushi with NP
sushi with tuna

with PP
with tuna

tuna

The CKY parsing algorithm

�19

We eat sushi with tuna

Each cell contains only a
single entry for each

nonterminal.
Each entry may have a list

of pairs of backpointers.

S → NP VP
VP → V NP
VP → VP PP
V → eat
NP → NP PP
NP → we
NP → sushi
NP → tuna
PP → P NP
P → with

CS447: Natural Language Processing (J. Hockenmaier)

What are the terminals in NLP?
Are the “terminals”: words or POS tags?  

For toy examples (e.g. on slides), it’s typically the words

With POS-tagged input, we may either treat the POS tags as
the terminals, or we assume that the unary rules in our
grammar are of the form
POS-tag → word
(so POS tags are the only nonterminals that can be rewritten
as words; some people call POS tags “preterminals”)

�20

CS447: Natural Language Processing (J. Hockenmaier)

Additional unary rules
In practice, we may allow other unary rules, e.g.
NP → Noun
(where Noun is also a nonterminal)

In that case, we apply all unary rules to the entries in
chart[i][j] after we’ve checked all binary splits  
(chart[i][k], chart[k+1][j])

Unary rules are fine as long as there are no “loops”
that could lead to an infinite chain of unary
productions, e.g.:

X → Y and Y → X
or: X → Y and Y → Z and Z → X

�21 CS447 Natural Language Processing

CKY so far…
Each entry in a cell chart[i][j] is associated with a
nonterminal X.
 
If there is a rule X → YZ in the grammar, and there is
a pair of cells chart[i][k], chart[k+1][j] with a Y in
chart[i][k] and a Z in chart[k+1][j],
we can add an entry X to cell chart[i][j], and associate
one pair of backpointers with the X in cell chart[i][k]  

Each entry might have multiple pairs of backpointers.
When we extract the parse trees at the end,  
we can get all possible trees.
We will need probabilities to find the single best tree!

�22

CS447 Natural Language Processing

Exercise: CKY parser
I eat sushi with chopsticks with you

�23

S ⟶ NP VP
NP ⟶ NP PP
NP ⟶ sushi
NP ⟶ I
NP ⟶ chopsticks
NP ⟶ you
VP ⟶ VP PP
VP ⟶ Verb NP
Verb ⟶ eat
PP ⟶ Prep NP
Prep ⟶ with

CS447 Natural Language Processing �24

How do you count the number of parse
trees for a sentence?

1. For each pair of backpointers  
(e.g.VP → V NP): multiply #trees of children  

trees(VPVP → V NP) = trees(V) × trees(NP)  

2. For each list of pairs of backpointers  
(e.g.VP → V NP and VP → VP PP): sum #trees  
trees(VP) = trees(VPVP→V NP) + trees(VPVP→VP PP)

CS447 Natural Language Processing

Cocke Kasami Younger (1)
w1 … … wi … wn

w1
…

…
wi
…
wn

initChart(n): 
for i = 1…n:  
initCell(i,i)
initCell(i,i):  
for c in lex(word[i]):  
addToCell(cell[i][i], c, null, null)
addToCell(Parent,cell,Left, Right)  
if (cell.hasEntry(Parent)):  
P = cell.getEntry(Parent)  
P.addBackpointers(Left, Right)  
else cell.addEntry(Parent, Left, Right)

�25

w1 … … wi … wn
w1
…

…
wi
…
wn

ckyParse(n):  
initChart(n)
fillChart(n)

fillChart(n):  
for span = 1…n-1: 
for i = 1…n-span:  
fillCell(i,i+span)  

fillCell(i,j):  
for k = i..j-1:  
combineCells(i, k, j)  

combineCells(i,k,j):  
for Y in cell[i][k]:  
for Z in cell[k +1][j]:  
for X in Nonterminals:  
if X →Y Z in Rules:  
addToCell(cell[i][j],X, Y, Z)

w1 … … wi … wn
w1
…

Y X wj
Z …

…
wn

CS447: Natural Language Processing (J. Hockenmaier)

Dealing with ambiguity:
Probabilistic  
Context-Free
Grammars (PCFGs)

�26

CS447 Natural Language Processing

Grammars are ambiguous
A grammar might generate multiple trees for a sentence:
 
 
 
 
 
 
 
What’s the most likely parse τ for sentence S ?  

We need a model of P(τ | S)

�27

Correct analysis

Incorrect analysis

eat with tunasushi
NPNP

PP
NP

V P

sushi eat with chopsticks
NPNP

PPVP
V P

eat sushi with tuna

eat sushi with chopsticks

eat sushi with chopsticks
NPNP

NP
VP

PP
V P

eat with tunasushi
NPNP

PPVP
V P

eat sushi with tuna

eat sushi with chopsticks

Correct analysis

Incorrect analysis

eat with tunasushi
NPNP

PP
NP

V P

sushi eat with chopsticks
NPNP

PPVP
V P

eat sushi with tuna

eat sushi with chopsticks

eat sushi with chopsticks
NPNP

NP
VP

PP
V P

eat with tunasushi
NPNP

PPVP
V P

eat sushi with tuna

eat sushi with chopsticks

CS447 Natural Language Processing

Computing P(τ | S)
Using Bayes’ Rule: 
 
 
 
 
 

The yield of a tree is the string of terminal symbols  
that can be read off the leaf nodes

arg max
�

P (� |S) = arg max
�

P (�, S)
P (S)

= arg max
�

P (�, S)

= arg max
�

P (�) if S = yield(�)

�28

Correct analysis

Incorrect analysis

eat with tunasushi
NPNP

PP
NP

V P

sushi eat with chopsticks
NPNP

PPVP
V P

eat sushi with tuna

eat sushi with chopsticks

eat sushi with chopsticks
NPNP

NP
VP

PP
V P

eat with tunasushi
NPNP

PPVP
V P

eat sushi with tuna

eat sushi with chopsticks

yield( ) = eat sushi with tuna

CS447 Natural Language Processing

T is the (infinite) set of all trees in the language: 

 
We need to define P(τ) such that: 
 

 
The set T is generated by a context-free grammar

Computing P(τ)

�29

⇤� ⇥ T : 0� P(�)� 1
��⇥T P(�) = 1

L = {s ⇥ ��| ⇤� ⇥ T : yield(�) = s}

S � NP VP VP � Verb NP NP � Det Noun
S � S conj S VP � VP PP NP � NP PP
S � ….. VP � ….. NP � …..

CS447 Natural Language Processing

Probabilistic Context-Free Grammars
For every nonterminal X, define a probability distribution
P(X → α | X) over all rules with the same LHS symbol X:

�30

S � NP VP 0.8
S � S conj S 0.2
NP � Noun 0.2
NP � Det Noun 0.4
NP � NP PP 0.2
NP � NP conj NP 0.2
VP � Verb 0.4
VP � Verb NP 0.3
VP � Verb NP NP 0.1
VP � VP PP 0.2
PP � P NP 1.0

CS447 Natural Language Processing

Computing P(τ) with a PCFG
The probability of a tree τ is the product of the probabilities  
of all its rules:

�31

P(τ) = 0.8 ×0.3 ×0.2 ×1.0

= 0.00384

×0.23

Noun

John

Verb

eats

Noun

pie

with

Noun

cream

S � NP VP 0.8
S � S conj S 0.2
NP � Noun 0.2
NP � Det Noun 0.4
NP � NP PP 0.2
NP � NP conj NP 0.2
VP � Verb 0.4
VP � Verb NP 0.3
VP � Verb NP NP 0.1
VP � VP PP 0.2
PP � P NP 1.0

CS498JH: Introduction to NLP

PCFG parsing
(decoding):
Probabilistic CKY

�32

CS447 Natural Language Processing

Probabilistic CKY: Viterbi

Like standard CKY, but with probabilities.
Finding the most likely tree argmaxτ P(τ,s) is similar to
Viterbi for HMMs:

Initialization: every chart entry that corresponds to a terminal  
(entries X in cell[i][i])has a Viterbi probability PVIT(X[i][i] ) = 1 

Recurrence: For every entry that corresponds to a non-terminal X
in cell[i][j], keep only the highest-scoring pair of backpointers
to any pair of children (Y in cell[i][k] and Z in cell[k+1][j]): 
PVIT(X[i][j]) = argmaxY,Z,k PVIT(Y[i][k]) × PVIT(Z[k+1][j] ) × P(X → Y Z | X )

Final step: Return the Viterbi parse for the start symbol S  
in the top cell[1][n].

�33 CS447 Natural Language Processing

Probabilistic CKY

�34

John eats pie with cream

N
1.0

John

V
1.0 eats

N
1.0 pie

P
1.0

with

N
1.0 cream

Input: POS-tagged sentence  
John_N eats_V pie_N with_P cream_N

NP
0.2

VP
0.3

NP
0.2

S
0.8·0.2·0.3

VP
1·0.3·0.2

= 0.06

PP
1·1·0.2

S
0.8·0.2·0.06

NP
0.2·0.2·0.2

= 0.008

VP
max( 1.0 ·0.008·0.3,

0.06·0.2·0.3 )

S
0.2·0.0036·0.8

S � NP VP 0.8
S � S conj S 0.2
NP � Noun 0.2
NP � Det Noun 0.4
NP � NP PP 0.2
NP � NP conj NP 0.2
VP � Verb 0.4
VP � Verb NP 0.3
VP � Verb NP NP 0.1
VP � VP PP 0.2
PP � P NP 1.0

0.3

Related Posts