Microsoft PowerPoint – lec9.pptx
Bottom-Up Parsing
(A First Step)
Copyright By PowCoder代写 加微信 powcoder
Cocke–Younger–Kasami (CYK) algorithm
Chomsky Normal Form
Showed how to use Java CUP for getting ASTs
But we never saw HOW the parser works
Dip our toe into parsing
– Approaches to parsing
– CFG transformations
• Useless non-terminals
• Chomsky Normal Form: A form of grammar that is easier to
– CYK: powerful, heavyweight approach to parsing
Approaches to Parsing
Top Down / “Goal driven”
– Begin with the start nonterminal
– Grow parse tree downward to
match the string
Bottom Up / “Data Driven”
– Start at terminals
– Generate ever larger subtrees;
the goal is to obtain a single tree
whose root is the start
nonterminal
CYK: A General Approach to Parsing
(Cocke–Younger–Kasami algorithm)
Operates in time O(n3)
Works bottom-up
Requires the grammar to be in Chomsky Normal Form
– This turns out not to be a limitation: any context-free grammar
can be converted into one in Chomsky Normal Form
Chomsky Normal Form
All rules must be one of two
X t (terminal)
The only rule allowed to
derive epsilon is the start S
What CNF buys CYK
• The fact that non-terminals come in pairs
allows you to think of a subtree as a subspan
of the input
• The fact that non-terminals are not nullable
(except for start) means that each subspan
has at least one character
s = s1 s2 s3 s4
CYK: Dynamic Programming
Form the leaves of the parse tree
Form binary interior nodes of the parse tree
s1 s2 s3 s4
S1,1 S2,2 S3,3 S4,4
s1 s2 s3 s4
S1,1 S2,2 S3,3 S4,4
S4,4S3,3S2,2S1,1
Running CYK …
Track every viable subtree from leaf to root. Here are
all the subspans for a string of 6 terminals:
1,1 2,2 3,3 4,4 5,5 6,6
1,2 2,3 3,4 4,5 5,6
1,3 2,4 3,5 4,6
1,4 2,5 3,6
Starting position of subspan
position of
start, end
characters
Full string
CYK Example
I,N L I,N C I,N R
id id id, )(
1,2 2,3 3,4 4,5 5,6
1,3 2,4 3,5 4,6
1,4 2,5 3,6
In general, go up a column
and down a diagonal
CYK Example
I,N L I,N C I,N R
id id id, )(
CYK Example
I,N L I,N C N R
id id id, )(
CYK Example
I,N L I C N R
id id id, )(
CYK Example
I,N L I C N R
id id id, )(
CYK Example
I,N L I C N R
id id id, )(
CYK Example
I,N L I C N R
id id id, )(
Cleaning up our grammars
We want to avoid unnecessary work
– Remove useless rules
Eliminating Useless Nonterminals
1. If a nonterminal cannot derive a sequence of
terminal symbols, then it is useless
2. If a nonterminal cannot be derived from the
start symbol, then it is useless
Eliminate Useless Nonterminals
If a nonterminal
cannot derive a
sequence of
terminal symbols,
then it is useless
Mark all terminal symbols
If all symbols on the
righthand side of a
production are marked
mark the lefthand side
Until no more non-terminals
can be marked
Eliminate Useless Nonterminals
If a nonterminal
cannot be derived
from the start
symbol, then it is
Mark the start symbol
If the lefthand side of a
production is marked
mark all righthand
non-terminal
Until no more non-terminals
can be marked
A + | – | ε
B digit | B digit
Chomsky Normal Form
– Eliminate epsilon rules
– Eliminate unit rules
– Fix productions with terminals on RHS
– Fix productions with > 2 nonterminals on RHS
Eliminate (Most) Epsilon Productions
If a nonterminal A immediately derives epsilon
– Make copies of all rules with A on the RHS and
delete all combinations of A in those copies
F id ( A )
F id ( A )
X A x A y A
X A x A y A
Eliminate Unit Productions
Productions of the form A B are called unit
productions
Place B anywhere A could have appeared and
remove the unit production
F id ( N )
F id ( A )
Fix RHS Terminals
For productions with terminals and something
else on the RHS
– For each terminal t add the rule
Where X is a new non-terminal
– Replace t with X in the original rules
F id ( N )
Fix RHS Nonterminals
For productions with > 2 Nonterminals on the
– Replace all but the first nonterminal with a new
nonterminal
– Add a rule from the new nonterminal to the
replaced nonterminal sequence
Parsing is Tough
CYK parses an arbitrary CFG, but
– O(n3) time
– Too slow!
For special classes of grammars
– O(n) time
– Examples of such classes: LL(1) and LALR(1)
Classes of Grammars
– Scans input from Left-to-right (first L)
– Builds a Leftmost Derivation (second L)
– Can peek (1) token ahead of the token being parsed
– Top-down “predictive parsers”
– Uses special lookahead procedure (LA)
– Scans input from Left-to-right (second L)
– Rightmost derivation (R)
– Can also peek (1) token ahead
LALR(1) strictly more powerful, but the algorithm is harder
to understand
(Java CUP generates a LALR(1) parser)
We covered
• How to parse with the CYK algorithm (dynamic
programming)
• How to put a grammar into Chomsky Normal
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com