代写代考 Microsoft PowerPoint – lec9.pptx

Microsoft PowerPoint – lec9.pptx

Bottom-Up Parsing
(A First Step)

Copyright By PowCoder代写 加微信 powcoder

Cocke–Younger–Kasami (CYK) algorithm

Chomsky Normal Form

Showed how to use Java CUP for getting ASTs
But we never saw HOW the parser works

Dip our toe into parsing
– Approaches to parsing
– CFG transformations

• Useless non-terminals
• Chomsky Normal Form: A form of grammar that is easier to

– CYK: powerful, heavyweight approach to parsing

Approaches to Parsing

Top Down / “Goal driven”
– Begin with the start nonterminal
– Grow parse tree downward to

match the string
Bottom Up / “Data Driven”
– Start at terminals
– Generate ever larger subtrees;

the goal is to obtain a single tree
whose root is the start
nonterminal

CYK: A General Approach to Parsing
(Cocke–Younger–Kasami algorithm)

Operates in time O(n3)
Works bottom-up
Requires the grammar to be in Chomsky Normal Form
– This turns out not to be a limitation: any context-free grammar

can be converted into one in Chomsky Normal Form

Chomsky Normal Form

All rules must be one of two

X t (terminal)

The only rule allowed to
derive epsilon is the start S

What CNF buys CYK

• The fact that non-terminals come in pairs
allows you to think of a subtree as a subspan
of the input

• The fact that non-terminals are not nullable
(except for start) means that each subspan
has at least one character

s = s1 s2 s3 s4

CYK: Dynamic Programming

Form the leaves of the parse tree

Form binary interior nodes of the parse tree

s1 s2 s3 s4

S1,1 S2,2 S3,3 S4,4

s1 s2 s3 s4

S1,1 S2,2 S3,3 S4,4

S4,4S3,3S2,2S1,1

Running CYK …
Track every viable subtree from leaf to root. Here are
all the subspans for a string of 6 terminals:

1,1 2,2 3,3 4,4 5,5 6,6

1,2 2,3 3,4 4,5 5,6

1,3 2,4 3,5 4,6

1,4 2,5 3,6

Starting position of subspan

position of

start, end

characters

Full string

CYK Example

I,N L I,N C I,N R

id id id, )(

1,2 2,3 3,4 4,5 5,6

1,3 2,4 3,5 4,6

1,4 2,5 3,6

In general, go up a column
and down a diagonal

CYK Example

I,N L I,N C I,N R

id id id, )(

CYK Example

I,N L I,N C N R

id id id, )(

CYK Example

I,N L I C N R

id id id, )(

CYK Example

I,N L I C N R

id id id, )(

CYK Example

I,N L I C N R

id id id, )(

CYK Example

I,N L I C N R

id id id, )(

Cleaning up our grammars

We want to avoid unnecessary work
– Remove useless rules

Eliminating Useless Nonterminals

1. If a nonterminal cannot derive a sequence of
terminal symbols, then it is useless

2. If a nonterminal cannot be derived from the
start symbol, then it is useless

Eliminate Useless Nonterminals

If a nonterminal
cannot derive a
sequence of
terminal symbols,
then it is useless

Mark all terminal symbols

If all symbols on the
righthand side of a
production are marked

mark the lefthand side

Until no more non-terminals
can be marked

Eliminate Useless Nonterminals

If a nonterminal
cannot be derived
from the start
symbol, then it is

Mark the start symbol

If the lefthand side of a
production is marked

mark all righthand
non-terminal

Until no more non-terminals
can be marked

A + | – | ε
B digit | B digit

Chomsky Normal Form

– Eliminate epsilon rules
– Eliminate unit rules
– Fix productions with terminals on RHS
– Fix productions with > 2 nonterminals on RHS

Eliminate (Most) Epsilon Productions

If a nonterminal A immediately derives epsilon
– Make copies of all rules with A on the RHS and

delete all combinations of A in those copies

F id ( A )

F id ( A )

X A x A y A

X A x A y A

Eliminate Unit Productions

Productions of the form A B are called unit
productions
Place B anywhere A could have appeared and
remove the unit production

F id ( N )

F id ( A )

Fix RHS Terminals

For productions with terminals and something
else on the RHS
– For each terminal t add the rule

Where X is a new non-terminal

– Replace t with X in the original rules

F id ( N )

Fix RHS Nonterminals

For productions with > 2 Nonterminals on the
– Replace all but the first nonterminal with a new

nonterminal
– Add a rule from the new nonterminal to the

replaced nonterminal sequence

Parsing is Tough

CYK parses an arbitrary CFG, but
– O(n3) time
– Too slow!

For special classes of grammars
– O(n) time
– Examples of such classes: LL(1) and LALR(1)

Classes of Grammars

– Scans input from Left-to-right (first L)
– Builds a Leftmost Derivation (second L)
– Can peek (1) token ahead of the token being parsed
– Top-down “predictive parsers”

– Uses special lookahead procedure (LA)
– Scans input from Left-to-right (second L)
– Rightmost derivation (R)
– Can also peek (1) token ahead

LALR(1) strictly more powerful, but the algorithm is harder
to understand
(Java CUP generates a LALR(1) parser)

We covered
• How to parse with the CYK algorithm (dynamic

programming)
• How to put a grammar into Chomsky Normal

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com