Scanning and Parsing
Stephen A. Edwards Columbia University Fall 2016
The First Question
How do you represent one of many things?
Compilers should accept many programs; how do we describe which one we want?
Use continuously varying values?
Very efficient, but has serious noise issues Edison Model B Home Cylinder phonograph, 1906
The ENIAC: Programming with Spaghetti
Have one symbol per thing?
Works nicely when there are only a few things Sholes and Glidden Typewriter, E. Remington and Sons, 1874
Have one symbol per thing?
Not so good when there are many, many things Nippon Typewriter SH-280, 2268 keys
Solution: Use a Discrete Combinatorial System
Use combinations of a small number of things to represent (exponentially) many different things.
Every Human Writing System Does This
Hieroglyphics (24+)
Cuneiform (1000 – 300)
Sanskrit (36)
Chinese (214 – 4000) IBM Selectric (88–96)
Mayan (100)
Roman (21–26)
The Second Question
How do you describe only certain combinations?
Compilers should only accept correct programs; how should a compiler check that its input is correct?
Just List Them?
Gets annoying for large numbers of combinations
Just List Them?
Can be really redundant
Choices: CS Research Jargon Generator
Pick one from each column
an integrated a parallel
a virtual
an interactive a responsive
a synchronized a balanced
a virtual
a meta-level
mobile
functional programmable distributed logical
digital concurrent knowledge-based multimedia
network preprocessor compiler system interface protocol architecture database algorithm
E.g., “a responsive knowledge-based preprocessor.” http://www.cs.purdue.edu/homes/dec/essay.topic.generator.html
SCIgen: An Automatic CS Paper Generator
Rooter: A Methodology for the Typical Unifi of Access Points and Redundancy
Jeremy Stribling, Daniel Aguayo and Maxwell Krohn
ABSTRACT
Many physicists would agree that, had it not been for congestion control, the evaluation of web browsers might never have occurred. In fact, few hackers worldwide would disagree with the essential unification of voice-over-IP and public- private key pair. In order to solve this riddle, we confirm that SMPs can be made stochastic, cacheable, and interposable.
I. INTRODUCTION
Many scholars would agree that, had it not been for active
networks, the simulation of Lamport clocks might never have
occurred. The notion that end-users synchronize with the
investigation of Markov models is rarely outdated. A theo-
retical grand challenge in theory is the important unification
of virtual machines and real-time theory. To what extent can
The rest of this paper is organized as follo we motivate the need for fiber-optic cables work in context with the prior work in th dress this obstacle, we disprove that even th tauted autonomous algorithm for the constru to-analog converters by Jones [10] is NP-c oriented languages can be made signed, de signed. Along these same lines, to accomplish concentrate our efforts on showing that the fa algorithm for the exploration of robots by Sa Ω((n + log n)) time [22]. In the end, we con
II. ARCHITECTURE
Our research is principled. Consider the ea by Martin and Smith; our model is similar, overcome this grand challenge. Despite the a claim at first glance seems unexpected, it previous work in the field. Any significant
web browsers be constructed to achieve this purpose?
http://pdos.csail.mit.edu/scigen/
Certainly, the usual methods for the emulation of Smalltalk
How about more structured collections of things?
The boy eats hot dogs.
The dog eats ice cream.
Every happy girl eats candy.
A dog eats candy.
The happy happy dog eats hot dogs.
happy
The boy A girl Every dog
Pinker, The Language Instinct
eats
hot dogs. ice cream. candy.
Lexical Analysis
Lexical Analysis (Scanning)
Translate a stream of characters to a stream of tokens
f o o = a + bar ( 0 , 42 , q ) ;
ID EQUALS ID PLUS ID LPAREN NUM COMMA ID LPAREN SEMI
Token Lexemes
EQUALS =
PLUS +
ID a foo bar NUM 0 42
Pattern
an equals sign
a plus sign
letter followed by letters or digits one or more digits
Lexical Analysis
Goal: simplify the job of the parser and reject some wrong programs, e.g.,
%#@$^#!@#%#$
is not a C program†
Scanners are usually much faster than parsers.
Discard as many irrelevant details as possible (e.g., whitespace, comments).
Parser does not care that the the identifer is “supercalifragilisticexpialidocious.”
Parser rules are only concerned with tokens. † It is what you type when your head hits the keyboard
Describing Tokens
Alphabet: A finite set of symbols
Examples: { 0, 1 }, { A, B, C, …, Z }, ASCII, Unicode
String: A finite sequence of symbols from an alphabet Examples: ε (the empty string), Stephen, αβγ
Language: A set of strings over an alphabet
Examples: (the empty language), { 1, 11, 111, 1111 }, all English words, strings that start with a letter followed by any sequence of letters and digits
Operations on Languages
Let L = { ε, wo }, M = { man, men }
Concatenation: Strings from one followed by the other LM = { man, men, woman, women }
Union: All strings from each language L∪M ={ε, wo, man, men }
Kleene Closure: Zero or more concatenations
M∗ ={ε}∪M∪MM∪MMM···=
{ε, man, men, manman, manmen, menman, menmen, manmanman, manmanmen, manmenman, . . . }
Kleene Closure
“*” is named after Stephen Cole Kleene, the inventor of regular expressions, who pronounced his last name “CLAY-nee.”
His son Ken writes “As far as I am aware this pronunciation is incorrect in all known languages. I believe that this novel pronunciation was invented by my father.”
Regular Expressions over an Alphabet Σ A standard way to express languages for tokens.
1. ε is a regular expression that denotes {ε} 2. Ifa∈Σ,a isanREthatdenotes{a}
3. If r and s denote languages L(r) and L(s),
(r)|(s) denotes L(r)∪L(s)
(r)(s) ( r ) ∗
{tu:t ∈L(r),u∈L(s)}
∪ ∞i = 0 L ( r ) i
L(r )0 = {ε}
L(r)i =L(r)L(r)i−1
where and
Regular Expression Examples
Σ={a,b} Regexp.
a|b (a|b)(a|b) a∗
(a | b)∗ a | a∗b
Language
{a,b}
{aa,ab,ba,bb}
{ε,a,aa,aaa,aaaa,…} {ε,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,…} {a,b,ab,aab,aaab,aaaab,…}
Specifying Tokens with REs
Typical choice: Σ = ASCII characters, i.e., { ,!,”,#,$,…,0,1,…,9,…,A,…,Z,…,~}
letters: A|B|···|Z|a|···|z digits: 0|1|···|9
identifier: letter(letter | digit)∗
Implementing Scanners Automatically
Regular Expressions (Rules)
Nondeterministic Finite Automata
Subset Construction
Deterministic Finite Automata
Tables
Nondeterministic Finite Automata
“All strings containing an even number of 0’s and 1’s”
1.
2. 3.
Set of states
S:ABCD SetofinputsymbolsΣ:{0,1}
Transitionfunctionσ:S×Σε→2S state ε 0 1
A {B} {C} AB B{A}{D}
0
0
1111
0
CD
C {D}{A} D {C} {B}
4. Start state s0 : A
0 5.
Set of accepting states
F:A
The Language induced by an NFA
An NFA accepts an input string x iff there is a path from the start state to an accepting state that “spells out” x.
0
AB
0
1111
0
CD
0
Show that the string “010010” is accepted.
A0B1D0C0D1B0A
Translating REs into NFAs (Thompson’s algorithm)
a a r1r2 r1
r2
Symbol Sequence
r1 | r2
ε r1 ε ε r2 ε
ε
ε r ε
Choice
(r )∗
Kleene Closure
ε
Why So Many Extra States and Transitions?
Invariant: Single start state; single end state; at most two outgoing arcs from any state: helpful for simulation.
What if we used this simpler rule for Kleene Closure?
ε
r
ε
Now consider a∗b∗ with this rule: εε
ab
εε
Is this right?
Translating REs into NFAs
Example: Translate (a | b)∗abb into an NFA. Answer: ε
ε2a3ε
0 ε 1 6 ε 7 a 8 b 9 b 10
ε4b5ε ε
Show that the string “aabb” is accepted. Answer: 0 ε 1 ε 2 a 3 ε 6 ε 7 a 8 b 9 b 10
Simulating NFAs
Problem: you must follow the “right” arcs to show that a string is accepted. How do you know which arc is right?
Solution: follow them all and sort it out later. “Two-stack” NFA simulation algorithm:
1. Initial states: the ε-closure of the start state 2. For each character c,
New states: follow all transitions labeled c
Form the ε-closure of the current states
3. Accept if any final state is accepting
Simulating an NFA: ·aabb, Start
ε
ε2a3ε
0 ε 1 6 ε 7 a 8 b 9 b 10
ε4b5ε ε
Simulating an NFA: ·aabb, ε-closure
ε
ε2a3ε
0 ε 1 6 ε 7 a 8 b 9 b 10
ε4b5ε ε
Simulating an NFA: a·abb
ε
ε2a3ε
0 ε 1 6 ε 7 a 8 b 9 b 10
ε4b5ε ε
Simulating an NFA: a·abb, ε-closure
ε
ε2a3ε
0 ε 1 6 ε 7 a 8 b 9 b 10
ε4b5ε ε
Simulating an NFA: aa·bb
ε
ε2a3ε
0 ε 1 6 ε 7 a 8 b 9 b 10
ε4b5ε ε
Simulating an NFA: aa·bb, ε-closure
ε
ε2a3ε
0 ε 1 6 ε 7 a 8 b 9 b 10
ε4b5ε ε
Simulating an NFA: aab·b
ε
ε2a3ε
0 ε 1 6 ε 7 a 8 b 9 b 10
ε4b5ε ε
Simulating an NFA: aab·b, ε-closure
ε
ε2a3ε
0 ε 1 6 ε 7 a 8 b 9 b 10
ε4b5ε ε
Simulating an NFA: aabb·
ε
ε2a3ε
0 ε 1 6 ε 7 a 8 b 9 b 10
ε4b5ε ε
Simulating an NFA: aabb·, Done
ε
ε2a3ε
0 ε 1 6 ε 7 a 8 b 9 b 10
ε4b5ε ε
Deterministic Finite Automata
Restricted form of NFAs:
No state has a transition on ε
For each state s and symbol a, there is at most one edge labeled a leaving s.
Differs subtly from the definition used in COMS W3261 (Sipser, Introduction to the Theory of Computation)
Very easy to check acceptance: simulate by maintaining current state. Accept if you end up on an accepting state. Reject if you end on a non-accepting state or if there is no transition from the current state for the next symbol.
Deterministic Finite Automata
{
type token = ELSE | ELSEIF
}
rule token =
parse “else” { ELSE }
| “elseif” { ELSEIF } elseif
Deterministic Finite Automata
{ type token = IF | ID of string | NUM of string }
rule token =
parse “if” { IF }
| [’a’-’z’] [’a’-’z’ ’0’-’9’]* as lit { ID(lit) }
| [’0’-’9’]+
as num { NUM(num) }
ID f IF i
a–hj–z
a–z0–9
ID a–z0–9
0–9
NUM 0–9
a–eg–z0–9
Building a DFA from an NFA
Subset construction algorithm
Simulate the NFA for all possible inputs and track the states that appear.
Each unique state during simulation becomes a state in the DFA.
Subset construction for (a | b)∗abb
Subset construction for (a | b)∗abb
a
b
Subset construction for (a | b)∗abb a
a
b
b
b
a
Subset construction for (a | b)∗abb a
a
b
a bab
b
Subset construction for (a | b)∗abb a
a
b
a baab
b
b
Result of subset construction for (a | b)∗abb a
a
b
a baab
b
b
Is this minimal?
Minimized result for (a | b)∗abb ba
a
a
b
a b
b
Transition Table Used In the Dragon Book
Problem: Translate (a | b)∗abb into an NFA and perform subset construction to produce a DFA.
Solution:
ε
ε2a3ε
0 ε 1 6 ε 7 a 8 b 9 b 10
ε4b5ε
ε
NFA State DFA State
{0,1,2,4,7} A {1,2,3,4,6,7,8} B {1,2,4,5,6,7} C {1,2,4,5,6,7,9} D {1,2,4,5,6,7,10} E
a b a
B C B D B C B E B C
ab ABD
baab a
bCbE
Subset Construction
An DFA can be exponentially larger than the corresponding NFA.
n states versus 2n
Tools often try to strike a balance between the two representations.
Lexical Analysis with Ocamllex
Constructing Scanners with Ocamllex
ocamllex
(subset construction)
scanner.mll
scanner.ml
An example:
scanner.mll
{ open Parser }
rule token =
parse [’ ’ ’\t’ ’\r’ ’\n’] { token lexbuf }
| ’+’
| ’-’
| ’*’
| ’/’
| [’0’-’9’]+ as lit | eof
{PLUS}
{ MINUS }
{ TIMES }
{ DIVIDE }
{ LITERAL(int_of_string lit) } { EOF }
Ocamllex Specifications
{
}
(* Definitions: optional *)
let ident = regexp let …
(* Rules: mandatory *)
rule entrypoint1 [arg1 … argn] =
parse pattern1 { action (* OCaml code *) }
| …
| patternn { action }
and entrypoint2 [arg1 … argn]} =
… and …
{
}
(* Header: verbatim OCaml code; mandatory *)
(* Trailer: verbatim OCaml code; optional *)
Patterns (In Order of Decreasing Precedence)
Pattern
’c’
_
eof
“foo”
[’1’ ’5’ ’a’-’z’]
[^ ’0’-’9’]
Meaning
A single character
Any character (underline)
The end-of-file
A literal string
“1,” “5,” or any lowercase letter Any character except a digit Grouping
A pattern defined in the let section
Zero or more patterns One or more patterns
Zero or one patterns
pattern1 followed by pattern2
Either pattern1 or pattern2
Bind the matched pattern to variable id
( pattern identifier
pattern * pattern +
pattern ?
pattern1 pattern2 pattern1 | pattern2 pattern as id
)
An Example
{ type token = PLUS | IF | ID of string | NUM of int }
let letter = [’a’-’z’ ’A’-’Z’] let digit = [’0’-’9’]
rule token =
parse [’ ’ ’\n’ ’\t’] { token lexbuf } (* Ignore whitespace *)
| ’+’ { PLUS } (* A symbol *)
| “if” { IF } (* A keyword *)
(* Identifiers *)
| letter (letter | digit | ’_’)* as id { ID(id) }
(* Numeric literals *)
| digit+ as lit { NUM(int_of_string lit) }
| “/*” { comment lexbuf } (* C-style comments *)
and comment =
parse “*/” { token lexbuf } (* Return to normal scanning *)
| _ { comment lexbuf } (* Ignore other characters *)
Free-Format Languages
Typical style arising from scanner/parser division
Program text is a series of tokens possibly separated by whitespace and comments, which are both ignored.
keywords (if while)
punctuation (, ( +)
identifiers (foo bar)
numbers (10 -3.14159e+32) strings (“A String”)
Free-Format Languages
Java C C++ C# Algol Pascal
Some deviate a little (e.g., C and C++ have a separate preprocessor)
But not all languages are free-format.
FORTRAN 77
FORTRAN 77 is not free-format. 72-character lines:
100 IF(IN .EQ. ’Y’ .OR. IN .EQ. ’y’ .OR. $ IN .EQ. ’T’ .OR. IN .EQ. ’t’) THEN
··· ···
Statement labelContinuation Normal
When column 6 is not a space, line is considered part of the previous.
Fixed-length line works well with a one-line buffer.
Makes sense on punch cards.
1
5
6
7
72
Python
The Python scripting language groups with indentation
i=0
while i < 10:
i=i+1 print i
i=0
while i < 10:
i=i+1 print i
# Prints 1, 2, ..., 10
# Just prints 10
This is succinct, but can be error-prone.
How do you wrap a conditional around instructions?
Syntax and Language Design
Does syntax matter? Yes and no
More important is a language’s semantics—its meaning.
The syntax is aesthetic, but can be a religious issue.
But aesthetics matter to people, and can be critical.
Verbosity does matter: smaller is usually better.
Too small can be problematic: APL is a succinct language with its own character set.
There are no APL programs, only puzzles.
Syntax and Language Design
Some syntax is error-prone. Classic fortran example:
DO5I=1,25 !Loopheader(fori=1to25) DO 5 I = 1.25 ! Assignment to variable DO5I
Trying too hard to reuse existing syntax in C++:
vector< vector
C distinguishes > and >> as different operators. Bjarne Stroustrup tells me they have finally fixed this.
Modeling Sentences
Simple Sentences Are Easy to Model
The boy eats hot dogs.
The dog eats ice cream.
Every happy girl eats candy.
A dog eats candy.
The happy happy dog eats hot dogs.
happy
The boy A girl Every dog
Pinker, The Language Instinct
eats
hot dogs. ice cream. candy.
Richer Sentences Are Harder
If the boy eats hot dogs, then the girl eats ice cream. Either the boy eats candy, or every dog eats candy.
happy
the boy Either a girl
If every dog
eats
hot dogs
ice or cream then candy
Does this work?
Automata Have Poor Memories
Want to “remember” whether it is an “either-or” or “if-then” sentence. Only solution: duplicate states.
Either
the
a every
the
a every
boy girl dog
boy girl dog
eats
hot dogs
ice cream or candy
hot dogs
ice cream then candy
happy
happy
happy
the
a every
boy girl dog
eats
hot dogs ice cream candy
If
eats
Automata in the form of Production Rules
Problem: automata do not remember where they’ve been
S→ Either A S→ If A
A→ the B A→ the C A→aB A→aC
A→ every B
A→ every C
B → happy B
B → happy C
C → boy D
C→ girl D
C → dog D
D→ eats E
E→ hot dogs F E → ice cream F E → candy F
F → or A
F → then A F→ε
D : eats
B : happy
A: the
a every
C: boy girl dog
E: hot dogs
ice cream candy
S: Either
If
F: or
then
Solution: Context-Free Grammars
Context-Free Grammars have the ability to “call subroutines:”
S → Either P, or P. S→ If P, then P. P→ A H N eats O A→ the
A→a
A→ every
H→ happy H H→ε
N → boy
N → girl
N → dog
O→ hot dogs O→ ice cream O→ candy
Exactly two Ps OneeachofA,H,N,andO
His“happy”zeroormoretimes
A Context-Free Grammar for a Simplified C
program → fdecl → formals → vdecls → vdecl → stmts → stmt →
expr →
ε | program vdecl | program fdecl id ( formals ) { vdecls stmts } id | formals , id
vdecl | vdecls vdecl
int id ;
ε | stmts stmt
expr ;|return expr ;|{ stmts }|if ( expr ) stmt| if ( expr ) stmt else stmt |
for ( expr ; expr ; expr ) stmt|while ( expr ) stmt
lit|id|id ( actuals )|( expr )|
expr + expr|expr – expr|expr * expr|expr / expr| expr == expr|expr != expr|expr < expr|expr <= expr| expr > expr | expr >= expr | expr = expr
expr | actuals,expr
actuals →
Constructing Grammars and Ocamlyacc
Parsing
Objective: build an abstract syntax tree (AST) for the token sequence from the scanner.
2*3+4 ⇒
+ *4
23
Goal: verify the syntax of the program, discard irrelevant information, and “understand” the structure of the program.
Parentheses and most other forms of punctuation removed.
Ambiguity
One morning I shot an elephant in my pajamas.
Ambiguity
One morning I shot an elephant in my pajamas. How he got in my pajamas I don’t know. —Groucho Marx
Ambiguity in English
S →NPVP
VP →VNP VP→VNPPP S
NP Pro V
NP →
NP →
NP →
NP →
PP →
V →shot Noun → elephant Noun → pajamas Pro → I
I shot an elephant in my pajamas
NP PP
Pro
Det Noun Poss Noun PNP
VP
NP
S
NP
ProVNP PP
I shot NP P NP
Det Noun in Poss Noun
VP
I shotNP
Det Noun P
PP
NP
an elephant in Poss Noun
Det → an P →in Poss → my
my pajamas
Jurafsky and Martin, Speech and Language Processing
an elephant my pajamas
The Dangling Else Problem
Who owns the else?
if (a) if (b) c(); else d();
if if
Should this be a if or b c()d()
a if d() ? b c()
Grammars are usually ambiguous; manuals give disambiguating rules such as C’s:
As usual the “else” is resolved by connecting an else with the last encountered elseless if.
The Dangling Else Problem
stmt : IF expr THEN stmt
| IF expr THEN stmt ELSE stmt
Problem comes after matching the first statement. Question is whether an “else” should be part of the current statement or a surrounding one since the second line tells us “stmt ELSE” is possible.
The Dangling Else Problem
Some languages resolve this problem by insisting on nesting everything.
E.g., Algol 68:
if a < b then a else b fi;
“fi” is “if” spelled backwards. The language also uses do–od and case–esac.
Another Solution to the Dangling Else Problem
Idea: break into two types of statements: those that have a dangling “then” (“dstmt”) and those that do not (“cstmt”). A statement may be either, but the statement just before an “else” must not have a dangling clause because if it did, the “else” would belong to it.
stmt : dstmt | cstmt
dstmt : IF expr THEN stmt
| IF expr THEN cstmt ELSE dstmt
cstmt : IF expr THEN cstmt ELSE cstmt | other statements...
We are effectively carrying an extra bit of information during parsing: whether there is an open “then” clause. Unfortunately, duplicating rules is the only way to do this in a context-free grammar.
Ambiguous Arithmetic
Ambiguity can be a problem in expressions. Consider parsing
with the grammar
+
e →e+e |e−e |e∗e |e/e |N ---
3-4*2+5
*
-53+ 3* *5
-+
4+ +2 4242 2534
3* *5
3425
Operator Precedence and Associativity
Usually resolve ambiguity in arithmetic expressions Like you were taught in elementary school:
“My Dear Aunt Sally”
Mnemonic for multiplication and division before addition and subtraction.
Operator Precedence
Defines how “sticky” an operator is.
1*2+3*4
* at higher precedence than +: (1 * 2) + (3 * 4)
+
**
1234
+ at higher precedence than *: 1 * (2 + 3) * 4
*
* 4 +
1
23
Associativity
Whether to evaluate left-to-right or right-to-left Most operators are left-associative
1-2-3-4
-- -4 1-
-3 2- 12 34
((1−2)−3)−4 1−(2−(3−4)) left associative right associative
Fixing Ambiguous Grammars
A grammar specification:
expr :
expr PLUS expr
| expr MINUS expr | expr TIMES expr | expr DIVIDE expr | NUMBER
Ambiguous: no precedence or associativity. Ocamlyacc’s complaint: “16 shift/reduce conflicts.”
Assigning Precedence Levels
Split into multiple rules, one per level
expr : expr PLUS expr | expr MINUS expr
| term
term : term TIMES term | term DIVIDE term
| atom atom : NUMBER
Still ambiguous: associativity not defined Ocamlyacc’s complaint: “8 shift/reduce conflicts.”
Assigning Associativity
Make one side the next level of precedence
expr : expr PLUS term | expr MINUS term
| term
term : term TIMES atom | term DIVIDE atom
| atom atom : NUMBER
This is left-associative. No shift/reduce conflicts.
Statement separators/terminators
C uses ; as a statement terminator.
if (asymbol…
Define symbols with attached attribute (also exported)
%start symbol . . .
Define start symbols (entry points)
%type
Define the type for a symbol (mandatory for start)
%left symbol . . .
%right symbol . . .
%nonassoc symbol . . .
Define predecence and associtivity for the given symbols, listed in order from lowest to highest precedence
Rules
nonterminal :
symbol … symbol { semantic-action }
| …
| symbol … symbol { semantic-action }
nonterminal is the name of a rule, e.g., “program,” “expr”
symbol is either a terminal (token) or another rule
semantic-action is OCaml code evaluated when the rule
is matched
In a semantic-action, $1, $2, . . . returns the value of the first, second, . . . symbol matched
A rule may include “%prec symbol” to override its default precedence
An Example .mly File
%token
%token PLUS MINUS TIMES DIV LPAREN RPAREN EOL
%left PLUS MINUS /* lowest precedence */ %left TIMES DIV
%nonassoc UMINUS /* highest precedence */
%start main %type
%%
main:
expr EOL
/* the entry point */
{$1}
{$1} | LPAREN expr RPAREN {$2}
expr: INT
| expr PLUS expr
| expr MINUS expr
| expr TIMES expr
| expr DIV expr
| MINUS expr %prec UMINUS { – $2 }
{$1+ $3} {$1- $3} {$1* $3} {$1/ $3}
Parsing Algorithms
Parsing Context-Free Grammars
There are O(n3) algorithms for parsing arbitrary CFGs, but most compilers demand O(n) algorithms.
Fortunately, the LL and LR subclasses of CFGs have O(n) parsing algorithms. People use these in practice.
Rightmost Derivation of Id ∗ Id + Id e
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
At each step, expand the rightmost nonterminal. nonterminal
“handle”: The right side of a production
Fun and interesting fact: there is exactly one rightmost expansion if the grammar is unambigious.
Rightmost Derivation of Id ∗ Id + Id e
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
t+e
At each step, expand the rightmost nonterminal. nonterminal
“handle”: The right side of a production
Fun and interesting fact: there is exactly one rightmost expansion if the grammar is unambigious.
Rightmost Derivation of Id ∗ Id + Id
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
At each step, expand the rightmost nonterminal. nonterminal
“handle”: The right side of a production
Fun and interesting fact: there is exactly one rightmost expansion if the grammar is unambigious.
Rightmost Derivation of Id ∗ Id + Id
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
At each step, expand the rightmost nonterminal. nonterminal
“handle”: The right side of a production
Fun and interesting fact: there is exactly one rightmost expansion if the grammar is unambigious.
Rightmost Derivation of Id ∗ Id + Id
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id Id ∗ t + Id
At each step, expand the rightmost nonterminal. nonterminal
“handle”: The right side of a production
Fun and interesting fact: there is exactly one rightmost expansion if the grammar is unambigious.
Rightmost Derivation of Id ∗ Id + Id
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
At each step, expand the rightmost nonterminal. nonterminal
“handle”: The right side of a production
Fun and interesting fact: there is exactly one rightmost expansion if the grammar is unambigious.
Rightmost Derivation of Id ∗ Id + Id
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
At each step, expand the rightmost nonterminal. nonterminal
“handle”: The right side of a production
Dragon-book style: underline handles
e → t + e → t + t → t + Id → Id ∗ t + Id → Id ∗ Id + Id
Rightmost Derivation: What to Expand
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
e t+e
t+t
t + Id
Id ∗ t + Id
Id ∗ Id + Id
Expand here ↑ Terminals only
Reverse Rightmost Derivation
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id
viable prefixes terminals
Reverse Rightmost Derivation
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id Id Id ∗ t + Id t
viable prefixes terminals
Reverse Rightmost Derivation
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id Id
Id∗t+Id t + Id
Id∗ t t
viable prefixes
terminals
Reverse Rightmost Derivation
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id Id Id∗t+Id Id∗ t
t + Id t Id t+t t
viable prefixes terminals
Reverse Rightmost Derivation
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id Id Id∗t+Id Id∗ t
t + Id t Id t+t t t+e e
viable prefixes terminals
Reverse Rightmost Derivation
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id Id Id∗t+Id Id∗ t
t + Id t Id t+t t t+e +e
ee
viable prefixes terminals
Shift/Reduce Parsing Using an Oracle
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id shift
stack
input
Shift/Reduce Parsing Using an Oracle
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id shift Id∗Id+Id shift
stack
input
Shift/Reduce Parsing Using an Oracle
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id shift
Id∗Id+Id shift
stack
input
Id∗Id+Id
shift
Shift/Reduce Parsing Using an Oracle
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id shift Id∗Id+Id shift
stack
input
Id∗Id+Id
Id∗Id+Id reduce 4
shift
Shift/Reduce Parsing Using an Oracle
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id Id∗Id+Id
Id∗Id+Id
Id∗Id+Id Id∗t +Id
shift shift shift reduce 4 reduce 3
stack
input
Shift/Reduce Parsing Using an Oracle
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id Id∗Id+Id
Id∗Id+Id
Id∗Id+Id Id∗t +Id t + Id
shift shift shift reduce 4 reduce 3 shift
stack
input
Shift/Reduce Parsing Using an Oracle
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id Id∗Id+Id
Id∗Id+Id
Id∗Id+Id Id∗t +Id t + Id
t + Id
shift shift shift reduce 4 reduce 3 shift shift
stack
input
Shift/Reduce Parsing Using an Oracle
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id Id∗Id+Id
Id∗Id+Id
Id∗Id+Id Id∗t +Id t + Id
t + Id t + Id
shift shift shift reduce 4 reduce 3 shift shift reduce 4
stack
input
Shift/Reduce Parsing Using an Oracle
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id Id∗Id+Id
Id∗Id+Id
Id∗Id+Id Id∗t +Id t + Id
t + Id
t + Id t + t
shift shift shift reduce 4 reduce 3 shift shift reduce 4 reduce 2
stack
input
Shift/Reduce Parsing Using an Oracle
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id Id∗Id+Id
Id∗Id+Id
Id∗Id+Id Id ∗ t + Id t + Id
t + Id
t+Id t+t t+e
shift shift shift reduce 4 reduce 3 shift shift reduce 4 reduce 2 reduce 1
stack
input
Shift/Reduce Parsing Using an Oracle
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e t+e t+t
t + Id
Id ∗ t + Id Id ∗ Id + Id
Id∗Id+Id Id∗Id+Id
Id∗Id+Id
Id∗Id+Id Id ∗ t + Id t + Id
t + Id
t+Id t+t t+e e
shift shift shift reduce 4 reduce 3 shift shift reduce 4 reduce 2
reduce 1 accept
stack
input
Handle Hunting
Right Sentential Form: any step in a rightmost derivation Handle: in a sentential form, a RHS of a rule that, when
rewritten, yields the previous step in a rightmost derivation. The big question in shift/reduce parsing:
When is there a handle on the top of the stack?
Enumerate all the right-sentential forms and pattern-match against them? Usually infinitely many; let’s try anyway.
Some Right-Sentential Forms and Their Handles
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
e
t Id∗t Id
Id∗Id∗t Id∗Id Id∗Id∗Id∗t Id∗Id∗Id
Some Right-Sentential Forms and Their Handles
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
t Id∗t Id
e
t+t+e t +t +t +e
t +t +t +t
t +e
Id∗Id∗t Id∗Id Id∗Id∗Id∗t Id∗Id∗Id
Some Right-Sentential Forms and Their Handles
1:e→t+e Patterns: 2:e→t
3:t→Id ∗t
4:t→Id
e
t
Id∗t Id t+t+e
Id∗Id∗t Id∗Id t +t +t +e Id∗Id∗Id∗t Id∗Id∗Id t +t +t +t
Id∗Id∗···∗Id∗t··· Id∗Id∗···∗Id···
t +t +···+t +e
t +t +···+t +Id
t +t +···+t +Id∗Id∗···∗Id∗t t +t +···+t
e
t +Id∗t
t +e t+t
t +Id Id∗t +Id Id+Id
Id∗Id∗t +Id Id∗Id+Id Id∗Id∗Id∗t +Id Id∗Id∗Id+Id
The Handle-Identifying Automaton
Magical result, due to Knuth: An automaton suffices to locate a handle in a right-sentential form.
e
e
t
Id∗Id∗···∗Id∗t ··· Id∗Id∗···∗Id···
t +t +···+t +e
t +t +···+t +Id
t +t +···+t +Id∗Id∗···∗Id∗t t +t +···+t
e
t
t +
∗e t +e
t Id∗t
Id Id
Id
Id
Building the Initial State of the LR(0) Automaton
e′→ e
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
Key idea: automata identify viable prefixes of right sentential forms. Each state is an equivalence class of possible places in productions.
At the beginning, any viable prefix must be at the beginning of a string expanded from e. We write this condition “e′ → e”
Building the Initial State of the LR(0) Automaton
e′→ e e→ t+e e→t
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
Key idea: automata identify viable prefixes of right sentential forms. Each state is an equivalence class of possible places in productions.
At the beginning, any viable prefix must be at the beginning of a string expanded from e. We write this condition “e′ → e”
There are two choices for what an e may expand to: t + e andt.Sowhene′→ e,e→ t+eande→ tarealsotrue, i.e., it must start with a string expanded from t.
Building the Initial State of the LR(0) Automaton
e′→ e e→ t+e e→t t→ Id∗t t→ Id
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
Key idea: automata identify viable prefixes of right sentential forms. Each state is an equivalence class of possible places in productions.
At the beginning, any viable prefix must be at the beginning of a string expanded from e. We write this condition “e′ → e”
There are two choices for what an e may expand to: t + e andt.Sowhene′→ e,e→ t+eande→ tarealsotrue, i.e., it must start with a string expanded from t.
Also,t mustbeId∗t orId,sot→ Id∗t andt→ Id.
This is a closure, like ε-closure in subset construction.
Building the LR(0) Automaton
The first state suggests a viable prefix can start as any string derived from e, any string derived from t, or Id.
e′→ e
e→ t+e S0:e→ t
t→ Id∗t t→ Id
Building the LR(0) Automaton
“Just passed a string derived from e”
e
“Just passed a prefix ending in a string derived from t”
The first state suggests a viable prefix can start as any string derived from e, any string derived from t, or Id.
The items for these three states come from advancing the across each thing, then performing the closure operation (vacuous here).
S7: e′→e
e′→ e
e→ t+e S0:e→ t
t→ Id∗t t→ Id
S2:e→t +e e→t
t
Id
“Just passed a prefix that ended in an Id”
S1:t→Id ∗t t → Id
Building the LR(0) Automaton
S7: e′→e
e
e→t+ e S4:
e′→ e
e→ t+e S0:e→ t
t→ Id∗t t→ Id
S2:e→t +e e→t
t
+
Id
∗
In S2, a + may be next. This givest+ e.
In S1, ∗ may be next, giving Id∗ t
S1:t→Id ∗t t → Id
t→Id∗ t S3:
Building the LR(0) Automaton
S7: e′→e
e
e→t+ e
e→ t+e S4:e→ t
t→ Id∗t t→ Id
e′→ e
e→ t+e S0:e→ t
t→ Id∗t t→ Id
S2:e→t +e e→t
t
+
Id
∗
In S2, a + may be next. This gives t + e. Closure adds 4 more items.
In S1, ∗ may be next, giving Id∗ t and two others.
S1:t→Id ∗t t → Id
t→Id∗ t S3:t→ Id∗t
t→ Id
Building the LR(0) Automaton
S7: e′→e
e
e→t+ e
e→ t+e S4:e→ t
t→ Id∗t t→ Id
e′→ e
e→ t+e S0:e→ t
t→ Id∗t t→ Id
S2:e→t +e e→t
S6: e→t+e
t
+
t
Id
Id
∗
Id
e
S1:t→Id ∗t t → Id
t→Id∗ t S3:t→ Id∗t
t→ Id
S5: t→Id∗t
t
What to do in each state?
Id
1:e→t +e 2:e→t 3:t→Id ∗t 4 : t → Id
Input Action
∗··· Shift
+··· Reduce 4
Reduce 4 Id··· Syntax Error
Id∗Id∗···∗Id∗t ··· Id∗Id∗···∗Id···
t +t +···+t +e
t +t +···+t +Id
t +t +···+t +Id∗Id∗···∗Id∗t t + t + · · · + t
e
S1:t→Id ∗t t → Id
∗
Stack
Id∗Id∗···∗Id Id∗Id∗···∗Id Id∗Id∗···∗Id Id∗Id∗···∗Id
The first function
If you can derive a string that starts with terminal t from some sequence of terminals and nonterminals α, then
t ∈ first(α).
1. Trivially, first(X ) = {X } if X is a terminal.
2. If X →ε, then add ε to first(X).
3. Foreachprod.X→Y···,addfirst(Y)−{ε}tofirst(X).
If X can produce something, X can start with whatever
that starts with
4. Foreachprod.X→Y1···YkZ···whereε∈first(Yi)for i =1,…,k, add first(Z)−{ε} to first(X).
Skip all potential ε’s at the beginning of whatever X produces
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
first(Id) = {Id}
first(t)={Id} because t →Id ∗t and t →Id
first(e ) = {Id} because e → t + e , e → t , and first(t ) = {Id}.
The follow function
If t is a terminal, A is a nonterminal, and ··· At ··· can be derived, then t ∈ follow(A).
1. Add $ (“end-of-input”) to follow(S) (start symbol). End-of-input comes after the start symbol
2. For each prod. →···Aα, add first(α)−{ε} to follow(A). A is followed by the first thing after it
3. For each prod. A → ···B or A → ···Bα where ε ∈ first(α), then add everything in follow(A) to follow(B).
If B appears at the end of a production, it can be followed by whatever follows that production
1 : e → t + e 2:e→t 3:t→Id ∗t 4:t→Id
first(t ) = {Id} first(e) = {Id}
follow(e ) = {$} follow(t)={ }
1. Because e is the start symbol
The follow function
If t is a terminal, A is a nonterminal, and ··· At ··· can be derived, then t ∈ follow(A).
1. Add $ (“end-of-input”) to follow(S) (start symbol). End-of-input comes after the start symbol
2. For each prod. →···Aα, add first(α)−{ε} to follow(A). A is followed by the first thing after it
3. For each prod. A → ···B or A → ···Bα where ε ∈ first(α), then add everything in follow(A) to follow(B).
If B appears at the end of a production, it can be followed by whatever follows that production
1 : e → t + e 2:e→t 3:t→Id ∗t 4:t→Id
first(t ) = {Id} first(e) = {Id}
follow(e ) = {$} follow(t)={+ }
2. Because e →t+e and first(+)={+}
The follow function
If t is a terminal, A is a nonterminal, and ··· At ··· can be derived, then t ∈ follow(A).
1. Add $ (“end-of-input”) to follow(S) (start symbol). End-of-input comes after the start symbol
2. For each prod. →···Aα, add first(α)−{ε} to follow(A). A is followed by the first thing after it
3. For each prod. A → ···B or A → ···Bα where ε ∈ first(α), then add everything in follow(A) to follow(B).
If B appears at the end of a production, it can be followed by whatever follows that production
1 : e → t + e follow(e ) = {$} 2:e→t follow(t)={+,$}
3:t→Id ∗t
4:t→Id first(t ) = {Id} first(e) = {Id}
3. Because e → t and $ ∈ follow(e)
The follow function
If t is a terminal, A is a nonterminal, and ··· At ··· can be derived, then t ∈ follow(A).
1. Add $ (“end-of-input”) to follow(S) (start symbol). End-of-input comes after the start symbol
2. For each prod. →···Aα, add first(α)−{ε} to follow(A). A is followed by the first thing after it
3. For each prod. A → ···B or A → ···Bα where ε ∈ first(α), then add everything in follow(A) to follow(B).
If B appears at the end of a production, it can be followed by whatever follows that production
1 : e → t + e follow(e ) = {$} 2:e→t follow(t)={+,$}
3:t→Id ∗t
4:t→Id first(t ) = {Id} first(e) = {Id}
Fixed-point reached: applying any rule does not change any set
Converting the LR(0) Automaton to an SLR Table
1:e→t +e 2:e→t 3:t→Id ∗t
e 4:t→Id t
State
Action Goto
Id + ∗ $ e t
S7: e′ →e·
S0
Idt+ 0s1 72 Id
∗e
t
follow(e) = {$} follow(t ) = {+, $}
Id
S2: e → t ·
S1: t → Id·
S4
S3
S6: e→t+e·
S5: t →Id∗t·
From S0, shift an Id and go to S1; orcrossat andgotoS2;orcross ane andgotoS7.
Converting the LR(0) Automaton to an SLR Table
1:e→t +e 2:e→t 3:t→Id ∗t
e 4:t→Id t
State
Action Goto
Id + ∗ $ e t
S7: e′ →e·
S0
S2: e → t ·
Idt+ 0s1 72
Id
∗e
t
follow(e) = {$} follow(t ) = {+, $}
1 r4 s3 r4
S1: t → Id·
S4
Id
S3
S6: e→t+e·
S5: t →Id∗t·
From S1, shift a ∗ and go to S3; or, if the next input ∈ follow(t ), reduce by rule 4.
Converting the LR(0) Automaton to an SLR Table
1:e→t +e 2:e→t 3:t→Id ∗t
e 4:t→Id t
State
Action Goto
Id + ∗ $ e t
S7: e′ →e·
S0
S2: e → t ·
Idt+ 0s1 72
Id
∗e
t
follow(e) = {$} follow(t ) = {+, $}
1 r4 s3 r4 2 s4 r2
S1: t → Id·
S4
Id
S3
S6: e→t+e·
S5: t →Id∗t·
From S2, shift a + and go to S4; or, if the next input ∈ follow(e), reduce by rule 2.
Converting the LR(0) Automaton to an SLR Table
1:e→t +e 2:e→t 3:t→Id ∗t
e 4:t→Id t
State
Action Goto
Id + ∗ $ e t
S7: e′ →e·
S0
S2: e → t ·
Idt+ 0s1 72
Id
∗e
t
follow(e) = {$} follow(t ) = {+, $}
1 r4 s3 r4
2 s4 r2 3s1 5
S1: t → Id·
S4
Id
S3
S6: e→t+e·
S5: t →Id∗t·
From S3, shift an Id and go to S1; orcrossat andgotoS5.
Converting the LR(0) Automaton to an SLR Table
1:e→t +e 2:e→t 3:t→Id ∗t
e 4:t→Id t
State
Action Goto
Id + ∗ $ e t
S7: e′ →e·
S0
S2: e → t ·
Idt+ 0s1 72
Id
∗e
t
follow(e) = {$} follow(t ) = {+, $}
1 r4 s3 r4
2 s4 r2 3s1 5 4s1 62
S1: t → Id·
S4
Id
S3
S6: e→t+e·
S5: t →Id∗t·
From S4, shift an Id and go to S1; orcrossane orat.
Converting the LR(0) Automaton to an SLR Table
1:e→t +e 2:e→t 3:t→Id ∗t
e 4:t→Id t
State
Action Goto
Id + ∗ $ e t
S7: e′ →e·
S0
S2: e → t ·
Idt+ 0s1 72
Id
∗e
t
follow(e) = {$} follow(t ) = {+, $}
1 r4 s3 r4
2 s4 r2 3s1 5 4s1 62 5 r3 r3
From S5, reduce using rule 3 if the next symbol ∈ follow(t ).
S1: t → Id·
S4
Id
S3
S6: e→t+e·
S5: t →Id∗t·
Converting the LR(0) Automaton to an SLR Table
1:e→t +e 2:e→t 3:t→Id ∗t
e 4:t→Id t
State
Action
Id + ∗ $
Goto
e t
S7: e′ →e·
S0
S2: e → t ·
Idt+ 0s1 72
Id
∗e
t
follow(e) = {$} follow(t ) = {+, $}
1 r4 s3 r4
2 s4 r2 3s1 5 4s1 62 5 r3 r3
6 r1
From S6, reduce using rule 1 if the next symbol ∈ follow(e).
S1: t → Id·
S4
Id
S3
S6: e→t+e·
S5: t →Id∗t·
Converting the LR(0) Automaton to an SLR Table
1:e→t +e 2:e→t 3:t→Id ∗t
e 4:t→Id t
State
Action Goto
Id + ∗ $ e t
S7: e′ →e·
S0
S2: e → t ·
s1
r4 s3 r4
s4 r2 s1
s1
r3 r3
r1
Idt+ 0 72
Id
∗e
t
follow(e) = {$} follow(t ) = {+, $}
1
2 35 462 5
6
7
If, in S7, we just crossed an e, accept if we are at the end of the input.
S1: t → Id·
S4
Id
S3
S6: e→t+e·
S5: t →Id∗t·
Shift/Reduce Parsing with an SLR Table
Stack Input
0
stack and the next input token.
Find the action (shift, reduce, or error) in the table.
In this case, shift the token onto the stack and mark it with state 1.
Action
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
Shift, goto 1
Id∗Id+Id$
Look at the state on top of the
State
Action
Id + ∗ $ e t
Goto
s1
r4 s3 r4
s4 r2 s1
s1
r3 r3
r1
072 1
2 35 462 5
6 7
Shift/Reduce Parsing with an SLR Table
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
State Action
Id + ∗ $ e t
072 1
2 35 462 5
6 7
Stack
0 Id 1
Input Id∗Id+Id$
∗Id+Id$
Action
Shift, goto 1 Shift, goto 3
0
Goto
Here, the state is 1, the next symbol is ∗, so shift and mark it with state 3.
s1
r4 s3 r4
s4 r2 s1
s1
r3 r3
r1
Shift/Reduce Parsing with an SLR Table
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
State Action
Id + ∗ $ e t
072 1
2 35 462 5
6 7
0
Input Id∗Id+Id$
∗Id+Id$
Action
Shift, goto 1 Shift, goto 3 Shift, goto 1 Reduce 4
Stack
0 Id 1
0 I1d ∗3 0 Id ∗3 Id
Here, the state is 1, the next symbol is +. The table says reduce using rule 4.
Goto
Id+Id$ 1 1 +Id$
s1
r4 s3 r4
s4 r2 s1
s1
r3 r3
r1
Shift/Reduce Parsing with an SLR Table
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
11 +Id$ Id+∗$et 0I1d∗3+Id$
Stack Input
Action Id∗Id+Id$ Shift, goto 1
0 0 Id
State Action
Goto
1 ∗Id+Id$
0 I1d ∗3 Id+Id$ 0 Id ∗3 Id
Shift, goto 3 Shift, goto 1 Reduce 4
s1
r4 s3 r4
s4 r2 s1
s1
r3 r3
r1
072 1
2
3
4 5 6 7
Remove the RHS of the rule (here,
just Id), observe the state on the
5 top of the stack, and consult the
6 2 “goto” portion of the table.
Shift/Reduce Parsing with an SLR Table
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
State Action
Id + ∗ $ e t
072 1
2
3 5
4 62
5
6 7
Stack Input
Action Id∗Id+Id$ Shift, goto 1
0 0 Id
Goto
1 ∗Id+Id$
0 I1d ∗3 Id+Id$ 0 Id ∗3 Id
11 +Id$ 0 Id ∗ t
135 +Id$
Shift, goto 3 Shift, goto 1 Reduce 4 Reduce 3
s1
r4 s3 r4
s4 r2 s1
s1
r3 r3
r1
Here, we push a t with state 5. This effectively “backs up” the LR(0) automaton and runs it over the newly added nonterminal.
In state 5 with an upcoming +, the action is “reduce 3.”
Shift/Reduce Parsing with an SLR Table
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
Stack Input
Action Id∗Id+Id$ Shift, goto 1
0 0 Id
1 ∗Id+Id$
0 I1d ∗3 Id+Id$ 0 Id ∗3 Id
11 +Id$ 0 Id ∗ t
Shift, goto 3 Shift, goto 1 Reduce 4 Reduce 3
+Id$ Shift, goto 4
State Action
Id+∗$et 135 +Id$
Goto
0 7202t 1
s1
r4 s3 r4
s4 r2 s1
s1
r3 r3
r1
2 35 462 5
6
7
This time, we strip off the RHS for rule 3, Id ∗ t , exposing state 0, so we push a t with state 2.
Shift/Reduce Parsing with an SLR Table
1:e→t +e 2:e→t 3:t→Id ∗t 4:t→Id
State Action Goto Id + ∗ $ e t
Stack
0 0 I1d
Id ∗ 0 1 3
0 I1d ∗3 I1d 0I1d∗35t
Input
Action
0 7202t 1 0t+
s1
r4 s3 r4
s4 r2 s1
s1
r3 r3
r1
224
3 5 462 5
6
7
0 2t +4 I1d 02t+42t
t+e 0246
0 7e
Id∗Id+Id$ Shift, goto 1 ∗Id+Id$ Shift, goto 3 Id+Id$ Shift, goto 1 +Id$ Reduce 4 +Id$ Reduce 3 +Id$ Shift, goto 4 Id$ Shift, goto 1
$ Reduce 4
$ Reduce 2
$ Reduce 1
$ Accept
L, R, and all that
LR parser: “Bottom-up parser”:
L = Left-to-right scan, R = (reverse) Rightmost derivation
RR parser: R = Right-to-left scan (from end)
I called them “Australian style”; nobody uses these
LL parser: “Top-down parser”:
L = Left-to-right scan: L = (reverse) Leftmost derivation
LR(1): LR parser that considers next token (lookahead of 1) LR(0): Only considers stack to decide shift/reduce
SLR(1): Simple LR: lookahead from first/follow rules Derived from LR(0) automaton
LALR(1): Lookahead LR(1): fancier lookahead analysis Uses same LR(0) automaton as SLR(1)
Ocamlyacc builds LALR(1) tables.
The Punchline
This is a tricky, but mechanical procedure. The Ocamlyacc parser generator uses a modified version of this technique to generate fast bottom-up parsers.
You need to understand it to comprehend error messages:
Shift/reduce conflicts are caused by a state like
t →·Elses t→·
If the next token is Else, do you reduce it since Else may follow a t, or shift it?
Reduce/reduce conflicts are caused by a state like
t → Id ∗ t · e → t +e ·
Do you reduce by “t →Id∗t” or by “e → t +e”?