Parser: Syntactic Analysis Readings: EAC2 Chapter 3
EECS4302 M: Compilers and Interpreters Winter 2020
CHEN-WEI WANG
Parser in Context
○ Recall:
Lexical Analysis Source Program
(seq. of characters)
Scanner
Syntactic Analysis seq. of tokens
Parser
Semantic Analysis
AST1 … ASTn
pretty printed
Target Program
○ Treats the input programas as a a sequence of classified tokens/words
○ Applies rules parsing token sequences as
abstract syntax trees (ASTs) [ syntactic analysis ]
○ Upon termination:
● ReportstokensequencesnotderivableasASTs ● ProducesanAST
○ No longer considers every character in input program.
○ token sequences constitute a
.
2 of 96
Derivable
context-free language (CFL)
Context-Free Languages: Introduction
● We have seen regular languages :
○ Can be described using finite automata or regular expressions.
○ Satisfy the pumping lemma.
● Languages with a recursive structure are provably non-regular.
● ●
e.g.,{0n1n ∣n≥0}
Context-free grammars (CFG’s) are used to describe strings
that can be generated in a recursive fashion. Context-free languages (CFL’s) are:
○ Languages that can be described using CFG’s.
○ A proper superset of the set of regular languages.
3 of 96
CFG: Example (1.1)
● The language that we previously proved as non-regular {0n#1n ∣n≥0}
can be described using the following grammar :
A → 0A1 A→B B→#
● A grammar contains a collection of substitution or production rules, where:
○ A terminal is a word w ∈ Σ∗ (e.g., 0, 1, etc.).
○ A variable or non-terminal is a word w ∈/ Σ∗ (e.g., A, B, etc.).
○ A start variable occurs on the LHS of the topmost rule (e.g., A).
4 of 96
CFG: Example (1.2)
● Given a grammar, generate a string by:
1. Write down the start variable.
2. Choose a production rule where the start variable appears on the
LHS of the arrow, and substitute it by the RHS.
3. There are two cases of the re-written string:
3.1 It contains no variables, then you are done.
3.2 It contains some variables, then substitute each variable using the
relevant production rules.
4. Repeat Step 3.
● e.g., We can generate an infinite number of strings from
A → 0A1 A→B B→#
○ A⇒B⇒#
○ A⇒0A1⇒0B1⇒0#1
○ A⇒0A1⇒00A11⇒00B11⇒00#11
○ … 5 of 96
CFG: Example (1.2)
Given a CFG, the derivation of a string can be shown as a parse tree .
e.g., The derivation of 000#111 has the parse tree
6 of 96
CFG: Example (2)
Design a CFG for the following language: {w∣w∈{0,1}∗∧w is a palidrome}
e.g., 00, 11, 0110, 1001, etc.
P→ε P→0 P→1
P → 0P0 P → 1P1
7 of 96
CFG: Example (3)
Design a CFG for the following language: {wwR ∣w∈{0,1}∗}
e.g., 00, 11, 0110, etc.
8 of 96
P→ε
P → 0P0 P → 1P1
CFG: Example (4)
Design a CFG for the set of binary strings, where each block of 0’s followed by at least as many 1’s.
e.g., 000111, 0001111, etc.
● We use S to represent one such string, and A to represent
each such block in S.
S → ε {BC of S}
S → AS {RC of S}
A → ε {BC of A}
A → 01 {BC of A}
A → 0A1 {RC of A: equal 0’s and 1’s} A → A1 {RCofA:more1’s}
9 of 96
CFG: Example (5.1) Version 1
Design the grammar for the following small expression language, which supports:
● Arithmetic operations: +, -, *, /
● Relational operations: >, <, >=, <=, ==, /=
● Logical operations: true, false, !, &&, ||, =>
Start with the variable Expression. ● There are two possible versions:
1. All operations are mixed together.
2. Relevant operations are grouped together.
Try both!
[e.g., (1 + true)/false]
10 of 96
CFG: Example (5.2) Version 1
11 of 96
Expression
→ IntegerConstant
∣ -IntegerConstant
∣ BooleanConstant
∣ BinaryOp
∣ UnaryOp
∣ ( Expression )
IntegerConstant → Digit
∣ Digit IntegerConstant
Digit → 0∣1∣2∣3∣4∣5∣6∣7∣8∣9
BooleanConstant → TRUE
∣ FALSE
CFG: Example (5.3) Version 1
12 of 96
BinaryOp → ∣
∣ ∣ ∣ ∣ ∣ ∣ ∣ ∣ ∣
UnaryOp →
Expression + Expression Expression – Expression Expression * Expression Expression / Expression Expression && Expression Expression || Expression Expression => Expression Expression == Expression Expression /= Expression Expression > Expression Expression < Expression
! Expression
CFG: Example (5.4) Version 1
However, Version 1 of CFG:
○ Parses string that requires further semantic analysis (e.g., type
checking):
e.g.,3 => 6
○ Is ambiguous , meaning that a string may have more than one
ways to interpret it.
e.g., Draw the parse tree(s) for 3 * 5 + 4
13 of 96
CFG: Example (5.5) Version 2
Expression → ArithmeticOp
∣ RelationalOp
∣ LogicalOp
∣ ( Expression )
IntegerConstant → Digit
∣ Digit IntegerConstant
Digit → 0∣1∣2∣3∣4∣5∣6∣7∣8∣9
BooleanConstant → TRUE
∣ FALSE
14 of 96
CFG: Example (5.6) Version 2
15 of 96
ArithmeticOp → ∣
∣ ∣ ∣ ∣ ∣
ArithmeticOp + ArithmeticOp ArithmeticOp – ArithmeticOp ArithmeticOp * ArithmeticOp ArithmeticOp / ArithmeticOp (ArithmeticOp) IntegerConstant -IntegerConstant ArithmeticOp == ArithmeticOp
RelationalOp →
∣ ArithmeticOp /= ArithmeticOp
LogicalOp
∣ ArithmeticOp > ArithmeticOp
∣ ArithmeticOp < ArithmeticOp → LogicalOp && LogicalOp
∣ LogicalOp || LogicalOp
∣ LogicalOp => LogicalOp
∣ ! LogicalOp
∣ (LogicalOp)
∣ RelationalOp
∣ BooleanConstant
CFG: Example (5.7) Version 2
However, Version 2 of CFG:
○ Eliminates some cases for further semantic analysis:
e.g.,(1 + 2) => (5 / 4) [noparsetree]
○ Still Parses string that might require further semantic analysis :
e.g.,(1 + 2) / (5 – (2 + 3))
○ Is ambiguous , meaning that a string may have more than one
ways to interpret it.
e.g., Draw the parse tree(s) for 3 * 5 + 4
16 of 96
CFG: Formal Definition (1)
● A context-free grammar (CFG) is a 4-tuple (V , Σ, R, S): ○ V is a finite set of variables.
○ Σisafinitesetofterminals. [V∩Σ=∅] ○ R is a finite set of rules s.t.
R ⊆ {v → s ∣ v ∈ V ∧ s ∈ (V ∪ Σ)∗}
○ S ∈ V is is the start variable.
● Givenstringsu,v,w∈(V∪Σ)∗,variableA∈V,andarule
A → w:
○ menas that uAv yields uwv.
○ means that u derives v, if: ● u=v;or
● u⇒u1 ⇒u2 ⇒⋅⋅⋅⇒uk ⇒v [ayieldsequence] ● GivenaCFGG=(V, Σ, R, S),thelanguageofG
L ( G ) = { w ∈ Σ ∗ ∣ S ⇒∗ w }
17 of 96
uAv ⇒ uwv
u ⇒∗ v
CFG: Formal Definition (2): Example
● Design the CFG for strings of properly-nested parentheses. e.g., (), ()(), ((()()))(), etc.
Present your answer in a formal manner.
● G=({S},{(,)},R,S),whereRis S→( S )∣SS∣ε
● Draw parse trees for the above three strings that G generates.
18 of 96
CFG: Formal Definition (3): Example
● ConsiderthegrammarG=(V,Σ,R,S):
○ R is
Expr → ∣
Term → ∣
Factor → ∣a
○ V={Expr,Term,Factor}
○ Σ = {a,+,*,(,)}
○ S=Expr
● Precedence of operators + and * is embedded in the grammar. ○ “Plus” is specified at a higher level (Expr) than is “times” (Term). ○ Both operands of a multiplication (Factor) may be parenthesized.
19 of 96
Expr + Term Term
Term * Factor Factor
(Expr )
Regular Expressions to CFG’s
● Recall the semantics of regular expressions (assuming that we do not consider ∅):
20 of 96
L( ε )
L( a )
L(E+F)
L( EF ) L( E∗ )
= {ε}
= {a}
= L(E)∪L(F) = L(E )L(F )
= (L(E ))∗
= L(E )
L( (E) )
● e.g.,Grammarfor(00+1)∗+(11+0)∗
S→A∣B A → ε∣AC C → 00∣1 B → ε∣BD D → 11∣0
DFA to CFG’s
● Given a DFA M = (Q, Σ, δ, q0, F):
○ Make a variable Ri for each state qi ∈ Q.
○ Make R0 the start variable, where q0 is the start state of M. ○ Add a rule Ri → aRj to the grammar if δ(qi,a) = qj.
○ AddaruleRi →εifqi ∈F.
● e.g., Grammar for
21 of 96
11 0
s0: even 0’s
s1: odd 0’s
R0 → 1R0∣0R1 R1 → 0R0∣1R1∣ε
0
CFG: Leftmost Derivations (1)
Expr → Expr + Term ∣ Term Term → Term * Factor ∣ Factor Factor → (Expr)∣a
○ Unique leftmost derivation for the string a + a * a:
Expr ⇒ Expr + Term ⇒ Term + Term
⇒ Factor + Term
⇒ a+Term
⇒ a + Term * Factor ⇒ a + Factor * Factor ⇒ a+a*Factor ⇒a+a*a
○ This leftmost derivation suggests that a * a is the right operand of +.
22 of 96
CFG: Rightmost Derivations (1)
Expr → Expr + Term ∣ Term Term → Term * Factor ∣ Factor Factor → (Expr)∣a
○ Unique rightmost derivation for the string a + a * a:
Expr ⇒ Expr + Term
⇒ Expr + Term * Factor ⇒ Expr+Term*a
⇒ Expr + Factor * a
⇒ Expr+a*a
⇒ Term+a*a
⇒ Factor+a*a ⇒a+a*a
○ This rightmost derivation suggests that a * a is the right operand of +.
23 of 96
CFG: Leftmost Derivations (2)
Expr → Expr + Term ∣ Term Term → Term * Factor ∣ Factor Factor → (Expr)∣a
○ Unique leftmost derivation for the string (a + a) * a: Expr ⇒ Term
⇒ Term * Factor
⇒ Factor * Factor
⇒ ( Expr ) * Factor
⇒ (Expr+Term)*Factor ⇒ (Term+Term)*Factor ⇒ ( Factor + Term ) * Factor ⇒ (a+Term)*Factor
⇒ (a+Factor)*Factor
⇒ (a+a)*Factor ⇒(a+a)*a
This leftmost derivation suggests that (a + a) is the left 24 of 96operand of *.
CFG: Rightmost Derivations (2)
Expr → Expr + Term ∣ Term Term → Term * Factor ∣ Factor Factor → (Expr)∣a
○ Unique rightmost derivation for the string (a + a) * a: Expr ⇒ Term
⇒ Term * Factor
⇒ Term * a
⇒ Factor * a
⇒ (Expr)*a
⇒ (Expr+Term)*a ⇒ (Expr+Factor)*a ⇒ (Expr+a)*a
⇒ (Term+a)*a ⇒ (Factor+a)*a ⇒(a+a)*a
This rightmost derivation suggests that (a + a) is the left 25 of 96operand of *.
CFG: Parse Trees vs. Derivations (1)
○ Parse trees for (leftmost & rightmost) derivations of expressions: a + a * a (a + a) * a
○ Orders in which derivations are performed are not reflected on parse trees.
26 of 96
CFG: Parse Trees vs. Derivations (2)
● A string w ∈ Σ∗ may have more than one derivations.
Q: distinct derivations for w ∈ Σ∗ ⇒ distinct parse trees for w?
A: Not in general ∵ Derivations with distinct orders of variable
substitutions may still result in the same parse tree. ● For example:
Expr → Expr + Term ∣ Term Term → Term * Factor ∣ Factor Factor → (Expr)∣a
For string a + a * a, the leftmost and rightmost derivations have distinct orders of variable substitutions, but their corresponding parse trees are the same.
27 of 96
CFG: Ambiguity: Definition
Given a grammar G = (V,Σ,R,S):
○ A string w ∈ Σ∗ is derived ambiguously in G if there exist two or more distinct parse trees or, equally,
two or more distinct leftmost derivations or, equally, two or more distinct rightmost derivations.
Here we require that all such derivations have been completed by following a particular order (leftmost or rightmost) to avoid false alarm.
○ G is ambiguous if it generates some string ambiguously.
28 of 96
CFG: Ambiguity: Exercise (1)
● Is the following grammar ambiguous ?
Expr →Expr + Expr ∣Expr * Expr ∣( Expr )∣a
● Yes ∵ it generates the string a + a * a ambiguously :
● Distinct ASTs (for the same input) mean distinct semantic interpretations: e.g.,
when a post-order traversal is used to implement evaluation
● Exercise: Show leftmost derivations for the two parse trees. 29 of 96
1
4 | …other statements … 3.2
This fragment shows that the else is optional. Unfortunately, the code Expressing Syntax 91
CFG: Ambiguity: Exercise (2.1)
The classic example of an ambiguous construct in the grammar for a pro-
gramming language is the if-then-else construct of many Algol-like
languages. The straightforward grammar for if-then-else ● Is the following grammar
1 Statement → if Expr then Statement else Statement
has two distinct rightmost derivations. The difference between them is The first derivation has Assignment2 controlled by the inner
Statement
2 | if Expr then Statement
3 Statem|enAtssignme→nt if
Expr then This fragment shows that the else is optional. Unfortunately, the code
4 | …other statements …
∣ if Expr then Statement else Statement
fragment
if Expr1 then if Expr2 then Assignment1 else Assignment2
∣ Assignment
if Expr1 then Statement
if Expr2 then Statement else Statement
…
● Yes ∵ it generates the following string : hastwodistinctrightmostderivations.ThedifferencebetweenthemisThesececlausewiththefirstif,sothat simple. The first derivation has Assignment2 controlled by the inner Assignment2 executes when Expr1 is false, independent of the value of
if Expr then if Expr then As
g
if, so Assignment2 e1xecutes when Expr1 is true and E2xpr2 is false: Statement
if Expr1 then Statement
if Expr2 then Statement else Statement
xp
r
2
:
fragment
if Expr1 then if Expr2 then Assignment1 else Assignment2
s
E
i
nment else Assignment 12
?
Assignment executes when Expr is true and Expr is false:
Statement
212
Statement
Assignment Assignment
12
if
Expr1
then
Statement else Statement
if Expr2 then Statement Assignment2
Assignment1 The second derivation associates the else
Assignment2 if
Assignment1
Clearly, these two derivations produce different behaviors in the compiled
Assignment2 executes when Expr1 is fals ● This is called the
clause with the first , so that
might be simple.
ambiguo
problem.
● Exercise: Show leftmost derivations for the two parse trees.
Expr2 :
Statement
if Expr then Statement else Statement
30 of 96
u
i
s
f,
so
code.
ambiguously
ond derivation associates the els
dangling else
e, independent of the value of
has two distinct rightmost derivations. The difference between them is
if Expr then if Expr then Assignment else Assignment
1212 simple. The first derivation has Assignment2 controlled by the inner
if, so Assignment2 executes when Expr1 is true and Expr2 is false: has two distinct rightmost derivations. The difference between them is
CFG:Amsimpble.igThueifitrsyt:derEivaxtioenrhcasiAsseignm(2ent.2c)ontrolledbytheinner
2
if, so Assignment2 executes when Expr1 is true and Expr2 is false: (Meaning 1) Assignment2 may be associated with the inner if:
if
if
ExSprtatementthen Statement 1
if Expr then Statement else Statement Expr1 then 2 Statement
Assignment Assignment if Expr2 then Statement 1 else Statement2
Statement
The second derivation associates the else clause with the first if, so that
Assignment2 executes when Expr1 is falsAes,singdnempentd1ent of tAhsesivganlmueenotf2
Expr2 :
(MeaninThge 2sec)oAndsdseriigvantimoneasnsotciamtesatyhebeleseacslasuosecwiaithtethde fiwrsithif,thsoethoatuter if:
2
Assignment2 executes when Expr1 is false, independent of the value of
Expr2 : if
if
Statement
Expr
Expr1
then Statement else Statement
Statement
Assignment Statement2
1
31 of 96
Assignment2 Clearly, these two derivations produce different behaviors in the compiled
then
if if
Expr then Statement Stat2ement else
Assignment1 Expr2 then Statement
CFG: Ambiguity: Exercise (2.3)
● We may remove the ambiguity by specifying that the dangling else is associated with the nearest if:
Statement → if Expr then Statement
∣ if Expr then WithElse else Statement
∣ Assignment
WithElse → if Expr then WithElse else WithElse
∣ Assignment
● When applying : ○ The true branch will be produced via WithElse.
○ The false branch will be produced via Statement.
There is no circularity between the two non-terminals. 32 of 96
if . . . then WithElse else Statement
Discovering Derivations
● GivenaCFGG=(V, Σ, R, S)andaninputprogramp∈Σ∗: ○ So far we manually come up a valid derivation S ⇒∗ p.
○ A parser is supposed to automate this derivation process.
Given an input sequence of (t,c) pairs, where token t (e.g., r241) belongs to some syntactic category c (e.g., register):
Either output a valid derivation (as an AST ), or signal an error .
● In the process of building an AST for the input program:
○ Root of AST: start symbol S of G
○ Internal nodes: A subset of variables V of G
○ Leaves of AST: token sequence input by the scanner
⇒ Discovering the grammatical connections (according to R)
between the root, internal nodes, and leaves is the hard part!
● ApproachestoParsing: [w∈(V∪Σ)∗,A∈V, ∈R] ○ Top-down parsing
For a node representing A, extend it with a subtree representing w. ○ Bottom-up parsing
A→w
33 of 96
For a substring matching w, build a node representing A accordingly.
TDP: Discovering Leftmost Derivation
ALGORITHM: TDParse
INPUT: CFG G=(V, Σ, R, S)
OUTPUT: Root of a Parse Tree or Syntax Error
PROCEDURE:
root := a new node for the start symbol S focus := root
initialize an empty stack trace trace.push(null)
word := NextWord()
while (true):
if focus ∈ V then
if ∃ unvisited rule focus → β1β2 …βn ∈ R then
create β1 , β2 . . . βn as children of focus trace.push(βn βn−1 . . . β2 )
focus := β1
else
if focus = S then report syntax error else backtrack
elseif word matches focus then word := NextWord()
focus := trace.pop()
elseif word = EOF ∧ focus = null then else backtrack
return root
backtrack ≜ pop focus.siblings; focus := focus.parent; focus.resetChildren 34 of 96
TDP: Exercise (1)
● Given the following CFG G:
Expr → Expr + Term ∣ Term
Term → Term * Factor ∣ Factor
Factor → (Expr ) ∣a
Trace TDParse on how to build an AST for input a + a * a. ● Running TDParse with G results an infinite loop !!!
○ TDParse focuses on the leftmost non-terminal. ○ The grammar G contains left-recursions.
● We must first convert left-recursions in G to right-recursions. 35 of 96
TDP: Exercise (2)
● Given the following CFG G:
Expr Expr′
Term Term′
Factor
→ Term Expr′
→ + Term Expr′
∣ε
→ Factor Term′
→ * Factor Term′
∣ε
→ (Expr )
∣a
Exercise. Trace TDParse on building AST for a + a * a.
Exercise. Trace TDParse on building AST for (a + a) * a. Q: How to handle ε-productions (e.g., Expr → ε)?
A: Execute focus := trace.pop() to advance to next node.
● Running TDParse will terminate ∵ G is right-recursive.
● We will learn about a systematic approach to converting
left-recursions in a given grammar to right-recursions. 36 of 96
Left-Recursions (LR): Direct vs. Indirect
GivenCFGG=(V, Σ, R, S),α,β,γ∈(V∪Σ)∗,Gcontains:
○ A cycle if ∃A ∈ V ● A ⇒∗ A
○ AdirectLRifA→Aα∈Rfornon-terminalA∈V e.g., e.g.,
37 of 96
Expr → Expr + Term
∣ Term
Term → Term * Factor
∣ Factor Factor → (Expr )
∣a
○ AnindirectLRifA→Bβ∈Rfornon-terminalsA,B∈V,B⇒∗ Aγ
A→Br,B⇒∗ Atd A→Ba,B⇒∗ Aafd
Expr → ∣
Expr + Term
Expr – Term ∣ Term
Term → Term ∣ Term
∣ Factor
* Factor / Factor
A → Ba ∣ b
B → Cd ∣ e
C → Df ∣ g
D → f ∣ Aa ∣ Cg
A → Br B → Cd C → At
TDP: (Preventively) Eliminating LRs
1 2 3 4 5 6 7 8 9
10 11 12 13 14
L9 to L11: Remove indirect left-recursions from A1 to Ai−1. L12 to L13: Remove direct left-recursions from A1 to Ai−1. Loop Invariant (outer for-loop)? At the start of ith iteration:
○ No direct or indirect left-recursions for A1,A2,…,Ai−1.
○ Moreprecisely:∀k∶k