Parser: Syntactic Analysis Readings: EAC2 Chapter 3
EECS4302 M: Compilers and Interpreters Winter 2020
CHEN-WEI WANG
Context-Free Languages: Introduction
● We have seen regular languages :
○ Can be described using finite automata or regular expressions.
○ Satisfy the pumping lemma.
● Languages with a recursive structure are provably non-regular.
e.g.,{0n1n n≥0}
Context-free grammars (CFG’s) are used to describe strings
● ●
3 of 96
that can be generated in a recursive fashion. Context-free languages (CFL’s) are:
○ Languages that can be described using CFG’s.
○ A proper superset of the set of regular languages.
Parser in Context
○ Recall:
Lexical Analysis Source Program
(seq. of characters)
Scanner
Syntactic Analysis seq. of tokens
Parser
Semantic Analysis
AST1 … ASTn
pretty printed
Target Program
○ Treats the input programas as a a sequence of classified tokens/words
○ Applies rules parsing token sequences as
abstract syntax trees (ASTs) [ syntactic analysis ]
○ Upon termination:
● ReportstokensequencesnotderivableasASTs ● ProducesanAST
○ No longer considers every character in input program.
○ token sequences constitute a
Derivable
context-free language (CFL)
.
2 of 96
CFG: Example (1.1)
● The language that we previously proved as non-regular {0n#1n n≥0}
can be described using the following grammar :
A → 0A1 A→B B→#
● A grammar contains a collection of substitution or production rules, where:
4 of 96
○ A terminal is a word w ∈ ⌃∗ (e.g., 0, 1, etc.).
○ A variable or non-terminal is a word w ∈ ⌃∗ (e.g., A, B, etc.).
○ A start variable occurs on the LHS of the topmost rule (e.g., A).
CFG: Example (1.2)
● Given a grammar, generate a string by:
1. Write down the start variable.
2. Choose a production rule where the start variable appears on the
LHS of the arrow, and substitute it by the RHS. 3. There are two cases of the re-written string:
3.1 It contains no variables, then you are done.
3.2 It contains some variables, then substitute each variable using the
relevant production rules. 4. Repeat Step 3.
● e.g., We can generate an infinite number of strings from
○ A⇒B⇒#
○ A⇒0A1⇒0B1⇒0#1
○ A⇒0A1⇒00A11⇒00B11⇒00#11
○ … 5 of 96
A → 0A1 A→B B→#
CFG: Example (2)
Design a CFG for the following language: {ww∈{0,1}∗∧w is a palidrome}
e.g., 00, 11, 0110, 1001, etc.
P→✏ P→0 P→1
P → 0P0 P → 1P1
7 of 96
CFG: Example (1.2)
Given a CFG, the derivation of a string can be shown as a parse tree .
e.g., The derivation of 000#111 has the parse tree
6 of 96
CFG: Example (3)
Design a CFG for the following language: {wwR w∈{0,1}∗}
e.g., 00, 11, 0110, etc.
P→✏
P → 0P0 P → 1P1
8 of 96
CFG: Example (4)
Design a CFG for the set of binary strings, where each block of 0’s followed by at least as many 1’s.
e.g., 000111, 0001111, etc.
● We use S to represent one such string, and A to represent
each such block in S.
S → ✏ {BC of S}
S → AS {RC of S}
A → ✏ {BC of A}
A → 01 {BC of A}
A → 0A1 {RC of A: equal 0’s and 1’s} A → A1 {RCofA:more1’s}
9 of 96
CFG: Example (5.2) Version 1
Expression
→ IntegerConstant
-IntegerConstant BooleanConstant BinaryOp
UnaryOp
( Expression )
IntegerConstant → Digit
Digit IntegerConstant
Digit → 0123456789 BooleanConstant → TRUE
FALSE 11 of 96
CFG: Example (5.1) Version 1
Design the grammar for the following small expression language, which supports:
● Arithmetic operations: +, -, *, /
● Relational operations: >, <, >=, <=, ==, /=
● Logical operations: true, false, !, &&, ||, =>
Start with the variable Expression. ● There are two possible versions:
1. All operations are mixed together. [e.g., (1 + true)false] 2. Relevant operations are grouped together.
10 of 96
Try both!
CFG: Example (5.3) Version 1
BinaryOp →
UnaryOp →
Expression + Expression Expression – Expression Expression * Expression Expression / Expression Expression && Expression Expression || Expression Expression => Expression Expression == Expression Expression /= Expression Expression > Expression Expression < Expression
! Expression
12 of 96
CFG: Example (5.4) Version 1
However, Version 1 of CFG:
○ Parses string that requires further semantic analysis (e.g., type
checking):
e.g.,3 => 6
○ Is ambiguous , meaning that a string may have more than one
ways to interpret it.
e.g., Draw the parse tree(s) for 3 * 5 + 4
13 of 96
CFG: Example (5.6) Version 2
15 of 96
ArithmeticOp →
RelationalOp →
ArithmeticOp + ArithmeticOp ArithmeticOp – ArithmeticOp ArithmeticOp * ArithmeticOp ArithmeticOp / ArithmeticOp (ArithmeticOp) IntegerConstant -IntegerConstant ArithmeticOp == ArithmeticOp ArithmeticOp /= ArithmeticOp ArithmeticOp > ArithmeticOp ArithmeticOp < ArithmeticOp LogicalOp && LogicalOp LogicalOp || LogicalOp LogicalOp => LogicalOp
! LogicalOp (LogicalOp) RelationalOp BooleanConstant
LogicalOp
→
CFG: Example (5.5) Version 2
Expression → ArithmeticOp RelationalOp
LogicalOp
( Expression )
IntegerConstant → Digit
Digit IntegerConstant
Digit → 0123456789 BooleanConstant → TRUE
FALSE 14 of 96
CFG: Example (5.7) Version 2
However, Version 2 of CFG:
○ Eliminates some cases for further semantic analysis:
e.g.,(1 + 2) => (5 / 4) [noparsetree]
○ Still Parses string that might require further semantic analysis :
e.g.,(1 + 2) / (5 – (2 + 3))
○ Is ambiguous , meaning that a string may have more than one
ways to interpret it.
e.g., Draw the parse tree(s) for 3 * 5 + 4
16 of 96
CFG: Formal Definition (1)
● A isa4-tuple(V, ⌃, R, S):
○ V is a finite set of variables. ○ ⌃isafinitesetofterminals. ○ R is a finite set of rules s.t.
[V∩⌃=]
R ⊆ {v → s v ∈ V ∧ s ∈ (V ∪ ⌃)∗}
○ S ∈ V is is the start variable.
● Givenstringsu,v,w∈(V∪⌃)∗,variableA∈V,andarule
A → w:
○ uAv ⇒ uwv menas that uAv yields uwv .
○ u ⇒∗ v means that u derives v, if: ● u=v;or
● u⇒u1 ⇒u2 ⇒⋅⋅⋅⇒uk ⇒v [ayieldsequence] ● GivenaCFGG=(V, ⌃, R, S),thelanguageofG
L ( G ) = { w ∈ ⌃ ∗ S ⇒∗ w } 17 of 96
context-free grammar (CFG)
CFG: Formal Definition (3): Example
● ConsiderthegrammarG=(V,⌃,R,S):
○ R is
Expr → Expr + Term
Term
Term → Term * Factor
○ V={Expr,Term,Factor} ○ ⌃ = {a,+,*,(,)}
○ S=Expr
a
Factor Factor → (Expr )
● Precedence of operators + and * is embedded in the grammar. ○ “Plus” is specified at a higher level (Expr) than is “times” (Term). ○ Both operands of a multiplication (Factor) may be parenthesized.
19 of 96
CFG: Formal Definition (2): Example
● Design the CFG for strings of properly-nested parentheses. e.g., (), ()(), ((()()))(), etc.
Present your answer in a formal manner.
● G=({S},{(,)},R,S),whereRis S→( S )SS✏
● Draw parse trees for the above three strings that G generates. 18 of 96
Regular Expressions to CFG’s
● Recall the semantics of regular expressions (assuming that we do not consider ):
L(✏) = {✏}
L(a) = {a} L(E+F) = L(E)∪L(F) L(EF) = L(E )L(F ) L(E∗) = (L(E))∗
L( (E) ) =
● e.g.,Grammarfor(00+1)∗+(11+0)∗
D → 110 20 of 96
S→AB
A → ✏AC
C → 001
B → ✏BD
L(E )
DFA to CFG’s
● Given a DFA M = (Q, ⌃, , q0, F):
○ Make a variable Ri for each state qi ∈ Q.
○ Make R0 the start variable, where q0 is the start state of M. ○ Add a rule Ri → aRj to the grammar if (qi,a) = qj.
○ AddaruleRi →✏ifqi ∈F.
● e.g., Grammar for
R1 → 0R01R1✏ 21 of 96
11 0
s0: even 0’s
s1: odd 0’s
R0 → 1R00R1
0
CFG: Rightmost Derivations (1)
Expr → Expr + Term Term Term → Term * Factor Factor Factor → (Expr)a
○ Unique rightmost derivation for the string a + a * a:
Expr ⇒ Expr + Term
⇒ Expr + Term * Factor ⇒ Expr+Term*a
⇒ Expr + Factor * a
⇒ Expr+a*a
⇒ Term+a*a
⇒ Factor+a*a ⇒a+a*a
○ This rightmost derivation suggests that a * a is the right operand of +.
23 of 96
CFG: Leftmost Derivations (1)
Expr → Expr + Term Term Term → Term * Factor Factor Factor → (Expr)a
○ Unique leftmost derivation for the string a + a * a:
Expr ⇒ Expr + Term ⇒ Term + Term
⇒ Factor + Term
⇒ a + Term
⇒ a + Term * Factor ⇒ a + Factor * Factor ⇒ a+a*Factor ⇒a+a*a
○ This leftmost derivation suggests that a * a is the right operand of +.
22 of 96
CFG: Leftmost Derivations (2)
Expr → Expr + Term Term Term → Term * Factor Factor Factor → (Expr)a
○ Unique leftmost derivation for the string (a + a) * a: Expr ⇒ Term
⇒ Term * Factor
⇒ Factor * Factor
⇒ ( Expr ) * Factor
⇒ (Expr+Term)*Factor ⇒ (Term+Term)*Factor ⇒ ( Factor + Term ) * Factor ⇒ (a+Term)*Factor
⇒ (a+Factor)*Factor
⇒ (a+a)*Factor ⇒(a+a)*a
This leftmost derivation suggests that (a + a) is the left 24 of 96operand of *.
CFG: Rightmost Derivations (2)
Expr → Expr + Term Term Term → Term * Factor Factor Factor → (Expr)a
○ Unique rightmost derivation for the string (a + a) * a: Expr ⇒ Term
⇒ Term * Factor
⇒ Term * a
⇒ Factor * a
⇒ (Expr)*a
⇒ (Expr+Term)*a ⇒ (Expr+Factor)*a ⇒ (Expr+a)*a
⇒ (Term+a)*a ⇒ (Factor+a)*a ⇒(a+a)*a
This rightmost derivation suggests that (a + a) is the left 25 of 96operand of *.
CFG: Parse Trees vs. Derivations (2)
● A string w ∈ ⌃∗ may have more than one derivations.
Q: distinct derivations for w ∈ ⌃∗ ⇒ distinct parse trees for w?
A: Not in general ∵ Derivations with distinct orders of variable substitutions may still result in the same parse tree.
● For example:
Expr → Expr + Term Term Term → Term * Factor Factor Factor → (Expr)a
For string a + a * a, the leftmost and rightmost derivations have distinct orders of variable substitutions, but their corresponding parse trees are the same.
27 of 96
CFG: Parse Trees vs. Derivations (1)
○ Parse trees for (leftmost & rightmost) derivations of expressions: a + a * a (a + a) * a
26 of 96
○ Orders in which derivations are performed are not reflected on
parse trees.
CFG: Ambiguity: Definition
Given a grammar G = (V,⌃,R,S):
○ A string w ∈ ⌃∗ is derived ambiguously in G if there exist two or more distinct parse trees or, equally,
two or more distinct leftmost derivations or, equally, two or more distinct rightmost derivations.
Here we require that all such derivations have been completed by following a particular order (leftmost or rightmost) to avoid false alarm.
○ G is ambiguous if it generates some string ambiguously. 28 of 96
fragment
|
if Expr then Statement else Statement | Assignment
code.
| …other statements …
This fragment shows that the else is optional. Unfortunately, the code if Expr1 then if Expr2 then Assignment1 else Assignment2
fragment
CFG: Ambiguity: Exercise (1)
● Is the following grammar ?
Expr →Expr + Expr Expr * Expr ( Expr )a
● Yes ∵ it generates the string a + a * a ambiguously :
● Distinct ASTs (for the same input) mean distinct semantic interpretations: e.g.,
when a post-order traversal is used to implement evaluation
29 of 96
● Exercise: Show leftmost derivation
o
w
The
cla
mmar for a pro-
s
f
r
t
s
sic
ex
h
am
p
e
t
o
p
le of
a
na
mbig
uo
a
us
co
gramming language is the if-then-else construct of many Algol-like languages. The straightforward grammar for if-then-else might be
r
s
ns
e
t
r
truc
t
in
th
e
e
eg
s
ra
.
ambiguous
has two distinct rightmost derivations. The difference between them is if Expr1 then if Expr2 then Assignment1 else Assignment2
simple. The first derivation has Assignment2 controlled by the inner if, so Assignment2 executes when Expr1 is true and Expr2 is false:
has two distinct rightmost derivations. The difference between them is
CFG:Amsimpble.igThueifitrsyt:derEivaxtioenrhcasiAsseignm(2ent.2c)ontrolledbytheinner 2
if, so Assignment2 executes when Expr1 is true and Expr2 is false: (Meaning 1) Assignment2 may be associated with the inner if:
Statement
if Expr1 then Statement
if Expr2 then Statement else Statement
The second derivation associates the else clause with the first if, so that Assignment2 executes when Expr1 is falsAes,singdnempentd1ent of tAhsesivganlmueenotf2 Expr2 :
(MeaninThge 2sec)oAndsdseriigvantimoneasnsotciamtesatyhebeleseacslasuosecwiaithtethde fiwrsithif,thsoethoatuter if: 2
Assignment2 executes when Expr1 is false, independent of the value of
Expr2 :
if Expr1
Statement
2 Expressing Syntax 91
31 of 96
then Statement else
if Expr2 then Statement
Statement
Assignment2
Assignment1
Clearly, these two derivations produce different behaviors in the compiled
1 2 3 4
Statement !
| if Expr then Statement
3.
CFG: Ambiguity: Exercise (2.1)
has tw
The classic example of an ambiguous construct in the grammar for a pro-
gramming language is the if-then-else construct of many Algol-like
languages. The straightforward grammar for if-then-else ● Is the following grammar
2 d 212
1 Statement ! if Expr then Statement else St
2 | if Expr then Statement
3 Statem|enAtssignme→nt if Expr then
4 | …other statements …
This fragment shows that the else is optional. Unfortunately, the code
This fragment shows that the else is optional. Unfortunately, the code 3.2 Expressing Syntax 91
fragment
if Expr1 then if Expr2 then Assignment1 else A
by the Assignment executes when Expr is true and Expr is false:
Statement
if Expr then Statement else Statement
Assignment
fragment
if Expr1 then if Expr2 then Assignment1 else Assignment2
● Yes ∵ it generates the following string :
…
hastwodistinctrightmostderivations.ThedifferencebetweenthemisThesececlausewiththefirstif,sothat simple. The first derivation has Assignment2 controlled by the inner Assignment2 executes when Expr1 is false, independent of the value of
if Expr then if Expr then As
i
nment else Assignment 12
if, so Assignment2 e1xecutes when Expr1 is true and E2xpr2 is false: Statement
if Expr1 then Statement
if Expr2 then Statement else Statement
Expr2 :
● Exercise: Show leftmost derivations for the two parse trees. 30 of 96
s
if
g
E
:
xp
r
2
Statement
ssignment2 between them is
?
inner
dangling else
e, independent of the value of
might be simple.
ambiguo
u
i
f,
so
s
ambiguously
ond derivation associates the els
Expr1
then
Statement else Statement
if Expr2 then Statement Assignment2 Assignment1
Assignment1 Assignment2 The second derivation associates the else clause with the first if, so that
Assignment2 executes when Expr1 is fals ● This is called the
Clearly, these two derivations produce different behaviors in the compiled code.
problem.
Clearly, these two derivations produce different behaviors in the compiled
code.
CFG: Ambiguity: Exercise (2.3)
● We may remove the ambiguity by specifying that the dangling else is associated with the nearest if:
Statement → if Expr then Statement
if Expr then WithElse else Statement
Assignment
WithElse → if Expr then WithElse else WithElse
Assignment
○ The true branch will be produced via WithElse.
There is no circularity between the two non-terminals. 32 of 96
● When applying
○ The false branch will be produced via Statement.
:
if . . . then WithElse else Statement
o distinct rightmost derivations. The difference The first derivation has Assignment controlle
atement
Discovering Derivations
● GivenaCFGG=(V, ⌃, R, S)andaninputprogramp∈⌃∗: ○ So far we manually come up a valid derivation S ⇒∗ p.
○ A parser is supposed to automate this derivation process.
Given an input sequence of (t,c) pairs, where token t (e.g., r241) belongs to some syntactic category c (e.g., register):
Either output a valid derivation (as an AST ), or signal an error .
● In the process of building an AST for the input program: ○ Root of AST: start symbol S of G
○ Internal nodes: A subset of variables V of G
○ Leaves of AST: token sequence input by the scanner
⇒ Discovering the grammatical connections (according to R)
between the root, internal nodes, and leaves is the hard part!
● ApproachestoParsing: [w∈(V∪⌃)∗,A∈V, A→w ∈R] ○ Top-down parsing
For a node representing A, extend it with a subtree representing w. ○ Bottom-up parsing
33 of 96
For a substring matching w, build a node representing A accordingly.
TDP: Exercise (1)
● Given the following CFG G:
Expr → Expr + Term Term
Term → Term * Factor Factor
Factor → (Expr ) a
Trace TDParse on how to build an AST for input a + a * a. ● Running TDParse with G results an infinite loop !!!
○ TDParse focuses on the leftmost non-terminal. ○ The grammar G contains left-recursions.
● We must first convert left-recursions in G to right-recursions. 35 of 96
TDP: Discovering Leftmost Derivation
ALGORITHM: TDParse
INPUT: CFG G=(V, ⌃, R, S)
OUTPUT: Root of a Parse Tree or Syntax Error
PROCEDURE:
root := a new node for the start symbol S focus := root
initialize an empty stack trace trace.push(null)
word := NextWord()
while (true):
if focus ∈ V then
if ∃ unvisited rule focus → 1 2 …