Defining syntax using CFGs
Roadmap Last time
– Defined context-free grammar This time
– CFGs for specifying a language’s syntax • Language membership
• List grammars
• Resolving ambiguity
• G = (N,Σ,P,S)
• ⇒!means“derivesin
1 or more steps”
• CFGgeneratesa string by applying productions until no non-terminals remain
Example: Nested parens N={Q}
Σ = { ( , ) }
P=Q→(Q)
S = Q | ε
CFG Review
Formal Definition of a CFG’s Language Let G = (N,Σ,P,S) be a CFG. Then
L(G) = 𝑤 𝑆 ⇒! 𝑤 where
S is the start nonterminal of G, and w is a sequence that consists of
(only) terminal symbols or 𝜀
A CFG Defines a Language CFG productions define the syntax of a language
1.
2. 3. 4. 5. 6.
Prog → begin Stmts end Stmts → Stmts semicolon Stmt
| Stmt
Stmt → id assign Expr
Expr →id
| Expr plus id
We call this notation “BNF” (for “Backus-Naur Form”) or “extended BNF”
HTTP grammar using BNF:
– http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html
List Grammars
• Useful to repeat a structure arbitrarily often
Stmts → Stmts semicolon Stmt | Stmt
Stmts
Stmts
Stmts
;
Stmt
Stmts
;
Stmt
List skews left
Stmts
;
Stmt
Stmts
…
Stmts
;
Stmt
List Grammars
• Useful to repeat a structure arbitrarily often Stmts → Stmt semicolon Stmts | Stmt
Stmts
Stmt
;
Stmts
Stmt
;
Stmts
List skews right
Stmts
Stmt
;
Stmts
Stmts
Stmt
;
Stmts
List Grammars
• What if we allowed both “skews”? Stmts → Stmts semicolon Stmts | Stmt
Stmts
;
Stmts
Stmts
;
Stmts
Stmts
Stmts
;
Stmts
;
Stmts
Stmt
Stmts
• •
1.
2. 3. 4. 5. 6.
Derivation Order
Leftmost Derivation: always expand the leftmost nonterminal Rightmost Derivation: always expand the rightmost nonterminal
Prog
Prog → begin Stmts end Stmts → Stmts semicolon Stmt
| Stmt
Stmt → id assign Expr
Expr →id
| Expr plus id
Leftmost expands this nonterminal
Rightmost expands this nonterminal
begin
Stmts
end
Stmts
semicolon
Stmt
Ambiguity
Even with a fixed derivation order, it is possible to derive the same string in multiple ways
For Grammar G and string w
–G is ambiguous if
• >1 leftmost derivation of w • >1 rightmost derivation of w •>1parsetreeforw
Exercise
• Give a grammar G and a word w that has more than 1 left-most derivation in G
Example: Ambiguous Grammars
Expr → intlit
| Expr minus Expr
| Expr times Expr
| lparen Expr rparen
Derive the string 4 – 7 * 3 (assume tokenization)
Parse Tree 2
Parse Tree 1
Expr
Expr
Expr
4
3 7
734
Expr
Expr
minus
Expr
Expr
times
Expr
minus
times
Expr
intlit
Expr
intlit
intlit
intlit
intlit
intlit
Why is Ambiguity Bad?
Why is Ambiguity Bad? Eventually, we’ll be using CFGs as the basis for our parser
– Parsing is much easier when there is no ambiguity in the grammar
– The parse tree may mismatch user understanding! 4-7*3
Operator precedence
Expr
Expr
Expr
4
3 7
734
Expr
Expr
minus
Expr
Expr
times
Expr
minus
times
Expr
intlit
Expr
intlit
intlit
intlit
intlit
intlit
Resolving Grammar Ambiguity: Precedence Intuitive problem
Expr → intlit
| Expr minus Expr
| Expr times Expr
| lparen Expr rparen
– Nonterminals are the same for both operators
To fix precedence
– 1 nonterminal per precedence level
– Parse lowest level first
Resolving Grammar Ambiguity: Precedence
Expr → intlit
| Expr minus Expr
| Expr times Expr
| lparen Expr rparen
lowest precedence level first
1 nonterminal per precedence level
Derive the string 4 – 7 * 3
Expr
Expr
minus
Expr
Expr → Expr minus Expr | Term
Term → Term times Term | Factor
Factor → intlit
| lparen Expr rparen
4
Term
Term
Factor
Term
times
Term
intlit
Factor
Factor
intlit
7
3
intlit
Resolving Grammar Ambiguity: Precedence
Fixed Grammar
Expr → expr minus expr
| Term
Term → Term times Term
| Factor Factor → intlit
| lparen Expr rparen
Derive the string 4 – 7 * 3
Let’s try to re-build the wrong parse tree
Expr
Term
Expr
Term
times
Term
Factor
Expr
Term
minus
Expr
We’ll never be able to derive minus without parens
intlit
3
Term
times
Factor
Factor
intlit
Term
4
7
3
intlit
Term
Factor
intlit
Fixed Grammar
Expr → Expr minus Expr
| Term
Term → Term times Term
| Factor Factor → intlit
| lparen Expr rparen
NO!
Derive the string 4 – 7 – 3
Did we fix all ambiguity?
Expr
Expr
minus
Expr
Expr
minus
Term
Expr
Factor
Term
Term
intlit
Factor
Factor
intlit
intlit
These subtrees could have been swapped!
Where we are so far
Precedence
– We want correct behavior on 4 – 7 * 9
– A new nonterminal for each precedence level
Associativity
– We want correct behavior on 4 – 7 – 9
– Minus should be left associative: a – b – c = (a – b) – c – Problem: the recursion in a rule like
Expr →ExprminusExpr
Definition: Recursion in Grammars
• A grammar is recursive in (nonterminal) X if
𝑋 ⇒! α𝑋γ for non-empty strings of symbols α and γ
• A grammar is left-recursive in X if
𝑋 ⇒! 𝑋γ for non-empty string of symbols γ
• A grammar is right-recursive in X if
𝑋 ⇒! α𝑋 for non-empty string of symbols α
Resolving Grammar Ambiguity: Associativity
Recognize left-assoc operators with left-recursive productions Recognize right-assoc operators with right-recursive productions
Term
Expr → Expr minus Expr
Example: 4 – 7 – 9
E
| Term
Term → Term times Term
| Factor
Factor → intlit | lparen Expr rparen
Factor
E
E
–
T
–
T
F
T
F
intlit
F
intlit
9
7
intlit
4
Resolving Grammar Ambiguity: Associativity
Expr → Expr minus Term | Term
Term → Term times Factor | Factor
Factor → intlit | lparen Expr rparen
Example: 4 – 7 – 9
Let’s try to re-build the wrong parse tree again
E
E
–
T
T
F
intlit
4
We’ll never be able to derive minus without parens
Example
• LanguageofBooleanexpressions
– bexp → TRUE
bexp → FALSE
bexp → bexp OR bexp
bexp → bexp AND bexp
bexp → NOT bexp
bexp → LPAREN bexp RPAREN
• AddnonterminalssothatORhaslowest precedence, then AND, then NOT. Then change the grammar to reflect the fact that
both AND and OR are left associative.
• Drawaparsetreefortheexpression: – true AND NOT true.
Stmt →
Consider this word in this grammar: if a then if b then s else s2
How would you derive it?
Another ambiguous example
if Cond then Stmt |
if Cond then Stmt else Stmt | …
Summary
To understand how a parser works, we start by understanding context-free grammars, which are used to define the language recognized by the parser.
terminal symbol
– (non)terminal symbol
– grammar rule (or production)
– derivation (leftmost derivation, rightmost derivation) – parse (or derivation) tree
– the language defined by a grammar
– ambiguous grammar