Compilers and computer architecture From strings to ASTs (2): context free grammars
Martin Berger 1 October 2019
1Email: M.F.Berger@sussex.ac.uk, Office hours: Wed 12-13 in Chi-2R312
1/1
Recall the function of compilers
2/1
Recall we are discussing parsing
Source program
Lexical analysis
Intermediate code generation
Optimisation
Syntax analysis
Semantic analysis, e.g. type checking
Code generation
Translated program
3/1
Introduction
Remember, we want to take a program given as a string and:
Checkifit’ssyntacticallycorrect,e.g.iseveryopened bracket later closed?
ProduceanASTtofacilitateefficientcodegeneration.
4/1
Introduction
T_while
T_greater
T_var ( n ) T_num ( 0 )
T_semicolon
T_decrement
T_var ( n )
T_var ( res )
T_var ( res )
T_update
T_mult
T_num ( 2 )
while( n > 0 ){
n–;
res *= 2; }
5/1
Introduction
We split that task into two phases, lexing and parsing. Lexing throws away some information (e.g. how many white-spaces) and prepares a token-list, which is used by the parser. The token-list simplifies the parser, because some detail is not important for syntactic correctness:
ifx <2+3thenP elseQ is syntactically correct exactly when
ify <111+222thenP elseQ
6/1
Introduction
The token-list simplifies the parser, because some detail is not important for syntactic correctness:
ifx <2+3thenP elseQ is syntactically correct exactly when
ify <111+222thenP elseQ
So from the point of view of the next stage (parsing), all we
need to know is that the input is
T_if T_var T_less T_int T_plus T_int T_then ...
Of course we cannot throw away the names of variables etc completely, as the later stages (type-checking and code generation) need them. They are just irrelevant for syntax checking. We keep them and our token-lists are like this
T_if T_var ( "x" ) T_less T_int ( 2 ) T_plus ...
7/1
Two tasks of syntax analysis
As with the lexical phase, we have to deal with two distinct tasks.
Specifyingthatthesyntacticallycorrectprograms(token lists) are.
Checkingifaninputprogram(tokenlist)issyntactically correct according to the specification, and output a corresponding AST.
Let’s deal with specification first. What are our options? How about using regular expressions for this purpose?
Alas not every language can be expressed in these formalisms. Example:
Alphabet = {′(′,′ )′}.
Language = all balanced parentheses (), ()(), (()), ((()(()())()(()))), ..., note: the empty string is balanced.
8/1
FSAs/REs can’t count
Let’s analyse the situation a bit more. Why can we not describe the language of all balanced parentheses using REs or FSAs.
Each FSA has only a fixed number (say n) of states. But what if we have more than n open brackets before we hit a closing bracket?
Since there are only n states, when we reach the n open bracket, we must have gone back to a state that we already visited earlier, say when we processed the i-th bracket with
i < n. This means the automaton treats i as it does n, leading to confusion.
Summary: FSAs can’t count, and likewise for REs (why?).
9/1
Lack of expressivity of regular expressions & FSAs
Why is it a problem for syntax analysis in programming languages if REs and FSAs can’t count?
Because programming languages contain many bracket-like constructs that can be nested, e.g.
begin ... end
do ... while
if ( ... ) then { ... } else { ... }
3 + ( 3 - (x + 6) )
But we must formalise the syntax of our language if we want to computer to process it. So we need a formalism that can ’count’.
10/1
Problem
What we are looking for is something like REs, but more powerful:
regular expression/FSA = ??? lexer parser
Let me introduce you to: context free grammars (CFGs).
11/1
Context free grammars
Programs have a naturally recursive and nested structure: A program is e.g.:
if P then Q else Q′, where P, Q, Q′ are programs. x := P,whereP isaprogram. beginx:=1;begin... end;y:=2;end
CFGs are a generalisation of regular expression that is ideal for describing such recursive and nested structures.
12/1
Context free grammar
A context-free grammar is a tuple (A,V,Init,R) where
Aisafinitesetcalledalphabet.
V is a finite, non-empty set of variables.
A∩V=∅.
Init ∈ V is the initial variable.
Risthefinitesetofreductions,whereeachreductioninR is of the form (l,r) such that
l isavariable,i.e.l ∈V.
r is a string (possibly empty) over the new alphabet A ∪ V .
We usually write l → r for (l,r) ∈ R.
Note that the alphabet are often also called terminal symbols, reductions are also called reduction steps or transitions or productions, some people say non-terminal symbol for variable, and the initial variable is also called start symbol.
13/1
Context free grammar
Example:
A={a,b}.
V={S}.
TheinitialvariableisS.
Rcontainsonlythreereductions:
S→aSb S→SS S→ε
Recall that ε is the empty string. Now the CFG is (A,V,S,R).
The language of balanced brackets with a being the open bracket, and b being the closed bracket!
To make this intuition precise, we need to say precisely what the language of a CFG is.
14/1
The language accepted by a CFG
The key idea is simple: replace the variables according to the reductions.
Given a string s over A ∪ V , ie. the alphabet and variables, any occurrence of a variable T in s can be replaced by the string r1...rn, provided there is a reduction T → r1...rn.
For example if we have a reduction
S→aTb then we can rewrite the string
to
aaSbb aaaTbbb
15/1
The language accepted by a CFG
How do we start this rewriting of variables? With the initial variable.
When does this rewriting of variables stop? When the string we arrive at by rewriting in a finite number of steps from the initial variable contains no more variables.
16/1
The language accepted by a CFG
Then: the language of a CFG is the set of all strings over the alphabet of the CFG that can be arrived at by rewriting from the initial variable.
17/1
The language accepted by a CFG
Let’s do this with the CFG for balanced brackets (A, V , S, R) where
A={(,)}.
V={S}.
TheinitialvariableisS.
ReductionsRareS→(S),S→SS,andS→ε
S → (S) → (SS)
→ ((S)S)
→ (((S))S)
→ (((S))SS)
→ (((S))εS) = (((S))S) → (((ε))S) = ((())S)
→ ((())ε) = ((()))
18/1
Question: Why / how can CFGs count?
Why / how does the CFG (A,V,S,R) with
S→(S) S→SS S→ε
count?
Because only S → ( S ) introduces new brackets. But by construction it always introduces a closing bracket for each new open bracket.
19/1
The language accepted by a CFG: infinite reductions
Note that many CFGs allow infinite reductions: for example with the grammar the previous slide we can do this:
S → (S) → ((S))
→ (((S)))
→ ((((S))))
→ (((((S))))) → ((((((S))))))
.
Such infinite reductions don’t affect the language of the grammar. Only sequences of rewrites that end in a string free from variables count towards the language.
20/1
The language accepted by a CFG
If you like formal definitions ...
Given a fixed CFG G = (A, V , S, R). For arbitrary strings
σ, σ′ ∈ (V ∪ A)∗ we define the one-step reduction relation ⇒ which relates strings from (V ∪ A)∗ as follows.
σ⇒σ′ ifandonlyif:
σ = σ1lσ2 where l ∈ V, and σ1,σ2 are strings from (V ∪ A)∗.
Thereisareductionl−→γinR.
σ′=σ1γσ2.
The language accepted by G, written lang(G) is given as
follows.
lang(G) = {γn | S→γ1 →···→γn, whereγn ∈A }
The sequence S → γ1 → · · · → γn is called derivation. Note: only strings free from variables can be in lang(G).
def ∗
21/1
Example CFG
Consider the following CFG where while, if, ; etc are elements of the alphabet, and M is a variable.
M → while M do M M → if M then M M → M;M
.
If M is the starting variable, then we can derive
M → M;M
→ M;ifMthenM
→ M;ifMthenwhileMdoM
.
We do this until we reach a string without variables.
22/1
Some conventions regarding CFGs
Here is a collection of conventions for making CFGs more readable. You will find them a lot when programming languages are discussed.
Variables are CAPITALISED, the alphabet is lower case (or vice versa).
Variables are in BOLD, the alphabet is not (or vice versa). Variables are written in ⟨angle-brackets⟩, the alphabet isn’t.
23/1
Some conventions regarding CFGs
Instead of multiple reductions from the same variable, like
we write
Instead of
N → r1 N → r2 N → r3
N → r1 | r2 | r3
P →ifP thenP | whileP doP We often write
P,Q→ifPthenQ | whilePdoQ Finally, many write ::= instead of →.
24/1
Simple arithmetic expressions
Let’s do another example. Grammar:
E → E+E|E∗E|(E)|0|1|...
The language contains: 7
7∗4
7∗4+222
7∗(4+222) ...
25/1
A well-known context free grammar
R → ∅|ε|′c′ |R+R|RR|R∗ |(R) What’s this?
(The syntax of) regular expressions can be described by a CFG (but not by an RE)!
26/1
REs vs CFGs
Since regular expressions are a special case of CFGs, could we not do both, lexing and parsing, using only CFGs?
In principle yes, but lexing based on REs (and FSAs) is simpler and faster!
27/1
Example: Java grammar
Let’s look at the CFG for a real language:
https://docs.oracle.com/javase/specs/jls/se13/
html/jls-19.html
28/1
CFGs, what’s next?
Recall we were looking for this:
regular expression/FSA = ???
FSA
And the answer was CFGs. But what is a parser?
regular expression/FSA = CFG
parser
FSA
???
29/1
CFGs, what’s next?
CFGs allow us to specify the grammar for programming languages. But that’s not all we want. We also want:
Analgorithmthatdecidedwhetheragiventokenlistisin the language of the grammar or not.
Analgorithmthatconvertsthelistoftokens(ifvalid)intoan AST.
The key idea to solving both problems in one go is the parse tree.
30/1
CFG vs AST
Here is a grammar that you were asked to write an AST for in the tutorials.
P ::=
| whileGt0 e do P | P; P
x := e | if0 e then P else P
e ::=
| x | 0 | 1 | 2 | ...
e + e | e - e | e * e | (e) | e % e
31/1
CFG vs AST
Here’s a plausible definition of ASTs for the language: Syntax.java
32/1
CFG vs AST
Do you notice something?
Looks very similar. The CFG is (almost?) a description of data type for the AST.
This is no coincidence, and we will use this similarity to construct the AST as we parse, in that we will see the parsing process as a tree. How?
33/1
Derivations and parse trees
Recall that a derivation in a CFG (A, V , I, R) is a sequence I→t1 →t2 →···→tn
where tn is free from variables, and each step is goverend by a reduction from R.
We can drawn each derivation as a tree, called parse tree. The parse tree tells us how the input token list ’fits into’ the grammar, e.g. which reduction we applied and when to ’consume’ the input.
ThestartsymbolIisthetree’sroot.
For each reduction X → ⟨y1, ..., yn⟩ we add all the yi as
children to node X.
I.e. nodes in the tree are elements of A ∪ V . Let’s look at an example.
34/1
Example parse tree
Recall our CFG for arithmetic expressions: Let’s do another example. Grammar:
E → E+E|E∗E|(E)|0|1|...
Let’s say we have the string "4 * 3 + 17". Let’s parse this string and build the corresponding parse tree.
35/1
Example parse tree
E
E+E
E * E 17
43
E→E+E
→ E∗E+E
→ 4∗E+E → 4∗E+17 → 4∗3+17
Let’s do this in detail on the board.
36/1
Derivations and parse trees
The following is important about parse trees.
Terminalsymbolsareattheleavesofthetree.
Variablessymbolsareatthenon-leavenodes.
Anin-ordertraversalofthetreereturnstheinputstring. Theparsetreerevealsbracketingstructureexplicitly.
37/1
Left- vs rightmost derivation
BUT ... usually there are many derivations for a chosen string, giving the same parse tree. For example:
E
E+E
E * E 17
43
Canonical choices:
E→E+E E→E+E
→ E∗E+E → 4∗E+E → 4∗E+17 → 4∗3+17
→ E∗E+E → E∗3+E → E∗3+17 → 4∗3+17
Left-most:alwaysreplacetheleft-mostvariable.
Right-most:alwaysreplacetheright-mostvariable.
NBtheexamplesabovewereneitherleft-norright-most!
In parsing we usually use either left- or rightmost derivations to construct a parse tree.
38/1
Left- vs rightmost derivation
Question: do left- and rightmost derivations lead to the same parse tree?
Answer: For a context-free grammar: YES. It really doesn’t matter in what order variables are rewritten. Why? Because the rest of the string is unaffected by rewriting a variable, so we can modify the order.
39/1
Which rule to choose?
Alas there is a second degree of freedom: which rule to choose?
In constrast: it can make a big difference which rule we apply when rewriting a variable
This leads to an important subject: ambiguity.
40/1
Ambiguity
41/1
Ambiguity
In the grammar
E → E+E|E∗E|(E)|0|1|...
the string 4 * 3 + 17 has two distinct parse trees!
EE
E*E E+E
4 E + E E * E 17
3
17 4 3
A CFG with this property is called ambiguous.
42/1
Ambiguity
More precisely: a context-free grammar is ambiguous if there is a string in the language of the grammar that has more than one parse tree.
Note that this has nothing to do with left- vs right-derivation. Each of the ambiguous parse trees has a left- and a right-derivation.
43/1
Ambiguity
We also have ambiguity in natural language, e.g.
44/1
Ambiguity
Ambiguity is programming languages is bad, because it leaves the meaning of a program unclear, e.g. the compiler should generate different code for 1 + 2 ∗ 3 when it’s uderstood as (1+2)∗3 than for 1+(2∗3).
Can we automatically check whether a grammar is ambigouous?
Bad news: ambiguity of grammars is undecidable, i.e. no algorithm can exist that takes as input a CFG and returns "Ambiguous" or "Not ambiguous" correctly for all CFGs.
45/1
Ambiguity
There are several ways to deal with ambiguity.
Parserreturnsallpossibleparsetrees,leavingchoiceto later compiler phases. Example: combinator parsers often do this, Earley parser. Downside: kicks can down the road ... need to disambiguate later (i.e. doesn’t really solve the problem), and does too much work if some of the parse trees are later discarded.
Usenon-ambiguousgrammar.Easiersaidthandone...
Rewritingthegrammartoremoveambiguity.For example by enforcing precedence that * binds more tightly than +. We look at this now.
46/1
Ambiguity: grammar rewriting
The problem with
E → E+E|E∗E|(E)|0|1|...
is that addition and multiplication have the same status. But in our everyday understanding, we think of a ∗ b + c as meaning (a ∗ b) + c. Moreover, we evaluate a + b + c as (a + b) + c. But there’s nothing in the naive grammar that ensures this.
Let’s bake these preferences into the grammar.
47/1
Ambiguity: grammar rewriting
Let’s rewrite
E → E+E|E∗E|(E)|0|1|...
to
E → F+E|F
F → N∗F | N | (E)∗F | (E) N → 0|1|...
Examples in class.
48/1
If-Then ambiguity
Here is a problem that often arises when specifying programming languages.
M → ifMthenM
| ifM thenM elseM | ...
49/1
If-Then ambiguity
Now we can find two distinct parse trees for
if B then if B’ then P else Q
if
B if Q
B' P
if
B if
B' P Q
50/1
If-Then ambiguity
We solved the */+ ambiguity by giving * precedence. At the level of grammar that meant we had + coming ’first’ in the grammar.
Let’s do this for the if-then ambiguity by saying:
else always closes the nearest unclosed if, so if-then-else has priority over if-then.
M→ |
| ITE →
| BOTH →
|
ITE BOTH
...
if M then ITE else ITE
...
ifMthenM
if M then ITE else BOTH
only if-then-else both if-then and if-then-else
other reductions
no other reductions
51/1
If-Then ambiguity, aka the dangling-else problem
M → |
ITE → |
BOTH → |
ITE BOTH
if M then ITE else ITE
...
ifMthenM
if M then ITE else BOTH
only if-then-else both if-then and if-then-else
other reductions
no other reductions
52/1
if
B if Q
B' P
if
B if
B' P Q
Ambiguity: general algorithm?
Alas there is no algorithm that can rewrite all ambiguous CFGs into unambiguous CFGs with the same language, since some CFGs are inherently ambiguous, meaning they are only recongnised by ambiguous CFGs.
Fortunately, such languages are esoteric and not relevant for programming languages. For languages relevant in programming, it is generally straightforward to produce an unambiguous CFG.
I will not ask you in the exam to convert an ambiguous CFG into an unambiguous CFG. You should just know what ambiguity means in parsing and why it is a problem.
53/1