程序代写代做代考 computer architecture compiler Java Haskell assembler algorithm Compilers and computer architecture: Parsing

Compilers and computer architecture: Parsing
Martin Berger
October 2015

Recall the function of compilers

Recall we are discussing parsing
Source program
Lexical analysis
Intermediate code generation
Optimisation
Syntax analysis
Semantic analysis, e.g. type checking
Code generation
Translated program

Key steps
Remember we need:
􏹩 Specifythesyntaxofthelanguage.
􏹩 Haveanalgorithmthatdecidesthelanguageandreturns an AST.

Key steps
􏹩 CFGsasmechanismtospecifysyntaxwithrecursive structure G = (A,V,I,→).
􏹩 V variables, A alphabet.
􏹩 I ∈ V initial variable.
􏹩 TransitionsX →σwhereX ∈A,σ∈(A∪V)∗
􏹩 IfX →σthenαXβ⇒ασβ.
􏹩 ThelanguageofGis{σ∈V∗|I⇒···⇒σ}.
􏹩 We call each I ⇒ ··· ⇒ σ a derivation.

Parser
A parser (in its simplest form) for a CFG G takes as input a string/token-list, and returns true/false depending on whether the string is in the language of G or not, and, at the same time, build an AST representing the structure of the input.
Key concepts: derivations and parse trees. The latter can be seen as a 2D representation of the former. It is very close to ASTs (but contains some redundant information like brackets). When constructing the AST we drop those, and only keep stuff that’s needed for later phases of the compiler.

Parser
A parser is an algorithm that takes as input a list (e.g. of tokens) and (1) decides if the list is a member of the language defined by the ambient CFG, and (2) does so by building up the parse tree (AST).
There are two approaches to parsers:
􏹩 Topdownorpredictive(wewillstudytherecursive descent algorithm).
􏹩 Bottomup(alsoknownasshift-reducealgorithms)

Two approaches to parsers
􏹩 Topdown.
􏹩 Good: conceptually easy, can be hand-written, powerful.
􏹩 Bad: Can be slow. Can’t deal with left-recursing CFGs (see
later), but left-recursive grammars can be converted automatically to non-left-recursive CFGs.
􏹩 Bottomup.
􏹩 Good: Fast, can deal with all (non-ambiguous) CFGs.
􏹩 Bad: Complicated, hard to write by hand.

Top down parsing: intuition for the decision problem
You look at the current string and go through all rules starting with a variable in the string (say leftmost) if the input can be used with one of the rules, if it matches.
If there’s a matching rule, recursively process the rest of the string. (To start, we must rewrite the initial variable.)
If you ’consume’ the whole string, you are finished.
If you get stuck, backtrack.
If you have exhaused all rules without success, reject the string.

Example top down parsing
P → beginQ P → prog
Q → end Q→P;Q
Let P be the initial variable. Note: grammar not ambiguous!

Example top down parsing
P → beginQ P → prog
Q → end Q→P;Q
Let P be the initial variable. Note: grammar not ambiguous! Example string: begin prog; prog; end

Example top down parsing
P → beginQ P → prog
Q → end Q→P;Q
Let P be the initial variable. Note: grammar not ambiguous! Example string: begin prog; prog; end
Slogan for top-down parsing: “Starting from the initial variable, search for a rule which rewrites the variables to yield characters from the alphabet consistent with the input.”

Example top down parsing
In class.

Example top down parsing
P
CODE: begin prog; prog; end PARSED:

Example top down parsing
P
CODE: prog; prog; end PARSED: begin
begin
Q

Example top down parsing
P
begin Q P;Q
CODE: prog; prog; end PARSED: begin

Example top down parsing
P
begin Q P;Q
prog
CODE: ; prog; end PARSED: begin prog

Example top down parsing
P
begin Q P;Q
prog
CODE: prog; end PARSED: begin prog;

Example top down parsing
P
begin Q P;Q
CODE: prog; end PARSED: begin prog;
prog
P;Q

Example top down parsing
P
begin Q P;Q
CODE: ; end PARSED: begin prog; prog
prog
P;Q prog

Example top down parsing
P
begin Q P;Q
CODE: end PARSED: begin prog; prog;
prog
P;Q prog

Example top down parsing
P
begin Q CODE:
P;Q
PARSED: begin prog; prog; end
prog
P;Q prog end

Summary of approach we used
􏹩 Westartwiththefullinput,andtheinitialvariable.
􏹩 Once,theparsingprocessisgoing,weassumethecurrent
input is derived from some variable (let’s call it X)
􏹩 ExamineeachalternativetransitionforX
􏹩 Comparefirst(ormore)unmatchedinputtokenwithfirst symbol on RHS of each alternative transition for X
􏹩 If a matching transition is found (e.g. begin) use it to rewrite X, and remove the matched stuff from the input string.
􏹩 Repeat, using next input token to determine the transition to be used for the next variable
􏹩 If no match, try a different transition. If nothing matches, backtrack.
􏹩 Ateachstep,oneofthetransitionswaschosen,andused from left-to-right, to replace a variable in the parse tree by a RHS.

Top down parsing pseudo code
Recall our grammar.
P → beginQ P → prog
Q → end Q→P;Q

Top down parsing pseudo code
Recall our grammar.
P → beginQ P → prog
Q → end Q→P;Q
First we define our tokens in pseudo-code (for more complicated languages, tokens may carry additional information).
interface Token
class T_begin () implementes Token
class T_end () implementes Token
class T_prog () implementes Token
class T_semicolon () implementes Token

Top down parsing pseudo code
Recall our grammar.
P → beginQ P → prog
Q → end Q→P;Q
We use methods, one for each variable (can be programmed in other ways)
def parseQ( tl : List [Token] ) : List [Token] = …
def parseP( tl : List [Token] ) : List [Token] = …

Top down parsing pseudo code
We use methods, one for each variable (can be programmed in other ways)
def parseQ( tl : List [Token] ) : List [Token] = …
def parseP( tl : List [Token] ) : List [Token] = …
Each method does the following:
􏹩 Consume(eat)asmuchoftheinputaspossibleaccording to grammar.
􏹩 Indicateifnoinputcanbeparsed.
􏹩 Ifsomeinputcanbeparsed,butnotall,returnthe leftovers.

Parsing P transitions
Recall our grammar.
P → beginQ P → prog
Q → end Q→P;Q
def parseP ( tl : List [ Token ] ) : List [ Token ] =
if tl is of form
T_begin :: rest then parseQ ( rest )
T_prog :: rest then rest
else “rule doesnt apply”
Here x::l is short for a list with first element x and rest l.

Parsing Q transitions
Recall our grammar.
P → beginQ P → prog
Q → end Q→P;Q
def parseQ ( tl : List [ Token ] ) : List [ Token ] =
if tl is of form
T_end :: rest then rest
else {
let tl2 = parseP ( tl )
if tl2 is of form
T_semicolon :: rest2 then parseQ ( rest2 )
else “rule doesnt apply”
In other words: if we have a terminal (token), we remove it from the input. If we have a variable, we call the associated parser.

Parsing P & Q transitions
That’s it. No further code needed. That was simple.

Parsing P & Q transitions
That’s it. No further code needed. That was simple. Usage:
let t = List ( T_begin,
T_prog,
T_semicolon,
T_end )
try {
val result = parseP ( t )
if ( result.size <= 0 ) println ( "success" ) else println ( "failure" ) } catch { case e : Exception => println ( “failure” ) }

Top down parsing
This was simple. The approach can be refinded massively, leading to combinator parsing, where you can more or less write a grammar, and it’s a valid Haskell or Scala program. In Java, this is difficult due to lacking expressivity. Combinator parsers are extremely elegant in my opinion.
However, we’ve omitted various issues.

Constructing an AST during parsing
Usually we don’t just want to decide if an input string is in the language defined by the ambient CFG. Instead we want to build up something, e.g. an AST. This is quite easy. Here’s a pseudo-code example for the grammar we’ve been looking at.
interface Token
class T_begin implements Token
class T_end implements Token
class T_prog ( s : String ) implements Token
class T_semicolon implements Token
interface AST
class Empty implements AST
class Prog implements AST
class Sequence ( lhs : AST, rhs : AST ) implements AS

Constructing an AST during parsing
We use options None and Some(…) to indicate success or absence of success in parsing. (See java.util.Optional in Java.) In case of success, we return the AST that has been constructed and the remaining tokens:
Some ( ( ast, rest ) )
Otherwise we return None. Many languages offer this, in Java you can use an interface Option with implentations None and Some( … ) to do this.

Constructing an AST during parsing
def parseP ( tl : List [ Token ] ) : Result =
if tl is of form
T_begin :: rest then parseQ ( rest )
T_prog :: rest then
Some ( new Prog (), rest ) )
else None }

Constructing an AST during parsing
def parseQ ( tl : List [ Token ] ) : Result =
if tl is of form
T_end :: rest then Some( ( Empty, rest ) )
else {
if parseP ( tl ) is of form
None then None
Some( ( astL, T_semicolon :: rest ) ) then
if parseQ ( rest ) is of form
None then None
Some( ( astR, rest2 ) ) then
Some(Sequence(astL, astR), rest2)) }}

Question: what’s the difference between parse trees and ASTs?
4 * (3 + 17)
4
*
4+
E E*E
E 3
(E)
+E 317
17

Question: what’s the difference between parse trees and ASTs?
Parse trees contain redundant information, e.g. brackets. Moreover, subsequent stages don’t care about e.g. variable names, so we can drop them.
4 * (3 + 17)
4
*
4+
E E*E
E 3
(E)
+E 317
17

What happens if we choose the wrong transition?
Several options.

What happens if we choose the wrong transition?
Several options.
􏹩 Backtracking:iftryalltransitionsuntilonematches.Ifyou can’t find a matching transition at all, then signal parsing failure. Simple, but might lead to slow parsing as we may have to abandon partially parsed strings.
􏹩 Changethegrammarsuchthatitisalwayspossibleto determine which transition to use (if the string is parsable at all). This is not always easy, but leads to fast parsers (and to LL(k) and LR(k) CFGs).

What happens if we choose the wrong transition?
Consider the following grammar:
P → loopQuntilE P → loop Q forever
…
If our input string is loop x:=x+1; y:=y-1; forever
…, our parser looks just at the leftmost token and it starts with the upper transition for P, it will have to reparse x:=x+1; y:=y-1;. That’s wasteful. In this case, we can rewrite the grammar so it is always possible without recursion to determine which transition to use.
P → loopQR R → until E R → forever
…

Left-recursion
Recall our grammar.
P → beginQ P → prog
Q → end Q→P;Q
and its implementation in a simple top-down parser:
def parseP( tl : List[Token] ) : List[Token] = {
if tl is of the form
T_begin :: rest then parseQ ( rest )
T_prog :: rest then rest
else “no matching transition”

Left-recursion
Let’s look at a grammar for expressions E→E+E|E∗E| −E|(E)|0|1|…
and its implementation in a simple top-down parser:

Left-recursion
E→E+E|E∗E| −E|(E)|0|1|…
def parseE ( tl : List [ Token ] ) : List [ Token ] = {
if tl is of form
T_leftBracket :: rest then …
T_minus :: rest then parseE ( rest )
T_int ( n ) :: rest then rest
else {
val tryAdd = parseE ( parse+ ( parseE ( tl ) ) )
if ( parsing successful … )
tryAdd else {
val tryMul = parseE ( parse* ( parseE ( tl ) ) )
if ( parsing successful … )
tryMul else
‘‘[parseE] no matching transition’’ )

Left-recursion
def parseE ( tl : List [ Token ] ) : List [ Token ] = {
if tl is of form
… then {
val tryAdd = parseE ( parse+ ( parseE ( tl ) ) )
if ( tryAdd.size > 0 )
tryAdd else {
…
The parser doesn’t terminate!
parseE(2 + 3) → parseE(2 + 3) → …

Left-recursion
The problem is that the grammar E→E+E|E∗E| −E|(E)|0|1|…
is left-recursive! Unlike
Why?
P → beginQ P → prog
Q → end Q→P;Q

Left-recursion
The problem is that the grammar E→E+E|E∗E| −E|(E)|0|1|…
is left-recursive! Unlike
P → beginQ P → prog
Q → end Q→P;Q
Why? because ’eating’ begin shortens the argument for the recursive call. More generally, a grammar is left-recursive if
we can find a variable N and a string σ such that N → ··· → Nσ

Removing left-recursion
It is possible algorithmically to remove left-recursion from a grammar by introducing new variable. Here is an example.
P → Pwoof|baaa What is the language of this CFG?

Removing left-recursion
It is possible algorithmically to remove left-recursion from a grammar by introducing new variable. Here is an example.
P → Pwoof|baaa
What is the language of this CFG? Strings of the form
baaa woof woof … woof We can rewrite this as:
P → baaaQ
Q → woofQ|ε
Here Q is a new variable, and ε the empty string. Now every ’recursive call’ needs to ’chew off’ an initial terminal. Hence recursion terminates.

Removing left-recursion
A more non-trivial example with two instances of left-recursion.
E → E+T|E−T|T T → T∗F|T/F|F
F → 0|1|…

Removing left-recursion
A more non-trivial example with two instances of left-recursion.
E → E+T|E−T|T T → T∗F|T/F|F
F → 0|1|…
E′ → +TE′|−TE′|ε rewrite ′
E → TE′
⇒ T → FT
T′ → ∗FE′|/FT′|ε
F → 0|1|…

Removing left-recursion
A more non-trivial example with two instances of left-recursion.
E → E+T|E−T|T T → T∗F|T/F|F
F → 0|1|…
E′→ +TE′|−TE′|ε rewrite ′
With this new grammar, a top-down parser will
􏹩 terminate(becauseallleft-recursionhasbeenremoved) 􏹩 backtrackonsome(more)inputs
E → TE′
⇒ T→ FT
T′→ ∗FE′|/FT′|ε
F → 0|1|…

Removing left-recursion
Every left-recursive grammar G can be mechanically transformed into a grammar G′ that is not left-recursive and accepts the same language.
It’s a fun and enlightening exercise to write implement this algorithm.

Bottom-up parsing
􏹩 Intop-downparsingthegrammarisusedleft-to-right:in trying to match against a variable, each of the possible RHSs are tried in order to extend the parse tree correctly.
􏹩 Inbottom-upparsing,theinputiscomparedagainstall the RHSs, to find where a string can be replaced by a variable by using a transition from right- to-left. Parsing succeeds when the whole input has been replaced by the initial variable.

Bottom-up parsing
􏹩 Bottom-upparsersaresomewhatcomplicatedtoconstruct by hand.
􏹩 However,theycanbeconstructedautomaticallybyparser generators.
􏹩 Thisisoftenthemostpracticalwaytobuildaparserfora non-trivial language.

Parser generators
Parsers can be generated by parser generators (e.g. JavaCC, CUPS, Yacc, Bison). Parsers generated by such tools are likely to be better (e.g. faster, better error handling) than hand-written parsers
Parser generators usually provide precedence declarations to handle ambiguity (or they produce parsers that return all possible parse trees). They also handle left-recursion. They are available for most programming languages.
CFG
Description of AST construction
Parser generator
Token stream Parser AST

Using parser generators
There are many parser (and lexer) generators. To use them properly, you will have to read the manual. However, often you
preamble
———- boundary ———-
grammarRule action
…
grammarRule action
In the preamble you are setting up the parser, e.g. saying that the name and type of the parsing function to be produced should be, what the type of the input is, what the initial variable is etc.

Using parser generators
With each rule
grammarRule action
we must specify what should happen when we encounter input that can be parsed in accordance with the rule. For example a rule E → E + E could be handled something like this
E ::= E TokenPlus E { return new ASTPlus($1,$2); }
Here E ::= E TokenPlus E is the rendering of the rule E → E + E in the language of the generator. The red stuff is the action in the target language (e.g. Java).
The $1 $2 allow us recursively to access the results from parsing the left $1 and right E $2 in E + E.

Using parser generators
So an grammar/action like
E ::= E TokenPlus E { return new ASTPlus($1,$2); }
says the following: whenever the input matches E + E then return a fresh object of class ASTPlus, and the constructor gets as first argument whatever matches the left E and as second argument whatever fits the right E.
So the input 3 + 4 might lead to the parser an object generated like so.
new ASTPlus ( new IntLiteral ( 3 ),
new IntLiteral ( 4 ) )
(Here we assume that parsing an integer yields objects of class IntLiteral.)

Using parser generators
Note that you can embed arbitrary code in an action:

Using parser generators
Note that you can embed arbitrary code in an action:
E ::= E TokenPlus E { print( “Let’s go to Mars” ) }

Using parser generators
Note that you can embed arbitrary code in an action:
E ::= E TokenPlus E { print( “Let’s go to Mars” ) }
The parser generator will simply generate a parser and add the action to handle the result of of successful parses. Note that parser generators typically don’t check the action for syntatic correctness, instead it simply embeds the actions in the code it produces. That means the generated parse may not compile, or may produce non-sense. (And likewise for lexer generators.)

Using parser generators
It is the programmer’s responibility to compile the code generated by the parser generator and integrate it with the rest of the compiler, i.e. feed the lexer’s output to the generated parser, and feed the generated perser’s output to the compiler’s backend.
Description of RE Token list construction
Lexer generator
User compilers
String Lexer
Description of CFG AST construction
Token list
Parser generator
User compilers
Parser
AST
Backend (Type check, code generation)
Assembler

Conclusion
We use CFGs to specify the syntax of programming languages (after giving the lexical details using regular expressions/FSAs).

Conclusion
We use CFGs to specify the syntax of programming languages (after giving the lexical details using regular expressions/FSAs).
Parsers for a CFG are algorithms that decide if an input string is in the language of the CFG, and at the same time can build up ASTs.
Ambiguity means some string can be parsed in more than one way, this must be avoided in some way or other.
Parsers are either top-down (easy to write but require absence of left recursion) or bottom-up.

The material in the textbooks
􏹩 Introductiontoparsing
􏹩 EaC Chapter 1
􏹩 Dragon book Ch.s 1 and 2.
􏹩 Appel & Palsberg Ch. 1
􏹩 Generalgrammarissues,top-downparsing
􏹩 EaC Chapter 3 sections 3.1-3.3
􏹩 Dragon Book pp.42, pp.60 and Ch. 4
􏹩 Appel & Palsberg Ch. 3.
􏹩 Parser-generators
􏹩 EaC Section 3.5
􏹩 Dragon Book Ch. 4 pp.287
􏹩 Appel & Palsberg Ch. 3 pp.68

Related Posts