程序代写代做代考 python Java c/c++ algorithm compiler Syntax

Syntax

COP5556 Programming Language Principles
Parsing 2

Recap of parsing so far:
Specify phrase structure of programming language with context free grammar (CFG) using EBNF notation
CFGs generate sentences
Parsers recognize sentences
Several approaches to parsing
bottom up
top down
Can classify grammars according to the type of parsing algorithm that can be used to parse them.
2

We are primarily interested in (mostly) LL(1) grammars which admit top-down parsing with one-step lookahead.
We defined
FIRST set
FOLLOW set
PREDICT set
Grammar is LL(1) if the PREDICT sets of all productions with the same left hand side are disjoint

Is this grammar LL(1)?

dec ::= modifier type ident ;
dec ::= procedure ident ( formals ) body ;
type ::= int | bool
modifier ::= public | ε
formals ::= ε | type ident ( , type ident )*
4

dec ::= modifier type ident ;
dec ::= procedure ident ( formals ) body ;
type ::= int | bool
modifier ::= public | ε
formals ::= ε | type ident ( , type ident )*
—————————-
PREDICT(dec ::= modifier type ident 😉 = ????

PREDICT(dec ::= procedure ident ( formals ) body ;)=
{ procedure }

8
Same left hand side,
PREDICT sets are disjoint,
so far so good

dec ::= modifier type ident ;
dec ::= procedure ident ( formals ) body ;
type ::= int | bool
modifier ::= public | ε
formals ::= ε | type ident ( , type ident )*
—————————-
PREDICT ( type ::= int ) = { int }
PREDICT ( type ::= bool ) = { bool }

PREDICT ( modifier ::= public ) = { public }
PREDICT ( modifier ::= ε ) =FOLLOW(modifier) = { int, bool }

PREDICT ( modifier ::= public ) = { public }
PREDICT ( modfier ::= ε ) = FOLLOW(modifier) =
{ int, bool }

10
Same left hand side,
PREDICT sets are disjoint,
still on track to be
LL(1)

dec ::= modifier type ident ;
dec ::= procedure ident ( formals ) body ;
type ::= int | bool
modifier ::= public | ε
formals ::= ε | type ident ( , type ident )*
—————————-
PREDICT ( formals ::= type ident ( , type ident )* ) =
{ int, bool }
PREDICT (formals ::= ε ) = FOLLOW(formals) = { ) }

12
Same left hand side,
PREDICT sets are disjoint.

Grammar is LL(1)

dec ::= modifier type ident ; { public, int, bool }
dec ::= procedure ident ( formals ) body ;
{ procedure }
type ::= int {int } | bool { bool }
modifier ::= public { public }
| ε {int, bool }
formals ::= ε { ) }
| type ident ( , type ident )* {int, bool }
——————————————–
Top down parse of int ident ;

dec int ident ;

dec int ident ;
modifier type ident ;

dec int ident ;
modifier type ident ;
ε type ident ;

dec int ident ;
modifier type ident ;
ε type ident ;
int ident ; int ident ;

Summary
property of PREDICT sets tells us that grammar is LL(1)
PREDICT sets are also used during parsing to choose a production

Is this grammar LL(1)?

ident_list ::= ident
ident_list ::= ident_list , ident

Is this grammar LL(1)?

ident_list ::= ident | ident_list , ident

FIRST(ident_list) =
ident U FIRST(ident_list)

PREDICT (ident_list ::= ident) = {ident}
PREDICT (ident_list ::= ident_list , ident)
= FIRST(ident_list) = {ident}
19

Is this grammar LL(1)?

ident_list ::= ident | ident_list , ident

FIRST(ident_list) =
ident U FIRST(ident_list)

PREDICT (ident_list ::= ident) = {ident}
PREDICT (ident_list ::= ident_list , ident)
= FIRST(ident_list) = {ident}
20
If you see left recursion,
the grammar is not LL

We saw that left recursion makes a grammar non-LL(1)
A grammar is left recursive if, for some nonterminal A,
A →+ A σ.
If there is a production of the form
A ::= A σ | β
then we immediately see that it is left recursive.
Left recursion may be indirect, involving several rules, but we will only worry about direct recursion
21
Left recursion

A production of the form
A ::= A σ | β
can be replaced in the grammar with
A ::= β B
B ::= σ B | ε

22
Left recursion removal
B is a new symbol

More generally, productions of the form
A ::= A σ1 |….|A σn| β1 | …| βm

can be replaced by
A ::= β1 B | ….| βm B
B ::= σ1 B | …. σn B | ε

23
Left recursion removal (2)

Original grammar
ident_list ::= ident | ident_list , ident

Recall rule : replace A ::= A σ | β with
A ::= β B
B ::= σ B | ε
Let
A := ident_list,
σ := , ident,
β = ident,
introduce new symbol ident_list_tail instead of B

Equivalent grammar without left-recursion
ident_list ::= ident ident_list_tail
ident_list_tail ::= , ident ident_list_tail | ε
24

A ::= β B
B ::= σ B | ε

can be replaced with

A ::= β σ*

ident_list ::= ident (, ident)*

New grammar specifies the same language as the starting point, but the structure of the parse tree may change.
25
Transforming to EBNF

Operators are left associative:
10-4-3 = (10-4)-3
A mult_op has higher precedence than an add_op:
3+4*5 = 3+(4*5)

28
Example: expression grammar with precedence and associativity

28
Precedence and associatively are important aspects of expressions that are not captured by the phrase structure of the language. However, these things can be captured in the structure of the parse tree, so we need to be careful when we make transformations. In particular, left recursion is the most useful way to indicate left associativity, but makes the grammar not LL(1).

29
Parse tree for 3+4*5

30
Parse tree for 10-4-3

31
Is this expression grammar LL(1)?

FIRST(expr) = FIRST(term) U FIRST(expr)

Grammar not LL(1)

32
Left recursive

32
In top-down parsing, left recursion can lead to non-termination of the parsing algorithm.

Start with left recursive grammar for expressions

expr ::= number | expr add_op number
add_op := + | –

Left recursion gives left-associative operators.
Right recursion gives right-associative operators

expr ::= number
| number right_assoc_op expr
33
Example

34
Parse tree for 4-3-5

number(3)
expr
expr
add_op
number(5)
expr
add_op
number(4)
–

expr ::= number expr_tail
expr_tail ::= add_op number expr_tail | ε
add_op ::= + | –

35
Transformed example

36
Parse tree for 3-4-5
expr
number(3)
expr_tail
add_op
number(4)
expr_tail
add_op
number(5)
expr_tail
ε

–
–

Fails to correctly capture associativity

expr ::= number | expr add_op number
add_op := + | –

expr ::= number expr_tail
expr_tail ::= add_op number expr_tail | ε
add_op ::= + | –

expr ::= number (add_op number)*
add_op ::= + | –

37
Transformed again– to EBNF

38
Parse tree for 3-4-5

39
Parse tree for 3-4-5
Captures associativity provided children handled from left to right

40
Thus the left-recursion in the grammar provided structural information beyond just the set of legal strings.

We need to be somewhat careful when applying transformations.

Note that bottom up parsers can handle left recursion just fine.

Grammar with production of shape

A ::= α β | α γ

is not LL(1)

41
Another cause of non-LL(1) ness

A ::= αβ | αγ

can be transformed using left factorization

A ::= α B
B ::= β | γ

where B is a new non-terminal

We still need PREDICT(B ::= β ) and PREDICT B ::= γ) to be disjoint
42
Common initial part can be factored out

It is often convenient to replace

A ::= α B
B ::= β | γ

with
A ::= α ( β | γ )

to avoid introducing a new symbol.

stmt ::= ident := expr | ident ( arg_list )

can be transformed to

stmt ::= ident ident_stmt_tail
ident_stmt_tail ::= := expr | ( arg_list )

Or more compactly

stmt ::= ident (:= expr | ( arg_list ))

44
Example

stmt ::= ident ident_stmt_tail

ident_stmt_tail ::= := expr | ( arg_list )

OK!

45
Example

expr ::= @expr . field_name
| @expr . field_name = expr
| @expr . method_name ( ε | expr (, expr)* )
| … .

field_name ::= ident
method_name ::= ident
46
Another example

expr ::= @expr . field_name
| @expr . field_name = expr
| @expr . method_name ( ε | expr (, expr)* )
| … .

field_name ::= ident
method_name ::= ident
47
Another example (2)

expr ::= @expr . field_name
| @expr . field_name = expr
| @expr . method_name ( ε | expr (, expr)* )
| … .

Problem!
field_name ::= ident
method_name ::= ident
48
Another example (3)

substitute ident for field_name and method_name

expr ::= @expr . ident
| @expr . ident = expr
| @expr . ident ( ε | expr (, expr)* )
| … .

49
Another example (4)

expr ::= @expr . ident
| @expr . ident = expr
| @expr . ident ( ε | expr (, expr)* )
| … .

50
Another example (5)

51
Another example (5)

expr ::= @expr . ident
(ε | = expr | ( ε | expr (, expr)* ) )

Because of the ε we can’t be sure this will work without knowing FOLLOW(expr) in entire grammar

52
Another example (6)

Note that eliminating left recursion and common prefixes does NOT necessarily make a grammar LL
There are infinitely many non-LL LANGUAGES, and the mechanical transformations work on most of them just fine
The few that arise in practice, however, can generally be handled with kludges

53
Transformations do not always work

Given an LL(1) grammar, we can systematically construct a recursive descent parser.
Tools exist that will do this automatically, we’ll do it by hand.
Our first parser will only recognize whether or not a sentence is legal.
Later we’ll extend the approach to construct an AST (abstract syntax tree)
54
Systematic construction of parsers

Compute the PREDICT sets for each production
Rewrite the grammar so that each non-terminal is on the left side of exactly one production.
For example, change
A::= B
A::= C
to
A::= B | C
55
Parser construction

For each non-terminal A, we will have a method void A().
For each possible shape of the rhs of a production

A ::= σ

we have a rule for constructing the code for the method A();
56

σ1 | σ 2 | … σn (where none are )

becomes code fragment

if (token ∈ PREDICT(lhs ::= σ1 )) parse σ1 ;
else if (token ∈ PREDICT(lhs ::= σ 2 ))
parse σ 2;
…
else if (token ∈ PREDICT(lhs ::= σn ))
parse σn
else error()

57
Notation: lhs = left hand side

σ1 | σ 2 | … σn | 

becomes

if (token ∈ PREDICT(lhs ::= σ1 )) parse σ1 ;
else if (token ∈ PREDICT(lhs ::= σ2 ))
parse σ2;
…
else if (token ∈ PREDICT(lhs ::= σn ))
parse σn ;
else just return //this branch matches 
58

σ1 σ 2 … σn

becomes

parse σ1;
parse σ2;
…
parse σn
59

σ*

becomes

while(token ∈ FIRST(σ)) { parse σ; }

A (where A is a non-terminal)

becomes

A();

(i.e. just invoke the method for A)

c (where a is a terminal symbol)

becomes

match(c);

where

match(c)
{ if ( current token is c )
{ get next token from scanner;}
else error();
}
62

It may be worthwhile to simplify the grammar
before starting.

For example,

A ::= B| xC
C ::= D

could be simplified to
A ::= B | xD

Also, code can be simplified after the fact to eliminate redundant tests, etc.
63

expr ::= term ( ( + | – ) term )*
term ::= factor ( ( * | / ) factor )*
factor ::= intlit | ( expr )

We are given a Scanner and Token class
64
Implementing a Parser in Java

public class SimpleParser
{
Scanner scanner; //Token producer
Token t; //next token

Parser(Scanner scanner) {
this.scanner = scanner;
t = scanner.nextToken();
}

65
Class SimpleParser

Token consume()
{
t = scanner.getNext();
}

void match(Kind kind){
if( t.isKind(kind)){
consume();
} else handle error
}

// expr ::= term ( ( + | – ) term )*

public void expr()
{
term();
while (t.isKind(PLUS) || t.isKind(MINUS))
{
if (t.isKind(PLUS))
{ match(PLUS);
}
else if (t.isKind(MINUS))
{
match(MINUS);
}
term();
}
}

term ::= factor ( ( * | / ) factor )*
same shape as expr()—here we use switch statement

void term()
{
factor();
while (t.isKind(TIMES) || t.isKind(DIV))
{
switch (t.kind)
{
case TIMES:
match(TIMES);
break;
case DIV:
match(DIV);
break;
}
factor();
}
}
68

term ::= factor ( ( * | / ) factor )*
same shape as expr()

void term()
{
factor();
while (t.isKind(TIMES) || t.isKind(DIV))
{
switch (t.kind)
{
case TIMES:
consume(); // match(TIMES);
break;
case DIV:
consume (); //match(DIV);
break;
}
factor();
}
}
69
Slightly simplified

term ::= factor ( ( * | / ) factor )*

void term()
{
factor();
while (t.isKind(TIMES) || t.isKind(DIV))
{
switch (t.kind)
{
case TIMES: case DIV:
consume();
break;
// case DIV:
// consume ();
// break;
}
factor();
}
}
70
Simplified again

// term ::= factor ( ( * | / ) factor )*

void term()
{
factor();
while (t.isKind(TIMES) || t.isKind(DIV))
{
// switch (t.kind)
// {
// case TIMES: case DIV
consume();
// break;
// }
factor();

}
}
71
And again

factor ::= int_lit | ( expr )

void factor() {
if (t.isKind(INT_LIT)) {
match(INT_LIT);
} else if (t.isKind(LPAREN)) {
match(LPAREN);
expr();
match(RPAREN);
}
else handle error
}
}

We showed how to systematically construct a recursive descent parser for a language specified by an LL(1) grammar.

Next
Example to show that the parser really does implement the top down parsing algorithm
Error handling in recursive descent parsers

expr ::= term ( ( + | – ) term )*
term ::= factor ( ( * | / ) factor )*
factor ::= int_lit | ( expr )

74
Recap of example

public void expr() // expr ::= term ( ( + | – ) term )*
{
term();
while (t.isKind(PLUS) || t.isKind(MINUS)) { consume(); term(); }
return;
}

void term() // term ::= factor ( ( * | / ) factor )*
{
factor();
while (t.isKind(TIMES) || t.isKind(DIV)){ consume(); factor(); }
return;
}

void factor() //factor ::= int_lit | ( expr )
{
if (t.isKind(INT_LIT)) { consume(); }
else if (t.isKind(LPAREN))
{ consume(); expr(); match(RPAREN); }
else error();
return
}
75

Sentence to parse:
3 – ( 4 * 5 )

Tokens:
num_lit(3), minus, lparen,
num_lit(4), times, numlit(5),
rparen

* factor ( ( * | / ) factor )* ( ( + | – ) term )* ) ( ( * | / ) factor )*
( ( + | – ) term )* * 5 ) consume

Syntax Errors
Read Scott section 2.3.5 (in the online supplement) through Exception based errors
We will discuss several approaches
Halting
Syntax error recovery
Panic mode recovery
Phrase level recovery
Context specific
Exception based

http://Scott 4e Online Supplement
80

Halting
When an error is detected, just stop and print message
This is easy for the compiler implementer, but inconvenient for the programmer using the compiler
Syntax error recovery
compiler continues looking for errors
high quality error recovery essential for production compilers
Problem: how to avoid cascading errors
81

Panic mode recovery
Define small set of safe symbols that delimit “clean” points in the input
When an error occurs, delete input tokens until a safe symbol is encountered
Return from subroutines until the parser is in a context where the new symbol might occur
Tends to generate lots of spurious errors in most languages
82

Phrase level recovery
Use different sets of safe symbols in different contexts
When a parsing routine for non-terminal A discovers an error at its beginning, it deletes tokens until it find one that is in FIRST(A) and proceeds, or a member of FOLLOW(A) and returns
83

void factor() //factor ::= int_lit | ( expr )
{ if (! (t.isKind(INT_LIT) || t.isKind(LPAREN))
{ report error
do { t = s.next();}
while ( t not in FIRST(factor) && t not in
FOLLOW(factor)
}
if (t.isKind(INT_LIT)) { consume(); }
else if (t.isKind(LPAREN))
{ consume(); expr(); match(RPAREN); }
//else error();
return();
}
84

87
More general description of phrase level recovery from Scott

If foo →ε, this approach tends to PREDICT ε and return when it should detect an error—

$$ is eof
87

88
More general description of phrase level recovery from Scott

A symbol may be in FOLLOW set because it can follow foo somewhere, but not in the particular context

Context-specific look-ahead (Wirth ’76)
Tokens may follow A somewhere in a valid program (and thus be in FOLLOW(A)) but are not necessarily allowed at a particular place.
Let the follow set be context-dependent
Pass in FOLLOW set in recursive calls
Otherwise, the same as phrase-level recovery
Example
stmt ::= ident := expr ;
would pass SEMI when calling expr here,
factor ::= ( expr )
would pass RPAREN when calling expr here,

Additional heuristic—don’t delete tokens that start major constructs that require matching tokens later (BEGIN, WHILE, etc)
89

Exception-based error recovery
Choose small set of contexts to back-out to when an error occurs (for example the beginning of code to handle a declaration, or a statement)
Attach exception handler to blocks of code that should implement recovery
When error is detected—raise an exception (throw in Java)
The system unwinds the stack to find most recent block of code with a handler.
The handler can continue, or re-throw the exception.
90

class SyntaxException extends Exception{}

Can add fields for additional information, constructors, etc.
When an exception occurs
the block is exited.
If there is no handler, the exception is propagated to enclosing block.
Propagation continues until the exception is caught.
If the exception reaches the outermost block without being caught, the program terminates.

91
Aside: Exceptions in Java

void factor() throws SyntaxException
{
if (t.isKind(INT_LIT)) {
consume();
}
else if (t.isKind(LPAREN)) { consume();
expr();
match(RPAREN);
}
//else error();
else throw new SyntaxException(…);
}
92

public void expr()
{ try{
term();
while (t.isKind(PLUS) || t.isKind(MINUS))
{ consume();
term();
}
}
catch (SyntaxException e)
{ while (t not EOF)
{ if (t in FIRST(expr) expr(); return;}
else if (t in FOLLOW(expr); return;}
else t = s.next();
}
}

}

Lex and Yacc
ANTLR
LL(*)
LPG
LR(k), backtracking possible
Xtext
based on ANTRL
creates simple eclipse IDE
many, many more
94
Tools for parser generation

ANTLR (ANother Tool for Language Recognition)
Reads a grammar file and generates a
lexer
parser
Can generate the lexer and parser in several languages (maybe)
java, c/c++, c#, python, ruby, …
Can include target language fragments in the grammar file that will be included in the generated parser.
More info at www.antlr.org
Accepts LL grammars, including LL(*) (unbounded lookahead)
Holds the entire input string in memory.
Approach only feasible recently, but necessary for IDEs such as eclipse.
95
ANTLR

/docProps/thumbnail.jpeg

Related Posts