CE204
Data Structures and Algorithms
Part 9
10/03/2019 CE204 Part 8
1
Parse Trees 1
A tree provides a convenient means of representing the structure of the source code of a program. For example, the expression (a+7)*b can be represented by the tree
10/03/2019 CE204 Part 6
2
Parse Trees 2
It is not necessary to show the parentheses in the tree; the structure of the tree for a+(7*b) would be different.
A parser is a program that converts text such as program source code into a tree; hence these trees are called parse trees.
Binary parse trees are adequate for the representation of most programming language constructs, but it is sometimes desirable to use ternary trees – for example a tree for an if…then…else… statement could have three children, the expression and the two statements. We could, however, represent this as a binary tree with the two statements being stored as grandchildren.
10/03/2019 CE204 Part 6
3
Parse Trees in Java 1
A node in a parse tree needs to hold two pieces of information – the kind of node (e.g. a number, identifier or operator) and its value (e.g. 7, myId, or plus). Since the latter can have various types we need to be able to store different types of value in the different kinds of nodes. Two approaches are possible in Java – we can use the Object class to store the value, or declare the node type as an abstract class with subclasses for each kind of node, with each subclass having a different type for the value.
10/03/2019 CE204 Part 6
4
Parse Trees in Java 2
Using the first approach we would store the kind of node as an integer (e.g. 0 for number nodes, 1 for identifier nodes and 2 for operator nodes). In order to make programs readable we should enumerate as constants the values used to represent each kind of node. The following slide shows an outline of the class declaration for this approach.
10/03/2019 CE204 Part 6
5
Parse Trees in Java 3
public class ParseTree
{ private int kind;
private Object value;
private ParseTree lChild, rChild;
public static final int numNode = 0,
idNode = 1, opNode = 2;
public ParseTree(int knd, Object val, ParseTree l, ParseTree r)
{ kind = knd; value = val;
lChild = l; rChild = r;
}
// methods to examine the node needed }
10/03/2019 CE204 Part 6
6
Parse Trees in Java 4
If we choose the abstract class approach for representing parse trees it is not necessary to store the kind in the node since this can be determined using the instanceof operator. However, if there are many kinds of node we might decide to store the kind so that a switch statement could be used instead of many if statements.
Assuming that the kind is not being stored, the declaration of the abstract class might take the form shown on the next slide. The instance variables have been declared as protected so that the constructors of the subclasses can access them.
10/03/2019 CE204 Part 6
7
Parse Trees in Java 5
public abstract class ParseTree
{ protected ParseTree lChild, rChild;
public ParseTree leftChild() { return lChild;
}
public ParseTree rightChild() { return rChild;
}
public abstract String toString();
}
10/03/2019 CE204 Part 6
8
Parse Trees in Java 6
A class that extends the abstract class ParseTree would then be written for each kind of node that can occur in a parse tree.
These classes would inherit the child-access methods of the abstract class but would have constructors and value-returning methods appropriate to the type of data that they store. The constructors for leaf nodes, such as numbers, would have only one argument, the value to be stored in the node, but those for non-leaf nodes, such as operators, would need the children as extra arguments.
An example is seen on the next slide. 10/03/2019 CE204 Part 6
9
Parse Trees in Java 7
class NumNode extends ParseTree
{ private int value;
public NumNode(int val)
{ lChild = rChild = null; // it’s a leaf
value = val; }
public int numValue() { return value;
}
public String toString() { return “”+value;
}
}
10/03/2019 CE204 Part 6
10
Lexical Analysis and Parsing 1
A parser is a program or method that takes input from some language (usually from a file), checks that the input is syntactically correct and generates an internal representation, usually in the form of a parse tree. Parsing is the first phase of the compilation process but may be used in other applications that expect input conforming to some well-defined syntax.
10/03/2019 CE204 Part 8
11
Lexical Analysis and Parsing 2
Parsers usually expect as input not a sequence of characters, but a sequence of tokens representing the lexemes (language items) from the input source. Typical lexemes include reserved words (e.g. class, if, while), numbers (e.g. 7, 12.5), string and character literals (e.g ”Hello”, ’x’) identifiers (e.g. myId), operators (e.g. <=) and punctuation symbols (e.g. ;).
A lexical analyser is used to convert the input into tokens. It is possible to convert the whole of the input before starting to parse but the usual approach is to provide a method to obtain the next token, which will be called by the parser whenever a token is needed.
10/03/2019 CE204 Part 8
12
Lexical Tokens 1
A token may be implemented as an object containing two instance variables, one indicating the kind of token and the other providing information about the value. The kinds should be enumerated as constants. Values may be strings, characters or numbers; the simplest way to store this information is to store the lexeme as a string.
To allow quick access the instance variables are usually declared to be public or protected; an example class can be seen on the next slide. Two constructors are provided: one for tokens that have value information and one for tokens that have not.
10/03/2019 CE204 Part 8
13
Lexical Tokens 2
public class Token
{ public int kind;
public String value;
public static final int
leftParen = 0, rightParen = 1, number = 2, ident = 3,
plus = 4, minus = 6, times = 7, divide = 8, power = 9, eof = 10;
public Token(int k)
{ kind = k; value = null;
}
public Token(int k, String s) { kind = k; value = s;
}
}
10/03/2019 CE204 Part 8
14
Grammars 1
Every programming language has a syntax that defines how lexemes may be combined to form a valid program. The syntax is usually expressed using a grammar.
A grammar comprises
a finite set of terminal symbols, corresponding to the kinds of token
a finite set of non-terminal symbols, corresponding to other language entities
a particular non-terminal symbol defined as the start symbol a finite set of productions of the form αβ where α and β
are strings of terminals and non-terminals 10/03/2019 CE204 Part 8
15
Grammars 2
The set of all strings of terminal symbols that can be obtained by starting with the start symbol and repeatedly replacing a string that occurs on the left-hand-side of a production with the corresponding right-hand-side is known as the language of the grammar.
The individual strings in the language are called sentences.
The sequence of steps of the replacement process is called a
derivation.
10/03/2019 CE204 Part 8
16
Grammars 3
When a grammar is used to specify the syntax of a programming language the non-terminal symbols will be constructs such as declaration, expression and statement.
If each left-hand-side string of a grammar comprises just a single non-terminal symbol the grammar is said to be context- free. Programming language grammars are usually context- free, but it should be noted that in such grammars we can not express constraints such as “all variables must be declared before use”.
10/03/2019 CE204 Part 8
17
Grammars 4
It is conventional to use bold text for terminals and italic text for non-terminals when writing grammars.
Notation of the form α β γ δ is often used as a shorthand to express several productions (αβ, αγ and αδ) on a single line.
The start symbol is usually identified as such by being the symbol on the left-hand side of the first production in the grammar.
10/03/2019 CE204 Part 8
18
Grammars 5
Consider the following grammar:
expr operand operator operand
operand ident number lparen expr rparen operator plus minus times
To show that (a+4)*b is a sentence of this grammar (assuming that a and b are valid identifiers and 4 is a valid number) we can use the derivation on the following slide.
The derivation is an example of a leftmost derivation – at each step a production is applied to the leftmost non-terminal symbol.
10/03/2019 CE204 Part 8
19
Grammars 6
expr operand operator operand
lparen expr rparen operator operand
lparen operand operator operand rparen operator operand lparen ident operator operand rparen operator operand lparen ident plus operand rparen operator operand
lparen ident plus number rparen operator operand
lparen ident plus number rparen times operand
lparen ident plus number rparen times ident
10/03/2019 CE204 Part 8
20
Ambiguous Grammars 1
Consider the following grammar:
expr expr operator expr id number lparen expr rparen
operator plus minus times divide
The sentence a+7*b can be obtained using two different leftmost derivations as seen on the next two slides; these give rise to two different parse trees as seen on the subsequent slide.
The grammar is hence ambiguous. Grammars that are to be used by parsers must either be rewritten in an unambiguous form or augmented with operator precedence rules.
10/03/2019 CE204 Part 8
21
Ambiguous Grammars 2
expr expr operator expr
expr operator expr operator expr id operator expr operator expr id plus expr operator expr
id plus number operator expr id plus number times expr
id plus number times id
10/03/2019 CE204 Part 8
22
Ambiguous Grammars 3
expr expr operator expr
id operator expr
id plus expr
id plus expr operator expr
id plus number operator expr id plus number times expr id plus number times id
10/03/2019 CE204 Part 8
23
Ambiguous Grammars 4
10/03/2019 CE204 Part 8
24
Parsing 1
There are two common approaches to parsing:
bottom-up: the parser obtains tokens from the lexical analyser and tries to match sequences to right-hand sides of grammar productions, building the parse trees from the leaves upwards.
top-down: the parser attempts to build a parse tree by starting with the start symbol and obtaining a leftmost derivation. As a choice of productions is often available it is necessary to examine the input tokens to determine which production to use. When a non-terminal symbol is reached in the production it must be matched against a token from the input.
10/03/2019 CE204 Part 8
25
Parsing 2
When writing parsers by hand the top-down approach is usually easiest. However the class of grammars that can be successfully parsed in a top-down manner without backtracking is limited. Furthermore, top-down parsers usually require unambiguous grammars.
There are systematic techniques for developing bottom-up parsers from grammars. These techniques can cope with ambiguous grammars augmented with precedence information. They are, however, quite complicated and not suitable for hand- coding parsers for large grammars. Tools such as yacc and bison have been developed to perform this task.
10/03/2019 CE204 Part 8
26
Recursive Descent Parsing 1
The simplest form of top-down parser is a recursive descent parser without look-ahead or backtracking. This can be used for only a limited set of grammars. The grammar from slide 19 (repeated here) is an example where such a parser can be used:
expr operand operator operand
operand ident number lparen expr rparen operator plus minus times
10/03/2019 CE204 Part 8
27
Recursive Descent Parsing 2
In a recursive descent parser there is a method for each non- terminal symbol in the grammar; these methods may be mutually recursive. For this grammar we need to write methods parseExpr, parseOperand and parseOperator. The first two of these will return trees but the latter will simply return the character representing the operator (for example ’+’). These three methods will be private and will be called by a public method called parse.
10/03/2019 CE204 Part 8
28
Recursive Descent Parsing 3
We shall assume that numbers are integers and identifiers are single letters. We shall also assume that a ParseTree class has been written using the abstract class approach as seen on slides 7-10, with subclasses called NumNode, IdNode and OperatorNode, and that the operator in an operator node is stored as a character. In addition we assume that a lexical analyser class called Lexer has been written, with a getToken method that retrieves the next token from the input as an object of the Token class from slide 14. To keep things as simple as possible our parser will not attempt to continue if there is an error in the input, but will simply output an appropriate error message.
10/03/2019 CE204 Part 8
29
Recursive Descent Parsing 4
public class Parser
{ private Lexer lex;
public ParseTree parse() { lex = new Lexer();
return parseExpr();
}
private ParseTree parseExpr()
{ ParseTree left = parseOperand();
char op = parseOperator();
ParseTree right = parseOperand();
return new OperatorNode(op, left, right);
}
10/03/2019 CE204 Part 8
30
Recursive Descent Parsing 5
// Parser class continued
private char parseOperator()
{ Token token = lex.getToken();
switch (token.kind)
{ case Token.plus: return '+';
case Token.minus: return '-'; case Token.times: return '*'; default:
System.out.println(
"Parse error: operator expected");
System.exit(0);
}
}
10/03/2019 CE204 Part 8
31
Recursive Descent Parsing 6
// Parser class continued
private ParseTree parseOperand()
{ Token token = lex.getToken();
switch(token.kind) { case Token.ident:
return new IdNode(token.value.charAt(0));
case Token.number: return new NumNode(
Integer.parseInt(token.value));
// switch statement continued on next slide
10/03/2019 CE204 Part 8
32
Recursive Descent Parsing 7
// parseOperand method continued
// still in switch(token.kind) case token.lparen:
{ ParseTree t = parseExpr();
if(lex.getToken().kind!=Token.rparen)
{ System.out.println(
"Parse error: ')' expected"); System.exit(0);
}
return t; }
// switch statement continued on next slide
10/03/2019 CE204 Part 8
33
Recursive Descent Parsing 8
// parseOperand method continued
// still in switch(token.kind)
default:
System.out.println(
"Parse error: invalid operand”);
System.exit(0);
} // end of switch statement
} // end of method } // end of class
10/03/2019 CE204 Part 8
34
Recursive Descent Parsing 9
It was easy to write a recursive descent parser for the grammar on slide 27 since whenever a production had a choice on the right-hand side each of the options began with a different terminal symbol, so it was possible to decide how to proceed by examining the next token.
Many languages require a one-token look-ahead if parsed using recursive descent techniques. Consider the following production:
stmt expr semicolon if lparen expr rparen stmt while lparen expr rparen stmt
10/03/2019 CE204 Part 8
35
Recursive Descent Parsing 10
The next token must be examined to determine which option to choose. If the token is neither if nor while it must be the first token of an expression so the parseExpr method should be called. This method will need to examine the same token again. Simply modifying the method so that it takes the previously- examined token as an argument is not an option since there are occurrences of expr in the grammar in positions other than the first symbol of a right-hand-side option.
We need to provide a look-ahead facility that allows us to examine a token without consuming it.
10/03/2019 CE204 Part 8
36
Recursive Descent Parsing 11
There are two ways of implementing a one-token look-ahead: one is to add to the lexical analyser an extra method which returns the next token but also saves it so it will be returned again when the getToken method is next called.
The second, more common, approach is to add a public nextToken variable to the Lexer class: this will always store the next token. The getToken method should now update this variable instead of returning a result, and should be called whenever the stored token has been used, but not when it has only been examined. If using this approach it is important to ensure that getToken is called at the beginning of the parse method to initialise the nextToken variable.
10/03/2019 CE204 Part 8
37
Recursive Descent Parsing 12
There are still situations where a simple recursive descent parser with a one-token look-ahead is insufficient. One such situation is where two right-hand-side options of a production begin with the same symbol. In this case we can use a look- ahead of more than one token to determine which to use, or attempt one and, if it fails, back-track and try the other (this could require the storage of several used tokens for possible reuse). An alternative solution is to attempt to write another grammar for the same language.
If two or more right-hand-side options in a production begin with non-terminal symbols similar problems can occur.
10/03/2019 CE204 Part 8
38
Left-Recursive Productions 1
Consider the following grammar production for the left-hand side of an assignment statement:
lhs lparen lhs rparen ident lhs dot ident lhs lbracket expr rbracket
If a parseLHS method were written in the same style as our previous parsing methods and the next token was neither a left parenthesis nor an identifier, the method would call itself recursively without consuming the token. This recursion would never terminate.
10/03/2019 CE204 Part 8
39
Left-Recursive Productions 2
The production on the previous is said to be left-recursive – one or more of the right-hand side options begins with the same non- terminal symbol as the symbol on the left-hand side. There is a technique for removing left-recursion from grammars; if applied to the above production it would generate productions of the form
lhs lparen lhs rparen lhs2 ident lhs2
lhs2 dot ident lhs2 lbracket expr rbracket lhs2 empty
Although this is now suitable for parsing by a recursive descent parser, it does not correspond to the parse trees that we wish to generate – if would be necessary to transform the tree after parsing a left-hand side.
10/03/2019 CE204 Part 8
40
Handling Operator Precedence 1
It is not difficult to write a grammar which correctly handles precedence of binary operators. The key is to introduce a different non-terminal symbol for each level of precedence:
exp exp plus term exp minus term term
term term times power term divide power power power opd uparrow power opd
opd ident number lparen exp rparen
The choice of whether to use exp minus term or term minus exp depends on the associativity of the operator. In this grammar the expression a-b-c would mean (a-b)-c but a^b^c would mean a^(b^c).
10/03/2019 CE204 Part 8
41
Handling Operator Precedence 2
Although the grammar on the previous slide is left-recursive it can be parsed easily without transformation. When the only left-recursive productions are of the form
eeop1 g...eopn gg
we can use a loop that repeatedly calls the method to parse g in
the method to parse e.
We present on the remaining slides a parser for the grammar from the previous slide, using a lexical analyser with a nextToken variable to implement look-ahead. We assume that the Token and ParseTree classes are as previously presented.
10/03/2019 CE204 Part 8
42
Handling Operator Precedence 3
public class Parser
{ private Lexer lex;
public ParseTree parse() { lex = new Lexer();
lex.getToken();
ParseTree t = parseExp();
if (lex.nextToken.kind!=Token.eof) System.out.println(
"Warning – garbage after expression");
return t; }
// class continued on next slides
10/03/2019 CE204 Part 8
43
Handling Operator Precedence 4
// Parser class continued
private ParseTree parseExp()
{ ParseTree result = parseTerm();
while (lex.nextToken.kind==Token.plus || lex.nextToken.kind==Token.minus)
{ char op =
lex.nextToken.kind==Token.plus ?
'+' : '-';
lex.getToken();
result = new OperatorNode(op, result,
}
return(result);
}
parseTerm());
10/03/2019 CE204 Part 8
44
Handling Operator Precedence 5
// Parser class continued
private ParseTree parseTerm()
{ ParseTree result = parsePower();
while (lex.nextToken.kind==Token.times || lex.nextToken.kind==Token.divide)
{ char op =
lex.nextToken.kind==Token.times ?
'*' : '/';
lex.getToken();
result = new OperatorNode(op, result,
}
return(result);
}
parsePower());
10/03/2019 CE204 Part 8
45
Handling Operator Precedence 6
// Parser class continued
private ParseTree parsePower()
{ ParseTree left = parseOpd();
if (lex.nextToken.kind==Token.uparrow) { lex.getToken();
return new OperatorNode('^', left,
parsePower());
} else
return left; }
10/03/2019
CE204 Part 8
46
Handling Operator Precedence 7
// Parser class continued
private ParseTree parseOpd()
{ Token token = lex.nextToken;
lex.getToken();
switch(token.kind)
{ case Token.ident:
return new IdNode( token.value.charAt(0));
case Token.number:
return new NumNode(
Integer.parseInt(token.value)); // switch continued on next slide
10/03/2019 CE204 Part 8
47
Handling Operator Precedence 8
// Parser class continued
// parseOpd switch continued
case Token.lparen:
{ ParseTree t = parseExp();
if (lex.nextToken.kind!=Token.rparen)
{ System.out.println(
"Parse error: ')' expected"); System.exit(0);
}
lex.getToken();
return(t);
}
// switch continued on next slide
10/03/2019 CE204 Part 8
48
Handling Operator Precedence 9
// Parser class continued
// parseOpd switch continued
default: System.out.println(
"Parse error: invalid operand");
System.exit(0);
} // end of switch
} // end of parseOperand method
}
10/03/2019 CE204 Part 8
49