Compiler Design Week 2
Detailed content
Weekly program
Week Week
Week Week Week Week Week Week Week Week Week Week Week
1 – Introduction to Compiler Design
2 – CD Programming Language
3 – Lexical Analysis
4 – Syntax Analysis
5 – Top Down Parsing
6 – Symbol Table and Error Recovery 7 – Bottom-Up Parsing
8 – Stack Machine (SM) Architecture 9 – Semantic Analysis
10 – Code Generation
11 – Code Optimisation
12 – Revision
13 – Extra revision (if needed)
2
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
Week 02 Lecture Outline
CD Programming Language
Grammar and BNF
CD21 Programming Language Specification CD21 Data items and expressions
CD21 Structures and arrays
CD21 Constructs – branching and looping CD21 procedures and functions
CD21 special features
Lexical Analyser
Lexeme, Token and Patterns
How to specify Tokens?
3
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
Grammar – COMP2270 refresher
• Context Free Grammar is used to specify the syntax of a language.
• Describes the hierarchical structure of most programming language constructs
• CFG has four components
– A set of terminals, sometimes referred to as “tokens”
– A set of nonterminals each of which represent a set of terminals
– A set of productions each of which has
• Head: a nonterminal
• Body: a sequence terminals and/or nonterminals
• A symbol between them (likeor ::= )
– One of the nonterminals is designated as start symbol
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
4
Grammar – COMP2270 refresher
• Grammar for arithmetic expressions of digits and + / –
listlist + digit listlist – digit list digit
digit 0
digit 1 ….
digit 9
Derivation: A process in which you begin with the start sybmol and repeatedly replace a nonterminal by the body of its production. Eventually you end up with a sequence of terminals – a sentence/string derived using that grammar.
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
5
Grammar – COMP2270 refresher
list
list + digit
list + 3
list – digit + 3 list – 4 + 3
digit – 4 + 3 9–4 + 3
Language: the set of all strings that can be derived in a grammar
listlist + digit listlist – digit list digit
digit 0
digit 1 ….
digit 9
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
6
Backus-Naur Form (BNF)
• Most will have seen languages specified in BNF, which is one fairly easy way to specify a language as a set of rules or derivations. Eg.
::= stands for “can be defined as”,
rules may also be recursive
<...> are phrase types in the language
• Eg. there are many other rules for statements
BNF (2)
• We can also have circular definitions
• BNF allows us to groups these alternatives into a single rule
• We can define an identifier as
• A rule that specifies could be
::=
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
8
Extended BNF (EBNF)
• EBNF is a few simple extensions to BNF
– makes expressing grammars more convenient;
• EBNF is no more “powerful” than BNF; that
– Anything that can be expressed in EBNF can also be expressed in BNF
• EBNF more concise than BNF
• EBNF is widely used as the de facto standard to define programming languages
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
9
What are extensions?
• They vary, but often are derived from regular expression syntax
• “*” (The Kleene Star): means 0 or more occurrences
• “+” (The Kleene Plus): means 1 or more occurrences
• “?”: means 0 or 1 occurrences (sometimes “[ … ]” used instead)
• Use of parentheses for grouping
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
10
BNF VS EBNF
• Grammar for decimal numbers in plain BNF:
• Same grammar in EBNF
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
11
Syntax for Arithmetic Expressions
• An interesting set of rules are the ones which define operator precedence for + – * / in most languages:
• Some of these rules are known as Left-Recursive.
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
12
The CD21 Programming Language
• This is the source language which is to be compiled for the project.
• It has features that match special features of the SM21 architecture.
• It does not take full advantage of all the special features of the SM21 architecture.
• Where there is lack of orthogonality, it is in the hope that this will make the language simpler.
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
13
The CD21 Programming Language
CD21 HelloWorld
main
i : boolean /– only primitive types possible, /– array or struct type not possible
/– Can NOT be empty
begin
Out << "Hello World" << Line ;
end CD21 HelloWorld
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
14
CD21 Keywords
• 31 reserved keywords
CD21 constants types is arrays main begin end array of func void const integer real boolean for repeat until if else In Out Line return not and or xor true false
• Keywords are NOT case sensitive
– By convention CD21 is in uppercase others in lower-case (In, Out and Line are in sentence case)
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
15
CD21 Data Items
• Integer Expressions and Calculations
• Real Expressions and Calculations
• Boolean Expressions and Calculations
• Arithmetic Expressions
• Relational and Boolean Expressions
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
16
CD21 Primitive data types
• Integer: integer
• Floating Point: real
• Boolean: boolean
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
17
Integer Expressions, Variables & Fields
• These use 64 bit two’s complement format
• The effective range is ±9*1018
• / does the usual integer division
• % does the usual remainder (modulo) calculation
• An integer variable or field name is required to store an integer value
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
18
Real Expressions, Variables & Fields
• These are 64-bit floating point values. This means that you will have about 13 significant digits in calculations.
• Addition, subtraction, multiplication and division work as you would expect.
• The modulo operation is not valid for reals.
• A real variable or field is required to store a real value
• When one operand is integer and the other is real, then the integer operand is automatically promoted to real and the result computed as a real quantity.
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
19
Numeric Expressions
• These follow normal precedence and associativity
• ( ... ) has highest precedence
• Then^
• Then*/%
• Lastly+and-
• Note that there is no unary -, or unary + in CD21, so x = -3 must be written x = 0 – 3 (or 0.0 – 3.0)
• Integer expressions automatically promote to real as required
• / implements the usual integer divide and floating point divide.
• ^ uses repeated multiplication for integer exponents and uses ex log a for real exponents.
• integer^integer is integer while all other variants of this operation are all real.
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
20
Relational and Boolean Expressions
• A boolean variable or field name is required to store a boolean value.
• Relational and boolean operators give a boolean result.
• The only special thing about relational expressions is the way that == and != are done for real values/expressions and variables – i.e. the definition of “effectively equals”.
• Relational operators continue the precedence hierarchy for expressions which ended with + and -
• == != < >= > <= are next highest precedence
• Boolean Operators continue the precedence hierarchy
• Then the logical NOT operator
• Then and, or and xor
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
21
Strings
• Delimited by "..." and may not go across a source line and may not contain " characters
• String literals are of any length
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
22
Structure Types
• These are simple records containing a list of fields which may be integer/real/boolean.
• The field names may be re-used within different structure types, however, this is not usual practice and is only allowed to facilitate normal cut&paste re-use (the only re-use available in CD21) between developers.
• Structures use named-type-equivalence.
• CD21 does not have variables which are a single variable of a structure
type.
• Structures only exist as array elements in CD21.
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
23
Structure Types
types
point is
x : integer,
y : integer
end
circle is
centreX : integer, centreY : integer, radius : real
end
Fields can by only simple types integer, real and boolean. Fields can not be structure type or array type
• •
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
24
Array Types and Variables
• Array types define what sort of arrays can be used within a program.
• Arrays types are always arrays-of-structures.
• Basic syntax is
– ArrayTypeName is array [Size] of StructTypeName end
– Size may be any (constant) integer expressions but is usually just a
single constant value or constant id.
• Note: There are no arrays of integer/real/boolean in CD21.
• An array variable has a single array type name as its type.
• Array variables use named-type-equivalence.
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
25
Array Types and Variables
constants
COUNT is 10
types mark is
id : integer, midterm : real, final : real
end
groupMarks is array [COUNT] of mark end
arrays
comp3290 : groupMarks
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
26
Arrays
• Array Lengths and Declarations
– Array element numbers start at 0 and go up to length-1.
– Referencing outside the bounds of an array (by means of an illegal element number or non-existent field) is not allowed.
– Array elements are variables (of their particular structure type) in their own right.
• Individual data values within an array is updated using the syntax:
– ArrayName[ElementNo].FieldName = expression;
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
27
Arrays Copy
• Whole of array operations are allowed, provided that the arrays have been declared with the same type name.
array1 : Type1, array2 : Type1
array2 = array1;
• Whole “array element” copy is also allowed within CD21.
array2[0] = array1[0]; /-- is legal
• Arithmetic within arrays must be by single fields, e.g.
array2[i].x = array1[k].x * array2[k].y;
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
28
Constructs for Structured Programming
• if (
• if (
• for (… , … ;
/– if/then
/– if/then/else
/– for loop construct
/– repeat loop construct
……
end
• repeat (… , … ) ……
until
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
29
if (
• Works as you would expect
• Note that the “then” and “else” clauses are inherently “lists” of statements (or compound statements) without resorting to any extra syntax to differentiate from a single statement clause.
• An else is assumed to match up with the nearest previous unmatched then.
• Omitting the else clause from the statement, turns it into a standard if/then statement.
• Can be nested
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
30
if (
i = 5;
/– An if statement with no else
if(i > 5)
Out << "i is greater than 5” << Line;
end
/-- An if statement with an else
if( i == (2+3) )
Out << "i is equal to 5" << Line;
else
Out << "i is not equal to 5” << Line ;
end
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
31
if (
/– Nested if statement
if (i > 0) if(i >= 1000)
Out << "i is positive and big " << Line; else
Out << "i is positive but small " << Line; end
else
Out << "i is not positive " << Line;
end
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
32
Loops
for (... , ... , ... ;
• This is a for loop that incorporates the ability to initialise (only executed at the start) within the construct itself.
• The continuation condition is given after the ; in the initialisation section and operates as a “while true, continue with next iteration”.
• The update mechanism is forced into the loop body.
repeat (… , … ) … until
• The repeat until loop also has an initialisation section.
• The until condition stops the loop when the condition is true.
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
33
Loops
Out << "For loop change i and j“ << Line ; for(i = 10, j= 5; i > 0)
Out << "i is currently ", i , " and j is currently " , j << Line; i = i - 1;
j += 1;
end
Out << "Repeat loop to get them back" << Line; repeat (i=i-1, j=j+1)
Out "i is currently ", i, " and j is currently ", j << Line; i = i + 1;
j -= 1;
until i == 10;
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
34
CD21 Procedures and Functions
• Procedures in CD21 are simply Functions which “return void”.
• Functions are able to return integer/real/boolean values by way of the return statement within the function body.
• Simple variables may be passed by value (varname syntax)
• Whole arrays are passed by descriptor copy, which is equivalent to
pass by reference with array size support.
• Whole array formal parameters may be labelled “const” which means the actual array becomes read-only during the function.
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
35
CD21 Procedures and Functions (2)
• Functions may both have side effects on the environment by way of reference parameters (and input/output statements).
• The procedure call (void function call) is a statement in its own right. eg: ) /– returns void
• Return of control is via the return void; statement.
• The function call is an expression, which evaluates to the last exported
value, eg:
x = )
• return
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
36
CD21 Procedures and Functions (3)
func squareSum (a : integer, b : integer) : integer
/– return type must be specified : integer, real, boolean, void
tmp : integer
/– local can be empty, local can be
/– (1) primitive types (integer, real, boolean) /– and (2) array type
begin
tmp = a^2;
tmp += b^2;
return tmp; /– return can be (1) return void; OR (2) return expr;
end
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
37
CD21 Procedures and Functions (4)
func funcOne(i : integer) : void begin
Out << "Function 1 called with i = ", i << Line ; funcTwo(i);
return;
end
func funcTwo(i:integer) : void begin
Out << "Function 2 called with i = ", i << Line; i = i -1;
if (i > 0)
funcOne(i); /– recurse end
return; end
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
38
CD21 Special Features
1. A power-of operator for positive powers.
2. Loop constructs that contain explicit initialization sections.
3. Structured data (record) types
4. Protected arrays, mandating the use of structured data.
5. Global “static” data area for arrays.
6. Array parameters to subprograms carry their descriptor information with them into the subprogram but can be marked as read-only.
7. Whole-array copy is possible and results in a deep copy of the body of the r-value array.
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
39
CD21 Special Features (2)
• Local variable declaration and visibility
– Variables declared within any procedure or function or main program exist for the entire lifetime of the block, but …
– They are visible from the point of declaration to the end of that particular function or main program, eg.
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
40
Task of Lexical Analyser
• Read the input characters
• Group them into tokens
• Produce as output a sequence of tokens
• Strip out comments and whitespaces
• Report any lexical error
• Generate a program listing for error reporting
• May interact with the symbol table as well
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
41
Interaction of the Scanner with the Parser
Source Code
Tokens
Get next token
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
Lexical Analyser
error
error
Syntax Analyser
Symbol Table
42
Why Lexical Analysis is a Separate Phase
• Simplifies the design of the compiler – Most important consideration
• Provides efficient implementation
– Systematic techniques to implement lexical analyzers by hand or automatically
– Stream buffering methods to scan input
• Improves portability
– Non-standard symbols and alternate character encodings can be more easily translated
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
43
Lexemes and Tokens
Lexeme – an independently meaningful subsequence of the character in
the input program.
Token – a token is a tuple, holding the information about the lexeme which is actually passed on to the syntactic analyser by the lexical analyser. It is a description of the lexeme containing information about the item that the following phases will need.
Eg. if x <= 0 then flag := 1
Lexeme Tokens will be
if (5,0)
then (6,0) 0 means unused x (1,1) 1st identifier
flag (1,2) 2nd identifier
<= (30,0)
:= (26,0)
0 (2,1) 1st integer constant
1 (2,2) 2nd integer constant
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
44
Lexemes and Tokens
This is more easily understood using an enumerated type, eg IDENT = 1, ICONST = 2, ... KWIF = 5, KWTHN = 6, ... OPASN = 26, ... OPLEQ = 30.
Lexeme Tokens will be
if (KWIF,0)
then (KWTHN,0) 0 means unused x (IDENT,1) 1st identifier
flag (IDENT,2) 2nd identifier
<= (OPLEQ,0)
:= (OPASN,0)
0 (ICONST,1) 1st integer constant
1 (ICONST,2) 2nd integer constant
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
45
Lexemes, Tokens and Patterns
• A pattern is a rule for describing the of lexemes belonging to a token.
• A pattern rule can be described as
– “letter followed by letters and digits” and – “non-empty sequence of digits”
• For keywords, the pattern is just the sequence of characters that form the keyword.
• Eg.if,elseetc.
• For identifiers and some other tokens, the pattern is a more complex
structure that is matched by many strings.
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
46
Common Token Classes
• Token classes that cover most or all of the tokens in many languages
– One token for each keyword
– Tokens for operators
– Token for identifiers
– Tokens for constants and literals
– T okens for delimeters
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
47
How do we Specify Tokens?
• Regular expression
– Effective in specifying lexeme patterns
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
48
RECAP- COMP2270
• Alphabet ()
• Strings – string length, empty string ()
• Strings – prefix, suffix, substring, proper {prefix, suffix, substring}
• String operations – concatenation, exponentiation
• Language
• Language operations – union, concatenation, exponentiation
• Kleene * (Kleene closure) and Kleene + (positive closure) operations
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
49
RECAP - Regular Expressions
• Basis symbols:
– is a regular expression denoting language {}
– a is a regular expression denoting {a}
• If r and s are regular expressions denoting languages L(r) and M(s) respectively, then
– r | s is a regular expression denoting L(r) M(s)
– rs is a regular expression denoting L(r)M(s)
– r* is a regular expression denoting L(r)*
– (r) is a regular expression denoting L(r)
• A language defined by a regular expression is called a regular set / regular language
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
50
Regular Definitions
•
Naming convention for regular expressions: d1 r1
d2 r2 ...
dn rn
where
Each di is a new symbol not in and different from other ds ri is a regular expression over {d1, d2, ..., di-1 }
Each dj in ri is textually substituted in ri
•
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
51
Regular Definitions
• Example:
letter A|B| ... |Z|a|b| ... |z digit 0|1| ... |9
id letter ( letter | digit )* What about the following definition?
digits digit digits | digit
• Cannot use recursion, this is illegal
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
52
• We frequently use the following shorthands: r+ = rr*
r? = r |
[a-z] =a|b|c| ... |z
• For example: digit [0-9]
num digit+ (. digit+)? ( E (+|-)? digit+ )?
27/7/2021
COMP3290 - Semester 2 - 2021 | www.newcastle.edu.au
53
Regular Definitions and Grammars
Grammar
stmt if expr then stmt
| if expr then stmt else stmt
|
expr term relop term
| term term id
| num
Regular definitions
if if
then then
else else
relop <|<=|<>|>|>=|=
digit [0-9]
letter [A-Za-z]
id letter ( letter | digit )*
num digit+ (. digit+)? ( E (+|-)? digit+ )?
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
54
References
• Compilers: Principles, Techniques, and Tools (2nd Edition)
• By A.V. Aho, Lam, R. Sethi, . Ullman
• Chapter 3
27/7/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
55