Compiler Design Week 1
Detailed content
Weekly program
Week
1 – Introduction to Compiler Design
Week
Week
Week
Week
Week
Week
Week
Week
Week
Week
Week
Week 13 – Extra revision (if needed)
2 – CD Programming Language 3 – Lexical Analysis
4 – Syntax Analysis
5 – Top Down Parsing
6 – Symbol Table and Error Recovery
7 – Bottom-Up Parsing
8 – Semantic Analysis
9 – Runtime Environment: Stack Machine (SM) Architecture
10 – Code Generation 11 – Code Optimisation 12 – Revision
2
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
Week 01 Lecture Outline
Introduction to Compiler Design
Compilers and Programming languages Basic Structure of a Compiler
Lexical Analysis
Syntactic Analysis
Semantic Analysis
Code Generation
Code Optimisation
Analysis-Synthesis model of compiler Compiler Construction tools
Complier Design for COMP3290
3
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
What is a compiler ?
• “A compiler is a program that reads a program written in a given language and translates it into an equivalent program in another language“, it is a translator.
Source Language
Target Language
Translator
Source program Target program
error messages
20/07/2021
Compiler
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
4
Examples
C++
Java
Polish
RISC object code
Java Byte Code Estonian
Compiler
Compiler
Translator at the UN
All compilers basically does the same thing. Difference lies in the choice of source and target languages.
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
5
Programming Languages
• Programming languages are notations to specify algorithms
– to human
– to machine
• Programming languages are very young, not even a century has passed since the dawn of electro-mechanical computing era.
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
6
Programming Languages
Machine Language:
• Simple machine code (all 0’s and 1’s)
• Instructions very closely resemble the operations performed in the underlying hardware
• Writing complex algorithm is severely challenging.
Low-level Programming Languages:
• Assembly languages (mnemonics like ADD A, B, C)
• Painfully huge number of lines even for really, small algorithms
High-level Programming Languages:
• High-level languages – e.g., Fortran, C, C++, Java, Python
• Programmers were happier to code, but compilers got complex
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
7
Type of Machine Codes
Compilers may generate three types of machine code
1. Pure Machine Code
– Generates code in a particular machine instruction set
– The code runs on bare hardware – No support for operating system of library routines
– Not common. Used for implementing operating systems or embedded applications
2. Augmented Machine Code
– Generates code for a machine architecture that is augmented with OS routines and runtime language support routines
– The code needs a particular OS in the target machine and language specific runtime support routines (for I/O, memory allocation, mathematical functions etc.)
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
8
Type of Machine Codes
3. Virtual Machine Code
– Generates virtual instructions for a VM
– Attractive for better portability (via an interpreter for the VM)
• CanrunonanymachineiftheVMinterpreterisavailable
– Can make the compiler itself easy to port. How?
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
9
HLL to LLL – CD to SM
• Hardware operations are usually quite limited
• One statement in a HLL = many lines of code in a low-level language = humongous amount of 0’s and 1’s in machine code
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
10
Compilers and Interpreters
• Share some of the functionalities of compilers
• Executes the programs with explicitly performing much translation
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
11
Compilers and Interpreters
• Special capabilities in interpreters
– Programs can be easily modified during execution – interactive debugging capability
– Can easily support languages in which type of objects change dynamically (though re-examination of the program as execution proceeds)
– Significant degree of machine independence
• What is the cost?
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
12
Compiler Writing – How difficult?
• A compiler is a very complex program the requires:
– Knowledge of programming languages
– Machine architecture
– Language theory
– Algorithms and Data Structure
– Software engineering
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
13
Compiler Writing – How difficult?
• A compiler is a very complex program that involves a lot of task:
– The first FORTRAN compiler took 18 staff-years to implement (1957)
– Requirements for implementation of C++ standard: 27 chapters, 5 annexes.
COMP3290: ~ 140 hours* – roughly 4 weeks full-time
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
14
Compiler Writing – How difficult?
• A compiler is a very complex program, but:
– Since the first compilers, systematic methods for writing compilers have been developed
– There is a traditional compiler structure: the compiler is implemented as a number of different phases that work together
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
15
Cousins of Compiler
Source Program
Modified Source Program Target Assembly Program
Relocatable Object Code
Absolute Machine Code
20/07/2021
Libraries and Relocatable Object Files
Preprocessor
Compiler
Assembler
OMP3290 – Semester 2 – 2021| www.newcastle.edu.au
16
Phases of a Compiler
• Compilation is generally divided into at least 4 simpler processes, often referred to as Phases
– Lexical Analysis (The Scanner)
– Syntactic Analysis (The Parser)
– Semantic Analysis
– Code Generation
• And another added towards the end called – Optimisation
• Where each phase begins its function and where it passes duties over to the next phase can vary significantly according to the languages involved.
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
17
Lexical Analysis
• The first thing we do with “Time flies like an arrow” is think of it as 5 words rather than a string of 20 characters. Before we group words into syntactic phrases we group letters into words.
• The lexical analyser reads the stream of characters making up the source program and groups the characters into meaningful sequences called lexemes.
• For example:
position = initial + rate * 60
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
18
Lexemes and Tokens
Token – a token is a tuple of the form
Token_Classification is what is needed to test the syntax. E.g. all id’s are syntactically equivalent. So in order to check syntax, all we need to know is that the lexeme is an identifier, there is no need to know which id it actually is.
Extra_Data for most lexemes is “NULL”, once we have found a keyword such as if or then or an operator such as +, we don’t need to know anything more about the lexeme. However, for lexemes such as id’s, integer constants, etc. we need to carry forward information about “which” id or constant this lexeme actually is, so that it is available for semantic analysis and code generation in later phases.
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
19
Syntactic Analysis
• Syntactic Analysis of language structure is obviously important in both natural language and computer languages.
• Syntax analyser checks if the program is grammatically correct.
• A complete set of rules covering all the structures possible within the
language is needed.
• It is generally easier with programming languages than with natural language as its unlikely that a PL will use the same word as both a noun and a verb (e.g. flies), or as both a verb and a preposition (e.g. like).
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
20
Basic Structure of a Compiler
Source Code
Tokens
Syntax Object Tree Code
Symbol Table
Lexical Analyser
Syntax Analyser
Code Gen.
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
21
Syntax Analyser and Syntax Trees
• The syntax analyser takes tokens from the lexical analyser and determines how they should be structured in to expressions, statements and other “phrases” of the programming language.
• Parser creates a tree-like intermediate representation that depicts the grammatical structure of the token stream.
• Syntax Trees – are one of many different schemes for describing the grammatical structure of a program.
• Syntax Trees are a good method because they encode structural information in a nice way – easily understood, yet containing only necessary information.
• Note (on following examples) that the only items which appear in syntax trees are identifiers and constants. Keywords and delimiters never appear as leaves but determine the structure of the tree. In English ALL words would be leaves of the tree.
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
22
Syntax Tree Examples
A * ( B + C ) has a self-evident structure
* A
if x <= 0 then flag := 1
if then stmt
+ BC
<=
asgn stmt
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
x 0flag 1
23
Semantic Analysis
• Knowing the lexical and syntactic structuring that should be associated with a program’s parts does not imply that the meaning of the program is known.
• Eg. x := 8 is an assignment statement with target x. The meaning of the statement cannot be determined until the meaning of x is known. If x was never declared or was declared to be of type boolean, then the statement would be meaningless.
• Determining the meaning is Semantic Analysis. With compilers the goal isnotjusttounderstandaprogrambuttotranslateit. Semantic Analysis and Code Generation are tightly integrated.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
24
Semantic Analysis
• Deciding what as actually meant by the particular example of the source language.
• Some meaning can be gained syntactically, e.g. in Pascal, x=3 in the context : const x=3 ... means declare a constant x and give it the value 3. On the other hand in the context if x=3 then ... means do the ... if the value of x is currently 3. However there are aspects of meaning that cannot be decided solely on syntax.
• What does const x=3 mean if x has already been declared?
• What does if x=3 then ... mean if x has never been declared or
has been declared to be an array?
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
25
Semantic Analysis
• Semantic analysis involves the further inspection of the syntax tree, using information from the Symbol Table to determine if the source HAS A SINGLE USEFUL MEANING.
• Semantic analysis also gather type information and stores in syntax tree or symbol table for later phases
• Type-checking is an important part of semantic analysis. – E.g. is the array index an integer?
• Coersion – another task often performed by semantic analyser.
– E.g if a binary arithmetic operator is applied to a floating-point number and an integer, the compiler may convert or coerce the integer into a floating-point number.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
26
Code Generation
• Takes as input an intermediate representation and maps it into the target language.
• The intermediate instructions are translated into sequences of machine instructions that perform the same task.
• If the target language is machine code, registers or memory locations are selected for each of the variables.
• A crucial aspect of code generation is the judicious assignment of registers to hold variables.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
27
Code Generation
• Requires an intricate knowledge of the target language.
• If you wish to translate Polish into Estonian for the UN then you must know both languages VERY well.
– Knowledge of idiom in the target language (as well as the source language in the case of natural language) can help with providing the closest meaning possible for the target.
– Such knowledge can make the translation much more effective (optimised).
• The same is true in code generation – intricacies of the architecture and particular syntactic structures within the source language can make a big difference to the performance of the target code.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
28
Code Generation
• With compilers, the requirement of Semantics is to have one EXACT meaning
– Idiom still plays a part in code generation in the form of more efficient ways of translating for the target machine, in the form of Code Optimisation.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
29
Phases of a Compiler
character stream
token stream
syntax tree
syntax tree intermediate representation
intermediate representation target machine code target machine code
Lexical Analyzer
Syntax Analyzer
Semantic Analyzer
Intermediate Code Generator
Machine Independent Code Optimiser
Symbol Table
Code Generator
Machine Dependent Code Optimiser
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
30
Code Optimisation
• Machine-independent code-optimisation phase attempts to improve the intermediate code so that better target code will result.
– Usually better means faster
– Other objectives may be desired, such as shorter code, or target code that consumes less power.
• Usually is a time consuming process
• The problem of generating the optimal target code from a source program is undecidable in general
– Some simple algorithms can help to improve a lot
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
31
Code Optimisation
• Suppressing code generation of unreachable code segments,
• Ridding of unused variables,
• Eliminating multiplication by 1 and addition by 0,
• Loop optimisation (e.g., remove statements that are not modified in the loop),
• Common subexpression elimination
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
32
Code Optimisation
Example:
_t1 = b * c
(eliminate add 0)
(eliminate common subexp, _t1 is same as _t3) _t2 = _t1 + _t1
a = _t2
_t1 = b * c
_t2 = _t1 + 0 _t3 = b * c
_t4 = _t2 + _t3 a = _t4
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
33
Code Optimisation
• Compiler optimisations must meet the following design objectives:
– The optimisation must be correct, that is, preserve the meaning of the compiled program,
– The optimisation must improve the performance of many programs,
– The compilation time must be kept reasonable, and
– The engineering effort required must be manageable.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
34
Code Optimisation
• Idioms in natural language translation - equivalent to knowledge of architecture of the target language
• Recognise sets of instructions that can be replaced by shorter or faster sets of instructions, based on the target machine architecture.
• Often instructions are pipelined, yielding out-of-order execution and decreasing total time of execution
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
35
Code Optimisation
• Redundant-instruction elimination
• Flow-of-control optimisations
• Algebraic simplifications
• Use of machine idioms
• These optimisations are often coined as Peephole optimisation techniques, since they consider a small window/segment of object codes for optimisation.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
36
Important Sub-process Symbol Table Generation
• A symbol table contains information about all the identifiers in the program along with important attributes such as type and scope.
• An entry for each identifier is created during syntax analysis.
• During semantic analysis, type and scope are added for each identifier.
• During code generation, different instructions are often required to be emitted based on the type of the identifier.
• Different instructions are used for integer and floating point arithmetic
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
37
Important Sub-process Error Handling
• Lexical analysis - bad tokens
• Syntax Analysis - bad combination of valid tokens
• Semantic Analysis - declare before use, type mismatch in assignments
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
38
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
39
The Analysis-Synthesis Model of Compilation
• There are two parts to compilation:
– Analysis (Front End): High level language to an intermediate representation
• Analysisdeterminestheoperationsimpliedbythesource program which are recorded in a tree structure
• Checksforsyntaxandsemanticerrors
• Creates an intermediate representation & symbol table
– Synthesis (Back End): From the intermediate representation, synthesise the target machine code
• Synthesis takes the tree structure and translates the operations therein into the target program
• Benefit?
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
40
Benefit of Modularising of a Compiler
• Problem becomes easier to handle
• Decouples analysis from synthesis
• Re-use the same intermediate representation as an input to different code generators for different target machine architectures
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
41
Compiler-Construction Tools
• Software development tools are available to implement one or more compiler phases
– Scanner generators
– Parser generators
– Syntax-directed translation engines
– Automatic code generators
– Data-flow engines
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
42
Benefit of Building a Compiler
• How complicated real-world problems are solved by abstracting the essence of the problem mathematically.
– Identify the problem
– Formulate a mathematical abstraction – Solve using mathematical techniques.
• We will see application on every phase of compiler
• We will learn the general methodology of solving complex and open- ended problems
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
43
Impact on the design of Compiler
• Design of programming languages and compilers are intimately related
• Advances in programming languages placed new demands on compiler writers
– to devise algorithms and representations to translate and support the new language features
• Compilers are critical in making high-performance computer architectures effective
– to devise translation algorithms that would take maximal advantage of the new hardware capabilities.
– Compilers are also critical in making high-performance computer architectures effective on users' applications.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
44
COMP3290: CD to SM
• In order to translate we need to know the language we are translating (CD).
• In order to translate into machine code, we need to look quite closely at the architecture of our target machine (SM).
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
45
COMP3290: CD
• CD has features that are similar to Ada/Algol60.
• It is a free format language & strongly typed language.
• It contains a set of features that’s simple to translate to the target SM code.
• Any language requires a complete and unambiguous language definition - both for the programmer and the compiler writer.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
46
CD Example
CD Ex1 main
i : int, j : int begin
i = 10;
j = i /2 ; print j ;
end CD Ex1
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
47
COMP3290: Lexical Analysis
• Scan a CD program character by character
• Check for juxtaposed characters forming words
• Split them as meaningful words in the language or so-called tokens
• Report errors if they are not allowed in CD.
• Such a tool is called lexer/scanner.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
48
COMP3290: Lexcial Analysis
CD Ex1 main
i : int, j : int begin
i = 10; j=i/2; print j ;
end CD Ex1
TCD TIDEN Ex1 TMAIN
TIDEN i
TIDEN j
TBEGN
TIDEN i
TIDEN j
TPRIN TIDEN j TSEMI TEND
TCD TIDEN Ex1 TEOF
20/07/2021
TCOLN TINTG TCOMA TCOLN TINTG
TEQUL TILIT 10 TEQUL TIDEN i
TSEMI
TDIVD TILIT 2
TSEMI
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
49
COMP3290: Parsing
Fragment of CD Language Syntax
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
50
COMP3290: Semantic Analysis
For example,
j = i /2 ;
• Have all variables been declared?
• Is there a type mismatch in assignments?
• Is it a valid operation on that identifier?
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
51
COMP3290: Code Generation
SM:
• Stack machine (SM) hardware is used in a small hand-held device designed to work in extreme conditions.
• Most of its instructions do not carry any operand information within the instruction itself but operate on a stack of register. Assumes that operands will be from the stack, and results placed in the stack.
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
52
COMP3290: Code Generation
• Generate assembly code and object code for the SM simulator from
any given CD program
20/07/2021
LA 1, j LV 1, i LB 2 DIV
/– divide i by 2
ST
/– store the result at j
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
53
COMP3290: Code Generation
• Generate assembly code and object code for the SM simulator from
any given CD program
5
41 2 52 91 0 0 0 8 41 10 43 91 0 0 0 0 81 0 0 0 8 41 2 14 43 81 0 0 0 0 64 67 00000000
3
32
10
2
1
2.500000
0
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
54
References
• Compilers: Principles, Techniques, and Tools (2nd Edition)
• By A.V. Aho, Lam, R. Sethi, . Ullman
• Chapter 1
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
55