Corporate PowerPoint Template
Compiler Design
Week 1
20/07/2021
2
Detailed content
Weekly program
Week 1 – Introduction to Compiler Design
Week 2 – CD Programming Language
Week 3 – Lexical Analysis
Week 4 – Syntax Analysis
Week 5 – Top Down Parsing
Week 6 – Symbol Table and Error Recovery
Week 7 – Bottom-Up Parsing
Week 8 – Semantic Analysis
Week 9 – Runtime Environment: Stack Machine (SM) Architecture
Week 10 – Code Generation
Week 11 – Code Optimisation
Week 12 – Revision
Week 13 – Extra revision (if needed)
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
Week 01 Lecture Outline
Introduction to Compiler Design
Compilers and Programming languages
Basic Structure of a Compiler
Lexical Analysis
Syntactic Analysis
Semantic Analysis
Code Generation
Code Optimisation
Analysis-Synthesis model of compiler
Compiler Construction tools
Complier Design for COMP3290
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
3
20/07/2021
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
4What is a compiler ?
• “A compiler is a program that reads a program written in a given language and
translates it into an equivalent program in another language“, it is a translator.
Source program Compiler Target program
error messages
Source
Language
Translator Target
Language
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
5Examples
Java Compiler Java Byte Code
C++ Compiler RISC object code
Polish
Translator
at the UN
Estonian
All compilers basically does the same thing. Difference lies in the choice of source
and target languages.
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
6Programming Languages
• Programming languages are notations to specify algorithms
– to human
– to machine
• Programming languages are very young, not even a century has
passed since the dawn of electro-mechanical computing era.
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
7Programming Languages
Machine Language:
• Simple machine code (all 0’s and 1’s)
• Instructions very closely resemble the operations performed in the
underlying hardware
• Writing complex algorithm is severely challenging.
Low-level Programming Languages:
• Assembly languages (mnemonics like ADD A, B, C)
• Painfully huge number of lines even for really, small algorithms
High-level Programming Languages:
• High-level languages – e.g., Fortran, C, C++, Java, Python
• Programmers were happier to code, but compilers got complex
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
8Type of Machine Codes
Compilers may generate three types of machine code
1. Pure Machine Code
– Generates code in a particular machine instruction set
– The code runs on bare hardware – No support for operating system of library
routines
– Not common. Used for implementing operating systems or embedded
applications
2. Augmented Machine Code
– Generates code for a machine architecture that is augmented with OS
routines and runtime language support routines
– The code needs a particular OS in the target machine and language specific
runtime support routines (for I/O, memory allocation, mathematical functions
etc.)
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
9Type of Machine Codes
3. Virtual Machine Code
– Generates virtual instructions for a VM
– Attractive for better portability (via an interpreter for the VM)
• Can run on any machine if the VM interpreter is available
– Can make the compiler itself easy to port. How?
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
10HLL to LLL – CD to SM
• Hardware operations are usually quite limited
• One statement in a HLL = many lines of code in a low-level language =
humongous amount of 0’s and 1’s in machine code
Compilers and Interpreters
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
11
• Share some of the functionalities of compilers
• Executes the programs with explicitly performing much translation
Compilers and Interpreters
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
12
• Special capabilities in interpreters
– Programs can be easily modified during execution – interactive
debugging capability
– Can easily support languages in which type of objects change
dynamically (though re-examination of the program as execution
proceeds)
– Significant degree of machine independence
• What is the cost?
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
13Compiler Writing – How difficult?
• A compiler is a very complex program the requires:
– Knowledge of programming languages
– Machine architecture
– Language theory
– Algorithms and Data Structure
– Software engineering
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
14Compiler Writing – How difficult?
• A compiler is a very complex program that involves a lot of task:
– The first FORTRAN compiler took 18 staff-years to implement
(1957)
– Requirements for implementation of C++ standard:
27 chapters, 5 annexes.
COMP3290: ~ 140 hours* – roughly 4 weeks full-time
Compiler Writing – How difficult?
• A compiler is a very complex program, but:
– Since the first compilers, systematic methods for writing compilers
have been developed
– There is a traditional compiler structure: the compiler is
implemented as a number of different phases that work together
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
15
Cousins of Compiler
Preprocessor
Compiler
Assembler
Linker
Source Program
Modified Source Program
Target Assembly Program
Relocatable Object Code
Absolute Machine Code
Libraries and
Relocatable Object Files
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
16
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
17Phases of a Compiler
• Compilation is generally divided into at least 4 simpler processes, often
referred to as Phases
– Lexical Analysis (The Scanner)
– Syntactic Analysis (The Parser)
– Semantic Analysis
– Code Generation
• And another added towards the end called
– Optimisation
• Where each phase begins its function and where it passes duties over
to the next phase can vary significantly according to the languages
involved.
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
18Lexical Analysis
• The first thing we do with “Time flies like an arrow” is think of it as 5
words rather than a string of 20 characters. Before we group words into
syntactic phrases we group letters into words.
• The lexical analyser reads the stream of characters making up the
source program and groups the characters into meaningful sequences
called lexemes.
• For example:
position = initial + rate * 60
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
19
Token – a token is a tuple of the form
Token_Classification is what is needed to test the syntax. E.g. all id’s are
syntactically equivalent. So in order to check syntax, all we need to know is that
the lexeme is an identifier, there is no need to know which id it actually is.
Extra_Data for most lexemes is “NULL”, once we have found a keyword such as if
or then or an operator such as +, we don’t need to know anything more about the
lexeme. However, for lexemes such as id’s, integer constants, etc. we need to
carry forward information about “which” id or constant this lexeme actually is, so
that it is available for semantic analysis and code generation in later phases.
Lexemes and Tokens
Syntactic Analysis
• Syntactic Analysis of language structure is obviously important in both
natural language and computer languages.
• Syntax analyser checks if the program is grammatically correct.
• A complete set of rules covering all the structures possible within the
language is needed.
• It is generally easier with programming languages than with natural
language as its unlikely that a PL will use the same word as both a
noun and a verb (e.g. flies), or as both a verb and a preposition (e.g.
like).
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
20
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
21Basic Structure of a Compiler
Lexical
Analyser
Syntax
Analyser
Code
Gen.
Symbol
Table
Source
Code
Tokens Syntax
Tree
Object
Code
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
22Syntax Analyser and Syntax Trees
• The syntax analyser takes tokens from the lexical analyser and
determines how they should be structured in to expressions, statements
and other “phrases” of the programming language.
• Parser creates a tree-like intermediate representation that depicts the
grammatical structure of the token stream.
• Syntax Trees – are one of many different schemes for describing the
grammatical structure of a program.
• Syntax Trees are a good method because they encode structural
information in a nice way – easily understood, yet containing only
necessary information.
• Note (on following examples) that the only items which appear in syntax
trees are identifiers and constants. Keywords and delimiters never
appear as leaves but determine the structure of the tree. In English
ALL words would be leaves of the tree.
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
23Syntax Tree Examples
A * ( B + C ) has a self-evident
structure
if x <= 0 then flag := 1
*
+
CB
A
if then
stmt
asgn
stmt<=
x 0 flag 1
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
24Semantic Analysis
• Knowing the lexical and syntactic structuring that should be associated
with a program’s parts does not imply that the meaning of the program
is known.
• Eg. x := 8 is an assignment statement with target x. The meaning of the
statement cannot be determined until the meaning of x is known. If x
was never declared or was declared to be of type boolean, then the
statement would be meaningless.
• Determining the meaning is Semantic Analysis. With compilers the goal
is not just to understand a program but to translate it. Semantic
Analysis and Code Generation are tightly integrated.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
25Semantic Analysis
• Deciding what as actually meant by the particular example of the
source language.
• Some meaning can be gained syntactically, e.g. in Pascal, x=3
in the context : const x=3 … means declare a constant x and
give it the value 3. On the other hand in the context if x=3 then
… means do the … if the value of x is currently 3. However
there are aspects of meaning that cannot be decided solely on
syntax.
• What does const x=3 mean if x has already been declared?
• What does if x=3 then … mean if x has never been declared or
has been declared to be an array?
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
26Semantic Analysis
• Semantic analysis involves the further inspection of the syntax
tree, using information from the Symbol Table to determine if the
source HAS A SINGLE USEFUL MEANING.
• Semantic analysis also gather type information and stores in
syntax tree or symbol table for later phases
• Type-checking is an important part of semantic analysis.
– E.g. is the array index an integer?
• Coersion – another task often performed by semantic analyser.
– E.g if a binary arithmetic operator is applied to a floating-point
number and an integer, the compiler may convert or coerce the
integer into a floating-point number.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
27Code Generation
• Takes as input an intermediate representation and maps it into the
target language.
• The intermediate instructions are translated into sequences of machine
instructions that perform the same task.
• If the target language is machine code, registers or memory locations
are selected for each of the variables.
• A crucial aspect of code generation is the judicious assignment of
registers to hold variables.
Code Generation
• Requires an intricate knowledge of the target language.
• If you wish to translate Polish into Estonian for the UN then you must
know both languages VERY well.
– Knowledge of idiom in the target language (as well as the source
language in the case of natural language) can help with providing
the closest meaning possible for the target.
– Such knowledge can make the translation much more effective
(optimised).
• The same is true in code generation – intricacies of the architecture and
particular syntactic structures within the source language can make a
big difference to the performance of the target code.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
28
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
29Code Generation
• With compilers, the requirement of Semantics is to have one EXACT
meaning
– Idiom still plays a part in code generation in the form of more
efficient ways of translating for the target machine, in the form of
Code Optimisation.
Phases of a Compiler
Lexical Analyzer
Syntax Analyzer
Semantic Analyzer
Intermediate Code Generator
Machine Independent Code Optimiser
Code Generator
Machine Dependent Code Optimiser
character stream
token stream
syntax tree
syntax tree
intermediate representation
intermediate representation
target machine code
target machine code
Symbol Table
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
30
Code Optimisation
• Machine-independent code-optimisation phase attempts to improve the
intermediate code so that better target code will result.
– Usually better means faster
– Other objectives may be desired, such as shorter code, or target
code that consumes less power.
• Usually is a time consuming process
• The problem of generating the optimal target code from a source
program is undecidable in general
– Some simple algorithms can help to improve a lot
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
31
Code Optimisation
• Suppressing code generation of unreachable code segments,
• Ridding of unused variables,
• Eliminating multiplication by 1 and addition by 0,
• Loop optimisation (e.g., remove statements that are not modified in the
loop),
• Common subexpression elimination
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
32
Code Optimisation
Example:
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
33
_t1 = b * c
(eliminate add 0)
(eliminate common subexp, _t1 is same as _t3)
_t2 = _t1 + _t1
a = _t2
_t1 = b * c
_t2 = _t1 + 0
_t3 = b * c
_t4 = _t2 + _t3
a = _t4
Code Optimisation
• Compiler optimisations must meet the following design objectives:
– The optimisation must be correct, that is, preserve the meaning of
the compiled program,
– The optimisation must improve the performance of many programs,
– The compilation time must be kept reasonable, and
– The engineering effort required must be manageable.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
34
Code Optimisation
• Idioms in natural language translation - equivalent to knowledge of
architecture of the target language
• Recognise sets of instructions that can be replaced by shorter or faster
sets of instructions, based on the target machine architecture.
• Often instructions are pipelined, yielding out-of-order execution and
decreasing total time of execution
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
35
Code Optimisation
• Redundant-instruction elimination
• Flow-of-control optimisations
• Algebraic simplifications
• Use of machine idioms
• These optimisations are often coined as Peephole optimisation
techniques, since they consider a small window/segment of object
codes for optimisation.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
36
Important Sub-process
Symbol Table Generation
• A symbol table contains information about all the identifiers in the
program along with important attributes such as type and scope.
• An entry for each identifier is created during syntax analysis.
• During semantic analysis, type and scope are added for each identifier.
• During code generation, different instructions are often required to be
emitted based on the type of the identifier.
• Different instructions are used for integer and floating point arithmetic
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
37
Important Sub-process
Error Handling
• Lexical analysis - bad tokens
• Syntax Analysis - bad combination of valid tokens
• Semantic Analysis - declare before use, type mismatch in assignments
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
38
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
39
The Analysis-Synthesis Model of
Compilation
• There are two parts to compilation:
– Analysis (Front End): High level language to an intermediate
representation
• Analysis determines the operations implied by the source
program which are recorded in a tree structure
• Checks for syntax and semantic errors
• Creates an intermediate representation & symbol table
– Synthesis (Back End): From the intermediate representation,
synthesise the target machine code
• Synthesis takes the tree structure and translates the operations
therein into the target program
• Benefit?
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
40
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
41Benefit of Modularising of a Compiler
• Problem becomes easier to handle
• Decouples analysis from synthesis
• Re-use the same intermediate representation as an input to different
code generators for different target machine architectures
Compiler-Construction Tools
• Software development tools are available to implement one or more
compiler phases
– Scanner generators
– Parser generators
– Syntax-directed translation engines
– Automatic code generators
– Data-flow engines
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
42
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
43Benefit of Building a Compiler
• How complicated real-world problems are solved by abstracting the
essence of the problem mathematically.
– Identify the problem
– Formulate a mathematical abstraction
– Solve using mathematical techniques.
• We will see application on every phase of compiler
• We will learn the general methodology of solving complex and open-
ended problems
Impact on the design of Compiler
• Design of programming languages and compilers are intimately related
• Advances in programming languages placed new demands on compiler
writers
– to devise algorithms and representations to translate and support
the new language features
• Compilers are critical in making high-performance computer
architectures effective
– to devise translation algorithms that would take maximal advantage
of the new hardware capabilities.
– Compilers are also critical in making high-performance computer
architectures effective on users' applications.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
44
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
45
COMP3290: CD to SM
• In order to translate we need to know the language we are translating
(CD).
• In order to translate into machine code, we need to look quite closely
at the architecture of our target machine (SM).
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
46COMP3290: CD
• CD has features that are similar to Ada/Algol60.
• It is a free format language & strongly typed language.
• It contains a set of features that’s simple to translate to the target SM
code.
• Any language requires a complete and unambiguous language
definition - both for the programmer and the compiler writer.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
47CD Example
CD Ex1
main
i : int, j : int
begin
i = 10;
j = i / 2 ;
print j ;
end CD Ex1
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
48COMP3290: Lexical Analysis
• Scan a CD program character by character
• Check for juxtaposed characters forming words
• Split them as meaningful words in the language or so-called tokens
• Report errors if they are not allowed in CD.
• Such a tool is called lexer/scanner.
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
49COMP3290: Lexcial Analysis
TCD TIDEN Ex1
TMAIN
TIDEN i TCOLN TINTG TCOMA
TIDEN j TCOLN TINTG
TBEGN
TIDEN i TEQUL TILIT 10 TSEMI
TIDEN j TEQUL TIDEN i TDIVD TILIT 2 TSEMI
TPRIN TIDEN j TSEMI
TEND
TCD TIDEN Ex1
TEOF
CD Ex1
main
i : int, j : int
begin
i = 10;
j = i / 2 ;
print j ;
end CD Ex1
20/07/2021
COMP3290 - Semester 2 - 2021| www.newcastle.edu.au
50COMP3290: Parsing
Fragment of CD Language Syntax
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
51COMP3290: Semantic Analysis
For example,
j = i / 2 ;
• Have all variables been declared?
• Is there a type mismatch in assignments?
• Is it a valid operation on that identifier?
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
52COMP3290: Code Generation
SM:
• Stack machine (SM) hardware is used in a small hand-held device
designed to work in extreme conditions.
• Most of its instructions do not carry any operand information within the
instruction itself but operate on a stack of register. Assumes that
operands will be from the stack, and results placed in the stack.
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
53COMP3290: Code Generation
• Generate assembly code and object code for the SM simulator from
any given CD program
LA 1, j
LV 1, i
LB 2
DIV
/– divide i by 2
ST
/– store the result at j
20/07/2021
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
54COMP3290: Code Generation
• Generate assembly code and object code for the SM simulator from
any given CD program
5
41 2 52 91 0 0 0 8
41 10 43 91 0 0 0 0
81 0 0 0 8 41 2 14
43 81 0 0 0 0 64 67
0 0 0 0 0 0 0 0
3
32
10
2
1
2.500000
0
References
• Compilers: Principles, Techniques, and Tools
(2nd Edition)
• By A.V. Aho, Monica S Lam, R. Sethi, Jeffrey D.
Ullman
• Chapter 1
COMP3290 – Semester 2 – 2021| www.newcastle.edu.au
55
20/07/2021