Chapter 1: Introduction
Chapter 1: Introduction
CS106 — Compiler Principles and Construction
Fall 2011
MUST FIT
Dr. Liang
Introduction Dr. Zhiyao Liang
1
Compiler is a translator
Source program is written in source language
Target ( or object) program is written in target (object) language.
A translator is also a program. Running this program will generate a target program.
Introduction Dr. Zhiyao Liang
2
Source Program
Translator
Target program
Languages
High-level languages: C, C++, Java
Assembly Languages:
symbolic form of machine languages
Machine languages
All instruction are sequences of 0s’ and 1s’.
Usually the instructions of an assembly language correspond one-to-one to the instruction of a machine language.
Introduction Dr. Zhiyao Liang
3
Example of assembly language code
The activity in a computer is governed by instructions.
Memory is accessed
via addresses.
Addresses can also be
stored in memory
A simple addition
[104] [104]+[100]:
ARM
mov r0, #100
mov r1, #104
ldr r2, [r0]
ldr r3, [r1]
add r2, r2, r3
str r2, [r1]
Introduction Dr. Zhiyao Liang
4
Programs related to compilers
Interpreter: It is a translator. It translates one or more statements of the source program into target code, then immediately executes the target code.
For example, BASIC, LISP are interpreted.
Assembler: A translator that translates programs in an assembly language to programs in a machine language.
Linker: A program puts several pieces of object codes together in order to make an executable file.
Introduction Dr. Zhiyao Liang
5
Example of the process of generating an executable program
Introduction Dr. Zhiyao Liang
6
Programs related to compilers, cont.
Loader: each time a program is loaded into a memory, its starting location is different. Such a program is called relocatable.
A loader resolves all relocatable addresses relative to a given base.
Usually an actual loader does not exists. Its job is done by the operating environment.
Preprocessor (sometime also called a macro): A program that is applied to the source code before compiling starts.
Example:
#include
#define pi 3.1415 replace every pi with 3.1415
Introduction Dr. Zhiyao Liang
7
Programs related to compilers, cont.
Editor: A program that edits your source code. such as notepad, VI, emacs, …
Debugger: A program that helps to determine execution error.
Usually it can halt execution of a program at a prespecified location called breakpoints.
Example: GDB
Profiler: A program that collects statistics on a program during its execution.
Project Manager: Manages the work of groups of programmers, and different versions of the same project.
Example: sccs and rcs on Unix systems.
Introduction Dr. Zhiyao Liang
8
The translation process
Introduction Dr. Zhiyao Liang
9
Literal Table
Symbol Table
Error Handler
Source Code Optimizer is actually also a intermediate code generator.
9
Source Code
Scanner
Tokens
Parser
Syntax Tree
Semantic Analyzer
Annotated Tree
Source Code Optimizer
Intermediate Code or Rep.
Code Generator
Target Code Optimizer
Target Code
Target Code
The scanner
The scanner performs lexical analysis:
The source code file is partitioned into a sequence of tokens
A string of characters is grouped into a token according to some lexical rules
Example: the code
a[index] = 4 + 2
contains 8 tokens:
Introduction Dr. Zhiyao Liang
10
Token Token type
a identifier
[ Left bracket
index identifier
] Right bracket
= assignment
4 number
+ plus sign
2 number
The Parser
Given a sequence of tokens, the parser performs syntax analysis:
Checking if the sequence of tokens can be generated by a certain grammar (an instance of a parsing problem).
Or equivalently, checking whether the sequence of tokens can form a parse tree (also called a syntax tree), according to the grammar.
Introduction Dr. Zhiyao Liang
11
Introduction Dr. Zhiyao Liang
12
assign-expression
expression
identifier
a
identifier
index
number
4
number
2
assign-expression
expression
expression
expression
expression
[
]
additive-expression
expression
=
expression
+
Example:
a syntax tree of
a[index] = 4 + 2
Usually syntax trees are further simplified to abstract syntax trees.
Introduction Dr. Zhiyao Liang
13
additive-expression
subscript-expression
assign-expression
identifier
a
identifier
index
number
4
number
2
Keep only the necessary nodes
The semantic analyzer
The semantics (informally) is its meaning, as opposed to its syntax, which means all feature of an execution of the program.
Requirements of the semantics of programs are usually described (informally) by English.
Some features of a program can be analyzed before running it.
They are referred to as static semantics.
Example: the types of variables.
Type checking is a common task of the (static) semantic analyzer.
Example of type checking
const int x = 3;
int * p = &x; // Reporting error.
Introduction Dr. Zhiyao Liang
14
Semantic analyzer produces an annotated tree, example
Introduction Dr. Zhiyao Liang
15
subscript-expression
assign-expression
identifier
a
array of
integer
identifier
index
integer
number
4
integer
number
2
integer
additive-expression
Static semantics information are added
Source code optimizer
The goal is to represent the source code in a way that is easier to be translated into target code.
One example of improving the source code is constant folding.
like replacing 4 + 2 with 6.
The optimization could be performed on linear list of the source code. For example:
a[index] = 4+2;
t = 4+2; a[index] = t;
t = 6; a[index] = t;
a[index]=6;
The above code, using t, is an example of three-address code, which is closer to assembly code.
three-address code is a form of intermediate code.
Introduction Dr. Zhiyao Liang
16
Source code optimization can also perform on annotated trees, example:
Introduction Dr. Zhiyao Liang
17
Introduction Dr. Zhiyao Liang
17
subscript-expression
assign-expression
identifier
index
integer
number
6
integer
identifier
a
array of
integer
This tree is an intermediate representation
Code Generator
Possible code generated from a[index] = 6;
MOV R0, index ;; value of index -> R0
MUL R0, 2 ;; double value in R0
MOV R1, &a ;; address of a -> R1
ADD R1, R0 ;; add R0 to R1
MOV *R1, 6 ;; constant 6 -> address in R1
Why need to double the value in R0?
Here we assume each integer contains two bytes, while addresses are based on bytes.
Introduction Dr. Zhiyao Liang
18
Target code optimizer
The goal is to make the target code simpler, and faster.
For example,
using shifting to replace multiplication:
MUL R0, 2 SHL R0
using more powerful addressing:
MOV R1, &a ;; address of a -> R1
ADD R1, R0 ;; add R0 to R1
MOV *R1, 6 ;;
MOV &a[R0], 6 ;; constant 6 -> address a + R0
Introduction Dr. Zhiyao Liang
19
Major data structures in a compiler
Tokens: does the scanner need to generate only one token at a time (single symbol lookahead), or an array of tokens?
Syntax tree: The nodes are usually structures created dynamically (using malloc, calloc), and are connected by pointers.
Symbol table: It records all needed information of each identifier.
It is accessed in every phase.
Need to be implemented using fast data structure, such as hash-table.
Introduction Dr. Zhiyao Liang
20
Major data structures in a compiler, cont.
Literal table: It stores the constants and strings of a program.
Avoiding multiple copies of a literal, and thus avoiding the program size.
Intermediate code: three-address code and p-code, better to represented for easy organization.
Temporary files: it is needed by the following reasons
Historically, when computer memory is too little to contain the whole program.
Nowadays, more often, when the compiler generates code (temporary file) with “blanks” in the target code, and these blanks can only be filled later (backpatching).
For example, a “blank” could be the address of some code, which is the destination of a jump instruction.
Introduction Dr. Zhiyao Liang
21
Views on the compiling process
Two parts:
Analysis part: analyzing the source program.
Synthesis part: producing the target code.
Two ends:
The front end depends on the source program
The back end depends on the target program.
When translating from (to) the same language, while the target (source) language is different, the front end (back end) can be reused.
For portability.
Introduction Dr. Zhiyao Liang
22
Source Code
Target
Code
Back
End
Intermediate Code
Front
End
Passes
A compiler often needs to process the entire source program several times.
Each time is called a pass.
A pass may contain several phases.
Some languages allow efficient compilers that only need one pass.
Such as Pascal and C.
Most compilers need three passes:
One for scanning and parsing.
One for semantic analysis and source code optimization.
One for code generation and target code optimization.
Introduction Dr. Zhiyao Liang
23
Language definition
A programming language usually has its reference manual.
It is usually written in English.
It is also called the definition the language.
Some language has reached its official standards.
Occasionally, a language has its formal definition.
It is expressed in some formal way, such as denotational semantics.
This ideal situation is not always available.
Compilers of a language must be implemented according to these standards.
Introduction Dr. Zhiyao Liang
24
Runtime environment
The definition of a language describes the behavior of the runtime environment of its program.
Three basic types of runtime environments, from simple to complex, easy to difficult for implementation:
Completely static:
No pointers or dynamic allocation or recursive calls.
All memory allocation is done before execution
Like FORTRAN 77
Semi-dynamic:
Allows programmer schedule dynamic allocation on heap (not automatically).
Allow recursive function calls, which is automatically implemented on the stack by compiler.
Like Pascal, C, and other Algol-like languages.
Fully-dynamic:
All allocation is performed automatically by code generated by compiler.
The compiler needs to implement complex garbage collection algorithms.
Like LISP (functional), and Java (OO) and Smalltalk (OO).
Introduction Dr. Zhiyao Liang
25
Compiler options and interfaces
A compiler needs interface mechanisms to communicate with the operating system.
Like I/O, accessing file system.
A compiler also needs to provide options to users.
Like listing characteristics, code optimization options.
gcc options:
http://gcc.gnu.org/onlinedocs/gcc-3.2.3/gcc/Option-Summary.html#Option%20Summary
cl options:
http://msdn.microsoft.com/en-us/library/fwkeyyhe.aspx
Interfacing and options are collectively referred to as compiler pragmatics.
Introduction Dr. Zhiyao Liang
26
Error Handling
A complier needs to find and report errors as much as possible in compile time.
A compiler also needs to generate code to capture run-time errors.
The definition of a language may require static error reporting and exception handling mechanism.
Exception handling is part of the run-time environment.
For example, an exception can cause the execution to halt.
Introduction Dr. Zhiyao Liang
27
Which is easier?
Translating all possible source programs in source code vs. translating only several programs, which is easier?
Translating from a high-level language (C, Java, C++) vs. a low level language (machine code, or assembly code), which is easier?
Translating to a high-level language vs. to a lower-level language, which is easier?
Writing a translator in a high-level language vs. in a low-level language, which is easier?
Introduction Dr. Zhiyao Liang
28
Answer: the former choice is easier.
28
Bootstrapping
The goal is to make the task of compiling easier.
The following explanations of bootstrapping is from http://bootstrap.com.sg/
This is how the saying “pull yourself up by your bootstrap” came about to describe a person who is able to overcome a situation by his or her own effort.
For compilers, bootstrapping means to accomplish a more difficult task by the help of an easier or an already accomplished task.
It is important, since affects the direction of efforts.
Introduction Dr. Zhiyao Liang
29
A bootstrap is a loop sewn onto the top rear of a boot. This little contraption allows for better leverage when wearing boots, enabling one to pull their boots swiftly and surely.
Tombstone diagrams
Tombstone diagrams are used represent language processors and their interactions.
This formalism was fully developed by Earley and Sturgis (1970).
We will use some materials of the content of chapter 2 of the book “Programming Language Processors in Java: Compliers and Interpreters”, written by Watt & Brown.
Cases of bootstrapping are shown with Tombstone diagrams.
Introduction Dr. Zhiyao Liang
30
Tombstone diagrams
Introduction Dr. Zhiyao Liang
31
A program with name P, written in language L.
A machine that runs code written in language M.
Running program P on machine M.
Note that the two matching Ms’.
The program can be a compiler or an interpreter.
An S-into-T translator written in language L.
When T and L are different, it is called a cross-compiler.
S
L
An interpreter, source language is S,
implementation and target language are both L.
P
M
S T
L
M
P
M
M
The goal of a bootstrap process
We want to run the programs written in a high-level language S on machine M.
So, we want to create a good
compiler, which has the form:
The good compiler should have the following features:
Correctness: the target program and the source program produce the same result.
Efficiency: this compiler should run efficiently, that is, fast and consume little space.
Optimizing: The generated target code should run efficiently.
We want to create such a compiler in an easy way.
Introduction Dr. Zhiyao Liang
32
S M
M
32
A bootstrap process
Step 1: Writing a good complete compiler, call it (1),
which has the form:
Writing the (1)is easier than directly writing a compiler using language M.
(1) is designed to be correct, efficient and optimizing. MH means M code with high efficiency.
Step 2: Write a “quick and dirty” program,
call it (2), of the form
This program is NOT a full-fledged compiler (note that it has dashed border).
It is designed to be capable of translating only a single program, which is (1).
The translation is correct, but may not be efficient (ML at the bottom), and the generated code may have low efficiency (indicated as ML )
Writing (2) could be achieved smartly (e.g., by some bootstrapping method).
Step 3: Running (2) on M, with (1) as the input, and obtain a compiler (3) as the output.
Introduction Dr. Zhiyao Liang
33
S → MH
SH
S → ML
ML
S → MH
SH
S → ML
ML
S → MH
ML
M
A bootstrap process, cont.
Running (1) can generate correct and efficient target code.
Running (3) can generate correct and efficient target code, why?
Since (2) does a correct translation, (3) should produce the same result as (1) does.
Running (3) is inefficient, why?
Since the translation of (2) does not do optimization: the target code is only correct, but does not run efficiently.
(3) is a full-fledged compiler.
Now, the only thing that (3) misses is efficiency.
How to do it?
Introduction Dr. Zhiyao Liang
34
(1)
(2)
(3)
S → MH
SH
S → ML
ML
S → MH
ML
M
A bootstrap process, cont.
How to generate the final ideal compiler (4)?
Running (4) produces the same result as (1), since (3) does a correct translation.
(4) is a compiler that produces efficient and correct target programs.
Running (4) is efficient, why?
Since (3) produces code that runs efficiently, which is (4).
Introduction Dr. Zhiyao Liang
35
(1)
(3)
(4)
S → MH
SH
S → MH
ML
S → MH
MH
M
Porting a compiler from machine H to machine K
We want create a compiler of language A that runs on machine K.
We already have to compiles.
(1) is ideally written, producing correct
and efficient code, and can run efficiently.
(2) is the work horse of compiling A on machine H.
Step 1:
Step 2:
Introduction Dr. Zhiyao Liang
36
Both (3) and (4) are correct, efficient and quality compilers
(1)
A → K
A
(2)
A → H
H
(3)
(1)
(2)
A → K
A
A → H
H
A → K
H
H
(4)
(1)
(3)
A → K
A
A → K
H
A → K
K
H
Abstract machine
For a real machine running language L:
We can simulate it on machine M by
It is an abstract machine
Introduction Dr. Zhiyao Liang
37
L
L
M
M
P
L
L
L
M
M
P
L
is simulated by
About Java
The jdk package has two programs:
javac is a compiler java an interpreter.
H is the platform. it can be windows, or Mac OS, Linux, Unix …
A Java program is first compiled to a JVM program.
Then, the JVM code is interpreted.
This kind of combination is called an interpretive compiler
Introduction Dr. Zhiyao Liang
38
Java → JVM
H
JVM
H
JVM
H
H
P
JVM
Java JVM
H
P
Java
P
JVM
H
Thinking about portability
Goal: we want to let Java programs run on a new machine M.
Suppose a compiler for C is available on M.
What kind of language processors of Java should we prepare for the goal?
These language processors should be independent to any kind of machine.
Introduction Dr. Zhiyao Liang
39
Thinking about portability
Introduction Dr. Zhiyao Liang
40
We prepare three language processors as a portability kit:
1: To run Java on M, we first write an interpreter of
JVM expressed in C . This is not too difficult, we can
reuse the front end of the interpreter in the kit .
2: Then the C compiler is used to
compile this interpreter:
3: Now Java code can run on M
It is used to generate and maintain the Java-to-JVM translator expressed in JVM
Java → JVM
Java
Java → JVM
JVM
JVM
Java
JVM
C
JVM
C
C → M
M
JVM
M
M
P
Java
Java → JVM
JVM
P
JVM
JVM
M
M
P
JVM
JVM
M
M
ldrr2, [r0]
ldrr3, [r1]
3
5
100
104
pc
addr2, r2, r3
strr2, [r1]
movr0, #100r0
r1
r2
Memory
movr1, #104
r3
[100] = ?
[104] = ?
/docProps/thumbnail.jpeg