程序代写代做代考 c++ compiler data structure Haskell flex C COMP90045 Programming Language Implementation

COMP90045 Programming Language Implementation
Intermediate Code Generation
Harald Søndergaard Lecture 15
Semester 1, 2019
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 1 / 24

Intermediate Code
Most compilers don’t generate target code directly from the annotated syntax tree.
Instead, they
translate the program from its abstract syntax tree (AST) form to an intermediate representation (IR),
then translate the IR to the target language.
Usually, most of the work is in the first stage.
This arrangement makes it easier to implement optimizations and to
retarget the compiler.
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 2 / 24

How Intermediate Representations Are Used
annotated syntax tree
intermediate code generator intermediate representation optimization 1 intermediate representation optimization 2 intermediate representation final code generator target language
PLI (Sem 1, 2019)
Intermediate Code Generation
⃝c University of Melbourne
3 / 24

IR for Optimization
Most compilers have several optimization passes (usually many more than two) that try to improve the generated code in some way.
The compiler writer can design the IR to be easier to optimize than the final target language. This typically means making it simpler than the actual target language, and removing some of the constraints that the target instruction set may impose.
For example, many instruction sets have two different forms of conditional branches. One form is smaller, but can branch only to nearby instructions, while the other form is bigger but can branch anywhere.
There is no point in worrying about which form to use until we know the distance between the branch instruction and its target. We will know that only after all optimizations.
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 4 / 24

Composition of Optimizations
Making each optimization generate its output in the same format as its input has several advantages.
A given compiler invocation can perform any subset of the available optimizations. This allows the compiler user to select exactly what optimizations they want via command line switches.
The various optimizations can be performed in any order, so the compiler writer can choose whichever order is likely to be most effective.
It is even possible to repeat an optimization, which is useful if, for example, running optimization 2 after optimization 1 generates new opportunities for optimization 1.
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 5 / 24

How Intermediate Representations Are Used
code generator 1 target language 1
intermediate representation
optimizations intermediate representation code generator 2 target language 2
code generator 3 target language 3
PLI (Sem 1, 2019)
Intermediate Code Generation
⃝c University of Melbourne 6 / 24

Retargeting
A compiler that generates code for one instruction set can be modified to also generate code for another instruction set, provided
the instruction sets are not too different, and
the IR does not depend on concepts that are present in the first
instruction set but not the second.
Adding a new target requires creating a new final code generator. If the new instruction set has features that the existing IR cannot express, it also requires changes to the IR and to all the code that operates on it. This can be a lot of work.
Many compilers can generate code for several different versions of one instruction set. Some can generate code for several different instruction sets; gcc can do so for a couple of dozen.
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 7 / 24

IR Design Considerations
The IR should expose those aspects of the target instruction set(s) that are of importance for the intended optimizations, and hide the rest.
This is a subjective definition; IR design is an art, not a science.
For example, if the target instruction set has two separate sets of registers, one for floating point values and one for all other values, the IR should reflect this.
Increasing the number of optimizations will in general require exposing more details of the target(s) in the IR.
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 8 / 24

IR Design Considerations
Some instruction set architectures (ISAs) store C doubles in two 32-bit registers, while others store them in a single 64-bit register.
If you target both kinds of ISAs, exposing the concept of register pairs in the IR is necessary for generating good code for the first kind of ISA but is an unnecessary complication for the second kind.
Better optimizations means exposing more details in the IR. Targeting a larger number of platforms means exposing fewer details. The two objectives (more optimizations, more targets) are therefore in conflict.
The conflicts become more intense if you want to add more source languages. For example, gcc compiles not just C but also C++ and Objective-C.
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 9 / 24

Universal Computer-Oriented Language
source language 1
source language 2 ?
source language 3
target language 1
target language 2
target language 3
It would be nice to have a universal IR, to which all source languages could be translated and which could itself be translated to all instruction sets. The attempt that comes closest to success is LLVM, but LLVM IR is not truly universal. Programming languages and instruction sets are simply too diverse.
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 10 / 24

Three-Address Code
The choice of the intermediate representation is one of the most crucial design decisions in the implementation of a compiler (the other big decision being the structure of the attributed abstract syntax tree).
The usual form of IR is some variant of three-address code. Three-address code represents the program as a sequence of (possibly labelled) instructions for a relatively simple and more-or-less idealized abstract machine.
It gets its name from the fact that the most complex instructions of this abstract machine usually have the form x := y op z.
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 11 / 24

Lvalues vs Rvalues
An lvalue designates a storage location—a memory cell or a register. An rvalue designates a value that may or may not be stored in an lvalue.
Assignments have lvalues on the left and rvalues on the right. (This is how lvalues and rvalues got their names.)
Any lvalues on the right are implicitly dereferenced. Dereferencing looks up the value stored in a location.
Most three-address codes use lvalues whenever possible. Some opcodes put a constant into an lvalue; others then refer to it using that lvalue.
For example, x, y and z are usually all lvalues in x := y op z.
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 12 / 24

Specifying Lvalues
One important decision in IR design is how the IR specifies lvalues. To specify a location, you must make two decisions:
is it in memory or in a register (and which kind of register, if there is more than one kind)
what the register number is or how the memory address is computed (offset in stack frame, index in array, etc).
Many IRs allow these decisions to be deferred by providing an effectively unlimited number of registers of each kind.
This allows earlier optimizations to put everything into registers without worrying about running out of registers. Later optimizations can restrict themselves to using only the actual number of registers.
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 13 / 24

Representing Three-Address Code
The most convenient way to represent three-address code internally is the same as the most convenient way to represent abstract syntax trees: using algebraic data types in a language such as Haskell.
data Reg
= Reg Int
data Slot
= Slot Int
data Instr
= Load Reg StackSlot | Store StackSlot Reg | IntConst Reg Int
| RealConst Reg Float | AddInt Reg Reg Reg | …
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 14 / 24

Representing Instruction Sequences
The traditional representation of instruction sequences uses arrays. Their fixed size is a problem, and insertion and deletion are hard.
Using lists instead of arrays avoids fixed limits on code size, and makes insertions/deletions much easier.
Using a tree structure can be even better, because it is more flexible. It allows code sequences to be generated in one order even if they are to be executed in another order.
data Code
= Node Code Code | Leaf Instr
| EmptyLeaf
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 15 / 24

Representing Labels
If the IR stores instructions in arrays, a label can be represented as an index into the array.
If the IR stores instructions in a dynamic data structure such as a list or a tree, a label can be represented as a pointer to an instruction. However, we often want to put a reference to a label into an instruction before we create the instruction at the label.
This can be handled with backpatching: create instructions with empty fields, and fill in the fields later when the labels to be used becomes known.
However, it is much simpler to add a “label” pseudo-instruction to the IR, and give each label its own name. The uniqueness of the name can be ensured by embedding the value of a counter in it.
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 16 / 24

Code Snippet Using Label “Instructions”
label1:
load r0, 4
int_const r1, 0 cmp_gt_int r0, r0, r1 branch_on_false r0, label2 load r0, 1
store 0, r0
:
load r1, 3
load r2, 1
mul_int r1, r1, r2
sub_int r0, r0, r1
store 4, r0
branch_uncond label1
label2:
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 17 / 24

Handling Labels in Haskell
In Haskell the counter that is used to generate new labels could be represented as an Int which initially is 0.
Code generator functions can receive the current value nextLabel, use it to generate a label, then pass on the value nextLabel + 1.
let suffix = show nextLabel labelName = “label” ++ suffix in
:
An alternative is to use a state monad and make the label counter be part of the state.
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 18 / 24

Handling Labels Monadic Style
type LabelCounter = Int
data State = State Env LabelCounter
data CodeGen a = CodeGen (State -> (a, State))
instance Monad CodeGen where return code
= CodeGen (\st -> (code, st)) CodeGen gen >>= f
= CodeGen (\st0 -> let
(code’, st1) = gen st0
CodeGen gen’ = f code’
in gen’ st1
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 19 / 24

Handling Labels Monadic Style
Extracting a code generator’s state:
getState :: CodeGen State getState
= CodeGen (\st -> (st, st))
getLabelCounter :: CodeGen LabelCounter getLabelCounter
= do
State _ lc <- getState return lc incLabelCounter :: CodeGen () = CodeGen (\(State env lc) -> ((), State env (lc+1)))
PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 20 / 24

Handling Labels Monadic Style
Or, simpler, use the Control.Monad.State library: type Codegen a
= State LabelCounter a
getLabelCounter :: Codegen LabelCounter = do label <- get return label incLabelCounter :: CodeGen () incLabelCounter = do label <- get put (label + 1) return () PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 21 / 24 Complex Expressions and Temporaries Since in three address code each instruction can perform only one operation, an expression that performs more than one operation needs to store intermediate results in temporary variables. For example, the expression x + z * 2 would normally be translated to something like this: t0 := 2 t1 := z * t0 t2 := x + t1 where t0, t1 and t2 are temporaries. If the IR has unlimited registers, we can put each temporary into one. Later passes may move some of them to stack slots. PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 22 / 24 Reuse of Temporaries Temporaries are called that because their lifetime is short. Most temporaries are used just once, soon after they are defined. A temporary may be reused if the value it represents is used again. b := x + (z*2) < (z*2)^(-y) may become, for example, t0:=2 t1:=z *t0 t2:=x +t1 t3:=2 t4:=z * t3 t5 := uminus y t6:=t4powt5 t7 := t2 < t6 b := t7 or t0:=2 t1:=z*t0 t2:=x+t1 t5 := uminus y t6 := t1 pow t5 t7 := t2 < t6 b :=t7 t1 will hold z*2 until the next assignment to t1 or to z. PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 23 / 24 Next Up Syntax-directed translation: Semantic rules for code generation. See also ICD chapter 6. PLI (Sem 1, 2019) Intermediate Code Generation ⃝c University of Melbourne 24 / 24