程序代写代做代考 ocaml assembler c# concurrency x86 computer architecture cuda javascript Haskell RISC-V Java arm assembly compiler algorithm c/c++ C c++ mips data structure Compilers and computer architecture: Realistic code generation

Compilers and computer architecture: Realistic code generation
Martin Berger 1 November 2019
1Email: M.F.Berger@sussex.ac.uk, Office hours: Wed 12-13 in Chi-2R312
1/1

Recall the function of compilers
2/1

Recall the structure of compilers
Source program
Lexical analysis
Intermediate code generation
Optimisation
Syntax analysis
Semantic analysis, e.g. type checking
Code generation
Translated program
3/1

Introduction
We have ’finished’ the compilers course, in the sense that we looked at all compiler phases: lexing, parsing, semantic analysis and code generation.
What you’ve learned so far is enough to make a basic compiler for a simple programming language.
Now we learn how to make a good compiler for a (more) sophisticated programming language.
To be able to do this, we must revisit code generation.
4/1

Introduction
We looked at code generation, but made various simplifications to highlight the key ideas.
􏰉 Sourcelanguagewassimplistic(noprocedures,noobjects etc).
􏰉 Targetlanguagewassimplistic(e.g.stackmachine,or register machine with unbounded registers).
􏰉 Translationproducedslowexecutables(inthecaseof stack- and accumulator machines), also we looked at one simple way to improve register allocation.
5/1

Introduction
Now we look at more realistic code generation. 􏰉 Targetisa’real’architecture(e.g.RISC-V).
􏰉 Sourcelanguagefeaturese.g.proceduresand classes/objects.
􏰉 Codegenerationistypicallysplitintothreeparts: 􏰉 Generation of intermediate code.
􏰉 Optimisation.
􏰉 Generation of machine code.
This leads to various complications.
6/1

Key issues in translation to a realistic processor: speed
The code generators we’ve discussed at the start of this course work fine, even on real processors. The problem is speed of executables.
Naive code generation schemes make many memory reference, that can be avoided with better generated code.
Why are memory references a problem? Because they are slow in comparison with register access (approx. two orders of magnitude slower).
But registers are scarce in real processors (RISC-V has 32 integer registers and 32 floating point registers). Putting a lot of effort into making best possible of registers pays off handsomely in executable speed.
7/1

Key issues in translation to a realistic processor: memory, energy
More generally a lot of effort goes into optimising the generated executable, not just for speed, but sometimes for memory, and, increasingly, for energy efficiency (important for battery powered devices, and for large-scale data centers).
For example on some architectures caching a computed value in memory and reusing it requires more energy than recomputing the value multiple times in registers.
8/1

Key issues in translation to a realistic processor: processor evolution
Processors evolve quickly. Extreme example: ARM processors in our phones.
In modern compilers up to 90% of compiler LOCs is optimisation. It would be annoying if we had to rewrite this every time a processor evolves a bit.
Fortunately, most machine languages are fairly similar.
So we compile first to an intermediate representation (IR) which is rather close to generic machine code, but hides some detail.
􏰉 MostoptimisationisdoneonthisIR.
􏰉 Codeforthetargetmachineistheneasytogeneratefrom
the IR.
A picture says more than 1000 words.
9/1

Key issues in translation to a realistic processor: processor evolution
C++ Java
C++ Java Python
Scala IR Haskell
C# Ocaml/F#
x86 Python ARM
x86
ARM
MIPS
RISC V
PowerPC
Scala Haskell C# Ocaml/F#
MIPS
RISC V
PowerPC
10/1

Intermediate representation
A good intermediate representation should have the following properties.
􏰉 IRmustbeeasytogenerateforfront-end.
􏰉 IRmustbeeasytoconverttorealmachinecodeforall
desired target machines.
􏰉 EachconstructintheIRmusthaveaclearandsimple meaning, so that optimisation can easily be specified and implemented. Optimisations in real compilers are complicated and regularly introduce errors.
Real compilers often even use multiple IRs.
11/1

Intermediate representation: LLVM
In 2019 the most well-known and widely used IR is probably LLVM (used to stand for “Low-Level Virtual Machine” although the project is now way bigger than virtual machines). It was developed by a student (Chris Lattner) for his final year dissertation. C. Lattner went to Apple where LLVM is now used in all Apple dev tools (OS X, iOS), did clang, Swift and was briefly head of Autopilot Software at Tesla. He’s at Google Brain now.
LLVM is free and open source, and has evolved into large tool suite for building compilers. I recommend that you take a look: http://llvm.org
If you want / need to build an optimising compiler, my recommendation is to use LLVM as backend.
12/1

Intermediate representation: WebAssembly
Last few years saw the rise of compile-to-Javascript, e.g. TypeScript, Dart and Scala.js.
Javascript is not a good language to compile to.
In the Summer 2017, all major browser vendors agreed to support WebAssembly as an platform independent IR.
WebAssembly’s most interesting features: 􏰉 Typed
􏰉 Secure(e.g.nobufferoverflowpossible)
Will WebAssembly succeed? Time will tell … but things are
looking good.
Check out: http://webassembly.org
13/1

Intermediate representation
Often IRs (including LLVM) are quite simple and look much like the machine code for the register machine with unlimited registers.
Such IR is often called three address code because most instructions look something like this:
r1 := r2 op r3
Here r1, r2, r3 are registers and op is an operation like addition, comparison … It is clear that this can easily be mapped to real assembly instructions (i.e. a mini-compiler).
14/1

Intermediate representation: SSA
Most modern compilers use a variant of three address code called SSA form, short for static single assignment form.
􏰉 Eachvariableisassignedexactlyonce(whataboutloops?) 􏰉 Eachvarianblemustbedefined(=assignedto)beforeuse
This format makes compiler optimisations easier and faster. Surprisingly close to functional programming.
15/1

Intermediate representation: PTX
PTX (= Parallel Thread Execution) is an intermediate language used for compiling Nvidia’s CUDA to Nvidia’s GPUs.
It’s a register machine architecture with unlimited registers.
16/1

Key issues in translation to a realistic processor: procedures, objects, methods
The key issues making code generation more complicated than the schemes we saw in the first few weeks are:
􏰉 Procedures,potentiallyrecursivelydefined. 􏰉 Objectsandsubtyping.
􏰉 …
17/1

Key issues in translation to a realistic processor: procedures
Procedures, potentially recursively defined, that can be invoked like this:
def f ( x : Int, y : Int ) : Int = {
if ( x == 0 ) then 7 else f ( x-1, x+y ) }

f(2, f(3, f(4, 100)))
Problem: we don’t know at compile time when, where and how often they will be invoked, so we must have dynamic (run-time) mechanisms that orchestrate the invocations.
18/1

Key issues in translation to a realistic processor: procedures, objects, methods
Objects and subtyping: We might have objects
class A { void f () {…} }
class B extends A { void f () {…} }
class C extends A { void f () {…} }

inp = read_user_input ()
A a = if inp == 0 then new B else new C
a.f()// how do we know which f to use? B::f or C::f?
We must have a dynamic mechanism (i.e. working at run-time) that can make a dynamic dispatch at run time to the correct f.
19/1

End of overview
Recursive procedures are the tallest mountain to climb. The technology that allows us to compile recursive procedures also helps us to translate objects, exceptions etc.
Now that we understand the problems, let us begin looking at solutions.
First we look at the key ideas and data-structures (lifetimes, activation records, stack, heap, alignment) in a general way.
Then we will write a code generator using these ideas.
20/1

Management of run-time resources
Important terminology:
􏰉 Static,happensatcompile-time.
􏰉 Dynamic,happensatrun-time.
It’s important to be clear what happens statically and what
happens dynamically.
They interact: statically, the compiler generates code that dynamically organises computation (e.g. garbage collection, invocation of procedures/methods, allocation of heap memory).
21/1

Run-time organisation of memory
What happens when a program is invoked?
􏰉 WeasktheOStoallocatespacefortheprogram.
􏰉 TheOSloadsthecodeoftheprogramintotheallocated space. Maybe maps symbolic names (remember those?) to real machine addresses. This is called (dynamic) linking.
􏰉 TheOSjumpstotheentrypointoftheprogram,i.e.the (translation of the) program’s entry point (e.g.: in Java the main method).
22/1

Visualisation of memory
Memory
Code
We often draw memory like this. Lines delimit different areas for different kinds of data. This a simplification, e.g. memory blocks might not be contiguous.
Other (e.g. data)
23/1

Runtime organisation
The compiler (at compile time) is responsible for
Memory
􏰉
Generatingthe executable code. This happens statically. Compiled code stays unchanged (i.e. is read only) after linking (with all modern compilers I know). Question: Why? Security, prevents attackers from modification (alas this is an insufficient defense).
Orchestratingthe dynamic use of the data area of memory.
Code
Other (e.g. data)
􏰉
24/1

Compilation of procedures
Now we are going to learn how to compile procedures (e.g. static methods in Java). The technology used here can also be used to compile (non-static) methods, but we need other technology in addition.
Note that we have two goals in code generation. 􏰉 Codeshouldbefast.
􏰉 Codeshouldbecorrect.
It’s easy to achieve each goal separately.
All complication arises from trying to meet both goals together.
25/1

Compilation of procedures: key assumptions
From now on we base our discussion of code generation on two fundamental assumptions.
􏰉 Executionissequential.
􏰉 Whenaprocedureiscalled,controlreturnstothepoint
immediately after the procedure call.
Compiling without either assumption is substantially harder.
Let’s look at both in some more detail.
26/1

Compilation of procedures: key assumption sequential execution
We assume that there is no parallel or concurrent execution in our programming language. Only one thing happens at a time, and code is executed on step after the other.
Concurrency (e.g. Java threads, or multi-core CPUs) violate this assumption.
We will ignore concurrency.
27/1

Compilation of procedures: key assumption is simple control flow
We assume that when a procedure is called, and returns, the next command to be executed after the procedure returns is the command immediately after the procedure call.
x = x+1
y = f( x, 2 ) // assignment is executed after
// f returns a value
z = y*z
We also assume that each procedure returns at most once for each invocation!
Languages with advanced control constructs call/cc, or goto or exceptions or concurrency violate this assumption.
We will ignore such advanced control constructs.
28/1

Two definitions: activations and lifetime of a procedure
Let Proc be a procedure.
An activation of Proc is simply an execution of a call to Proc,
i.e. running Proc.
Activations have a lifetime. The lifetime of an activation of
Proc is
􏰉 AllstepstoexecutetheactivationofProc,including…
􏰉 …allthestepsinproceduresthatProccallswhilerunning (including recursive calls to itself).
29/1

Reminder: lifetime and scope of a variable
Similar: lifetime of a variable x is the part of the execution of the program where x is defined, available and used.
Note that
􏰉 thelifetimeofavariableisadynamicconcept. 􏰉 thescopeofavariableisastaticconcept.
Recall that scope is the portion of the program text where a variable is visible, e.g. scope in Java:
public class Main {
private int x = 0;
public static void main ( String [] args ) {}
public void f () {
int x = 2;
( new B () ).f ( x ); }
class B {
int x = 10;
public void f ( int x ) {
System.out.println ( x ); } } }
30/1

Important observation of procedure lifetimes
def f () = {
println ( “entering f” )
g()
println ( “leaving f” ) }
def g () = {
println ( ”
h ()
println ( ”
def h () = {
println ( ”
println ( ”
entering g” )
leaving g” ) }
entering h” )
leaving h” ) }
def main ( args : Array [ String ] ) {
f()
g()
h() } }
In class, test.scala, test2.scala.
31/1

Important observation of procedure lifetimes
Lifetimes of procedure activations are properly nested (well-bracketed):
When f calls g, the g returns before f returns, etc. Hence we can draw activations like below. This is a consequence of our two assumption above.
Activations
h gh
f
gh
main
Time
32/1

Important observation of procedure lifetimes
What about recursive procedures?
def factorial ( n : Int ) : Int = {
println ( “entering f ( ” + n + ” )” )
val result = if ( n <= 0 ) 1 else n * factorial ( n-1 ) println ( "leaving f ( " + n + " )" ) result } def main ( args : Array [ String ] ) { println ( "entering main" ) factorial ( 4 ) println ( "leaving main" ) } In class, factorial, factorial2. 33/1 Important observation of procedure lifetimes Again we see the nesting structure as a form of stacking. Activations f f f main f Time 34/1 Activations We need to store some data to orchestrate execution of a procedure call. E.g. return address, procedure arguments, return value. Let’s also call this information the activation ... Where should we store this information, i.e. the activation, at run time? Cannot use statically allocated memory address. Why? because more than one activation might be active at the same time (recursion) and we cannot predict how many are activations are active a the same time. Why? Recall Rice’s theorem? 35/1 Activations The activation tree of a program is dynamic, is run-time behaviour. Cannot be predicted statically how many, and how they are nested (e.g. might depend on user input.) The activation tree may be different for different runs. But they are always ’well-bracketed’. This suggests the following implementation of procedures: Use a stack to keep track of currently active activations, in order encountered: 􏰉 Whenwecallaprocedurewecreateanactivationonthe stack, containing all relevant data about activation (eg. arguments, return address). 􏰉 Whentheprocedurecallterminates,wepoptheactivation off of the stack. 36/1 Example: stack stores activations def f () = { g() } def g () = { h () } def h () = {} def main ( args : Array [ String ] ) { f(); g(); f() } Draw stack in class. 37/1 Memory organisation: stack Memory Code Other data Stack Empty The stack will grow as new procedures are called, and shrink when procedure calls terminate. Stack pointer 38/1 Stack growth Activation records, aka (stack)frames Let’s summarise. An activation, also known as activation record, frame or stack frame, stores all the information needed to execute one procedure activation. Activation records (and their being on the stack) are the key data structure to make procedures (and later methods) to work. 39/1 Activation records, aka (stack)frames What information do we keep in an activation record? Depends on the details of the programming language, compilation strategy, and target architecture. But roughly this: 􏰉 Argumentsfortheprocedurejustcalled. 􏰉 Resultoftheactivation,tobehandedbacktocaller. 􏰉 Returnaddress. 40/1 Activation records, aka (stack)frames Let’s look at an example. def g () : Int = { 1 } def f ( n : Int ) : Int = { val result = if ( n <= 0 ) g() else n * f ( n-1 ) result } def main ( args : Array [ String ] ) { f(3)} Here is a possible layout for activation records for f (for main and g they are different). result argument control link return address 41/1 Activation records, aka (stack)frames result argument control link return address Here is a possible layout for activation records for f (for main and g they are different). 􏰉 Theresultholdstheintegerfreturns. 􏰉 Theargumentholdstheuniqueintegerargumentfgets. If a procedure has n arguments, then n slots are needed in the AR to holds them. 􏰉 Thecontrollinkisapointertocaller’sactivationrecord (explained later). 􏰉 Thereturnaddressiswherefjumpstowhenfinished. Needed because f is called in multiple places. Do you note something important about this? The size and shape (e.g. which field is at what offset) can be determined at compile-time. This has important consequences. 42/1 Example of activations record stacking main f( 3 ) Main's AR Result Argument: 3 Control Return address: "in main" Result Argument: 2 Control Return address: "recursive" f( 2 ) Main has no arguments and interesting return values, so AR is not so interesting. “in main” and “recursive” denote the two places where f is called, so the call were execution resumes when f finishes. Only one of many possible AR designs. 43/1 Stacks as arrays Remember that we realised something very important earlier. The size and shape (e.g. which field is at what offset) can be determined at compile-time. This has important consequences: The compiler can emit code that accesses any field in an AR, provided there is a pointer to the top of the AR. But does the stack data type support this? 44/1 Stacks as arrays Usually we think of the data-type stack as a black box that we can either push something onto, or pop off, no other operations provided. Stacks in real CPUs support those, but are also big arrays. This means we access the fields in the activation record pointed to by the stack pointer relative to the stack pointer, e.g. SP-4 or SP-16. For example if the SP points to the top of the AR (i.e. to the return address) and each field is 32 bits, then the argument is at SP-8 and the result sits at SP-12. If the SP points to the first free slot above the stack, then the argument is at SP-12 and the result sits at SP-16. result argument control link return address 45/1 When AR is popped, SP points to result Assume the SP points to the first free slot above the stack. Main's AR Result Argument: 3 Control Return address: "in main" Result: 1 Argument: 2 Control Return address: "recursive" Empty Pop off AR SP Main's AR Result Argument: 3 Control Return address: "in main" Result: 1 Argument: 2 Control Return address: "recursive" Empty SP 46/1 When AR is popped, SP points to result Here we are using the fact that popping off something, only rearranges the SP, but data is physically still there (not overwritten). So the caller can easily access the result. As we see later, local variables of procedures are also stored in ARs and are at fixed offset, so to access local variables we make use of the ’stack’ also being an array. Main's AR Result Argument: 3 Control Return address: "in main" Result: 1 Argument: 2 Control Return address: "recursive" Empty Pop off AR SP Main's AR Result Argument: 3 Control Return address: "in main" Result: 1 Argument: 2 Control Return address: "recursive" Empty SP 47/1 Saving caller data def f (...) { x = x*y g( x, y, z ) x = z+y+z } If a procedure f calls g, for example then the execution of g typically uses registers. But these registers are typically also used by the activation of f that is not yet finished. It would be bad if the activation of g overwrote f’s registers. What happens is that the call to g saves f’s registers before executing g, and when the execution of g is finished, these saved registers are restored. The saved registers are also stored in g’s AR. The compiler must generate the code to save and restore caller registers. More about this later. 48/1 Note that the AR layout just sketched, and division of responsibility between the caller and callee are contingent (could have been done otherwise). What conventions we use depends on many factors, such as: 􏰉 Compatibilitywithexistingsoftware. 􏰉 AvailabilityofCPUsupportforsomeoperations. 􏰉 Speedormemoryusage. 􏰉 Codegenerationsimplicity. 49/1 ARs in real compilers Production compilers attempt to hold as much of the AR in registers as possible, especially the procedure arguments and result. In other words: in this case the AR is stored partly on the stack and partly in registers. Reason: speed! Memory access is much slower than register access. This can make procedure invocations complicated to compile. 50/1 Global and heap data We’ve now begun to understand how a compiler handles procedure calls. We will come back to this soon. But now we will look at global variables and heap data. 51/1 Global data Consider the following Java program. public class Test { static int x = 1; public static void main ( String [] args ) { System.out.println ( x ); } } Where to store x? As x is static in the sense available from the start of the program to its termination, we cannot store x in an AR. That’s because ARs live only for part of the program’s lifetime. We also say that x is statically allocated. Depending on the programming language, other static data may exist. 52/1 Static data To deal with static data, we augment the run-time layout of memory a bit by adding an area where static data lives throughout the lifetime of the program. Code Static data Stack Empty Stack base Stack pointer 53/1 Heap data Static variables are not the only thing that cannot be stored in ARs. Consider the following Java program. public class A { ... } public class B { A f () { return new A (); } void g () { A a = f(); ... } } The object in red cannot be in ARs for invocations of f because it will outlive the invocation. The AR will hold a pointer to the object though. Can we store the object with static data? Yes, but typically, the object will not be used for the program’s entiere lifetime. The static data will not be garbage collected (see later), so memory there will not be reclaimed / reused. So storing the object there is wasteful. 54/1 Heap data So we need an additional memory region, where we store things that live longer than the activation where they are are created, but (typically) shorter than the program’s lifetime. This region is called heap. ARs are automatically popped off the stack when an activation of a procedure terminates, but how do we free memory in the heap? Answer: garbage collection (e.g. in Java, Scala, Haskell, Python, Javascript), or manual memory management (e.g. C/C++). We will learn more about this later. 55/1 (Simplified) Memory layout Code Static data Stack Empty Heap Stack base Stack pointer Heap pointer Code: fixed size. Static data: fixed size, contains static variables. Stack: dynamic, contains ARs. Heap: dynamic, contains all other data. 56/1 (Simplified) Memory layout Both heap and stack have dynamically changing sizes. We must make sure they don’t grow into each other. Solution: Stack and heap start at opposite ends of the memory and grow towards each other. When stack and heap boundaries cross, we’ve run out of memory, and must terminate the program (or ask the OS for more memory). The compiler must generate code that checks for (and deals with) this. The OS can help with this (e.g. MMU). 57/1 Multiple meanings of “Heap” A warning, the meaning of the term “heap” is different in other parts of CS. In particular, in the study of algorithms/data-structures, a heap is a tree-based data structure, e.g. like so: These heaps have nothing to do with the heaps in compilation. 58/1 Alignment Here is another important (and annoying) issue to do with memory. Most modern CPUs have a 32 or 64 bits data bus. That means the CPU reads 32 or 64 bits in one cycle. But memory is addressed in bytes. This is for historical reasons. Recall that a byte is 8 bits, and a word is 32 bits (= 4 bytes) in 32 bit machines, and 64 bits (= 8 bytes) in a 64 bit machine. 59/1 Alignment Example: assume we have a 32 bit CPU that reads 32 bit words. 􏰉 TheCPUreadsawordataddress3000.Thenitreadsin one cycle the content of addresses 3000, 3001, 3002 and 3003. 􏰉 TheCPUreads(orshouldread)awordataddress3001. Then it reads in one cycle the content of addresses 3001, 3002, 3003 and 3004. Note that 3000 is divisible by 4, while 3001 is not divisible by 4. Divisibility by 4 is important for many (most? all?) 32 bit CPUs because accessing 32 bits in memory starting at an address that is divisible by 4 is (much) faster than accessing memory at an address that is not divisible by 4. We say addresses divisible 4 are (32 bit) aligned. 60/1 Alignment Moregenerally,fora2n bitCPU(e.g.n=5means32bit,n=6 means 64 bit), we say that addresses divisible by 2n−3 are word-boundaries. Memory access of 2n bits starting at a word-boundary is aligned, access at all other addresses is misaligned. 􏰉 32bitCPUs,n=5,sowordboundaries,hencealigned access begins at an address that is a multiple of 4 = 25−3, e.g. 0, 4, 8, 12, 16, 20, ... 􏰉 64bitCPUs,n=6,wordboundariesareataddressesthat are multiples of 8 = 26−3, e.g. 0, 8, 16, 24, 32, 40, ... 􏰉 128bitCPUs,n=7,wordboundariesareataddresses that are multiples of 16 = 27−3, e.g. 0, 16, 32, 48, 64, ... Important: misaligned access is much slower (approx. 10 times) than aligned access for many (most? all?) CPUs. In some CPUs misaligned access is an error. 61/1 Alignment Because of the huge speed penalty for misaligned memory access, compilers must do their best to ensure that all memory access is aligned. Compilers achieve this by always locating data at word boundaries, using padding where necessary. 62/1 Alignment Here is an example of padding data for a 32 bit CPU. We want to store the strings “Hello” and “World”. Word boundary Word boundary Word boundary Hello 01230123 Padding ... Hello ... 01230123 Next data H e l l o W o ... 01230123 Word Word 63/1 Alignment Most assembler languages have build in support that aligns (some) data automatically. For example RISC-V assembly language has the .align command. Putting .align n in a line means that the succeedings lines are aligned on a 2n byte boundary. .align 2 .asciiz "Hello world!" Here,.align 2alignsthenextvalue(thestring"Helloworld!") on a word boundary on a 32 bit machine. 64/1