程序代写代做代考 interpreter x86 compiler mips clock assembler cache C assembly Haskell COMP90045 Programming Language Implementation

COMP90045 Programming Language Implementation
Code Generation and Local Optimization
Harald Søndergaard Lecture 19
Semester 1, 2019
PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 1 / 21

Unconventional Target Languages
Some compilers for high level languages (e.g., C++, Haskell, Mercury) generate C code instead of assembler or machine code. This effectively avoids having to repeat all the work done on C compilers to generate good code, and to generate code for many platforms.
GNU C has some extensions over standard C that make it closer to assembler, such as the ability to put global variables in specified registers, and to include assembly code in C code. These make it a better target language, but not an ideal one.
C- – is a variant specifically designed to be a target language; for example, it allows control over stack frame layout, for use in RTTI.
PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 2 / 21

Virtual Machines
Some compilers generate code for neither an existing machine nor an existing language, but for a made-up or virtual machine.
The code for the virtual machine (VM) can either be interpreted, or compiled to the machine code of the platform just before execution. The latter technique is called just-in-time (JIT) compilation; the VM code is the source language of the JIT compiler, which must be fast.
Either way, the virtual machine code (as text or in binary) is portable to any machine that has an interpreter or JITter for the VM.
This idea is many decades old, but was popularized by the JVM.
PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 3 / 21

.NET and LLVM
Most VMs are designed to support just one language.
The VM used by Microsoft’s .NET is a partial exception to this (supporting too many source languages requires solving the Universal IR problem mentioned previously).
LLVM (originally the “low-level virtual machine”) was similarly designed to serve as a target for many different languages.
Today LLVM is much more than a VM—it is a compiler construction kit with many reusable technologies. A primary design criterion was support for source- and target-independent compiler optimization, something that is reflected in the LLVM IR and infrastructure.
PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 4 / 21

Conventional Target Languages
Many compilers generate assembly language code, which is translated into relocatable machine code by the system’s assembler. Most other compilers generate relocatable machine code directly.
Generating relocatable code directly makes compilation faster (eliminating the need to write, read and parse a temporary file), but
requires the compiler as well as the assembler to know the format and encoding of instructions,
requires the compiler as well as the assembler and the linker to know the object file format, including relocation information,
requires some other means (for example, a disassembler) to allow the compiler’s output to be debugged.
PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 5 / 21

Final Code Generation
The process of translating the IR to the actual target code needs to take care of the details of the target machine that the IR is designed to hide. It is thus often quite low-level and boring, but requires full knowledge of the target platform.
Usually, the final code generator turns a single IR instruction into either a single target instruction or a short sequence of instructions.
Consider the IR instruction if r1 < r2 goto L. If the target supports a compare-less-than-and-branch instruction, then we can translate this to a single target instruction. With other instruction sets, we must translate it to a comparison instruction followed by a conditional branch instruction. PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 6 / 21 Instruction Selection Sometimes there are many ways to translate a single IR instruction. For example, on the Motorola 680x0, the IR instruction r1 := 0 can be translated into (a) move r1,#0 (b) clr r1 (c) sub r1,r1 (d) ... The compiler associates a cost with each alternative code sequence, and chooses the one with the best cost measure. The cost may be the size of the code sequence (measured in bytes), its execution time (measured in clock cycles), or a composite measure. Unfortunately, on modern CPUs, the time taken to execute an instruction sequence depends on many external factors (e.g., the contents of the cache, and what the previous instructions in the pipeline are), so there may not be a choice that is always best. PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 7 / 21 Looking Up Variables When generating code that accesses the value of a variable, we need to know all the places where the value is available. This can be any subset of the following: in the variable’s memory location (its stack slot, or its slot in a rodata, data or bss section), in one or more registers, a known constant. The code generator will prefer to pick up the value from a register, since the instruction sequence using a register will be faster and smaller than the alternatives. If the variable is not available in any register, loading a constant into a register is usually preferable to a memory access. PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 8 / 21 Pattern Matching Many instruction sets have instructions that perform two or more operations. For example, on a MIPS machine, lw r1,16(r2) performs both an add and a load: it computes an address by adding 16 to r2, and loads the word at that address into r1. If the two operations are performed in a single instruction in the IR, this can be translated to lw r1,16(r2) directly. If the two operations are performed in two or more separate instructions in the IR, then the final code generator should recognize that the pattern of use of those IR instructions is one that matches the pattern implemented by a particular machine instruction. Since patterns can be described by grammars, the pattern matching is often done using parsing technology. PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 9 / 21 Pattern Matching: An Example Consider translating the following IR instructions to MIPS code: r3 := 16; r4 := r2 + r3; r1 := *r4 You can implement the first IR instruction with addi r3,r0,16 because on the MIPS machines r0 always contains zero. You can implement the first two IR instructions with addi r4,r2,16 provided r3 is not needed later. You can implement all three IR instructions with lw r1,16(r2) provided r3 and r4 are not needed later. Pattern matchers in code generators are therefore often greedy: they try to implement as many IR instructions as possible with a single target instruction, as long as doing so improves the cost. PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 10 / 21 Greedy Pattern Matching Greedy pattern matchers often create code no ordinary sane human would write. Example: the instruction leal (%eax,%edx,4), %eax on x86. This performs eax := 4*eax + edx (eax and edx are the names of two x86 registers). The x86 has several complex addressing modes. One computes an address by adding together two registers and a constant. However, once you compute the address, you don’t have to use it to look up memory; you can just save the address in a register. That is what the load effective address instructions do, of which leal is one. The leal instruction is faster as well as smaller than two separate shift and addition instructions, because it uses a specialized fast shifter that can only shift left by 1, 2 or 3 bits. PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 11 / 21 Optimizations Compilers can perform optimizations both on IR and on the generated target code. Usually, each optimization is done on only one representation, but some compilers perform similar optimizations on both the IR and the target code. An optimization that works on the target code has a chance to optimize details that the IR keeps hidden. An optimization that does not need to access those details is better done on the IR. An optimization that makes the IR smaller reduces the size of the input to later compiler passes. If there are many later passes and the optimization is fast, it may “pay for itself”. PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 12 / 21 Optimizations: Strength Reduction Look for opportunities, in loops, to replace a more expensive operation by a cheaper one: k := 15; i := 0; while (i < N) do a[i] := k * i; i := i + 1; od PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 13 / 21 Optimizations: Strength Reduction Look for opportunities, in loops, to replace a more expensive operation by a cheaper one: k := 15; i := 0; while (i < N) do a[i] := k * i; i := i + 1; od Here a multiplication has been replaced by an addition. x := 0; k:=15; i := 0; while (i od < N) do a[i] x := x + k; i := i + 1; := x; PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 14 / 21 Exploiting Algebraic Identities Theidentityx∗2=x+x allowsustoreplacemulr1,r2,2with add r1, r2, r2, which is probably faster, but won’t count that as strength reduction. Better yet, utilize x ∗ 2 = x ≪ 1 (a left shift). More subtly, utilize, say, x ∗ 15 = (x ≪ 4) − x. Other algebraic laws about commutativity, associativity, distributivity, and so on, may help us reorganise expressions so that operations are performed in a better order. PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 15 / 21 Reordering of Operations Let array a[lo..hi] be laid out in memory from position start. The address of its ith element is start + (i − lo) ∗ eltsize, where eltsize is the size of an element of a. But start + (i − lo) ∗ eltsize = start + i ∗ eltsize − lo ∗ eltsize = start − lo ∗ eltsize + i ∗ eltsize Now t = start − lo ∗ eltsize can be computed once and for all at compile time, leaving only t + (i ∗ eltsize) to be performed at runtime. However, when reordering operations, the compiler must be careful about overflows; an identity that holds in Z or R may not hold for fixed-size integers or floats. PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 16 / 21 Quiz: Algebraic Laws of Machine Arithmetic With x and y being signed integers, can we replace the C code on the left with that on the right? if (x > y)
x = 42;
if (x – y > 0)
x = 42;
PLI (Sem 1, 2019)
Code Generation and Local Optimization ⃝c University of Melbourne 17 / 21

Quiz: Algebraic Laws of Machine Arithmetic
With x and y being signed integers, can we replace the C code on the left with that on the right?
if (x > y) if (x – y > 0) x = 42; x = 42;
No! In signed machine arithmetic, x > y does not imply x − y > 0. Why not? Give an example.
PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 17 / 21

Quiz: Algebraic Laws of Machine Arithmetic
With x and y being signed integers, can we replace the C code on the left with that on the right?
if (x > y) if (x – y > 0) x = 42; x = 42;
No! In signed machine arithmetic, x > y does not imply x − y > 0. Why not? Give an example.
With 4 bits, take x = 4, y = −6. Then x − y = −6.
PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 17 / 21

Quiz: Algebraic Laws of Machine Arithmetic
Can we replace the C code on the left with that on the right?
if (x < 0) x = -x; if (x < 0) x = 42; if (x < 0) x = -x; PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 18 / 21 Quiz: Algebraic Laws of Machine Arithmetic Can we replace the C code on the left with that on the right? if (x < 0) if (x < 0) x = -x; x = -x; if (x < 0) x = 42; No! In two’s complement machine arithmetic, −m = m has two solutions! One is m = 0; what is the other? PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 18 / 21 Peephole Optimization A cheap way to optimize code sequences is to peep at them through a small sliding window, and try to fit the instructions in that window into known patterns. Code that fits a source pattern is then replaced with equivalent code that is shorter and/or faster, created from the target pattern associated with the source pattern. store X(Y), rN => load rN, X(Y)
store X(Y), rN => load rM, X(Y)
store X(Y), rN
store X(Y), rN
move rM, rN
Compiler writers can find new patterns by looking at the generated code. However, it may be preferable to fix the code generator so that it does not emit suboptimal code in the first place.
PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 19 / 21

Quiz: Peephole Optimization
The following uses patterns like rX for arbitrary Oz registers. Is this a valid peephole optimization for Oz?
add_int rX, rA, rA int_const rX, 4 add_int rY, rX, rA => mul_int rZ, rA, rX add_int rZ, rY, rA
Why or why not?
PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 20 / 21

Quiz: Peephole Optimization
The following uses patterns like rX for arbitrary Oz registers. Is this a valid peephole optimization for Oz?
add_int rX, rA, rA int_const rX, 4 add_int rY, rX, rA => mul_int rZ, rA, rX add_int rZ, rY, rA
Why or why not?
It is not a valid optimization, because the second instruction sequence is not equivalent in all respects to the first instruction sequence: while it leaves the same value in rZ, it leaves different values in rX and rY.
Besides, a single multiplication is likely to be slower than three additions.
PLI (Sem 1, 2019) Code Generation and Local Optimization ⃝c University of Melbourne 20 / 21

Jump Optimizations
Branch and jump instructions can be quite expensive on current CPUs, because they disrupt the flow of instructions through pipelines.
Some branching can be improved by peephole optimization:
if Cond then goto L1 if !Cond then goto L2
goto L2 => L1: Instr L1: Instr
Removing jumps-to-jumps needs a separate optimization pass, because the knowledge it needs does not fit in a peephole window:
if Cond goto L1

L1: goto L2
=>
if Cond goto L2

L1: goto L2
PLI (Sem 1, 2019)
Code Generation and Local Optimization ⃝c University of Melbourne 21 / 21