COMPILER OPTIMISATION II Instruction scheduling
Introduction
• We will consider a set of optimisations which a typical optimising compiler might perform.
Copyright By PowCoder代写 加微信 powcoder
• We will illustrate many transformations at the source level.
• important to remember that compiler is making transformations at level of individual machine instructions.
Programmer’s perspective:
These are (largely) optimisations which you would expect a compiler to do, and should very rarely be hand-coded.
• Instruction scheduling optimisations • Loop unrolling
• Variable expansion
• Register renaming
• Software pipelining
Why is instruction scheduling important
• Compiler code-generator translates IR into sequences of hardware instructions
• Most modern processors use register/load-store instruction sets.
• Load store operations go between registers and memory • Other operations read/write registers
• Most instructions take many cycles to complete but are pipelined
• Processor can continue to issue new instruction until it needs to use the result of an incomplete operation (pipeline stalls waiting for source register to become ready)
Dependency graph
• Values in registers represent the dependencies between instructions (forms a graph)
• Instructions can be issued in any order that preserves dependency order.
• Compiler wants to choose schedule to leave enough time between issue and result-use to prevent pipeline stall.
Dependency graph • A=(B*C)+(D*E)
• Instructions are graph nodes
• Values (registers) are graph vertexes
Non pipelined instruction
• Similar for non pipelined instructions
• Many processors include special purpose functional units like sqrt unit.
• sqrt often does not pipeline but other instructions can still be issued while sqrt progresses.
• Compiler needs to put other instructions between sqrt issue and use of result
• Has to find instructions that don’t require the sqrt unit as these would also block
Out of order execution
• Some processors support out-of-order execution
• Processor analyses the fetched instruction stream for dependencies and can issue instructions out-of-order
• Especially useful for running legacy binaries.
• Schedule generated by the compiler is less
• However transformations to increase the range of possible schedules also help hardware scheduler.
• Out-of-order execution takes a lot of hardware
• Current trend is towards larger number of simple cores.
Basic Blocks
• A Basic Block is a linear sequence of instructions with a single entry point and a single exit point.
• Execution always starts at the beginning of the block.
• Every time the block is executed every instruction in the
sequence is executed.
• Typically a compiler will break a program into its basic blocks and schedule the instructions in each block independently.
• Block boundaries are formed by
• Conditional branches (including loops) • Subroutine calls.
• Return statements
Basic block scheduling
• For a basic block of instructions, try to find order which minimises execution time, and preserves ordering of dependent instructions. E.g.
• Construct dependency graph for basic block.
• Label each node with maximum possible delay from the
node to end of the block.
• At each stage, there is a set of candidate nodes all of whose predecessors have been scheduled.
• Choose candidate with largest possible delay.
• May include many other heuristics (e.g. suitable neighbouring instructions for superscalar/VLIW execution).
Instruction scheduling
Compiler has to try to:
• Minimise the number of instructions executed
• Minimise the cost of pipeline stalls
• reducing data hazards, control hazards, structural hazards
• Enable multiple issue (superscalar or VLIW)
• In principle explicit coalesce into SIMD instructions as well though most compilers start by recognising vectorisable loops and using alternative code generator that parallelises over loop iterations.
• Look at some basic techniques first…
• Branches can cause stalls to the instruction fetch
• Processor does not know which branch to fetch until later in pipeline.
• Two common architectural solutions • Delayed branches
• A branch delay slot occurs where a branch instruction is defined as taking place some number of cycles after the instruction is issued.
• Relies on the compiler
• Speculative execution
• Processor starts executing down one (or possibly both) branches
• Needs to “roll-back” instructions on branch not taken • Relies on more complex hardware
Branch scheduling
• Compiler optimisation for when delayed branches are used
• Finding instructions to fit into branch delay slots • Instruction must not affect the branch, nor be a
branch itself, nor stall the pipeline.
• Try to find an instruction which can be executed regardless of whether the branch is taken or not
• otherwise insert null instructions
Reorder Buffer
• Possible implementation of speculative execution
• Processors may have a special set of registers to support out-of-order and speculative execution
• Re-order Buffer (ROB) (also known as architectural registers)
• Allows retiring instructions in the correct order
• Allows roll back of speculative execution
• Often combined with a reservation station to enable re-use of data between pipeline stages without using a register
Loop unrolling
• Loops with small bodies generate small basic blocks of assembly code
• lot of dependencies between instructions • high branch frequency
• little scope for good instruction scheduling
• Loop unrolling is a technique for increasing the size of the loop body
• gives more scope for better schedules
• reduces branch frequency
• make more independent instructions available for multiple issue.
Loop unrolling
• Replace loop body by multiple copies of the body
• Modify loop control
• take care of arbitrary loop bounds
• Number of copies is called unroll factor
do i=1,n a(i)=b(i)+d*c(i)
do i=1,n-3,4
a(i)=b(i)+d*c(i)
a(i+1)=b(i+1)+d*c(i+1)
a(i+2)=b(i+2)+d*c(i+2)
a(i+3)=b(i+3)+d*c(i+3)
do j = i,n
a(j)=b(j)+d*c(j) end do
• Remember that this is in fact done by the compiler at the IR or assembly code level.
• If the loop iterations are independent, then we end up with a larger basic block with relatively few dependencies, and more scope for scheduling.
• also reduce no. of compare and branch instructions
• Choice of unroll factor is important (usually 2,4,8) • if factor is too large, can run out of registers
• May help if loop blocks correspond to cache lines
• Cannot unroll loops with complex flow control
• hard to generate code to jump out of the unrolled version at the right place
Outer loop unrolling
• If we have a loop nest, then it is possible to unroll one of the outer loops instead of the innermost
• Can improve locality. do i=1,n
a(i,j)=c*d(j)
end do end do
2 loads for 1 flop
do i=1,n,4
a(i,j)=c*d(j)
a(i+1,j)=c*d(j)
a(i+2,j)=c*d(j)
a(i+3,j)=c*d(j)
end do end do
5 loads for 4 flops
Value of c held in register
2 loads for 4 flops with CSE but compiler needs to know a and d are not aliases
Variable expansion
• Variable expansion can help break dependencies in unrolled loops
• improves scheduling opportunities
• Close connection to reduction variables in parallel
for (i=0,i