Static Scheduling & VLIW 15-740
Prof. Nathan Beckmann
(Original slides by Onur Mutlu, edited by Seth Goldstein)
Carnegie Mellon University
Reprise of dynamic scheduling
n
2
DO WE REALLY NEED ALL THIS COMPLEX HARDWARE?
HOW FAR CAN WE GET WITHOUT IT?
3
Key Questions
Q1. How do we find independent instructions to fetch/execute?
Q2. How do we enable more compiler optimizations?
e.g., common subexpression elimination, constant propagation, dead code elimination, redundancy elimination, …
Q3. How do we increase the instruction fetch rate?
i.e., have the ability to fetch more instructions per cycle
4
Key Questions
Q1. How do we find independent instructions to fetch/execute?
Q2. How do we enable more compiler optimizations?
e.g., common subexpression elimination, constant propagation, dead code elimination, redundancy elimination, …
Q3. How do we increase the instruction fetch rate?
i.e., have the ability to fetch more instructions per cycle
A: Enabling the compiler to optimize across a larger number of instructions that will be executed straight line (without branches getting in the way) eases all of the above
5
Very long instruction word – VLIW
n Compiler does the scheduling statically
n Simple hardware with multiple function units
q Reduced hardware complexity
q Little or no scheduling done in hardware, e.g., in-order q Hopefully, faster clock and less power
n Compiler required to group and schedule instructions (compare to OoO superscalar)
q Predicated instructions to help with scheduling (trace, etc.) q More registers (for software pipelining, etc.)
6
VLIW example – 2-cycle loads
n RISC code MUL R1, R3, 3
LD R4, 0(R1)
ADD R2, R2, R4
SUB R3, R3, 1
BNEZ R3, -4
n VLIW code MUL R1, R3, 3
LD R4, 0(R1)
NOP
ADD R2, R2, R4
SUB R3, R3, 1
NOP
NOP
BNEZ R3, -4
7
Very long instruction word – VLIW
n Compiler does the scheduling statically
n Simple hardware with multiple function units
n Compiler required to group and schedule instructions (compare to OoO superscalar)
n Example machines:
q Multiflow, Cydra 5 (8-16 ops per VLIW)
q IA-64 Itanium (3 ops per bundle, “EPIC” VLIW variant) q TMS32xxxx (5+ ops per VLIW, DSPs)
q Crusoe (4 ops per VLIW, JIT x86 è VLIW)
q Tilera TILEPro (3 ops per VLIW, 100+ cores)
8
Comparison between SS « VLIW
From Mark Smotherman, “Understanding EPIC Architectures and Implementations”
Comparison: CISC, RISC, VLIW
VLIW is a natural extension of RISC ideas to superscalar
VLIW relies on compiler for ILP
n
11
VLIW: Finding Independent Operations
n Within a basic block, there is limited instruction-level parallelism
n To find multiple instructions to be executed in parallel, the compiler needs to consider multiple basic blocks
n Problem: Moving an instruction above a branch is unsafe because instruction is not guaranteed to be executed
n Idea: Enlarge blocks at compile time by finding the frequently-executed paths
q Trace scheduling
q Superblock scheduling q Hyperblock scheduling q Software Pipelining
It’s all about the compiler and how to schedule the instructions to maximize parallelism
12
List Scheduling: For 1 basic block
n Idea:Assignprioritytoeachinstruction
n Initializereadylistthatholdsallreadyinstructions
n Choose one ready instruction I from ready list with the highest priority
n InsertIintoschedule
q Ensuring resources are available (structural hazards)
n Addthoseinstructionswhoseprecedenceconstraintsare now satisfied into the ready list
13
Data Precedence Graph
i1 i2 i5 i6 i7 i10 i11 i12
222222 i3 2 i8 2 i13
2
22
i4
i9
i14
4
4 i15
2 i16
14
Instruction Prioritization Heuristics
n Number of descendants in precedence graph
n Maximum latency from root node of precedence graph n Length of operation latency
n Ranking of paths based on importance
n Some combination of above
15
VLIW List Scheduling
n Assign priorities – critical path in this example
n Compute data ready list (DRL) – all operations whose predecessors
have been scheduled.
n Select from DRL in priority order while checking resource constraints n Add newly ready operations to DRL and repeat for next instruction
5
1
23334
23456
223
789
112
1011 12
1
4-wide VLIW
Data Ready List
1
{1}
6
3
4
5
{2,3,4,5,6}
9
2
7
8
{2,7,8,9}
12
10
11
{10,11,12}
13
{13}
13
16
VLIW List Scheduling
5
1 2233435364
4-wide VLIW
2
2
3
2
1
13
789
11
1011 12
17
VLIW List Scheduling
5
1 2233435364
4-wide VLIW
1
6
3
4
5
9
2
7
8
12
10
11
13
2
2
3
2
1
13
789
11
1011 12
18
VLIW List Scheduling+Structural Hazards
5
1 2233435364
4-wide VLIW
2
2
3
2
1011 12
1
13
789 11
19
VLIW List Scheduling+Structural Hazards
5
1 2233435364
4-wide VLIW
1
2
3
6
5
9
4
7
8
12
10
11
13
2
2
3
2
1011 12
1
13
789 11
20
Extending the scheduling domain
n Basic block is too small to get any real parallelism q Recall: 88% of OOO benefit from speculation è
larger scheduling window
n How to extend the basic block?
q Why do we have basic blocks in the first place?
q Loops
n Loop unrolling
n Software pipelining
q Non-loops
n Will almost always involve some speculation n Thus profiling may be very important
[MTZ’13]
21
Safety and Legality in Code Motion
n Two characteristics of speculative code motion:
q Safety: whether or not spurious exceptions may occur q Legality: whether or not result will be always correct
n Four possible types of code motion:
r4 = r1 …
r1 = …
r1 = r2 & r3
r1 = r2 & r3
(a) safe and legal
(b) illegal
r1 = …
r1 = load A
r4 = r1 …
r1 = load A
(c) unsafe
(d) unsafe and illegal
22
Code Movement Constraints
n Downward
q When moving an operation from a BB to one of its dest BB’s,
n all the other dest basic blocks should still be able to use the result of the operation
n the other source BB’s of the dest BB should not be disturbed n Upward
q When moving an operation from a BB to its source BB’s n register values required by the other dest BB’s must not be
destroyed
n the movement must not cause new exceptions
23
Trace Scheduling
n Trace: A frequently executed path in the control-flow graph (has multiple side entrances and multiple side exits)
n Idea: Find independent operations within a trace to pack into VLIW instructions.
q Traces determined via profiling
q Compiler adds fix-up code for correctness (if a side entrance or side exit of a trace is exercised at runtime, corresponding fix-up code is executed)
24
Trace Scheduling Idea
25
Trace Scheduling (II)
n There may be conditional branches from the middle of the trace (side exits) and transitions from other traces into the middle of the trace (side entrances).
n These control-flow transitions are ignored during trace scheduling.
n After scheduling, fix-up/bookkeeping code is inserted to ensure the correct execution of off-trace code.
n Fisher, “Trace scheduling: A technique for global microcode compaction,” IEEE TC 1981.
26
Trace Scheduling (III)
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Instr 2
Instr 3
Instr 4
Instr 1
Instr 5
What bookkeeping is required when Instr 1
is moved below the side entrance in the trace?
27
Trace Scheduling (IV)
Instr 3
Instr 4
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Instr 2
Instr 3
Instr 4
Instr 1
Instr 5
28
Trace Scheduling (V)
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Instr 1
Instr 5
Instr 2
Instr 3
Instr 4
What bookkeeping is required when Instr 5 moves above the side entrance in the trace?
29
Trace Scheduling (VI)
Instr 5
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Instr 1
Instr 5
Instr 2
Instr 3
Instr 4
30
Trace Scheduling Fixup Code Issues
n Sometimes need to copy instructions more than once to ensure correctness on all paths (see C below)
Original trace
A
B X C
D A’ B’ C’ Y Scheduled B
trace E DYA
C’’’ E’’ D’’
EC
BX Correctness C
B’’ X
DY
31
Trace Scheduling Overview
n Trace Selection
q select seed block (the highest frequency basic block) q extend trace (along the highest frequency edges)
forward (successor of the last block of the trace)
backward (predecessor of the first block of the trace) q don’t cross loop back edge
q bound max_trace_length heuristically
n Trace Scheduling
q build data precedence graph for a whole trace
q perform list scheduling and allocate registers
q add compensation code to maintain semantic correctness
n Speculative Code Motion (upward)
q move an instruction above a branch if safe
32
Trace Scheduling Example (I)
B1
9 stalls
B3
1 stall
fdiv f1, f2, f3 fadd f4, f1, f5 beq r1, $0
fdiv f1, f2, f3 fadd f4, f1, f5
beq r1, $0
990
ld r2, 0(r3)
990
800
800
10
r2 and f2 not live out
B3
f2 not live out
B6
B2
ld r2, 4(r3)
10
ld r2, 0(r3)
add r2, r2, 4 beq r2, $0
fsub f2, f2, f6 st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4
add r2, r2, 4 beq r2, $0
B4
B5
200
fsub f2, f3, f7
200
B7
B6
1 stall
fsub f2, f2, f6 st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4
33
Trace Scheduling Example (I)
9 stalls
fdiv f1, f2, f3 fadd f4, f1, f5 beq r1, $0
ld r2, 0(r3)
add r2, r2, 4 beq r2, $0
fsub f2, f2, f6 st.d f2, 0(r8)
B6
fdiv f1, f2, f3 beq r1, $0
ld r2, 0(r3) fsub f2, f2, f6
add r2, r2, 4 beq r2, $0
st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4 fadd f4, f1, f5
1 stall
r2 and f2 not live out
B3 0 stall 0 stall
f2 not live out
1 stall
1 stall
add r3, r3, 4 add r8, r8, 4
B3 B6
34
Trace Scheduling Example (II)
fdiv f1, f2, f3 beq r1, $0
ld r2, 0(r3) fsub f2, f2, f6
add r2, r2, 4 beq r2, $0
st.d f2, 0(r8)
add r3, r3, 4
add r8,r8,4
fadd f4, f1, f5 fadd f4, f1, f5
B6
0 stall 0 stall
1 stall
fdiv f1, f2, f3 beq r1, $0
ld r2, 0(r3) fsub f2, f2, f6
add r2, r2, 4 beq r2, $0
st.d f2, 0(r8)
fadd f4, f1, f5
Split comp. code
fadd f4, f1, f5
add r3, r3, 4
B3 add r8,r8,4 B3
B6
35
Trace Scheduling Example (III)
fdiv f1, f2, f3 beq r1, $0
ld r2, 0(r3) fsub f2, f2, f6 add r2, r2, 4 beq r2, $0
st.d f2, 0(r8)
fadd f4, f1, f5
Split comp. code
fadd f4, f1, f5
add r3, r3, 4 add r8, r8, 4 fadd f4, f1, f5
B3
B6
add r3, r3, 4 add r8, r8, 4
Join comp. code
36
Trace Scheduling Example (IV)
fdiv f1, f2, f3 beq r1, $0
ld r2, 0(r3) fsub f2, f2, f6 add r2, r2, 4 beq r2, $0
st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4 fadd f4, f1, f5
fadd f4, f1, f5
Split comp. code
fadd f4, f1, f5
B3
add r2, r2, 4 beq r2, $0
fsub f2, f2, f6 st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4
B6
Copied split instructions
add r3, r3, 4 add r8, r8, 4
Join comp. code
37
Trace Scheduling Example (V)
ld r2, 0(r3) fsub f2, f2, f6 add r2, r2, 4
beq r2, $0
fdiv f1, f2, f3 beq r1, $0
fadd f4, f1, f5
ld r2, 4(r3)
add r2, r2, 4 beq r2, $0
B3
st.d f2, 0(r8) add r3, r3, 4 add r8, r8, 4 fadd f4, f1, f5
fadd f4, f1, f5
B6
fsub f2, f2, f6 st.d f2, 0(r8)
add r3, r3, 4 add r8, r8, 4
fsub f2, f3, f7
add r3, r3, 4 add r8, r8, 4
38
Trace Scheduling Tradeoffs
n Advantages
+ Enables the finding of more independent instructions à fewer
NOPs in a VLIW instruction
n Disadvantages
— Profile dependent
— What if dynamic path deviates from trace à lots of NOPs in the VLIW instructions
— Code bloat and additional fix-up code executed — Due to side entrances and side exits
— Infrequent paths interfere with the frequent path — Effectiveness depends on the bias of branches
— Unbiased branches à smaller traces à less opportunity for finding independent instructions
39
Superblock Scheduling
n Trace: multiple entry, multiple exit block
n Superblock: single-entry, multiple exit block
q A trace with side entrances are eliminated
q Infrequent paths do not interfere with the frequent path
+ More optimization/scheduling opportunity than traces
+ Eliminates “difficult” bookkeeping due to side entrances
Hwu+, “The Superblock: An Effective Technique for VLIW and superscalar compilation,” J of SC 1991.
40
Superblock example
opA: mul r1,r2,3
99
opC: mul r3,r2,3
opA: mul r1,r2,3
99 opB: add r2,r2,1
11
opB: add r2,r2,1
opC’: mul r3,r2,3
opC: mul r3,r2,3
1
Original Code
Code After Superblock Formation
opA: mul r1,r2,3
99
opC: mov r3,r1
1
opB: add r2,r2,1
opC’: mul r3,r2,3
Code After Common Subexpression Elimination
41
Superblock Scheduling Shortcomings
— Still profile-dependent
— No single frequently executed path if there is an unbiased
branch
— Reduces the size of superblocks
— Code bloat and additional fix-up code executed — Due to side exits
42
Hyperblock Scheduling
n Idea: Use predication support to eliminate unbiased branches and increase the size of superblocks
n Hyperblock: A single-entry, multiple-exit block with internal control flow eliminated using predication (if-conversion)
n Advantages
+ Reduces the effect of unbiased branches on scheduled block size
n Disadvantages
— Requires predicated execution support
— All disadvantages of predicated execution
43
Hyperblock Formation (I)
n Hyperblock formation 1. Block selection
2. Tail duplication
3. If-conversion
10
90 80
80 20
20
n Block selection
q Select subset of BBs for inclusion in HB
q Difficult problem 10
q Weighted cost/benefit function n Height overhead
n Resource overhead
n Dependency overhead
n Branch elimination benefit n Weighted by frequency
90 10
10
n Mahlke et al., “Effective Compiler Support for Predicated Execution Using the Hyperblock,” MICRO 1992.
BB1
BB2
BB3
BB4
BB5
BB6
44
Hyperblock Formation (II)
Tail duplication same as with Superblock formation 10
BB1
80
BB2
80
10
90 10
BB6
10
20
BB1
80
BB2
80
20
BB3
20
90
10
BB3
20
BB4
BB4
10
BB5
BB5
90
BB6
81 9
10 19
BB6’
45
Hyperblock Formation (III)
10
BB1
80
BB2
80
If-convert (predicate) intra-hyperblock branches 10
BB1 p1,p2 = CMPP
BB2 if p1
BB3 if p2
BB4
BB6
BB4
20
BB3
20
90
10
10
BB5
BB6
81 9
9
1
10
81
9
1
BB6’
BB5
BB6’
46
WHAT ABOUT MEMORY?
47
Non-Faulting Loads and Exception Propagation
ld.s r1=[a]
inst 1 inst 2 …. br
inst 1 inst 2 ….
br
unsafe code
motion
….
ld r1=[a]
use=r1
n ld.s fetches speculatively from memory
i.e. any exception due to ld.s is suppressed
n If ld.s r1 did not cause an exception then chk.s r1 is a NOP, else a branch is taken (to execute some compensation code)
….
chk.s r1
use=r1
ld r1=[a]
48
Non-Faulting Loads and Exception Propagation in IA-64
ld.s r1=[a]
inst 1 inst 2 use=r1 ….
br
inst 1 inst 2 ….
br br
unsafe code
motion
….
ld r1=[a] use=r1
n Load data can be speculatively consumed prior to check
n “speculation” status is propagated with speculated data
n Any instruction that uses a speculative result also becomes speculative itself (i.e. suppressed exceptions)
n chk.s checks the entire dataflow sequence for exceptions
….
chk.s use
ld r1=[a] use=r1
49
Aggressive ST-LD Reordering in IA-64
inst 1
inst 2
….
st [?]
st[?] ….
ld r1=[x]
use=r1
ld.a r1=[x]
inst 1
inst 2
….
st [?]
….
ld.c r1=[x] use=r1
potential aliasing
n ld.a starts the monitoring of any store to the same address as the advanced load
n If no aliasing has occurred since ld.a, ld.c is a NOP n If aliasing has occurred, ld.c re-loads from memory
50
Aggressive ST-LD Reordering in IA-64
inst 1
inst 2
….
st [?]
st[?] ….
ld r1=[x] use=r1
ld.a r1=[x]
inst 1 inst 2 use=r1 ….
st [?] …. chk.a X ….
potential aliasing
ld r1=[a] use=r1
51
Can We Do Better?
n Hyperblock still
q Profile dependent
q Requires fix-up code
q And, requires predication support
n Single-entry, single-exit enlarged blocks
q Block-structured ISA
n Optimizes multiple paths (can use predication to enlarge blocks) n No need for fix-up code (duplication instead of fixup)
52
Block Structured ISA
n Blocks (> instructions) are atomic (all-or-none) operations q Either all of the block is committed or none of it
n Compiler enlarges blocks by combining basic blocks with their control flow successors
q Branches within the enlarged block converted to “fault” operations à if the fault operation evaluates to true, the block is discarded and the target of fault is fetched
Melvin and Patt, “Enhancing Instruction Scheduling with a Block-Structured ISA,” IJPP 1995.
53
Block Structured ISA (II)
n Advantages:
+ Larger atomic blocks à larger units can be fetched from I-cache
+ Aggressive compiler optimizations (e.g. reordering) can be enabled within atomic blocks (no side entries or exits)
+ Can explicitly represent dependencies among operations within an enlarged block
n Disadvantages:
— “Fault operations” can lead to work to be wasted (atomicity)
— Code bloat (multiple copies of the same basic block exists in the binary and possibly in I-cache)
— Need to predict which enlarged block comes next n Optimizations
q Within an enlarged block, the compiler can perform optimizations that cannot normally/easily be performed across basic blocks
54
Block Structured ISA (III)
n Hao et al., “Increasing the instruction fetch rate via block- structured instruction set architectures,” MICRO 1996.
55
Superblock vs. BS-ISA
n Superblock
q Single-entry, multiple exit code block
q Not atomic
q Compiler inserts fix-up code on superblock side exit
n BS-ISA blocks
q Single-entry, single exit
q Atomic
q Need to roll back to the beginning of the block on fault
56
Superblock vs. BS-ISA
n Superblock
+ No ISA support needed
— Optimizes for only 1 frequently executed path
— Not good if dynamic path deviates from profiled path à missed opportunity to optimize another path
n Block Structured ISA
+ Enables optimization of multiple paths and their dynamic selection.
+ Dynamic prediction to choose the next enlarged block. Can dynamically adapt to changes in frequently executed paths at run- time
+ Atomicity can enable more aggressive code optimization — Code bloat becomes severe as more blocks are combined
— Requires “next enlarged block” prediction, ISA+HW support — More wasted work on “fault” due to atomicity requirement
57
Summary and Questions
n Trace, superblock, hyperblock, block-structured ISA
n How many entries, how many exits does each of them have?
q What are the corresponding benefits and downsides?
n What are the common benefits?
q Enable and enlarge the scope of code optimizations q Reduce fetch breaks; increase fetch rate
n What are the common downsides?
q Code bloat (code size increase)
q Wasted work if control flow deviates from enlarged block’s path
58
VLIW Summary
n Heavy reliance on compiler (push RISC to the extreme) q Compiler algorithms (e.g., software pipelining) have lasting
impact outside of VLIW
n Is there enough statically knowable parallelism?
q E.g., memory aliasing and branch bias n What about wasted FUs? Code bloat?
q Code size is already a big problem with x86 apps!
n Architecture joke: “VLIW is the architecture of the future,
and always will be.”
n Yet many DSPs are VLIW. Why?
59
WHAT ABOUT LOOPS?
60
The problem with loops
n Consider the following code:
for (int i = 0; i < N; i++) {
b[i] = b[i] * b[i];
}
61
The problem with loops
n Consider the following code:
for (int i = 0; i < N; i++) {
b[i] = b[i] * b[i];
}
n RISC assembly (LD & MUL 2-cycles) LOOP:
Useful work
Loop overhead
LD R1, 0(R3)
MUL R2, R1, R1
ST R2, 0(R3)
ADD R3, R3, 4
BLT R3, R4, LOOP
è 7 cycles per iteration
62
The problem with loops
n Consider the following code:
for (int i = 0; i < N; i++) {
b[i] = b[i] * b[i];
}
n VLIW assembly (1 ALU, 1 LD/ST; 2-cycles)
LOOP:
è5 cycles per iteration
NOP NOP
LD R1, 0(R3)
NOP NOP NOP
Useful work
MUL R2, R1, R1
ADD R,3 R,3 4
BLT R3, R4, LOOP
ST R2, -4(R3)
Loop overhead
63
Amortize overheads by unrolling loops n Key idea is to schedule the following code instead:
for (int i = 0; i < N; i+=4) {
b[i+0] = b[i+0] * b[i+0];
b[i+1] = b[i+1] * b[i+1];
b[i+2] = b[i+2] * b[i+2];
b[i+3] = b[i+3] * b[i+3];
}
Loop unrolling
è Larger scheduling block è Better schedule
64
Amortize overheads by unrolling loops n VLIW assembly
LOOP: NOP NOP
è 2 cycles per iteration
LD R1, 0(R9)
LD R3, 4(R9)
LD R5, 8(R9)
LD R7, 12(R9)
ST R2, 0(R9)
ST R4, 4(R9)
ST R6, 8(R9)
ST R8, -4(R9)
MUL R2, R1, R1
MUL R4, R3, R3
MUL R6, R5, R5
MUL R8, R7, R7
ADD R9, R9, 16
BLT R9, R10 LOOP
Loop overhead
Useful work
65
Correctness of loop unrolling
n Is this transformation legal?
for (int i = 0; i < N; i++) {
b[i] = b[i] * b[i];
}
for (int i = 0; i < N; i+=4) {
b[i+0] = b[i+0] * b[i+0];
b[i+1] = b[i+1] * b[i+1];
b[i+2] = b[i+2] * b[i+2];
b[i+3] = b[i+3] * b[i+3];
}
66
Correctness of loop unrolling
n Instead, schedule the following code: int i;
for (i = 0; i+3 < N; i+=4) {
b[i+0] = b[i+0] * b[i+0];
b[i+1] = b[i+1] * b[i+1];
b[i+2] = b[i+2] * b[i+2];
b[i+3] = b[i+3] * b[i+3];
}
for (; i < N; i++) {
b[i] = b[i] * b[i];
}
67
Loop unrolling summary
n Advantages
q Reduces loop overhead
q Improves code schedule within loop n (eg, hiding MUL latency in example)
n Disadvantages
q Increases code size
q Less effective on loops with internal branches n Can use predication ... more on this later
68
Loop Unrolling
n Idea: Replicate loop body multiple times within an iteration + Reduces loop maintenance overhead
q Induction variable increment or loop condition test + Enlarges basic block (and analysis scope)
q Enables code optimization and scheduling opportunities
-- What if iteration count not a multiple of unroll factor? (need extra code to detect
this)
-- Increases code size
69
Can we do better?
n VLIW assembly (ALU + LD/ST) LOOP: NOP
NOP
LD R1, 0(R9)
LD R3, 4(R9)
LD R5, 8(R9)
LD R7, 12(R9)
ST R2, 0(R9)
ST R4, 4(R9)
ST R6, 8(R9)
ST R8, -4(R9)
MUL R2, R1, R1
MUL R4, R3, R3
MUL R6, R5, R5
MUL R8, R7, R7
ADD R9, R9, 16
BLT R9, R10 LOOP
è 2 cycles per iteration
Loop overhead
Useful work
Not in this case – LD/ST unit is at 100% utilization
70
Can we do better?
n VLIW assembly (3-wide) LOOP: NOP
NOP
NOP NOP NOP
LD R1, 0(R9)
LD R3, 4(R9)
LD R5, 8(R9)
LD R7, 12(R9)
MUL R2, R1, R1
MUL R4, R3, R3
MUL R6, R5, R5
MUL R8, R7, R7
ST R2, 0(R9)
ST R4, 4(R9)
ST R6, 8(R9)
ST R8, 12(R9)
NOP
ADD R9, R9, 16
BLT R9, R10, LOOP
NOP
Loop overhead
è 7/4 cycles per iteration
Useful work
71
Can we do better?
n VLIW assembly (3-wide) LOOP: NOP
NOP
NOP NOP NOP
LD R1, 0(R31)
LD R3, 4(R31)
LD R5, 8(R31)
LD R7, 12(R31)
LD R9, 12(R31)
LD R11, 12(R31)
LD R13, 12(R31)
LD R15, 12(R31)
MUL R2, R1, R1
MUL R4, R3, R3
MUL R6, R3, R3
MUL R8, R3, R3
MUL R10, R3, R3
MUL R12, R3, R3
MUL R14, R5, R5
MUL R16, R7, R7
ST R2, 0(R31)
ST R4, 4(R31)
ST R6, 8(R31)
ST R8, 12(R31)
ST R10, 16(R31)
ST R12, 20(R31)
ST R14, 24(R31)
ST R16, -4(R31)
NOP
ADD R31, R31, 32
BLT R31, R10, LOOP
NOP
è 11/8 cycles per iteration
Loop overhead
Useful work
...but code & register bloat
72
Can we do better than unrolling?
n VLIW assembly (3-wide) LOOP: NOP
NOP
Ramp-up & ramp-down overhead
NOP
Loop overhead
Useful work
NOP NOP NOP
LD R1, 0(R31)
LD R3, 4(R31)
LD R5, 8(R31)
LD R7, 12(R31)
LD R9, 12(R31)
LD R11, 12(R31)
LD R13, 12(R31)
LD R15, 12(R31)
MUL R2, R1, R1
MUL R4, R3, R3
MUL R6, R3, R3
MUL R8, R3, R3
MUL R10, R3, R3
MUL R12, R3, R3
MUL R14, R5, R5
MUL R16, R7, R7
ST R2, 0(R31)
ST R4, 4(R31)
ST R6, 8(R31)
ST R8, 12(R31)
ST R10, 16(R31)
ST R12, 20(R31)
ST R14, 24(R31)
ST R16, -4(R31)
NOP
ADD R31, R31, 32
BLT R31, R10, LOOP
73
Can we do better than unrolling?
n VLIW assembly (3-wide) LOOP: NOP
NOP
Ramp-up & ramp-down overhead
NOP NOP NOP
LD R1, 0(R31)
LD R3, 4(R31)
LD R5, 8(R31)
LD R7, 12(R31)
LD R9, 12(R31)
LD R11, 12(R31)
LD R13, 12(R31)
LD R15, 12(R31)
MUL R2, R1, R1
MUL R4, R3, R3
MUL R6, R3, R3
MUL R8, R3, R3
MUL R10, R3, R3
MUL R12, R3, R3
MUL R14, R5, R5
MUL R16, R7, R7
ST R2, 0(R31)
ST R4, 4(R31)
ST R6, 8(R31)
ST R8, 12(R31)
ST R10, 16(R31)
ST R12, 20(R31)
ST R14, 24(R31)
ST R16, -4(R31)
NOP
ADD R31, R31, 32
BLT R31, R10, LOOP
Perfect efficiency
NOP
Loop overhead
Useful work
Goal: Maintain peak efficiency, w/out ramp-up/down
74
Software Pipelining
n Idea: Move instructions across iterations of the loop q Very large improvements in running time are possible
n E.g., 5-wide VLIW
LOOP:
{
LD R1, 0(R9) NOP
NOP NOP
LD R1, 4(R9) MUL R2, R1, R1
ADD R9, R9, 4
NOP
ADD R9, R9, 4
BGE R9, R10, END
Current iteration One iteration ago Two iterations ago
NOP
SUB R10, R10, 4
LD R1, 0(R9)
MUL R2, R1, R1
ST R2, -8(R9)
ADD R9, R9, 4
BLT R9, R10, LOOP
}
NOP
ST R2, -4(R9)
NOP
ST R2, 0(R9)
END:
NOP
NOP
NOP
NOP
NOP
MUL R2, R1, R1
NOP
NOP
15-745 © Seth Copen Goldstein 2000-5 75
Software Pipelining
n Idea: Move instructions across iterations of the loop q Very large improvements in running time are possible
n E.g., 5-wide VLIW
LD R1, 0(R9) NOP
NOP NOP
LD R1, 4(R9) MUL R2, R1, R1
NOP
ADD R9, R9, 4
NOP
ADD R9, R9, 4
BGE R9, R10, END
è 1 cycle per iteration
}
NOP
ST R2, -4(R9)
NOP
ST R2, 0(R9)
LOOP:
Perfect efficiency
END:
{
SUB R10, R10, 4
LD R1, 0(R9)
NOP
NOP
NOP
NOP
NOP
MUL R2, R1, R1
NOP
NOP
MUL R2, R1, R1
ST R2, -8(R9)
ADD R9, R9, 4
BLT R9, R10, LOOP
15-745 © Seth Copen Goldstein 2000-5 76
Goal of SP
n Increase distance between dependent operations by moving destination operation to a later iteration
A:a¬ ld[d] B:b¬ a*a
C: st [d], b D:d¬ d+4
Assume all have latency of 2
A
B
C
D
15-745 © Seth Copen Goldstein 2000-5 77
Can we decrease the latency?
n Lets unroll
A: a¬ld[d]
B: b¬a*a
C: st [d], b D: d¬d+4 A1: a ¬ ld [d] B1: b ¬ a * a C1: st [d], b D1: d¬ d+4
A
B
C
D
A1
B1
C1
D1
15-745 © Seth Copen Goldstein 2000-5 78
Rename variables
A: a¬ ld[d]
B: b¬ a*a
C: st [d], b D: d1¬d+4
A1: a1 ¬ ld [d1]
B1: b1¬ a1*a1 C1: st [d1], b1 D1:d¬ d1+4
A
B
C
D
A1
B1
C1
D1
15-745 © Seth Copen Goldstein 2000-5 79
Schedule
A: a¬ ld[d]
B: b¬ a*a
C: st [d], b D: d1¬d+4
A1: a1 ¬ ld [d1]
B1: b1¬ a1*a1 C1: st [d1], b1 D1:d¬ d1+4
AD
B
A1
C
B1
D1
C1
A
B
C
D1
D
A1
B1
C1
15-745 © Seth Copen Goldstein 2000-5 80
Unroll Some More
A: a¬
B: b¬
C:
D: d1¬ A1: a1 ¬ B1: b1¬ C1:
D1: d2¬ A2: a2 ¬ B2: b2¬ C2:
D2: d ¬
ld[d]
a*a
st [d], b d+4
ld [d1] a1*a1
st [d1], b1 d1+4
ld [d2] a2*a2
st [d2], b2 d2 + 4
AD B
C
A1 D1 B1 A2
C1
C2 D2
B2
A
B
C
D2
D
A1
B1
C1
D1
A2
B2
C2
15-745 © Seth Copen Goldstein 2000-5 81
Unroll Some More
A: a ¬
B: b¬
C:
D: d1¬ A1: a1 ¬ B1: b1¬ C1:
D1: d2¬ A2: a2 ¬ B2: b2¬ C2:
D2: d¬
ld [d]
a*a
st [d], b d+4
ld [d1] a1*a1
st [d1], b1 d1+4
ld [d2] a2*a2
st [d2], b2 d2+4
AD B
C
A1 D1
B1 A2 D2
C1
B2 A3 C2 B3 C3
A
B
C
D3
D
A1
B1
C1
D1
A2
B2
C2
D2
A3
B3
C3
D3
15-745 © Seth Copen Goldstein 2000-5 82
A
B
C
D4
D
A1
B1
C1
D1
A2
B2
C2
D2
A3
B3
C3
D3
A4
B4
C4
One More Time
AD B
C
A1 D1
B1 A2 D2
C1
B2 A3 C2 B3 C3
D3
A4 B4 C4
15-745 © Seth Copen Goldstein 2000-5 83
A
B
C
D4
D
A1
B1
C1
D1
A2
B2
C2
D2
A3
B3
C3
D3
A4
B4
C4
Can Rearrange
AD B
C
A1 D1
B1 A2 D2
C1
B2 A3 C2 B3 C3
D3
A4 B4 C4
15-745 © Seth Copen Goldstein 2000-5 84
Rearrange
AD B
C
A1 D1
B1 A2 D2
C1
B2 A3 C2 B3 C3
A
B
C
D3
D
A1
B1
C1
D1
A2
B2
C2
D2
A3
B3
C3
D3
15-745 © Seth Copen Goldstein 2000-5 85
Rearrange
AD B
C
A1 D1
B1 A2 D2
C1
B2 A3D3 C2 B3
C3
A
B
C
D3
D
A1
B1
C1
D1
A2
B2
C2
D2
A3
B3
C3
15-745 © Seth Copen Goldstein 2000-5 86
SP Loop
A: a ¬ B: b¬ D: d1¬ A1: a1 ¬ D1: d2¬
C:
B1: b1¬ A2: a2 ¬ D2: d¬
B2: b2¬ C1:
D3: d2¬ C2:
ld [d] a*a d+4 ld [d1] d1+4
st [d], b a1*a1 ld [d2] d2+4
a2*a2
st [d1], b1 d1+4
st [d2], b2
Prolog
Body Epilog
A
B
C
C
C
D3
D
A1
B1
B1
B1
C1
D1
A2
A2
A2
B2
C2
D2
D2
D2
15-745 © Seth Copen Goldstein 2000-5 87
Goal of Software Pipelining
n Increase distance between dependent operations by moving destination operation to a later iteration
A B
C
dependencies in initial loop
A
after SP
B
C
i+1 i+2
iteration i
15-745 © Seth Copen Goldstein 2000-5 88
Goal of Software Pipelining
n Discover ILP across iterations!
A0 A0
A1 B0
A2 B1 C0 A3 B2 C1
B3 C2 C3
A1 B0
Ai Bi-1 Ci-2
Bi Ci-1 Ci
15-745 © Seth Copen Goldstein 2000-5 89
Example
Assume operating on a infinite wide machine
A0
A1 B0
Prolog loop body
epilog
Ai Bi-1 Ci-2
Bi Ci-1 Ci
15-745 © Seth Copen Goldstein 2000-5 90
Dealing with exit conditions
for (i=0; i
A0
B0
if (i+1 == N) goto last i=1
A1
if (i+2 == N) goto epilog i=2
{
}
loop: Ai Ai
Bi Bi-1 Ci Ci-2
i++
if (i < N) goto loop epilog:
Bi
Ci-1 last:
ci done:
15-745 © Seth Copen Goldstein 2000-5 91
Loop Unrolling vs. Software Pipelining
For SuperScalar or VLIW
n Loop Unrolling reduces loop overhead n Software Pipelining reduces fill/drain n Best is if you combine them
Software Pipelining
Loop Unrolling Time
15-745 © Seth Copen Goldstein 2000-5 92
TODO: More complicated SP example
93
Software Pipelining Approaches
n The first serious approach to software pipelining was presented by Aiken & Nicolau.
q Aiken’s 1988 Ph.D. thesis.
q Impractical as it ignores resource hazards (focusing only
on data-dependence constraints).
n “Iterative Modulo Scheduling” Rau MICRO’94
q Addresses resource constraints
q Iterate over different loop lengths until one schedules successfully
q Compute loop lower/upper bounds from required & available resources
94
SYSTOLIC ARRAYS
95
Why Systolic Architectures?
n Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory
n Similar to an assembly line
q Different people work on the same car
q Many cars are assembled simultaneously q Can be two-dimensional
n Special purpose accelerators/architectures need
q Simple, regular designs (keep # unique parts small and
regular)
q High concurrency à high performance
q Balanced computation and I/O (memory access)
96
Systolic Architectures
n H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982.
Memory: heart PEs: cells
Memory pulses data through cells
97
Systolic Architectures
n Basic principle: Replace a single PE with a regular array of PEs and carefully orchestrate flow of data between the PEs à achieve high throughput w/o increasing memory bandwidth requirements
n Differences from pipelining:
q Array structure can be non-linear
and multi-dimensional
q PE connections can be multidirectional (and different speed)
q PEs can have local memory and execute kernels (rather than a piece of the instruction)
98
Systolic Computation Example
n
99
Systolic Computation Example: Convolution
100
Systolic Computation: Convolution
y1
w3
w2
x2
x1
y1
w1
w3
w2
w1
x3
x2
x1
x3
y1
w3
y1
w2
x2
y2
w1
y2
w3
w2
w1
x4
x3
x2
y2
y3
w3
w2
x4
x3
w1
101
Systolic Computation: Convolution
y1
w3
w2
w1
x2
x1
102
Systolic Computation: Convolution
y1
w3
w2
w1
x3
x2
x1
103
Systolic Computation: Convolution
y1
y2
w3
w2
w1
x3
x2
104
Systolic Computation: Convolution
y1
y2
w3
w2
w1
x4
x3
x2
105
Systolic Computation: Convolution
y1
y2
y3
w3
w2
w1
x4
x3
106
Systolic Computation Example: Convolution
n Worthwhile to implement adder and multiplier separately to allow overlapping of add/multiply executions
107
TODO: Example relating SP to systolic architecture for some computation (maybe the convolution)
108
More Programmability
n Each PE in a systolic array
q Can store multiple “weights”
q Weights can be selected on the fly
q Eases implementation of, e.g., adaptive filtering
n Taken further
q Each PE can have its own data and instruction memory
q Data memory à to store partial/temporary results, constants
q Leads to stream processing, pipeline parallelism n Moregenerally,stagedexecution
109
Pipeline Parallelism
110
File Compression Example
111
Why pipeline parallelism in software?
n Pipeline parallelism vs data parallelism
q Why split pipeline stages across PEs?
q No cycle-time benefit like we got in hardware
n Data movement patterns differ
q Pipeline parallelism: move input data between PEs q Data parallelism: move task code/data between PEs
n Tight feedback loops within single stage q E.g., compression or encryption
n Appropriate design depends on application
112
Systolic Array Summary
n Advantages
q Makes multiple uses of each data item à reduce data fetches q High concurrency
q Regular design (both data and control flow)
n Disadvantages
q Not good at exploiting irregular parallelism
q Relatively special purpose à need software, programmer support to be a general purpose model
113
The WARP Computer
n HT Kung, CMU, 1984-1988
n Linear array of 10 cells, each cell a 10 Mflop programmable
processor
n Attached to a general purpose host machine
n High-level language and optimizing compiler to program the systolic array
n Used extensively to accelerate vision and robotics tasks n Annaratone et al., “Warp Architecture and
Implementation,” ISCA 1986.
n Annaratone et al., “The Warp Computer: Architecture, Implementation, and Performance,” IEEE TC 1987.
114
The WARP Computer
115
The WARP Computer
116
Resource Constraints
n Minimally indivisible sequences, i and j, can execute together if combined resources in a step do not exceed available resources.
n R(i) is a resource configuration vector R(i) is the number of units of resource i
n r(i) is a resource usage vector s.t. 0 £ r(i) £ R(i)
n Each node in G has an associated r(i)
15-745 © Seth Copen Goldstein 2000-5 117
SCALAR REPLACEMENT
118
Scalar Replacement: Example
DO I = 1, N
DO J = 1, M
A(I) = A(I) + B(J)
ENDDO
ENDDO
DO I = 1, N
T = A(I)
DO J = 1, M
T = T + B(J)
ENDDO
A(I) = T ENDDO
n AllloadsandstorestoAintheinner loop have been saved
n A(I)canbeleftinaregister throughout the inner loop
n Superscalar+cachewillgetmostof n HighchanceofTbeingallocateda this, but not allocate A(I) to register register by compiler
Scalar Replacement
n Convert array reference to scalar reference
n Approach is to use dependences to achieve these memory hierarchy transformations
Dependence and Memory Hierarchy
n True or Flow - save loads and cache miss n Anti - save cache miss?
n Output - save stores
n Input - save loads
A(I) = ... + B(I) ... =A(I)+k A(I) = ...
... = B(I)
Dependence and Memory Hierarchy
n Loop Carried dependences - Consistent dependences most useful for memory management purposes
n Consistent dependences - dependences with constant dependence distance
Using Dependences
n In the reduction example DO I = 1, N
DO J = 1, M
A(I) = A(I) + B(J)
ENDDO ENDDO
DO I = 1, N
T = A(I)
DO J = 1, M
T = T + B(J)
ENDDO
A(I) = T ENDDO
n True dependence - replace the references to A in the inner loop by scalar T
n Output dependence - store can be moved outside the inner loop
n Antidependence - load can be moved before the inner loop
Scalar Replacement
n Example: Scalar Replacement in case of loop independent dependence
DO I = 1, N
A(I) = B(I) + C
X(I) = A(I)*Q
ENDDO
DO I = 1, N
t = B(I) + C
A(I) = t
X(I) = t*Q
ENDDO
n One less load for each iteration for reference to A
Scalar Replacement
n Example: Scalar Replacement in case of loop carried dependence spanning single iteration
DO I = 1, N
A(I) = B(I-1)
B(I) = A(I) +
C(I)
ENDDO
tB = B(0)
DO I = 1, N
tA = tB
A(I) = tA
tB = tA + C(I)
B(I) = tB
ENDDO
n One less load for each iter for ref to B which had a loop carried true dependence of 1 iter
n Also one less load per iter for reference to A
Scalar Replacement
n Example: Scalar Replacement in case of loop carried dependence spanning multiple iterations
DO I = 1, N
A(I) = B(I-1) +
B(I+1)
ENDDO
n
t1 = B(0)
t2 = B(1)
DO I = 1, N
t3 = B(I+1)
A(I) = t1 + t3
t1 = t2
t2 = t3
ENDDO
One less load for each iter for ref to B which had a loop carried input dependence of 2 iters
Invariants maintained were
t1=B(I-1); t2=B(I);
t3=B(I+1)
n
Aiken/Nicolau Scheduling Step 1
Perform scalar replacement to eliminate memory references where possible.
for i:=1 to N do
a := j Å V[i-1]
b := a Å f c := e Å j d := f Å c e := b Å d f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
for i:=1 to N do a:=jÅb b:=aÅf c:=eÅj d:=fÅc e:=bÅd
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
15-745 © Seth Copen Goldstein 2000-5 127
Aiken/Nicolau Scheduling Step 2
Unroll the loop and compute the data-dependence graph (DDG).
DDG for rolled loop:
for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
ac
j
bd gf
h
e
15-745 © Seth Copen Goldstein 2000-5 128
Aiken/Nicolau Scheduling Step 2, cont’d
DDG for unrolled loop:
for i:=1 to N do a := j Å b b := a Å f c:=eÅj d:=fÅc e:=bÅd
f:=U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
a1 c1
b1 d1
g a j1 e h1 121
b f1 c 22
g2 a3
b 3
j2 d 2
f2
e2 c3
h1
15-745 © Seth Copen Goldstein 2000-5 129
Aiken/Nicolau Scheduling Step 3
Build a tableau of iteration number vs cycle time.
ac 1 1
b d1 1j
iteration 123456
1 acfjfjfjfjfjfj 2bd
3egha
1 h 4 cb g1a2 e115dga
f 6ehb b1c 7 cga 2j28db
2d 9 ehga g2a3 2 10 cb
fh11 dga 2e112 ehb
b3 2 13 cg c 14 d
3 15 eh
15-745 © Seth Copen Goldstein 2000-5 130
cycle
Aiken/Nicolau Scheduling Step 3
Build a tableau of iteration number vs cycle time.
basically, you’re emulating
iteration
1 1 123456
ac
a superscalar with infinite
1 acfjfjfjfjfjfj 2 bd
3 egh a
4 cb
5 dga
6 eh b 7cga 8db
9 ehga
resources, infinite register
b d1
renaming, always predicting
1j
the loop-back branch: h g a 1 e 1
121
thus, just pure
f1 b2jc2
data dependency
2d
g2a3 2 10 cb
fh11 dga 2e112 ehb
b3 2 13 cg c 14 d
3 15 eh
15-745 © Seth Copen Goldstein 2000-5 131
cycle
Aiken/Nicolau Scheduling Step 4
Find repeating patterns of instructions.
iteration 123456
1 acfjfjfjfjfjfj 2 bd
34 egh acb
5 dga
6 ehb
7 cg a 8db
9
10
11
12
13
14
15
eh g a cb
dga ehb cg
d eh
15-745 © Seth Copen Goldstein 2000-5 132
cycle
Aiken/Nicolau Scheduling Step 4
Find repeating patterns of instructions.
iteration 123456
1 acfjfjfjfjfjfj 2 bd
34 egh acb
5 dga
6 ehb
7 cg a 8db
9
10
11
12
13
14
15
eh g a cb
dga ehb cg
d eh
15-745 © Seth Copen Goldstein 2000-5 133
cycle
Aiken/Nicolau Scheduling Step 4
Find repeating patterns of instructions.
iteration 123456
1 acfjfjfjfjfjfj 2 bd
34 egh acb
5 dga
6 ehb
7 cg a 8db
Go back and relate slopes to DDG
9
10
11
12
13
14
15
eh g a cb
dga ehb cg
d eh
15-745 © Seth Copen Goldstein 2000-5 134
cycle
Aiken/Nicolau Scheduling Step 5
“Coalesce” the slopes.
iteration iteration 123456 123456
1 acfjfjfjfjfjfj 2 bd
3 egh a
4 cb
1 acfj 2 bd 3 egh 4
fj
a
cb fj
dg a
eh b fj
5 dga
6 ehb
7 cg a 8db8db
9
10
11
12
13
14
15
eh g a cb
dga ehb cg
d eh
cg a
9 eh g fj
5 6 7
10 ca 11 db 12 eh g
13 c 14 d 15 eh
15-745 © Seth Copen Goldstein 2000-5 135
cycle
cycle
Aiken/Nicolau Scheduling Step 6
Find the loop body and “reroll” the loop.
iteration 123456
1 acfj 2bd 3 egh 4
fj
a
cb fj dga ehbfj
5
6
7 8db
9
10
11
12
13
14
15
ehg fj ca
db eh g c d
eh
cg a
15-745 © Seth Copen Goldstein 2000-5 136
cycle
Aiken/Nicolau Scheduling Step 6
Find the loop body and “reroll” the loop.
Prologue/entry code
Loop body Epilogue/exit code
iteration 123456
1 acfj
2bd fj
3 egh a
4 cb fj
5 dga
6 ehbfj 78 cg a
910
11 db 12 eh g
13 c 14 d 15 eh
db ehg fj
ca
15-745 © Seth Copen Goldstein 2000-5 137
cycle
Aiken/Nicolau Scheduling Step 7
Generate code.
(Assume VLIW-like machine for this example. The instructions on each line should be issued in parallel.)
L:
bi+1 := ai Å fi
W[i] := di
ai+2 := ji+1 Å bi+1 i := i+1
bN :=aN Å fN-1 W[N-1] := dN-1
w[N]:=dN
a1:=j0 Å b0 b1:=a1 Å f0 e1:=b1 Å d1 c2:=e1 Å j1 d2:=f1 Å c2 e2:=b2 Å d2 c3:=e2 Å j2
di:=fi-1Åci ei := bi Å di ci+1:=eiÅji
c1:=e0 Å j0 d1:=f0 Å c1 V[1] := b1 b2:=a2 Å f1 V[2] := b2 W[2] := d2 V[3] := b3
f1:=U[1]
f2:=U[2] W[1] := d1
f3 := U[3] a3:=j2 Å b3:=a3 Å a4:=j3 Å
j1 := X[1]
j2 := X[2] a2:=j1 Å b1
j3 := X[3] b2
f2 f4 := U[4] b3 i:=3
j4 := X[4]
dN-1 :=fN-2 Å cN-1 eN-1 := bN-1 Å dN-1
cN := eN-1 Å jN-1 dN := fN-1 + cN
eN :=bN Å dN
V[i+1] := bi+1 v[N] := bN
fi+2 := U[I+2] ji+2 := X[i+2] if i
n For an edge, u®v, we must have s(v)-s(u) 3 d(u,v)
15-745 © Seth Copen Goldstein 2000-5 149
Precedence Constraints
n Cyclic: constraint becomes a tuple:
q p is the minimum iteration delay
(or the loop carried dependence distance) q d is the delay
n For an edge, u®v, we must have s(v)-s(u) 3 d(u,v)-s*p(u,v)
np30
n If data dependence is
q within an iteration, p=0
q loop-carried across p iter boundaries, p>0
15-745 © Seth Copen Goldstein 2000-5 150
Iterative Approach
n Finding minimum S that satisfies the constraints is NP- Complete.
n Heuristic:
q Find lower and upper bounds for S
q foreach s from lower to upper bound? n Schedule graph.
n If succeed, done
n Otherwise try again (with next higher s)
n Thus: “Iterative Modulo Scheduling” Rau MICRO’94
15-745 © Seth Copen Goldstein 2000-5 151
Iterative Approach
n Heuristic:
q Find lower and upper bounds for S
q foreach s from lower to upper bound
n Schedule graph.
n If succeed, done
n Otherwise try again (with next higher s)
n So the key difference:
q AN88 does not assume S when scheduling
q IMS must assume an S for each scheduling attempt to understand resource conflicts
15-745 © Seth Copen Goldstein 2000-5 152
Lower Bounds
n Resource Constraints: SR
maximum over all resources of # of uses divided by #
available…
n Precedence Constraints: SE
max delay over all cycles in dataflow graph
In practice, one is easy, other is hard.
Tim’s secret approach: just use SR as lower bound, then do binary search for best S
15-745 © Seth Copen Goldstein 2000-5 153
Acyclic Example
a <0,2>
<0,3> c
b <0,1>
Lower Bound: SR=2 Upper Bound: 5
15-745 © Seth Copen Goldstein 2000-5 154
Lower Bound on s
• Assume 1 ALU and 1 MU
• Assume latency Op or load is 1 cycle
a
c
for i:=1 to N do a := j Å b b := a Å f c := e Å j d:=fÅc e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
<1,1> <0,1>
<1,1> <0,1>
<1,1> <0,1>
j
b
h
f
d
g
<0,1>
<0,1>
<1,1>
Resources => 5 cycles Dependencies => 3 cycles
e
15-745 © Seth Copen Goldstein 2000-5 155
Scheduling data structures
To schedule for initiation interval s:
n Create a resource table with s rows and R columns
n Create a vector, s, of length N for n instructions in the loop
q s[n] = the time at which n is scheduled, or NONE
n Prioritize instructions by some heuristic q critical path (or cycle)
q resource critical
15-745 © Seth Copen Goldstein 2000-5 156
Scheduling algorithm
n Pick an instruction, n
n Calculate earliest time due to dependence constraints For all x=pred(n),
earliest = max(earliest, s(x)+d(x,n)-s.p(x,n)) n try and schedule n from earliest to (earliest+s-1)
s.t. resource constraints are obeyed.
q possible twist: deschedule a conflicting node to make way for n, maybe randomly, like sim anneal
n If we fail, then this schedule is faulty (i.e. give up on this s)
15-745 © Seth Copen Goldstein 2000-5 157
Scheduling algorithm – cont.
n We now schedule n at earliest, I.e., s(n) = earliest n Fix up schedule
q Successors, x, of n must be scheduled s.t.
s(x) >= s(n)+d(n,x)-s.p(n,x), otherwise they are removed
(descheduled) and put back on worklist.
n repeat this some number of times until either
q succeed, then register allocate q fail, then increase s
15-745 © Seth Copen Goldstein 2000-5 158
Simplest Example
for () {
a = b+c
b = a*a
c = a*194 }
<1,1>
<1,1> <0,1> <0,1>
b
Resources:
a
1
1
c
What is IIres? What is IIrec?
15-745 © Seth Copen Goldstein 2000-5 159
Simplest Example
for () { 0 a = b+c
b = a*a
c = a*194 1 }
a
Modulo Resource Table:
0 1
b
c
Try II = 2
1
15-745 © Seth Copen Goldstein 2000-5 160
Simplest Example
for () { 0 a = b+c
b = a*a
c = a*194 1 }
a
Try II = 2
Modulo Resource Table:
0 1
1
b
c
1
15-745 © Seth Copen Goldstein 2000-5 161
Simplest Example
for () { 0 a = b+c
b = a*a
c = a*194 1 }
Modulo Resource Table:
0 1
1
1
a
2
Try II = 2
b
c
1
15-745 © Seth Copen Goldstein 2000-5 162
Simplest Example
for () { 0 a = b+c
b = a*a
c = a*194 1 }
a
b
Try II = 2
Modulo Resource Table:
0 1
earliest a: sigma(c) + delay(c) – 2 = 2+1-2 = 1
1
c
2
1
15-745 © Seth Copen Goldstein 2000-5 163
Simplest Example
for () {
a = b+c
b = a*a
c = a*194
0 1 2
}
b
a
Try II = 2
Modulo Resource Table:
0 1
earliest b? scheduled b? what next?
1
c
1
15-745 © Seth Copen Goldstein 2000-5 164
Simplest Example
for () {
a = b+c
b = a*a
c = a*194
0 1 2 3
}
Try II = 2
Modulo Resource Table:
0 1
Lesson: lower bound may not be achievable
1
b
a
c
1
15-745 © Seth Copen Goldstein 2000-5 165
Example
for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: ?
<1,1> <0,1>
<1,1> <0,1>
<1,1> <0,1>
<0,1>
<0,1>
<1,1>
g
a
h
j
f
c
b
d
e
15-745 © Seth Copen Goldstein 2000-5 166
Example
for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: c,d,e,a,b,f,j,g,h
g
a
<1,1> <0,1>
<1,1> <0,1>
<1,1> <0,1>
<0,1>
<0,1>
<1,1>
h
f
c
j
b
d
e
15-745 © Seth Copen Goldstein 2000-5 167
for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
s=5
instr
s
a
b
c
d
e
f
g
h
j
ALU
MU
Priorities: c,d,e,a,b,f,j,g,h
a
b
j
h
f
c
g
e
d
15-745 © Seth Copen Goldstein 2000-5 168
for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: a,b,f,j,g,h
s=5
instr
s
a
b
c
0
d
1
e
2
f
g
h
j
ALU
MU
c
d
e
j
a
c
f
b
d
g
h
e
15-745 © Seth Copen Goldstein 2000-5 169
for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: b,f,j,g,h
s=5
instr
s
a
3
b
c
0
d
1
e
2
f
g
h
j
ALU
MU
c
d
e
a
j
a
c
f
b
d
g
h
e
15-745 © Seth Copen Goldstein 2000-5 170
for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: b,f,j,g,h
s=5
instr
s
a
3
b
4
c
0
d
1
e
2
f
g
h
j
ALU
MU
c
d
e
a
b
j
a
c
f
b
d
g
h
e
15-745 © Seth Copen Goldstein 2000-5 171
for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: e,f,j,g,h
s=5
ALU
MU
c
d
a
b
j
a
c
f
b
d
h
g
instr
a
b
c
d
e
f
g
h
j
1
s
3
4
0
e
b causes b->e edge violation
15-745 © Seth Copen Goldstein 2000-5 172
for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: e,f,j,g,h
s=5
ALU
MU
c
d
e
a
b
j
a
c
f
b
d
h
g
instr
a
b
c
d
e
f
g
h
j
1
s
3
4
0
7
e
e causes e->c edge violation
15-745 © Seth Copen Goldstein 2000-5 173
for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities: f,j,g,h
s=5
instr
s
a
3
b
4
c
5
d
6
e
7
f
0
g
h
j
ALU
MU
c
f
d
e
a
b
j
a
c
f
b
d
g
h
e
15-745 © Seth Copen Goldstein 2000-5 174
for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities:j,g,h
s=5
instr
s
a
3
b
4
c
5
d
6
e
7
f
0
g
h
j
1
ALU
MU
c
f
d
j
e
a
b
j
a
c
f
b
d
g
h
e
15-745 © Seth Copen Goldstein 2000-5 175
for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d
f := U[i]
g: V[i] := b
h: W[i] := d
j := X[i]
Priorities:g,h
s=5
instr
s
a
3
b
4
c
5
d
6
e
7
f
0
g
7
h
8
j
1
ALU
MU
c
f
d
j
e
g
a
h
b
j
a
c
f
b
d
g
h
e
15-745 © Seth Copen Goldstein 2000-5 176
Creating the Loop
n Create the body from the schedule.
n Determine which iteration an instruction
falls into
q Mark its sources and dest as belonging to that iteration.
q Add Moves to update registers n Prolog fills in gaps at beginning
q For each move we will have an instruction in prolog, and we fill in dependent instructions
n Epilog fills in gaps at end
instr
s
a
3
b
4
c
5
d
6
e
7
f
0
g
7
h
8
j
1
15-745 © Seth Copen Goldstein 2000-5 177
f0 = U[0]; j0 = X[0];
FOR i = 0 to N f1 := U[i+1] j1 := X[i+1] nop
a := j0 ? b b := a ? f0 c := e ? j0 d := f0 ? c e := b ? d
h: W[i] := d f0 = f1
j0 = j1
g: V[i] := b
15-745 © Seth Copen Goldstein 2000-5 178
Conditionals
n What about internal control structure, I.e., conditionals n Three approaches
q Schedule both sides and use conditional moves
q Schedule each side, then make the body of the conditional a
macro op with appropriate resource vector q Trace schedule the loop
15-745 © Seth Copen Goldstein 2000-5 179
What to take away
n Architecture includes compiler!
n Dependence analysis is very important
(including alias analysis)
n Software pipelining crucial for statically scheduled, but also very useful for dynamically scheduled
15-745 © Seth Copen Goldstein 2000-5 180
Multiflow: An early VLIW architecture (1987)
EPIC – Intel IA-64 Architecture
n Gets rid of lock-step execution of instructions within a VLIW instruction
n Idea: More ISA support for static scheduling and parallelization q Specify dependencies within and between VLIW instructions
(explicitly parallel)
+ No lock-step execution
+ Static reordering of stores and loads + dynamic checking
— Hardware needs to perform dependency checking (albeit aided by software)
— Other disadvantages of VLIW still exist
n Huck et al., “Introducing the IA-64 Architecture,” IEEE Micro, Sep/Oct 2000.
183
IA-64 Instructions
n IA-64 “Bundle” (~EPIC Instruction)
q Total of 128 bits
q Contains three IA-64 instructions
q Template bits in each bundle specify dependencies within a bundle
\
n IA-64 Instruction
q Fixed-length 41 bits long
q Contains three 7-bit register specifiers
q Contains a 6-bit field for specifying one of the 64 one-bit predicate registers
184
IA-64 Instruction Bundles and Groups
n Groups of instructions can be executed safely in parallel
q Marked by “stop bits”
n Bundles are for packaging
q Groups can span multiple bundles
n Alleviates recompilation need somewhat
185
Template Bits
n Specify two things
q Stop information: Boundary of independent instructions
q Functional unit information: Where should each instruction be routed
186