Module Outline:
Module 2: High Performance Techniques— Scoreboard
• How to Optimize the Pipeline? Extract More Parallelism! • Compiler-Directed (Static) Approaches
– VLIW
– EPIC
– Superscalar
– Software Pipelining
• A Dynamic Approach—Scoreboard
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 1
Pipelining: Can we somehow make CPI closer to 1?
• Let’s assume full pipelining
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 2
FP Loop: Where are the hazards?
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 3
FP Loop Showing Stalls
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 4
Revised FP Loop Minimizing Stalls
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 5
Unroll Loop Four Times (straightforward way)
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 6
Unrolled Loop That Minimizes Stalls
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 7
Getting CPI < 1: Processing Multiple Instructions/Cycle
• Use parallel processing!!
• Two main variations: superscalar and VLIW
• Superscalar: varying number of instructions/cycle (1 to 6)
- parallelism and dependencies determined/resolved by HW
- IBM PowerPC 604, Sun UltraSPARC, DEC Alpha 21164, HP 7100
• Very Long Instruction Words (VLIW): fixed number of instructions (16) determined by compiler
- pipeline is exposed; compiler must schedule delays to get right results • Explicit Parallel Instruction Computer (EPIC)/Intel
- 128 bit packets containing 3 instructions (can execute sequentially)
- can link 128 bit packets together to allow more parallelism
- compiler determines parallelism, HW checks dependencies and forwards/stalls
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 8
Parallelism: Overt vs. Covert
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 9
Parallelism: Overt vs. Covert
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 10
Problem vs. Program
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 11
Compilers for Covert Parallelism
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 12
Compilers for Covert Parallelism
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 13
Compilers for Covert Parallelism
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 14
Compilers for Covert Parallelism
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 15
Compilers for Covert Parallelism
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 16
Compilers for Covert Parallelism
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 17
Unrolled Loop
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 18
Compilation and ISA
• Efficient compilation requires knowledge of the pipeline structure - latency and bandwidth of each operation type
• But a good ISA transcends several implementations with different pipelines
- should things like a delayed branch be in an ISA?
- should a compiler use the properties of one implementation when compiling for an ISA? - do we need a new interface?
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 19
An Alternative
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 20
Very Long Instruction Word (VLIW) Computers
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 21
Pros:
Cons:
• Very simple hardware
• Lockstep execution (static schedule)
- no dependency detection
- simple issue logic
- just ALUs and register files
- very sensitive to long latency operations (cache misses)
• Potentially exploits large amounts of ILP
• Global register file hard to build • Lots of NO-OPs
ELEC6036 (Vincent TAM)
Module 2: High Performance Techniques—Scoreboard Page 22
VLIW Pros and Cons
- poor “code density”
- I-cache capacity and bandwidth
compromised
• Must recompile sources to deliver potential
• Implementation visible through ISA
EPIC: Explicit Parallel Instruction Computer
128-bit instructions:
• three 3-address operations
op1 op2 op3 tmp
• a template that encodes dependencies
• 128 general registers
• predication
• speculative load (data prediction)
• Example: IA-64 of Intel/HP
pred op rd
rs1 rs2 const
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard
Page 23
Getting CPI < 1: Issuing Multiple Instructions/Cycle
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 24
Getting CPI < 1: Issuing Multiple Instructions/Cycle
• Superscalar DLX: 2 instructions, 1 FP & 1 anything else
- fetch 64 bits/clock cycle; integer on left, FP on right
- can only issue 2nd instruction if 1st instruction issues
- more ports for FP registers to do FP load & FP op in a pair
• 1 cycle load delay expands to 3 instructions in SS
- instruction in right half can’t use it, nor instructions in next slot
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 25
Loop Unrolling in Superscalar
• Unrolled 5 times to avoid delays (+1 due to SS) • 12 clocks, or 2.4 clocks per iteration
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 26
Another Example: Multiple Issue
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 27
Another Example: Multiple Issue—Rescheduled Code
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 28
- exactly 50% FP operations - no hazards
Limits of Superscalar
• While integer/FP split is simple for the HW, get CPI of 0.5 only for programs with:
• If more instructions issue at same time, greater difficulty of decode and issue
- even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue
• VLIW: tradeoff instruction space for simple decoding
– the long instruction word has room for many operations
– by definition, all the operations the compiler puts in the long instruction word can execut
in parallel
– e.g., 2 integer operations, 2 FP ops, 2 memory references, 1 branch;
16 to 24 bits for each of these fields ==> 7 x 16 or 112 bits to 7 x 24 or 168 bits wide – need compiling technique that schedules across several branches
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 29
Loop Unrolling in VLIW
• Need more registers in VLIW (EPIC ==> 128 integer + 128 FP)
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 30
Software Pipelining
• Observation: if iterations from loops are independent, then can get more ILP (instruction level parallelism) by taking instructions from different iterations
• Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (i.e., Tomasulo algorithms in SW)
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 31
Software Pipelining Example
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 32
Software Pipelining with Loop Unrolling in VLIW
• 9 results in 9 cycles, or 1 clock per iteration
• average: 3.3 ops per clock, 66% efficiency
• Note: need less registers for software pipelining (only using 7 registers here, was using 15)
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 33
Multiple Issue:
• more complex issue logic
Multiple Issue as compared with VLIW
– check dependencies
– check structural hazards
– issue variable number of instructions (0-N) – shift unissued instructions over
• Able to run existing binaries
– recompile for performance, not correctness
• Datapaths identical
– but bypass requires detection
• Neither VLIW or multiple-issue can schedule around run-time variation in instruction latency
– cache misses
• Dealing with run-time variation requires run-time or dynamic scheduling
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 34
The Problem with Static Scheduling (Compile-Time)
In-Order Execution:
• an unexpected long latency blocks ready instructions from executing (scheduled code cannot be changed at run-time)
• binaries need to be rescheduled (recompiled) for each new processor implementation
• small number of named registers becomes a bottleneck
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 35
• Why in HW at run time?
Can we use HW to get CPI closer to 1?
– works when can’t know real dependencies at compile time – compiler simpler; avoid recompilation also!
– code for one machine runs well on another
• Dynamic scheduling?!
• Key ideas: allow instructions behind stall to proceed:
DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14
• Out-of-order execution ==> out-of-order completion • Disadvantages?
– complexity
– precise interrupts harder!
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 36
• How do we prevent WAR and WAW hazards? • How do we deal with variable latency?
– forwarding for RAW hazards harder
Problems?
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 37
Scoreboard: a bookkeeping technique
• Out-of-order execution divides ID stage:
– 1. Issue—decode instructions, check for structural hazards
– 2. Read operands—wait until no data hazards, then read operands
• Scoreboards date to CDC6600 designed in 1963
• Instructions execute whenever not dependent on previous instructions and no hazards
• CDC6600: in-order issue, out-of-order execution, out-of-order commit (or completion)
– no forwarding!
– imprecise interrupt/exception model for now
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 38
Scoreboard Architecture (CDC6600)
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 39
Scoreboard Implications
• Out-of-order completion ==> WAR, WAW hazards?
• Solutions for WAR:
– stall writeback until registers have been read
– read registers only during Read Operands stage
• No register renaming!
• Need to have multiple instructions in execution phase ==> multiple execution units or pipelined execution units
• Scoreboard keeps track of dependencies between instructions that have already issued
• Scoreboard replaces ID, EX, WB with 4 stages
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 40
Four Stages of Scoreboard Control
• Issue—decode instructions & check for structural hazards (ID1)
– instructions issued in program order (for hazard checking)
– don’t issue if structural hazard
– don’t issue if instruction is output dependent on any previously issued but uncompleted
instruction (no WAW hazards)
• Read operands—wait until no data hazards, then read operands (ID2)
– all real dependencies (RAW hazards) resolved in this stage, since we wait for instructions to write back data
– no forwarding of data in this model
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 41
Four Stages of Scoreboard Control
• Execution—operate on operands (EX)
– the functional unit begins execution upon receiving operands;
when the result is ready, it notifies the scoreboard that it has completed execution
• Write result—finish execution (WB)
– stall until no WAR hazards with previous instructions:
Example: DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F8, F8, F14
CDC6600 scoreboard would stall SUBD until ADDD reads operands
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 42
• Instruction status:
Which of 4 steps the instruction is in
Field
Meaning
Busy Op
Fi
Fj, Fk Qj, Qk Rj, Rk
indicates whether the unit is busy or not operation to perform in the unit (e.g., add or sub) destination register
source register numbers
functional units producing source registers Fj, Fk flags indicating when Fj, Fk are ready
Three Parts of the Scoreboard
• Functional unit status:
Indicates the state of the functional unit (FU). 9 fields for each functional unit
• Register result status:
Indicates which functional unit will write each register, if one exists; blank when no pending instructions will write that register
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 43
Scoreboard Example
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 44
Detailed Scoreboard Pipeline Control
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 45
Scoreboard Example: Cycle 1
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 46
Scoreboard Example: Cycle 2
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 47
Scoreboard Example: Cycle 3
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 48
Scoreboard Example: Cycle 4
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 49
Scoreboard Example: Cycle 5
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 50
Scoreboard Example: Cycle 6
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 51
Scoreboard Example: Cycle 7
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 52
Scoreboard Example: Cycle 8a (1st half of clock cycle)
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 53
Scoreboard Example: Cycle 8b (2nd half of clock cycle)
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 54
Scoreboard Example: Cycle 9
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 55
Scoreboard Example: Cycle 10
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 56
Scoreboard Example: Cycle 11
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 57
Scoreboard Example: Cycle 12
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 58
Scoreboard Example: Cycle 13
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 59
Scoreboard Example: Cycle 14
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 60
Scoreboard Example: Cycle 15
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 61
Scoreboard Example: Cycle 16
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 62
Scoreboard Example: Cycle 17
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 63
Scoreboard Example: Cycle 18
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 64
Scoreboard Example: Cycle 19
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 65
Scoreboard Example: Cycle 20
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 66
Scoreboard Example: Cycle 21
• WAR Hazard is now gone…
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 67
Scoreboard Example: Cycle 22
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 68
Let’s skip some cycles…Scoreboard Example: Cycle 61
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 69
Scoreboard Example: Cycle 62
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 70
Review—Scoreboard Example: Cycle 62
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 71
• Speedup 1.7 from compiler; 2.5 by hand; UT slow memory (no cache) limits benefit
• Limitations of 6600 scoreboard:
CDC6600 Scoreboard
– no forwarding hardware
– limited to instructions in basic block (small instruction window)
– small number of functional units (structural hazards), especially integer/load store units – do not issue on structural hazards
– wait for WAR hazards
– prevent WAW hazards
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 72
• Compiler scheduling • HW exploiting ILP
Summary of Concepts
– works when we can’t possibly know dependencies at compile time – code for one machine runs well on another
• Key idea of scoreboard: allow instructions behind stall to proceed (decode => issue instructions and read operands)
– enables out-of-order execution ==> out-of-order completion – ID stage checked both for structural and data dependencies – original version didn’t handle forwarding
– no automatic register renaming
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 73