程序代写代做代考 algorithm compiler cache clock Module Outline:

Module Outline:
Module 2: High Performance Techniques— Scoreboard
• How to Optimize the Pipeline? Extract More Parallelism! • Compiler-Directed (Static) Approaches
– VLIW
– EPIC
– Superscalar
– Software Pipelining
• A Dynamic Approach—Scoreboard
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 1

Pipelining: Can we somehow make CPI closer to 1?
• Let’s assume full pipelining
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 2

FP Loop: Where are the hazards?
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 3

FP Loop Showing Stalls
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 4

Revised FP Loop Minimizing Stalls
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 5

Unroll Loop Four Times (straightforward way)
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 6

Unrolled Loop That Minimizes Stalls
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 7

Getting CPI < 1: Processing Multiple Instructions/Cycle • Use parallel processing!! • Two main variations: superscalar and VLIW • Superscalar: varying number of instructions/cycle (1 to 6) - parallelism and dependencies determined/resolved by HW - IBM PowerPC 604, Sun UltraSPARC, DEC Alpha 21164, HP 7100 • Very Long Instruction Words (VLIW): fixed number of instructions (16) determined by compiler - pipeline is exposed; compiler must schedule delays to get right results • Explicit Parallel Instruction Computer (EPIC)/Intel - 128 bit packets containing 3 instructions (can execute sequentially) - can link 128 bit packets together to allow more parallelism - compiler determines parallelism, HW checks dependencies and forwards/stalls ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 8 Parallelism: Overt vs. Covert ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 9 Parallelism: Overt vs. Covert ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 10 Problem vs. Program ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 11 Compilers for Covert Parallelism ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 12 Compilers for Covert Parallelism ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 13 Compilers for Covert Parallelism ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 14 Compilers for Covert Parallelism ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 15 Compilers for Covert Parallelism ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 16 Compilers for Covert Parallelism ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 17 Unrolled Loop ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 18 Compilation and ISA • Efficient compilation requires knowledge of the pipeline structure - latency and bandwidth of each operation type • But a good ISA transcends several implementations with different pipelines - should things like a delayed branch be in an ISA? - should a compiler use the properties of one implementation when compiling for an ISA? - do we need a new interface? ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 19 An Alternative ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 20 Very Long Instruction Word (VLIW) Computers ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 21 Pros: Cons: • Very simple hardware • Lockstep execution (static schedule) - no dependency detection - simple issue logic - just ALUs and register files - very sensitive to long latency operations (cache misses) • Potentially exploits large amounts of ILP • Global register file hard to build • Lots of NO-OPs ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 22 VLIW Pros and Cons - poor “code density” - I-cache capacity and bandwidth compromised • Must recompile sources to deliver potential • Implementation visible through ISA EPIC: Explicit Parallel Instruction Computer 128-bit instructions: • three 3-address operations op1 op2 op3 tmp • a template that encodes dependencies • 128 general registers • predication • speculative load (data prediction) • Example: IA-64 of Intel/HP pred op rd rs1 rs2 const ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 23 Getting CPI < 1: Issuing Multiple Instructions/Cycle ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 24 Getting CPI < 1: Issuing Multiple Instructions/Cycle • Superscalar DLX: 2 instructions, 1 FP & 1 anything else - fetch 64 bits/clock cycle; integer on left, FP on right - can only issue 2nd instruction if 1st instruction issues - more ports for FP registers to do FP load & FP op in a pair • 1 cycle load delay expands to 3 instructions in SS - instruction in right half can’t use it, nor instructions in next slot ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 25 Loop Unrolling in Superscalar • Unrolled 5 times to avoid delays (+1 due to SS) • 12 clocks, or 2.4 clocks per iteration ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 26 Another Example: Multiple Issue ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 27 Another Example: Multiple Issue—Rescheduled Code ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 28 - exactly 50% FP operations - no hazards Limits of Superscalar • While integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: • If more instructions issue at same time, greater difficulty of decode and issue - even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue
• VLIW: tradeoff instruction space for simple decoding
– the long instruction word has room for many operations
– by definition, all the operations the compiler puts in the long instruction word can execut
in parallel
– e.g., 2 integer operations, 2 FP ops, 2 memory references, 1 branch;
16 to 24 bits for each of these fields ==> 7 x 16 or 112 bits to 7 x 24 or 168 bits wide – need compiling technique that schedules across several branches
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 29

Loop Unrolling in VLIW
• Need more registers in VLIW (EPIC ==> 128 integer + 128 FP)
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 30

Software Pipelining
• Observation: if iterations from loops are independent, then can get more ILP (instruction level parallelism) by taking instructions from different iterations
• Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (i.e., Tomasulo algorithms in SW)
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 31

Software Pipelining Example
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 32

Software Pipelining with Loop Unrolling in VLIW
• 9 results in 9 cycles, or 1 clock per iteration
• average: 3.3 ops per clock, 66% efficiency
• Note: need less registers for software pipelining (only using 7 registers here, was using 15)
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 33

Multiple Issue:
• more complex issue logic
Multiple Issue as compared with VLIW
– check dependencies
– check structural hazards
– issue variable number of instructions (0-N) – shift unissued instructions over
• Able to run existing binaries
– recompile for performance, not correctness
• Datapaths identical
– but bypass requires detection
• Neither VLIW or multiple-issue can schedule around run-time variation in instruction latency
– cache misses
• Dealing with run-time variation requires run-time or dynamic scheduling
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 34

The Problem with Static Scheduling (Compile-Time)
In-Order Execution:
• an unexpected long latency blocks ready instructions from executing (scheduled code cannot be changed at run-time)
• binaries need to be rescheduled (recompiled) for each new processor implementation
• small number of named registers becomes a bottleneck
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 35

• Why in HW at run time?
Can we use HW to get CPI closer to 1?
– works when can’t know real dependencies at compile time – compiler simpler; avoid recompilation also!
– code for one machine runs well on another
• Dynamic scheduling?!
• Key ideas: allow instructions behind stall to proceed:
DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14
• Out-of-order execution ==> out-of-order completion • Disadvantages?
– complexity
– precise interrupts harder!
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 36

• How do we prevent WAR and WAW hazards? • How do we deal with variable latency?
– forwarding for RAW hazards harder
Problems?
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 37

Scoreboard: a bookkeeping technique
• Out-of-order execution divides ID stage:
– 1. Issue—decode instructions, check for structural hazards
– 2. Read operands—wait until no data hazards, then read operands
• Scoreboards date to CDC6600 designed in 1963
• Instructions execute whenever not dependent on previous instructions and no hazards
• CDC6600: in-order issue, out-of-order execution, out-of-order commit (or completion)
– no forwarding!
– imprecise interrupt/exception model for now
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 38

Scoreboard Architecture (CDC6600)
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 39

Scoreboard Implications
• Out-of-order completion ==> WAR, WAW hazards?
• Solutions for WAR:
– stall writeback until registers have been read
– read registers only during Read Operands stage
• No register renaming!
• Need to have multiple instructions in execution phase ==> multiple execution units or pipelined execution units
• Scoreboard keeps track of dependencies between instructions that have already issued
• Scoreboard replaces ID, EX, WB with 4 stages
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 40

Four Stages of Scoreboard Control
• Issue—decode instructions & check for structural hazards (ID1)
– instructions issued in program order (for hazard checking)
– don’t issue if structural hazard
– don’t issue if instruction is output dependent on any previously issued but uncompleted
instruction (no WAW hazards)
• Read operands—wait until no data hazards, then read operands (ID2)
– all real dependencies (RAW hazards) resolved in this stage, since we wait for instructions to write back data
– no forwarding of data in this model
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 41

Four Stages of Scoreboard Control
• Execution—operate on operands (EX)
– the functional unit begins execution upon receiving operands;
when the result is ready, it notifies the scoreboard that it has completed execution
• Write result—finish execution (WB)
– stall until no WAR hazards with previous instructions:
Example: DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F8, F8, F14
CDC6600 scoreboard would stall SUBD until ADDD reads operands
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 42

• Instruction status:
Which of 4 steps the instruction is in
Field
Meaning
Busy Op
Fi
Fj, Fk Qj, Qk Rj, Rk
indicates whether the unit is busy or not operation to perform in the unit (e.g., add or sub) destination register
source register numbers
functional units producing source registers Fj, Fk flags indicating when Fj, Fk are ready
Three Parts of the Scoreboard
• Functional unit status:
Indicates the state of the functional unit (FU). 9 fields for each functional unit
• Register result status:
Indicates which functional unit will write each register, if one exists; blank when no pending instructions will write that register
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 43

Scoreboard Example
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 44

Detailed Scoreboard Pipeline Control
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 45

Scoreboard Example: Cycle 1
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 46

Scoreboard Example: Cycle 2
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 47

Scoreboard Example: Cycle 3
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 48

Scoreboard Example: Cycle 4
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 49

Scoreboard Example: Cycle 5
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 50

Scoreboard Example: Cycle 6
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 51

Scoreboard Example: Cycle 7
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 52

Scoreboard Example: Cycle 8a (1st half of clock cycle)
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 53

Scoreboard Example: Cycle 8b (2nd half of clock cycle)
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 54

Scoreboard Example: Cycle 9
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 55

Scoreboard Example: Cycle 10
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 56

Scoreboard Example: Cycle 11
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 57

Scoreboard Example: Cycle 12
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 58

Scoreboard Example: Cycle 13
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 59

Scoreboard Example: Cycle 14
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 60

Scoreboard Example: Cycle 15
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 61

Scoreboard Example: Cycle 16
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 62

Scoreboard Example: Cycle 17
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 63

Scoreboard Example: Cycle 18
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 64

Scoreboard Example: Cycle 19
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 65

Scoreboard Example: Cycle 20
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 66

Scoreboard Example: Cycle 21
• WAR Hazard is now gone…
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 67

Scoreboard Example: Cycle 22
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 68

Let’s skip some cycles…Scoreboard Example: Cycle 61
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 69

Scoreboard Example: Cycle 62
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 70

Review—Scoreboard Example: Cycle 62
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 71

• Speedup 1.7 from compiler; 2.5 by hand; UT slow memory (no cache) limits benefit
• Limitations of 6600 scoreboard:
CDC6600 Scoreboard
– no forwarding hardware
– limited to instructions in basic block (small instruction window)
– small number of functional units (structural hazards), especially integer/load store units – do not issue on structural hazards
– wait for WAR hazards
– prevent WAW hazards
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 72

• Compiler scheduling • HW exploiting ILP
Summary of Concepts
– works when we can’t possibly know dependencies at compile time – code for one machine runs well on another
• Key idea of scoreboard: allow instructions behind stall to proceed (decode => issue instructions and read operands)
– enables out-of-order execution ==> out-of-order completion – ID stage checked both for structural and data dependencies – original version didn’t handle forwarding
– no automatic register renaming
ELEC6036 (Vincent TAM) Module 2: High Performance Techniques—Scoreboard Page 73