CS计算机代考程序代写 algorithm data structure GPU cache computer architecture RISC-V mips Computer Architecture ELEC3441

Computer Architecture ELEC3441
Lecture 11 – Advanced Pipeline
Dr. Hayden Kwok-Hay So
Department of Electrical and Electronic Engineering
Simple pipeline so far…
n CPI of simple pipelined processor is always >= 1
• Commitsatmost1instructionpercycle
n Stalling and wasted cycles increase CPI • Cachemiss
• TLBmiss
• pagefault
• Branchmisprediction
HKUEEE ENGG3441 – HS 2
Challenges with pipeline:
n Pipeline works best when all the stages have equal and predictable latencies
• The longest stage limits overall performance
• Unpredictable latency must be hidden by stalling the
rest of the pipeline But, in reality…
n Many operations in a processor have
• Variable latency: latency changes during run time
• Long latency: takes many cycles to complete
• Partially pipelined: initiating interval > 1
n Notable challenging stages:
• Memory operation
• Floating point operation
Can we design processors with CPI ≤ 1? HKUEEE ENGG3441 – HS 3
Beyond Simple Pipeline
n Out-of-order issue, execute, commit
n Tomasulo Architecture
n Superscalar, dynamic OOO processor n Multi-scalar processor
n VLIW
n Vector Processor
n SIMD
n GPU acceleration
n Multi-core processor
n Reconfigurable Computer
CPI ≥ 1
CPI < 1 HKUEEE ENGG3441 - HS 4 § § § § § Much more hardware than an integer unit – Single-cycle FPU is a bad idea – why? Common to have several FPU’s Common to have different types of FPU’s: Fadd, Fmul, Fdiv, ... An FPU may be pipelined, partially pipelined or not pipelined To operate several FPU’s concurrently the FP register file needs to have more read and write ports Floating-Point Unit (FPU) 5 Functional Unit Characteristics 1cyc 1cyc 1cyc fully pipelined partially pipelined start new operation every 1 cycle start new operation every 2 cycles 2 cyc 2 cyc Functional units have internal pipeline registers Þ operands are latched when an instruction enters a functional unit Þ following instructions are able to write register file during a long-latency operation 6 Floating-Point ISA §Interaction between floating-point datapath and integer datapath is determined by ISA §RISC-V ISA – separate register files for FP and Integer instructions • the only interaction is via a set of move/convert instructions (some ISA’s don’t even permit this) – separate load/store for FPR’s and GPR’s but both use GPR’s for address calculation – FP compares write integer registers, then use integer branch 7 Issues in Complex Pipeline Control • Structural conflicts at the execution stage if some FPU or memory unit is not pipelined and takes more than one cycle • Structural conflicts at the write-back stage due to variable latencies of different functional units • Out-of-order write hazards due to variable latencies of different functional units • How to handle exceptions? GPRs FPRs ALU Mem IF ID Issue WB Fadd Fmul Fdiv 8 Complex In-Order Pipeline PC Inst. Mem D Decode GPRs X1 + X2 Data Mem X3 W § Delay writeback so all operations have same latency to W stage – Writeportsneveroversubscribed (one inst. in & one inst. out every cycle) – Stall pipeline on long latency operations, e.g., divides, cache misses – Handle exceptions in-order at commit point How to prevent increased writeback latency from slowing down single cycle integer operations? Bypassing FAdd FMul Unpipelined X2 divider X3 FPRs X1 X2 X3 W X2 X3 FDiv Commit Point 9 In-Order Superscalar Pipeline 2 § Fetch two instructions per cycle; issue both simultaneously if one is integer/memory and other is floating point § Inexpensive way of increasing throughput, examples include Alpha 21064 (1992) & MIPS R5000 series (1996) § Same idea can be extended to wider issue by duplicating functional units (e.g. 4-issue UltraSPARC & Alpha 21164) but regfile ports and bypassing costs grow quickly + PC Inst. Mem D Dual Decode GPRs X1 X2 Data Mem X3 W FPRs X1 X2 FAdd FMul Unpipelined divider X3 X3 W X2 X3 FDiv X2 Commit Point 10 Can we get away from in-order processing? 11 Reordering of Instructions n 3 general categories: • In-orderIssue+In-orderCompletion • In-orderIssue+Out-of-orderCompletion • Out-of-orderIssue+Out-of-orderCompletion HKUEEE ENGG3441 - HS 12 Types of Data Hazards Consider executing a sequence of rkßri op rj type of instructions Data-dependence r3ßr1opr2 r5ßr3opr4 Anti-dependence r3ßr1opr2 r1ßr4opr5 Output-dependence r3ßr1opr2 r3ßr6opr7 Read-after-Write (RAW) hazard Write-after-Read (WAR) hazard Write-after-Write (WAW) hazard 13 Register vs. Memory Dependence Data hazards due to register operands can be determined at the decode stage, but data hazards due to memory operands can be determined only after computing the effective address Store: M[r1 + disp1] ß r2 Load: r3 ß M[r4 + disp2] Does (r1 + disp1) = (r4 + disp2) ? 14 Data Hazards: An Example I1 FDIV.D I2 FLD I3 FMUL.D I4 FDIV.D I5 FSUB.D I6 FADD.D f6, f6, f4 f2, 45(x3) f0, f2, f4 f8, f6, f2 f10, f0, f6 f6, f8, f2 RAW Hazards WAR Hazards WAW Hazards 15 Instruction Scheduling I1 FDIV.D I2 FLD I3 FMULT.D I4 FDIV.D I5 FSUB.D I6 FADD.D f6, f6, f4 f2, 45(x3) f0, f2, f4 f8, f6, f2 f10, f0, f6 f6, f8, f2 Valid orderings: in-order I1 I2 I3 I4 I5 I6 I4 I5 I6 I5 I4 I6 I1 I2 I3 I4 I5 I6 out-of-order I2 out-of-order I1 I1 I3 I2 I3 16 Out-of-order Completion In-order Issue I1 FDIV.D f6, I2 FLD f2, I3 FMULT.Df0, f2, I4 FDIV.D f8, I5 FSUB.D f10, I6 FADD.D f6, Latency f6, f4 4 45(x3) 1 f43 f6, f2 4 f0, f6 1 f8, f2 1 in-ordercomp 12 1234 354656 out-of-ordercomp 1 2 2 3 1 4 3 5 5 4 6 6 17 Complex Pipeline ALU Mem IF ID Issue WB GPR’s FPR’s Can we solve write hazards without equalizing all pipeline depths and without bypassing? Fadd Fmul Fdiv 18 When is it Safe to Issue an Instruction? Suppose a data structure keeps track of all the instructions in all the functional units The following checks need to be made before the Issue stage can dispatch an instruction § Is the required function unit available? § Is the input data available? è RAW? § Is it safe to write the destination? è WAR? WAW? § Is there a structural conflict at the WB stage? 19 Name Busy Int Mem Add1 Add2 Add3 Mult1 Mult2 Div Op Dest Src1 Src2 A Data Structure for Correct Issues Keeps track of the status of Functional Units The instruction i at the Issue stage consults this table FU available? RAW? WAR? WAW? check the busy column search the dest column for i’s sources search the source columns for i’s destination search the dest column for i’s destination An entry is added to the table if no hazard is detected; An entry is removed from the table after Write-Back 20 Simplifying the Data Structure Assuming In-order Issue Suppose the instruction is not dispatched by the Issue stage if a RAW hazard exists or the required FU is busy, and that operands are latched by functional unit on issue: Can the dispatched instruction cause a WAR hazard ? NO: Operands read at issue WAW hazard ? YES: Out-of-order completion 21 Simplifying the Data Structure ... §No WAR hazard ⟹ no need to keep src1 and src2 §The Issue stage does not dispatch an instruction in case of a WAW hazard ⟹ a register name can occur at most once in the dest column §WP[reg#] : a bit-vector to record the registers for which writes are pending – These bits are set by the Issue stage and cleared by the WB stage ⟹ Each pipeline stage in the FU's must carry the dest field and a flag to indicate if it is valid “the (we, ws) pair” 22 Scoreboard for In-order Issues Busy[FU#] : a bit-vector to indicate FU’s availability. (FU = Int, Add, Mult, Div) These bits are hardwired to FU's. WP[reg#] : a bit-vector to record the registers for which writes are pending. These bits are set by Issue stage and cleared by WB stage Issue checks the instruction (opcode dest src1 src2) against the scoreboard (Busy & WP) to dispatch FU available? RAW? WAR? Busy[FU#] WP[src1] or WP[src2] cannot arise WAW? WP[dest] 23 Scoreboard Dynamics Functional Unit Status Int(1) Add(1) Mult(3) Div(4) Registers Reserved WB for Writes f6 f6, f2 f6, f2 I2 f6, f0 f6, f0 I1 f0, f8 f0, f8 I3 f8, f10 f8, f10 I5 f8I4 f6 f6 I6 I1 f6 I2 f2 f6 f6 f2 I3 f0 f6 f0 f6 I4 f0 f8 f8 f0 I5 f10 f8 f8 f10 f8 I6 f6 f6 t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 I1 I2 I3 I4 I5 I6 FDIV.D FLD FMULT.D FDIV.D FSUB.D FADD.D f6, f6, f4 f2, 45(x3) f0, f2, f4 f8, f6, f2 f10, f0, f6 f6, f8, f2 24 In-Order Issue Limitations: an example 1 FLD f2, 2 FLD f4, 3 FMULT.D f6, 4 FSUB.D f8, 5 FDIV.D f4, 6 FADD.D f10, 34(x2) 45(x3) f4, f2 f2, f2 f2, f8 f6, f4 latency 1 12 long 3 1 4 5 1 43 6 In-order issue restriction prevents instruction 4 from being dispatched In-order: 1(2,1). . . . . . 2344 35. . .566 25 Out-of-Order Issue ALU Mem Issue § Issue stage buffer holds multiple instructions waiting to issue. § Decode adds next instruction to buffer if there is space and the instruction does not cause a WAR or WAW hazard. – Note: WAR possible again because issue is out-of-order (WAR not possible with in-order issue and latching of input operands at functional unit) § Any instruction in buffer whose RAW hazards are satisfied can be issued (for now at most one dispatch per cycle). On a write back (WB), new instructions may get enabled. 26 IF ID WB Fadd Fmul How many instructions can be in the pipeline? Which features of an ISA limit the number of instructions in the pipeline? Number of Registers Out-of-order dispatch by itself does not provide any significant performance improvement! 27 Issue Limitations: In-Order and Out-of-Order 1 FLD f2, 2 FLD f4, 3 FMULT.D f6, 4 FSUB.D f8, 5 FDIV.D f4, 6 FADD.D f10, 34(x2) 45(x3) f4, f2 f2, f2 f2, f8 f6, f4 latency 1 12 long 3 1 4 5 1 43 6 Out-of-order execution did not allow any significant improvement! In-order: 1(2,1). . . . . . 2344 35. . .566 Out-of-order: 1(2,1)44. . . . 235 . 3 .566 cannot dispatch due to WAR on f4 28 Overcoming the Lack of Register Names Floating Point pipelines often cannot be kept filled with small number of registers. IBM 360 had only 4 floating-point registers Can a microarchitecture use more registers than specified by the ISA without loss of ISA compatibility ? Robert Tomasulo of IBM suggested an ingenious solution in 1967 using on-the-fly register renaming 29 Issue Limitations: In-Order and Out-of-Order 1 FLD f2, 2 FLD f4, 3 FMULT.D f6, 4 FSUB.D f8, 5 FDIV.D f4’, 6 FADD.D f10, 34(x2) 45(x3) f4, f2 f2, f2 f2, f8 f6, f4’ latency 1 12 long 3 1 4 5 1 43 X 6 Any antidependence can be eliminated by renaming. In-order: 1(2,1). . . . . . 2344 35. . .566 Out-of-order: 1(2,1)445 . . . 2(3,5)366 (renaming ⟹ additional storage) Can it be done in hardware? yes! 30 Register Renaming ALU Mem IF ID Issue WB Fadd Fmul § Decode does register renaming and adds instructions to the issue-stage instruction reorder buffer (ROB) ⟹ renaming makes WAR or WAW hazards impossible § Any instruction in ROB whose RAW hazards have been satisfied can be dispatched. ⟹ Out-of-order or dataflow execution 31 Tomasulo Architecture R. M. Tomasulo, 1967 IBM 360/91 From Mem FP Op Queue FP Registers Store Buffers Load1 Load2 Load3 Load4 Load5 Load6 Load Buffers Add1 Add2 Add3 Mult1 Mult2 Reservation Stations To Mem FP adders FP multipliers HKUEEE ENGG3441 - HS 32 Common Data Bus (CDB) Tomasulo’s Algorithm n Control & buffers distributed with Function Units (FU) • FU buffers called reservation stations • Contains pending operands n Registers in instructions replaced by values or pointers to reservation stations(RS) • On-the-fly register renaming • Renaming avoids WAR, WAW hazards (for now) • More reservation stations than registers n Common Data Bus that broadcasts results to all Fus • RS stores new (pending) data until consumed by FU • No register file • Avoids RAW hazards by executing an instruction only when its operands are available n Loads and Stores treated as FUs with RSs as well n Branch prediction allows FP ops beyond basic block in FP queue HKUEEE ENGG3441 - HS 33 Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execute—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available n Normal data bus: data + destination (“go to” bus) n Common data bus: data + source (“come from” bus) • 64 bits of data + 4 bits of Functional Unit source address • Write if matches expected Functional Unit (produces result) • Does the broadcast n Example speed: 3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for / HKUEEE ENGG3441 - HS 34 Reorder Buffer n A buffer storing all instructions in-flight • Commit to architectural state only when all previous instructions have committed n Enables in-order commit • Avoid WAW w.r.t. ISA registers • Precise exception • Handle branch misprediction n Usually implemented as a hardware circular buffer • Insert entry by issue stages • Remove entry by commit n May handle memory writes in similar fashion • e.g. Out-of-order execution of SW, but turns out there is a branch misprediction HKUEEE ENGG3441 - HS 35 Instruction Commit/Retire n Instruction commit: commit the results of an instruction to permanent machine states • Completing an instruction not necessary means committing/retiring an instruction • e.g. ROB commits in-order, while instructions complete out-of-order n In-order commits benefits: • guarantee no WAW • precise exception • enable speculative executions n In-order commits limitation: • Added hardware • CPI=1 barrier if not careful HKUEEE ENGG3441 - HS 36 Acknowledgements n These slides contain material developed and copyright by: • Arvind (MIT) • Krste Asanovic (MIT/UCB) • Joel Emer (Intel/MIT) • James Hoe (CMU) • John Kubiatowicz (UCB) • David Patterson (UCB) • John Lazzaro (UCB) n MIT material derived from course 6.823 n UCB material derived from course CS152, CS252 HKUEEE ENGG3441 - HS 37