Computer Architecture ELEC3441
Lecture 11 – Advanced Pipeline
Dr. Hayden Kwok-Hay So
Department of Electrical and Electronic Engineering
Simple pipeline so far…
n CPI of simple pipelined processor is always >= 1
• Commitsatmost1instructionpercycle
n Stalling and wasted cycles increase CPI • Cachemiss
• TLBmiss
• pagefault
• Branchmisprediction
HKUEEE ENGG3441 – HS 2
Challenges with pipeline:
n Pipeline works best when all the stages have equal and predictable latencies
• The longest stage limits overall performance
• Unpredictable latency must be hidden by stalling the
rest of the pipeline But, in reality…
n Many operations in a processor have
• Variable latency: latency changes during run time
• Long latency: takes many cycles to complete
• Partially pipelined: initiating interval > 1
n Notable challenging stages:
• Memory operation
• Floating point operation
Can we design processors with CPI ≤ 1? HKUEEE ENGG3441 – HS 3
Beyond Simple Pipeline
n Out-of-order issue, execute, commit
n Tomasulo Architecture
n Superscalar, dynamic OOO processor n Multi-scalar processor
n VLIW
n Vector Processor
n SIMD
n GPU acceleration
n Multi-core processor
n Reconfigurable Computer
CPI ≥ 1
CPI < 1
HKUEEE ENGG3441 - HS
4
§
§ §
§ §
Much more hardware than an integer unit – Single-cycle FPU is a bad idea – why?
Common to have several FPU’s
Common to have different types of FPU’s: Fadd, Fmul, Fdiv, ...
An FPU may be pipelined, partially pipelined or not pipelined
To operate several FPU’s concurrently the FP register file needs to have more read and write ports
Floating-Point Unit (FPU)
5
Functional Unit Characteristics
1cyc
1cyc
1cyc
fully pipelined
partially pipelined
start new operation every 1 cycle
start new operation every 2 cycles
2 cyc
2 cyc
Functional units have internal pipeline registers
Þ operands are latched when an instruction enters a functional unit
Þ following instructions are able to write register file during a long-latency operation
6
Floating-Point ISA
§Interaction between floating-point datapath and integer datapath is determined by ISA
§RISC-V ISA
– separate register files for FP and Integer instructions
• the only interaction is via a set of move/convert instructions (some ISA’s don’t even permit this)
– separate load/store for FPR’s and GPR’s but both use GPR’s for address calculation
– FP compares write integer registers, then use integer branch
7
Issues in Complex Pipeline Control
• Structural conflicts at the execution stage if some FPU or memory unit is not pipelined and takes more than one cycle
• Structural conflicts at the write-back stage due to variable latencies of different functional units
• Out-of-order write hazards due to variable latencies of different functional units
• How to handle exceptions?
GPRs FPRs
ALU
Mem
IF
ID
Issue
WB
Fadd
Fmul
Fdiv
8
Complex In-Order Pipeline
PC
Inst.
Mem
D
Decode
GPRs
X1
+
X2
Data
Mem
X3
W
§ Delay writeback so all operations have same latency to W stage
– Writeportsneveroversubscribed (one inst. in & one inst. out every cycle)
– Stall pipeline on long latency operations, e.g., divides, cache misses
– Handle exceptions in-order at commit point
How to prevent increased writeback latency from slowing down single cycle integer operations? Bypassing
FAdd
FMul
Unpipelined X2 divider X3
FPRs
X1
X2
X3
W
X2
X3
FDiv
Commit Point
9
In-Order Superscalar Pipeline
2
§ Fetch two instructions per cycle; issue both simultaneously if one is integer/memory and other is floating point
§ Inexpensive way of increasing throughput, examples include Alpha 21064 (1992) & MIPS R5000 series (1996)
§ Same idea can be extended to wider issue by duplicating functional units (e.g. 4-issue UltraSPARC & Alpha 21164) but regfile ports and bypassing costs grow quickly
+
PC
Inst.
Mem
D
Dual
Decode
GPRs
X1
X2
Data Mem
X3
W
FPRs
X1
X2
FAdd
FMul
Unpipelined divider X3
X3
W
X2
X3
FDiv
X2
Commit Point
10
Can we get away from in-order processing?
11
Reordering of Instructions
n 3 general categories:
• In-orderIssue+In-orderCompletion
• In-orderIssue+Out-of-orderCompletion
• Out-of-orderIssue+Out-of-orderCompletion
HKUEEE ENGG3441 - HS 12
Types of Data Hazards
Consider executing a sequence of
rkßri op rj type of instructions
Data-dependence
r3ßr1opr2 r5ßr3opr4
Anti-dependence
r3ßr1opr2 r1ßr4opr5
Output-dependence
r3ßr1opr2 r3ßr6opr7
Read-after-Write (RAW) hazard
Write-after-Read (WAR) hazard
Write-after-Write (WAW) hazard
13
Register vs. Memory Dependence
Data hazards due to register operands can be determined at the decode stage, but data hazards due to memory operands can be determined only after computing the effective address
Store: M[r1 + disp1] ß r2 Load: r3 ß M[r4 + disp2]
Does (r1 + disp1) = (r4 + disp2) ?
14
Data Hazards: An Example
I1 FDIV.D
I2 FLD
I3 FMUL.D
I4 FDIV.D
I5 FSUB.D
I6 FADD.D
f6, f6, f4 f2, 45(x3)
f0, f2, f4 f8, f6, f2 f10, f0, f6 f6, f8, f2
RAW Hazards
WAR Hazards
WAW Hazards
15
Instruction Scheduling
I1 FDIV.D
I2 FLD
I3 FMULT.D
I4 FDIV.D
I5 FSUB.D
I6 FADD.D
f6, f6, f4 f2, 45(x3)
f0, f2, f4 f8, f6, f2 f10, f0, f6 f6, f8, f2
Valid orderings:
in-order I1 I2 I3
I4 I5 I6
I4 I5 I6 I5 I4 I6
I1 I2 I3 I4
I5 I6
out-of-order I2 out-of-order I1
I1 I3 I2 I3
16
Out-of-order Completion
In-order Issue
I1 FDIV.D f6,
I2 FLD f2,
I3 FMULT.Df0, f2,
I4 FDIV.D f8,
I5 FSUB.D f10,
I6 FADD.D f6,
Latency f6, f4 4
45(x3) 1 f43
f6, f2 4 f0, f6 1 f8, f2 1
in-ordercomp 12 1234 354656 out-of-ordercomp 1 2 2 3 1 4 3 5 5 4 6 6
17
Complex Pipeline
ALU
Mem
IF
ID
Issue
WB
GPR’s FPR’s
Can we solve write hazards without equalizing all pipeline depths and without bypassing?
Fadd
Fmul
Fdiv
18
When is it Safe to Issue an Instruction?
Suppose a data structure keeps track of all the instructions in all the functional units
The following checks need to be made before the Issue stage can dispatch an instruction
§ Is the required function unit available?
§ Is the input data available? è RAW?
§ Is it safe to write the destination? è WAR? WAW? § Is there a structural conflict at the WB stage?
19
Name Busy
Int Mem Add1 Add2 Add3 Mult1 Mult2 Div
Op Dest Src1 Src2
A Data Structure for Correct Issues
Keeps track of the status of Functional Units
The instruction i at the Issue stage consults this table
FU available? RAW?
WAR?
WAW?
check the busy column
search the dest column for i’s sources search the source columns for i’s destination search the dest column for i’s destination
An entry is added to the table if no hazard is detected; An entry is removed from the table after Write-Back
20
Simplifying the Data Structure Assuming In-order Issue
Suppose the instruction is not dispatched by the Issue stage if a RAW hazard exists or the required FU is busy, and that operands are latched by functional unit on issue:
Can the dispatched instruction cause a WAR hazard ?
NO: Operands read at issue
WAW hazard ?
YES: Out-of-order completion
21
Simplifying the Data Structure ...
§No WAR hazard
⟹ no need to keep src1 and src2
§The Issue stage does not dispatch an instruction in case of a WAW hazard
⟹ a register name can occur at most once in the dest column
§WP[reg#] : a bit-vector to record the registers for which writes are pending
– These bits are set by the Issue stage and cleared by the WB stage
⟹ Each pipeline stage in the FU's must carry the dest field and a flag to indicate if it is valid “the (we, ws) pair”
22
Scoreboard for In-order Issues
Busy[FU#] : a bit-vector to indicate FU’s availability. (FU = Int, Add, Mult, Div)
These bits are hardwired to FU's.
WP[reg#] : a bit-vector to record the registers for which writes are pending.
These bits are set by Issue stage and cleared by WB stage
Issue checks the instruction (opcode dest src1 src2) against the scoreboard (Busy & WP) to dispatch
FU available? RAW?
WAR?
Busy[FU#]
WP[src1] or WP[src2] cannot arise
WAW? WP[dest]
23
Scoreboard Dynamics
Functional Unit Status Int(1) Add(1) Mult(3) Div(4)
Registers Reserved WB for Writes
f6
f6, f2
f6, f2 I2 f6, f0
f6, f0 I1 f0, f8
f0, f8 I3 f8, f10
f8, f10 I5 f8I4 f6
f6 I6
I1
f6
I2 f2
f6
f6
f2
I3
f0
f6
f0
f6
I4
f0
f8
f8
f0
I5
f10
f8
f8
f10
f8
I6
f6
f6
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11
I1
I2
I3
I4
I5
I6
FDIV.D
FLD
FMULT.D
FDIV.D
FSUB.D
FADD.D
f6, f6, f4
f2, 45(x3)
f0, f2, f4
f8, f6, f2
f10, f0, f6
f6, f8, f2
24
In-Order Issue Limitations: an example
1 FLD f2,
2 FLD f4,
3 FMULT.D f6,
4 FSUB.D f8,
5 FDIV.D f4,
6 FADD.D f10,
34(x2) 45(x3)
f4, f2 f2, f2 f2, f8 f6, f4
latency 1
12
long
3
1
4 5 1
43
6
In-order issue restriction prevents instruction 4 from being dispatched
In-order:
1(2,1). . . . . . 2344 35. . .566
25
Out-of-Order Issue
ALU
Mem
Issue
§ Issue stage buffer holds multiple instructions waiting to issue.
§ Decode adds next instruction to buffer if there is space and the instruction does not cause a WAR or WAW hazard.
– Note: WAR possible again because issue is out-of-order (WAR not possible with in-order issue and latching of input operands at functional unit)
§ Any instruction in buffer whose RAW hazards are satisfied can be issued (for now at most one dispatch per cycle). On a write back (WB), new instructions may get enabled.
26
IF
ID
WB
Fadd
Fmul
How many instructions can be in the pipeline?
Which features of an ISA limit the number of instructions in the pipeline?
Number of Registers
Out-of-order dispatch by itself does not provide any significant performance improvement!
27
Issue Limitations: In-Order and Out-of-Order
1 FLD f2,
2 FLD f4,
3 FMULT.D f6,
4 FSUB.D f8,
5 FDIV.D f4,
6 FADD.D f10,
34(x2) 45(x3)
f4, f2 f2, f2 f2, f8 f6, f4
latency 1
12
long
3
1
4 5 1
43
6
Out-of-order execution did not allow any significant improvement!
In-order:
1(2,1). . . . . . 2344 35. . .566
Out-of-order: 1(2,1)44. . . . 235 . 3 .566
cannot dispatch due to WAR on f4
28
Overcoming the Lack of Register Names
Floating Point pipelines often cannot be kept filled with small number of registers.
IBM 360 had only 4 floating-point registers
Can a microarchitecture use more registers than specified by the ISA without loss of ISA compatibility ?
Robert Tomasulo of IBM suggested an ingenious solution in 1967 using on-the-fly register renaming
29
Issue Limitations: In-Order and Out-of-Order
1 FLD f2,
2 FLD f4,
3 FMULT.D f6,
4 FSUB.D f8,
5 FDIV.D f4’,
6 FADD.D f10,
34(x2) 45(x3)
f4, f2 f2, f2 f2, f8 f6, f4’
latency 1
12
long
3
1
4 5 1
43
X
6
Any antidependence can be eliminated by renaming.
In-order: 1(2,1). . . . . . 2344 35. . .566 Out-of-order: 1(2,1)445 . . . 2(3,5)366
(renaming ⟹ additional storage) Can it be done in hardware? yes!
30
Register Renaming
ALU
Mem
IF
ID
Issue
WB
Fadd
Fmul
§ Decode does register renaming and adds instructions to the issue-stage instruction reorder buffer (ROB)
⟹ renaming makes WAR or WAW hazards impossible
§ Any instruction in ROB whose RAW hazards have been satisfied
can be dispatched.
⟹ Out-of-order or dataflow execution
31
Tomasulo Architecture
R. M. Tomasulo, 1967 IBM 360/91
From Mem FP Op Queue
FP Registers
Store Buffers
Load1 Load2 Load3 Load4 Load5 Load6
Load Buffers
Add1 Add2 Add3
Mult1 Mult2
Reservation Stations
To Mem
FP adders
FP multipliers
HKUEEE
ENGG3441 - HS
32
Common Data Bus (CDB)
Tomasulo’s Algorithm
n Control & buffers distributed with Function Units (FU)
• FU buffers called reservation stations
• Contains pending operands
n Registers in instructions replaced by values or pointers to reservation stations(RS)
• On-the-fly register renaming
• Renaming avoids WAR, WAW hazards (for now)
• More reservation stations than registers
n Common Data Bus that broadcasts results to all Fus
• RS stores new (pending) data until consumed by FU
• No register file
• Avoids RAW hazards by executing an instruction only when its
operands are available
n Loads and Stores treated as FUs with RSs as well
n Branch prediction allows FP ops beyond basic block in FP queue
HKUEEE ENGG3441 - HS 33
Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
If reservation station free (no structural hazard),
control issues instr & sends operands (renames registers).
2. Execute—operate on operands (EX) When both operands ready then execute;
if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available
n Normal data bus: data + destination (“go to” bus)
n Common data bus: data + source (“come from” bus)
• 64 bits of data + 4 bits of Functional Unit source address
• Write if matches expected Functional Unit (produces result)
• Does the broadcast
n Example speed:
3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /
HKUEEE ENGG3441 - HS 34
Reorder Buffer
n A buffer storing all instructions in-flight
• Commit to architectural state only when all previous
instructions have committed
n Enables in-order commit
• Avoid WAW w.r.t. ISA registers
• Precise exception
• Handle branch misprediction
n Usually implemented as a hardware circular buffer
• Insert entry by issue stages
• Remove entry by commit
n May handle memory writes in similar fashion
• e.g. Out-of-order execution of SW, but turns out there is a
branch misprediction
HKUEEE ENGG3441 - HS 35
Instruction Commit/Retire
n Instruction commit: commit the results of an instruction to permanent machine states
• Completing an instruction not necessary means committing/retiring an instruction
• e.g. ROB commits in-order, while instructions complete out-of-order
n In-order commits benefits:
• guarantee no WAW
• precise exception
• enable speculative executions
n In-order commits limitation:
• Added hardware
• CPI=1 barrier if not careful
HKUEEE ENGG3441 - HS 36
Acknowledgements
n These slides contain material developed and copyright by:
• Arvind (MIT)
• Krste Asanovic (MIT/UCB)
• Joel Emer (Intel/MIT)
• James Hoe (CMU)
• John Kubiatowicz (UCB)
• David Patterson (UCB)
• John Lazzaro (UCB)
n MIT material derived from course 6.823
n UCB material derived from course CS152,
CS252
HKUEEE ENGG3441 - HS 37