Computer Architecture ELEC3441
Lecture 5 – Pipelining
Dr. Hayden Kwok-Hay So
Department of Electrical and Electronic Engineering
RISC vs CISC – Iron Law
CPUTime = # of instruction × # of cycle × time program instruction cycle
L4 L5,6
HKUEEE ENGG3441 – HS
2
Microarchitecture
CPI
Cycle Time
CISC
>1
short
RISC – single cycle unpipelined
1
long
RISC – pipelined
1
short
Pipeline Motivation – Buying
Food from Canteen
Slow 2 customers Order Food Drink Order Food Drink
n Getting food from canteen involves 3 steps:
• Place order (P)
• Pickup food (F)
• Pickup drink (D)
n If there is only 1 customer: • PèFèD
1 customer Order Food Drink
time
n How to serve 2 customer?
HKUEEE ENGG3441 – HS 3
Food Ordering – Pipeline
Pre-requisite: All steps must be able to operate independently in parallel
4 customers (pipeline)
4 customers (no pipeline)
PFD PFD
PFD PFD
PFDPFDPFDPFD
n Serving one after oneèSlow:
• Assume each step take 1 unit of time, then
• N customersè3N units of time
n Better solution: Pipeline
• Overlap different steps in parallel
• N customersè2 + N units of time
HKUEEE ENGG3441 – HS 4
time
Pipeline Observations:
4 customers (unbalanced pipeline)
4 customers (pipeline)
4 customers (no pipeline)
PFD PFD PFDPFD
PFDPFD PFDPFD
PFDPFDPFDPFD
time
n All “stages” (P, F, D) are busy all the time • Non-pipeline: busy 1/3 of the time
n Balanced pipeline: 1 customer per 1 unit of time
n The longest stage dictates the overall performance
• 1 customer per time-of-longest-stage
• Balanced delay on each stageèbest performance
HKUEEE ENGG3441 – HS 5
2 Views of Pipeline
Customer 0 Customer 1 Customer 2 Customer 3
P
F D
P F D
Timeline View
P
F D
P F D
Order (P)
Food (F)
Drink (D)
c0 c1 c2 c3
c0 c1 c0
c2 c3
c1 c2 c3
ENGG3441 – HS
Resource View
time
HKUEEE
6
Instruction Pipelining
n Recall there are 5 steps to execute 1 instruction in RISC-V
• InstructionFetch
• InstructionDecode • Execution
• Memoryoperation • WriteBack
n The 5 steps can be pipelined if they can operate independently
• è Pipeline registers • Andmore…
HKUEEE ENGG3441 – HS 7
Pipelined Datapath
0x4
Add
we rs1
rs2
rd1 wa
wd rd2 GPRs
PC
IR
addr rdata
Inst. Memory
we addr
rdata Data
Memory wdata
Instruction Fetch
Instruction decode & Reg-fetch
ALU
Execute
Memory Access
write -back
ELEC3441 – HS
8
5-Stage Pipelined Execution
0x4
Add
we rs1
rs2
rd1 wa
wd rd2 GPRs
Decode, Reg. Fetch (ID)
we addr
rdata Data
Memory wdata
Memory (MA)
t5 t6 t7 ….
WB2
MA3 WB3
EX4 MA4 WB4
ID5 EX5 MA5 WB5
PC addr
rdata IR
Inst. Memory
I-Fetch (IF)
time
instruction1
instruction2
instruction3
instruction4
instruction5
ALU
Execute (EX)
Write -Back (WB)
t0 t1 t2 t3
IF1 ID1 EX1
IF2 ID2
MA1
EX2
IF3 ID3
IF4
ELEC3441 – HS
9
t4
WB1
MA2
EX3
ID4
IF5
0x4
Add
5-Stage Pipelined Execution
Resource Usage Diagram
we rs1
rs2
PC addr rd1 we
rdata IR ws ALU addr
wd rd2 rdata
Inst. GPRs Data
Memory wdata
Memory (MA)
time
IF
ID
EX I1I2 I4I5
MA I1 I3 I4 I5
WB I2 I3 I4 I5
ELEC3441 – HS
Memory
I-Fetch (IF)
Decode, Reg. Fetch Execute (ID) (EX)
Write -Back (WB)
t0 t1 t2 t3
t5 t6 t7 …. I5
I1 I2 I3 I4 I1 I2 I3
t4 I5 I4 I3 I2 I1
10
HKUEEE ENGG3441 – HS 11
we rs1
rs2
rd1 wa
wd rd2 GPRs
addr
inst
Inst Memory
we addr
rdata Data
Memory
Imm Select
wdata
wdata
add x1, x2, x3
lw x4, 20(x5)
ori x6, x7, 1
sub x8, x9, x10
What is wrong with this?
0x4
Data writing back to regfile is from an instruction 3 cycles ago
ALU
Add
PC
IR
HKUEEE
ENGG3441 – HS
12
Resources
sub x8, x9, 10
ori x6, x7, 1
lw x4, 20(x5)
add x1, x2, x3
decode
we rs1
rs2
rd1 wa
wd rd2 GPRs
addr
inst
Inst Memory
we addr
rdata Data
Memory
Imm Select
wdata
wdata
Pipelined RISC-V Datapath
without jumps
FDEMW
IRIR IR
0x4
PC
Add
RegWriteEn
we rs1
rs2
rd1 wa
wd rd2 GPRs
FuncSel
MemWrite
WBSel
addr
inst
Inst Memory
IR
we ALU addr
rdata
Data
Memory wdata wdata
HKUEEE
14
Imm Select
Op2Sel
ENGG3441 – HS
Control Points Need to Be Connected
Pipelined Execution Control:
0x4
PC
Add
IRIR IR
IR
A
ALU Y
n Replicate instruction register to every stage n Distributed decoding for each stage based
on the current instruction of that stage
HKUEEE ENGG3441 – HS 13
B
MD1
R
MD2
Benefit of Instruction Pipelining
CPUTime = # of instruction × # of cycle × time program instruction cycle
n When the pipeline is filled, CPI=1
n Shorter cycle time because less work to do per cycle
• In fact, more pipeline stagesè shorter cycle time
• Commercialprocessorscanhave up to 20 stages pipeline
HKUEEE ENGG3441 – HS 15
Pipeline is Difficult
Structural Data Control Hazard Hazard Hazard
n On every cycle, the hardware needs to detect and resolve all types of hazards, while keeping pipeline as filled as possible to achieve CPI=1
• In real systems, CPI suffers slightly in return for higher clock speed
n Need to make sure hardware adheres to the ISA “contract” with the programmer
… difficult but worth it…
HKUEEE ENGG3441 – HS 16
Structural Hazard
n Structural hazard arises when more than 1 pipeline stages require access to the same physical hardware
n Solutions:
1. Addextracopiesoftheresource
2. Changeresourcesothatitcanhandle concurrent use
3. Requiredifferentstagestoaccesshardwareat different time
HKUEEE
• Stall one (some) of the conflicting stages • Avoid the concurrent use
ENGG3441 – HS 17
IF
i0
i1
i2
i3
i4
i5
ID
i0
i1
i2
i3
i4
i5
EX
i0
i1
i2
i3
i4
i5
MEM
i0
i1
i2
i3
i4
i5
WB
i0
i1
i2
i3
i4
i5
Structural Hazard Ex – Memory
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
n Physically there is only 1 main memory in a computer
• Holds instruction and data
n During run-time, both IF and MEM stages need access to the main memory
• èstructural hazard
n Solution so far: “replicate” the memory
HKUEEE
• •
ENGG3441 – HS
Instruction + Data memory Reality: Inst + Data Cache
18
IF
i0
i1
i2
i3
i4
i5
ID
i0
i1
i2
i3
i4
i5
EX
i0
i1
i2
i3
i4
i5
MEM
i0
i1
i2
i3
i4
i5
WB
i0
i1
i2
i3
i4
i5
HKUEEE ENGG3441 – HS 20
Structural Hazard Ex – Regfile
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
n Physically there is only 1 register file
n During run-time, there are chances that both ID and WB stages need access to the regfile
• ID: read regfile (rs1, rs2)
• WB: write regfile (rd) • èstructural hazard
n Solution so far: regfile supports concurrent read and write
HKUEEE
ENGG3441 – HS 19
Data Hazard
n Data hazard arises when pipeline stages access data location in ways that are incompatible with the ISA contract with the programmer
n Technically 3 types of hazards
• Read After Write hazards (RAW)
• Write After Read hazards (WAR)
• Write After Write hazards (WAW)
n What may go wrong?
• RAW: a later read happens before an earlier write
• WAR: a later write happens before an earlier read
• WAW: a later write happens before an earlier write
n Data hazard happens on register AND memory locations
n In our 5-stage pipeline, only RAW can happen
HKUEEE ENGG3441 – HS 21
RAW
x1ßx0 + 10 x4ßx1 + 17 x4ßx2 + x3 x2ßx4 + 1
WAW
WAR
Data Hazard Example
0x4
Add
we rs1
rs2
PC addr rd1
we ALU addr
rdata Data
Memory wdata
rdata IR wa
wd rd2
Inst. GPRs Memory
I-Fetch Decode, Reg. Fetch
Execute
(EX) (MA)
Write -Back (WB)
(IF)
time
x1ßx0 + 10 x4ßx1 + 17
(ID)
t0 t1 t2 t3 t4 t5 t6 t7 ….
IF1 ID1 EX1 MA1 WB1
IF2 ID2 EX2 MA2 WB2
ELEC3441 – HS
22
Memory
writes val of x1
new val of x1 calculated
old val of x1 read
Resolving Data Hazards
n Strategy 1: Stalling
• Waitfortheresulttobeavailablebyfreezing
earlier pipeline stages
• è Interlocks
n Strategy 2: Data Forwarding
• Routedataassoonaspossibleafteritis calculated to the earlier pipeline stage
• è bypass
HKUEEE ENGG3441 – HS 23
Feedback to Resolve Hazards
FB1 FB2
FB3 FB4
stage stage 34
stage 1
stage 2
§ Later stages provide dependence information to earlier stages which can stall (or kill) instructions
• Controllingapipelineinthismannerworksprovided
the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i
(otherwise deadlocks may occur)
24
Interlocks to resolve Data Hazards
Stall Condition
0x4
PC
bubble
IRIR IR 1
Add
we rs1
rs2 rd1
wa
wd rd2
GPRs
addr
inst
Inst Memory
IR
A
B
MD1
we ALU Y addr
rdata Data
Memory
R
25
…
x1 ¬ x0 + 10 x4 ¬ x1 + 17 …
wdata
Imm Select
MD2
wdata
Interlocks to resolve Data Hazards
FreSetzaellPCoantdition 2nd instruction
Send bubble in place of decoded 2nd instruction
0x4
PC
bubble
IRIR IR 1
Add
we rs1
rs2 rd1
wa
wd rd2
GPRs
addr
inst
Inst Memory
1st instruction proceeds
IR
A
B
MD1
we ALU Y addr
rdata Data
Memory
R
26
…
x1 ¬ x0 + 10 x4 ¬ x1 + 17 …
wdata
Imm Select
MD2
wdata
(I1) x1 ¬ (x0) + 10IF1 (I2)x4¬(x1)+17 (I3)
(I4)
ID1
IF2
EX1 MA1 WB1
stalled stages
ID2 EX2 MA2 WB2
IF3 ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
IF5 ID5 EX5 MA5 WB5
Stalled Stages and Pipeline Bubbles
time
t0 t1 t2 t3 t4 t5 t6 t7 ….
(I5)
Resource Usage
time
t0 t1 t2 t3 t4 IF I1 I2
ID I1
EX I1– MA I1- WB I1
t5 t6 t7 I3I4I5
I2I3I4 -I2I3
–I2 – – –
….
I5
I4 I5 I3I4I5
I2 I3 I4 I5
ID2 ID2 ID2
IF3 IF3 IF3
I3 I3 I3
I2 I2 I2
– Þ pipelinebubble
27
Pipeline Bubbles
n Pipeline bubble is a logical concept
n Can be implemented using
• NOPinstruction
• Specialcontroldecoding
• Stallpipelinebydisablingpipelineregisters •…
n Causes pipeline stalls
HKUEEE ENGG3441 – HS 28
Bubbles turns into NOPs
addi x1, x0, 10
addi x4, x1, 17
ori x6, x7, 1
sub x8, x9, x10
HKUEEE ENGG3441 – HS 29
addi x1, x0, 10
NOP
NOP
NOP
addi x4, x1, 17
ori x6, x7, 1
sub x8, x9, x10
Hazards due to Loads & Stores
Stall Condition
0x4
PC
bubble
IRIR IR 1
Add
we rs1
rs2
rd1 wa
wd rd2 GPRs
addr
inst
Inst Memory
What if x1+7 = x3+5 ?
we addr
rdata Data
Memory
IR
A
ALU Y
B
MD1
MD2
R
30
…
M[x1+7] ß x2 x4ßM[x3+5] …
Is there any possible data hazard in this instruction sequence?
Imm Select
wdata
wdata
Load & Store Hazards
x1+7 = x3+5 è data hazard
However, the hazard is avoided because our memory system completes writes in one cycle !
Load/Store hazards are sometimes resolved in the pipeline and sometimes in the memory system itself.
More on this later in the course.
…
M[x1+7] ß x2 x4 ß M[x3+5] …
31
Time
Pipeline CPI Examples
3 instructions finish in 3 cycles CPI = 3/3 =1
3 instructions finish in 4 cycles CPI = 4/3 = 1.33
3 instructions finish in 5cycles CPI = 5/3 = 1.67
Measure from when first instruction finishes to when last instruction in sequence finishes.
Inst
1
Inst 2
Inst 3
Inst
1
Inst 2
Bubble
Inst 3
Inst
1
Bubble 1
Inst 2
InBsutb3ble 2
Inst
3
32
Resolving Data Hazards
n Strategy 1: Stalling
• Waitfortheresulttobeavailablebyfreezing
earlier pipeline stages
• è Interlocks
n Strategy 2: Data Forwarding
• Routedataassoonaspossibleafteritis calculated to the earlier pipeline stage
• è bypass
HKUEEE ENGG3441 – HS 33
Bypassing
t0 t1 t2 t3 t4 t5 t6 t7 ….
time
(I1)x1¬x0+10 (I2)x4¬x1+17 (I3)
(I4)
(I5)
A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input
IF1 ID1 EX1 MA1 WB1
IF2
ID2 EX2 MA2 WB2
IF3 ID3 EX3 MA3
ID4 EX4 IF5 ID5
stalled stages IF4 Each stall or kill introduces a bubble in the pipeline
time
(I1)x1¬x0+10 (I2)x4¬x1+17 (I3)
(I4)
(I5)
t0 t1 t2 t3 t4 t5 t6 t7 ….
IF1 ID1 EX1 MA1 WB1
IF2 ID2 EX2 MA2 WB2
IF3 ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
IF5 ID5 EX5 MA5 WB5
ID2 ID2 ID2
IF3 IF3 IF3
⇒CPI > 1
34
stall
Add
PC D IR
…
x1¬x0+10
x4 ¬ x1 + 17 yes
x4 ¬ x1…