Pipelined Processor Design
Pipelined Processor Design
COE 301 / ICS 233
Computer Organization
Dr. Muhamed Mudawar
College of Computer Sciences and Engineering
King Fahd University of Petroleum and Minerals
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Presentation Outline
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Single-Cycle Datapath
Shown below is the single-cycle datapath
How to pipeline this single-cycle datapath?
Answer: Introduce pipeline registers at end of each stage
Branch Target Address
A
L
U
Address
Instruction
Instruction
Memory
Rs
Rd
Ext
Rt
Jump Target = PC[31:28] ‖ Imm26
ALU result
clk
PC
00
Data
Memory
Address
Data_in
Data_out
Registers
RA
RB
BusA
BusB
RW
BusW
1
0
Imm16
Next PC Address
0
1
1
0
+
0
1
2
IF = Instruction Fetch
+1
ID = Instruction Decode
& Register Read
EX = Execute
MEM = Memory Access
WB = Write Back
ALUOp
RegWr
RegDst
ALUSrc
MemRd
MemWr
WBdata
PCSrc
ExtOp
Zero
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
clk
Pipelined Datapath
Pipeline registers are shown in green, including the PC
Same clock edge updates all pipeline registers and PC
In addition to updating register file and data memory (for store)
Branch Target Address
A
L
U
Address
Instruction
Instruction
Memory
Rs
Rd
Ext
Rt
Jump Target = PC[31:28] ‖ Imm26
ALU Result
1
0
Imm16
Next PC Address
0
1
+
0
1
2
IF = Instruction Fetch
+1
ID = Instruction Decode
& Register Read
EX = Execute
MEM = Memory Access
WB = Write Back
ALUOp
RegWr
RegDst
ALUSrc
MemRd
MemWr
1
0
WBdata
PCSrc
ExtOp
Zero
Data
Memory
Address
Data_in
Data_out
Registers
RA
RB
BusA
BusB
RW
BusW
PC
00
Inst
NPC
BTA
A
B
Imm
D
R
Data
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Problem with Register Destination
Instruction in ID stage is different from the one in WB stage
WB stage is writing to a different destination register
Writing the destination register of the instruction in the ID Stage
1
RegDst
0
clk
Branch Target Address
A
L
U
Address
Instruction
Instruction
Memory
Rs
Rd
Ext
Rt
Jump Target = PC[31:28] ‖ Imm26
ALU Result
Data
Memory
Address
Data_in
Data_out
Registers
RA
RB
BusA
BusB
RW
BusW
1
0
Imm16
Next PC Address
+
0
1
2
IF = Instruction Fetch
+1
ID = Instruction Decode
& Register Read
EX = Execute
MEM = Memory Access
WB = Write Back
ALUOp
RegWr
ALUSrc
MemRd
MemWr
1
0
WBdata
PCSrc
ExtOp
Zero
PC
00
Inst
NPC
BTA
A
B
Imm
D
R
Data
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Registers
RA
RB
BusA
BusB
RW
BusW
Pipelining the Destination Register
Destination Register should be pipelined from ID to WB
The WB stage writes back data knowing the destination register
clk
Branch Target Address
A
L
U
Address
Instruction
Instruction
Memory
Rs
Rd
Ext
Rt
Jump Target = PC[31:28] ‖ Imm26
ALU Result
Data
Memory
Address
Data_in
Data_out
1
0
Imm16
Next PC Address
RegDst
+
0
1
2
IF = Instruction Fetch
+1
ID = Instruction Decode
& Register Read
EX = Execute
MEM = Memory Access
WB = Write Back
ALUOp
RegWr
ALUSrc
MemRd
MemWr
1
0
WBdata
PCSrc
ExtOp
Zero
PC
00
Inst
NPC
BTA
A
B
Imm
D
R
Data
1
0
Rd2
Rd3
Rd4
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Graphically Representing Pipelines
Multiple instruction execution over multiple clock cycles
Instructions are listed in execution order from top to bottom
Clock cycles move from left to right
Figure shows the use of resources at each stage and each cycle
Time (in cycles)
Program Execution Order
add $s1, $s2, $s3
CC2
Reg
IM
DM
Reg
sub $t5, $s2, $t3
CC4
ALU
IM
sw $s2, 10($t3)
DM
Reg
CC5
Reg
ALU
IM
DM
Reg
CC6
Reg
ALU
DM
CC7
Reg
ALU
CC8
Reg
DM
lw $t6, 8($s5)
IM
CC1
Reg
ori $s4, $t3, 7
ALU
CC3
IM
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Instruction-Time Diagram shows:
Which instruction occupying what stage at each clock cycle
Instruction flow is pipelined over the 5 stages
Instruction-Time Diagram
IF
WB
–
EX
ID
WB
–
EX
WB
MEM
–
ID
IF
EX
ID
IF
Time
CC1
CC4
CC5
CC6
CC7
CC8
CC9
CC2
CC3
MEM
EX
ID
IF
WB
MEM
EX
ID
IF
lw $t7, 8($s3)
lw $t6, 8($s5)
ori $t4, $s3, 7
sub $s5, $s2, $t3
sw $s2, 10($s3)
Instruction Order
Up to five instructions can be in the pipeline during the same cycle
Instruction Level Parallelism (ILP)
ALU instructions skip the MEM stage. Store instructions skip the WB stage
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Control Signals
Same control signals used in the single-cycle datapath
Registers
RA
RB
BusA
BusB
RW
BusW
clk
Branch Target Address
A
L
U
Address
Instruction
Instruction
Memory
Rs
Rd
Ext
Rt
Jump Target = PC[31:28] ‖ Imm26
ALU Result
Data
Memory
Address
Data_in
Data_out
1
0
Imm16
Next PC Address
RegDst
+
0
1
2
IF = Instruction Fetch
+1
ID = Instruction Decode
EX = Execute
MEM = Memory Access
WB = Write Back
ALUOp
RegWr
ALUSrc
MemRd
MemWr
1
0
WBdata
PCSrc
ExtOp
Zero
PC
00
Inst
NPC
BTA
A
B
Imm
D
R
Data
1
0
Rd2
Rd3
Rd4
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
RegWr
WB
MemRd
MemWr
WBdata
Pipelined Control
PCSrc
RegDst
EX
ExtOp
ExtOp
Registers
RA
RB
BusA
BusB
RW
BusW
clk
Branch Target Address
A
L
U
Address
Instruction
Instruction
Memory
Rs
Rd
Ext
Rt
Jump Target = PC[31:28] ‖ Imm26
ALU Result
Data
Memory
Address
Data_in
Data_out
1
0
Imm16
Next PC Address
+
0
1
2
IF = Instruction Fetch
+1
ID = Instruction Decode
EX = Execute
MEM = Memory Access
WB = Write Back
1
0
PC
00
Inst
NPC
BTA
A
B
Imm
D
R
Data
1
0
Rd2
Rd3
Rd4
Main & ALU
Control
Op
func
ALUSrc
ALUOp
MEM
Pipeline control signals just like data
Zero
PC
Control
Zero
J
BEQ, BNE
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Pipelined Control – Cont’d
ID stage generates all the control signals
Pipeline the control signals as the instruction moves
Extend the pipeline registers to include the control signals
Each stage uses some of the control signals
Instruction Decode and Register Read
Control signals are generated
RegDst and ExtOp are used in this stage, J (Jump) is used by PC control
Execution Stage => ALUSrc, ALUOp, BEQ, BNE
ALU generates zero signal for PC control logic (Branch Control)
Memory Stage => MemRd, MemWr, and WBdata
Write Back Stage => RegWr control signal is used in the last stage
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Control Signals Summary
Op Decode
Stage Execute
Stage Memory
Stage Write
Back PC
Control
RegDst ExtOp ALUSrc ALUOp MemRd MemWr WBdata RegWr PCSrc
R-Type 1=Rd X 0=Reg func 0 0 0 1 0 = next PC
ADDI 0=Rt 1=sign 1=Imm ADD 0 0 0 1 0 = next PC
SLTI 0=Rt 1=sign 1=Imm SLT 0 0 0 1 0 = next PC
ANDI 0=Rt 0=zero 1=Imm AND 0 0 0 1 0 = next PC
ORI 0=Rt 0=zero 1=Imm OR 0 0 0 1 0 = next PC
LW 0=Rt 1=sign 1=Imm ADD 1 0 1 1 0 = next PC
SW X 1=sign 1=Imm ADD 0 1 X 0 0 = next PC
BEQ X X 0=Reg SUB 0 0 X 0 0 or 2 = BTA
BNE X X 0=Reg SUB 0 0 X 0 0 or 2 = BTA
J X X X X 0 0 X 0 1 = jump target
PCSrc = 0 or 2 (BTA) for BEQ and BNE, depending on the zero flag
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Next . . .
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Hazards: situations that would cause incorrect execution
If next instruction were launched during its designated clock cycle
Structural hazards
Caused by resource contention
Using same resource by two instructions during the same cycle
Data hazards
An instruction may compute a result needed by next instruction
Data hazards are caused by data dependencies between instructions
Control hazards
Caused by instructions that change control flow (branches/jumps)
Delays in changing the flow of control
Hazards complicate pipeline control and limit performance
Pipeline Hazards
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Structural Hazards
Problem
Attempt to use the same hardware resource by two different
instructions during the same clock cycle
Example
Writing back ALU result in stage 4
Conflict with writing load data in stage 5
WB
WB
EX
ID
WB
EX
MEM
IF
ID
IF
Time
CC1
CC4
CC5
CC6
CC7
CC8
CC9
CC2
CC3
EX
ID
IF
MEM
EX
ID
IF
lw $t6, 8($s5)
ori $t4, $s3, 7
sub $t5, $s2, $s3
sw $s2, 10($s3)
Instructions
Structural Hazard
Two instructions are attempting to write the register file during same cycle
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Resolving Structural Hazards
Serious Hazard:
Hazard cannot be ignored
Solution 1: Delay Access to Resource
Must have mechanism to delay instruction access to resource
Delay all write backs to the register file to stage 5
ALU instructions bypass stage 4 (memory) without doing anything
Solution 2: Add more hardware resources (more costly)
Add more hardware to eliminate the structural hazard
Redesign the register file to have two write ports
First write port can be used to write back ALU results in stage 4
Second write port can be used to write back load data in stage 5
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Dependency between instructions causes a data hazard
The dependent instructions are close to each other
Pipelined execution might change the order of operand access
Read After Write – RAW Hazard
Given two instructions I and J, where I comes before J
Instruction J should read an operand after it is written by I
Called a data dependence in compiler terminology
I: add $s1, $s2, $s3 # $s1 is written
J: sub $s4, $s1, $s3 # $s1 is read
Hazard occurs when J reads the operand before I writes it
Data Hazards
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
DM
Reg
IM
Reg
ALU
IM
DM
Reg
Reg
ALU
IM
DM
Reg
Reg
ALU
DM
Reg
ALU
Reg
DM
IM
Reg
ALU
IM
Time (cycles)
Program Execution Order
value of $s2
sub $s2, $t1, $t3
CC1
10
CC2
add $s4, $s2, $t5
10
CC3
or $s6, $t3, $s2
10
CC4
and $s7, $t4, $s2
10
CC6
20
CC7
20
CC8
20
CC5
sw $t8, 10($s2)
10
Example of a RAW Data Hazard
Result of sub is needed by add, or, and, & sw instructions
Instructions add & or will read old value of $s2 from reg file
During CC5, $s2 is written at end of cycle, old value is read
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Reg
Reg
Solution 1: Stalling the Pipeline
Three stall cycles during CC3 thru CC5 (wasting 3 cycles)
The 3 stall cycles delay the execution of add and the fetching of or
The 3 stall cycles insert 3 bubbles (No operations) into the ALU
The add instruction remains in the second stage until CC6
The or instruction is not fetched until CC6
DM
Reg
Reg
Reg
Time (in cycles)
Instruction Order
value of $s2
CC1
10
CC2
10
CC3
10
CC4
10
CC6
20
CC7
20
CC8
20
CC5
10
add $s4, $s2, $t5
IM
or $s6, $t3, $s2
IM
ALU
ALU
Reg
sub $s2, $t1, $t3
IM
Reg
ALU
DM
Reg
CC9
20
DM
stall
stall
stall
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
DM
Reg
Reg
Reg
Reg
Reg
Time (cycles)
Program Execution Order
value of $s2
sub $s2, $t1, $t3
IM
CC1
10
CC2
add $s4, $s2, $t5
IM
10
CC3
or $s6, $t3, $s2
ALU
IM
10
CC4
and $s7, $s6, $s2
ALU
IM
10
CC6
Reg
DM
ALU
20
CC7
Reg
DM
ALU
20
CC8
Reg
DM
20
CC5
sw $t8, 10($s2)
Reg
DM
ALU
IM
10
Solution 2: Forwarding ALU Result
The ALU result is forwarded (fed back) to the ALU input
No bubbles are inserted into the pipeline and no cycles are wasted
ALU result is forwarded from ALU, MEM, and WB stages
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Implementing Forwarding
0
1
2
3
0
1
2
3
R
32
32
clk
32
Rs
Instruction
0
1
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
32
Rd4
A
L
U
Ext
Imm16
1
0
Rd3
Rd2
A
B
Data
D
Imm
32
Register File
RB
BusA
BusB
RW
BusW
RA
Rt
Two multiplexers added at the inputs of A & B registers
Data from ALU stage, MEM stage, and WB stage is fed back
Two signals: ForwardA and ForwardB to control forwarding
ForwardA
ForwardB
32
Rd
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Forwarding Control Signals
Signal Explanation
ForwardA = 0 First ALU operand comes from register file = Value of (Rs)
ForwardA = 1 Forward result of previous instruction to A (from ALU stage)
ForwardA = 2 Forward result of 2nd previous instruction to A (from MEM stage)
ForwardA = 3 Forward result of 3rd previous instruction to A (from WB stage)
ForwardB = 0 Second ALU operand comes from register file = Value of (Rt)
ForwardB = 1 Forward result of previous instruction to B (from ALU stage)
ForwardB = 2 Forward result of 2nd previous instruction to B (from MEM stage)
ForwardB = 3 Forward result of 3rd previous instruction to B (from WB stage)
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Forwarding Example
0
1
2
3
0
1
2
3
R
32
32
clk
32
Rs
Instruction
0
1
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
32
Rd4
A
L
U
Ext
Imm16
1
0
Rd3
Rd2
A
B
Data
D
Imm
32
Register File
RB
BusA
BusB
RW
BusW
RA
Rt
32
Rd
Instruction sequence:
lw $t4, 4($t0)
ori $t7, $t1, 2
sub $t3, $t4, $t7
When sub instruction in ID stage
ori will be in the ALU stage
lw will be in the MEM stage
lw $t4,4($t0)
ori $t7,$t1,2
sub $t3,$t4,$t7
ForwardA = 2 (from MEM stage)
ForwardB = 1 (from ALU stage)
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
RAW Hazard Detection
Current instruction is being decoded in the Decode stage
Previous instruction is in the Execute stage
Second previous instruction is in the Memory stage
Third previous instruction is in the Write Back stage
If ((Rs != 0) and (Rs == Rd2) and (EX.RegWr)) ForwardA = 1
Else if ((Rs != 0) and (Rs == Rd3) and (MEM.RegWr)) ForwardA = 2
Else if ((Rs != 0) and (Rs == Rd4) and (WB.RegWr)) ForwardA = 3
Else ForwardA = 0
If ((Rt != 0) and (Rt == Rd2) and (EX.RegWr)) ForwardB = 1
Else if ((Rt != 0) and (Rt == Rd3) and (MEM.RegWr)) ForwardB = 2
Else if ((Rt != 0) and (Rt == Rd4) and (WB.RegWr)) ForwardB = 3
Else ForwardB = 0
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Hazard Detecting and Forwarding Logic
ForwardB
ForwardA
Rs
Rt
Hazard
Detect &
Forward
0
1
2
3
0
1
2
3
R
32
32
clk
32
Rs
Instruction
0
1
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
32
Rd4
A
L
U
Ext
Imm16
1
0
Rd3
Rd2
A
B
Data
D
Imm
32
Register File
RB
BusA
BusB
RW
BusW
RA
Rt
32
Rd
Main
& ALU
Control
Op
func
WB
RegDst
EX
ExtOp
ExtOp
ALUSrc
ALUOp
MEM
MemRd
MemWr
WBdata
RegWr
RegWr
RegWr
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Next . . .
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Reg
Reg
Reg
Time (cycles)
Program Order
CC2
add $s4, $s2, $t5
Reg
IF
CC3
or $t6, $t3, $s2
ALU
IF
CC6
Reg
DM
ALU
CC7
Reg
Reg
DM
CC8
Reg
lw $s2, 20($t1)
IF
CC1
CC4
and $t7, $s2, $t4
DM
ALU
IF
CC5
DM
ALU
Load Delay
Unfortunately, not all data hazards can be forwarded
Load has a delay that cannot be eliminated by forwarding
In the example shown below …
The LW instruction does not read data until end of CC4
Cannot forward data to ADD at end of CC3 – NOT possible
However, load can forward data to 2nd next and later instructions
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Detecting RAW Hazard after Load
Detecting a RAW hazard after a Load instruction:
The load instruction will be in the EX stage
Instruction that depends on the load data is in the decode stage
Condition for stalling the pipeline
if ((EX.MemRd == 1) // Detect Load in EX stage
and (ForwardA==1 or ForwardB==1)) Stall // RAW Hazard
Insert a bubble into the EX stage after a load instruction
Bubble is a no-op that wastes one clock cycle
Delays the dependent instruction after load by one cycle
Because of RAW hazard
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Reg
or $t6, $s3, $s2
IM
DM
Reg
ALU
Reg
ALU
DM
Reg
add $s4, $s2, $t5
IM
Reg
lw $s2, 20($s1)
IM
stall
ALU
bubble
bubble
bubble
DM
Reg
Stall the Pipeline for one Cycle
ADD instruction depends on LW stall at CC3
Allow Load instruction in ALU stage to proceed
Freeze PC and Instruction registers (NO instruction is fetched)
Introduce a bubble into the ALU stage (bubble is a NO-OP)
Load can forward data to next instruction after delaying it
Time (cycles)
Program Order
CC2
CC3
CC6
CC7
CC8
CC1
CC4
CC5
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
lw $s2, 8($s1)
MEM
WB
EX
ID
Stall
IF
lw $s1, ($t5)
MEM
WB
EX
ID
IF
Showing Stall Cycles
Stall cycles can be shown on instruction-time diagram
Hazard is detected in the Decode stage
Stall indicates that instruction is delayed
Instruction fetching is also delayed after a stall
Example:
add $v0, $s2, $t3
–
WB
EX
ID
Stall
IF
sub $v1, $s2, $v0
–
WB
EX
ID
IF
Time
CC1
CC4
CC5
CC6
CC7
CC8
CC9
CC2
CC3
CC10
Data forwarding is shown using green arrows
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Hazard Detecting and Forwarding Logic
ForwardB
ForwardA
Rs
Rt
Hazard Detect
Forward
and Stall
MemRd
RegWr
RegWr
RegWr
Stall
Disable PC
Disable IR
Bubble = 0
0
1
2
3
0
1
2
3
R
32
32
clk
32
Rs
Instruction
0
1
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
32
Rd4
A
L
U
Ext
Imm16
1
0
Rd3
Rd2
A
B
Data
D
Imm
32
Register File
RB
BusA
BusB
RW
BusW
RA
Rt
32
Rd
Main
& ALU
Control
Op
func
WB
MemRd
MemWr
WBdata
RegDst
EX
ExtOp
MEM
PC
Control Signals
0
1
ALUSrc
ALUOp
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Code Scheduling to Avoid Stalls
Compilers reorder code in a way to avoid load stalls
Consider the translation of the following statements:
A = B + C; D = E – F; // A thru F are in Memory
Slow code:
lw $t0, 4($s0) # &B = 4($s0)
lw $t1, 8($s0) # &C = 8($s0)
add $t2, $t0, $t1 # stall cycle
sw $t2, 0($s0) # &A = 0($s0)
lw $t3, 16($s0) # &E = 16($s0)
lw $t4, 20($s0) # &F = 20($s0)
sub $t5, $t3, $t4 # stall cycle
sw $t5, 12($0) # &D = 12($0)
Fast code: No Stalls
lw $t0, 4($s0)
lw $t1, 8($s0)
lw $t3, 16($s0)
lw $t4, 20($s0)
add $t2, $t0, $t1
sw $t2, 0($s0)
sub $t5, $t3, $t4
sw $t5, 12($s0)
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Instruction J should write its result after it is read by I
Called anti-dependence by compiler writers
I: sub $t4, $t1, $t3 # $t1 is read
J: add $t1, $t2, $t3 # $t1 is written
Results from reuse of the name $t1
NOT a data hazard in the 5-stage pipeline because:
Reads are always in stage 2
Writes are always in stage 5, and
Instructions are processed in order
Anti-dependence can be eliminated by renaming
Use a different destination register for add (eg, $t5)
Name Dependence: Write After Read
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Name Dependence: Write After Write
Same destination register is written by two instructions
Called output-dependence in compiler terminology
I: sub $t1, $t4, $t3 # $t1 is written
J: add $t1, $t2, $t3 # $t1 is written again
Not a data hazard in the 5-stage pipeline because:
All writes are ordered and always take place in stage 5
However, can be a hazard in more complex pipelines
If instructions are allowed to complete out of order, and
Instruction J completes and writes $t1 before instruction I
Output dependence can be eliminated by renaming $t1
Read After Read is NOT a name dependence
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Next . . .
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Control Hazards
Jump and Branch can cause great performance loss
Jump instruction needs only the jump target address
Branch instruction needs two things:
Branch Result Taken or Not Taken
Branch Target Address
PC + 4 If Branch is NOT taken
PC + 4 + 4 × immediate If Branch is Taken
Jump and Branch targets are computed in the ID stage
At which point a new instruction is already being fetched
Jump Instruction: 1-cycle delay
Branch: 2-cycle delay for branch result (taken or not taken)
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
1-Cycle Jump Delay
Control logic detects a Jump instruction in the 2nd Stage
Next instruction is fetched anyway
Convert Next instruction into bubble (Jump is always taken)
J L1
IF
cc1
Next instruction
. . .
L1: Target instruction
cc2
ID
IF
Jump
Target
Addr
cc4
cc5
cc6
cc7
cc3
Bubble
Bubble
Bubble
Bubble
IF
Reg
DM
ALU
Reg
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
2-Cycle Branch Delay
Control logic detects a Branch instruction in the 2nd Stage
ALU computes the Branch outcome in the 3rd Stage
Next1 and Next2 instructions will be fetched anyway
Convert Next1 and Next2 into bubbles if branch is taken
Beq $t1,$t2,L1
IF
cc1
Next1
cc2
Reg
IF
Next2
cc4
cc5
cc6
cc7
IF
Reg
DM
ALU
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
L1: target instruction
cc3
Branch
Target
Addr
ALU
Reg
IF
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Predict Branch NOT Taken
Branches can be predicted to be NOT taken
If branch outcome is NOT taken then
Next1 and Next2 instructions can be executed
Do not convert Next1 & Next2 into bubbles
No wasted cycles
Beq $t1,$t2,L1
IF
cc1
Next1
cc2
Reg
IF
Next2
cc3
NOT Taken
ALU
Reg
IF
Reg
cc4
cc5
cc6
cc7
ALU
DM
ALU
DM
Reg
Reg
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
J
J
Pipelined Jump and Branch
Main
& ALU
Control
Op
func
ForwardB
ForwardA
Rs
Rt
Forward & Stall
Rd2, Rd3, Rd4
RegWr, MemRd
0
1
2
3
0
1
2
3
R
32
32
Rs
Instruction
0
1
A
L
U
Ext
Imm16
1
0
Rd3
Rd2
A
B
D
Imm
32
Register File
RB
BusA
BusB
RW
BusW
RA
Rt
32
Rd
Zero
Address
Instruction
Instruction
Memory
Branch Target Address
Jump Target = PC[31:28] ‖ Imm26
Next PC Address
+
0
1
2
+1
PC
00
NPC
BTA
0
1
Bubble = 0
Stall
Disable PC
Disable IR
Bubble = NOP
Kill1
Jump
kills next
instruction
Kill2
Taken branch kills two
MEM
Control Signals
0
1
Control Signals
EX
PC
Control
PCSrc
BEQ, BNE
BEQ, BNE
Zero
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
PC Control for Pipelined Jump and Branch
if ((BEQ && Zero) || (BNE && !Zero))
{ Jmp=0; Br=1; Kill1=1; Kill2=1; }
else if (J)
{ Jmp=1; Br=0; Kill1=1; Kill2=0; }
else
{ Jmp=0; Br=0; Kill1=0; Kill2=0; }
Br = (( BEQ · Zero ) + (BNE · Zero ))
Jmp = J · Br
Kill1 = J + Br
Kill2 = Br
PCSrc = { Br, Jmp } // 0, 1, or 2
Br
Jmp
Kill1
Kill2
BEQ
BNE
J
Zero
PCSrc
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Jump and Branch Impact on CPI
Base CPI = 1 without counting jump and branch
Unconditional Jump = 5%, Conditional branch = 20%
90% of conditional branches are taken
Jump kills next instruction, Taken Branch kills next two
What is the effect of jump and branch on the CPI?
Solution:
Jump adds 1 wasted cycle for 5% of instructions = 1 × 0.05
Branch adds 2 wasted cycles for 20% × 90% of instructions
= 2 × 0.2 × 0.9 = 0.36
New CPI = 1 + 0.05 + 0.36 = 1.41 (due to wasted cycles)
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Next . . .
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Branch Hazard Alternatives
Predict Branch Not Taken (previously discussed)
Successor instruction is already fetched
Do NOT kill instructions if the branch is NOT taken
Kill only instructions appearing after Jump or taken branch
Delayed Branch
Define branch to take place AFTER the next instruction
Compiler/assembler fills the branch delay slot (for 1 delay cycle)
Dynamic Branch Prediction
Loop branches are taken most of time
Must reduce the branch delay to 0, but how?
How to predict branch behavior at runtime?
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Define branch to take place after the next instruction
MIPS defines one delay slot
Reduces branch penalty
Compiler fills the branch delay slot
By selecting an independent instruction
from before the branch
Must be okay to execute instruction in the
delay slot whether branch is taken or not
If no instruction is found
Compiler fills delay slot with a NO-OP
Delayed Branch
label:
. . .
add $t2,$t3,$t4
beq $s1,$s0,label
Delay Slot
label:
. . .
beq $s1,$s0,label
add $t2,$t3,$t4
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
New meaning for branch instruction
Branching takes place after next instruction (Not immediately!)
Impacts software and compiler
Compiler is responsible to fill the branch delay slot
However, modern processors and deeply pipelined
Branch penalty is multiple cycles in deep pipelines
Multiple delay slots are difficult to fill with useful instructions
MIPS used delayed branching in earlier pipelines
However, delayed branching lost popularity in recent processors
Dynamic branch prediction has replaced delayed branching
Drawback of Delayed Branching
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Zero-Delayed Branching
How to achieve zero delay for a jump or a taken branch?
Jump or branch target address is computed in the ID stage
Next instruction has already been fetched in the IF stage
Solution
Introduce a Branch Target Buffer (BTB) in the IF stage
Store the target address of recent branch and jump instructions
Use the lower bits of the PC to index the BTB
Each BTB entry stores Branch/Jump address & Target Address
Check the PC to see if the instruction being fetched is a branch
Update the PC using the target address stored in the BTB
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Branch Target Buffer (IF Stage)
The branch target buffer is implemented as a small cache
Stores the target address of recent branches and jumps
We must also have prediction bits
To predict whether branches are taken or not taken
The prediction bits are determined by the hardware at runtime
mux
PC
Branch Target & Prediction Buffer
Addresses of Recent Branches
Target
Addresses
low-order bits used as index
Predict
Bits
Inc
=
predict_taken
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Branch Target Buffer – cont’d
Each Branch Target Buffer (BTB) entry stores:
Address of a recent jump or branch instruction
Target address of jump or branch
Prediction bits for a conditional branch (Taken or Not Taken)
To predict jump/branch target address and branch outcome before instruction is decoded and branch outcome is computed
Use the lower bits of the PC to index the BTB
Check if the PC matches an entry in the BTB (jump or branch)
If there is a match and the branch is predicted to be Taken then Update the PC using the target address stored in the BTB
The BTB entries are updated by the hardware at runtime
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Dynamic Branch Prediction
Prediction of branches at runtime using prediction bits
Prediction bits are associated with each entry in the BTB
Prediction bits reflect the recent history of a branch instruction
Typically few prediction bits (1 or 2) are used per entry
We don’t know if the prediction is correct or not
If correct prediction …
Continue normal execution – no wasted cycles
If incorrect prediction (misprediction) …
Kill the instructions that were incorrectly fetched wasted cycles
Update prediction bits and target address for future use
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Correct
Prediction
No stall
cycles
Yes
No
Dynamic Branch Prediction – Cont’d
Use PC to address Instruction memory and Branch Target Buffer
Found
BTB entry with predict
taken?
Increment PC
PC = target address
Enter branch & target address,
and set prediction in BTB entry.
Kill fetched instructions.
Restart PC at target address
Mispredicted branch
Kill fetched instructions
Update prediction bits
Restart PC after branch
Normal
Execution
Yes
No
Taken
branch?
No
Yes
IF
ID
EX
Taken
branch?
Jump?
No
Enter jump & target address in BTB
Kill fetched instruction.
Restart PC at jump target address
Yes
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Prediction is just a hint that is assumed to be correct
If incorrect then fetched instructions are killed
1-bit prediction scheme is simplest to implement
1 bit per branch instruction (associated with BTB entry)
Record last outcome of a branch instruction (Taken/Not taken)
Use last outcome to predict future behavior of a branch
1-bit Prediction Scheme
Predict
Not Taken
Taken
Predict
Taken
Not
Taken
Not Taken
Taken
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
1-Bit Predictor: Shortcoming
Inner loop branch mispredicted twice!
Mispredict as taken on last iteration of inner loop
Then mispredict as not taken on first iteration of inner loop next time around
outer: …
…
inner: …
…
bne …, …, inner
…
bne …, …, outer
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Morgan Kaufmann Publishers
11 April, 2018
Chapter 4 — The Processor
53
1-bit prediction scheme has a performance shortcoming
2-bit prediction scheme works better and is often used
4 states: strong and weak predict taken / predict not taken
Implemented as a saturating counter
Counter is incremented to max=3 when branch outcome is taken
Counter is decremented to min=0 when branch is not taken
2-bit Prediction Scheme
Not Taken
Taken
Not Taken
Taken
Strong
Predict
Not Taken
Taken
Weak
Predict
Taken
Not Taken
Weak
Predict
Not Taken
Not Taken
Taken
Strong
Predict
Taken
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Evaluating Branch Alternatives
Assume: Jump = 2%, Branch-Not-Taken = 5%, Branch-Taken = 15%
Assume a branch target buffer with hit rate = 90% for jump & branch
Prediction accuracy for jump = 100%, for conditional branch = 95%
What is the impact on the CPI? (Ideal CPI = 1 if no control hazards)
Branch Scheme Jump Branch Not Taken Branch Taken
Predict not taken Penalty = 1 cycle Penalty = 0 cycles Penalty = 2 cycles
Delayed Penalty = 0 cycles Penalty = 0 cycles Penalty = 1 cycle
BTB Prediction Penalty = 1 cycle Penalty = 2 cycles Penalty = 2 cycles
Branch Scheme Jump = 2% Branch NT = 5% Branch Taken = 15% CPI
Predict not taken 0.02 × 1 0 0.15 × 2 = 0.30 1+0.32
Delayed 0 0 0.15 × 1 = 0.15 1+0.15
BTB Prediction 0.02×0.1×1 0.05×0.9×0.05×2 0.15×(0.1+0.9×0.05)×2 1+0.05
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
In Summary
Three types of pipeline hazards
Structural hazards: conflicts using a resource during same cycle
Data hazards: caused by data dependencies between instructions
Control hazards: caused by branch and jump instructions
Hazards limit the performance and complicate the design
Structural hazards: eliminated by careful design or more hardware
Data hazards are eliminated by forwarding
However, load delay cannot be eliminated and stalls the pipeline
Delayed branching reduces branch delay but needs compiler support
BTB with branch prediction can reduce branch delay to zero
Branch misprediction should kill the wrongly fetched instructions
Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
/docProps/thumbnail.jpeg