程序代写代做代考 compiler assembler scheme mips cache Pipelined Processor Design

Pipelined Processor Design

Pipelined Processor Design
COE 301 / ICS 233
Computer Organization
Dr. Muhamed Mudawar
College of Computer Sciences and Engineering
King Fahd University of Petroleum and Minerals

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Presentation Outline
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Single-Cycle Datapath
Shown below is the single-cycle datapath
How to pipeline this single-cycle datapath?
Answer: Introduce pipeline registers at end of each stage

Branch Target Address

A
L
U

Address
Instruction
Instruction
Memory

Rs
Rd

Ext
Rt

Jump Target = PC[31:28] ‖ Imm26
ALU result
clk
PC
00

Data
Memory
Address
Data_in
Data_out

Registers
RA
RB
BusA
BusB
RW
BusW

1
0

Imm16

Next PC Address

0
1

1
0

+

0
1
2

IF = Instruction Fetch
+1
ID = Instruction Decode
& Register Read

EX = Execute
MEM = Memory Access
WB = Write Back

ALUOp

RegWr

RegDst
ALUSrc

MemRd

MemWr

WBdata

PCSrc

ExtOp

Zero

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›

clk
Pipelined Datapath
Pipeline registers are shown in green, including the PC
Same clock edge updates all pipeline registers and PC
In addition to updating register file and data memory (for store)

Branch Target Address

A
L
U

Address
Instruction
Instruction
Memory

Rs
Rd

Ext
Rt

Jump Target = PC[31:28] ‖ Imm26
ALU Result

1
0

Imm16

Next PC Address

0
1

+

0
1
2

IF = Instruction Fetch
+1
ID = Instruction Decode
& Register Read

EX = Execute

MEM = Memory Access
WB = Write Back

ALUOp

RegWr

RegDst
ALUSrc

MemRd

MemWr

1
0
WBdata

PCSrc

ExtOp

Zero

Data
Memory
Address
Data_in
Data_out

Registers
RA
RB
BusA
BusB
RW
BusW

PC
00

Inst

NPC

BTA
A

B

Imm
D

R

Data

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Problem with Register Destination
Instruction in ID stage is different from the one in WB stage
WB stage is writing to a different destination register
Writing the destination register of the instruction in the ID Stage

1

RegDst

0

clk

Branch Target Address

A
L
U

Address
Instruction
Instruction
Memory

Rs
Rd

Ext
Rt

Jump Target = PC[31:28] ‖ Imm26
ALU Result
Data
Memory
Address
Data_in
Data_out

Registers
RA
RB
BusA
BusB
RW
BusW

1
0

Imm16

Next PC Address

+

0
1
2

IF = Instruction Fetch
+1
ID = Instruction Decode
& Register Read

EX = Execute

MEM = Memory Access
WB = Write Back

ALUOp

RegWr
ALUSrc

MemRd

MemWr

1
0
WBdata

PCSrc

ExtOp

Zero

PC
00

Inst

NPC

BTA
A

B

Imm
D

R

Data

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›

Registers
RA
RB
BusA
BusB
RW
BusW

Pipelining the Destination Register
Destination Register should be pipelined from ID to WB
The WB stage writes back data knowing the destination register

clk

Branch Target Address

A
L
U

Address
Instruction
Instruction
Memory

Rs
Rd

Ext
Rt

Jump Target = PC[31:28] ‖ Imm26
ALU Result
Data
Memory
Address
Data_in
Data_out

1
0

Imm16

Next PC Address

RegDst

+

0
1
2

IF = Instruction Fetch
+1
ID = Instruction Decode
& Register Read

EX = Execute

MEM = Memory Access
WB = Write Back

ALUOp

RegWr
ALUSrc

MemRd

MemWr

1
0
WBdata

PCSrc

ExtOp

Zero

PC
00
Inst
NPC
BTA
A
B
Imm
D
R
Data

1

0

Rd2
Rd3

Rd4

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Graphically Representing Pipelines
Multiple instruction execution over multiple clock cycles
Instructions are listed in execution order from top to bottom
Clock cycles move from left to right
Figure shows the use of resources at each stage and each cycle

Time (in cycles)
Program Execution Order
add $s1, $s2, $s3
CC2

Reg

IM

DM

Reg

sub $t5, $s2, $t3
CC4

ALU

IM

sw $s2, 10($t3)

DM

Reg

CC5
Reg

ALU

IM

DM

Reg

CC6

Reg

ALU

DM
CC7

Reg

ALU

CC8
Reg

DM

lw $t6, 8($s5)

IM

CC1

Reg

ori $s4, $t3, 7

ALU
CC3

IM

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Instruction-Time Diagram shows:
Which instruction occupying what stage at each clock cycle
Instruction flow is pipelined over the 5 stages
Instruction-Time Diagram
IF
WB

EX
ID
WB

EX
WB
MEM

ID
IF
EX
ID
IF

Time

CC1
CC4
CC5
CC6
CC7
CC8
CC9
CC2
CC3
MEM
EX
ID
IF
WB
MEM
EX
ID
IF
lw $t7, 8($s3)
lw $t6, 8($s5)
ori $t4, $s3, 7
sub $s5, $s2, $t3
sw $s2, 10($s3)

Instruction Order

Up to five instructions can be in the pipeline during the same cycle
Instruction Level Parallelism (ILP)

ALU instructions skip the MEM stage. Store instructions skip the WB stage

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Control Signals
Same control signals used in the single-cycle datapath

Registers
RA
RB
BusA
BusB
RW
BusW

clk

Branch Target Address

A
L
U

Address
Instruction
Instruction
Memory

Rs
Rd

Ext
Rt

Jump Target = PC[31:28] ‖ Imm26
ALU Result
Data
Memory
Address
Data_in
Data_out

1
0

Imm16

Next PC Address

RegDst

+

0
1
2

IF = Instruction Fetch
+1
ID = Instruction Decode

EX = Execute

MEM = Memory Access
WB = Write Back

ALUOp

RegWr
ALUSrc

MemRd

MemWr

1
0
WBdata

PCSrc

ExtOp

Zero

PC
00
Inst
NPC
BTA
A
B
Imm
D
R
Data

1

0

Rd2
Rd3

Rd4

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›

RegWr

WB
MemRd
MemWr
WBdata
Pipelined Control

PCSrc

RegDst

EX

ExtOp

ExtOp

Registers
RA
RB
BusA
BusB
RW
BusW

clk

Branch Target Address

A
L
U

Address
Instruction
Instruction
Memory

Rs
Rd

Ext
Rt

Jump Target = PC[31:28] ‖ Imm26
ALU Result
Data
Memory
Address
Data_in
Data_out

1
0

Imm16

Next PC Address

+

0
1
2

IF = Instruction Fetch
+1
ID = Instruction Decode

EX = Execute

MEM = Memory Access
WB = Write Back

1
0

PC
00
Inst
NPC
BTA
A
B
Imm
D
R
Data

1

0

Rd2
Rd3

Rd4

Main & ALU
Control

Op
func

ALUSrc
ALUOp
MEM
Pipeline control signals just like data

Zero

PC
Control
Zero

J
BEQ, BNE

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›

Pipelined Control – Cont’d
ID stage generates all the control signals
Pipeline the control signals as the instruction moves
Extend the pipeline registers to include the control signals
Each stage uses some of the control signals
Instruction Decode and Register Read
Control signals are generated
RegDst and ExtOp are used in this stage, J (Jump) is used by PC control
Execution Stage => ALUSrc, ALUOp, BEQ, BNE
ALU generates zero signal for PC control logic (Branch Control)
Memory Stage => MemRd, MemWr, and WBdata
Write Back Stage => RegWr control signal is used in the last stage

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Control Signals Summary
Op Decode
Stage Execute
Stage Memory
Stage Write
Back PC
Control
RegDst ExtOp ALUSrc ALUOp MemRd MemWr WBdata RegWr PCSrc
R-Type 1=Rd X 0=Reg func 0 0 0 1 0 = next PC
ADDI 0=Rt 1=sign 1=Imm ADD 0 0 0 1 0 = next PC
SLTI 0=Rt 1=sign 1=Imm SLT 0 0 0 1 0 = next PC
ANDI 0=Rt 0=zero 1=Imm AND 0 0 0 1 0 = next PC
ORI 0=Rt 0=zero 1=Imm OR 0 0 0 1 0 = next PC
LW 0=Rt 1=sign 1=Imm ADD 1 0 1 1 0 = next PC
SW X 1=sign 1=Imm ADD 0 1 X 0 0 = next PC
BEQ X X 0=Reg SUB 0 0 X 0 0 or 2 = BTA
BNE X X 0=Reg SUB 0 0 X 0 0 or 2 = BTA
J X X X X 0 0 X 0 1 = jump target

PCSrc = 0 or 2 (BTA) for BEQ and BNE, depending on the zero flag

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Next . . .
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Hazards: situations that would cause incorrect execution
If next instruction were launched during its designated clock cycle
Structural hazards
Caused by resource contention
Using same resource by two instructions during the same cycle
Data hazards
An instruction may compute a result needed by next instruction
Data hazards are caused by data dependencies between instructions
Control hazards
Caused by instructions that change control flow (branches/jumps)
Delays in changing the flow of control
Hazards complicate pipeline control and limit performance
Pipeline Hazards

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Structural Hazards
Problem
Attempt to use the same hardware resource by two different
instructions during the same clock cycle
Example
Writing back ALU result in stage 4
Conflict with writing load data in stage 5
WB
WB
EX
ID
WB
EX
MEM
IF
ID
IF

Time

CC1
CC4
CC5
CC6
CC7
CC8
CC9
CC2
CC3
EX
ID
IF
MEM
EX
ID
IF
lw $t6, 8($s5)
ori $t4, $s3, 7
sub $t5, $s2, $s3
sw $s2, 10($s3)

Instructions
Structural Hazard
Two instructions are attempting to write the register file during same cycle

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Resolving Structural Hazards
Serious Hazard:
Hazard cannot be ignored
Solution 1: Delay Access to Resource
Must have mechanism to delay instruction access to resource
Delay all write backs to the register file to stage 5
ALU instructions bypass stage 4 (memory) without doing anything
Solution 2: Add more hardware resources (more costly)
Add more hardware to eliminate the structural hazard
Redesign the register file to have two write ports
First write port can be used to write back ALU results in stage 4
Second write port can be used to write back load data in stage 5

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Dependency between instructions causes a data hazard
The dependent instructions are close to each other
Pipelined execution might change the order of operand access
Read After Write – RAW Hazard
Given two instructions I and J, where I comes before J
Instruction J should read an operand after it is written by I
Called a data dependence in compiler terminology
I: add $s1, $s2, $s3 # $s1 is written
J: sub $s4, $s1, $s3 # $s1 is read
Hazard occurs when J reads the operand before I writes it
Data Hazards

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
DM

Reg

IM

Reg

ALU

IM

DM

Reg

Reg

ALU

IM

DM

Reg

Reg

ALU

DM
Reg

ALU

Reg

DM

IM

Reg

ALU

IM

Time (cycles)
Program Execution Order
value of $s2
sub $s2, $t1, $t3
CC1
10

CC2
add $s4, $s2, $t5
10

CC3
or $s6, $t3, $s2
10

CC4
and $s7, $t4, $s2
10

CC6
20

CC7
20

CC8
20
CC5
sw $t8, 10($s2)
10

Example of a RAW Data Hazard
Result of sub is needed by add, or, and, & sw instructions
Instructions add & or will read old value of $s2 from reg file
During CC5, $s2 is written at end of cycle, old value is read

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›

Reg

Reg

Solution 1: Stalling the Pipeline
Three stall cycles during CC3 thru CC5 (wasting 3 cycles)
The 3 stall cycles delay the execution of add and the fetching of or
The 3 stall cycles insert 3 bubbles (No operations) into the ALU
The add instruction remains in the second stage until CC6
The or instruction is not fetched until CC6

DM

Reg

Reg

Reg

Time (in cycles)
Instruction Order
value of $s2
CC1
10
CC2
10
CC3
10
CC4
10
CC6
20
CC7
20
CC8
20
CC5
10

add $s4, $s2, $t5

IM
or $s6, $t3, $s2

IM

ALU

ALU

Reg

sub $s2, $t1, $t3

IM

Reg

ALU

DM

Reg

CC9
20

DM

stall
stall
stall

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›

DM

Reg

Reg

Reg

Reg

Reg

Time (cycles)
Program Execution Order
value of $s2
sub $s2, $t1, $t3
IM
CC1
10

CC2
add $s4, $s2, $t5

IM
10

CC3
or $s6, $t3, $s2

ALU

IM
10

CC4
and $s7, $s6, $s2

ALU

IM
10

CC6
Reg

DM

ALU

20

CC7
Reg

DM

ALU

20

CC8
Reg

DM

20
CC5
sw $t8, 10($s2)
Reg

DM

ALU

IM
10

Solution 2: Forwarding ALU Result
The ALU result is forwarded (fed back) to the ALU input
No bubbles are inserted into the pipeline and no cycles are wasted
ALU result is forwarded from ALU, MEM, and WB stages

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Implementing Forwarding

0
1
2
3

0
1
2
3
R

32
32

clk
32

Rs

Instruction

0
1

ALU result

32

0
1

Data
Memory
Address
Data_in
Data_out

32
Rd4

A
L
U

Ext
Imm16

1
0
Rd3

Rd2
A
B
Data
D
Imm

32

Register File
RB
BusA
BusB
RW
BusW

RA
Rt

Two multiplexers added at the inputs of A & B registers
Data from ALU stage, MEM stage, and WB stage is fed back
Two signals: ForwardA and ForwardB to control forwarding

ForwardA

ForwardB

32

Rd

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Forwarding Control Signals
Signal Explanation
ForwardA = 0 First ALU operand comes from register file = Value of (Rs)
ForwardA = 1 Forward result of previous instruction to A (from ALU stage)
ForwardA = 2 Forward result of 2nd previous instruction to A (from MEM stage)
ForwardA = 3 Forward result of 3rd previous instruction to A (from WB stage)
ForwardB = 0 Second ALU operand comes from register file = Value of (Rt)
ForwardB = 1 Forward result of previous instruction to B (from ALU stage)
ForwardB = 2 Forward result of 2nd previous instruction to B (from MEM stage)
ForwardB = 3 Forward result of 3rd previous instruction to B (from WB stage)

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Forwarding Example

0
1
2
3

0
1
2
3
R

32
32

clk
32

Rs

Instruction

0
1

ALU result

32

0
1

Data
Memory
Address
Data_in
Data_out

32
Rd4

A
L
U

Ext
Imm16

1
0
Rd3

Rd2
A
B
Data
D
Imm

32

Register File
RB
BusA
BusB
RW
BusW

RA
Rt

32

Rd

Instruction sequence:
lw $t4, 4($t0)
ori $t7, $t1, 2
sub $t3, $t4, $t7
When sub instruction in ID stage
ori will be in the ALU stage
lw will be in the MEM stage
lw $t4,4($t0)
ori $t7,$t1,2
sub $t3,$t4,$t7

ForwardA = 2 (from MEM stage)

ForwardB = 1 (from ALU stage)

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
RAW Hazard Detection
Current instruction is being decoded in the Decode stage
Previous instruction is in the Execute stage
Second previous instruction is in the Memory stage
Third previous instruction is in the Write Back stage
If ((Rs != 0) and (Rs == Rd2) and (EX.RegWr)) ForwardA = 1
Else if ((Rs != 0) and (Rs == Rd3) and (MEM.RegWr)) ForwardA = 2
Else if ((Rs != 0) and (Rs == Rd4) and (WB.RegWr)) ForwardA = 3
Else ForwardA = 0
If ((Rt != 0) and (Rt == Rd2) and (EX.RegWr)) ForwardB = 1
Else if ((Rt != 0) and (Rt == Rd3) and (MEM.RegWr)) ForwardB = 2
Else if ((Rt != 0) and (Rt == Rd4) and (WB.RegWr)) ForwardB = 3
Else ForwardB = 0

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Hazard Detecting and Forwarding Logic

ForwardB
ForwardA
Rs
Rt

Hazard
Detect &
Forward

0
1
2
3

0
1
2
3
R

32
32

clk
32

Rs

Instruction

0
1

ALU result

32

0
1

Data
Memory
Address
Data_in
Data_out

32
Rd4

A
L
U

Ext
Imm16

1
0
Rd3

Rd2
A
B
Data
D
Imm

32

Register File
RB
BusA
BusB
RW
BusW

RA
Rt

32

Rd

Main
& ALU
Control

Op
func

WB
RegDst

EX

ExtOp

ExtOp

ALUSrc
ALUOp
MEM

MemRd
MemWr
WBdata

RegWr
RegWr
RegWr

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Next . . .
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›

Reg

Reg

Reg

Time (cycles)
Program Order
CC2
add $s4, $s2, $t5

Reg

IF
CC3
or $t6, $t3, $s2

ALU

IF
CC6
Reg

DM

ALU

CC7
Reg

Reg

DM

CC8
Reg

lw $s2, 20($t1)

IF

CC1
CC4
and $t7, $s2, $t4

DM

ALU

IF
CC5

DM

ALU

Load Delay
Unfortunately, not all data hazards can be forwarded
Load has a delay that cannot be eliminated by forwarding
In the example shown below …
The LW instruction does not read data until end of CC4
Cannot forward data to ADD at end of CC3 – NOT possible

However, load can forward data to 2nd next and later instructions

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Detecting RAW Hazard after Load
Detecting a RAW hazard after a Load instruction:
The load instruction will be in the EX stage
Instruction that depends on the load data is in the decode stage
Condition for stalling the pipeline
if ((EX.MemRd == 1) // Detect Load in EX stage
and (ForwardA==1 or ForwardB==1)) Stall // RAW Hazard
Insert a bubble into the EX stage after a load instruction
Bubble is a no-op that wastes one clock cycle
Delays the dependent instruction after load by one cycle
Because of RAW hazard

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›

Reg

or $t6, $s3, $s2

IM

DM

Reg

ALU

Reg

ALU

DM

Reg

add $s4, $s2, $t5

IM

Reg

lw $s2, 20($s1)

IM

stall

ALU
bubble
bubble
bubble
DM

Reg

Stall the Pipeline for one Cycle
ADD instruction depends on LW  stall at CC3
Allow Load instruction in ALU stage to proceed
Freeze PC and Instruction registers (NO instruction is fetched)
Introduce a bubble into the ALU stage (bubble is a NO-OP)
Load can forward data to next instruction after delaying it

Time (cycles)
Program Order
CC2
CC3
CC6
CC7
CC8
CC1
CC4
CC5

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
lw $s2, 8($s1)
MEM
WB
EX
ID
Stall
IF
lw $s1, ($t5)
MEM
WB
EX
ID
IF
Showing Stall Cycles
Stall cycles can be shown on instruction-time diagram
Hazard is detected in the Decode stage
Stall indicates that instruction is delayed
Instruction fetching is also delayed after a stall
Example:
add $v0, $s2, $t3

WB
EX
ID
Stall
IF
sub $v1, $s2, $v0

WB
EX
ID
IF
Time

CC1
CC4
CC5
CC6
CC7
CC8
CC9
CC2
CC3
CC10

Data forwarding is shown using green arrows

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Hazard Detecting and Forwarding Logic

ForwardB
ForwardA
Rs
Rt

Hazard Detect
Forward
and Stall

MemRd
RegWr
RegWr
RegWr

Stall

Disable PC
Disable IR

Bubble = 0

0
1
2
3

0
1
2
3
R

32
32

clk
32

Rs

Instruction

0
1

ALU result

32

0
1

Data
Memory
Address
Data_in
Data_out

32
Rd4

A
L
U

Ext
Imm16

1
0
Rd3

Rd2
A
B
Data
D
Imm

32

Register File
RB
BusA
BusB
RW
BusW

RA
Rt

32

Rd

Main
& ALU
Control

Op
func

WB
MemRd
MemWr
WBdata
RegDst
EX

ExtOp

MEM

PC
Control Signals

0
1
ALUSrc
ALUOp

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Code Scheduling to Avoid Stalls
Compilers reorder code in a way to avoid load stalls
Consider the translation of the following statements:
A = B + C; D = E – F; // A thru F are in Memory
Slow code:
lw $t0, 4($s0) # &B = 4($s0)
lw $t1, 8($s0) # &C = 8($s0)
add $t2, $t0, $t1 # stall cycle
sw $t2, 0($s0) # &A = 0($s0)
lw $t3, 16($s0) # &E = 16($s0)
lw $t4, 20($s0) # &F = 20($s0)
sub $t5, $t3, $t4 # stall cycle
sw $t5, 12($0) # &D = 12($0)
Fast code: No Stalls
lw $t0, 4($s0)
lw $t1, 8($s0)
lw $t3, 16($s0)
lw $t4, 20($s0)
add $t2, $t0, $t1
sw $t2, 0($s0)
sub $t5, $t3, $t4
sw $t5, 12($s0)

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Instruction J should write its result after it is read by I
Called anti-dependence by compiler writers
I: sub $t4, $t1, $t3 # $t1 is read
J: add $t1, $t2, $t3 # $t1 is written
Results from reuse of the name $t1
NOT a data hazard in the 5-stage pipeline because:
Reads are always in stage 2
Writes are always in stage 5, and
Instructions are processed in order
Anti-dependence can be eliminated by renaming
Use a different destination register for add (eg, $t5)
Name Dependence: Write After Read

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Name Dependence: Write After Write
Same destination register is written by two instructions
Called output-dependence in compiler terminology
I: sub $t1, $t4, $t3 # $t1 is written
J: add $t1, $t2, $t3 # $t1 is written again
Not a data hazard in the 5-stage pipeline because:
All writes are ordered and always take place in stage 5
However, can be a hazard in more complex pipelines
If instructions are allowed to complete out of order, and
Instruction J completes and writes $t1 before instruction I
Output dependence can be eliminated by renaming $t1
Read After Read is NOT a name dependence

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Next . . .
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Control Hazards
Jump and Branch can cause great performance loss
Jump instruction needs only the jump target address
Branch instruction needs two things:
Branch Result Taken or Not Taken
Branch Target Address
PC + 4 If Branch is NOT taken
PC + 4 + 4 × immediate If Branch is Taken
Jump and Branch targets are computed in the ID stage
At which point a new instruction is already being fetched
Jump Instruction: 1-cycle delay
Branch: 2-cycle delay for branch result (taken or not taken)

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
1-Cycle Jump Delay
Control logic detects a Jump instruction in the 2nd Stage
Next instruction is fetched anyway
Convert Next instruction into bubble (Jump is always taken)

J L1
IF

cc1

Next instruction

. . .
L1: Target instruction
cc2

ID
IF

Jump
Target
Addr

cc4
cc5
cc6
cc7
cc3

Bubble

Bubble

Bubble

Bubble

IF

Reg

DM
ALU

Reg

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
2-Cycle Branch Delay
Control logic detects a Branch instruction in the 2nd Stage
ALU computes the Branch outcome in the 3rd Stage
Next1 and Next2 instructions will be fetched anyway
Convert Next1 and Next2 into bubbles if branch is taken

Beq $t1,$t2,L1
IF

cc1

Next1

cc2

Reg

IF

Next2

cc4
cc5
cc6
cc7

IF

Reg

DM
ALU

Bubble

Bubble

Bubble

Bubble

Bubble

Bubble

Bubble
L1: target instruction
cc3
Branch
Target
Addr

ALU

Reg
IF

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Predict Branch NOT Taken
Branches can be predicted to be NOT taken
If branch outcome is NOT taken then
Next1 and Next2 instructions can be executed
Do not convert Next1 & Next2 into bubbles
No wasted cycles

Beq $t1,$t2,L1
IF

cc1

Next1

cc2

Reg

IF

Next2

cc3
NOT Taken

ALU

Reg
IF

Reg
cc4
cc5
cc6
cc7

ALU

DM

ALU

DM

Reg

Reg

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
J

J
Pipelined Jump and Branch

Main
& ALU
Control

Op
func

ForwardB
ForwardA
Rs
Rt

Forward & Stall

Rd2, Rd3, Rd4

RegWr, MemRd

0
1
2
3

0
1
2
3
R

32

32

Rs

Instruction

0
1

A
L
U

Ext
Imm16

1
0
Rd3

Rd2
A
B
D
Imm

32

Register File
RB
BusA
BusB
RW
BusW

RA
Rt

32

Rd

Zero

Address
Instruction
Instruction
Memory

Branch Target Address

Jump Target = PC[31:28] ‖ Imm26
Next PC Address

+

0
1
2

+1

PC
00
NPC
BTA

0
1

Bubble = 0

Stall

Disable PC
Disable IR

Bubble = NOP

Kill1
Jump
kills next
instruction

Kill2
Taken branch kills two

MEM
Control Signals

0
1
Control Signals
EX

PC
Control

PCSrc

BEQ, BNE

BEQ, BNE

Zero

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
PC Control for Pipelined Jump and Branch
if ((BEQ && Zero) || (BNE && !Zero))
{ Jmp=0; Br=1; Kill1=1; Kill2=1; }
else if (J)
{ Jmp=1; Br=0; Kill1=1; Kill2=0; }
else
{ Jmp=0; Br=0; Kill1=0; Kill2=0; }
Br = (( BEQ · Zero ) + (BNE · Zero ))
Jmp = J · Br
Kill1 = J + Br
Kill2 = Br
PCSrc = { Br, Jmp } // 0, 1, or 2

Br

Jmp
Kill1

Kill2

BEQ

BNE
J

Zero

PCSrc

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Jump and Branch Impact on CPI
Base CPI = 1 without counting jump and branch
Unconditional Jump = 5%, Conditional branch = 20%
90% of conditional branches are taken
Jump kills next instruction, Taken Branch kills next two
What is the effect of jump and branch on the CPI?
Solution:
Jump adds 1 wasted cycle for 5% of instructions = 1 × 0.05
Branch adds 2 wasted cycles for 20% × 90% of instructions
= 2 × 0.2 × 0.9 = 0.36
New CPI = 1 + 0.05 + 0.36 = 1.41 (due to wasted cycles)

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Next . . .
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Branch Hazard Alternatives
Predict Branch Not Taken (previously discussed)
Successor instruction is already fetched
Do NOT kill instructions if the branch is NOT taken
Kill only instructions appearing after Jump or taken branch
Delayed Branch
Define branch to take place AFTER the next instruction
Compiler/assembler fills the branch delay slot (for 1 delay cycle)
Dynamic Branch Prediction
Loop branches are taken most of time
Must reduce the branch delay to 0, but how?
How to predict branch behavior at runtime?

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Define branch to take place after the next instruction
MIPS defines one delay slot
Reduces branch penalty
Compiler fills the branch delay slot
By selecting an independent instruction
from before the branch
Must be okay to execute instruction in the
delay slot whether branch is taken or not
If no instruction is found
Compiler fills delay slot with a NO-OP
Delayed Branch
label:
. . .
add $t2,$t3,$t4
beq $s1,$s0,label
Delay Slot

label:
. . .
beq $s1,$s0,label
add $t2,$t3,$t4

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
New meaning for branch instruction
Branching takes place after next instruction (Not immediately!)
Impacts software and compiler
Compiler is responsible to fill the branch delay slot
However, modern processors and deeply pipelined
Branch penalty is multiple cycles in deep pipelines
Multiple delay slots are difficult to fill with useful instructions
MIPS used delayed branching in earlier pipelines
However, delayed branching lost popularity in recent processors
Dynamic branch prediction has replaced delayed branching
Drawback of Delayed Branching

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Zero-Delayed Branching
How to achieve zero delay for a jump or a taken branch?
Jump or branch target address is computed in the ID stage
Next instruction has already been fetched in the IF stage
Solution
Introduce a Branch Target Buffer (BTB) in the IF stage
Store the target address of recent branch and jump instructions
Use the lower bits of the PC to index the BTB
Each BTB entry stores Branch/Jump address & Target Address
Check the PC to see if the instruction being fetched is a branch
Update the PC using the target address stored in the BTB

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Branch Target Buffer (IF Stage)
The branch target buffer is implemented as a small cache
Stores the target address of recent branches and jumps
We must also have prediction bits
To predict whether branches are taken or not taken
The prediction bits are determined by the hardware at runtime

mux
PC

Branch Target & Prediction Buffer

Addresses of Recent Branches
Target
Addresses

low-order bits used as index
Predict
Bits
Inc

=

predict_taken

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Branch Target Buffer – cont’d
Each Branch Target Buffer (BTB) entry stores:
Address of a recent jump or branch instruction
Target address of jump or branch
Prediction bits for a conditional branch (Taken or Not Taken)
To predict jump/branch target address and branch outcome before instruction is decoded and branch outcome is computed
Use the lower bits of the PC to index the BTB
Check if the PC matches an entry in the BTB (jump or branch)
If there is a match and the branch is predicted to be Taken then Update the PC using the target address stored in the BTB
The BTB entries are updated by the hardware at runtime

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Dynamic Branch Prediction
Prediction of branches at runtime using prediction bits
Prediction bits are associated with each entry in the BTB
Prediction bits reflect the recent history of a branch instruction
Typically few prediction bits (1 or 2) are used per entry
We don’t know if the prediction is correct or not
If correct prediction …
Continue normal execution – no wasted cycles
If incorrect prediction (misprediction) …
Kill the instructions that were incorrectly fetched  wasted cycles
Update prediction bits and target address for future use

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Correct
Prediction
No stall
cycles
Yes
No
Dynamic Branch Prediction – Cont’d
Use PC to address Instruction memory and Branch Target Buffer

Found
BTB entry with predict
taken?
Increment PC
PC = target address
Enter branch & target address,
and set prediction in BTB entry.
Kill fetched instructions.
Restart PC at target address
Mispredicted branch
Kill fetched instructions
Update prediction bits
Restart PC after branch
Normal
Execution
Yes
No

Taken
branch?
No
Yes
IF
ID
EX

Taken
branch?

Jump?
No
Enter jump & target address in BTB
Kill fetched instruction.
Restart PC at jump target address
Yes

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Prediction is just a hint that is assumed to be correct
If incorrect then fetched instructions are killed
1-bit prediction scheme is simplest to implement
1 bit per branch instruction (associated with BTB entry)
Record last outcome of a branch instruction (Taken/Not taken)
Use last outcome to predict future behavior of a branch
1-bit Prediction Scheme

Predict
Not Taken

Taken

Predict
Taken

Not
Taken
Not Taken
Taken

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
1-Bit Predictor: Shortcoming
Inner loop branch mispredicted twice!
Mispredict as taken on last iteration of inner loop
Then mispredict as not taken on first iteration of inner loop next time around

outer: …

inner: …

bne …, …, inner

bne …, …, outer

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Morgan Kaufmann Publishers
11 April, 2018
Chapter 4 — The Processor
53

1-bit prediction scheme has a performance shortcoming
2-bit prediction scheme works better and is often used
4 states: strong and weak predict taken / predict not taken
Implemented as a saturating counter
Counter is incremented to max=3 when branch outcome is taken
Counter is decremented to min=0 when branch is not taken
2-bit Prediction Scheme

Not Taken
Taken

Not Taken
Taken

Strong
Predict
Not Taken
Taken

Weak
Predict
Taken

Not Taken

Weak
Predict
Not Taken

Not Taken
Taken

Strong
Predict
Taken

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Evaluating Branch Alternatives
Assume: Jump = 2%, Branch-Not-Taken = 5%, Branch-Taken = 15%
Assume a branch target buffer with hit rate = 90% for jump & branch
Prediction accuracy for jump = 100%, for conditional branch = 95%
What is the impact on the CPI? (Ideal CPI = 1 if no control hazards)
Branch Scheme Jump Branch Not Taken Branch Taken
Predict not taken Penalty = 1 cycle Penalty = 0 cycles Penalty = 2 cycles
Delayed Penalty = 0 cycles Penalty = 0 cycles Penalty = 1 cycle
BTB Prediction Penalty = 1 cycle Penalty = 2 cycles Penalty = 2 cycles

Branch Scheme Jump = 2% Branch NT = 5% Branch Taken = 15% CPI

Predict not taken 0.02 × 1 0 0.15 × 2 = 0.30 1+0.32

Delayed 0 0 0.15 × 1 = 0.15 1+0.15

BTB Prediction 0.02×0.1×1 0.05×0.9×0.05×2 0.15×(0.1+0.9×0.05)×2 1+0.05

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
In Summary
Three types of pipeline hazards
Structural hazards: conflicts using a resource during same cycle
Data hazards: caused by data dependencies between instructions
Control hazards: caused by branch and jump instructions
Hazards limit the performance and complicate the design
Structural hazards: eliminated by careful design or more hardware
Data hazards are eliminated by forwarding
However, load delay cannot be eliminated and stalls the pipeline
Delayed branching reduces branch delay but needs compiler support
BTB with branch prediction can reduce branch delay to zero
Branch misprediction should kill the wrongly fetched instructions

Pipelined Processor Design COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›

/docProps/thumbnail.jpeg