L16_1 Pipeline- Performance_Data-Hazards
EECS 370 – Introduction to Computer Organization – Fall 2020
EECS 370 – Introduction to Computer Organization – © Bill Arthur 1 The material in this presentation cannot be copied in any form without written permission
Learning Objectives
• To identify and apply performance metrics related to data hazards for the LC2K pipeline datapath.
EECS 370 – Introduction to Computer Organization
2
Building with Pipelines
• CPI for pipelining:
• 1 (ideal case – no stalls)
• > 1 (reality, depends on program)
• What if we want to improve performance more?
• Want CPI as low as possible – lower than 1
• Use Parallelism
• Instruction Level Parallelism (ILP) – Within task
• Thread Level Parallelism (TLP) – Having many tasks
• Data Level Parallelism (DLP) – Many tasks with same instructions
EECS 370 – Introduction to Computer Organization
3
LC2K Pipeline Summary
• Data hazards
• Hazard exists if producer-consumer of a register within a 2-instruction
window
• Noteforproject,thewindowis3instructions
• Detect and stall – insert enough noops to separate producer and consumer by > 2 instructions (> 3 instructions for project)
• Detect and forward
• Handles all cases except LW-USE, need 1 noop here
• Control hazards
• Detect and stall – needs 3 noops inserted after each branch
• Predict and squash
• Zero noops if predict correctly • 3 if predict incorrectly
Fetch Decode Execute Memory WB
PC read Need register values Branches resolved Register values produced
EECS 370 – Introduction to Computer Organization
4
Review: Basic Performance Equation
• Execution time (Time/Program) =
• # of instr (I/P) x CPI (C/I) x cycle time (T/C)
• Multi-cycle decreases cycle time but increases CPI • Pipelining decreases CPI
• Approaches 1.0 if no stalls (hazards that are fixed by stalling)
EECS 370 – Introduction to Computer Organization
5
Calculating Performance with No Stalls
How many cycles does this code take to execute?
add 1 2 3 nor 1 4 5 add 4 6 7
What value is written to the
ALU result field of the Mem/WB pipeline register at the end of cycle 5.
EECS 370 – Introduction to Computer Organization
6
Calculating Performance with No Stalls
How many cycles does this code take to execute?
add 1 2 3 nor 1 4 5 add 4 6 7
No stalls – Final WB @ cycle 7
Time:
1
2
3
4
5
6
7
8
add 1 2 3
IF
ID
EX
ME
WB
nor 3 4 5
IF
ID
EX
ME
WB
add 3 5 6
IF
ID
EX
ME
WB
What value is written to the
ALU result field of the Mem/WB pipeline register at the end of cycle 5.
nor result
EECS 370 – Introduction to Computer Organization
7
Performance: Data Hazards -Detect and Stall
How many data hazards are there in this code?
How many stall cycles if we use detect and stall to handle the hazards?
add 1 2 3 nor 3 4 5 add 3 5 6
Time:
1
2
3
4
5
6
7
8
9
10
11
add 1 2 3
IF
ID
EX
ME
WB
nor 3 4 5
add 3 5 6
EECS 370 – Introduction to Computer Organization
8
Performance: Data Hazards -Detect and Stall
How many data hazards are there
add 1 2 3 nor 3 4 5 add 3 5 6
How many stall cycles if we use detect and stall to handle the hazards?
in this code?
3 data hazards
Stall : 4 cycles Total : 11 cycles
Time:
1
2
3
4
5
6
7
8
9
10
11
add 1 2 3
IF
ID
EX
ME
WB
nor 3 4 5
IF
ID*
ID*
ID
EX
ME
WB
add 3 5 6
IF*
IF*
IF
ID*
ID*
ID
EX
ME
WB
EECS 370 – Introduction to Computer Organization
9
en
First half of cycle 3
M U X
1
+
+
A L U
target
2
1
Hazard
3
regA
0
14
eq?
M U X
7
ALU result
en
PC
Inst mem
regB
10
14
M U X
R0
R1
R2
R3
R4
R5
R6
R7
3
11
7
M U X
ALU result
Data memory
mdata
77
data
1 8
add 1 2 3
nor 3 4 5
add 3 5 6
add
valB
IF/ ID/
ID EX Mem WB
EX/ Mem/
10
nor 3 4 5
Register file
Performance: Data Hazards -Detect and Forward
Where do the values for the second add instruction come from?
add 1 2 3 nor 3 4 5 add 3 5 6 lw 3 6 7 add 6 6 1
How many stall cycles on the LC2K pipelined datapath with data forwarding from lecture?
Time:
1
2
3
4
5
6
7
8
9
10
11
add 1 2 3
IF
ID
EX
ME
WB
nor 3 4 5
add 3 5 6
lw 367
add 6 6 1
EECS 370 – Introduction to Computer Organization
11
First half of cycle 4
M U X
1
+
+
21 M U X
11
M U X
A L U
3
New Hazard
3 regA
14
M U X
7
2
0
5
regB
PC
Inst mem
10
M U X
R0
R1
R2
R3
R4
R5
R6
R7
5
3
10
11
21
77
11
Data memory
data
1 8
1. add 1 2 3 2. nor 3 4 5 3. add 3 5 6 4. lw 3 6 7 5. add 6 6 1
nor
add
H1
IF/ ID/ Mem/ ID EX WB
12
add 356
Register file
First half of cycle 8
+
99
M U X
M U 99 X
A L U
5
M U X
R0
R1
R2
R3
R4
R5
R6 -11 R7 22
regA
0
6
21
11
data
14
7
regB
-32
-11
-11
Data memory
99
add
noop
lw
H3
M U X
1
+
M U X
Inst mem
PC
1. add 1 2 3 2. nor 3 4 5 3. add 3 5 6 4. lw 3 6 7 5. add 6 6 1
IF/ ID/ Mem/ ID EX WB
13
Register file
Performance: Data Hazards -Detect and Forward
Where do the values for the second add instruction come from?
add 1 2 3 nor 3 4 5 add 3 5 6 lw 3 6 7 add 6 6 1
How many stall cycles on the LC2K pipelined datapath with data forwarding from lecture?
From Mem/WB and EX/Mem
Time:
1
2
3
4
5
6
7
8
9
10
11
add 1 2 3
IF
ID
EX
ME
WB
nor 3 4 5
IF
ID
EX
ME
WB
add 3 5 6
IF
ID
EX
ME
WB
lw 367
IF
ID
EX
ME
WB
add 6 6 1
IF
ID*
ID
EX
ME
WB
Data forward
EECS 370 – Introduction to Computer Organization
14
1stallfor lw➜add
Logistics
• There are 3 videos for lecture 16
• L16_1 – Pipeline-Performance_Data-Hazards
• L16_2 – Pipeline-Performance_Control-Hazards • L16_3 – Pipeline-Performance
• There is one worksheet for lecture 16 1. L16 worksheet
EECS 370 – Introduction to Computer Organization
15
L16_2 Pipeline- Performance_Control-Hazards
EECS 370 – Introduction to Computer Organization – Fall 2020
EECS 370 – Introduction to Computer Organization – © Bill Arthur 16 The material in this presentation cannot be copied in any form without written permission
Learning Objectives
• To identify and apply performance metrics related to control hazards for the LC2K pipeline datapath.
EECS 370 – Introduction to Computer Organization
17
Performance: Control Hazards – Speculate and Squash
• How many cycles are saved if you perform speculate and squash for the following code (assume that branches are predicted to be not
taken)?
add 1 2 3 beq 1 5 1 nor 6 4 1 add 3 4 5
• Assume the branch is taken: How many cycles to execute this code?
• Assume the branch is not taken: How many cycles execute this code? EECS 370 – Introduction to Computer Organization
18
Performance: Control Hazards – Speculate and
Squash
Branch prediction not taken
add 1 2 3 beq 1 5 1 nor 6 4 1 add 3 4 5
Branch taken
Time:
1
2
3
4
5
6
7
8
9
10
11
add 1 2 3
IF
ID
EX
ME
WB
beq 1 5 1
nor 6 4 1
add 3 4 5
Branch not taken
Time:
1
2
3
4
5
6
7
8
9
10
11
add 1 2 3
IF
ID
EX
ME
WB
beq 1 5 1
nor 6 4 1
add 3 4 5
EECS 370 – Introduction to Computer Organization
19
Performance: Control Hazards – Speculate and
Squash
Branch prediction not taken
add 1 2 3 beq 1 5 1 nor 6 4 1 add 3 4 5
Branch taken
Time:
1
2
3
4
5
6
7
8
9
10
11
add 1 2 3
IF
ID
EX
ME
WB
beq 1 5 1
IF
ID
EX
ME
WB
nor 6 4 1
IF
ID
EX
add 3 4 5
IF
ID
IF
ID
EX
ME
WB
Branch not taken
Time:
1
2
3
4
5
6
7
8
9
10
11
add 1 2 3
IF
ID
EX
ME
WB
beq 1 5 1
IF
ID
EX
ME
WB
nor 6 4 1
IF
ID
EX
ME
WB
add 3 4 5
IF
ID
EX
ME
WB
EECS 370 – Introduction to Computer Organization
20
Performance: Control Hazards – Detect and Stall
Branch taken
Time:
1
2
3
4
5
6
7
8
9
10
11
add 1 2 3
IF
ID
EX
ME
WB
beq 1 5 1
IF
ID
EX
ME
WB
nor 6 4 1
IF*
IF*
IF*
add 3 4 5
IF
ID
EX
ME
WB
Same code, detect and stall
add 1 2 3 beq 1 5 1 nor 6 4 1 add 3 4 5
Branch not taken
Time:
1
2
3
4
5
6
7
8
9
10
11
add 1 2 3
IF
ID
EX
ME
WB
beq 1 5 1
IF
ID
EX
ME
WB
nor 6 4 1
IF*
IF*
IF*
IF
ID
EX
ME
WB
add 3 4 5
IF
ID
EX
ME
WB
EECS 370 – Introduction to Computer Organization
21
Performance: Control Hazards – Speculate and Squash
• How many cycles are saved if you perform speculate and squash for the following code (assume that branches are predicted to be not
taken)?
add 1 2 3 beq 1 5 1 nor 6 4 1 add 3 4 5
• Assume the branch is taken: How many cycles to execute this code? 3 instructions + 3 stalls + 4 to empty pipe = 10 cycles
• Assume the branch is not taken: How many cycles execute this code? 4 instructions + 4 to empty pipe = 8 cycles
EECS 370 – Introduction to Computer Organization
22
Performance: Control Hazards I
Assume halt is resolved in WB stage
Assume the first branch is taken 50% of the time and the loop iterates 100 times and forwarding for all data hazards.
1. How many cycles does the code take
assuming detect and stall for control hazards?
add 1 2 3 beq 1 5 1 lw 6 4 1 add 3 4 5 beq 5 7 -5 halt
EECS 370 – Introduction to Computer Organization
23
Control Hazards – Stall
beq1 1 10 add 3 4 5
time
fetch decode execute memory writeback fetch* fetch* fetch* fetch
beq add
EECS 370 – Introduction to Computer Organization
24
Branch target address
OR
fetch
Time Graph – Detect and Forward
Time:
1
2
3
4
5
6
7
8
9
10
11
12
13
add 1 2 3
IF
ID
EX
ME
WB
nor 3 4 5
IF
ID
EX
ME
WB
add 6 3 7
IF
ID
EX
ME
WB
lw 3 6 10
IF
ID
EX
ME
WB
sw 6 2 12
IF
ID*
ID
EX
ME
WB
Data forward
EECS 370 – Introduction to Computer Organization
25
Performance: Control Hazards I
Assume halt is resolved in WB stage
Assume the first branch is taken 50% of the time and the loop iterates 100 times and forwarding for all data hazards.
1. How many cycles does the code take
assuming detect and stall for control hazards?
add 1 2 3 beq 1 5 1 lw 6 4 1 add 3 4 5 beq 5 7 -5 halt
# Instructions = 100*(0.5*5 + 0.5*4) + 1 = 451
Time = 451 + load stalls + branch stalls + empty pipe Time = 451 + 100*0.5*1 +
Time = 451 + 100*0.5*1 + (100*3 + 100*3) + 4
Time = 1105
EECS 370 – Introduction to Computer Organization
26
Performance: Control Hazards II
Assume halt is resolved in WB stage
Assume the first branch is taken 50% of the time and the loop iterates 100 times and forwarding for all data hazards.
2. How many cycles does the code take
assuming speculate and squash where all branches are predicted not taken?
add 1 2 3 beq 1 5 1 lw 6 4 1 add 3 4 5 beq 5 7 -5 halt
EECS 370 – Introduction to Computer Organization
27
Performance: Control Hazards II
Assume halt is resolved in WB stage
Assume the first branch is taken 50% of the time and the loop iterates 100 times and forwarding for all data hazards.
2. How many cycles does the code take
assuming speculate and squash where all branches are predicted not taken?
add 1 2 3 beq 1 5 1 lw 6 4 1 add 3 4 5 beq 5 7 -5 halt
# Instructions = 100*(0.5*5 + 0.5*4) + 1 = 451
Time = 451 + load stalls + branch stalls + empty pipe Time = 451 + 100*0.5*1 + (100*0.5*3 + 99*3) + 4 Time = 952
EECS 370 – Introduction to Computer Organization
28
Performance: Control Hazards III
Assume halt is resolved in WB stage
Assume the first branch is taken 50% of the time and the loop iterates 100 times and forwarding for all data hazards.
3. How many cycles does the code take
assuming speculate and squash where backward branches are predicted taken and forward branches not taken (BTFN)? Assume that the predictor has a BTB with entries for both branches to start.
add123 beq151 lw641 add345 beq5 7 -5 halt
EECS 370 – Introduction to Computer Organization
29
Performance: Control Hazards III
Assume halt is resolved in WB stage
Assume the first branch is taken 50% of the time and the loop iterates 100 times and forwarding for all data hazards.
3. How many cycles does the code take
assuming speculate and squash where backward branches are predicted taken and forward branches not taken (BTFN)? Assume that the predictor has a BTB with entries for both branches to start.
add123 beq151 lw641 add345 beq5 7 -5 halt
# Instructions = 100*(0.5*5 + 0.5*4) + 1 = 451
Time = 451 + load stalls + branch stalls + empty pipe Time = 451 + 100*0.5*1 + (100*0.5*3 + 1*3) + 4
Time = 658
EECS 370 – Introduction to Computer Organization
30
Performance: Control Hazards IV
Assume the first branch has the pattern TTTN that repeats, and the loop is iterated 100 times and forwarding for all data hazards.
4. How many cycles does the code take if a 2-bit counter
BTB is used to predict each branch, how many cycles does the code take? Assume initial state of branch predictor counter is “10” (WT)
Assume halt is resolved in WB stage
add 1 2 3 beq 1 5 1 lw 6 4 1 add 3 4 5 beq5 7 -5 halt
EECS 370 – Introduction to Computer Organization
31
Performance: Control Hazards IV
Assume the first branch has the pattern TTTN that repeats, and the loop is iterated 100 times and forwarding for all data hazards.
4. How many cycles does the code take if a 2-bit counter
BTB is used to predict each branch, how many cycles does the code take? Assume initial state of branch
Assume halt is resolved in WB stage
add 1 2 3 beq 1 5 1 lw 6 4 1 add 3 4 5 beq5 7 -5 halt
√ √ √ X √ √ √ X √ √ √ X
T T T N T T T N T T T N • • •
predictor counter is “10” (WT)
beq 1 5 1
beq 5 7 -5 is correct 99 times, then incorrect last iteration
# Instructions = 100*(0.25*5 + 0.75*4) + 1 = 426 Time = 426 + load stalls + branch stalls + empty pipe Time = 426 + 100*0.25*1 + 100*0.25*3 + 1*3 + 4 Time = 533
EECS 370 – Introduction to Computer Organization
32
Logistics
• There are 3 videos for lecture 16
• L16_1 – Pipeline-Performance_Data-Hazards
• L16_2 – Pipeline-Performance_Control-Hazards • L16_3 – Pipeline-Performance
• There is one worksheet for lecture 16 1. L16 worksheet
• There are optional, supplementary videos with detailed walk-through for the examples
• These are optional, if you want to see the (repetitious) walk-through for examples in the lecture.
EECS 370 – Introduction to Computer Organization
33
L16_3 Pipeline-Performance
EECS 370 – Introduction to Computer Organization – Fall 2020
EECS 370 – Introduction to Computer Organization – © Bill Arthur 34 The material in this presentation cannot be copied in any form without written permission
Learning Objectives
• To identify and apply performance metrics related to all hazards for the LC2K pipeline datapath.
EECS 370 – Introduction to Computer Organization
35
Classic Performance Problem
• Program with following instruction breakdown:
lw 10% sw 15% beq 25% R-type 50%
• Speculate “always not-taken” and squash. 80% of branches not-taken
• Full forwarding to execute stage. 20% of loads stall for 1 cycle
• What is the CPI of the program?
• What is the total execution time if cycle
frequency is 100MHz?
EECS 370 – Introduction to Computer Organization
36
Classic Performance Problem
• Program with following instruction breakdown:
lw 10% sw 15% beq 25% R-type 50%
CPI = 1 + 0.10 (loads) * 0.20 (load use stall)*1 + 0.25 (branch) * 0.20 (miss rate)*3
CPI = 1 + 0.02 + 0.15 = 1.17 Time = 1.17 * 10ns = 11.7ns
• Speculate “always not-taken” and squash. 80% of branches not-taken
• Full forwarding to execute stage. 20% of loads stall for 1 cycle
• What is the CPI of the program?
• What is the total execution time if cycle
frequency is 100MHz?
EECS 370 – Introduction to Computer Organization
37
Classic Performance Problem 2.0
• Assume branches are resolved at Execute? • What is the CPI?
• What happens to cycle time?
• What is the total execution time?
EECS 370 – Introduction to Computer Organization
38
Classic Performance Problem 2.0
• Assume branches are resolved at Execute? • What is the CPI?
• What happens to cycle time?
• What is the total execution time?
CPI = 1 + 0.10 (loads) *0.20 (load use stall)*1 + 0.25 (branch) * 0.20 (miss rate)*2
CPI = 1 + 0.02 + 0.1 = 1.12
EECS 370 – Introduction to Computer Organization
39
Performance: Deeper Pipelines
• Assume the setup of the previous problem.
• What if we have a 10-stage pipeline? • Instructions are fetched at stage 1.
• Register file is read at stage 3.
• Execution begins at stage 5.
• Branches are resolved at stage 7.
• Memory access is complete in stage 9.
• What’s the CPI of the program?
• If the clock rate was doubled by doubling the
pipeline depth, is performance also doubled?
EECS 370 – Introduction to Computer Organization
40
Performance: Deeper Pipelines
• Assume the setup of the previous problem.
• What if we have a 10-stage pipeline? • Instructions are fetched at stage 1.
• Register file is read at stage 3.
• Execution begins at stage 5.
• Branches are resolved at stage 7.
• Memory access is complete in stage 9.
• What’s the CPI of the program?
• If the clock rate was doubled by doubling the
pipeline depth, is performance also doubled?
EECS 370 – Introduction to Computer Organization
41
CPI = 1
+ 0.10 (loads) * 0.20 (load use stall) * ??? + 0.25 (branch) * 0.20 (N stalls) * ???
Performance: Deeper Pipelines
• Assume the setup of the previous problem.
• What if we have a 10-stage pipeline? • Instructions are fetched at stage 1.
• Register file is read at stage 3.
• Execution begins at stage 5.
• Branches are resolved at stage 7.
• Memory access is complete in stage 9.
• What’s the CPI of the program?
CPI = 1
+ 0.10 (loads) * 0.20 (load use stall) * 4 + 0.25 (branch) * 0.20 (N stalls) * 6
CPI = 1 + 0.08 + 0.30 = 1.38 Time = 1.38 * 5ns = 6.9 ns
• If the clock frequency was doubled by doubling the pipeline depth, is performance also doubled?
EECS 370 – Introduction to Computer Organization
42
Up Next… Caches
• This is the last lecture on pipeline datapath. • Next lecture: caches
• Usually memory hierarchy between the processor and main memory
• Starting Thursday Prof. Satish Narayanasamy will be recording lectures
and holding office hours • It was great to teach you!
EECS 370 – Introduction to Computer Organization
43
Logistics
• There are 3 videos for lecture 16
• L16_1 – Pipeline-Performance_Data-Hazards
• L16_2 – Pipeline-Performance_Control-Hazards • L16_3 – Pipeline-Performance
• There is one worksheet for lecture 16 1. L16 worksheet
EECS 370 – Introduction to Computer Organization
44