L15_1 Detect-and-Forward
EECS 370 – Introduction to Computer Organization – Fall 2020
EECS 370 – Introduction to Computer Organization – © Bill Arthur 1 The material in this presentation cannot be copied in any form without written permission
Learning Objectives
• To understand the flow of data through the datapath and between pipeline stages in reducing stalls arising from data hazards.
EECS 370 – Introduction to Computer Organization
2
Detect and Stall – Problems
• CPI increases every time a hazard is detected!
• Is that necessary? Not always!
• Re-route the result of the add to the nor
• nor no longer needs to read R3 from reg file • It can get the data later (when it is ready)
• This lets us complete the decode this cycle
• But we need more control to remember that the data that we are not getting from the reg file at this time will be found elsewhere in the pipeline at a later cycle.
EECS 370 – Introduction to Computer Organization
3
M U X
1
+
+
A L U
eq?
ALU result
regA
0
PC+1
target
M U X
data dest
regB
valA
PC
Inst mem
PC+1
M U X
R0 R1 R2 R3 R4 R5 R6 R7
ALU result
mdata
valB offset
M U X
Data memory
valB
dest
dest
dest
op
op
op
EECS 370 – Introduction to Computer Organization
4
instruction
Register file
Handling Data Hazards III: Detect and Forward
• Detect: same as detect and stall
• Except that all 4 hazards must be treated differently
• i.e., you can’t logical-OR the 4 hazard signals
• Forward:
• New bypass datapaths route computed data to where it is needed • New MUX and control to pick the right data
• Beware: Stalling may still be required even in the presence of forwarding
EECS 370 – Introduction to Computer Organization
5
Sample Code (Simple)
We will use this program for the next example
1. add 1 2 3 2. nor 3 4 5 3. add 6 3 7 4.lw 3610 5.sw 6212
HAZARDS:
1.add 1 2 2.nor 45 3.add6 7 4.lw 3 10 5. sw 2 12
3
3
3
6
6
EECS 370 – Introduction to Computer Organization
6
First half of cycle 3
M U X
1
+
+
2
14
7
1
Hazard
3
regA
0
PC
Inst mem
14
M U X
R0
R1
R2
R3
R4
R5
R6
R7
A L U
M U X
3
regB
10
11
77
7
M U X
Data memory
data
1 8
1. add 1 2 3 2. nor 3 4 5 3. add 6 3 7 4. lw 3 6 10 5. sw 6 2 12
add
fwd
IF/ ID/
ID EX Mem WB
fwd
EX/ Mem/
fwd
7
nor 345
Register file
End of cycle 3
M U X
1
+
+
3
14
7
2
regA
0
PC
Inst mem
regB
10
M U X
R0
R1
R2
R3
R4
R5
R6
R7
A L U
M U X
5
3
10
11
21
77
11
M U X
Data memory
data
1 8
1. add 1 2 3 2. nor 3 4 5 3. add 6 3 7 4. lw 3 6 10 5. sw 6 2 12
nor
H1
EX/ Mem/ ID EX Mem WB
IF/ ID/
add
8
add 437
Register file
First half of cycle 4
M U X
1
+
+
21 M U X
11
M U X
A L U
3
M U X
3 regB
14
7
2
New Hazard
regA
0
PC
Inst mem
M U X
R0
R1
R2
R3
R4
R5
R6
R7
5
3
10
11
10
21
77
11
Data memory
data
1 8
1. add 1 2 3 2. nor 3 4 5 3. add 6 3 7 4. lw 3 6 10 5. sw 6 2 12
nor
add
H1
IF/ ID/
ID EX Mem WB
EX/ Mem/
9
add 637
Register file
End of cycle 4
M U X
1
+
+
M U X
M U X
A L U
4
14
3
M U X
PC
Inst mem
M U X
R0
R1
R2
R3
R4
R5
R6
R7
regA
0
7
21
regB
1
7
5
3
10
11
-32
77
10
Data memory
data
1 8
1. add 1 2 3 2. nor 3 4 5 3. add 6 3 7 4. lw 3 6 10 5. sw 6 2 12
add
nor
add
H2
H1
IF/ ID/
ID EX Mem WB
EX/ Mem/
10
lw 3610
Register file
First half of cycle 5
M U X
1
+
+
M
1U X
21M U X
A L U
14
3
No Hazard
3
regA
7
21
M U X
PC
4
Inst mem
regB
1
M U X
R0 R1 R2 R3 R4 R5 R6 R7
0
3
10
7
5
11
-32
77
10
Data memory
data
1 8
1. add 1 2 3 2. nor 3 4 5 3. add 6 3 7 4. lw 3 6 10 5. sw 6 2 12
add
nor
add
H2
H1
IF/ ID/
ID EX Mem WB
EX/ Mem/
11
lw 3610
Register file
End of cycle 5
M U X
1
+
+
M U X
M U X
A L U
5
14
4
regA
0
M U X
7
-32
PC
Inst mem
regB
21
M U X
R0 R1 R2 R3 R4 R5 R6 R7
7
5
21
6
11
22
77
Data memory
data
1 8
1. add 1 2 3 2. nor 3 4 5 3. add 6 3 7 4. lw 3 6 10 5. sw 6 2 12
10
lw
add
H2
H1
IF/ ID/
ID EX Mem WB
EX/ Mem/
nand
12
sw 6 2 12
Register file
First half of cycle 6
M U X
1
+
+
M U X
M U X
A L U
en
PC
en
5
Hazard
6
14
4
regA
0
M U X
7
-32
Inst mem
regB
21
M U X
R0
R1
R2
R3
R4
R5
1 8
6
7
5
21
11
22
L
77
Data memory
R6
data
R7
1. add 1 2 3 2. nor 3 4 5 3. add 6 3 7 4. lw 3 6 10 5. sw 6 2 12
10
lw
add
nor
H2
H1
IF/ ID/
ID EX Mem WB
EX/ Mem/
13
sw 6 2 12
Register file
End of cycle 6
M U X
1
+
+
M U X
M U X
A L U
5
14
M U X
7
22
PC
Inst mem
M U X
R0 R1 R2 R3 R4 R5 R6 R7
regA
regB
0
21
6
7
11
31
-32
Data memory
data
1 8
1. add 1 2 3 2. nor 3 4 5 3. add 6 3 7 4. lw 3 6 10 5. sw 6 2 12
noop
lw
add
H2
IF/ ID/
ID EX Mem WB
EX/ Mem/
14
sw 6 2 12
Register file
First half of cycle 7
M U X
1
+
+
M U X
M U X
A L U
5
14
M U X
7
22
PC
Inst mem
M U X
R0 R1 R2 R3 R4 R5 R6 R7
Hazard 6 regA
regB
0
21
6
7
11
31
-32
Data memory
data
1 8
1. add 1 2 3 2. nor 3 4 5 3. add 6 3 7 4. lw 3 6 10 5. sw 6 2 12
noop
lw
add
H2
IF/ ID/
ID EX Mem WB
EX/ Mem/
15
sw 6 2 12
Register file
End of cycle 7
+
M U X
M U X
A L U
5
regA
0
M U X
R0
R1
R2
R3
R4
R5
R6
R7 22
6
21
11
-32
7 12
Data memory
99
data
1
14
7
regB
1
sw
noop
lw
H3
M U X
1
+
M U X
Inst mem
PC
1. add 1 2 3 2. nor 3 4 5 3. add 6 3 7 4. lw 3 6 10 5. sw 6 2 12
IF/ ID/
ID EX Mem WB
EX/ Mem/
16
Register file
First half of cycle 8
+
99
M U X
M U 12 X
A L U
5
M U X
R0
R1
R2
R3
R4
R5
R6
R7 22
regA
0
6
21
11
data
1
14
7
regB
-32
1
7 12
Data memory
99
sw
noop
lw
H3
M U X
1
+
M U X
Inst mem
PC
1. add 1 2 3 2. nor 3 4 5 3. add 6 3 7 4. lw 3 6 10 5. sw 6 2 12
IF/ ID/
ID EX Mem WB
EX/ Mem/
17
Register file
End of cycle 8
+
M U X
M U X
A L U
regA
0
14
7
regB
21
M U X
R0 R1 R2 R3 R4 R5
R6 99
R7 22
11
data
-32
111
Data memory
sw
H3
M U X
1
+
M U X
Inst mem
PC
1. add 1 2 3 2. nor 3 4 5 3. add 6 3 7 4. lw 3 6 10 5. sw 6 2 12
IF/ ID/
ID EX Mem WB
EX/ Mem/
noop
18
Register file
Time Graph – Detect and Forward
Time:
1
2
3
4
5
6
7
8
9
10
11
12
13
add 1 2 3
IF
ID
EX
ME
WB
nor 3 4 5
IF
ID
EX
ME
WB
add 6 3 7
IF
ID
EX
ME
WB
lw 3 6 10
IF
ID
EX
ME
WB
sw 6 2 12
IF
ID*
ID
EX
ME
WB
Data forward
EECS 370 – Introduction to Computer Organization
19
Data Forwarding – Lecture vs. Project 3
• Some questions you may have
• What is the WBEND pipeline register in the project for?
• Why are 3 nops required to avoid hazards in the project? • But only 2 nops in class?
• Answer
• The “magic” register file
• Lecture register file assumes internal forwarding – in a single cycle, values written to a register are immediately reflected to any reads that occur in the same cycle.
• Project register file does not do internal forwarding.
• Most modern processors have internal forwarding as its cheaper than having
an additional pipeline register.
EECS 370 – Introduction to Computer Organization
20
Project 3 Design Tips
• Build up your simulator in pieces.
• First, design code without any hazards – get the pipeline flow to work for all
instructions
• Build up your data forwarding (remember forwarding from 3 places back to EX: EX/MEM, MEM/WB, and WB/END).
• Handle control hazards.
• One extra cycle for stalls (simple register file implementation).
• Testcases are critical to debugging your code – don’t rely on the autograder!
• Use your functionally correct previous “golden design” for testing.
• Implement without considering hazards first, test with hazard-free code,
then consider hazards.
• Watch discussion videos (or attend!)
EECS 370 – Introduction to Computer Organization
21
Logistics
• There are 3 videos for lecture 15 • L15_1 – Detect-and-Forward
• L15_2 – Control-Hazards
• L15_3 – Branch-Prediction
• There is one worksheet for lecture 15 1. L15 worksheet
EECS 370 – Introduction to Computer Organization
22
L15_2 Control-Hazards
EECS 370 – Introduction to Computer Organization – Fall 2020
EECS 370 – Introduction to Computer Organization – © Bill Arthur 23 The material in this presentation cannot be copied in any form without written permission
Learning Objectives
• To identify and describe control hazards in the LC2K datapath.
• To identify the approaches to handling control hazards and their trade-offs.
• Resources:
• Pipeline simulator in Resources on EECS 370 website
• https://vhosts.eecs.umich.edu/370simulators/pipeline/simulator.html EECS 370 – Introduction to Computer Organization
24
Control Hazard
Control hazard:
Also called branch hazard. When the proper instruction cannot execute in the proper pipeline clock cycle because the instruction that was fetched is not the one that is needed; that is, the flow of instruction addresses is not what the pipeline expected.
EECS 370 – Introduction to Computer Organization
25
Pipeline Function for beq
• Fetch: read instruction from memory.
• Decode: read source operands from registers.
• Execute: calculate target address and test for equality. • Memory: Send target to PC if test is equal.
• Writeback: Nothing left to do.
• Branch outcomes • Not Taken
• PC = PC + 1 • Taken
• PC = Branch Target Address EECS 370 – Introduction to Computer Organization
26
Control Hazards
beq1 1 10 add 3 4 5
time
fetch decode execute memory writeback fetch decode execute
beq add
EECS 370 – Introduction to Computer Organization
27
Approaches to Handling Control Hazards
I. Avoid
Make sure there are no hazards in the code.
II.
Detect and stall
Delay fetch until branch resolved
Speculate and Squash-if-Wrong
Go ahead and fetch more instructions in case it is correct, but stop
them if they should not have been executed.
III.
EECS 370 – Introduction to Computer Organization
28
Handling Control Hazards I: Avoid all Hazards
• Do not have branch instructions!
• Impractical.
• Using predicated instructions works for small branches (not covered in this class)
• Delay taking branch
• Special branch instruction where next N instructions still execute, and then
branches after those
• Not covered in more detail in this class
EECS 370 – Introduction to Computer Organization
29
Handling Control Hazards II: Detect and Stall
• Detection
• Must wait until decode
• Compare opcode to beq or jalr
• Alternately, this is just another control signal
• Stall
• Keep current instructions in fetch
• Insert noops
• Pass noop to decode stage, not execute!
EECS 370 – Introduction to Computer Organization
30
Control Hazards – Stall
beq1 1 10 add 3 4 5
time
fetch decode execute memory writeback fetch* fetch* fetch* fetch
beq add
EECS 370 – Introduction to Computer Organization
31
Branch target address
OR
fetch
1
+
M U X
PC
Inst mem
REG file
M U X
+
A L U
sign ext
Data memory
Control
ID/ EX/ Mem/
IF/
ID EX Mem WB
32
noop
Problems with Detect and Stall
• CPI increases every time a branch is detected!
• Is that necessary? Not always!
• Branch not always taken
• Let us assume that it is NOT taken…
• In this case, we can ignore the beq (treat it like a noop). • KeepfetchingPC+1
• What if we are wrong?
• OK, as long as we do not COMPLETE any instructions we mistakenly executed.
• i.e., make changes that will be seen later such as changing register or memory values.
EECS 370 – Introduction to Computer Organization
33
Handling Control Hazards III: Speculate and Squash
• Speculate: assume not equal
• Keep fetching from PC+1 until we know that the branch is really taken.
• Squash: stop bad instructions if taken
• Send a noop to Decode, Execute, and Memory • Send target address to PC
EECS 370 – Introduction to Computer Organization
34
1
+
M U X
PC
Inst mem
REG file
M U X
+
equal
A L U
sign ext
Data memory
1. beq 2. nor 3. add 4. nor
Control
ID/ EX/ Mem/
IF/
ID EX Mem WB
35
naododp
beq
nboeoqp
noorp
Problems with Fetching PC+1
• CPI increases every time a branch is taken! • About 50%-66% of time
• Is that necessary?
No! But how can you fetch from the target before you even know the previous instruction is a branch – much less whether it is taken?
EECS 370 – Introduction to Computer Organization
36
1
+
M U X
PC
Inst mem
REG file
eq?
Control
sign ext
M U X
+
A L U
Data memory
bpc
target
beq
IF/
ID EX Mem WB 37
ID/ EX/ Mem/
target
Logistics
• There are 3 videos for lecture 15 • L15_1 – Detect-and-Forward
• L15_2 – Control-Hazards
• L15_3 – Branch-Prediction
• There is one worksheet for lecture 15 1. L15 worksheet
EECS 370 – Introduction to Computer Organization
38
L15_3 Branch-Prediction
EECS 370 – Introduction to Computer Organization – Fall 2020
EECS 370 – Introduction to Computer Organization – © Bill Arthur 39 The material in this presentation cannot be copied in any form without written permission
Learning Objectives
• To identify the control and state for simple branch prediction. • Ability to understand metrics for branch prediction accuracy.
EECS 370 – Introduction to Computer Organization
40
Branch Prediction
• Predict the next fetch address (to be used in the next cycle)
• Requires three things to be predicted at fetch stage: • Whether the fetched instruction is a branch
• Branch direction (if conditional)
• Branch target address (if direction is taken)
• Observation: Target address remains the same for a conditional direct branch across dynamic instances
• Store the target address from previous instance and access it with the PC • Called Branch Target Buffer (BTB) or Branch Target Address Cache
EECS 370 – Introduction to Computer Organization
41
Fetch Stage with Branch Prediction
Direction predictor (2-bit counters)
taken?
PC + inst size
hit?
Next Fetch Address
Program Counter
Address of the current branch
target address
Cache of Target Addresses (BTB: Branch Target Buffer)
EECS 370 – Introduction to Computer Organization
42
Branch Direction Prediction
• “Branch direction” refers to whether the branch was taken or not • Two methods for predicting direction:
• Static – We predict once during compilation, and that prediction never changes
• Dynamic – We predict (potentially) many times during execution, and the prediction may change over time
• Static vs dynamic strategies are a very common topic in computer architecture
EECS 370 – Introduction to Computer Organization
43
Branch Direction Prediction
• Compile time (static)
• Always not taken
• Always taken
• BTFN (Backward taken, forward not taken) • Profile based (likely direction)
• Program analysis based (likely direction)
• Run time (dynamic)
• Last time prediction (single-bit)
• Two-bit counter based prediction
• Two-level prediction (global vs. local) • Hybrid
EECS 370 – Introduction to Computer Organization
44
Branch Direction Prediction (Static)
• Always not-taken
• Simple to implement: no need for BTB, no direction prediction
• Low accuracy: ~30-40%
• Compiler can layout code such that the likely path is the “not-taken” path
• Always taken
• No direction prediction
• Better accuracy: ~60-70%
• Backward branches (i.e. loop branches) are usually taken • Backward branch: target address lower than branch PC
• Backward taken, forward not taken (BTFN)
• Predict backward (loop) branches as taken, others not-taken
EECS 370 – Introduction to Computer Organization
45
Branch Direction Prediction (Dynamic)
• Last time predictor
• Single bit per branch (stored in BTB)
• Indicates which direction branch went last time it executed
TTTTTTTTTTNNNNNNNNNN → 90% accuracy
• Always mispredicts the last iteration and the first iteration of a loop branch
• Accuracy for a loop with N iterations = (N-2)/N
+ Loop branches for loops with large number of iterations — Loop branches for loops will small number of iterations
TNTNTNTNTNTNTNTNTNTN→ 0% accuracy EECS 370 – Introduction to Computer Organization
46
State Machine for Last-Time Prediction
actually taken
actually predict not taken not
taken
predict taken
actually taken
EECS 370 – Introduction to Computer Organization
47
actually not taken
Improving the Last Time Predictor
• Problem: A last-time predictor changes its prediction from T→NT or NT→T too quickly
• Even though the branch may be mostly taken or mostly not taken • Solution Idea: Add hysteresis to the predictor so that prediction does
not change on a single different outcome
• Use two bits to track the history of predictions for a branch instead of a single bit
• Can have 2 states for T or NT instead of 1 state for each EECS 370 – Introduction to Computer Organization
48
Two-Bit Counter Based Prediction
• Each branch associated with a two-bit counter
• One more bit provides hysteresis
• A strong prediction does not change with one single different outcome
+ Better prediction accuracy
— More hardware cost (but counter can be part of a BTB entry)
EECS 370 – Introduction to Computer Organization
49
State Machine for 2-bit Saturating Counter
actually taken
“strongly taken”
pred taken 11
actually taken
pred !taken 01
actually
!taken pred
taken actually 10
taken
actually !taken
actually
!taken pred
!taken actually 00
taken
“weakly taken”
“weakly !taken”
“strongly !taken”
actually !taken
EECS 370 – Introduction to Computer Organization
50
Two-Bit Counter Based Prediction
▪What is the prediction accuracy of a branch with the following sequence of taken/not taken evaluations
TTTTNTTNNNTNTNN
Branch State
Pred
T
10
T
T
T
N
T
T
N
N
N
T
N
T
N
N
EECS 370 – Introduction to Computer Organization
51
11
11
11
11
10
11
11
10
01
00
01
00
01
00
T
T
T
T
T
T
T
T
T
N
N
N
N
N
N
Branch Prediction
• Predict not taken:
• Predict backward taken:
• Predict same as last time:
• Pentium:
• Pentium Pro:
• Best paper designs:
~50% accurate. ~65% accurate. ~80% accurate.
~85% accurate. ~92% accurate. ~96% accurate.
EECS 370 – Introduction to Computer Organization
52
Can We Do Better? – Beyond EECS 370
• Absolutely – take EECS 470 • TAGE Branch Prediction
• TAgged GEometric length predictor • Neural Branch Prediction
• State of branch prediction
• Branch Prediction is Not a Solved Problem
EECS 370 – Introduction to Computer Organization
53
Logistics
• There are 3 videos for lecture 15 • L15_1 – Detect-and-Forward
• L15_2 – Control-Hazards
• L15_3 – Branch-Prediction
• There is one worksheet for lecture 15 1. L15 worksheet
EECS 370 – Introduction to Computer Organization
54