CSE 371 Computer Organization and Design
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Computer Organization and Design
Copyright By PowCoder代写 加微信 powcoder
Unit 7: Branch Prediction
Based on slides by Profs. , & C.J. IS 501: Comp. Arch. | Dr. | Branch Prediction
This Unit: Branch Prediction
Control hazards
Branch prediction
System software
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Control Dependences and
Branch Prediction
CIS 501: Comp. Arch. | Dr. | Branch Prediction
CIS 501: Comp. Arch. | Dr. | Branch Prediction
What About Branches?
Branch speculation
Could just stall to wait for branch outcome (two-cycle penalty)
Fetch past branch insns before branch outcome is known
Default: assume “not-taken” (at fetch, can’t tell it’s a branch)
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Big Idea: Speculative Execution
Speculation: “risky transactions on chance of profit”
Speculative execution
Execute before all parameters known with certainty
Correct speculation
Avoid stall, improve performance
Incorrect speculation (mis-speculation)
Must abort/flush/squash incorrect insns
Must undo incorrect changes (recover pre-speculation state)
Control speculation: speculation aimed at control hazards
Unknown parameter: are these the correct insns to execute next?
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Control Speculation Mechanics
Guess branch target, start fetching at guessed position
Doing nothing is implicitly guessing target is PC+4
We were already speculating before!
Can actively guess other targets: dynamic branch prediction
Execute branch to verify (check) guess
Correct speculation? keep going
Mis-speculation? Flush mis-speculated insns
Hopefully haven’t modified permanent state (Regfile, DMem)
Happens naturally in in-order 5-stage pipeline
Actually have modified one piece of state: PC! Why is that ok?
CIS 501: Comp. Arch. | Dr. | Branch Prediction
When to Perform Branch Prediction?
Option #1: During Decode
Look at instruction opcode to determine branch instructions
Can calculate next PC from instruction (for PC-relative branches)
One cycle “mis-fetch” penalty even if branch predictor is correct
Option #2: During Fetch?
How do we do that?
1 2 3 4 5 6 7 8 9
bnez r3,targ F D X M W
targ:add r4⟵r5,r4 F D X M W
More specifically if you look at schematic in the book even if you know that the branch will be taken you can’t do anything about it until you have calculated the branch target
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branch Recovery
Branch recovery: what to do when branch is actually taken
Insns that are in F and D are wrong
Flush them, i.e., replace them with nops
They haven’t written permanent state yet (regfile, DMem)
Two cycle penalty for taken branches
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branch Speculation and Recovery
Mis-speculation recovery: what to do on wrong guess
Not too painful in a short, in-order pipeline
Branch resolves in X
Younger insns (in F, D) haven’t changed permanent state
Flush insns currently in D and X (i.e., replace with nops)
1 2 3 4 5 6 7 8 9
addi r3⟵r1,1 F D X M W
bnez r3,targ F D X M W
st r6⟶[r7+4] F D X M W
mul r10⟵r8,r9 F D X M W
1 2 3 4 5 6 7 8 9
addi r3⟵r1,1 F D X M W
bnez r3,targ F D X M W
st r6⟶[r7+4] F D — — —
mul r10⟵r8,r9 F — — — —
targ:add r4⟵r4,r5 F D X M W
speculative
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branch Performance
Back of the envelope calculation
Branch: 20%, load: 20%, store: 10%, other: 50%
Say, 75% of branches are taken
CPI = 1 + 20% * 75% * 2 =
1 + 0.20 * 0.75 * 2 = 1.3
Branches cause 30% slowdown
Worse with deeper pipelines (higher mis-prediction penalty)
Can we do better than assuming branch is not taken?
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Dynamic Branch Prediction
Dynamic branch prediction: hardware guesses outcome
Start fetching from guessed address
Flush on mis-prediction
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branch Prediction Performance
Parameters
Branch: 20%, load: 20%, store: 10%, other: 50%
75% of branches are taken
Dynamic branch prediction
Branches predicted with 95% accuracy
CPI = 1 + 20% * 5% * 2 = 1.02
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Dynamic Branch Prediction Components
Step #1: is it a branch?
Easy after decode…
Step #2: is the branch taken or not taken?
Direction predictor (applies to conditional branches only)
Predicts taken/not-taken
Step #3: if the branch is taken, where does it go?
Easy after decode…
Branch Prediction Steps
CIS 501: Comp. Arch. | Prof. | Branch Prediction
is insn a branch?
predicted target
branch target buffer
direction predictor
Which insn’s behavior are we trying to predict?
Where does PC come from?
prediction source:
trying to predict behavior of Decode insn, PC of Decode insn
Branch TARGET PREDICTION
CIS 501: Comp. Arch. | Dr. | Branch Prediction
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Revisiting Branch Prediction Components
Step #1: is it a branch?
Easy after decode… during fetch: predictor
Step #2: is the branch taken or not taken?
Direction predictor (later)
Step #3: if the branch is taken, where does it go?
Branch target predictor (BTB)
Supplies target PC if branch is taken
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branch Target Buffer
Learn from past, predict the future
Record the past in a hardware structure
Branch target buffer (BTB):
“guess” the future PC based on past behavior
“Last time the branch X was taken, it went to address Y”
“So, in the future, if address X is fetched, fetch address Y next”
PC indexes table of bits target addresses
Essentially: branch will go to same place it went last time
What about aliasing?
Two PCs with the same lower bits?
No problem, just a prediction!
predicted target
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branch Target Buffer (continued)
At Fetch, how do we know we have a branch? We don’t…
…all insns access BTB in parallel with Imem Fetch
Key idea: use BTB to predict which insn are branches
Implement by “tagging” each entry with its corresponding PC
Update BTB on every taken branch insn, record target PC:
BTB[PC].tag = PC, BTB[PC].target = target of branch
All insns access at Fetch in parallel with Imem
Check for tag match, signifies insn at that PC is a branch
otherwise, assume insn is not a branch
Predicted PC = (BTB[PC].tag == PC) ? BTB[PC].target : PC+4
predicted target
Recording a tuple PC -> target but only for branches. When you fetch out an entry with your hash code you verify that this has a tuple that is relevant since the PC matches.
Question why does this work? Most branches/jumps are direct
Are there some that are not. Jump register -> shows up a lot in function returns.
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Why Does a BTB Work?
Because most control insns use direct targets
Target encoded in insn itself same “taken” target every time
What about indirect targets?
Target held in a register can be different each time
Two indirect call idioms
Dynamically linked functions (DLLs): target always the same
Dynamically dispatched (virtual) functions: hard but uncommon
Also two indirect unconditional jump idioms
Switches: hard but uncommon
Function returns: hard and common but…
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Return Address Stack (RAS)
Return address stack (RAS)
Call instruction? RAS[TopOfStack++] = PC+4
Return instruction? Predicted-target = RAS[–TopOfStack]
Q: how can you tell if an insn is a call/return before decoding it?
Accessing RAS on every insn BTB-style doesn’t work
Answer: another predictor (or put them in BTB marked as “return”)
Or, pre-decode bits in insn mem, written when first executed
predicted target
Branch Direction PREDICTION
CIS 501: Comp. Arch. | Dr. | Branch Prediction
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branch Direction Prediction
Learn from past, predict the future
Record the past in a hardware structure
Direction predictor (DIRP)
Map conditional-branch PC to taken/not-taken (T/N) decision
Individual conditional branches often biased or weakly biased
90%+ one way or the other considered “biased”
Why? Loop back edges, checking for uncommon conditions
Bimodal predictor: simplest predictor
PC indexes Branch History Table of bits (0 = N, 1 = T), no tags
Essentially: branch will go same way it went last time
What about aliasing?
Two PC with the same lower bits?
No problem, just a prediction!
Prediction (taken or
not taken)
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Bimodal Branch Predictor
simplest direction predictor
PC indexes table of bits (0 = N, 1 = T), no tags
Essentially: branch will go same way it went last time
Problem: inner loop branch below
for (i=0;i<100;i++)
for (j=0;j<3;j++)
// whatever
Two “built-in” mis-predictions per inner loop iteration
Branch predictor “changes its mind too quickly”
Time State Prediction Outcome Result?
1 N N T Wrong
2 T T T Correct
3 T T T Correct
4 T T N Wrong
5 N N T Wrong
6 T T T Correct
7 T T T Correct
8 T T N Wrong
9 N N T Wrong
10 T T T Correct
11 T T T Correct
12 T T N Wrong
Loop : Y Y Y N Y Y Y N Y Y Y N
Pred: - Y Y Y N Y Y Y N Y Y Y
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Two-Bit Saturating Counters (2bc)
Two-bit saturating counters (2bc) [Smith 1981]
Replace each single-bit prediction
(0,1,2,3) = (N,n,t,T)
Adds “hysteresis”
Force predictor to mis-predict twice before “changing its mind”
One mispredict each loop execution
(rather than two)
Fixes this pathology (which is not contrived, by the way)
Can we do even better?
Time State Prediction Outcome Result?
1 N N T Wrong
2 n N T Wrong
3 t T T Correct
4 T T N Wrong
5 t T T Correct
6 T T T Correct
7 T T T Correct
8 T T N Wrong
9 t T T Correct
10 T T T Correct
11 T T T Correct
12 T T N Wrong
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Correlated Predictor
Correlated (two-level) predictor [Patt 1991]
Exploits observation that branch outcomes are correlated
Maintains separate prediction per (PC, BHR) pairs
Branch history register (BHR): recent branch outcomes
Simple working example: assume program has one branch
BHT: one 1-bit DIRP entry
BHT+2BHR: 22 = 4 1-bit DIRP entries
Why didn’t we do better?
BHT not long enough to capture pattern
Time “Pattern” State Prediction Outcome Result?
NN NT TN TT
1 NN N N N N N T Wrong
2 NT T N N N N T Wrong
3 TT T T N N N T Wrong
4 TT T T N T T N Wrong
5 TN T T N N N T Wrong
6 NT T T T N T T Correct
7 TT T T T N N T Wrong
8 TT T T T T T N Wrong
9 TN T T T N T T Correct
10 NT T T T N T T Correct
11 TT T T T N N T Wrong
12 TT T T T T T N Wrong
Shift register storing last k branch outcomes. A global history context.
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Correlated Predictor – 3 Bit Pattern
Time “Pattern” State Prediction Outcome Result?
NNN NNT NTN NTT TNN TNT TTN TTT
1 NNN N N N N N N N N N T Wrong
2 NNT T N N N N N N N N T Wrong
3 NTT T T N N N N N N N T Wrong
4 TTT T T N T N N N N N N Correct
5 TTN T T N T N N N N N T Wrong
6 TNT T T N T N N T N N T Wrong
7 NTT T T N T N T T N T T Correct
8 TTT T T N T N T T N N N Correct
9 TTN T T N T N T T N T T Correct
10 TNT T T N T N T T N T T Correct
11 NTT T T N T N T T N T T Correct
12 TTT T T N T N T T N N N Correct
Try 3 bits
of history
No mis-predictions after predictor learns all the relevant patterns!
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branches may be correlated
for (i=0; i<1000000; i++) { // Highly biased
if (i % 3 == 0) { // Locally correlated
if (random() % 2 == 0) { // Unpredictable
if (i % 3 == 0) {
… // Globally correlated
Gshare History-Based Predictor
Exploits observation that branch outcomes are correlated
Maintains recent branch outcomes in Branch History Register (BHR)
In addition to BHT of counters (typically 2-bit sat. counters)
How do we incorporate history into our predictions?
Use PC xor BHR to index into BHT. Why?
CIS 501: Comp. Arch. | Dr. | Branch Prediction
direction prediction (T/NT)
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Gshare History-based Predictor
Gshare working example
assume program has one branch
BHT: one 1-bit DIRP entry
3BHR: last 3 branch outcomes
train counter, and update BHR after each branch
Time State BHR Prediction Outcome Result?
1 N NNN N T wrong
2 N NNT N T wrong
3 N NTT N T wrong
4 N TTT N N correct
5 N TTN N T wrong
6 N TNT N T wrong
7 T NTT T T correct
8 N TTT N N correct
9 T TTN T T correct
10 T TNT T T correct
11 T NTT T T correct
12 N TTT N N correct
Shift register storing last k branch outcomes, “global history”
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Correlated Predictor Design I
Design choice I: one global BHR or one per PC (local)?
Each one captures different kinds of patterns
Global history captures relationship among different branches
Local history captures “self” correlation
Local history requires another table to store the per-PC history
for (i=0; i<1000000; i++) { // Highly biased
if (i % 3 == 0) { // “Local” correlated
// whatever
if (random() % 2 == 0) { // Unpredictable
if (i % 3 >= 1) {
// whatever // “Global” correlated
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Correlated Predictor Design II
Design choice II: how many history bits (BHR size)?
Tricky one
Given unlimited resources, longer BHRs are better, but…
BHT utilization decreases
Many history patterns are never seen
Many branches are history independent (don’t care)
PC xor BHR allows multiple PCs to dynamically share BHT
BHR length < log2(BHT size)
Predictor takes longer to train
Typical length: 8–12
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Hybrid Predictor
Hybrid (tournament) predictor [McFarling 1993]
Attacks correlated predictor BHT capacity problem
Idea: combine two predictors
Simple bimodal predictor for history-independent branches
Correlated predictor for branches that need history
Chooser assigns branches to one predictor or the other
Branches start in simple BHT, move mis-prediction threshold
Correlated predictor can be made smaller, handles fewer branches
90–95% accuracy
Very similar to ensemble learning techniques from ML
REDUCING BRANCH PENALTY
CIS 501: Comp. Arch. | Dr. | Branch Prediction
CIS 371 (Martin): Pipelining
Reducing Penalty: Fast Branches
Fast branch: can decide at D, not X
Test must be comparison to zero or equality, no time for ALU
New taken branch penalty is 1
Additional insns (slt) for more complex tests, must bypass to D too
Bypassing into D instead of just X is necessitated since previous instructions may compute results that are relevant to the equality computation.
Reducing Branch Penalty
Approach taken in text is to move branch testing into the ID stage so fewer instructions are flushed on a mis-prediction.
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Note equality unit shoe horned into decode phase between register file outputs
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Reducing Penalty: Fast Branches
Fast branch: targets control-hazard penalty
Basically, branch insns that can resolve at D, not X
Test must be comparison to zero or equality, no time for ALU
New taken branch penalty is 1
Additional comparison insns (e.g., cmplt, slt) for complex tests
Must bypass into decode stage now, too
1 2 3 4 5 6 7 8 9
bnez r3,targ F D X M W
st r6⟶[r7+4] F D -- -- --
targ:add r4⟵r5,r4 F D X M W
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Fast Branch Performance
Assume: Branch: 20%, 75% of branches are taken
CPI = 1 + 20% * 75% * 1 = 1 + 0.20*0.75*1 = 1.15
15% slowdown (better than the 30% from before)
But wait, fast branches assume only simple comparisons
Fine for MIPS
But not fine for ISAs with “branch if $1 > $2” operations
In such cases, say 25% of branches require an extra insn
CPI = 1 + (20% * 75% * 1) + 20%*25%*1(extra insn) = 1.2
Example of ISA and micro-architecture interaction
Type of branch instructions
Another option: “Delayed branch” or “branch delay slot”
What about condition codes?
More specifically if you look at schematic in the book even if you know that the branch will be taken you can’t do anything about it until you have calculated the branch target
Delayed slot idea – we always execute the next instruction after a branch. Compilers job to put something useful in there.
Putting It All Together
BTB & branch direction predictor during fetch
If branch prediction correct, no taken branch penalty
CIS 501: Comp. Arch. | Dr. | Branch Prediction
predicted target
taken/not-taken
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branch Prediction Performance
Dynamic branch prediction
20% of instruction branches
Simple predictor: branches predicted with 75% accuracy
CPI = 1 + (20% * 25% * 2) = 1.1
More advanced predictor: 95% accuracy
CPI = 1 + (20% * 5% * 2) = 1.02
Branch mis-predictions still a big problem though
Pipelines are long: typical mis-prediction penalty is 10+ cycles
For cores that do more per cycle, predictions more costly (later)
PREDICATION
CIS 501: Comp. Arch. | Dr. | Branch Prediction
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Predication
Instead of predicting which way we’re going, why not go both ways?
compute a predicate bit indicating a condition
ISA includes predicated instructions
predicated insns either execute as normal or as NOPs, depending on the predicate bit
x86 cmov performs conditional load/store
32b ARM allows almost all insns to be predicated
64b ARM has predicated reg-reg move, inc, dec, not
Nvidia’s CUDA ISA supports predication on most insns
predicate bits are like LC4 NZP bits
x86 FLAGS, ARM condition codes
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Predication Performance
Predication overhead is additional insns
Sometimes overhead is zero
for if-then statement where condition is true
Most of the times it isn’t
if-then-else statement, only one of the paths is useful
Calculation for a given branch, predicate (vs speculate) if…
Average number of additional insns > overall mis-prediction penalty
For an individual branch
Mis-prediction penalty in a 5-stage pipeline = 2
Mis-prediction rate is <50%, and often <20%
Overall mis-prediction penalty <1 and often <0.4
So when is predication ever worth it?
CIS 501: Comp. Arch. | Dr. | Branch Prediction
Predication Performance
What does predication actually accomplish?
In a scalar 5-stage pipeli
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com