CS代考 CSE 371 Computer Organization and Design

CSE 371 Computer Organization and Design

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Computer Organization and Design

Unit 7: Branch Prediction

Based on slides by Profs. , & C.J. IS 501: Comp. Arch. | Dr. | Branch Prediction
This Unit: Branch Prediction
Control hazards
Branch prediction

System software

CIS 501: Comp. Arch. | Dr. | Branch Prediction

Control Dependences and
Branch Prediction

CIS 501: Comp. Arch. | Dr. | Branch Prediction

CIS 501: Comp. Arch. | Dr. | Branch Prediction
What About Branches?
Branch speculation
Could just stall to wait for branch outcome (two-cycle penalty)
Fetch past branch insns before branch outcome is known
Default: assume “not-taken” (at fetch, can’t tell it’s a branch)

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Big Idea: Speculative Execution
Speculation: “risky transactions on chance of profit”

Speculative execution
Execute before all parameters known with certainty
Correct speculation
Avoid stall, improve performance
Incorrect speculation (mis-speculation)
Must abort/flush/squash incorrect insns
Must undo incorrect changes (recover pre-speculation state)

Control speculation: speculation aimed at control hazards
Unknown parameter: are these the correct insns to execute next?

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Control Speculation Mechanics
Guess branch target, start fetching at guessed position
Doing nothing is implicitly guessing target is PC+4
We were already speculating before!
Can actively guess other targets: dynamic branch prediction

Execute branch to verify (check) guess
Correct speculation? keep going
Mis-speculation? Flush mis-speculated insns
Hopefully haven’t modified permanent state (Regfile, DMem)
Happens naturally in in-order 5-stage pipeline

Actually have modified one piece of state: PC! Why is that ok?

CIS 501: Comp. Arch. | Dr. | Branch Prediction
When to Perform Branch Prediction?
Option #1: During Decode
Look at instruction opcode to determine branch instructions
Can calculate next PC from instruction (for PC-relative branches)
One cycle “mis-fetch” penalty even if branch predictor is correct

Option #2: During Fetch?
How do we do that?
1 2 3 4 5 6 7 8 9
bnez r3,targ F D X M W
targ:add r4⟵r5,r4 F D X M W

More specifically if you look at schematic in the book even if you know that the branch will be taken you can’t do anything about it until you have calculated the branch target

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branch Recovery

Branch recovery: what to do when branch is actually taken
Insns that are in F and D are wrong
Flush them, i.e., replace them with nops
They haven’t written permanent state yet (regfile, DMem)
Two cycle penalty for taken branches

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branch Speculation and Recovery
Mis-speculation recovery: what to do on wrong guess
Not too painful in a short, in-order pipeline
Branch resolves in X
Younger insns (in F, D) haven’t changed permanent state
Flush insns currently in D and X (i.e., replace with nops)
1 2 3 4 5 6 7 8 9
addi r3⟵r1,1 F D X M W
bnez r3,targ F D X M W
st r6⟶[r7+4] F D X M W
mul r10⟵r8,r9 F D X M W

1 2 3 4 5 6 7 8 9
addi r3⟵r1,1 F D X M W
bnez r3,targ F D X M W
st r6⟶[r7+4] F D — — —
mul r10⟵r8,r9 F — — — —
targ:add r4⟵r4,r5 F D X M W

speculative

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branch Performance
Back of the envelope calculation
Branch: 20%, load: 20%, store: 10%, other: 50%
Say, 75% of branches are taken

CPI = 1 + 20% * 75% * 2 =
1 + 0.20 * 0.75 * 2 = 1.3
Branches cause 30% slowdown
Worse with deeper pipelines (higher mis-prediction penalty)

Can we do better than assuming branch is not taken?

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Dynamic Branch Prediction
Dynamic branch prediction: hardware guesses outcome
Start fetching from guessed address
Flush on mis-prediction

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branch Prediction Performance
Parameters
Branch: 20%, load: 20%, store: 10%, other: 50%
75% of branches are taken
Dynamic branch prediction
Branches predicted with 95% accuracy
CPI = 1 + 20% * 5% * 2 = 1.02

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Dynamic Branch Prediction Components
Step #1: is it a branch?
Easy after decode…
Step #2: is the branch taken or not taken?
Direction predictor (applies to conditional branches only)
Predicts taken/not-taken
Step #3: if the branch is taken, where does it go?
Easy after decode…

Branch Prediction Steps
CIS 501: Comp. Arch. | Prof. | Branch Prediction
is insn a branch?
predicted target
branch target buffer
direction predictor
Which insn’s behavior are we trying to predict?
Where does PC come from?
prediction source:

trying to predict behavior of Decode insn, PC of Decode insn

Branch TARGET PREDICTION

CIS 501: Comp. Arch. | Dr. | Branch Prediction

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Revisiting Branch Prediction Components
Step #1: is it a branch?
Easy after decode… during fetch: predictor
Step #2: is the branch taken or not taken?
Direction predictor (later)
Step #3: if the branch is taken, where does it go?
Branch target predictor (BTB)
Supplies target PC if branch is taken

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branch Target Buffer
Learn from past, predict the future
Record the past in a hardware structure

Branch target buffer (BTB):
“guess” the future PC based on past behavior
“Last time the branch X was taken, it went to address Y”
“So, in the future, if address X is fetched, fetch address Y next”
PC indexes table of bits target addresses
Essentially: branch will go to same place it went last time

What about aliasing?
Two PCs with the same lower bits?
No problem, just a prediction!

predicted target

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branch Target Buffer (continued)
At Fetch, how do we know we have a branch? We don’t…
…all insns access BTB in parallel with Imem Fetch
Key idea: use BTB to predict which insn are branches
Implement by “tagging” each entry with its corresponding PC
Update BTB on every taken branch insn, record target PC:
BTB[PC].tag = PC, BTB[PC].target = target of branch
All insns access at Fetch in parallel with Imem
Check for tag match, signifies insn at that PC is a branch
otherwise, assume insn is not a branch
Predicted PC = (BTB[PC].tag == PC) ? BTB[PC].target : PC+4

predicted target

Recording a tuple PC -> target but only for branches. When you fetch out an entry with your hash code you verify that this has a tuple that is relevant since the PC matches.

Question why does this work? Most branches/jumps are direct

Are there some that are not. Jump register -> shows up a lot in function returns.

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Why Does a BTB Work?
Because most control insns use direct targets
Target encoded in insn itself  same “taken” target every time

What about indirect targets?
Target held in a register  can be different each time
Two indirect call idioms
Dynamically linked functions (DLLs): target always the same
Dynamically dispatched (virtual) functions: hard but uncommon
Also two indirect unconditional jump idioms
Switches: hard but uncommon
Function returns: hard and common but…

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Return Address Stack (RAS)
Return address stack (RAS)
Call instruction? RAS[TopOfStack++] = PC+4
Return instruction? Predicted-target = RAS[–TopOfStack]
Q: how can you tell if an insn is a call/return before decoding it?
Accessing RAS on every insn BTB-style doesn’t work
Answer: another predictor (or put them in BTB marked as “return”)
Or, pre-decode bits in insn mem, written when first executed

predicted target

Branch Direction PREDICTION

CIS 501: Comp. Arch. | Dr. | Branch Prediction

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branch Direction Prediction
Learn from past, predict the future
Record the past in a hardware structure
Direction predictor (DIRP)
Map conditional-branch PC to taken/not-taken (T/N) decision
Individual conditional branches often biased or weakly biased
90%+ one way or the other considered “biased”
Why? Loop back edges, checking for uncommon conditions
Bimodal predictor: simplest predictor
PC indexes Branch History Table of bits (0 = N, 1 = T), no tags
Essentially: branch will go same way it went last time

What about aliasing?
Two PC with the same lower bits?
No problem, just a prediction!

Prediction (taken or
not taken)

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Bimodal Branch Predictor
simplest direction predictor
PC indexes table of bits (0 = N, 1 = T), no tags
Essentially: branch will go same way it went last time
Problem: inner loop branch below
for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever Two “built-in” mis-predictions per inner loop iteration Branch predictor “changes its mind too quickly” Time State Prediction Outcome Result? 1 N N T Wrong 2 T T T Correct 3 T T T Correct 4 T T N Wrong 5 N N T Wrong 6 T T T Correct 7 T T T Correct 8 T T N Wrong 9 N N T Wrong 10 T T T Correct 11 T T T Correct 12 T T N Wrong Loop : Y Y Y N Y Y Y N Y Y Y N Pred: - Y Y Y N Y Y Y N Y Y Y CIS 501: Comp. Arch. | Dr. | Branch Prediction Two-Bit Saturating Counters (2bc) Two-bit saturating counters (2bc) [Smith 1981] Replace each single-bit prediction (0,1,2,3) = (N,n,t,T) Adds “hysteresis” Force predictor to mis-predict twice before “changing its mind” One mispredict each loop execution (rather than two) Fixes this pathology (which is not contrived, by the way) Can we do even better? Time State Prediction Outcome Result? 1 N N T Wrong 2 n N T Wrong 3 t T T Correct 4 T T N Wrong 5 t T T Correct 6 T T T Correct 7 T T T Correct 8 T T N Wrong 9 t T T Correct 10 T T T Correct 11 T T T Correct 12 T T N Wrong CIS 501: Comp. Arch. | Dr. | Branch Prediction Correlated Predictor Correlated (two-level) predictor [Patt 1991] Exploits observation that branch outcomes are correlated Maintains separate prediction per (PC, BHR) pairs Branch history register (BHR): recent branch outcomes Simple working example: assume program has one branch BHT: one 1-bit DIRP entry BHT+2BHR: 22 = 4 1-bit DIRP entries Why didn’t we do better? BHT not long enough to capture pattern Time “Pattern” State Prediction Outcome Result? NN NT TN TT 1 NN N N N N N T Wrong 2 NT T N N N N T Wrong 3 TT T T N N N T Wrong 4 TT T T N T T N Wrong 5 TN T T N N N T Wrong 6 NT T T T N T T Correct 7 TT T T T N N T Wrong 8 TT T T T T T N Wrong 9 TN T T T N T T Correct 10 NT T T T N T T Correct 11 TT T T T N N T Wrong 12 TT T T T T T N Wrong Shift register storing last k branch outcomes. A global history context. CIS 501: Comp. Arch. | Dr. | Branch Prediction Correlated Predictor – 3 Bit Pattern Time “Pattern” State Prediction Outcome Result? NNN NNT NTN NTT TNN TNT TTN TTT 1 NNN N N N N N N N N N T Wrong 2 NNT T N N N N N N N N T Wrong 3 NTT T T N N N N N N N T Wrong 4 TTT T T N T N N N N N N Correct 5 TTN T T N T N N N N N T Wrong 6 TNT T T N T N N T N N T Wrong 7 NTT T T N T N T T N T T Correct 8 TTT T T N T N T T N N N Correct 9 TTN T T N T N T T N T T Correct 10 TNT T T N T N T T N T T Correct 11 NTT T T N T N T T N T T Correct 12 TTT T T N T N T T N N N Correct Try 3 bits of history No mis-predictions after predictor learns all the relevant patterns! CIS 501: Comp. Arch. | Dr. | Branch Prediction Branches may be correlated for (i=0; i<1000000; i++) { // Highly biased if (i % 3 == 0) { // Locally correlated if (random() % 2 == 0) { // Unpredictable if (i % 3 == 0) { … // Globally correlated Gshare History-Based Predictor Exploits observation that branch outcomes are correlated Maintains recent branch outcomes in Branch History Register (BHR) In addition to BHT of counters (typically 2-bit sat. counters) How do we incorporate history into our predictions? Use PC xor BHR to index into BHT. Why? CIS 501: Comp. Arch. | Dr. | Branch Prediction direction prediction (T/NT) CIS 501: Comp. Arch. | Dr. | Branch Prediction Gshare History-based Predictor Gshare working example assume program has one branch BHT: one 1-bit DIRP entry 3BHR: last 3 branch outcomes train counter, and update BHR after each branch Time State BHR Prediction Outcome Result? 1 N NNN N T wrong 2 N NNT N T wrong 3 N NTT N T wrong 4 N TTT N N correct 5 N TTN N T wrong 6 N TNT N T wrong 7 T NTT T T correct 8 N TTT N N correct 9 T TTN T T correct 10 T TNT T T correct 11 T NTT T T correct 12 N TTT N N correct Shift register storing last k branch outcomes, “global history” CIS 501: Comp. Arch. | Dr. | Branch Prediction Correlated Predictor Design I Design choice I: one global BHR or one per PC (local)? Each one captures different kinds of patterns Global history captures relationship among different branches Local history captures “self” correlation Local history requires another table to store the per-PC history for (i=0; i<1000000; i++) { // Highly biased if (i % 3 == 0) { // “Local” correlated // whatever if (random() % 2 == 0) { // Unpredictable if (i % 3 >= 1) {
// whatever // “Global” correlated

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Correlated Predictor Design II
Design choice II: how many history bits (BHR size)?
Tricky one
Given unlimited resources, longer BHRs are better, but…
BHT utilization decreases
Many history patterns are never seen
Many branches are history independent (don’t care)
PC xor BHR allows multiple PCs to dynamically share BHT
BHR length < log2(BHT size) Predictor takes longer to train Typical length: 8–12 CIS 501: Comp. Arch. | Dr. | Branch Prediction Hybrid Predictor Hybrid (tournament) predictor [McFarling 1993] Attacks correlated predictor BHT capacity problem Idea: combine two predictors Simple bimodal predictor for history-independent branches Correlated predictor for branches that need history Chooser assigns branches to one predictor or the other Branches start in simple BHT, move mis-prediction threshold Correlated predictor can be made smaller, handles fewer branches 90–95% accuracy Very similar to ensemble learning techniques from ML REDUCING BRANCH PENALTY CIS 501: Comp. Arch. | Dr. | Branch Prediction CIS 371 (Martin): Pipelining Reducing Penalty: Fast Branches Fast branch: can decide at D, not X Test must be comparison to zero or equality, no time for ALU New taken branch penalty is 1 Additional insns (slt) for more complex tests, must bypass to D too Bypassing into D instead of just X is necessitated since previous instructions may compute results that are relevant to the equality computation. Reducing Branch Penalty Approach taken in text is to move branch testing into the ID stage so fewer instructions are flushed on a mis-prediction. CIS 501: Comp. Arch. | Dr. | Branch Prediction Note equality unit shoe horned into decode phase between register file outputs CIS 501: Comp. Arch. | Dr. | Branch Prediction Reducing Penalty: Fast Branches Fast branch: targets control-hazard penalty Basically, branch insns that can resolve at D, not X Test must be comparison to zero or equality, no time for ALU New taken branch penalty is 1 Additional comparison insns (e.g., cmplt, slt) for complex tests Must bypass into decode stage now, too 1 2 3 4 5 6 7 8 9 bnez r3,targ F D X M W st r6⟶[r7+4] F D -- -- -- targ:add r4⟵r5,r4 F D X M W CIS 501: Comp. Arch. | Dr. | Branch Prediction Fast Branch Performance Assume: Branch: 20%, 75% of branches are taken CPI = 1 + 20% * 75% * 1 = 1 + 0.20*0.75*1 = 1.15 15% slowdown (better than the 30% from before) But wait, fast branches assume only simple comparisons Fine for MIPS But not fine for ISAs with “branch if $1 > $2” operations

In such cases, say 25% of branches require an extra insn
CPI = 1 + (20% * 75% * 1) + 20%*25%*1(extra insn) = 1.2

Example of ISA and micro-architecture interaction
Type of branch instructions
Another option: “Delayed branch” or “branch delay slot”
What about condition codes?

More specifically if you look at schematic in the book even if you know that the branch will be taken you can’t do anything about it until you have calculated the branch target

Delayed slot idea – we always execute the next instruction after a branch. Compilers job to put something useful in there.

Putting It All Together
BTB & branch direction predictor during fetch

If branch prediction correct, no taken branch penalty
CIS 501: Comp. Arch. | Dr. | Branch Prediction

predicted target

taken/not-taken

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Branch Prediction Performance
Dynamic branch prediction
20% of instruction branches
Simple predictor: branches predicted with 75% accuracy
CPI = 1 + (20% * 25% * 2) = 1.1
More advanced predictor: 95% accuracy
CPI = 1 + (20% * 5% * 2) = 1.02

Branch mis-predictions still a big problem though
Pipelines are long: typical mis-prediction penalty is 10+ cycles
For cores that do more per cycle, predictions more costly (later)

PREDICATION

CIS 501: Comp. Arch. | Dr. | Branch Prediction

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Predication
Instead of predicting which way we’re going, why not go both ways?
compute a predicate bit indicating a condition
ISA includes predicated instructions
predicated insns either execute as normal or as NOPs, depending on the predicate bit
x86 cmov performs conditional load/store
32b ARM allows almost all insns to be predicated
64b ARM has predicated reg-reg move, inc, dec, not
Nvidia’s CUDA ISA supports predication on most insns
predicate bits are like LC4 NZP bits
x86 FLAGS, ARM condition codes

CIS 501: Comp. Arch. | Dr. | Branch Prediction
Predication Performance
Predication overhead is additional insns
Sometimes overhead is zero
for if-then statement where condition is true
Most of the times it isn’t
if-then-else statement, only one of the paths is useful
Calculation for a given branch, predicate (vs speculate) if…
Average number of additional insns > overall mis-prediction penalty
For an individual branch
Mis-prediction penalty in a 5-stage pipeline = 2
Mis-prediction rate is <50%, and often <20% Overall mis-prediction penalty <1 and often <0.4 So when is predication ever worth it? CIS 501: Comp. Arch. | Dr. | Branch Prediction Predication Performance What does predication actually accomplish? In a scalar 5-stage pipeli 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts