L19-BranchPrediction
Nathan Beckmann
CMU
Based on slides by Joel Emer, MIT
Branch Prediction
Beckmann27 April 2017
Commit: Instruction irrevocably updates
architectural state (aka “graduation” or
“completion”).
Execute: Instructions and operands sent to
execution units .
When execution completes, all results and
exception flags are available.
Decode: Instructions placed in appropriate
issue (aka “dispatch”) stage buffer
Fetch: Instruction bits retrieved
from cache.
Phases of Instruction Execution
I-cache
Fetch
Buffer
Issue
Buffer
Func.
Units
Arch.
State
Result
Buffer
PC
2
Beckmann27 April 2017
I-cache
Fetch
Buffer
Issue
Buffer
Func.
Units
Arch.
State
Execute
Decode
Result
Buffer Commit
PC
Fetch
Modern processors may have
> 10 pipeline stages between
next PC calculation and branch
resolution !
Control Flow Penalty
How much work is lost if
pipeline doesn’t follow
correct instruction flow?
~ Loop length x pipeline width
Loose loop
Branch
executed
Next fetch
started
3
Beckmann27 April 2017
Average Run-Length between
Branches
Average dynamic instruction mix from SPEC92:
SPECint92 SPECfp92
ALU 39 % 13 %
FPU Add 20 %
FPU Mult 13 %
load 26 % 23 %
store 9 % 9 %
branch 16 % 8 %
other 10 % 12 %
SPECint92: compress, eqntott, espresso, gcc , li
SPECfp92: doduc, ear, hydro2d, mdijdp2, su2cor
What is the average run length between branches
4
Beckmann27 April 2017
Instruction Taken known? Target known?
J
JR
BEQZ/BNEZ
MIPS Branches and Jumps
Each instruction fetch depends on one or two pieces
of information from the preceding instruction:
After Reg. Fetch* After Inst. Decode
After Inst. Decode After Inst. Decode
After Inst. Decode After Reg. Fetch
*Assuming zero detect on register read
1) Is the preceding instruction a taken branch?
2) If so, what is the target address?
5
Beckmann27 April 2017
Realistic Branch Penalties
A PC Generation/Mux
P Instruction Fetch Stage 1
F Instruction Fetch Stage 2
B Branch Address Calc/Begin Decode
I Complete Decode
J Steer Instructions to Functional units
R Register File Read
E Integer Execute
Remainder of execute pipeline
(+ another 6 stages)
UltraSPARC-III instruction fetch pipeline stages
(in-order issue, 4-way superscalar, 750MHz, 2000)
Branch
Target
Address
Known
Branch
Direction &
Jump
Register
Target
Known
6
Beckmann27 April 2017
Reducing Control Flow Penalty
Software solutions
• Eliminate branches – loop unrolling
Increases the run length
• Reduce resolution time – instruction scheduling
Compute the branch condition as early
as possible (of limited value – why?)
Hardware solutions
• Find something else to do architecturally
• delay slots – replace pipeline bubbles with
useful work (requires software cooperation)
• Speculate – branch prediction
Speculative execution of instructions beyond
the branch
Useful in both in- and out-of-order processors
7
Beckmann27 April 2017
Branch Prediction
Motivation:
Branch penalties limit performance of deeply pipelined
processors
Modern branch predictors have high accuracy
(>95%) and can reduce branch penalties significantly
Required hardware support:
Prediction structures:
• Branch history tables, branch target buffers, etc.
Mispredict recovery mechanisms:
• Keep result computation separate from commit
• Kill instructions following branch in pipeline
• Restore state to state following branch
8
Beckmann27 April 2017
Static Branch Prediction
Overall probability a branch is taken is ~60-70% but:
ISA can attach preferred direction semantics to branches,
e.g., Motorola MC88110
bne0 (preferred taken) beq0 (not taken)
ISA can allow arbitrary choice of statically predicted direction,
e.g., HP PA-RISC, Intel IA-64
typically reported as ~80% accurate
JZ
JZ
backward
90%
forward
50%
9
Beckmann
Is Static Prediction Enough?
• UltraSPARC-III
– Mispredicted branches have penalty of 6 cycles
– 4-wide issue
• Wasted work per branch @ 80%
accuracy:
20% misprediction rate
× 6 stages
× 4-way issue
= 4.8 wasted instructions / branch
• Branches are 15-20% of instructions!
27 April 2017
10
Beckmann
Dynamic Branch Prediction
learning based on past behavior
27 April 2017
Spatial correlation
Several branches may resolve in a highly
correlated manner (a preferred path of
execution)
Temporal correlation
The way a branch resolves may be a good
predictor of the way it will resolve at the next
execution
Echoes of temporal/spatial locality in caches…
11
Beckmann27 April 2017
Dynamic Prediction
Input
Truth/Feedback
Prediction
Predictor
Operations
• Predict
Prediction as a feedback control process • Update
12
Beckmann27 April 2017
Predictor Primitive
Emer & Gloy, 1997
• Indexed table holding values
• Operations
– Predict
– Update
• Algebraic notation
Prediction = P[Width, Depth](Index; Update)
Index
Prediction
Update
Depth
Width
P
UI
13
Beckmann27 April 2017
One-bit Predictor
PC
Taken
Prediction
A21064(PC; T) = P[ 1, 2K ](PC; T)
P
U
I
1 bit
What happens on loop branches?
At best, mispredicts twice for every use of loop.
Simple temporal prediction
14
Beckmann27 April 2017
Branch Prediction Bits
• Assume 2 BP bits per instruction
• Use saturating counter
O
n
¬
taken
è
ç
O
n
taken
1 1 Strongly taken
1 0 Weakly taken
0 1 Weakly ¬taken
0 0 Strongly ¬taken
15
Beckmann27 April 2017
Two-bit Predictor
Smith, 1981
PC
+/- Adder
Taken
Prediction
Counter[W,D](I; T) = P[W, D](I; if T then P+1 else P-1)
A21164(PC; T) = MSB(Counter[2, 2K](PC; T))
P
U
I
2 bits
16
Beckmann27 April 2017
Branch History Table
4K-entry BHT, 2 bits/entry, ~80-90% correct predictions
0 0Fetch PC
Branch? Target PC
+
I-Cache
Opcode offset
Instruction
k
BHT Index
2k-entry
BHT,
2 bits/entry
Taken/¬Taken?
17
Beckmann27 April 2017
Exploiting Spatial Correlation
Yeh and Patt, 1992
History register, H, records the direction of the last
N branches executed by the processor
if (x[i] < 7) then y += 1; if (x[i] < 5) then c -= 4; If first condition false, second condition also false 18 Beckmann27 April 2017 History Register PC Concatenate Taken History History(PC, T) = P(PC; P || T) P U I 19 Beckmann27 April 2017 Global History GHist(;T) = MSB(Counter(History(0, T); T)) Ind-Ghist(PC;T) = MSB(Counter(PC || Hist(GHist(;T);T))) Taken Concat Global History +/- Prediction Can we take advantage of a pattern at a particular PC? 20 Beckmann27 April 2017 Local History PC Concat Local History +/- Prediction Taken LHist(PC, T) = MSB(Counter(History(PC; T); T)) Can we take advantage of the global pattern at a particular PC? 21 Beckmann27 April 2017 Two-level Predictor Concat Global History +/- Prediction Taken 2Level(PC, T) = MSB(Counter(History(0; T)||PC; T)) Concat PC 22 Beckmann27 April 2017 Two-Level Branch Predictor Pentium Pro uses the result from the last two branches to select one of the four sets of BHT bits (~95% correct) 0 0 kFetch PC Shift in Taken/¬Taken results of each branch 2-bit global branch history shift register Taken/¬Taken? 23 Beckmann Which predictor is best? • Many different predictors were proposed • Each handles particular patterns well • Common principles: temporal / spatial correlation, saturating counters, etc • But none is universal • What to do? 27 April 2017 24 Beckmann27 April 2017 Choosing Predictors LHist GHist Chooser Chooser = MSB(P(PC; P + (A==T) - (B==T)) or Chooser = MSB(P(GHist(PC; T); P + (A==T) - (B==T)) Prediction 25 Beckmann27 April 2017 Tournament Branch Predictor (Alpha 21264) • Choice predictor learns whether best to use local or global branch history in predicting next branch • Global history is speculatively updated but restored on mispredict • Claim 90-100% success on range of applications Local history table (1,024x10b) PC Local prediction (1,024x3b) Global Prediction (4,096x2b) Choice Prediction (4,096x2b) Global History (12b)Prediction 26 Beckmann Sophisticated Designs • Neural-network-based branch predictors – [Vintan, IJCNN ‘99][Jiminez, HPCA ‘01] – High prediction accuracy, but more computation – Actually implemented in AMD Piledriver, AMD Ryzen • TAGE predictor – TAgged GEometric predictor [Seznac, JILP ‘06] – Keep multiple predictions with different history lengths – Partially tag predictions to avoid false matches – Only provide prediction on tag match – Rumored to be what Intel uses 27 April 2017 27 Beckmann TAGE predictor Seznec & Michaud, 2006 27 April 2017 TAGE_TREE[L1, L2, L3](PC; T) = TAGE[L3](PC, TAGE[L2](PC, TAGE[L1](PC, Bimodal(PC;T) ;T) ;T ;T) TAGE[L3] Final Prediction TAGE[L2]TAGE[L1]BiModal PC Use me? My Guess 28 Beckmann27 April 2017 TAGE component Counter Prediction Useful Tag Use me? My Guess PC Next Predictor GHist 29 Useful bit is updated when predictor was correct and its alternative was wrong; used for replacement. Beckmann TAGE predictor component 27 April 2017 TAGE[L](PC, NEXT; T) = idx = hash(PC, GHIST[L](;T)) tag = hash(PC, GHIST[L](;T)) TAGE.U = SA(idx, tag; ((TAGE == T) && (NEXT != T))?1:SA) TAGE.Counter = SA(idx, tag; T?SA+1:SA-1) use_me = TAGE.U && isStrong(TAGE.Counter) TAGE = use_me?MSB(TAGE.Counter):NEXT Notes: SA is a ‘set associative’ structure SA allocation occurs on mispredict (not shown) TAGE.U cleared on global counter saturation 30 Beckmann27 April 2017 Limitations of branch predictors Only predicts branch direction. Therefore, cannot redirect fetch stream until after branch target is determined. UltraSPARC-III fetch pipeline Correctly predicted taken branch penalty Jump Register penalty A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute Remainder of execute pipeline (+ another 6 stages) 31 Beckmann27 April 2017 Branch Target Buffer (untagged) BP bits are stored with the predicted target address. IF stage: If (BP=taken) then nPC=target else nPC=PC+4 later: check prediction, if wrong then kill the instruction and update BTB & BPb else update BPb IMEM PC Branch Target Buffer (BTB) (2k entries)k BPbpredicted target BP target 32 Beckmann27 April 2017 Address Collisions What will be fetched after the instruction at 1028? BTB prediction = Correct target = Þ Assume a 128-entry BTB BPbtarget take236 1028 Add ..... 132 Jump 100 Instruction Memory 236 1032 kill PC=236 and fetch PC=1032 Is this a common occurrence? Can we avoid these bubbles? 33 Beckmann27 April 2017 BTB is only for Control Instructions BTB contains useful information for branch and jump instructions only Þ Do not update it for other instructions For all other instructions the next PC is (PC)+4 ! How to achieve this effect without decoding the instruction? 34 Beckmann27 April 2017 Branch Target Buffer (tagged) • Keep both the branch PC and target PC in the BTB • PC+4 is fetched if match fails • Only taken branches and jumps held in BTB • Next PC determined before branch fetched and decoded 2k-entry direct-mapped BTB (can also be associative) I-Cache PC k Valid valid Entry PC = match predicted target target PC 35 Beckmann27 April 2017 Consulting BTB Before Decoding 1028 Add ..... 132 Jump 100 BPbtarget take236 entry PC 132 • The match for PC=1028 fails and 1028+4 is fetched Þ eliminates false predictions after ALU instructions • BTB contains entries only for control transfer instructions Þ more room to store branch targets 36 Beckmann27 April 2017 Combining BTB and BHT • BTB entries are considerably more expensive than BHT, but can redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR) • BHT can hold many more entries and is more accurate A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute BTB BHTBHT in later pipeline stage corrects when BTB misses a predicted taken branch BTB/BHT only updated after branch resolves in E stage 37 Beckmann27 April 2017 Line Prediction (Alpha 21[234]64) • Line Predictor predicts line to fetch each cycle (tight loop) – Untagged BTB structure – Why? – 21464 was to predict 2 lines per cycle • Icache fetches block, and predictors improve target prediction • PC Calc checks accuracy of line prediction(s) • For superscalar useful to predict next cache line(s) to fetch Line Predictor Instr Cache Branch Predictor Return Stack Indirect Branch Predictor Decode & PC Calc Mispredict 38 Beckmann27 April 2017 Uses of Jump Register (JR) • Switch statements (jump to address of matching case) • Dynamic function call (jump to run-time function address) • Subroutine returns (jump to return address) How well does BTB work for each of these cases? BTB works well if same case used repeatedly BTB works well if same function usually called, (e.g., in C++ programming, when objects have same type in virtual function call) BTB works well if usually return to the same place Þ Often one function called from many distinct call sites! 39 Beckmann27 April 2017 Subroutine Return Stack Small structure to accelerate JR for subroutine returns, typically much more accurate than BTBs. &fb() &fc() Push call address when function call executed Pop return address when subroutine return decoded fa() { fb(); } fb() { fc(); } fc() { fd(); } &fd() k entries (typically k=8-16) 40 Beckmann27 April 2017 Overview of branch prediction P C Need next PC immediately Decode RegRead Execute Instr type, PC relative targets available Simple conditions, register targets available Complex conditions available BTB BP, JMP, Ret Loose loop Loose loop Loose loopTight loop Must speculation check always be correct? No… Best predictors reflect program behavior 41 Thank you !