程序代写代做代考 c++ mips cache L19-BranchPrediction

L19-BranchPrediction

Nathan Beckmann
CMU

Based on slides by Joel Emer, MIT

Branch Prediction

Beckmann27 April 2017

Commit: Instruction irrevocably updates
architectural state (aka “graduation” or
“completion”).

Execute: Instructions and operands sent to
execution units .
When execution completes, all results and
exception flags are available.

Decode: Instructions placed in appropriate
issue (aka “dispatch”) stage buffer

Fetch: Instruction bits retrieved
from cache.

Phases of Instruction Execution

I-cache

Fetch
Buffer

Issue
Buffer

Func.
Units

Arch.
State

Result
Buffer

Beckmann27 April 2017

I-cache

Fetch
Buffer

Issue
Buffer

Func.
Units

Arch.
State

Execute

Decode

Result
Buffer Commit

Fetch
Modern processors may have
> 10 pipeline stages between
next PC calculation and branch
resolution !

Control Flow Penalty

How much work is lost if
pipeline doesn’t follow
correct instruction flow?

~ Loop length x pipeline width

Loose loop

Branch
executed

Next fetch
started

Beckmann27 April 2017

Average Run-Length between
Branches

Average dynamic instruction mix from SPEC92:
SPECint92 SPECfp92

ALU 39 % 13 %
FPU Add 20 %
FPU Mult 13 %
load 26 % 23 %
store 9 % 9 %
branch 16 % 8 %
other 10 % 12 %

SPECint92: compress, eqntott, espresso, gcc , li
SPECfp92: doduc, ear, hydro2d, mdijdp2, su2cor

What is the average run length between branches

Beckmann27 April 2017

Instruction Taken known? Target known?

JR
BEQZ/BNEZ

MIPS Branches and Jumps
Each instruction fetch depends on one or two pieces
of information from the preceding instruction:

After Reg. Fetch* After Inst. Decode

After Inst. Decode After Inst. Decode

After Inst. Decode After Reg. Fetch

*Assuming zero detect on register read

1) Is the preceding instruction a taken branch?

2) If so, what is the target address?

Beckmann27 April 2017

Realistic Branch Penalties

A PC Generation/Mux
P Instruction Fetch Stage 1
F Instruction Fetch Stage 2
B Branch Address Calc/Begin Decode
I Complete Decode
J Steer Instructions to Functional units
R Register File Read
E Integer Execute

Remainder of execute pipeline
(+ another 6 stages)

UltraSPARC-III instruction fetch pipeline stages
(in-order issue, 4-way superscalar, 750MHz, 2000)

Branch
Target
Address
Known

Branch
Direction &
Jump
Register
Target
Known

Beckmann27 April 2017

Reducing Control Flow Penalty
Software solutions

• Eliminate branches – loop unrolling
Increases the run length

• Reduce resolution time – instruction scheduling
Compute the branch condition as early
as possible (of limited value – why?)

Hardware solutions
• Find something else to do architecturally

• delay slots – replace pipeline bubbles with
useful work (requires software cooperation)

• Speculate – branch prediction
Speculative execution of instructions beyond
the branch
Useful in both in- and out-of-order processors

Beckmann27 April 2017

Branch Prediction
Motivation:

Branch penalties limit performance of deeply pipelined
processors

Modern branch predictors have high accuracy
(>95%) and can reduce branch penalties significantly

Required hardware support:
Prediction structures:

• Branch history tables, branch target buffers, etc.

Mispredict recovery mechanisms:
• Keep result computation separate from commit
• Kill instructions following branch in pipeline
• Restore state to state following branch

Beckmann27 April 2017

Static Branch Prediction
Overall probability a branch is taken is ~60-70% but:

ISA can attach preferred direction semantics to branches,
e.g., Motorola MC88110

bne0 (preferred taken) beq0 (not taken)

ISA can allow arbitrary choice of statically predicted direction,
e.g., HP PA-RISC, Intel IA-64

typically reported as ~80% accurate

JZ
backward

90%
forward

50%

Beckmann

Is Static Prediction Enough?

• UltraSPARC-III
– Mispredicted branches have penalty of 6 cycles
– 4-wide issue

• Wasted work per branch @ 80%
accuracy:

20% misprediction rate
× 6 stages
× 4-way issue
= 4.8 wasted instructions / branch

• Branches are 15-20% of instructions!

27 April 2017

Beckmann

Dynamic Branch Prediction
learning based on past behavior

27 April 2017

Spatial correlation
Several branches may resolve in a highly
correlated manner (a preferred path of
execution)

Temporal correlation
The way a branch resolves may be a good
predictor of the way it will resolve at the next
execution

Echoes of temporal/spatial locality in caches…

Beckmann27 April 2017

Dynamic Prediction

Input

Truth/Feedback

Prediction
Predictor

Operations

• Predict
Prediction as a feedback control process • Update

Beckmann27 April 2017

Predictor Primitive
Emer & Gloy, 1997

• Indexed table holding values

• Operations
– Predict
– Update

• Algebraic notation

Prediction = P[Width, Depth](Index; Update)

Index

Prediction

Update

Depth

Width

Beckmann27 April 2017

One-bit Predictor

Taken

Prediction

A21064(PC; T) = P[ 1, 2K ](PC; T)

1 bit

What happens on loop branches?

At best, mispredicts twice for every use of loop.

Simple temporal prediction

Beckmann27 April 2017

Branch Prediction Bits
• Assume 2 BP bits per instruction
• Use saturating counter

O
n
¬

taken
è

ç
O

n
taken

1 1 Strongly taken

1 0 Weakly taken

0 1 Weakly ¬taken

0 0 Strongly ¬taken

Beckmann27 April 2017

Two-bit Predictor
Smith, 1981

+/- Adder

Taken
Prediction

Counter[W,D](I; T) = P[W, D](I; if T then P+1 else P-1)

A21164(PC; T) = MSB(Counter[2, 2K](PC; T))

2 bits

Beckmann27 April 2017

Branch History Table

4K-entry BHT, 2 bits/entry, ~80-90% correct predictions

0 0Fetch PC

Branch? Target PC

I-Cache

Opcode offset
Instruction

BHT Index
2k-entry
BHT,
2 bits/entry

Taken/¬Taken?

Beckmann27 April 2017

Exploiting Spatial Correlation
Yeh and Patt, 1992

History register, H, records the direction of the last
N branches executed by the processor

if (x[i] < 7) then y += 1; if (x[i] < 5) then c -= 4; If first condition false, second condition also false 18 Beckmann27 April 2017 History Register PC Concatenate Taken History History(PC, T) = P(PC; P || T) P U I 19 Beckmann27 April 2017 Global History GHist(;T) = MSB(Counter(History(0, T); T)) Ind-Ghist(PC;T) = MSB(Counter(PC || Hist(GHist(;T);T))) Taken Concat Global History +/- Prediction Can we take advantage of a pattern at a particular PC? 20 Beckmann27 April 2017 Local History PC Concat Local History +/- Prediction Taken LHist(PC, T) = MSB(Counter(History(PC; T); T)) Can we take advantage of the global pattern at a particular PC? 21 Beckmann27 April 2017 Two-level Predictor Concat Global History +/- Prediction Taken 2Level(PC, T) = MSB(Counter(History(0; T)||PC; T)) Concat PC 22 Beckmann27 April 2017 Two-Level Branch Predictor Pentium Pro uses the result from the last two branches to select one of the four sets of BHT bits (~95% correct) 0 0 kFetch PC Shift in Taken/¬Taken results of each branch 2-bit global branch history shift register Taken/¬Taken? 23 Beckmann Which predictor is best? • Many different predictors were proposed • Each handles particular patterns well • Common principles: temporal / spatial correlation, saturating counters, etc • But none is universal • What to do? 27 April 2017 24 Beckmann27 April 2017 Choosing Predictors LHist GHist Chooser Chooser = MSB(P(PC; P + (A==T) - (B==T)) or Chooser = MSB(P(GHist(PC; T); P + (A==T) - (B==T)) Prediction 25 Beckmann27 April 2017 Tournament Branch Predictor (Alpha 21264) • Choice predictor learns whether best to use local or global branch history in predicting next branch • Global history is speculatively updated but restored on mispredict • Claim 90-100% success on range of applications Local history table (1,024x10b) PC Local prediction (1,024x3b) Global Prediction (4,096x2b) Choice Prediction (4,096x2b) Global History (12b)Prediction 26 Beckmann Sophisticated Designs • Neural-network-based branch predictors – [Vintan, IJCNN ‘99][Jiminez, HPCA ‘01] – High prediction accuracy, but more computation – Actually implemented in AMD Piledriver, AMD Ryzen • TAGE predictor – TAgged GEometric predictor [Seznac, JILP ‘06] – Keep multiple predictions with different history lengths – Partially tag predictions to avoid false matches – Only provide prediction on tag match – Rumored to be what Intel uses 27 April 2017 27 Beckmann TAGE predictor Seznec & Michaud, 2006 27 April 2017 TAGE_TREE[L1, L2, L3](PC; T) = TAGE[L3](PC, TAGE[L2](PC, TAGE[L1](PC, Bimodal(PC;T) ;T) ;T ;T) TAGE[L3] Final Prediction TAGE[L2]TAGE[L1]BiModal PC Use me? My Guess 28 Beckmann27 April 2017 TAGE component Counter Prediction Useful Tag Use me? My Guess PC Next Predictor GHist 29 Useful bit is updated when predictor was correct and its alternative was wrong; used for replacement. Beckmann TAGE predictor component 27 April 2017 TAGE[L](PC, NEXT; T) = idx = hash(PC, GHIST[L](;T)) tag = hash(PC, GHIST[L](;T)) TAGE.U = SA(idx, tag; ((TAGE == T) && (NEXT != T))?1:SA) TAGE.Counter = SA(idx, tag; T?SA+1:SA-1) use_me = TAGE.U && isStrong(TAGE.Counter) TAGE = use_me?MSB(TAGE.Counter):NEXT Notes: SA is a ‘set associative’ structure SA allocation occurs on mispredict (not shown) TAGE.U cleared on global counter saturation 30 Beckmann27 April 2017 Limitations of branch predictors Only predicts branch direction. Therefore, cannot redirect fetch stream until after branch target is determined. UltraSPARC-III fetch pipeline Correctly predicted taken branch penalty Jump Register penalty A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute Remainder of execute pipeline (+ another 6 stages) 31 Beckmann27 April 2017 Branch Target Buffer (untagged) BP bits are stored with the predicted target address. IF stage: If (BP=taken) then nPC=target else nPC=PC+4 later: check prediction, if wrong then kill the instruction and update BTB & BPb else update BPb IMEM PC Branch Target Buffer (BTB) (2k entries)k BPbpredicted target BP target 32 Beckmann27 April 2017 Address Collisions What will be fetched after the instruction at 1028? BTB prediction = Correct target = Þ Assume a 128-entry BTB BPbtarget take236 1028 Add ..... 132 Jump 100 Instruction Memory 236 1032 kill PC=236 and fetch PC=1032 Is this a common occurrence? Can we avoid these bubbles? 33 Beckmann27 April 2017 BTB is only for Control Instructions BTB contains useful information for branch and jump instructions only Þ Do not update it for other instructions For all other instructions the next PC is (PC)+4 ! How to achieve this effect without decoding the instruction? 34 Beckmann27 April 2017 Branch Target Buffer (tagged) • Keep both the branch PC and target PC in the BTB • PC+4 is fetched if match fails • Only taken branches and jumps held in BTB • Next PC determined before branch fetched and decoded 2k-entry direct-mapped BTB (can also be associative) I-Cache PC k Valid valid Entry PC = match predicted target target PC 35 Beckmann27 April 2017 Consulting BTB Before Decoding 1028 Add ..... 132 Jump 100 BPbtarget take236 entry PC 132 • The match for PC=1028 fails and 1028+4 is fetched Þ eliminates false predictions after ALU instructions • BTB contains entries only for control transfer instructions Þ more room to store branch targets 36 Beckmann27 April 2017 Combining BTB and BHT • BTB entries are considerably more expensive than BHT, but can redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR) • BHT can hold many more entries and is more accurate A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute BTB BHTBHT in later pipeline stage corrects when BTB misses a predicted taken branch BTB/BHT only updated after branch resolves in E stage 37 Beckmann27 April 2017 Line Prediction (Alpha 21[234]64) • Line Predictor predicts line to fetch each cycle (tight loop) – Untagged BTB structure – Why? – 21464 was to predict 2 lines per cycle • Icache fetches block, and predictors improve target prediction • PC Calc checks accuracy of line prediction(s) • For superscalar useful to predict next cache line(s) to fetch Line Predictor Instr Cache Branch Predictor Return Stack Indirect Branch Predictor Decode & PC Calc Mispredict 38 Beckmann27 April 2017 Uses of Jump Register (JR) • Switch statements (jump to address of matching case) • Dynamic function call (jump to run-time function address) • Subroutine returns (jump to return address) How well does BTB work for each of these cases? BTB works well if same case used repeatedly BTB works well if same function usually called, (e.g., in C++ programming, when objects have same type in virtual function call) BTB works well if usually return to the same place Þ Often one function called from many distinct call sites! 39 Beckmann27 April 2017 Subroutine Return Stack Small structure to accelerate JR for subroutine returns, typically much more accurate than BTBs. &fb() &fc() Push call address when function call executed Pop return address when subroutine return decoded fa() { fb(); } fb() { fc(); } fc() { fd(); } &fd() k entries (typically k=8-16) 40 Beckmann27 April 2017 Overview of branch prediction P C Need next PC immediately Decode RegRead Execute Instr type, PC relative targets available Simple conditions, register targets available Complex conditions available BTB BP, JMP, Ret Loose loop Loose loop Loose loopTight loop Must speculation check always be correct? No… Best predictors reflect program behavior 41 Thank you !

Related Posts