CS计算机代考程序代写 compiler cache c++ computer architecture RISC-V mips Computer Architecture ELEC3441

Computer Architecture ELEC3441
Lecture 6 – Branch Prediction + Interrupts
Dr. Hayden Kwok-Hay So
Department of Electrical and Electronic Engineering
Why an Instruction may not be dispatched every cycle (CPI>1)
§ Full bypassing may be too expensive to implement
– typically all frequently used paths are provided
– some infrequently used bypass paths may increase cycle time and counteract the benefit of reducing CPI
§ Loads have two-cycle latency
– Instruction after load cannot use load result
– MIPS-I ISA defined load delay slots, a software-visible pipeline hazard (compiler schedules independent instruction or inserts NOP to avoid hazard). Removed in MIPS-II (pipeline interlocks added in hardware)
• MIPS:“Microprocessor without Interlocked Pipeline Stages” § Conditional branches may cause bubbles
– kill following instruction(s) if no delay slots
Machines with software-visible delay slots may execute significant number of NOP instructions inserted by the compiler. NOPs increase instructions/program!
2
RISC-V Branches and Jumps
Each instruction fetch depends on one or two pieces of information from the preceding instruction:
1) Is the preceding instruction a taken branch? 2) If so, what is the target address?
Instruction
JAL
JALR B
Taken known?
After Inst. Decode After Inst. Decode
After Execute
Target known?
After Inst. Decode After Reg. Fetch
After Inst. Decode
3
Branch Penalties in Modern Pipelines
UltraSPARC-III instruction fetch pipeline stages (in-order issue, 4-way superscalar, 750MHz, 2000)
A
P
F
B
I
J
R
E
Branch Target Address Known
Branch Direction &
Jump Register Target Known
PC Generation/Mux
Instruction Fetch Stage 1 Instruction Fetch Stage 2
Branch Address Calc/Begin Decode Complete Decode
Steer Instructions to Functional units Register File Read
Integer Execute
Remainder of execute pipeline (+ another 6 stages)
4

Reducing Control Flow Penalty
§ Software solutions
– Eliminate branches – loop unrolling
• Increases the run length
– Reduce resolution time – instruction scheduling
• Compute the branch condition as early as possible (of limited value because branches often in critical path through code)
§ Hardware solutions
– Find something else to do – delay slots
• Replaces pipeline bubbles with useful work (requires software cooperation)
– Speculate – branch prediction
• Speculative execution of instructions beyond the branch
5
Branch Prediction
Motivation:
Branch penalties limit performance of deeply pipelined processors
Modern branch predictors have high accuracy
(>95%) and can reduce branch penalties significantly
Required hardware support:
Prediction structures:
• Branch history tables (BHT), branch target buffers (BTB), etc.
Mispredict recovery mechanisms:
• Keep result computation separate from commit • Kill instructions following branch in pipeline
• Restore state to that following branch
6
Static Branch Prediction
Overall probability a branch is taken is ~60-70% but:
backward forward 90% 50%
ISA can attach preferred direction semantics to branches, e.g., Motorola MC88110
bne0 (preferred taken) beq0 (not taken)
ISA can allow arbitrary choice of statically predicted direction, e.g., HP PA-RISC, Intel IA-64
typically reported as ~80% accurate
7
Dynamic Branch Prediction learning based on past behavior
§Temporal correlation
– The way a branch resolves may be a good predictor of
the way it will resolve at the next execution
§Spatial correlation
– Several branches may resolve in a highly correlated manner (a preferred path of execution)
8

Loop Example
addi x1, x0, 10
loop: lw x2, 0(x3)
addi x2, x2, 1
sw x2, 0(x3)
addi x3, x3, -4
addi x1, x1, -1
bneq x1, x0, loop
exit: nop
n In the above code, the branch is executed 10 times
• First 9 timesèbranch taken
• 10th timeèbranch not taken
n Temporal correlation works:
• If the branch has been taken 5 times, it is likely it will be
taken the 6th time
n Last time must be wrong
HKUEEE ENGG3441 – HS 9
for (i=10; i>0; i–) {
a[i] = a[i] + 1;
}
Simple Dynamic Branch Prediction
Predict the direction of next branch based on previous branch result
n Predict based on the immediate previous direction
• If last branch was taken, then predict branch will be taken this time
• If last branch was not taken, also predict this branch will not be taken
n Draw back:
• 2 misses on the loop example above in most cases: 1
miss on enter, 1 miss on exit
• Consider nested loop
HKUEEE ENGG3441 – HS 10
2-bit Branch Prediction
n Predict branch direction based on 2 previous branch results of the same branch
• Change prediction only after 2 consecutive mispredictions taken
not taken
PT0
taken
PNT1 not
taken
not taken
taken
taken
reset
PT1
PTN0
HKUEEE
ENGG3441 – HS
11
not taken
Implementation Considerations
n Ideally, hardware contains branch predictor for every branch instructions in the program
n Challenges:
• Branch instructions may occur at any address
• One predictor for every location (2^32) is impractical
• Dynamically mapping predictor to branch instruction is too slow (Fully associative)
n Usual solution:
• Use a small hardware table indexed by lower k bits of the instruction address
• The probability of having two branch instructions with the same lower k bits address is small
• In case it happens, it only lower the prediction accuracyèprogram is functionally correct regardless
HKUEEE ENGG3441 – HS 12

Branch History Table
00
Fetch PC
Instruction
k
BHT Index
2k-entry BHT,
2 bits/entry
I-Mem
Opcode
offset
Branch?
+
Target PC
Taken/¬Taken?
4K-entry BHT, 2 bits/entry, ~80-90% correct predictions
13
Exploiting Spatial Correlation
Yeh and Patt, 1992
if (x[i] < 7) then y += 1; if (x[i] < 5) then c -= 4; If first condition false, second condition also false History register, H, records the direction of the last N branches executed by the processor 14 Two-Level Branch Predictor Pentium Pro uses the result from the last two branches to select one of the four sets of BHT bits (~95% correct) 00 Fetch PC k 2-bit global branch history shift register Shift in Taken/¬Taken results of each branch Taken/¬Taken? 15 2-bit Branch Prediction Accuracy nASA7 matrix300 tomcatv doduc spice fpppp gcc espresso eqntott li 1% 0% 1% 0% 0% 0% 1% 0% 1% n Local 2-bit predictor with 4096 entries n Local 2-bit predictor with unlimited entries n 2-bit local + 2-bit global predictor 5% 5% 5% 9% 9% 5% 9% 5% 5% 5% 4% 9% 12% 11% 11% 5% 10% 6% 10% 18% 18% HKUEEE ENGG3441 - HS 16 0% 2% 4% 6% 8% 10%12%14%16%18% Frequency of mispredictions 4096 entries: 2 bits per entry Unlimited entries: 2 bits per entry 1024 entries: (2,2) SPEC89 benchmarks Speculating Both Directions §An alternative to branch prediction is to execute both directions of a branch speculatively – resource requirement is proportional to the number of concurrent speculative executions – only half the resources engage in useful work when both directions of a branch are executed speculatively – branch prediction takes less resources than speculative execution of both paths §With accurate branch prediction, it is more cost effective to dedicate all resources to the predicted direction! 17 Limitations of BHTs Only predicts branch direction. Therefore, cannot redirect fetch stream until after branch target is determined. PC Generation/Mux Instruction Fetch Stage 1 Instruction Fetch Stage 2 Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute Remainder of execute pipeline (+ another 6 stages) A P F B I J R E Correctly predicted taken branch penalty Jump Register penalty UltraSPARC-III fetch pipeline 18 predicted BPb target IMEM Branch Target Buffer k Branch Target Buffer (2k entries) PC target BP BP bits are stored with the predicted target address. IF stage: If (BP=taken) then nPC=target else nPC=PC+4 Later: check prediction, if wrong then kill the instruction and update BTB & BPb else update BPb 19 Assume a 128-entry BTB Address Collisions target BPb 236 take What will be fetched after the instruction at 1028? Instruction Memory BTB prediction = Correct target = 1032 è kill PC=236 and fetch PC=1032 Is this a common occurrence? Can we avoid these bubbles? 236 132 Jump +104 1028 Add ..... 1028 mod 128 = 4 132 mod 128 = 4 20 BTB is only for Control Instructions § BTB contains useful information for branch and jump instructions only èDo not update it for other instructions § For all other instructions the next PC is PC+4 ! § How to achieve this effect without decoding the instruction? 21 I-Cache PC 2k-entry direct-mapped BTB (can also be associative) Branch Target Buffer (BTB) Entry PC Valid predicted target PC k • Keep both the branch PC and target PC in the BTB • PC+4 is fetched if match fails • Only taken branches and jumps held in BTB • Next PC determined before branch fetched and decoded = match valid target 23 Combining BTB and BHT § BTB entries are considerably more expensive than BHT, but can redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR) § BHT can hold many more entries and is more accurate BHT in later pipeline stage corrects when BTB misses a predicted taken branch PC Generation/Mux Instruction Fetch Stage 1 Instruction Fetch Stage 2 Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute BTB A P F BHT B I J R E BTB/BHT only updated after branch resolves in E stage 24 Send PC to memory and branch-target buffer IF No Entry found in Yes branch-target buffer? ID Is No instruction Yes Normal instruction execution a taken branch? Send out predicted PC No Taken Yes branch? EX Enter branch instruction address and next PC into branch- target buffer Mispredicted branch, kill fetched instruction; restart fetch at other target; delete entry from target buffer Branch correctly predicted; continue execution with no stalls Combining BTB and BHT in Pipeline n Assume branch address is computed at EX HKUEEE ENGG3441 - HS 25 Uses of Jump Register (JR) § Switch statements (jump to address of matching case) BTB works well if same case used repeatedly § Dynamic function call (jump to run-time function address) BTB works well if same function usually called, (e.g., in C++ programming, when objects have same type in virtual function call) § Subroutine returns (jump to return address) BTB works well if usually return to the same place Þ Often one function called from many distinct call sites! How well does BTB work for each of these cases? 26 Subroutine Return Stack Small structure to accelerate JR for subroutine returns, typically much more accurate than BTBs. fa() { fb(); } fb() { fc(); } fc() { fd(); } Push call address when function call executed Pop return address when subroutine return decoded k entries (typically k=8-16) &fd() &fc() &fb() 27 HKUEEE ENGG3441 - HS 28 Causes of Interrupts § Asynchronous: an external event – input/output device service-request – timer expiration – power disruptions, hardware failure Interrupts Interrupt: an event that requests the attention of the processor § Synchronous: an internal event (a.k.a. traps or exceptions) – undefined opcode, privileged instruction – arithmetic overflow, FPU exception – misaligned memory access – virtual memory exceptions: page faults, TLB misses, protection violations – system calls, e.g., jumps into kernel Exceptions 29 Interrupts: altering the normal flow of control Ii-1 HI1 program Ii HI2 Ii+1 HIn interrupt handler An external or internal event that needs to be processed by another (system) program. The event is usually unexpected or rare from program’s point of view. 30 Asynchronous Interrupts: invoking the interrupt handler § An I/O device requests attention by asserting one of the prioritized interrupt request lines § When the processor decides to process the interrupt – It stops the current program at instruction Ii, completing all the instructions up to Ii-1 (precise interrupt) – It saves the PC of instruction Ii in a special register (EPC) – It disables interrupts and transfers control to a designated interrupt handler running in the kernel mode 31 Interrupt Handler § Saves EPC before enabling interrupts to allow nested interrupts Þ – need an instruction to move EPC into GPRs – need a way to mask further interrupts at least until EPC can be saved § Needs to read a status register that indicates the cause of the interrupt § Uses a special indirect jump instruction RFE (return-from- exception) which – enablesinterrupts – restores the processor to the user mode – restores hardware status and control state 32 Synchronous Interrupts § A synchronous interrupt (exception) is caused by a particular instruction § In general, the instruction cannot be completed and needs to be restarted after the exception has been handled – requires undoing the effect of one or more partially executed instructions § In the case of a system call trap, the instruction is considered to have been completed – a special jump instruction involving a change to privileged kernel mode 33 Exception Handling 5-Stage Pipeline PC Inst. Mem D Decode E + M Data Mem W PC address Illegal Exception Opcode Asynchronous Interrupts Overflow Data address Exceptions § How to handle multiple simultaneous exceptions in different pipeline stages? § How and where to handle external asynchronous interrupts? 34 PC Inst. Mem PC address Exception Kill F Stage Illegal Opcode Kill D Stage + Overflow Kill E Stage Commit Point Data address Exceptions Exception Handling 5-Stage Pipeline D Decode E M Data Mem W Exc D Exc E Exc M Select Handler PC Asynchronous Interrupts PC D PC E PC M Kill Writeback 35 Exception Handling 5-Stage Pipeline § Hold exception flags in pipeline until commit point (M stage) § Exceptions in earlier pipe stages override later exceptions for a given instruction § Inject external interrupts at commit point (override others) § If exception at commit: update Cause and EPC registers, kill all stages, inject handler PC into fetch stage 36 Speculating on Exceptions §Prediction mechanism – Exceptions are rare, so simply predicting no exceptions is very accurate! §Check prediction mechanism – Exceptions detected at end of instruction execution pipeline, special hardware for various exception types §Recovery mechanism – Only write architectural state at commit point, so can throw away partially executed instructions after exception – Launch exception handler after flushing pipeline §Bypassing allows use of uncommitted instruction results by following instructions 37 EPC Cause (I1) 096: ADD (I2) 100: XOR (I3) 104: SUB IF1 ID1 IF2 EX1 MA1 - overflow! ID2 EX2 - - IF3 ID3 - -- IF4 - --- IF5 ID5 EX5 MA5 WB5 t2 t3 t4 t5 t6 t7 .... Exception Pipeline Diagram time t0 t1 t2 t3 t4 t5 t6 t7 .... (I4) 108: ADD (I5) Exc. Handler code Resource Usage I3 I4 I5 ID I1 I2 I3 - I5 time t0 t1 IF I1 I2 EX I1I2--I5 MA I1 - - - I5 WB - - - - I5 38 In Conclusions... n Branch prediction allows hardware to speculate the direction of branch instructions • If speculation wrong, kill incorrect instructions in pipeline • Otherwise proceed as normal n BTB helps predict the branch target address n BHT helps predict the branch direction n Implementation in critical path of instruction fetch è must be light weight n Exceptions and Interrupts alter the behavior of pipeline • Commonly handles them at end of pipeline before commit • Need to be careful with potentially incorrect forwarded values HKUEEE ENGG3441 - HS 39 Acknowledgements n These slides contain material developed and copyright by: • Arvind(MIT) • KrsteAsanovic(MIT/UCB) • JoelEmer(Intel/MIT) • JamesHoe(CMU) • JohnKubiatowicz(UCB) • DavidPatterson(UCB) n MIT material derived from course 6.823 n UCB material derived from course CS152, CS252 HKUEEE ENGG3441 - HS 40

Related Posts