Out-of-order execution: instructions can be executed in an order different from that specified in the program
Dependences between instructions:
– Data Dependence (a.k.a. Read after Write – RAW) – Control dependence
Lect. 2: Superscalar Processors
Copyright By PowCoder代写 加微信 powcoder
▪ Superscalar: several instructions are simultaneously at the same
Pipelining: several instructions are simultaneously at different stages of their execution
stages of their execution
Speculative execution: tentative execution despite dependences
Parallel Architectures – 2019-20
A 5-stage Pipeline
General registers
IF = instruction fetch (includes PC increment)
ID = instruction decode + fetching values from general purpose registers EXE = arithmetic/logic operations or address computation
MEM = memory access or branch completion
WB = write back results to general purpose registers
Parallel Architectures – 2019-20
A Pipelining Diagram Start one instruction per clock cycle
instruction flow
ID EXE MEM WB
cycle 123456
∴each instruction still takes 5 cycles, but instructions now complete every cycle: CPI → 1
Parallel Architectures – 2019-20
Multiple-issue Superscalar Start two instructions per clock cycle
instruction flow
ID EXE MEM
cycle 123456
CPI → 0.5; IPC → 2
Parallel Architectures – 2019-20
Advanced Superscalar Execution
▪ In practice:
– Data, control, and structural hazards spoil issue flow
– Multi-cycle instructions spoil commit flow
Ideally: in an n-issue superscalar, n instructions are fetched, decoded, executed, and committed per cycle
Buffers at issue (issue queue) and commit (reorder buffer)
decouple these stages from the rest of the pipeline and regularize somewhat breaks in the flow
instructions
instructions
General registers
Fetch engine
Parallel Architectures – 2019-20
Problems At Instruction Fetch
Case 1: all instructions located in same
cache line and no branch
Case 2: instructions spread in more
lines and no branch
– More than one cache lookup is required in the same cycle
– Words from different lines must be ordered and packed into instruction queue
Crossing instruction cache line boundaries
– e.g., 32 bit instructions and 32 byte instruction cache lines → 8 instructions per cache line; 4-wide superscalar processor
Parallel Architectures – 2019-20
Problems At Instruction Fetch
Case 1: single not taken branch
Case 2: single taken branch outside
fetch range and
into other cache line
– Branch prediction is required within the instruction fetch stage
– For wider issue processors multiple predictions are likely required
– In practice most fetch units only fetch up to the first predicted taken branch
Control flow
– e.g., 32 bit instructions and 32 byte instruction cache lines → 8 instructions per cache line; 4-wide superscalar processor
Parallel Architectures – 2019-20
Example Frequencies of Control Flow
avg. BB size
# of inst. between taken branches
Data from Rotenberg et. al. for SPEC 92 Int
One branch about every 4 to 6 instructions
One taken branch about every 5 to 9 instructions
Parallel Architectures – 2019-20
Solutions For Instruction Fetch
Advanced fetch engines that can perform multiple cache line lookups
– E.g., interleaved I-caches where consecutive program lines are stored in different banks that be can accessed in parallel
Very fast, albeit not very accurate branch predictors (e.g. branch target buffers)
– Note: usually used in conjunction with more accurate but slower predictors
Restructuring instruction storage to keep commonly consecutive instructions together (e.g., Trace cache in Pentium 4)
Parallel Architectures – 2019-20
Example Advanced
Control flow prediction units:
i) Branch Target Buffer
ii) Return Address Stack
iii) Branch Predictor
Mask to select instructions from each of the cache lines
2-way interleaved I-cache
Figure from Rotenberg et. al.
Final alignment unit
Parallel Architectures – 2019-20
Trace Caches
– Store instructions in execution order (traces)
– Traces can start with any static instruction and are identified by the starting instruction’s PC
– Traces are dynamically created as instructions are normally fetched and branches are resolved
– Tracesalsocontaintheoutcomesoftheimplicitlypredictedbranches
– When the same trace is again encountered (i.e., same starting instruction and same branch predictions) instructions are obtained from trace cache
– Note that multiple traces can be stored with the same starting instruction
Traditional I-cache: instructions laid out in program order
Dynamic execution order does not always follow program order (e.g., taken branches) and the dynamic order also changes
Parallel Architectures – 2019-20
Branch Prediction
We already saw BTB for quick predictions
Combining Predictor
– Processors have multiple branch predictors with
accuracy delay tradeoffs
– Meta-predictor chooses what predictor to use
Perceptron predictor
– Uses neural-networks for branch prediction
TAGE predictor
– Similar to combining predictor idea but with no meta
Parallel Architectures – 2019-20
Superscalar: Other Challenges
Superscalar decode
– Replicate decoders (ok)
Superscalar issue
– Number of dependence tests increases
quadratically (bad)
Superscalar register read
– Number of register ports increases linearly (bad)
Parallel Architectures – 2019-20
Superscalar: Other Challenges
Superscalar execute
– Replicate functional units (Not bad)
Superscalar bypass/forwarding – Increases quadratically (bad)
– Clustering mitigates this problem
Superscalar register-writeback – Increases linearly (bad)
ILP uncovered
– Limited by ILP inherent in program – Bigger instruction windows
Parallel Architectures – 2019-20
Effect of Instruction Window
18151211 9
gcc espresso li
Infinite 2048 512 128 32
Parallel Architectures – 2019-20
Instructions Per Clock
References and Further Reading
▪ A Software trace cache:
“Software Trace Cache”, A. Ramirez, J.-L. Larriba-Pey, C. Navarro, J. Torrellas, and M.
Valero, Intl. Conf. on Supercomputing, June 1999.
Original hardware trace cache:
“Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching”, E. Rotenberg, S. Bennett, and J. Smith, Intl. Symp. on Microarchitecture, December 1996.
Next trace prediction for trace caches:
“Path-Based Next Trace Prediction”, Q. Jacobson, E. Rotenberg, and J. Smith, Intl. Symp. on Microarchitecture, December 1997.
Parallel Architectures – 2019-20
References and Further Reading
Parallel Architectures – 2019-20
Probing Further
Advanced register allocation and de-allocation
“Late Allocation and Early Release of Physical Registers”, T. Monreal, V. Vinals, J. Gonzalez, A. Gonzalez, and M. Valero, IEEE Trans. on Computers, October 2004.
Value prediction
“Exceeding the Dataflow Limit Via Value Prediction”, M. H. Lipasti and J. P. Shen, Intl. Symp. on Microarchitecture, December 1996.
Limitations to wide issue processors
“Complexity-Effective Superscalar Processors”, S. Palacharla, N. P. Jouppi, and J. Smith, Intl. Symp. on Computer Architecture, June 1997.
“Clock Rate Versus IPC: the End of the Road for Conventional Microarchitectures”, V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, Intl. Symp. on Computer Architecture, June 2000.
Parallel Architectures – 2019-20
Pros/Cons of Trace Caches
+ Instructions come from a single trace cache line
+ Branches are implicitly predicted
– The instruction that follows the branch is fixed in the trace and implies the branch’s
direction (taken or not taken)
+ I-cache still present, so no need to change cache hierarchy
+ In CISC ISA’s (e.g., x86) the trace cache can keep decoded instructions (e.g., Pentium 4)
– Wasted storage as instructions appear in both I-cache and trace cache, and in possibly multiple trace cache lines
– Not very good when there are traces with common sub-paths
– Not very good at handling indirect jumps and returns (which
have multiple targets, instead of only taken/not taken)
Parallel Architectures – 2019-20
Structure of a Trace Cache
Figure from Rotenberg et. al.
Parallel Architectures – 2019-20
Structure of a Trace Cache
– Branch flags and mask: m-1 bits to specify the direction of the up to m branches
– Branchmask:thenumberofbranchesinthetrace
– Trace target address and fall-through address: the address of the next instruction to be fetched after the trace is exhausted
Each line contains n instructions from up to m basic blocks Control bits:
Trace cache hit:
– Tag must match
– Branch predictions must match the branch flags for all branches in the trace
Parallel Architectures – 2019-20
Trace Creation
predicted taken)
▪ Trace is terminated when either n instructions
or m branches have been added Trace target/fall-through address are
computed at the end
Starts on a trace cache miss Instructions are fetched up to the first
predicted taken branch
Instructions are collected, possibly from
multiple basic blocks (when branches are
Parallel Architectures – 2019-20
▪ Processor can fetch up to 4 instructions per cycle
I-cache lines contain 8, 32-bit instructions and Trace Cache lines contain up to 24 instructions and 3 branches
Machine Code
L1: I1 [ALU] …
I5 [Cond. Br. to L3] L2: I6 [ALU]
I12 [Jump to L4] L3: I13 [ALU]
I18 [Cond. Br. to L5 ] L4: I19 [ALU]
I24 [Cond. Br. to L1] L5:
Basic Blocks
Layout in I-Cache
I4 I5 I6 I7 I8 I9 I10 I11
B1 (I1-I5)
B2 (I6-I12)
B3 (I13-I18)
I22 I23 I24
B4 (I19-I24)
Parallel Architectures – 2019-20
▪ Step 1: fetch I1-I3 (stop at end of line) → Trace Cache miss → Start trace collection
▪ Step 2: fetch I4-I5 (possible I-cache miss) (stop at predicted taken branch)
▪ Step3:fetchI13-16(possibleI-cachemiss)
▪ Step 4: fetch I17-I19 (I18 is predicted not taken branch, stop at end of line)
▪ Step5:fetchI20-I23(possibleI-cachemiss)
▪ Step 6: fetch I24 (stop at predicted taken branch)
▪ Step 7: fetch I1-I4 replaced by Trace Cache access
Basic Blocks
Layout in I-Cache
I4 I5 I6 I7 I8 I9 I10 I11
I12 I13 I14 I15 I16 I17 I18 I19
I20 I21 I22 I23 I24
Common path
Layout in Trace Cache
I1 I2 I3 I4 I5 I13 I14 I15
B1 (I1-I5)
I16 I17 I24
B2 (I6-I12)
B3 (I13-I18)
B4 (I19-I24)
Parallel Architectures – 2019-20
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com