CSE 371 Computer Organization and Design
CIS 371: Comp. Org & Design | Dr. | Superscalar
Computer Organization and Design
Copyright By PowCoder代写 加微信 powcoder
Unit 9: Superscalar Pipelines
Slides developed by M. Martin, A. Roth, C.J. Taylor and
at the University of Pennsylvania
with sources that included University of Wisconsin slides
by , , , and .
CIS 371: Comp. Org & Design | Dr. | Superscalar
A Key Theme: Parallelism
Previously: pipeline-level parallelism
Work on execute of one instruction in parallel with decode of next
Next: instruction-level parallelism (ILP)
Execute multiple independent instructions fully in parallel
Static & dynamic scheduling
Extract much more ILP
Data-level parallelism (DLP)
Single-instruction, multiple data (one insn., four 64-bit adds)
Thread-level parallelism (TLP)
Multiple software threads running on multiple cores
CIS 371: Comp. Org & Design | Dr. | Superscalar
This Unit: (In-Order) Superscalar Pipelines
Idea of instruction-level parallelism
Superscalar hardware issues
Bypassing and register file
Stall logic
“Superscalar” vs VLIW/EPIC
System software
CIS 371: Comp. Org & Design | Dr. | Superscalar
“Scalar” Pipeline & the
So far we have looked at scalar pipelines
One instruction per stage
With control speculation, bypassing, etc.
Performance limit (aka “ ”) is CPI = IPC = 1
Limit is never even achieved (hazards)
Diminishing returns from “super-pipelining” (hazards + overhead)
An Opportunity…
But consider:
ADD r1, r2 -> r3
ADD r4, r5 -> r6
Why not execute them at the same time? (We can!)
What about:
ADD r1, r2 -> r3
ADD r4, r3 -> r6
In this case, dependences prevent parallel execution
What about three instructions at a time?
Or four instructions at a time?
CIS 371: Comp. Org & Design | Dr. | Superscalar
What Checking Is Required?
For two instructions: 2 checks
ADD src11, src21 -> dest1
ADD src12, src22 -> dest2 (2 checks)
For three instructions: 6 checks
ADD src11, src21 -> dest1
ADD src12, src22 -> dest2 (2 checks)
ADD src13, src23 -> dest3 (4 checks)
For four instructions: 12 checks
ADD src11, src21 -> dest1
ADD src12, src22 -> dest2 (2 checks)
ADD src13, src23 -> dest3 (4 checks)
ADD src14, src24 -> dest4 (6 checks)
Plus checking for load-to-use stalls from prior n loads
CIS 371: Comp. Org & Design | Dr. | Superscalar
What Checking Is Required?
For two instructions: 2 checks
ADD src11, src21 -> dest1
ADD src12, src22 -> dest2 (2 checks)
For three instructions: 6 checks
ADD src11, src21 -> dest1
ADD src12, src22 -> dest2 (2 checks)
ADD src13, src23 -> dest3 (4 checks)
For four instructions: 12 checks
ADD src11, src21 -> dest1
ADD src12, src22 -> dest2 (2 checks)
ADD src13, src23 -> dest3 (4 checks)
ADD src14, src24 -> dest4 (6 checks)
Plus checking for load-to-use stalls from prior n loads
CIS 371: Comp. Org & Design | Dr. | Superscalar
How do we build such “superscalar” hardware?
CIS 371: Comp. Org & Design | Dr. | Superscalar
CIS 371: Comp. Org & Design | Dr. | Superscalar
Multiple-Issue or “Superscalar” Pipeline
Overcome this limit using multiple issue
Also called superscalar
Two instructions per stage at once, or three, or four, or eight…
“Instruction-Level Parallelism (ILP)” [Fisher, IEEE TC’81]
Today, typically “4-ish-wide” (Intel Broadwell, AMD Ryzen)
Broadwell issues up to 8 in the right circumstances, Ryzen up to 6
ARM cores usually issue less
CIS 371: Comp. Org & Design | Dr. | Superscalar
A Typical Dual-Issue Pipeline (1 of 2)
Fetch an entire 16B or 32B cache block
4 to 8 instructions (assuming 4-byte average instruction length)
Predict a single branch per cycle
Parallel decode
Need to check for conflicting instructions
Is output register of I1 is an input register to I2?
Other stalls, too (for example, load-use delay)
CIS 371: Comp. Org & Design | Dr. | Superscalar
A Typical Dual-Issue Pipeline (2 of 2)
Multi-ported register file
Larger area, latency, power, cost, complexity
Multiple execution units
Simple adders are easy, but bypass paths are expensive
Memory unit
Single load per cycle (stall at decode) probably okay for dual issue
Alternative: add a read port to data cache
Larger area, latency, power, cost, complexity
CIS 371: Comp. Org & Design | Dr. | Superscalar
How Much ILP is There?
The compiler tries to “schedule” code to avoid stalls
Even for scalar machines (to fill load-use delay slot)
Even harder to schedule multiple-issue (superscalar)
How much ILP is common?
Greatly depends on the application
Consider memory copy
Unroll loop, lots of independent operations
Other programs, less so
Even given unbounded ILP,
superscalar has implementation limits
IPC (or CPI) vs clock frequency trade-off
Given these challenges, what is reasonable today?
~4 instruction per cycle maximum
CIS 371: Comp. Org & Design | Dr. | Superscalar
Superscalar Pipeline Diagrams – Ideal
scalar 1 2 3 4 5 6 7 8 9 10 11 12
lw 0(r1)r2 F D X M W
lw 4(r1)r3 F D X M W
lw 8(r1)r4 F D X M W
add r14,r15r6 F D X M W
add r12,r13r7 F D X M W
add r17,r16r8 F D X M W
lw 0(r18)r9 F D X M W
2-way superscalar 1 2 3 4 5 6 7 8 9 10 11 12
lw 0(r1)r2 F D X M W
lw 4(r1)r3 F D X M W
lw 8(r1)r4 F D X M W
add r14,r15r6 F D X M W
add r12,r13r7 F D X M W
add r17,r16r8 F D X M W
lw 0(r18)r9 F D X M W
Like Noah’s ark the instructions come in 2 by two.
CIS 371: Comp. Org & Design | Dr. | Superscalar
Superscalar Pipeline Diagrams – Realistic
scalar 1 2 3 4 5 6 7 8 9 10 11 12
lw 0(r1)r2 F D X M W
lw 4(r1)r3 F D X M W
lw 8(r1)r4 F D X M W
add r4,r5r6 F D d* X M W
add r2,r3r7 F d* D X M W
add r7,r6r8 F D X M W
lw 4(r8)r9 F D X M W
2-way superscalar 1 2 3 4 5 6 7 8 9 10 11 12
lw 0(r1)r2 F D X M W
lw 4(r1)r3 F D X M W
lw 8(r1)r4 F D X M W
add r4,r5r6 F D d* d* X M W
add r2,r3r7 F D d* X M W
add r7,r6r8 F d* D X M W
lw 4(r8)r9 F d* d* D X M W
Superscalar Implementation Challenges
CIS 371: Comp. Org & Design | Dr. | Superscalar
CIS 371: Comp. Org & Design | Dr. | Superscalar
Superscalar Challenges – Front End
Superscalar instruction fetch
Modest: fetch multiple instructions per cycle
Aggressive: buffer instructions and/or predict multiple branches
Superscalar instruction decode
Replicate decoders
Superscalar instruction issue
Determine when instructions can proceed in parallel
More complex stall logic – order N2 for N-wide machine
Not all combinations of types of instructions possible
Superscalar register read
Port for each register read (4-wide superscalar 8 read “ports”)
Each port needs its own set of address and data wires
Latency & area #ports2
CIS 371: Comp. Org & Design | Dr. | Superscalar
Superscalar Challenges – Back End
Superscalar instruction execution
Replicate arithmetic units (but not all, for example, integer divider)
Perhaps multiple cache ports (slower access, higher energy)
Only for 4-wide or larger (why? only ~35% are load/store insn)
Superscalar bypass paths
More possible sources for data values
Order (N2 * P) for N-wide machine with execute pipeline depth P
Superscalar instruction register writeback
One write port per instruction that writes a register
Example, 4-wide superscalar 4 write ports
Fundamental challenge:
Amount of ILP (instruction-level parallelism) in the program
Compiler must schedule code and extract parallelism
Superscalar Bypass
N2 bypass network
N+1 input muxes at each ALU input
N2 point-to-point connections
Routing lengthens wires
Heavy capacitive load
And this is just one bypass stage (MX)!
There is also WX bypassing
Even more for deeper pipelines
One of the big problems of superscalar
Why? On the critical path of
single-cycle “bypass & execute” loop
CIS 371: Comp. Org & Design | Dr. | Superscalar
CIS 371: Comp. Org & Design | Dr. | Superscalar
Not All N2 Created Equal
N2 bypass vs. N2 stall logic & dependence cross-check
Which is the bigger problem?
N2 bypass … by far
64- bit quantities (vs. 5-bit)
Multiple levels (MX, WX) of bypass (vs. 1 level of stall logic)
Must fit in one clock period with ALU (vs. not)
Dependence cross-check not even 2nd biggest N2 problem
Regfile is also an N2 problem (think latency where N is #ports)
And also more serious than cross-check
As it turns out delay scales as the number of ports squared because of the layout and wiring constraints involved in the routing.
CIS 371: Comp. Org & Design | Dr. | Superscalar
Mitigating N2 Bypass & Register File
Clustering: mitigates N2 bypass
Group ALUs into K clusters
Full bypassing within a cluster
Limited bypassing between clusters
With 1 or 2 cycle delay
Can hurt IPC, but faster clock
(N/K) + 1 inputs at each mux
(N/K)2 bypass paths in each cluster
Steering: key to performance
Steer dependent insns to same cluster
Cluster register file, too
Replicate a register file per cluster
All register writes update all replicas
Fewer read ports; only for cluster
CIS 371: Comp. Org & Design | Dr. | Superscalar
Mitigating N2 RegFile: Clustering++
Clustering: split N-wide execution pipeline into K clusters
With centralized register file, 2N read ports and N write ports
Clustered register file: extend clustering to register file
Replicate the register file (one replica per cluster)
Register file supplies register operands to just its cluster
All register writes go to all register files (keep them in sync)
Advantage: fewer read ports per register!
K register files, each with 2N/K read ports and N write ports
Another Challenge: Superscalar Fetch
What is involved in fetching multiple instructions per cycle?
In same cache block? no problem
64-byte cache block is 16 instructions (~4 bytes per instruction)
Favors larger block size (independent of hit rate)
What if next instruction is last instruction in a block?
Fetch only one instruction that cycle
Or, some processors may allow fetching from 2 consecutive blocks
What about taken branches?
How many instructions can be fetched on average?
Average number of instructions per taken branch?
Assume: 20% branches, 50% taken ~10 instructions
Consider a 5-instruction loop with an 4-issue processor
Without smarter fetch, ILP is limited to 2.5 (not 4, which is bad)
CIS 371: Comp. Org & Design | Dr. | Superscalar
Increasing Superscalar
Option #1: over-fetch and buffer
Add a queue between fetch and decode (18 entries in Intel Core2)
Compensates for cycles that fetch less than maximum instructions
“decouples” the “front end” (fetch) from the “back end” (execute)
Option #2: “loop stream detector” (Core 2, Core i7)
Put entire loop body into a small cache
Core2: 18 macro-ops, up to four taken branches
Core i7: 28 micro-ops (avoids re-decoding macro-ops!)
Any branch mis-prediction requires normal re-fetch
Other options: next-next-block prediction, “trace cache”
CIS 371: Comp. Org & Design | Dr. | Superscalar
insn queue
also loop stream detector
CIS 371: Comp. Org & Design | Dr. | Superscalar
Multiple-Issue Implementations
Statically-scheduled (in-order) superscalar
What we’ve talked about thus far
Executes unmodified sequential programs
Hardware must figure out what can be done in parallel
E.g., Pentium (2-wide), UltraSPARC (4-wide), Alpha 21164 (4-wide)
Very Long Instruction Word (VLIW)
Compiler identifies independent instructions, new ISA
Hardware can be simple and perhaps lower power
E.g., TransMeta Crusoe (4-wide), most DSPs
Variant: Explicitly Parallel Instruction Computing (EPIC)
A bit more flexible encoding & some hardware to help compiler
E.g., Intel Itanium (6-wide)
Dynamically-scheduled superscalar (next topic)
Hardware extracts more ILP by on-the-fly reordering
Intel Atom/Core/Xeon, AMD Opteron/Ryzen, some ARM A-series
Itanium 3 instructions in each ‘bundle’ all elements of the bundle go through together.
CIS 371: Comp. Org & Design | Dr. | Superscalar
Trends in Single-Processor Multiple Issue
Issue width has saturated at 4-6 for high-performance cores
Canceled Alpha 21464 was 8-way issue
Not enough ILP to justify going to wider issue
Hardware or compiler scheduling needed to exploit 4-6 effectively
More on this in the next unit
For high-performance per watt cores (say, smart phones)
Typically 2-wide superscalar (but increasing each generation)
486 Pentium PentiumII Pentium4 II Core2
Year 1989 1993 1998 2001 2002 2004 2006
Width 1 2 3 3 3 6 4
CIS 371: Comp. Org & Design | Dr. | Superscalar
Multiple Issue Redux
Multiple issue
Exploits insn level parallelism (ILP) beyond pipelining
Improves IPC, but perhaps at some clock & energy penalty
4-6 way issue is about the peak issue width currently justifiable
Low-power implementations today typically 2-wide superscalar
Problem spots
N2 bypass & register file clustering
Fetch + branch prediction buffering, loop streaming, trace cache
N2 dependency check VLIW/EPIC (but unclear how key this is)
Implementations
Superscalar vs. VLIW/EPIC
CIS 371: Comp. Org & Design | Dr. | Superscalar
This Unit: (In-Order) Superscalar Pipelines
Idea of instruction-level parallelism
Superscalar hardware issues
Bypassing and register file
Stall logic
“Superscalar” vs VLIW/EPIC
System software
CIS 371: Comp. Org & Design | Dr. | Superscalar
CIS 371: Comp. Org & Design | Dr. | Superscalar
Trends in Single-Processor Multiple Issue
Issue width has saturated at 4-6 for high-performance cores
Canceled Alpha 21464 was 8-way issue
No justification for going wider
Hardware or compiler “scheduling” needed to exploit 4-6 effectively
For high-performance per watt cores, issue width is ~2
Advanced scheduling techniques not needed
Multi-threading (a little later) helps cope with cache misses
486 Pentium PentiumII Pentium4 II Core2
Year 1989 1993 1998 2001 2002 2004 2006
Width 1 2 3 3 3 6 4
CIS 371: Comp. Org & Design | Dr. | Superscalar
Impact of Branch Prediction
Base CPI for scalar pipeline is 1
Base CPI for N-way superscalar pipeline is 1/N
Amplifies stall penalties
Assumes no data stalls (an overly optmistic assumption)
Example: Branch penalty calculation
20% branches, 75% taken, 2 cycle penalty, no branch prediction
Scalar pipeline
1 + 0.2*0.75*2 = 1.3 1.3/1 = 1.3 30% slowdown
2-way superscalar pipeline
0.5 + 0.2*0.75*2 = 0.8 0.8/0.5 = 1.6 60% slowdown
4-way superscalar
0.25 + 0.2*0.75*2 = 0.55 0.55/0.25 = 2.2 120% slowdown
CIS 371: Comp. Org & Design | Dr. | Superscalar
Predication
Branch mis-predictions hurt more on superscalar
Replace difficult branches with something else…
Convert control flow into data flow (& dependencies)
Helps hard-to-predict branches (but can hurt predictable branches)
Predication
Conditionally executed insns unconditionally fetched
Full predication (ARM, Intel Itanium)
Can tag every insn with predicate, but extra bits in instruction
Conditional moves (Alpha, x86)
Construct appearance of full predication from one primitive
cmoveq r1,r2,r3 // if (r1==0) r3=r2;
May require some code duplication to achieve desired effect
Doesn’t handle conditional memory operations
Only good way of adding predication to an existing ISA
If-conversion: replacing control with predication
Forgot to make the point that predication hurts performance if branch was predicted correctly
CIS 371: Comp. Org & Design | Dr. | Superscalar
Predication If-Conversion Example
0: ldf Y(r1),f2
1: fbne f2,4
4: stf f0,Y(r1)
5: ldf X(r1),f4
6: mulf f4,f2,f6
7: stf f6,Z(r1)
2: ldf W(r1),f2
0: ldf Y(r1),f2
1: fspne f2,p1
2: ldf.p p1,W(r1),f2
4: stf.np p1,f0,Y(r1)
5: ldf X(r1),f4
6: mulf f4,f2,f6
7: stf f6,Z(r1)
Using Predication
if (A == 0)
Z[i] = A*X[i];
0: ldf Y(r1),f2
1: fbne f2,4
2: ldf W(r1),f2
4: stf f0,Y(r1)
5: ldf X(r1),f4
6: mulf f4,f2,f6
7: stf f6,Z(r1)
Source code
Machine code
CIS 371: Comp. Org & Design | Dr. | Superscalar
ISA Support for Predication
Itanium: change branch 1 to set-predicate insn fspne
Change insns 2 and 4 to predicated insns
ldf.p performs ldf if predicate p1 is true
stf.np performs stf if predicate p1 is false
0: ldf Y(r1),f2
1: fspne f2,p1
2: ldf.p p1,W(r1),f2
4: stf.np p1,f0,Y(r1)
5: ldf X(r1),f4
6: mulf f4,f2,f6
7: stf f6,Z(r1)
CMOV Prediction Example
x86 only has a “CMOV” instruction
Note: in x86’s CMOV, any “load” part is non-conditional
Small change in the code helps the compiler optimize
CIS 371: Comp. Org & Design | Dr. | Superscalar
func: testl %edi, %edi
movslq %esi,%rax
movl (%rdx,%rax,4), %esi
.L2: movl %esi, %eax
int func(int a, int b, int* array)
if (a > 0) {
return array[b];
int func2(int a, int b, int* array)
int temp = array[b];
if (a > 0) {
return temp;
func2: movslq %esi, %rax
testl %edi, %edi
cmovle (%rdx,%rax,4), %esi
movl %esi, %eax
Another CMOV Example (Part I)
Same with and without –fno-in-conversion flag!
CIS 371: Comp. Org & Design | Dr. | Superscalar
tree_t* search(tree_t* t, int key)
while (t != NULL) {
if (t->value == key) {
if (t->value > key) {
t = t->right_ptr;
t = t->left_ptr;
return NULL;
cmpl %esi, (%rdi)
movq 8(%rdi), %rdi
movq 16(%rdi), %rdi
testq %rdi, %rdi
gcc –Os –fno-if-conversion
Another CMOV Example (Part II)
Similar assembly as before (-fno-if-converstion)
Does reduce taken branches
CIS 371: Comp. Org & Design | Dr. | Superscalar
tree_t* search(tree_t* t, int key)
while (t != NULL) {
if (t->value == key) {
tree_t* right = t->right_ptr;
tree_t* left = t->left_ptr;
if (t->value > key) {
t = right;
return NULL;
cmpl %esi, (%rdi)
movq 8(%rdi), %rax
movq 16(%rdi), %rdi
movq %rax, %rdi
testq %rdi, %rdi
gcc –Os –fno-if-conversion
Another CMOV Example (Part III)
Now, with –fif-converstion (enabled by default)
Uses CMOV to avoid branch misprediction
CIS 371: Comp. Org & Design | Dr. | Superscalar
tree_t* search(tree_t* t, int key)
while (t != NULL) {
if (t->value == key) {
tree_t* right = t->right_ptr;
tree_t* left = t->left_ptr;
if (t->value > key) {
t = right;
return NULL;
cmpl %esi, (%rdi)
movq 16(%rdi), %rax
movq 8(%rdi), %rdi
cmovle %rax, %rdi
testq %rdi, %rdi
CIS 371: Comp. Org & Design | Dr. | Superscalar
Predication Performance
Cost/benefit analysis
Benefit: predication avoids branches
Thus avoiding mis-predictions
Also reduces pressure on predictor table (fewer branches to track)
Cost: extra instructions (fetched, but not actually executed)
As branch predictors are highly accurate…
Might not help:
5-stage pipeline, two instruction on each path of if-then-else
No performance gain, likely slower if branch predictable
Or even hurt!
But can help:
Deeper pipelines, hard-to-predict branches, and few added insn
Thus, prediction is useful, but not a panacea
Textbook (MA:FSPTCM)
Sections 3.1, 3.2 (but not “Sidebar” in 3.2), 3.5.1
Sections 4.2, 4.3, 5.3.3
CIS 371: Comp. Org & Design | Dr. | Superscalar
/docProps/thumbnail.jpeg
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com