CS代写 CIS 371: Comp. Org & Design | Dr. | Superscalar 1

Computer Organization and Design
Unit 9: Superscalar Pipelines
Slides developed by M. Martin, A. Roth, C.J. Taylor and at the University of Pennsylvania
with sources that included University of Wisconsin slides

by , , , and .
CIS 371: Comp. Org & Design | Dr. | Superscalar 1

A Key Theme: Parallelism
• Previously: pipeline-level parallelism
• Work on execute of one instruction in parallel with decode of next
• Next: instruction-level parallelism (ILP)
• Execute multiple independent instructions fully in parallel
• Static & dynamic scheduling
• Extract much more ILP • Data-level parallelism (DLP)
• Single-instruction, multiple data (one insn., four 64-bit adds) • Thread-level parallelism (TLP)
• Multiple software threads running on multiple cores
CIS 371: Comp. Org & Design | Dr. | Superscalar 2

This Unit: (In-Order) Superscalar Pipelines
System software
• Idea of instruction-level parallelism
• Superscalar hardware issues • Bypassing and register file
• Stall logic
• “Superscalar” vs VLIW/EPIC
CIS 371: Comp. Org & Design | Dr. | Superscalar 3

“Scalar” Pipeline & the
• So far we have looked at scalar pipelines • One instruction per stage
• With control speculation, bypassing, etc.
– Performance limit (aka “ ”) is CPI = IPC = 1
– Limit is never even achieved (hazards)
– Diminishing returns from “super-pipelining” (hazards + overhead)
CIS 371: Comp. Org & Design | Dr. | Superscalar 4

An Opportunity…
• But consider:
ADD r1, r2 -> r3
ADD r4, r5 -> r6
• Why not execute them at the same time? (We can!)
• What about:
ADD r1, r2 -> r3
ADD r4, r3 -> r6
• In this case, dependences prevent parallel execution
• What about three instructions at a time? • Or four instructions at a time?
CIS 371: Comp. Org & Design | Dr. | Superscalar 5

What Checking Is Required?
• For two instructions: 2 checks
ADD src11, src21 -> dest1
ADD src12, src22 -> dest2 (2 checks)
• For three instructions: 6 checks
ADD src11, src21 -> dest1
ADD src12, src22 -> dest2 (2 checks) ADD src13, src23 -> dest3 (4 checks)
• For four instructions: 12 checks
ADD src11, src21 -> dest1 ADD src12, src22 -> dest2 ADD src13, src23 -> dest3 ADD src14, src24 -> dest4
(2 checks)
(4 checks)
(6 checks)
• Plus checking for load-to-use stalls from prior n loads
CIS 371: Comp. Org & Design | Dr. | Superscalar 6

How do we build such “superscalar” hardware?
CIS 371: Comp. Org & Design | Dr. | Superscalar 8

Multiple-Issue or “Superscalar” Pipeline
• Overcome this limit using multiple issue
• Also called superscalar
• Two instructions per stage at once, or three, or four, or eight… • “Instruction-Level Parallelism (ILP)” [Fisher, IEEE TC’81]
• Today, typically “4-ish-wide” (Intel Broadwell, AMD Ryzen) • Broadwell issues up to 8 in the right circumstances, Ryzen up to 6 • ARM cores usually issue less
CIS 371: Comp. Org & Design | Dr. | Superscalar 9

A Typical Dual-Issue Pipeline (1 of 2)
• Fetch an entire 16B or 32B cache block
• 4 to 8 instructions (assuming 4-byte average instruction length) • Predict a single branch per cycle
• Parallel decode
• Need to check for conflicting instructions
• Is output register of I1 is an input register to I2? • Other stalls, too (for example, load-use delay)
CIS 371: Comp. Org & Design | Dr. | Superscalar 10

A Typical Dual-Issue Pipeline (2 of 2)
• Multi-ported register file
• Larger area, latency, power, cost, complexity
• Multiple execution units
• Simple adders are easy, but bypass paths are expensive
• Memory unit
• Single load per cycle (stall at decode) probably okay for dual issue • Alternative: add a read port to data cache
• Larger area, latency, power, cost, complexity
CIS 371: Comp. Org & Design | Dr. | Superscalar 11

How Much ILP is There?
• The compiler tries to “schedule” code to avoid stalls • Even for scalar machines (to fill load-use delay slot)
• Even harder to schedule multiple-issue (superscalar)
• How much ILP is common?
• Greatly depends on the application
• Consider memory copy
• Unroll loop, lots of independent operations • Other programs, less so
• Even given unbounded ILP, superscalar has implementation limits
• IPC (or CPI) vs clock frequency trade-off
• Given these challenges, what is reasonable today?
• ~4 instruction per cycle maximum
CIS 371: Comp. Org & Design | Dr. | Superscalar 12

Superscalar Pipeline Diagrams – Ideal
lw 0(r1)èr2
lw 4(r1)èr3
lw 8(r1)èr4 add r14,r15èr6 add r12,r13èr7 add r17,r16èr8 lw 0(r18)èr9
1 2 3 4 5 6 7 8 9 10 11 12
FDXMW FDXMW
FDXMW FDXMW
FDXMW FDXMW
2-waysuperscalar1 2 3 4 5 6 7 8 9101112
lw 0(r1)èr2
lw 4(r1)èr3
lw 8(r1)èr4 add r14,r15èr6 add r12,r13èr7 add r17,r16èr8 lw 0(r18)èr9
FDXMW FDXMW
FDXMW FDXMW
FDXMW FDXMW
CIS 371: Comp. Org & Design | Dr. | Superscalar 13

Superscalar Pipeline Diagrams – Realistic
lw 0(r1)èr2 lw 4(r1)èr3 lw 8(r1)èr4 add r4,r5èr6 add r2,r3èr7 add r7,r6èr8 lw 4(r8)èr9
1 2 3 4 5 6 7 8 9 10 11 12
FDXMW FDXMW
F D d* X M W
F d* D X M W FDXMW
2-waysuperscalar1 2 3 4 5 6 7 8 9101112
lw 0(r1)èr2 lw 4(r1)èr3 lw 8(r1)èr4 add r4,r5èr6 add r2,r3èr7 add r7,r6èr8 lw 4(r8)èr9
FDXMW FDXMW
F D d* d* X M W
F D d* X M W
F d* D X M W
F d* d* D X M W
CIS 371: Comp. Org & Design | Dr. | Superscalar 14

Superscalar Implementation Challenges
CIS 371: Comp. Org & Design | Dr. | Superscalar 15

Superscalar Challenges – Front End
• Superscalarinstructionfetch
• Modest: fetch multiple instructions per cycle
• Aggressive: buffer instructions and/or predict multiple branches
• Superscalarinstructiondecode • Replicate decoders
• Superscalarinstructionissue
• Determine when instructions can proceed in parallel
• More complex stall logic – order N2 for N-wide machine • Not all combinations of types of instructions possible
• Superscalarregisterread
• Portforeachregisterread(4-widesuperscalarè8read“ports”) • Each port needs its own set of address and data wires
• Latency & area μ #ports2
CIS 371: Comp. Org & Design | Dr. | Superscalar 16

Superscalar Challenges – Back End
• Superscalarinstructionexecution
• Replicate arithmetic units (but not all, for example, integer divider) • Perhaps multiple cache ports (slower access, higher energy)
• Only for 4-wide or larger (why? only ~35% are load/store insn)
• Superscalarbypasspaths
• More possible sources for data values
• Order (N2 * P) for N-wide machine with execute pipeline depth P
• Superscalarinstructionregisterwriteback • One write port per instruction that writes a register
• Example,4-widesuperscalarè4writeports
• Fundamentalchallenge:
• Amount of ILP (instruction-level parallelism) in the program • Compiler must schedule code and extract parallelism
CIS 371: Comp. Org & Design | Dr. | Superscalar 17

Superscalar Bypass
• N2 bypass network
– N+1 input muxes at each ALU input – N2 point-to-point connections
– Routing lengthens wires
– Heavy capacitive load
• And this is just one bypass stage (MX)! • There is also WX bypassing
• Even more for deeper pipelines
• One of the big problems of superscalar
• Why? On the critical path of single-cycle “bypass & execute” loop
| Dr. | Superscalar 18
CIS 371: Comp. Org & Design

Not All N2 Created Equal
• N2 bypass vs. N2 stall logic & dependence cross-check • Which is the bigger problem?
• N2 bypass … by far
• 64- bit quantities (vs. 5-bit)
• Multiple levels (MX, WX) of bypass (vs. 1 level of stall logic) • Must fit in one clock period with ALU (vs. not)
• Dependence cross-check not even 2nd biggest N2 problem • Regfile is also an N2 problem (think latency where N is #ports)
• And also more serious than cross-check
CIS 371: Comp. Org & Design | Dr. | Superscalar 19

Mitigating N2 Bypass & Register File
• Clustering: mitigates N2 bypass • Group ALUs into K clusters
• Full bypassing within a cluster
• Limited bypassing between clusters
• With 1 or 2 cycle delay
• Can hurt IPC, but faster clock
• (N/K) + 1 inputs at each mux
• (N/K)2 bypass paths in each cluster
• Steering:keytoperformance
• Steer dependent insns to same cluster
• Clusterregisterfile,too
• Replicate a register file per cluster
• All register writes update all replicas
• Fewer read ports; only for cluster
CIS 371: Comp. Org & Design | Dr. | Superscalar 20

Mitigating N2 RegFile: Clustering++
• Clustering: split N-wide execution pipeline into K clusters • With centralized register file, 2N read ports and N write ports
• Clusteredregisterfile:extendclusteringtoregisterfile • Replicate the register file (one replica per cluster)
• Register file supplies register operands to just its cluster
• All register writes go to all register files (keep them in sync)
• Advantage: fewer read ports per register!
• K register files, each with 2N/K read ports and N write ports
CIS 371: Comp. Org & Design | Dr. | Superscalar 21

Another Challenge: Superscalar Fetch
• What is involved in fetching multiple instructions per cycle?
• In same cache block? ® no problem
• 64-byte cache block is 16 instructions (~4 bytes per instruction) • Favors larger block size (independent of hit rate)
• What if next instruction is last instruction in a block?
• Fetch only one instruction that cycle
• Or, some processors may allow fetching from 2 consecutive blocks
• What about taken branches?
• How many instructions can be fetched on average? • Average number of instructions per taken branch?
• Assume: 20% branches, 50% taken ® ~10 instructions
• Consider a 5-instruction loop with an 4-issue processor
• Without smarter fetch, ILP is limited to 2.5 (not 4, which is bad)
CIS 371: Comp. Org & Design | Dr. | Superscalar 22

Increasing Superscalar
also loop stream detector
• Option #1: over-fetch and buffer
• Add a queue between fetch and decode (18 entries in Intel Core2) • Compensates for cycles that fetch less than maximum instructions • “decouples” the “front end” (fetch) from the “back end” (execute)
• Option #2: “loop stream detector” (Core 2, Core i7) • Put entire loop body into a small cache
• Core2: 18 macro-ops, up to four taken branches
• Core i7: 28 micro-ops (avoids re-decoding macro-ops!) • Any branch mis-prediction requires normal re-fetch
• Other options: next-next-block prediction, “trace cache”
CIS 371: Comp. Org & Design | Dr. | Superscalar 23
insn queue

Multiple-Issue Implementations
• Statically-scheduled(in-order)superscalar
• Whatwe’vetalkedaboutthusfar
+ Executes unmodified sequential programs
– Hardware must figure out what can be done in parallel
• E.g., Pentium (2-wide), UltraSPARC (4-wide), Alpha 21164 (4-wide)
• VeryLongInstructionWord(VLIW)
– Compiler identifies independent instructions, new ISA
+ Hardware can be simple and perhaps lower power
• E.g., TransMeta Crusoe (4-wide), most DSPs
• Variant:ExplicitlyParallelInstructionComputing(EPIC)
• A bit more flexible encoding & some hardware to help compiler • E.g., Intel Itanium (6-wide)
• Dynamically-scheduledsuperscalar(nexttopic)
• HardwareextractsmoreILPbyon-the-flyreordering
• Intel Atom/Core/Xeon, AMD Opteron/Ryzen, some ARM A-series
CIS 371: Comp. Org & Design | Dr. | Superscalar 24

Trends in Single-Processor Multiple Issue
• Issue width has saturated at 4-6 for high-performance cores • Canceled Alpha 21464 was 8-way issue
• Not enough ILP to justify going to wider issue
• Hardware or compiler scheduling needed to exploit 4-6 effectively
• More on this in the next unit
• For high-performance per watt cores (say, smart phones) • Typically 2-wide superscalar (but increasing each generation)
CIS 371: Comp. Org & Design | Dr. | Superscalar 25

Multiple Issue Redux
• Multiple issue
• Exploits insn level parallelism (ILP) beyond pipelining
• Improves IPC, but perhaps at some clock & energy penalty
• 4-6 way issue is about the peak issue width currently justifiable
• Low-power implementations today typically 2-wide superscalar
• Problem spots
• N2 bypass & register file ® clustering
• Fetch + branch prediction ® buffering, loop streaming, trace cache • N2 dependency check ® VLIW/EPIC (but unclear how key this is)
• Implementations
• Superscalar vs. VLIW/EPIC
CIS 371: Comp. Org & Design | Dr. | Superscalar 26

CIS 371: Comp. Org & Design | Dr. | Superscalar 28

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts