CS代写 CIS 371: Comp. Org & Design | Dr. | Superscalar 1

Computer Organization and Design
Unit 9: Superscalar Pipelines
Slides developed by M. Martin, A. Roth, C.J. Taylor and at the University of Pennsylvania
with sources that included University of Wisconsin slides

Copyright By PowCoder代写 加微信 powcoder

by , , , and .
CIS 371: Comp. Org & Design | Dr. | Superscalar 1

A Key Theme: Parallelism
• Previously: pipeline-level parallelism
• Work on execute of one instruction in parallel with decode of next
• Next: instruction-level parallelism (ILP)
• Execute multiple independent instructions fully in parallel
• Static & dynamic scheduling
• Extract much more ILP • Data-level parallelism (DLP)
• Single-instruction, multiple data (one insn., four 64-bit adds) • Thread-level parallelism (TLP)
• Multiple software threads running on multiple cores
CIS 371: Comp. Org & Design | Dr. | Superscalar 2

This Unit: (In-Order) Superscalar Pipelines
System software
• Idea of instruction-level parallelism
• Superscalar hardware issues • Bypassing and register file
• Stall logic
• “Superscalar” vs VLIW/EPIC
CIS 371: Comp. Org & Design | Dr. | Superscalar 3

“Scalar” Pipeline & the
• So far we have looked at scalar pipelines • One instruction per stage
• With control speculation, bypassing, etc.
– Performance limit (aka “ ”) is CPI = IPC = 1
– Limit is never even achieved (hazards)
– Diminishing returns from “super-pipelining” (hazards + overhead)
CIS 371: Comp. Org & Design | Dr. | Superscalar 4

An Opportunity…
• But consider:
ADD r1, r2 -> r3
ADD r4, r5 -> r6
• Why not execute them at the same time? (We can!)
• What about:
ADD r1, r2 -> r3
ADD r4, r3 -> r6
• In this case, dependences prevent parallel execution
• What about three instructions at a time? • Or four instructions at a time?
CIS 371: Comp. Org & Design | Dr. | Superscalar 5

What Checking Is Required?
• For two instructions: 2 checks
ADD src11, src21 -> dest1
ADD src12, src22 -> dest2 (2 checks)
• For three instructions: 6 checks
ADD src11, src21 -> dest1
ADD src12, src22 -> dest2 (2 checks) ADD src13, src23 -> dest3 (4 checks)
• For four instructions: 12 checks
ADD src11, src21 -> dest1 ADD src12, src22 -> dest2 ADD src13, src23 -> dest3 ADD src14, src24 -> dest4
(2 checks)
(4 checks)
(6 checks)
• Plus checking for load-to-use stalls from prior n loads
CIS 371: Comp. Org & Design | Dr. | Superscalar 6

What Checking Is Required?
• For two instructions: 2 checks
ADD src11, src21 -> dest1
ADD src12, src22 -> dest2 (2 checks)
• For three instructions: 6 checks
ADD src11, src21 -> dest1
ADD src12, src22 -> dest2 (2 checks) ADD src13, src23 -> dest3 (4 checks)
• For four instructions: 12 checks
ADD src11, src21 -> dest1 ADD src12, src22 -> dest2 ADD src13, src23 -> dest3 ADD src14, src24 -> dest4
(2 checks)
(4 checks)
(6 checks)
• Plus checking for load-to-use stalls from prior n loads
CIS 371: Comp. Org & Design | Dr. | Superscalar 7

How do we build such “superscalar” hardware?
CIS 371: Comp. Org & Design | Dr. | Superscalar 8

Multiple-Issue or “Superscalar” Pipeline
• Overcome this limit using multiple issue
• Also called superscalar
• Two instructions per stage at once, or three, or four, or eight… • “Instruction-Level Parallelism (ILP)” [Fisher, IEEE TC’81]
• Today, typically “4-ish-wide” (Intel Broadwell, AMD Ryzen) • Broadwell issues up to 8 in the right circumstances, Ryzen up to 6 • ARM cores usually issue less
CIS 371: Comp. Org & Design | Dr. | Superscalar 9

A Typical Dual-Issue Pipeline (1 of 2)
• Fetch an entire 16B or 32B cache block
• 4 to 8 instructions (assuming 4-byte average instruction length) • Predict a single branch per cycle
• Parallel decode
• Need to check for conflicting instructions
• Is output register of I1 is an input register to I2? • Other stalls, too (for example, load-use delay)
CIS 371: Comp. Org & Design | Dr. | Superscalar 10

A Typical Dual-Issue Pipeline (2 of 2)
• Multi-ported register file
• Larger area, latency, power, cost, complexity
• Multiple execution units
• Simple adders are easy, but bypass paths are expensive
• Memory unit
• Single load per cycle (stall at decode) probably okay for dual issue • Alternative: add a read port to data cache
• Larger area, latency, power, cost, complexity
CIS 371: Comp. Org & Design | Dr. | Superscalar 11

How Much ILP is There?
• The compiler tries to “schedule” code to avoid stalls • Even for scalar machines (to fill load-use delay slot)
• Even harder to schedule multiple-issue (superscalar)
• How much ILP is common?
• Greatly depends on the application
• Consider memory copy
• Unroll loop, lots of independent operations • Other programs, less so
• Even given unbounded ILP, superscalar has implementation limits
• IPC (or CPI) vs clock frequency trade-off
• Given these challenges, what is reasonable today?
• ~4 instruction per cycle maximum
CIS 371: Comp. Org & Design | Dr. | Superscalar 12

Superscalar Pipeline Diagrams – Ideal
lw 0(r1)èr2
lw 4(r1)èr3
lw 8(r1)èr4 add r14,r15èr6 add r12,r13èr7 add r17,r16èr8 lw 0(r18)èr9
1 2 3 4 5 6 7 8 9 10 11 12
FDXMW FDXMW
FDXMW FDXMW
FDXMW FDXMW
2-waysuperscalar1 2 3 4 5 6 7 8 9101112
lw 0(r1)èr2
lw 4(r1)èr3
lw 8(r1)èr4 add r14,r15èr6 add r12,r13èr7 add r17,r16èr8 lw 0(r18)èr9
FDXMW FDXMW
FDXMW FDXMW
FDXMW FDXMW
CIS 371: Comp. Org & Design | Dr. | Superscalar 13

Superscalar Pipeline Diagrams – Realistic
lw 0(r1)èr2 lw 4(r1)èr3 lw 8(r1)èr4 add r4,r5èr6 add r2,r3èr7 add r7,r6èr8 lw 4(r8)èr9
1 2 3 4 5 6 7 8 9 10 11 12
FDXMW FDXMW
F D d* X M W
F d* D X M W FDXMW
2-waysuperscalar1 2 3 4 5 6 7 8 9101112
lw 0(r1)èr2 lw 4(r1)èr3 lw 8(r1)èr4 add r4,r5èr6 add r2,r3èr7 add r7,r6èr8 lw 4(r8)èr9
FDXMW FDXMW
F D d* d* X M W
F D d* X M W
F d* D X M W
F d* d* D X M W
CIS 371: Comp. Org & Design | Dr. | Superscalar 14

Superscalar Implementation Challenges
CIS 371: Comp. Org & Design | Dr. | Superscalar 15

Superscalar Challenges – Front End
• Superscalarinstructionfetch
• Modest: fetch multiple instructions per cycle
• Aggressive: buffer instructions and/or predict multiple branches
• Superscalarinstructiondecode • Replicate decoders
• Superscalarinstructionissue
• Determine when instructions can proceed in parallel
• More complex stall logic – order N2 for N-wide machine • Not all combinations of types of instructions possible
• Superscalarregisterread
• Portforeachregisterread(4-widesuperscalarè8read“ports”) • Each port needs its own set of address and data wires
• Latency & area μ #ports2
CIS 371: Comp. Org & Design | Dr. | Superscalar 16

Superscalar Challenges – Back End
• Superscalarinstructionexecution
• Replicate arithmetic units (but not all, for example, integer divider) • Perhaps multiple cache ports (slower access, higher energy)
• Only for 4-wide or larger (why? only ~35% are load/store insn)
• Superscalarbypasspaths
• More possible sources for data values
• Order (N2 * P) for N-wide machine with execute pipeline depth P
• Superscalarinstructionregisterwriteback • One write port per instruction that writes a register
• Example,4-widesuperscalarè4writeports
• Fundamentalchallenge:
• Amount of ILP (instruction-level parallelism) in the program • Compiler must schedule code and extract parallelism
CIS 371: Comp. Org & Design | Dr. | Superscalar 17

Superscalar Bypass
• N2 bypass network
– N+1 input muxes at each ALU input – N2 point-to-point connections
– Routing lengthens wires
– Heavy capacitive load
• And this is just one bypass stage (MX)! • There is also WX bypassing
• Even more for deeper pipelines
• One of the big problems of superscalar
• Why? On the critical path of single-cycle “bypass & execute” loop
| Dr. | Superscalar 18
CIS 371: Comp. Org & Design

Not All N2 Created Equal
• N2 bypass vs. N2 stall logic & dependence cross-check • Which is the bigger problem?
• N2 bypass … by far
• 64- bit quantities (vs. 5-bit)
• Multiple levels (MX, WX) of bypass (vs. 1 level of stall logic) • Must fit in one clock period with ALU (vs. not)
• Dependence cross-check not even 2nd biggest N2 problem • Regfile is also an N2 problem (think latency where N is #ports)
• And also more serious than cross-check
CIS 371: Comp. Org & Design | Dr. | Superscalar 19

Mitigating N2 Bypass & Register File
• Clustering: mitigates N2 bypass • Group ALUs into K clusters
• Full bypassing within a cluster
• Limited bypassing between clusters
• With 1 or 2 cycle delay
• Can hurt IPC, but faster clock
• (N/K) + 1 inputs at each mux
• (N/K)2 bypass paths in each cluster
• Steering:keytoperformance
• Steer dependent insns to same cluster
• Clusterregisterfile,too
• Replicate a register file per cluster
• All register writes update all replicas
• Fewer read ports; only for cluster
CIS 371: Comp. Org & Design | Dr. | Superscalar 20

Mitigating N2 RegFile: Clustering++
• Clustering: split N-wide execution pipeline into K clusters • With centralized register file, 2N read ports and N write ports
• Clusteredregisterfile:extendclusteringtoregisterfile • Replicate the register file (one replica per cluster)
• Register file supplies register operands to just its cluster
• All register writes go to all register files (keep them in sync)
• Advantage: fewer read ports per register!
• K register files, each with 2N/K read ports and N write ports
CIS 371: Comp. Org & Design | Dr. | Superscalar 21

Another Challenge: Superscalar Fetch
• What is involved in fetching multiple instructions per cycle?
• In same cache block? ® no problem
• 64-byte cache block is 16 instructions (~4 bytes per instruction) • Favors larger block size (independent of hit rate)
• What if next instruction is last instruction in a block?
• Fetch only one instruction that cycle
• Or, some processors may allow fetching from 2 consecutive blocks
• What about taken branches?
• How many instructions can be fetched on average? • Average number of instructions per taken branch?
• Assume: 20% branches, 50% taken ® ~10 instructions
• Consider a 5-instruction loop with an 4-issue processor
• Without smarter fetch, ILP is limited to 2.5 (not 4, which is bad)
CIS 371: Comp. Org & Design | Dr. | Superscalar 22

Increasing Superscalar
also loop stream detector
• Option #1: over-fetch and buffer
• Add a queue between fetch and decode (18 entries in Intel Core2) • Compensates for cycles that fetch less than maximum instructions • “decouples” the “front end” (fetch) from the “back end” (execute)
• Option #2: “loop stream detector” (Core 2, Core i7) • Put entire loop body into a small cache
• Core2: 18 macro-ops, up to four taken branches
• Core i7: 28 micro-ops (avoids re-decoding macro-ops!) • Any branch mis-prediction requires normal re-fetch
• Other options: next-next-block prediction, “trace cache”
CIS 371: Comp. Org & Design | Dr. | Superscalar 23
insn queue

Multiple-Issue Implementations
• Statically-scheduled(in-order)superscalar
• Whatwe’vetalkedaboutthusfar
+ Executes unmodified sequential programs
– Hardware must figure out what can be done in parallel
• E.g., Pentium (2-wide), UltraSPARC (4-wide), Alpha 21164 (4-wide)
• VeryLongInstructionWord(VLIW)
– Compiler identifies independent instructions, new ISA
+ Hardware can be simple and perhaps lower power
• E.g., TransMeta Crusoe (4-wide), most DSPs
• Variant:ExplicitlyParallelInstructionComputing(EPIC)
• A bit more flexible encoding & some hardware to help compiler • E.g., Intel Itanium (6-wide)
• Dynamically-scheduledsuperscalar(nexttopic)
• HardwareextractsmoreILPbyon-the-flyreordering
• Intel Atom/Core/Xeon, AMD Opteron/Ryzen, some ARM A-series
CIS 371: Comp. Org & Design | Dr. | Superscalar 24

Trends in Single-Processor Multiple Issue
• Issue width has saturated at 4-6 for high-performance cores • Canceled Alpha 21464 was 8-way issue
• Not enough ILP to justify going to wider issue
• Hardware or compiler scheduling needed to exploit 4-6 effectively
• More on this in the next unit
• For high-performance per watt cores (say, smart phones) • Typically 2-wide superscalar (but increasing each generation)
CIS 371: Comp. Org & Design | Dr. | Superscalar 25

Multiple Issue Redux
• Multiple issue
• Exploits insn level parallelism (ILP) beyond pipelining
• Improves IPC, but perhaps at some clock & energy penalty
• 4-6 way issue is about the peak issue width currently justifiable
• Low-power implementations today typically 2-wide superscalar
• Problem spots
• N2 bypass & register file ® clustering
• Fetch + branch prediction ® buffering, loop streaming, trace cache • N2 dependency check ® VLIW/EPIC (but unclear how key this is)
• Implementations
• Superscalar vs. VLIW/EPIC
CIS 371: Comp. Org & Design | Dr. | Superscalar 26

This Unit: (In-Order) Superscalar Pipelines
System software
• Idea of instruction-level parallelism
• Superscalar hardware issues • Bypassing and register file
• Stall logic
• “Superscalar” vs VLIW/EPIC
CIS 371: Comp. Org & Design | Dr. | Superscalar 27

CIS 371: Comp. Org & Design | Dr. | Superscalar 28

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com