CS计算机代考程序代写 mips compiler cache Digital System Design 4 Parallel Computing Architecture 1

Digital System Design 4 Parallel Computing Architecture 1
Stewart Smith Digital Systems Design 4

This Lecture

‣ ‣
‣ ‣
Parallel computing aspects of previous topics
Instruction set architecture – synchronisation (Textbook §2.11)
Arithmetic systems – sub word parallelism (Textbook §3.6)
Pipelining – multiple issue processors (Textbook §4.1)
Multicore processors and memory (Textbook §5.10)
Stewart Smith
Digital Systems Design 4

Parallelism in Instruction Set Architecture


P1 writes, then P2 reads
‣ Data race if P1 and P2 don’t synchronise
– Result depends of order of accesses Hardware support required
Atomic read/write memory operation
No other access to the location allowed between the read and write

‣ ‣

‣ ‣
Synchronisation of instructions
Two processors sharing an area of memory

Could be a single instruction
E.g., atomic swap of register ↔ memory Or an atomic pair of instructions
Stewart Smith
Digital Systems Design 4

Synchronisation in MIPS
• •
Load linked: ll rt, offset(rs)
Store conditional: sc rt, offset(rs)
Succeeds if location not changed since the ll – Returns1inrt

‣ Fails if location is changed
– Returns0inrt
Example: atomic swap (to test/set lock variable)

try: add $t0,$zero,$s4 ;copy exchange value ll $t1,0($s1) ;load linked
sc $t0,0($s1) ;store conditional beq $t0,$zero,try ;branch store fails add $s4,$zero,$t1 ;put load value in $s4
Stewart Smith Digital Systems Design 4

Subword Parallellism


Graphics and audio applications can take advantage of performing simultaneous operations on short vectors
‣ Example: 128-bit adder:
– Sixteen 8-bit adds – Eight 16-bit adds – Four 32-bit adds
Also called data-level parallelism, vector parallelism, or Single Instruction, Multiple Data (SIMD)
Stewart Smith
Digital Systems Design 4

Parallelism in FP Arithmetic


Parallel programs may interleave operations in unexpected orders
Assumptions of associativity may fail
Variable
Value
Sum 1: (X+Y)+Z
Sum 2: X+(Y+Z)
X
-1.50E+38
-1.50E+38
Y
1.50E+38
0.00E+00
Z
1.00E+00
1.00E+00
1.50E+38
Result
1.00E+00
0.00E+00

Need to validate parallel programs under varying degrees of parallelism
Stewart Smith
Digital Systems Design 4

Instruction Level Parallelism
• •



‣ ‣
Pipelining: executing multiple instructions in parallel To increase ILP
Deeper pipeline
– Less work per stage ⇒ shorter clock cycle
Multiple issue
Replicate pipeline stages ⇒ multiple pipelines Start multiple instructions per clock cycle
CPI < 1, so use Instructions Per Cycle (IPC) - E.g., 4GHz 4-way multiple-issue - 16 BIPS, peak CPI = 0.25, peak IPC = 4 Stewart Smith Digital Systems Design 4 Multiple Issue • ‣ ‣ • ‣ ‣ ‣ Static multiple issue Compiler groups instructions to be issued together Packages them into “issue slots” ‣ Compiler detects and avoids hazards Dynamic multiple issue CPU examines instruction stream and chooses instructions to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced techniques at runtime Stewart Smith Digital Systems Design 4 Speculation • ‣ ‣ • • ‣ ‣ “Guess” what to do with an instruction Start operation as soon as possible Check whether guess was right If so, complete the operation ‣ ‣ If not, roll-back and do the right thing Common to static and dynamic multiple issue Examples Speculate on branch outcome - Roll back if path taken is different Speculate on load - Roll back if location is updated Stewart Smith Digital Systems Design 4 Meltdown and Spectre Stewart Smith Digital Systems Design 4 Static Multiple Issue • Compiler groups instructions into “issue packets” Group of instructions that can be issued on a single cycle ‣ •‣ Determined by pipeline resources required ‣ Think of an issue packet as a very long instruction Specifies multiple concurrent operations ⇒ Very Long Instruction Word (VLIW) Stewart Smith Digital Systems Design 4 Scheduling Static Multiple Issue • ‣ ‣ ‣ - ‣ Compiler must remove some/all hazards Reorder instructions into issue packets No dependencies within a packet Possibly some dependencies between packets Varies between ISAs; compiler must know! Pad with nop if necessary Stewart Smith Digital Systems Design 4 MIPS with Static Dual Issue Stewart Smith Digital Systems Design 4 MIPS with Static Dual Issue • ‣ ‣ ‣ Two-issue packets One ALU/branch instruction One load/store instruction 64-bit aligned - ALU/branch, then load/store - Pad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX ME WB n+4 Load/store IF ID EX M ME WB n+8 ALU/branch IF ID M EX ME WB n + 12 Load/store IF ID EX M ME WB n + 16 ALU/branch IF ID M EX ME WB n + 20 Load/store IF ID EX M ME WB Stewart Smith Digital Systems Design 4 M Hazards in Dual Issue MIPS • • ‣ ‣ - Split into two packets, effectively a stall Load-use hazard ‣ Still one cycle use latency, but now two instructions More aggressive scheduling required • • More instructions executing in parallel EX data hazard Forwarding avoided stalls with single-issue Now can’t use ALU result in load/store in same packet add $t0, $s0, $s1 load $s2, 0($t0) Stewart Smith Digital Systems Design 4 Scheduling Example • Schedule this code for dual issue MIPS Loop: lw $t0, 0($s1) addu $t0, $t0, $s2 # $t0=array element # add scalar in $s2 # store result # decrement pointer sw $t0, 0($s1) addi $s1, $s1,–4 bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store Cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1,–4 nop 2 addu $t0, $t0, $s2 nop 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4 ‣ IPC = 5/4 = 1.25 (c.f. peak IPC = 2) Stewart Smith Digital Systems Design 4 Loop Unrolling • ‣ • ‣ ‣ - - - Replicate loop body to expose more parallelism Reduces loop-control overhead Use different registers per replication Called “register renaming” Avoid loop-carried “anti-dependencies” Store followed by a load of the same register AKA “name dependence” Reuse of a register name Stewart Smith Digital Systems Design 4 Loop Unrolling Example • Same code as the previous example, assume loop index is a multiple of 4 ALU/branch Load/store Cycle Loop addi $s1, $s1,–16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t3, $s2 sw $t1, 12($s1) 6 nop sw $t2, 8($s1) 7 bne $s1, $zero, Loop sw $t3, 4($s1) 8 ‣ - Closer to 2, but at cost of registers and code size IPC = 14/8 = 1.75 Stewart Smith Digital Systems Design 4 Dynamic Multiple Issue • ‣ - • ‣ ‣ Though it may still help Code semantics ensured by the CPU “Superscalar” processors CPU decides whether to issue 0, 1, 2, ... each cycle Avoiding structural and data hazards Avoids the need for compiler scheduling Stewart Smith Digital Systems Design 4 Dynamic Pipeline Scheduling • ‣ • ‣ Allow the CPU to execute instructions out of order to avoid stalls But commit result to registers in order Example lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 Can start sub while addu is waiting for lw Digital Systems Design 4 Stewart Smith Dynamically Scheduled CPU Preserves dependencies Hold pending operands Reorder buffer for register writes Results also sent to any waiting reservation stations Can supply operands for issued instructions Stewart Smith Digital Systems Design 4 Power Efficiency • • Complexity of dynamic scheduling and speculations requires power Multiple simpler cores may be better Microprocessor Year Clock Rate Pipeline Stages Issue width Out-of-order/ Speculation Cores Power i486 1989 25MHz 5 1 No 1 5W Pentium 1993 66MHz 5 2 No 1 10W Pentium Pro 1997 200MHz 10 3 Yes 1 29W P4 Willamette 2001 2000MHz 22 3 Yes 1 75W P4 Prescott 2004 3600MHz 31 3 Yes 1 103W Intel Core 2006 3000 MHz 14 4 Yes 2 75 W Intel Core i7 2008 3600 MHz 14 4 Yes 2-4 87 W Nehalem Intel Core 2010 3730 MHz 14 4 Yes 6 130 W Westmere Intel Core i7 Ivy 2012 3400 MHz 14 4 Yes 6 130 W Bridge Intel Core 2014 3700 MHz 14 4 Yes 10 140 W Broadwell Intel Core i9 2016 3100 MHz 14 4 Yes 14 165 W Skylake Intel Ice Lake 2018 4200 MHz 14 4 Yes 16 185 W Stewart Smith Digital Systems Design 4 Parallelism in Memory: Cache Coherence Problems • ‣ Suppose two CPU cores share a physical address space Write-through caches Time step Event CPU A’s cache CPU B’s cache Memory 0 0 1 CPU A reads X 0 0 2 CPU B reads X 0 0 0 3 CPU A writes 1 to X 1 0 1 Stewart Smith Digital Systems Design 4 Coherence Defined • • Informally: Reads return most recently written value Formally: P writes X; P reads X (no intervening writes) ⇒ read returns written value ‣ ‣ P1 writes X; P2 reads X (sufficiently later) ⇒ read returns written value - ⇒ all processors see writes in the same order - c.f. CPU B reading X after step 3 in example ‣ P1 writes X, P2 writes X End up with the same final value for X Stewart Smith Digital Systems Design 4 Cache Coherence Protocols • • • ‣ Operations performed by caches in multiprocessors to ensure coherence Migration of data to local caches - Reduces bandwidth for shared memory ‣ ‣ Replication of read-shared data - Reduces contention for access Snooping protocols ‣ Each cache monitors bus reads/writes Directory-based protocols Caches and memory record sharing status of blocks in a directory Stewart Smith Digital Systems Design 4 Invalidating Snooping Protocols • to be written Broadcasts an invalidate message on the bus Subsequent read in another cache misses - Owning cache supplies updated value Cache gets exclusive access to a block when it is ‣ ‣ CPU activity Bus activity CPU A’s cache CPU B’s cache Memory 0 CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X 0 0 0 CPU A writes 1 to X Invalidate for X 1 0 CPU B read X Cache miss for X 1 1 1 Stewart Smith Digital Systems Design 4 Memory Consistency • ‣ • ‣ • ‣ ‣ When are writes seen by other processors “Seen” means a read returns the written value ‣ Can’t be instantaneously Assumptions A write completes only when all processors have seen it ‣ A processor does not reorder writes with other accesses Consequence P writes X then writes Y ⇒ all processors that see new Y also see new X Processors can reorder reads, but not writes Stewart Smith Digital Systems Design 4 Next Lecture • ‣ ‣ ‣ ‣ ‣ Parallel Computing and Performance Definitions Speedup and Efficiency Return to Amdahl’s Law Scaling Load balancing Stewart Smith Digital Systems Design 4