CS计算机代考程序代写 mips compiler cache Parallel 1 2021

Parallel 1 2021

Stewart Smith Digital Systems Design 4

Digital System Design 4
Parallel Computing Architecture 1

Stewart Smith Digital Systems Design 4

This Lecture
• Parallel computing aspects of previous topics
‣ Instruction set architecture – synchronisation

(Textbook §2.11)

‣ Arithmetic systems – sub word parallelism
(Textbook §3.6)

‣ Pipelining – multiple issue processors
(Textbook §4.1)

‣ Multicore processors and memory
(Textbook §5.10)

Stewart Smith Digital Systems Design 4

Parallelism in Instruction Set
Architecture

• Synchronisation of instructions
‣ Two processors sharing an area of memory
‣ P1 writes, then P2 reads
‣ Data race if P1 and P2 don’t synchronise

– Result depends of order of accesses
• Hardware support required
‣ Atomic read/write memory operation
‣ No other access to the location allowed between the read and

write

• Could be a single instruction
‣ E.g., atomic swap of register ↔ memory
‣ Or an atomic pair of instructions

Stewart Smith Digital Systems Design 4

Synchronisation in MIPS
• Load linked: ll rt, offset(rs)
• Store conditional: sc rt, offset(rs)
‣ Succeeds if location not changed since the ll

– Returns 1 in rt
‣ Fails if location is changed

– Returns 0 in rt
• Example: atomic swap (to test/set lock variable)
try: add $t0,$zero,$s4 ;copy exchange value
ll $t1,0($s1) ;load linked
sc $t0,0($s1) ;store conditional
beq $t0,$zero,try ;branch store fails
add $s4,$zero,$t1 ;put load value in $s4

Stewart Smith Digital Systems Design 4

Subword Parallellism
• Graphics and audio applications can take

advantage of performing simultaneous
operations on short vectors
‣ Example: 128-bit adder:

– Sixteen 8-bit adds
– Eight 16-bit adds
– Four 32-bit adds

• Also called data-level parallelism, vector
parallelism, or Single Instruction, Multiple
Data (SIMD)

Stewart Smith Digital Systems Design 4

Parallelism in FP Arithmetic

• Parallel programs may interleave operations in
unexpected orders
‣ Assumptions of associativity may fail

• Need to validate parallel programs under varying
degrees of parallelism

Variable Value
Sum 1:

(X+Y)+Z

Sum 2:

X+(Y+Z)

X -1.50E+38 -1.50E+38

Y 1.50E+38 0.00E+00

Z 1.00E+00 1.00E+00 1.50E+38

Result 1.00E+00 0.00E+00

Stewart Smith Digital Systems Design 4

Instruction Level Parallelism

• Pipelining: executing multiple instructions in parallel
• To increase ILP
‣ Deeper pipeline

– Less work per stage ⇒ shorter clock cycle

• Multiple issue
‣ Replicate pipeline stages ⇒ multiple pipelines
‣ Start multiple instructions per clock cycle
‣ CPI < 1, so use Instructions Per Cycle (IPC) - E.g., 4GHz 4-way multiple-issue - 16 BIPS, peak CPI = 0.25, peak IPC = 4 Stewart Smith Digital Systems Design 4 Multiple Issue • Static multiple issue ‣ Compiler groups instructions to be issued together ‣ Packages them into “issue slots” ‣ Compiler detects and avoids hazards • Dynamic multiple issue ‣ CPU examines instruction stream and chooses instructions to issue each cycle ‣ Compiler can help by reordering instructions ‣ CPU resolves hazards using advanced techniques at runtime Stewart Smith Digital Systems Design 4 Speculation • “Guess” what to do with an instruction ‣ Start operation as soon as possible ‣ Check whether guess was right ‣ If so, complete the operation ‣ If not, roll-back and do the right thing • Common to static and dynamic multiple issue • Examples ‣ Speculate on branch outcome - Roll back if path taken is different ‣ Speculate on load - Roll back if location is updated Stewart Smith Digital Systems Design 4 Meltdown and Spectre Stewart Smith Digital Systems Design 4 Static Multiple Issue • Compiler groups instructions into “issue packets” ‣ Group of instructions that can be issued on a single cycle ‣ Determined by pipeline resources required • Think of an issue packet as a very long instruction ‣ Specifies multiple concurrent operations ⇒ Very Long Instruction Word (VLIW) Stewart Smith Digital Systems Design 4 Scheduling Static Multiple Issue • Compiler must remove some/all hazards ‣ Reorder instructions into issue packets ‣ No dependencies within a packet ‣ Possibly some dependencies between packets - Varies between ISAs; compiler must know! ‣ Pad with nop if necessary Stewart Smith Digital Systems Design 4 MIPS with Static Dual Issue Stewart Smith Digital Systems Design 4 MIPS with Static Dual Issue • Two-issue packets ‣ One ALU/branch instruction ‣ One load/store instruction ‣ 64-bit aligned - ALU/branch, then load/store - Pad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX ME M WB n + 4 Load/store IF ID EX ME M WB n + 8 ALU/branch IF ID EX ME M WB n + 12 Load/store IF ID EX ME M WB n + 16 ALU/branch IF ID EX ME M WB n + 20 Load/store IF ID EX ME M WB Stewart Smith Digital Systems Design 4 Hazards in Dual Issue MIPS • More instructions executing in parallel • EX data hazard ‣ Forwarding avoided stalls with single-issue ‣ Now can’t use ALU result in load/store in same packet add $t0, $s0, $s1 load $s2, 0($t0) - Split into two packets, effectively a stall • Load-use hazard ‣ Still one cycle use latency, but now two instructions • More aggressive scheduling required Stewart Smith Digital Systems Design 4 Scheduling Example • Schedule this code for dual issue MIPS ‣ IPC = 5/4 = 1.25 (c.f. peak IPC = 2) Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1,–4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store Cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1,–4 nop 2 addu $t0, $t0, $s2 nop 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4 Stewart Smith Digital Systems Design 4 Loop Unrolling • Replicate loop body to expose more parallelism ‣ Reduces loop-control overhead • Use different registers per replication ‣ Called “register renaming” ‣ Avoid loop-carried “anti-dependencies” - Store followed by a load of the same register - AKA “name dependence” - Reuse of a register name Stewart Smith Digital Systems Design 4 Loop Unrolling Example • Same code as the previous example, assume loop index is a multiple of 4 ‣ IPC = 14/8 = 1.75 - Closer to 2, but at cost of registers and code size ALU/branch Load/store Cycle Loop addi $s1, $s1,–16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t3, $s2 sw $t1, 12($s1) 6 nop sw $t2, 8($s1) 7 bne $s1, $zero, Loop sw $t3, 4($s1) 8 Stewart Smith Digital Systems Design 4 Dynamic Multiple Issue • “Superscalar” processors ‣ CPU decides whether to issue 0, 1, 2, … each cycle - Avoiding structural and data hazards • Avoids the need for compiler scheduling ‣ Though it may still help ‣ Code semantics ensured by the CPU Stewart Smith Digital Systems Design 4 Dynamic Pipeline Scheduling • Allow the CPU to execute instructions out of order to avoid stalls ‣ But commit result to registers in order • Example lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 ‣ Can start sub while addu is waiting for lw Stewart Smith Digital Systems Design 4 Dynamically Scheduled CPU Results also sent to any waiting reservation stations Reorder buffer for register writes Can supply operands for issued instructions Preserves dependencies Hold pending operands Stewart Smith Digital Systems Design 4 Power Efficiency • Complexity of dynamic scheduling and speculations requires power • Multiple simpler cores may be better Microprocessor Year Clock Rate Pipeline Stages Issue width Out-of-order/ Speculation Cores Power i486 1989 25MHz 5 1 No 1 5W Pentium 1993 66MHz 5 2 No 1 10W Pentium Pro 1997 200MHz 10 3 Yes 1 29W P4 Willamette 2001 2000MHz 22 3 Yes 1 75W P4 Prescott 2004 3600MHz 31 3 Yes 1 103W Intel Core 2006 3000 MHz 14 4 Yes 2 75 W Intel Core i7 Nehalem 2008 3600 MHz 14 4 Yes 2-4 87 W Intel Core Westmere 2010 3730 MHz 14 4 Yes 6 130 W Intel Core i7 Ivy Bridge 2012 3400 MHz 14 4 Yes 6 130 W Intel Core Broadwell 2014 3700 MHz 14 4 Yes 10 140 W Intel Core i9 Skylake 2016 3100 MHz 14 4 Yes 14 165 W Intel Ice Lake 2018 4200 MHz 14 4 Yes 16 185 W Stewart Smith Digital Systems Design 4 Parallelism in Memory: Cache Coherence Problems • Suppose two CPU cores share a physical address space ‣ Write-through caches Time step Event CPU A’s cache CPU B’s cache Memory 0 0 1 CPU A reads X 0 0 2 CPU B reads X 0 0 0 3 CPU A writes 1 to X 1 0 1 Stewart Smith Digital Systems Design 4 Coherence Defined • Informally: Reads return most recently written value • Formally: ‣ P writes X; P reads X (no intervening writes) ⇒ read returns written value ‣ P1 writes X; P2 reads X (sufficiently later) ⇒ read returns written value - c.f. CPU B reading X after step 3 in example ‣ P1 writes X, P2 writes X ⇒ all processors see writes in the same order - End up with the same final value for X Stewart Smith Digital Systems Design 4 Cache Coherence Protocols • Operations performed by caches in multiprocessors to ensure coherence ‣ Migration of data to local caches - Reduces bandwidth for shared memory ‣ Replication of read-shared data - Reduces contention for access • Snooping protocols ‣ Each cache monitors bus reads/writes • Directory-based protocols ‣ Caches and memory record sharing status of blocks in a directory Stewart Smith Digital Systems Design 4 Invalidating Snooping Protocols • Cache gets exclusive access to a block when it is to be written ‣ Broadcasts an invalidate message on the bus ‣ Subsequent read in another cache misses - Owning cache supplies updated value CPU activity Bus activity CPU A’s cache CPU B’s cache Memory 0 CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X 0 0 0 CPU A writes 1 to X Invalidate for X 1 0 CPU B read X Cache miss for X 1 1 1 Stewart Smith Digital Systems Design 4 Memory Consistency • When are writes seen by other processors ‣ “Seen” means a read returns the written value ‣ Can’t be instantaneously • Assumptions ‣ A write completes only when all processors have seen it ‣ A processor does not reorder writes with other accesses • Consequence ‣ P writes X then writes Y ⇒ all processors that see new Y also see new X ‣ Processors can reorder reads, but not writes Stewart Smith Digital Systems Design 4 Next Lecture • Parallel Computing and Performance ‣ Definitions ‣ Speedup and Efficiency ‣ Return to Amdahl’s Law ‣ Scaling ‣ Load balancing