Parallel 1 2021
Stewart Smith Digital Systems Design 4
Digital System Design 4
Parallel Computing Architecture 1
Stewart Smith Digital Systems Design 4
This Lecture
• Parallel computing aspects of previous topics
‣ Instruction set architecture – synchronisation
(Textbook §2.11)
‣ Arithmetic systems – sub word parallelism
(Textbook §3.6)
‣ Pipelining – multiple issue processors
(Textbook §4.1)
‣ Multicore processors and memory
(Textbook §5.10)
Stewart Smith Digital Systems Design 4
Parallelism in Instruction Set
Architecture
• Synchronisation of instructions
‣ Two processors sharing an area of memory
‣ P1 writes, then P2 reads
‣ Data race if P1 and P2 don’t synchronise
– Result depends of order of accesses
• Hardware support required
‣ Atomic read/write memory operation
‣ No other access to the location allowed between the read and
write
• Could be a single instruction
‣ E.g., atomic swap of register ↔ memory
‣ Or an atomic pair of instructions
Stewart Smith Digital Systems Design 4
Synchronisation in MIPS
• Load linked: ll rt, offset(rs)
• Store conditional: sc rt, offset(rs)
‣ Succeeds if location not changed since the ll
– Returns 1 in rt
‣ Fails if location is changed
– Returns 0 in rt
• Example: atomic swap (to test/set lock variable)
try: add $t0,$zero,$s4 ;copy exchange value
ll $t1,0($s1) ;load linked
sc $t0,0($s1) ;store conditional
beq $t0,$zero,try ;branch store fails
add $s4,$zero,$t1 ;put load value in $s4
Stewart Smith Digital Systems Design 4
Subword Parallellism
• Graphics and audio applications can take
advantage of performing simultaneous
operations on short vectors
‣ Example: 128-bit adder:
– Sixteen 8-bit adds
– Eight 16-bit adds
– Four 32-bit adds
• Also called data-level parallelism, vector
parallelism, or Single Instruction, Multiple
Data (SIMD)
Stewart Smith Digital Systems Design 4
Parallelism in FP Arithmetic
• Parallel programs may interleave operations in
unexpected orders
‣ Assumptions of associativity may fail
• Need to validate parallel programs under varying
degrees of parallelism
Variable Value
Sum 1:
(X+Y)+Z
Sum 2:
X+(Y+Z)
X -1.50E+38 -1.50E+38
Y 1.50E+38 0.00E+00
Z 1.00E+00 1.00E+00 1.50E+38
Result 1.00E+00 0.00E+00
Stewart Smith Digital Systems Design 4
Instruction Level Parallelism
• Pipelining: executing multiple instructions in parallel
• To increase ILP
‣ Deeper pipeline
– Less work per stage ⇒ shorter clock cycle
• Multiple issue
‣ Replicate pipeline stages ⇒ multiple pipelines
‣ Start multiple instructions per clock cycle
‣ CPI < 1, so use Instructions Per Cycle (IPC)
- E.g., 4GHz 4-way multiple-issue
- 16 BIPS, peak CPI = 0.25, peak IPC = 4
Stewart Smith Digital Systems Design 4
Multiple Issue
• Static multiple issue
‣ Compiler groups instructions to be issued together
‣ Packages them into “issue slots”
‣ Compiler detects and avoids hazards
• Dynamic multiple issue
‣ CPU examines instruction stream and chooses
instructions to issue each cycle
‣ Compiler can help by reordering instructions
‣ CPU resolves hazards using advanced techniques
at runtime
Stewart Smith Digital Systems Design 4
Speculation
• “Guess” what to do with an instruction
‣ Start operation as soon as possible
‣ Check whether guess was right
‣ If so, complete the operation
‣ If not, roll-back and do the right thing
• Common to static and dynamic multiple issue
• Examples
‣ Speculate on branch outcome
- Roll back if path taken is different
‣ Speculate on load
- Roll back if location is updated
Stewart Smith Digital Systems Design 4
Meltdown and Spectre
Stewart Smith Digital Systems Design 4
Static Multiple Issue
• Compiler groups instructions into “issue
packets”
‣ Group of instructions that can be issued on a
single cycle
‣ Determined by pipeline resources required
• Think of an issue packet as a very long
instruction
‣ Specifies multiple concurrent operations
⇒ Very Long Instruction Word (VLIW)
Stewart Smith Digital Systems Design 4
Scheduling Static Multiple Issue
• Compiler must remove some/all hazards
‣ Reorder instructions into issue packets
‣ No dependencies within a packet
‣ Possibly some dependencies between packets
- Varies between ISAs; compiler must know!
‣ Pad with nop if necessary
Stewart Smith Digital Systems Design 4
MIPS with Static Dual Issue
Stewart Smith Digital Systems Design 4
MIPS with Static Dual Issue
• Two-issue packets
‣ One ALU/branch instruction
‣ One load/store instruction
‣ 64-bit aligned
- ALU/branch, then load/store
- Pad an unused instruction with nop
Address Instruction type Pipeline Stages
n ALU/branch IF ID EX ME
M
WB
n + 4 Load/store IF ID EX ME
M
WB
n + 8 ALU/branch IF ID EX ME
M
WB
n + 12 Load/store IF ID EX ME
M
WB
n + 16 ALU/branch IF ID EX ME
M
WB
n + 20 Load/store IF ID EX ME
M
WB
Stewart Smith Digital Systems Design 4
Hazards in Dual Issue MIPS
• More instructions executing in parallel
• EX data hazard
‣ Forwarding avoided stalls with single-issue
‣ Now can’t use ALU result in load/store in same
packet
add $t0, $s0, $s1
load $s2, 0($t0)
- Split into two packets, effectively a stall
• Load-use hazard
‣ Still one cycle use latency, but now two instructions
• More aggressive scheduling required
Stewart Smith Digital Systems Design 4
Scheduling Example
• Schedule this code for dual issue MIPS
‣ IPC = 5/4 = 1.25 (c.f. peak IPC = 2)
Loop: lw $t0, 0($s1) # $t0=array element
addu $t0, $t0, $s2 # add scalar in $s2
sw $t0, 0($s1) # store result
addi $s1, $s1,–4 # decrement pointer
bne $s1, $zero, Loop # branch $s1!=0
ALU/branch Load/store Cycle
Loop: nop lw $t0, 0($s1) 1
addi $s1, $s1,–4 nop 2
addu $t0, $t0, $s2 nop 3
bne $s1, $zero, Loop sw $t0, 4($s1) 4
Stewart Smith Digital Systems Design 4
Loop Unrolling
• Replicate loop body to expose more parallelism
‣ Reduces loop-control overhead
• Use different registers per replication
‣ Called “register renaming”
‣ Avoid loop-carried “anti-dependencies”
- Store followed by a load of the same register
- AKA “name dependence”
- Reuse of a register name
Stewart Smith Digital Systems Design 4
Loop Unrolling Example
• Same code as the previous example, assume loop index is a
multiple of 4
‣ IPC = 14/8 = 1.75
- Closer to 2, but at cost of registers and code size
ALU/branch Load/store Cycle
Loop addi $s1, $s1,–16 lw $t0, 0($s1) 1
nop lw $t1, 12($s1) 2
addu $t0, $t0, $s2 lw $t2, 8($s1) 3
addu $t1, $t1, $s2 lw $t3, 4($s1) 4
addu $t2, $t2, $s2 sw $t0, 16($s1) 5
addu $t3, $t3, $s2 sw $t1, 12($s1) 6
nop sw $t2, 8($s1) 7
bne $s1, $zero, Loop sw $t3, 4($s1) 8
Stewart Smith Digital Systems Design 4
Dynamic Multiple Issue
• “Superscalar” processors
‣ CPU decides whether to issue 0, 1, 2, … each
cycle
- Avoiding structural and data hazards
• Avoids the need for compiler scheduling
‣ Though it may still help
‣ Code semantics ensured by the CPU
Stewart Smith Digital Systems Design 4
Dynamic Pipeline Scheduling
• Allow the CPU to execute instructions out
of order to avoid stalls
‣ But commit result to registers in order
• Example
lw $t0, 20($s2)
addu $t1, $t0, $t2
sub $s4, $s4, $t3
slti $t5, $s4, 20
‣ Can start sub while addu is waiting for lw
Stewart Smith Digital Systems Design 4
Dynamically Scheduled CPU
Results also sent to
any waiting
reservation stations
Reorder buffer for
register writes Can supply
operands for issued
instructions
Preserves
dependencies
Hold pending
operands
Stewart Smith Digital Systems Design 4
Power Efficiency
• Complexity of dynamic scheduling and
speculations requires power
• Multiple simpler cores may be better
Microprocessor Year Clock Rate Pipeline
Stages
Issue
width
Out-of-order/
Speculation
Cores Power
i486 1989 25MHz 5 1 No 1 5W
Pentium 1993 66MHz 5 2 No 1 10W
Pentium Pro 1997 200MHz 10 3 Yes 1 29W
P4 Willamette 2001 2000MHz 22 3 Yes 1 75W
P4 Prescott 2004 3600MHz 31 3 Yes 1 103W
Intel Core 2006 3000 MHz 14 4 Yes 2 75 W
Intel Core i7
Nehalem
2008 3600 MHz 14 4 Yes 2-4 87 W
Intel Core
Westmere
2010 3730 MHz 14 4 Yes 6 130 W
Intel Core i7 Ivy
Bridge
2012 3400 MHz 14 4 Yes 6 130 W
Intel Core
Broadwell
2014 3700 MHz 14 4 Yes 10 140 W
Intel Core i9
Skylake
2016 3100 MHz 14 4 Yes 14 165 W
Intel Ice Lake 2018 4200 MHz 14 4 Yes 16 185 W
Stewart Smith Digital Systems Design 4
Parallelism in Memory:
Cache Coherence Problems
• Suppose two CPU cores share a physical
address space
‣ Write-through caches
Time step Event CPU A’s cache
CPU B’s
cache Memory
0 0
1 CPU A reads X 0 0
2 CPU B reads X 0 0 0
3 CPU A writes 1 to X 1 0 1
Stewart Smith Digital Systems Design 4
Coherence Defined
• Informally: Reads return most recently written
value
• Formally:
‣ P writes X; P reads X (no intervening writes)
⇒ read returns written value
‣ P1 writes X; P2 reads X (sufficiently later)
⇒ read returns written value
- c.f. CPU B reading X after step 3 in example
‣ P1 writes X, P2 writes X
⇒ all processors see writes in the same order
- End up with the same final value for X
Stewart Smith Digital Systems Design 4
Cache Coherence Protocols
• Operations performed by caches in multiprocessors
to ensure coherence
‣ Migration of data to local caches
- Reduces bandwidth for shared memory
‣ Replication of read-shared data
- Reduces contention for access
• Snooping protocols
‣ Each cache monitors bus reads/writes
• Directory-based protocols
‣ Caches and memory record sharing status of blocks in a
directory
Stewart Smith Digital Systems Design 4
Invalidating Snooping Protocols
• Cache gets exclusive access to a block when it is
to be written
‣ Broadcasts an invalidate message on the bus
‣ Subsequent read in another cache misses
- Owning cache supplies updated value
CPU activity Bus activity
CPU A’s
cache
CPU B’s
cache
Memory
0
CPU A reads X Cache miss for X 0 0
CPU B reads X Cache miss for X 0 0 0
CPU A writes 1 to X Invalidate for X 1 0
CPU B read X Cache miss for X 1 1 1
Stewart Smith Digital Systems Design 4
Memory Consistency
• When are writes seen by other processors
‣ “Seen” means a read returns the written value
‣ Can’t be instantaneously
• Assumptions
‣ A write completes only when all processors have seen it
‣ A processor does not reorder writes with other accesses
• Consequence
‣ P writes X then writes Y
⇒ all processors that see new Y also see new X
‣ Processors can reorder reads, but not writes
Stewart Smith Digital Systems Design 4
Next Lecture
• Parallel Computing and Performance
‣ Definitions
‣ Speedup and Efficiency
‣ Return to Amdahl’s Law
‣ Scaling
‣ Load balancing