Digital System Design 4 Parallel Computing Architecture 1
Stewart Smith Digital Systems Design 4
This Lecture
•
‣ ‣
‣ ‣
Parallel computing aspects of previous topics
Instruction set architecture – synchronisation (Textbook §2.11)
Arithmetic systems – sub word parallelism (Textbook §3.6)
Pipelining – multiple issue processors (Textbook §4.1)
Multicore processors and memory (Textbook §5.10)
Stewart Smith
Digital Systems Design 4
Parallelism in Instruction Set Architecture
•
‣
P1 writes, then P2 reads
‣ Data race if P1 and P2 don’t synchronise
– Result depends of order of accesses Hardware support required
Atomic read/write memory operation
No other access to the location allowed between the read and write
•
‣ ‣
•
‣ ‣
Synchronisation of instructions
Two processors sharing an area of memory
‣
Could be a single instruction
E.g., atomic swap of register ↔ memory Or an atomic pair of instructions
Stewart Smith
Digital Systems Design 4
Synchronisation in MIPS
• •
Load linked: ll rt, offset(rs)
Store conditional: sc rt, offset(rs)
Succeeds if location not changed since the ll – Returns1inrt
‣
‣ Fails if location is changed
– Returns0inrt
Example: atomic swap (to test/set lock variable)
•
try: add $t0,$zero,$s4 ;copy exchange value ll $t1,0($s1) ;load linked
sc $t0,0($s1) ;store conditional beq $t0,$zero,try ;branch store fails add $s4,$zero,$t1 ;put load value in $s4
Stewart Smith Digital Systems Design 4
Subword Parallellism
•
•
Graphics and audio applications can take advantage of performing simultaneous operations on short vectors
‣ Example: 128-bit adder:
– Sixteen 8-bit adds – Eight 16-bit adds – Four 32-bit adds
Also called data-level parallelism, vector parallelism, or Single Instruction, Multiple Data (SIMD)
Stewart Smith
Digital Systems Design 4
Parallelism in FP Arithmetic
•
‣
Parallel programs may interleave operations in unexpected orders
Assumptions of associativity may fail
Variable
Value
Sum 1: (X+Y)+Z
Sum 2: X+(Y+Z)
X
-1.50E+38
-1.50E+38
Y
1.50E+38
0.00E+00
Z
1.00E+00
1.00E+00
1.50E+38
Result
1.00E+00
0.00E+00
•
Need to validate parallel programs under varying degrees of parallelism
Stewart Smith
Digital Systems Design 4
Instruction Level Parallelism
• •
‣
•
‣
‣ ‣
Pipelining: executing multiple instructions in parallel To increase ILP
Deeper pipeline
– Less work per stage ⇒ shorter clock cycle
Multiple issue
Replicate pipeline stages ⇒ multiple pipelines Start multiple instructions per clock cycle
CPI < 1, so use Instructions Per Cycle (IPC) - E.g., 4GHz 4-way multiple-issue
- 16 BIPS, peak CPI = 0.25, peak IPC = 4
Stewart Smith
Digital Systems Design 4
Multiple Issue
•
‣ ‣
•
‣
‣ ‣
Static multiple issue
Compiler groups instructions to be issued together
Packages them into “issue slots”
‣ Compiler detects and avoids hazards
Dynamic multiple issue
CPU examines instruction stream and chooses instructions to issue each cycle
Compiler can help by reordering instructions
CPU resolves hazards using advanced techniques at runtime
Stewart Smith
Digital Systems Design 4
Speculation
•
‣
‣
• •
‣ ‣
“Guess” what to do with an instruction
Start operation as soon as possible Check whether guess was right
If so, complete the operation
‣
‣ If not, roll-back and do the right thing
Common to static and dynamic multiple issue Examples
Speculate on branch outcome
- Roll back if path taken is different
Speculate on load
- Roll back if location is updated
Stewart Smith
Digital Systems Design 4
Meltdown and Spectre
Stewart Smith Digital Systems Design 4
Static Multiple Issue
•
Compiler groups instructions into “issue packets”
Group of instructions that can be issued on a single cycle
‣
•‣ Determined by pipeline resources required
‣
Think of an issue packet as a very long instruction
Specifies multiple concurrent operations ⇒ Very Long Instruction Word (VLIW)
Stewart Smith
Digital Systems Design 4
Scheduling Static Multiple Issue
•
‣
‣
‣
-
‣
Compiler must remove some/all hazards
Reorder instructions into issue packets
No dependencies within a packet
Possibly some dependencies between packets
Varies between ISAs; compiler must know! Pad with nop if necessary
Stewart Smith
Digital Systems Design 4
MIPS with Static Dual Issue
Stewart Smith Digital Systems Design 4
MIPS with Static Dual Issue
•
‣ ‣ ‣
Two-issue packets
One ALU/branch instruction
One load/store instruction
64-bit aligned
- ALU/branch, then load/store
- Pad an unused instruction with nop
Address
Instruction type
Pipeline Stages
n
ALU/branch
IF
ID
EX
ME
WB
n+4
Load/store
IF
ID
EX
M ME
WB
n+8
ALU/branch
IF
ID
M EX
ME
WB
n + 12
Load/store
IF
ID
EX
M ME
WB
n + 16
ALU/branch
IF
ID
M EX
ME
WB
n + 20
Load/store
IF
ID
EX
M ME
WB
Stewart Smith
Digital Systems Design 4
M
Hazards in Dual Issue MIPS
• •
‣ ‣
- Split into two packets, effectively a stall Load-use hazard
‣ Still one cycle use latency, but now two instructions More aggressive scheduling required
•
•
More instructions executing in parallel EX data hazard
Forwarding avoided stalls with single-issue
Now can’t use ALU result in load/store in same packet
add $t0, $s0, $s1
load $s2, 0($t0)
Stewart Smith
Digital Systems Design 4
Scheduling Example
•
Schedule this code for dual issue MIPS
Loop: lw $t0, 0($s1)
addu $t0, $t0, $s2
# $t0=array element
# add scalar in $s2
# store result
# decrement pointer
sw $t0, 0($s1)
addi $s1, $s1,–4
bne $s1, $zero, Loop # branch $s1!=0
ALU/branch
Load/store
Cycle
Loop:
nop
lw $t0, 0($s1)
1
addi $s1, $s1,–4
nop
2
addu $t0, $t0, $s2
nop
3
bne $s1, $zero, Loop
sw $t0, 4($s1)
4
‣
IPC = 5/4 = 1.25 (c.f. peak IPC = 2)
Stewart Smith
Digital Systems Design 4
Loop Unrolling
•
‣
•
‣ ‣
- - -
Replicate loop body to expose more parallelism
Reduces loop-control overhead
Use different registers per replication
Called “register renaming”
Avoid loop-carried “anti-dependencies”
Store followed by a load of the same register AKA “name dependence”
Reuse of a register name
Stewart Smith
Digital Systems Design 4
Loop Unrolling Example
•
Same code as the previous example, assume loop index is a multiple of 4
ALU/branch
Load/store
Cycle
Loop
addi $s1, $s1,–16
lw $t0, 0($s1)
1
nop
lw $t1, 12($s1)
2
addu $t0, $t0, $s2
lw $t2, 8($s1)
3
addu $t1, $t1, $s2
lw $t3, 4($s1)
4
addu $t2, $t2, $s2
sw $t0, 16($s1)
5
addu $t3, $t3, $s2
sw $t1, 12($s1)
6
nop
sw $t2, 8($s1)
7
bne $s1, $zero, Loop
sw $t3, 4($s1)
8
‣
-
Closer to 2, but at cost of registers and code size
IPC = 14/8 = 1.75
Stewart Smith
Digital Systems Design 4
Dynamic Multiple Issue
•
‣
- •
‣ ‣
Though it may still help
Code semantics ensured by the CPU
“Superscalar” processors
CPU decides whether to issue 0, 1, 2, ... each cycle
Avoiding structural and data hazards
Avoids the need for compiler scheduling
Stewart Smith
Digital Systems Design 4
Dynamic Pipeline Scheduling
•
‣
•
‣
Allow the CPU to execute instructions out of order to avoid stalls
But commit result to registers in order Example
lw $t0, 20($s2)
addu $t1, $t0, $t2
sub $s4, $s4, $t3
slti $t5, $s4, 20
Can start sub while addu is waiting for lw Digital Systems Design 4
Stewart Smith
Dynamically Scheduled CPU
Preserves dependencies
Hold pending operands
Reorder buffer for register writes
Results also sent to any waiting reservation stations
Can supply operands for issued instructions
Stewart Smith Digital Systems Design 4
Power Efficiency
• •
Complexity of dynamic scheduling and speculations requires power
Multiple simpler cores may be better
Microprocessor
Year
Clock Rate
Pipeline Stages
Issue width
Out-of-order/ Speculation
Cores
Power
i486
1989
25MHz
5
1
No
1
5W
Pentium
1993
66MHz
5
2
No
1
10W
Pentium Pro
1997
200MHz
10
3
Yes
1
29W
P4 Willamette
2001
2000MHz
22
3
Yes
1
75W
P4 Prescott
2004
3600MHz
31
3
Yes
1
103W
Intel Core
2006
3000 MHz
14
4
Yes
2
75 W
Intel Core i7
2008
3600 MHz
14
4
Yes
2-4
87 W
Nehalem Intel Core
2010
3730 MHz
14
4
Yes
6
130 W
Westmere Intel Core i7 Ivy
2012
3400 MHz
14
4
Yes
6
130 W
Bridge Intel Core
2014
3700 MHz
14
4
Yes
10
140 W
Broadwell Intel Core i9
2016
3100 MHz
14
4
Yes
14
165 W
Skylake Intel Ice Lake
2018
4200 MHz
14
4
Yes
16
185 W
Stewart Smith
Digital Systems Design 4
Parallelism in Memory: Cache Coherence Problems
•
‣
Suppose two CPU cores share a physical address space
Write-through caches
Time step
Event
CPU A’s cache
CPU B’s cache
Memory
0
0
1
CPU A reads X
0
0
2
CPU B reads X
0
0
0
3
CPU A writes 1 to X
1
0
1
Stewart Smith
Digital Systems Design 4
Coherence Defined
• •
Informally: Reads return most recently written value
Formally:
P writes X; P reads X (no intervening writes) ⇒ read returns written value
‣
‣ P1 writes X; P2 reads X (sufficiently later)
⇒ read returns written value
-
⇒ all processors see writes in the same order
-
c.f. CPU B reading X after step 3 in example ‣ P1 writes X, P2 writes X
End up with the same final value for X
Stewart Smith
Digital Systems Design 4
Cache Coherence Protocols
•
•
•
‣
Operations performed by caches in multiprocessors to ensure coherence
Migration of data to local caches
- Reduces bandwidth for shared memory
‣
‣ Replication of read-shared data
- Reduces contention for access Snooping protocols
‣ Each cache monitors bus reads/writes Directory-based protocols
Caches and memory record sharing status of blocks in a directory
Stewart Smith
Digital Systems Design 4
Invalidating Snooping Protocols
•
to be written
Broadcasts an invalidate message on the bus Subsequent read in another cache misses
- Owning cache supplies updated value
Cache gets exclusive access to a block when it is
‣ ‣
CPU activity
Bus activity
CPU A’s cache
CPU B’s cache
Memory
0
CPU A reads X
Cache miss for X
0
0
CPU B reads X
Cache miss for X
0
0
0
CPU A writes 1 to X
Invalidate for X
1
0
CPU B read X
Cache miss for X
1
1
1
Stewart Smith
Digital Systems Design 4
Memory Consistency
•
‣
•
‣
•
‣ ‣
When are writes seen by other processors
“Seen” means a read returns the written value ‣ Can’t be instantaneously
Assumptions
A write completes only when all processors have seen it ‣ A processor does not reorder writes with other accesses
Consequence
P writes X then writes Y
⇒ all processors that see new Y also see new X
Processors can reorder reads, but not writes
Stewart Smith
Digital Systems Design 4
Next Lecture
•
‣ ‣ ‣ ‣ ‣
Parallel Computing and Performance
Definitions
Speedup and Efficiency Return to Amdahl’s Law Scaling
Load balancing
Stewart Smith
Digital Systems Design 4