Lect. 3: Shared Memory Multiprocessors
Have a single OS for the whole system, support both processes and
threads, and appear as a common multiprogrammed system (Thus, by this definition, Beowulf clusters are not multiprocessors)
Obtained by connecting full processors together
Copyright By PowCoder代写 加微信 powcoder
– Processorshavetheirownconnectiontomemory
– Processors are capable of independent execution and control
(Thus, by this definition, GPU is not a multiprocessor as the GPU cores are not
capable of independent execution, but 2nd generation Xeon Phi is!!)
▪ Suitable for parallel programs where threads can follow different
Can be used to run multiple sequential programs concurrently or parallel programs
code (task-level-parallelism)
Parallel Architectures – 2019-20 !1
Physically centralized memory, uniform memory access (UMA) (a.k.a. SMP) Physically distributed memory, non-uniform memory access (NUMA)
Shared Memory Multiprocessors
Recall the two common organizations:
CPU CPU CPU CPU CPU CPU CPU
Main memory
(Note: both organizations have local caches)
Parallel Architectures – 2019-20
Shared Memory Multiprocessors
Recall the communication model:
– Threads in different processors can use the same virtual address space – Communication is done through shared memory variables
Producer (p1) Consumer (p2)
data = 10;
x = data * y;
Parallel Architectures – 2019-20 !3
Shared Memory Multiprocessors
Recall the communication model:
– Threads in different processors can use the same virtual address space – Communication is done through shared memory variables
– Explicitsynchronization(e.g.,variableflagbelow)
Producer (p1) Consumer (p2)
flag = 0; flag = 0; ……
data = 10;
while (!flag);
x = data * y;
Parallel Architectures – 2019-20 !4
HW Support for Shared Memory
▪ Cache Coherence
– Caches + multiprocessers! stale values
– System must behave as if caches are not present
– Cachecoherenceisimplementationthatensuresthisvia:
▪ Write propagation ▪ Write serialization
▪ Primitive synchronization instructions
– Certain hardware instructions are extremely useful/required for
achieving synchronisation
– Memoryfences:memoryorderingondemand
– Read-Modify-writes: support for locks (critical sections)
– Transactional memory extensions
Memory Consistency
– When (in what order) should writes propagate? – What value should a read return?
– Memoryconsistencyisaspecification.
Parallel Architectures – 2019-20 !5
Cache Coherence
Producer (p1) Consumer (p2)
flag = 0; flag = 0; ……
data = 10;
while (!flag); x = data * y;
The update to flag (and data) should be (eventually) visible to p2
Parallel Architectures – 2019-20 !6
Cache Coherence
CPU CPU CPU CPU
CPU CPU CPU
Make Caches Invisible!
Main memory
Main memory
Parallel Architectures – 2019-20
Memory Consistency
Producer (p1) Consumer (p2)
flag = 0; flag = 0; ……
data = 10;
while (!flag) {} x = data * y;
If p2 sees the update to flag, will p2 see the update to data?
Parallel Architectures – 2019-20 !8
Primitive Synchronization
Producer (p1) Consumer (p2)
flag = 0; flag = 0; ……
data = 10;
If p2 sees the update to flag, will it see the update to data?
while (!flag) {}
x = data * y;
Parallel Architectures – 2019-20
The Cache Coherence Problem
T0: A not cached T1: load A (A=1) T2: A=1
T0: A not cached T1: A not cached T2: A not cached T3: A not cached T4: load A (A=1)
use old value
T0: A not cached T1: A not cached T2: load A (A=1) T3: store A (A=2) T4: A=2
T5: load A (A=1) use stale value!
T0: A=1 T1: A=1 T2: A=1 T3: A=1 T4: A=1
Main memory
Parallel Architectures – 2019-20
Cache Coherence
Write Propagation
– writes are (eventually) visible in all processors
Write Serialization
– Writes are observable in the same order from all
processors
// Initially all values are 0.
P1 P2 P3 P4
Parallel Architectures – 2019-20
Cache Coherence Protocols
Idea of most protocols:
– Keeptrackofwhatprocessorshavecopiesofwhatdata
– Enforcethatatanygiventimeasinglevalueofeverydataexists:
▪ By getting rid of copies of the data with old values → invalidate protocols
▪ By updating everyone’s copy of the data → update protocols
– TheyenforcetheSingle-Writer-Multiple-ReaderInvariant(orSWMR).
CPU CPU CPU CPU
CPU CPU CPU CPU
▪ There are two alternating epochs: A single writer epoch, a multiple reader epoch
▪ Before a write happens, it must communicate with all other actors to obtain exclusive access.
Main memory
Main memory
Parallel Architectures – 2019-20 !12
Cache Coherence Protocols
Idea of most protocols:
– Keeptrackofwhatprocessorshavecopiesofwhatdata
– Enforcethatatanygiventimeasinglevalueofeverydataexists:
▪ By getting rid of copies of the data with old values → invalidate protocols
▪ By updating everyone’s copy of the data → update protocols
– TheyenforcetheSingle-Writer-Multiple-ReaderInvariant(orSWMR).
CPU CPU CPU CPU
CPU CPU CPU CPU
– Guarantee that old values are eventually invalidated/updated (write
propagation)
– Guarantee that only a single processor is allowed to modify a certain datum at
any given time (write serialization)
Parallel Architectures – 2019-20 !13
Main memory
Main memory
Write-invalidate Example
T1: load A (A=1) T2: A=1
T3: A not cached T4: A not cached T5: load A (A=2)
T1: A not cached T2: A not cached T3: A not cached T4: load A (A=2)
T1: A not cached T2: load A (A=1) T3: store A (A=2) T4: A=2
invalidate
Parallel Architectures – 2019-20
T1: A=1 T2: A=1 T3: A=1 T4: A=1
Main memory
Write-update Example
T1: load A (A=1)
T3:A=2 T3:Anotcached
T1: A not cached T2: load A (A=1) T3: store A (A=2) T4: A=2
T1: A not cached T2: A not cached
T5: load A (A=2)
T4: load A (A=2) new value
T1: A=1 T2: A=1 T3: A=2 T4: A=2
Main memory
Parallel Architectures – 2019-20
Invalidate vs. Update Protocols
+ New value can be re-used without the need to ask for it again + Data can always be read from memory
+ Modified blocks can be evicted from caches silently
– Possiblemultipleuselessupdates(morebandwidth)
Usually used with write-through caches (less popular)
Invalidate:
+ Multiple writes by the same processor to the cache block only require one
invalidation
+ No need to send the new value of the data (less bandwidth)
– Cachesmustbeabletoprovideup-to-datedatauponrequest
– Must write-back data to memory when evicting a modified block Usually used with write-back caches (more popular)
Parallel Architectures – 2019-20 !16
Cache Coherence Protocols
Implementation can either be in software or hardware.
▪ Software coherence: Programmer-directed or compiler-controlled
▪ Expose instructions such as writeback and self-invalidate to software
Insert these at appropriate points, e.g., by leveraging static analysis.
Problem: conservatism of static analysis (or programmer burden) ▪ Hardware coherence
Add state bits to cache lines to track state of the line
– Mostcommon:Modified,Owned,Exclusive,Shared,Invalid – Protocols usually named after the states supported
Cache lines transition between states upon load/store operations from the local processor and by remote processors
These state transitions must guarantee SWMR and in doing so:
write propagation and write serialization
Parallel Architectures – 2019-20 !17
Example: MSI Protocol
– Modified (M): block is cached only in this cache and has been modified – Shared (S): block is cached in this cache and possibly in other caches (no
cache can modify the block)
– Invalid (I): block is not cached
Parallel Architectures – 2019-20 !18
Example: MSI Protocol Transactions originated at this CPU:
CPU write miss
CPU read miss
CPU read hit
CPU write (upgrade)
CPU read hit CPU write hit
Parallel Architectures – 2019-20 !19
Example: MSI Protocol Transactions originated at other CPU:
CPU read hit
Remote read miss
CPU write miss
CPU read miss
Remote write miss Remote write miss
CPU read hit CPU write hit
Remote read miss
Parallel Architectures – 2019-20
Example: MESI Protocol
State E is obtained on reads when no other processor has a shared
– All processors must answer if they have copies or not – Orsomedevicemustknowifprocessorshavecopies
– Modified (M): block is cached only in this cache and has been modified
– Exclusive (E): block is cached only in this cache, has not been modified, but can be modified at will
– Shared (S): block is cached in this cache and possibly in other caches – Invalid (I): block is not cached
Advantage over MSI
– Often variables are loaded, modified in register, and then stored
– ThestoreonstateEthendoesnotrequireaskingforpermissiontowrite
Parallel Architectures – 2019-20 !21
Example: MESI Protocol Transactions originated at this CPU:
CPU read hit CPU read miss & sharing
Must inform everyone (upgrade)
CPU read miss & no sharing
CPU write miss
CPU read hit CPU write hit
CPU read hit
Parallel Architectures – 2019-20
Can be done silently
Example: MESI Protocol Transactions originated at other CPU:
Remote read miss
Remote write miss Remote read miss
Remote write miss
Remote write miss
Remote read miss
Parallel Architectures – 2019-20
Possible Implementations
Two ways of implementing coherence protocols in hardware
– Snooping: all cache controllers monitor all other caches’ activities and maintain the state of their lines
▪ Commonly used with buses and in many CMP’s today
– Directory: a central control device directly handles all cache activities and tells the caches what transitions to make
▪ Can be of two types: physically centralized and physically distributed ▪ Commonly used with scalable interconnects and in many CMP’s today
Parallel Architectures – 2019-20 !24
Behavior of Cache Coherence Protocols
because the cache set was full
▪ Coherence misses: when a block is not in the cache because it was
invalidated by a write from another processor
– Hard to reduce → relates to intrinsic communication and sharing of data in
the parallel application
– Falsesharingcoherencemisses:processorsmodifydifferentwordsofthe
cache block (no real communication or sharing) but end up invalidating the complete block
Uniprocessor cache misses (the 3 C’s):
– Cold (or compulsory) misses: when a block is accessed for the first time – Capacitymisses:whenablockisnotinthecachebecauseitwasevicted
because the cache was full
– Conflictmisses:whenablockisnotinthecachebecauseitwasevicted
Parallel Architectures – 2019-20 !25
Behavior of Cache Coherence Protocols
False sharing misses increases with larger cache line size – Onlytruesharingremainswithsingleword/bytecachelines
False sharing misses can be reduced with better placement of data in memory
True sharing misses tends to decrease with larger cache line sizes (due to locality)
Classifying misses in a multiprocessor is not straightforward
– E.g., if P0 has line A in the cache and evicts it due to capacity limitation, and later P1 writes to the same line: is this a capacity or a coherence miss?
It is both, as fixing one problem (e.g., increasing cache size) won’t fix the other (see Figure 5.20 of Culler&Singh for a complete decision chart)
Parallel Architectures – 2019-20 !26
Behavior of Cache Coherence Protocols
Bottom-line: threads don’t usually read and write the same data indiscriminately
Common types of data access patterns
– Private: data that is only accessed by a single processor
– Read-only shared: data that is accessed by multiple processors but only for
reading (this includes instructions)
– Migratory: data that is used and modified by multiple processors, but in
– Producer-consumer: data that is updated by one processor and consumed by
– Read-write: data that is used and modified by multiple processors
simultaneously
▪ Data used for synchronization
Parallel Architectures – 2019-20 !27
References and Further Reading
“Parallel Computer Architecture”, . Culler, Jaswinder Pal Singh (Chapter 5).
“A Primer on Memory Consistency and Cache Coherence”, . Sorin, . Hill, . Wood
Parallel Architectures – 2019-20 !28
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com