CS3350B Computer Organization Chapter 5: Parallel Architectures
Alex Brandt
Department of Computer Science University of Western Ontario, Canada
Monday March 29, 2021
Alex Brandt
Chapter 5: Parallel Architectures Monday March 29, 2021 1 / 48
Outline
1 Introduction
2 Multiprocessors and Multi-core processors
3 Cache Coherency
4 False Sharing
5 Multithreading
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 2 / 48
Needing Multicore Architectures
Recall:
Processor-Memory gap.
Power Wall.
Moore’s law failing?
Great Ideas in Computer Architecture: Performance via Parallelism.
New reasons:
ILP has hit a peak within current power/thermal constraints.
Can leverage vector operations and SIMD processors but this limits parallelism to a single instruction type.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 3 / 48
Needing Multicore: Performance Plot
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 4 / 48
SIMD Processors
SIMD processors are sometimes called vector processors.
However, not all SIMD processors are vector processors.
They execute a Single Instruction on Multiple Data elements.
In a vector processor the data elements must be adjacent.
SSE (Streaming SIMD Extension) extends x86 for vectorized instrs. ë Modern CPUs: intel, AMD.
Ex: The loop can be unrolled and executed using a 128-bit (4×32-bit int) vectorized instruction.
int i = 0;
for (i; i < n; i++) {
int i = 0;
for (i; i < n; i += 4) {
A[i] += 10;
A[i+1] += 10;
A[i+2] += 10;
A[i+3] += 10;
}
A[i] += 10;
}
Alex Brandt
Chapter 5: Parallel Architectures
Monday March 29, 2021
5 / 48
Flynn’s Taxonomy
Flynn’s Taxonomy is a characterization of an architecture’s parallelism.
Single Data Stream Multiple Data Streams
Single Instr. Stream
SISD SIMD
Multiple Instr. Streams
MISD MIMD
SISD – the normal architecture; basic von Neumann architecture. SIMD – single instruction applied to multiple data elements.
Sometimes called a vector processor.
MISD – obscure and not usually used. A data pipeline where each stage performs a different operation.
MIMD – a multiprocessor; multiple processors fetch different instructions and operate on different data.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 6 / 48
Outline
1 Introduction
2 Multiprocessors and Multi-core processors
3 Cache Coherency
4 False Sharing
5 Multithreading
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 7 / 48
Multiprocessors
Multiprocessors belong to the MIMD type.
Multiple, independent processors executing different
instructions/programs.
A computer with literally multiple processors.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 8 / 48
Multi-core Processors
Multi-core processors are still MIMD type but differ from multi- processors.
Also called chip-level multiprocessor; the processor itself has multiple processors (datapaths).
Contrast with superscalar (which is only SISD).
Each core can operate a completely different instruction stream.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 9 / 48
Modern Multi-Core Circuitry
Sep. L2
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 10 / 48
Multiprocessor vs Multi-core
Each processor in a multiprocessor system can also be multi-core. To operating system each core seen as an independent processor.
But notice that cores within a processor usually share some cache. This gives better performance to processes which need to share information (e.g. multiple threads within a single process).
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 11 / 48
Simultaneous Multi-threading
Simultaneous Multi-threading (SMT) is hardware-level support within a single core to handle multiple threads at once.
One SMT core presents itself to the OS as multiple processors, one for each possible thread.
Called hyperthreading by intel.
Not necessarily completely duplicate hardware as in cores, but some redundancy in datapath.
Sort of like superscalar for ILP but with different instruction streams.
Within an SMT core
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 12 / 48
Pros and Cons of Multi (core) Processors
Possible throughput increases are stellar.
Allows multi-tasking in OS.
Parallelism tackles power wall, moore’s law, performance bottlenecks.
However, a single program cannot always make use of multiple cores or processors.
Must program explicitly for parallelism.
Parallel programming is hard.
ë Must explicitly program for thread-level parallelism (TLP)
ë Some tools try to help: compiler vectorization, Open-MP, Cilk.
Still only one global memory.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 13 / 48
Multi-core Configurations
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 14 / 48
Outline
1 Introduction
2 Multiprocessors and Multi-core processors
3 Cache Coherency
4 False Sharing
5 Multithreading
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 15 / 48
The Problem of Cache Coherence
If each processor (core) has its own (L1) cache then it also has its own copy of memory.
If two cores both have a copy, how does one core get the write updates of another core?
This is the problem of cache coherence.
Writing back to lower levels of cache which are shared by multiple processors (cores) is not sufficient. Must also propagate change upwards?
This is particularly a problem if this lowest level is very slow memory. What can we do?
Note: cache coherency is usually performed on a per cache block basis, not per memory word/address.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 16 / 48
Cache Coherency and Consistency
Coherency: If processor P1 reads a value of a particularly memory address after processor P2 writes to it (and no other writes have occurred to that address) then P1 must read the value P2 wrote.
(Sequential) Consistency: all writes to a memory address must be performed in some sequential order.
Coherency is about what value is read while consistency is when the value is read.
Cache coherency requires two parts in a solution:
Write propagation: writes to a cache block must be propagated to other copies of that same cache block in other caches.
Transaction serialization: reads and writes to a particular address must be seen by all processors in the same order.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 17 / 48
Snooping Policies
Whenever a processor reads from or writes to a lower level of cache, these transactions are broadcast on some shared global bus.
A snooping policy (a.k.a bus sniffing) has all cores monitor this bus for all requests and responses and act accordingly.
Write invalidate protocol: when a write is broadcast, all other cores which share a copy of the address being written to invalidate their copies.
Write update protocol: when a write is broadcast, all other cores which share a copy of that address being written to update their local copy with the value which was broadcast.
In either case, serialization is handled by mutually exclusive access to global bus. Core stalls if bus is busy.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 18 / 48
Invalidate vs Update
Consider multiple writes to same memory address without any reads. Consider writes to adjacent memory words within a single cache block.
Both of these cases common due to temporal/spatial locality.
Both of these cases require multiple update signals and new values to be sent across the bus.
However, only one invalidate signal would be needed in either case.
Bus bandwidth is a precious resource.
Invalidate protocols preferred due to less bandwidth required. Invalidate protocols used in modern CPUs over update protocols.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 19 / 48
The MESI Protocol (1/2)
MESI: an invalidate snooping protocol which adds several optimizations to further reduce bus bandwidth.
Recall that a normal cache has a “valid” state and, if using a write-back policy, a “dirty” state.
MESI adds “Exclusive” and “Shared” states. M dirty, I invalid. If cache block is “Exclusive” avoid broadcasting a write.
MESI can be described by looking at what happens on read misses, read hits, write misses, and write hits.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 20 / 48
The MESI Protocol (2/2)
Each cache block in each processor’s cache has one of 4 states:
Modified: The cache block is exclusively owned by the processor and is dirty (i.e. it differs from main memory).
Exclusive: The cache block is exclusively owned by the processor and is clean (i.e. the same value as main memory).
Shared: The cache block is shared between multiple processors and is clean.
Invalid: The cache block was loaded into cache but its value is no longer valid.
.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 21 / 48
MESI Read Hit
The cache block’s state is one of MES to be considered a hit. Nothing to do; just return the value and do not change states.
If state is M then the cache block is dirty but that’s fine for this particular core.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 22 / 48
MESI Read Miss (1/2)
Case 1: No copies anywhere.
The requesting core waits for response from lower level memory. The value is stored in cache with state E.
Case 2: One cache has an E copy.
The snooping cache hears read request, puts copy of value on bus. The access to lower level memory is abandoned.
The requesting core reads value from bus.
Both cores set cache block’s state to S.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 23 / 48
MESI Read Miss (2/2)
Case 3: Several caches have an S copy.
One snooping cache hears request, puts copy on bus.
The access to lower level memory is abandoned.
The requesting core reads value from bus and sets state to S. All copies are still in state S.
Case 4: One cache has an M copy.
The snooping cache hears read request, puts copy on bus.
The access to lower level memory is abandoned.
The requesting core reads value from bus and sets state to S.
The snooping core sets state to S and writes updated value back to main memory.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 24 / 48
MESI Write Hit
The cache block’s state is one of MES to be considered a hit.
If in M state:
Cache block is exclusively owned and already dirty. Update cache value, no state change.
If in E state:
Cache block is exclusively owned and clean. Update cache value, set state to M.
If in S state:
Cache block is shared but clean.
Requesting core broadcasts invalidate on the bus. Snooping cores with S copy change to I state.
Requesting core updates cache value and sets state to M.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 25 / 48
MESI Write Miss
This one is tricky. Need to read first and then write. Case 1: No other copies.
Read from main memory, write new value in cache, set state to M.
Case 2: Other copy is in E state / Other copies are in S state. The requesting cache requests Read With Intent To Modify.
The snooping cache(s) hear RWITM request, puts copy on bus, sets own state to I.
Requester reads value from bus, updates value, sets state to M.
Case 3: Other copy in M state
The snooping cache hears RWITM, puts copy on bus, writes back to main memory, sets its own state to I.
Requester reads value from bus, updates value, sets state to M.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 26 / 48
MESI: An Example
Consider a system with 3 processors where all processors are referencing the same cache block.
Using the MESI protocol fill the table to describe the cache block’s state in each processor’s cache.
Request P1
Initially - P1 Read E P1 Write M P3 Read S
P2 P3 Data Supplier
- --
- - Main Mem. - --
- S P1 Cache - M-
P3 Cache
-
P1 or P3 Cache
P3 Write I
P1 Read S
P3Read S
P2 Read S S S
- S - S
Alex Brandt
Chapter 5: Parallel Architectures
Monday March 29, 2021
27 / 48
Outline
1 Introduction
2 Multiprocessors and Multi-core processors
3 Cache Coherency
4 False Sharing
5 Multithreading
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 28 / 48
Block Size and Cache Coherency
Recall: Memory always moved to and from a cache as a full cache block.
In MESI, a write invalidates an entire cache block.
Cache blocks larger than 1 word can cause false sharing.
ë Two cores are writing to different memory addresses that just happen
to be in the same cache block.
ë False sharing disproportionately increases cache misses in
write-invalidate schemes.
ë Can cause thrashing of invalidate signals.
ë Not a problem if cores share a cache (but, in modern processors, each
core has its own L1 cache).
Core1 Core2
A B 4 word cache block
Alex Brandt
Chapter 5: Parallel Architectures Monday March 29, 2021 29 / 48
True Sharing vs False Sharing
True sharing caused by communication of data between threads. Data is truly invalided and must be read again.
Would still occur if cache block size was 1 word.
False sharing caused by multiple processors writing to different memory addresses within a single cache block.
The cache block is shared but no word within the block is actually shared.
Miss would not occur if cache block size was 1 word.
Invalidation on each write causes a cache miss for the other processor. Goes back and forth in this way thrashing.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 30 / 48
Multiprocessor Cache Performance
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 31 / 48
Memory cycles per instruction
False Sharing Example
Let x1 and x2 belong in the same cache block. Processor 1 and Processor 2 each want to access either x1 or x2. Let us assume both processors begin with the cache block loaded into cache.
Time 1
2
3
4
5
P1 Write x1
Write x1 Read x2
P2 Read x2 Write x2
True, False, Hit? Why?
Hit; invalidate x1 in P2
False miss; x1 irrelevant to P2 Hit; x1 irrelevant to P2
False miss; x1 irrelevant to P2 True miss; invalidate x2 in P1
Alex Brandt
Chapter 5: Parallel Architectures Monday March 29, 2021 32 / 48
Combating False Sharing
Change the algorithm
Change the “stride” of the loop to access data in different cache line.
Re-evaluate: do these threads really need to be accessing memory which is this close together?
Have different threads access different, far away parts of memory.
Change the data structure
Add “padding” to data structures so that data is better aligned in memory (with respect to memory word or cache line).
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 33 / 48
Outline
1 Introduction
2 Multiprocessors and Multi-core processors
3 Cache Coherency
4 False Sharing
5 Multithreading
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 34 / 48
What is Multithreading?
Multithreading is not multi-core or multiprocessor. ë But together they give us performance.
Multithreading is a programming model which allows for multiple, concurrent threads of execution, each with their own context.
Thread: the smallest processing unit that can be scheduled by the OS.
Context: A thread’s own and unique local variables, PC, register values, stack.
But, threads within the same process share an address space (heap).
Process: An instance of a program being executed. Usually handled by the operating system and requires a lot of overhead to instantiate and set-up properly. A process can contain many threads.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 35 / 48
Multithreading Memory Model (1/2)
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 36 / 48
Multihreading Memory Model (2/2)
It is also possible to separate the heap into “thread-local” or private memory and shared memory.
Different programming languages allow this construct.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 37 / 48
Multithreading and Multi-core
Using multiple threads does not require multiple cores (processors).
A single processor can handle multiple threads via time-division multiplexing: the sharing of time on the datapath between threads.
ë This requires context switching—updating the state of the processor’s register file, PC, stack pointer, etc. to match the thread’s context.
With multiple cores (processors), threads can run simultaneously.
ë Each core/processor has its own registers, etc. no context switching.
ë If the number of threads exceeds the number of processors (cores) then
multiple threads must run on one processor (core) via context switching.
Thread scheduling is hard. Generally, the OS and hardware handle this. We’ll ignore this.
ë ë
Preemptive scheduling: threads are interrupted and context switches are forced.
Non-Preemptive scheduling: a.k.a cooperative scheduling, threads yield themselves to allow others to run. Threads are not interrupted.
Alex Brandt
Chapter 5: Parallel Architectures Monday March 29, 2021 38 / 48
Context Switching
To switch contexts, the context/state begin switched from must first be saved in some way.
Generally, this occurs by storing all the values of the registers, PC, etc. in some special data structure (e.g. a process control block) and storing that somewhere in memory.
ë Usually in the operating system’s memory address space.
Context switching is expensive, particularly if threads are not from the same processes.
ë Of course, an operating system can switch between multiple processes.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 39 / 48
Data Races
Via MESI, we know cache coherency is a problem.
If two threads both attempt to write to same memory location at the same time, one must be first.
Recall: transaction serialization.
But, what order do the writes occur?
Non-Determinism
void setAddress(int* addr, int val) {
*addr = val;
}
int main(int argc, char** argv) {
int* p = new int[1];
*p = 0;
std::thread t1(setAddress, p, 1);
std::thread t2(setAddress, p, 2);
t1.join();
t2.join();
std::cerr << "p: " << *p << "\n";
return 0; }
Alex Brandt Chapter 5: Parallel Architectures
Monday March 29, 2021 40 / 48
Data Races In Action
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 41 / 48
More Data Races
What happens if multiple threads reading and writing?
ë Need synchronization between
void incrAddr(int* address) {
int val = *address;
val++;
*address = val;
}
int main(int argc, char** argv) {
int* p = new int[1];
*p = 0;
std::thread t1(incrAddr, p);
std::thread t2(incrAddr, p);
t1.join();
t2.join();
std::cerr << "p: " << *p << "\n";
return 0; }
threads.
What are the possible values of p that could be printed?
Alex Brandt
Chapter 5: Parallel Architectures
Monday March 29, 2021 42 / 48
More Data Races
What happens if multiple threads reading and writing?
ë Need synchronization between
void incrAddr(int* address) {
int val = *address;
val++;
*address = val;
}
int main(int argc, char** argv) {
int* p = new int[1];
*p = 0;
std::thread t1(incrAddr, p);
std::thread t2(incrAddr, p);
t1.join();
t2.join();
std::cerr << "p: " << *p << "\n";
return 0; }
threads.
What are the possible values of p that could be printed?
1 or 2 non-determinism.
Context switch could occur between reading from address and writing back updated result.
Alex Brandt Chapter 5: Parallel Architectures
Monday March 29, 2021 42 / 48
Scheduling Multiple Threads
The interleaving of instructions for multiple threads being executed simultaneously is non-deterministic.
Case 1:
Time Step 1
2
3
4 5 6
Thread 1
val = *addr
val++
*addr = val
Thread 2
val = *addr
val++
*addr = val
Case 2:
Time Step 1
2
3
4 5 6
Thread 1
val = *addr
val++
*addr = val
Thread 2
val = *addr
val++
*addr = val
The final value is 2.
The final value is 1.
Address was read from twice before it was ever updated. Both threads read 0 from *addr and both write 1 to *addr.
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 43 / 48
Fixing Data Races
To fix data races we need thread synchronization.
ë Only one thread can execute some critical section at one time. ë This is called mutual exclusion.
We generally use locks whose “ownership” allows a thread to access a critical section.
If a thread tries to “lock” (a.k.a take/own/capture) a lock that is already locked by another thread, it waits for the lock to be unlocked and then tries to lock it again.
std::mutex mutex;
void incrAddr(int* address) {
mutex.lock(); //wait here until successful lock
int val = *address;
val++;
*address = val;
mutex.unlock();
}
Alex Brandt
Chapter 5: Parallel Architectures
Monday March 29, 2021
44 / 48
Implementing a Lock
Semaphore: a counter used to control access to a critical section.
Locks usually use a specialized binary semaphore: its value is 0 or 1.
The simplest lock is a spinlock.
It waits by “spinning ” until the lock can be acquired. A bad, non-working, but simple example:
void spinlock::lock() {
int spins = 0;
while(this.semaphore == 1) {
++spins;
}
this.semaphore = 1;
}
If multiple threads spinning on same spinlock, which one is first to set the semaphore to 1? (i.e. which one owns the lock?) Which one goes back to spinning?
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 45 / 48
Better Lock Implementation: “Test and Set”
“Test and set” is an atomic operation that sets a variable’s value, returning its old value.
Atomic: an operation (many instructions) which is viewed as happening instantly across all threads. Cannot be interrupted by a context switch.
this.setSemaphore() is atomic.
void spinlock::lock() {
int spins = 0;
while(this.semaphore == 1) {
++spins;
}
int oldVal = this.setSemaphore(1);
if (oldVal == 1) {
//another thread got there first
this.lock();
}
}
Alex Brandt
Chapter 5: Parallel Architectures
Monday March 29, 2021
46 / 48
Automatic Multithreading
Some tools exist to automatically parallelize code. ë Cilk (GCC 5-7), OpenMP (GCC 4.2+)
These tools are good but can’t replace a fine-tuned implementation of a human.
But writing efficient parallel code is even harder than getting it working in the first place.
ë Automatic tuning of parallel code one of my current research topics.
void parallelAdd(int* a, int n) {
//cilk_for: each loop iteration done in parallel
cilk_for(int i = 0; i < n; ++i) {
a[i] += 10; }
}
Alex Brandt
Chapter 5: Parallel Architectures Monday March 29, 2021 47 / 48
Summary
Serial execution has hit many walls: memory, power, ILP.
Parallelism needed for performance.
One processor can give us vectorized instructions (SIMD).
Multi-core and multiprocessor (MIMD) gives us a way to effectively use multithreading.
Cache coherency a problem with multiple cores, but can be fixed with snooping policies (e.g. MESI).
False sharing impacts cache performance.
Multithreading Parallelism on Multiprocessors Performance.
Want more? CS4402 - Distributed and Parallel Systems
Alex Brandt Chapter 5: Parallel Architectures Monday March 29, 2021 48 / 48