Parallel 3 2021
Stewart Smith Digital Systems Design 4
Digital System Design 4
Parallel Computing Architecture 3
Stewart Smith Digital Systems Design 4
This Lecture
• Multicore and Multiprocessor systems
• Multi-threading in processors
Stewart Smith Digital Systems Design 4
Multicore Processors
• Effectively several processors on the same die
• Shared memory controller
• Separate L1 and usually L2 caches
• Processes don’t necessarily map to the same
processor each time they’re run (but can
request it (Processor Affinity))
• Each processor can run a separate process or
threads of the same process
Stewart Smith Digital Systems Design 4
Shared Memory Multiprocessors
• Hardware provides single physical
address space for all processors
• Synchronise shared variables using locks
• Memory access time
‣ UMA (uniform) vs. NUMA (nonuniform)
Stewart Smith Digital Systems Design 4
Example: Sum Reduction
• Sum 100,000 numbers on 100 processor UMA
‣ Each processor has ID: 0 ≤ Pn ≤ 99
‣ Partition 1000 numbers per processor
‣ Initial summation on each processor
sum[Pn] = 0;
for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1)
sum[Pn] = sum[Pn] + A[i];
• Now need to add these partial sums
‣ Reduction: divide and conquer
‣ Half the processors add pairs, then quarter, …
‣ Need to synchronise between reduction steps
Stewart Smith Digital Systems Design 4
Example: Sum Reduction
half = 100; /* 100 processors in multiprocessor */
repeat
synch(); /* wait for partial sum completion */
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
Stewart Smith Digital Systems Design 4
Message Passing
• Each processor has private physical address space
• Hardware sends/receives messages between
processors
Stewart Smith Digital Systems Design 4
Loosely Coupled Clusters
• Network of independent computers
‣ Each has private memory and OS
‣ Connected using I/O system
- E.g., Ethernet/switch, Internet
• Suitable for applications with independent tasks
‣ Web servers, databases, simulations, …
• High availability, scalable, affordable
• Problems
‣ Administration cost (prefer virtual machines)
‣ Low interconnect bandwidth
- c.f. processor/memory bandwidth on an SMP
Stewart Smith Digital Systems Design 4
Sum Reduction (again)
• Sum 100,000 on 100 processors
‣ First distribute 1000 numbers to each
‣ Then do partial sums
sum = 0;
for (i = 0; i<1000; i = i + 1)
sum = sum + AN[i];
• Reduction
‣ Half the processors send, other half receive and add
‣ Then quarter send, quarter receive and add, …
Stewart Smith Digital Systems Design 4
Sum Reduction (again)
• Given send() and receive() operations
limit = 100; half = 100;/* 100 processors */
repeat
half = (half+1)/2; /* send vs. receive
dividing line */
if (Pn >= half && Pn < limit)
send(Pn - half, sum);
if (Pn < (limit/2)) sum = sum + receive();
limit = half; /* upper limit of senders */
until (half == 1); /* exit with final sum */
• Send/receive also provides synchronisation
• Assumes send/receive take similar time to addition
Stewart Smith Digital Systems Design 4
Distributed Computing
• Using multiple separate computer systems in
parallel
• May all be
‣ In the same cabinet (compute cluster)
‣ In the same room (multi-node HPC)
‣ A bunch of desktop PC’s in a department (Beowolf)
‣ Spread across the whole internet (Grid Computing,
e.g. SETI@Home)
‣ Commercial: Cloud computing frameworks
Stewart Smith Digital Systems Design 4
Grid Computing
• Disappearing, to be replaced by cloud
computing
• Still used for non-profit projects, like
SETI@Home or Folding@Home
• No central server
• Complicated to keep high uptime (people
turn their PCs off at night, etc)
• Diverse platforms, operating systems …
Stewart Smith Digital Systems Design 4
Cloud Computing
• Commercial services
• Big Players like Amazon & Google
• Can hire:
‣ File storage
‣ Compute
‣ GPU Computing (this is newer)
• Pay for what you use…
• Central → Economies of Scale
Stewart Smith Digital Systems Design 4
Processes and Threads
• Process
‣ An application or service
- [Service: A process that runs in the background,
no user interface]
‣ Has its own executable file
‣ Gets its own virtual memory space
‣ Can’t ‘see’ other processes
- [Have to use special mechanisms for inter-process
communication (IPC) ]
‣ Multitasking controlled by Operating System
Stewart Smith Digital Systems Design 4
Processes and Threads
• Thread
‣ One process may have many threads
‣ Share the same address space
‣ Not separate executables, parts of the same
program
‣ Multitasking of threads must be coded for
manually
Stewart Smith Digital Systems Design 4
Multithreading
• Performing multiple threads of execution in parallel
‣ Replicate registers, PC, etc.
‣ Fast switching between threads
• Fine-grain multithreading
‣ Switch threads after each cycle
‣ Interleave instruction execution
‣ If one thread stalls, others are executed
• Coarse-grain multithreading
‣ Only switch on long stall (e.g., L2-cache miss)
‣ Simplifies hardware, but doesn’t hide short stalls (eg,
data hazards)
Stewart Smith Digital Systems Design 4
Simultaneous Multithreading
• In multiple-issue dynamically scheduled
processor
‣ Schedule instructions from multiple threads
‣ Instructions from independent threads execute
when functional units are available
‣ Within threads, dependencies handled by
scheduling and register renaming
Stewart Smith Digital Systems Design 4
Intel Hyperthreading
• Each processor core has a double set of registers
but single cache / ALU
• Can execute two threads in parallel, by interleaving
commands (like a GPU)
• Advertising claims 2x speed-up
‣ Depends on load
‣ Depends on code itself
‣ Usually in the range 1.2x – 1.5x
• Every other manufacturer does something similar
under a different name
Stewart Smith Digital Systems Design 4
Multithreading Examples