Digital System Design 4 Parallel Computing Architecture 3
Stewart Smith Digital Systems Design 4
• •
Multicore and Multiprocessor systems Multi-threading in processors
This Lecture
Stewart Smith
Digital Systems Design 4
Multicore Processors
•
•
•
•
•
Effectively several processors on the same die Shared memory controller
Separate L1 and usually L2 caches
Processes don’t necessarily map to the same processor each time they’re run (but can request it (Processor Affinity))
Each processor can run a separate process or threads of the same process
Stewart Smith
Digital Systems Design 4
Shared Memory Multiprocessors
•
• •
‣
Hardware provides single physical address space for all processors
Synchronise shared variables using locks Memory access time
UMA (uniform) vs. NUMA (nonuniform)
Stewart Smith
Digital Systems Design 4
Example: Sum Reduction
•
‣
‣
‣
•
‣ ‣ ‣
Sum 100,000 numbers on 100 processor UMA
Each processor has ID: 0 ≤ Pn ≤ 99
Partition 1000 numbers per processor
Initial summation on each processor
sum[Pn] = 0;
for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1)
sum[Pn] = sum[Pn] + A[i];
Now need to add these partial sums
Reduction: divide and conquer
Half the processors add pairs, then quarter, ... Need to synchronise between reduction steps
Stewart Smith
Digital Systems Design 4
Example: Sum Reduction
half = 100; /* 100 processors in multiprocessor */
repeat
synch(); /* wait for partial sum completion */
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
Stewart Smith Digital Systems Design 4
Message Passing
• •
Each processor has private physical address space
Hardware sends/receives messages between processors
Stewart Smith
Digital Systems Design 4
Loosely Coupled Clusters
•
‣
- E.g., Ethernet/switch, Internet
Suitable for applications with independent tasks
‣ Web servers, databases, simulations, ... High availability, scalable, affordable
Problems
Administration cost (prefer virtual machines) Low interconnect bandwidth
- c.f. processor/memory bandwidth on an SMP Digital Systems Design 4
•
• •
‣ ‣
Network of independent computers
Each has private memory and OS ‣ Connected using I/O system
Stewart Smith
Sum Reduction (again)
•
‣ ‣
•
‣ ‣
Sum 100,000 on 100 processors
First distribute 1000 numbers to each Then do partial sums
sum = 0;
for (i = 0; i<1000; i = i + 1)
sum = sum + AN[i];
Reduction
Half the processors send, other half receive and add Then quarter send, quarter receive and add, ...
Stewart Smith
Digital Systems Design 4
Sum Reduction (again)
•
• •
Send/receive also provides synchronisation Assumes send/receive take similar time to addition
Given send() and receive() operations limit = 100; half = 100;/* 100 processors */
repeat
half = (half+1)/2; /* send vs. receive
dividing line */
if (Pn >= half && Pn < limit)
send(Pn - half, sum);
if (Pn < (limit/2)) sum = sum + receive();
limit = half; /* upper limit of senders */
until (half == 1); /* exit with final sum */
Stewart Smith
Digital Systems Design 4
Distributed Computing
• •
‣ ‣ ‣ ‣
‣
Using multiple separate computer systems in parallel
May all be
In the same cabinet (compute cluster)
In the same room (multi-node HPC)
A bunch of desktop PC’s in a department (Beowolf)
Spread across the whole internet (Grid Computing, e.g. SETI@Home)
Commercial: Cloud computing frameworks
Stewart Smith
Digital Systems Design 4
Grid Computing
• •
• •
•
Disappearing, to be replaced by cloud computing
Still used for non-profit projects, like SETI@Home or Folding@Home
No central server
Complicated to keep high uptime (people turn their PCs off at night, etc)
Diverse platforms, operating systems ...
Stewart Smith
Digital Systems Design 4
Cloud Computing
•
•
•
‣ ‣
Pay for what you use... Central → Economies of Scale
Commercial services
Big Players like Amazon & Google
Can hire:
File storage
Compute
•‣ GPU Computing (this is newer)
•
Stewart Smith Digital Systems Design 4
Processes and Threads
•
‣
-
Has its own executable file
Gets its own virtual memory space
Can’t ‘see’ other processes
- [Have to use special mechanisms for inter-process communication (IPC) ]
Multitasking controlled by Operating System
‣
‣
‣
‣
Process
An application or service
[Service: A process that runs in the background, no user interface]
Stewart Smith
Digital Systems Design 4
Processes and Threads
•
‣
‣
‣
‣
Thread
One process may have many threads Share the same address space
Not separate executables, parts of the same program
Multitasking of threads must be coded for manually
Stewart Smith
Digital Systems Design 4
Multithreading
•
‣
•
‣ ‣
•
‣ ‣
Performing multiple threads of execution in parallel
Replicate registers, PC, etc.
‣ Fast switching between threads
Fine-grain multithreading
Switch threads after each cycle
Interleave instruction execution
‣ If one thread stalls, others are executed
Coarse-grain multithreading
Only switch on long stall (e.g., L2-cache miss)
Simplifies hardware, but doesn’t hide short stalls (eg, data hazards)
Stewart Smith
Digital Systems Design 4
Simultaneous Multithreading
•
‣ ‣
‣
In multiple-issue dynamically scheduled processor
Schedule instructions from multiple threads
Instructions from independent threads execute when functional units are available
Within threads, dependencies handled by scheduling and register renaming
Stewart Smith
Digital Systems Design 4
Intel Hyperthreading
• •
•
‣ ‣
•
Each processor core has a double set of registers but single cache / ALU
Can execute two threads in parallel, by interleaving commands (like a GPU)
Advertising claims 2x speed-up
Depends on load
Depends on code itself
‣ Usually in the range 1.2x – 1.5x
Every other manufacturer does something similar under a different name
Stewart Smith
Digital Systems Design 4
Multithreading Examples
Stewart Smith Digital Systems Design 4