CS计算机代考程序代写 database GPU cache Digital System Design 4 Parallel Computing Architecture 3

Digital System Design 4 Parallel Computing Architecture 3
Stewart Smith Digital Systems Design 4

• •
Multicore and Multiprocessor systems Multi-threading in processors
This Lecture
Stewart Smith
Digital Systems Design 4

Multicore Processors





Effectively several processors on the same die Shared memory controller
Separate L1 and usually L2 caches
Processes don’t necessarily map to the same processor each time they’re run (but can request it (Processor Affinity))
Each processor can run a separate process or threads of the same process
Stewart Smith
Digital Systems Design 4

Shared Memory Multiprocessors

• •

Hardware provides single physical address space for all processors
Synchronise shared variables using locks Memory access time
UMA (uniform) vs. NUMA (nonuniform)
Stewart Smith
Digital Systems Design 4

Example: Sum Reduction





‣ ‣ ‣
Sum 100,000 numbers on 100 processor UMA
Each processor has ID: 0 ≤ Pn ≤ 99
Partition 1000 numbers per processor
Initial summation on each processor
sum[Pn] = 0;
for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; Now need to add these partial sums Reduction: divide and conquer Half the processors add pairs, then quarter, ... Need to synchronise between reduction steps Stewart Smith Digital Systems Design 4 Example: Sum Reduction half = 100; /* 100 processors in multiprocessor */ repeat synch(); /* wait for partial sum completion */ if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1); Stewart Smith Digital Systems Design 4 Message Passing • • Each processor has private physical address space Hardware sends/receives messages between processors Stewart Smith Digital Systems Design 4 Loosely Coupled Clusters • ‣ - E.g., Ethernet/switch, Internet Suitable for applications with independent tasks ‣ Web servers, databases, simulations, ... High availability, scalable, affordable Problems Administration cost (prefer virtual machines) Low interconnect bandwidth - c.f. processor/memory bandwidth on an SMP Digital Systems Design 4 • • • ‣ ‣ Network of independent computers Each has private memory and OS ‣ Connected using I/O system Stewart Smith Sum Reduction (again) • ‣ ‣ • ‣ ‣ Sum 100,000 on 100 processors First distribute 1000 numbers to each Then do partial sums sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + AN[i]; Reduction Half the processors send, other half receive and add Then quarter send, quarter receive and add, ... Stewart Smith Digital Systems Design 4 Sum Reduction (again) • • • Send/receive also provides synchronisation Assumes send/receive take similar time to addition Given send() and receive() operations limit = 100; half = 100;/* 100 processors */ repeat half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit) send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */ until (half == 1); /* exit with final sum */ Stewart Smith Digital Systems Design 4 Distributed Computing • • ‣ ‣ ‣ ‣ ‣ Using multiple separate computer systems in parallel May all be In the same cabinet (compute cluster) In the same room (multi-node HPC) A bunch of desktop PC’s in a department (Beowolf) Spread across the whole internet (Grid Computing, e.g. SETI@Home) Commercial: Cloud computing frameworks Stewart Smith Digital Systems Design 4 Grid Computing • • • • • Disappearing, to be replaced by cloud computing Still used for non-profit projects, like SETI@Home or Folding@Home No central server Complicated to keep high uptime (people turn their PCs off at night, etc) Diverse platforms, operating systems ... Stewart Smith Digital Systems Design 4 Cloud Computing • • • ‣ ‣ Pay for what you use... Central → Economies of Scale Commercial services Big Players like Amazon & Google Can hire: File storage Compute •‣ GPU Computing (this is newer) • Stewart Smith Digital Systems Design 4 Processes and Threads • ‣ - Has its own executable file Gets its own virtual memory space Can’t ‘see’ other processes - [Have to use special mechanisms for inter-process communication (IPC) ] Multitasking controlled by Operating System ‣ ‣ ‣ ‣ Process An application or service [Service: A process that runs in the background, no user interface] Stewart Smith Digital Systems Design 4 Processes and Threads • ‣ ‣ ‣ ‣ Thread One process may have many threads Share the same address space Not separate executables, parts of the same program Multitasking of threads must be coded for manually Stewart Smith Digital Systems Design 4 Multithreading • ‣ • ‣ ‣ • ‣ ‣ Performing multiple threads of execution in parallel Replicate registers, PC, etc. ‣ Fast switching between threads Fine-grain multithreading Switch threads after each cycle Interleave instruction execution ‣ If one thread stalls, others are executed Coarse-grain multithreading Only switch on long stall (e.g., L2-cache miss) Simplifies hardware, but doesn’t hide short stalls (eg, data hazards) Stewart Smith Digital Systems Design 4 Simultaneous Multithreading • ‣ ‣ ‣ In multiple-issue dynamically scheduled processor Schedule instructions from multiple threads Instructions from independent threads execute when functional units are available Within threads, dependencies handled by scheduling and register renaming Stewart Smith Digital Systems Design 4 Intel Hyperthreading • • • ‣ ‣ • Each processor core has a double set of registers but single cache / ALU Can execute two threads in parallel, by interleaving commands (like a GPU) Advertising claims 2x speed-up Depends on load Depends on code itself ‣ Usually in the range 1.2x – 1.5x Every other manufacturer does something similar under a different name Stewart Smith Digital Systems Design 4 Multithreading Examples Stewart Smith Digital Systems Design 4