CS计算机代考程序代写 database GPU cache Parallel 3 2021

Parallel 3 2021

Stewart Smith Digital Systems Design 4

Digital System Design 4
Parallel Computing Architecture 3

Stewart Smith Digital Systems Design 4

This Lecture

• Multicore and Multiprocessor systems
• Multi-threading in processors

Stewart Smith Digital Systems Design 4

Multicore Processors
• Effectively several processors on the same die
• Shared memory controller
• Separate L1 and usually L2 caches
• Processes don’t necessarily map to the same

processor each time they’re run (but can
request it (Processor Affinity))

• Each processor can run a separate process or
threads of the same process

Stewart Smith Digital Systems Design 4

Shared Memory Multiprocessors
• Hardware provides single physical

address space for all processors
• Synchronise shared variables using locks
• Memory access time
‣ UMA (uniform) vs. NUMA (nonuniform)

Stewart Smith Digital Systems Design 4

Example: Sum Reduction
• Sum 100,000 numbers on 100 processor UMA
‣ Each processor has ID: 0 ≤ Pn ≤ 99
‣ Partition 1000 numbers per processor
‣ Initial summation on each processor

sum[Pn] = 0;
for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; • Now need to add these partial sums ‣ Reduction: divide and conquer ‣ Half the processors add pairs, then quarter, … ‣ Need to synchronise between reduction steps Stewart Smith Digital Systems Design 4 Example: Sum Reduction half = 100; /* 100 processors in multiprocessor */ repeat synch(); /* wait for partial sum completion */ if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1); Stewart Smith Digital Systems Design 4 Message Passing • Each processor has private physical address space • Hardware sends/receives messages between processors Stewart Smith Digital Systems Design 4 Loosely Coupled Clusters • Network of independent computers ‣ Each has private memory and OS ‣ Connected using I/O system - E.g., Ethernet/switch, Internet • Suitable for applications with independent tasks ‣ Web servers, databases, simulations, … • High availability, scalable, affordable • Problems ‣ Administration cost (prefer virtual machines) ‣ Low interconnect bandwidth - c.f. processor/memory bandwidth on an SMP Stewart Smith Digital Systems Design 4 Sum Reduction (again) • Sum 100,000 on 100 processors ‣ First distribute 1000 numbers to each ‣ Then do partial sums sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + AN[i]; • Reduction ‣ Half the processors send, other half receive and add ‣ Then quarter send, quarter receive and add, … Stewart Smith Digital Systems Design 4 Sum Reduction (again) • Given send() and receive() operations limit = 100; half = 100;/* 100 processors */ repeat half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit) send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */ until (half == 1); /* exit with final sum */ • Send/receive also provides synchronisation • Assumes send/receive take similar time to addition Stewart Smith Digital Systems Design 4 Distributed Computing • Using multiple separate computer systems in parallel • May all be ‣ In the same cabinet (compute cluster) ‣ In the same room (multi-node HPC) ‣ A bunch of desktop PC’s in a department (Beowolf) ‣ Spread across the whole internet (Grid Computing, e.g. SETI@Home) ‣ Commercial: Cloud computing frameworks Stewart Smith Digital Systems Design 4 Grid Computing • Disappearing, to be replaced by cloud computing • Still used for non-profit projects, like SETI@Home or Folding@Home • No central server • Complicated to keep high uptime (people turn their PCs off at night, etc) • Diverse platforms, operating systems … Stewart Smith Digital Systems Design 4 Cloud Computing • Commercial services • Big Players like Amazon & Google • Can hire: ‣ File storage ‣ Compute ‣ GPU Computing (this is newer) • Pay for what you use… • Central → Economies of Scale Stewart Smith Digital Systems Design 4 Processes and Threads • Process ‣ An application or service - [Service: A process that runs in the background, no user interface] ‣ Has its own executable file ‣ Gets its own virtual memory space ‣ Can’t ‘see’ other processes - [Have to use special mechanisms for inter-process communication (IPC) ] ‣ Multitasking controlled by Operating System Stewart Smith Digital Systems Design 4 Processes and Threads • Thread ‣ One process may have many threads ‣ Share the same address space ‣ Not separate executables, parts of the same program ‣ Multitasking of threads must be coded for manually Stewart Smith Digital Systems Design 4 Multithreading • Performing multiple threads of execution in parallel ‣ Replicate registers, PC, etc. ‣ Fast switching between threads • Fine-grain multithreading ‣ Switch threads after each cycle ‣ Interleave instruction execution ‣ If one thread stalls, others are executed • Coarse-grain multithreading ‣ Only switch on long stall (e.g., L2-cache miss) ‣ Simplifies hardware, but doesn’t hide short stalls (eg, data hazards) Stewart Smith Digital Systems Design 4 Simultaneous Multithreading • In multiple-issue dynamically scheduled processor ‣ Schedule instructions from multiple threads ‣ Instructions from independent threads execute when functional units are available ‣ Within threads, dependencies handled by scheduling and register renaming Stewart Smith Digital Systems Design 4 Intel Hyperthreading • Each processor core has a double set of registers but single cache / ALU • Can execute two threads in parallel, by interleaving commands (like a GPU) • Advertising claims 2x speed-up ‣ Depends on load ‣ Depends on code itself ‣ Usually in the range 1.2x – 1.5x • Every other manufacturer does something similar under a different name Stewart Smith Digital Systems Design 4 Multithreading Examples