Chapter …
Parallel Processors from Client to Cloud
Copyright By PowCoder代写 加微信 powcoder
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Introduction
Goal: connecting multiple computers
to get higher performance
Multiprocessors
Scalability, availability, power efficiency
Task-level (process-level) parallelism
High throughput for independent jobs
Parallel processing program
Single program run on multiple processors
Multicore microprocessors
Chips with multiple processors (cores)
§6.1 Introduction
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Hardware and Software
Serial: e.g., Pentium 4
Parallel: e.g., quad-core Xeon e5345
Sequential: e.g., matrix multiplication
Concurrent: e.g., operating system
Sequential/concurrent software can run on serial/parallel hardware
Challenge: making effective use of parallel hardware
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
What We’ve Already Covered
§2.11: Parallelism and Instructions
Synchronization
§3.6: Parallelism and Computer Arithmetic
Subword Parallelism
§4.10: Parallelism and Advanced Instruction-Level Parallelism
§5.10: Parallelism and Memory Hierarchies
Cache Coherence
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Parallel Programming
Parallel software is the problem
Need to get significant performance improvement
Otherwise, just use a faster uniprocessor, since it’s easier!
Difficulties
Partitioning
Coordination
Communications overhead
§6.2 The Difficulty of Creating Parallel Processing Programs
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Amdahl’s Law
Sequential part can limit speedup
Example: 100 processors, 90× speedup?
Tnew = Tparallelizable/100 + Tsequential
Solving: Fparallelizable = 0.999
Need sequential part to be 0.1% of original time
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Scaling Example
Workload: sum of 10 scalars, and 10 × 10 matrix sum
Speed up from 10 to 100 processors
Single processor: Time = (10 + 100) × tadd
10 processors
Time = 10 × tadd + 100/10 × tadd = 20 × tadd
Speedup = 110/20 = 5.5 (55% of potential)
100 processors
Time = 10 × tadd + 100/100 × tadd = 11 × tadd
Speedup = 110/11 = 10 (10% of potential)
Assumes load can be balanced across processors
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Scaling Example (cont)
What if matrix size is 100 × 100?
Single processor: Time = (10 + 10000) × tadd
10 processors
Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd
Speedup = 10010/1010 = 9.9 (99% of potential)
100 processors
Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
Speedup = 10010/110 = 91 (91% of potential)
Assuming load balanced
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Strong vs Weak Scaling
Strong scaling: problem size fixed
As in example
Weak scaling: problem size proportional to number of processors
10 processors, 10 × 10 matrix
Time = 20 × tadd
100 processors, 32 × 32 matrix
Time = 10 × tadd + 1000/100 × tadd = 20 × tadd
Constant performance in this example
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Instruction and Data Streams
An alternate classification
SPMD: Single Program Multiple Data
A parallel program on a MIMD computer
Conditional code for different processors
Chapter 6 — Parallel Processors from Client to Cloud — *
§6.3 SISD, MIMD, SIMD, SPMD, and Vector
Data Streams
Single Multiple
Instruction Streams Single SISD:
Intel Pentium 4 SIMD: SSE instructions of x86
Multiple MISD:
No examples today MIMD:
Intel Xeon e5345
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Vector Processors
Highly pipelined function units
Stream data from/to vector registers to units
Data collected from memory into registers
Results stored from registers to memory
Example: Vector extension to RISC-V
v0 to v31: 32 × 64-element registers, (64-bit elements)
Vector instructions
fld.v, fsd.v: load/store vector
fadd.d.v: add vectors of double
fadd.d.vs: add scalar to each element of vector of double
Significantly reduces instruction-fetch bandwidth
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Example: DAXPY (Y = a × X + Y)
Conventional RISC-V code:
fld f0,a(x3) // load scalar a
addi x5,x19,512 // end of array X
loop: fld f1,0(x19) // load x[i]
fmul.d f1,f1,f0 // a * x[i]
fld f2,0(x20) // load y[i]
fadd.d f2,f2,f1 // a * x[i] + y[i]
fsd f2,0(x20) // store y[i]
addi x19,x19,8 // increment index to x
addi x20,x20,8 // increment index to y
bltu x19,x5,loop // repeat if not done
Vector RISC-V code:
fld f0,a(x3) // load scalar a
fld.v v0,0(x19) // load vector x
fmul.d.vs v0,v0,f0 // vector-scalar multiply
fld.v v1,0(x20) // load vector y
fadd.d.v v1,v1,v0 // vector-vector add
fsd.v v1,0(x20) // store vector y
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Vector vs. Scalar
Vector architectures and compilers
Simplify data-parallel programming
Explicit statement of absence of loop-carried dependences
Reduced checking in hardware
Regular access patterns benefit from interleaved and burst memory
Avoid control hazards by avoiding loops
More general than ad-hoc media extensions (such as MMX, SSE)
Better match with compiler technology
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Operate elementwise on vectors of data
E.g., MMX and SSE instructions in x86
Multiple data elements in 128-bit wide registers
All processors execute the same instruction at the same time
Each with different data address, etc.
Simplifies synchronization
Reduced instruction control hardware
Works best for highly data-parallel applications
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Vector vs. Multimedia Extensions
Vector instructions have a variable vector width, multimedia extensions have a fixed width
Vector instructions support strided access, multimedia extensions do not
Vector units can be combination of pipelined and arrayed functional units:
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Multithreading
Performing multiple threads of execution in parallel
Replicate registers, PC, etc.
Fast switching between threads
Fine-grain multithreading
Switch threads after each cycle
Interleave instruction execution
If one thread stalls, others are executed
Coarse-grain multithreading
Only switch on long stall (e.g., L2-cache miss)
Simplifies hardware, but doesn’t hide short stalls (eg, data hazards)
§6.4 Hardware Multithreading
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Simultaneous Multithreading
In multiple-issue dynamically scheduled processor
Schedule instructions from multiple threads
Instructions from independent threads execute when function units are available
Within threads, dependencies handled by scheduling and register renaming
Example: Intel Pentium-4 HT
Two threads: duplicated registers, shared function units and caches
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Multithreading Example
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Future of Multithreading
Will it survive? In what form?
Power considerations simplified microarchitectures
Simpler forms of multithreading
Tolerating cache-miss latency
Thread switch may be most effective
Multiple simple cores might share resources more effectively
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Shared Memory
SMP: shared memory multiprocessor
Hardware provides single physical
address space for all processors
Synchronize shared variables using locks
Memory access time
UMA (uniform) vs. NUMA (nonuniform)
Chapter 6 — Parallel Processors from Client to Cloud — *
§6.5 Multicore and Other Shared Memory Multiprocessors
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Example: Sum Reduction
Sum 64,000 numbers on 64 processor UMA
Each processor has ID: 0 ≤ Pn ≤ 63
Partition 1000 numbers per processor
Initial summation on each processor
sum[Pn] = 0;
for (i = 1000*Pn;
i < 1000*(Pn+1); i += 1)
sum[Pn] += A[i];
Now need to add these partial sums
Reduction: divide and conquer
Half the processors add pairs, then quarter, …
Need to synchronize between reduction steps
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Example: Sum Reduction
half = 64;
if (half%2 != 0 && Pn == 0)
sum[0] += sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] += sum[Pn+half];
while (half > 1);
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
History of GPUs
Early video cards
Frame buffer memory with address generation for video output
3D graphics processing
Originally high-end computers (e.g., SGI)
Moore’s Law lower cost, higher density
3D graphics cards for PCs and game consoles
Graphics Processing Units
Processors oriented to 3D graphics tasks
Vertex/pixel processing, shading, texture mapping,
rasterization
§6.6 Introduction to Graphics Processing Units
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Graphics in the System
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
GPU Architectures
Processing is highly data-parallel
GPUs are highly multithreaded
Use thread switching to hide memory latency
Less reliance on multi-level caches
Graphics memory is wide and high-bandwidth
Trend toward general purpose GPUs
Heterogeneous CPU/GPU systems
CPU for sequential code, GPU for parallel code
Programming languages/APIs
DirectX, OpenGL
C for Graphics (Cg), High Level Shader Language (HLSL)
Compute Unified Device Architecture (CUDA)
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Example: NVIDIA Fermi
Multiple SIMD processors, each as shown:
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Example: NVIDIA Fermi
SIMD Processor: 16 SIMD lanes
SIMD instruction
Operates on 32 element wide threads
Dynamically scheduled on 16-wide processor over 2 cycles
32K x 32-bit registers spread across lanes
64 registers per thread context
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
GPU Memory Structures
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Classifying GPUs
Don’t fit nicely into SIMD/MIMD model
Conditional execution in a thread allows an illusion of MIMD
But with performance degredation
Need to write general purpose code with care
Chapter 6 — Parallel Processors from Client to Cloud — *
Static: Discovered
at Compile Time Dynamic: Discovered at Runtime
Instruction-Level Parallelism VLIW Superscalar
Data-Level Parallelism SIMD or Vector Tesla Multiprocessor
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Putting GPUs into Perspective
Chapter 6 — Parallel Processors from Client to Cloud — *
Feature Multicore with SIMD GPU
SIMD processors 4 to 8 8 to 16
SIMD lanes/processor 2 to 4 8 to 16
Multithreading hardware support for SIMD threads 2 to 4 16 to 32
Typical ratio of single precision to double-precision performance 2:1 2:1
Largest cache size 8 MB 0.75 MB
Size of memory address 64-bit 64-bit
Size of main memory 8 GB to 256 GB 4 GB to 6 GB
Memory protection at level of page Yes Yes
Demand paging Yes No
Integrated scalar processor/SIMD processor Yes No
Cache coherent Yes No
Chapter 6 — Parallel Processors from Client to Cloud — *
Guide to GPU Terms
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Message Passing
Each processor has private physical address space
Hardware sends/receives messages between processors
§6.7 Clusters, WSC, and Other Message-Passing MPs
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Loosely Coupled Clusters
Network of independent computers
Each has private memory and OS
Connected using I/O system
E.g., Ethernet/switch, Internet
Suitable for applications with independent tasks
Web servers, databases, simulations, …
High availability, scalable, affordable
Administration cost (prefer virtual machines)
Low interconnect bandwidth
c.f. processor/memory bandwidth on an SMP
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Sum Reduction (Again)
Sum 64,000 on 64 processors
First distribute 1000 numbers to each
The do partial sums
for (i = 0; i<1000; i += 1)
sum += AN[i];
Half the processors send, other half receive and add
The quarter send, quarter receive and add, …
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Sum Reduction (Again)
Given send() and receive() operations
limit = 64; half = 64;/* 64 processors */
half = (half+1)/2; /* send vs. receive
dividing line */
if (Pn >= half && Pn < limit)
send(Pn - half, sum);
if (Pn < (limit/2))
sum += receive();
limit = half; /* upper limit of senders */
while (half > 1); /* exit with final sum */
Send/receive also provide synchronization
Assumes send/receive take similar time to addition
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Grid Computing
Separate computers interconnected by long-haul networks
E.g., Internet connections
Work units farmed out, results sent back
Can make use of idle time on PCs
E.g., World Community Grid
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Interconnection Networks
Network topologies
Arrangements of processors, switches, and links
§6.8 Introduction to Multiprocessor Network Topologies
N-cube (N = 3)
Fully connected
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 7 — Multicores, Multiprocessors, and Clusters
Chapter 7 — Multicores, Multiprocessors, and Clusters
Multistage Networks
Chapter 6 — Parallel Processors from Client to Cloud — *
Chapter 6 — Parallel Processors from Client to Cloud — *
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com