CS代考 Chapter …

Chapter …

Parallel Processors from Client to Cloud

Copyright By PowCoder代写 加微信 powcoder

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Introduction
Goal: connecting multiple computers
to get higher performance
Multiprocessors
Scalability, availability, power efficiency
Task-level (process-level) parallelism
High throughput for independent jobs
Parallel processing program
Single program run on multiple processors
Multicore microprocessors
Chips with multiple processors (cores)

§6.1 Introduction
Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Hardware and Software
Serial: e.g., Pentium 4
Parallel: e.g., quad-core Xeon e5345
Sequential: e.g., matrix multiplication
Concurrent: e.g., operating system
Sequential/concurrent software can run on serial/parallel hardware
Challenge: making effective use of parallel hardware

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

What We’ve Already Covered
§2.11: Parallelism and Instructions
Synchronization
§3.6: Parallelism and Computer Arithmetic
Subword Parallelism
§4.10: Parallelism and Advanced Instruction-Level Parallelism
§5.10: Parallelism and Memory Hierarchies
Cache Coherence

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Parallel Programming
Parallel software is the problem
Need to get significant performance improvement
Otherwise, just use a faster uniprocessor, since it’s easier!
Difficulties
Partitioning
Coordination
Communications overhead

§6.2 The Difficulty of Creating Parallel Processing Programs
Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Amdahl’s Law
Sequential part can limit speedup
Example: 100 processors, 90× speedup?
Tnew = Tparallelizable/100 + Tsequential

Solving: Fparallelizable = 0.999
Need sequential part to be 0.1% of original time

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Scaling Example
Workload: sum of 10 scalars, and 10 × 10 matrix sum
Speed up from 10 to 100 processors
Single processor: Time = (10 + 100) × tadd
10 processors
Time = 10 × tadd + 100/10 × tadd = 20 × tadd
Speedup = 110/20 = 5.5 (55% of potential)
100 processors
Time = 10 × tadd + 100/100 × tadd = 11 × tadd
Speedup = 110/11 = 10 (10% of potential)
Assumes load can be balanced across processors

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Scaling Example (cont)
What if matrix size is 100 × 100?
Single processor: Time = (10 + 10000) × tadd
10 processors
Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd
Speedup = 10010/1010 = 9.9 (99% of potential)
100 processors
Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
Speedup = 10010/110 = 91 (91% of potential)
Assuming load balanced

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Strong vs Weak Scaling
Strong scaling: problem size fixed
As in example
Weak scaling: problem size proportional to number of processors
10 processors, 10 × 10 matrix
Time = 20 × tadd
100 processors, 32 × 32 matrix
Time = 10 × tadd + 1000/100 × tadd = 20 × tadd
Constant performance in this example

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Instruction and Data Streams
An alternate classification

SPMD: Single Program Multiple Data
A parallel program on a MIMD computer
Conditional code for different processors

Chapter 6 — Parallel Processors from Client to Cloud — *
§6.3 SISD, MIMD, SIMD, SPMD, and Vector
Data Streams
Single Multiple
Instruction Streams Single SISD:
Intel Pentium 4 SIMD: SSE instructions of x86
Multiple MISD:
No examples today MIMD:
Intel Xeon e5345

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Vector Processors
Highly pipelined function units
Stream data from/to vector registers to units
Data collected from memory into registers
Results stored from registers to memory
Example: Vector extension to RISC-V
v0 to v31: 32 × 64-element registers, (64-bit elements)
Vector instructions
fld.v, fsd.v: load/store vector
fadd.d.v: add vectors of double
fadd.d.vs: add scalar to each element of vector of double
Significantly reduces instruction-fetch bandwidth

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Example: DAXPY (Y = a × X + Y)
Conventional RISC-V code:

fld f0,a(x3) // load scalar a
addi x5,x19,512 // end of array X
loop: fld f1,0(x19) // load x[i]
fmul.d f1,f1,f0 // a * x[i]
fld f2,0(x20) // load y[i]
fadd.d f2,f2,f1 // a * x[i] + y[i]
fsd f2,0(x20) // store y[i]
addi x19,x19,8 // increment index to x
addi x20,x20,8 // increment index to y
bltu x19,x5,loop // repeat if not done
Vector RISC-V code:
fld f0,a(x3) // load scalar a
fld.v v0,0(x19) // load vector x
fmul.d.vs v0,v0,f0 // vector-scalar multiply
fld.v v1,0(x20) // load vector y
fadd.d.v v1,v1,v0 // vector-vector add
fsd.v v1,0(x20) // store vector y

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Vector vs. Scalar
Vector architectures and compilers
Simplify data-parallel programming
Explicit statement of absence of loop-carried dependences
Reduced checking in hardware
Regular access patterns benefit from interleaved and burst memory
Avoid control hazards by avoiding loops
More general than ad-hoc media extensions (such as MMX, SSE)
Better match with compiler technology

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Operate elementwise on vectors of data
E.g., MMX and SSE instructions in x86
Multiple data elements in 128-bit wide registers
All processors execute the same instruction at the same time
Each with different data address, etc.
Simplifies synchronization
Reduced instruction control hardware
Works best for highly data-parallel applications

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Vector vs. Multimedia Extensions
Vector instructions have a variable vector width, multimedia extensions have a fixed width
Vector instructions support strided access, multimedia extensions do not
Vector units can be combination of pipelined and arrayed functional units:

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Multithreading
Performing multiple threads of execution in parallel
Replicate registers, PC, etc.
Fast switching between threads
Fine-grain multithreading
Switch threads after each cycle
Interleave instruction execution
If one thread stalls, others are executed
Coarse-grain multithreading
Only switch on long stall (e.g., L2-cache miss)
Simplifies hardware, but doesn’t hide short stalls (eg, data hazards)

§6.4 Hardware Multithreading
Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Simultaneous Multithreading
In multiple-issue dynamically scheduled processor
Schedule instructions from multiple threads
Instructions from independent threads execute when function units are available
Within threads, dependencies handled by scheduling and register renaming
Example: Intel Pentium-4 HT
Two threads: duplicated registers, shared function units and caches

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Multithreading Example
Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Future of Multithreading
Will it survive? In what form?
Power considerations  simplified microarchitectures
Simpler forms of multithreading
Tolerating cache-miss latency
Thread switch may be most effective
Multiple simple cores might share resources more effectively

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Shared Memory
SMP: shared memory multiprocessor
Hardware provides single physical
address space for all processors
Synchronize shared variables using locks
Memory access time
UMA (uniform) vs. NUMA (nonuniform)

Chapter 6 — Parallel Processors from Client to Cloud — *
§6.5 Multicore and Other Shared Memory Multiprocessors

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Example: Sum Reduction
Sum 64,000 numbers on 64 processor UMA
Each processor has ID: 0 ≤ Pn ≤ 63
Partition 1000 numbers per processor
Initial summation on each processor

sum[Pn] = 0;
for (i = 1000*Pn;
i < 1000*(Pn+1); i += 1) sum[Pn] += A[i]; Now need to add these partial sums Reduction: divide and conquer Half the processors add pairs, then quarter, … Need to synchronize between reduction steps Chapter 6 — Parallel Processors from Client to Cloud — * Chapter 6 — Parallel Processors from Client to Cloud — * Chapter 7 — Multicores, Multiprocessors, and Clusters Chapter 7 — Multicores, Multiprocessors, and Clusters Example: Sum Reduction half = 64; if (half%2 != 0 && Pn == 0) sum[0] += sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] += sum[Pn+half]; while (half > 1);
Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

History of GPUs
Early video cards
Frame buffer memory with address generation for video output
3D graphics processing
Originally high-end computers (e.g., SGI)
Moore’s Law  lower cost, higher density
3D graphics cards for PCs and game consoles
Graphics Processing Units
Processors oriented to 3D graphics tasks
Vertex/pixel processing, shading, texture mapping,
rasterization

§6.6 Introduction to Graphics Processing Units
Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Graphics in the System
Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

GPU Architectures
Processing is highly data-parallel
GPUs are highly multithreaded
Use thread switching to hide memory latency
Less reliance on multi-level caches
Graphics memory is wide and high-bandwidth
Trend toward general purpose GPUs
Heterogeneous CPU/GPU systems
CPU for sequential code, GPU for parallel code
Programming languages/APIs
DirectX, OpenGL
C for Graphics (Cg), High Level Shader Language (HLSL)
Compute Unified Device Architecture (CUDA)

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Example: NVIDIA Fermi
Multiple SIMD processors, each as shown:

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Example: NVIDIA Fermi
SIMD Processor: 16 SIMD lanes
SIMD instruction
Operates on 32 element wide threads
Dynamically scheduled on 16-wide processor over 2 cycles
32K x 32-bit registers spread across lanes
64 registers per thread context

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

GPU Memory Structures
Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Classifying GPUs
Don’t fit nicely into SIMD/MIMD model
Conditional execution in a thread allows an illusion of MIMD
But with performance degredation
Need to write general purpose code with care

Chapter 6 — Parallel Processors from Client to Cloud — *
Static: Discovered
at Compile Time Dynamic: Discovered at Runtime
Instruction-Level Parallelism VLIW Superscalar
Data-Level Parallelism SIMD or Vector Tesla Multiprocessor

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Putting GPUs into Perspective
Chapter 6 — Parallel Processors from Client to Cloud — *
Feature Multicore with SIMD GPU
SIMD processors 4 to 8 8 to 16
SIMD lanes/processor 2 to 4 8 to 16
Multithreading hardware support for SIMD threads 2 to 4 16 to 32
Typical ratio of single precision to double-precision performance 2:1 2:1
Largest cache size 8 MB 0.75 MB
Size of memory address 64-bit 64-bit
Size of main memory 8 GB to 256 GB 4 GB to 6 GB
Memory protection at level of page Yes Yes
Demand paging Yes No
Integrated scalar processor/SIMD processor Yes No
Cache coherent Yes No

Chapter 6 — Parallel Processors from Client to Cloud — *

Guide to GPU Terms
Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Message Passing
Each processor has private physical address space
Hardware sends/receives messages between processors

§6.7 Clusters, WSC, and Other Message-Passing MPs
Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Loosely Coupled Clusters
Network of independent computers
Each has private memory and OS
Connected using I/O system
E.g., Ethernet/switch, Internet
Suitable for applications with independent tasks
Web servers, databases, simulations, …
High availability, scalable, affordable
Administration cost (prefer virtual machines)
Low interconnect bandwidth
c.f. processor/memory bandwidth on an SMP

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Sum Reduction (Again)
Sum 64,000 on 64 processors
First distribute 1000 numbers to each
The do partial sums

for (i = 0; i<1000; i += 1) sum += AN[i]; Half the processors send, other half receive and add The quarter send, quarter receive and add, … Chapter 6 — Parallel Processors from Client to Cloud — * Chapter 6 — Parallel Processors from Client to Cloud — * Chapter 7 — Multicores, Multiprocessors, and Clusters Chapter 7 — Multicores, Multiprocessors, and Clusters Sum Reduction (Again) Given send() and receive() operations limit = 64; half = 64;/* 64 processors */ half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit) send(Pn - half, sum); if (Pn < (limit/2)) sum += receive(); limit = half; /* upper limit of senders */ while (half > 1); /* exit with final sum */
Send/receive also provide synchronization
Assumes send/receive take similar time to addition

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Grid Computing
Separate computers interconnected by long-haul networks
E.g., Internet connections
Work units farmed out, results sent back
Can make use of idle time on PCs
E.g., World Community Grid

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Interconnection Networks
Network topologies
Arrangements of processors, switches, and links

§6.8 Introduction to Multiprocessor Network Topologies
N-cube (N = 3)
Fully connected
Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 7 — Multicores, Multiprocessors, and Clusters

Chapter 7 — Multicores, Multiprocessors, and Clusters

Multistage Networks
Chapter 6 — Parallel Processors from Client to Cloud — *

Chapter 6 — Parallel Processors from Client to Cloud — *

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com