COMP5426 Distributed
Introduction
References
Copyright By PowCoder代写 加微信 powcoder
– NVIDIAGPUEducatorsProgram – https://developer.nvidia.com/educators
– NVIDIA’s Academic Programs
– https://developer.nvidia.com/academia
– The contents of this short course ppt slides are mainly copied from the following book and its accompanying teaching materials:
. Kirk and Wen-mei W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, 2nd edition, , 2013
– A method for extracting notable features and patterns from large data sets
– Feature extraction for object recognition in images
– Fraud detection in credit card transactions
– Correlating heavenly object movements in astrophysics –…
– Basic histograms – for each element in the data set, use the value to identify a “bin counter” to increment
A Text Histogram Example
– Define the bins as four-letter sections of the alphabet: a-d, e-h, i-l, n-p, …
– For each character in an input string, increment the appropriate bin counter.
– In the phrase “Programming Massively Parallel Processors” the output histogram is shown below:
A simple parallel histogram algorithm
– Partitiontheinputintosections
– Have each thread to take a section of the input
– Each thread iterates through its section.
– For each letter, increment the appropriate bin counter
Sectioned Partitioning (Iteration #1)
Sectioned Partitioning (Iteration #2)
Input Partitioning Affects Memory Access Efficiency
– Sectioned partitioning results in poor memory access efficiency
– Adjacent threads do not access adjacent memory locations
– Accesses are not coalesced
– DRAM bandwidth is poorly utilized
– Note: For coalescing we need locality across threads for one instruction instead of locality across subsequent instructions for one thread
Input Partitioning Affects Memory Access Efficiency
– Sectioned partitioning results in poor memory access efficiency
– Adjacent threads do not access adjacent memory locations
– Accesses are not coalesced
– DRAM bandwidth is poorly utilized
– Change to interleaved partitioning
– All threads process a contiguous section of elements
– They all move to the next section and repeat
– The memory accesses are coalesced
Interleaved Partitioning of Input
– For coalescing and better memory access performance
Interleaved Partitioning (Iteration 2)
– For coalescing and better memory access performance
Read-modify-write in the Text Histogram Example
– Multiple threads try to access and modify the same data location simultaneously can cause the data race problem
Data Race in Parallel Thread Execution
thread1: OldMem[x] NewOld + 1
Old Mem[x] NewOld + 1 Mem[x] [x] and New are per-thread register variables.
Question 1: If Mem[x] was initially 0, what would the value of Mem[x] be after threads 1 and 2 have completed?
Question 2: What does each thread get in their Old variable?
Unfortunately, the answers may vary according to the relative execution timing between the two threads, which is referred to as a data race.
Data Race Without Atomic Operations
Mem[x] initialized to 0 thread1: Old Mem[x]
time NewOld + 1 Mem[x] New
Old Mem[x] NewOld + 1 Mem[x] New
– Both threads receive 0 in Old
– Mem[x] becomes 1
Purpose of Atomic Operations – To Ensure Good Outcomes
Old Mem[x] NewOld + 1 Mem[x] New
Old Mem[x] NewOld + 1 Mem[x]
Old Mem[x] NewOld + 1 Mem[x] New
Old Mem[x] NewOld + 1 Mem[x] New
Key Concepts of Atomic Operations
– Aread-modify-writeoperationperformedbyasinglehardware instruction on a memory location address
– Read the old value, calculate a new value, and write the new value to the location
– The hardware ensures that no other threads can perform another read-modify-write operation on the same location until the current atomic operation is complete
– Any other threads that attempt to perform an atomic operation on the same location will typically be held in a queue
– All threads perform their atomic operations serially on the same location
Atomic Operations in CUDA
– Performed by calling functions that are translated into single instructions (a.k.a. intrinsic functions or intrinsics)
– Atomic add, sub, inc, dec, min, max, exch (exchange), CAS (compare and swap)
– Atomic Add
int atomicAdd(int* address, int val);
– reads the 32-bit word old from the location pointed to by address in global or shared memory, computes (old + val), and stores the result back to memory at the same address. The function returns old.
More Atomic Adds in CUDA
– Unsigned32-bitintegeratomicadd
unsigned int atomicAdd(unsigned int* address,
unsigned int val);
– Unsigned64-bitintegeratomicadd
unsigned long long int atomicAdd(unsigned long long
int* address, unsigned long long int val);
– Single-precisionfloating-pointatomicadd(capability>2.0) – float atomicAdd(float* address, float val);
A Basic Text Histogram Kernel
– The kernel receives a pointer to the input buffer of byte values – Each thread process the input in a strided pattern
__global__ void histo_kernel(unsigned char *buffer,
long size, unsigned int *histo) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
// stride is total number of threads
int stride = blockDim.x * gridDim.x;
// All threads handle blockDim.x * gridDim.x // consecutive elements
while (i < size) {
int alphabet_position = buffer[i] – “a”;
if (alphabet_position >= 0 && alpha_position < 26)
atomicAdd(&(histo[alphabet_position/4]), 1);
i += stride; }
A Basic Histogram Kernel (cont.)
– The kernel receives a pointer to the input buffer of byte values – Each thread process the input in a strided pattern
__global__ void histo_kernel(unsigned char *buffer,
long size, unsigned int *histo) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
// stride is total number of threads
int stride = blockDim.x * gridDim.x;
// All threads handle blockDim.x * gridDim.x // consecutive elements
while (i < size) {
int alphabet_position = buffer[i] – “a”;
if (alphabet_position >= 0 && alpha_position < 26)
atomicAdd(&(histo[alphabet_position/4]), 1);
i += stride; }
A Basic Histogram Kernel (cont.)
– The kernel receives a pointer to the input buffer of byte values – Each thread process the input in a strided pattern
__global__ void histo_kernel(unsigned char *buffer,
long size, unsigned int *histo) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
// stride is total number of threads
int stride = blockDim.x * gridDim.x;
// All threads handle blockDim.x * gridDim.x // consecutive elements
while (i < size) {
int alphabet_position = buffer[i] – “a”;
if (alphabet_position >= 0 && alpha_position < 26)
atomicAdd(&(histo[alphabet_position/4]), 1);
i += stride; }
Atomic Operations on Global Memory (DRAM)
– An atomic operation on a DRAM location starts with a read, which has a latency of a few hundred cycles
– The atomic operation ends with a write to the same location, with a latency of a few hundred cycles
– During this whole time, no one else can access the location
Atomic Operations on DRAM
– Each Read-Modify-Write has two full memory access delays – All atomic operations on the same variable (DRAM location) are serialized
DRAM read latency
DRAM write latency
DRAMreadlatency DRAMwritelatency
atomic operation N
atomic operation N+1
Latency determines throughput
– Throughput of atomic operations on the same DRAM location is the rate at which the application can execute an atomic operation.
– The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence, typically more than 1000 cycles for global memory (DRAM) locations.
– This means that if many threads attempt to do atomic operation on the same location (contention), the memory throughput is reduced to < 1/1000 of the peak bandwidth of one memory channel!
Hardware Improvements
– Atomic operations on Fermi L2 cache
– Medium latency, about 1/10 of the DRAM latency
– Shared among all blocks
– “Free improvement” on Global Memory atomics
L2 latency
L2 latency
L2 latency
L2 latency
atomic operation N atomic operation N+1
Hardware Improvements
– Atomic operations on Shared Memory
Very short latency
Private to each thread block
Need algorithm work by programmers (more later)
atomic operation N
atomic operation N+1
Privatization
Heavy contention and
serialization Block 0 Block 1 ... Block N
Block 0 Block 1 ... Block N
Atomic Updates
Final Copy
Final Copy
Privatization (cont.)
Much less contention and serialization
Block 1 ... Block N
Block 0 Block 1 ... Block N
Atomic Updates
Final Copy
Final Copy
Privatization (cont.)
Block 1 ...
Block 1 ... Block N
Final Copy
Atomic Updates
Much less contention and serialization
Final Copy
Cost and Benefit of Privatization
– Overhead for creating and initializing private copies
– Overhead for accumulating the contents of private copies into the final copy
– Much less contention and serialization in accessing both the private copies and the final copy
– The overall performance can often be improved more than 10x
Shared Memory Atomics for Histogram
– Each subset of threads are in the same block
– Much higher throughput than DRAM (100x) or L2 (10x) atomics
– Less contention – only threads in the same block can access a shared memory variable
– This is a very important use case for shared memory!
Shared Memory Atomics Requires Privatization
– Create private copies of the histo[] array for each thread block
__global__ void histo_kernel(unsigned char *buffer,
long size, unsigned int *histo) {
__shared__ unsigned int histo_private[7];
Shared Memory Atomics Requires Privatization
– Create private copies of the histo[] array for each thread block
__global__ void histo_kernel(unsigned char *buffer,
long size, unsigned int *histo) {
__shared__ unsigned int histo_private[7];
if (threadIdx.x < 7) histo_private[threadidx.x] = 0; __syncthreads();
Initialize the bin counters in the private copies of histo[]
Build Private Histogram
int i = threadIdx.x + blockIdx.x * blockDim.x; // stride is total number of threads
int stride = blockDim.x * gridDim.x;
while (i < size) {
int alphabet_position = buffer[i] – “a”;
if (alphabet_position >= 0 && alpha_position < 26)
atomicAdd(&(histo_private[alphabet_position/4]), 1); i += stride;
__syncthreads(); // wait for all other threads in the block to finish
if (threadIdx.x < 7) {
atomicAdd(&(histo[threadIdx.x]), histo_private[threadIdx.x] ); }
More on Privatization
– Privatization is a powerful and frequently used technique for parallelizing applications
– The operation needs to be associative and commutative – Histogram add operation is associative and commutative
– No privatization if the operation does not fit the requirement
– The private histogram size needs to be small – Fits into shared memory
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com