PowerPoint Presentation
Parallel Computing
with GPUs: Parallel
Patterns
Dr Paul Richmond
http://paulrichmond.shef.ac.uk/teaching/COM4521/
Parallel Patterns Overview
Reduction
Scan
What are parallel Patterns
Parallel patterns are high level building blocks that can be used to
create algorithms
Implementation is abstracted to give a higher level view
Patterns describe techniques suited to parallelism
Allows algorithms to be built with parallelism from ground up
Top down approach might not parallelise very easily…
Consider a the simplest parallel pattern: Map
Takes the input list i
Applies a function f
Writes the result list o by applying f to all members of i
Equivalent to a CUDA kernel where i and o are memory locations determined
by threadIdx etc.
Gather
Multiple inputs and single coalesced output
Might have sequential loading or random access
Affect memory performance
Differs to map due to multiple inputs
0 1 2 3
Memory Values/Locations
ThreadIdx.x 4 5 6 7
Gather operation
Read from a number of locations
Random access load
0 1
Memory Values/Locations
ThreadIdx.x 2 3
Gather operation
Read from a number of locations
Scatter
Reads from a single input and writes to one or many
Can be implemented in CUDA using atomics
Write pattern will determine performance
0 1 2 3
Memory Values/Locations
ThreadIdx.x 4 5 6 7 Scatter operation
Write to a number of locations
Random access write?
0 1 2 3
Memory Values/Locations
ThreadIdx.x 4 5 6 7 Scatter operation
Write to a number of locations
Collision on write
Other Parallel Patterns
Stencil
Gather a fixed pattern, usually based on locality
See 2D shared memory examples
Reduce (this lecture)
Reduce value to a single value or set of key value pairs
Combined with Map to form Map Reduce (often with intermediate shuffle or
sort)
Scan (this lecture)
Compute the sum of previous value in a set
Sort (later)
Sort values or
Stencil Gather
Parallel Patterns Overview
Reduction
Scan
Reduction
A reduction is where all elements of a set have a common binary associative
operator (⊕) applied to them to “reduce” the set to a single value
Binary associative = order in which operations is performed on set does not matter
E.g. (1 + 2) + 3 + 4 == 1 + (2 + 3) + 4 == 10
Example operators
Most obvious example is addition (Summation)
Other examples, Maximum, Minimum, product
Serial example is trivial but how does this work in parallel?
int data[N];
int i, r;
for (int i = 0; i < N; i++){ r = reduce(r, data[i]); } int reduce(int r, int i){ return r + i; } int data[N]; int i, r; for (int i = N-1; i >= 0; i–){
r = reduce(r, data[i]);
}
OR
Parallel Reduction
Order of operations does not matter so we don’t have to think serially.
A tree based approach can be used
At each step data is reduced by a factor of 2
9 5 2 1 2 3 8 1
14 3 5 9
1417
31
⊕⊕⊕⊕
⊕ ⊕
⊕
N Elements
Log2(N) steps
Parallel Reduction in CUDA
No global synchronisation so how do multiple blocks perform
reduction?
Split the execution into multiple stages
Recursive method
9 5 2 1 2 3 8 1
14 3 5 9
1417
31
⊕⊕⊕⊕
⊕ ⊕
⊕
Kernel Launch 1
Kernel Launch 2
Kernel Launch 3
Recursive Reduction Problems
What might be some problems with the following?
__global__ void sum_reduction(float *input, float *results){
extern __shared__ int sdata[];
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[threadIdx.x] = input[i];
__syncthreads();
if (i % 2 == 0){
results[i / 2] = sdata[threadIdx.x] + sdata[threadIdx.x+1]
}
}
Block Level Reduction
Lower launch overhead (reduction within block)
Much better use of shared memory
__global__ void sum_reduction(float *input, float *block_results){
extern __shared__ int sdata[];
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[threadIdx.x] = input[i];
__syncthreads();
for (unsigned int stride = 1; stride < blockDim.x; stride*=2){ unsigned int strided_i = threadIdx.x * 2 * stride; if (strided_i < blockDim.x){ sdata[strided_i] += sdata[strided_i + stride] } __syncthreads(); } if (threadIdx.x == 0) block_results[blockIdx.x] = sdata[0]; } Block Level Recursive Reduction for (unsigned int stride = 1; stride < blockDim.x; stride*=2){ unsigned int strided_i = threadIdx.x * 2 * stride; if (strided_i < blockDim.x){ sdata[strided_i] += sdata[strided_i + stride] } __syncthreads(); } 9 5 2 1 2 3 8 1 0 1 2 3 14 5 3 1 5 3 9 1 0 1 17 5 3 1 14 3 9 1 0 31 5 3 1 14 3 9 1 Shared Memory Values Shared Memory Values Shared Memory Values Shared Memory Values threadIdx.x threadIdx.x threadIdx.x Loop 1 stride = 1 Loop 2 stride = 2 Loop 3 stride = 4 Block Level Reduction Is this shared memory access pattern bank conflict free? for (unsigned int stride = 1; stride < blockDim.x; stride*=2){ unsigned int strided_i = threadIdx.x * 2 * stride; if (strided_i < blockDim.x){ sdata[strided_i] += sdata[strided_i + stride] } __syncthreads(); } 9 5 2 1 2 3 8 1 0 1 2 3 11 8 10 2 2 3 8 1 0 1 21 10 10 2 2 3 8 1 0 31 10 10 2 2 3 8 1 Shared Memory Values Shared Memory Values Shared Memory Values Shared Memory Values threadIdx.x threadIdx.x threadIdx.x Loop 1 stride = 4 Loop 2 stride = 2 Loop 3 stride = 1 Block Level Reduction (Sequential Addressing) for (unsigned int stride = blockDim.x/2; stride > 0; stride>>=1){
if (threadIdx.x < stride){ sdata[threadIdx.x] += sdata[threadIdx.x + stride] } __syncthreads(); } stride /=2 Now conflict free regardless of the reduction loop stride The stride between shared memory variable accesses for threads is always sequential Careful: Two types of stride discussed 1. Loop stride (of algorithm) 2. SM variable stride (in 4 bytes) sm_stride 1 loop stride 1 threadIdx.x index bank 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17 17 18 18 18 19 19 19 20 20 20 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25 26 26 26 27 27 27 28 28 28 29 29 29 30 30 30 31 31 31 32 0 Banks Used 32 Max Conflicts 1 Global Reduction Approach Use the recursive method Our block level reduction can be applied to the result At some stage it may be more effective to simply sum the final block on the CPU Or use atomics on block results Thread block width Global Reduction Atomics __global__ void sum_reduction(float *input, float *result){ extern __shared__ int sdata[]; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[threadIdx.x] = input[i]; __syncthreads(); for (unsigned int stride = blockDim.x/2; stride > 0; stride>>=2){
if (threadIdx.x < stride){ sdata[threadIdx.x] += sdata[threadIdx.x + stride] } __syncthreads(); } if (threadIdx.x == 0) atomicAdd(result, sdata[0]); } Further Optimisation? Can we improve our technique further? __global__ void sum_reduction(float *input, float *result){ extern __shared__ int sdata[]; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[threadIdx.x] = input[i]; __syncthreads(); for (unsigned int stride = blockDim.x/2; stride > 0; stride>>=2){
if (threadIdx.x < stride){ sdata[threadIdx.x] += sdata[threadIdx.x + stride] } __syncthreads(); } if (threadIdx.x == 0) atomicAdd(result, sdata[0]); } Parallel Patterns Overview Reduction Scan What is scan? Consider the following … A B C D F Remove empty elements from array (compact) 1 2 5 1 3 4 3 8 1 5 1 3 3 2 4 8 Split elements from array based on condition (split) E G H I Output variable numbers of values per thread 2 0 5 6 3 0 1 0 2 5 6 3 12 0 1 2 1 3 0 0 What is scan? Consider the following … 2 0 1 2 1 3 0 0 A B C D F Remove empty elements from array (compact) 1 2 5 1 3 4 3 8 1 5 1 3 3 2 4 8 Split elements from array based on condition (split) E G H I Output variable numbers of values per thread 2 0 5 6 3 0 1 0 2 5 6 3 1 Each has the same problem Not even considered for sequential programs! Where to write output in parallel? Parallel Prefix Sum (scan) Where to write output in parallel? Each threads needs to know the output location(s) it can write to avoid conflicts. 2 0 5 6 3 0 1 0 2 5 6 3 1 0 0 1 2 3 3 4 4 Sparse data 0 1 2 3 4 5 6 7 Output/Write index – running sum of binary output state Thread/Read index Compacted data The solution is a parallel prefix sum (or scan) Given the inputs A = [a 0 , a 1 , …, a n-1 ] and binary associate operator ⊕ Scan(A) = [0, a0, (a 0 ⊕a 1 ), …, (a 0 ⊕a 1 ⊕…⊕a n-1 )] Serial Parallel Prefix Sum Example E.g. Given the input and the addition operator A= [2, 6, 2 ,4, 7, 2 ,1, 5] Scan(A) = [0, 2, 2+6, 2+6+2, 2+6+2+4, …] Scan(A) = [0, 2, 8, 10, 14, 21, 23, 24] More generally a serial implementation of an additive scan using a running sum looks like… int A[8] = { 2, 6, 2, 4, 7, 2, 1, 5 }; int scan_A[8]; int running_sum = 0; for (int i = 0; i < 8; ++i) { scan_A[i] = running_sum; running_sum += A[i]; } Serial Scan for Compaction int Input[8] = { 2, 0, 5, 6, 3, 0, 1, 0 }; int A[8] = { 2, 0, 5, 6, 3, 0, 1, 0 }; int scan_A[8]; int output[5] int running_sum = 0; for (int i = 0; i < 8; ++i){ A[i] = Input>0;
}
for (int i = 0; i < 8; ++i){ scan_A[i] = running_sum; running_sum += A[i]; } for (int i = 0; i < 8; ++i){ int input = Input[i]; if (input > 0){
int idx = scan[i];
output[idx] = input;
}
}
// generate scan input
// A = {1, 0, 1, 1, 1, 0, 1, 0}
// scan
// result = {0, 1, 1, 2, 3, 4, 4, 5}
// scattered write
// output = {2, 5, 6, 3, 1}
Parallel Local (Shared Memory) Scan
2 6 2 4 7 2 1 5
0 1 2 3
2 8 8 6 11 9 3 6
2 8 10 14 19 15 14 15
2 8 10 14 21 23 24 29
Shared Memory Values
Shared Memory Values
Shared Memory Values
Shared Memory Values
threadIdx.x
threadIdx.x
threadIdx.x
Loop 1
stride = 1
Loop 2
stride = 2
Loop 3
stride = 4
4 5 6
0 1 2 3 4 5
0 1 2 3
Log2(N) steps
After Log(N) loops each sum has local plus preceding 2n-1 values
Inclusive Scan
Parallel Local Scan
2 6 2 4 7 2 1 5
0 1 2 3
2 8 8 6 11 9 3 6
2 8 10 14 19 15 14 15
2 8 10 14 21 23 24 29
Shared Memory Values
Shared Memory Values
Shared Memory Values
Shared Memory Values
threadIdx.x
threadIdx.x
threadIdx.x
Loop 1
stride = 1
Loop 2
stride = 2
Loop 3
stride = 4
4 5 6
0 1 2 3 4 5
0 1 2 3
Log2(N) steps
2 8 10 14 21 23 24 290
Inclusive scan
Exclusive scan + reduction
Implementing Local Scan with Shared Memory
No bank conflicts (stride of 1 between threads)
Synchronisation required between read and write
__global__ void scan(float *input) {
extern __shared__ float s_data[];
s_data[threadIdx.x] = input[threadIdx.x + blockIdx.x*blockDim.x];
for (int stride = 1; stride
__syncthreads();
s_data[threadIdx.x] += s_value;
}
//something with global results?
}
Implementing Local Scan (at warp level)
Exactly the same as the block level technique but at warp level
Warp prefix sum is in threadIdx.x%32==31
Either use shared memory to reduce between warps
Or consider the following global scan approaches.
__global__ void scan(float *input) {
__shared__ float s_data[32];
float val1, val2;
val1 = input[threadIdx.x + blockIdx.x*blockDim.x];
for (int s = 1; s < 32; s <<= 1) { val2 = __shfl_up(val1, s); if (threadIdx.x % 32 >= s)
val1 += val2;
}
//store warp level results}
Implementing scan at Grid Level
2 6 2 4 7 2 1 5
0 1 2 3
2 8 8 6 11 9 3 6
2 8 10 14 19 15 14 15
2 8 10 14 21 23 24 29
Shared Memory Values
Shared Memory Values
Shared Memory Values
Local Scan Result
threadIdx.x
threadIdx.x
threadIdx.x 4 5 6
0 1 2 3 4 5
0 1 2 3
4 2 2 4 1 0 1 2
0 1 2 3
4 6 4 6 5 1 1 3
4 6 8 12 9 7 6 4
4 6 8 12 13 13 14 16
4 5 6
0 1 2 3 4 5
0 1 2 3
Thread Block 1 Thread Block 2
2 8 10 14 21 23 24 29 33 35 37 41 42 42 43 45Global Scan Result
Implementing scan at Grid Level
Same problem as reduction when scaling to grid level
Each block is required to add the reduction value from proceeding blocks
Global scan therefore requires either;
1. Recursive scan kernel on results of local scan
Additional kernel to add sums of proceeding blocks
2. Atomic Increments (next slides)
Increment a counter for block level results
Additional kernel to add sums of proceeding blocks to each value
Global Level Scan (Atomics Part 1)
__device__ block_sums[BLOCK_DIM];
__global__ void scan(float *input, float *local_result) {
extern __shared__ float s_data[];
s_data[threadIdx.x] = input[threadIdx.x + blockIdx.x*blockDim.x];
for (int stride = 1; stride
__syncthreads();
s_data[threadIdx.x] += s_value;
}
//store local scan result to each thread
local_result[threadIdx.x + blockIdx.x*blockDim.x] = s_data[threadIdx.x];
//atomic store to all proceeding block totals
if (threadIdx.x == 0){
for (int i=0; i