Microsoft PowerPoint – L11-GPUs-Reduction+Shuffle.pptx
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
CSC 367
Parallel Programming
Bogdan Simion
General-purpose computing with
Graphics Processing Units (GPUs)
Comprehensive examples – Reductions
With many thanks to NVIDIA’s Mark Harris for some of the neat CUDA examples!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
This week
• Comprehensive example: reduction operation on GPUs
• Use all we learnt so far …
• Other ways to do reduction
• Shuffle on down, use atomics
2
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Reduction
• Reduce multiple values into a single value
• e.g., find sum (or max, min, etc.) value from an array
• This is quite a common operation => must be efficiently implemented
• Easy to implement in parallel, but hard to get it done efficiently
• Great optimization example for all we’ve learnt so far
• Go through several iterations of this implementation
• Take several optimization strategies
3
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Reduction – sum of an array
• Sequential implementation – O(n):
4
. . .
Final sum
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Parallel Reduction
• Approach: pair-wise sums per each thread
• Tree-like structure, O(log2n)
• Each thread adds two elements and produces a partial sum
5
Final sum
Problem: each red dotted line must act like a barrier!
– values above it must be calculated before we can proceed
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Parallel Reduction
• Now sum four per thread instead
• Tree-like structure, O(log4n)
• Each thread adds 4 elements
6
Final sum
Problem: similarly, must synchronize threads between stages
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Parallel Reduction
• Main goals:
• Must keep all SMs on the GPU busy
• Must leverage the various GPU memories efficiently
• Considerations
• Must use a large number of thread blocks
• Each block produces a partial result
• Must communicate partial results between blocks, to compute other partial
results, and ultimately getting the final result
7
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Using shared memory – preliminaries
• Compiler creates a copy of a __shared__ variable for each block
• All threads in a block share the variable’s memory, but cannot see (or
change) the copy of the same variable that an other block sees
• Idea: maintain partial sums in shared memory!
• __shared__ int sdata[];
• Each block processes a portion of the input
• Recall also that we need barriers:
• __syncthreads(): all threads in a block have completed instructions before it
• Not a global barrier, only within a block!
8
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Global synchronization problem
• Problem: how to know when each block has finished its partial sum?
9
Block 0 Block 1 Block 2 Block 3
Ti
m
e
• We need some way to wait for the last block to finish => we need a global barrier
• CUDA does not provide a global barrier – not a good idea
• Expensive to implement
• Either limits number of blocks, or potential for deadlock – think why!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Decompose reduction into multiple kernels
• Leverage implicit synchronization between kernel invocations
10
• Wait .. what about the overheads of launching a new kernel?
• negligible hardware overhead, low software overhead
Block 0 Block 1 Block 2 Block 3
Ti
m
e
Kernel1
Kernel2
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Parallel reduction with multiple kernels
• Basic idea: decompose computations into multiple successive
kernel invocations
• The code for all levels is exactly the same
• “Recursive” kernel invocation
11
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
What is our optimization goal?
• We should strive to reach GPU peak performance
• Choose the right metric:
• GFLOP/s: for compute-bound kernels
• Bandwidth: for memory-bound kernels
• Reductions have very low arithmetic intensity
• 1 flop per element loaded (bandwidth-optimal)
• Calculate bandwidth on your GPU, e.g.:
• 384-bit memory interface, 900MHz DDR
• 384 * 1800 / 8 = 86.4 GB/s
12
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Access pattern
14
0
Thread
ids
0
0
0
8
4 8 12
2 4 6 8 10 12 14s = 1
s = 2
s = 4
s = 8
Thread
ids
Thread
ids
Thread
ids
• Compute partial sums at each step, with a different stride
• First adjacent elements, then every other element, every 4th, etc.
• Each tid is mapped onto an element, some threads don’t do work depending on the step
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Attempt #1: Interleaved addressing
15
__global__ void reduce1(int *g_idata, int *g_odata) {
extern __shared__ int sdata[]; // ignore the “extern” for now
unsigned int tid = threadIdx.x; // tid within a block
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i]; // Each thread loads one elem to shmem
__syncthreads();
// Reduction in shared memory, half the threads each step
for (int s = 1; s < blockDim.x; s *= 2) { // step = s * 2
if (tid % (2*s) == 0) { // Take only tids divisible by step
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// Write the result for this block back to global memory
if (tid == 0) { g_odata[blockIdx.x] = sdata[0]; }
}
Note: can we only wait for those threads that update elements? Move __sync inside the if?
In a warp, not all threads participate in the "if"
Divergent code impacts performance!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Divergent execution
16
Active
Inactive
• Consider two warps – what happens at each step?
s = 1
s = 2
s = 4
s = 8
s = 16
s = 32
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Attempt #2: Deal with divergent code
17
0
Thread
ids
0
0
0
1
1 2 3
1 2 3 4 5 6 7s = 1
s = 2
s = 4
s = 8
Thread
ids
Thread
ids
Thread
ids
• Compute partial sums at each step, with a different stride
• First adjacent elements, then every other element, every 4th, etc.
• Now all threads do work at the start, to decrease divergence…
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Attempt #2: Deal with divergent code
18
for (int s = 1; s < blockDim.x; s *= 2) { // step = s * 2
if (tid % (2*s) == 0) { // Take only tids divisible by step
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
• Idea: replace the divergent branch ...
• ... by using strided index and non-divergent branch
for (int s = 1; s < blockDim.x; s *= 2) {
int idx = 2 * s * tid; // Take all tids, then first half, etc
if (idx < blockDim.x) {
sdata[idx] += sdata[idx + s];
}
__syncthreads();
}
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Deal with divergence
19
Active
Inactive
• Previous approach had divergence at every step. No divergence now for s = 1
• In reality, more than just 2 warps => more non-divergent steps
s = 1
s = 2
s = 4
s = 8
s = 16
s = 32
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
New problem: bank conflicts
• Reminder: BC = threads in half-warp access different 32-bit words from same bank
• Interleaved accesses => at almost every step, we have 2-way bank conflicts
• Consider what happens with a large number of threads (more than a half-warp!)
20
0
Thread
ids
0
0
0
1
1 2 3
1 2 3 4 5 6 7s = 1
s = 2
s = 4
s = 8
Thread
ids
Thread
ids
Thread
ids
Intuition: accessing memory non-
sequentially => end up accessing more
memory locations from same bank…
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
“Sequential” access pattern
• No conflicts with this sequential addressing
• Keep in mind how consecutive memory locations are laid out across banks!
21
0
Thread
ids 1s = 8
s = 4
s = 2
s = 1
Thread
ids
Thread
ids
Thread
ids
2 3 4 5 6 7
0 1
0
0 1 2 3
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Attempt #3: Enforce sequential addressing
22
• Idea: replace strided indexing in the loop …
• … with “reversed” loop and threadID-based indexing:
for (int s = blockDim.x/2; s > 0; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
for (int s = 1; s < blockDim.x; s *= 2) {
int idx = 2 * s * tid; // Take all tids, then first half, etc
if (idx < blockDim.x) {
sdata[idx] += sdata[idx + s];
}
__syncthreads();
}
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Idle threads
• Problem: half the threads are idle on the first iteration, and only gets
worse after that:
• This is not efficient!!!
23
for (int s = blockDim.x/2; s > 0; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Attempt #4: First iteration at load time
• How about halving the number of blocks, and replace the single load
24
unsigned int tid = threadIdx.x; // tid within a block
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i]; // Each thread loads 1 elem to shmem
__syncthreads();
unsigned int tid = threadIdx.x; // tid within a block
unsigned int i = blockIdx.x * (blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads();
• ... with 2 loads and first add of the reduction:
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Performance for 4M element reduction
• Blocksize: 128 threads, results are on an older GPU:
25
Time
(4M ints)
Bandwidth Step
speedup
Cumulative
speedup
Attempt #1 (interleaved addressing +
divergent branch)
8.054 ms 2.083 GB/s
Attempt #2 (interleaved addressing +
bank conflicts)
3.456 ms 4.854 GB/s 2.33 x 2.33x
Attempt #3 (sequential addressing) 1.722 ms 9.741 GB/s 2.01x 4.68x
Attempt #4 (first add during global load) 0.965 ms 17.377 GB/s 1.78x 8.34x
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Instruction Bottleneck
• At 17 GB/s, we're far from bandwidth-bound
• And we know the reduction has low arithmetic intensity ...
• Therefore, a likely bottleneck is instruction overhead
• Ancillary instructions that are not loads, stores, or arithmetic for the core
computation
• In other words: address arithmetic and loop overhead
• Strategy: unroll loops!
26
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Unrolling the last warp
• As reduction proceeds, number of "active" threads decreases
• When s <= 32, we have only one warp left
• Recall that instructions are SIMD synchronous within a warp
• That means when s <= 32:
• We don't need to __syncthreads()
• We don't need the " if (tid < s) " because it doesn't save any work
• Let's unroll the last 6 iterations of the inner loop
27
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Attempt #5: Unroll the last warp
• Note: this saves useless work in all warps, not just the last one!
• Without unrolling, all warps execute every iteration of the for loop and if stmt
28
for (int s = blockDim.x/2; s > 32; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if (tid < 32) {
if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Performance for 4M element reduction
29
Time
(4M ints)
Bandwidth Step
speedup
Cumulative
speedup
Attempt #1 (interleaved addressing +
divergent branch)
8.054 ms 2.083 GB/s
Attempt #2 (interleaved addressing +
bank conflicts)
3.456 ms 4.854 GB/s 2.33 x 2.33x
Attempt #3 (sequential addressing) 1.722 ms 9.741 GB/s 2.01x 4.68x
Attempt #4 (first add during global load) 0.965 ms 17.377 GB/s 1.78x 8.34x
Attempt #5 (unroll last warp) 0.536 ms 31.289 GB/s 1.8x 15.01x
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Complete unrolling
• Idea: If we knew the number of iterations at compile time, we could
completely unroll the reduction
• Luckily, the block size is limited by the GPU (assume that limit is 512 threads
for our card)
• Also, we are sticking to power-of-2 block sizes
• So we can easily unroll for a fixed block size
• Problem: we need to be generic – how can we unroll for block sizes
that we don’t know at compile time?
30
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Unrolling with Templates
• Templates to the rescue!
• CUDA supports C++ template parameters on device and host functions
• Specify block size as a function template parameter:
31
template
__global void reduce6 (int *g_idata, int *g_odata){
…
}
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Attempt #6: Completely Unrolled
• Note: all code in red will be evaluated at compile time
• Results in a very efficient inner loop!
32
if (blockSize >= 512) {
if (tid < 256){ sdata[tid] += sdata[tid + 256];} __syncthreads();
}
if (blockSize >= 256) {
if (tid < 128){ sdata[tid] += sdata[tid + 128];} __syncthreads(); } if (blockSize >= 128) {
if (tid < 64){ sdata[tid] += sdata[tid + 64];} __syncthreads();
}
if (tid < 32) {
if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Invoking Template Kernels
• Don’t we still need block size at compile time?
• Nope, just a switch statement with 10 possible block sizes:
33
switch (threads) {
case 512:
reduce5<512><<
case 256:
reduce5<256><<
case 128:
reduce5<128><<
case 64:
reduce5< 64><<
case 32:
reduce5< 32><<
case 16:
reduce5< 16><<
case 8:
reduce5< 8><<
case 4:
reduce5< 4><<
case 2:
reduce5< 2><<
case 1:
reduce5< 1><<
}
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Performance for 4M element reduction
34
Time
(4M ints)
Bandwidth Step
speedup
Cumulative
speedup
Attempt #1 (interleaved addressing +
divergent branch)
8.054 ms 2.083 GB/s
Attempt #2 (interleaved addressing +
bank conflicts)
3.456 ms 4.854 GB/s 2.33 x 2.33x
Attempt #3 (sequential addressing) 1.722 ms 9.741 GB/s 2.01x 4.68x
Attempt #4 (first add during global load) 0.965 ms 17.377 GB/s 1.78x 8.34x
Attempt #5 (unroll last warp) 0.536 ms 31.289 GB/s 1.8x 15.01x
Attempt #6 (completely unrolled) 0.381 ms 43.996 GB/s 1.41x 21.16x
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Parallel reduction complexity
• We have log N parallel steps, each step S does N/2S independent ops
• e.g., N = 16, S = 1, 2, 3, …
35
0
Thread
ids 1
S = 1
8 ops
S = 2
4 ops
S = 3
2 ops
S = 4
1 op
Thread
ids
Thread
ids
Thread
ids
2 3 4 5 6 7
0 1
0
0 1 2 3
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Parallel reduction complexity
• We have log N parallel steps, each step S does N/2S independent ops
• Step complexity is O(log N)
• For N=2D, performs Σ S ∈ [1..D] 2D-S = N-1 operations
• Work complexity is O(N) => it is work efficient
• i.e., does not perform more operations than a sequential algorithm
• With P threads physically in parallel (P processors), time complexity is
O(N/P + log N)
• Compared to O(N) for sequential reduction
• In a thread block: P=N => O(log N)
36
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
What about cost?
• Cost of a parallel algorithm is # of processors x time complexity
• Allocate threads instead of processors: O(N) threads
• Time complexity is O(log N) => cost is O(N log N) => not cost efficient!
• A previous lecture: let’s scale down to O(N / log N) threads
(technically, this derives from something called Brent’s theorem…)
• Each thread does O(log N) sequential work
• Then all O(N / log N) threads cooperate for O(log N) steps
• Cost = O( (N / log N) * log N) = O(N) => cost efficient!
• Recall discussion from a few lectures ago too … (same idea)
• Sometimes called algorithm cascading
• Can lead to significant speedups in practice
37
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Algorithm cascading
• Combine sequential and parallel reduction
• Each thread loads and sums multiple elements into shared memory
• Tree-based reduction in shared memory
• Each thread should sum up O(log N) elements
• i.e., 1024 or 2048 elements per block vs. 128 or 256
• Might be worth pushing it even further
• Possibly better latency hiding with more work per thread
• More threads per block reduces levels in tree of recursive kernel invocations
• High kernel launch overhead in last levels with few blocks
• Play around with this on your GPU – optimal # of blocks and threads?
• How many elements per thread?
38
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Attempt #7: Multiple adds per thread
• Replace load and add of two elements:
39
unsigned int tid = threadIdx.x; // tid within a block
unsigned int i = blockIdx.x * (blockSize*2) + threadIdx.x;
unsigned int gridSize = (blockSize*2*gridDim.x);
sdata[tid] = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_data[i+blockSize];
i += gridSize;
}
__syncthreads();
• ... with a while loop to add as many as necessary:
unsigned int tid = threadIdx.x; // tid within a block
unsigned int i = blockIdx.x * (blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_data[i+blockDim.x];
__syncthreads();
Why do we use gridSize loop stride?
Maintains coalescing!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Performance for 4M element reduction
40
Time
(4M ints)
Bandwidth Step
speedup
Cumulative
speedup
Attempt #1 (interleaved addressing +
divergent branch)
8.054 ms 2.083 GB/s
Attempt #2 (interleaved addressing +
bank conflicts)
3.456 ms 4.854 GB/s 2.33 x 2.33x
Attempt #3 (sequential addressing) 1.722 ms 9.741 GB/s 2.01x 4.68x
Attempt #4 (first add during global load) 0.965 ms 17.377 GB/s 1.78x 8.34x
Attempt #5 (unroll last warp) 0.536 ms 31.289 GB/s 1.8x 15.01x
Attempt #6 (completely unrolled) 0.381 ms 43.996 GB/s 1.41x 21.16x
Attempt # 7 (multiple elem per thread) 0.268 ms 62.671 GB/s 1.42x 30.04x
• Attempt 7 on 32M elements: 73 GB/s!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Final optimized reduction kernel
41
template
__global void reduce7 (int *g_idata, int *g_odata, unsigned int n) {
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x; // tid within a block
unsigned int i = blockIdx.x * (blockSize*2) + threadIdx.x;
unsigned int gridSize = (blockSize*2*gridDim.x);
sdata[tid] = 0;
while (i < n) { sdata[tid] = g_idata[i] + g_data[i+blockSize]; i += gridSize; }
__syncthreads();
if(blockSize >= 512){ if(tid < 256){ sdata[tid] += sdata[tid + s];} __syncthreads();}
if(blockSize >= 256){ if(tid < 128){ sdata[tid] += sdata[tid + s];} __syncthreads();}
if(blockSize >= 128){ if(tid < 64){ sdata[tid] += sdata[tid + s];} __syncthreads();}
if (tid < 32) {
if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Take-aways
• Must understand CUDA performance characteristics
• Memory coalescing
• Divergent branching
• Bank conflicts
• Latency hiding
• Algorithmic optimizations
• Decompose into multiple kernels
• Changes to addressing
• Loop unrolling
• Algorithm cascading
• Efficiently parallelizing code requires thought + analysis of tradeoffs !
42
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Next up …
• Other ways to do reduction:
• Shuffle on down, use atomics
43
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Reduction (revisited): faster version using shuffles
• As more features get added, various operations can be implemented much
faster using neat tricks
• Parallel reduction exchanges data between threads in the same block
• On earlier hardware, shared memory was the way to go: write data to shmem,
synchronize, reduce locally, then put data back into global mem
• As of Kepler cards, shuffle instruction (SHFL) enables a thread to directly read a
register from another thread in the same warp
• Threads in a warp can now collectively exchange or broadcast data
• Shuffle intrinsics (CUDA 8): __shfl(), __shfl_down(), __shfl_up(), __shfl_xor()
• Shuffle intrinsics (CUDA 9+): __shfl_sync(), __shfl_down_sync(), __shfl_up_sync(),
__shfl_xor_sync()
• We’ll use __shfl_down_sync(), for more info on others check out NVIDIA docs
• See older post on: https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
44
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Shuffle down
• Shuffle down operation (pre-CUDA9, use __shfl_down and no mask):
45
T __shfl_down_sync(unsigned mask, int var, unsigned int delta,
int width=warpSize);
• Calculates a source lane ID by adding a delta to the caller’s lane ID
• A lane ID is a thread’s index within its warp (0 – 31)
• Shifts var down the warp by delta lanes
• If the source lane ID is out of range or the source thread has exited, the calling thread’s
own var is returned
• The ID number of the source lane will not wrap around the value of width and so the
upper delta lanes will remain unchanged
• The mask indicates the threads participating in the call
• Width must be one of (2, 4, 8, 16, 32), default is warp size
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Shuffle down
• Shuffle down operation:
46
T __shfl_down_sync(unsigned mask, int var, unsigned int delta,
int width=warpSize);
0 1 2 3 4 5 6 7
2 3 4 5 6 7 76
warpID 0 1 2 3 4 5 6 7
i
j
i = threadIdx.x % 32;
j = __shfl_down_sync(0xFFFFFFFF,
i, 2, 8);
• Advantages
• Replaces a multi-instruction shared mem sequence with a single instruction
• Does not use any shared memory
• Synchronization is within a warp and is implicit in the instruction, so no need
for __syncthreads()
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Shuffle warp reduce
• First, use it to reduce within a warp
• Use __shfl_down_sync to build “reduction tree”
47
warpID
1 1 1 1 1 1 1 1
2 2 2 2
0 1 2 3 4 5 6 7
4 4
8
v += __shfl_down_sync(0xFFFFFFFF,v,4);
v += __shfl_down_sync(0xFFFFFFFF,v,2);
v += __shfl_down_sync(0xFFFFFFFF,v,1);
• After the __shfl_down_sync operations, thread 0 has the result in its
var v
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Shuffle warp reduce
• Shfl-based warp reduction:
48
__inline__ __device__
int warp_reduce_sum (int val){
for (int offset = warpSize/2; offset > 0; offset >>= 1)
val += __shfl_down_sync(0xFFFFFFFF, val, offset);
return val;
}
• After the __shfl_down_sync operations, thread 0 has the result in its val
• Multiple warps: each thread 0 from each warp will hold a reduced value
• Side-note: could implement an “all-reduce” semantic using
__shfl_xor_sync
• Check documentation for further details
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Block reduce
• We can use warp reduction to implement reduction across entire block
49
__inline__ __device__
int block_reduce_sum(int val) {
static __shared__ int shared[32]; // Shared mem for 32 partial sums
int lane = threadIdx.x % warpSize; // Lane = thread id within a warp
int wid = threadIdx.x / warpSize; // Warp ID of the thread
val = warp_reduce_sum(val); // Each warp performs partial reduction
if(lane == 0) shared[wid] = val; // Write reduced value to shmem
__syncthreads(); // Wait for all partial reductions within a block
// Read from shared memory only if that warp existed
val = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0;
if(wid == 0) val = warp_reduce_sum(val);//Final reduce within 1st warp
return val;
}
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Reduce across blocks
• Involves global communication => break computation into two kernel launches
• Kernel1 generates partial reduction results (1 per block, in t0 of the block)
• Kernel2 reduces the partial results into a grand total
50
__global__ void reduce_kernel(int *in, int *out, int N) {
int sum = 0;
//reduce multiple elements per thread
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < N;
i += blockDim.x * gridDim.x) {
sum += in[i];
}
sum = block_reduce_sum(sum);
if (threadIdx.x == 0) out[blockIdx.x] = sum;
}
...
int threads = 512;
int blocks = min((N + threads - 1) / threads, 1024);
reduce_kernel<<
reduce_kernel<<<1, 1024>>>(out, out, blocks);
restrict to max 1024
blocks (max 1024
partial results)
K1: generate 1024
partial results
K2: generate
grand total
We know this is 1024 max now, use same kernel
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Reminder: Atomic Operations
• CUDA supports several atomic operations:
• atomicAdd, atomicMin, etc.
• Atomics are great for many use cases
• Check out the documentation for more operations
• Advanced atomics: e.g., atomicCAS – can implement locks for complex code!
• Do not rely on atomics blindly
• Can lead to serious performance degradation
• Must understand what is the impact of atomics in that specific context
• Rethink the algorithm when problems arise
• Dealing with contention for global memory
• Strategy: Use shared memory and staged approach, like in the example
51
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Use atomics to optimize reductions
• Use same approach for warp reduction, but have 1st thread atomically
update the reduced value directly in the grand total
52
__global__ void reduce_warp_atomic_kernel(int *in, int *out, int N) {
int sum = 0;
for(int i = blockIdx.x * blockDim.x + threadIdx.x;
i < N;
i += blockDim.x * gridDim.x) {
sum += in[i];
}
sum = warp_reduce_sum(sum);
if ((threadIdx.x & (warpSize - 1)) == 0)
atomicAdd(out, sum);
}
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Use atomics to optimize reductions
• Other approach: same as block reduction, but each block's thread 0
atomically adds the partial result to grand total
53
__global__ void reduce_warp_atomic_kernel(int *in, int *out, int N) {
int sum = 0;
for(int i = blockIdx.x * blockDim.x + threadIdx.x;
i < N;
i += blockDim.x * gridDim.x) {
sum += in[i];
}
sum = block_reduce_sum(sum);
if (threadIdx.x == 0)
atomicAdd(out, sum);
}
• Evaluate performance for each of these – which one is faster?
• Atomics: if integral data, or precision doesn't matter, can be useful!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Take-aways
• Rich features in newer hardware simplify programming effort
• The basic concepts are still there, but squeezing every bit of
performance means knowing and using latest features
• Hardware evolves at a fast pace => you should be excited to explore
all the cool new features!
54
2016
2020
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Next time
• Pinned memory
• GPU streams
• Overlapping kernels with data transfers
• Further directions for exploration in GPU domain…
55
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
A4 extension?
• Anonymous request for an extension, reason brought up:
• Due to other assignment deadlines, might help to have more time…
• At most can extend to last day of classes
• Extended deadline: December 8th, 10pm
• In class poll: Yay or nay?
56