CS计算机代考程序代写 compiler cuda GPU c++ algorithm Microsoft PowerPoint - L11-GPUs-Reduction+Shuffle.pptx

Microsoft PowerPoint – L11-GPUs-Reduction+Shuffle.pptx

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

CSC 367

Parallel Programming

Bogdan Simion

General-purpose computing with

Graphics Processing Units (GPUs)

Comprehensive examples – Reductions

With many thanks to NVIDIA’s Mark Harris for some of the neat CUDA examples!

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

This week
• Comprehensive example: reduction operation on GPUs

• Use all we learnt so far …

• Other ways to do reduction

• Shuffle on down, use atomics

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Reduction
• Reduce multiple values into a single value

• e.g., find sum (or max, min, etc.) value from an array

• This is quite a common operation => must be efficiently implemented

• Easy to implement in parallel, but hard to get it done efficiently

• Great optimization example for all we’ve learnt so far

• Go through several iterations of this implementation

• Take several optimization strategies

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Reduction – sum of an array
• Sequential implementation – O(n):

. . .

Final sum

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Parallel Reduction
• Approach: pair-wise sums per each thread

• Tree-like structure, O(log2n)

• Each thread adds two elements and produces a partial sum

Final sum

Problem: each red dotted line must act like a barrier!

– values above it must be calculated before we can proceed

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Parallel Reduction
• Now sum four per thread instead

• Tree-like structure, O(log4n)

• Each thread adds 4 elements

Final sum

Problem: similarly, must synchronize threads between stages

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Parallel Reduction
• Main goals:

• Must keep all SMs on the GPU busy

• Must leverage the various GPU memories efficiently

• Considerations

• Must use a large number of thread blocks

• Each block produces a partial result

• Must communicate partial results between blocks, to compute other partial

results, and ultimately getting the final result

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Using shared memory – preliminaries
• Compiler creates a copy of a __shared__ variable for each block

• All threads in a block share the variable’s memory, but cannot see (or

change) the copy of the same variable that an other block sees

• Idea: maintain partial sums in shared memory!

• __shared__ int sdata[];

• Each block processes a portion of the input

• Recall also that we need barriers:

• __syncthreads(): all threads in a block have completed instructions before it

• Not a global barrier, only within a block!

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Global synchronization problem
• Problem: how to know when each block has finished its partial sum?

Block 0 Block 1 Block 2 Block 3

Ti
m

• We need some way to wait for the last block to finish => we need a global barrier

• CUDA does not provide a global barrier – not a good idea

• Expensive to implement

• Either limits number of blocks, or potential for deadlock – think why!

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Decompose reduction into multiple kernels
• Leverage implicit synchronization between kernel invocations

• Wait .. what about the overheads of launching a new kernel?

• negligible hardware overhead, low software overhead

Block 0 Block 1 Block 2 Block 3

Ti
m

Kernel1

Kernel2

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Parallel reduction with multiple kernels
• Basic idea: decompose computations into multiple successive

kernel invocations

• The code for all levels is exactly the same

• “Recursive” kernel invocation

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

What is our optimization goal?
• We should strive to reach GPU peak performance

• Choose the right metric:

• GFLOP/s: for compute-bound kernels

• Bandwidth: for memory-bound kernels

• Reductions have very low arithmetic intensity

• 1 flop per element loaded (bandwidth-optimal)

• Calculate bandwidth on your GPU, e.g.:

• 384-bit memory interface, 900MHz DDR

• 384 * 1800 / 8 = 86.4 GB/s

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Access pattern

0
Thread

ids

4 8 12

2 4 6 8 10 12 14s = 1

s = 2

s = 4

s = 8

Thread

ids

Thread

ids

Thread

ids

• Compute partial sums at each step, with a different stride
• First adjacent elements, then every other element, every 4th, etc.
• Each tid is mapped onto an element, some threads don’t do work depending on the step

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Attempt #1: Interleaved addressing

__global__ void reduce1(int *g_idata, int *g_odata) {
extern __shared__ int sdata[]; // ignore the “extern” for now

unsigned int tid = threadIdx.x; // tid within a block
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

sdata[tid] = g_idata[i]; // Each thread loads one elem to shmem
__syncthreads();

// Reduction in shared memory, half the threads each step
for (int s = 1; s < blockDim.x; s *= 2) { // step = s * 2 if (tid % (2*s) == 0) { // Take only tids divisible by step sdata[tid] += sdata[tid + s]; } __syncthreads(); } // Write the result for this block back to global memory if (tid == 0) { g_odata[blockIdx.x] = sdata[0]; } } Note: can we only wait for those threads that update elements? Move __sync inside the if? In a warp, not all threads participate in the "if" Divergent code impacts performance! University of Toronto Mississauga, Department of Mathematical and Computational Sciences Divergent execution 16 Active Inactive • Consider two warps – what happens at each step? s = 1 s = 2 s = 4 s = 8 s = 16 s = 32 University of Toronto Mississauga, Department of Mathematical and Computational Sciences Attempt #2: Deal with divergent code 17 0 Thread ids 0 0 0 1 1 2 3 1 2 3 4 5 6 7s = 1 s = 2 s = 4 s = 8 Thread ids Thread ids Thread ids • Compute partial sums at each step, with a different stride • First adjacent elements, then every other element, every 4th, etc. • Now all threads do work at the start, to decrease divergence… University of Toronto Mississauga, Department of Mathematical and Computational Sciences Attempt #2: Deal with divergent code 18 for (int s = 1; s < blockDim.x; s *= 2) { // step = s * 2 if (tid % (2*s) == 0) { // Take only tids divisible by step sdata[tid] += sdata[tid + s]; } __syncthreads(); } • Idea: replace the divergent branch ... • ... by using strided index and non-divergent branch for (int s = 1; s < blockDim.x; s *= 2) { int idx = 2 * s * tid; // Take all tids, then first half, etc if (idx < blockDim.x) { sdata[idx] += sdata[idx + s]; } __syncthreads(); } University of Toronto Mississauga, Department of Mathematical and Computational Sciences Deal with divergence 19 Active Inactive • Previous approach had divergence at every step. No divergence now for s = 1 • In reality, more than just 2 warps => more non-divergent steps

s = 1

s = 2

s = 4

s = 8

s = 16

s = 32

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

New problem: bank conflicts

• Reminder: BC = threads in half-warp access different 32-bit words from same bank

• Interleaved accesses => at almost every step, we have 2-way bank conflicts

• Consider what happens with a large number of threads (more than a half-warp!)

0
Thread

ids

1 2 3

1 2 3 4 5 6 7s = 1

s = 2

s = 4

s = 8

Thread

ids

Thread

ids

Thread

ids
Intuition: accessing memory non-
sequentially => end up accessing more

memory locations from same bank…

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

“Sequential” access pattern

• No conflicts with this sequential addressing

• Keep in mind how consecutive memory locations are laid out across banks!

0
Thread

ids 1s = 8

s = 4

s = 2

s = 1

Thread

ids

Thread

ids

Thread

ids

2 3 4 5 6 7

0 1

0 1 2 3

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Attempt #3: Enforce sequential addressing

• Idea: replace strided indexing in the loop …

• … with “reversed” loop and threadID-based indexing:
for (int s = blockDim.x/2; s > 0; s >>= 1) {
if (tid < s) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } for (int s = 1; s < blockDim.x; s *= 2) { int idx = 2 * s * tid; // Take all tids, then first half, etc if (idx < blockDim.x) { sdata[idx] += sdata[idx + s]; } __syncthreads(); } University of Toronto Mississauga, Department of Mathematical and Computational Sciences Idle threads • Problem: half the threads are idle on the first iteration, and only gets worse after that: • This is not efficient!!! 23 for (int s = blockDim.x/2; s > 0; s >>= 1) {
if (tid < s) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } University of Toronto Mississauga, Department of Mathematical and Computational Sciences Attempt #4: First iteration at load time • How about halving the number of blocks, and replace the single load 24 unsigned int tid = threadIdx.x; // tid within a block unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; // Each thread loads 1 elem to shmem __syncthreads(); unsigned int tid = threadIdx.x; // tid within a block unsigned int i = blockIdx.x * (blockDim.x*2) + threadIdx.x; sdata[tid] = g_idata[i] + g_idata[i+blockDim.x]; __syncthreads(); • ... with 2 loads and first add of the reduction: University of Toronto Mississauga, Department of Mathematical and Computational Sciences Performance for 4M element reduction • Blocksize: 128 threads, results are on an older GPU: 25 Time (4M ints) Bandwidth Step speedup Cumulative speedup Attempt #1 (interleaved addressing + divergent branch) 8.054 ms 2.083 GB/s Attempt #2 (interleaved addressing + bank conflicts) 3.456 ms 4.854 GB/s 2.33 x 2.33x Attempt #3 (sequential addressing) 1.722 ms 9.741 GB/s 2.01x 4.68x Attempt #4 (first add during global load) 0.965 ms 17.377 GB/s 1.78x 8.34x University of Toronto Mississauga, Department of Mathematical and Computational Sciences Instruction Bottleneck • At 17 GB/s, we're far from bandwidth-bound • And we know the reduction has low arithmetic intensity ... • Therefore, a likely bottleneck is instruction overhead • Ancillary instructions that are not loads, stores, or arithmetic for the core computation • In other words: address arithmetic and loop overhead • Strategy: unroll loops! 26 University of Toronto Mississauga, Department of Mathematical and Computational Sciences Unrolling the last warp • As reduction proceeds, number of "active" threads decreases • When s <= 32, we have only one warp left • Recall that instructions are SIMD synchronous within a warp • That means when s <= 32: • We don't need to __syncthreads() • We don't need the " if (tid < s) " because it doesn't save any work • Let's unroll the last 6 iterations of the inner loop 27 University of Toronto Mississauga, Department of Mathematical and Computational Sciences Attempt #5: Unroll the last warp • Note: this saves useless work in all warps, not just the last one! • Without unrolling, all warps execute every iteration of the for loop and if stmt 28 for (int s = blockDim.x/2; s > 32; s >>= 1) {
if (tid < s) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } if (tid < 32) { if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];

}

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Performance for 4M element reduction

Time

(4M ints)

Bandwidth Step

speedup

Cumulative

speedup

Attempt #1 (interleaved addressing +

divergent branch)

8.054 ms 2.083 GB/s

Attempt #2 (interleaved addressing +

bank conflicts)

3.456 ms 4.854 GB/s 2.33 x 2.33x

Attempt #3 (sequential addressing) 1.722 ms 9.741 GB/s 2.01x 4.68x

Attempt #4 (first add during global load) 0.965 ms 17.377 GB/s 1.78x 8.34x

Attempt #5 (unroll last warp) 0.536 ms 31.289 GB/s 1.8x 15.01x

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Complete unrolling
• Idea: If we knew the number of iterations at compile time, we could

completely unroll the reduction

• Luckily, the block size is limited by the GPU (assume that limit is 512 threads

for our card)

• Also, we are sticking to power-of-2 block sizes

• So we can easily unroll for a fixed block size

• Problem: we need to be generic – how can we unroll for block sizes

that we don’t know at compile time?

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Unrolling with Templates
• Templates to the rescue!

• CUDA supports C++ template parameters on device and host functions

• Specify block size as a function template parameter:

template
__global void reduce6 (int *g_idata, int *g_odata){

…

}

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Attempt #6: Completely Unrolled

• Note: all code in red will be evaluated at compile time

• Results in a very efficient inner loop!
32

if (blockSize >= 512) {
if (tid < 256){ sdata[tid] += sdata[tid + 256];} __syncthreads(); } if (blockSize >= 256) {

if (tid < 128){ sdata[tid] += sdata[tid + 128];} __syncthreads(); } if (blockSize >= 128) {

if (tid < 64){ sdata[tid] += sdata[tid + 64];} __syncthreads(); } if (tid < 32) { if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];

}

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Invoking Template Kernels
• Don’t we still need block size at compile time?

• Nope, just a switch statement with 10 possible block sizes:

switch (threads) {
case 512:
reduce5<512><<>>(d_idata, d_odata); break;

case 256:
reduce5<256><<>>(d_idata, d_odata); break;

case 128:
reduce5<128><<>>(d_idata, d_odata); break;

case 64:
reduce5< 64><<>>(d_idata, d_odata); break;

case 32:
reduce5< 32><<>>(d_idata, d_odata); break;

case 16:
reduce5< 16><<>>(d_idata, d_odata); break;

case 8:
reduce5< 8><<>>(d_idata, d_odata); break;

case 4:
reduce5< 4><<>>(d_idata, d_odata); break;

case 2:
reduce5< 2><<>>(d_idata, d_odata); break;

case 1:
reduce5< 1><<>>(d_idata, d_odata); break;

}

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Performance for 4M element reduction

Time

(4M ints)

Bandwidth Step

speedup

Cumulative

speedup

Attempt #1 (interleaved addressing +

divergent branch)

8.054 ms 2.083 GB/s

Attempt #2 (interleaved addressing +

bank conflicts)

3.456 ms 4.854 GB/s 2.33 x 2.33x

Attempt #3 (sequential addressing) 1.722 ms 9.741 GB/s 2.01x 4.68x

Attempt #4 (first add during global load) 0.965 ms 17.377 GB/s 1.78x 8.34x

Attempt #5 (unroll last warp) 0.536 ms 31.289 GB/s 1.8x 15.01x

Attempt #6 (completely unrolled) 0.381 ms 43.996 GB/s 1.41x 21.16x

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Parallel reduction complexity
• We have log N parallel steps, each step S does N/2S independent ops

• e.g., N = 16, S = 1, 2, 3, …

0
Thread

ids 1

S = 1

8 ops

S = 2

4 ops

S = 3

2 ops

S = 4

1 op

Thread

ids

Thread

ids

Thread

ids

2 3 4 5 6 7

0 1

0 1 2 3

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Parallel reduction complexity
• We have log N parallel steps, each step S does N/2S independent ops

• Step complexity is O(log N)

• For N=2D, performs Σ S ∈ [1..D] 2D-S = N-1 operations
• Work complexity is O(N) => it is work efficient

• i.e., does not perform more operations than a sequential algorithm

• With P threads physically in parallel (P processors), time complexity is

O(N/P + log N)

• Compared to O(N) for sequential reduction

• In a thread block: P=N => O(log N)

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

What about cost?
• Cost of a parallel algorithm is # of processors x time complexity

• Allocate threads instead of processors: O(N) threads

• Time complexity is O(log N) => cost is O(N log N) => not cost efficient!

• A previous lecture: let’s scale down to O(N / log N) threads

(technically, this derives from something called Brent’s theorem…)

• Each thread does O(log N) sequential work

• Then all O(N / log N) threads cooperate for O(log N) steps

• Cost = O( (N / log N) * log N) = O(N) => cost efficient!

• Recall discussion from a few lectures ago too … (same idea)

• Sometimes called algorithm cascading

• Can lead to significant speedups in practice
37

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Algorithm cascading
• Combine sequential and parallel reduction

• Each thread loads and sums multiple elements into shared memory

• Tree-based reduction in shared memory

• Each thread should sum up O(log N) elements

• i.e., 1024 or 2048 elements per block vs. 128 or 256

• Might be worth pushing it even further

• Possibly better latency hiding with more work per thread

• More threads per block reduces levels in tree of recursive kernel invocations

• High kernel launch overhead in last levels with few blocks

• Play around with this on your GPU – optimal # of blocks and threads?

• How many elements per thread?
38

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Attempt #7: Multiple adds per thread
• Replace load and add of two elements:

unsigned int tid = threadIdx.x; // tid within a block
unsigned int i = blockIdx.x * (blockSize*2) + threadIdx.x;
unsigned int gridSize = (blockSize*2*gridDim.x);
sdata[tid] = 0;

while (i < n) { sdata[tid] += g_idata[i] + g_data[i+blockSize]; i += gridSize; } __syncthreads(); • ... with a while loop to add as many as necessary: unsigned int tid = threadIdx.x; // tid within a block unsigned int i = blockIdx.x * (blockDim.x*2) + threadIdx.x; sdata[tid] = g_idata[i] + g_data[i+blockDim.x]; __syncthreads(); Why do we use gridSize loop stride? Maintains coalescing! University of Toronto Mississauga, Department of Mathematical and Computational Sciences Performance for 4M element reduction 40 Time (4M ints) Bandwidth Step speedup Cumulative speedup Attempt #1 (interleaved addressing + divergent branch) 8.054 ms 2.083 GB/s Attempt #2 (interleaved addressing + bank conflicts) 3.456 ms 4.854 GB/s 2.33 x 2.33x Attempt #3 (sequential addressing) 1.722 ms 9.741 GB/s 2.01x 4.68x Attempt #4 (first add during global load) 0.965 ms 17.377 GB/s 1.78x 8.34x Attempt #5 (unroll last warp) 0.536 ms 31.289 GB/s 1.8x 15.01x Attempt #6 (completely unrolled) 0.381 ms 43.996 GB/s 1.41x 21.16x Attempt # 7 (multiple elem per thread) 0.268 ms 62.671 GB/s 1.42x 30.04x • Attempt 7 on 32M elements: 73 GB/s! University of Toronto Mississauga, Department of Mathematical and Computational Sciences Final optimized reduction kernel 41 template
__global void reduce7 (int *g_idata, int *g_odata, unsigned int n) {

extern __shared__ int sdata[];

unsigned int tid = threadIdx.x; // tid within a block
unsigned int i = blockIdx.x * (blockSize*2) + threadIdx.x;
unsigned int gridSize = (blockSize*2*gridDim.x);
sdata[tid] = 0;

while (i < n) { sdata[tid] = g_idata[i] + g_data[i+blockSize]; i += gridSize; } __syncthreads(); if(blockSize >= 512){ if(tid < 256){ sdata[tid] += sdata[tid + s];} __syncthreads();} if(blockSize >= 256){ if(tid < 128){ sdata[tid] += sdata[tid + s];} __syncthreads();} if(blockSize >= 128){ if(tid < 64){ sdata[tid] += sdata[tid + s];} __syncthreads();} if (tid < 32) { if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];

}
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Take-aways
• Must understand CUDA performance characteristics

• Memory coalescing

• Divergent branching

• Bank conflicts

• Latency hiding

• Algorithmic optimizations

• Decompose into multiple kernels

• Changes to addressing

• Loop unrolling

• Algorithm cascading

• Efficiently parallelizing code requires thought + analysis of tradeoffs !
42

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Next up …

• Other ways to do reduction:

• Shuffle on down, use atomics

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Reduction (revisited): faster version using shuffles

• As more features get added, various operations can be implemented much

faster using neat tricks

• Parallel reduction exchanges data between threads in the same block

• On earlier hardware, shared memory was the way to go: write data to shmem,

synchronize, reduce locally, then put data back into global mem

• As of Kepler cards, shuffle instruction (SHFL) enables a thread to directly read a

• Threads in a warp can now collectively exchange or broadcast data

• Shuffle intrinsics (CUDA 8): __shfl(), __shfl_down(), __shfl_up(), __shfl_xor()

• Shuffle intrinsics (CUDA 9+): __shfl_sync(), __shfl_down_sync(), __shfl_up_sync(),

__shfl_xor_sync()

• We’ll use __shfl_down_sync(), for more info on others check out NVIDIA docs

• See older post on: https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Shuffle down
• Shuffle down operation (pre-CUDA9, use __shfl_down and no mask):

T __shfl_down_sync(unsigned mask, int var, unsigned int delta,
int width=warpSize);

• Calculates a source lane ID by adding a delta to the caller’s lane ID
• A lane ID is a thread’s index within its warp (0 – 31)

• Shifts var down the warp by delta lanes
• If the source lane ID is out of range or the source thread has exited, the calling thread’s

own var is returned

• The ID number of the source lane will not wrap around the value of width and so the

upper delta lanes will remain unchanged

• The mask indicates the threads participating in the call

• Width must be one of (2, 4, 8, 16, 32), default is warp size

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Shuffle down
• Shuffle down operation:

T __shfl_down_sync(unsigned mask, int var, unsigned int delta,
int width=warpSize);

0 1 2 3 4 5 6 7

2 3 4 5 6 7 76

warpID 0 1 2 3 4 5 6 7
i

i = threadIdx.x % 32;
j = __shfl_down_sync(0xFFFFFFFF,

i, 2, 8);

• Advantages

• Replaces a multi-instruction shared mem sequence with a single instruction

• Does not use any shared memory

• Synchronization is within a warp and is implicit in the instruction, so no need

for __syncthreads()

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Shuffle warp reduce
• First, use it to reduce within a warp

• Use __shfl_down_sync to build “reduction tree”

warpID
1 1 1 1 1 1 1 1

2 2 2 2

0 1 2 3 4 5 6 7

4 4

v += __shfl_down_sync(0xFFFFFFFF,v,4);

v += __shfl_down_sync(0xFFFFFFFF,v,2);

v += __shfl_down_sync(0xFFFFFFFF,v,1);

• After the __shfl_down_sync operations, thread 0 has the result in its

var v

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Shuffle warp reduce
• Shfl-based warp reduction:

__inline__ __device__
int warp_reduce_sum (int val){

for (int offset = warpSize/2; offset > 0; offset >>= 1)
val += __shfl_down_sync(0xFFFFFFFF, val, offset);

return val;
}

• After the __shfl_down_sync operations, thread 0 has the result in its val

• Multiple warps: each thread 0 from each warp will hold a reduced value

• Side-note: could implement an “all-reduce” semantic using

__shfl_xor_sync

• Check documentation for further details

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Block reduce
• We can use warp reduction to implement reduction across entire block

__inline__ __device__
int block_reduce_sum(int val) {

static __shared__ int shared[32]; // Shared mem for 32 partial sums
int lane = threadIdx.x % warpSize; // Lane = thread id within a warp
int wid = threadIdx.x / warpSize; // Warp ID of the thread

val = warp_reduce_sum(val); // Each warp performs partial reduction

if(lane == 0) shared[wid] = val; // Write reduced value to shmem

__syncthreads(); // Wait for all partial reductions within a block

// Read from shared memory only if that warp existed
val = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0; if(wid == 0) val = warp_reduce_sum(val);//Final reduce within 1st warp return val; } University of Toronto Mississauga, Department of Mathematical and Computational Sciences Reduce across blocks • Involves global communication => break computation into two kernel launches

• Kernel1 generates partial reduction results (1 per block, in t0 of the block)

• Kernel2 reduces the partial results into a grand total

__global__ void reduce_kernel(int *in, int *out, int N) {
int sum = 0;
//reduce multiple elements per thread
for (int i = blockIdx.x * blockDim.x + threadIdx.x;

i < N; i += blockDim.x * gridDim.x) { sum += in[i]; } sum = block_reduce_sum(sum); if (threadIdx.x == 0) out[blockIdx.x] = sum; } ... int threads = 512; int blocks = min((N + threads - 1) / threads, 1024); reduce_kernel<<>>(in, out, N);
reduce_kernel<<<1, 1024>>>(out, out, blocks);

restrict to max 1024
blocks (max 1024
partial results)

K1: generate 1024
partial results

K2: generate
grand total

We know this is 1024 max now, use same kernel

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Reminder: Atomic Operations
• CUDA supports several atomic operations:

• atomicAdd, atomicMin, etc.

• Atomics are great for many use cases

• Check out the documentation for more operations

• Advanced atomics: e.g., atomicCAS – can implement locks for complex code!

• Do not rely on atomics blindly

• Can lead to serious performance degradation

• Must understand what is the impact of atomics in that specific context

• Rethink the algorithm when problems arise

• Dealing with contention for global memory

• Strategy: Use shared memory and staged approach, like in the example
51

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Use atomics to optimize reductions
• Use same approach for warp reduction, but have 1st thread atomically

update the reduced value directly in the grand total

__global__ void reduce_warp_atomic_kernel(int *in, int *out, int N) {
int sum = 0;

for(int i = blockIdx.x * blockDim.x + threadIdx.x;
i < N; i += blockDim.x * gridDim.x) { sum += in[i]; } sum = warp_reduce_sum(sum); if ((threadIdx.x & (warpSize - 1)) == 0) atomicAdd(out, sum); } University of Toronto Mississauga, Department of Mathematical and Computational Sciences Use atomics to optimize reductions • Other approach: same as block reduction, but each block's thread 0 atomically adds the partial result to grand total 53 __global__ void reduce_warp_atomic_kernel(int *in, int *out, int N) { int sum = 0; for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) { sum += in[i]; } sum = block_reduce_sum(sum); if (threadIdx.x == 0) atomicAdd(out, sum); } • Evaluate performance for each of these – which one is faster? • Atomics: if integral data, or precision doesn't matter, can be useful! University of Toronto Mississauga, Department of Mathematical and Computational Sciences Take-aways • Rich features in newer hardware simplify programming effort • The basic concepts are still there, but squeezing every bit of performance means knowing and using latest features • Hardware evolves at a fast pace => you should be excited to explore

all the cool new features!

2016

2020

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Next time
• Pinned memory

• GPU streams

• Overlapping kernels with data transfers

• Further directions for exploration in GPU domain…

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

A4 extension?
• Anonymous request for an extension, reason brought up:

• Due to other assignment deadlines, might help to have more time…

• At most can extend to last day of classes

• Extended deadline: December 8th, 10pm

• In class poll: Yay or nay?

Related Posts