CSC 367 Parallel Programming
General-purpose computing with Graphics Processing Units (GPUs) Comprehensive examples – Reductions
With many thanks to NVIDIA’s for some of the neat CUDA examples!
Copyright By PowCoder代写 加微信 powcoder
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
• Otherwaystodoreduction
• Shuffleondown,useatomics
• Comprehensiveexample:reductionoperationonGPUs • Use all we learnt so far …
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 2
• Reducemultiplevaluesintoasinglevalue
• e.g., find sum (or max, min, etc.) value from an array
• This is quite a common operation => must be efficiently implemented
• Easy to implement in parallel, but hard to get it done efficiently
• Great optimization example for all we’ve learnt so far • Gothroughseveraliterationsofthisimplementation
• Takeseveraloptimizationstrategies
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 3
Reduction – sum of an array
• Sequentialimplementation–O(n):
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 4
Problem: each red dotted line must act like a barrier!
– values above it must be calculated before we can proceed
Parallel Reduction
• Approach:pair-wisesumspereachthread
• Tree-likestructure,O(log2n)
• Eachthreadaddstwoelementsandproducesapartialsum
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 5
• Now sum four per thread instead • Tree-likestructure,O(log4n)
• Eachthreadadds4elements
Parallel Reduction
Problem: similarly, must synchronize threads between stages
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 6
Parallel Reduction
• Main goals:
• Must keep all SMs on the GPU busy
• MustleveragethevariousGPUmemoriesefficiently
• Considerations
• Mustusealargenumberofthreadblocks
• Each block produces a partial result
• Must communicate partial results between blocks, to compute other partial results, and ultimately getting the final result
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 7
Using shared memory – preliminaries
• Compilercreatesacopyofa__shared__variableforeachblock
• Allthreadsinablocksharethevariable’smemory,butcannotsee(or change) the copy of the same variable that an other block sees
• Idea: maintain partial sums in shared memory! • __shared__ int sdata[];
• Eachblockprocessesaportionoftheinput
• Recallalsothatweneedbarriers:
• __syncthreads(): all threads in a block have completed instructions before it
• Not a global barrier, only within a block!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 8
Global synchronization problem
• Problem:howtoknowwheneachblockhasfinisheditspartialsum? Block 0 Block 1 Block 2 Block 3
• Weneedsomewaytowaitforthelastblocktofinish=>weneedaglobalbarrier • CUDA does not provide a global barrier – not a good idea
• Expensivetoimplement
• Eitherlimitsnumberofblocks,orpotentialfordeadlock–thinkwhy!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 9
Kernel1 Kernel2
Decompose reduction into multiple kernels
• Leverageimplicitsynchronizationbetweenkernelinvocations
Block 0 Block 1 Block 2 Block 3
• Wait .. what about the overheads of launching a new kernel? • negligiblehardwareoverhead,lowsoftwareoverhead
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 10
Parallel reduction with multiple kernels
• Basicidea:decomposecomputationsintomultiplesuccessive kernel invocations
• Thecodeforalllevelsisexactlythesame • “Recursive” kernel invocation
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 11
What is our optimization goal?
• WeshouldstrivetoreachGPUpeakperformance
• Choosetherightmetric:
• GFLOP/s:forcompute-boundkernels • Bandwidth:formemory-boundkernels
• Reductions have very low arithmetic intensity • 1flopperelementloaded(bandwidth-optimal)
• CalculatebandwidthonyourGPU,e.g.: • 384-bit memory interface, 900MHz DDR • 384*1800/8=86.4GB/s
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 12
Thread0 2 4 6 8 10 12 14 ids
Thread 0 ids
First adjacent elements, then every other element, every 4th, etc.
Each tid is mapped onto an element, some threads don’t do work depending on the step
Thread 0 ids
s=8 Thread 0 ids
Access pattern
• Compute partial sums at each step, with a different stride
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 14
Note: can we only wait for those threads that update elements? Move __sync inside the if?
In a warp, not all threads participate in the “if” Divergent code impacts performance!
Attempt #1: Interleaved addressing
__global__ void reduce1(int *g_idata, int *g_odata) {
extern __shared__ int sdata[]; // ignore the “extern” for now
unsigned int tid = threadIdx.x; // tid within a block unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i]; // Each thread loads one elem to shmem __syncthreads();
// Reduction in shared memory, half the threads each step for (int s = 1; s < blockDim.x; s *= 2) { // step = s * 2
if (tid % (2*s) == 0) { // Take only tids divisible by step sdata[tid] += sdata[tid + s];
__syncthreads();
// Write the result for this block back to global memory
if (tid == 0) { g_odata[blockIdx.x] = sdata[0]; }
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 15
Divergent execution
Active Inactive
• Consider two warps – what happens at each step?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 16
Thread0 1 2 3 4 5 6 7 ids
Thread 0 ids
First adjacent elements, then every other element, every 4th, etc.
Now all threads do work at the start, to decrease divergence...
Attempt #2: Deal with divergent code
• Compute partial sums at each step, with a different stride
Thread 0 ids
s=8 Thread 0 ids
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 17
... by using strided index and non-divergent branch
Attempt #2: Deal with divergent code
• Idea:replacethedivergentbranch...
for (int s = 1; s < blockDim.x; s *= 2) { // step = s * 2
if (tid % (2*s) == 0) { // Take only tids divisible by step
sdata[tid] += sdata[tid + s];
__syncthreads();
for (int s = 1; s < blockDim.x; s *= 2) {
int idx = 2 * s * tid; // Take all tids, then first half, etc
if (idx < blockDim.x) { sdata[idx] += sdata[idx + s];
__syncthreads();
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 18
Deal with divergence
Active Inactive
• Previousapproachhaddivergenceateverystep.Nodivergencenowfors=1 • Inreality,morethanjust2warps=>morenon-divergentsteps
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 19
Reminder: BC = threads in half-warp access different 32-bit words from same bank
Thread0 1 2 3 4 5 6 7 ids
Thread 0 ids
s=8 Thread 0 ids
Intuition: accessing memory non- sequentially => end up accessing more memory locations from same bank
New problem: bank conflicts
Thread 0 1 ids
Interleaved accesses => at almost every step, we have 2-way bank conflicts • Consider what happens with a large number of threads (more than a half-warp!)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 20
Thread ids
Thread ids
“Sequential” access pattern
s=8Thread0 1 2 3 4 5 6 7 ids
• Keepinmindhowconsecutivememorylocationsarelaidoutacrossbanks!
No conflicts with this sequential addressing
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 21
Attempt #3: Enforce sequential addressing
• Idea: replace strided indexing in the loop …
for (int s = 1; s < blockDim.x; s *= 2) {
int idx = 2 * s * tid; // Take all tids, then first half, etc
if (idx < blockDim.x) { sdata[idx] += sdata[idx + s];
__syncthreads();
• ... with "reversed" loop and threadID-based indexing:
for (int s = blockDim.x/2; s > 0; s >>= 1) { if (tid < s) {
sdata[tid] += sdata[tid + s];
__syncthreads();
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 22
sdata[tid] += sdata[tid + s];
__syncthreads();
• Thisisnotefficient!!!
Idle threads
• Problem:halfthethreadsareidleonthefirstiteration,andonlygets worse after that:
for (int s = blockDim.x/2; s > 0; s >>= 1) { if (tid < s) {
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 23
Attempt #4: First iteration at load time
• Howabouthalvingthenumberofblocks,andreplacethesingleload
unsigned int tid = threadIdx.x; // tid within a block unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i]; // Each thread loads 1 elem to shmem __syncthreads();
• ...with2loadsandfirstaddofthereduction:
unsigned int tid = threadIdx.x; // tid within a block unsigned int i = blockIdx.x * (blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads();
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 24
Performance for 4M element reduction
• Blocksize:128threads,resultsareonanolderGPU:
Attempt #1 (interleaved addressing + divergent branch)
8.054 ms 3.456 ms
2.083 GB/s
Attempt #2 (interleaved addressing + bank conflicts)
4.854 GB/s
Attempt #3 (sequential addressing) Attempt #4 (first add during global load)
1.722 ms 0.965 ms
9.741 GB/s 17.377 GB/s
2.01x 1.78x
4.68x 8.34x
Time (4M ints)
Step speedup
Cumulative speedup
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Instruction Bottleneck
• At17GB/s,we'refarfrombandwidth-bound
• And we know the reduction has low arithmetic intensity ...
• Therefore,alikelybottleneckisinstructionoverhead
• Ancillaryinstructionsthatarenotloads,stores,orarithmeticforthecore
computation
• Inotherwords:addressarithmeticandloopoverhead
• Strategy:unrollloops!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 26
Unrolling the last warp
• Asreductionproceeds,numberof"active"threadsdecreases • When s <= 32, we have only one warp left
• Recall that instructions are SIMD synchronous within a warp
• Thatmeanswhens<=32:
• Wedon'tneedto__syncthreads()
• We don't need the " if (tid < s) " because it doesn't save any work
• Let'sunrollthelast6iterationsoftheinnerloop
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 27
Attempt #5: Unroll the last warp
for (int s = blockDim.x/2; s > 32; s >>= 1) { if (tid < s) {
sdata[tid] += sdata[tid + s];
__syncthreads();
if (tid < 32) { if (blockSize >= if (blockSize >= if (blockSize >= if (blockSize >= if (blockSize >= if (blockSize >=
64) sdata[tid]
32) sdata[tid]
16) sdata[tid]
+= sdata[tid + 32];
+= sdata[tid + 16];
+= sdata[tid + 8];
+= sdata[tid + 4];
+= sdata[tid + 2];
+= sdata[tid + 1];
8) sdata[tid]
4) sdata[tid]
2) sdata[tid]
• Note: this saves useless work in all warps, not just the last one!
• Withoutunrolling,allwarpsexecuteeveryiterationoftheforloopandifstmt
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 28
Performance for 4M element reduction
Attempt #1 (interleaved addressing + divergent branch)
8.054 ms 3.456 ms
2.083 GB/s
Attempt #2 (interleaved addressing + bank conflicts)
4.854 GB/s
Attempt #3 (sequential addressing) Attempt #4 (first add during global load) Attempt #5 (unroll last warp)
1.722 ms 0.965 ms 0.536 ms
9.741 GB/s 17.377 GB/s 31.289 GB/s
2.01x 1.78x 1.8x
Time (4M ints)
Step speedup
Cumulative speedup
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
8.34x 15.01x
Complete unrolling
• Idea:Ifweknewthenumberofiterationsatcompiletime,wecould completely unroll the reduction
• Luckily, the block size is limited by the GPU (assume that limit is 512 threads for our card)
• Also,wearestickingtopower-of-2blocksizes • Sowecaneasilyunrollforafixedblocksize
• Problem:weneedtobegeneric–howcanweunrollforblocksizes that we don’t know at compile time?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 30
Unrolling with Templates
• Templatestotherescue!
• CUDAsupportsC++templateparametersondeviceandhostfunctions
• Specifyblocksizeasafunctiontemplateparameter:
template
__global void reduce6 (int *g_idata, int *g_odata){
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 31
Attempt #6: Completely Unrolled
if (blockSize >= 512) {
if (tid < 256){ sdata[tid] += sdata[tid + 256];} __syncthreads(); }
if (blockSize >= 256) {
if (tid < 128){ sdata[tid] += sdata[tid + 128];} __syncthreads(); }
if (blockSize >= 128) {
if (tid < 64){ sdata[tid] += sdata[tid + 64];} __syncthreads(); }
if (tid < 32) { if (blockSize if (blockSize if (blockSize if (blockSize if (blockSize if (blockSize
>= 64) sdata[tid] += sdata[tid + 32]; >= 32) sdata[tid] += sdata[tid + 16]; >= 16) sdata[tid] += sdata[tid + 8]; >= 8) sdata[tid] += sdata[tid + 4]; >= 4) sdata[tid] += sdata[tid + 2]; >= 2) sdata[tid] += sdata[tid + 1];
• Note:allcodeinredwillbeevaluatedatcompiletime
• Results in a very efficient inner loop!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 32
Invoking Template Kernels
• Don’twestillneedblocksizeatcompiletime?
• Nope,justaswitchstatementwith10possibleblocksizes:
switch (threads) {
reduce5<512><<
reduce5<256><<
reduce5<128><<
reduce5< 64><<
reduce5< 32><<
reduce5< 16><<
reduce5< 8><<
reduce5< 4><<
reduce5< 2><<
reduce5< 1><<
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 33
Performance for 4M element reduction
Attempt #1 (interleaved addressing + divergent branch)
8.054 ms 3.456 ms
2.083 GB/s
Attempt #2 (interleaved addressing + bank conflicts)
4.854 GB/s
Attempt #3 (sequential addressing) Attempt #4 (first add during global load) Attempt #5 (unroll last warp)
Attempt #6 (completely unrolled)
1.722 ms 0.965 ms 0.536 ms 0.381 ms
9.741 GB/s 17.377 GB/s 31.289 GB/s 43.996 GB/s
2.01x 1.78x 1.8x 1.41x
Time (4M ints)
Step speedup
Cumulative speedup
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
8.34x 15.01x 21.16x
Thread ids
Thread ids
Thread ids
Parallel reduction complexity
• WehavelogNparallelsteps,eachstepSdoesN/2Sindependentops • e.g.,N=16,S=1,2,3,…
ids 0 1 2 3 4 5 6 7
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 35
Parallel reduction complexity
• WehavelogNparallelsteps,eachstepSdoesN/2Sindependentops • Step complexity is O(log N)
• For N=2D, performs Σ S ∈ [1..D] 2D-S = N-1 operations
• Work complexity is O(N) => it is work efficient
• i.e.,doesnotperformmoreoperationsthanasequentialalgorithm
• WithPthreadsphysicallyinparallel(Pprocessors),timecomplexityis O(N/P + log N)
• ComparedtoO(N)forsequentialreduction • In a thread block: P=N => O(log N)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 36
What about cost?
• Costofaparallelalgorithmis#ofprocessorsxtimecomplexity
• Allocatethreadsinsteadofprocessors:O(N)threads
• Time complexity is O(log N) => cost is O(N log N) => not cost efficient!
• A previous lecture: let’s scale down to O(N / log N) threads (technically, this derives from something called Brent’s theorem…)
• Each thread does O(log N) sequential work
• Then all O(N / log N) threads cooperate for O(log N) steps
• Cost = O( (N / log N) * log N) = O(N) => cost efficient!
• Recalldiscussionfromafewlecturesagotoo…(sameidea)
• Sometimes called algorithm cascading
• Canleadtosignificantspeedupsinpractice
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 37
Algorithm cascading
• Combinesequentialandparallelreduction
• Each thread loads and sums multiple elements into shared memory • Tree-basedreductioninsharedmemory
• EachthreadshouldsumupO(logN)elements
• i.e.,1024or2048elementsperblockvs.128or256
• Mightbeworthpushingitevenfurther
• Possibly better latency hiding with more work per thread
• More threads per block reduces levels in tree of recursive kernel invocations
• High kernel launch overhead in last levels with few blocks
• Play around with this on your GPU – optimal # of blocks and threads?
• Howmanyelementsperthread?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 38
Attempt #7: Multiple adds per thread
• Replaceloadandaddoftwoelements:
unsigned int tid = threadIdx.x; // tid within a block unsigned int i = blockIdx.x * (blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_data[i+blockDim.x];
__syncthreads();
• …withawhilelooptoaddasmanyasnecessary:
unsigned int tid = threadIdx.x; // tid within a block unsigned int i = blockIdx.x * (blockSize*2) + threadIdx.x; unsigned int gridSize = (blockSize*2*gridDim.x); sdata[tid] = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_data[i+blockSize];
i += gridSize;
Why do we use gridSize loop stride? Maintains coalescing!
__syncthreads();
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 39
Performance for 4M element reduction
Attempt #1 (interleaved addressing + divergent branch)
8.054 ms 3.456 ms
2.083 GB/s
Attempt #2 (interleaved addressing + bank conflicts)
4.854 GB/s
Attempt #3 (sequential addressing) Attempt #4 (first add during global load) Attempt #5 (unroll last warp)
Attempt #6 (completely unrolled)
1.722 ms 0.965 ms 0.536 ms 0.381 ms
9.741 GB/s 17.377 GB/s 31.289 GB/s 43.996 GB/s
2.01x 1.78x 1.8x 1.41x
Attempt # 7 (multiple elem per thread)
62.671 GB/s
• Attempt 7 on 32M elements: 73 GB/s!
Time (4M ints)
Step speedup
Cumulative speedup
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
8.34x 15.01x 21.16x
Final optimized reduction kernel
template
__global void reduce7 (int *g_idata, int *g_odata, unsigned int n) {
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x; // tid within a block
unsigned int i = blockIdx.x * (blockSize*2) + threadIdx.x;
unsigned int gridSize = (blockSize*2*gridDim.x);
sdata[tid] = 0;