编程代写 CSC 367 Parallel Programming

CSC 367 Parallel Programming

General-purpose computing with Graphics Processing Units (GPUs) Comprehensive examples – Reductions
With many thanks to NVIDIA’s for some of the neat CUDA examples!

Copyright By PowCoder代写 加微信 powcoder

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

• Otherwaystodoreduction
• Shuffleondown,useatomics
• Comprehensiveexample:reductionoperationonGPUs • Use all we learnt so far …
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 2

• Reducemultiplevaluesintoasinglevalue
• e.g., find sum (or max, min, etc.) value from an array
• This is quite a common operation => must be efficiently implemented
• Easy to implement in parallel, but hard to get it done efficiently
• Great optimization example for all we’ve learnt so far • Gothroughseveraliterationsofthisimplementation
• Takeseveraloptimizationstrategies
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 3

Reduction – sum of an array
• Sequentialimplementation–O(n):
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 4

Problem: each red dotted line must act like a barrier!
– values above it must be calculated before we can proceed
Parallel Reduction
• Approach:pair-wisesumspereachthread
• Tree-likestructure,O(log2n)
• Eachthreadaddstwoelementsandproducesapartialsum
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 5

• Now sum four per thread instead • Tree-likestructure,O(log4n)
• Eachthreadadds4elements
Parallel Reduction
Problem: similarly, must synchronize threads between stages
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 6

Parallel Reduction
• Main goals:
• Must keep all SMs on the GPU busy
• MustleveragethevariousGPUmemoriesefficiently
• Considerations
• Mustusealargenumberofthreadblocks
• Each block produces a partial result
• Must communicate partial results between blocks, to compute other partial results, and ultimately getting the final result
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 7

Using shared memory – preliminaries
• Compilercreatesacopyofa__shared__variableforeachblock
• Allthreadsinablocksharethevariable’smemory,butcannotsee(or change) the copy of the same variable that an other block sees
• Idea: maintain partial sums in shared memory! • __shared__ int sdata[];
• Eachblockprocessesaportionoftheinput
• Recallalsothatweneedbarriers:
• __syncthreads(): all threads in a block have completed instructions before it
• Not a global barrier, only within a block!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 8

Global synchronization problem
• Problem:howtoknowwheneachblockhasfinisheditspartialsum? Block 0 Block 1 Block 2 Block 3
• Weneedsomewaytowaitforthelastblocktofinish=>weneedaglobalbarrier • CUDA does not provide a global barrier – not a good idea
• Expensivetoimplement
• Eitherlimitsnumberofblocks,orpotentialfordeadlock–thinkwhy!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 9

Kernel1 Kernel2
Decompose reduction into multiple kernels
• Leverageimplicitsynchronizationbetweenkernelinvocations
Block 0 Block 1 Block 2 Block 3
• Wait .. what about the overheads of launching a new kernel? • negligiblehardwareoverhead,lowsoftwareoverhead
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 10

Parallel reduction with multiple kernels
• Basicidea:decomposecomputationsintomultiplesuccessive kernel invocations
• Thecodeforalllevelsisexactlythesame • “Recursive” kernel invocation
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 11

What is our optimization goal?
• WeshouldstrivetoreachGPUpeakperformance
• Choosetherightmetric:
• GFLOP/s:forcompute-boundkernels • Bandwidth:formemory-boundkernels
• Reductions have very low arithmetic intensity • 1flopperelementloaded(bandwidth-optimal)
• CalculatebandwidthonyourGPU,e.g.: • 384-bit memory interface, 900MHz DDR • 384*1800/8=86.4GB/s
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 12

Thread0 2 4 6 8 10 12 14 ids
Thread 0 ids
First adjacent elements, then every other element, every 4th, etc.
Each tid is mapped onto an element, some threads don’t do work depending on the step
Thread 0 ids
s=8 Thread 0 ids
Access pattern
• Compute partial sums at each step, with a different stride
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 14

Note: can we only wait for those threads that update elements? Move __sync inside the if?
In a warp, not all threads participate in the “if” Divergent code impacts performance!
Attempt #1: Interleaved addressing
__global__ void reduce1(int *g_idata, int *g_odata) {
extern __shared__ int sdata[]; // ignore the “extern” for now
unsigned int tid = threadIdx.x; // tid within a block unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i]; // Each thread loads one elem to shmem __syncthreads();
// Reduction in shared memory, half the threads each step for (int s = 1; s < blockDim.x; s *= 2) { // step = s * 2 if (tid % (2*s) == 0) { // Take only tids divisible by step sdata[tid] += sdata[tid + s]; __syncthreads(); // Write the result for this block back to global memory if (tid == 0) { g_odata[blockIdx.x] = sdata[0]; } University of Toronto Mississauga, Department of Mathematical and Computational Sciences 15 Divergent execution Active Inactive • Consider two warps – what happens at each step? University of Toronto Mississauga, Department of Mathematical and Computational Sciences 16 Thread0 1 2 3 4 5 6 7 ids Thread 0 ids First adjacent elements, then every other element, every 4th, etc. Now all threads do work at the start, to decrease divergence... Attempt #2: Deal with divergent code • Compute partial sums at each step, with a different stride Thread 0 ids s=8 Thread 0 ids University of Toronto Mississauga, Department of Mathematical and Computational Sciences 17 ... by using strided index and non-divergent branch Attempt #2: Deal with divergent code • Idea:replacethedivergentbranch... for (int s = 1; s < blockDim.x; s *= 2) { // step = s * 2 if (tid % (2*s) == 0) { // Take only tids divisible by step sdata[tid] += sdata[tid + s]; __syncthreads(); for (int s = 1; s < blockDim.x; s *= 2) { int idx = 2 * s * tid; // Take all tids, then first half, etc if (idx < blockDim.x) { sdata[idx] += sdata[idx + s]; __syncthreads(); University of Toronto Mississauga, Department of Mathematical and Computational Sciences 18 Deal with divergence Active Inactive • Previousapproachhaddivergenceateverystep.Nodivergencenowfors=1 • Inreality,morethanjust2warps=>morenon-divergentsteps
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 19

Reminder: BC = threads in half-warp access different 32-bit words from same bank
Thread0 1 2 3 4 5 6 7 ids
Thread 0 ids
s=8 Thread 0 ids
Intuition: accessing memory non- sequentially => end up accessing more memory locations from same bank
New problem: bank conflicts
Thread 0 1 ids
Interleaved accesses => at almost every step, we have 2-way bank conflicts • Consider what happens with a large number of threads (more than a half-warp!)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 20

Thread ids
Thread ids
“Sequential” access pattern
s=8Thread0 1 2 3 4 5 6 7 ids
• Keepinmindhowconsecutivememorylocationsarelaidoutacrossbanks!
No conflicts with this sequential addressing
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 21

Attempt #3: Enforce sequential addressing
• Idea: replace strided indexing in the loop …
for (int s = 1; s < blockDim.x; s *= 2) { int idx = 2 * s * tid; // Take all tids, then first half, etc if (idx < blockDim.x) { sdata[idx] += sdata[idx + s]; __syncthreads(); • ... with "reversed" loop and threadID-based indexing: for (int s = blockDim.x/2; s > 0; s >>= 1) { if (tid < s) { sdata[tid] += sdata[tid + s]; __syncthreads(); University of Toronto Mississauga, Department of Mathematical and Computational Sciences 22 sdata[tid] += sdata[tid + s]; __syncthreads(); • Thisisnotefficient!!! Idle threads • Problem:halfthethreadsareidleonthefirstiteration,andonlygets worse after that: for (int s = blockDim.x/2; s > 0; s >>= 1) { if (tid < s) { University of Toronto Mississauga, Department of Mathematical and Computational Sciences 23 Attempt #4: First iteration at load time • Howabouthalvingthenumberofblocks,andreplacethesingleload unsigned int tid = threadIdx.x; // tid within a block unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; // Each thread loads 1 elem to shmem __syncthreads(); • ...with2loadsandfirstaddofthereduction: unsigned int tid = threadIdx.x; // tid within a block unsigned int i = blockIdx.x * (blockDim.x*2) + threadIdx.x; sdata[tid] = g_idata[i] + g_idata[i+blockDim.x]; __syncthreads(); University of Toronto Mississauga, Department of Mathematical and Computational Sciences 24 Performance for 4M element reduction • Blocksize:128threads,resultsareonanolderGPU: Attempt #1 (interleaved addressing + divergent branch) 8.054 ms 3.456 ms 2.083 GB/s Attempt #2 (interleaved addressing + bank conflicts) 4.854 GB/s Attempt #3 (sequential addressing) Attempt #4 (first add during global load) 1.722 ms 0.965 ms 9.741 GB/s 17.377 GB/s 2.01x 1.78x 4.68x 8.34x Time (4M ints) Step speedup Cumulative speedup University of Toronto Mississauga, Department of Mathematical and Computational Sciences Instruction Bottleneck • At17GB/s,we'refarfrombandwidth-bound • And we know the reduction has low arithmetic intensity ... • Therefore,alikelybottleneckisinstructionoverhead • Ancillaryinstructionsthatarenotloads,stores,orarithmeticforthecore computation • Inotherwords:addressarithmeticandloopoverhead • Strategy:unrollloops! University of Toronto Mississauga, Department of Mathematical and Computational Sciences 26 Unrolling the last warp • Asreductionproceeds,numberof"active"threadsdecreases • When s <= 32, we have only one warp left • Recall that instructions are SIMD synchronous within a warp • Thatmeanswhens<=32: • Wedon'tneedto__syncthreads() • We don't need the " if (tid < s) " because it doesn't save any work • Let'sunrollthelast6iterationsoftheinnerloop University of Toronto Mississauga, Department of Mathematical and Computational Sciences 27 Attempt #5: Unroll the last warp for (int s = blockDim.x/2; s > 32; s >>= 1) { if (tid < s) { sdata[tid] += sdata[tid + s]; __syncthreads(); if (tid < 32) { if (blockSize >= if (blockSize >= if (blockSize >= if (blockSize >= if (blockSize >= if (blockSize >=
64) sdata[tid]
32) sdata[tid]
16) sdata[tid]
+= sdata[tid + 32];
+= sdata[tid + 16];
+= sdata[tid + 8];
+= sdata[tid + 4];
+= sdata[tid + 2];
+= sdata[tid + 1];
8) sdata[tid]
4) sdata[tid]
2) sdata[tid]
• Note: this saves useless work in all warps, not just the last one!
• Withoutunrolling,allwarpsexecuteeveryiterationoftheforloopandifstmt
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 28

Performance for 4M element reduction
Attempt #1 (interleaved addressing + divergent branch)
8.054 ms 3.456 ms
2.083 GB/s
Attempt #2 (interleaved addressing + bank conflicts)
4.854 GB/s
Attempt #3 (sequential addressing) Attempt #4 (first add during global load) Attempt #5 (unroll last warp)
1.722 ms 0.965 ms 0.536 ms
9.741 GB/s 17.377 GB/s 31.289 GB/s
2.01x 1.78x 1.8x
Time (4M ints)
Step speedup
Cumulative speedup
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
8.34x 15.01x

Complete unrolling
• Idea:Ifweknewthenumberofiterationsatcompiletime,wecould completely unroll the reduction
• Luckily, the block size is limited by the GPU (assume that limit is 512 threads for our card)
• Also,wearestickingtopower-of-2blocksizes • Sowecaneasilyunrollforafixedblocksize
• Problem:weneedtobegeneric–howcanweunrollforblocksizes that we don’t know at compile time?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 30

Unrolling with Templates
• Templatestotherescue!
• CUDAsupportsC++templateparametersondeviceandhostfunctions
• Specifyblocksizeasafunctiontemplateparameter:
template
__global void reduce6 (int *g_idata, int *g_odata){
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 31

Attempt #6: Completely Unrolled
if (blockSize >= 512) {
if (tid < 256){ sdata[tid] += sdata[tid + 256];} __syncthreads(); } if (blockSize >= 256) {
if (tid < 128){ sdata[tid] += sdata[tid + 128];} __syncthreads(); } if (blockSize >= 128) {
if (tid < 64){ sdata[tid] += sdata[tid + 64];} __syncthreads(); } if (tid < 32) { if (blockSize if (blockSize if (blockSize if (blockSize if (blockSize if (blockSize >= 64) sdata[tid] += sdata[tid + 32]; >= 32) sdata[tid] += sdata[tid + 16]; >= 16) sdata[tid] += sdata[tid + 8]; >= 8) sdata[tid] += sdata[tid + 4]; >= 4) sdata[tid] += sdata[tid + 2]; >= 2) sdata[tid] += sdata[tid + 1];
• Note:allcodeinredwillbeevaluatedatcompiletime
• Results in a very efficient inner loop!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 32

Invoking Template Kernels
• Don’twestillneedblocksizeatcompiletime?
• Nope,justaswitchstatementwith10possibleblocksizes:
switch (threads) {
reduce5<512><<>>(d_idata, d_odata); break;
reduce5<256><<>>(d_idata, d_odata); break;
reduce5<128><<>>(d_idata, d_odata); break;
reduce5< 64><<>>(d_idata, d_odata); break;
reduce5< 32><<>>(d_idata, d_odata); break;
reduce5< 16><<>>(d_idata, d_odata); break;
reduce5< 8><<>>(d_idata, d_odata); break;
reduce5< 4><<>>(d_idata, d_odata); break;
reduce5< 2><<>>(d_idata, d_odata); break;
reduce5< 1><<>>(d_idata, d_odata); break;
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 33

Performance for 4M element reduction
Attempt #1 (interleaved addressing + divergent branch)
8.054 ms 3.456 ms
2.083 GB/s
Attempt #2 (interleaved addressing + bank conflicts)
4.854 GB/s
Attempt #3 (sequential addressing) Attempt #4 (first add during global load) Attempt #5 (unroll last warp)
Attempt #6 (completely unrolled)
1.722 ms 0.965 ms 0.536 ms 0.381 ms
9.741 GB/s 17.377 GB/s 31.289 GB/s 43.996 GB/s
2.01x 1.78x 1.8x 1.41x
Time (4M ints)
Step speedup
Cumulative speedup
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
8.34x 15.01x 21.16x

Thread ids
Thread ids
Thread ids
Parallel reduction complexity
• WehavelogNparallelsteps,eachstepSdoesN/2Sindependentops • e.g.,N=16,S=1,2,3,…
ids 0 1 2 3 4 5 6 7
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 35

Parallel reduction complexity
• WehavelogNparallelsteps,eachstepSdoesN/2Sindependentops • Step complexity is O(log N)
• For N=2D, performs Σ S ∈ [1..D] 2D-S = N-1 operations
• Work complexity is O(N) => it is work efficient
• i.e.,doesnotperformmoreoperationsthanasequentialalgorithm
• WithPthreadsphysicallyinparallel(Pprocessors),timecomplexityis O(N/P + log N)
• ComparedtoO(N)forsequentialreduction • In a thread block: P=N => O(log N)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 36

What about cost?
• Costofaparallelalgorithmis#ofprocessorsxtimecomplexity
• Allocatethreadsinsteadofprocessors:O(N)threads
• Time complexity is O(log N) => cost is O(N log N) => not cost efficient!
• A previous lecture: let’s scale down to O(N / log N) threads (technically, this derives from something called Brent’s theorem…)
• Each thread does O(log N) sequential work
• Then all O(N / log N) threads cooperate for O(log N) steps
• Cost = O( (N / log N) * log N) = O(N) => cost efficient!
• Recalldiscussionfromafewlecturesagotoo…(sameidea)
• Sometimes called algorithm cascading
• Canleadtosignificantspeedupsinpractice
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 37

Algorithm cascading
• Combinesequentialandparallelreduction
• Each thread loads and sums multiple elements into shared memory • Tree-basedreductioninsharedmemory
• EachthreadshouldsumupO(logN)elements
• i.e.,1024or2048elementsperblockvs.128or256
• Mightbeworthpushingitevenfurther
• Possibly better latency hiding with more work per thread
• More threads per block reduces levels in tree of recursive kernel invocations
• High kernel launch overhead in last levels with few blocks
• Play around with this on your GPU – optimal # of blocks and threads?
• Howmanyelementsperthread?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 38

Attempt #7: Multiple adds per thread
• Replaceloadandaddoftwoelements:
unsigned int tid = threadIdx.x; // tid within a block unsigned int i = blockIdx.x * (blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_data[i+blockDim.x];
__syncthreads();
• …withawhilelooptoaddasmanyasnecessary:
unsigned int tid = threadIdx.x; // tid within a block unsigned int i = blockIdx.x * (blockSize*2) + threadIdx.x; unsigned int gridSize = (blockSize*2*gridDim.x); sdata[tid] = 0;
while (i < n) { sdata[tid] += g_idata[i] + g_data[i+blockSize]; i += gridSize; Why do we use gridSize loop stride? Maintains coalescing! __syncthreads(); University of Toronto Mississauga, Department of Mathematical and Computational Sciences 39 Performance for 4M element reduction Attempt #1 (interleaved addressing + divergent branch) 8.054 ms 3.456 ms 2.083 GB/s Attempt #2 (interleaved addressing + bank conflicts) 4.854 GB/s Attempt #3 (sequential addressing) Attempt #4 (first add during global load) Attempt #5 (unroll last warp) Attempt #6 (completely unrolled) 1.722 ms 0.965 ms 0.536 ms 0.381 ms 9.741 GB/s 17.377 GB/s 31.289 GB/s 43.996 GB/s 2.01x 1.78x 1.8x 1.41x Attempt # 7 (multiple elem per thread) 62.671 GB/s • Attempt 7 on 32M elements: 73 GB/s! Time (4M ints) Step speedup Cumulative speedup University of Toronto Mississauga, Department of Mathematical and Computational Sciences 8.34x 15.01x 21.16x Final optimized reduction kernel template
__global void reduce7 (int *g_idata, int *g_odata, unsigned int n) {
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x; // tid within a block
unsigned int i = blockIdx.x * (blockSize*2) + threadIdx.x;
unsigned int gridSize = (blockSize*2*gridDim.x);
sdata[tid] = 0;