编程辅导 CSC 367 Parallel Programming

CSC 367 Parallel Programming

General-purpose computing with Graphics Processing Units (GPUs) (continued)
With many thanks to NVIDIA’s for some of the neat CUDA examples!

Copyright By PowCoder代写 加微信 powcoder

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

HOST / CPU CPU
PU PU PU PU PU PU PU PU
Main memory
Device memory
Reminder: Why GPUs?
• Large collection of SIMD multiprocessors
• Massive thread parallelism – 100s of processors, high memory bandwidth
• HidesmemorylatencybetterthanCPUs • AddressestheCPU-memorygap
DEVICE / GPU
PU PU PU PU … PU PU PU PU Local memory Local memory
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 2

• CUDA programming basics, examples
• Gridsandblocksandthreads
• Occupancy
• Parallelexecutionmodel
• Memorymodelbasics,memorytypesandpurpose
• Global memory, Memory coalescing, Local memory • Shared memory, Bank conflicts
• Constant memory
• Texture memory
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 3

CUDA C (Typical) program
• allocate memory on CPU (on the “host”)
• allocate memory on GPU (on the “device”)
• transfer data to device memory
• launch kernel
• wait to finish
• transfer data back to host memory (if necessary)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 4

• Remember terminology • Host=CPU
Examples – simple kernels
• Device=GPU
• Keywords__global__,__device__
• Kernel launched with: <<<>>> – blocks, threads (more on this later) • Code…
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 5

• VisualStudioeditionofNsightforWindows
Useful tools
• LaterversionsofCUDAincludeNsight,anIDEforLinux/Macbasedon Eclipse (includes a profiler and debugger too!)
• Thecommandlineprofiler(nvprof),orthevisualprofiler(nvvp)–useful to analyze your programs
• nvprof’s “Profiling result” section – useful to know where is time being spent
• nvprof’sAPItracecanbeturnedoff(–profile-api-tracenone)toreduce some profiling overhead for short kernels
• nvvp’s timeline – visually see how the execution looks like
• Formoreinfo,consultNVIDIA’sdocumentationormanpages(nvprof–help)
• Recallthatprofilingyourcodecansaveyoualotoftrouble
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 6

• Example:
Debugging and memory checks
• cuda-memcheck(runtimeerrorcheckertoolformemoryaccesses, similar to valgrind to some extent) – some instability issues in the past
• Goodole’printingwillonlygetyousofar…
• Printing on the device involves data transfers (even in CUDA’s cuPrintf…) • You*can*stilldoit,butitwillbepainful
• cuda-gdb:mustcompilecodewith:-g(hostcode),-G(devicecode)
• $ nvcc –g –G worst_program_ever.cu –o best_program_ever
• Similar to gdb (run, continue, bt, info, kill, break, print, next, step, quit, etc.)
• cuda-gdb has cuda-memcheck integrated (Use: set cuda memcheck on)
• Quickguide:http://developer.download.nvidia.com/GTC/PDF/1062_Satoor.pdf University of Toronto Mississauga, Department of Mathematical and Computational Sciences 7

Grids and blocks and threads, oh my…
• Thread:basicunitofexecution/parallelism
• Canbeorganizedin1D,2D,3Dlayoutwithinablock(foreasierindexing) • Scheduledinwarps(batchesof32threads)
• Block:logicalorganizationofacollectionofthreads
• Eachblockcouldhave,e.g.,64,256,512,768,1024,etc.,threads
• Notallblocksruninparallel,butmorethan1canrunonaSMconcurrently • EachblockofthreadsisassignedtoaSM(nocontrolwhichgoeswhere!) • IfwaymoreblocksthanSMs,blocksarequeuedandcontext-switched
• EachSMcanmaintainthecontextformultipleblocks!
• Grid:logicalorganizationofacollectionofblocks
• e.g.,1D,2D,3Dlogicallayoutoftheblocks(easierindexing)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 8

Grids and blocks and threads, oh my…
• Example: increment all elements in an N * N matrix
• Saywewanttoassignoneelementperthread,andNis1024
• Let’spickblocksize=256threads=>1024*1024/256=4096blocks • Organize threads in 2D blocks => 16 * 16 threads per block
• Organizeblocksina2Dgrid=>64*64blocks
• Declare them in CUDA C:
• dim3 threadsPerBlock(16, 16); // 256 threads in total
• dim3 blocks(N / threadsPerBlock.x, N / threadsPerBlock.y); // 4096 blocks total
• Launchkernelusingthedeclareddimensions:
• compute<<>>(/*kernelparameters*/);
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 9

Identifying/indexing a thread
• Eachthreadcanidentifyitsassignedelement,asfollows: • inti=(blockIdx.x*blockDim.x)+threadIdx.x;
• intj=(blockIdx.y*blockDim.y)+threadIdx.y;
• Thesearebuiltinnotationswhichathreadcanusetoidentifyitsindex in the grids and blocks.
• threadIdx=threadindexwithinitsblock
• blockDim=sizeofablock(howmanythreadsineachdimension) • blockIdx=blockindexinthegrid
• gridDim=sizeofagrid(howmanyblocksineachdimension)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 10

blockDim.x
• 1DGridofblocks,eachone is a 1D block of threads
• 2DGridofblocks,eachone
B0 B1 B2 … Bn-1
is a 2D block of threads 2D Grid
blockIdx.x
B(1,0) B(1,1) B(1,2) B(1,3)
T0 T1 T2 … Tm-1
B(2,0) B(2,1) B(2,2) B(2,3) B(3,0) B(3,1) B(3,2) B(3,3)
globalTID = blockIdx.x*blockDim.x + threadIdx.x
B(blockIdx.x, blockIdx.y)
Locate thread in this 2D topology: thrRowID=blockIdx.x*blockDim.x+threadIdx.x thrColId = blockIdx.y * blockDim.y + threadIdx.y
blockDim.x
B(3,1) T(0,0) T(0,1)
T(threadIdx.x, threadIdx.y)
T(1,0) T(1,1)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 11
B(0,0) B(0,1) B(0,2) B(0,3)
blockDim.y

blockDim.x
• 1DGridofblocks,eachone is a 1D block of threads
• 2DGridofblocks,eachone
B0 B1 B2 … Bn-1
is a 2D block of threads 2D Grid
blockIdx.x
B(1,0) B(1,1) B(1,2) B(1,3)
T0 T1 T2 … Tm-1
B(2,0) B(2,1) B(2,2) B(2,3) B(3,0) B(3,1) B(3,2) B(3,3)
globalTID = blockIdx.x*blockDim.x + threadIdx.x
B(blockIdx.x, blockIdx.y) T(threadIdx.x, threadIdx.y)
T(1,0) T(1,1)
IFF you must convert to global 1D tid: – Block ID within 2D grid: bid=blockIdx.y*gridDim.x+blockIdx.x – Tid within 2D grid of 2D blocks:
blockDim.x
B(3,1) T(0,0) T(0,1)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 12
B(0,0) B(0,1) B(0,2) B(0,3)
blockDim.y
threadIdx.x
tid = bid*(blockDim.x*blockDim.y) + (threadIdx.y*blockDim.x)+

• Maximumnumberofthreadsperblock
• Maximumnumberofblockspergrid
Limitations
• ThesenumbersdependontheGPU–seepreviouslab(cardspecs)!
• Thesearehardwarelimitations=>ifexceeded,kernellaunchfailure! => For huge data, cannot count on one item per thread
• Notalwaysbesttorunwiththemaxforeachofthese
• Sometimeslessismore:)
• Understandyourdeviceandpicktheparallelismparametersforthejob
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 13

• Hidinglatency:whenonewarpisstalled,executeadifferentwarp
• Need a metric related to how many active warps on a SM • Tells us how effectively the H/W is kept busy
• Occupancy = ratio of number of active warps per SM to max number of possible active warps (the latter – see lab exercise)
• Higheroccupancydoesnotnecessarilymeanhigherperformance • Atsomepointadditionaloccupancydoesn’timproveperformance
• However, low occupancy is always bad – poor memory latency hiding
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 14

• Register availability is a limiting factor
• Abilitytoholdlocalvariables(withlow-latencyaccess)forlotsofthreads
• If each block uses many registers, the number of blocks that can be resident on an SM is reduced => lowers the occupancy of the SM
• e.g.,8192registersperSM,1024threadsresidentperSM
• => for 100% occupancy, each thread can only use 8 registers max
• DependsonCUDAcomputecapability
• InolderCUDA,occupancycalculatorisbasicallyaspreadsheet
• Calculate occupancy based on block size and shared memory usage • Innewerversions(afterCUDA6.5),runtimefunctionsforthis
• Occupancy calculator API: see documentation and cuda_occupancy.h
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 15

• How do we implement this in CUDA?
• Problem:arrayA,addanumberXtoallelementsinparallel=>arrayA’ • AllocatememforAandB(copyofA)onCPU
• AllocateA’onGPU
• TransfercontentsofAtoA’intodevicememory • Launchakernel
• Copy result A’ back to CPU memory into A
• CompareAtoB,tocheckcorrectness
• Next up: How to do this efficiently? What does efficient mean in this
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 16

Parallel execution
• Single-InstructionMultiple-Threads(SIMT)model
• SIMD requires vector code in each thread, SIMT allows you to write scalar code per thread (vectorization is guaranteed by the hardware)
• Asingleinstructionisissuedforawarpatatime(warp=32threads)
• Threads in a warp execute instructions in lock-step (same instruction for all)
• Warps can run ahead of other warps – use __syncthreads() to barrier all thread warps in a block (not all threads from all blocks!!)
T0 T31 …
T32 T63 …
T64 T95 …

University of Toronto Mississauga, Department of Mathematical and Computational Sciences 17

instructions
other instructions

Control Flow
Coherent execution
Divergent execution
Possible loss of efficiency if code is not written with this in mind!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 18

Memory and access techniques
• Memory types
• Memory coalescing
• Shared memory and bank conflicts
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 19

• Global memory
• Local memory
• Shared memory • Constant memory • Texture memory
Memory types
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 20

Local and Global memory
• Globalmemory
• Mostdataresideshere
• Host communication (this is where data gets transferred from/to CPU memory)
• Sharedbyallthreads
• Large size (a few GB typically), but slower than shared memory
• L1cachehelpshidethelatencyforglobal(andlocal)memoryaccesses
• Goodbandwidthviamemorycoalescing
• Local memory (keep in mind: terms used in CUDA) • aka Private per thread global memory
• Autovariables,registerspill
• Samespeedasglobalmemory!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 21

Memory coalescing
• Warpaccessesshouldreferencesequentialmemorylocationsforbest performance => these accesses get coalesced into a single access
Coalesced accesses Scattered accesses T0 … T31 T0 … T31
T0 … T31
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 22

Examples – memory coalescing
• In C, a matrix is stored row-wise as a 1-D array
A0,0 A0,1 A0,2 A0,3 A0,4 A0,5 A0,6 A0,7 A1,0 A1,1 A1,2 A1,3 A1,4 A1,5 A1,6 A1,7 A2,0 A2,1 A2,2 A2,3 A2,4 A2,5 A2,6 A2,7 A3,0 A3,1 A3,2 A3,3 A3,4 A3,5 A3,6 A3,7 A4,0 A4,1 A4,2 A4,3 A4,4 A4,5 A4,6 A4,7 A5,0 A5,1 A5,2 A5,3 A5,4 A5,5 A5,6 A5,7 A6,0 A6,1 A6,2 A6,3 A6,4 A6,5 A6,6 A6,7 A7,0 A7,1 A7,2 A7,3 A7,4 A7,5 A7,6 A7,7
A0,0 A0,1 A0,2 … A0,6 A0,7 A1,0 A1,1 A1,2 … A1,6 A1,7 … A7,0 A7,1 A7,2 … A7,6 A7,7
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 23

Examples – memory coalescing
• Considerwarp=8threads,eachprocessingeitheraroworacolumn
A0,0 A0,1 A0,2 A0,3 A0,4 A0,5 A0,6 A0,7 A0,0 A0,1 A0,2 A0,3 A0,4 A0,5 A0,6 A0,7 A1,0 A1,1 A1,2 A1,3 A1,4 A1,5 A1,6 A1,7 A1,0 A1,1 A1,2 A1,3 A1,4 A1,5 A1,6 A1,7 A2,0 A2,1 A2,2 A2,3 A2,4 A2,5 A2,6 A2,7 A2,0 A2,1 A2,2 A2,3 A2,4 A2,5 A2,6 A2,7 A3,0 A3,1 A3,2 A3,3 A3,4 A3,5 A3,6 A3,7 A3,0 A3,1 A3,2 A3,3 A3,4 A3,5 A3,6 A3,7 A4,0 A4,1 A4,2 A4,3 A4,4 A4,5 A4,6 A4,7 A4,0 A4,1 A4,2 A4,3 A4,4 A4,5 A4,6 A4,7 A5,0 A5,1 A5,2 A5,3 A5,4 A5,5 A5,6 A5,7 A5,0 A5,1 A5,2 A5,3 A5,4 A5,5 A5,6 A5,7 A6,0 A6,1 A6,2 A6,3 A6,4 A6,5 A6,6 A6,7 A6,0 A6,1 A6,2 A6,3 A6,4 A6,5 A6,6 A6,7 A7,0 A7,1 A7,2 A7,3 A7,4 A7,5 A7,6 A7,7 A7,0 A7,1 A7,2 A7,3 A7,4 A7,5 A7,6 A7,7
Which accesses are coalesced?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 24

Examples – memory coalescing
• Considerwhathappensoneachmemoryaccess
A0,0 A0,1 A0,2 … A0,6 A0,7 A1,0 A1,1 A1,2 … A1,6 A1,7 … A7,0 A7,1 A7,2 … A7,6 A7,7
A0,0 A0,1 A0,2 … A0,6 A0,7 A1,0 A1,1 A1,2 … A1,6 A1,7 … A7,0 A7,1 A7,2 … A7,6 A7,7 University of Toronto Mississauga, Department of Mathematical and Computational Sciences 25

Examples – memory coalescing
• Considerthismatrixofelements
ABCD EFGH IJKL
• Considerthateachthreadprocessesthefollowingelements..thoughts?
Case a) T0:A,B,C T1:D,E,F T2:G,H,I T3:J,K,L
Case b) T0:A,E,I T1:B,F,J T2:C,G,K T3:D,H,L
• Code may sometimes not be intuitive to write in the most efficient way • Wedowantgoodperformancethough…
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 26

Example (revisited)
• Launchseveralthreads,1block
• Coalesced vs. uncoalesced accesses
• Launchseveralblocks
• Let’s assign one element per thread
• Introduce warp divergence
• Hardware limits – max number of threads and blocks!
• Generalize kernel to handle several elements per thread
• Learnbydoing:trythesethingsyourself! • Analyze your findings!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 27

• Lowerlatencythanglobalmemory
Shared memory
• Actsassoftwareprogrammablecache(it’sachunkofL1!) • Declareintentionbyusing__shared__keyword
• Organizedin32banks
• Successive 32-bit words are assigned to successive banks
• Any memory load/store of N addresses spanning N distinct memory banks can be serviced simultaneously => N times the bandwidth of a single bank!
• Bandwidthofsharedmemory:32-bitsperbankpercycle
Bank29 Bank30 Bank31
• Bank conflicts: intuitively, it’s the failure to distribute the threads’ accesses across memory banks
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 28

Linear addressing (stride=1)
Random 1:1 (distinct banks)
Some threads access same bank
Bank14 Bank15
Bank conflicts
• When threads in a warp access different 32-bit words from the same bank
• Mustavoidbankconflicts!=>Designcodeaccordingly
• Threadsaccessingbyteswithinthesame32-bitwordisokthough=>noconflict
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 29

Shared memory – small example
• Reverseanarray
• Assumesmallsize,e.g.,64elements,justforclarity
1 92314733195
void cpuReverse(int *a, int n) {
for (int i = 0; i < n/2; i++) { 51933714239 1 • GPU:copyarraytoglobalmemory,usesharedmemoryforreversing • Eachthreadprocessesanelement(simplifyingassumption,cangeneralize) • Carefulaboutsynchronization! University of Toronto Mississauga, Department of Mathematical and Computational Sciences 30 int tmp = a[i]; a[i] = a[n – i - 1]; a[n – i – 1]; Static shared memory • Staticsharedmemory • Statically allocate a shared memory array of fixed size • Whenweknowtheamountofmemoryatcompiletime __global__ void staticReverse(int *g, int n) { __shared__ int s[64]; // declare shmem, note g is global int t = threadIdx.x; int tr = n – t - 1; // index of element to reverse by tid t s[t] = g[t]; // store an element into shared memory __syncthreads(); g[t] = s[tr]; // copy element in the right position ... staticReverse<<<1,n>>>(dev_arr, n);
Courtesy of NVIDIA: https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 31

Dynamic shared memory
• Dynamicsharedmemory
• Don’tknowthesizeofs,onlyknownatruntime=>specifyatkernellaunch!
__global__ void dynamicReverse(int *g, int n) {
extern __shared__ int s[]; // dynamic shmem, unsized array int t = threadIdx.x;
int tr = n – t – 1; // index of element to reverse by tid t s[t] = g[t]; // store an element into shared memory __syncthreads();
g[t] = s[tr]; // copy element in the right position}
dynamicReverse<<<1,n,n*sizeof(int)>>>(dev_arr, n); //size arg
• What if we needed multiple dynamically sized arrays?
• Declareasingleexternarray,divideitupandusepointermagic!
Courtesy of NVIDIA: https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 32

Shared memory – key observations
• Much faster than global memory, it’s a “controllable” part of L1 cache • Configurableamountonsomecards!
• Shared memory is shared by threads in a block => provides a mechanism for threads to cooperate!
• Whennecessary,use__syncthreads()forblock-levelbarriers
• Facilitatesglobalmemorycoalescingincaseswhereitwouldnototherwisebe possible
• Doesnothavethesequentialaccessrestrictionsofglobalmemory,to achieve optimal performance
• Onlyneedtoavoidbankconflicts
• Otherwiseaccessesgetserialized=>poorperformance,potentiallyworse
than global memory
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 33

• Musthaveenoughparallelism
• Coherentexecution
So far … Key takeaways!
• Atleastafewthousandsofthreadsexecutingconcurrently
• Keepthecoresbusyandbenefitfromhighmemorybandwidth
• Coalescedmemoryaccess
• Accessestosequentialmemorylocationsbythreadsinawarpareveryfast
• Not as crucial on newer GPUs / compute capabilities, but still a big performance hit!
• Threadsinawarpareautomaticallysynchronized(proceedinlockstep) • Carefulwithwarpdivergence
• Sharedmemory
• Fastbutmustavoidbankconflicts
• Reworkyourdataaccesspatternswhennecessary!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 34

• Constant memory • Texture memory • Atomics
Next up …
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 35

Constant memory
• Usedfordatawhichdoesnotchangeduringkernelexecution • => Constrains usage of that data to be read-only
• Small amount: typically 64KB
• Advantages:handleddifferentlythanglobalmemory
• 1.Asinglereadcanbebroadcasttoahalf-warp=>savesupto15reads, helps reduce the required memory bandwidth by 94%
• 2.Constantmemoryiscached=>consecutivereadsfromsameaddresswill not incur any additional memory traffic
• Disadvantage: half-warp threads must read from same location
• Use __constant__ modifier to indicate data is stored there
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 36

• Examples…
• Remember:practice!
Constant memory
University of Toronto Mississauga, Department of Mathematical and Computational Sciences