代写代考 CSC 367 Parallel Programming

CSC 367 Parallel Programming

General-purpose computing with Graphics Processing Units (GPUs): Page-locked memory and Streams
With many thanks to NVIDIA’s for some of the neat CUDA examples!

University of Toronto Mississauga, Department of Mathematical and Computational Sciences

So far …
• Revisiting PC architecture
• WhyGPUs?
• General-purposeGPUs,CUDAframework
• GPUexecutionmodel:threads,blocks,grids,warpscheduling
• GPUmemorymodel:typesofmemoriesandatomics
• Reductionsinsharedmemory:casestudyforefficientuseof GPU features
• Reductions using shuffle operations
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 2

• Page-locked(Pinned)memory
• Batchingdatatransfers
• Streams:overlappingkernelsanddatatransfers
• SeealsoMarkHarris’post:https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 3

Pinned Host Memory
• Sofar,allocatememorywithcudaMalloc() • …andlifewasbeautiful
• CUDA runtime offers its own mechanism to allocate host memory • cudaHostAlloc()orcudaMallocHost()
• DeallocatewithcudaFreeHost()
cudaError_t status = cudaMallocHost((void**)&h_mPinned, bytes); if (status != cudaSuccess)
printf(“Error allocating pinned host memory\n”); …
cudaFreeHost(h_mPinned); • Whynotjustusemalloc()?
• Host memory is subject to paging out to disk by the OS
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 4

Virtual Address Space (for Process X)
Physical memory
This is called Demand paging
University of Toronto Mississauga, Department of Mathematical and Computational Sciences

Why allocate pinned/locked memory?
• It’snowsafetoallowanapplicationaccesstothephysicaladdress
• GPUcanuseDMAtocopydatatoorfromthehost(noCPUintervention)
• Pinnedmemorytransfersaretypicallyfaster
• Besides,GPUcannotaccessdatadirectlyfrompageablememory,pinned memory is used as a “staging area” (must pin before transfer)
Device Host
Pageable data transfer
Pinned data transfer
Pageable Memory
Pinned Memory
Pinned Memory
• Caveat: can affect other apps (less pageable mem to go around) => don’t overuse! University of Toronto Mississauga, Department of Mathematical and Computational Sciences 6

• Let’s do this …
Benchmark performance
• Whatdidwenotice?
• Howdopinnedmemorytransferscomparetopageablememorytransfers?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 7

Batching small transfers
• You’veseenbynowthateachtransfershasoverheads
• Idea:insteadofmanysmalltransfers,performfewerlargertransfers • Batchmanysmalltransfersintoalargetransfer
• Thesamegoesformemoryallocationrequests
• Onelargemallocismoreefficientthanlotsoftinyallocationrequests
• For2-Darraytransfers,youcanalsousecudaMemcpy2D() • See documentation for more info!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 8

CUDA Streams
• Asequenceofoperationsthatexecute(onthedevice)intheorder they are issued in the host code
• e.g.,copyhost-to-device,kernel,copydevice-to-host
• Operationsindifferentstreamscanberunconcurrently!
• Allkernelsanddatatransfersruninastream
• Ifnostreamspecifiedexplicitly,thedefaultstreamisused
• Default stream – slightly different semantics than other streams
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 9

Default stream
• Typicalexecution:copyin,launchkernel,copyout
cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice); lots_o_computations<<<1,N>>>(d_a);
cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);
• Copiesareblocking,kernelisasynchronous!
• Wecouldrunindependenthostcodeduringkernelexecution
cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice); lots_o_computations<<<1,N>>>(d_a);
some_other_computations(); //runs on CPU in parallel with kernel cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);
• Whichonefinishesfirst?Kernelorsome_other_computations?
• Doesn’t matter, device-to-host copy must wait for kernel to finish anyway
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 10

Non-default streams
• Declare,createanddestroyonhostside(checkerrorsasusualthough)
cudaStream_t stream1;
cudaError_t result;
result = cudaStreamCreate(&stream1); result = cudaStreamDestroy(stream1);
• Datatransfersinnon-defaultstreams:usecudaMemcpyAsync()
• Similar to cudaMemcpy(), but non-blocking on the host
• cudaMemcpy2DAsync()andcudaMemcpy3DAsync()variantsexist(seedocs)
cudaMemcpyAsync(d_a, a, N, cudaMemcpyHostToDevice, stream1); • Launchakernelinanon-defaultstream
lots_o_computations<<<1,N,0,stream1>>>(d_a);
• Alloperations(transfersandkernels)arenon-blocking!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 11

Streams and synchronization
• Mightneedtosynchronizethehostcodewithoperationsinastream!
• Bluntway:cudaDeviceSynchronize()
• Block the host code on this line until all previous device operations complete • Canbeoverkill,affectsperformance
• Othermoresubtleways:
• cudaStreamSynchronize(str): block the host until all ops from ‘str’ complete
• cudaStreamQuery(str): test whether all ops from ‘str’ are complete
• Considerexploringotherwaystoo:cudaEventSynchronize(event), cudaEventQuery(event), cudaStreamWaitEvent(event)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 12

When can we overlap kernel with transfers?
• Devicemustbeabletodo”concurrentcopyandexecution”!
• Recall lab8? Check the deviceOverlap property of the cudaDeviceProp structure!
• Most recent cards have this ability, since it’s pretty crucial
• Kernelexecutionanddatatransfermustoccurindifferentstreams • Same stream => same order they are issued
• Thehostmemorybeingtransferredmustbeallocatedaspinned • Otherwisecan’tissueasyncmemcpy
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 13

Example: using multiple streams
• Herearemultiplewaystoissuetheoperationsfromthehost • NottheorderofexecutionontheGPUthough!
• Withinsamestream,orderisenforced • Canoverlapoperationsacrossstreams • Bothproducecorrectresults
• What’s the difference?
Copy H-to-D (s1) Kernel (s1) Copy D-to-H (s1) Copy H-to-D (s2) Kernel (s2) Copy D-to-H (s2) Copy H-to-D (s3) Kernel (s3) Copy D-to-H (s3) Copy H-to-D (s4) Kernel (s4) Copy D-to-H (s4)
Copy H-to-D (s1) Copy H-to-D (s2) Copy H-to-D (s3) Copy H-to-D (s4) Kernel (s1) Kernel (s2) Kernel (s3) Kernel (s4) Copy D-to-H (s1) Copy D-to-H (s2) Copy D-to-H (s3) Copy D-to-H (s4)
• Must understand GPU scheduling!
• Copy engines and kernel engines
• Inter-engine: dependencies enforced • Intra-engine: execute in order of issue
• GPU-dependent! # of copy engines?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 14

Host H-to-D (s1) Kernel (s1) D-to-H (s1) H-to-D (s2) Kernel (s2) D-to-H (s2) H-to-D (s3) Kernel (s3) D-to-H (s3) H-to-D (s4) Kernel (s4) D-to-H (s4)
CopyEngine H-to-D (s1)
KernelEngine Host CopyEngine
KernelEngine
D-to-H (s1) H-to-D (s2)
D-to-H (s2) H-to-D (s3)
D-to-H (s3) H-to-D (s4)
D-to-H (s4)
No better than synchronous
Better overlap
GPU work scheduling
• Assumealltransfersorkernelexecutionstakethesameamountoftime
• 1kernelengine,1copyengine
Kernel (s1)
H-to-D (s1) H-to-D (s2) H-to-D (s3) H-to-D (s4) Kernel (s1) Kernel (s2) Kernel (s3) Kernel (s4) D-to-H (s1) D-to-H (s2) D-to-H (s3) D-to-H (s4)
H-to-D (s1) H-to-D(s2) H-to-D (s3) H-to-D(s4) D-to-H(s1) D-to-H (s2) D-to-H (s3) D-to-H (s4)
Kernel(s1) Kernel (s2) Kernel(s3) Kernel(s4)
Kernel (s2)
Kernel (s3)
Kernel (s4)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 15

Half, relative to synchronous
Identical to the other one
GPU work scheduling
• Assumealltransfersorkernelexecutionstakethesameamountoftime • 1kernelengine,2copyengines:oneforH-to-D,oneforD-to-H
Host H-to-D (s1) Kernel (s1) D-to-H (s1) H-to-D (s2) Kernel (s2) D-to-H (s2) H-to-D (s3) Kernel (s3) D-to-H (s3) H-to-D (s4) Kernel (s4) D-to-H (s4)
HtoD-Eng. H-to-D (s1) H-to-D (s2) H-to-D (s3) H-to-D (s4)
KernelEng.
Host H-to-D (s1)
KernelEng. DtoH-Eng. Kernel (s1)
Kernel (s1) Kernel (s2) Kernel (s3) Kernel (s4)
D-to-H (s1) D-to-H (s2) D-to-H (s3) D-to-H (s4)
H-to-D (s2) H-to-D (s3) H-to-D (s4) Kernel (s1) Kernel (s2) Kernel (s3) Kernel (s4) D-to-H (s1) D-to-H (s2) D-to-H (s3) D-to-H (s4)
H-to-D (s1) H-to-D (s2) H-to-D (s3) H-to-D (s4)
Kernel (s2) Kernel (s3) Kernel (s4)
D-to-H (s1) D-to-H (s2) D-to-H (s3) D-to-H (s4)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 16

Half, relative to synchronous
Slower, delayed D-to-Hs
GPU work scheduling
• Some GPUs can run multiple kernels concurrently if issued back-to-back in different streams => can delay D-to-H’s
Host H-to-D (s1) Kernel (s1) D-to-H (s1) H-to-D (s2) Kernel (s2) D-to-H (s2) H-to-D (s3) Kernel (s3) D-to-H (s3) H-to-D (s4) Kernel (s4) D-to-H (s4)
HtoD-Eng. H-to-D (s1) H-to-D (s2) H-to-D (s3) H-to-D (s4)
KernelEng.
Host H-to-D (s1)
HtoD-Eng. KernelEng. DtoH-Eng. H-to-D (s1)
Kernel (s1) Kernel (s2) Kernel (s3) Kernel (s4)
D-to-H (s1) D-to-H (s2) D-to-H (s3) D-to-H (s4)
H-to-D (s2) H-to-D (s3) H-to-D (s4) Kernel (s1) Kernel (s2) Kernel (s3) Kernel (s4) D-to-H (s1) D-to-H (s2) D-to-H (s3) D-to-H (s4)
H-to-D (s2) H-to-D (s3) H-to-D (s4)
Kernel (s1) Kernel (s2) Kernel (s3) Kernel (s4)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 17
D-to-H (s1) D-to-H (s2) D-to-H (s3) D-to-H (s4)

Take-aways
• Default stream is nice, but can’t take full advantage of GPU power
• Streamsallowoverlappingcomputationsanddatatransfers • Betterperformanceifwecanpipelineoperationsefficiently
• Mustunderstandyourdevicethough
• DependingontheGPU,itmightmatterhowyouwriteyourcode
• In newer compute capabilities, no need to be as careful about launch order
• Whenindoubt,runsmallexperimentsandanalyzeperformance • Testing / profiling is crucial!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 18

• CUDADynamicParallelism–asofCUDA5.0
Future exploration
• Canlaunchfine-grainedkernelswithinotherkernels
• Thinklaunchingadaptivegrids(ascoarse-grainedorfine-grainedasneeded), using recursion, etc.
• Hyper-Q: Parallel queues (Kepler/Pascal – 32 parallel queues) – less serialization, better GPU utilization (sharing it between CPU threads => better saturate the GPU)
• More resources for exploration – CUDA8 and beyond:
• http://on-demand.gputechconf.com/gtc/2016/presentation/s6224-mark-harris.pdf
• Keep in mind: Hardware (and software) advances at a very fast pace! • Newfeaturessimplifycomplextasks
• Knowingwhat’sunderthecoverisimportant!
• ItisyourdutytoExplore,Test,Profile,andAnalyze!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 19

GPU Wrap-up – more stuff worth exploring
• OpenCL–openstandard,NVIDIAcompetitor
• Can be used on different GPUs, or other platforms (CPU, FPGA, etc.)
• Slightly different terminology, more involved low-level development
• AMD’s Radeon Open Compute Platform (ROCm) – HIP toolchain:
https://github.com/ROCm-Developer-Tools/HIP
• Google TPUs (Tensor Processing Unit)
• Foracceleratinginferencephaseofneuralnetworks
• DeployedinGoogle’sdatacenterssince2015
• 2017study:https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf
• OpenACCframework–similartoOpenMPbutonseveralaccelerators
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 20

• Revisitinganolderexercise • Announcements
Next up…
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 21

Spinlock using CAS (revisited)
• Compare-And-Swap (CAS) instruction
• int CAS(int *address, int expected, int new);
Global variables:
owner = 0;
int spin_lock(int caller_tid) {
while(CAS(&owner, 0, caller_tid);
void spin_unlock() {
owner = 0;
my_tid = i;
spin_lock(my_tid);
// Critical section
spin_unlock();
my_tid = j;
spin_lock(my_tid);
// Critical section
spin_unlock();
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 22

• Nolabthisweek!
• MoretimetoworkonA4
Announcements
• A4extraofficehours:asusual,announcedonPiazza
• Nextweek’slecture:
• (Brief) Future exploration of relevant topics in Parallel Programming and Parallel Computing in general
• Examreview
• Examprepexercises
• Courseevaluations:seeonQuercus,underCourseEvalsmenu • Importanttogetfeedback!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 24

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts