IT代写 COMP Distributed

COMP Distributed
Introduction

References

Copyright By PowCoder代写 加微信 powcoder

– NVIDIAGPUEducatorsProgram – https://developer.nvidia.com/educators
– NVIDIA’s Academic Programs
– https://developer.nvidia.com/academia
– The contents of this short course ppt slides are mainly copied from the following book and its accompanying teaching materials:
. Kirk and Wen-mei W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, 2nd edition, , 2013

Review – CUDA Execution Model
– Heterogeneous host (CPU) + device (GPU) application C program
Serial parts in host C code
Parallel parts in device kernel code
SIMD and multithreading (Single Instruction & Multiple Threads, or SIMT)
Serial Code (host)
Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args);

Review – Typical Structure of a CUDA Program
– Kernel function
– __global__ void kernelOne(args…){}
– allocatememoryspaceonthedevice–
cudaMalloc(&d_GlblVarPtr, bytes )
– transferdatafromhosttodevice–cudaMemCpy(d_GlblVarPtr, h_Gl…)
– kernelcall–kernelOne<<>>(args…);
– transferresultsfromdevicetohost–cudaMemCpy(h_GlblVarPtr,…)needed
– optional:compareagainstgolden(hostcomputed)solution

Review – Thread Grid, Blocks and Warps
– A CUDA kernel is executed by a grid (array) of threads
– All threads in a grid run the same kernel code (Single Program Multiple Data)
– Thread array divided into multiple blocks which are distributed to different SMs (8 – 12 blocks/SM)
– Multiple dimensional blocks/grid and multiple dimensional threads/block
– Threads in different blocks do not interact
– Thread indexes to compute memory addresses and make control decisions (i = blockIdx.x * blockDim.x + threadIdx.x)
– Each Block is executed as 32-thread Warps – Warps are scheduling units in SM
– Threads in a warp execute in SIMD
– Grid size is application dependent
– Block size is machine dependent
– Control divergence: threads in a warp take different control flow paths which are serialized

Review – CUDA Memories
– Registers (per-thread)
– Shared memory (per-block)
– Global memory (all threads)
– Memory coalescing: locality across threads for one instruction
– Shared memory 100x faster than global memory: data loaded into shared memory & then used many times
Block (0, 0)
Registers Registers Registers Registers
Thread (0, 0)
Global Memory
Shared Memory
Constant Memory
Block (1, 0)
Shared Memory
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)

Review – Synchronization
– Threads within a block cooperate via shared memory, atomic operations and barrier synchronization
– Barrier Synchronization – __syncthreads()
– Avoid data race with atomic operations
– e.g., int atomicAdd(int* address, int val);
– Privatization: atomic operations per block on shared memory & then across blocks on global memory

Not Covered
– Asynchronous memory copy between host and device
– Multiple Streams
– Multiple devices
– Distributed-memory cluster with multiple GPUs

Parallel Algorithm Design and Analysis
– The most difficult task
– Unfortunately, there are no simple recipes
– Requires a sort of integrative thought that is commonly referred to as “creativity”
– needs experience
– In general, we need to consider
– Fine grained parallelism
– Memory coalescing
– Effective use of shared memory – Control divergence
– Synchronization overhead

Developer Tools – Debuggers
NSIGHT CUDA-GDB CUDA MEMCHECK NVIDIA Provided
https://developer.nvidia.com/debugging-solutions

Developer Tools – Profilers
NVVP NVPROF NVIDIA Provided
VampirTrac e
https://developer.nvidia.com/performance-analysis-tools

Ways to Accelerate Applications
Easy to use Most Performance
Applications
Compiler Directives
Easy to use Portable code
Programming Languages
Most Performance Most Flexibility

GPU Accelerated Libraries
Linear Algebra FFT, BLAS, SPARSE, Matrix
Numerical & Math RAND, Statistics
Data Struct. & AI Sort, Scan, Zero Sum
Visual Processing Image & Video NPP
NVIDIA cuFFT , cuBLAS, cuSP ARSE
NVIDIA Math Lib
GPU AI – Board Games
NVIDIA cuRAND
GPU AI – Path Finding
NVIDIA Video Encode

Compiler Directives: Easy, Portable Acceleration
Ease of use: Compiler takes care of details of parallelism management and data movement
Portable: The code is generic, not specific to any type of hardware and can be deployed into multiple languages
Uncertain: Performance of code can vary across compiler versions

– CompilerdirectivesforC,C++,andFORTRAN
#pragma acc parallel loop copyin(input1[0:inputLength],input2[0:inputLength]),
copyout(output[0:inputLength])
for(i = 0; i < inputLength; ++i) { output[i] = input1[i] + input2[i]; } Programming Languages: Most Performance and Flexible Acceleration Performance: Programmer has best control of parallelism and data movement Flexible: The computation does not need to fit into a limited set of library patterns or directive types Verbose: The programmer often needs to express more details 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com