CS计算机代考程序代写 cuda GPU Microsoft PowerPoint – COMP528 HAL30 CUDA memory model, thread block matters

Microsoft PowerPoint – COMP528 HAL30 CUDA memory model, thread block matters

Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and
Multi-Processor Programming

30 – HAL

Steps to Porting Serial to CUDA

Determine work that has
inherent parallelism

Move (serial) work to a “kernel”

3. (a) Allocate kernel vars
(b) Initialise kernel vars

typically by copying data from h->d using cudaMemCpy()
(c) Add call to parallel CUDA kernel using <<>

kernel runs on device asynchronously
(d) Copy results from d->h using cudaMemCpy()

Based upon
“Steps to CUDA”

© High End Compute Ltd

__global__ cuda_kernel(x, y, z) {

// parallel control via varying index

my_i = threadIdx.x + blockIdx.x*blockDim.x;

z[my_i] = x[my_i] + y[my_i];

// not there is NO ‘for’ loop over index

}

PCI Express

Host variables
x, y, z

Device variables
x_dev, y_dev, z_dev

Z_dev[0] = X_dev[0] + Y_dev[0]
Z_dev[1] = X_dev[1] + Y_dev[1]
Z_dev[2] = X_dev[2] + Y_dev[2]
Z_dev[3] = X_dev[3] + Y_dev[3]
Z_dev[4] = X_dev[4] + Y_dev[4]
Z_dev[5] = X_dev[5] + Y_dev[5]

Z_dev[n-1] = X_dev[n-1] + Y_dev[n-1]

Example: Z = X + Y

int main(int argc, char *argv[]) {

float *z, *x, *y;

int num, i;

// memory for device (GPU)

float *x_dev, *y_dev, *z_dev;

// set up arrays

num = (int) atoi(argv[1]);

printf(“Forming z=x+y for dimension %d\n”, num);

x = (float *) malloc(num*sizeof(*x));

y = (float *) malloc(num*sizeof(*y));

z = (float *) malloc(num*sizeof(*z));

// set up device memory

cudaMalloc(&x_dev, num*sizeof(float));

cudaMalloc(&y_dev, num*sizeof(float));

cudaMalloc(&z_dev, num*sizeof(float));

// init vars

for (i=0; i>> (x_dev, y_dev, z_dev, num);

// copy results back

cudaMemcpy(z, z_dev, num*sizeof(float), cudaMemcpyDeviceToHost);

// use results in ‘z’ on local host

__global__ void cuda_kernel(float *x, float *y, float *z, int num) {

// parallel control via varying index

int my_i = threadIdx.x + blockIdx.x*blockDim.x;

// handle my_i exceeds num

if (my_i < num) { z[my_i] = x[my_i] + y[my_i]; } } COMP528 (c) mkbane, Univ of Liverpool (2018, 2019) int main(int argc, char *argv[]) { float *z, *x, *y; int num, i; // memory for device (GPU) float *x_dev, *y_dev, *z_dev; // set up arrays num = (int) atoi(argv[1]); printf("Forming z=x+y for dimension %d\n", num); x = (float *) malloc(num*sizeof(*x)); y = (float *) malloc(num*sizeof(*y)); z = (float *) malloc(num*sizeof(*z)); // set up device memory cudaMalloc(&x_dev, num*sizeof(float)); cudaMalloc(&y_dev, num*sizeof(float)); cudaMalloc(&z_dev, num*sizeof(float)); // init vars for (i=0; i>> (x_dev, y_dev, z_dev, num);

// copy results back

cudaMemcpy(z, z_dev, num*sizeof(float), cudaMemcpyDeviceToHost);

// use results in ‘z’ on local host

__global__ void cuda_kernel(float *x, float *y, float *z, int num) {

// parallel control via varying index

int my_i = threadIdx.x + blockIdx.x*blockDim.x;

// handle my_i exceeds num

if (my_i < num) { z[my_i] = x[my_i] + y[my_i]; } } Timing CUDA • makes use of “CUDA event API” • events make use of “CUDA streams” • CUDA streams • items in a given stream, executed in that order • eg start timer, run kernel, stop timer • possible (with hw support) to run 2 (or more) streams per GPU • further asynchronicity • need to be aware of where synchronisations are • so that we time what we want to time… CUDA Event: timing example cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); // set-up cudaEventRecord(start,0); // timestamp to &start (stream 0) Kernel_thing_to_time <<>> ()
cudaEventRecord(stop,0); // timestamp to &stop

float eTime;
cudaEventElapsedTime(&eTime, start, stop); // get time in milliseconds

LESSON FROM 2018/2019 LABS
– it was pointed out by previous student/s

(with thanks) that in some cases the initial
CUDA call (whatever it was) was taking about
5 seconds

– investigation with Chadwick sys admin
determined the need to set “persistence” on
the cards, see
https://docs.nvidia.com/deploy/driver-
persistence/index.html#persistence-mode

– having applied this, the 5 seconds per initial
call disappeared

– if you are doing timings another day on
another GPU card you may wish to check this
setting

COMP528 (c) mkbane, Univ of Liverpool (2018, 2019)

• using this example, we
are going to examine
some options for
initialisation

Initialisation Matters
• Since cost of transfer can be high, you need to consider options

• Option 1
• initialise x[] and y[] on CPU then transfer to GPU
• GPU kernel “vecAddKernel”: forms z[i] = x[i] + y[i]
• copy z[] from GPU to CPU

• Option 2
• GPU kernel: initialises x & y, then forms z=x+y
• copy z[] from GPU to CPU

• Depending on the CPU and the GPU
and the PCI-e (and how the transfer is undertaken)

• C.f. cost(CPU init) + cost(transfer-to-dev) .v. Cost(GPU init)

• By use of “CUDA streams” we
can have overlapping events

• Such as async memCpy:
• cudaMemcpyAsync(*dest, *src, countBytes, direction, stream)
• CPU: init(x), then async copy

whilst init(y)
==> maybe get cost of copying x
for free

• dependent on cost of init(y)
wrt
cost of copying x in background

Asynchronous…

• what about async copy?
• CPU: init(x), then async copy whilst init(y)
• Effectiveness will depend on GPU e.g.

• pinning host memory
• # of “engines” (doing copying) for hardware

• what is “pinning host memory”?
• non-page-able memory
• ==> device knows where it lives ==> CUDA can optimise ==> faster too
• instead on host of malloc-ing for x, we now use cudaHostAlloc for x

Optimising Use of GPU Threads
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation

Performance: warps
• threads scheduled in bunches: “warps” of 32 threads
• threads within a warp, run in lockstep

• same instruction at the same time

• IMPORTANT to avoid divergence within a warp
if (t%2 ==0)

x[i] = coloring(t);

else

x[i] = coloring(t-1);

• generally, can lose the 32x speed-up
• NB cost of branches may not necessarily be equal

Threads, Thread Blocks and GRIDS
• So far…

• <<>>
• threadIdx.x, blockIdx.x, blockDim.x

• But “grids” are supported:
• dim3 grid(xDim, yDim, zDim)

dim3 threadsPerBlock(xNum, yNum, zNum)
• <<>>
 launches xDim*yDim*zDum*xNum*yNum*zNum threads
(1000s! but lightweight and GPU is a thread-handling-machine)

• threadIdx.x, threadIdx.y, threadIdx.z
• blockIdx.[xyz] & blockDim.[xyz]
• gridDim.[xyz] – no. of blocks in given direction

How many threads? Blocks?
• Strangely, not one single “correct” answer
• Often useful to compare performance of

threads=128; <<>>
blocks=4; <<>>

• some limits (from “deviceQuery” or
cudaGetDeviceProperties () runtime call)

• key may be degree of
synchronisation &
shared memory…

• usually some
mapping to
physical problem

Threads

• Using grid
• to represent underlying maths & physics (etc)
• to get more threads in total

• Synchronisation is important
• thread block: access to shared memory
• warp: thread-scheduler works on a warp (=32 threads)

• launches 32 threads, all working in lock step;
• context switches are of a warp

• Good walk through: https://www3.nd.edu/~zxu2/acms60212-40212-S12/Lec-12-02.pdf

2D grid example: mat-mat addition

__global__ void myAdd(float *B, float *C, float *A) {
// form A = B + C
int row=blockIdx.x*blockDim.x+threadIdx.x;
int col=blockIdx.y*blockDim.y+threadIdx.y;
int idx;
// note that element(row,col) for N*N array can be represented as item col+row*N in 1D equiv

if (row=2.0 allows in-kernel dynamic memory allocation via malloc 

• HOWEVER, care since each thread has now allocated memory…
• … and may not need everything it is allocating

Shared Memory

• Lower latency than global
• Allows threads within a given thread block to share

==> comms between threads 

• Synchronisation at thread block level
• hence why we have <>
• i.e. we can chose how many threads we want to share data / synchronise
• whilst bearing in mind, threads are dispensed in a “warp” of 32 threads
• so for performance can choose 32 or 64 or… threads in a block

COMP528 – CUDA – (c) michael bane, university of liverpool

How to use Thread Block shared memory

• Within kernel, define
__shared__ type val;

• To synchronise the threads running within a given thread block
__syncthreads()

• Pre CUDA-9:
• You cannot synchronise between thread blocks (other than kernel ending)

• CUDA 9 onwards
• Use of “co-operative groups”

• Keeping threads busy is a good thing
• Avoiding divergence within a warp is a good thing

• Shared memory can also have “bank conflicts”
• many threads accessing addresses in same memory bank

==> conflict
==> requests are serialised (very bad news)

Summation Example

• DEMO:
~mkbane/PC_DEMOS/acceleratorReduction/summation.cu

• use of __shared__
• use of __syncThreads()
• (explain how it works – next slides)

Multiple GPUs?

• WHY?
• further parallelisation
• scales up the memory available

• YES
• for CUDA4+, cc 2.0+
• check device info

• cudaSetDevice()
sets the current GPU

Multiple GPUs?

• so mix of CUDA calls and changing
the target device via (at appropriate times!)
cudaSetDevice()
==> multiple GPUs

• BUT data transfer too?
• UVA: unified virtual addressing
• GPU-to-GPU data transfers
• GPU-0 accessing GPU-1 data directly
• cudaDeviceEnablePeerAccess && cudaDeviceCanAccessPeer set to enable

>1 Kernel

• Yes (as per lab!)

• But can you have
>1 running kernel at once?

• depends on hardware (supported on modern GPUs with cc>=2.0)
• depends on SM available (if 1 kernel uses all SMs then…)
• use “Streams”

Kernels can call Kernels!

• __global__ qualifier: CUDA kernel called form host
• __device__ qualifier: CUDA kernel called from GPU (ie a __global__ or

another __device__ qualified function)

Profiling

• NVIDIA recommends…
• IDE: nsight
• but cannot run on chadwick

GPU-enabled compute note (no X11 support)

• nvprof
• either output to screen
• or to file, then use nvprof from login node (X11 supported)

demo

Optimisation

• shared memory
• no warp divergence
• play with #threads, #blocks
• potentially async / multiple kernels

• “occupancy”
• ratio: #active warps in SM/max #warps in SM

https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy
.htm

Other things to know about

• Dynamic parallelism (CUDA capability (cc) 3.5+)
• eg problem changes wrt time ==> need more threads / diff work pattern
• launching kernels from a kernel

• Tensor cores
• machine learning
• cc 7.0 onwards

Other things to know about

• Dynamic parallelism (CUDA capability (cc) 3.5+)
• eg problem changes wrt time ==> need more threads / diff work pattern
• launching kernels from a kernel

• Tensor cores
• machine learning
• cc 7.0 onwards

CUDA

• Only had time for introductory-intermediate topics
• future reading on key topics:
• Sanders & Kandrot:
• Chapter 3: calling a CUDA kernel
• Section 4.2.1: data transfers
• Sect 5.3: Shared Memory & synchronisation
• Chapter 6 (constant memory: const) & Chapter 7 (texture mem: locality)
• Sect 6.3: Events / timing
• Chapter 10: streams (asynchronisation)

• NVIDIA’s CUDA web/resources
• https://www.microway.com/hpc-tech-tips/gpu-memory-types-

performance-comparison/ << memory of GPU Questions via MS Teams / email Dr Michael K Bane, Computer Science, University of Liverpool m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane