Microsoft PowerPoint – COMP528 HAL30 CUDA memory model, thread block matters
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
30 – HAL
Steps to Porting Serial to CUDA
Determine work that has
inherent parallelism
Move (serial) work to a “kernel”
3. (a) Allocate kernel vars
(b) Initialise kernel vars
typically by copying data from h->d using cudaMemCpy()
(c) Add call to parallel CUDA kernel using <<
kernel runs on device asynchronously
(d) Copy results from d->h using cudaMemCpy()
Based upon
“Steps to CUDA”
© High End Compute Ltd
__global__ cuda_kernel(x, y, z) {
// parallel control via varying index
my_i = threadIdx.x + blockIdx.x*blockDim.x;
z[my_i] = x[my_i] + y[my_i];
// not there is NO ‘for’ loop over index
}
PCI Express
Host variables
x, y, z
Device variables
x_dev, y_dev, z_dev
Z_dev[0] = X_dev[0] + Y_dev[0]
Z_dev[1] = X_dev[1] + Y_dev[1]
Z_dev[2] = X_dev[2] + Y_dev[2]
Z_dev[3] = X_dev[3] + Y_dev[3]
Z_dev[4] = X_dev[4] + Y_dev[4]
Z_dev[5] = X_dev[5] + Y_dev[5]
…
Z_dev[n-1] = X_dev[n-1] + Y_dev[n-1]
Example: Z = X + Y
int main(int argc, char *argv[]) {
float *z, *x, *y;
int num, i;
// memory for device (GPU)
float *x_dev, *y_dev, *z_dev;
// set up arrays
num = (int) atoi(argv[1]);
printf(“Forming z=x+y for dimension %d\n”, num);
x = (float *) malloc(num*sizeof(*x));
y = (float *) malloc(num*sizeof(*y));
z = (float *) malloc(num*sizeof(*z));
// set up device memory
cudaMalloc(&x_dev, num*sizeof(float));
cudaMalloc(&y_dev, num*sizeof(float));
cudaMalloc(&z_dev, num*sizeof(float));
// init vars
for (i=0; i
// copy results back
cudaMemcpy(z, z_dev, num*sizeof(float), cudaMemcpyDeviceToHost);
// use results in ‘z’ on local host
__global__ void cuda_kernel(float *x, float *y, float *z, int num) {
// parallel control via varying index
int my_i = threadIdx.x + blockIdx.x*blockDim.x;
// handle my_i exceeds num
if (my_i < num) {
z[my_i] = x[my_i] + y[my_i];
}
}
COMP528 (c) mkbane, Univ of Liverpool (2018, 2019)
int main(int argc, char *argv[]) {
float *z, *x, *y;
int num, i;
// memory for device (GPU)
float *x_dev, *y_dev, *z_dev;
// set up arrays
num = (int) atoi(argv[1]);
printf("Forming z=x+y for dimension %d\n", num);
x = (float *) malloc(num*sizeof(*x));
y = (float *) malloc(num*sizeof(*y));
z = (float *) malloc(num*sizeof(*z));
// set up device memory
cudaMalloc(&x_dev, num*sizeof(float));
cudaMalloc(&y_dev, num*sizeof(float));
cudaMalloc(&z_dev, num*sizeof(float));
// init vars
for (i=0; i
// copy results back
cudaMemcpy(z, z_dev, num*sizeof(float), cudaMemcpyDeviceToHost);
// use results in ‘z’ on local host
__global__ void cuda_kernel(float *x, float *y, float *z, int num) {
// parallel control via varying index
int my_i = threadIdx.x + blockIdx.x*blockDim.x;
// handle my_i exceeds num
if (my_i < num) {
z[my_i] = x[my_i] + y[my_i];
}
}
Timing CUDA
• makes use of “CUDA event API”
• events make use of “CUDA streams”
• CUDA streams
• items in a given stream, executed in that order
• eg start timer, run kernel, stop timer
• possible (with hw support) to run 2 (or more) streams per GPU
• further asynchronicity
• need to be aware of where synchronisations are
• so that we time what we want to time…
CUDA Event: timing example
cudaEvent_t start, stop;
cudaEventCreate(&start); cudaEventCreate(&stop); // set-up
cudaEventRecord(start,0); // timestamp to &start (stream 0)
Kernel_thing_to_time <<
cudaEventRecord(stop,0); // timestamp to &stop
float eTime;
cudaEventElapsedTime(&eTime, start, stop); // get time in milliseconds
LESSON FROM 2018/2019 LABS
– it was pointed out by previous student/s
(with thanks) that in some cases the initial
CUDA call (whatever it was) was taking about
5 seconds
– investigation with Chadwick sys admin
determined the need to set “persistence” on
the cards, see
https://docs.nvidia.com/deploy/driver-
persistence/index.html#persistence-mode
– having applied this, the 5 seconds per initial
call disappeared
– if you are doing timings another day on
another GPU card you may wish to check this
setting
COMP528 (c) mkbane, Univ of Liverpool (2018, 2019)
• using this example, we
are going to examine
some options for
initialisation
Initialisation Matters
• Since cost of transfer can be high, you need to consider options
• Option 1
• initialise x[] and y[] on CPU then transfer to GPU
• GPU kernel “vecAddKernel”: forms z[i] = x[i] + y[i]
• copy z[] from GPU to CPU
• Option 2
• GPU kernel: initialises x & y, then forms z=x+y
• copy z[] from GPU to CPU
• Depending on the CPU and the GPU
and the PCI-e (and how the transfer is undertaken)
• C.f. cost(CPU init) + cost(transfer-to-dev) .v. Cost(GPU init)
• By use of “CUDA streams” we
can have overlapping events
• Such as async memCpy:
• cudaMemcpyAsync(*dest, *src, countBytes, direction, stream)
• CPU: init(x), then async copy
whilst init(y)
==> maybe get cost of copying x
for free
• dependent on cost of init(y)
wrt
cost of copying x in background
Asynchronous…
• what about async copy?
• CPU: init(x), then async copy whilst init(y)
• Effectiveness will depend on GPU e.g.
• pinning host memory
• # of “engines” (doing copying) for hardware
• what is “pinning host memory”?
• non-page-able memory
• ==> device knows where it lives ==> CUDA can optimise ==> faster too
• instead on host of malloc-ing for x, we now use cudaHostAlloc for x
Optimising Use of GPU Threads
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation
Performance: warps
• threads scheduled in bunches: “warps” of 32 threads
• threads within a warp, run in lockstep
• same instruction at the same time
• IMPORTANT to avoid divergence within a warp
if (t%2 ==0)
x[i] = coloring(t);
else
x[i] = coloring(t-1);
• generally, can lose the 32x speed-up
• NB cost of branches may not necessarily be equal
Threads, Thread Blocks and GRIDS
• So far…
• <<
• threadIdx.x, blockIdx.x, blockDim.x
• But “grids” are supported:
• dim3 grid(xDim, yDim, zDim)
dim3 threadsPerBlock(xNum, yNum, zNum)
• <<
launches xDim*yDim*zDum*xNum*yNum*zNum threads
(1000s! but lightweight and GPU is a thread-handling-machine)
• threadIdx.x, threadIdx.y, threadIdx.z
• blockIdx.[xyz] & blockDim.[xyz]
• gridDim.[xyz] – no. of blocks in given direction
How many threads? Blocks?
• Strangely, not one single “correct” answer
• Often useful to compare performance of
threads=128; <<
blocks=4; <<
• some limits (from “deviceQuery” or
cudaGetDeviceProperties () runtime call)
• key may be degree of
synchronisation &
shared memory…
• usually some
mapping to
physical problem
Threads
• Using grid
• to represent underlying maths & physics (etc)
• to get more threads in total
• Synchronisation is important
• thread block: access to shared memory
• warp: thread-scheduler works on a warp (=32 threads)
• launches 32 threads, all working in lock step;
• context switches are of a warp
• Good walk through: https://www3.nd.edu/~zxu2/acms60212-40212-S12/Lec-12-02.pdf
2D grid example: mat-mat addition
__global__ void myAdd(float *B, float *C, float *A) {
// form A = B + C
int row=blockIdx.x*blockDim.x+threadIdx.x;
int col=blockIdx.y*blockDim.y+threadIdx.y;
int idx;
// note that element(row,col) for N*N array can be represented as item col+row*N in 1D equiv
if (row
• HOWEVER, care since each thread has now allocated memory…
• … and may not need everything it is allocating
Shared Memory
• Lower latency than global
• Allows threads within a given thread block to share
==> comms between threads
• Synchronisation at thread block level
• hence why we have <
• i.e. we can chose how many threads we want to share data / synchronise
• whilst bearing in mind, threads are dispensed in a “warp” of 32 threads
• so for performance can choose 32 or 64 or… threads in a block
COMP528 – CUDA – (c) michael bane, university of liverpool
How to use Thread Block shared memory
• Within kernel, define
__shared__ type val;
• To synchronise the threads running within a given thread block
__syncthreads()
• Pre CUDA-9:
• You cannot synchronise between thread blocks (other than kernel ending)
• CUDA 9 onwards
• Use of “co-operative groups”
• Keeping threads busy is a good thing
• Avoiding divergence within a warp is a good thing
• Shared memory can also have “bank conflicts”
• many threads accessing addresses in same memory bank
==> conflict
==> requests are serialised (very bad news)
Summation Example
• DEMO:
~mkbane/PC_DEMOS/acceleratorReduction/summation.cu
• use of __shared__
• use of __syncThreads()
• (explain how it works – next slides)
Multiple GPUs?
• WHY?
• further parallelisation
• scales up the memory available
• YES
• for CUDA4+, cc 2.0+
• check device info
• cudaSetDevice()
sets the current GPU
Multiple GPUs?
• so mix of CUDA calls and changing
the target device via (at appropriate times!)
cudaSetDevice()
==> multiple GPUs
• BUT data transfer too?
• UVA: unified virtual addressing
• GPU-to-GPU data transfers
• GPU-0 accessing GPU-1 data directly
• cudaDeviceEnablePeerAccess && cudaDeviceCanAccessPeer set to enable
>1 Kernel
• Yes (as per lab!)
• But can you have
>1 running kernel at once?
• depends on hardware (supported on modern GPUs with cc>=2.0)
• depends on SM available (if 1 kernel uses all SMs then…)
• use “Streams”
Kernels can call Kernels!
• __global__ qualifier: CUDA kernel called form host
• __device__ qualifier: CUDA kernel called from GPU (ie a __global__ or
another __device__ qualified function)
Profiling
• NVIDIA recommends…
• IDE: nsight
• but cannot run on chadwick
GPU-enabled compute note (no X11 support)
• nvprof
• either output to screen
• or to file, then use nvprof from login node (X11 supported)
demo
Optimisation
• shared memory
• no warp divergence
• play with #threads, #blocks
• potentially async / multiple kernels
• “occupancy”
• ratio: #active warps in SM/max #warps in SM
https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy
.htm
Other things to know about
• Dynamic parallelism (CUDA capability (cc) 3.5+)
• eg problem changes wrt time ==> need more threads / diff work pattern
• launching kernels from a kernel
• Tensor cores
• machine learning
• cc 7.0 onwards
Other things to know about
• Dynamic parallelism (CUDA capability (cc) 3.5+)
• eg problem changes wrt time ==> need more threads / diff work pattern
• launching kernels from a kernel
• Tensor cores
• machine learning
• cc 7.0 onwards
CUDA
• Only had time for introductory-intermediate topics
• future reading on key topics:
• Sanders & Kandrot:
• Chapter 3: calling a CUDA kernel
• Section 4.2.1: data transfers
• Sect 5.3: Shared Memory & synchronisation
• Chapter 6 (constant memory: const) & Chapter 7 (texture mem: locality)
• Sect 6.3: Events / timing
• Chapter 10: streams (asynchronisation)
• NVIDIA’s CUDA web/resources
• https://www.microway.com/hpc-tech-tips/gpu-memory-types-
performance-comparison/ << memory of GPU Questions via MS Teams / email Dr Michael K Bane, Computer Science, University of Liverpool m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane