Microsoft PowerPoint – COMP528 HAL28 Intro to CUDA.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
28 – HAL
CUDA
Reading/Background Materials
• “CUDA by example: an introduction to general-purpose
GPU programming”, Sanders & Kandrot (2011)
– hard copies in the library
– available (300 pages download ==> save it for ref!) via
http://www.mat.unimi.it/users/sansotte/cuda/CUDA_by_Example.pdf
• NVIDIA’s CUDA web/resources
– https://www.nvidia.com/en-us/deep-learning-ai/education/
• “GPU Gems” eg download from NVIDIA
– https://developer.nvidia.com/gpugems/gpugems/contributors
only the parts on computation (not the parts on graphics!)
Steps of converting key work to use CUDA
1. Determine work that has inherent parallelism via running
concurrent work-packages
2. Move a work-package to a “kernel”
3. Invoke a parallel kernel by use of CUDA
(which will run multiple copies of kernel, one for each
work-package required)
Based upon
“Steps to CUDA”
© High End Compute Ltd
Steps of converting key work to use CUDA
1. Determine work that has inherent parallelism
“z = x +y” loop
Steps of converting key work to use CUDA
1. Determine work that has inherent parallelism via running
concurrent work-packages
2. Move a work-package to a “kernel”
3. Invoke a parallel kernel by use of CUDA
(which will run multiple copies of kernel, one for each
work-package required)
Based upon
“Steps to CUDA”
© High End Compute Ltd
CUDA by Example: vector arithmetic kernel
serial_kernel(x, y, z, num) {
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
}
return;
}
start = clock();
serial_kernel(x, y, z, num);
finish = clock();
start = clock();
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
}
finish = clock();
CUDA by Example: vector arithmetic kernel
serial_kernel(x, y, z, num) {
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
}
return;
}
start = clock();
serial_kernel(x, y, z, num);
finish = clock();
start = clock();
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
}
finish = clock();
Steps of converting key work to use CUDA
1. Determine work that has inherent parallelism via running
concurrent work-packages
2. Move a work-package to a "kernel"
3. Invoke a parallel kernel by use of CUDA
(which will run multiple copies of kernel, one for each
work-package required)
Based upon
“Steps to CUDA”
© High End Compute Ltd
CUDA by Example: vector arithmetic kernel
serial_kernel(x, y, z, num) {
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
}
}
start = clock();
serial_kernel(x, y, z, num);
finish = clock();
CUDA by Example: CUDA kernel
serial_kernel(x, y, z, num) {
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
}
}
start = clock();
serial_kernel(x, y, z, num)
finish = clock();
__global__ cuda_kernel(x, y, z) {
// parallel control via varying index
my_i = threadIdx.x + blockIdx.x*blockDim.x;
z[my_i] = x[my_i] + y[my_i];
// not there is NO 'for' loop over index
}
start = clock();
cuda_kernel <<
finish = clock();
CUDA by Example: CUDA kernel
serial_kernel(x, y, z, num) {
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
}
}
start = clock();
serial_kernel(x, y, z, num)
finish = clock();
__global__ cuda_kernel(x, y, z) {
// parallel control via varying index
my_i = threadIdx.x + blockIdx.x*blockDim.x;
z[my_i] = x[my_i] + y[my_i];
// not there is NO 'for' loop over index
}
start = clock();
cuda_kernel <<
finish = clock();
CUDA by Example: CUDA kernel
serial_kernel(x, y, z, num) {
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
}
}
start = clock();
serial_kernel(x, y, z, num)
finish = clock();
__global__ cuda_kernel(x, y, z) {
// parallel control via varying index
my_i = threadIdx.x + blockIdx.x*blockDim.x;
z[my_i] = x[my_i] + y[my_i];
// not there is NO 'for' loop over index
}
start = clock();
cuda_kernel <<
finish = clock();
Steps of converting key work to use CUDA
1. Determine work that has inherent parallelism via running
concurrent work-packages
2. Move a work-package to a “kernel”
3. Invoke a parallel kernel by use of CUDA
(which will run multiple copies of kernel, one for each
work-package required)
Based upon
“Steps to CUDA”
© High End Compute Ltd
Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane