Microsoft PowerPoint – COMP528 HAL28 Intro to CUDA.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk
COMP528: Multi-core and
Multi-Processor Programming
28 – HAL
Reading/Background Materials
• “CUDA by example: an introduction to general-purpose
GPU programming”, Sanders & Kandrot (2011)
– hard copies in the library
– available (300 pages download ==> save it for ref!) via
• NVIDIA’s CUDA web/resources
• “GPU Gems” eg download from NVIDIA
only the parts on computation (not the parts on graphics!)
Steps of converting key work to use CUDA
1. Determine work that has inherent parallelism via running
concurrent work-packages
2. Move a work-package to a “kernel”
3. Invoke a parallel kernel by use of CUDA
(which will run multiple copies of kernel, one for each
work-package required)
Based upon
“Steps to CUDA”
© High End Compute Ltd
Steps of converting key work to use CUDA
1. Determine work that has inherent parallelism
“z = x +y” loop
Steps of converting key work to use CUDA
1. Determine work that has inherent parallelism via running
concurrent work-packages
2. Move a work-package to a “kernel”
3. Invoke a parallel kernel by use of CUDA
(which will run multiple copies of kernel, one for each
work-package required)
Based upon
“Steps to CUDA”
© High End Compute Ltd
CUDA by Example: vector arithmetic kernel
serial_kernel(x, y, z, num) {
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
start = clock();
serial_kernel(x, y, z, num);
finish = clock();
start = clock();
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
finish = clock();
CUDA by Example: vector arithmetic kernel
serial_kernel(x, y, z, num) {
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
start = clock();
serial_kernel(x, y, z, num);
finish = clock();
start = clock();
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
finish = clock();
Steps of converting key work to use CUDA
1. Determine work that has inherent parallelism via running
concurrent work-packages
2. Move a work-package to a "kernel"
3. Invoke a parallel kernel by use of CUDA
(which will run multiple copies of kernel, one for each
work-package required)
Based upon
“Steps to CUDA”
© High End Compute Ltd
CUDA by Example: vector arithmetic kernel
serial_kernel(x, y, z, num) {
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
start = clock();
serial_kernel(x, y, z, num);
finish = clock();
CUDA by Example: CUDA kernel
serial_kernel(x, y, z, num) {
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
start = clock();
serial_kernel(x, y, z, num)
finish = clock();
__global__ cuda_kernel(x, y, z) {
// parallel control via varying index
my_i = threadIdx.x + blockIdx.x*blockDim.x;
z[my_i] = x[my_i] + y[my_i];
// not there is NO 'for' loop over index
start = clock();
cuda_kernel <<
finish = clock();
CUDA by Example: CUDA kernel
serial_kernel(x, y, z, num) {
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
start = clock();
serial_kernel(x, y, z, num)
finish = clock();
__global__ cuda_kernel(x, y, z) {
// parallel control via varying index
my_i = threadIdx.x + blockIdx.x*blockDim.x;
z[my_i] = x[my_i] + y[my_i];
// not there is NO 'for' loop over index
start = clock();
cuda_kernel <<
finish = clock();
CUDA by Example: CUDA kernel
serial_kernel(x, y, z, num) {
for (i = 0; i< num; i++) {
z[i] = x[i] + y[i];
start = clock();
serial_kernel(x, y, z, num)
finish = clock();
__global__ cuda_kernel(x, y, z) {
// parallel control via varying index
my_i = threadIdx.x + blockIdx.x*blockDim.x;
z[my_i] = x[my_i] + y[my_i];
// not there is NO 'for' loop over index
start = clock();
cuda_kernel <<
finish = clock();
Steps of converting key work to use CUDA
1. Determine work that has inherent parallelism via running
concurrent work-packages
2. Move a work-package to a “kernel”
3. Invoke a parallel kernel by use of CUDA
(which will run multiple copies of kernel, one for each
work-package required)
Based upon
“Steps to CUDA”
© High End Compute Ltd
Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk