CS计算机代考程序代写 cuda GPU Microsoft PowerPoint – COMP528 HAL28 Intro to CUDA.pptx

Microsoft PowerPoint – COMP528 HAL28 Intro to CUDA.pptx

Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and
Multi-Processor Programming

28 – HAL

CUDA

Reading/Background Materials
• “CUDA by example: an introduction to general-purpose

GPU programming”, Sanders & Kandrot (2011)
– hard copies in the library

– available (300 pages download ==> save it for ref!) via
http://www.mat.unimi.it/users/sansotte/cuda/CUDA_by_Example.pdf

• NVIDIA’s CUDA web/resources
– https://www.nvidia.com/en-us/deep-learning-ai/education/

• “GPU Gems” eg download from NVIDIA
– https://developer.nvidia.com/gpugems/gpugems/contributors

only the parts on computation (not the parts on graphics!)

Steps of converting key work to use CUDA

1. Determine work that has inherent parallelism via running
concurrent work-packages

2. Move a work-package to a “kernel”

3. Invoke a parallel kernel by use of CUDA
(which will run multiple copies of kernel, one for each
work-package required)

Based upon
“Steps to CUDA”

© High End Compute Ltd

Steps of converting key work to use CUDA

1. Determine work that has inherent parallelism

 “z = x +y” loop

Steps of converting key work to use CUDA

1. Determine work that has inherent parallelism via running
concurrent work-packages

2. Move a work-package to a “kernel”

3. Invoke a parallel kernel by use of CUDA
(which will run multiple copies of kernel, one for each
work-package required)

Based upon
“Steps to CUDA”

© High End Compute Ltd

CUDA by Example: vector arithmetic kernel

serial_kernel(x, y, z, num) {

for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } return; } start = clock(); serial_kernel(x, y, z, num); finish = clock(); start = clock(); for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } finish = clock(); CUDA by Example: vector arithmetic kernel serial_kernel(x, y, z, num) { for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } return; } start = clock(); serial_kernel(x, y, z, num); finish = clock(); start = clock(); for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } finish = clock(); Steps of converting key work to use CUDA 1. Determine work that has inherent parallelism via running concurrent work-packages 2. Move a work-package to a "kernel" 3. Invoke a parallel kernel by use of CUDA (which will run multiple copies of kernel, one for each work-package required) Based upon “Steps to CUDA” © High End Compute Ltd CUDA by Example: vector arithmetic kernel serial_kernel(x, y, z, num) { for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } } start = clock(); serial_kernel(x, y, z, num); finish = clock(); CUDA by Example: CUDA kernel serial_kernel(x, y, z, num) { for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } } start = clock(); serial_kernel(x, y, z, num) finish = clock(); __global__ cuda_kernel(x, y, z) { // parallel control via varying index my_i = threadIdx.x + blockIdx.x*blockDim.x; z[my_i] = x[my_i] + y[my_i]; // not there is NO 'for' loop over index } start = clock(); cuda_kernel <<>> (x, y, z);

finish = clock();

CUDA by Example: CUDA kernel

serial_kernel(x, y, z, num) {

for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } } start = clock(); serial_kernel(x, y, z, num) finish = clock(); __global__ cuda_kernel(x, y, z) { // parallel control via varying index my_i = threadIdx.x + blockIdx.x*blockDim.x; z[my_i] = x[my_i] + y[my_i]; // not there is NO 'for' loop over index } start = clock(); cuda_kernel <<>> (x, y, z);

finish = clock();

CUDA by Example: CUDA kernel

serial_kernel(x, y, z, num) {

for (i = 0; i< num; i++) { z[i] = x[i] + y[i]; } } start = clock(); serial_kernel(x, y, z, num) finish = clock(); __global__ cuda_kernel(x, y, z) { // parallel control via varying index my_i = threadIdx.x + blockIdx.x*blockDim.x; z[my_i] = x[my_i] + y[my_i]; // not there is NO 'for' loop over index } start = clock(); cuda_kernel <<>> (x, y, z);

finish = clock();

Steps of converting key work to use CUDA

1. Determine work that has inherent parallelism via running
concurrent work-packages

2. Move a work-package to a “kernel”

3. Invoke a parallel kernel by use of CUDA
(which will run multiple copies of kernel, one for each
work-package required)

Based upon
“Steps to CUDA”

© High End Compute Ltd

Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane