CUDA 编程代写

CUDA
Model II

Programming

September 27, 2018
School of Computer Science and Engineering, KNU

Objectives

 Know how to check if your kernel will operate properly

 Verifying your kernel  Handling errors
 Compiling & Executing

 Know how long a kernel takes to execute  Timing your kernel

 Organize Parallel Threads

Brief Review

checkResult
 1st : “printf”for Fermi or later generation devices

 2nd: <<<1, 1>>>:
 Execution configuration
 Force to kernel to run with only one block and one thread

 Two ways:

printf

Handling Errors

 Difficult to identify which routine caused an error

 Reason: CUDA calls are asynchronous

 Solution: error-handling macro to wrap all CUDA API calls

Handling Errors

 CHECK(cudaMemcpy(d_c, gpuRef, nBytes, cudaMemcpyHostToDevice));

 kernel_function<<<grid,block>>>(argument list);  CHECK(cudaDeviceSynchronize());

 Blocks the host thread until the device has completed all preceding requested tasks

 Ensures no error occurred as part of the last kernel launch  Debugging purpose

Timing with CPU Timer

 CPU timer
 System call: gettimeofday();  Header: sys/time.h

 cudaDeviceSynchronize(): wait for all GPU threads to complete

 iElaps: time spent as if you had measured kernel execution with wristwatch

Example

 For a big vector with 16M (16×106)

 int nElem = 1<<24;

 GPU scalability: a row-major array index using the block and thread indices

 Total vector elements should be smaller than total threads

Example code (1/3)

“sumArraysOnGPU.cu”

Example code (2/3)

“sumArraysOnGPU.cu”

Example

code (3/3)

“sumArraysOnGPU.cu”

Execution Results  Execution results

Device Limitations

 Limitations on grid and block dimensions

 Device dependent

 E.g., Fermi devices:

 maximum # of threads/block is 1024

 Maximum grid dimension for each x, y, and z dimension is 65,535

Organizing Parallel Threads

 2D grid with 2D blocks  1D grid with 1D blocks  2D grid with 1D blocks

Indexing Matrices with Blocks and Threads

 Example: 8-by-6 matrix

 In a matrix addition kernel

  •   A thread is assigned one data element to process
  •   First issue: accessing the assigned data from global memory using block and thread index 3 indices for a 2D case
     Thread and block index
     Coordinate of a given point in the matrix  Offset in linear global memory

Indexing Matrices with Blocks and Threads

 1st step: map the thread & block index  ix = threadIdx.x + blockIdx.x * blockDim.x  iy = threadIdx.y + blockIdx.y * blockDim.y

 2nd step: map a matrix coordinate to a global memory location/index

 idx = iy * nx + ix

Indexing Matrices with Blocks and Threads

Example  Create a function

“printThreadInfo” to

  •   Thread index
  •   Block index
  •   Matrix coordinate
  •   Global linear memory offset
  •   Value of corresponding elements

Example (1/4):

checkThreadIndex.cu

Example (2/4):

checkThreadIndex.cu

Example (3/4):

checkThreadIndex.cu

Example

(4/4):

adIndex.cu

checkThre

Summing Matrices with a 2D Grid and 2D Blocks

:

Example

sumMatrixOnGPU

2D

grid

2D

block.cu

:

Example

sumMatrixOnGPU

2D

grid

2D

block.cu

:

Example

sumMatrixOnGPU

2D

grid

2D

block.cu

:

Example

sumMatrixOnGPU

2D

grid

2D

block.cu

2D

:

Example Execution

sumMatrixOnGPU

grid

2D

block.cu

Summing Matrices with a 1D Grid and 1D Block

1D Grid & 1D Block

1D Grid & 1D Block

Summing Matrices with a 2D Grid and 1D Blocks

2

D Grid & 1D Block

2

D Grid & 1D Block

2

D Grid & 1D Block

2

D Grid & 1D Block

Practice & Submission

 1st Deadline: 11: 59PM, Sept. 27, 2018 (100%)

 2nd Deadline: 11: 59PM, Oct. 1, 2018 (70%)

 Tasks:
 1st: Practice the program named “sumMatrixOnGPU-2D-

grid-2D-block.cu”

 2nd: Adapt it to integer matrix addition. Find the best execution configuration.

 Submissions should include the following:

 Source code

 PDF file including 1) screenshots of your execution results; 2) the reason of your decision on the best execution configuration

Thank

you!