CUDA
Model II
Programming
September 27, 2018
School of Computer Science and Engineering, KNU
Objectives
Know how to check if your kernel will operate properly
Verifying your kernel Handling errors
Compiling & Executing
Know how long a kernel takes to execute Timing your kernel
Organize Parallel Threads
Brief Review
checkResult
1st : “printf”for Fermi or later generation devices
2nd: <<<1, 1>>>:
Execution configuration
Force to kernel to run with only one block and one thread
Two ways:
“ printf ” |
Handling Errors Difficult to identify which routine caused an error Reason: CUDA calls are asynchronous Solution: error-handling macro to wrap all CUDA API calls |
Handling Errors
CHECK(cudaMemcpy(d_c, gpuRef, nBytes, cudaMemcpyHostToDevice));
kernel_function<<<grid,block>>>(argument list); CHECK(cudaDeviceSynchronize());
Blocks the host thread until the device has completed all preceding requested tasks
Ensures no error occurred as part of the last kernel launch Debugging purpose
Timing with CPU Timer
CPU timer
System call: gettimeofday(); Header: sys/time.h
cudaDeviceSynchronize(): wait for all GPU threads to complete
iElaps: time spent as if you had measured kernel execution with wristwatch
Example For a big vector with 16M (16×106) int nElem = 1<<24; GPU scalability: a row-major array index using the block and thread indices |
Total vector elements should be smaller than total threads |
Example code (1/3)
“sumArraysOnGPU.cu”
Example code (2/3)
“sumArraysOnGPU.cu”
Example code (3/3) “sumArraysOnGPU.cu” |
Execution Results Execution results
Device Limitations
Limitations on grid and block dimensions
Device dependent
E.g., Fermi devices:
maximum # of threads/block is 1024
Maximum grid dimension for each x, y, and z dimension is 65,535
Organizing Parallel Threads
2D grid with 2D blocks 1D grid with 1D blocks 2D grid with 1D blocks
Indexing Matrices with Blocks and Threads
Example: 8-by-6 matrix
In a matrix addition kernel
- A thread is assigned one data element to process
- First issue: accessing the assigned data from global memory using block and thread index 3 indices for a 2D case
Thread and block index
Coordinate of a given point in the matrix Offset in linear global memory
Indexing Matrices with Blocks and Threads
1st step: map the thread & block index ix = threadIdx.x + blockIdx.x * blockDim.x iy = threadIdx.y + blockIdx.y * blockDim.y
2nd step: map a matrix coordinate to a global memory location/index
idx = iy * nx + ix
Indexing Matrices with Blocks and Threads
Example Create a function
“printThreadInfo” to
- Thread index
- Block index
- Matrix coordinate
- Global linear memory offset
- Value of corresponding elements
Example (1/4): “ checkThreadIndex.cu ” |
Example (2/4):
“
checkThreadIndex.cu
”
Example (3/4):
“
checkThreadIndex.cu
”
Example
(4/4):
“
adIndex.cu
checkThre
”
Summing Matrices with a 2D Grid and 2D Blocks |
“
:
–
Example
sumMatrixOnGPU
2D
–
grid
–
2D
–
block.cu
”
“
:
–
Example
sumMatrixOnGPU
2D
–
grid
–
2D
–
block.cu
”
“
:
–
Example
sumMatrixOnGPU
2D
–
grid
–
2D
–
block.cu
”
“
:
–
Example
sumMatrixOnGPU
2D
–
grid
–
2D
–
block.cu
”
“
–
2D
–
:
–
Example Execution
sumMatrixOnGPU
grid
2D
–
block.cu
”
Summing Matrices with a 1D Grid and 1D Block
1D Grid & 1D Block
1D Grid & 1D Block
Summing Matrices with a 2D Grid and 1D Blocks
2
D Grid & 1D Block
2
D Grid & 1D Block
2 D Grid & 1D Block |
2 D Grid & 1D Block |
Practice & Submission
1st Deadline: 11: 59PM, Sept. 27, 2018 (100%)
2nd Deadline: 11: 59PM, Oct. 1, 2018 (70%)
Tasks:
1st: Practice the program named “sumMatrixOnGPU-2D-
grid-2D-block.cu”
2nd: Adapt it to integer matrix addition. Find the best execution configuration.
Submissions should include the following:
Source code
PDF file including 1) screenshots of your execution results; 2) the reason of your decision on the best execution configuration
Thank
you!