CUDA Tutorial
MSBD5009 Assignment3
Outline
• CUDA Environment
• CUDA programming basics • Assignment 3
CUDA Environment
• CUDA environment is enabled already on your Azure instance. • Check your CUDA environment:
• Use nvcc –version
• If you cannot found nvcc command, please send email to TAs.
Typical CUDA programming model
Memory Allocation
• HostMemory • malloc
• void* malloc(size_t size); • Parameters:
• size: size of the memory block, in bytes. size_t is an unsigned integral type.
• returns:
• On success, a pointer to the memory block allocated by the function.
Example
• DeviceMemory
• cudaMalloc
• cudaMalloc(void **ptr, size_t size); • Parameters:
• **ptr: Pointer to allocated device memory.
• size: Requested allocation size in bytes. • returns:
• cudaSuccess, cudaErrorMemoryAllocation
Memory deallocation
• HostMemory
• free
• void* free(void *ptr);
• Parameters:
• *ptr: This is the pointer to a memory block previously allocated with malloc, calloc or realloc to be deallocated. If a null pointer is passed as argument, no action occurs.
• returns:
• This function does not return any value.
Example
• DeviceMemory • cudaFree
• cudaMalloc(void **ptr); • Parameters:
• **ptr:Device pointer to memory to free. • returns:
• cudaSuccess, cudaErrorInvalidDevicePointer, cudaErrorInitializationError.
Data transfer between host and device
cudaMemcpy(void* dst,
const void* src,
size_t count,
enum cudaMemcpyKind kind)
kind:
cudaMemcpyHostToHost,
cudaMemcpyHostToDevice,
cudaMemcpyDeviceToHost,
cudaMemcpyDeviceToDevice
CUDA Kernel Declaration and Invocation
• A kernel function declaration has the prefix __global__, return type void. __global__ void kernelName(param1, ..);
• A kernel function invocation includes launch parameters: #block, #thread.. kernelName<<<#block, #thread, size, stream>>>(param1, ..);
#block: number of blocks per grid.
#thread: number of threads per block.
size and stream are ignored in our assignment.
• E.g:
AddKernel<<<32, 1024>>(d_c, d_a, d_b);
Build-in variable dim3
• dim3 is an integer vector type that can be used in CUDA code.
• Its most common application:
• pass the grid and block dimensions in a kernel invocation.
• dim3 has 3 elements: x, y, z
• Ccodeinitialization:dim3grid={512,512,1}
• C++codeinitialization:dim3gird(512,512,1);
• Not all three elements need to be provided.
• Any element not provided during initialization is initialized to 1, not 0!
• E.g:
dim3 block(32); // 32 * 1 * 1
dim3 thread(1024) // 1024 * 1 * 1 AddKernel<<< block, thread>>>(d_c, d_a, d_b);
More Dim3 example
// 1 grid, 4 blocks per grid, 4 threads per block.
dim3 block(4, 1, 1); //4 blocks per grid
dim3 thread(4, 1, 1); // 4 threads per block
addKernel<<
Thread index calculation
Thread index calculation
Assignment3:Compiling on Azure Instance
Compile: nvcc -std=c++11 -arch=compute_37 -code=sm_37 main.cu
cuda_smith_waterman_skeleton.cu -o cuda_smith_waterman
Run: ./cuda_smith_waterman
Or just use run_cuda.sh bash script.
Assignment3: 2D score -> 1D score
• Reason:
• Coalesced Access
• Linear representation of 2D array
Index transform.
score matrix allocation changes.
score matrix addressing changes.
Assignment3: 2D score -> 1D score
cuda_smith_waterman.h
Assignment3: Error check helper function
E.g:
GPUErrChk(cudaMalloc(&d_A, size));
Assignment3: Your task
• Handle memory allocation & deallocation by yourself.
• Handle memory copy by yourself.
• Write one or more kernels to do the computation of score matrix.
• Note: Using the global memory is enough.
Assignment3: Hints
• Coalescedmainmemoryaccess.
• Improveperformancegreatly
• Ifthethreadsinablockareaccessing
P2 P1
P0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
consecutive global memory locations, then all the accesses are combined into a single request(or coalesced) by the hardware.
• refertolecturenotes:
• cuda_programmingmodel • n-bodysimulation
• Pattern:Consecutivethreadsaccessconsecutive memory addresses.
5
2
9
6
3
13
10
7
4
14
11
8
15
12
Consecutive threads access consecutive memory addresses.
1
16
Transfer anti-diagonals to new rows.
P0
P2 P1
1234 5678 9 10 11 12
13 14 15 16
1
1
ɛ
ɛ
ɛ
5
2
ɛ
ɛ
9
6
3
ɛ
13
10
7
4
14
11
8
ɛ
15
12
ɛ
ɛ
16
ɛ
ɛ
ɛ
5
2
9
13
6
3
align
10
7
4
14
11
8
15
12
1
16
P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2
ɛ
ɛ
ɛ
5
2
ɛ
ɛ
9
6
3
13
10
7
4
….
alignments The memory layout of threads access.
Consecutive threads access consecutive memory addresses.
Assignment3: References
• Kernel launch parameters:
• #block is set to the number of SM(streaming multiprocessors) of GPU or the multiples of SMs.
• Azure instance’s GPU has 13 SMs
• #thread is set to the multiples of 32.
• E.g: <<<13, 256>>> <<<26, 1024>>> <<<13, 512>>>
• Referential running time:
Assignment3: Suggestion
If you have questions about CUDA programming:
• Work hard
• Read your lecture notes
• Read NVIDIA CUDA documents
• Ask Google
• Read previous year’s final exam algorithm code on Maximum Flow
problem
• Send emails to TAs