Intro to Parallel Computing
Topic 13: CUDA Threads
COSC 407: Intro to Parallel Computing
Topic 13: CUDA Threads COSC 407: Intro to Parallel Computing
Copyright By PowCoder代写 加微信 powcoder
Previous pre-recorded lecture (Students’ led Q/As):
– CUDA basics: program structure
– Useful Built-in CUDA functions
– Function Declarations (global, device, host)
– Error Handling, cudaDeviceSynchronize
– Hardware architecture: sp à SM à GPU
– Thread Organization: threads à blocks à grids
• Dimension variables (blockDim, gridDim)
– Thread Life Cycle From the HW Perspective
– Kernel Launch Configuration: 1D grids/blocks
Next Lecture:
– Kernel Launch Configuration: nD grids/blocks
– CUDA limits
– Thread Cooperation
– Running Example: Matrix Multiplication
Topic 13: CUDA Threads
COSC 407: Intro to Parallel Computing
CPU and GPU have separate memory spaces
• Need to move data to device (GPU) if it is processed there
• Need to move results back to CPU memory
• Functions: cudaMalloc, cudaFree, cudaMemcpy
• Hold memory addresses in either CPU or GPU memory
• Can’t differentiate CPU pointers from GPU pointers by just
checking their values.
• There, you must use pointers in their appropriate locations.
Dereferencing CPU pointer in kernel will likely crash Dereferencing GPU pointer host code will likely crash
COSC 407: Intro to Parallel Computing
Topic 13: CUDA Threads
Error Handling
CUDA has two sources of errors : (1) Errors from CUDA API
E.g. cannot allocate memory space on the device
(2) Errors from CUDA Kernel
i.e. errors that happen inside your kernel code.
Topic 13: CUDA Threads COSC 407: Intro to Parallel Computing
Handling CUDA API Errors
• CUDA API functions return an error code of type cudaError_t
• For example:
• cudaSuccess (=0, if no problems)
• cudaErrorMemoryAllocation (=2, if cannot allocate memory)
• Other error codes (positive values) are possible
• see here for the full list.
Such errors should be handled using some extra code. For example:
Topic 13: CUDA Threads COSC 407: Intro to Parallel Computing
cudaError_t err = cudaMalloc(&d_a, num_bytes);
if (err != cudaSuccess) {
printf(“Can’t allocate CUDA Memory”); …//more code to handle error
} else {…}
Handling CUDA API Errors
A better IDEA: to avoid repeatedly writing if statements after each CUDA call, you can define and use a macro as following:
Then use this macro whenever you call a CUDA API function
Topic 13: CUDA Threads COSC 407: Intro to Parallel Computing
CHK( cudaMalloc(&d_a, num_bytes) );
#define CHK(call) { \ cudaError_t err = call; \ if (err != cudaSuccess) { \
printf(“Error%d: %s:%d\n”,err,__FILE__,__LINE__); \ printf(cudaGetErrorString(err)); \
cudaDeviceReset();
destroys and clean up all resources \ associated with the current device in \ the current process immediately
Handling CUDA Kernel Errors
• The other type of errors is the one that happens during the execution of YOUR kernel function
• You can check for this error as follows
Kernel<<<..,..>>>(); //call kernel CHK(cudaGetLastError()); //1 CHK(cudaDeviceSynchronize()); //2
• Statement #1 will check for kernel launch errors
• e.g. too many threads per block
• CUDA runtime maintains an error variable that is overwritten each time an error occurs. cudaGetLastError() returns the value of this variable and resets the variable to cudaSuccess.
• Statement #2 will block the host until GPU is done
• Any asynchronous error is returned by (cudaDeviceSynchronize)
More details: https://devblogs.nvidia.com/how-query-device-properties-and-handle-errors-cuda-cc/ Topic 13: CUDA Threads COSC 407: Intro to Parallel Computing
cudaDeviceSynchronize()
• CUDA functions and host code are asynchronous
• i.e. they return control to the calling CPU thread before
they finish their work
• cudaDeviceSynchronize() can be used to block the calling
CPU thread until all CUDA calls made by this thread are
• Example use: time your kernel
• (must include time.h and cuda lib)
Synchronization is expensive, so don’t overuse it!
Topic 13: CUDA Threads COSC 407: Intro to Parallel Computing
double t = clock(); Kernel<<<..,..>>>(); cudaDeviceSynchronize()
t = (clock()-t)/CLOCKS_PER_SEC;
Adding Vectors: Revisited
• You saw before that we usually assign a thread to process each element in an array.
• Assigning threads to vector elements was easy
Topic 13: CUDA Threads COSC 407: Intro to Parallel Computing
__global__ void vec_add(float *A, float *B, float* C, int N) {
int i = threadIdx.x;
if (i < N) C[i] = A[i] + B[i];
Typical GPU Program
Host code (running on CPU)
1. Allocate space on GPU
2. Copy CPU data to GPU
3. Launch kernel function(s) on GPU
define launch-configuration before that.
4. Copy results from GPU to CPU
5. Free GPU memory
Kernel code (running on GPU)
• Write kernel function as if it will run on a single thread
• Use IDs to identify which piece of data is processed by
this thread
• Remember that this SAME kernel function is executed by many threads
• Parallelism of threads is expressed in the host code
Topic 13: CUDA Threads COSC 407: Intro to Parallel Computing
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Global Memory (DRAM)
GPU Design
§ Massively threaded, sustains 1000s of threads per app
§ The figure: 8 SMs x 16 SP = 128 SPs (CUDA cores) § SM: streaming multiprocessor
§ SP: streaming processor
Input Assembler Thread Execution Manager
SM SM SM SM SM SM SM SM
Load/store Load/store Load/store
Topic 13: CUDA Threads
Load/store
Load/store
Load/store
COSC 407: Intro to Parallel Computing
GPU Design
• A scalable array of multithreaded Streaming Multiprocessors (SMs)
• A multithreaded program is partitioned into blocks of threads that execute independently from each other
• GPU with more multiprocessors will automatically execute the program in less time than a GPU with fewer multiprocessors.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
Topic 13: CUDA Threads
COSC 407: Intro to Parallel Computing
GPU Design
• When host invokes a kernel grid
• Blocks of the grid are enumerated
• Distributed to multiprocessors with available execution capacity
• The threads of a thread block execute concurrently on one multiprocessor
• Multiple thread blocks can execute concurrently on one multiprocessor
• As thread blocks terminate
• New blocks are launched on the vacated multiprocessors
• Designed to execute hundreds of threads concurrently. To manage such a large amount of threads (SIMT )
Topic 13: CUDA Threads COSC 407: Intro to Parallel Computing
Threads Organization: Basics
On SOFTWARE (code) side:
– Threads are grouped into Blocks
• All threads in a block execute the same kernel program (SPMD)
– Blocks are grouped into a Grid
• Each thread has a unique ID within a block
• Each block has a unique ID within a grid
On HARDWARE side:
– Each block runs on one SM.
• An SM might run more than one block
– Each thread runs on an SP (within an SM)
• An SP can only run one thread at any time
• Might run many successive threads.
– More about this later
Topic 13: CUDA Threads COSC 407: Intro to Parallel Computing
Device Properties
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include
int main()
//just to check
cudaDeviceProp prop;
int count;
cudaGetDeviceCount(&count);
for (int i = 0; i < count; i++)
cudaGetDeviceProperties(&prop, i);
//examine members of the strcut
Topic 13: CUDA Threads
COSC 407: Intro to Parallel Computing
Threads in a Block
Threads in a block are organized in 1D, 2D, or 3D array of threads.
Built-in ID variables • threadIdx.x • threadIdx.y • threadIdx.z
Thread IDs are unique within a block z
T0,0,1 T1,0,1 T2,0,1
T0,0,0 T1,0,0 T2,0,0 T0,1,0 T1,1,0 T2,1,0
threadIdx.x = 2
threadIdx.y = 1
threadIdx.z = 1
Topic 13: CUDA Threads
COSC 407: Intro to Parallel Computing
Why 1/2/3-D Organization?
Simplifies memory addressing when processing
T0,0,0 T1,0,0 T0,1,0 T1,1,0
T2,0,0 T2,1,0
multidimensional data
T0,0,0 T1,0,0 T2,0,0 V[]
1D threads are most suitable for processing vectors
2D threads are most suitable for 2D arrays (e.g. images)
rr arr arr rr arr arr
COSC 407: Intro to Parallel Computing
3D threads are most suitable for 3D arrays (e.g. 3D environments)
Topic 13: CUDA Threads
T0,0,1 T1,0,1 T2,0,1
T0,0,0 T1,0,0 T2,0,0 T0,1,0 T1,1,0 T2,1,0
Blocks in a Grid
§ Kernel code (on host) may initiate one or more blocks, each with many threads.
__global__ kernel1(..) {..}
kernel1<<
§ All blocks for a given kernel belong to a grid
§ All blocks in a grid must finish before the next kernel runs.
– Synchronization point!
§ Remember that each block runs on
Topic 13: CUDA Threads
Based on NVIDIA
COSC 407: Intro to Parallel Computing
Block (0,0)
Block (1,0)
Block (1,0)
Block (1,1)
Block (0,0)
Block (1,0)
Block (1,0)
Block (1,1)
Blocks in a Grid
§ Blocks in a grid are organized in 1D, 2D, or 3D array of blocks.
§ Built-in ID variables blockIdx.x blockIdx.y blockIdx.z
§ Block IDs are unique within a grid
Topic 13: CUDA Threads
COSC 407: Intro to Parallel Computing
blockIdx.x = 1
blockIdx.y = 1
blockIdx.z = 1
Dimension Variables
A dimension variable holds the number of elements over this dimension
§ Dimensions may be unique for each
grid and are set at launch time – cannot change a kernel’s
dimensions once it is launched. § Built-in dimension variables
blockDim.x, blockDim.y
blockDim.z
gridDim.x, gridDim.y,
– E.g. Grid 2 in the figure: gridDim.x = 2 blockDim.x = 3
Topic 13: CUDA Threads
COSC 407: Intro to Parallel Computing
Thread Life Cycle in HW
Incomplete version
1. Grid is launched
kernelFoo<<
2. Blocks are distributed to SM
– Potentially more than one Block per SM
3. Each SM launches the threads in its block.
– One thread per core (SP)
4. As Blocks complete, resources are
Note: this is not the complete lifecycle. We are still missing the “warps” which are discussed later.
Topic 13: CUDA Threads
COSC 407: Intro to Parallel Computing
T0,0,0 T1,0,0 T2,0,0 T0,1,0 T1,1,0 T2,1,0
Thread Life Cycle in HW
Incomplete version
Parallel Data Cache
Texture Texture
Parallel Data Cache
Texture Texture
Parallel Data Cache
Texture Texture
Topic 13: CUDA Threads
COSC 407: Intro to Parallel Computing
Based on NVIDIA
Kernel Launch Configuration
• From the Vector Addition Example vectorAdd<<<1, N>>>(…);
• Statement tells the GPU to launch N threads on 1 block
• The general format:
kernelFunc<<
• You can:
• Run as many blocks at once (all belong to the same grid)
• Each block can have a maximum of:
• 1024 threads on newer GPUs.
• 512 threads on older GPUs
• We cannot specify which block runs before which.
Topic 13: CUDA Threads COSC 407: Intro to Parallel Computing
Kernel Launch Configuration
• You should choose the breakdown of threads and blocks that make sense to your problem.
Example: For a vector, choose 1D setup with options
KernelFunc<<< 1, 30 >>>(…); KernelFunc<<< 3, 10>>>(…);
30 threads: 3 Blocks, each with 10 threads
• Dimensionality of above example: 1D blocks and 1D grid
• x-dimension is used by default for 1D items
• Can define higher dimensionality using dim3. (more about this later)
Remember, each block is assigned to one SM. If you want to fully use the GPU, then #blocks should be ≥ # of SMs
Topic 13: CUDA Threads COSC 407: Intro to Parallel Computing
30 threads: 1 Block with 30 threads
Vector Addition: Revisited
__global__ void vectorAdd(int* a, int* b, int* c, int n) {
int i = threadIdx.x;
c[i] = a[i] + b[i];
int main() {
int *a, *b, *c, *d_A, *d_B, *d_C;
//…allocate space on CPU and GPU
//…initialize a,b
//…copy a,b to GPU at d_A, d_B
//launch the kernel
vectorAdd <<<1,N>>> (d_A, d_B, d_C, N); //…results back from d_C to c //…free up memory
© /NVIDIA and Wen-mei W. Hwu, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign
Topic 13: CUDA Threads COSC 407: Intro to Parallel Computing
Only x-dim is used
This means 1 grid with 1 block running on 1 SM, and N threads organized in
1D array (over the x-dimension only)
Computing Array Index for Each Thread
foo<<< 1, 30 >>>(…);
1 Block with 30 threads
Example: computing i for v[2], v[5]
T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 v
void foo(…){
int i = threadIdx.x;
blockIdx.x * blockDim.x … + threadIdx.x
foo<<< 3, 10>>>(…);
3 Blocks, each with 10 threads
Example: computing i for v[2], v[5]
0 1 2 3 4 5 6 7 8 9 10 11
blockIdx.x * blockDim.x … + threadIdx.x
void foo(…){
int i = ??;
T0 T1 T2 T3
T0 T1 T2 T3
0 1 2 3 4 5 6 7 8 9 10 11
T0 T1 T2 T3
ii==02*;4+2=2;
Topic 13: CUDA Threads
i=50;*4+5=5;
i=0*4+2=2;
i=1*4+1=5; COSC 407: Intro to Parallel Computing
Computing Array Index for Each Thread
• The general formula to compute the thread index:
Use above formulas in the kernel function to identify which
data element is accessed by each thread.
• Example: using the formulas above, compute the (x,y) of the
highlighted element; i.e. confirm that thread T1,1,0 in block B1,0,0
int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; int z = blockIdx.z * blockDim.z + threadIdx.z;
will be only accessing img[1][4] Answer: For this thread:
x = 1*3+1 = 4 //col
y = 0*2+1 = 1 //row img[1][4]
Topic 13: CUDA Threads
T0,0,0 T1,0,0 T2,0,0 T0,1,0 T1,1,0 T2,1,0
T0,0,0 T1,0,0 T2,0,0 T0,1,0 T1,1,0 T2,1,0
COSC 407: Intro to Parallel Computing
Processing 100×70 Picture
__global__ void PicKrnl(float* d_Pin,float* d_Pout,int w,int h){ // Calculate row # of the d_Pin and d_Pout element to process int y = blockIdx.y * blockDim.y + threadIdx.y;
// Calculate column # of the d_Pin and d_Pout element to process int x = blockIdx.x * blockDim.x + threadIdx.x;
// each thread computes one element of d_Pout if in range
d_Pout[y * w + x] = f(d_Pin[y * w + x]); // errors? if((y
1. Choose the number threads per block (nthreads).
2. Compute the number of blocks as follows:
nblocks = (N-1)/nthreads + 1
Note (again): if you want to fully use the GPU, then
• #threads per block should be large (≥ #SPs per SM)
• #blocks should be ≥ # of SMs
Topic 13: CUDA Threads COSC 407: Intro to Parallel Computing
Launch Configuration Examples
Assume we choose nthreads = 256 (i.e. #threads per block ) • # of array elements N = 200
• nblocks= 199/256 + 1 = 1 àtotal # threads = 256
• # of array elements N = 256
• nblocks=255/256+1=1 àtotal#threads=256
• # of array elements N = 400
• nblocks=399/256+1=2 àtotal#threads=512 Note: use if(i
Topic 13: CUDA Threads COSC 407: Intro to Parallel Computing
B0,0,0 B1,0,0
1D array of blocks, each having 1D array of threads
T0,0,0 … T511,0,0
T0,0,0 … T511,0,0
Vector Addition: Full Code Rewrite the serial program below so that vectorAdd runs on the GPU
with 4 blocks each having 256 threads
SERIAL CODE
#define N 1024
void vectorAdd(int* a, int* b, int* c, int n) {
for (i = 0; i < n; i++)
c[i] = a[i] + b[i];
int main() {
int *a, *b, *c, i;
a = (int*) malloc(N * sizeof(int)); b = (int*) malloc(N * sizeof(int)); c = (int*) malloc(N * sizeof(int));
for(i=0;i