Intro to Parallel Computing
Topic 13: CUDA Threads – Part 2
COSC 407: Intro to Parallel Computing
Topic 13: CUDA Threads – Part 2 COSC 407: Intro to Parallel Computing
Copyright By PowCoder代写 加微信 powcoder
Previously:
– Error Handling, cudaDeviceSynchronize
– Hardware architecture: sp → SM → GPU
– Thread Organization: threads → blocks → grids
• Dimension variables (blockDim, gridDim)
– Thread Life Cycle From the HW Perspective
– Kernel Launch Configuration: 1D grids/blocks
– Kernel Launch Configuration: nD grids/blocks
– CUDA limits
– Thread Cooperation
– Running Example: Matrix Multiplication
Topic 13: CUDA Threads – Part 2
Slide materials based on, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign © /NVIDIA and Wen-meiW.Hwu
COSC 407: Intro to Parallel Computing
Higher Dimensional Grids / Blocks
Remember: choose the breakdown of threads and blocks that make sense to your problem. Example:
• Assume you want to process a 100 pixel x 70 pixel image
(each 1 thread processes 1 pixel). • We will have many options, e.g.:
Option: (1 block/row, 1 thread/pixel)
A grid of 1×70 blocks (gx=1, gy=70)
each block with 100×1 threads (dx=100,dy=1)
Topic 13: CUDA Threads – Part 2
Another Option (1 block/segment)
A grid of 10×7 blocks (gx=10, gy=7) each block with 10×10 threads
… ……… …
COSC 407: Intro to Parallel Computing
Higher Dimensional Grids/Blocks
kernelFunction <<< gridSize , blockSize >>>
• gridSize: dimension and size of the grid in terms of blocks – could be one of the following:
– dim3(gx,gy,gz) →incaseof3Dgrid » Wheregx,gy,gzdefinethethreedimensions
– dim3(gx, gy) → in case of 2D grid
» equivalenttodime3(gx,gy,1)
– dim3(gx) ,oraninteger →incaseof1Dgrid
» equivalenttodim3(gx,1,1)orsimplygx(theinteger) » e.g.,dim3(8,1,1)=dim3(8)=8
• blockSize: dimension and size of each block in threads.
– dim3(bx) Topic 13: CUDA Threads – Part 2
,oraninteger
– dim3(bx,by,bz)
– dim3(bx, by)
→incaseof3Dblock → in case of 2D block →incaseof1Dblock
COSC 407: Intro to Parallel Computing
Hello Again….
__global__ void hello(){
printf(“Thread(%d,%d,%d) in Block(%d,%d,%d) says:Hello!\n”, threadIdx.x, threadIdx.y, threadIdx.z,
blockIdx.x, blockIdx.y, blockIdx.z);
int main(){
hello<<
printf(“That’s all!\n”);
return 0; }
Topic 13: CUDA Threads – Part 2
// same as hello<<<2,2>>>() // force the printf() in // device to flush here
COSC 407: Intro to Parallel Computing
Thread(0,0,0) in Block(0,0,0) says:Hello!
Thread(1,0,0) in Block(0,0,0) says:Hello!
Thread(0,0,0) in Block(1,0,0) says:Hello!
What is the output?
Thread(1,0,0) in Block(1,0,0) says:Hello!
That’s all!
Hello Again (same function))
__global__ void hello(){
printf(“Thread(%d,%d,%d) in Block(%d,%d,%d) says:Hello!\n”,
threadIdx.x, threadIdx.y, threadIdx.z, blockIdx.x, blockIdx.y, blockIdx.z); }
int main(){
dim3 gridSize(2,1,1), blockSize(2,1,1); hello<<
cudaDeviceSynchronize();
printf(“That’s all!\n”);
// force the printf() in // device to flush
COSC 407: Intro to Parallel Computing
Topic 13: CUDA Threads – Part 2
Thread(0,0,0) in Block(0,0,0) says:Hello!
Thread(1,0,0) in Block(0,0,0) says:Hello!
Thread(0,0,0) in Block(1,0,0) says:Hello!
Thread(1,0,0) in Block(1,0,0) says:Hello!
That’s all!
Aside: printf on the kernel?
▪ Yes, although not a great idea.. – Specific use cases
▪ Need to use cudaDeviceSynchronize() – Kernelrunsasynchronouslyfromhost
– Seethecodeinpreviousslide
Topic 13: CUDA Threads – Part 2 COSC 407: Intro to Parallel Computing
Computing # of Blocks for 2D Grids
Lets say we have an image of the size WIDTH x HEIGHT
And assume we use 2D blocks of # of threads TILE_WIDTH x TILE_HEIGHT
TILE_WIDTH
Topic 13: CUDA Threads – Part 2 COSC 407: Intro to Parallel Computing
– E.g. TILE_WIDTH = TILE_HEIGHT= 32, totaling 1024 threads How do we determine the grid & block organization?
//block dimensions
int TILE_WIDTH = 32; //num of threads along x int TILE_HEIGHT = 32; //num of threads along y dim3 blocksize(TILE_WIDTH,TILE_HEIGHT);
//grid dimensions
int nblk_x = (WIDTH – 1) / TILE_WIDTH + 1; int nblk_y = (HEIGHT – 1)/ TILE_HEIGHT + 1; dim3 gridsize(nblk_x, nblk_y);
//launch kernel
kernel<<
void kernel(…){
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y; …}
In this example: nblk_x = 7 nblk_y = 5
TILE_HEIGHT
CUDA Limits
For CUDA compute capability 3.0+: ▪ Within a grid:
– Total of (231 – 1) x 65535 x 65535 blocks
• Maximum x-dimension of a grid: 231 – 1
• Maximum y- and z- dimension of a grid: 216 – 1 (= 65535)
– That is, launch as many blocks as you want (almost)! ▪ Within a block
– Maximum total number of threads per block:
• 1024 (or 512 on older GPUs supporting compute capability <
– Maximum dimension of a block (# of thread per dimension)
• x- or y- dimension: 1024 (or 512)
• z-dimension: 64
▪ The first assignment on CUDA walks you through this (see last lecture) ▪ Check full specs here.
Slide materials based on, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign © /NVIDIA and Wen-meiW.Hwu
Topic 13: CUDA Threads – Part 2 COSC 407: Intro to Parallel Computing
• Organization of OpenMP threads vs. CUDA threads? • OpenMP:
• number of threads p close to number of processors • CUDA:
• many many threads, organized in 1D, 2D or 3D arrays
Slide materials based on, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign © /NVIDIA and Wen-meiW.Hwu
Topic 13: CUDA Threads – Part 2 COSC 407: Intro to Parallel Computing
Threads Cooperation
▪ Threads in same block can cooperate
– Synchronize their execution
– Communicate via shared memory
– thread/block index is used to assign work and address shared data
▪ Threads in different blocks cannot cooperate
– Blocks can execute in any order relative to other blocks.
– There is no native way to synchronize all threads in all blocks.
• To synchronize threads in all blocks, terminate your kernel at the synchronization point, and then launch a new kernel which would continue with your job
Topic 13: CUDA Threads – Part 2
COSC 407: Intro to Parallel Computing
(more about this later)
Threads Cooperation
▪ For now, all you need to remember is:
– All threads in all blocks run the same kernel.
– Threads within the same block cooperate via shared memory, atomic operations and barrier synchronization.
– Threads in different blocks CANNOT cooperate. Block 0 Block 1
threadID 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
We will discuss more about this later
Topic 13: CUDA Threads – Part 2
... 0 1 2 3 4 5 6 7 ...
COSC 407: Intro to Parallel Computing
Block N - 1
float x = input[threadID]; float y = func(x); output[threadID] = y;
float x = input[threadID]; float y = func(x); output[threadID] = y;
float x = input[threadID]; float y = func(x); output[threadID] = y;
Matrix Multiplication
• A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs
• Assume square matrix for simplicity
• For now, we will discuss
• Memory data transfer API between host and device
• Thread ID usage • Later
• How to speed up performance
Topic 13: CUDA Threads – Part 2
Slide materials based on, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign © /NVIDIA and Wen-meiW.Hwu
COSC 407: Intro to Parallel Computing
Programming Model
P=M*N Size is WIDTH x WIDTH
Each thread calculates one element of P
M and N are loaded WIDTH times from global
Slide materials based on, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign
Topic 13: CUDA Threads – Part 2 © /NVIDIA and W en-mei W . Hwu COSC 407: Intro to Parallel Computing
WIDTH WIDTH
How Each Element in P is Computed
Topic 13: CUDA Threads – Part 2
One thread per element in P
48 = 3*2 + 2*4+5*2+4*6
Slide materials based on, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign © /NVIDIA and Wen-meiW.Hwu
COSC 407: Intro to Parallel Computing
Serial Code
// Matrix multiplication on the (CPU) host in double precision
void MatrixMulSerial(float* M, float* N, float* P, int width) {
//for each element Pr,c
for (int r=0; r
cudaMemcpy(P, d_P, size, cudaMemcpyDeviceToHost);
//4) Free device matrices
cudaFree(d_M); cudaFree(d_N); cudaFree(d_P);
Slide materials based on, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign © /NVIDIA and Wen-meiW.Hwu
Topic 13: CUDA Threads – Part 2 COSC 407: Intro to Parallel Computing
Using Multiple Blocks
▪ We saw that using only one block has a serious limitation: size of matrix limited by 1024.
▪ Also, you are not fully using your GPU
▪ Solution: use multiple blocks
– We shall apply the method
explained previously
▪ More on this next day
Topic 13: CUDA Threads – Part 2
Slide materials based on, 2007-2010, ECE 408, University of Illinois, Urbana-Champaign © /NVIDIA and Wen-meiW.Hwu
COSC 407: Intro to Parallel Computing
Remember…
Why we need to divide threads into blocks with the grid?
• To make thread organization better fit the problem • e.g., 2D blocks for 2D images.
• To satisfy CUDA limits (only 1024 threads per block) • We also need to avoid GPU hardware limits
• For example, G80 has 16 SMs.
• Each SM can process up to 8 blocks at a time
and up to 768 threads at a time (more later) • To exploit the GPU full power
• E.g., one block means one SM is functioning and remaining are not
• To allow for threads communication at different levels
• Threads within same block have “shared memory” and can
sync. Threads in different blocks cannot sync (at least directly) and can only share data through the global memory.
• More about this next…
Topic 13: CUDA Threads – Part 2 COSC 407: Intro to Parallel Computing
– Kernel Launch Configuration: nD grids/blocks
– CUDA limits
– Thread Cooperation
– Running Example: Matrix Multiplication
– CUDA Scalability
– Thread Scheduling on the H/W: Thread Lifecycle
• zero-overhead and latency tolerance
– GPU limits
– CUDA Memories Types (and Performance)
– Example: Improving Performance of Matrix Multiplication
Topic 13: CUDA Threads – Part 2 COSC 407: Intro to Parallel Computing
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com