COMP Distributed
Introduction to GPU
Programming
Copyright By PowCoder代写 加微信 powcoder
References
– NVIDIAGPUEducatorsProgram – https://developer.nvidia.com/educators
– NVIDIA’s Academic Programs
– https://developer.nvidia.com/academia
– The contents of the ppt slides are mainly copied from the following book and its accompanying teaching materials:
. Kirk and Wen-mei W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, 3rd edition, , 2016
Architecture of a Modern GPU
Architecture of a Modern GPU
NVIDIA Fermi, 512 Processing Elements (SPs)
Architecture of a Modern GPU
NVIDIA Fermi, Streaming Multiprocessor (SM)
Architecture of a Modern GPU (cont.)
For new generation GPUs, e.g., Pascal,
Each streaming multiprocessor (SM)
128 cores (streaming processors, or SPs) 96KB of shared memory
48KB L1 cache
Up to 2K threads
Each chip may have 10 – 20 SMs, or 1280 to 2560 cores and 6-8GB of memory
CUDA Execution Model
– Heterogeneous host (CPU) + device (GPU) application C program
– Serial parts in host C code
– Parallel parts in device kernel code
CUDA Execution Model
– Heterogeneous host (CPU) + device (GPU) application C program
Serial parts in host C code
Parallel parts in device kernel code
SIMD and multithreading (Single Instruction & Multiple Threads, or SIMT)
Need fine grained parallelism
Serial Code (host)
Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args);
Vector Addition – Traditional C Code
// Compute vector sum C = A + B
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
for (i = 0; i
void vecAdd(float *h_A, float *h_B, float *h_C, int n) {
int size = n* sizeof(float);
float *d_A, *d_B, *d_C;
// Allocate device memory for A, B, and C // copy A and B to device memory
// Kernel launch code – the device performs the actual vector addition
// copy C from the device memory
// Free device vectors
Data Parallelism – Vector Addition
• Partition data and then associate data with computational tasks
• Create multiple threads to perform tasks
• Think of multiple operations being conducted simultaneously on multiple SPs!
vector B …
Arrays of Parallel Threads
• A CUDA kernel is executed by a grid (array) of threads
– All threads in a grid run the same kernel code (Single Program
Multiple Data)
– thread array divided into multiple blocks which are distributed to different SMs
– Threads within a block cooperate via shared memory, atomic operations and barrier synchronization
– Each thread has indexes that it uses to compute memory addresses and make control decisions
i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i];
Thread Blocks: Scalable Cooperation
Thread Block 0 Thread Block 1 Thread Block N-1
……… …
i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
Transparent Scalability
Thread grid
– Each block can execute in any order relative to others.
– Hardware is free to assign blocks to any processor at any time
– A kernel scales to any number of parallel processors
Executing Thread Blocks
– Threads are assigned to Streaming Multiprocessors (SM) in block granularity
– Up to 8 blocks to each SM as resource allows
– Fermi SM can take up to 1536 threads
– Could be 256 (threads/block) * 6 blocks – Or 512 (threads/block) * 3 blocks, etc.
– SM maintains thread/block idx #s
– SM manages/schedules thread execution
t0 t1 t2 … tm
Shared Memory
Warps as Scheduling Units
• Each Block is executed as 32-thread Warps
– An implementation decision, not part of the CUDA programming model
– Warps are scheduling units in SM
– Threads in a warp execute in SIMD
– Future GPUs may have different number of threads in each warp
• If 3 blocks are assigned to a SM and each block has 256 threads, how many Warps are there in an SM?
– Each Block is divided into 256/32 = 8 Warps – There are 8 * 3 = 24 Warps
Block 0 Warps Block 1 Warps Block 2 Warps ………
t0 t1 t2 … t31
t0 t1 t2 … t31
t0 t1 t2 … t31
Register File
L1 Shared Memory
Thread Scheduling
– SM implements zero-overhead warp scheduling
– Warps whose next instruction has its operands ready for consumption are eligible for execution
– Eligible Warps are selected for execution based on a prioritized scheduling policy
– All threads in a warp execute the same instruction when selected
Vector Addition Kernel
Device Code
// Compute vector sum C = A + B
// Each thread performs one pair-wise addition
__global__
void vecAddKernel(float* A, float* B, float* C, int n)
int i = threadIdx.x+blockDim.x*blockIdx.x;
if(i
Total number of blocks Number of threads / block Total number of threads = grid dimension * block dimension
Example: Vector Addition Kernel Launch (Host Code)
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{// d_A, d_B, d_C allocations and copies omitted // Run ceil(n/256.0) blocks of 256 threads each
} vecAddKernel<<
The ceiling function makes sure that there are enough threads to cover all elements.
Kernel execution in a nutshell
void vecAdd(…)
//d_A,d_B,d_C allocations and copies omitted
vecAddKernel<<
__global__
void vecAddKernel(float *A,
float *B, float *C, int n)
if( i
More on Kernel Launch (Host Code)
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{ dim3 DimGrid((n-1)/256 + 1, 1, 1);
dim3 DimBlock(256, 1, 1);
} vecAddKernel<<
This is an equivalent way to express the ceiling function.
Covering a 62×76 Picture with 16×16 Blocks
Not all threads in a Block will follow the same control flow path.
Row-Major Layout in C/C++
Row*Width+Col = 2*4+1 = 9
M0,0 M0,1 M0,2 M0,3
M2,0 M2,1 M2,2 M2,3
Source Code of a PictureKernel
__global__ void PictureKernel(float* d_Pin, float* d_Pout,
int height, int width) // Calculate the row # of the d_Pin and d_Pout element
int Row = blockIdx.y*blockDim.y + threadIdx.y;
// Calculate the column # of the d_Pin and d_Pout element
int Col = blockIdx.x*blockDim.x + threadIdx.x;
// each thread computes one element of d_Pout if in range
if ((Row < height) && (Col < width)) {
} } d_Pout[Row*width+Col] = 2.0*d_Pin[Row*width+Col];
Scale every pixel value by 2.0
Host Code for Launching PictureKernel
// assume that the picture is m × n,
// m pixels in y dimension and n pixels in x dimension
// input d_Pin has been allocated on and copied to device
// output d_Pout has been allocated on device
dim3 DimGrid((n-1)/16 + 1, (m-1)/16+1, 1);
dim3 DimBlock(16, 16, 1); PictureKernel<<
Review – Typical Structure of a CUDA Program
– Kernel function
– __global__ void kernelOne(args…){}
– allocatememoryspaceonthedevice–
cudaMalloc((void**)&d_GlblVarPtr, bytes )
– transferdatafromhosttodevice–cudaMemCpy(d_GlblVarPtr, h_Gl…)
– kernelcall–kernelOne<<
– transferresultsfromdevicetohost–cudaMemCpy(h_GlblVarPtr,…)needed
– optional:compareagainstgolden(hostcomputed)solution
Review – Thread Grid, Blocks and Warps
– A CUDA kernel is executed by a grid (array) of threads
– All threads in a grid run the same kernel code (Single Program Multiple Data)
– Thread array divided into multiple blocks which are distributed to different SMs (8 – 12 blocks/SM)
– Multiple dimensional blocks/grid and multiple dimensional threads/block
– Threads in different blocks do not interact
– Thread indexes to compute memory addresses and make control decisions (i = blockIdx.x * blockDim.x + threadIdx.x)
– Each Block is executed as 32-thread Warps – Warps are scheduling units in SM
– Threads in a warp execute in SIMD
– Grid size is application dependent
– Block size is machine dependent
Review – CUDA Memories
– Registers (per-thread)
– Shared memory (per-block)
– Global memory (all threads)
– Memory coalescing: locality across threads for one instruction
– Shared memory 100x faster than global memory: data loaded into shared memory & then used many times
Block (0, 0)
Registers Registers Registers Registers
Thread (0, 0)
Global Memory
Shared Memory
Constant Memory
Block (1, 0)
Shared Memory
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Compiling A CUDA Program
Integrated C programs with CUDA extensions (e.g., myprog.cu)
NVCC Compiler (nvcc –o myprog pyprog.cu)
Host Code Device Code
Host C Compiler/ Just-in-Time Compiler
Heterogeneous Computing Platform with CPUs, GPUs, etc.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com