High Performance Computing
GPU and CUDA
Dr Ligang He
Copyright By PowCoder代写 加微信 powcoder
Architecture of GPU
– Contains a number of Streaming Multiprocessors (SM)
• Nvidia GeForce RTX 3090 has 82 SMs
– A SM contains a number of GPU cores
• In RTX 3090, each SM has 128 cores
• In total, there are 10496 cores in RTX 3090.
Graphics Processing Unit
SM Internal in RTX 3090 (Ampere architecture)
GPU is a PCI • PCI-e slot
e peripheral device
Performance Trend
GPU is typically 100x more powerful than multicore CPU
Why is there such performance gap?
qBecause of the differences in the design between GPU and CPU
Design of CPU
The design objective of CPU is to optimize the performance of a sequential code
Control unit is important in CPU
– Obtain instructions from memory
– Interpret the instructions
– Figure out what data are needed and where it is stored – Signal the functional units (ALUs) to run the instructions
Design of CPU
The design objective of CPU is to optimize the performance of a sequential code
Control unit is important in CPU CPU has complicated control units
– Execute instructions out of their sequential order (single core) or in parallel (multicore)
– Branch prediction – Data forwarding
Design of CPU
The design objective of CPU is to optimize the performance of a sequential code
Has complicated control unit
Has large cache to reduce data access latencies Powerful ALU
Design Objective of CPU
Latency-oriented design
qComplicated control unit
qLarge caches on CPU chip (on-chip cache)
qPowerful ALU
qThey are at the cost of increased use of chip area and power
Applications with one or very few threads achieve higher performance in CPU
OR gate with transistors
Motivation of GPU Design
– Video game industry: need to perform a massive number of floating-point and integer calculations per video frame
– Motivate GPU vendors to use most chip area for floating-point/integer calculations
qEach calculation is simple
• Simple control unit and simple ALUs
qCalculation is more important than cache
• Small cache, allowing data access to have long latency
GPU Design
– Use a large number of simple ALUs to increase
the total throughput
q Throughput is the number of instructions completed
in a time unit
q The application runs with a large number of threads
q While some threads are waiting for long-latency operations (e.g., memory access), GPU can always find other threads to run
q Throughput-oriented design: maximize the total throughput of a large number of threads, allowing individual threads to take longer time
GPU vs. CPU in Architecture
• CUDA is the most popular programming model for writing parallel programs on GPU
• developed by NVIDIA
CUDA keywords and kernel
A CUDA program has two parts of code
Host code: the part of code run on CPU Device code: the part of code run on GPU
The functions that will be run on the GPU device are marked with CUDA keywords
A function that is run on GPU is called a kernel function
Keywords in function declaration – __global__
– a kernel function to be executed in GPU – __host__
– run on the host
– By default, all functions are host functions if they do not have any CUDA keywords
PTER 3 Introduction to Data Parallelism and CUDA C An Example of a CPU Program
// Compute vector sum h_C = h_A+h_B
void vecAdd(float* h_A, float* h_B, float* h_C, int n) {
for (i = 0; i < n; i++) h_C[i] = h_A[i] + h_B[i]; }
int main() {
// Memory allocation for h_A, h_B, and h_C // I/O to read h_A and h_B, N elements each ...
vecAdd(h_A, h_B, h_C, N); }
mple traditional vector addition C code example.
Outline of a
#include
void vecAdd(float* A, float* B, float* C, int n) {
Part 1: Allocate device memory for A, B, and C and Copy A and B from host memory to device memory
Part 2: Launch Kernel code to perform the actual operation on GPU
Part 3: Copy the result C from device memory to host memory; Free device vectors
() function for GPU
Parts 1&3: dealing with GPU memory
• Allocate GPU memory
• Copy the data from CPU memory to GPU memory
• Copy the result in GPU memory back to CPU memory
Memory Management in GPU
cudaMalloc(void ** devPtr, size_t size) Allocate the device memory in GPU
Two parameters
– devPtr: a pointer to the address of the allocated memory
– size: Size of allocated memory
Memory Management in GPU
cudaMemcpy(dst, src, count, kind)
Memory data transfer
Four parameters
– 1. destination location of the data to be copied
– 2. source location of the data
– 3. size of the data
– 4. The types of memory copying: host to host, host to device, device to device, device to host
void vecAdd(float* A, float* B, float* C, int n) {
int size=n*sizeof(float); float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, size);
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice); cudaMalloc(&d_B, size);
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice); cudaMalloc(&d_C, size);
Part 2: Launch Kernel code to perform the actual operation on GPU
cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); }
Part 2: Launch and Run the Kernel Code
Launch and execute the Kernel function Various related issues in Part 2
q Execution model of the kernel function q Thread structure
q Execution configuration
q Workload distribution
Part 2: Launch and Run the Kernel Function Various related issues in Part 2
q Execution model of the kernel function q Thread structure
q Execution configuration
q Workload distribution
Execution Model of GPU
The execution starts with host (CPU) execution
When a kernel function is called, it is executed by a large number of threads on the GPU
All the threads are collectively called a grid
When all threads of a kernel complete their execution, the
corresponding grid terminates
3.3 A Vector Addition K
The execution continues on the host until another kernel is called
CPU serial code
GPU parallel kernel KernelA<<< nBlK, nTid >>>(args);
CPU serial code
GPU parallel kernel KernelA<<< nBlK, nTid >>>(args);
FIGURE 3.3
Execution of a CUDA program.
Launch and execute the Kernel function Various related issues in Part 2
q Execution model of the kernel function q Thread structure
q Execution configuration
q Workload distribution
Thread Structure for Running a Kernel
When a host code launches a kernel, CUDA generates a grid of thread blocks
Each block contains the same number of threads
CHAPTER 3 Introduction to Data Parallelism and CUDA C Each thread runs the same kernel function
Block 0 Block 1 Block N-1
0 1 2 254255 0 1 2 254255 0 1 2 254255
… ………
i = blockIdx.x * blockDim.x + threadIdx.x;
C_d[i] = A_d[i] + B_d[i];
FIGURE 3.10
All threads in a grid execute the same kernel code.
i = blockIdx.x * blockDim.x + threadIdx.x;
C_d[i] = A_d[i] + B_d[i];
i = blockIdx.x * blockDim.x + threadIdx.x;
C_d[i] = A_d[i] + B_d[i];
Thread Organization
Two-level architecture: threads are organized into a grid of blocks
The grid and blocks can be multidimensional
Thread Organization
– gridDim(x, y, z): the dimensions of the grid,
– blockDim(x, y, z): the dimensions of the block,
– blockIdx(x, y, z):
– the coordinate (ID) of the
block in the grid,
– it can be accessed by the calling thread to obtain which block it is in
– threadIdx(x, y, z):
– the local coordinate (ID) of a
thread in a block,
– It can be accessed by the calling thread to obtain its local position in the block
in variables
gridDim: the dimensions of the grid
blockDim: the dimensions of the block
blockIdx: the block index within the grid
All the threads in a block share the same blockIdx value threadIdx: the thread index within the block
Their values are preinitialized by the CUDA runtime library when invoking the kernel function
Can be accessed in the kernel function
Launch and execute the Kernel function Various related issues in Part 2
q Execution model of the kernel function q Thread structure
q Execution configuration
q Workload distribution
Execution configuration of kernel launch
Execution configuration is specified when invoking a kernel function
Set two parameters between <<< and >>> before the function parameters, for example
dim3 grid(3, 2, 4), block(128, 1, 1); vecAdd <<< grid, block>>>(A, B, C);
First parameter defines the grid dimension
The second parameter defines the block dimension
Execution Configuration
– When the following kernel function is called, dim3 grid(3, 2, 4), block(128, 1, 1);
vecAdd <<< grid, block>>>(A, B, C);
– Dimensions values are stored in the built-in variables gridDim and blockDim
– The built-in variables are initialized -gridDim.x=3, gridDim.y=2, gridDim.z=4 -blockDim.x=128, blockDim.y=1, blockDim.z=1
-Question: how many threads will be generated? -Answer: 3*2*4*128
Execution configuration
dim3 grid(3, 2, 4), block(128, 1, 1); vecAdd <<< grid, block>>>(A, B, C);
If single values are specified in execution
configuration, it means the grid3o.5r thKeebrlnoeclkFiusnctions and Thre specified as 1D
For example,
int vectAdd(float* A, float* B, float* C, int n) {
// d_A, d_B, d_C allocations and copies omitted
// Run ceil(n/256) blocks of 256 threads each
vecAddKernel<<
FIGURE 3.13
A vector addition kernel function and its launch statement.
Launch and execute the Kernel function Various related issues in Part 2
q Execution model of the kernel function q Thread structure
q Execution configuration
q Workload distribution
Workload distribution
• Use different threads to process different data items
• Workload distribution: which threads run which data items?
Match threads to data items
– Assume the following grid of blocks are generated to compute C_d=A_d+B_d
Griddim(x, y, z)=(N, 1, 1) ,blockdim(x, y, z)=(256, 1, 1), blockidx(x, 0, 0), threadidx(x, 0, 0)
CHAPTER 3 Introduction to Data Parallelism and CUDA C – Question: how to allocate a thread (threadidx) to calculate
element i: A_d[i]+B_d[i]?
Block 0 Block 1 Block N-1
i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x + … i = blockIdx.x * blockDim.x + threadIdx.x; threadIdx.x; threadIdx.x;
C_d[i] = A_d[i] + B_d[i];
C_d[i] = A_d[i] + B_d[i];
C_d[i] = A_d[i] + B_d[i];
FIGURE 3.10
Griddim(x, y, z)=(N, 1, 1) ,blockdim(x, y, z)=(256, 1, 1), blockidx(x, 0, 0), threadidx(x, 0, 0)
Match threads to data items
– Assume the following grid of blocks are generated to compute C_d=A_d+B_d
CHAPTER 3 Introduction to Data Parallelism and CUDA C
– Question: how to allocate a thread (threadidx) to compute
A_d[i]+B_d[i]?
Block 0 Block 1 Block N-1
… ………
i = blockIdx.x * blockDim.x + threadIdx.x;
C_d[i] = A_d[i] + B_d[i];
FIGURE 3.10
i = blockIdx.x * blockDim.x + threadIdx.x;
C_d[i] = A_d[i] + B_d[i];
i = blockIdx.x * blockDim.x + threadIdx.x;
C_d[i] = A_d[i] + B_d[i];
All threads in a grid execute the same kernel function
3.5 Kernel Functions and Threadi The threads use their indices (i.e., blockidx and
threadidx) to
Kernel Execution for
distinguish themselves from each other
// Compute vector sum C = A+B
identify the appropriate part of the data for processing
// Each thread performs one pair-wise addition
__global__
void vecAddKernel(float* A, float* B, float* C, int n) {
54 CHAPTER 3 Introduction to Data Parallelism and CUDA C int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i
cudaMemcpy(C, dC, size, cudaMemcpyDeviceToHost); //Part 3
cudaFree(dA); cudaFree(dB); cudaFree(dC); }
Compilation Process of a CUDA Program
– NVCC compiler uses the CUDA keywords to separate the host code and device code
44 CHAPTER 3 Introduction to Data Parallelism and CUDA C
– The host code is further compiled with standard C
complier and run as a CPU process
– A device code is first compiled by NVCC to PTX code
– The PTX code is further compiled by NVCC to executable
Integrated C programs with CUDA extensions
Device Code (PTX)
FIGURE 3.2
Host C preprocessor, compiler/ linker
NVCC Compiler
Device just-in-time compiler
Heterogeneous Computing Platform with CPUs, GPUs
Overview of the compilation process of a CUDA program.
Timing the GPU code
cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop);
cudaEventRecord(start, 0); vecAddKernel<<
cudaEventElapsedTime(&time, start, stop); cudaEventDestroy(start);
cudaEventDestroy(stop);
– Using Events for timing on GPU
– cudaEventRecord() is used to record a time stamp for the event when this function is reached in the program.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com