程序代写代做代考 GPU compiler cuda Microsoft PowerPoint – GPU-1 [Compatibility Mode]

Microsoft PowerPoint – GPU-1 [Compatibility Mode]

12Computer Science, University of Warwick

CUDA

 CUDA is the most popular programming model for
writing parallel programs to run on GPU

 developed by NVIDIA

13Computer Science, University of Warwick

CUDA keywords and kernel

– A CUDA program has two parts of code

– Host code: the part of code run on CPU

– Device code: the part of code run on GPU

– The functions that will be run on the GPU device are
marked with CUDA keywords

– A function that is run on GPU is called a kernel
function

14Computer Science, University of Warwick

Keywords in function declaration

– __global__

– A global function is a kernel function to be executed in GPU

– can only be called from the host code

– __host__

– A host function (e.g., traditional C function) executes on the host

– can only be called from another host function

– By default, all functions are host functions if they do not have
any CUDA keywords

15Computer Science, University of Warwick

An Example of CPU Code

16Computer Science, University of Warwick

Outline of a vecAdd() function for GPU

#include

void vecAdd(float* A, float* B, float* C, int n)
{

Part 1: Allocate device memory for A, B, and C and Copy A and B
from host memory to device memory

Part 2: Launch Kernel code to perform the actual operation on
GPU

Part 3: Copy the result C from device memory to host memory;
Free device vectors

}

17Computer Science, University of Warwick

Part 1 and Part 3: dealing with GPU memory

 Allocate GPU memory

 Copy the data from CPU memory to GPU memory

 Copy the result in GPU memory back to CPU memory

18Computer Science, University of Warwick

Memory Management in GPU

– cudaMalloc(void ** devPtr, size_t size)

– Allocate the device global memory

– Two parameters

– devPtr: a pointer to the address of the allocate memory

– size: Size of allocated memory

19Computer Science, University of Warwick

Memory Management in GPU

– cudaMemcpy(dst, src, count, kind)

– Memory data transfer

– Four parameters

– 1. destination location of the data to be copied

– 2. source location of the data

– 3. size of the data

– 4. The types of memory copying: host to host, host to
device, device to device, device to host

20Computer Science, University of Warwick

vecAdd Function

void vecAdd(float* A, float* B, float* C, int n)
{

int size=n*sizeof(float);
float *dA, *dB, *dC;

cudaMalloc(&dA, size);

cudaMemcpy(dA, A, size, cudaMemcpyHostToDevice);

cudaMalloc(&dB, size);

cudaMemcpy(dB, B, size, cudaMemcpyHostToDevice);

cudaMalloc(&dC, size);

Part 2: Launch Kernel code to perform the actual operation on
GPU

cudaMemcpy(C, dC, size, cudaMemcpyDeviceToHost);
cudaFree(dA); cudaFree(dB); cudaFree(dC);

}

21Computer Science, University of Warwick

Part 2: Launch and Run the Kernel Code

 Launch and execute the Kernel function

 Various related issues in Part 2

 Execution model of the kernel function

 Thread structure

 Execution configuration

 Kernel execution

22Computer Science, University of Warwick

Part 2: Launch and Run the Kernel Function

 Various related issues in Part 2

 Execution model of the kernel function

 Thread structure

 Execution configuration

 Kernel execution

23Computer Science, University of Warwick

Execution Model of GPU

– The execution starts with host (CPU) execution

– When a kernel function is called, it is executed by a
large number of threads on the GPU

– All the threads to run a kernel are collectively called a grid

– When all threads of a kernel complete their
execution, the corresponding grid terminates

– The execution
continues on the host
until another kernel is
called

24Computer Science, University of Warwick

GPU code – Part 2

 Launch and execute the Kernel function

 Various related issues in Part 2

 Execution model of the kernel function

 Thread structure

 Execution configuration

 Kernel execution

25Computer Science, University of Warwick

– When a host code launches a kernel, CUDA
generates a grid of thread blocks

– Each block contains the same number of threads (up to
1024)

– Each thread runs the same kernel function

Thread Structure for Running a Kernel

26Computer Science, University of Warwick

Threads are organized into a grid of blocks (Two-level
architecture)

The grid and blocks can be multidimensional

Thread Organization

27Computer Science, University of Warwick

Thread Organization

– gridDim(x, y, z): the dimensions of the grid,

– blockDim(x, y, z): the dimensions of the block,

– blockIdx(x, y, z):

– the coordinate (ID) of the
block in the grid,

– it can be accessed by the
calling thread to obtain which
block it is in

– threadIdx(x, y, z):

– the local coordinate (ID) of a
thread in a block,

– It can be accessed by the
calling thread to obtain its
local position in the block

28Computer Science, University of Warwick

Build-in variables

– gridDim: the dimensions of the grid

– blockDim: the dimensions of the block

– blockIdx: the block index within the grid

– All the threads in a block share the same blockIdx value

– threadIdx: the thread index within the block

– Their values are preinitialized by the CUDA runtime
library when invoking the kernel function

– Can be accessed in the kernel function

29Computer Science, University of Warwick

GPU code – Part 2

 Launch and execute the Kernel function

 Various related issues in Part 2

 Execution model of the kernel function

 Thread structure

 Execution configuration

 Kernel execution

30Computer Science, University of Warwick

Execution configuration of kernel launch

– Execution configuration sets the grid and block size

– Set between the <<< and >>> before the C function parameters

– First parameter defines grid size: the number of thread blocks in
the grid

– The second specifies the block size: the number of threads in
each block

The same kernel can be launched with different execution
configurations

31Computer Science, University of Warwick

Execution Configuration

– Execution configuration sets

– Grid and block are multidimensional

– Execution configuration sets the grid and block dimensions

– Dimensions values are stored in the built-in variables gridDim
and blockDim

Example: dim3 a(3, 2, 4); dim3 b(128, 1, 1);

vecAdd <<>> (…); Then,

-gridDim.x=3, gridDim.y=2, gridDim.z=4

-blockDim.x=128, blockDim.y=1, blockDim.z=1

-Question: how many threads will be generated?

-Answer: 3*2*4*128

32Computer Science, University of Warwick

GPU code – Part 2

 Launch and execute the Kernel function

 Various related issues in Part 2

 Execution model of the kernel function

 Thread structure

 Execution configuration

 Kernel execution

33Computer Science, University of Warwick

Kernel execution

 Different threads process different parts of data in
the kernel code

 We need to match different threads to different parts
of the data

34Computer Science, University of Warwick

Match threads to data items

– Assume the following grid of blocks are generated to
compute C_d=A_d+B_d

Griddim(x, y, z)=(N, 1, 1) ,blockdim(x, y, z)=(256, 1, 1),
blockidx(x, 0, 0), threadidx(x, 0, 0)

– Question: how to match a thread (threadidx) to compute
A_d[i]+B_d[i]?

35Computer Science, University of Warwick

Match threads to data items

– Assume the following grid of blocks are generated to
compute C_d=A_d+B_d

Griddim(x, y, z)=(N, 1, 1) ,blockdim(x, y, z)=(256, 1, 1),
blockidx(x, 0, 0), threadidx(x, 0, 0)

– Question: how to match a thread (threadidx) to compute
A_d[i]+B_d[i]?

36Computer Science, University of Warwick

Exercise

– Calculate C=A+B; A, B, C are
6*12 matrices

– Assume a grid of blocks on
the right are generated:
griddim(2, 3), blockdim(3, 4)

– Question: How to match a
thread to calculate
C[i]=A[i]+B[i]?

– Answer: calculate the global index of a
thread in the grid

– X=blockidx.x*blockdim.x+threadidx.x

– Y=blockidx.y*blockdim.y+threadidx.y

37Computer Science, University of Warwick

Kernel Execution for vecAdd

– All threads in a grid execute the same kernel function

– The threads use their coordinates (i.e., blockidx and
threadidx) to

– distinguish themselves from each other

– identify the appropriate part of the data to process

38Computer Science, University of Warwick

Local Variables in a Kernel Function

– Local (automatic) variable in the kernel function are
private to each thread

– Each thread has a local copy of the variable

39Computer Science, University of Warwick

If statement

– Only first n threads perform the addition

– Because not all vector lengths can be expressed as multiples of
the block size

– Allows the kernel to process vectors of any lengths

40Computer Science, University of Warwick

Comparison between CPU and GPU
version

– There is a “for” loop in the CPU version

– In the GPU version, the grid of threads is equivalent
to the loop

– GPU version

– CPU version

41Computer Science, University of Warwick

The complete program of vecAdd

void vecAdd(float* A, float* B, float* C, int n)
{

int size=n*sizeof(float);
float *dA, *dB, *dC;

cudaMalloc(&dA, size); //Part 1

cudaMemcpy(dA, A, size, cudaMemcpyHostToDevice);

cudaMalloc(&dB, size);

cudaMemcpy(dB, B, size, cudaMemcpyHostToDevice);

cudaMalloc(&dC, size);

vecAddKernel<<>>(dA, dB, dC, n); //Part 2

cudaMemcpy(C, dC, size, cudaMemcpyDeviceToHost); //Part 3
cudaFree(dA); cudaFree(dB); cudaFree(dC);

}

42Computer Science, University of Warwick

Compilation Process of a CUDA Program

– A device code is first
compiled by NVCC to
PTX code

– The PTX code is
further compiled by
NVCC to executable

– NVCC compiler uses the CUDA keywords to separate
the host code and device code

– The host code is further compiled with standard C
complier and run as a CPU process

43Computer Science, University of Warwick

Timing the GPU code

 Using Events for timing on GPU

 Events are special kernels that can be invoked for
timing on GPU

44Computer Science, University of Warwick

Timing the GPU code

 cudaEventRecord() is used to place the start and
stop events into the execution of kernel

 The GPU will record a timestamp for the event when
the Kernel function reaches the event

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
vecAddKernel<<>>(dA,
dB, dC, n);
cudaEventRecord(Stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
cudaEventDestroy(start);

cudaEventDestroy(stop);