PowerPoint Presentation
Parallel Computing
with GPUs:
Introduction to CUDA
Dr Paul Richmond
http://paulrichmond.shef.ac.uk/teaching/COM4521/
This lecture
CUDA Programming Model
CUDA Device Code
CUDA Host Code and Memory Management
CUDA Compilation and execution in Visual Studio
Programming a GPU with CUDA
Application Code
GPU CPU
Use GPU to
Parallelize
Compute-Intensive
Functions
Rest of Sequential
CPU Code
GPU Kernel Code
_________
_____
______
GPU Kernel Code
_________
_____
______
Main Program
Code
___________
_______
_________
GPU Kernel Code
_________
_____
______
CPU
NVIDIA
GPU
PCIe BUS
DRAM GDRAM
Simple processing flow
1. Copy input data from CPU memory to
GPU memory
PCI Bus
Images: Mark Harris (NVIDIA)
Simple processing flow
1. Copy input data from CPU memory to
GPU memory
2. Load GPU program and execute,
caching data on chip for performance
PCI Bus
Images: Mark Harris (NVIDIA)
Simple processing flow
1. Copy input data from CPU memory to
GPU memory
2. Load GPU program and execute,
caching data on chip for performance
3. Copy results from GPU memory to
CPU memory
PCI Bus
Images: Mark Harris (NVIDIA)
Data set decomposed into a stream of elements
A single computational function (kernel) operates on each element
A thread is the execution of a kernel on one data element
Multiple Streaming Multiprocessor cores can operate on multiple elements in
parallel
Many parallel threads
Suitable for Data Parallel problems
Stream Computing
…
GPU
SMP SMP
SMPSMP
Streaming MultiProcessor (SMP)
Device Memory Shared Memory / Cache
Scheduler / Dispatcher
Instruction Cache and Registers
How does the stream competing principle map to the with the hardware model?
Hardware abstracted as a Grid of Thread
Blocks
Blocks map to SMPs
Each thread maps onto a CUDA core
Don’t need to know the hardware
characteristics
Code is portable across different GPU versions
CUDA Software Model
Grid
Block
Thread
CUDA Introduces a new dim types. E.g. dim2, dim3, dim4
dim3 contains a collection of three integers (X, Y, Z)
dim3 my_xyz (x_value, y_value, z_value);
Values are accessed as members
int x = my_xyz.x;
CUDA Vector Types
threadIdx
The location of a thread within a block. E.g. (2,1,0)
blockIdx
The location of a block within a grid. E.g. (1,0,0)
blockDim
The dimensions of the blocks. E.g. (3,9,1)
gridDim
The dimensions of the grid. E.g. (3,2,1)
Idx values use zero indices, Dim values are a size
Special dim3 Vectors
Grid
Block
Thread
Students arrive at halls of residence to check in
Rooms allocated in order
Unfortunately admission rates are down!
Only half as many rooms as students
Each student can be moved from room i to room 2i so that no-one has a
neighbour
Analogy
Receptionist performs the following tasks
1. Asks each student their assigned room number
2. Works out their new room number
3. Informs them of their new room number
Serial Solution
“Everybody check your
room number. Multiply it by
2 and go to that room”
Parallel Solution
CUDA Programming Model
CUDA Device Code
CUDA Host Code and Memory Management
CUDA Compilation and execution in Visual Studio
Serial solution
for (i=0;i
dim3 blocksPerGrid(?,?,?);
dim3 threadsPerBlock(?,?,?);
myKernel<<
CUDA C Example: Host
BlockThread
threadIdx.x = 1 threadIdx.x = N
What should the dimensions be?
Call the kernel by using the CUDA kernel launch syntax
kernel<<
dim3 blocksPerGrid(1,1,1); //use only one block
dim3 threadsPerBlock(N,1,1); //use N threads in the block
myKernel<<
CUDA C Example: Host
BlockThread
threadIdx.x = 1 threadIdx.x = N
Vector Addition Example
Consider a more interesting example
Vector addition: e.g. a + b = c
a
0
a
1
a
2
a
3
a
4
a
5
a
6
a
n
b
0
b
1
b
2
b
3
b
4
b
5
b
6
b
n
c
0
c
1
c
2
c
3
c
4
c
5
c
6
c
n
…
…
…
//Kernel Code
__global__ void vectorAdd(float *a, float *b, float *c)
{
int i = threadIdx.x;
c[i] = a[i] + b[i];
}
//Host Code
…
dim3 blocksPerGrid(1,1,1);
dim3 threadsPerBlock(N,1,1); //single block of threads
vectorAdd<<
Vector Addition Example
Only one block will give poor performance
A block gets allocated to a single SMP!
Solution: Use multiple blocks
dim3 blocksPerGrid(N/8,1,1); // assumes 8 divides N exactly
dim3 threadsPerBlock(8,1,1); //8 threads in the block
myKernel<<
CUDA C Example: Host
Block
Thread
Grid
…
//Kernel Code
__global__ void vectorAdd(float *a, float *b, float *c)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
c[i] = a[i] + b[i];
}
Vector Addition Example
0 1 2 3 4 5 6 7 …
The integer i gives a unique thread Index used to access a unique
value from the vectors a, b and c
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
threadIdx.x
blockIdx.x = 0
threadIdx.x
blockIdx.x = 1
threadIdx.x
blockIdx.x = N-1
A note on block sizes
Thread block sizes can not be larger that 1024
Max grid size is 2147483647 for 1D
Grid y and z dimensions are limited to 65535
Block size should ideally be divisible by 32
This is the warp size which threads are scheduled
Not less than 32 as in our trivial example!
Varying the block size will result in different performance characteristics
Try incrementing by values of 32 and benchmark.
Calling a kernel with scalar parameters assumes a 1D grid of thread blocks.
E.g. my_kernel<<<8, 128>>>(arguments);
Device functions
Kernels are always prefixed with _global_
To call a function from a kernel the function must be a device
function (i.e. it must be compiled for the GPU device)
A device function must be prefixed with _device_
A device function is not available from the host
Unless it is also prefixed with _host_
int increment(int a){ return a + 1; }
__device__ int increment(int a){ return a + 1; }
__device__ __host__ int increment(int a){ return a + 1; }
Host only
Device only
Host and device
Global functions are always void return type. Why?
CUDA Programming Model
CUDA Device Code
CUDA Host Code and Memory Management
CUDA Compilation and execution in Visual Studio
GPU has separate dedicated memory from the host CPU
Data accessed in kernels must be on GPU memory
Data must be explicitly copied and transferred
cudaMalloc() is used to allocate memory on the GPU
cudaFree() releases memory
float *a;
cudaMalloc(&a, N*sizeof(float));
…
cudaFree(a);
Memory Management
Once memory has been allocated we need to copy data to it and from it.
cudaMemcpy() transfers memory from the host to device to host and
vice versa
cudaMemcpy(array_device, array_host,
N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(array_host, array_device,
N*sizeof(float), cudaMemcpyDeviceToHost);
First argument is always the destination of transfer
Transfers are relatively slow and should be minimised where possible
Memory Copying
#define N 2048
#define THREADS_PER_BLOCK 128
__global__ void vectorAdd(float *a, float *b, float *c) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
c[i] = a[i] + b[i];
}
int main(void) {
float *a, *b, *c; // host copies of a, b, c
float *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(float);
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
a = (float *)malloc(size); random_floats(a, N);
b = (float *)malloc(size); random_floats(b, N);
c = (float *)malloc(size);
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
vectorAdd <<
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
Define pointer variables
Allocate GPU memory
Allocate host memory and
initialise contents
Copy input data to the
device
Launch the kernel
Copy data back to host
Clean up
Define kernel
Define macros
Statically allocated device memory
How do we declare a large array on the host without using malloc?
Statically allocated device memory
How do we declare a large array on the host without using malloc?
Statically allocate using compile time size
int array[N];
We can do the same on the device. i.e.
Just like when applied to a function
Only available on the device
Must use cudaMemCopyToSymbol()
Must be a global variable
__device__ int array[N];
#define N 2048
#define THREADS_PER_BLOCK 128
__device__ float d_a[N];
__device__ float d_b[N];
__device__ float d_c[N];
__global__ void vectorAdd() {
int i = blockIdx.x * blockDim.x + threadIdx.x;
d_c[i] = d_a[i] + d_b[i];
}
int main(void) {
float *a, *b, *c; // host copies of a, b, c
int size = N * sizeof(float);
a = (float *)malloc(size); random_floats(a, N);
b = (float *)malloc(size); random_floats(b, N);
c = (float *)malloc(size);
cudaMemcpyToSymbol(d_a, a, size);
cudaMemcpyToSymbol(d_b, b, size);
vectorAdd <<
cudaMemcpyFromSymbol(c, d_c, size);
free(a); free(b); free(c);
return 0;
}
Define pointer variables
Statically allocate GPU
memory
Allocate host memory and
initialise contents
Copy input data to the
device
Launch the kernel
Copy data back to host
Clean up
Define kernel
Define macros
Kernel calls are non-blocking
Host continues after kernel launch
Overlaps CPU and GPU execution
cudaDeviceSynchronise() call be called from the host to block
until GPU kernels have completed
vectorAdd<<
//do work on host (that doesn’t depend on c)
cudaDeviceSynchronise(); //wait for kernel to finish
Standard cudaMemcpy calls are blocking
Non-blocking variants exist
Device Synchronisation
CUDA Programming Model
CUDA Device Code
CUDA Host Code and Memory Management
CUDA Compilation and execution in Visual Studio
CUDA C Code is compiled using nvcc e.g.
Will compile host AND device code to produce an executable
nvcc –o example example.cu
We will be using Visual Studio to build our CUDA code so we will not
need to compile at the command line (unless you are running on
ShARC)
Compiling a CUDA program
Creating a CUDA Project
Create New CUDA
Project
Select NVIDIA -> CUDA
9.1
This will create a project
with a default kernels.cu
file containing a basic
vector addition example
Preferred Method!
Adding a CUDA source file
Alternatively add a
CUDA source file to an
existing application
If you do this you must
modify the project
properties to include
CUDA build
customisations
http://developer.downlo
ad.nvidia.com/compute/c
uda/6_5/rel/docs/CUDA_
Getting_Started_Window
s.pdf (section 3.4)
http://developer.download.nvidia.com/compute/cuda/6_5/rel/docs/CUDA_Getting_Started_Windows.pdf
Compilation
CUDA source file (*.cu) are
compiled by nvcc
An existing cuda.rules file
creates property page for
CUDA source files
Configures nvcc in the same
way as configuring the C
compiler
Options such as optimisation
and include directories can be
inherited from project defaults
C and C++ files are compiled
with cl (MSVCC compiler)
.c /
.cpp
.cu
Host
.obj
Host
.obj
CUDA
.obj
cl.exe nvcc.exe
cl.exe cudaacc
Host
functions
CUDA
kernels
linker
Executable
Device Versions
Different generations of NVIDIA hardware have different compatibility
In the last lecture we saw product families and chip variants
These are classified by CUDA compute versions
Compilation normally builds for CUDA compute version 2
See Project Properties, CUDA C/C++Device->Code Generation
Default value is “compute_20,sm_20”
Any hardware with greater than the compiled compute version can execute the code
(backwards compatibility)
You can build for multiple versions using separator
E.g. “compute_20,sm_20;compute_30,sm_30;compute_35,sm_35”
All Diamond and Lewin Labs GPUs
This will increase build time and execution file size
Runtime will select the best version for your hardware
https://en.wikipedia.org/wiki/CUDA#Supported_GPUs
https://en.wikipedia.org/wiki/CUDA#Supported_GPUs
Device Versions of Available GPUs
Diamond High Spec Lab (lab machines)
Quadro 5200
compute_35,sm_35;
Lewin Lab Main Room
GeForce GT 520
compute_20,sm_20;
Lewin Lab Quiet Room
GeForce GT 630
compute_30,sm_30;
ShARC
Tesla K80
compute_50,sm_50;
P100
compute_60,sm_60;
CUDA Properties
Debugging
NSIGHT is a GPU debugger for debugging GPU kernel code
It does not debug breakpoints in host code
To launch select insert a breakpoint and select NSIGHT-> Start CUDA
Debugging
You must be in the debug build configuration.
When stepping all warps except the debugger focus will be paused
Use conditional breakpoints to focus on specific threads
Right click on break point and select Condition
Error Checking
cudaError_t: enumerator for runtime errors
Can be converted to an error string (const char *) using
cudaGetErrorString(cudaError_t)
Many host functions (e.g. cudaMalloc, cudaMemcpy) return a
cudaError_t which can be used to handle errors gracefully
cudaError_t cudaStatus;
cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
//handle error
}
Kernels do not return an error but if one is raised it can be queried
using the cudaGetLastError() function
addKernel<<<1, size>>>(dev_c, dev_a, dev_b);
cudaStatus = cudaGetLastError();
Summary
CUDA is a C like programming language
Programming a GPU requires moving data to and from the device
Parallel regions are executed using Kernel
Kernels require high levels of parallelism
Exposed as many threads grouped into blocks
Thread blocks are mapped to SMs
Host and device code are compiled separately and linked into a single
executable
Acknowledgements and Further Reading
Some of the content in this lecture material has been provided by;
1. GPUComputing@Sheffield Introduction to CUDA Teaching Material
Originally from content provided by Alan Gray at EPCC
2. NVIDIA Educational Material
Specifically Mark Harris’s (Introduction to CUDA C)
Further Reading
Essential Reading: CUDA C Programming Guide
http://docs.nvidia.com/cuda/cuda-c-programming-guide/
http://docs.nvidia.com/cuda/cuda-c-programming-guide/