CUDA: Introduction
Christian Trefftz / Greg Wolffe Grand Valley State University
Supercomputing 2008 Education Program
(modifications by Jernej Barbic, 2008
–
2019)
Ø
What is GPGPU?
Ø
l
l
l
l
–
–
Terms
Purpose computing on a Graphics Processing Unit
l
General
Using graphic hardware for non
computations What is CUDA?
Parallel computing platform and API by Nvidia Compute Unified Device Architecture
l
graphic
–
parallel Introduced in 2007; still actively updated
Software architecture for managing data programming
2
Motivation
3
Motivation
4
Motivation
5
Ø CPU
l Fastcaches
CPU vs. GPU
l Branching adaptability l High performance
Ø GPU
l Multiple ALUs
l Fast onboard memory
l High throughput on parallel tasks
• Executesprogramoneachfragment/vertex
Ø CPUs are great for task parallelism Ø GPUs are great for data parallelism
6
CPU vs. GPU
–
Hardware
Ø
More transistors devoted to data processing
7
Traditional Graphics Pipeline
Vertex processing
ò
Rasterizer
ò
Fragment processing
ò
Renderer (textures)
8
Pixel / Thread Processing
9
GPU Architecture
10
Processing Element
Ø
Processing element = thread processor
11
GPU Memory Architecture
Uncached:
Registers Shared Memory Local Memory Global Memory
Cached:
Constant Memory Texture Memory
Ø Ø Ø Ø
Ø Ø
12
Data
co
Ø
–
parallel Programming
Ø
Think of the GPU as a massively
threaded
– Write “kernel” functions that execute on
–
processor
Ø Ø
Keep it busy!
Keep your data close!
the device elements in parallel
[
[
—
processing multiple data
massive threading local memory
13
Hardware Requirements
Ø
capable
CUDA
– video card
Ø Ø Ø
Power supply Cooling
PCI
–
Express
14
A Gentle Introduction to CUDA Programming
17
Credits
Ø
l l
l l
The code used in this presentation is based on code available in:
the Tutorial on CUDA in Dr. Dobbs Journal Andrew Bellenir’s code for matrix multiplication
Igor Majdandzic’s code for Voronoi diagrams NVIDIA’s CUDA programming guide
18
Software Requirements/Tools
CUDA device driver
CUDA Toolkit (compiler, CUBLAS, CUFFT) CUDA Software Development Kit
Emulator Profiling:
Occupancy calculator Visual profiler
Ø Ø Ø
l
Ø Ø
19
Ø
Ø Ø Ø
To compute, we need to:
Allocate memory for the computation on the GPU (incl. variables)
Provide input data
computation to be performed
Specify the
Read the results from the GPU (output)
20
Initially:
array
CPU Memory
GPU Card’s Memory
21
Allocate Memory in the GPU card
array
Host’s Memory
array_d
GPU Card’s Memory
22
Copy content from the host’s memory to the GPU card memory
array
Host’s Memory
array_d
GPU Card’s Memory
23
Execute code on the GPU
GPU MPs
array
Host’s Memory
array_d
GPU Card’s Memory
24
Copy results back to the host memory
array
Host’s Memory
array_d
GPU Card’s Memory
25
The Kernel
Ø The code to be executed in the stream processors on the GPU
Ø Simultaneous execution in several (perhaps all) stream processors on the GPU
Ø How is every instance of the kernel going to know which piece of data it is working on?
26
Grid and Block Size
Grid size: The number of blocks
•
l
•
l
–
Can be 1, 2, or 3
dimensional array of blocks Each block is divided into threads
Can be 1 or 2
–
dimensional array of threads
27
Let’s look at a very simple example
The code has been divided into two files:
simple.c simple.cu
simple.c is ordinary code in C
It allocates an array of integers, initializes it to values corresponding to the indices in the array and prints the array.
It calls a function that modifies the array The array is printed again.
Ø
l l
Ø Ø
Ø Ø
28
simple.c
Ø
}
fillArray(a,SIZEOFARRAY);
/* Now print the array after calling fillArray */
#include
#define SIZEOFARRAY 64
extern void fillArray(int *a,int size);
/* The main program */
int main(int argc,char *argv[])
{
/* Declare the array that will be modified by the GPU */
int a[SIZEOFARRAY];
int i;
/* Initialize the array to 0s */
for(i=0;i < SIZEOFARRAY;i++) {
a[i]=0; }
/* Print the initial array */
printf("Initial state of the array:
for(i = 0;i < SIZEOFARRAY;i++) {
printf("%d ",a[i]);
\
n");
printf("
\
n");
/* Call the function that will in turn call the function in the GPU that will fill the array */
printf("Final state of the array:
n"); for(i = 0;i < SIZEOFARRAY;i++) {
\
}
printf("%d ",a[i]);
printf("
n");
return 0;
\
}
29
simple.cu
Ø
l
simple.cu contains two functions
fillArray(): A function that will be executed on the host and which takes care of:
• • • • • •
Allocating variables in the global GPU memory Copying the array from the host to the GPU memory Setting the grid and block sizes
Invoking the kernel that is executed on the GPU Copying the values back to the host memory Freeing the GPU memory
30
fillArray (part 1)
#define BLOCK_SIZE 32
extern "C" void fillArray(int *array, int arraySize) {
int * array_d; cudaError_t result;
/* cudaMalloc allocates space in GPU memory */
result = cudaMalloc((void**)&array_d,sizeof(int)*arraySize);
/* copy the CPU array into the GPU array_d */
result = cudaMemcpy(array_d,array,sizeof(int)*arraySize,
cudaMemcpyHostToDevice);
31
fillArray (part 2)
/* Indicate block size */
dim3 dimblock(BLOCK_SIZE);
/* Indicate grid size */
dim3 dimgrid(arraySize / BLOCK_SIZE);
/* Call the kernel */ cu_fillArray<<
/* Copy the results from GPU back to CPU memory */
result = cudaMemcpy(array,array_d,sizeof(int)*arraySize,cudaMemcpyDevice ToHost);
/* Release the GPU memory */
cudaFree(array_d);
}
32
simple.cu (cont.)
Ø
l
l
The other function in simple.cu is cu_fillArray():
This is the GPU kernel
Identified by the keyword: __global__
Built
• •
l
in variables:
–
blockIdx.x : block index within the grid threadIdx.x: thread index within the block
33
cu_fillArray
__global__ void cu_fillArray(int * array_d) {
int x;
x = blockIdx.x * BLOCK_SIZE + threadIdx.x; array_d[x] = x;
}
__global__ void cu_addIntegers(int * array_d1, int * array_d2) {
int x;
x = blockIdx.x * BLOCK_SIZE + threadIdx.x; array_d1[x] += array_d2[x];
}
34
To compile:
nvcc simple.c simple.cu
The compiler generates the code for both the host and the GPU
Demo on cuda.littlefe.net …
Ø Ø
Ø
–
o simple
35
In the GPU:
Elements
Array Elements
Block 0
Processing
Thread 0
Thread 1
Thread 2
Thread 3
Thread 0
Thread 1
Thread 2
Thread 3
Block 1
37
Another Example:
SAXPY (Scalar Alpha X Plus Y)
A common operation in linear algebra
saxpy
Ø
l
Ø
CUDA: loop iteration
ð
thread
41
Traditional Sequential Code
void saxpy_serial(int n, float alpha,
float *x,
float *y)
{
for(int i = 0;i < n;i++)
y[i] = alpha*x[i] + y[i];
}
42
CUDA Code
__global__ void saxpy_parallel(int n, float alpha,
float *x,
float *y) {
int i = blockIdx.x*blockDim.x+threadIdx.x;
if (i