CS402: Lab Session 8
CUDA
1 Introduction
In the labs up until now we have been looking a mechanisms for exposing par-
allelism on CPUs. For this lab we will be looking at the CUDA, which is a
programming model designed for GPUs. GPUs offer a finer level of data paral-
lelism than a CPU, with which large speedups are possible if you application is
well suited. The CUDA programming model is vastly different from OpenMP
and MPI. This lab session will touch on kernels and how they run in the con-
text of the thread hierarchy abstraction. For more information, you can refer
to the CUDA C Programming Guide. The contents covered in this lab session
are mainly in sections 2.1, 2.2 and 2.3.
2 Hello World in CUDA
As with the other labs, the first example we will look at is a variant of “Hello,
world!”, which has been modified to demonstrate a minimal CUDA program.
Normally, the parallel program is modified such that each thread prints out its
ID and “Hello, world!”, but this is not possible as we cannot print to the screen
from code running on an GPU.
#include
const int N = 16 ;
const int b l o c k s i z e = 16 ;
g l o b a l
void h e l l o (char ∗a , int ∗b) {
a [ threadIdx . x ] += b [ threadIdx . x ] ;
}
int main ( ) {
char a [N] = ”He l lo \0\0\0\0\0\0” ;
int b [N] = {15 , 10 , 6 , 0 , −11, 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0} ;
char ∗ad ;
int ∗bd ;
const int c s i z e = N∗ s izeof (char ) ;
const int i s i z e = N∗ s izeof ( int ) ;
p r i n t f ( ”%s ” , a ) ;
cudaMalloc ( (void ∗∗)&ad , c s i z e ) ;
cudaMalloc ( (void ∗∗)&bd , i s i z e ) ;
cudaMemcpy( ad , a , c s i z e , cudaMemcpyHostToDevice ) ;
1
http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
cudaMemcpy( bd , b , i s i z e , cudaMemcpyHostToDevice ) ;
dim3 dimBlock ( b l o ck s i z e , 1 ) ;
dim3 dimGrid ( 1 , 1 ) ;
h e l l o<<
cudaMemcpy( a , ad , c s i z e , cudaMemcpyDeviceToHost ) ;
cudaFree ( ad ) ;
cudaFree ( bd ) ;
p r i n t f ( ”%s \n” , a ) ;
return 0 ;
}
Listing 1: CUDA “Hello World” example
In this case we create an array, a, equal to {‘H’,‘e’,‘l’,‘l’,‘o’,‘ ’},
then we use CUDA to transform this into {‘W’,‘o’,‘r’,‘l’,‘d’,‘!’} using
the fact that we can add, for example, 15 to the ASCII value of “H” to create
a “W”. The difference values are stored in the array b. You can find this code
in Listing 1 and the file named helloworldCUDA.cu.
CUDA code is not as easy to compile as the examples in previous labs and
as such requires two steps. The first involves setting up the environment for
the compiler, by using the source command on the file named environment.sh
which is included as part of this lab.
source environment.sh
If you open this file you will see that it appends to the environmental vari-
ables $PATH and $LD LIBRARY PATH with the location of the CUDA compiler
and CUDA libraries respectively. The second step is the compilation process,
which differs from the compilation of plain C in that you must use the nvcc com-
mand. This is a wrapper around gcc which takes care of the CUDA language
extensions.
nvcc -o helloworldCUDA.b helloworldCUDA.cu
Run the compiled binary just like any other, and you will be greeted with
the output below.
Hello World!
The following list contains explanations for the CUDA specific portions of
the code.
• global – This annotation is used to declare a kernel.
• cudaMalloc – This is similar to malloc in C but allocates memory on the
GPU.
• cudaMemcpy – This copies data from hosts memory to device memory.
• hello<<
much the same was as a function is called in C. The main difference is the
execution configuration within the triple chevrons. The first argument in
the execution configuration specifies the number of thread blocks in the
grid, and the second specifies the number of threads per thread block.
2
• threadIdx.x – The ID of the thread executing the kernel.
As always, if you have any questions at this point, make sure you ask one of
the tutors for help.
3 Using Threads
Task 1 This task will be carried out in the context of a Single-precision A*X
Plus Y (SAXPY) example written in CUDA, which can be found in Listing 2
and the file named saxpy.cu. Your objective is to use the knowledge gained
from the lectures and the CUDA programming guide to complete the kernel; the
key is in understanding the thread hierarchy and how these relate to threadIdx,
blockIdx and blockDim. You will know if you have completed the task correctly
as the reported error value will reduce from 2.0 to 0.0.
#include
g l o b a l
void saxpy ( int n , f loat a , f loat ∗x , f loat ∗y ) {
//TODO
}
int main (void ) {
int N = 1<<20;
f loat ∗x , ∗y , ∗d x , ∗d y ;
x = ( f loat ∗) mal loc (N∗ s izeof ( f loat ) ) ;
y = ( f loat ∗) mal loc (N∗ s izeof ( f loat ) ) ;
cudaMalloc(&d x , N∗ s izeof ( f loat ) ) ;
cudaMalloc(&d y , N∗ s izeof ( f loat ) ) ;
for ( int i = 0 ; i < N; i++) {
x [ i ] = 1 .0 f ;
y [ i ] = 2 .0 f ;
}
cudaMemcpy( d x , x , N∗ s izeof ( f loat ) , cudaMemcpyHostToDevice ) ;
cudaMemcpy( d y , y , N∗ s izeof ( f loat ) , cudaMemcpyHostToDevice ) ;
saxpy<<<(N+255) /256 , 256>>>(N, 2 . 0 , d x , d y ) ;
cudaMemcpy(y , d y , N∗ s izeof ( f loat ) , cudaMemcpyDeviceToHost ) ;
f loat maxError = 0 .0 f ;
for ( int i = 0 ; i < N; i++)
maxError = max(maxError , abs (y [ i ]−4.0 f ) ) ;
p r i n t f ( ”Max e r r o r : %f \n” , maxError ) ;
}
Listing 2: CUDA SAXPY example
4 Assessing Performance
Performance analysis requires you to time the various portions of your code, but
this works a little differently in CUDA than seen in previous labs. Timing CUDA
code relies on creating, firing, synchronising and calculating the time difference
of events using cudaEventCreate(cudaEvent t e), cudaEventRecord(cudaEvent t
e), cudaEventSynchronize(cudaEvent t e) and cudaEventElapsedTime(float*
3
out, cudaEvent t e, cudaEvent t e) respectively. These functions are used
to time a region of code as shown in Listing 3.
cudaEvent t s ta r t , stop ;
cudaEventCreate(& s t a r t ) ;
cudaEventCreate(&stop ) ;
cudaEventRecord ( s t a r t ) ;
//Region to time .
cudaEventRecord ( stop ) ;
// b l o c k s CPU un t i l the s top event has been completed .
cudaEventSynchronize ( stop ) ;
f loat mi l l i s e c ond s = 0 . 0 ;
cudaEventElapsedTime(&mi l l i s e c ond s , s t a r t , stop ) ;
Listing 3: Memory transfer example
Task 2 Often it is useful to know how long data transfers to and from the GPU
take, since if the cost of sending its data outweighs the gain from parallelisation,
then there is little value in trying. For Task 2, place timing calls around the
cudaAlloc and both cudaMemcpy functions in Listing 4.
#include
int main ( )
{
const unsigned int N = 1048576;
const unsigned int bytes = N ∗ s izeof ( int ) ;
int ∗h a = ( int ∗) mal loc ( bytes ) ;
int ∗d a ;
cudaMalloc ( ( int ∗∗)&d a , bytes ) ;
memset ( h a , 0 , bytes ) ;
cudaMemcpy( d a , h a , bytes , cudaMemcpyHostToDevice ) ;
cudaMemcpy( h a , d a , bytes , cudaMemcpyDeviceToHost ) ;
return 0 ;
}
Listing 4: Memory transfer example
5 Matrix Multiplication
Task 3 Matrix multiplication is a staple operation in scientific codes due to its
use in solving linear equations. In this task we will consolidate your knowledge
of the thread hierarchy, by completing the mapping from threads to rows and
columns (marked by the TODO comments). You will know if you have done
this correctly because the program will no longer print “Validation failed.”.
Additionally, you should determine whether it is useful, in the context of the
code in Listing 5, to offload the compute to the device by timing the CPU version
and the GPU version of the code (think carefully about which operations you
should include in your timed regions).
#include
#include
4
#define BLOCK SIZE 16
g l o b a l
void mat mult ( f loat ∗A, f loat ∗B, f loat ∗C, int N) {
int row = 0 ; //TODO
int c o l = 0 ; //TODO
f loat sum = 0.0 f ;
for ( int n = 0 ; n < N; ++n) {
sum += A[ row∗N+n ]∗B[ n∗N+co l ] ;
}
C[ row∗N+co l ] = sum ;
}
void mat mult cpu ( f loat ∗A, f loat ∗B, f loat ∗C, int N) {
#pragma omp p a r a l l e l for
for ( int row=0; row
i f ( cudaPeekAtLastError ( ) != cudaSuccess ) {
5
f p r i n t f ( s tde r r , ”CUDA e r r o r detec ted : \”%s \”\n” ,
cudaGetErrorStr ing ( cudaGetLastError ( ) ) ) ;
return 1 ;
}
f loat ∗C;
C = new f loat [N∗N] ;
cudaMemcpy(C,dC, s i z e , cudaMemcpyDeviceToHost ) ;
mat mult cpu (hA, hB, hC, N) ;
for ( int row=0; row