CUDA
CMPSC 450
What is CUDA
• Compute Unified Device Architecture
• An extension of the C programming language created by nVidia.
• Enables GPUs to execute programs written in C in an integrated host (CPU) + device (GPU) app C program
• Execute “kernels” as a SIMT program • A dedicated hardware solution
CMPSC 450
CMPSC 450
CMPSC 450
CMPSC 450
CMPSC 450
NVidia Pascal Architecture
• 20 SM’s (4 groups of 5)
• 1 Raster Engine per group
• Shared L2 cache (~2MB)
• 64 ROP’s (render output unit OR Raster Operations Pipeline)
• Memory controllers
CMPSC 450
SM – Streaming Multiprocessor
• Each SM is subdivided in 4 pieces containing: • 1 Warp Scheduler
• 32 cores
• 8 Load/Store units
• 8 Special Function Units • 64 KB register file
• Shared in the SM:
• 96 KB shared memory • Instruction Cache
• 4 FP64 CUDA cores
• 1 16×2 CUDA core
CMPSC 450
Cuda – Threading
• Relies on lots of lightweight threads
• Half-Warp – Group of 16 consecutive threads
• Warp – 32 Consecutive threads
• Block – collection of threads (1D, 2D or 3D), can be 192, 256, 512 or 768 threads. Max depends on GPU. Threads in the same block can synchronize with each other
• Grid – collection of blocks (1D or 2D). Blocks can not synchronize with each other.
CMPSC 450
SM Thread Details
• 32 cores share an instruction stream (warp)
• Up to 64*32=2048 CUDA threads can be active on the SM at a time
• But only 4*32=128 CUDA threads are executed simultaneously.
• Groups of 32 CUDA threads share an instruction stream. (and are executed in a SIMD manner)
• The SM may choose to execute 2 instructions from a CUDA thread at once when an appropriate mixture is available. (And of course each of those two instructions are executed in a SIMD manner as described above.)
CMPSC 450
Theoretical Performance
• Consider a GeForce GTX 1080 with 2560 cores.
• Each CUDA core can perform a single-precision FMA (fused multiply and add) instruction every clock cycle. An FMA is treated as two operations. So the peak single-precision performance is given by 2 x 2560 x clock frequency.
• The clock frequency is 1.607 GHz (base) and 1.733 Ghz (with boost).
• Using 1.607 GHz, we get 2 x 2560 x 1.607 = 8.228 GFlop/s, which is closer to choice D.
CMPSC 450
Memory Architecture
CPU
GPU
Memory
6 -> 64 GB
768MB -> 8GB
Memory Bandwidth
24 -> 32 GB/s
100 -> 200 GB/s
L2 Cache
8 -> 15 MB
512 kB -> 2 MB
L1 Cache
256 -> 512 kB
16 -> 64 kB
CMPSC 450
CUDA – memory heirarcy
• Global Memory – Video card RAM
• Shared Memory – ~16kB, shared by threads in the same block. • Texture Memory – cached, generally read-only
CUDA cores load data from local memory, therefore, data must be copied to GPU memory before operating on it.
CMPSC 450
Memory Allocation
float *h_dataA, *h_dataB, *h_resultC; float *d_dataA, *d_dataB, *d_resultC;
h_dataA = (float *)malloc(sizeof(float) * MAX_DATA_SIZE); h_dataB = (float *)malloc(sizeof(float) * MAX_DATA_SIZE); h_resultC = (float *)malloc(sizeof(float) * MAX_DATA_SIZE);
cudaMalloc( (void **)&d_dataA, sizeof(float) * MAX_DATA_SIZE); cudaMalloc( (void **)&d_dataB, sizeof(float) * MAX_DATA_SIZE); cudaMalloc( (void **)&d_resultC , sizeof(float) * MAX_DATA_SIZE);
CMPSC 450
Copy data to and from device
// Copy the data to the device
cudaMemcpy(d_dataA, h_dataA, sizeof(float) * dataAmount, cudaMemcpyHostToDevice); cudaMemcpy(d_dataB, h_dataB, sizeof(float) * dataAmount, cudaMemcpyHostToDevice);
// Do the multiplication on the GPU
multiplyNumbersGPU<<
// Copy the data back to the host
cudaMemcpy(h_resultC, d_dataA, sizeof(float) * dataAmount, cudaMemcpyDeviceToHost);
CMPSC 450
The Kernel (an example)
• Ex: Perform a mathematical operation on two vectors of data • Size 1024 x 1024 x 32 elements
• Organize threads, thread blocks and grids appropriately
• Ex: choose 256 threads per block, but can be any 3 dimensional shape • dim3 threadBlockRows(256, 1);
• Number of blocks required = 131072 = (# elements / 256)
• Grid dimension is limited to 65536, so we choose 128x1024x1
CMPSC 450
CUDA Built-In Variables
• blockIdx.x, blockIdx.y, blockIdx.z are built-in variables that returns the block ID in the x-axis, y-axis, and z-axis of the block that is executing the given block of code.
• threadIdx.x, threadIdx.y, threadIdx.z are built-in variables that return the thread ID in the x-axis, y-axis, and z-axis of the thread that is being executed by this stream processor in this particular block.
• blockDim.x, blockDim.y, blockDim.z are built-in variables that return the “block dimension” (i.e., the number of threads in a block in the x-axis, y-axis, and z-axis).
• So, you can express your collection of blocks, and your collection of threads within a block, as a 1D array, a 2D array or a 3D array.
• These can be helpful when thinking of your data as 2D or 3D.
• The full global thread ID in x dimension can be computed by:
x = blockIdx.x * blockDim.x + threadIdx.x;
CMPSC 450
Higher Dimension Grids/Blocks
• 1D grids/blocks are suitable for 1D data, but higher dimensional grids/blocks are necessary for:
• higher dimensional data.
• data set larger than the hardware dimensional limitations of blocks.
• CUDA has built-in variables and structures to define the number of blocks in a grid in each dimension and the number of threads in a block in each dimension.
CMPSC 450
The Kernel – an example
__global__ void multiplyNumbersGPU(float *pDataA, float *pDataB, float *pResult)
{
// We already set it to 256 threads per block, with 128 thread blocks per grid row.
// This gives every thread a unique ID.
int tid = (blockIdx.y * 128 * 256) + blockIdx.x * 256 + threadIdx.x;
//pResult[tid] = pDataA[tid] * pDataB[tid]; // Each thread only multiplies one data element. //pResult[tid] = pDataA[tid] * pDataB[tid] / 12.34567; /
/pResult[tid] = sqrt(pDataA[tid] * pDataB[tid] / 12.34567);
pResult[tid] = sqrt(pDataA[tid] * pDataB[tid] / 12.34567) * sin(pDataA[tid]);
}
• __global__ keyword indicates that this function may be called from either host PC or CUDA device
• Each thread figures out which data element it is responsible for computing
CMPSC 450
Performance Analysis
CMPSC 450
Thread Communication
• Threads may only safely communicate with each other within the same block
• Two ways threads can communicate: • Shared memory (but only 16kB per SM) • Global memory (larger, but slower)
CMPSC 450
CUDA Function Declarations
Executed on the:
Only callable from the:
__device__ float DeviceFunc()
device
device
__global__ void KernelFunc()
device
host
__host__ float HostFunc()
Host
host
For functions executed on the device:
– No recursion
– No static variable declarations inside the function – No variable number of arguments
CMPSC 450
Min Max Average Demo
• Share min, max, average
• 256 Threads / block
• 3 variables * 4 bytes / variable * 256 threads / block = 3KB / block • Thread blocks configured as 256x1x1
• Grid configured as 128xYx1, use Y to adjust for data size
CMPSC 450
CUDA shared keyword
• Use keyword __shared__ to place variable in to shared memory for a thread block.
// Declare arrays to be in shared memory.
// 256 elements * (4 bytes / element) * 3 = 3KB.
__shared__ float min[256];
__shared__ float max[256];
__shared__ float avg[256];
CMPSC 450
Value reduction
int nTotalThreads = blockDim.x; // Total number of active threads while(nTotalThreads > 1) {
int halfPoint = (nTotalThreads >> 1); // divide by two // only the first half of the threads will be active.
if (threadIdx.x < halfPoint) {
// Get the shared value stored by another thread float temp = min[threadIdx.x + halfPoint];
if (temp < min[threadIdx.x]) min[threadIdx.x] = temp; temp = max[threadIdx.x + halfPoint];
if (temp > max[threadIdx.x]) max[threadIdx.x] = temp; // when calculating the average, sum and divide avg[threadIdx.x] += avg[threadIdx.x + halfPoint];
avg[threadIdx.x] /= 2;
}
__syncthreads();
nTotalThreads = (nTotalThreads >> 1);
// divide by two.
}
CMPSC 450
Atomic Operations
• Atomic operations avoid race conditions by preventing the interference from other threads
•Example:int atomicAdd(int* address, int val);
• This will load the value at address, add val to it and write it back
without being interrupted.
• This will force code to be serialized if multiple threads want to operate on the same shared memory address!
• Atomic operations were added with CUDA 1.1, older graphics cards may not support some atomic operations.
CMPSC 450
Atomic Instructions
• Addition/subtraction: atomicAdd, atomicSub
• Minimum/maximum: atomicMin, atomicMax
• Conditional increment/decrement: atomicInc, atomicDec • Exchange/compare-and-swap: atomicExch, atomicCAS
• More types in Fermi: atomicAnd, atomicOr, atomicXor
CMPSC 450
Warp Divergence
• Threads are executed in warps (of 32), all threads in a warp execute the same instruction at the same time
• What about conditionals? • This is Warp divergence
Consider:
if (x < 0.0)
z = x – 2.0;
else
z = sqrt(x);
CMPSC 450
Warp Divergence
• Nvidia GPUs have predicated instructions which are only carried out if a logical flag is true
p: a = b + c; // computed only if p is true
Therefore, from our previous example:
p = (x < 0.0); p: z = x-2.0;
!p: z = sqrt(x);
All threads execute both conditional branches, so execution cost is the sum of both branches.
CMPSC 450
More Demos
V_add.cu M_add.cu M_mult.cu M_multTile.cu
CMPSC 450
Prefix Sums (Parallel Scan)
• Recall:
The all-prefix-sums operation takes a binary associative operator ,
and an ordered set of n elements
[a0, a1, ..., an−1],
[a0,(a0 a1), ...,(a0 a1 ... an−1)].
and returns the ordered set
CMPSC 450
Exclusive Scan
The exclusive all-prefix-sums operation takes a binary associative operator , and an ordered set of n elements
[a0, a1, ..., an−1],
[0, a0,(a0 a1), ...,(a0 a1 ... an−2)].
and returns the ordered set
CMPSC 450
A visualisation
ford=1tolog2 ndo for all k pardo
if k >= 2d then
x[k] = x[k – 2d-1] + x[k]
• Not work efficient
• Not all threads execute concurrently (only in warps)
• This algorithm will not work
CMPSC 450
Double-Buffered Sum Scan
ford=1tolog2 ndo for all k pardo
if k >= 2d then
x[out][k] = x[in][k – 2d-1] + x[in][k]
else
x[out][k] = x[in][k]
Can only handle arrays as large as can be processed by a single thread block running on one multiprocessor of a GPU.
CMPSC 450
A Work-Efficient Parallel Scan
Up-Sweep (Reduce) Phase (Blelloch 1990):
for d = 0 to log2 n-1 do forallk=0ton-1by2d+1 inparalleldo
x[k+2d+1 –1]=x[k+2d –1]+x[k+2d +1–1]
CMPSC 450
W.E. Parallel Scan (Down-Sweep)
x[n-1] = 0
for d = log2 n-1 downto 0 do
forallk=0ton–1by2d +1inparalleldo t=x[k+2d –1]
x[k+2d –1]=x[k+2d +1–1]
x[k+2d +1–1]=t+x[k+2d +1–1]
CMPSC 450
Banking…
When multiple threads in the same warp access the same bank, a bank conflict occurs unless all threads of the warp access the same address within the same 32-bit word.
• The number of threads that access a single bank is called the degree of the bank conflict
• Bank conflicts cause serialization of the multiple accesses to the memory bank, so that a shared memory access with a degree-n bank conflict requires n times as many cycles to process as an access with no conflict
CMPSC 450
CMPSC 450
CMPSC 450
Sources
• http://15418.courses.cs.cmu.edu/spring2017/lecture/gpuarch/slide_ 001
• https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx- 1080-and-1070-founders-edition-review
• http://supercomputingblog.com/cuda/what-is-cuda-an-introduction/
• https://devblogs.nvidia.com/parallelforall/
• http://users.wfu.edu/choss/CUDA/lectures.html
• https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lec3-2×2.pdf
• https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39. html
CMPSC 450