程序代写 GTX 1050 Ti graphics cards.

GPU PROGRAMMING
1. Introduction
In the labs up until now we have been looking at mechanisms for exposing parallelism on CPUs. For this lab we will be looking at CUDA, which is a pro- gramming model designed for GPUs. GPUs offer a finer level of data parallelism than a CPU, with which large speed-ups are possible if your application is well suited. The CUDA programming model is vastly different from OpenMP and MPI. This lab session will touch on kernels and how they run in the context of the thread hierarchy abstraction model. For more information, you can refer to the CUDA C Programming Guide[1]. The contents covered in this lab session are mainly in sections 2.1, 2.2, and 2.3.
For additional support, examples, and explanations regarding some of the con- cepts introduced in today’s lab, see ’s blog posts on NVIDIA’s Devel- oper blog. They cover a wide range of core concepts including an introduction to CUDA[2], accessing global memory efficiently[3], optimising data transfers[4], and overlapping data transfers for efficiency[5], among others.

In order to run any of the CUDA examples in this lab sheet, you can use the desktop machines in the labs, or you can use a particular partition on kudu. To get to these, login to the remote nodes using the following command (where uXXXXXXX is be replaced by your username and nn in remote-nn is replaced by the last two digits of your username):
You will need to change the partition flag to utilise the desktop-batch partition. Each of these machines have an NVIDIA GTX 1050 Ti graphics cards.
3. Compilation and Running
CUDA code is compiled slightly differently to the examples in previous labs and as such requires two steps. The first involves setting up the environment for the compiler, by loading the following environment module.
$ module load cuda11.2
The second step is the compilation process, which differs from the compilation of plain C in that you must use the nvcc command. This is a wrapper around gcc which takes care of the CUDA language extensions. Aside from using nvcc

2 GPU PROGRAMMING
instead of gcc/g++, compilation and execution of programs is exactly the same as before. Code can be run directly on the stone compute nodes and as such it is not necessary to submit any jobs through Slurm if you are making use of these nodes. If you are running the examples from this sheet on kudu, the script provided in Listing 1 can be used in order to submit any CUDA jobs.
#!/bin/bash
#SBATCH –job-name=cuda-job
#SBATCH –partition=desktop-batch
#SBATCH –nodes=1
#SBATCH –time=00:01:00
module purge
module load cuda11.2
srun ./
Listing 1. A sample Batch Script for CUDA job submission
nvcc will implement and recognise a lot of the common command-line flags and arguments that the other compilers do, but it will also have its own in addition. For a full reference, read the CUDA Toolkit Documentation [6] for more.
4. Exercises
As with other seminars, work through each exercise in turn before having a go
at the problems.
4.1. Hello World in CUDA. As with the other labs, the first example we will look at is a variant of “Hello, world!”, which has been modified to demonstrate a minimal CUDA program. Normally, the parallel program is modified such that each thread prints out its ID and “Hello, world!”, but this is not possible as we cannot print to the screen from code running on an GPU.
#include
const int N = 16;
const int blocksize = 16;
__global__
void hello(char* a, int* b) {
a[threadIdx.x] += b[threadIdx.x];

GPU PROGRAMMING 3
int main() {
char a[N] = “Hello \0\0\0\0\0\0”;
int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
const int csize = N * sizeof(char);
const int isize = N * sizeof(int);
printf(“%s”, a);
cudaMalloc((void**)&ad, csize);
cudaMalloc((void**)&bd, isize);
cudaMemcpy(ad, a, csize, cudaMemcpyHostToDevice);
cudaMemcpy(bd, b, isize, cudaMemcpyHostToDevice);
dim3 dimBlock(blocksize, 1);
dim3 dimGrid(1, 1);
hello<<>>(ad, bd);
cudaMemcpy(a, ad, csize, cudaMemcpyDeviceToHost);
cudaFree(ad);
cudaFree(bd);
printf(“%s\n”, a);
return 0; }
Listing 2. CUDA “Hello World” example
In this case we create an array, a, equal to ‘H’,‘e’,‘l’,‘l’,‘o’,‘ ’, then we use CUDA to transform this into ‘W’,‘o’,‘r’,‘l’,‘d’,‘!’ using the fact that we can add, for example, 15 to the ASCII value of “H” to create a “W”. The difference values are stored in the array b. You can find this code in Listing 2 and the file named helloworldCUDA.cu.
To compile the “Hello World” example, type the following command into your terminal:
$ nvcc -o helloworldCUDA helloworldCUDA.cu
Run the compiled binary just like any other, and you will be greeted with the output below.
Hello World!
The following list contains explanations for the CUDA specific portions of the code.

GPU PROGRAMMING
__global__ – This annotation is used to declare a kernel.
cudaMalloc – This is similar to malloc in C but allocates memory on the GPU.
cudaMemcpy – This copies data from hosts memory to device memory. hello<<>>(ad, bd) – This invokes the kernel in much the same was as a function is called in C. The main difference is the execution configuration within the triple chevrons. The first argument in the execution configuration specifies the number of thread blocks in the grid, and the second specifies the number of threads per thread block. threadIdx.x – The ID of the thread executing the kernel.
4.2. Threads. As discussed in the lectures on GPU programming, CUDA imple- ments a block-thread model for its parallelism. A diagram of which can be seen in Figure 1.
Figure 1. CUDA’s Block-Thread Parallelism Model
Threads are arranged in a grid of blocks, each block containing a multidimensional assortment of threads in order to imbue structure to parallelised constructs and to aid data processing/mapping.
Each block and thread is assigned an index within their parent, e.g. blockIdx and threadIdx respective. The available components of the indices (x, y etc.) vary depending on the dimensionality of the kernel (the arguments passed into the

GPU PROGRAMMING 5
<<<>>> part of the function call). Both of the grid and block types will have an associated dimension object associated with them, e.g. gridDim and blockDim. The child blocks and threads can call these to retrieve data about the shape of their parent, usually to calculate which indices of the data to read and transform.
Using Figure 1 as an example:
gridDim = {3, 2}, gridDim.x = 3, gridDim.y = 2
blockDim = {4, 3}, blockDim.x = 4, blockDim.y = 3
For thread(1, 0):
– blockIdx = {1, 1}, blockIdx.x = 1, blockIdx.y = 1
– threadIdx = {1, 0}, threadIdx.x = 1, threadIdx.y = 0
Often it’s necessary to calculate the i and j indices of the thread the kernel function is being run on in order to work out which input data to read and which output data to return or write. Details of this are left to the reader in Task 2, but you will find the components of blockIdx, blockDim, and threadIdx useful when calculating this (you may sometimes need to also include gridDim, depending on the grid and block dimensions you specify when calling your kernel function). You now know enough to start Task 1.
4.3. Assessing Performance. Performance analysis requires you to time the various portions of your code, but this works a little differently in CUDA than seen in previous labs. Timing CUDA code relies on creat- ing, firing, synchronising, and calculating the time difference of events using cudaEventCreate(cudaEvent_t* event), cudaEventRecord(cudaEvent_t e), cudaEventSynchronize(cudaEvent_t e) and cudaEventElapsedTime(float* out, cudaEvent_t start, cudaEvent_t end) respectively[7]. These functions are used to time a region of code as shown in Listing 3.
You now know enough to start Tasks 2 and 3. 5. Problems
5.1. Task 1: Using Threads. This task will be carried out in the context of a Single-precision A*X Plus Y (SAXPY) example written in CUDA, which can be found in Listing 4 and the file named saxpy.cu. Your objective is to use the knowledge gained from the lectures and the CUDA programming guide to complete the kernel; the key is in understanding the thread hierarchy and how these relate to threadIdx, blockIdx, and blockDim. You will know if you have completed the task correctly as the reported error value will reduce from 2.0 to 0.0.
#include

6 GPU PROGRAMMING
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
// Region to time…
cudaEventRecord(stop);
// Blocks CPU until the top event has been completed
cudaEventSynchronise(stop);
float milliseconds = 0.0;
cudaEventElapsedTime(&milliseconds, start, stop);
Listing 3. Timing a block of CUDA code
__global__
void saxpy(int n, float a, float* x, float* y) {
int main(void) {
int N = 1 << 20; float *x, *y, *d_x, *d_y; x = (float*)malloc(N * sizeof(float)); y = (float*)malloc(N * sizeof(float)); cudaMalloc(&d_x, N * sizeof(float)); cudaMalloc(&d_y, N * sizeof(float)); for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; } cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_y, y, N * sizeof(float), cudaMemcpyHostToDevice); saxpy<<<(N + 255)/256, 256>>>(N, 2.0, d_y, d_y);
cudaMemcpy(y, d_y, N * sizeof(float), cudaMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++) { maxError = max(maxError, abs(y[i]-4.0f)); printf("Max error: %f\n", maxError); GPU PROGRAMMING 7 Listing 4. CUDA SAXPY example 5.2. Task 2: Assessing Performance. Often it is useful to know how long data transfers to and from the GPU take, since if the cost of sending its data outweighs the gain from parallelisation, then there is little value in trying. For Task 2, place timing calls around the cudaMalloc and both cudaMemcpy functions in Listing 5. #include
int main() {
const unsigned int N = 1048576;
const unsigned int bytes = N * sizeof(int);
int *h_a = (int*)malloc(bytes);
cudaMalloc((int**)&d_a, bytes);
memset(h_a, 0, bytes);
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);
return 0; }
Listing 5. Memory transfer example
5.3. Task 3: Matrix Multiplication. Matrix multiplication is a staple oper- ation in scientific codes due to its use in solving linear equations. In this task we will consolidate your knowledge of the thread hierarchy, by completing the mapping from threads to rows and columns (marked by the TODO comments). You will know if you have done this correctly because the program will no longer print “Validation failed.”. Additionally, you should determine whether it is useful, in the context of the code in Listing 6, to offload the compute to the device by timing the CPU version and the GPU version of the code (think carefully about which operations you should include in your timed regions).
#include
#include
#define BLOCK_SIZE 16
__global__
void mat_mult(float *A, float *B, float *C, int N) {
int row = 0; // TODO
int col = 0; // TODO
float sum = 0.0f;
for (int n = 0; n < N; ++n) { GPU PROGRAMMING sum += A[row * N + n] * B[n * N + col]; C[row * N + col] = sum; void mat_mult_cpu(float *A, float *B, float *C, int N) { #pragma omp parallel for for (int row = 0; row < N; ++row) { for (int col = 0; col < N; ++col) { float sum = 0.0f; for (int n = 0; n < N; ++n) { sum += A[row * N + n] * B[n * N + col]; C[row * N + col] = sum; int main(int argc, char* argv[]) { N = K * BLOCK_SIZE; float *hA, *hB, *hC; hA = new float[N * N]; hB = new float[N * N]; hC = new float[N * N]; for (int j = 0; j < N; j++) { for (int i = 0; i < N; i++) { hA[j * N + i] = 2.0f * (j + i); hB[j * N + i] = 1.0f * (j - i); int size = N * N * sizeof(float); float *dA, *dB, *dC; cudaMalloc(&dA, size); cudaMalloc(&dB, size); cudaMalloc(&dC, size); dim3 threadBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(K, K); cudaMemcpy(dA, hA, size, cudaMemcpyHostToDevice); GPU PROGRAMMING 9 cudaMemcpy(dB, hB, size, cudaMemcpyHostToDevice); mat_mult<<>>(dA, dB, dC, N);
if (cudaPeekAtLastError() != cudaSuccess) {
fprintf(stderr, “CUDA error detected: \”%s\”\n”,
cudaGetErrorString(cudaGetLastError()));
return 1; }
C = new float[N * N];
cudaMemcpy(C, dC, size, cudaMemcpyDeviceToHost);
mat_mult_cpu(hA, hB, hC, N);
for (int row = 0; row < N; row++) { for (int col = 0; col < N; col++) { if (C[row * N + col] != hC[row * N + col]) { fprintf(stderr, "Validation failed at row=%d, col=%d.\n", row, col); Listing 6. CUDA Matrix multiplication example References [1] NVIDIA. CUDA C Programming Guide. https://docs.nvidia.com/cuda/pdf/CUDA_C_ Programming_Guide.pdf (accessed December 10, 2019), 2019. [2] . An Even Easier Introduction to CUDA. https://devblogs.nvidia.com/ even-easier-introduction-cuda/ (accessed December 10, 2019), 2017. [3] . How to Access Global Memory Efficiently in CUDA C/C++ Kernels. https:// devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-c-kernels/ (ac- cessed December 10, 2019), 2013. [4] . How to Optimize Data Transfers in CUDA C/C++. https://devblogs. nvidia.com/how-optimize-data-transfers-cuda-cc/ (accessed December 10, 2019), 2012. [5] . How to Overlap Data Transfers in CUDA C/C++. https://devblogs.nvidia. com/how-overlap-data-transfers-cuda-cc/ (accessed December 10, 2019), 2012. [6] NVIDIA. CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/ cuda-compiler-driver-nvcc/index.html (accessed December 10, 2019), 2019. [7] NVIDIA. CUDA Toolkit Documentation - 5.5 Event Management. https://docs.nvidia. com/cuda/cuda-runtime-api/group__CUDART__EVENT.html (accessed December 16, 2019), 2019. 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts