程序代写代做代考 GPU cuda Com 4521 Parallel Computing with GPUs: Lab 04

Com 4521 Parallel Computing with GPUs: Lab 04

Spring Semester 2018

Dr Paul Richmond

Lab Assistants: Robert Chisholm, John Charlton

Department of Computer Science, University of Sheffield

Learning Outcomes

 Understand how to launch CUDA kernels

 Understand and demonstrate how to allocate and move memory to and from the GPU

 Understand CUDA thread block layouts for 1D and2D problems

 Learn how to error check code by implementing a reference version

 Learn how to memory check code using the NSIGHT profiler

Prerequisites

Install CUDA Toolkit from the software centre.

Lab Register

The lab register must be completed by every student following the completion of the exercises. You

should complete this when you have completed the lab including reviewing the solutions. You are

not expected to complete this during the lab class but you should complete it by the end of the

teaching week.

Lab Register Link: https://goo.gl/0r73gD

Exercise 01

Exercise 1 requires that we de-cipher some encoded text. The provided text (in the file

encrypted01.bin) has been encoded by using an affine cipher. The affine cypher is a type of

monoalphabetic substitution cypher where each numerical character of the alphabet is encrypted

using a mathematical function. The encryption function is defined as;

𝐸(𝑥) = (𝐴𝑥 + 𝐵) 𝑚𝑜𝑑 𝑀

Where 𝐴 and 𝐵 are keys of the cypher, mod is the modulo operation and 𝐴 and 𝑀 are co-prime. For

this exercise the value of 𝐴 is 15, 𝐵 is 27 and 𝑀 is 128 (the size of the ASCII alphabet). The affine

decryption function is defined as

𝐷(𝑥) = 𝐴−1(𝑥 − 𝐵) 𝑚𝑜𝑑 𝑀

Where 𝐴−1 is the modular multiplicative inverse of 𝐴 modulo 𝑀. For this exercise 𝐴−1 has a value of

111. Note: The mod operation is not the same as the remainder operator (%) for negative numbers.

A suitable mod function has been provided for the example. The provided function takes the form of

modulo(int a, int b) where a in this case is everything left of the affine decryption functions

mod operator (e.g. 𝐴−1(𝑥 − 𝐵)) and b is everything to the right of the mod operator (e.g 𝑀).

https://goo.gl/0r73gD
https://goo.gl/0r73gD
https://goo.gl/0r73gD

As each of the encrypted character values are independent we can use the GPU to decrypt them in

parallel. To do this we will launch a thread for each of the encrypted character values and use a

kernel function to perform the decryption. Starting from the code provided, complete the exercise

by completing the following;

1.1 Modify the modulo function so that it can be called on the device by the affine_decrypt

kernel.

1.2 Implement the decryption kernel for a single block of threads with an x dimension of N

(1024). The function should store the result in d_output. You can define the inverse

modulus A, B and M using a pre-processor definition.

1.3 Allocate some memory on the device for the input (d_input) and output (d_output).

1.4 Copy the host input values in h_input to the device memory d_input.

1.5 Configure a single block of N threads and launch the affine_decrypt kernel.

1.6 Copy the device output values in d_output to the host memory h_output.

1.7 Compile and execute your program. If you have performed the exercise correctly you should

decrypt the text.

1.8 Don’t go running off through the forest just yet! Modify your code to complete the

affine_decrypt_multiblock kernel which should work when using multiple blocks of

threads. Change your grid and block dimensions so that you launch 8 blocks of 128 threads.

Exercise 02

In exercise 2 we are going to extend the vector addition example from the lecture. Create a new

CUDA project and import the starting code (exercise02.cu). Perform the following modifications.

2.1 The code has an obvious mistake. Rather than correct it implement a CPU version of the vector

addition (Called vectorAddCPU) storing the result in an array called c_ref. Implement a new

function ‘validate’ which compares the GPU result to the CPU result. It should print an error

for each value which is incorrect and return a value indicating the total number of errors. You

should also print the number of errors to the console. Now fix the error and confirm your error

check code works.

2.2 Change the value of N to 2050. Your code will now produce an error. Why? Modify your code so

that you launch enough threads to account for the error.

2.3 If you performed the above without considering the extra threads then chances are that you

have written to GPU memory beyond the bounds which you have allocated. This may not

necessarily raise an error. We can check our program for out of bounds exceptions by using the

CUDA debugger. Without inserting any breakpoints select the NSIGHT menu and ensure that

‘Enable CUDA Memory Checker’ is enabled. Start the CUDA debugger, the first time you do this a

firewall dialog will appear, press cancel and NSIGHT will continue as planned. It will halt in your

kernel due to the memory checker detecting an access violation. View the NSIGHT Output

Window to view a summary of these. Correct the error by performing a check in the kernel so

that you do not write beyond the bounds of the allocated memory. Test in the CUDA debugger

and ensure that you no longer have any errors.

Exercise 03

We are going to implement a matrix addition kernel. In matrix addition, two matrices of the same

dimensions are added entry wise. If you modify your code from exercise 2 it will require the

following changes;

3.1 Modify the value of size so that you allocate enough memory for a matrix size of N x N and

moves the correct amount of data using cudaMemcpy. Set N to 2048.

3.2 Modify the random_ints function to generate a random matrix rather than a vector.

3.3 Rename your CPU implementation to matrixAddCPU and update the validate function.

3.4 Change your launch parameters to launch a 2D grid of thread blocks with 256 threads per block.

Create a new kernel (matrixAdd) to perform the matrix addition. Hint: You might find it helps

to reduce N to a single thread block to test your code.

3.5 Finally modify your code so that it works with none square arrays of N x M for any size.