程序代写代做代考 GPU algorithm cuda Com 4521 Parallel Computing with GPUs: Lab 09 (CUDA libraries and

Com 4521 Parallel Computing with GPUs: Lab 09 (CUDA libraries and

Streams)

Spring Semester 2018

Dr Paul Richmond

Lab Assistant: Robert Chisholm and John Charlton

Department of Computer Science, University of Sheffield

Learning Outcomes

 Understand how to use the Thrust library to perform sorting of key value pairs

 Understand how to use the Thrust library to simplify the scan algorithm

 Understand how parallel primitives can be used within a complex problem (particle-particle

interactions)

 Understand how CUDA streams can be used to overlap memory copies and kernel execution

 Demonstrate performance differences of scheduling stream work via two distinct methods.

Exercise 01

For this exercise we are revisiting some code from the very first lab where we implemented a

command driven calculator. The exercise has been extended to create a parallel calculator which will

apply a fixed program of calculations to a set of input data using a SPMD model. A GPU implementation

has already been provided. The performance critical (and hence timed) part of the implementation

requires copying the input data to the device, performing the calculations and finally reading the data

back form the device. The provided implementation is strictly synchronous in execution and uses only

the default CUDA stream. It is possible to improve the performance by breaking the task down into

smaller independent parts and using streams to ensure that the copy engines(s) and compute engine

are kept busy with independent work. We will implement 2 different streaming versions and compare

the performance of each.

The first asynchronous version that we implement will loop through each stream and perform a copy

to the device, kernel execution and copy back from the device. Each stream will be responsible for a

subset of the data (of size SAMPLES divided by NUM_STREAMS). Implement this version by

completing the following tasks which require you to modify the cudaCalculatorNStream1 function;

1.1 Allocate the host and device memory so that it can be used for asynchronous copying.

1.2 Create NUM_STREAMS streams in the streams array declared within the function.

1.3 Create a loop to iterate the streams. For each stream launch stages 1, 2 and 3 from the

synchronous version so that the stream operates on only a subset of the data. Test your code.

1.4 Destroy each of the CUDA streams.

For the second asynchronous version you can copy the allocation, stream creation and destruction

code from your cudaCalculatorNStream1 function to cudaCalculatorNStream2. For

this exercise we will make a subtle change to the way work within streams is scheduled.

1.5 Loop through the streams and schedule each stream to perform a host to device memory copy on

a suitable subset of the data.

1.6 Loop through the streams and schedule each stream to perform the kernel execution on a suitable

subset of the data.

1.7 Loop through the streams and schedule each stream to perform a device to host memory copy on

a suitable subset of the data. Hence copying back the result.

1.8 Test and Benchmark your code. Modify the NUM_STREAMS value and complete the following

table of results. Do you understand why the two different stream versions differ?

Version Streams GPU timing (s)
cudaCalculatorDefaultStream NA
cudaCalculatorNStream1 2
cudaCalculatorNStream1 3
cudaCalculatorNStream1 6
cudaCalculatorNStream2 2
cudaCalculatorNStream2 3
cudaCalculatorNStream2 6

Exercise 02

In exercise two we are going to migrate an implementation of the simple particle type simulation

which was described in the previous lecture. The implementation calculates each particle’s nearest

neighbour within a fixed range. It does this through the following steps.

1. Generate key value pairs: where the key is the bucket that the particle belongs to and the

value is the index of the particle.

2. Sorting the key value pairs: the key value pairs are sorted according to their key

3. Re-ordering particles: The particles are reordered from the sorted key values pairs. Using the

value a scattered write can move the location of the individual particles.

4. Calculate a histogram of the particles: For each bucket that discretises the environment a

count of the number of particles is generated.

5. Prefix sum of buckets: Given the bucket counts a prefix sum is used to determine where the

starting index of each bucket is located within the sorted list of particles

6. Calculate the nearest neighbour for each particle by considering all neighbours within a fixed

range.

The implementation provided uses the CPU for most of these stages (other than the nearest neighbour

calculation). To improve this code we are going to re-implement the CPU functions on the GPU by

completing a number of steps. In the particlesGPU function perform the following;

1.1 Start by copying the initialised participle data from the host (h_particles) to the device

(d_particles).

1.2 Complete the kernel which calculates the key value pairs.

1.3 Use Thrust sort by key to sort the key value pairs (d_key_values).

1.4 Implement a kernel for reordering particles given the old positions (d_particles) and the key

value pairs. Save the results in d_particles_sorted.

1.5 Implement a kernel to calculate the counts for each bucket in the environment. Store these in the

d_env structure. Hint: You will need to use atomics if you want to increment device global values.

1.6 Use Thrust to perform an exclusive scan on the environment counts. The scan output should go

in d_env->start_index. Hint: You will need to cast the arrays within d_env using device

pointer casts.

1.7 Ensure you code produces the correct result and compare the CPU and GPU versions. Note that

our timing code does not use CUDA events as we are timing both CPU and GPU code execution.