程序代写代做代考 GPU algorithm cache cuda Com 4521 Parallel Computing with GPUs: Lab 05

Com 4521 Parallel Computing with GPUs: Lab 05

Spring Semester 2018

Dr Paul Richmond

Lab Assistants: John Carlton and Robert Chisholm

Department of Computer Science, University of Sheffield

Learning Outcomes

 How to query CUDA device properties

 Understanding how to observe the difference between theoretical and measure memory

bandwidth

 Understanding an observing the difference between the constant cache and read-only cache

 Understanding how to use texture binding for problems which naturally map to 2D domains

Corrections

 6/3/18 Exercise 1.3 corrected to reflect the result is measured in kilobits

Lab Register

The lab register must be completed by every student following the completion of the exercises. You

should complete this when you have completed the lab including reviewing the solutions. You are

not expected to complete this during the lab class but you should complete it by the end of the

teaching week.

Lab Register Link: https://goo.gl/0r73gD

Exercise 01

In exercise one we are going to extend our vector addition kernel. The code is provided in

exercie01.cu. Complete the following changes

1.1 Modify the example to use statically defined global variables (i.e. where the size is declared at

compile time and you do not need to use cudaMalloc). Note: A device symbol (statically

defined CUDA memory) is not the same as a device address in the host code. Passing a symbol as

an argument to the kernel launch will cause invalid memory accesses in the kernel.

1.2 Modify the code to record timing data of the kernel execution. Print this data to the console.

1.3 We would like to query the device properties so that we can calculate the theoretical memory

bandwidth of the device. The formal for theoretical bandwidth is given by;

𝑡ℎ𝑒𝑜𝑟𝑒𝑡𝑖𝑐𝑎𝑙𝐵𝑊 = 𝑚𝑒𝑚𝑜𝑟𝑦𝐶𝑙𝑜𝑐𝑘𝑅𝑎𝑡𝑒 ∗ 𝑚𝑒𝑚𝑜𝑟𝑦𝐵𝑢𝑠𝑊𝑖𝑑𝑡ℎ

Using cudaDeviceProp query the two values from the first cudaDevice available and

multiply these by two (as DDR memory is double pumped, hence the name) to calculate the

theoretical bandwidth. Print the theoretical bandwidth to the console in GB\s (Giga Bytes per

second). Note that the above will calculate the result in kilobits\second (as 𝑚𝑒𝑚𝑜𝑟𝑦𝐶𝑙𝑜𝑐𝑘𝑅𝑎𝑡𝑒

https://goo.gl/0r73gD
https://goo.gl/0r73gD
https://goo.gl/0r73gD

is measured in kilohertz and 𝑚𝑒𝑚𝑜𝑟𝑦𝐵𝑢𝑠𝑊𝑖𝑑𝑡ℎ is measured in bits). You will need to convert

to the memory clock rate to Gb\s (Gigabits per second) and then convert this to GB\s.

1.4 Theoretical bandwidth is the maximum bandwidth we could achieve in ideal conditions. We will

learn more about improving bandwidth in later lectures. For now we would like to calculate the

measure bandwidth of the vectorAdd kernel. Measure bandwidth is given by;

𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑𝐵𝑊 =
(𝑅𝐵𝑦𝑡𝑒𝑠 + 𝑊𝐵𝑦𝑡𝑒𝑠)

𝑡

Where 𝑅𝐵𝑦𝑡𝑒𝑠 is the number of bytes read and 𝑊𝐵𝑦𝑡𝑒𝑠 is the number of bytes written by the

kernel. You can calculate these values by considering how many bytes the kernel reads and

writes and multiplying it by the number of threads that are launched. The value 𝑡 is given by

your timing data in ms you will need to convert this to seconds to give the bandwidth in GB\s.

Print the value to the console so that you can compare it with the theoretical bandwidth. Note:

Don’t forget to switch to Release mode to profile your code execution times.

Exercise 02

In the last lecture we learned about the different types of memory and caches which are available on

a GPU device. For this exercise we are going to optimise a simple ray tracer application by changing

the memory types which are used. Download the starting code (example02.cu) and take a look at it.

The ray tracer is a simple ray casting algorithm which casts a raw for each pixel into a scene

consisting of sphere objects. The ray checks for intersections with the spheres, where there is an

intersection a colour value for the pixel is generated based on the intersection position of the ray on

the sphere (giving an impression of forward facing lighting). For more information on the ray tracing

technique read Chapter 6 of the CUDA by Example book which this exercise is based on. Try

executing the starting code an examining the output image (output.ppm) using GIMP or Adobe

Photoshop.

The initial code places the spheres in GPU global memory. We know that there are two options for

improving this in the form of constant memory and texture/read only memory. Implement the

following changes.

The following exercise should be completed by building the device code for the latest supported

version of the GPU you are using. For the diamond high spec labs this is “compute_35,sm_35”.

2.1 Create a modified version of ray tracing kernel which uses the read-only data cache

(ray_trace_read_only). You should implement this by using the const and

__restrict__ qualifiers. Calculate the execution time of the new version alongside the old

version so that they can be directly compared: You will need to also create a modified version of

the sphere intersect function (sphere_intersect_read_only).

2.2 Create a modified version of ray tracing kernel which uses the constant data cache

(ray_trace_const). Calculate the execution time of the new version alongside the two

other versions so that they can be directly compared.

2.3 How does the performance compare? Is this what you expected and why? Modify the number of

spheres to complete the following table. For an extra challenge try to do this automatically so

that you loop over all the sphere count sizes in the table and record the timing results in a 2D

array.

Sphere Count Normal Read-only cache Constant cache

16

32

64

128

256

1024

2048

Exercise 03

In exercise 3 we are going to experiment with using texture memory. In this example we are going to

explicitly use texture binding rather than using qualifiers to force memory loads through the read-

only cache. There are good reasons for doing this when dealing with problems which relate to

images or with problems decompose naturally to 2D layouts.1 The example that we will be working

with is an image blur. The code (exercise03.cu) is provided along with an image input.ppm (an image

of my dog relaxing). Build and execute the code to see the result of executing the image blur kernel.

You can modify the macro SAMPLE_SIZE to increase the scale of the blur.

Take a look at the kernel code and ensure that you understand it. We will now implement texture

sampling by performing the following

3.1 Duplicate the image_blur kernel (naming it image_blur_texture1D). Declare a 1

dimensional texture reference with cudaReadModeElementType. Modify the new kernel to

perform a texture lookup using tex1Dfetch. Modify the host code to execute the texture1D

version of the kernel after the first version saving the timing value to the y component of the

variable ms. You will need to add appropriate host code to bind and unbind the texture before

and after the kernel execution respectively.

3.2 Duplicate the image_blur kernel (naming it image_blur_texture2D). Declare a 2

dimensional texture reference with cudaReadModeElementType. Modify the new kernel to

perform a texture lookup using tex2D. Modify the host code to execute the texture2D version

of the kernel after the first version saving the timing value to the z component of the variable

ms. You will need to add appropriate host code to bind and unbind the texture before and after

the kernel execution respectively.

3.3 In the case of the 2D version it is possible to perform wrapping of the index values without

explicitly checking the x and y offset values. To do this remove the checks from your kernel and

set the addressMode[0] and addressMode[1] structure member of your 2D texture

reference to cudaAddressModeWrap.

Compare the performance metrics. There is likely not much difference between the 1D and 2D

texture version however the code is more concise with the 2D version.

1 E.g. Improved caching, address wrapping and filtering.