程序代写代做代考 Excel GPU compiler cache cuda Com4521/Com6521: Parallel Computing with GPUs

Com4521/Com6521: Parallel Computing with GPUs

Assignment: Part 2

Deadline: Tuesday 15th May 2018 17:00 (week 12)

Last Edited: 09/03/2018

Marking

Assignment 2 (of 2) is worth 70% of the total assignment mark. The total assignment mark (both

parts 1 and 2) is worth 80% of the total module mark.

Assignment 2 marks will be weighted as 50% for code functionality and performance and 50% for

demonstrating understanding via a written report.

Document Changes

This is Version 1 of the assignment document. Any corrections or changes to this document will be

noted here and an update will be sent out to the course google group mailing list.

Introduction

The aim of the assignment is to test your understanding and technical ability to implement efficient

code on the GPU. You will be expected to benchmark and optimise the implementation of a simple

image processing problem. You have already implemented a serial and multi-core version in

Assignment 1, you are expected to implement a GPU version of the same task. The emphasis of this

assignment is on your ability to progressively improve your code to converge on an efficient

implementation. In order to demonstrate this, you are expected to document (in a written report) any

design consideration you have made in order to improve the performance (and ensure correctness).

For example, you should use benchmarking to demonstrate how changes to your implementation

have resulted in performance improvements, or explain through theory why a particular method was

chosen. Handing in just a piece of code with excellent performance will not score highly in the

assessment, unless you have also demonstrated an understanding in the written report, of how you

have progressively refined your implementation to reach the final solution.

The Assignment Requirements (Code)

You are expected to implement the pixelate filter on the GPU (using the same problem definition from

the Assignment 1 document). You should use your hand-in for assignment 1 as your starting code

which should already perform IO and the CPU/OpenMP modes. Your assignment 1 code will already

have a placeholder for when the program mode is CUDA. In order to add GPU code to your assignment

1 hand-in you will need to rename your existing C file to a CUDA extension (*.cu) before importing

it to a new Visual Studio, NVIDIA CUDA project. Your program should accept the additional CUDA

mode argument and you should update the print_help()function to reflect the new CUDA option.

You should use the feedback from your assignment 1 hand-in to ensure your GPU code is efficient and

works correctly. You may need to update your file IO code if you have previously implemented this

incorrectly (otherwise your GPU code may produce incorrect results). Your CPU and OpenMP code

should still be able to be run using the mode program argument but this code will not be re-assessed

https://groups.google.com/a/sheffield.ac.uk/d/forum/com4521-group

as part of assignment 2.

Timing:

The program arguments for the GPU version of your code are the same as the previous assignment.

You should take care to ensure that you accurately time the GPU code using an appropriate technique.

Parallel Implementation:

There are two obvious methods to implement the pixelate technique on the GPU. You can either

parallelise each mosaic cell or you can parallelise each input pixel. Each approach may be more suited

to certain sizes of the mosaic cell size C. You may implement either method (or a hybrid of both),

however, you must justify your decision within the document. You may consider implementing both

techniques to provide evidence as to which approach is favourable when considering different C

values. You should consider a range of input image sizes ensuring that for either version you have

enough threads to fully utilise the device. Larger input images may be slower but may give better

device utilisation.

Documentation Requirements

You are expected to document the implementation of your code. More specifically, you are

expected to compare and contrast various implementation techniques to show how you have

converged on a particular implementation. In particular, you should benchmark your code in

Release mode to compare alternative techniques (such as the use of various GPU memory caches

where appropriate) and give an explanation as to why one implementation technique (or

optimisation you have made) is better than the other. Some examples of interesting benchmarks or

discussions include;

● Parallelisation approach?

● Different methods to layout or represent your data in GPU memory to ensure coalesced

access patterns. E.g. Arrays of Structures vs Structures of Arrays.

● The use of various GPU memory caches (texture/read-only, constant, shared memory) to

reduce the number of global memory reads? Note: Not all will be suitable for your problem

but should discuss why not.

● Any GPU optimisations you have made to improve the performance? A description of any

investigations into performance through benchmarking or profiling.

● Any other interesting aspects of the implementation or optimisation techniques you have

applied to the GPU version of your code.

Benchmarking should always be done in Release mode within Visual Studio with timing results for a

single run of your program averaged over a number of independent program runs. Benchmarks

should consider various values of Image size and the mosaic filter size C (as defined in part 1

document) to demonstrate performance scaling. For each significant improvement to your code try

to show the performance of your code before and after changes. You should highlight (with short

code samples) any novel aspects or optimisations you have made.

Project Hand In

You should hand in your program code via MOLE with the documentation as a single pdf within a single

zip file. You should also include the Visual Studio solution and any project files. Your code should build

in the Release mode configuration without errors or warnings (other than those caused by

IntelliSense) on Diamond computer room 4 (lab) machines. You should submit whatever you have

done if you have not completed the entire assignment. Your code should not rely on any third party

libraries or tools (other than those included with CUDA or OpenMP).

Marking

The marks for part 2 of assignment will be distributed as follows:

● 50% of the assignment is for the coding aspect. Half of this percentage is for the quality of the

programming and optimisation (including the performance of your code) and the other half is

for satisfying the requirements.

● 50% of the assignment is for the production of a document describing the processes you have

undertaken to implement and optimise your code. This should include benchmarking and

iterative refinement of approaches as described in the documentation requirements.

In assessing your work, the following requirements will be considered for the code aspect.

1. Is the GPU code functionally correct? I.e. Has the technique been implemented correctly and

does it produce the correct result? A number of test cases will be used to evaluate this against

the reference implementation.

2. Have you managed to use GPUs appropriately ensuring that the device has sufficient levels of

parallelism for all mosaic cell sizes?

3. Has iterative improvement of the code yielded a sufficiently optimised final GPU program?

4. Does the code make good use of memory bandwidth?

5. Does the GPU code avoid race conditions when reducing value?

6. Does the GPU code use effective caching to reduce the number of global memory loads?

7. Are there any compiler warnings or dangerous memory accesses (beyond the bounds of the

memory allocated)? Does your program free any memory which is allocated?

8. Is your code structured clearly and well commented?

In assessing your documentation, the following will be considered and should act as a guideline for

discussing incremental improvements to your code.

1. Description of the technique and how it is implemented. Is a good justification given for the

choice of parallelisation method?

2. Have appropriate investigations been made into using a good memory access pattern and

suitable caching technique? Are good explanations given for the benchmarking results?

3. Does your document describe optimisations to your code and show the impact of these?

4. Is there benchmarking and discussion about the performance difference between all three

version of the code?

Tips for Developing Your Code and Documentation

If you are unable to implement all aspects of the technique on the GPU, then you should default back

to using the CPU or OpenMP versions for that part so that your code builds and executes producing

the correct result. Similarly, if you apply a technique that does not improve the performance, you

should include this in your documentation and explain your belief/understanding as to why it did not

work as expected. You can use #define to allow your code to be built in different versions to make

a comparison of techniques more straight forwards.

You should comment your code to make it clear what you have done. You should test your code to

make sure that it works for all images sizes, values of 𝑐 and image input files. For values and input files

which are incorrect (for example 𝑐 = −10 or an input file with the incorrect number of pixel values)

your code should exit elegantly with a helpful error message. Your code should never read or write

beyond allocated memory.

Assignment Help

Help for your assignment will be available in general lab classes. For specific questions outside of the

labs you should use the course google discussion group.