COM4521/COM6521: Parallel Computing with Graphical Processing Units (GPUs)
Assignment (80%)
Deadline: 17:00 Monday 17th May (week 12) Last Edited: 26/02/2021
Document Changes:
Any corrections or changes to this document will be noted here and an update will be sent out to the course google group mailing list.
Introduction
The assessment has been designed against the module’s learning objectives. The assignment is worth 80% of the total module mark. The aim of the assignment is to assess your ability and understanding, of implementing and optimising parallel algorithms using both OpenMP and CUDA.
An existing project containing a single threaded implementation of an algorithm has been provided. This provided starting code also contains functions for validating the correctness and timing the performance of your implemented algorithms.
You are expected to implement both an OpenMP and a CUDA version of the provided algorithm, and to complete a report which documents the justification for the techniques you have used and how profiling and/or benchmarking supports your choices.
The Algorithm and Starting Code
The algorithm to be implemented and optimised is called Contrast Limited Adaptive Histogram Equalisation (CLAHE)1 . This algorithm is a technique for improving the contrast of images, by decomposing an image into a regular grid of tiles and equalising each tile’s histogram of contrasts, so that the range of contrasts in a given area is wider.
1 https://en.wikipedia.org/wiki/Adaptive_histogram_equalization#Contrast_Limited_AHE
Figure 1: An example input (left) and output (right) from the provided reference implementation.
The CLAHE algorithm can be broken up into 3 stages
1. Histogram generation (stage 1) – a histogram is generated for each tile’s pixel values
and the most common pixel value for the whole image is returned.
2. Histogram equalisation (stage 2) – each tile’s histogram is equalised.
3. Histogram interpolation (stage 3) – each pixel’s value is updated using interpolated
values from the equalised histograms of the nearest tiles.
The reference implementation and starting code are available to download from:
https://github.com/RSE-Sheffield/COMCUDA_2021/archive/master.zip
The provided code only handles a single image channel of an image (this conversion is handled automatically), you do not need to attempt handling multiple channels. Furthermore the tile size is fixed (see TILE_SIZE in config.h) and images will be cropped to a multiple of the tile size in each dimension. Your algorithms only need to support these provided configurations. The only variable input is that of image size, which may range from a single tile, to hundreds of tiles when your code is tested on submission.
The handout code includes test images (in the samples folder) of different input sizes. You can create your own test images in photoshop by compressing the range using the contrast tool and saving them in the png format.
Each of the stages are described in more detail below.
Histogram Generation (Stage 1)
Figure 2: A Simple 5×4 image tile showing pixel values in the range of 0-12. Brightness indicates occurrences of particular values.
Table 1: Histogram calculation of the simple image in Figure 2. Left is the value and right represents the occurrences of the value. The table has 12 counts (or bins) representing the pixel range of 0-12.
During histogram generation the algorithm is required to count the occurrences of grayscale pixel values for each tile of the image. Figure 2 above shows a single tile of 4 x 5 pixels. In this simplified example the pixel range is only 0-12. The colour represents the occurrence of a particular pixel value from Table 1.
To perform histogram generation you will need to decompose the image into square tiles. Within the handout code there is a fixed tile size(TILE_SIZE). For each tile you will need to calculate the histogram of pixel values. Unlike the simpied example above, pixels can take any value in the inclusive range 0-255, therefore each histogram must have 256 (or PIXEL_RANGE) counts (or bins).
During this stage, it is also required for you to calculate the most common pixel value from the image, e.g. generate a whole image histogram of pixels and return the index of the largest bin. In the example above this is the pixel value 2 as it has the highest count. If there are multiple most common pixel values, only 1 needs to be returned. When completing the template files for OpenMP and CUDA the provided stage one function should return the most common pixel value.
The method cpu_stage1() provides the single threaded implementation of this stage of the algorithm. The methods validate_histogram(…) and skip_histogram(…) can be used to validate or skip the output of this stage. However, both of these functions require pointers to host memory matching the layout and order of the CPU reference implementation.
Histogram Equalisation (Stage 2)
Table 2: An extended version of Table 1 showing the calculated histogram values for the various steps required within stage 2 of the algorithm.
In order to equalise the histogram a number of steps are required. Firstly each histogram count must be clamped with a range of 0 to the absolute contrast limit. From the simple example in the previous stage, Table 2 shows the Clamped values when an absolute contrast limit of 3 is used. Within the handout code the contrast limit is set by the ABSOLUTE_CONTRAST_LIMIT definition. Any pixel values which exceed the absolute contrast limit are summed and then redistributed among the image by calculating a bonus contrast value to be added to each histogram value according to the following formula;
bonus_contrast = f loor( extra_contrast ) PIXEL_RANGE
In table 2 the bonus contrast is 0 so the column for the Limited distribution is the same as the Clamped distribution. Any contrast lost through the use of the floor function is also calculated as;
lost_contrast = extra_contrast % PIXEL_RANGE
The lost contrast will later be used in the formula for normalisation.
The next step requires the calculation of a cumulative histogram (often referred to as cumulative distribution function or CDF). The cumulative distribution can be obtained by applying a scan operation over the limited histogram. The results of this are shown in the Cumulative column for the simple example.
Finally, the histogram equalisation formula is applied to obtain an Equalised c umulative histogram. The formula for equalisation is;
e(i) = (( cdf(i) − cdfmin ) × (PIXEL_RANGE − 2)) + 1 (W× H) − lost_contrast
Where e(i) is the equalised contrast value for a pixel with contrast value of i , cdf is the cumulative value for pixel of contrast i , cdf min is the minimum cumulative value and W and
H are the tile width and height respectively. The result of this formula must also be clamped, as it may exceed the pixel range.
Within the CPU reference code the calculation of the variable t within cpu_stage2() demonstrates the application of this formula. Stage 2 does not necessarily need to remain as three distinct parts, however be careful to ensure that your output validates.
The methods validate_limited_histogram(…), validate_cumulative_histogram (…) and validate_equalised_histogram(…) can be used to check the result of the 3 parts of this stage described above. Likewise, validate_ can be replaced with skip_ to bypass these parts if incomplete. All of these functions require pointers to host memory matching the layout and order of the CPU reference implementation.
Histogram Interpolation (Stage 3)
The final output image is calculated in the stage 3 by reading back the value stored in the equalised histograms, for each original pixel’s contrast value. However as the image has been tiled, the value must be interpolated from the histograms of the nearest 4 histograms using bilinear interpolation. Each tile is subdivided into 4 corners, with pixels from a corner accessing histograms from the 4 tiles touching that corner. Edge and corner sub-tiles are only able to sample from 2 and 1 tiles respectively.
Figure 3: A guide to interpolation. Solid lines represent tiles, dashed lines represent the separation of the 4 corners of each tile. Red has no interpolation, green has linear interpolation, blue has bilinear interpolation. Image source: Wikipedia/user:Vswitches CC-BY-SA3.
The method cpu_stage3() provides the single threaded implementation of this stage of the algorithm. The example code explicitly checks for corners and X and Y borders.
The methods validate_interpolate(…) and skip_interpolate(…) can be used to validate or skip the output of this stage. However, both of these functions require
pointers to host memory matching the layout and order of the CPU reference implementation.
The Task Code
For this assignment you must complete the code found in openmp.c and cuda.cu, so that they perform the same algorithm described above and found inthe reference implementation (cpu.c), using OpenMP and CUDA respectively. You should not modify or create any other files within the project. The two algorithms to be implemented are separated into 5 methods named openmp_begin(), o penmp_stage1(), openmp_stage2(), openmp_stage3(), openmp_end() respectively (and likewise for CUDA). The begin
method takes the input image and performs any memory allocations of copies required by the algorithm. The end method, similarly returns the output image and frees allocated memory resources.
You should implement the OpenMP and CUDA algorithms with the intention of achieving the fastest performance for each algorithm on the hardware you use to develop and test your assignment. As the second stage is the most advanced, it is recommended that you address the other stages first. The starting code has helper methods, which allow you to skip and validate the three stages individually.
It is important to free all used memory, as memory leaks could cause the benchmark mode, which repeats the algorithm to run out of memory.
The header helper.h contains methods which can be used to skip and validate individual stages of the assignment, to assist you with development and finding problems with the correctness of your code. The VALIDATION pre-processor definition is defined only for Debug builds. Each of the stage functions (for both OpenMP and CUDA) provides a place for you to call the appropriate validation functions to check that your code is correct. As VALIDATION is not defined for Release b uilds this will not affect your benchmarking performance.
The Report
You are expected to provide a report alongside your code submission. For each of the 6 stages you should complete the template provided in Appendix A. The report is your chance to demonstrate to the marker that you understand what has been taught in the module.
Benchmarks should always be carried out in release mode, with timing averaged over several runs. The provided project code has a runtime argument –bench which will repeat the algorithm for a given input 1000 times. It is important to benchmark over a range of inputs, to allow consideration of how the performance scales with the size of inputs.
Deliverables
You must submit your openmp.c, cuda.cu and your report document (e.g. .pdf/.docx)within a single zip file via Mole, before the deadline. Your code should build in the Release mode configuration projected to you without errors or warnings (other than those caused by IntelliSense) on Diamond machines. You do not need to hand in the project files or any other files other than openmp.c, cuda.cu. As such, it is important that you do not modify any of the other files provided in the starting code so that your submitted code remains compatible with the projects that will be used to mark your submission.
Your code should not rely on any third party tools/libraries not introduced within the lectures/lab classes. The use of Thrust and CUB is permitted.
Even if you do not complete all aspects of the assignment, partial progress should be submitted as this can still receive marks.
Marking
When marking, both the correctness of the output, and the quality/appropriateness of the technique used will be assessed. The report should be used to justify the approach used and demonstrate its impact on the performance.
The marks for each stage of the assignment will be distributed as follows:
OpenMP (40%)
CUDA (60%)
Stage 1 (30%)
12%
18%
Stage 2 (40%)
16%
24%
Stage 3 (30%)
12%
18%
For each of the 6 stages in total, the distribution of the marks will be determined by the following criteria;
1) Has the stage been implemented? [25%]
a) Is the implementation for the specific stage complete?
b) Is the implementation free from race conditions or other errors regardless of
the output?
c) Is code structured clearly and logically?
d) Is memory used efficiently and free’d where allocated?
2) Is the implementation correct for the stage when compared to a number of test cases? I.e. Does the implementation pass validation using the helper functions in Debug mode [25%].
3) Choice, justification and performance reporting of the approach towards implementation as evidenced in the report. [50%]. A breakdown of how marks are awarded is provided in the report structure template in Appendix A.
If you submit work after the deadline you will incur a deduction of 5% of the mark for each working day that the work is late after the deadline. Work submitted more than 5 working days late will be graded as 0. This is the same lateness policy applied university wide to all undergraduate and postgraduate programmes.
Assignment Help and Feedback
There is a dedicated assignment feedback lab in Week 11 but you are expected to seek help with your assignment in general lab classes. The labs provide you with an opportunity to receive feedback on your work so you are expected to bring your assignment work to the labs and ask questions. The assignment has been designed so that it is possible to start on the OpenMP aspects as soon as it is handed out and to incrementally progress with the
CUDA aspects as the course progresses. You should not leave implementation to the last week of the module.
For questions outside of the labs you should use the course google discussion group (COM4521-group@sheffield.ac.uk). However, as messages to the google group are public to all students, emails should avoid including assignment code and should be questions about ideas and techniques rather than requests to fix code (which can be done within labs).
Appendix A: Report Structure
Template
Each stage should focus on a specific choice of technique which you have applied in your implementation. E.g. OpenMP Scheduling, OpenMP approaches for avoiding race conditions, CUDA memory caching, Atomics, Reductions, Warp operations, Shared Memory, etc.
Each stage should be no more than 500 words and may be far fewer for some stages.
● Briefly describe how the stage is implemented focusing what choice of technique you have applied to your code.
Marks will be awarded for;
● Clarity of description.
Justification
● Describe why you selected a particular technique or approach. Provide justification using your understanding from experience in the lectures and labs as to why the approach is appropriate and efficient.
Marks will be awarded for;
● Appropriateness of the approach. I.e. Is this the most efficient choice?
● Justification of the approach and demonstration of understanding.
Performance
Problem size
CPU Reference Timing (ms)
256 x 256 px
1024x 1024 px
2048 x 2048 px
4096 x 4096 px
● Report your benchmark results in the table provided above
● Describe which aspects of your implementation limits performance? E.g. Is your code compute, memory or latency bound on the GPU? Have you performed any profiling? Is a particular operation slow?
● What could be improved in your code if you had more time?
Marks will be awarded for;
● Does the justification match the experimental result?
● Have limiting factors of the code been identified?
● Has justification for limiting factors been described or evidenced?