程序代写代做代考 cache GPU kernel compiler c/c++ KIT308/408 Multicore Architecture and Programming Assignment 3 — OpenCL

KIT308/408 Multicore Architecture and Programming Assignment 3 — OpenCL
Aims of the assignment
The purpose of this assignment is to give you experience at writing a program using OpenCL programming techniques. This assignment will give you an opportunity to demonstrate your understanding of:
setting up OpenCL structures;
passing memory from the CPU to the GPU;
creating memory formats that work across devices; and the translation of C/C++ code to OpenCL code.
Due Date
11:55pm Friday 16th of October (Week 13 of Semester)
Late assignments will only be accepted in exceptional circumstances and provided that the proper procedures have been followed (see the School Office or this link for details) assignments which are submitted late without good reason will be subject to mark penalties if they are accepted at all (see the School office or this link for details on this as well).
Forms to request extensions of time to submit assignments are available from the Discipline of ICT office. Requests must be accompanied by suitable documentation and should be submitted before the assignment due date.
Assignment Submission
Your assignment is to be submitted electronically via MyLO and should contain:
An assignment cover sheet;
A .zip (or .rar) containing a Visual Studio Solution containing a project for each attempted stage of the assignment (labelled sensibly so it’s easy to determine what is Stage 1, Stage 2, Stage 3, Stage 4, and Stage 5).
A document containing:
A table of timing information comparing the original provided base code times against the Stage 4 OpenCL implementation on each scene file.
An analysis of the above timing data.
You do not need to (and shouldn’t) submit executables, temporary object files, or images. In particular, you must delete the “.vs” directory before submission as it just Visual Studio temporary files and 100s of MBs. Do not however delete the “Scenes” folder or the “Outputs” folder (but do delete the images within this one).

Marking
This assignment will be marked out of 100. The following is the breakdown of marks:
Task/Topic
1. OpenCL Setup and Data Transfer
CPU-side
Correct OpenCL setup and kernel creation
Well thought out and correct OpenCL buffers and kernel arguments for representing scene data
Correct creation of two-dimensional kernel job
Correct memory management (i.e. freeing resources)
Inclusion of (helpful) error handling
Correct alignment of datastructures
GPU-side
Correct declaration of equivalent OpenCL types and datastructures (for primitives, sceneobjects, etc.)
Output (via printf) that verifies that the scene has been correctly transferred (for full marks, this information should only print once per execution)
Correctly fills output buffer with colour
2. Basic Rendering
OpenCL implementation of renderer that colours spheres white and everything else black (i.e. no lighting, material application, reflection, refraction, or shadowing, etc.)
Correctly handles all scenes
3. Rendering with Lighting
OpenCL implementation of renderer with correct lighting for spheres and boxes but no shadows or materials Correctly handles all scenes
4. Full Rendering
OpenCL implementation of renderer with correct lighting, shadowing, materials, reflection, and refraction Correctly handles all scenes
5. Rendering Large/Complex Images
OpenCL implementation (with all features of Stage 4) that can render all the Stage 5 tests OpenCL implementation (with all features of Stage 4) that handles images of arbitrary complexity
OR a well explained rationale for why a general solution is not possible to do in code using OpenCL Documentation
Outputs showing timing information for Stage 4 on all applicable scene files Analysis of data (comparison between Stage 4 execution times and the base code)
Penalties
Failure to comply with submission instructions (eg. no cover sheet, incorrect submission of files, abnormal solution/project structure, etc.)
Poor programming style (eg. insufficient / poor comments, poor variable names, poor indenting, obfuscated code without documentation, compiler warnings, etc.)
Lateness (-20% for up to 24 hours, -50% for up to 7 days, -100% after 7 days)
Programming Style
Marks 50%
5% 5% 5% 5% 5% 5%
10% 5%
5%
10%
5% 5%
10%
5%
5%
10%
5%
5%
10%
5% 5%
10%
5% 5%
-10% up to -20% up to -100%
This assignment is not focussed on programming style (although it is concerned with efficient code), but you should endeavour to follow good programming practices. You should, for example:
comment your code;
use sensible variables names;
use correct and consistent indenting; and
internally document (with comments) any notable design decisions.
[NOTE: any examples in the provided assignment materials that don’t live up to the above criteria, should be considered to be deliberate examples of what not to do and are provided to aid your learning ;P]

The Assignment Task
You are to heavily modify a slightly simplified implementation of our original single-threaded “simple” raytracer so that in runs as an OpenCL program. To fully complete this assignment (to Stage 4), you will need to convert everything from render onwards (i.e. all the functionality of the render function, every function that is called from render, and everything that is called from there, and so on) to be written in OpenCL.
From the provided single-threaded raytracer implementation, you will create multiple subsequent versions that convert various parts of the program to be implemented in OpenCL:
1. Code to start an OpenCL Kernel and transfer the scene datastructure (no easy task).
2. Basic rendering that just shows spheres as white and everything else as black (i.e. traceRay working to to the point of
objectIntersection working for spheres, but no lighting etc.).
3. Rendering with support for boxes and basic lighting complete.
4. Full rendering with all features of the base code in OpenCL.
5. Rendering of large / complex images.
Implementation
1. OpenCL Initialisation and Data Transfer
This stage involves initialising OpenCL (getting the platform, context, etc. etc.) as we’ve done in the tutorials. However, it requires careful thought when deciding how to pass through data to the OpenCL kernel as all the parameters to render (e.g. width, height, aaLevel) need to be available in the kernel and this includes the entire scene object.
You can’t simply pass a pointer to the scene, as any objects more complicated than scalars need to be allocated to a buffer (so that they can be available to the GPU). Even if you allocate the scene object to a buffer, the containers for the scene objects won’t transfer over automatically, so you’ll need to think about how to transfer them as well.
Additionally, to be able to use the scene in OpenCL, you’ll need an equivalent definition of the Scene struct which in turn means you’ll need definitions of scene objects and primitives. Some of these structs (e.g. Point, Vector, Colour, etc.) can be represented by built-in OpenCL types to greatly simplify your implementation (and gain access to a host of built-in functions).
Memory layout for structs on the CPU and GPU is not likely to be the same, and so you need to take great care to ensure that each struct has the same size on both devices. This means that on the CPU-side, in order to get the memory alignment you want, you may need to either manually pad structs, use the pre-defined OpenCL types (e.g. cl_float3), or make use of the _declspec(align(X)) command.
You also need to be explicit about what kind of memory different variables are. Many of your datastructures and pointers to datastructures will need to be declared __global.
In order to complete this step you will need to:
Have completely transferred a copy of the scene (including the scene object containers) to the GPU and verified this via outputting the whole structure.
Filled the entire output buffer with colour to ensure that your kernel is working over the correct range.
Each pixel should be coloured as RGB(xCoordinate % 256, 0, yCoordinate % 256)
(i.e. the red-channel of the pixel will be the x-coordinate of the pixel modulo 256, and the blue channel of the pixel will be the y-coordinate of the pixel modulo 256).
At the end of this stage the program should successfully transfer all scene information to the OpenCL kernel and be able to create a full image (filled with coloured pixels). Be thorough in your testing (i.e. test all the scenes) to ensure that everything works correctly before progressing.
Hints / Tips
There is a commented out function outputInfo (and call to the function in main) that you can use to display helpful values about the scene for testing. This code can be duplicated in OpenCL and called from your kernel to confirm that everything has transferred correctly.
Just make a basic kernel to begin with and transfer only the scene object. Verify that it has transferred correctly by using printf (this is available in OpenCL (huzzah!) at the cost of greatly reduced performance) on the CPU and on the GPU (use the work offset values to ensure you only print stuff once).
In OpenCL, rather than defining structs for many of the renderer types you can simply typedef OpenCL data types. e.g. use “typedef well_chosen_type Point;” instead of a struct.
Memory alignment is a big issue. Never assume that something has transferred correctly to the OpenCL program — verify that it has via printf.
Pay attention to where in memory your datastructures are. You will need to use __global and/or __constant to avoid compiler errors/warnings.
Your kernel function will be a conversion of the render function, but at this stage it doesn’t need to do anything except get all of the same parameters, output the scene (via printf), and fill the image in with a colour. Refer to the introductory OpenCL tutorials for help making a basic kernel.
There are a number of possible problems you may encounter when loading OpenCL files:
Due to how the assignment solution/projects are set up to use shared folders for scenes, outputs, and reference images, you’ll need to specify the path of your OpenCL file when using clLoadSource (e.g. “Stage1/Raytrace.cl”). This also applies for any further files you #include within OpenCL as below.
Rather than create one giant OpenCL source file, you would be wise to split your code across multiple files and #include them from within the OpenCL file you load using the clLoadSource function. You’ll need to ensure that you define things in the correct order (sometimes function prototypes can help here).

Some OpenCL implementations have a problem with caching old versions of included files, you may wish to manually clear this cache if you have trouble, or set up a pre-build event to do it for you (see this stackoverflow thread for more info).
NOTE: this is perhaps the hardest stage of the assignment.
2. Basic Rendering
This stage involves implementing some of the functionality of the traceRay function to at least detect whether a collision with an object has occured or not. As you only need to add colour when a collision with a sphere has been detected you can either check for collisions with all kinds of objects and be selective, or just not check for collisions with boxes at this stage.
Hints / Tips
Be really careful when converting * operations. Much of the original code overloads * to be dot-product and this will not automatically translate in the OpenCL program (it may compile, but instead just do a SIMD like multiplication) — there is a dot(X, Y) function that you can use.
All of the stdlib math function need to be converted to OpenCL ones. Most of the time this is as simple as removing “f” from the end of the function name.
3. Rendering with Boxes and Lighting
This stage involves including the intersection tests with boxes and adding in the lighting calculations.
4. Full Rendering
This stage involves including the materials, shadowing, reflection, refraction, etc.
5. Full Rendering of Large / Complex Images
This stage involves trying to render images that require a large amount of time to generate. Once you’ve completed Stage 4 successfully, try running the Stage 5 tests. NOTE: on some system configurations, trying to run such a large OpenCL job may cause stability issues.
The first part of this stage involves creating a bespoke solution that would be successful at rendering these tests, but is not required to render other images (e.g. much smaller images, much larger images, images with a million times more objects, etc.).
The second part of this stage is generalising that solution so that it could handle images of arbitrary complexity — or, explaining why this would not be feasible.
Hints / Tips
Changing the Timeout Detection and Recovery (TDR) value of your system is not a valid approach to this stage of the assignment. You should endeavor to find a purely code-based solution.
Techniques we’ve explored for Assignment 1 should be helpful here.
You do not need to wait for the longer tests to complete if you are confident you have created a solution that works (some of them take may take a very long time to produce a result).

Documentation
After completing stage 4 of the assignment (anything before a complete stage 4 solution would produce incomparable times) you should provide:
timing information for each scene file for:
The base single-threaded code;
The base multi-threaded code using a thread count matching the maximum number of logical processors in your system and a block size of 16;
The base SIMD code using a thread count matching the maximum number of logical processors in your system and a block size of 16;
The time taken for the first run using OpenCL (for a Stage 4 implementation); and
The average time taken (to 1 decimal place) for the remaining runs over 9 runs.
an explanation of the results (e.g. why there’s no difference between the performance of Stage 4 compared to the base code, or why a particular implementation works well (or poorly) on a particular scene, etc.).
Tests / Timing
The following tables list all the tests that your code needs to generate correctly at each stage. They also shows the timing tests that need to be performed in order to fully complete the documentation section of the assignment. Fully completing this tests may take up to an hour (with the 1 run for the previous versions, and 10 required runs for OpenCL) on some hardware, so plan your time accordingly.
In order to confirm your images match the images created by the base version of the assignment code, it’s strongly recommended you use a image comparison tool. For part of the marking for this, Image Magick will be used (as it was in Assignment 1 and 2).
NOTE: all debug printf outputs for normal executions (including the output of Stage 1 of the assignment) should be removed before timing — however, you should NOT remove error printfs that occur if something goes wrong in the OpenCL setup process.
Timing Test
Base Single- Threaded
Base Multi- Threaded (i.e. Ass1 Solution)
Multi- Threaded SIMD (i.e. Ass2 Solution)
Stage 4
OpenCL First Run
Stage 4 OpenCL Average of 9 Runs
1.
-input Scenes/cornell.txt -size 1024 1024 – samples 1
2.
-input Scenes/cornell.txt -size 1024 1024 – samples 4
3.
-input Scenes/cornell.txt -size 1024 1024 – samples 16
4.
-input Scenes/allmaterials.txt -size 1000 1000 -samples 4
5.
-input Scenes/5000spheres.txt -size 1280 720 -samples 1
6.
-input Scenes/dudes.txt -size 1024 1024 – samples 1
7.
-input Scenes/cornell-199lights.txt -size 1024 1024 -samples 1

The following tests will be run on your code for each scene file and compared against the reference output shown:
Test
Stage 1
Stage 2
Stage 3
Stage 4
1.
-input Scenes/cornell.txt -size 256 256 -samples 1
2.
-input Scenes/allmaterials.txt -size 1000 1000 -samples 4
3.
-input Scenes/5000spheres.txt -size 1280 720 -samples 1
4.
-input Scenes/dudes.txt -size 1024 1024 -samples 1
5.
-input Scenes/cornell-199lights.txt -size 1024 1024 -samples 1
The following tests will be run on your code for Stage 5:
Test
Stage 5
1.
-input Scenes/dudes.txt -size 2048 2048 -samples 1
2.
-input Scenes/dudes.txt -size 2048 2048 -samples 2
3.
-input Scenes/dudes.txt -size 2048 2048 -samples 4
4.
-input Scenes/dudes.txt -size 2048 2048 -samples 8
5.
-input Scenes/dudes.txt -size 2048 2048 -samples 16
6.
-input Scenes/dudes.txt -size 2048 2048 -samples 32

Provided Materials
The materials provided with this assignment contain:
The source code of the base version of the raytracer (i.e. the original starting point of Assignment 1). A set of scene files to be supplied to the program.
A set of reference images for testing.
Some batch files for testing purposes.
Download the materials as a ZIP file.
Source Code
The provided MSVC solution, contains 6 projects.
RayTracerAss3
The provided code consists of 21 source files.
Raytracing logic:
Raytrace.cpp: this file contains the main function which reads the supplied scene file, begins the raytracing, and writes the output BMP file. The main render loop, ray trace function, and handling of reflection and refraction is also in this file.
Intersection.h and Intersection.cpp: these files define a datastructure for capturing relevant information at the point of intersection between a ray and a scene object and functions for testing for individual ray-object collisions and ray-scene collisions.
Lighting.h and Lighting.cpp: these files provide functions to apply a lighting calculation at a single intersection point.
Texturing.h and Texturing.cpp: these files provide functions for the reading points from 3D procedural textures. Constants.h: this header provide constant definitions used in the raytracing.
Basic types:
Primitives.h: this header contains definitions for points, vector, and rays. It also provides functions and overloaded operators for performing calculations with vectors and points.
SceneObjects.h: this header file provides definitions for scene objects (ie. materials, lights, spheres, and boxes). Colour.h: this header defines a datastructure for representing colours (with each colour component represented as a float) and simple operations on colours, including conversions to/from the standard BGR pixel format.
Scene definition and I/O:
Scene.h and Scene.cpp: the header file contains the datastructure to represent a scene and a single function that initialises this datastructure from a file. The scene datastructure itself consists of properties of the scene and lists of the various scene objects as described above. The implementation file contains many functions to aide in the scene loading process. Scene loading relies upon the functionality provided by the Config class.
Config.h and Config.cpp: this class provide facilities for parsing the scene file.
SimpleString.h: this is helper string class used by the Config class.
OpenCL I/O:
LoadCL.h and LoadCL.cpp: these files contain a helper function for loading OpenCL files. These are not required for the base code, but are useful for all assignment Stages.
Image I/O:
ImageIO.h and ImageIO.cpp: these files contain the definitions of functions to read and write BMP files.
Miscellaneous:
Timer.h: this class provides a simple timer that makes use of different system functions depending on whether TARGET_WINDOWS, TARGET_PPU, or TARGET_SPU is defined (we don’t use the latter two, but I left this file unchanged in case anyone wanted to see how such cross-platform stuff can be handled).
Stage1 – Stage5
These projects are empty.
To begin work on the assignment you should (in Windows Explorer) copy all of the 21 .h and .cpp files from RaytracerAss3 into
the Stage1 folder and then right-click on the Stage 1 in Visual Studio and choose “Add / Exiting Item…” and add those 21 files.
Executing
The program has the following functionality:
By default it will attempt to load the scene “Scenes/cornell.txt” and render it at 1024×1024 with 1×1 samples.
By default it will output a file named “Outputs/[scenefile-name]_[width]x[height]x[sample-level]_[executable- filename].bmp” (e.g. with all the default options, “Outputs/cornell.txt_1024x1024x1_RayTracerAss2.exe.bmp”)
It takes command line arguments that allow the user to specify the width and height, the anti-aliasing level (must be a power of two), the name of the source scene file, the name of the destination BMP file, and the number of times to perform the render (to improve the timing information).
Additionally it accepts an argument for whether each thread will instead colour the area rendered by the thread as a solid tint based on the x,y coordinates of each pixel.
It loads the specified scene.
It renders the scene (as many times as requested).
It produces timing information for the first run, and the average of the time taken for all subsequent renders produced, ignoring all file IO.

It outputs the rendered scene as a BMP file.
For example, running the program at the command line with no arguments would produce output similar to the following (as
well as writing the resultant BMP file to Outputs/cornell.txt_1024x1024x1_RayTracerAss3.exe.bmp): first run time: 3847ms, subsequent average time taken (0 run(s)): N/A
Adding -runs 10 to the arguments would produce output similar to the following: first run time: 3891ms, subsequent average time taken (9 run(s)): 3814.2ms
Testing Batch Files
A number of batch files are provided that are intended to be executed from the command line, e.g.
For timing:
baseTiming.bat will perform all the timing tests required for the base single-threaded code.
stage4Timing.bat will perform all the timing tests required for Stage 4. For testing (requires Image Magick installation), e.g.:
stage1Tests.bat will perform all the comparisons required for Stage 1 Tests. stage2Tests.bat will perform all the comparisons required for Stage 2 Tests. stage3Tests.bat will perform all the comparisons required for Stage 3 Tests. stage4Tests.bat will perform all the comparisons required for Stage 4 Tests. stage5Tests.bat will perform all the comparisons required for Stage 5 Tests.