程序代写代做代考 data structure compiler c/c++ KIT308/408 Multicore Architecture and Programming Assignment 2 — SIMD

KIT308/408 Multicore Architecture and Programming Assignment 2 — SIMD
Aims of the assignment
The purpose of this assignment is to give you experience at writing a program using SIMD programming techniques. This assignment will give you an opportunity to demonstrate your understanding of:
the where there is one there is many approach (WTIOTIM); structures of arrays;
the use of SIMD for calculation; and
the translation of conditional statements into SIMD.
Due Date
11:55pm Friday 25th of September (Week 10 of semester)
Late assignments will only be accepted in exceptional circumstances and provided that the proper procedures have been followed (see the School Office or this link for details) assignments which are submitted late without good reason will be subject to mark penalties if they are accepted at all (see the School office or this link for details on this as well).
Forms to request extensions of time to submit assignments are available from the Discipline of ICT office. Requests must be accompanied by suitable documentation and should be submitted before the assignment due date.
Assignment Submission
Your assignment is to be submitted electronically via MyLO and should contain:
An assignment cover sheet;
A .zip (or .rar) containing a Visual Studio Solution containing a project for each attempted stage of the assignment (in the format provided in the downloadable materials provided below).
A document containing:
A table of timing information comparing the original single-threaded code (from assignment 1), the provided base code times, and the Stage 1, 2, 3, 4, and 5 SIMD implementation on each scene file (all running with the maximum threads natively supported by your CPU).
An analysis of the above timing data.
You do not need to (and shouldn’t) submit executables, temporary object files, or images. In particular, you must delete the “.vs” diretory before submission as it just Visual Studio temporary files and 100s of MBs. Do not however delete the “Scenes” folder or the “Outputs” folder (but do delete the images within this one).

Marking
This assignment will be marked out of 100. NOTE: if your code for a particular stage does not correctly produce the expected output images, you will only be able to receive a maximum of half marks for that stage — see below for more details.
The following is the breakdown of marks:
Task/Topic
Marks 10%
5%
5%
15%
5% 5%
5%
35%
25%
5% 5% 5%
5%
5%
10%
5%
5%
20%
5% 5% 5%
5%
10%
5% 5%
10%
5% 5%
-10% up to -20% up to -100%
1. Boxes — Where there is One there is Many (WTIOTIM)
Correct implementation of WTIOTIM approach to isBoxIntersected function to calculate distance to nearest object — and use in the objectIntersection function (in Intersection.cpp)
Correct implementation of WTIOTIM approach to create a short-circuiting isBoxIntersected function to return (as soon as possible) whether any box intersects — and use in the isInShadow function (in Lighting.cpp)
2. Boxes — Structures of Arrays
Correct declarations of data structures for SoA SIMD form of Box container data
Correct code to convert AoS container to dynamically declared SoA structures (conversion should happen after the scene has been loaded)
Correct rewriting of isBoxIntersected functions and the calls to it to use SoA code 3. Boxes — SIMD Conversion of Intersection Test
A. SIMD Conversion of short-circuiting isBoxIntersected function
Correct and efficient (e.g. no use of if-statements, for-loops, or scalar code) intersection point (and exit point)
calculation (i.e. tmin and tmax)
Correct and efficient (e.g. no use of if-statements) handling of cases where no intersection occurs due to point
being outside of box (i.e. first if-statement)
Correct and efficient (e.g. no use of if-statements) handling of cases where intersection does occur (i.e. second if- statement)
Correct declaration of loop constants before loop
Correct handling of box list length not being divisible by 8
B. SIMD Conversion of non-short-circuiting isBoxIntersected function
Correct and efficient (e.g. no use of if-statements) SIMD calculation of closest intersection (i.e. t)
Correct and efficient (e.g. no use of if-statements) SIMD calculation of corresponding box index (i.e. index)
4. SIMD Lighting
(NOTE: this section will require SoA declarations and conversion from AoS for Light container data)
Correct SIMD conversion of applyLighting function
Correct SIMD conversion of applySpecular function (i.e. to make it handles 8 lights)
Correct and efficient SIMD conversion (e.g. no use of switch-statements) of applyDiffuse function with applyCheckboard, applyCircles, and applyWood functions manually inlined (i.e. to make it handle 8 lights)
Splitting of applyDiffuse function into two functions, one that can be done before the loop in applyLighting, and one with the part of the calculation that must be done within the loop
5. SIMD Render Loop and Ray Tracing
Correct SIMD conversion of renderSection function (including call to traceRay function and output of pixels)
Correct SIMD conversion of traceRay, calculateIntersectionResponse, calculateReflection, and calculateRefraction
OR a well reasoned explanation of why this conversion is not worth completing (in terms of likely efficiency increase)
Documentation
Outputs showing timing information for the base assignment code and Stages 1–5 (with the maximum threads natively supported by your CPU) on all applicable scene files
Analysis of data (comparisons across the base code and Stages 1–5)
Penalties
Failure to comply with submission instructions (eg. no cover sheet, incorrect submission of files, abnormal solution/project structure, etc.)
Poor programming style (eg. insufficient / poor comments, poor variable names, poor indenting, obfuscated code without documentation, compiler warnings, etc.)
Lateness (-20% for up to 24 hours, -50% for up to 7 days, -100% after 7 days)

Marking and Correct Images
As SIMD code is very very difficult to debug — as grows exponentially harder as the amount of it grows — there will be limited opportunity in the marking process to determine where exactly mistakes have been made, and even less chance of being able to provide fixes to those mistakes.
As a result, if your code for a stage does not produce the expected image, then you will only be eligible for half marks for that stage of the assignment.
In order to work within this constraint, you should attempt translations into SIMD in very small steps (i.e. a single line at a time, or even a partial line at a time). The lectures will demonstrate this approach to SIMD translation. You will likely receive more marks for a partially complete SIMD translation than a fully complete translation that doesn’t work 100%.
Stage 4 and Stage 5
As these stages may require the use of less accurate approximations to some mathematical functions, the concept of “correct” image will be somewhat relaxed from having to be an exact match as specified by ImageMagick — an acceptable visual match will likely be enough.
Programming Style
This assignment is not focussed on programming style (although it is concerned with efficient code), but you should endeavour to follow good programming practices. You should, for example:
comment your code;
use sensible variables names;
use correct and consistent indenting; and
internally document (with comments) any notable design decisions.
[NOTE: any examples in the provided assignment materials that don’t live up to the above criteria, should be considered to be deliberate examples of what not to do and are provided to aid your learning ;P]

The Assignment Task
You are to modify the (square-based) multithreaded implementation of a “simple” raytracer from the first assignment to take advantage of SIMD instructions. This requires changes across multiple files.
To aid you in this conversion, the sphere/ray intersection code has already been translated into SIMD code (by following the techniques required to do stages 1–3 of the assignment).
From the provided (square-based) multithreaded raytracer implementation, for Stages 1–3 of the assignment you will create multiple subsequent versions that modify the box implementation as follows:
1. Rewritten functions for the box/ray intersection tests in a where-there-is-one-there-is-many approach. 2. Translation of array-of-structures (AoS) with structures-of-arrays (SoA) for the box container.
3. Optimisation of the box/ray intersection test to take advantage of SIMD.
For Stage 4, you will follow a similar pattern, but for lights and the lighting calculations. For Stage 5, you will attempt to rewrite the main rendering loop and subsequent functions.
Implementation — SIMD Boxes (Stages 1–3)
The following section describes in detail the steps required to complete Stages 1–3 of the assignment. If you complete this step, only then should you attempt Stage 4 and do similar steps for lights. If you complete this step, only then should you attempt Stage 5.
1. Where there is One there is Many
This stage involves replacing the isBoxIntersected function with two new versions that use the where there is one there is many paradigm.
In order to complete this step you will need to:
Write two functions that take the entire box container (or at least a pointer/reference) as an argument and will perform whatever action took place at the calling site. There are two versions as the isInShadow function (in Lighting.cpp) uses a short-circuited approach, exiting as soon as any box collision is discovered, and the objectIntersection function (in Intersection.cpp) finds the closest box intersected with (which means it must examine them all).
A version of isBoxIntersected that returns a boolean depending on whether or not a box intersects with the ray (i.e. stopping as soon as one is found).
A version of isBoxIntersected that returns void and updates its t parameter to be the closest box that intersects with the ray.
2. Structures of Arrays
This stage involves modifying the Scene struct (Scene.h) to store a duplicate of the box data via structure of arrays rather than the currently existing array of structures (currently in the boxContainer variable).
In order to complete this step you will need to:
Create a SoA copy of the data that is stored in boxContainer in the Scene struct.
NOTE: this means the program has two copies of the box data, the original one in AoS form and a second one in SoA form. As this assignment predominately only requires changes to isBoxIntersected (and the sites of the calls to this function) the rest of the code will still use the original AoS version of the data.
Fill this SoA copy of the box data after the Scene struct has been loaded (dynamically allocating memory for the SoA representation).
Rewrite the isBoxIntersected function to make use of the SoA.
At times this code update may require conversions to and/or from Vector structs to the equivalently stored data in the SoA.
At the end of this stage the program should still produce the same results as the base code. Be thorough in your testing (i.e. test all the scenes) to ensure that everything works correctly before progressing.
3A. SIMD Conversion of isBoxIntersected function
This stage involves converting the first of the two new version of the isBoxIntersected function from Stage 2 (the bool returning
one) into a SIMD implementation.
In order to complete this step you will need to:
Convert many of the input parameters (or parts of them) into SIMD-ready values (e.g. the ray starting location and direction need to be converted into a SIMD format).
Step through the SoA in chunks proportional to the number of value stored in each SIMD variable (i.e. 8).
Convert the calculation to use SIMD. Care must be taken to correctly deal with the various conditional expressions in the loop. There are two if-statements that must be converted to SIMD code via the use of appropriate masking statements.
Ensure your approach deals with the situation where the box count isn’t equally divisible by the number in each SIMD calculation (e.g. 9 boxes with 8 in each SIMD value).
3B. SIMD Conversion of isBoxIntersected function
This stage involves converting the latter of the two new versions of the isBoxIntersected function from Stage 2 (the void returning
one) into a SIMD implementation.
Most of this conversion is identical to step 3, and the code can just be copied from that function. It does however require the addition of:

New variables to keep track of the closest intersection found, and which box index it was found at.
Consolidation of the values calculated using SIMD into a single scalar return value. This should be done after the loop finishes.
Hints / Tips
The techniques required to complete each stage rely heavily on work done in the SIMD tutorials and the step-by-step approach shown in lectures — refer to them often.
When implementing the SIMD stage, it’s best to SIMD-ify small portions of code at a time (e.g. even a single line is often a lot) and then perform the rest of the calculation through the existing scalar code, or even compare the result of the SIMD version to the existing scalar code.
Again for SIMD, it’s helpful to render scenes at a greatly reduced size for testing, with an equally small block size, one sample per pixel, and with one thread (e.g. try using allMaterials -size 8 8 -blockSize 8 -samples 1 -threads 1). It may even be helpful to fix the number of rays cast (MAX_RAYS_CAST) to be 1, so that reflection and refraction don’t occur. This doesn’t produce a very nice image, but images can be easily compared visually for problems (the example is only 64 pixels) and printf statements can be used to verify the correctness of calculations without outputting a huge amount of data (the use of debuggers is somewhat perilous in a multithreaded environment, but that shouldn’t be an issue here).
When converting conditional statements to masks it can be helpful to do this with scalars first (although C/C++ implementations often use 1 rather than 0xFFFFFFFF for true, so the calculations may be different).
Write functions to output all the elements in a SIMD value for easier printf debugging (I am not a fan of how these types are shown in the debugger).
Implementation — SIMD Lighting (Stage 4)
The following section describes in detail the steps required to complete Stage 4 of the assignment. You should only attempt this step if you have fully completed (and tested) Stages 1–3. If you complete this step, only then should you attempt Stage 5.
This stage involves rewriting the applyLighting function and a number of functions it calls. In order to complete this step you will need to:
Create SoA declarations and conversions from AoS for Light container data (in a similar way that this was done for spheres and boxes).
Convert the applyLighting function to SIMD (i.e. handle multiple lights simultaneously).
NOTE: the call to isInShadow is not suitable for SIMD conversion, so a simple for loop that repeatedly calls the function
and stores each result in turn in a SIMD vector is acceptable.
Convert the applySpecular function to SIMD (i.e. perform this for 8 lights and produce 8 colours).
Create a normal non-SIMD applyDiffuse function that has applyCheckboard, applyCircles, and applyWood functions manually inlined before attempting the conversion to SIMD.
Convert this new applyDiffuse function to SIMD (i.e. perform this for 8 lights and produce 8 colours).
Determine what parts of applyDiffuse are the same regardless of where the light is, and separate the function into two parts, and then integrate this into applyLighting.
Missing Mathematical Functions
There are no definitions of a few needed mathematical functions in the AVX/AVX2 instruction sets:
The powf function, as used in the applySpecular function in Lighting.cpp.
The sinf and cosf functions, as used in the applyWood function in Texture.cpp. The expf function, as used in the convertToPixel method in Colour.h.
Definitions for these functions can either be sourced from:
The SVML api — if your processor / compiler supports them (e.g. _mm256_sin_ps).
From the included MathSIMD.h file (e.g. _sin256_ps). NOTE: when using this library, function precision will likely be reduced, resulting in output images diverging somewhat from the reference images).
By writing your own function that just calls the required function 8 times for each slot in the vector (e.g. sinf).
Hints / Tips
All the hints/tips from the stages 1–3 apply here (i.e. complete this in small steps, use 8×8 images as tests, use a lot of printfs, etc.).
You may find it helpful to define some helper classes similar to Vector8 to aid with the completion of this step, e.g. Colour8 (and maybe even Ray8). Also in a similar fashion to Vector8, defining useful operators can help with translation and code readability.
Implementation — SIMD Main Rendering Loop (Stage 5)
The following section describes in detail the steps required to complete Stage 5 of the assignment. You should only attempt this step if you have fully completed (and tested) Stages 1–4.
This stage involves rewriting the renderSection function and a number of functions it calls. In order to complete this step you will need to:
Determine how to introduce SIMD into the main loop within renderSection with the aim of generating and writing 8 pixels of the image at a time.
Determine how to convert traceRay, calculateIntersectionResponse, calculateReflection, and calculateRefraction to SIMD. At this point you can either complete this conversion, or explain why this conversion is not likely to yield significant effeciency increases to be worth the effort — if you determine this to be the case.

Documentation
When completing either stage of the assignment you should provide:
timing information for each scene file for the average time (to 1 decimal place) taken over 5 runs for a render using a thread count supported by your the number of logical processors in your system and a block size of 16. See a later section for an example format for this timing table; and
an explanation of the results (e.g. why there’s no difference between the performance of stages 1, 2, and 3 (NOTE: this is a made up example and isn’t necessarily what to expect), or why a particular type of implementation works well (or poorly) on a particular scene, etc.). This explanation should be with respect to the CPU on the system on which you ran the tests, and you should discuss how the architectural features of the CPU explain the results.
Tests / Timing
The following table lists all the tests that your code needs to generate correctly at each stage. It also shows the timing tests that need to be performed in order to fully complete the documentation section of the assignment. Fully completing this tests may take up to an hour (with the 5 required runs) on some hardware, so plan your time accordingly.
In order to confirm your images match the images created by the base version of the assignment code, it’s strongly recommended you use a image comparison tool. For part of the marking for this, Image Magick will be used (as it was in Assignment 1).
Timing Test
Average Time of 5 Runs (Milliseconds)
Base Code
SIMD Spheres
Stage 1 WTIOTIM
Stage 2 SoA
Stage 3 SIMD Boxes
Stage 4 SIMD Lighting
Stage 5 SIMD Main Loop
1.
-input Scenes/cornell.txt – size 1024 1024 -samples 4
2.
-input Scenes/allmaterials.txt -size 1000 1000 -samples 4
3.
-input Scenes/5000spheres.txt – size 1280 720 -samples 1
4.
-input Scenes/dudes.txt – size 1024 1024 -samples 1
5.
-input Scenes/cornell- 199lights.txt -size 1024 1024 -samples 1
The following tests will be run on your code for each scene file. You also might find the 8×8 tests useful for your SIMD code conversion:
Test
Image Result
1.
-blockSize 8 -size 8 8 -samples 1 -threads 1
2.
-blockSize 16 -size 256 256 -samples 1
3.
-blockSize 16 -size 256 256 -samples 2

Provided Materials
The materials provided with this assignment contain:
The source code of the base multi-threaded version of the raytracer (i.e. a solution to Assignment 1).
The source code of the base multi-threaded version of the raytracer with a SIMD implementation of the Ray-Sphere test. A set of scene files to be supplied to the program.
A set of reference images for testing.
Some batch files for testing purposes.
Download the materials as an ZIP file.
Source Code
The provided MSVC solution, contains 7 projects.
RayTracerAss2
The provided code consists of 19 source files.
Raytracing logic:
Raytrace.cpp: this file contains the main function which reads the supplied scene file, begins the raytracing, and writes the output BMP file. The main render loop, ray trace function, and handling of reflection and refraction is also in this file.
Intersection.h and Intersection.cpp: these files define a datastructure for capturing relevant information at the point of intersection between a ray and a scene object and functions for testing for individual ray-object collisions and ray-scene collisions.
Lighting.h and Lighting.cpp: these files provide functions to apply a lighting calculation at a single intersection point. Texturing.h and Texturing.cpp: these files provide functions for the reading points from 3D procedural textures. Constants.h: this header provide constant definitions used in the raytracing.
Basic types:
Primitives.h: this header contains definitions for points, vector, and rays. It also provides functions and overloaded operators for performing calculations with vectors and points.
SceneObjects.h: this header file provides definitions for scene objects (ie. materials, lights, spheres, and boxes). Colour.h: this header defines a datastructure for representing colours (with each colour component represented as a float) and simple operations on colours, including conversions to/from the standard BGR pixel format.
Scene definition and I/O:
Scene.h and Scene.cpp: the header file contains the datastructure to represent a scene and a single function that initialises this datastructure from a file. The scene datastructure itself consists of properties of the scene and lists of the various scene objects as described above. The implementation file contains many functions to aide in the scene loading process. Scene loading relies upon the functionality provided by the Config class.
Config.h and Config.cpp: this class provide facilities for parsing the scene file.
SimpleString.h: this is helper string class used by the Config class.
Image I/O:
ImageIO.h and ImageIO.cpp: these files contain the definitions of functions to read and write BMP files.
Miscellaneous:
Timer.h: this class provides a simple timer that makes use of different system functions depending on whether TARGET_WINDOWS, TARGET_PPU, or TARGET_SPU is defined (we don’t use the latter two, but I left this file unchanged in case anyone wanted to see how such cross-platform stuff can be handled).
Stage0_SIMDSpheres
The provided code consists of 21 source files. There are a number of modified files:
Scene definition and I/O:
Scene.h: SIMD SoA datastructures for representing spheres.
Raytracing logic:
Raytrace.cpp: simdifySceneContainers makes a copy of the SoA sphere data in AoS SIMD form.
Intersection.h: prototypes of WTIOTIM short-circuiting and non-short-circuiting versions of isSphereIntersected. Intersection.cpp:
selectMinimumAndIndex helper function.
WTIOTIM short-circuiting version of isSphereIntersected.
WTIOTIM non-short-circuiting version of isSphereIntersected.
A commented out version that of isSphereIntersected that doesn’t use helper functions/classes from PrimitivesSIMD.h (just for reference).
A call to non-short-circuiting version of isSphereIntersected in the objectIntersection function.
Lighting.cpp: call to short-circuiting version of isSphereIntersected function in isInShadow function. There are two new files (not included in the base project):
SIMD Helpers:
PrimitivesSIMD.h: this header contains operators for many common SIMD functions as well as a definition for Vector8 (as a representation for 8 vectors). It also provides functions and overloaded operators for performing calculations with Vector8s.
MathSIMD.h: this header file provides definitions for some math/trigonometry functions not provided on most systems (e.g. sin, cos, pow, exp, etc.). NOTE: this is not needed until Stage 4.

Stage1 – Stage5
These projects are empty.
To begin work on the assignment you should (in Windows Explorer) copy all of the 21 .h and .cpp files from Stage0_SIMDSphere into
the Stage1 folder and then right-click on the Stage 1 in Visual Studio and choose “Add / Exiting Item…” and add those 21 files.
Executing
The program has the following functionality:
By default it will attempt to load the scene “Scenes/cornell.txt” and render it at 1024×1024 with 1×1 samples (using the maximum number of threads supported by the CPU natively, with a block size of 16).
By default it will output a file named “Outputs/[scenefile-name]_[width]x[height]x[sample-level]_[executable-filename].bmp” (e.g. with all the default options, “Outputs/cornell.txt_1024x1024x1_RayTracerAss2.exe.bmp”)
It takes command line arguments that allow the user to specify the width and height, the anti-aliasing level (must be a power of two), the name of the source scene file, the name of the destination BMP file, and the number of times to perform the render (to improve the timing information).
Additionally it accepts some arguments for setting the number of threads, whether each thread will tint the area that it renders, whether each thread will instead colour the area rendered by the thread as a solid grey, and the size of the block to render.
It loads the specified scene.
It renders the scene (as many times as requested).
It produces timing information for the average time taken to produce a render ignoring all file IO.
It outputs the rendered scene as a BMP file.
For example, running the program at the command line with no arguments would perform the first test (as described in the scene file section):
On execution this would produce output similar to the following (as well as writing the resultant BMP file to Outputs/cornell.txt_1024x1024x1_RayTracerAss2.exe.bmp):
rendered 1048576 samples using 8 threads, average time taken (1 run(s)): 578.0ms
Testing Batch Files
A number of batch files are provided that are intended to be executed from the command line, e.g.
For timing:
baseTiming.bat will perform all the timing tests required for Stage 1 (i.e. 5 runs with the appropriate amount of threads for each Test scene).
stage0Timing.bat will perform all the timing tests required for Stage 0 (SIMD Spheres).
stage1Timing.bat will perform all the timing tests required for Stage 1.
stage2Timing.bat will perform all the timing tests required for Stage 2. stage3Timing.bat will perform all the timing tests required for Stage 3. stage4Timing.bat will perform all the timing tests required for Stage 4. stage5Timing.bat will perform all the timing tests required for Stage 5.
For testing (requires Image Magick installation), e.g.:
stage1Tests.bat will perform all the comparisons required for Stage 1 Tests.
stage2Tests.bat will perform all the comparisons required for Stage 2 Tests.
stage3Tests.bat will perform all the comparisons required for Stage 3 Tests.
stage4Tests.bat will perform all the comparisons required for Stage 4 Tests.
stage4Tests_MathSIMD.bat will perform all the comparisons required for Stage 4 Tests if you use theMathSIMD.h functions.
stage5Tests.bat will perform all the comparisons required for Stage 5 Tests.
stage5Tests_MathSIMD.bat will perform all the comparisons required for Stage 5 Tests if you use theMathSIMD.h functions.