KIT308/408 Multicore Architecture and Programming Assignment 1 — Multithreading
Aims of the assignment
The purpose of this assignment is to give you experience at writing a multithreaded program for the CPU. This assignment will give you an opportunity to demonstrate your understanding of:
creating and handling multiple threads; partitioning work;
dynamic scheduling; and
data-sharing between threads.
Due Date
11:55pm Friday 21st of August (Week 6 of semester)
Late assignments will only be accepted in exceptional circumstances and provided that the proper procedures have been followed (see the School Office or this link for details) assignments which are submitted late without good reason will be subject to mark penalties if they are accepted at all (see the School office or this link for details on this as well).
Forms to request extensions of time to submit assignments are available from the School of Engineering and ICT office. Requests must be accompanied by suitable documentation and should be submitted before the assignment due date.
Due to the tight schedule of this unit (and the reliance of future assignments on a solution to this one), extensions will be unable to be granted and late assignments will not be accepted. If exception circumstances occur for an individual student they must contact the lecture-in-charge before the assignment due date to arrange alternative methods of assessment in this unit.
Assignment Submission
Your assignment is to be submitted electronically via MyLO and should contain:
An assignment cover sheet;
A .zip (or .rar) containing a Visual Studio Solution containing a project for each attempted stage of the assignment (in the format provided in the downloadable materials provided below).
A document containing:
A table of timing information comparing the original single-threaded times against each of three stages on each scene file.
An analysis of the above timing data.
You do not need to (and shouldn’t) submit executables, temporary object files, or images. In particular, you must delete the “.vs” diretory before submission as it just Visual Studio temporary files and 100s of MBs. Do not however delete the “Scenes” folder or the “Outputs” folder (but do delete the images within this one).
Marking
This assignment will be marked out of 100. The following is the breakdown of marks:
Task/Topic
1. Basic Multithreaded CPU Implementation
Correct (5%) and elegant (5%) thread management
(i.e. the correct image for any number of threads (even >64 threads)
Correct use of “-thread” command-line argument
Successful use of “-testMode” and “-colourise” command-line arguments
Correct partitioning of render for images with height equally divisible by thread count
(i.e. the correct image, with exactly equal work area for each thread, and with the correct number of samples rendered for: Tests 1, 2, 4, 5, and 7 with 8 threads)
Correct partitioning of render for images with height NOT equally divisible by thread count
(i.e. the correct image, with work area allocated “evenly” across threads, and with the correct number of samples rendered for: Tests 3 and 6 with 8 threads, and also Test 1 with 3, 5, 6, 7, and 513 threads)
2. Dynamically Balanced Line-Based Multithreaded CPU Implementation
Correct (5%) and elegant (5%) thread management
Correct (5%) and elegant (5%) data-sharing between threads
(i.e. no possibility of threads “fighting” over shared memory locations)
3. Dynamically Balanced Square-Based Multithreaded CPU Implementation
Correct and elegant thread management
Correct and elegant data-sharing between threads Correct use of “-blockSize” command-line argument
Correct partitioning of render into squares for images with size that is equally divisible by block size
(i.e. the correct image, with exactly equal work area for each block, rendered for: Tests 1, 2, 5, and 7 with block size 8 and 64)
Correct partitioning of render into squares for images with size that is NOT equally divisible by block size (i.e. the correct image, with work area for each block either at block size or smaller at edges, rendered for: Tests 3, 4, and 6 with block size 8 and 64)
Correct partitioning of render with no extra samples generated
(i.e. the correct image, with work area for each block either at block size or smaller at edges, and with the correct number of samples rendered for: Tests 3, 4, and 6 with block size 8 and 64)
Documentation
Outputs showing timing information for each stage on all applicable scene files with all thread counts (to get full marks for this part your code needs to pass all tests for all stages above)
Analysis of data with respect to CPU used
(to get full marks for this part your code needs to pass all tests for all stages above)
Penalties
Failure to comply with submission instructions (eg. no cover sheet, incorrect submission of files, submission of unnecessary temporary or image files, abnormal solution/project structure, etc.)
Poor programming style (eg. insufficient / poor comments, poor variable names, poor indenting, obfuscated code without documentation, compiler warnings, etc.)
Lateness (-20% for up to 24 hours, -50% for up to 7 days, -100% after 7 days) Late assignments will not be accepted
Programming Style
Marks 30%
10%
5% 5% 5%
5%
20%
10% 10%
30%
5% 5% 5% 5%
5%
5%
20%
10% 10%
-10% up to -20% up to -100%
This assignment is not focussed on programming style, but you should endeavour to follow good programming practices. You should, for example:
comment your code;
use sensible variables names;
use correct and consistent indenting; and
internally document (with comments) any notable design decisions.
[NOTE: any examples in the provided assignment materials that don’t live up to the above criteria, should be considered to be deliberate examples of what not to do and are provided to aid your learning ;P]
The Assignment Task
You are to implement a “simple” raytracer in a multithreaded fashion (for a quick definition of raytracing see the wikipedia page). From the provided single-threaded raytracer implementation, you will create multiple subsequent versions as follows:
1. A basic multithreaded CPU implementation.
2. A dynamically balanced line-based multithreaded CPU implementation.
3. A dynamically balanced square-based multithreaded CPU implementation.
Implementation
1. Basic Multithreaded CPU Implementation
This stage involves splitting up the render across multiple threads by splitting up the render into equal sized chunks (or almost equal sized chunks when the number of threads doesn’t divide evenly into the image height) with each chunk being rendered entirely by an individual thread.
In order to complete this step you will need to:
Write code to partition the rendering job into chunks (so that each thread generates a different part of the final image). Write code for the to create and manage multiple threads.
At the end of this stage the program should be able to handle any image size (any dimensions that are multiples of four, at least, up to the maximum of 2048×2048).
To be eligible for full marks, your assignment should also use the command-line option “-threads” to set the number of threads to any value that is less than or equal to the height of the image being rendered and also respect the “-colourise” option to tint each section of the image with a set of colours (colour re-use is expected for a large number of threads).
2. Dynamically Balanced Line-Based Multithreaded CPU Implementation
This stage involves splitting up the render into lines which are dynamically allocated to threads from the thread pool as they request more work.
In order to complete this step you will need to:
Start as many threads as required (specified from the command-line arguments) for the thread pool. Manage some shared data between the threads so that each do the correct work.
3. Dynamically Balanced Square-Based Multithreaded CPU Implementation
This stage involves splitting up the render into squares which are dynamically allocated to thread from the thread pool as they request more work.
In order to complete this step you will need to:
Partition the render into squares (requires more carefulness when writing results to the image).
NOTE: this implementation is (slighty) harder than Stage 2, but might not result in an increase in performance.
Hints / Tips
The techniques required to complete each stage rely heavily on work done in the tutorials 2 and 3 — refer to them often.
Documentation
For each stage of the assignment you attempt you should provide:
average (of 3 runs) timing information for each scene file for the total time taken for a render for particular thread counts. See the next section for an example format for this timing table that specifies which tests are required for which scenes; and
an explanation of the results (e.g. why there’s no difference between the performance of stages 1, 2, and 3 (NOTE: this is a made up example and isn’t necessarily what to expect), or why a particular type of implementation works well (or poorly) on a particular scene, etc.). This explanation should be with respect to the CPU on the system on which you ran the tests, and you should discuss how the architectural features of the CPU explain the results.
Tests / Timing
The following table lists all the tests that your code needs to generate correctly at each stage. It also shows the timing tests that need to be performed in order to fully complete the documentation section of the assignment.
In order to confirm your images match the images created by the base version of the assignment code, it’s strongly recommended you use a image comparison tool. For part of the marking for this, Image Magick will be used:
Image Magick: download for Windows/Mac/Linux. e.g. running this from the command-line to test:
magick.exe compare -metric mae Outputs\cornell.txt_1024x1024x1_RayTracerAss1.exe.bmp
Outputs_REFERENCE\cornell.txt_1024x1024x4_RayTracerAss1.exe.bmp diff.bmp
This produces an image (“diff.bmp”) showing the differences between the two source images, and also a numeric summary of the difference, here these images are 0.13266% different):
0.33827 (0.0013266)
(this example deliberately compares the 1×1 sampled to the 4×4 sampled image to show there is a difference).
Note: to fully complete the empty cells in the table below, around 180 total images will need be calculated (each test needs 3 runs, and the first six tests need to be run with the base code, and with 8 threads for each stage of the assignment, and test 7 needs to be run for 1–8 threads for stage 1, stage 2, and stage 3 with two different block sizes), so plan your time accordingly.
Test
1. -input Scenes/cornell.txt -size 1024 1024 -samples 1
2. -input Scenes/cornell.txt -size 1024 1024 -samples 4
3. -input Scenes/cornell.txt -size 500 300 -samples 1
4. -input Scenes/allmaterials.txt -size 1000 1000 – samples 1
5. -input Scenes/cornell- 199lights.txt -size 256 256 -samples 1
6. -input Scenes/5000spheres.txt -size 480 270 -samples 1
7. -input Scenes/dudes.txt -size 256 256 -samples 1
Single Threaded
Average Time (Milliseconds) Multithreaded Technique / (Stage)
Threads 12345678
Static Chunks (Stage 1) Dynamic Lines (Stage 2) Dynamic Squares (Stage Dynamic Squares (Stage Static Chunks (Stage 1) Dynamic Lines (Stage 2) Dynamic Squares (Stage Dynamic Squares (Stage Static Chunks (Stage 1) Dynamic Lines (Stage 2) Dynamic Squares (Stage Dynamic Squares (Stage Static Chunks (Stage 1) Dynamic Lines (Stage 2) Dynamic Squares (Stage Dynamic Squares (Stage Static Chunks (Stage 1) Dynamic Lines (Stage 2) Dynamic Squares (Stage Dynamic Squares (Stage Static Chunks (Stage 1) Dynamic Lines (Stage 2) Dynamic Squares (Stage Dynamic Squares (Stage Static Chunks (Stage 1) Dynamic Lines (Stage 2) Dynamic Squares (Stage Dynamic Squares (Stage
3) — blocksize 8 3) — blocksize 64
3) — blocksize 8 3) — blocksize 64
3) — blocksize 8 3) — blocksize 64
3) — blocksize 8 3) — blocksize 64
3) — blocksize 8 3) — blocksize 64
3) — blocksize 8 3) — blocksize 64
3) — blocksize 8 3) — blocksize 64
X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
Provided Materials
The materials provided with this assignment contain:
the source code of the base single-threaded version of the raytracer;
a set of scene files to be supplied to the program;
a set of reference output files created from the single-threaded version of the program;
four batch files that will run all 7 timing tests for the base code, and each of the three stages; and
a further set of five batch files that will run comparison tests (assuming Image Magick is installed in its default location) for each of the three stages.
Download the materials as an ZIP file.
Source Code
The provided code consists of 19 source files.
Raytracing logic:
Raytrace.cpp: this file contains the main function which reads the supplied scene file, begins the raytracing, and writes the output BMP file. The main render loop, ray trace function, and handling of reflection and refraction is also in this file.
Intersection.h and Intersection.cpp: these files define a datastructure for capturing relevant information at the point of intersection between a ray and a scene object and functions for testing for individual ray-object collisions and ray- scene collisions.
Lighting.h and Lighting.cpp: these files provide functions to apply a lighting calculation at a single intersection point. Texturing.h and Texturing.cpp: these files provide functions for the reading points from 3D procedural textures. Constants.h: this header provide constant definitions used in the raytracing.
Basic types:
Primitives.h: this header contains definitions for points, vector, and rays. It also provides functions and overloaded operators for performing calculations with vectors and points.
SceneObjects.h: this header file provides definitions for scene objects (ie. materials, lights, spheres, and boxes). Colour.h: this header defines a datastructure for representing colours (with each colour component represented as a float) and simple operations on colours, including conversions to/from the standard BGR pixel format.
Scene definition and I/O:
Scene.h and Scene.cpp: the header file contains the datastructure to represent a scene and a single function that initialises this datastructure from a file. The scene datastructure itself consists of properties of the scene and lists of the various scene objects as described above. The implementation file contains many functions to aide in the scene loading process. Scene loading relies upon the functionality provided by the Config class.
Config.h and Config.cpp: this class provide facilities for parsing the scene file.
SimpleString.h: this is helper string class used by the Config class.
Image I/O:
ImageIO.h and ImageIO.cpp: these files contain the definitions of functions to read and write BMP files.
Miscellaneous:
Timer.h: this class provides a simple timer that makes use of different system functions depending on whether TARGET_WINDOWS, TARGET_PPU, or TARGET_SPU is defined (we don’t use the latter two, but I left this file unchanged in case anyone wanted to see how such cross-platform stuff can be handled).
Executing
The program has the following functionality:
By default it will attempt to load the scene “Scenes/cornell.txt” and render it at 1024×1024 with 1×1 samples.
By default it will output a file named “Outputs/[scenefile-name]_[width]x[height]x[sample-level]_[executable- filename].bmp” (e.g. with all the default options, “Outputs/cornell.txt_1024x1024x1_RayTracerAss1.exe.bmp”)
It takes command line arguments that allow the user to specify the width and height, the anti-aliasing level (must be a power of two), the name of the source scene file, the name of the destination BMP file, and the number of times to perform the render (to improve the timing information).
Additionally it accepts some arguments (that are currently unused) for setting the number of threads, whether each thread will tint the area that it renders, and the size of the block to render (only applicable for stage 3).
Further it accepts an arguments “-testMode” that will fill the output with white. When executing with multiple threads this should be a greyscale colour that represents the thread number.
It loads the specified scene.
It renders the scene (as many times as requested).
It produces timing information for the average time taken to produce a render ignoring all file IO, and the number of samples generated per run.
It outputs the rendered scene as a BMP file.
For example, running the program at the command line with no arguments would perform the first test (as described in the scene file section):
On execution this would produce output similar to the following (as well as writing the resultant BMP file to Outputs/cornell.txt_1024x1024x1_RayTracerAss1.exe.bmp):
average time taken (1 run(s)): 3672ms
Testing Batch Files
A number of batch files are provided that are intended to be executed from the command line, e.g.
For timing:
Stage1Timing.bat 3 8 will perform all the timing tests required for Stage 1 (i.e. 3 runs with 8 threads one each Test scene).
Stage2Timing.bat 3 8 will perform all the timing tests required for Stage 2 (i.e. 3 runs with 8 threads one each Test scene).
Stage3Timing.bat 3 8 8 will perform all the timing tests required for Stage 3 at a particular block size (i.e. 3 runs
with 8 threads and block size 8 on each Test scene). For testing (requires Image Magick installation), e.g.:
Stage1Tests1.bat will perform all the comparison required for Stage 1 Tests where the thread count evenly divides into the image height.
Stage1Tests2.bat will perform all the comparison required for Stage 1 Tests where the thread count doesn’t evenly divide into the image height.
Ray Tracing
The materials provided with this assignment include an implementation of a simple raytracer based on the raytracing tutorial at codermind.com (links currently broken, but still available via the wayback machine here). This tutorial also provides a pretty good introduction to the concepts of ray tracing.
Apart from a general structure rewrite and simplification of the code, the supplied implementation has the following similarities and differences to the codermind tutorial:
it retains support for:
spheres;
diffuse (Lambert) and specular (Blinn-Phong) lighting; shadows;
super-sampled anti-aliasing;
simple exposure;
a conic perspective camera model; and
reflection and refraction.
it is extended, in that is provides additional support for: (axis-aligned) boxes;
(simple) procedural 3D textures;
setting the camera’s position and (xz) rotation; reading and writing BMP files.
it is simplified, in that it doesn’t provide support for: blobs;
gamma calculations;
perlin noise based procedural textures;
bump mapping;
auto exposure estimation;
bitmap texturing;
cubemap texturing;
an environment cubemap;
depth of field; and
support for reflection and refraction in the same material.
For those of you who want more information (as it’s not entirely necessary to understand these concepts to be able to complete the assignment) on some of the general and/or mathematical aspects of raytracing, see:
General explanations:
Codermind Tutorial (archived mirror): A step-by-step guide and the original source of the codebase of the assignment materials.
Project Amiga Juggler: A step-by-step guide to the maths and implementation of a raytracer (handling only spheres and a ground plane) in Java:
Step 1: vectors, rays, dot-product, and cross-product. Step 5: camera model.
Step 8: ray-sphere intersections and Phong shading. Step 9: ray-plane intersections.
Step 13: reflection.
Wikipedia’s ray tracing page: a basic outline of the concepts.
Princeton Ray Casting Lectures: concepts, maths, and some pseudo code. Princeton Illumination lectures: concepts, maths, and some pseudo code. Softsurfer Algorithms: geometry algorithms.
Intersections:
Ray-sphere intersections: wikipedia, codermind (part 1), Project Amiga Juggler (step 8), Princeton Ray Casting Lectures (Slide 11–14).
Ray-AABB (axis-aligned bounding box) intersections: medium.
Ray-plane intersections: wikipedia, Princeton Ray Casting Lectures (Slide 17), Softsurfer.com Algorithms. Ray-triangle intersections: wikipedia, source paper, lighthouse3d, one more explanation.
Ray-something intersections: Real-time Rendering.
Lighting:
Diffuse lighting: codermind (part 1), Princeton Illumination Lectures (Slide 19–21). Specular lighting: codermind (part 2), Princeton Illumination Lectures (Slide 23–25). Shadowing: Project Amiga Juggler (step 8).
Perlin Noise:
2D: wikipedia.
3D: Understanding Perlin Noise.
One of the scenes uses voxel characters created by miklovesrobots originally in the .VOX format:
Voxel character / objects: Mini Mike’s Metro Minis. Magica Voxel editor and resources: MagicaVoxel. Voxel file format: voxel-model.