Student ID No: _________________
UNIVERSITY OF TASMANIA
Pages: 14 Questions: 23
EXAMINATIONS FOR DEGREES AND DIPLOMAS June 2019
KIT308 Multicore Architecture and Programming
KIT408 Advanced Multicore Architecture and Programming
Instructions:
First and Only Paper Ordinary Examination
Examiners: Ian Lewis
Time Allowed: TWO (2) hours Reading Time: FIFTEEN (15) minutes
There are a total of 120 marks available. Attempt ALL questions from Section A; attempt EIGHT (8) questions from Section B; and attempt ALL questions from Section C.
The answers to questions 1–10 (Section A) must be clearly written on the answer sheet provided at the back of the paper. Answers to Section B and Section C must be written in the booklet provided.
-2- KIT308 Multicore Architecture and Programming
SECTION A – MULTIPLE CHOICE
Attempt ALL questions from Section A. Each question is worth 2 marks. This section is worth 20 marks, or 16.7% of the examination.
Question 1. (Addresses ILO 1)
Which of the a. b. c.
d. e.
following is a description of simultaneous multithreading?
Threadsareexecutedsimultaneouslyonmultiplecores
Multiplethreadsareexecutedsimultaneouslywithinthesamepipeline
Control is passed between threads and each thread is given a small amount of time to execute
Eachthreadexecutesuntilcompletion Alloftheabove
Question 2. (Addresses ILO 1)
Why are multicore architectures often composed of simplified cores?
a. ToreduceCPUclockspeed
b. Toenablemorecorestobeplacedononechip
c. To simplify the design of the L1 caches
d. Toincreaseinstructionparallelism
e. Toallowforeasycommunicationbetweencores
Question 3. (Addresses ILO 1)
[2 marks]
Which of the a. b. c. d. e.
following is a description of a multicore heterogeneous architecture? Aprocessorwitheachcorehavingpipelinesofvaryinglengths Aprocessorthatallowsformultiplethreadstoberunoneachcore A processor with each core having multiple different ALUs Aprocessorwithspecialisedcores
Continued…
Aprocessorwithidenticalcores
[2 marks]
[2 marks]
-3- KIT308 Multicore Architecture and Programming
Question 4. (Addresses ILO 1)
Which of the CPU?
a.
b. c. d. e.
following is the best description of the way caches are shared on a multicore
OnlytheL2andabove(e.g.L3andL4asappropriate)aresharedbetween cores
TheL1andL2cachesaresharedbetweencores
Only the L1 data cache is shared between cores
OnlytheL1cacheissharedbetweencores
DifferentCPUdesignssharedifferentlevelsofcache(anywherefromthe L1 onwards) between each core
Question 5. (Addresses ILO 1)
The following are the execution times of a multithreaded program (performing a task that can be split up into fully independent portions) being run with a various number of threads:
[2 marks]
Threads
Time (msecs)
1
1000
2
511
4
460
8
427
16
431
Which of the a. b. c.
d. e.
following architectures is it most likely that the program is being run on?
Atwo-coresystemwith4-waysimultaneousmultithreading
Ahomogeneouseight-coreRISC-basedsystemwithL1/L2/L3cache
A single-core system with 2-way simultaneous multithreading with SIMD instructions that pack 4 values into a single vector
Asingle-coresystemwith4-waysimultaneousmultithreading Aheterogenoussuperscalararchitecturewithin-orderscheduling
Continued…
[2 marks]
-4- KIT308 Multicore Architecture and Programming
Question 6. (Addresses ILO 2)
When sending data to the function that acts as the main function for a thread that is
created by the Windows function CreateThread, how are the arguments passed?
a. Viashareddatathatthethreadstartfunctionautomaticallyknowshowto
access
b. UsingaC++template
c. By using multiple arguments each representing a single piece of data
d. Inastructure/classthatispassedviaavoidpointerthatmustbecast back to a pointer to its original type before the data can be used
e. Bystoringallthevaluesinanarray,andpassingthatintothefunction, and then deconstructing the array
[2 marks]
Question 7. (Addresses ILO 1)
Which of the following is a possible reason why using the Structure of Arrays (SoA) approach can improve efficiency over the more traditional Array of Structures (AoS) approach?
a. Betterlocalityofreference
b. Amorelogicallayoutofdatafromprogrammers
c. Reduction of the number of executed instructions
d. Reductionofthenumberofbranchesmispredicted
e. Alloftheabove
[2 marks]
Question 8. (Addresses ILO 2)
Which of the following SSE code snippets creates a vector containing four single-precision
(32-bit) floating-point values each with the value 3.0f?
a. _mm_set_epi32(3.0f, 3.0f, 3.0f, 3.0f) b. _mm_set_pd(3.0f, 3.0f, 3.0f, 3.0f)
c. _mm_load_ps(3.0f)
d. _mm_set1_ps(3.0f)
e. _mm_setr_ps(3.0f)
Continued…
[2 marks]
-5- KIT308 Multicore Architecture and Programming
Question 9. (Addresses ILO 2)
What is the value of result after execution of the following SSE code?
__m128 a = _mm_set_ps(1.0f, 3.14f, 6 * 9f, -0.5f);
__m128 b = _mm_set_ps(-1.0f, 3.1415f, 42f, -0.5f);
__m128 result = _mm_cmpgt_ps(a, b);
a. { false, true, false, false }
b. {1,0,1,1}
c. {0,1,0,0}
d. { 0xFFFFFFFF, 0x00000000, 0xFFFFFFFF, 0x00000000 } e. { 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000}
Question 10. (Addresses ILO 2)
What is the value of result after execution of the following SSE code?
__m128 a = _mm_set_ps(1.0f, -0.5f, 3.1415f, 42.9f);
__m128i result = _mm_cvtps_epi32(a);
a. { 0xFFFFFFFF, 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF }
b. {1,0,3,42}
c. {1,-1,3,43}
[2 marks]
Continued…
d. result will contain the result of treating the bits of the floating-point values as integers without trying to numerically convert them
e. Thesecondinstructionwouldcauseatypeerror
[2 marks]
-6- KIT308 Multicore Architecture and Programming
SECTION B – SHORT ANSWER QUESTIONS
Answer any EIGHT (8) of Questions 11 through 20 (inclusive). Each is worth 5 marks. This section is worth 40 marks, or 33.3% of the examination.
Question 11. Pipelined CPU Architectures (Addresses ILOs 1 and 4)
Explain in your own words, the architectural features of a pipelined CPU that lead to
increased performance and the limitations of these features.
[5 marks]
Question 12. Superscalar CPU Architectures (Addresses ILOs 1 and 4)
Explain in your own words, the architectural features of a superscalar CPU that lead to
increased performance and the limitations of these features.
[5 marks]
Question 13. Unified Shader GPU Architectures (Addresses ILOs 1 and 4)
Explain in your own words, the architectural features of a unified shader architecture GPU
that lead to increased performance and the limitations of these features.
[5 marks]
Question 14. Memory Access (Addresses ILOs 1 and 4)
Describe the most costly sequence of events that could occur on a CPU with two levels of
cache when reading from a memory location for the first time.
Question 15. Branches (Addresses ILOs 1 and 4)
[5 marks]
Why are conditional branches problematic in modern CPU and GPU architectures? Describe different techniques that are used to reduce the effect of branches on such architectures.
[5 marks]
Question 16. Loop Unrolling (Addresses ILOs 1 and 4)
Describe the technique of loop unrolling and explain how this can improve the efficiency of programs. What features of modern CPU architectures are supported by this technique?
Continued…
[5 marks]
-7- KIT308 Multicore Architecture and Programming
Question 17. Context (Addresses ILOs 1 and 4)
Describe the concept of a processor context switch in detail. Explain why context switches are treated differently on CPUs and GPUs. What effect does this have on latency and throughput of calculations?
[5 marks]
Question 18. Explicit Memory Management (Addresses ILOs 2 and 4)
What makes the following type of datastructure problematic to transfer from the CPU to GPU for use in an OpenCL program? Given that the variable branchyTextCPU was initialized to contain 1000 strings with links between them, describe (either in plain English or pseudocode) what steps would be required to fully transfer a copy of the text tree to the GPU.
struct TextTree
{
char* text;
TextTree* left;
}; TextTree* right;
TextTree branchyTextCPU;
Question 19. Translating AoS to SoA (Addresses ILOs 2 and 4) Rewrite the following code fragment using Structures of Arrays (SoA).
struct PointColour
{
float x;
float y;
float z;
}; unsigned char colour[3];
PointColour pcs[TOTAL];
for (int i = 0; i < TOTAL; ++i)
[5 marks]
{
pcs[i].x *= pcs[i].x;
pcs[i].z = -pcs[i].y;
pcs[i].colour[1] = (pcs[i].colour[0] + pcs[i].colour[2]) & 0x0F;
}
[5 marks]
Continued...
-8- KIT308 Multicore Architecture and Programming
Question 20. SIMD (Addresses ILOs 2 and 4)
Rewrite the following code using AVX SIMD instructions and ensure it contains no
branches by use of intrinsics for comparison and selection.
float a[8], b[8], c[8], e[8];
for (unsigned int i = 0; i < 8; i++)
if (a[i] >= b[i])
e[i] = b[i];
elseif (a[i] < c[i]) e[i] = a[i];
else[i] = c[i];
[5 marks]
Continued...
-9- KIT308 Multicore Architecture and Programming
SECTION C – LONG ANSWER QUESTIONS
There are THREE (3) questions in Section C, attempt ALL questions. Each question is worth 20 marks. Section C is worth 60 marks in total, or 50% of the examination.
Question 21. Multithreaded Programming (Addresses ILOs 2 and 3)
Write a multithreaded program that calculates the colour histogram for a 2D RGB image of size 1000x1000. Colour data is stored with a single byte per channel, as in the following definition:
struct Colour {
}; unsigned char channels[3];
A colour histogram requires calculating the frequency of each different numerical value for each colour channel in an image, i.e. the program should calculate and store how often each of the 256 possible values of each colour channel (an unsigned char) appears in the original colour array. For example, if the value 0xFF (255) appears 123456 times in the zeroth channel (i.e. the “red” channel), 1000000 times in the first channel (i.e. the “green” channel), and 0 times in the second channel (i.e. the “blue” channel), the value at index 0xFF in the output arrays should be 123456, 1000000, and 0 respectively.
Your program should accept a single command-line argument to specify the number of threads to execute with, e.g. to execute the program with 13 threads it would be run with:
Q21.exe 13
Despite the relative simplicity of the calculation task, your program should allocate tasks to the threads dynamically via the use of a shared memory location.
Your program should consist of:
a. A data structure to hold all the necessary parameters for each thread’s
execution and the shared memory for thread synchronization.
[5 marks]
b. The main function to read the command line argument and set up the data for, create, and manage the threads. You can assume the Colour array is pre- initialised with some data and you don’t have to detect or handle errors in this function. This function should also collate the results from the threads to have an individual frequencies array per channel.
[10 marks]
c. A thread start routine that converts its argument, synchronizes via shared
Continued...
memory, and performs the colour histogram calculation.
[5 marks]
-10- KIT308 Multicore Architecture and Programming
Question 22. SIMD Programming (Addresses ILOs 2 and 3)
The following C function performs a simple blur (each new pixel value is calculated by applying a mask calculation to itself and the surrounding pixels) on a width × height image where each pixel is a floating-point value. (Note: pixels on the edge of the image are not calculated and can be given any value).
void blur(float* in, float* out, int width, int height)
{
float vertical_total[width];
for (int y = 1; y < height - 1; y++)
{
for (int x = 0; x < width; x++)
{
vertical_total[x] = in[x + (y - 1) * width] +
in[x + y * width] + in[x + (y + 1) * width];
}
for (int x = 1; x < width - 1; x++)
{
out[x] = (vertical_total[x + 1] + vertical_total[x] +
vertical_total[x - 1]) / 9;
}
} }
Rewrite the blur function to use AVX (256-bit) vector types for its array parameters and AVX SIMD mathematical operations and intrinsics in its body.
(Hint: using some combination of the _mm256_shuffle_ps, permutevar8x32, and _mm256_or_ps intrinsics is the most effective approach for the second loop over x).
[20 marks]
Continued...
-11- KIT308 Multicore Architecture and Programming
Question 23. OpenCL Programming (Addresses ILOs 2 and 3)
Given the following function calculation to perform a calculation of a cubic region of a
procedural wood texture, write an OpenCL program to perform the same calculation.
typedef struct vec3
{
union { float x; float r; };
union { float y; float g; };
union { float z; float b; };
} vec3;
vec3 operator*(float a, vec3 v)
{
return { a * v.x, a * v.y, a * v.z };
}
Colour wood(vec3 point, float scale, vec3 col1, vec3 col2)
{
vec3 p = scale * point;
p = { p.x * cosf(p.y * 0.996f) * sinf(p.z * 1.023f),
cosf(p.x) * p.y * sinf(p.z * 1.211f),
cosf(p.x * 1.473f) * cosf(p.y * 0.795f) * p.z };
int which = int(floorf(sqrtf(p.x*p.x + p.y*p.y + p.z*p.z))) & 1;
return (which ? col1 : col2);
}
void calculation(unsigned int xSize, unsigned int ySize,
unsigned int zSize, float* out)
{
vec3 blue = { 0.25f, 0.25f, 1.0f };
vec3 yellow = { 1.0f, 1.0f, 0.5f };
for (unsigned int x = 0; x < xSize; x++)
{
for (unsigned int y = 0; y < ySize; y++)
{
for (unsigned int z = 0; x < zSize; x++)
{
vec3 point = { x, y, z };
*out++ = wood(point, 5f, blue, yellow);
}
} }
}
Your program should perform the calculation for a 256x256x256 buffer and consist of two files:
a. A CPU program that creates the context for execution, builds the OpenCL program, creates the output buffer, distributes work appropriately, and gathers the results. Note: you do not have to detect or handle errors in this part.
b. An OpenCL program that performs a single calculation step.
[10 marks] [10 marks]
Continued...
-12- KIT308 Multicore Architecture and Programming
SSE / AVX Intrinsics Reference Table (Subset):
__m128 _mm_set1_ps(float a)
res[0..3] = a
__m128 _mm_set_ps(float a, float b, float c, float d)
res[0] = d
res[1] = c
res[2] = b
res[3] = a
__m128 _mm_setr_ps(float a)
res[0] = a
res[1] = b
res[2] = c
res[3] = d
__m128 _mm_load_ps(float* mem)
res[n] = mem[n]
void _mm_store_ps(float* mem, float a)
mem[n] = a[n]
__m128i _mm_castps_si128(__m128 a)
res[n] = *((int*) &a[n])
__m128i _mm_cvtps_epi32(__m128 a)
res[n] = (int) a[n]
__m128 _mm_cmpge_ps(__m128 a, __m128 b)
res[n] = a[n] >= b[n]
__m128 _mm_cmpgt_ps(__m128 a, __m128 b)
res[n] = a[n] > b[n]
__m128 _mm_cmple_ps(__m128 a, __m128 b)
res[n] = a[n] <= b[n]
__m128 _mm_cmplt_ps(__m128 a, __m128 b)
res[n] = a[n] < b[n]
__m128 _mm_cmpeq_ps(__m128 a, __m128 b)
res[n] = a[n] == b[n]
__m256 _mm256_add_ps(__m256 a, __m256 b)
res[n] = a[n] + b[n]
__m256 _mm256_sub_ps(__m256 a, __m256 b)
res[n] = a[n] - b[n]
__m256 _mm256_mul_ps(__m256 a, __m256 b)
res[n] = a[n] * b[n]
__m256 _mm256_or_ps(__m256 a, __m256 b)
res[n] = a[n] | b[n]
__m256 _mm256_and_ps(__m256 a, __m256 b)
res[n] = a[n] & b[n]
__m256 _mm256_andnot_ps(__m256 a, __m256 b)
res[n] = (!a[n]) & b[n]
__m256 _mm256_permutevar8x32_ps(__m256 a, __m256i idx)
res[n] = a[idx[n]]
__m256 _mm256_shuffle_ps(__m256 a,
__m256 b, const int imm8)
treating 8-bit imm8 as an array of four 2-bit values:
res[0] = a[imm[0]]
res[1] = a[imm[1]]
res[2] = b[imm[2]]
res[3] = b[imm[3]]
res[4] = a[imm[0] + 4]
res[5] = a[imm[1] + 4]
res[6] = b[imm[2] + 4]
res[7] = b[imm[3] + 4]
__m256 _mm256_cmp_ps(__m256 a, __m256 b, const int imm8)
res[n] = a[n] OP b[n]
where OP is determined by value of imm8,
e.g.
_CMP_EQ_0Q: == (equality) _CMP_GE_0Q: >= (greater than or equal) _CMP_GT_0Q: > (greater than)
Continued…
-13- KIT308 Multicore Architecture and Programming
OpenCL Initialisation Function Reference Table:
cl_int clGetPlatformIDs(
cl_uint num_entries, cl_platform_id *platforms, cl_uint *num_platforms)
Obtain a list of platforms
cl_int clGetDeviceIDs(
cl_platform_id platform, cl_device_type device_type, cl_uint num_entries, cl_device_id *devices,
cl_uint *num_devices)
Obtain a list of devices
cl_context clCreateContext(
const cl_context_properties *props, cl_uint num_devices, const cl_device_id *devices,
void (*pfn_notify)(const char *errinfo,
const void *private_info, size_t cb, void *user_data),
void *user_data, cl_int *errcode_ret)
Creates an OpenCL context
cl_command_queue clCreateCommandQueue(
cl_context context, cl_device_id device, cl_command_queue_properties properties, cl_int *errcode_ret)
Creates a command queue on a specific device
cl_mem clCreateBuffer(
cl_context context, cl_mem_flags flags, size_t size, void *host_ptr, cl_int *errcode_ret)
Creates a memory buffer object
cl_int clEnqueueReadBuffer(
cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t cb, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)
Enqueue command to read from a buffer
cl_int clEnqueueWriteBuffer(
cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_write, size_t offset, size_t cb, const void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)
Enqueue command to write to a buffer
cl_program clCreateProgramWithBinary(
cl_context context, cl_uint num_devices,
const cl_device_id *device_list, const size_t *lengths, const unsigned char **binaries, cl_int *binary_status, cl_int *errcode_ret)
Creates a program object for a context
cl_int clBuildProgram(
cl_program program, cl_uint num_devices,
const cl_device_id *device_list, const char *options, void (*pfn_notify)(cl_program, void *user_data), void *user_data)
Builds a program executable from the program sources or binary
cl_int clCreateKernelsInProgram(
cl_program program, cl_uint num_kernels, cl_kernel *kernels, cl_uint *num_kernels_ret)
Creates a kernel object
cl_int clSetKernelArg(
cl_kernel kernel, cl_uint arg_index, size_t arg_size, const void *arg_value)
Set an argument value for a kernel
cl_int clEnqueueNDRangeKernel(
cl_command_queue command_queue, cl_kernel kernel, cl_uint work_dim, const size_t *global_work_offset, const size_t *global_work_size,
const size_t *local_work_size,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list, cl_event *event)
Enqueue command to execute a kernel on a device
Continued…
-14- KIT308 Multicore Architecture and Programming
Student Number: _____________________________
Your selection must be written as a BLOCK CAPITAL in the box provided for each question. For example:
Question A1
A
Question A1
Question A2
Question A3
Question A4
Question A5
Question A6
Question A7
Question A8
Question A9
Question A10