程序代写代做代考 cache GPU C kernel data structure Student ID No: _________________

Student ID No: _________________
UNIVERSITY OF TASMANIA
Pages: 13 Questions: 23
EXAMINATIONS FOR DEGREES AND DIPLOMAS June–July 2019
KIT308 Multicore Architecture and Programming
KIT408 Advanced Multicore Architecture and Programming
First and Only Paper
Deferred and Supplementary Examination
Instructions:
Examiners: Ian Lewis
Time Allowed: TWO (2) hours Reading Time: FIFTEEN (15) minutes
There are a total of 120 marks available. Attempt ALL questions from Section A; attempt EIGHT (8) questions from Section B; and attempt ALL questions from Section C.
The answers to questions 1–10 (Section A) must be clearly written on the answer sheet provided at the back of the paper. Answers to Section B and Section C must be written in the booklet provided.

-2- KIT308 Multicore Architecture and Programming
SECTION A – MULTIPLE CHOICE
Attempt ALL questions from Section A. Each question is worth 2 marks. This section is worth 20 marks, or 16.7% of the examination.
Question 1. (Addresses ILO 1)
Which of the a. b. c.
d. e.
following is a description of simultaneous multithreading?
Threadsareexecutedsimultaneouslyonmultiplecores
Multiplethreadsareexecutedsimultaneouslywithinthesamepipeline
Control is passed between threads and each thread is given a small amount of time to execute
Eachthreadexecutesuntilcompletion Alloftheabove
[2 marks]
Question 2. (Addresses ILO 1)
Which of the following resources are often shared between cores in a typical multi-core
architecture?
a. L1datacache
b. L3cache c. ALUs
d. MMU
e. Registers
Question 3. (Addresses ILO 1)
Cache coherency is required for which kind of architecture?
a. ASIMDsuperscalararchitecture
b. ApipelinedRISCarchitecture
c. A unicore architecture
d. Amulticorearchitecture
e. ASMT-capablesuperscalararchitecture
[2 marks]
Continued…
[2 marks]

-3- KIT308 Multicore Architecture and Programming
Question 4. (Addresses ILO 1)
What is the purpose of a split L1 cache?
a. Toprovidesafetyofoperationsthroughsegmentation
b. Toreducecontentionformemoryaccess
c. To allow for simultaneous multithreading
d. Toallowforself-modifyingcode
e. Alloftheabove
Question 5. (Addresses ILO 1)
[2 marks]
The following are the execution times of a multithreaded program (performing a task that can be split up into fully independent portions) being run with a various number of threads:
Threads
Time (msecs)
1
1000
2
511
4
270
8
190
16
194
Which of the a. b. c.
d. e.
following architectures is it most likely that the program is being run on?
Afour-coresystemwith2-waysimultaneousmultithreading
Ahomogeneoussixteen-coreRISC-basedsystemwithL1/L2/L3cache
A single-core system with 2-way simultaneous multithreading with SIMD instructions that pack 4 values into a single vector
Asingle-coresystemwith4-waysimultaneousmultithreading Aheterogenoussuperscalararchitecturewithout-of-orderscheduling
[2 marks]
Continued…

-4- KIT308 Multicore Architecture and Programming
Question 6. (Addresses ILO 1)
Which of the a. b. c.
d. e.
following is a possible reason why loop unrolling can improve efficiency?
Reductionofthenumberofconditionalbranches
Reductionofthenumberofloopcounterdecrements
Increased opportunity for scheduling of independent instructions within the loop
Reductionofthenumberofloopcountertests Alloftheabove
[2 marks]
Question 7. (Addresses ILO 2)
Which of the following SSE code snippets creates a vector containing four single-precision
(32-bit) floating-point values each with the value 3.0f? a. _mm_set_ps(3.0f)
b. _mm_set_pd(3.0f, 3.0f, 3.0f, 3.0f) c. _mm_load_ps(3.0f)
d. _mm_set1_pd(3.0f)
e. _mm_set1_ps(3.0f)
Question 8. (Addresses ILO 2)
What is the value of result after execution of the following SSE code?
__m128 a = _mm_set_ps(-1.0f, -0.5f, 3.14f, 6 * 9f);
__m128 b = _mm_set_ps(1.0f, -0.5f, 3.1415f, 42f);
__m128 result = _mm_cmpge_ps(a, b);
a. { false, true, false, true }
b. {1,0,1,0}
c. {0,1,0,1}
d. { 0x00000000, 0xFFFFFFFF, 0x00000000, 0xFFFFFFFF } e. { 0x00000000, 0x00000000, 0x00000000, 0xFFFFFFFF }
[2 marks]
Continued…
[2 marks]

-5- KIT308 Multicore Architecture and Programming
Question 9. (Addresses ILO 2)
What is the value of result after execution of the following SSE code?
__m128 a = _mm_set_ps(1.0f, -0.5f, 3.1415f, 42.9f);
__m128i result = _mm_castps_si128(a);
a. {1,-1,3,42}
b. {1,0,3,42}
c. {1,-1,3,43}
d. result will contain the result of treating the bits of the floating-point values as integers without trying to numerically convert them
e. Thesecondinstructionwouldcauseatypeerror
Question 10. (Addresses ILO 2)
What is the value of result after execution of the following OpenCL code?
float4 a = (float4) (-0.5f, 1.0f, 42.9f, 3.1415f);
float4 result = 1.1f;
result.yz = a.xw;
a. { 1.0f, -0.5f, 1.1f, 1.1f }
b. { 1.1f, -0.5f, 3.1415f, 1.1f } c. { 1.1f, 1.1f, 1.1f, 1.1f }
d. { 1.0f, 1.1f, 1.1f, 42.9f }
e. Noneoftheabove
[2 marks]
Continued…
[2 marks]

-6- KIT308 Multicore Architecture and Programming
SECTION B – SHORT ANSWER QUESTIONS
Answer any EIGHT (8) of Questions 11 through 20 (inclusive). Each is worth 5 marks. This section is worth 40 marks, or 33.3% of the examination.
Question 11. Pipelined CPU Architectures (Addresses ILOs 1 and 4)
Explain in your own words, the architectural features of a pipelined CPU that lead to
increased performance and the limitations of these features.
[5 marks]
Question 12. Superscalar CPU Architectures (Addresses ILOs 1 and 4)
Explain in your own words, the architectural features of a superscalar CPU that lead to
increased performance and the limitations of these features.
[5 marks]
Question 13. Fixed-Function Pipeline GPU Architectures (Addresses ILOs 1 and 4) Explain in your own words, the architectural features of a fixed-function pipeline GPU that
lead to increased performance and the limitations of these features.
[5 marks]
Question 14. Unified Shader GPU Architectures (Addresses ILOs 1 and 4)
Explain in your own words, the architectural features of a unified shader architecture GPU
that lead to increased performance and the limitations of these features.
[5 marks]
Question 15. Memory Access (Addresses ILOs 1 and 4)
Describe the most costly sequence of events that could occur on a CPU with two levels of
cache when reading from a memory location for the first time.
Question 16. Branches (Addresses ILOs 1 and 4)
[5 marks]
Why are conditional branches problematic in modern CPU and GPU architectures? Describe different techniques that are used to reduce the effect of branches on such architectures.
Continued…
[5 marks]

-7- KIT308 Multicore Architecture and Programming
Question 17. AoS Versus SoA (Addresses ILOs 1 and 4)
What are the advantages and disadvantages of storing values as arrays of structures or as structures of arrays? What features of modern CPU architectures are supported by an SoA data layout?
[5 marks]
Question 18. Context (Addresses ILOs 1 and 4)
Describe the concept of a processor context switch in detail. Explain why context switches are treated differently on CPUs and GPUs. What effect does this have on latency and throughput of calculations?
Question 19. Translating AoS to SoA (Addresses ILOs 2 and 4) Rewrite the following code fragment using Structures of Arrays (SoA).
struct PointColour
{
float vector[3];
unsigned char red;
unsigned char green;
}; unsigned char blue;
PointColour pcs[TOTAL];
for (int i = 0; i < TOTAL; ++i) { pcs[i].vector[1] *= pcs[i].vector[0]; pcs[i].vector[2] = -pcs[i].vector[0]; pcs[i].red = (pcs[i].green + pcs[i].blue) & 0xFF; } [5 marks] Question 20. SIMD (Addresses ILOs 2 and 4) Rewrite the following code using AVX SIMD instructions and ensure it contains no branches by use of intrinsics for comparison and selection. float a[8], b[8], c[8], e[8]; for (unsigned int i = 0; i < 8; i++) if (a[i] == b[i]) e[i] = b[i]; elseif (a[i] > c[i]) e[i] = a[i];
else[i] = c[i];
[5 marks]
Continued…
[5 marks]

-8- KIT308 Multicore Architecture and Programming
SECTION C – LONG ANSWER QUESTIONS
There are THREE (3) questions in Section C, attempt ALL questions. Each question is worth 20 marks. Section C is worth 60 marks in total, or 50% of the examination.
Question 21. Multithreaded Programming (Addresses ILOs 2 and 3)
Write a multithreaded program that calculates the frequency of each different numerical value in a ten million element array of unsigned shorts, i.e. the program should calculate and store how often each of the 65536 possible values of a unsigned short appears in the original array. For example, if the value 10 appears 123 times, the tenth value in the output array should be 123.
Your program should accept a single command-line argument to specify the number of threads to execute with, e.g. to execute the program with 13 threads it would be run with:
Q21.exe 13
Despite the simplicity of the calculation task, your program should allocate tasks to the threads dynamically via the use of a shared memory location.
Your program should consist of:
a. A data structure to hold all the necessary parameters for each thread’s
execution and the shared memory for thread synchronization.
[5 marks]
b. The main function to read the command line argument and set up the data for, create, and manage the threads. You can assume the unsigned short array is pre-initialised with some data and you don’t have to detect or handle errors in this function. This function should also collate the results from the threads into a single frequencies array.
[10 marks]
c. A thread start routine that converts its argument, synchronizes via shared
Continued…
memory, and performs the frequency calculation.
[5 marks]

-9- KIT308 Multicore Architecture and Programming
Question 22. SIMD Programming (Addresses ILOs 2 and 3)
The following C function performs a Sobel vertical edge detection (each new pixel value is calculated by applying a mask calculation to itself and the surrounding pixels) on a
width × height image where each pixel is a floating-point value. (Note: pixels on the edge of the image are not calculated and can be given any value).
void detect_edge(float* in, float* out, int width, int height)
{
float vertical_total[width];
for (int y = 1; y < height - 1; y++) { for (int x = 0; x < width; x++) { vertical_total[x] = in[x + (y - 1) * width] + 2 * in[x + y * width] + in[x + (y + 1) * width]; } for (int x = 1; x < width - 1; x++) { out[x] = vertical_total[x + 1] - vertical_total[x - 1]; } } } Rewrite the detect_edge function to use AVX vector types for its array parameters and AVX SIMD mathematical operations and intrinsics in its body. (Hint: using some combination of the _mm256_shuffle_ps, permutevar8x32, and _mm256_or_ps intrinsics is the most effective approach for the second loop over x). [20 marks] Continued... -10- KIT308 Multicore Architecture and Programming Question 23. OpenCL Programming (Addresses ILOs 2 and 3) Given the following function calculation to perform a calculation of a cubic region of a procedural wood grain texture, write an OpenCL program to perform the same calculation. typedef struct vec3 { union { float x; float r; }; union { float y; float g; }; union { float z; float b; }; } vec3; vec3 operator*(float a, vec3 v) { return { a * v.x, a * v.y, a * v.z }; } float noise(vec3 point) { return /* magic noise calc... */; } Colour wood(vec3 point, float scale, vec3 col1, vec3 col2) { vec3 p = scale * point; float grain = noise(p) * 5; grain = grain - (int) grain; return (grain * col1) + (1.0f - grain) * col2; } void calculation(unsigned int xSize, unsigned int ySize, unsigned int zSize, float* out) { vec3 red = { 1f, 0.5f, 0.5f }; vec3 yellow = { 1f, 1f, 0.5f }; for (unsigned int x = 0; x < xSize; x++) { for (unsigned int y = 0; y < ySize; y++) { for (unsigned int z = 0; x < zSize; x++) { vec3 point = { x, y, z }; *out++ = wood(point, 5f, red, yellow); } } } } Your program should perform the calculation for a 256x256x256 buffer and consist of two files: a. A CPU program that creates the context for execution, builds the OpenCL program, creates the output buffer, distributes work appropriately, and gathers the results. Note: you do not have to detect or handle errors in this part. b. An OpenCL program that performs a single calculation step. [10 marks] [10 marks] Continued... -11- KIT308 Multicore Architecture and Programming SSE / AVX Intrinsics Reference Table (Subset): __m128 _mm_set1_ps(float a) res[0..3] = a __m128 _mm_set_ps(float a, float b, float c, float d) res[0] = d res[1] = c res[2] = b res[3] = a __m128 _mm_setr_ps(float a) res[0] = a res[1] = b res[2] = c res[3] = d __m128 _mm_load_ps(float* mem) res[n] = mem[n] void _mm_store_ps(float* mem, float a) mem[n] = a[n] __m128i _mm_castps_si128(__m128 a) res[n] = *((int*) &a[n]) __m128i _mm_cvtps_epi32(__m128 a) res[n] = (int) a[n] __m128 _mm_cmpge_ps(__m128 a, __m128 b) res[n] = a[n] >= b[n]
__m128 _mm_cmpgt_ps(__m128 a, __m128 b)
res[n] = a[n] > b[n]
__m128 _mm_cmple_ps(__m128 a, __m128 b)
res[n] = a[n] <= b[n] __m128 _mm_cmplt_ps(__m128 a, __m128 b) res[n] = a[n] < b[n] __m128 _mm_cmpeq_ps(__m128 a, __m128 b) res[n] = a[n] == b[n] __m256 _mm256_add_ps(__m256 a, __m256 b) res[n] = a[n] + b[n] __m256 _mm256_sub_ps(__m256 a, __m256 b) res[n] = a[n] - b[n] __m256 _mm256_mul_ps(__m256 a, __m256 b) res[n] = a[n] * b[n] __m256 _mm256_or_ps(__m256 a, __m256 b) res[n] = a[n] | b[n] __m256 _mm256_and_ps(__m256 a, __m256 b) res[n] = a[n] & b[n] __m256 _mm256_andnot_ps(__m256 a, __m256 b) res[n] = (!a[n]) & b[n] __m256 _mm256_permutevar8x32_ps(__m256 a, __m256i idx) res[n] = a[idx[n]] __m256 _mm256_shuffle_ps(__m256 a, __m256 b, const int imm8) treating 8-bit imm8 as an array of four 2-bit values: res[0] = a[imm[0]] res[1] = a[imm[1]] res[2] = b[imm[2]] res[3] = b[imm[3]] res[4] = a[imm[0] + 4] res[5] = a[imm[1] + 4] res[6] = b[imm[2] + 4] res[7] = b[imm[3] + 4] __m256 _mm256_cmp_ps(__m256 a, __m256 b, const int imm8) res[n] = a[n] OP b[n] where OP is determined by value of imm8, e.g. _CMP_EQ_0Q: == (equality) _CMP_GE_0Q: >= (greater than or equal) _CMP_GT_0Q: > (greater than)
Continued…

-12- KIT308 Multicore Architecture and Programming
OpenCL Initialisation Function Reference Table:
cl_int clGetPlatformIDs(
cl_uint num_entries, cl_platform_id *platforms, cl_uint *num_platforms)
Obtain a list of platforms
cl_int clGetDeviceIDs(
cl_platform_id platform, cl_device_type device_type, cl_uint num_entries, cl_device_id *devices,
cl_uint *num_devices)
Obtain a list of devices
cl_context clCreateContext(
const cl_context_properties *props, cl_uint num_devices, const cl_device_id *devices,
void (*pfn_notify)(const char *errinfo,
const void *private_info, size_t cb, void *user_data),
void *user_data, cl_int *errcode_ret)
Creates an OpenCL context
cl_command_queue clCreateCommandQueue(
cl_context context, cl_device_id device, cl_command_queue_properties properties, cl_int *errcode_ret)
Creates a command queue on a specific device
cl_mem clCreateBuffer(
cl_context context, cl_mem_flags flags, size_t size, void *host_ptr, cl_int *errcode_ret)
Creates a memory buffer object
cl_int clEnqueueReadBuffer(
cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t cb, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)
Enqueue command to read from a buffer
cl_int clEnqueueWriteBuffer(
cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_write, size_t offset, size_t cb, const void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)
Enqueue command to write to a buffer
cl_program clCreateProgramWithBinary(
cl_context context, cl_uint num_devices,
const cl_device_id *device_list, const size_t *lengths, const unsigned char **binaries, cl_int *binary_status, cl_int *errcode_ret)
Creates a program object for a context
cl_int clBuildProgram(
cl_program program, cl_uint num_devices,
const cl_device_id *device_list, const char *options, void (*pfn_notify)(cl_program, void *user_data), void *user_data)
Builds a program executable from the program sources or binary
cl_int clCreateKernelsInProgram(
cl_program program, cl_uint num_kernels, cl_kernel *kernels, cl_uint *num_kernels_ret)
Creates a kernel object
cl_int clSetKernelArg(
cl_kernel kernel, cl_uint arg_index, size_t arg_size, const void *arg_value)
Set an argument value for a kernel
cl_int clEnqueueNDRangeKernel(
cl_command_queue command_queue, cl_kernel kernel, cl_uint work_dim, const size_t *global_work_offset, const size_t *global_work_size,
const size_t *local_work_size,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list, cl_event *event)
Enqueue command to execute a kernel on a device
Continued…

-13- KIT308 Multicore Architecture and Programming
Student Number: _____________________________
Your selection must be written as a BLOCK CAPITAL in the box provided for each question. For example:
Question A1
A
Question A1
Question A2
Question A3
Question A4
Question A5
Question A6
Question A7
Question A8
Question A9
Question A10