程序代写代做代考 C cache clock GPU Student ID number: ____________________

Student ID number: ____________________
UNIVERSITY OF TASMANIA
EXAMINATIONS FOR DEGREES AND DIPLOMAS
October 2015
KIT308 Multicore Architecture and Programming Examiners: Ian Lewis
Time allowed: TWO (2) hours Reading Time: FIFTEEN (15) minutes
Instructions:
There are a total of ONE HUNDRED AND TWENTY (120) marks available. Attempt ALL TEN (10) questions from Section A; attempt EIGHT (8) questions from Section B; and attempt ALL THREE (3) questions from Section C.
Each question in Section A is worth TWO (2) marks, each question in Section B is worth FIVE (5) marks, and each question in Section C is worth TWENTY (20) marks.
Section A is worth TWENTY (20) marks overall, Section B is worth FOURTY (40) marks overall, and Section C is worth SIXTY (60) marks overall.
The answers to questions 1–10 (Section A) must be clearly written on the answer sheet provided at the back of the paper. Answers to Section B and Section C must be written in the booklet provided.
Pages: 12 Questions: 23

-2- KIT308 Multicore Architecture and Programming
SECTION A – MULTIPLE CHOICE QUESTIONS
Attempt ALL questions from Section A. Each question is worth 2 marks. This section is worth 20 marks, or 16.7% of the examination.
Question 1.
Which of the following is NOT a typical characteristic of RISC processors?
a. A single instruction size
b. Varying alignment of data
c. A small number of addressing modes
d. Alargenumberofregisters
e. A maximum of one load or store performed per instruction
Question 2.
[2 marks]
Which of the following is the best description of simultaneous multithreading?
a. Control is passed between threads and each thread is given a small amount of time to execute
b. Multiple threads are executed simultaneously within the same pipeline
c. Threads are executed simultaneously on multiple cores
d. Each thread executes until completion
e. None of the above
[2 marks]
Question 3.
Which of the following resources are often shared between cores in a typical multi- core architecture?
a. L1 cache
b. L3 cache
c. ALUs
d. MMU
e. Registers
Continued…
[2 marks]

KIT308 Multicore Architecture and Programming -3-
Question 4.
The following are the execution times of a multithreaded program (performing a task that can be split up into fully independent portions) being run with a various number of threads:
Threads
Time (msecs)
1
1000
2
553
4
271
8
230
16
241
Which of the following architectures is it most likely that the program is being run on?
a. A single-core system with 2-way simultaneous multithreading with SIMD instructions that pack 8 values into a single vector
b. A heterogenous superscalar architecture with out-of-order scheduling
c. A four-core system with 2-way simultaneous multithreading
d. A homogeneous eight-core RISC-based system with L1/L2/L3 cache
e. A single-core system with 4-way simultaneous multithreading
Question 5.
Cache coherency is required for which kind of architecture?
a. A pipelined RISC architecture
b. A unicore architecture
c. A multi-core architecture
d. A SIMD superscalar architecture
e. A SMT-capable superscalar architecture
[2 marks]
[2 marks]
Continued…

-4- KIT308 Multicore Architecture and Programming
Question 6.
Which of the following is a description of a multi-core heterogeneous architecture?
a. A processor with each core having a different clock speed
b. A processor with multiple specialised cores
c. A processor with each core having multiple different specialised ALUs
d. A processor with each core having pipelines of varying lengths
e. A processor with each core having its own local memory
Question 7.
[2 marks]
Which of the following SSE code snippets creates a vector containing four single- precision (32-bit) floating-point values each with the value 3.0f?
a. _mm_set1_pd(3.0f)
b. _mm_set_ps(3.0f)
c. _mm_load_ps(3.0f)
d. _mm_set1_epi32(3.0f)
e. _mm_set_ps(3.0f, 3.0f, 3.0f, 3.0f)
Question 8.
[2 marks]
What is the value of result after execution of the following SSE code? __m128 a = _mm_set_ps(-1.0f, -0.5f, 3.14f, 6 * 9f);
__m128 b = _mm_set_ps(1.0f, -0.5f, 3.1415f, 42f); __m128 result = _mm_cmpge_ps(a, b);
a. { false, true, false, true }
b. { 0xFFFFFFFF, 0x00000000, 0xFFFFFFFF, 0x00000000 } c. {1,0,1,0}
d. { 0x00000000, 0xFFFFFFFF, 0x00000000, 0xFFFFFFFF } e. {0,1,0,1}
Continued…
[2 marks]

KIT308 Multicore Architecture and Programming -5-
Question 9.
What is the value of result after execution of the following SSE code? __m128 a = _mm_set_ps(1.0f, -0.5f, 3.1415f, 42.9f); __m128i result = _mm_castps_si128(a);
a. {1,0,3,42}
b. { 1, -1, 3, 42 }
c. {1,-1,3,43}
d. result will contain the result of treating each of the floating-point values as integers without trying to convert them
e. The second instruction would cause a type error
Question 10.
[2 marks]
What is the value of result after execution of the following OpenCL code? float4 a = (float4)(1.0f, -0.5f, 3.1415f, 42.9f); float4 result = 1.1f;
result.xw = a.yz;
a. { 1.0f, -0.5f, 1.1f, 1.1f}
b. { 1.1f, 1.1f, 1.1f, 1.1f }
c. { 1.0f, 1.1f, 1.1f, 42.9f }
d. { 1.1f, -0.5f, 3.1415f, 1.1f }
e. None of the above
[2 marks]
Continued…

-6- KIT308 Multicore Architecture and Programming
SECTION B — SHORT ANSWER QUESTIONS
Answer any EIGHT (8) of Questions 11 through 20 (inclusive). Each is worth 5 marks. This section is worth 40 marks, or 33.3% of the examination.
Question 11. Pipelined Architectures
Explain in your own words the difference between a non-pipelined and pipelined architecture.
[5 marks]
Question 12. Superscalar Architectures
Explain in your own words the difference between a pipelined and superscalar architecture.
Question 13. GPU Architectures
[5 marks]
[5 marks]
Explain in your own words the difference between a typical multicore CPU architecture and a modern GPU architecture.
Question 14. Microcode ROMs
What is the purpose of the microcode ROM on the Pentium 4 and later processors? What architectural design decision lead to the inclusion of this ROM?
[5 marks]
Question 15. Dependencies
Consider the following pseudo-code program in the context of executing it on an out- of-order superscalar architecture:
R1 = R2 * R3 R2 = R4
R1 = R1 – R2
For each dependency present in this code:
 name the type of dependency;
 explain why the dependency affects execution; and
 suggest techniques for removing the dependency (if possible).
Continued…
[5 marks]

KIT308 Multicore Architecture and Programming -7-
Question 16. Memory Access
Describe the most costly sequence of events that could occur on a CPU when reading from a memory location for the first time.
[5 marks]
Question 17. Branches
Why are conditional branches problematic in pipelined architectures? Describe different techniques that are used to reduce the effect of branches on such architectures.
[5 marks]
Question 18. Loop Unrolling
Describe the technique of loop unrolling and explain how this can improve the efficiency of programs.
Question 19. AoS Versus SoA
Rewrite the following code fragment using Structures of Arrays (SoA).
struct PointColour {
double x, y, z;
int colour[3]; };
PointColour pcs[TOTAL];
for (int i = 0; i < TOTAL; ++i) { } pcs[i].x += pcs[i].y * 4; pcs[i].z = -pcs[i].y; pcs[i].colour[1] = pcs[i].colour[0] + pcs[i].colour[2]; [5 marks] [5 marks] Continued... -8- KIT308 Multicore Architecture and Programming Question 20. SIMD Rewrite the following code using SSE SIMD instructions and ensure it contains no branches by use of intrinsics for comparison and selection. Continued... float a[4], b[4], c[4], e[4]; for (unsigned int i = 0; i < 4; i++) if (a[i] < b[i]) if (a[i] == c[i]) e[i] = a[i]; else e[i] = c[i]; else e[i] = b[i]; [5 marks] KIT308 Multicore Architecture and Programming -9- SECTION C — LONG ANSWER QUESTIONS Answer all THREE (3) of the Questions 21 through 23 (inclusive). Each is worth 20 marks. This section is worth 60 marks, or 50% of the examination. Question 21. Multithreaded Programming Write a multithreaded program that converts all lowercase characters to uppercase (and leaves all other characters unchanged) in a ten million element array of unsigned chars. Your program should accept a single command-line argument to specify the number of threads to execute with, e.g. to execute the program with 13 threads it would be run with: Q21.exe 13 Despite the simplicity of the calculation task, your program should allocate tasks to the threads dynamically via the use of a shared memory location. Your program should consist of: a. A datastructure to hold all the necessary parameters for each thread’s execution and the shared memory for thread synchronization. [5 marks] b. The main function to read the command line argument and set up the data for, create, and manage the threads. You can assume the char array is pre- initialised with some data and you don’t have to detect or handle errors in this function. [10 marks] c. A thread start routine that converts its argument, synchronizes via shared memory, and performs the uppercase conversion. [5 marks] Continued... -10- KIT308 Multicore Architecture and Programming Question 22. SIMD Programming The following scalar code performs a 4x4 matrix multiplication: typedef float Matrix[4][4]; void matrix_mul(Matrix& a, const Matrix& b, const Matrix& c) { for (int i = 0; i < 4; i++) { for (int row = 0; row < 4; row++) { for (int col = 0; col < 4; col++) { if (i == 0) a[row][col] = b[row][i] * c[i][col]; else a[row][col] += b[row][i] * c[i][col]; } } } } Redefine the matrix type to use an appropriate SSE SIMD vector type and rewrite the code using this type and SSE SIMD intrinsics. Hints:  After redefining the Matrix type, carefully consider which loop(s) will become unnecessary through the SIMD conversion process.  You may need to access single scalar values from within a SIMD vector — if so, it’s fine to use the fields defined in the Visual Studio union types.  When removing the conditional expression from the inner most loop, carefully consider what would be the best approach. [20 marks] Continued... KIT308 Multicore Architecture and Programming -11- Question 23. GPGPU Programming Given the following function calculation to perform some unspecified calculation, write a GPGPU program to perform the same calculation. float oneCalculation(unsigned int xPos, unsigned int yPos, unsigned int width, unsigned int height) { ... do some complicated calculation ... return ... result of complicated calculation ...; } void calculation(unsigned int width, unsigned int height, float* out) { for (unsigned int x = 0; x < width; x++) { for (unsigned int y = 0; y < height; y++) { *out++ = oneCalculation(x, y, width, height); } } } Your program should perform the calculation for a 256x256 buffer and consist of two files: a. A CPU program that creates the context for execution, builds the OpenCL program, creates the output buffer, distributes work appropriately, and gathers the results. Note: you do not have to detect or handle errors in this part. b. An OpenCL program that performs a single calculation step. [15 marks] [5 marks] Continued... -12- KIT308 Multicore Architecture and Programming Student Number: _____________________________ Your selection must be written as a BLOCK CAPITAL in the box provided for each question. For example: Question A1 A Question A1 Question A2 Question A3 Question A4 Question A5 Question A6 Question A7 Question A8 Question A9 Question A10