CS代考 COMP322101 Module Title: Parallel Computation School of Computing

Module Code: COMP322101 Module Title: Parallel Computation School of Computing
Examination Information
– There are 10 pages to this exam.
– Answer all 2 questions.

Copyright By PowCoder代写 加微信 powcoder

©c UNIVERSITY OF LEEDS Semester 2 2019/2020
– The total number of marks for this examination paper is 100.
– This exam is worth approximately 58.82% of the final module grade.
– The number in brackets [ ] indicates the marks available for each question or part question.
– It is expected that it will take approximately 4 hours to complete this exam.
– Your submission should be less than 5000 words in total.
– It is possible to receive full marks using less than 3000 words in total.
– You may lose marks for including extensive discussions that are not relevant to the question.
– You will not receive any marks, and may lose marks, for simply copying material from your notes verbatim.
– You are reminded of the need for clear presentation in your answers.
– You are required to submit a single PDF file via the Minerva submission point.
– Support for the examination will be available via the module Yammer group.
Page 1 of 10 Turn the page over

Module Code: COMP322101 Question 1
This question concerns a parallel implementation of a cellular automoton called “Conway’s game of life”. The rules for this game are very simple. There is an N × N array of integers (i.e. a two-dimensional array) for which each entry in the array has a value that is either 1 (we say the entry “has life”) or 0 (“no life”). The game consists of a (possibly infinite) sequence of iterations from some initial configuration. At each iteration the following rules are applied to determine the value of each array entry after the iteration:
• If the entry has a value of 1 and either two or three (not more and not less) of the surrounding eight entries have a value of 1, then the entry has a value of 1 at the end of the iteration;
• If the entry has a value of 0 and exactly three of the eight surrounding entries have a value of 1, then the entry has a value of 1 at the end of the iteration;
• Otherwise the entry has a value of 0 at the end of the iteration. The diagram below shows three consecutive iterations for N = 6:
000000 000000 000000 001000 001100 001100 001100 001100 010010 000100 001100 001100 000000 000000 000000 000000 000000 000000
Some sequential code, for which we fix the values of the arrays u and unew in their first and last rows and columns, is shown below.
// The code below is repeated for each iteration of Game of Life for (i=1; i scratch[gid+1] ){
float temp = scratch[gid];
scratch[gid] = scratch[gid+1];
scratch[gid+1] = temp;
// Copy back to global memory. Ensure sorting completed first. barrier(CLK LOCAL MEM FENCE);
array[gid] = scratch[gid];
(a) Inspect the code for this kernel, and then answer the following questions.
(i) What does the descriptor local before the argument scratch in line 2 denote, and why is this a suitable choice for this kernel? Where is the memory for scratch allocated, and how large should the memory pointed to by scratch be?
(ii) Somebody new to GPU programming looks at the kernel code and is confused by line 8. They understand it is copying the array to the memory pointed to by scratch, but do not understand why it is not in a loop. Explain why line 8 does indeed copy the full array despite not being in a loop.
Page 7 of 10 Turn the page over

Module Code: COMP322101
(iii) What function does the barrier command on line 9 perform in the context of this kernel?
(iv) On testing it is found that the sort is sometimes performed correctly, but sometimes fails. Identify the line(s) in the code where the problem lies, and explain why it produces this non-deterministic behaviour. How would you solve this problem? You do not need to provide code or pseudo-code as long as your description is clear.
(b) The bubblesort described in part (a) is limited to arrays no larger than the maximum work group size T supported by the hardware. Now suppose that you want to extend the algorithm to work for array sizes R greater than this, i.e. R > T . Someone suggests that you adapt your host code to make multiple calls to a modified kernel.
(i) Why do you think it was recommended to make multiple kernel calls?
(ii) Do you agree it is impossible to perform bubblesort for large problems R > T using a single kernel, and even if it could be achieved, would you recommend doing so? Explain your answer.
(iii) The host code is now altered to call a modified kernel multiple times to perform the full sort for R > T . During each kernel call, sections of the array of size 2T are sorted using an algorithm which is an extension of that given in part (a). The index ranges for these small array sections alternate between kernel calls, and by calling the kernel a sufficient number of times, the full list becomes sorted. The first few kernel calls and the alternating index ranges are shown in the diagram below for R = 7T, where the grey squares denote array sections of size T, and the crossed arrows denote sorting over 2T array elements. How many times must the host code enqueue the kernel to ensure the data is fully sorted? Explain your answer. You may assume R is divisible by T , i.e. R%T==0. Is your answer different for odd and even values of R/T ?
Page 8 of 10 Turn the page over

Module Code: COMP322101
Kernel Kernel Kernel
(iv) With this new implementation, is there a maximum array size R that can be sorted? Explain you answer.
(c) Returning to the original kernel of part (a), identify three sources of parallel overhead in this code, giving line number(s). For each, explain why the introduction of the overhead was necessary for this specific problem, and whether or not you expect it to be a significant overhead.
(d) Suppose that you now implement an OpenMP version of bubblesort, and you have verified it correctly works on a multi-core CPU. The same implementation also works in serial, i.e. for p = 1 core. To determine the parallel scaling of this OpenMP bubblesort, you perform some timing runs.
(i) You measure the time to sort a list of size R = 1000 in serial, and find that your implementation takes ts = 1 ms. You then sort a list of size R = 4000 using p = 4 cores, and find the execution time has dropped to tp = 0.3 ms. Finally, you measure thetimetosortalistofsizeR=8000onp=8cores,andnowfindtp =0.2ms. What is the speedup S and efficiency E for both p = 4 and p = 8?
(ii) Do the timing runs in part (i) correspond to weak or strong scaling? Explain your answer.
(iii) It is suggested that a fraction f = 0.2 of the serial bubblesort has not been parallelised in the OpenMP implementation. What is the maximum speedup S for this f for both of the cases given in (i) above, according to Amdhal’s law, and again for the Gustafson-Barsis law? By comparing these predictions to the actual measurements
from (i), what can you say about the hypothesis that f = 0.2?
Page 9 of 10 Turn the page over

Module Code: COMP322101
(iv) Suppose that the time for the GPU version to execute for a problem size R = 128 is tp = 0.1 ms using 128 work items or threads. Can you calculate the speedup S for this case?
[Question 2 Total: 50 marks] [Grand Total: 100 marks]
Page 10 of 10 End

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com