留学生辅导 INFR11170 ADVANCED PARALLEL TECHNIQUES

INFR11170 ADVANCED PARALLEL TECHNIQUES
Monday 26 th April 2021
13:00 to 15:00
Answers must be submitted to Turnitin by 16:00

Copyright By PowCoder代写 加微信 powcoder

INSTRUCTIONS TO CANDIDATES
1. Note that ALL QUESTIONS ARE COMPULSORY.
2. EACHQUESTIONISWORTH25MARKS.Differentsub-questions may have different numbers of marks. Take note of this in allocating time to questions.
3. CALCULATORS MAY BE USED IN THIS EXAMINATION.
4. THIS EXAMINATION IS AN OPEN-BOOK ASSESSMENT. You may refer to material from your notes, course material, or beyond to assist you. You should not copy any text or images into your answer however as your answer must remain your own work. If you refer to material from outside the course it must be referenced properly.
5. THIS IS A REMOTE EXAMINATION. As stated in the Own Work Declaration, for the duration of the assessment plus one hour you must not communicate with any other person about your work either electronically or by word or sign, nor let your work be seen by any other person.
6. Please refer to guidance in Learn under Examination Information if you have any difficulties.
THIS EXAMINATION WILL BE MARKED ANONYMOUSLY

(a) Describe at a high level how one uses the CUDA programming model to run a calculation across the many cores of a single GPU.
The following function computes one iteration of a simple percolation model in serial.
// Inputs:
// M, N are the sizes of the arrays without halos.
// state and next are arrays with halos
// (size (M + 2) x (N + 2))
// Returns number of changed values.
int percolate_step(int M, int N, int const* state, int* next) {
int nchange = 0;
// Only update non-halo values. for (int i = 1; i <= M; ++i) { for (int j = 1; j <= N; ++j) { int const oldval = state[i*(N+2) + j]; int newval = oldval; // 0 => solid, so do nothing
if (oldval != 0) {
// Set new to be the maximum value of old[i][j] // and its four nearest neighbours.
newval = max(newval, state[(i-1)*(N+2) + j]); newval = max(newval, state[(i+1)*(N+2) + j]); newval = max(newval, state[(i)*(N+2) + j-1]); newval = max(newval, state[(i)*(N+2) + j+1]);
if (newval != oldval) {
++nchange;
next[(i)*(N+2) + j] = newval;
return nchange;
(b) Ignoring the requirement to compute the number of changes made, adapt this code to run using CUDA on an NVIDIA GPU and explain how you will obtain good performance. You will not be penalised for minor syntax and spelling errors.
(c) Explain, with reference to the hardware and programming model, how you might adapt your solution to accumulate the total number of updates with good performance.
Page 1 of 4
[15 marks]

Describe the limitations of typical GPU architectures that prevent the effi- cient implementation of traditional threading models on these devices, and how these limitations are reflected in the OpenMP offloading API.
(b) Discuss the advantages and disadvantages of using OpenMP instead of CUDA for programming GPU devices.
(c) Consider the following code fragment which represents the kernel of a 2- dimensional PDE solver, parallelised using a 1-dimensional decomposition and exchanging halo data with MPI_Sendrecv:
for (int step=0; step
OutputIt exclusive_scan(InputIt begin, InputIt end,
OutputIt out,
T identity, BinOp op) {
typename std::iterator_traits::value_type;
value_type tot = identity;
for (; begin != end; ++begin, ++out) {
*out = tot;
tot = op(tot, *begin);
return out; }
(b) Write a snippet of code that uses this function to compute the exclusive scan of an input std::vector where each element in the result (also a std::vector) is the product of the preceeding elements of the input.
QUESTION CONTINUES ON NEXT PAGE
Page 3 of 4
using value_type =

QUESTION CONTINUED FROM PREVIOUS PAGE
The algorithm can be parallelised by splitting the input range into blocks, as shown in the figure below for four blocks. An independent task can compute the exclusive_scan of each block and store the generalised sum of each block (note this includes the last element) into an auxilliary array. The exclusive_scan of these sub-totals can be computed and then each block’s entries can be updated by applying x = op(total, x) to every element.
Input array of values
Scan block 0 Scan block 1 Scan block 2 Scan block 3
Store block totals
Compute scan of block totals
Output array of scanned values
(c) Implement, with commentary, a function template
template
OutputIt exclusive_scan(int NTHREAD,
InputIt begin, InputIt end,
OutputIt out_begin,
T identity, BinOp op);
which performs this algorithm using the specified number of parallel threads, using the C++ threading library. Note:
• You may assume that both InputIt and OutputIt are random access iterators.
• You will not be penalised for minor syntax and spelling errors.
• Credit will be given for identifying any performance problems with your
implementation along with suggestions for fixing these.
(d) Comment briefly on how you might use the OpenMP API instead of C++ threads for this task, the relative ease of implementation, and any perfor- mance differences.
Page 4 of 4
[14 marks]

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com