程序代写 INFR11170 ADVANCED PARALLEL TECHNIQUES

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS
INFR11170 ADVANCED PARALLEL TECHNIQUES
Monday 4 th May 2020
13:00 to 15:00

Copyright By PowCoder代写 加微信 powcoder

Answers must be submitted to Turnitin by 16:00
INSTRUCTIONS TO CANDIDATES
1. Note that ALL QUESTIONS ARE COMPULSORY.
2. EACHQUESTIONISWORTH25MARKS.Differentsub-questions may have different numbers of total marks. Take note of this in allo- cating time to questions.
3. THIS EXAMINATION IS AN OPEN-BOOK ASSESSMENT. You may refer to material from your notes, course material, or beyond to assist you. You should not copy any text or images into your an- swer however as your answer must remain your own work. If you refer to material from outside the course it must be referenced properly.
4. PleaserefertoguidanceinLearnunderExaminationInformationifyou have any difficulties.
EPCC Courses
Convener: Dr Vijay Nagarajan External Examiner: Prof Matt Probert
THIS EXAMINATION WILL BE MARKED ANONYMOUSLY

1. Describe the main differences between the OpenMP C++ API and the C++ std::thread API.
Consider the following code fragment which tests the Goldbach conjecture for the even numbers up to N, and is parallelised using OpenMP:
int counterex = 0;
#pragma omp parallel for reduction (+:counterex) schedule(static,1)
for (int num = 4; num <= N; num+=2){ bool result = false; for (int i = 2; i <= num/2; ++i) { if (isprime(i) && isprime(num-i)) { result = true; if (!result) counterex++; bool goldbach = {counterex == 0}; Show how you would implement the equivalent code in C++ using std::thread, std::atomic and a lambda expression. (You will not be penalised for minor syntax errors.) The parallel STL provides the algorithm std::transform reduce, which can be used to apply a function to all the elements in a range, and then reduce the results to a single value using a given associative operator, subject to an execution policy. Describe how you could use this algorithm to implement the code fragment above (you are not expected to write code), and discuss the advantages and disadvantages of this approach. Page 1 of 6 [12 marks] 2. (a) NVIDIA GPU accelerators contain multiple types of memory in addition to the host system’s main memory. Complete a copy of the following table describing these types, as available through the CUDA language and APIs. Type Size Latency Lifetime Access Large (16GB) Poor (∼ 300 cy- cles) Managed by pro- grammer on host via cudaMalloc and cudaFree Up to 96 kB per SM All threads in block Good (1 to 100s of clock cycles) A programmer has written the following CUDA fragment to offload the application of a square stencil to a two-dimensional array of data (of size NI by NJ). constexpr int RADIUS = 2; constexpr int SIZE = 2*RADIUS + 1; constexpr int DATA_SIZE = NI * NJ * sizeof(double); __global__ void apply_stencil(const double* in, const double* stencil, double* out) { int i = blockIdx.y * blockDim.y + threadIdx.y; int j = blockIdx.x * blockDim.x + threadIdx.x; double ans = 0.0; if (i=(NI-RADIUS) || j=(NJ-RADIUS)) {
ans = in[i*NJ + j];
for (int di = -RADIUS; di <= RADIUS; ++di) { for (int dj = -RADIUS; dj <= RADIUS; ++dj) { ans += stencil[di*SIZE + dj] * in[(i+di)*NJ + (j+dj)]; out[i*NJ + j] = ans; [Question continues overleaf] Page 2 of 6 void gpu_solve(const double* in, const double* stencil, double* out) { double* device_old; cudaMalloc(&device_old, DATA_SIZE); cudaMemcpy(device_old, in, DATA_SIZE, cudaMemcpyHostToDevice); double* device_new; cudaMalloc(&device_new, DATA_SIZE); dim3 TPB{16, 16, 1}; dim3 BPG{(NJ-1) / TPB.x + 1, (NI-1)/TPB.y + 1, 1}; for (int i = 0; i>> (device_old, stencil, device_new);
cudaDeviceSynchronize();
std::swap(device_old, device_new);
std::swap(device_old, device_new); // Undo last swap
cudaMemcpy(out, device_new, DATA_SIZE, cudaMemcpyDeviceToHost);
cudaFree(device_old);
cudaFree(device_new);
(b) Explain, with reference to the hardware, why the programmer has chosen to map the CUDA x-direction to their j index and CUDA y-direction to i, rather than vice-versa.
(c) The programmer wants to improve performance by using constant memory to store the stencil coefficients. Rewrite this code to do so. Minor syntax and spelling errors will not be penalised.
(d) The programmer further want to improve their performance by reading a tile of the input into CUDA shared memory before applying the stencil. Explain, in words, how this can be implemented, what steps they will have to take to ensure correctness, and why (with reference to the hardware) this is likely to improve performance.
Page 3 of 6

PyPy and Numba both perform just-in-time (JIT) compilation, but in dif- ferent ways. Explain how they differ, stating clearly what is being compiled in each case, and when.
Consider the following Python functions, in which a and b are NumPy ar- rays, a being two-dimensional of size N × N and b being one-dimensional of size N:
def kernel(a, b):
b += computeInterior(a, b)**2
def computeInterior(a, b):
c = numpy.zeros_like(b)
for i in range(1, b.size – 1):
for j in range(1, b.size – 1):
c[i] += a[i, j] * b[j]
(b) Identify the ufuncs that appear in kernel, and explain why these provide better performance than explicit for loops. Rewrite computeInterior to avoid explicit for loops and use NumPy functions instead.
(c) Numba can fuse multiple NumPy operations involving arrays when it com- piles a function. Rewrite the above code, aiming for better performance, using Numba and explain the reasons you expect this code to be faster and use less memory than the version in part (b).
(d) What changes, if any, need to be made to your code in part (c) to enable it to run in parallel? If it is part of a larger Python code that uses Python threads, what happens if you release the global interpreter lock (GIL) for your code, and what issues could arise?
[10 marks]
Page 4 of 6

4. The following fragment of C code (see overleaf for an equivalent Fortran version) shows the key stages in a simple 1D cellular automaton traffic model where, at each iteration, the values of the new road new[i], i = 1,2,…, N, depend on the three values old[i-1], old[i] and old[i+1]. In this parallel OpenSHMEM version, halo cells old[0] and old[N+1] have been introduced into the old array.
// parallel OpenSHMEM traffic model (C version)
// arrays sizes are: int new[N] and int old[N+2]
// ‘‘up’’ and ‘‘down’’ are the two neighbouring PEs
for (iter = 1; iter <= maxiter; iter++) shmem_barrier_all(); shmem_put(&old[0], &old[N], 1, up ); shmem_put(&old[N+1], &old[1], 1, down); shmem_barrier_all(); // update the new road based on the old values // new[i], i = 1, 2, ..., N, depends on old[i-1], old[i] and old[i+1] updateroad(new, old, N); // copy back non-halo values for next iteration for (i=1; i <= N; i++) old[i] = new[i]; [Question continues overleaf] Page 5 of 6 ! parallel OpenSHMEM traffic model (Fortran version) ! arrays sizes are: integer new(1:N) and integer old(0:N+1) ! ‘‘up’’ and ‘‘down’’ are the two neighbouring PEs do iter = 1, maxiter call shmem_barrier_all() call shmem_put(old(0), old(N), 1, up ) call shmem_put(old(N+1), old(1), 1, down) call shmem_barrier_all() ! update the new road based on the old values ! new(i), i = 1, 2, ..., N, depends on old(i-1), old(i) and old(i+1) call updateroad(new, old, N) ! copy back non-halo values for next iteration do i = 1, N old(i) = new(i) (a) OpenSHMEM requires that data accessed remotely is allocated in symmet- ric storage. Explain which array(s) this applies to and how it could be implemented. (b) Explain the communications that occurs from each of the two shmem put calls, stating which array elements are transferred to/from which PEs. (c) What does a call to shmem barrier all() achieve? Explain carefully why each of the two calls above is required. (d) Would the code still be correct if the second call to shmem barrier() was after the call to updateroad instead of before? Explain your answer. (e) Would your answer to part (d) change if the communications were done using shmem get() instead of shmem put()? Explain your answer. Page 6 of 6 [5 marks ] [5 marks ] [7 marks ] [4 marks ] [4 marks ] 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com