CS代考 COMP5426 Parallel and Distributed Computing Final Exam Semester 1 2021

COMP5426 Parallel and Distributed Computing Final Exam Semester 1 2021
Question 1 (50 Marks)
(This section contains 15 short-answer questions.)
1) Give three methods for increasing the speed of a computer system. [3]

Copyright By PowCoder代写 加微信 powcoder

2) Classify MIMD computers based on memory organization and communication methods. [3]
3) Describe at least FOUR characteristics of shared-memory multiprocessors that distinguish them from distributed-memory multicomputers on the basis of hardware architecture, data sharing and communication/synchronization. [4]
4) Give a definition of cloud computing and briefly describe its main characteristics. [4]
5) What are the major performance metrics for characterizing an interconnection
network, and what are the main purposes of using these metrics? [4]
6) What is the execution time, in general, relate to the standard all-to-all reduction operation without broadcasting on a two-dimensional torus with a total number of compute nodes p. [4]
7) What is a non-blocking network? Give two examples. [3]
8) Parallel paths are a number of distinct paths collecting nodes A and B and these paths have no common nodes other than A and B. What is the maximum number of possible parallel paths between any two nodes in a d-dimensional hypercube? [4]
9) Will it take longer time to perform a one-to-all broadcast operation on a one- dimensional array of size than that on a two-dimensional mash of size if per-hop time is ignored? You must give reasons to justify your answer. [4]
10)Prove that if total overhead for a given problem size, then the parallel execution time will continue to decrease as is increased and asymptotically
approach a constant value.
11) What is race condition? Explain race condition with two examples. [3]
12) In the context of a GPU program using CUDA, what are (i) threads; (ii) blocks; and
(iii) grids? [3]
13) In CUDA programming, how many different kinds of memories are in a GPU? [2]
14) Can functions annotated with __global__ be executed on the host? [2]
15) What does coalesced / un-coalesced memory access mean? [3]
Page 1 of 4

COMP5426 Parallel and Distributed Computing Final Exam Semester 1 2021
Question 2 (10 Marks)
Suppose that MPI_COMM_WORLD consists of three processes with rank 0, 1, and 2, respectively. Suppose the following code is executed:
int x, y, z;
switch(my_rank) {
case 0: x=1; y=4; z=2;
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Send(&y, 1, MPI_INT, 2, 2, MPI_COMM_WORLD); MPI_Recv(&z, 1, MPI_INT, 1, 1, MPI_COMM_WORLD, &status); MPI_Allreduce(&y, &x, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD); break;
case 1: x=3; y=8; z=6;
MPI_Bcast(&y, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Send(&z, 1, MPI_INT, 2, 1, MPI_COMM_WORLD); MPI_Send(&x, 1, MPI_INT, 0, 1, MPI_COMM_WORLD); MPI_Allreduce(&z, &y, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD); break;
case 2: x=6; y=7; z=8;
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Recv(&y, 1, MPI_INT, 0, 2, MPI_COMM_WORLD, &status); MPI_Recv(&z, 1, MPI_INT, 1, 1, MPI_COMM_WORLD, &status); MPI_Allreduce(&x, &y, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD); break;
What are the values of x, y, and z on each process after the code has been executed?
QUESTION 3 (10 Marks)
A counting semaphore is a synchronization mechanism widely used in practice. It has an integer variable used for signaling among threads. Only three atomic operations may be performed on a semaphore:
i. sem_Init: The integer variable is initially assigned a zero or positive value.
ii. sem_Wait: The integer value is decremented by 1; if the value is smaller than zero,
the calling thread will be blocked.
iii. sem_Signal: The integer value is incremented by 1; if the value is smaller than or
equal to zero, unblock one of the waiting thread.
You are asked to implement a simple counting semaphore using pthread mutex and condition variable.
A counting semaphore type may be defined as typedef struct {
int count; /* the integer variable */ pthread_mutex_t s_m; /* mutex */ pthread_cond_t s_cv; /* condition variable */
} mylib_sem_t;
You must implement the following THREE functions:
mylib_sem_init(mylib_sem_t *s, int value)/*sem_Init as discussed above*/ mylib_sem_wait(mylib_sem_t *s) /*sem_Wait operation as discussed above*/ mylib_sem_signal(mylib_sem_t *s) /* sem_Signal as discussed above */
Page 2 of 4

COMP5426 Parallel and Distributed Computing Final Exam Semester 1 2021
Question 4 (10 Marks)
Consider a CUDA kernel routine reduceOp() below. This routine performs a parallel reduction operation, i.e., parallel summation of an array of integers with n elements. The routine does not make use of shared memory in each SM (or Stream Multiprocessor) and the reduction is done in-place, which means that the values in global memory are replaced by partial sums at each step.
__global__ void reduceOp(int *g_idata, int *g_odata, unsigned int n)
// set thread ID
unsigned int tid = threadIdx.x;
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
// convert global data pointer to the local pointer of this block int *idata = g_idata + blockIdx.x*blockDim.x;
// boundary check
if(idx >= n) return;
// in-place reduction in global memory
for (int stride = 1; stride < blockDim.x; stride *= 2) { int index = 2 * stride * tid; if (index < blockDim.x) { idata[index] += idata[index + stride]; // synchronize within threadblock __syncthreads(); // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = idata[0]; } You need to answer the following questions after analyzing the routine. 1) Describe how the parallel algorithm adopted by this routine works. 2) Discuss if there are any shortcomings in this algorithm (in addition to the use of shared memory). 3) Based on your discussion, modify the routine to make it work more efficiently still for in-place reduction in global memory without making use of shared memory in each 4) Justify your modifications, i.e., discuss why your modifications can enhance the performance of the original routine. Page 3 of 4 COMP5426 Parallel and Distributed Computing Final Exam Semester 1 2021 Question 5 (20 Marks) Relational operator ¡°.>¡± performs element-wise comparisons between two arrays of equal length and the results is a logical array of the same length indicating the locations where the relation ¡°>¡± (i.e., greater than) is true. Given two integer arrays and
, for example, . The total number of locations where the relation is true is in this example. (Note in general
1) Given a set of
algorithm to find the array pairs which have the largest value for in a distributed memory system of processing elements organized as a one-dimensional ring for . (Note there may be multiple such array pairs.)
In your design you must seriously consider how to balance the workloads across the processing elements and minimize the communication cost.
You need to discuss how both data and computation are partitioned, how many computation-and-communication steps are needed, and in each step what computation is performed and how data are sent/received by each processing element. You may draw figures and/or write pseudocode to simplify the description of your parallel algorithm.
2) Calculate the speedup, efficiency and isoefficiency function for your designed parallel algorithm. Show your work.
END OF EXAMINATION
arrays, each being of length , you are asked to design a parallel
Page 4 of 4

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com