CS计算机代考程序代写 algorithm Copyright ⃝c 2021, the University of Minnesota. 1

Copyright ⃝c 2021, the University of Minnesota. 1
CSci 5451, S’21 Homework # 2 Due Date: 04-02-21
1. We have a supercomputer that 8,192 processors and want to demonstrate that the machine can achieve an impressive speed-up of 7,000 on a problem of great importance. What is the maximum portion of the parallel execution time that can be devoted to the inherently sequential part of the code if we want to achieve this goal?
2. This question is about the Karp-Flatt metric.
(a) Suppose that you observe the speed-up of a parallel program and find that it behaves like S(p) ≈ p/(1 + 0.1√p − 1). Find the related experimental serial fraction e(p). What can you conclude on the efficiency of your program?
(b) What is e(p) if S(p) takes the form S(p) = p/(1 + (p − 1)f(p))? In this case what should f satisfy in order for e(p) to converge to a constant rapidely? Among the following functions which ones will do the job: √
(i) f(p) = 0.01; (ii) f(p) =
3. We consider the problem of matrix by vector multiplication y = Ax on p processes, where A is n × n and n is exactly divisible by p.
(a) Assume that the matrix A and operand x are distributed row-wise as shown in Page 10-13 of the notes (n/p rows of A in each process and n/p components of x on each process.) What MPI command would you invoke to perform communication needed to perform the matrix-vector poduct in this case?
(b) Assume now that the matrix is distributed columnwise (n/p columns per process) and that x is located in process 0. We want the result y to land in process 0 as well. Describe (in words or with a pseudo-code) how you would implement the matvec operation in this case – explicitely naming the MPI commands that you would invoked. There may be a few possible options but you need to pay attention to efficiency.
4. A sequential code has time complexity ≈ αn2, where n is the size of the data-set. When a parallel version of this code is implemented it is found that that it executes a purely sequential data-related portion that takes time (12000 + n)μsec. The rest of the code is purely parallel and runs (sequentially) in time n2/200.
(a) What is the maximum speed-up achieved by this program as a function of n and p? What is this maximum when n = 8,000 and p = 20? Plot speed-up for n = 8,000 and p = 2,4,6,··· ,16 – along with the perfect speed-up curve (The curve S(p) = p). What do you get when p → ∞?
(b) The above does not take into account communication overhead. We assume that the parallel execution requires entails a communication overhead of the form
t(n, p) = β log n × [τs log p + βn]
microseconds, where τs = 5, 000 and β = 0.1. Recalculate the speed-up under this scenario. Redo the case p = 20, n = 8000 seen in (a). Also get a new plot of corresponding curve along those of (a) for the case when n = 8, 000.
5. Recall that scaled speedup is defined by selecting the problem so that work is increased linearly with the number of processing elements; that is, if a base problem of a certain size
p; (iii) f(p) = 0.1 + exp(−p). ?

Copyright ⃝c 2021, the University of Minnesota. 2 involves work W on a single processing element, then on p processors, the size is selected so
that the work is pW and then the scaled speed-up is SScaled = pW .
Tp(pW, p)
Consider the problem of adding n numbers on p processing elements. Assume that it takes ten time units to communicate a number between two processing elements, and that it takes one unit of time to add two numbers.
(a) Plot the standard speedup curve for the case when we are adding n = 256 numbers on p = 1,4,16,64, and 256 processors. (b) On the same figure plot the scaled speedup curve. Hint: The parallel run time is (n/p − 1) + 11 log p.
6. (Continuation of previous exercise) Using the expression for the standard speed-up from the previous question (same value of n = 256) obtain the efficiency for p = 1, 2, 3, · · · , 10. How many processors can be used if we want the efficiency to be no less than 50%.
7. Assume that you have two matrices of size n × n and p processors, configured on a √p × √p torus. Obtain a timing model for the Cannon algorithm. Assume that communicating m data items to any nearest neighbor costs ts + m ∗ tw time units [only one channel (east, west, north, or south) can be used at a time] and that one flop costs one time unit. Next perform a scaled speed-up analysis where the scaling is based on memory rather than time. So when p increases, n increases as n = √pn0, where n0 is the initial size. On the same figure plot the standard and scaled speed-up for n0 = 32 and when √p = 2k and k = 0,1,2,3,4 (i.e., p = 1,4,16,64,256). Assume ts = 10,tw = 1.
8. Show the progression of the (a) odd-even Mergesort algorithm, and the (b) bitonic sort algorithm, when sorting the array
3, 10, 1, 11, 30, 18, 8, 20.
[Hint: follow similar detail of process as for the examples in the notes.]