Problem:
Basics of Parallel Algorithms Practice Questions
What are the T(n) and W(n) for the following PRAM algorithm:
Solution:
T(n) = Ө(log n) and W(n) = Ө(n). This is the balanced binary tree-based algorithm for summation.
Problem:
A work-efficient PRAM requires W(n) = O(n log n) and T(n) = O(log2 n). For what values of p does the algorithm achieve optimal speedup?
Solution:
Problem:
Show two work-efficient PRAM algorithms for summation using DAGs, for summing up 7 elements. If you are given 3 processors, which DAG would you use?
Solution:
Work efficient => Ө(n)
Dag 1 is preferable with 3 processors. Ops can e schedules in 3 time steps.
Problem:
Consider two PRAM algorithms for solving a problem. One algorithm has T(n) = O(log n), W(n) = O(n log n), and requires Ω(n) processors to achieve these bounds. Another algorithm has T(n) = O(log2 n) and W(n) = O(n) and requires Ω(n log n) processors. Which algorithm would you prefer, given a system with Ө(log n) processors? Why?
Solution:
Case 1: Tp(n) = O(n + log n) = O(n) Case2:Tp(n)=O(n/logn+log2 n)=O(n/logn) Case 2 is preferable because of lower Tp(n).
Problem:
What is the distinction between the partitioning and the divide and conquer parallelization strategies?
Solution:
Partitioning:
Split problem into ‘p’ parts
Merge step is simpler
Iterative algorithm Divide and conquer
Problem is subdivided with implicit assumption on ‘p’ Recursive algorithm
Problem:
The execution time of a parallel algorithm on 100 processors is 1 second. The execution time of the parallel algorithm on one processor is 50 seconds, and the running time of the best sequential algorithm for this problem is 20 seconds. What is the parallel efficiency for the 100-processor run?
Solution:
T1 = 50s, Tp = 1s, p = 100. Parallel efficiency (our class definition) is T1/(pTp) = 0.5.
Problem:
The single core/serial execution time of a benchmark for problem size X is 110 seconds. Assume that there is a constant overhead (serial component) of 10 seconds in the benchmark, and that this overhead is independent of problem size. Further assume that the rest of the benchmark work is perfectly parallelizable (i.e., you obtain a speedup of p using p cores for the parallelizable routines). What is the overall parallel speedup with p = 10 processors? According to Amdahl’s law, what is the maximum obtainable parallel speedup for this benchmark?
Solution:
The speedup with p = 10 processors is
According to Amdahl’s law, the serial overhead will limit the maximum parallel speedup achieved. The
maximum parallel speedup would be
Problem:
The serial complexity of a problem is Ө(n log n). The time taken by FPX, a fast PRAM algorithm for this problem is Ө(log n) using n/2 processors. Further, you are given that FPX performs Ө(n log n) work. Which of the following statements is correct?
A) This algorithm is work-optimal.
B) This algorithm has bounds similar to the balanced binary tree based summation algorithm.
C) On a p-processor system, this algorithm will require time. D) Speedup is optimal when p = Ө(n log n)
Solution: A, B & C
Problem:
Give an efficient PRAM algorithm for dense matrix-vector multiplication. What is the parallel time, work performed and how many processors are required?
Solution:
Given an n x n matrix A and a vector b of size n, we want to compute the vector c = Ab. Each element
. We can use the balanced binary tree based PRAM algorithm for summation to compute each element ci. This algorithm has running time T(n) = Ө(log n) and W(n) = Ө(n) using Ө(n) processors. Since all n elements of c can be computed concurrently, the overall algorithm will have T(n) = Ө(log n), W(n) = Ө(n2) using Ө(n2) processors. This is a work-efficient algorithm because the work complexity of the parallel algorithm matches the asymptotic running time of the best serial algorithm.
Problem:
Given two matrices A (size n x l) and B (size l x m), discuss how you can apply the balanced binary tree- based parallelization strategy to design an algorithm for computing the matrix product C (= AB). What is the work performed and the parallel time? How many processors are required to achieve the time bounds?
Solution:
C is of size n x m.
Cij = sum(k = 1:l)(aik * bkj)
For each cij, we can use the balance binary tree based summation algorithm after computing the products aik*bkj
Work performed = Ө(nml). Time = Ө(log l)
We require nml processors to achieve these bounds
Problem:
Consider a problem whose serial complexity is Ө(n2). PRAM algorithm #1 for this problem performs Ө(n2) work and the parallel running time is Ө(√n). PRAM algorithm #2 for this problem performs Ө(n2 log n) work and the parallel running time is Ө(log n). Given p = Ө(n) processors, which algorithm would you prefer to use? Why?
Solution:
We can use Brent’s theorem to determine Tp(n).
Alg. #1: Tp(n) = O(n2/p + √n) when p = Ө(n), Tp(n) = O(n)
Alg. #2: Tp(n) = O( (n2 log n)/p + log(n) when p = Ө(n), Tp(n) = O(n log n) Alg 1 is the faster choice.