Problem:
OpenMP Practice Questions
What are the values of v1, v2 and v3 on the execution of the OpenMP region below?
int v1 = 0; int v2 = 0; int v3 = 0; #pragma omp parallel num_threads(4) {
int tid = omp_get_thread_num();
#pragma omp critical
v1 = v1 + tid;
#pragma omp single
v2 = v2 + tid;
#pragma omp master
v3 = v3 + tid; }
Solution:
v1 = 6 (all threads execute this statement and it is protected by a critical pragma) v2 can be 0, 1, 2, or 3 as one of the four threads executes this statement.
v3 = 0 (only master thread, i.e. thread with tid = 0, executes this statement)
Problem:
What is your best estimate for the time taken by an OpenMP parallelized program to sum a billion double-precision floating point values on a dual-socket, quad core system with 32GB DDR3 memory, a peak memory bandwidth of 30 GB/s, and a clock speed of 2.0 GHz? Briefly explain your calculation.
Solution:
109 double precision values => 109 adds, 8 * 109 bytes accessed Assume 1 op/clock cycle => 109 adds take 109 * 0.5ns = 0.5 seconds 8 cores -> 8 way parallelism in adds => 0.5 / 8 = 0.06 seconds
Data transfer time (min) = (8 * 109) / (30 * 109) = 0.26 seconds
Time 0.26 seconds.
Problem:
Identify correctness and performance bugs in the following OpenMP code.
int sum = 0;
int n = 100000;
#pragma omp parallel num_threads(8)
{
int partial_sum = 0;
#pragma omp parallel for schedule(dynamic) nowait {
For (i = 0; i < n; i++)
partial_sum += A[i] + B[i];
}
sum = sum + partial_sum;
}
Solution:
Correctness bug 1: sum is a shared variable and must be protected by a critical pragma preceding it. Otherwise, multiple threads may concurrently increment sum by the value partial_sum, and the result would be incorrect.
Correctness bug 2: Since a parallel region has already been created, the pragma before the for loop should just be pragma omp for and not pragma omp parallel for. Pragma omp for would correctly partition the loop iterations among the multiple threads.
Possible correctness bug 3: Variable I doesn¡¯t seem to be initialized inside the loop. Loop iteration variables must be private.
Performance bug: A dynamic schedule implies that loop iterations are assigned to threads in a dynamic manner. A dynamic schedule would incur a very high overhead in this case because each iteration performs very little work in this case. (two loads and one add). A static loop schedule would be a better option.
Problem:
In a shared memory program using OpenMP threads, which of the following operations need to be protected using synchronization constructs in order to ensure correct execution of code?
A) Two threads reading the same shared variable x
B) One thread reading shared variable x, another thread writing to x
C) All threads writing to the same shared variable x
D) Two threads writing to a private variable x
Solution: B & C
Problem:
Is the loop below amenable to OpenMP parallelization? What changes would you make to parallelize this loop?
for (int i = 1; i < 1000000; i++) {
B[i] = A[i] + C[i];
sum = sum + B[i-1];
}
Solution:
No, the loop cannot be parallelized as-is because of inter-iteration dependencies. The sum update in iteration I depends on values computed in iteration i-1. Rewrite the loop as follows to correctly compute the value of sum and all values in array B.
sum = 0;
#pragma omp parallel for reduction (+:sum) for (int i = 1; i < 999999; i++)
{
B[i] = A[i] + C[i];
sum = sum + B[i];
}
sum = sum + B[0];
B[999999] = A{999999] + C[999999];
Problem:
Estimate the running time of an OpenMP program for computing pi using Riemann sums. Assume that you are running the program on a 2GHz quad-core single processor Intel system based on the Haswell microarchitecture (i.e. support for AVX SIMD intrinsics). Further assume that each iteration of the summation loop requires four additions and ten multiplications, and that the instructions can be fully pipelined. You are also given that the peak memory bandwidth of the system is 21.3 GB/s and that the last level cache size is 6MB. Explain your calculations.
Solution:
Let us assume we use n iterations to approximate pi. Each iteration has 14 floating point instructions. Let¡¯s assume that each arithmetic instruction takes a clock cycle, or 0.5ns for completion. Main memory bandwidth and cache capacity information is irrelevant because there are no reads and writes to main memory. This is a compute-bound program. Let¡¯s assume a 4x speedup on this multicore system. So the running time estimate would be 1.75n nanoseconds. If n = 109, the running time estimate is 1.75 seconds.
Problem:
Identify the correctness and performance bugs in the following OpenMP code. Assume that the array A is of size 106 and has positive integers.
int maxval = 0;
#pragma omp parallel num_threads(8)
{
int lmax = 0;
#pragma omp for schedule(dynamic)
for (int i = 1; i <= 1000000; i++)
{
if (A[i] > lmax) lmax = A[i];
}
#pragma omp master
if (lmax > maxval) maxval = lmax;
}
printf(¡°Max value: %d\n¡±, maxval);
Solution:
Use omp critical instead of omp master, we want all threads to execute the conditional
for (i = 0; i < 1000000; i++)
Use static scheduling instead of dynamic to improve performance
Problem:
Estimate the running time of an OpenMP program for computing the dot product of two vectors u and V of size 100 million. Assume that you are running the program on a 2 GHz dual-core single-processor Intel system based on the Sandy Bridge microarchitecture (i.e. support for AVX SIMD intrinsics). You are also given that the peak memory bandwidth of the system is 20 GB/s and that the last-level cache size is 6 MB. Explain your calculations.
Solution:
Assume the vectors have double-precision values.
Totaldatasize=2*100*106 *8bytes=1.6*109 bytes.
Data transfer time = (1.6 * 109) / (20 * 109) = 80 ms.
Dot product 100 million adds, 100 million multiplies
AVX intrinsics => 4 way SIMD => 25 * 106 adds, 25 * 106 mults
Assuming full instruction pipelining, arithmetic time = (25 * 106) / (2 * 2 * 109) = 6.25 ms Therefore, the running time is 80 ms.