Problem:
Serial Optimization Practice Questions
Rank the following in increasing order of how expensive they would be inside a loop (say, with a billion iterations and approximately 10 instructions).
1. A floating point add
2. Exp(x)
3. OpenMP critical region
4. A floating point divide
5. OpenMP atomic pragma
Solution:
A floating point add is the least expensive instruction. An OpenMP critical region is the most expensive, because it involves locks (system calls) to provide mutual exclusion. An OpenMP atomic pragma is less expensive than a lock, as it is just for a single statement. Floating point divides cannot be pipelined and so they are more expensive than additions. Exp(x) is the third most expensive instruction, as it requires multiple adds, multiplies and divides.
Problem:
For the program below, what is the number of floating point operations performed in each iteration? Assume that the compiler doesn’t perform any fancy optimizations, or auto-vectorize the code, to reduce the operation count.
double sv = 2.423;
for (i = 0; i < n; i++) {
double a = A[i];
double b = B[i];
double c = C[i];
double d = 2.0 * c * (a * a + b * b – 2.0 * a * b) / sv; d[i] = d;
}
Suggest ways to rewrite the loop to reduce the number of floating point operations to avoid the division inside the loop. What is the new floating point operation count?
Solution:
Pre-compute “2.0 / sv” outside loop. Call it “sv2_inv” Replace line “double d = ... “ with
double e = a – b;
double d = sv2_inv * c * e * e;
This reduces the operation count from 8 + 1 divide to 4 operations.
Problem:
What are SIMD instructions? For which code listing below are they more relevant / useful? Listing 1:
for (i = 0; i < n; i++)
{
B[i] = A[A[i]];
}
Listing 2:
for (i = 0; i < n; i++) {
for (j = 0; j < m; j++)
{
A[i] = A[i] + B[j];
}
}
Solution:
Typical instructions involve to operands. Current Intel and AMD microarchitectures have additional special registers of sized up to 512 bits, and support instructions that can operate on 8 double-precision floats concurrently (4 pairs of operands). A programmer can exploit this parallelism using special low- level functions called intrinsics. These intrinsics map directly to SIMD (Single Instruction Multiple Data) instructions.
The inner loop in listing 2 is more amenable to SIMD vectorization, since the loop can be unrolled by a factor of 2 or 4, and we can perform the summation of values in vector B. Listing 1 involves indirection
because B[i] is set to A[A[i]]. We need to get the value A[i] first, and then A[A[i]]. There are also no arithmetic operations in Listing 1.