Problem:
Data Locality / Caches Practice Questions
Consider a matrix A of dimension m x n, and a vector x. Both A and x are single-precision floats. for (i = 0; i < m; i++)
{
float dot_prod = 0;
for (j = 0; j < n; j++)
{
dot_prod = dot_prod + A[i][j] * x[j];
}
c[i] = dot_prod;
}
Assume the processor can complete one floating point add and one multiply every clock cycle. What is the lower bound on execution time (in clock cycles) based on floating point operations?
Which loop would you unroll?
Assuming an initial cold cache configuration, m = 106, n = 256, and a last level cache line size of
64 bytes, what is the total number of cache lines fetched?
Solution:
There are a total of mn additions and mn multiplications inside the loop. Assuming the compiler is able to transform the code such that one add and one multiply can be performed every clock cycle, the floating point performance lower bound on the execution time (in clock cycles) would be mn.
Outer loop to enable to re-use of x[j].
ForA,weneedtofetch106 x256x4/64cachelines.Cwouldrequire106 x4/64cachelines
and x would require 256 x 4 / 64 lines.
Problem:
Consider the following loop accessing two matrices A and B of double-precision floating point values. double h = 0.0001;
for (i = 1; i < n-1; i++)
{
for (j = 1; j < n-1; j++)
{
B[i][j] = B[i][j-1] + h*(A[i][j-1] + A[i][j+1] + A[i+1][j]
+ A[i-1][j]);
} }
Assume a processor can complete one addition and one multiplication per clock cycle.
o What is the lower bound for running time (in clock cycles) based on just the floating point operations performed?
o Assume a last-level cache line size of 64 bytes, what is the number of cache lines fetched from main memory to the last level cache?
o For n = 105, a 2GHz processor and given a single-core memory read bandwidth of 8 GB/s, what is a lower bound on the running time?
o Would you say that this code is compute-bound, memory bound or neither?
Solution:
o The number of additions is 4(n-2)2 and the number of multiplications is (n-2)2. Because of the dependencies, the multiplies and adds cannot be performed concurrently. So the total number of floating-point instructions is 5(n-2)2, and since the cost of an add/multiply is one cycle, the lower bound is 5(n-2)2 clock cycles.
o A and B are arrays of doubles, so each of them is of size 8n2 bytes. The total data that has to be fetched from memory is 16n2 bytes, and since the cache line size is given to be 64 bytes, the number of cache lines fetched is 16n2 / 64 = n2/4.
o The time due to floating-point operations would be approx. (5 * 1010) / (2 * 109) = 25 seconds. The time due to memory reads would be approx. (16 * 1010) / (8 * 109) = 20 seconds. The memory traffic due to stores of B also need to be accounted for, but let’s ignore that for now.
o It is compute bound.
Problem:
Consider the following code snippet below:
for (i = 0; i < m; i++) {
float ai = A[i];
for (j = 0; j < n; j++)
{
B[n*i+j] = ai*C[n*i+j] + D[j];
}
}
A, B, C and D are single precision floating point arrays of size m, mn, mn and n respectively.
1. On some 64-bit processor, you are given that the latency of add and multiply instructions are 3 and 5 clock cycles respectively. The throughput of add and multiply is 1 instruction/clock cycle. What is the lower bound on execution time (in clock cycles) based on floating point operations performed?
2. Assuming an initial “cold cache” configuration, m = 106, n = 103, and a last level cache line size of 64 bytes, what is the total number of cache lines fetched?
3. Suggest an optimization to improve performance of this code.
Solution:
1. There are a total of mn iterations. In each iteration, there is one floating point add and one floating point multiply. The add and the multiply cannot be performed concurrently. Also since the loop isn’t unrolled there is no ILP and each instruction will incur the latency cost. Hence the lower bound on execution time would be (5+3) * mn clock cycles = 8mn clock cycles.
2. The total cache lines fetched are going to be dominated by B (writes, 4mn bytes) and C (read- only, 4mn bytes). Assuming a write-allocate policy, the values of B will be first fetched into cache before they are updated. Also 4m bytes of A and 4n bytes of D are read. Ideally, D is small enough to be held in cache even if A, B and C are not. So the total cache lines fetched in a cold cache configuration would be (4mn + 4mn + 4m + 4n)/64 ~= 1.25 x 108.
3. Unroll the inner loop. This will expose ILP.