Problem:
Modern Computer Architecture Practice Questions
Suppose you are programming a processor with an add and multiply latency of 3 cycles. It is also given that this processor can complete one add and multiply every clock cycle when instructions are fully pipelined. Consider the following loop:
for (i = 0; i < m; i++) {
A[i] = B[i] * C[i] + D[i] * E[i];
}
1. Assuming the program is executed as-is (i.e. no pipelining), what is the lower bound on execution time (in clock cycles) based on floating point arithmetic performed?
2. How can you exploit more instruction level parallelism in this program? What changes do you propose?
3. Assuming you can pipeline the adds and multiplies, what would be the lower bound on execution time in clock cycles for the arithmetic?
Solution:
1. 2 mults, 1 add => 9 cycles / iteration or 9m clock cycles
2. Apply loop unrolling, coupled with out of order execution and pipelining will enhance instruction
level parallelism.
3. Still 2 mults, then 1 add due to dependencies. Assuming efficient loop unrolling, data and
instruction prefetch, the multiply unit will have 2 mults issued for each iteration and will be the bottleneck of the process. Therefore the execution time will be approximately 2m cycles.
Problem:
How long would you expect the following loop to take on a system with a 2 GHz Intel Processor, a 6MB last-level cache, a memory bandwidth of 10 GB/s, and a memory latency of 75ns per cache line? Assume that A is an integer (4 bytes/word) array with its elements having random values between 0 and 1000000000.
n = 1000000000;
for (int i = 1; i < n; i++) {
int j = A[i];
sum = sum + A[j];
}
Solution:
Assume an add throughput of 1 op/clock cycle.
2GHz => 0.5 ns clock cycles
109 adds->0.5*109 s*10-9 =0.5stime
All access to A[i] are unit-stride and regular
o 4 * 109 bytes are fetched from memory
o (4*109)/(10*109 B/s)=0.4stime
(Nearly) all access to A[j] are cache misses since j is a random value, array size is 4GB but cache
size is 6MB.
o 109 cachemisses=109 *75*10-9s=75s
Problem:
Explain the distinction between pipelining and multithreading.
Solution:
Pipelining, or instruction pipelining, helps improve overall instruction throughput by performing multiple subtasks of different instruction concurrently. Instructions are split into a sequence of elementary subtasks that take the same amount of time. Multithreading, or hardware multi-threading, refers to hardware support for concurrent execution of multiple software threads. Most hardware resources are shared by multiple threads, but some are replicated. Both pipelining and multithreading help reduce instruction execution latency.
Problem:
Suppose you are programming a processor with an add latency of 3 clock cycles and a multiply latency of 5 clock cycles. It is also given that this processor can complete one add and one multiply instruction every clock cycle, when instructions are fully pipelined. Consider the following loop:
for (i = 0; i < m; i++) {
A[i] = B[i] * C[i] + D[i] + E[i];
}
1. Assume the program is executed as-is (i.e. no pipelining), what is the lower bound on execution time (in clock cycles) based on floating point arithmetic performed?
2. How can you exploit more instruction level parallelism in this program? What changes do you propose?
3. Assuming you can pipeline the adds and multiplies, what would be the lower bound on execution time in clock cycles for the arithmetic?
Solution:
1. Two adds, 1 multiply => 3 * 2 + 5 = 11 clock cycles / iteration, 11m for the code.
2. Apply loop unrolling, coupled with out of order execution and pipelining will enhance instruction
level parallelism.
3. There are two adds per loop iteration and one multiply. The Add unit will be the bottleneck,
therefore the execution time will be 2m cycles.