Microsoft Word – Answers to Final.docx
Answers to Final Practice Exam A
1. (1)
6 x 2 = 12 bits of tag
(2)
i = 0: A[0]: miss; A[4]: miss; A[2]: miss;
i = 1: A[1]: miss; A[3]: hit; A[3]: hit;
i = 2: A[2]: hit; A[2]: hit; A[4]: miss;
So the hit ratio is 4/9 And the final cache contents: A[4]A[5] in cache block 0 and A[2]A[3] in block 1.
(3) 0x23, 0x22, 0x30
Final cache contents: 0x30-31 in block 0; 0x22-23 in block 1
2.
(1) 1 + (4+1) x 8/2 = 21 bus cycles = 84 CPU cycles
(2) CPI = 1.6 + (0.5% + 10% x 15%) x 84 = 3.28
(3) (10 x 109) x 3.28 / (2 x 109) = 16.4 seconds
(4) 1 + (4+4) x 2 = 17 bus cycles
3. You need to draw a diagram that shows IM, Register File, ALU, DM, MUX as the major components of the
five stages. You then need to show that, for the instruction: At the ID stage, X2 and X3 are read from the
register file. At the EXE stage, the ALU is configured as an adder. The DM is bypassed, while the MUX at the
write back stage takes input from the ALU output and sends the result as X1 to the register file.
4.
(1) latency:1, throughput: 4
(2) ADD X1, X2, X3
ADD X1, X4, X5
(3) L1: write-through, L2: write-back
(4) It does not check if an instruction is a conditional branch.
(5) fine-grain multithreading
(6) * each GPU has a large number of cores
* GPUs evolve faster than general purpose CPUs
1 1 6
offset cache
block no.
tag
Cache
Block
No
0
1
Answers to Final Practice Exam B
1. (1)
4 x 7 = 28 bits of tag
(2)
i = 0: A[0]: miss; A[8]: miss; A[4]: miss;
i = 1: A[1]: miss; A[9]: miss; A[5]: miss;
i = 2: A[2]: miss; A[10]: miss; A[6]: miss;
So the hit ratio is 0 %.
(3) 0x2A6, 0xA23, 0xA2B, 0xA20
Final cache contents: B20-B2F in set 0 way 0; A20-A2F in set 0 way 1
2.
(1) 1+(9+1) x 2 = 21 bus cycles = 42 CPU cycles
(2) CPI = 1.2 + 5% x 42 + 40% x 10% x 42 = 4.98
(3) (20 x 109) x 4.98 / (500 x 106) = 199.2 seconds
(4) 1+9+2 = 12 bus cycles = 24 CPU cycles
3. You need to draw a diagram that shows IM, Register File, ALU, DM, MUX as the major components of the
five stages. You then need to add MUX and PC to the IF stage. You also need to add the hardware needed for
making branch decision and for calculating branch target address. As to the number of cycles for pipeline
stalling, it depends on the specific implementation you put down in your drawing. It is 2 if the branch decision
making/target address calculation is resolved at the EXE stage. It is 3 at the MEM stage.
4.
(1) ADD X1, X1, X2
(2) LDUR X19, [X2, #0]
SUB X4, X19, X18
(3) Yes. It increases parallelism that can be explored by the processor.
(4) 2
(5) multiple sets of registers
(6) * short data types (8 or 16 bits)
* massive regular computations
4 1 7
offset set no. tag
Cache
Set
No
0
1
Way-0 Way-1
3
Answers to Final Practice Exam C
1. (1)
16 bits of tag
(2)
i = 0: A[0]: miss; A[8]: miss; A[4]: miss;
i = 1: A[1]: miss; A[9]: miss; A[5]: hit;
i = 2: A[2]: miss; A[10]: miss; A[6]: miss;
So the hit ratio is 1/9.
(3) 0xA6, 0x23, 0x2B, 0x20
Final cache contents: 0x20-23 in block 0; 0x24-27 in block 1; 0x28-2B in block 2
2.
(1) 1+7+2 = 10 bus cycles = 40 CPU cycles
(2) CPI = 1.5 + 2% x 40 + 20% x 40% x 40 = 5.5
(3) (10 x 109) x 5.5 / (109) = 55 seconds
(4) 1+(7+1)*2 = 17 bus cycles = 68 CPU cycles
3. You need to draw a diagram that shows IM, Register File, ALU, DM, MUX as the major components of the
five stages. You then need to add a control unit to the ID stage.
For the WB stage, the MUX inputs from DM for the LDUR instruction. The MUX inputs from the ALU for the
ADDI instruction. The MUX input control does not matter for the STUR instruction.
4.
(1) CBZ X1, loop
SUB X4, X19, X18
(2) ADD X1, X2, X3
ADD X4, X1, X5
(3) Access the row elements of the matrix sequentially.
(4) 10
(5) Fine MT: switch to a new thread every clock cycle
SMT: multiple threads are executed at the same time
(6) grid computing
Cache
Block
No
0
1
2
3
2 2 4
offset block no. tag
4
*** No answers are included for Practice Exam D ***
Answers to Final Practice Exam E
1. (1)
6 x 8 = 48 bits of tag
(2) i=0 => Read A[0]: miss; Read A[2]: miss; Write A[11]: miss
i=1 => Read A[1]: hit; Read A[3]: hit; Write A[10]: hit
i=2 => Read A[2]: hit; Read A[4]: miss; Write A[9]: miss
i=3 => Read A[3]: hit; Read A[5]: hit; Write A[8]: hit
So the hit ratio is 7/12.
(3) 0x110 and 0x126. The final cache contents: 0x110-0x11F in set 1, way-0; 0x020-0x02F in set 2, way-0;
0x120-0x12F in set 2, way-1.
2. (1) 1+ 11 + 8 = 20 bus cycles = 100 CPU cycles
(2) CPI = 1.35 + (1% + 20% x 20%) x 100 = 6.35
(3) 20 billion x 6.35 / 2 billion = 63.5 seconds
(4) 1 + 11 x 8/2 + 8 = 53 bus cycles.
3. You need to draw a diagram that shows IM, Register File, ALU, DM, MUX as the major components of the
five stages. For the “BR X30” instruction, you then need to add a MUX and a PC to the IF stage. The MUX
has one input coming from PC+4 and the other input, i.e., X30, coming from the register file. There is one
penalty cycle for the BR instruction because the fetching of the next instruction cannot start until the completion
of the ID stage (of the BR instruction).
4. (1) Control Unit
(2) ADD X1, X2, X3
ADD X4, X5, X6
(3) IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
An alternative is to draw a diagram that shows IM, Register File, ALU, DM, MUX as the major
components of a five stage pipe, but adding extra wires and ALU to show the ability to
fetch/decode/execute multiple instructions every clock cycle.
(4) The instruction results are not written to the register file or memory.
(5) 4-way set associative. It normally has less conflict because each memory block has four places to go
(instead of one in the case of direct-mapped cache).
(6) Clusters.
4 2 6
offset set no. tag
Cache
Set
No
0
1
2
3
Way-0 Way-1