CS计算机代考程序代写 GPU cache Microsoft Word - Answers to Final.docx

Microsoft Word – Answers to Final.docx

Answers to Final Practice Exam A

1. (1)

6 x 2 = 12 bits of tag

(2)

i = 0: A[0]: miss; A[4]: miss; A[2]: miss;

i = 1: A[1]: miss; A[3]: hit; A[3]: hit;

i = 2: A[2]: hit; A[2]: hit; A[4]: miss;

So the hit ratio is 4/9 And the final cache contents: A[4]A[5] in cache block 0 and A[2]A[3] in block 1.

(3) 0x23, 0x22, 0x30

Final cache contents: 0x30-31 in block 0; 0x22-23 in block 1

(1) 1 + (4+1) x 8/2 = 21 bus cycles = 84 CPU cycles

(2) CPI = 1.6 + (0.5% + 10% x 15%) x 84 = 3.28

(3) (10 x 109) x 3.28 / (2 x 109) = 16.4 seconds

(4) 1 + (4+4) x 2 = 17 bus cycles

3. You need to draw a diagram that shows IM, Register File, ALU, DM, MUX as the major components of the

five stages. You then need to show that, for the instruction: At the ID stage, X2 and X3 are read from the

write back stage takes input from the ALU output and sends the result as X1 to the register file.

(1) latency:1, throughput: 4

(2) ADD X1, X2, X3

ADD X1, X4, X5

(3) L1: write-through, L2: write-back

(4) It does not check if an instruction is a conditional branch.

(5) fine-grain multithreading

(6) * each GPU has a large number of cores

* GPUs evolve faster than general purpose CPUs

1 1 6

offset cache

block no.

tag

Cache

Block

Answers to Final Practice Exam B

1. (1)

4 x 7 = 28 bits of tag

(2)

i = 0: A[0]: miss; A[8]: miss; A[4]: miss;

i = 1: A[1]: miss; A[9]: miss; A[5]: miss;

i = 2: A[2]: miss; A[10]: miss; A[6]: miss;

So the hit ratio is 0 %.

(3) 0x2A6, 0xA23, 0xA2B, 0xA20

Final cache contents: B20-B2F in set 0 way 0; A20-A2F in set 0 way 1

(1) 1+(9+1) x 2 = 21 bus cycles = 42 CPU cycles

(2) CPI = 1.2 + 5% x 42 + 40% x 10% x 42 = 4.98

(3) (20 x 109) x 4.98 / (500 x 106) = 199.2 seconds

(4) 1+9+2 = 12 bus cycles = 24 CPU cycles

3. You need to draw a diagram that shows IM, Register File, ALU, DM, MUX as the major components of the

five stages. You then need to add MUX and PC to the IF stage. You also need to add the hardware needed for

making branch decision and for calculating branch target address. As to the number of cycles for pipeline

stalling, it depends on the specific implementation you put down in your drawing. It is 2 if the branch decision

making/target address calculation is resolved at the EXE stage. It is 3 at the MEM stage.

(1) ADD X1, X1, X2

(2) LDUR X19, [X2, #0]

SUB X4, X19, X18

(3) Yes. It increases parallelism that can be explored by the processor.

(4) 2

(5) multiple sets of registers

(6) * short data types (8 or 16 bits)

* massive regular computations

4 1 7

offset set no. tag

Cache

Set

Way-0 Way-1

Answers to Final Practice Exam C

1. (1)

16 bits of tag

(2)

i = 0: A[0]: miss; A[8]: miss; A[4]: miss;

i = 1: A[1]: miss; A[9]: miss; A[5]: hit;

i = 2: A[2]: miss; A[10]: miss; A[6]: miss;

So the hit ratio is 1/9.

(3) 0xA6, 0x23, 0x2B, 0x20

Final cache contents: 0x20-23 in block 0; 0x24-27 in block 1; 0x28-2B in block 2

(1) 1+7+2 = 10 bus cycles = 40 CPU cycles

(2) CPI = 1.5 + 2% x 40 + 20% x 40% x 40 = 5.5

(3) (10 x 109) x 5.5 / (109) = 55 seconds

(4) 1+(7+1)*2 = 17 bus cycles = 68 CPU cycles

3. You need to draw a diagram that shows IM, Register File, ALU, DM, MUX as the major components of the

five stages. You then need to add a control unit to the ID stage.

For the WB stage, the MUX inputs from DM for the LDUR instruction. The MUX inputs from the ALU for the

ADDI instruction. The MUX input control does not matter for the STUR instruction.

(1) CBZ X1, loop

SUB X4, X19, X18

(2) ADD X1, X2, X3

ADD X4, X1, X5

(3) Access the row elements of the matrix sequentially.

(4) 10

(5) Fine MT: switch to a new thread every clock cycle

SMT: multiple threads are executed at the same time

(6) grid computing

Cache

Block

2 2 4

offset block no. tag

*** No answers are included for Practice Exam D ***

Answers to Final Practice Exam E

1. (1)

6 x 8 = 48 bits of tag

(2) i=0 => Read A[0]: miss; Read A[2]: miss; Write A[11]: miss

i=1 => Read A[1]: hit; Read A[3]: hit; Write A[10]: hit

i=2 => Read A[2]: hit; Read A[4]: miss; Write A[9]: miss

i=3 => Read A[3]: hit; Read A[5]: hit; Write A[8]: hit

So the hit ratio is 7/12.

(3) 0x110 and 0x126. The final cache contents: 0x110-0x11F in set 1, way-0; 0x020-0x02F in set 2, way-0;

0x120-0x12F in set 2, way-1.

2. (1) 1+ 11 + 8 = 20 bus cycles = 100 CPU cycles

(2) CPI = 1.35 + (1% + 20% x 20%) x 100 = 6.35

(3) 20 billion x 6.35 / 2 billion = 63.5 seconds

(4) 1 + 11 x 8/2 + 8 = 53 bus cycles.

3. You need to draw a diagram that shows IM, Register File, ALU, DM, MUX as the major components of the

five stages. For the “BR X30” instruction, you then need to add a MUX and a PC to the IF stage. The MUX

has one input coming from PC+4 and the other input, i.e., X30, coming from the register file. There is one

penalty cycle for the BR instruction because the fetching of the next instruction cannot start until the completion

of the ID stage (of the BR instruction).

4. (1) Control Unit

(2) ADD X1, X2, X3

ADD X4, X5, X6

(3) IF ID EXE MEM WB

IF ID EXE MEM WB

An alternative is to draw a diagram that shows IM, Register File, ALU, DM, MUX as the major

components of a five stage pipe, but adding extra wires and ALU to show the ability to

fetch/decode/execute multiple instructions every clock cycle.

(4) The instruction results are not written to the register file or memory.

(5) 4-way set associative. It normally has less conflict because each memory block has four places to go

(instead of one in the case of direct-mapped cache).

(6) Clusters.

4 2 6

offset set no. tag

Cache

Set

Way-0 Way-1

Related Posts