程序代写代做代考 cache C assembly clock EECS 370 Final Exam Fall 2018 SOLUTIONS

EECS 370 Final Exam Fall 2018 SOLUTIONS
Notes:
● Closed book.
● You are allowed one cheat sheet (notes on both sides allowed)
● 10 problems on 16 pages​. Count them to be sure you have them all.
● Calculators without wireless are allowed, but no PDAs, Portables, Cell phones, etc.
● This exam is fairly long; ​don’t spend too much time on any one problem​.
● You have ​120 minutes​ for the exam.
● Some questions may be more difficult than others. You may want to skip around.
● Partial credit cannot be given if work is not shown​.
● Write legibly and dark enough for the scanners to read your work​.
Write your uniqname on the line provided at the top of each page​.
You are to abide by the University of Michigan/Engineering honor code. Please sign below to
signify that you have kept the honor code pledge:
I have neither given nor received aid on this exam, nor have I concealed any violations of the Honor Code.​
1 of 16
Signature:
Name:
Uniqname:
First and last name of person sitting to your ​right​ (write the symbol ⊥ if at the end of a row): _________________________________________
_________________________________________
_________________________________________
_________________________________________
First and last name of person sitting to your ​left​ (write the symbol ⊥ if at the end of a row): _________________________________________

2 of 16 uniqname: ______________________
Problem 1. True / False Points: ___/13
True False
a. A 4-bit 2’s complement number can represent values [-8, 8].
b. Branch prediction with our 5-stage pipeline will always have a lower CPI than detect and stall.
c. Caller and callee save registers are ​not​ defined by the hardware architecture.
d. Sign extension can be used on both loads and stores to preserve the sign (negative or positive) of data.
e. The number 1.0 * 2^(-127) cannot be represented in 32-bit IEEE-754 floating point format.
f. When an integer is loaded from memory into a register, endian-ness defines the order bytes are placed in the register.
g. A control ROM is sequential logic, not combinational.
h. Function parameters passed in memory are always stored on the stack.
i. LRU is a concept used to maximize temporal locality within a direct mapped cache.
j. The relocation table of an object file contains the location of references to symbols in the text section that have to be fixed up during linking.
k. Write-through, no-allocate caches make better use of temporal locality than write-back, write-allocate caches.
l. The virtual address space is limited in size by the amount of physical memory (DRAM) on a computer.
m. Adding an extra stage to a pipelined processor will greatly increase the CPI of a 100,000 instruction test bench. ​Assume the clock period, number of hazards, and stalls due to hazards are unchanged.

3 of 16 uniqname: ______________________
Problem 2: LC2K Assembly Points: ___/15
Convert the following C-code to LC2K assembly. For the assembly code, assume all registers are initialized to 0 and the arrays ​arr1 ​and​ arr2​ are stored somewhere else in memory.
Hints:
● ind1→R3,ind2→R4,num→R5,2147483647==0x7FFFFFFF
● To do ​X < Y comparison in assembly, compute ​–Y; add ​X + (–Y); extract the signbit of ​X – Y​ and compare to ​0​. C-code LC2K assembly int size = 5; int arr1[5] = int arr2[5] = int dest[10]; int ind1 = 0, int num = 0; while (num != if (ind2 == (ind1 != {1, 3, 5, 7, 9}; {2, 4, 6, 8, 10}; ind2 = 0; (size * 2)){ size || size && arr1[ind1] < arr2[ind2])) { dest[num] = arr1[ind1]; ind1++; num++; } else { dest[num] = arr2[ind2]; ind2++; num++; } } 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 lw lw whileadd beq if beq beq 0 1 one 0 2 size 2 ​2 ​ 7 5 7 end 4 2 ​true ​3 ​ 2 ​else ​3 ​ 6 arr1 ​4 ​ 7 arr2 7 ​7 ​ 7 1 7 7 lw lw nor add add 6 ​7 ​ 6 lw nor beq else lw sw add add 1 beq 0 true lw 3 sw 5 add 1 add 1 beq 0 end halt one .fill 1 size .fill 5 0 7 mask ​6/7​ ​7/6​ 6 6 0 ​true 4 6 arr2 ​5 ​ 6 dest 1 4 4 5 5 0 ​while ​7 ​ arr1 7 ​dest ​3 ​ 3 5 5 0 ​while mask .fill dest .fill ... 2147483647 0 4 of 16 uniqname: ______________________ Problem 3: Repeat Performance Points: ___/10 You are given a benchmark test case with the following makeup: 20% beq 30% lw 10% noop 10% sw 30% add Assume noop takes 2 cycles in multicycle datapath. In this benchmark, for a pipeline datapath, assume ​detect-and-stall is used for all data hazard resolution and speculate-and-squash i​ s used for all control hazard resolution. Also, note that: ● 40% of the beq instructions are incorrectly predicted ● 30% of the lw instructions are immediately followed by a dependent instruction ● 30% of the add instructions are immediately followed by a dependent instruction Memory Read: 15 ns Memory Write: 20 ns Register read/write: 10 ns ALU: 5 ns All other operations: 0 ns Assume that there is internal register forwarding, the pipeline has five stages, and instructions are read from memory. a. What is the clock period (in nanoseconds) for the following datapaths? Your answers must be numbers, ​not​ numeric equations. ● SingleCycle: 55ns(lw:15+10+5+15+10) ● Multicycle: 20ns ● Pipelined: 20ns b. What is the CPI of the benchmark on the following datapaths? Your answers may be numeric equations. Simplify them as much as possible​. ● Single Cycle: ● Multicycle: ● Pipelined: 1 CPI 4.1 CPI = 2(0.10) + 4(0.60) + 5(0.30) 1.6 CPI = 1 + 3(0.20)(0.40) + 2(0.60)(0.30) c. A new memory upgrade comes out that reduces the latency of memory reads to 5 ns. All other latencies are unchanged. Now what is the clock period (in nanoseconds) for the following datapaths? Your answers must be numbers, ​not​ numeric equations. ● SingleCycle: 40ns(sw:5+10+5+20) ● Multicycle: 20ns ● Pipelined: 20ns 5 of 16 uniqname: ______________________ Problem 4. Detecting Hazards in LC2K Points: ___/15 We decide to use the following program as a benchmark for our 5-stage LC2K pipeline. Our pipeline uses detect-and-forward to handle data hazards and speculate and squash (predict not-taken) to handle control hazards. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 end sw halt lw 0 1 lw 0 2 lw 0 3 posOne a b end 4 4 5 mask 5 6 6 less 4 4 2 loop 3 loop result # Load +1 into reg 1 # Load a into reg 2 # Load b into reg 3 # If a == b, then branch # Negate a (i) # Negate a (ii) # Calculate b - a # Determine if a > b (i) # Determine if a > b (ii) # Determine if a > b (iii) # Determine if a > b (iv) # If a <= b, then branch # Negate b (i) # Negate b (ii) #Seta=a-b # Branch back to loop # Set b = b - a # Branch back to loop # Store a into memory # End program # 0x8000 loop beq nor 2 2 add 4 1 add 4 3 lw 0 6 nor 5 5 nor 6 6 nor 5 6 beq 0 6 nor 3 3 add 4 1 add 2 4 beq 0 0 less add beq 0 0 0 2 posOne .fill 1 mask .fill a .fill 18 b .fill 12 result .fill 1 2 3 3 4 -32768 a. How many stall cycles are added due to data hazards throughout the lifetime of the program? b. How many stall cycles are added due to control hazards throughout the lifetime of the program? 1 cycle 12 cycles c. This benchmark executes 30 ​instructions​ in total. How many cycles​ does this program take to execute (including stall cycles and loading the pipeline)? 1 + 12 + 30 + 4 = 47 cycles d. Given that this benchmark executes 30 instructions, what is the CPI of this benchmark? 47/30 = 1.57 CPI e. Write two lines we could swap with each other to reduce the total number of stalls that would occur due to data hazards. Lines 1 & 3 6 of 16 uniqname: ______________________ Problem 5. Do You C What I C? Points: ___/15 Consider an 8-bit byte-addressable architecture that uses a ​32-byte, 2-way set-associative cache with a block size of ​8 bytes.​ The cache is initially empty, and uses LRU replacement policy. Fill out the following table for the given sequence of memory accesses. If the access is a hit, fill in the “Hit” bubble. If the access is a miss, indicate the type of miss (Compulsory / Capacity / Conflict) by filling in the bubble in the correct column. Misses Address Hit Compulsory Capacity Conflict 0x3E 0xAE 0x3C 0x35 0xB3 0x77 0x30 0x7F 0xAF 0x70 0xB4 7 of 16 uniqname: ______________________ Problem 6. Level Three Points: ___/15 Consider a 64-bit byte-addressable system that supports virtual memory with the following specifications: ● Apageis2KB. ● The page table is hierarchical with 3 levels. ● The first-level occupies 4 pages of memory. ● Each second-level occupies 8 pages of memory. ● A page table entry is 8 B and is the same for all levels. ● 32 GB of physical memory is installed. ● Virtual addresses are 64 bits 1. How many bits are used for the page offset? Ans: 2. How many virtual pages exist in this system? How many physical pages exist? # Virtual Pages: # Physical Pages: 3. How many bits in a virtual address are used to index the first-level page table? Ans: 11 bits 25​ 3​ pages 22​ 4​ pages 10 bits 4. How many bits in a virtual address are used to index a second-level page table? Ans: 11 bits 5. How many bits in a virtual address are used to index a third-level page table? Ans: 32 bits 6. How much memory does this page table system occupy in the worst case? ​Express your final answer i​ n bytes​ and in powers of 2. Do not simplify (e.g. if your answer is 2x​ ​ + 2y​ ​, leave it at that). Ans: 21​ 3​ +2​24​ +25​ 6​ B 7. If this system used a single-level page table instead of a three-level page table, how much memory would it occupy? Express your final answer in bytes and in powers of 2. Do not simplify (e.g. if your answer is 2x​ ​ + 2y​ ​, leave it at that). A n s : 2 5​ 6 ​ B 8 of 16 uniqname: ______________________ Problem 7. Historical Eviction Points: ___/20 It’s 1987 and Apple has just released the Macintosh II featuring a Motorola 68030 processor. The 68030 contains a state of the art ​64 byte data-cache ​and uses ​byte-addressable memory​. Having taken 370, you know there’s more to cache design than just size, so you write a C program to help you determine the block size, number of blocks per set, and number of sets in the cache. ​Assume all variables other than the data[ ] array are stored in registers and that the cache has been cleared before each call to miss_rate(...). #define CACHE_SIZE ​XX double miss_rate(int stride) { char data[CACHE_SIZE * 2]; /* note array is 2x cache size */ int misses = 0; for (int i = 0; i < stride; i++) for (int j = i; j < (CACHE_SIZE * 2); j += stride) if (ACCESS(data[j]) == CACHE_MISS)) misses++; return (misses / (CACHE_SIZE * 2)); } a. In order to test your code, you first run it on a ​32 B fully-associative​ cache with ​8 B blocks​. You set the ​CACHE_SIZE ​macro to​ 32​. Fill in the miss rate for each stride in the following table. ​Only answers in the table will be graded. Fractions are OK. Stride Miss Rate 1 0.125 2 0.25 4 0.5 8 1 16 0.125 32 0.125 9 of 16 uniqname: ______________________ b. You now run your program on the Motorola 68030 processor with a ​64 byte data-cache​. You set the ​CACHE_SIZE ​macro to​ 64​ and collect the following data: Stride Miss Rate 1 0.25 2 0.50 4 1.00 8 1.00 16 1.00 32 1.00 64 0.25 Fill in one bubble completely for each question. 64 32 16 8 4 2 1 What is the block size in bytes? What is the number of blocks per set? What is the number of sets? 10 of 16 uniqname: ______________________ Problem 8. Modified Pipeline Datapath Points: ___/20 We once again want to upgrade our datapath to run the new LC2K instruction ​madd​. The instruction has the following functionality: madd regA regB offset # ​regB = regB + Mem[regA + offset] To accommodate this new instruction, the pipeline has been expanded to 6 stages: Fetch, Decode, Execute1, Memory, Execute2, Writeback​. a. What values from the ​EX1/MEM​ pipeline registers and D​ ata memory​ need to be sent to the ​MEM/EX2​ pipeline register in order to allow for correct execution? ● Select up to four connections to be made to the ​MEM/EX2​ stage ● Fill in the bubble next to any values you want to send to ​MEM/EX2 ALUresult valB mdata eq? target 11 of 16 uniqname: ______________________ b. Using the values you assigned in part a, make connections to the components in the Execute2​ stage and the ​WriteData​ block in the ​EX2/WB​ pipeline register. ● Up to 4 connections can be made to ​MUX​ (may not need all) ● Up to 2 connections can be made to ​ALU​ (may not need all) ● Possible connections are ​ALUout​, ​MUXout​, and any connections you made in part a. ALUout ALUresult mdata ___________​ to ​MUXin ​ to ​MUXin ​ to ​MUXin ​ to ​MUXin mdata valB MUXout ​ to ​ALUin ​ to ​ALUin ​ to ​WriteData c. Assuming all connections are correctly done (independent of part a/b), fill out the timing diagram for the following program. Assume data hazards are handled using detect-and-forward, with stalls are inserted when necessary. Use the symbol * for stalls. lw 0 lw 0 add 1 madd 0 add 3 halt five .fill 5 six .fill 6 seven .fill 7 eight .fill 8 nine .fill 9 ten .fill 10 1 2 2 3 3 five six 3 ten 4 #lw1 #lw2 #add1 # madd #add2 # halt Cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 lw1 IF ID EX1 MEM EX2 WB lw2 IF ID EX1 ME M EX2 WB add1 IF * ID EX1 ME M EX2 WB madd IF ID EX1 ME M EX2 WB add2 IF * * ID EX1 ME M EX2 WB halt IF ID EX1 ME M EX2 WB 12 of 16 uniqname: ______________________ Problem 9. Virtual Memory + Data Cache A. Consider a system with the following specs: ○ 1 MB physical memory ○ 32-bit virtual address ○ Single-level Page Table ■ 64 KB page size ○ 64 KB Data-cache ■ Direct Mapped ■ 4 KB block size Points: ___/15 Determine the cache tag, set index, and block offset for virtual address 0x0EEC5370 if the system uses a virtually addressed data-cache. Then, do the same for a physically addressed data-cache. ​Answers must be in hexadecimal. Assume the virtual page for this address maps to physical page ​0x7​. a. Virtually addressed data-cache ○ Tag: ○ Set Index: ○ Block Offset: 0x​0EEC 0x​5 0x​370 b. Physically addressed data-cache ○ Tag: 0x​7 ○ Set Index: 0x​5 ○ Block Offset: 0x​370 13 of 16 uniqname: ______________________ B. Consideravirtualmemorysystemwith​two-levelhierarchicalpage-table​,TLB,​virtually addressed​ data-cache, and the following specs: ○ TLB ● Access time: 1 cycle ○ Data-Cache ● Access time: 1 cycle ○ Memory ● Access time: 30 cycles ○ Disk ● Access time: 100,000 cycles ○ NOTES ● The TLB and Data-cache are ​accessed in parallel ● On a 1st level page fault, the creation of a 2nd level page table occurs in parallel with accessing the disk ● The page table cannot be cached and is available in main memory ● Pages that aren’t in memory are on disk ● Upon retrieval from cache, main memory or disk, the data is sent immediately to the CPU, while other updates occur in parallel ● What are the possible memory access latencies, in cycles, for the following situations? For the following three questions, there may be one or more answers. Include all possible answers. Your answer(s) must be a number, not a numeric equation. a. When data is in the data-cache? Ans: 1 cycle b. When data must be fetched from memory? Ans: 31 and 91 cycles Hit TLB: 1 (dcache/tlb) + 30 (memory) Miss TLB, hit page table: 1 (dcache/tlb) + 30 (1st level pt) + 30 (2nd level pt) + 30 (memory) c. When data must be fetched from disk? Ans: 100,031/100,061 cycles miss in 1st level page table: 1 (dcache/tlb) + 30 (1st level pt) + 100,000 (disk) miss in 2nd level page table: 1 (dcache/tlb) + 30 (1st level pt) + 30 (2nd level pt) + 100,000 (disk) 14 of 16 uniqname: ______________________ Problem 10. The Lost Page Table Points: ___/20 A process, PID=739, just crashed, and we would like to recover some of its data. In order to recover its memory, we need ​you​ to find the value of the p​ age table base register​. Right before the crash, a portion of physical memory (bytes given ​in hexadecimal) ​was saved: Address Data from

to

+0
+1
+2
+3
+4
+5
+6
+7
+8
+9
+A
+B
+C
+D
+E
+F
0x30
89
DF
EE
83
2F
AA
B2
C7
EF
A9
2A
33
85
01
A0
29
0x40
22
46
BE
4D
32
2F
B9
CC
AB
13
90
60
24
A0
D8
34
0x50
5A
B7
D4
50
E9
A8
24
38
A7
E9
13
8B
6A
3B
22
41
0x60
46
D9
B7
C5
88
91
25
67
44
22
EE
FF
BB
A2
1F
7D
0x70
16
95
2F
B0
E7
E9
B2
10
19
14
BE
C0
FF
EE
25
66
0x80
78
55
20
10
14
D0
3A
3F
20
6D
3B
C5
D4
A9
FC
10
0x90
00
07
0F
A2
0A
0C
6F
88
16
14
13
08
00
C3
21
6B
0xA0
31
26
2B
45
4F
35
8A
09
CC
4D
44
0B
4A
C9
8A
89
0xB0
A0
BE
3B
29
85
86
75
30
16
BA
EE
C5
0A
24
08
09
Figure 1: Physical Memory before crash (bytes given in hexadecimal)
Process 739’s memory accesses (from process start to crash) are reproduced below. You may ​assume that:
● no other processes are running in the system.
● process 739 had no other memory accesses besides those in Figure 2
Hint: When the process started, its page table was empty.
Time (ms) before crash 13
8
3
1
Virtual Address 0x8A 0x92 0x83 0xAC
Page Fault Yes Yes
No Yes
Data Retrieved 0xEE 0xBE 0xC5 0x85
Figure 2: Virtual Memory Address accesses from process 739

15 of 16 uniqname: ______________________
You are given the following clues:
● Virtual Addresses are 10 bits (10-bit byte addressable system)
● 256 bytes of physical memory
● Single-level page table
● All sizes are powers of 2
● Page table entries are 8 bits (1 byte) and of the form:
#unusedbits + #PPN bits = 7
1. What is the page size?
2. How many page table entries are in the page table?
3. How many bits is a physical page number (PPN)?
Ans: 16 B Ans: 26​ ​ = 64 entries Ans: 4 bits
← 8 bits →



4. How are Virtual Addresses and Physical Addresses composed? Recall that these can be broken into [VPN || Page Offset] and [PPN || Page Offset], respectively. Identify the number of bits for each below:
6 bits
4 bits
Virtual Page Number (VPN)
Page Offset
4 bits
4 bits
Physical Page Number (PPN)
Page Offset
5. For each of the 4 virtual address accesses, write down the Physical Page Number (PPN)
Time (ms)
Virtual Address
Page Fault
Data Retrieved
PPN
13
0x8A
Yes
0xEE
0x​6
8
0x92
Yes
0xBE
0x​4
3
0x83
No
0xC5
0x​6
1
0xAC
Yes
0x85
0x​3
6. What is the value of the Page Table Base Register (PTBR), and thus the address of the start of the page table?
Starting Address of Page Table: 0x90

16 of 16 uniqname: ______________________
Problem 10 Explanation:
1. Know page size is >= 16 b/c 0x83 does not cause page fault. Know page size < 32 since 0x92 does result in a page fault 2. Page size = 16b -> Page offset = 4 bits. # PTE entries = VPN bits = VA – page offset
= 2^6 entries
3. P.A.=log(256)=8bits.P.A.-pageoffset=8-4=4bits.
4. (follows from 1-3)
5. Asdf
a. 0x8A -> look for “A” column where data is 0xEE -> row 0x60 or 0xB0
b. 0x92 -> look for “2” column where data is 0xBE -> row 0x40
c. 0x83 -> look for “3” column where data is 0xC5 -> row 0x60 (so we know
0x8A is in row 0x60 too, since a,c are on the same page)
d. 0xAC -> look for “C” column where data is 0x85 -> row 0x30
6. Look for row where offset 8,9,A are 6, 4, 3, respectively. This happens in row 0x90.