CSci 4203/EE4367, Spring 2021 Homework Assignment III (Issued March 30, 2021)
Instructions:
1. You can type in your solutions by downloading this MS Word file. Or, write your solutions, scan the file and upload your PDF file.
2. Label your assignment with your name and UMN email address
3. Submit assignments via Canvas course web page. Note: No late submissions accepted.
4. Due Date: 11:59 pm Thursday 4/27/2021 (Central Standard Time)
5. For each problem, please show all of your work to get full credits. Without showing your
work, up to half of the total credits could be deducted.
6. Due to TA resource limitations, we will be able to grade only a subset of the assigned
problems (same subset for everyone).
7. Homework must be done individually. Please pay specific attention to the “Student
Academic Integrity and Scholastic Dishonesty” noted in the syllabus. Failing the class because of cheating on a homework is not worth it.
Problem 1 From Homework #2 Problem 5 (Multiple-Issue Processors, Chapter 4)
This exercise compares the performance of an 1-issue and a 2-issue statically-scheduled processor with a design as in Figure 4.69 (in textbook) shown below. Note that there is no hazard detection unit and no data forwarding unit in both processors. So, it is the responsibility of a programmer or a compiler to insert nops (if needed), and to statically schedule the instructions for a better performance (i.e. to reduce the number of nops). In the 2-issue processor, one instruction can be an ALU or branch instruction, and the other can be a load or store instruction. Note that nops need to be filled in if there is no instruction available in a particular slot. Assume we have the following C program.
for (j=0; j < m; j++)
y[j] = x[j] + z[j-1];
When translating into MIPS code, assume that variables are kept in registers as shown in the following table, and that all registers except those assigned in the following table (i.e. r6 – r15) can be used as free registers to allow register renaming in the unrolled loop.
jmxyz r4 r5 r1 r2 r3
A. Translate this C code directly into MIPS instructions, i.e. without considering hazards.
B. Try to insert nops into the code from (A) for the 1-issue processor described above (i.e. no hazard detection unit and no data forwarding unit). Try to rearrange the instructions to achieve a better performance, i.e. try to minimize the number of nops needed. Assume we have a perfect branch predictor, how many cycles are needed to execute one iteration
of the loop using your code?
C. Unroll the loop two times, and try to schedule the instructions on the 2-issue statically-
1
schedule processor as described above (as in Figure 4.69). Assume we have a perfect branch predictor, how many cycles are needed to execute one iteration of the unrolled loop (i.e. two iterations of the original code) using your code? Please present your code schedule similar to that in Figure 4.71 (in textbook) as shown below in which the empty slots are nops.
D. Assume both 1-issue and 2-issue processors have perfect branch prediction. Also, assume the loop has 1000 iterations (i.e. m = 1000). What is the speedup we can achieve by going from an 1-issue to a 2-issue processor? Note that there will be only 500 iterations in the unrolled loop (since we unroll the loop two times).
Solutions:
A. Note:Therecouldbemultiplevalidcodesequencesfortheloop.Wejust
show one of them here.
addi r4, $zero, 0
LOOP: beq r4, r5, EXIT
//s1: Initialize j
// s2: if (j=m), loop is done
// s3: j x 4
// s4: calculate addr of x[j]
// s5: Load x[j]
sll r6, r4, 2
add r6, r6, r1
lw r6, 0(r6)
addi r11, r4, -1
sll r11, r11, 2
addr11,r11,r3 //s8:calculateaddrofz[j-1]
addi r4, $zero, 0
nop
nop
LOOP: beq r4, r5, EXIT
addi r11, r4, -1
//s1: Initialize j
// s2: if (j=m), loop is done
// s6: j-1
// nop for r11 in s6
// s3: j x 4
// s7: (j-1) x 4
//nop for r6 in s3
//s4:calculate addr of x[j]
//s8:calculate addr of z[j-1]
// nop for r6 in s4
// s5: Load x[j]
nop
sll
sll
nop
add
add
nop
lw r6, 0(r6)
r6, r4, 2
r11, r11, 2
r6, r6, r1
r11, r11, r3
// s6: j-1
// s7: (j-1) x 4
lw r11, 0(r11)
add r11, r6, r11
sll r12, r4, 2
addr12,r12,r2 //s12:calculateaddrofy[j] sw r11, 0(r12) //s13: y[j] <-- x[j] + z[j-1] addi r4, r4, 1 // s14: j++
beq $zero, $zero, LOOP // s15: Iterate loop
EXIT:
B. Note: There could be multiple valid code sequences for the loop. We just show one of them here
// s9: Load z[j-1]
// s10: x[j] + z[j-1]
// s11: j x 4
2
lw r11, 0(r11)
sll r12, r4, 2
nop
add r11, r6, r11
add r12, r12, r2
addi r4, r4, 1
nop
sw r11, 0(r12)
// s9: Load z[j-1]
// s11: j x 4
//nop for r11 in s9
// s10: x[j] + z[j-1]
//s12:calculate addr of y[j]
// s14: j++
//nop for r12 in s12
//s13: y[j] <-- x[j] + z[j-1]
beq $zero, $zero, LOOP //s15: Iterate loop
EXIT:
C. After unroll loop twice, rename registers r11 - > r13 and r12 – > r14, we have the following loop:
addi r4, $zero, 0
LOOP: beq r4, r5, EXIT
//Initialize j
//s1:if (j=m), loop is done
//s2: j x 4
//s3:calculate addr of x[j]
//s4: Load x[j]
//s5: j-1
//s6: (j-1) x 4
//s7:calculate addr z[j-1]
//s8: Load z[j-1]
//s9: x[j] + z[j-1]
//s10: j x 4
//s11:calculate addr of y[j]
sll r6, r4, 2
add r6, r6, r1
lw r6, 0(r6)
addi r11, r4, -1
sll r11, r11, 2
add r11, r11, r3
lw r11, 0(r11)
add r11, r6, r11
sll r12, r4, 2
add r12, r12, r2
sw r11, 0(r12)
addi r4, r4, 1
//s12: y[j] <-- x[j] + z[j-1]
//s13: j+1
// start of next iteration
//s14:j x 4
sll r7, r4, 2
add r7, r7, r1
lw r7, 0(r7)
addi r13, r4, -1
sll r13, r13, 2
addr13,r13,r3
lw r13, 0(r13)
add r13, r7, r13
sll r14, r4, 2
add r14, r14, r2
sw r13, 0(r14)
addi r4, r4, 1
beq $zero, $zero, LOOP //s26: Iterate the loop
EXIT:
After schedule the above loop to reduce the number of nops, we have the following loop. We ignore the initialization of j. (Note: There could be multiple valid code
//s15:calculate addr of x[j]
//s16: Load x[j]
//s17:j-1
//s18: (j-1) x 4
//s19:calculateaddrofz[j-1] //s20:Load z[j-1]
//s21: x[j] + z[j-1] //s22: j x 4
//s23:calculate addr of y[j]
//s24:y[j] <-- x[j] + z[j-1]
//s25:j+1
3
sequences for the loop. We just show one of them here)
LOOP: beq r4, r5, EXIT
sll r6, r4, 2
//s1:if (j=m), loop is done
//s2:j x 4
//s5:j-1
//s10:j x 4
//s3:calculate addr of x[j]
//s13: j+1
//s6:(j-1) x 4
//s4:Load x[j]
//s17: j
//s14: (j+1) x 4
//s7:calculate addr of z[j-1]
//s18: j x 4
//s11:calculate addr of y[j]
//s15:calculate addr of x[j+1]
//s19:calculate addr of z[j]
//s8: Load z[j-1]
//s22:(j+1) x 4
//s9: x[j] + z[j-1]
//s20:Load z[j]
//s23:calculateaddrofy[j+1] //s16: Load x[j+1]
addi r11, r4, -1 sll r12, r4, 2 add r6, r6, r1 addi r4, r4, 1 sll r11, r11, lw r6, 0(r6) addi r13, r4, sll r7, r4, 2 add r11, r11, sll r13, r13, add r12, r12, add r7, r7, r1 add r13, r13, r3 lw r11, 0(r11) sll r14, r4, 2 add r11, r6, r11 lw r13, 0(r13) addr14,r14,r2 lw r7, 0(r7)
sw r11, 0(r12)
add r13, r7, r13
addi r4, r4, 1
sw r13, 0(r14)
beq $zero, $zero, LOOP
EXIT:
ALU/Branch Instr
loop s1:beq r4, r5, EXIT s2:sll r6, r4, 2
s5:addi r11, r4, -1
s10:sll r12, r4, 2
s13:addi r4, r4, 1
s3:add r6, r6, r1
s6:sll r11, r11, 2
s17:addi r13,r4,-1
s14:sll r7, r4, 2
2 -1
r3 2 r2
//s12:y[j] <-- x[j] + z[j-1]
//s21; x[j+1] + z[j]
//s25:j+1
//s24:y[j+1] <- x[j+1] + z[j]
//s26:Iterate the loop
Data Transfer Instr Clock Cycle
1 2 3 4 5 6 7 8
s4:lw r6, 0(r6) 9
4
s7:add r11, r11, r3
s18:sll r13, r13, 2
s11:add r12,r12,r2
s15:add r7, r7, r1
s19:add r13,r13,r3
s22:sll r14, r4, 2
s9:add r11, r6, r11
s23:add r14,r14,r2
s21:add r13,r7,r13
s25:addi r4, r4, 1
s26:beq
$zero,$zero,LOOP
s8:lw r11,0(r11)
s16:lw r7, 0(r7)
s20:lw
r13,0(r13)
s12:sw
r11,0(r12)
10
11
12
13
14
15
16
17
18 19
20 21 22
s24:sw 23 r13,0(r14)
24 25
D.
With the assumption of perfect branch prediction and based on the code schedules in (B) and (C), we have for 1-issue processor, the execution time is 19 x 1000 = 19,000 cycles, and for 2-issue we have 23 x 500 = 11,500 cycles. The speedup is 19,000/11,500 = 1.65
(Figure 4.71 in Textbook)
5
Problem 2 (Cache Memory Organizations, Chapter 5)
Below is a trace of 8-bit memory references. Each memory reference in the trace is given as word addresses. Assume the memory is word addressable, not byte-addressable in this problem.
Memory reference trace: 6010, 14610, 9210, 18610, 6010, 14610, 18610, 9210
Problem 2.a
Assume we have a direct-mapped cache with 8 one-word blocks. For each of those references, identify its memory address in binary form, its tag, and its index to the direct-mapped cache. Also, indicate if each reference is a hit or a miss, assuming the cache is initially empty. Show your results in the following table.
Word Address 60
146
92
186
60
146
186
92
Problem 2.b
Memory Address
0011 1100 1001 0010 0101 1100 1011 1010 0011 1100 1001 0010 1011 1010 0101 1100
Tag Index
Hit/Miss
0x07 4 Miss 0x12 2 Miss 0x0B 4 Miss 0x17 2 Miss 0x07 4 Miss 0x12 2 Miss 0x17 2 Miss 0x0B 4 Miss
Assume we have a 2-way set-associative cache with 8 one-word blocks. For each of those references in the same trace, identify its binary address, its tag, and its index. Also, list if each reference is a hit or a miss, assuming the cache is initially empty. Show your results in the following table.
Word Address 60
146
Binary Address
0011 1100 1001 0010
Tag Index
Hit/Miss
0x0F 0 Miss 0x24 2 Miss
6
92 0101 1100 186 1011 1010 60 0011 1100 146 1001 0010 186 1011 1010 92 0101 1100
Problem 2.c
0x17 0 Miss 0x1E 2 Miss 0x0F 0 Hit 0x24 2 Hit 0x1E 2 Hit 0x17 0 Hit
We are trying to decide the best cache design for the given address trace among two possible cache design options, all with a total of 8 words of data. Case-1 is a direct-mapped cache design with 1- word blocks as in Problem 1.a. Case-2 is a 2-way set-associative cache design with 1-word blocks as in Problem 1.b. Given address trace above, which cache design has the lowest miss rate? Assume the miss penalty is 20 cycles. Case-1 has an access time of 1 cycle. Case-2 has an access time of 2 cycles. Which cache design has a better performance in terms of AMAT (Average Memory Access Time)?
AMAT = Hit Time + # of accesses x miss rate x miss_penalty
Case 1: It has 8/8 = 100% miss rate. AMAT = 1 + 20 x 1 = 21 cycles
Case 2: It has 4/8 = 50% miss rate. So, Case 2 has a lower miss rate. AMAT=2+20x0.5 =12cycles
Case 2 has a better performance
Problem 2.d
There are many design parameters that are important to a cache’s overall performance. Assume we have the following parameters for a direct-mapped cache: (1) cache data size: 16Kbyte; (2) cache block size: 4 words. Calculate the total number of bits required for the direct-mapped cache that includes data, tag and valid bit, assuming it is in a machine with 32-bit address. Given that the total size, find the total size for the closest 4-way set-associative cache with 2-word blocks of equal size or greater.
Total number of data bits in a 4-word/block is 16 bytes x 8 bits/byte = 128 bits; Total number of blocks in the cache is: 16Kbyte/16 byte=1K bloks = 210
The tag bits required for each block is = 32 - 10 – 4 = 18 bits
We need 1 valid bit for each block.
Total number of bits needed is: (1 + 18 + 128) x 1024 = 147 x 1024 = 150,528 bits
7
For the 4-way set associative 2-word/block cache: Thenumberofdatabitsperblockis2x4=8bytesx8bits/byte=64bits=26 bits
We need 2x sets (each set has 4 blocks) to equal or greater than the above cache size The tag bits required will be 32 – x – 3 = (29 – x ) bits Totalnumberofbitsneededis:(1+29-x+64)x4x2x ≥150,528 So,theclosestxis9,whichrequires(1+29–9+64)x4x29 =85x4x512=174,080 bits
Problem 3. TLB and Address Mapping
Virtual memory uses a page table to track the mapping of (i.e. to translate) virtual addresses to physical addresses in main memory. This exercise shows how the page table must be updated as memory references are being processed. Assume we have a 32-bit virtual address and the page size is 4Kbytes. Also, it has a 4-entry fully-associative TLB with an LRU (least recently used) replacement scheme. When a page is brought into the main memory due to a page fault, it is placed in the page following the page with the largest page number, i.e. if the current largest physical page number is 12 as shown in the table below, it will be placed in the page number 13. The following list shows the stream of virtual addresses during the program execution.
479710 , 249910 , 1416810 , 3420310 , 4874210 , 1466410 , 5191510
The initial content of TLB and Page Table is shown in the following two tables. TLB (4-entry fully set associative with LRU replacement scheme)
Valid Bit Tag Physical Page Number (PPN)
Least Recently Used (LRU) Status
1 11 12 LRU
1 7 4 2ndLRU 1 3 6 3rdLRU 0 - - N/A
Page Table (First 13 entries of the page table)
Virtual Page Number Valid Bit Physical Page Number (PPN) (VPN) or on Disk (if Valid Bit = 0)
015 1 0 Disk 2 0 Disk
8
316 419 5 1 11 6 0 Disk 714
8 0 Disk
9 0 Disk
10 1 3
11 1 12
12 0 Disk
Problem 3.a
Please show the final state of the TLB and the page table given the address stream shown above. Also, list for each memory reference if it is a hit/miss in the TLB, a hit/miss in the page table, or a page fault.
Since we have a 32-bit virtual address and a page size of 4-KB = 212 bytes, we have
232 / 212 = 220 pages in the virtual address space, i.e. each virtual page number (VPN) has 20 bits, and a 12-bit page offset.
As we have a fully-associative TLB, we will use the 20-bit (VPN) as the tag in TLB.
Note that we also use 20-bit VPN to index into the page table.
04797 -- 0x12BD -- 0000 0000 0000 0000 0001 0010 1011 1101 02499 -- 0x09C3 -- 0000 0000 0000 0000 0000 1001 1100 0011 14168 -- 0x3758 -- 0000 0000 0000 0000 0011 0111 0101 1000 34203 -- 0x859B -- 0000 0000 0000 0000 1000 0101 1001 1011 48742 -- 0xBE66 -- 0000 0000 0000 0000 1011 1110 0110 0110 14664 -- 0x3948 -- 0000 0000 0000 0000 0011 1001 0100 1000 51915 -- 0xCACB -- 0000 0000 0000 0000 1100 1010 1100 1011
Address Tag 04797 0x1 02499 0x0
Offset Hit/Miss in TLB
0x2BD Miss 0x9C3 Miss
Hit/Miss in Page Table
Miss Hit
Page Fault
Y N
9
14168
34203
48742
14664
51915
0x3 0x8 0xB 0x3 0xC
0x758 Hit
0x59B Miss
0xE66 Miss
0x948 Hit
0xACB Miss Miss
Not Accessed (N/A) N Miss Y Hit N Not Accessed (N/A) N
Y
Physical Page Number (PPN)
Page Table
Virtual Page Number (VPN) Valid Bit Physical Page Number or on Disk 015
1 0 -> 1 Disk -> 13
2 0 Disk 316 419
5 1 11
6 0 Disk 714
8 0 -> 1
9 0 Disk
10 1 3
11 1 12
TLB
Valid
1
1
136
0->1 X -> 1-> 11 X -> 13 -> 12
Bit
Tag (VPN)
11 ->0 ->12 7 -> 8
12 -> 5 -> 15 4 -> 14
12 0 -> 1
Disk -> 14
Disk -> 15
10
Problem 3.b
Repeat Problem 3.a, but assume we have 8-Kbyte pages instead of 4-Kbyte pages.
Since we have 8-KB pages, we have a 19-bit tag and a 13-bit offset
04797 — 0x12BD — 000 0000 0000 0000 0000 1 0010 1011 1101 02499 — 0x09C3 — 000 0000 0000 0000 0000 0 1001 1100 0011 14168 — 0x3758 — 000 0000 0000 0000 0001 1 0111 0101 1000 34203 — 0x859B — 000 0000 0000 0000 0100 0 0101 1001 1011 48742 — 0xBE66 — 000 0000 0000 0000 0101 1 1110 0110 0110 14664 — 0x3948 — 000 0000 0000 0000 0001 1 1001 0100 1000 51915 — 0xCACB — 000 0000 0000 0000 0110 0 1010 1100 1011
Address Tag
Offset Hit/Miss in TLB
Hit/Miss in Page Table
Hit
Not Accessed (N/A) Miss
Hit
Hit
Not Accessed (N/A) Miss
Page Fault
N N Y N N N Y
04797
02499
14168
34203
48742
14664
51915
TLB: Valid Bit 1
1
1
0 -> 1
0x0 0x12BD
0x0 0x09C3
0x1 0x1758
0x4 0x059B
0x5 0x1E66
0x1 0x1948 0x6 0x0ACB
Tag
Miss Hit Miss Miss Miss Hit Miss
Physical Page Number (PPN)
11->1 12->13 7->4 4->9 3->5 6->11 X->0->6 X->5->14
Page Table:
Virtual Page Number (VPN) Valid Bit Physical Page Number or on Disk 015
1 0 -> 1 Disk -> 13
2 0 Disk 316
11
419
5 1 11
6 0 -> 1 Disk -> 14 714
8 0 Disk
9 0 Disk
10 1 3
11 1 12
12 0 Disk
Problem 3.c
Repeat Problem 3.a. Assume we have a page size of 4KB, and the TLB is 2-way set associative as shown below.
TLB (2-way set associative with LRU replacement scheme in each set)
Set Number
0
1
Valid Bit Tag Physical Page Number Least Recently Used
(PPN)
(LRU) Status
0 5 3 1stLRU 0 2 9 2ndLRU 0 3 4 1stLRU 0 – – N/A
2-way set associative TLB:
The VPN gets partitioned into 19-bit tag and 1-bit index into the TLB set. 19-bit tag, 1-bit index, 12-bit offset
Since we have 4-KB pages and a 2-way set associative TLB of 2 sets, we have a 19-bit tag and a 1-bit index into TLB with a 12-bit offset.
Note that we still have a 20-bit virtual page number (VPN), which is used to index the page table.
04797 — 0x12BD — 000 0000 0000 0000 0000 1 0010 1011 1101
12
02499 — 0x09C3 — 000 0000 0000 14168 — 0x3758 — 000 0000 0000 34203 — 0x859B — 000 0000 0000 48742 — 0xBE66 — 000 0000 0000 14664 — 0x3948 — 000 0000 0000 51915 — 0xCACB — 000 0000 0000
0000 0000 0 1001 1100 0011 0000 0001 1 0111 0101 1000 0000 0100 0 0101 1001 1011 0000 0101 1 1110 0110 0110 0000 0001 1 1001 0100 1000 0000 0110 0 1010 1100 1011
Virtual Address
Page Table Index (VPN)
Tag
Set Index
Offset
Hit/Miss in TLB
Hit/Miss in Page Table
Page Fault
04797 0x1 02499 0x0 14168 0x3 34203 0x8 48742 0xB 14664 0x3 51915 0xC
TLB:
Set Valid Bit
0 0->1
0->1
1 0->1 0->1
0x0 0x0 0x1 0x4 0x5 0x1 0x6
1 0 1 0 1 1 0
0x2BD 0x9C3 0x758 0x59B 0xE66 0x948 0xACB
Tag
5-> 0 -> 6
2-> 4
3-> 1
4-> 0 -> 5
Miss Miss Yes Miss Hit No Miss Hit No Miss Miss Yes Miss Hit No Hit N/A No Miss Miss Yes
Physical Page Number
3->5->15 9->14 4->6 9->13->12
Page Table:
Virtual Page Number (VPN)
Valid Bit
Physical Page Number or on Disk
015
1 0 -> 1 Disk -> 13
13
2 0 Disk 316 419
5 1 11
6 0 Disk 714
8 0 -> 1 Disk -> 14 9 0 Disk
10 1 3
11 1 12
12 1 Disk -> 15
Problem 3.d
Repeat Problem 3.a. Assume we have 4KB page size, and the TLB is direct-mapped, instead of fully associative. Also, discuss the importance of having a TLB to high performance, i.e. how would virtual memory accesses be handled if there were no TLB?
Since we have 4-KB pages and a direct-mapped TLB of 4 entries, we have a 18-bit tag and a 2-bit index into TLB with a 12-bit offset.
Note that we still have a 20-bit virtual page number (VPN), which is used to index the page table.
04797 — 0x12BD — 00 0000 0000 0000 0000 01 0010 1011 1101 02499 — 0x09C3 — 00 0000 0000 0000 0000 00 1001 1100 0011 14168 — 0x3758 — 00 0000 0000 0000 0000 11 0111 0101 1000 34203 — 0x859B — 00 0000 0000 0000 0010 00 0101 1001 1011 48742 — 0xBE66 — 00 0000 0000 0000 0010 11 1110 0110 0110 14664 — 0x3948 — 00 0000 0000 0000 0000 11 1001 0100 1000 51915 — 0xCACB — 00 0000 0000 0000 0011 00 1010 1100 1011
14
Address VPN 04797 0x1
02499 0x0 14168 0x3 34203 0x8 48742 0xB 14664 0x3 51915 0xC
TLB: Set
0 1 2 3
Tag
0x0 0x0 0x0 0x2 0x2 0x0 0x3
Set Offset Hit/Miss Index TLB
1 0x2BD Miss 0 0x9C3 Miss 3 0x758 Miss 0 0x59B Miss 3 0xE66 Miss 3 0x948 Miss 0 0xACB Miss
in Hit/Miss Table
in
Page
Page Fault
Yes No No Yes No No No
Miss Hit Hit Miss Hit Hit Hit
Valid Bit
0 ->1 0 -> 1 0
0 -> 1
Tag
11 -> 0 -> 2 -> 3 7 -> 0
__
X -> 0 -> 2 -> 0
Physical Page Number
12->5->14->15 4->13
__ X->6->12->6
Page Table:
Virtual Page Number
Valid Bit Physical Page Number or on Disk 015
1 0 -> 1 Disk -> 13 2 0 Disk 316 419
5 1 11 6 0 Disk
15
714
8 0 -> 1 Disk -> 14 9 0 Disk
10 1 3
11 1 12
12 1 Disk -> 15
Each memory reference must access its page table entry to translate its virtuall address to a physical address in the main memory. The TLB is a small cache that keeps the most recently- used page table entries in the processor for address translation. It allows such address translation to be performed without accessing off-chip main memory, which will need a long latency. If there were no TLB, memory access time would increase significantly.
Problem 3.e
Here are several parameters that can impact the overall size of the page table: (1) the virtual address size; (2) page size, and (3) the size of each page table entry. Assume the virtual address size is 32 bits, the page size is 8KB and the size of each page table entry is 4 bytes (assume it includes all of the information, e.g. the valid bit, physical page number, etc.). Please calculate the total page table size (in bytes) for a system running 6 applications. Each has its own page table, i.e. there are 6 page tables active at the same time in the system.
Page Size = 8KB = 213 bytes
Each virtual address has 32 bits, i.e. the total size of the virtual space is 232 bytes, which is 4 gigabytes (4GB).
So, we have 232 / 213 = 219 pages in the virtual address space.
Each has a page table entry in the page table.
As specified, each page table entry has 4 bytes = 22 bytes.
So, the page table size is: 22 x 219 = 221 = 2 megabytes = 2MB
As our system is running 6 application at any time, we need to keep all 6 page tables in main memory. The total page table size for 6 applications is thus 2 MB x 6 = 12MB
Problem 4 Cache Coherence, False Sharing
Cache coherence concerns the views of multiple processors on a given cache block. The following table shows two processors and their read/write operations on two different words X[0] and X[1] in an 8-byte cache block (initially, X[0] = X[1] = 0). Assume the size of integers is 4 bytes (i.e. one word). Hence, X[0] and X[1] are in the same cache block.
16
Processor 1 Processor 2 X[0]++; X[1]=3; X[0]=6; X[1]+=2;
Problem 4.a
List all possible values of the given cache block assuming we have a correct coherence protocol implementation. Note that there can be 6 possible orderings in the execution of the two statements on each processor. Two possible orderings are shown in the two tables below. You need to show the rest of the orderings, and the results in each ordering. List at least one more possible value of the block if the 2-processor system does not ensure cache coherency, i.e. a processor is not aware of the value having been changed in the other processor.
Order 1
Processor 1 X[0]++; X[1] = 3;
Processor 2
X[0] = 6; X[1] += 2;
Order 2
Processor 1
X[0]++;
X[1] = 3;
Order 2 results: X[0] = 7; X[1] = 3;
Processor 2 X[0] = 6;
X[1] += 2;
Order 1 results: X[0] = 6; X[1] = 5;
Order 3 Processor 1 X[0]++;
X[1] = 3;
Processor 2 X[0] = 6; X[1] += 2;
Order 3 results: X[0] = 6, X[1] = 5
17
Order 4 Processor 1 X[0]++;
X[1] = 3;
Processor 2
X[0] = 6; X[1] += 2;
Order 4 results: X[0] = 6, X[1] = 3
Order 5 Processor 1
X[0]++; X[1] = 3;
Processor 2 X[0] = 6;
X[1] += 2; Order 5 results: X[0] = 7, X[1] = 5
Order 6 Processor 1
X[0]++; X[1] = 3;
Processor 2 X[0] = 6; X[1] += 2;
Order 6 results: X[0] = 7, X[1] = 3
If coherency isn’t ensured, and P2’s operations take precedence over P1’s, we could have X[0] = 6 , X[1] = 2
Problem 4.b
Assume we have a snooping cache coherence protocol, list the operation sequence (in the format similar to Figure 5.42) of the read/write operations on each processor/cache for the Order 2 shown in the above table. Remember that X[0] and X[1] are in the same cache block.
18
Processor activity
P2 reads X[0]
P2 writes X[0] P1 reads X[0] P1 writes X[0] P2 reads X[1] P2 writes X[1] P1 reads X[1] P1 writes X[1]
Problem 4.c
Bus activity
Cache miss for X[0] Invalidation for X[0]
Cache miss for X[0] Invalidation for X[0] Cache miss for X[1] Invalidation for X[1] Cache miss for X[1] Invalidation for X[1]
Contents of P1’s cache
Empty Empty
X[0]=6;X[1]=0 X[0]=7;X[1]=0 X[0]=7;X[1]=0 Invalidate X[0];X[1] X[0]=7;X[1]=2 X[0]=7;X[1]=3
Contents of P2’s cache
X[0]=0;X[1]=0 X[0]=6;X[1]=0
X[0]=6;X[1]=0 Invalidate X[0];X[1] X[0]=7;X[1]=0 X[0]=7;X[1]=2 X[0]=7;X[1]=2 Invalidate X[0];X[1]
Contents of Memory
X[0]=0;X[1]=0 X[0]=0;X[1]=0
X[0]=6;X[1]=0 X[0]=6;X[1]=0 X[0]=7;X[1]=0 X[0]=7;X[1]=0 X[0]=7;X[1]=2 X[0]=7;X[1]=2
Among the 6 possible orderings in Problem 4.a, what are the best-case and the worst-case numbers of cache misses needed to execute the listed read/write instructions?
Listing the operation sequence as above for all the different orderings helps us identify the numbers of cache misses needed to execute the listed read/write instructions.
Best case: Orderings 1 and 6, which require 2 cache misses. Worst case: Orderings 2 and 3, which require 4 cache misses. As a reference, Orderings 4 and 5 require 3 cache misses.
19