Q1) Virtual memory
(1pt) What is the purpose of a TLB?
(2pts) Suppose we’re using an 8-bit system with 16-byte pages and a set-associative TLB with a total of 8 sets. How many bits are there in the TLB tag?
(1pt) What is the purpose of a page table?
(2pts) Suppose we’re using a 16-bit system with 2-byte pointers. Suppose the system has 64- byte pages with a single-level page table. What is the size of a page table in bytes?
(1pt) What is the purpose of making a page table multi-level?
(3pts) Describe how Copy-On-Write (COW) affects the page tables in the parent and child process in fork().
(5pts) Suppose we’re using an 8-bit system with 16-byte pages and a two-level page table. Assume the two-level split is such that each table contains the same number of entries. Suppose the following is a dump of physical memory:
00: 501020201090f050 10: 9000c05030e03000 20: 40e040f0f000f0f0 30: 5000f0f090e09040 40: 804000f04090e010 50: 6040f01020201090 60: 5070e050f05080f0 70: 5000f0d0c0f02010 80: 50401020c050f000 90: 7050f06090e09000 a0: 402030f020e020e0 b0: 6000201040f030f0 c0: a050205070903040 d0: 60401020c0500030 e0: 0060103040f02090 f0: b0d000c050003040
e040209000508010 f0f0602010d050f0 a030a0f050e040f0 f010d0201090f050 d0903000f0f090e0 f030b010204000f0 602090d050f08040 c0f0f06060305040 f0f0c0902030f030 60205050000070c0 5000104040303050 405000f09040d0f0 502040d030c0f0e0 70908030804020c0 10c0709060f05080 4090e0000070c090
Suppose we declare char* p = 0xca. What is *p, assuming the process’s base page table register (e.g., cr3) contains 0x80?
Q2) Concurrency
(5pts) Consider the following multi-threaded C code:
1: bool send_money(acct* src, acct* dst, unsigned int amount) {
2: bool sent = false;
3: mu t e x _ l o c k ( &s r c – > mu t e x ) ;
4: if (src->value >= amount) {
5: src->value -= amount;
6: mu t e x _ l o c k ( &d s t – > mu t e x ) ;
7: dst->value += amount;
8: mu t e x _ u n l o c k ( &d s t – > mu t e x ) ;
9: sent = true;
10: }
11: mu t e x _ u n l o c k ( &s r c – > mu t e x ) ;
12: return sent;
13: }
Describe and fix any bug(s) you see, if any.
(5pts) Consider the following multi-threaded C code:
1: mut ex_t queueMut ex;
2: i nt queueMut exIni t = 0;
3: int insertQueueThreadSafe(queue_t* q, void* data) {
4: if (queueMutexInit == 0) {
5: queueMut exI ni t = 1;
6: mut ex_i ni t ( &queueMut ex) ;
7: }
8: mut ex_l ock( &queueMut ex) ;
9: int ret = insertQueue(q, data);
10: mut ex_unl ock( &queueMut ex) ;
11: return ret;
12: }
Describe any bug(s) you see, if any.
(5pts) Consider the following multi-threaded C code:
1: void push(void* data) {
2: mut ex_l ock( &mut ex) ;
3: queue. i nsert (dat a);
4: cond_signal(&cv);
5: mut ex_unl ock( &mut ex) ;
6: }
7: void* pop() {
8: mut ex_l ock( &mut ex) ;
9: if (queue.empty()) {
10: cond_wai t ( &cv, &mut ex) ;
11: }
12: void* data = queue.pop();
13: mut ex_unl ock( &mut ex) ;
14: return data;
15: }
Describe and fix any bug(s) you see, if any.
Q3) Kernel
(1pt) Give an example where a page fault results in an application crash.
(1pt) Give an example where a page fault doesn’t result in an application crash.
(1pt) Why is the kernel stack separate from the user program’s stack?
(3pts) List three (3) events that generate different interrupts.
(2pts) To implement a context switch, what data needs to be saved?
(2pts) Consider the following code in the kernel space for swapping the contents of two non- overlapping 4kb arrays:
1: void swap4kb(void* a, void* b) {
2: char tmp[4096];
3: assert(a!=NULL);
4: assert(b != NULL);
5: memcpy( t mp, a, 4096) ;
6: memcpy( a, b, 4096) ;
7: memcpy( b, t mp, 4096) ;
8: }
Describe any bug(s) you see, if any.
Q4) Storage
(1pt) Briefly describe the data that is stored in the page cache.
(1pt) Briefly describe the data that is stored in the disk cache.
(1pt) Briefly describe the purpose of the swap partition.
(1pt) Briefly describe one reason why SSDs are better than HDDs.
(1pt) Briefly describe one reason why HDDs are better than SSDs.
(1pt) What is the difference between RAID4 and RAID5, which is better, and why?
(2pts) What is the purpose of journaling in filesystems?
(2pts) What is the purpose of log structuring in filesystems?
Q5) Understanding processor performance (15 pts)
(4pts) Assume that you are moving from a single-cycle MIPS processor design with frequency F and CPI Cbase to a 5-stage MIPS processor design with frequency 3F. At what CPI would your new processor have twice the performance of the original?
(4pts) Assume that your processor has no cache, and that your memory takes a fixed amount of time independent of cycle time to access, stalling the pipeline while memory is being accessed. If memory stalls currently account for K% of CPI, what is the maximum speedup that can be achieved by increasing the clock frequency by a factor of G?
(7pts) Assume that a parallelization optimization across two cores for an originally serial program provides an improvement factor of K for the X% of execution that it applies to, but requires an additional Y% of the original execution worth of computation to be performed in order to apply the optimization. Assuming that the tradeoff is worthwhile (i.e. speedup >1) what are the improvements in throughput and latency over the original serial code if two copies of the two-way parallelized program are run across four cores? Express your throughput improvement as a function of your latency improvement.
Q6) Pipelining with data hazards (15 pts)
(a) (4) Label all data dependencies and (false/anti)-dependencies in the below code fragment with an arrow and their data dependency type.
1| FOO: lw
2| lw
3| addu $2, $2, 4| sw
$2, 0($4) $3, 0($5) $3
$2, 0($4) $4, 4($4) 4
5| lw
6| addi $5, $5,
7| bne $4, $0, F00
(b) (7) Consider the instructions from (a) scheduled on a five stage, scalar MIPS pipeline (FDEMW) with full forwarding and bypass networks and hazard detection and branch resolution in D. In the table below, for each cycle, indicate the cycle an instruction completes a stage with a capital letter (FDEMW) and indicate stalls with a lowercase letter (fdemw). Indicate by drawing a vertical arrow when a value is bypassed from one instruction to another in the cycle that the forwarding occurs. For simplicity, omit WD bypassing arrows. Assume all loads/stores are 1- cycle hits, that there are no exceptions, and that all branches are perfectly predicted.
CYCLE
1
2
3
4
5
6
7
8
9
1 0
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
FOO: lw $2, 0($4)
lw $3,0($5)
addu $2, $2, $3
sw $2, 0($4)
lw $4,4($4)
addi $5, $5, 4
bne $4, $0, FOO
(c) (4) Given the schedule below, list what forwarding is demonstrated as being supported (and list the pair of instructions that demonstrate this, for each pipeline stage pair) and where
hazard detection and branch resolution occur.
CYCLE
FOO: lw $2, 0($4)
1
2
3
4
5
6
7
8
9
1 0
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
F
D
E
M
W
lw $3,0($5)
F
D
E
M
W
addu $2, $2, $3
F
D
e
E
M
W
sw $2, 0($4)
F
d
D
e
E
M
W
lw $4,4($4)
f
F
d
D
E
M
W
addi $5, $5, 4 bne $4, $0, FOO
fFDEMW FDEMW
Q7) Caching (10 pts)
(a) (4) The base CPI of a system, excluding memory stalls, is C
Loads and stores collectively constitute LS% of all instructions
Accessing the L1 data cache takes K cycles (accounted for in base CPI). Accessing the L1 instruction cache takes P cycles (accounted for in base CPI). Misses to main memory take an average of D cycles.
The L1 D-cache miss rate/access is MD. The L1 I-cache miss rate/access is MI.
Your team is considering adding an L2 cache for data accesses only. Assuming that the L2 access time is 4K, how high can the L2 miss rate be and still improve AMAT?
(b) (6) There are three programs, FOO, BAR, and BAZ, a reference microbenchmark “ALWAYS- HITS” and an architecture SHODAN-N where each processor in the SHODAN-N series varies only by the total size of its (highly associative) single-level cache. The size of the cache doubles with each increment of N (i.e. SHODAN-3 has 8 times as much cache as SHODAN-0). Assume that the access patterns of all three programs are not pathological with respect to cache parameters (i.e. assume as a simplifying assumption that changing the replacement policy would not have significant effects on performance and that FOO, BAR, and BAZ do not vary wildly in performance when increasing associativity beyond two, etc.)
As FOO, BAR, and BAZ are run across the SHODAN series of processors in order from a SHODAN-0 to a SHODAN-7, the following behaviors are observed:
The performance of FOO is very good (similar to ALWAYS-HITS) on a SHODAN-0, but gets progressively worse by SHODAN-7, although it still performs reasonably well.
The performance of BAR is initially poor, and then increases greatly between SHODAN-2 to SHODAN-3, and then declines.
The performance of BAZ is initially poor, and gets progressively better from SHODAN-0 to SHODAN-7, but the rate of improvement decreases with every step.
a) Describe the properties of FOO, BAR, and BAZ that would produce these respective behaviors (you may assume that each program has uniform behavior if it simplifies your answer).
b) Assume that you are designing a follow-on to the SHODAN series, the POLITO processors. Describe a cache memory system that will effectively serve all three programs and justify your answers. Note any compromises in the design where one program suffers at the expense of the others, and how.
Q8) Multiprocessors and topologies (10 pts)
(2pts) What is the intuitive meaning of bisection bandwidth?
(8pts) Using LL/SC, safely insert an element into a concurrently shared list that may have an arbitrary number of readers and writers. You should assume that no reader or writer will delete nodes from the list and that there is an invariant header node that is always valid. Only provide the insertion code below, ignoring how the appropriate insertion point was located. Specify any necessary invariants.