程序代写代做代考 cache C assembly clock Name:

Name:
Problem 1: True / False
1.1 In an LC2K 5-stage pipeline, the sw instruction will never be stalled due to data hazards.
◯True
◯False
◯Not Assigned
1.2 In an LC2K 5-stage pipeline, the beq instruction will never be stalled due to data hazards.
◯True
◯False
◯Not Assigned
1.3 In an LC2K 5-stage pipeline, the beq instruction will never cause later instructions to be stalled due to data hazards.
◯True
◯False
◯Not Assigned
1.4 In an LC2K 5-stage pipeline, the lw instruction will never be stalled due to data hazards.
◯True
◯False
◯Not Assigned
1.5 When compared with avoidance for handling control hazards, a detect-and-stall scheme is guaranteed to result in fewer noops executed.
◯True
◯False
◯Not Assigned
1.6 When compared with avoidance for handling control hazards, a detect-and-stall scheme will not always result in fewer noops executed.
◯True
◯False
◯Not Assigned
1.7 The branch pattern “T T T T N N N N” (where T indicates taken and N indicates not taken), when repeated for a long time, would be better predicted by a
two-bit saturating counter instead of a one-bit “last-time” predictor.
◯True
◯False
◯Not Assigned
1.8 The branch pattern “T T T T N N N N” (where T indicates taken and N indicates not taken), when repeated for a long time, would be better predicted by a
one-bit “last-time” predictor, rather than a two-bit saturating counter
◯True
◯False
◯Not Assigned
1.9 Whether or not a one-bit “last-time” predictor would outperform a two-bit saturating counter when predicting the branch pattern “T T T T N N N N” (where T indicates taken and N indicates not taken), when repeated for a long time, would depend on the initial state of the predictors.
◯True
◯False
◯Not Assigned

Incrementing through an array of 1000 integers and incrementing each of them once exhibits lower temporal locality than iterating through a 10 element array 100 times.
◯True
◯False
◯Not Assigned
1.11 Incrementing through an array of 1000 integers and incrementing each of them once exhibits higher temporal locality than iterating through a 10 element
array 100 times.
◯True
◯False
◯Not Assigned
1.12 A direct mapped cache contains more tag storage than an associative cache with the same data storage capacity.
◯True
◯False
◯Not Assigned
1.13 A direct mapped cache contains less tag storage than an associative cache with the same data storage capacity.
◯True
◯False
◯Not Assigned
1.14 If program A exhibits signicantly more spatial locality than program B, then increasing the block size of a cache is generally expected to improve
performance of program A more than the performance of program B.
◯True
◯False
◯Not Assigned
1.15 If program A exhibits signicantly more temporal locality than program B, then increasing the block size of a cache is generally expected to improve
performance of program A more than the performance of program B.
◯True
◯False
◯Not Assigned
1.16 If program A exhibits signicantly more spatial locality than program B, then increasing the number of lines in the cache is generally expected to improve
performance of program A more than the performance of program B.
◯True
◯False
◯Not Assigned
1.17 Fully associative caches never have any capacity misses.
◯True
◯False
◯Not Assigned
1.18 Set associative caches never have any conict misses.
◯True
◯False
◯Not Assigned
1.19 Increasing the number of lines in a cache (while keeping everything else the same) is generally expected to decrease conict misses.
◯True
◯False
◯Not Assigned
1.10

When compared with single-level page tables, multi-level page tables improve the latency of translations at the cost of increased storage.
◯True
◯False
◯Not Assigned
1.21 When compared with single-level page tables, multi-level page tables decrease typical stoage costs at the expense of longer access times.
◯True
◯False
◯Not Assigned
1.22 Improving branch prediction accuracy will improve performance on multi-cycle machines.
◯True
◯False
◯Not Assigned
1.23 Improving branch prediction accuracy will improve performance on pipelined machines.
◯True
◯False
◯Not Assigned
1.24 Virtually addressed caches are typically faster to use than physically addressed caches when only a single process is running.
◯True
◯False
◯Not Assigned
1.25 Virtually addressed caches are typically slower to use than physically addressed caches when only a single process is running.
◯True
◯False
◯Not Assigned
Problem 2: Multiple Choice
2.1 Which of the following are advantages of the detect-and-stall scheme over avoidance regarding data hazards? (select all that apply):
◯Detect-and-stall reduces the number of stalls due to data hazards when executing a program ◯Detect-and-stall preserves backwards compatibility of programs on new processors ◯Detect-and-stall reduces hardware complexity
◯Detect-and-stall programs generally have fewer misses in the instruction cache
◯None of the above ◯Not Assigned
2.2 Which of the following are advantages of the detect-and-stall scheme over avoidance regarding data hazards? (select all that apply):
◯Detect-and-stall reduces the number of stalls due to data hazards when executing a program ◯Detect-and-stall programs generally have fewer misses in the instruction cache ◯Detect-and-stall reduces hardware complexity
◯Detect-and-stall preserves backwards compatibility of programs on new processors
◯None of the above ◯Not Assigned
1.20

Which of the following are guaranteed not to increase after increasing the associativity of a cache (while keeping the total data storage size constant)? (select all that apply):
◯Number of conict misses ◯Number of capacity misses ◯Number of compulsory misses ◯Clock period
◯None of the above ◯Not Assigned
2.4 Which of the following may increase after increasing the associativity of a cache (while keeping the total data storage size constant)? (select all that apply):
◯Number of conict misses ◯Number of compulsory misses ◯Number of capacity misses ◯Clock period
◯None of the above ◯Not Assigned
2.5 Which of the following modications would be considered an architectural change (i.e. a change in the ISA)? (select all that apply)
◯Increasing the number of registers
◯Increasing the amount of virtual address space
◯Increasing the amount of physical address space
◯Adding an instruction cache
◯None of the above
◯Not Assigned
2.6 Which of the following modications would not be considered an architectural change (i.e. a change in the ISA)? (select all that apply)
◯Increasing the number of registers
◯Increasing the amount of physical address space
◯Increasing the amount of virtual address space
◯Adding an instruction cache
◯None of the above
◯Not Assigned
2.7 Which of the following structures are indexed using bits from a virtual address? (select all that apply)
◯Virtually addressed cache ◯1st level page table ◯2nd level page table ◯TLB
◯None of the above ◯Not Assigned
2.8 Which of the following structures are not indexed using any bits from a virtual address? (select all that apply)
◯Virtually addressed cache ◯2nd level page table ◯1st level page table ◯TLB
◯None of the above ◯Not Assigned
2.3

Problem 3: Performance
For each performance optimization, indicate whether the performance is expected to improve by: decreasing CPI, decreasing the number of instructions executed, or decreasing the clock period (select all that apply):
3.1 Introducing a memory cache into a previously cache-less system
◯Decreasing CPI
◯Decreasing number of instructions executed
◯Decreasing clock period
◯Not Assigned
3.2 Converting from a single-cycle machine into a 5-stage pipeline
◯Decreasing CPI
◯Decreasing number of instructions executed
◯Decreasing clock period
◯Not Assigned
3.3 Switching from using a static “predict-not-taken” branch predictor to a more accurate, dynamic predictor
◯Decreasing CPI
◯Decreasing number of instructions executed
◯Decreasing clock period
◯Not Assigned
3.4 Switching from avoidance to detect-and-forward to handle data hazards
◯Decreasing CPI
◯Decreasing number of instructions executed ◯Decreasing clock period
◯Not Assigned
Problem 4: CPI

Consider a program with the following instruction breakdown:
add (5%) nor (15%) beq (20%) lw (25%) sw (35%)
60% of the branches are taken. 10% of instructions that write to a register are followed immediately by a dependent instruction, and 1% of instructions that write to a register have a dependent instruction after that.
For example, consider the sequence of instructions:
Inst1 Inst2 Inst3
The above implies that there is a 10% chance that Inst2 reads the result of Inst1, and a 1% chance that Inst3 reads the result of Inst1. Note that it is possible that both Inst2 and Inst3 read the result of Inst1 (you should assume all probabilities are independent of one another).
4.1 Pipeline A is a 5-stage pipeline using detect-and-forward and speculate not-taken, exactly identical to the one discussed in lecture. Calculate the CPI of this program, assuming that it runs for a very long time. List the components that contribute to the CPI below for partial credit. We have lled in the rst one for you:
Overall CPI :
Description CPI Contribution (Additional comments, if needed)
Base CPI 1
4.2 Pipeline B – Consider a modied version of pipeline A which uses a “predict taken” scheme. In this pipeline, the PC of the fetched instruction is always incremented to PC+1 until a branch instruction reaches the decode stage. When the branch instruction is decoded, the IF stage is always squashed, and the PC is set to the branch’s target destination (assume a BTB is included with 100% accuracy) on the next positive clock edge. Branch mispredictions are resolved in the same stage as pipeline A. Keeping all other factors constant, what is the CPI of the new processor?
Overall CPI :
Description CPI Contribution (Additional comments, if needed)

Pipeline C is identical to Pipeline A, but the decode stage is split into 2 stages (ID1 and ID2) and the execute stage is split into 2 stages (EX1 and EX2). The processor is still fully pipelined and can maintain a CPI of 1 under ideal circumstances with no hazards. All operands must be available before an instruction can enter the EX1 stage. Branches are resolved at the end of EX2, and mispredictions are handled during the MEM stage. What is the CPI of the new machine?
Overall CPI :
Description CPI Contribution (Additional comments, if needed)
Problem 5: Branch Prediction
4.3

Consider the following LC2K assembly code (which has been extended to allow a bitwise “AND” instruction):
add add lw lw AND
loop beq lw
AND if beq beq inc add endifadd beq
0 0 0 0 0 3 0 4 4 4 2 3 2 6
6 5
7 4
0 0
1 4
2 4
0 0
1
2
three
one
5
end beq1 Stack
7
inc beq2 endif beq3 1
2
loop beq4
end halt three .fill 3 one .fill 1
Assume Stack is pointing to a 5 element array consisting of {5, 7, 12, 11, 9} (where 5 is in the smaller address and 9 is in the larger address). This code is run on a processor where every branch instruction is assigned its own 2-bit saturating counter for the purposes of branch prediction, and every counter is initialized to “weakly taken”.
Fill out the table below indicating the state of the predictor before the branch is resolved, its prediction, and the actual result of the branch. If a branch is not executed 5 times, leave the later entries blank. We have lled in the rst column for beq1 already.
5.1 Table for beq1
Iteration 01234
State WT
Prediction T Result N
5.2 Table for beq2
Iteration 0 1 2
State Prediction Result
5.3 Table for beq3
Iteration 0 1 2
State Prediction Result
5.4 Table for beq4
Iteration 0 1 2
State Prediction Result
Problem 6: Virtual Memory
3 4
3 4
3 4

Consider a system with the following parameters: ● 30 bit address space
● 1 GB of physical memory
● 2 KB pages
● 4 B page table entry size 6.1
How many physical pages are there?
How many virtual pages are there?
How large would a single level page table be, in bytes?
If we were to design a multi-level page table where every page table was required to fit in a single page, how many levels
would be needed when all virtual pages are in use?
6.2 Use the below space to show any calculations made
6.3
Level 1 Level 2 Level 3 Level 4 Level 5
How many bits would be needed to index into each level page table, and how much storage would each level occupy? Leave unused levels blank.
Number of bits used for index into table
Combined page table size for all tables at that level
Problem 7: Caches
Consider a direct mapped cache that is 2-way set associative, has 16 bytes of data storage, 4 byte blocks and is write-allocate. There is a 16-bit, byte addressable,
virtual address space. Evictions are done based on an LRU scheme. Loads and stores are to one byte values. For the following memory addresses, indicate the tag, set index, and block oset, whether it is a “Hit” or a “Miss”, and for misses, what type of miss it is (“compulsory”, “capacity”, or “conict”).

Address Type Tag
0xE7 Load 0x35 Load 0x12 Load 0xE6 Store 0x1D Store 0x34 Load 0xE5 Store 0x1B Load 0xE3 Store 0x11 Store
7.2 How many bytes are read from / written to memory when using a write-back policy? Bytes written:
Bytes read:
7.3 How many bytes are read from / written to memory when using a write-through policy? Bytes written:
Bytes read:
Set Index
Block Oset
Hit or Miss
Miss Type
Problem 8: Caches Part Two
Consider two cache designs (A and B), each of which has 8 lines, stores a total of 32 bytes of data, and is write-through. They dier in their associativity: cache A is
direct-mapped, while cache B is two-way set associative with LRU replacement. The system has a 32-bit, byte addressable memory system, and the cache is initially empty.
8.1 What is the total storage needed (in bits) for storing overhead (i.e. non-data) bits in cache A? Answer : bits
7.1

Use the below space to show any calculations made
8.3
What is the total storage needed (in bits) for storing overhead (i.e. non-data) bits in cache B?
Answer : bits
8.4 Use the below space to show any calculations made
8.2

What is the shortest sequence of addresses (written in hex) that would result in a HIT when loading from cache A, but a MISS in cache B? If no such sequence exists, write “N/A” instead.
8.6
What is the shortest sequence of addresses (written in hex) that would result in a MISS when loading from cache A, but a HIT in cache B? If no such sequence exists, write “N/A” instead.
8.5