UNIVERSITY OF CALIFORNIA, LOS ANGELES UCLA
“
BERKELEY • DAVIS • IRVINE • LOS ANGELES • RIVERSIDE • SAN DIEGO • SAN FRANCISCO SANTA BARBARA • SANTA CRUZ
CS M151B / EE M116C
Final Exam
Before you start, make sure you have all 13 pages attached to this cover sheet.
All work and answers should be written directly on these pages, use the backs of pages if needed. This is an open book, open notes final – but you cannot share books, notes, or calculators.
Problem 1 (10 points): ___________ Problem 2 (10 points): ___________ Problem 3 (10 points): ___________ Problem 4 (10 points): ___________ Problem 5 (20 points): ___________ Problem 6 (10 points): ___________ Problem 7 (30 points): ___________
Total: ________________ (out of 100 points)
NAME: ________________________________________________________________ ID: ____________________________________________________________________
1. I Amdahl-ighted with Tradeoffs (10 points): Given the following problems, suggest one solution and give one drawback of the solution. Be brief, but specific.
EXAMPLE
Problem: long memory latencies
Solution: Caches
Drawback: when the cache misses, the latency becomes worse due to the cache access latency
We would not accept solutions like: “do not use memory”, “use a slower CPU”, “cache is hard to spell”, etc
Problem: too many capacity misses in the data cache
Solution:
drawback:
Problem: too many control hazards
Solution:
drawback:
Problem: our carry lookahead adder is too slow
Solution:
drawback:
Problem: we want to be able to use a larger immediate field in the MIPS ISA
Solution:
drawback:
Problem: the execution time of our CPU with a single-cycle datapath is too high
Solution: drawback:
1
2. Hazard a Guess? (10 points): Assume you are using the 5-stage pipelined MIPS processor, with a three-cycle branch penalty. Further assume that we always use predict not taken. Consider the following instruction sequence, where the bne is taken once, and then not taken once (so 7 instructions will be executed total):
Loop : lw $t0, 512($t0) lw $t1, 64($t0)
bne $s0, $t1, Loop sw $s1, 128($t0)
Assuming that the pipeline is empty before the first instruction:
a. Suppose we do not have any data forwarding hardware – we stall on data hazards. The register file is still written in the first half of a cycle and read in the second half of a cycle, so there is no hazard from WB to ID. Calculate the number of cycles that this sequence of instructions would take:
__________________
2
b. How many cycles would this sequence of instructions take with data forwarding hardware: _____________________
3
3. More $ More Problems (10 points): Find the data cache hit or miss stats for a given set of addresses.
The data cache is a 1KB, direct mapped cache with 64-byte blocks. Find the hit/miss behavior of the cache for a given byte address stream, and label misses as compulsory, capacity, or conflict misses. All blocks in the cache are initially invalid.
Cache Miss Type
Cache Address in Binary Hit or Miss
…0011011010000
…0001011100000
…0000011010000
…0011011100000
…0011011010000
…0001011100000
…0000011010000
…0011011100000
4. The Trouble with TLBs (10 points): Consider an architecture with 32-bit virtual addresses and 1 GB of physical memory. Pages are 32KB and we have a TLB with 64 sets that is 8-way set associative. The data and instruction caches are 8KB with 16B block sizes and are direct mapped – and they are both virtually indexed and physically tagged. Assume that every page mapping (in the TLB or page table) requires 1 extra bit for storing protection information. Answer the following:
a. How many pages of virtual memory can fit in physical memory at a time? ______________
b. How large (in bytes) is the page table? ________________
c. What fraction of the total number of page translations can fit in the TLB? ____________
d. What bits of a virtual address will be used for the index to the TLB? Specify this as a range of bits – i.e. bits 4 to 28 will be used as the index. The least significant bit is labeled 0 and the most significant bit is labeled 31.
___________________
4
5. Starting Some Static (Scheduling) (20 points): Consider the 2-way superscalar processor we covered in class – a five stage pipeline where we can issue one ALU or branch instruction along with one load or store instruction every cycle. Suppose that the branch delay penalty is two cycles and that we handle control hazards with branch delay slots (since the penalty is two cycles, and this is a 2-way superscalar processor, that would be four instructions that we need to place in delay slots). This processor has full forwarding hardware. This processor is a VLIW machine. How long would the following code take to execute on this processor assuming the loop is executed 200 times? Assume the pipeline is initially empty and give the time taken up until the completed execution of the instruction sequence shown here. First you will need to schedule (i.e. reorder) the code (use the table below) to reduce the total number of cycles required (but don’t unroll it…yet).
Total # of cycles for 200 iterations: _____________________________
(Hint – schedule the code first for one iteration, then figure out how long it will take the processor to run 200 iterations of this scheduled code)
Loop:
lw $t0, 0 ($s0)
lw $t1, 0 ($t0)
add $t1, $s1, $t1
sw $t1, 0 ($t0) # you may assume that this store never goes to the same address as the first load addi $s0, $s0, 4
bne $s0, $s2, Loop
Cycle
1st Issue Slot (ALU or Branch)
2nd Issue Slot (LW or SW)
1
2
3
4
5
6
7
8
9
10
11
12
13
5
Now unroll the loop once to make two copies of the loop body. Schedule it again and record the total # of cycles for 200 iterations:
__________________________________
Cycle
1st Issue Slot (ALU or Branch)
2nd Issue Slot (LW or SW)
1
2
3
4
5
6
7
8
9
10
11
12
13
6
6. A Branch Too Far (10 points): One difficulty in designing a branch predictor is trying to avoid cases where two PCs with very different branch behavior index to the same entry of a 2-bit branch predictor. This is called destructive aliasing. One way around this is to use multiple 2-bit branch predictors with different sizes. This way, if two PCs index to the same entry in one predictor, they will not likely index to the same entry in the other predictor. We will evaluate a scheme with three 2-bit branch predictors – each with a different number of entries. The three predictors will be accessed in parallel, and the majority decision of the predictors will be chosen. So if two predictors say taken and the other predictor says not taken, the majority decision will be taken. The scheme looks like this:
PC
1K-entry 2-bit predictor
512-entry 2-bit predictor
Taken or Not Taken
Each predictor has the following FSM:
Taken Predict taken
Not taken
Majority Vote
Predict taken
Taken
11 Taken 10 Not taken
Not taken
Predict not taken
01 Taken 00
Not taken
7
768- entry 2-bit predictor
Predict not taken
Evaluate the performance of this prediction scheme on the following sequence of PCs. The table shows the address of the branch and the actual direction of the branch (taken or not taken). You get to fill in whether or not the branch predictor would guess correctly or not. Each node of the FSM is marked with the 2-bit value representing that state. Assume that all predictors are initialized to 00. To find an index into a predictor, assume we use the simplified branch indexing formula: index = PC % predictor_size. The symbol % represents the modulo operator. Predictor_size will be different according to the predictor.
PC
Actual Direction
Correctly Predicted?
128
T
640
NT
1152
NT
128
T
640
T
1152
NT
128
T
640
NT
1152
NT
128
T
640
T
1152
NT
8
i
r W
7.
With Friends Like These… (30 points): Consider the scalar pipeline we have explored in class:
i
r W
Hazard detection unit
ID/EX.MemRead
ID/EX
M ux
et
EX/MEM
DI / FI
IF/ID
Control
MEM/WB
0
et
CP
on
i t c
ur
PC
t s
nI
M u x
WB
M
WB
EX
M
WB
Instruction memory
Registers
M u x
M u x
M u x
ALU
Data memory
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRd
Rt
Rd
EX/MEM.RegisterRd
ID/EX.RegisterRt
Rs Rt
Forwarding unit
MEM/WB.RegisterRd
a. (10 points) Suppose 10% of instructions are stores, 15% are branches, 25% are loads, and the rest are R-type. 30% of all loads are followed by a dependent instruction. We have full forwarding hardware on this architecture. We use a predict not taken branch prediction policy and there is a 2 cycle branch penalty. This means that the PC is updated at the end of the EX stage – after the comparison is made in the ALU. One third of all branches are taken. There is an instruction cache with a single cycle latency and a miss rate of 10% and a data cache with a single cycle latency and a miss rate of 20%.
We have an L2 cache that misses 5% – it has a 10 cycle latency – and memory has a 100 cycle latency. Find the TCPI for this architecture.
TCPI = __________________
9
b. (5 points) Your friend has a flash of brilliance – “I know a way to get rid of stalls in this pipeline. The reason we have to stall now is because a load can have a dependent instruction follow it through the pipeline, and we cannot forward the load’s data until the end of the MEM stage – but the dependent instruction needs it at the beginning of the EX stage. So what if we add another ALU that recomputes what we did in EX if the instruction before it is a load and it is dependent on the load?” This ALU will be in the memory stage of the pipeline as shown below in this simplified picture:
M u x
M u x
ALU
Data memory
M u x
M u Mx
u x
M u x
Is your friend right or wrong? If they are wrong, give an example of when we would still need to stall. They are right: ___________________
Or
Counter example: ____________________________________________________________
ALU
10
c. (5 points) Another friend offers an alternative – using the original pipeline from part a, let’s get rid of base + displacement addressing for loads and stores. Loads and stores can only use register addressing now. This will allow us to combine EX and MEM into one stage (called EM) and avoid the need to stall entirely. Instructions will either use the ALU or memory – but not both. There is still forwarding hardware, but now we only need to forward from the EM/WB latch to the EM stage ALU. The pipeline will now be:
IF
ID
EM (EX or MEM)
WB
Suppose that four fifths of loads actually use base + displacement addressing (i.e. they have a non-zero displacement), which means that these loads will need to have add instructions before them to do their effective address computation. Half of stores use base + displacement addressing, and these will also need to be replaced with an add plus the store instruction. This modification has no impact on the branch penalty or the instruction cache miss rate.
Is your friend right or wrong – will this eliminate all stalls? If they are wrong, give an example of when we would still need to stall.
They are right: ___________________
Or
Counter example: _____________________________________________________________
11
d. (10 points) A third friend has a different idea (it may be time for you to get new friends who don’t talk about architecture all the time). Forget about trying to eliminate hazards – she says we should just use superpipelining and get a win on cycle time. Take the original architecture from part a – ignore the suggestions from b and c – and assume that the stages have the following latencies:
Your friend suggests a way to cut the IF, EX, and MEM stages in half – just increase the pipeline depth and make each of these stages into two stages. So your pipeline would now have IF1, IF2, ID, EX1, EX2, MEM1, MEM2, and WB stages – each of which would have 100 picosecond latency. Your friend also finds a way to do full forwarding between stages – even in the ALU – but loads are still a problem. In fact, load stalls will increase now because of this increase in pipeline depth. To help you figure out the new # of pipeline stalls from load data hazards, use the following table:
Stage
Latency (in picoseconds)
IF
200
ID
100
EX
200
MEM
200
WB
100
% of Loads
Distance of the next dependent instruction
30%
1 cycle
20%
Exactly 2 cycles later
20%
Exactly 3 cycles later
10%
Exactly 4 cycles later
10%
Exactly 5 cycles later
5%
Exactly 6 cycles later
5%
Exactly 7 or more cycles later
So this means that 30% of loads are immediately followed by a dependent (i.e. 1 cycle later), 20% of loads have a dependent exactly 2 cycles later, 20% have a dependent 3 cycles later, and so on. These classifications are completely disjoint – the 20% of loads that have a dependent 2 cycles later do NOT have dependents 1 cycle later.
Find the TCPI of this new architecture:_________________
12
Assume your target application will run 1M instructions. Find the execution time of this architecture for that application:
ET: ________________
13