CS 520 – Ghose – Fall 2018 – Midterm Exam EXAM ENDS AT 4:20 pm, Open Book, Open Notes
ANSWER ALL 5 PROBLEMS
SUGGESTED LENGTH OF ANSWERS ARE ALSO GIVEN BELOW – DON’T BE TOO VERBOSE
Problem 1 (30 Points, 2 Pages). A new instruction, LOADR
As an example, if the registers
SOLUTION:
Fetch Stage:
Activity: Fetches instruction.
Output: fetched instruction sent to D/RF stage, PC_value of instruction.
D/RF:
Activity: Decode instruction and read out values of
Output: PC_value of instruction, decoded information for instruction, values of
EX:
Activity: Decoded information sets up ALU to do an addition to add up the values of
Output: PC_value of instruction, decoded information for instruction, sum of the values of
MEM:
Activity: Decoded information directs the logic of the MEM stage to perform a memory read using sum passed to it from the EX stage as the address.
Output: PC_value of instruction, decoded information for instruction, value of data, data read from memory, address of
WB:
Activity: Decoded information directs the logic of the WB stage to write the data read from memory into the
Output: None.
GRADING: Fetch stage: 2 points total. For all subsequent stages, a total of 7 points per stage, broken up as follows:
Activity: 3 points; outputs: 3 points. 1 off for each missing item, max penalty of 7 points per stage (other than Fetch).
Did not show PC chain or Instruction register (decoded info) chain: 2 points off in each case.
No mention of use of decoded info anywhere: 2 off – once
OK to draw diagram and show info flow but must state how decoded info is used in MEM (to do nothing).
Problem 2 (25 Points, 3 Pages). Answer the claims made in following questions as “True” or “False” and explain each of your answers:
a) In an out-of-order processor, if execution constraints implied by anti-dependencies and output- dependencies are correctly handled, then flow dependencies are taken care of automatically.
SOLUTION: False. Flow dependency requires hardware to ensure that operation on source values of an instruction can be considered for starting only when all source operand values are available.
b) In an out-of-order processor with a LSQ, if a LOAD is allowed to perform a memory operation before earlier STOREs queued up ahead of it in the LSQ, the ordering implied by the sequential execution model is not violated as long as the memory address computed for the LOAD does not match the memory addresses of all of the STOREs ahead of it in the LSQ.
SOLUTION: True. The memory location from which the LOAD accesses the data is unaffected by the STORES ahead of the LOAD in the LSQ. The net effect is that when the LOAD is to be committed based on dispatch order (as recorded in the ROB), the memory operation performed by the LOAD was delayed till all previous STOREs committed and updated memory in program order.
c) For an out-of-order processor using register renaming like Variation 3 on Page 198 of the notes, assume that the ISA specifies 32 general purpose registers and one condition code register. If the unified register file (RF) contains a total of 33 registers, the performance of the processor is going to be very poor.
SOLUTION: True. The unified register file has enough registers only to hold the committed values of the architectural registers (32 general-purpose registers plus the condition code register), leaving no registers for use to be assigned to another instance of an architectural register. This will stall dispatches till another previously dispatched instruction commits; the advantages of register renaming is not exploited.
d) The code for a loop that is transformed using software pipelining for one pipelined implementation of an ISA will always produce the correct results when run on a different pipelined implementation of the same ISA.
SOLUTION: False. Software pipelining requires specific knowledge of the instruction execution latencies, which are implementation-specific. Running the software code generated for a specific implementation may lead to incorrect results in a different implementation that has different execution latencies.
e) In a processor that uses register renaming, an instruction like ADD R1, R1, R1 is illegal and cannot be handled.
SOLUTION: False. The two “R1”s used as the source refer to the previous instance of R1 which come from a physical register different from the one allocated to the destination “R1” of the instruction. Since the rename table reads for the source registers are completed before the rename table entry for the destination is updated, the renaming step will correctly obtain the physical register address for the instance of “R1” that is used as the two sources for the ADD.
GRADING:
5 points for each of (a) through (e). Correct answer: 2 points, explanation: 3 points. Minimum score for trying: 5 points.
Problem 3 (25 Points, 2 Pages). Consider the out-of-order processor shown as Variation 2 on Page 198 of the notes.
a) Propose a format of the entry for an ADD instruction in the ROB?
b) Describe how information stored in the ROB entry for the ADD is used to commit the ADD instruction using a C-like code similar to that shown on Page 185 of the notes.
SOLUTION:
ARF: architectural register file – holds committed instances of architectural registers PRF: physical register file – holds result till commitment
a) Entry for ADD in the ROB has the following fields:
Type: type of instruction corresponding to this entry (load or store or register-to-register etc., in this case, “register-to-register”).
PC value: address of the instruction
Status (of result/exception code field)
Excode (exception code)
Destination architectural register (darch) for this instruction
Destination physical register (dphy) assigned to this ADD instruction
Note that a ROB field is not needed to hold the result, as it is assigned a physical register in the PRF.
b) Committing the ADD:
if (ROB[tail].status == INVALID) then stall commitment else {
if (ROB[tail].Excode < > 0) /* execution has generated exception */ signal exception /* and exit to exception handling functions */
else {
}
Critical portions of the solution are shown in red, boldface fonts.
GRADING:
Part (a): Total on this part: 10 points
Must NOT list “Result” field to hold result computed by ADD – if this is included, 4 points off. Any missing field: 1 off for each missing field
Part (b): Total on this part: 15 points
Each of the two statements in red and boldface: 5 points each, that is:
ARF [darch] = PRF [dphy]; 5 Points
Release physical register back to free list; 5 Points
Check for exception missing: 2 points off Check for status missing: 3 points off Minimum score on Part (a) for trying: 2 points Minimum score on Part (b) for trying: 3 points.
Problem 4 (15 Points, 1 Page). In an out-of-order processor shown as Variation 2 on Page 198 of the notes and instruction semantics as defined in the notes, what are the conditions that have to be met for dispatching:
a) A LOAD instruction
b) A STORE instruction
c) An ADD instruction
SOLUTION:
a) An IQ entry must be free AND the LSQ must not be full AND a physical register is available in the
PRF AND the ROB must not be full.
b) An IQ entry must be free AND the LSQ must not be full AND the ROB must not be full.
c) An IQ entry must be free AND a physical register is available in the PRF AND the ROB must not be full.
ARF [darch] = PRF [dphy];
Release physical register back to free list; ROB.tail ++
GRADING:
1.5 points for each resource needed in each part. (a), (b) and (c) are thus worth 6, 4.5 and 4.5 points, respectively. Minimum score for trying: 3 points.
Problem 5 (15 Points, 1 Page`). Repeat Problem 4 for the out-of-order processor shown as Variation 3 on Page 198 of the notes.
SOLUTION:
d) An IQ entry must be free AND the LSQ must not be full AND a physical register is available in the
unified register file (RF) AND the ROB must not be full.
e) An IQ entry must be free AND the LSQ must not be full AND the ROB must not be full.
f) An IQ entry must be free AND a physical register is available in the unified register file (RF) AND the ROB must not be full.
GRADING:
1.5 points for each resource needed in each part. (a), (b) and (c) are thus worth 6, 4.5 and 4.5 points, respectively. Minimum score for trying: 3 points.
CS 520 – Fall 2019 – Ghose Exam 1 – Open book/open notes
PLEASE BE CONCISE. SUGGESTED LENGTH OF ANSWERS SHOWN BELOW FOR EACH PROBLEM. MAKE SURE YOU PRINT YOUR NAME CLEARLY ON THE EXAM BOOK. PLEASE WRITE LEGIBLY.
ANSWER PROBLEM 1 ON THE QUESTION SHEET, PROBLEMS 2 THROUGH 5 IN THE ANSWER BOOK AND RETURN THE QUESTION SHEET WITH YOUR EXAM BOOK
Name (Print Clearly) ________________________________________________________________ Signature: _________________________________________________________________________
Problem 1 (20 Points). Answer the following claims as True (T) or False (F) in the box next to the question in this question sheet. Read each claim very carefully before you answer.
a. Machine-level parallelism is required for exploiting instruction level parallelism
T
b. Data forwarding in pipelines eliminates bubbles due to data dependencies completely in all cases
F
c. The sequential execution model assumed by programmers requires memory updates to be made in program order
T
d. An issue queue always forces instructions to complete execution out of program order
F
e. Software pipelined code are binary compatible
F
f. In an out-of-order processor with multiple function units, where each function unit is pipelined, a selection logic is required for every function unit to select an eligible instruction for issue
T
g. In an out-of-order processor supporting tag broadcasting and associative tag matching to wake up dependents one cycle before the actual result is available, the result may need to be forwarded to a consumer as it enters a function unit to begin execution
T
h. Multithreaded processors cannot improve the processing latency of individual threads
T
i. Dynamic branch predictors are likely to be incorrect when consecutive executions of a branch have identical outcomes
F
j. Branch prediction accuracy is increased when the control flow path into a branch instruction is taken into consideration
T
GRADING: 2 POINTS PER QUESTION, MINIMUM SCORE: 4 POINTS
(TURN OVER)
1 of 4
Problem 2 (20 Points, 2 Pages). Consider a 5-way out-of-order superscalar processor that handles ALL three types of data dependencies explicitly without using register renaming. Assuming that instructions have at most three source registers and one destination register, calculate the total number of comparators that are needed to detect all dependencies concurrently within the group of co-dispatched instructions. Show all work and provide necessary explanations.
SOLUTION:
Assume instructions being co-dispatched are: I1, I2, I3, I4 and I5 (in program order).
(Worth 7 Points) Detecting flow dependencies – number of comparators needed are:
From dest of I1 to each of the three srcs of I2, I3, I4 and I5: 3 comparators for each of I2 to I5, total: 12 From dest of I2 to each of the three srcs of I3, I4 and I5: 3 comparators for each of I3 to I5, total: 9 From dest of I3 to each of the three srcs of I4 and I5: 3 comparators for each of I4 and I5, total: 6
From dest of I4 to each of the three srcs of I5: 3 comparators
So total number of comparators for detecting flow dependencies within co-dispatched group is (12 + 9 + 6 + 3) = 30
(Worth 7 Points) Detecting anti-dependencies – number of comparators needed are:
From each of the three sources of I1 to the destinations of I2, I3, I4 and I5: total 12 From each of the three sources of I2 to the destinations of I3, I4 and I5: total 9 From each of the three sources of I3 to the destinations of I4 and I5: total 6
From each of the three sources of I4 to the destination of I5: total 3
So total number of comparators for detecting anti-dependencies within co-dispatched group is (12 + 9 + 6 + 3) = 30
(Worth 6 Points) Detecting output dependencies – number of comparators needed are:
From the destination of I1 to the destinations of I2, I3, I4 and I5: total 4 From the destination of I2 to the destinations of I3, I4 and I5: total 3 From the destination of I3 to the destinations of I4 and I5: total 2
From the destination of I4 to the destination of I5: total 1
So total number of comparators for detecting anti-dependencies within co-dispatched group is (4 + 3 + 2 + 1) = 10
So a total of 70 comparators are needed to concurrently detect all three types of dependencies within the group of 5 co-dispatched instructions.
Other grading guidelines for Problem 2: 2 off for each mistake, -16 off maximum. Minimum score for trying; 4 points
Total 5 points off on entire problem if ALL answers or answers to individual parts are correct, but assumed 2 source registers per instruction instead of three.
2 of 4
Problem 3 (40 Points, 4 Pages). Consider a processor supporting out-of-order execution that uses a centralized issue queue (IQ) where all source registers are read out at the time of instruction issue. This processor has multiple function units and uses associative tag based notifications on the availability of a register value with physical register addresses as the tags. Assume further that instructions have at most two source registers. Modify this processor design as follows:
Split up the IQ into two different issue queues, IQ0 and IQ1 with IQ1 supporting associative tag-based forwarding, as in the original design. The total number of entries in IQ0 and IQ1 is the same as the total number of entries in IQ.
Instructions that have one or both of the source registers as invalid at the time of dispatch are dispatched to IQ1. Instructions that have all of their source register ready at the time of dispatch are dispatched to IQ0.
a. What are the potential advantages of splitting up the centralized issue queue in this manner? What are the disadvantages?
ANSWER: 20 Points for Part (a):
Advantages: Worth 10 points (or 15, with bonus)
1) Fewer number of comparators are needed:
IQ0 has both source register operands valid at the time of dispatch – it does not need any comparators.
IQ1 needs two comparators per entry per forwarding bus
2) The overall hardware requirements of IQ0 and IQ1 is simplified: IQ0 does not need any wakeup
logic or a forwarding bus connection.
3) Shorter forwarding bus with fewer comparator connected to it can speed up tag broadcast and potentially support a faster clock frequency.
GRADING: 5 points each for ANY TWO of (1), (2) and (3). 5 bonus points if (c) is ALSO answered along with correct answers for (1) and (2)
Disadvantages: Worth 10 points
1) Two sets of connections from decode/rename/dispatch stages to IQ0 and IQ1 instead of just one in the original design.
2) IQ0 and IQ1 need independent connections to the physical register file and the function units.
3) Need an arbitration logic to select issued instructions from IQ0 and/or IQ1.
4) Dispatch can get blocked if queue needed is full while the other queue has free entries. With a single queue, any free IQ entry can hold a dispatched instruction regardless of the number of ready operands.
Must get any two of these correctly – 5 points each.
b. Is it necessary to have IQ0 at all? Why cannot instructions that have all sources ready be issued directly to the function units?
ANSWER: IQ0 is needed to momentarily hold instructions while instructions from IQ1 are issued. If we directly issue instructions from the dispatch stage, issue of instructions from IQ1 may be unnecessarily delayed/starved.
3 of 4
Worth: 10 Points
c. Propose a strategy for selection instruction for issuing from these three queues. Justify your strategy.
ANSWER: To free starvation of issues from either queue, a round robin policy may be used to select instructions from the issue queue to each function unit. This will be simpler to implement. Other selection policies such as picking up instructions from the queue that has more allocated entries or ready entries can also be used but these policies will be difficult to implement.
Worth 10 points
Make sure that you provide explanations to your answers.
Minimum score on Problem 3: 8 points.
Problem 4 (20 Points, 2 Pages). Consider an out-of-order processor with a centralized issue queue (IQ) and a load-store queue (LSQ) that has a dedicated Memory Access Unit to calculate the target memory address and then start the memory access once the entry at the head of the LSQ is eligible to perform the memory access. The processor uses tag-based notifications on the availability of register values.
a. What are the similarities between the IQ and the LSQ?
ANSWER: This part worth 10 points, 5 points each for any two below.
1) Both queues use associative tag matching logic to become aware of the availability of source register values.
2) Both hold dispatched instructions.
3) Both need connections to read out register values at the time of issue.
b. What are the dissimilarities?
ANSWER: This part worth 10 points, 5 points each for any two below.
1) IQ is not FIFO – any free entry can be allocated; LSQ is FIFO and entries are allocated at its tail
2) IQ needs a selection logic for issue, LSQ does not – issues take place from its head.
3) Issue conditions are different for the IQ and LSQ: IQ can issue out of program order, while LSQ can only issue in program order and only when corresponding ROB entry is at the head of the ROB.
Minimum score on Problem 4: 4 points.
Problem 5 (20 Points, 2 Pages). You are to design a branch predictor that keeps track of the actual outcomes of the previous four executions of the branch and predicts the branch to be taken if three consecutive executions of the branch were all taken in the past four executions of the branch.
a. What information will be stored in the BTB? Worth 10 points
ANSWER: Past 4 actual branch outcomes: A4, A3, A2, A1, where A1 is the most recent outcome.
b. What is the logic for deriving the prediction from the stored information? Worth 10 points
c. ANSWER: (A4 & A3 & A2) | (A3 & A2 & A1), where & = logical AND and | = bitwise OR.
Can also draw logic gate equivalents with 3-input gates or 2-input gates. Minimum score on Problem 5: 4 points
4 of 4