CS计算机代考程序代写 assembly mips algorithm computer architecture RISC-V cache compiler EEC 170

EEC 170
Name: Instructions:
UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering
Introduction to Computer Architecture
Midterm 1
Fall 2019
1. This exam is open-note, open-book. Calculators are allowed.
2. No devices that can communicate (laptops, phones, or PDAs) allowed, with the excep- tion of a device that can read your e-book textbook, on which you must turn off your wifi and/or cellular connections. Other than “find”, no typing. Turn off all phones, pagers, and alarms.
3. You may not share any materials with anyone else, including books, notes, lab reports, datasheets, calculators, and most importantly, answers!
4. This exam is designed to be a 60 minute exam, but you have 110 minutes to complete it. (Roughly) one point is budgeted per minute. Use your time wisely!
Excerpts from the UC Davis Code of Academic Conduct:
1. Each student should act with personal honesty at all times.
2. Each student should act with fairness to others in the class. That means, for example, that when taking an examination, students should not seek an unfair advantage over classmates through cheating or other dishonest behavior.
3. Students should take group as well as individual responsibility for honorable behavior.
I understand the honor code and agree to be bound by it.
Signature:
Page:
2
3
4
5
6
9
Total
Points:
8
12
12
8
10
15
65
Score:
Page 1 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

All instructions in this exam, unless otherwise noted, are RISC-V instructions. Recall that if the instruction writes to a register, that register appears first in the instruction. Unless otherwise indicated, you may not use any RISC-V pseudoinstructions on this exam, only actual RISC-V instructions. Feel free to detach the two RISC-V reference pages and the adder reference page at the end of the exam, but if you do, please take them with you and recycle them (don’t try to reattach them or turn them in).
1. Many machine learning applications do not require the precision or range of single- precision (32b) floating point. Thus Google has introduced a new 16-bit floating-point format, bfloat16, which is structured in a similar way to the IEEE 754 formats we discussed in class. bfloat16 has one sign bit, an 8-bit exponent, and a 7-bit significand (the bits after the 1.).
(a) (5 points) What is the smallest normal positive number that can be rep- resented as a bfloat16? (“Normal” here means “ignore subnormal numbers”.) Express your answer as a normalized number ±a.b×2c, where you specify the sign, a, b, and c. You should clearly show your work if you want partial credit.
(b) NVIDIA hardware also supports a 16-bit floating-point format, half. It has 1 sign bit, 5 exponent bits, and 11 significand bits.
Answer each of the following questions with bfloat16 or half.
i. (1 point) The format with the largest representable value is (bfloat1 or half):
ii. (1 point) The format with a representable value closest to zero is (bfloat1 or half):
iii. (1 point) The format with the smallest range between two adjacent values when the bias-adjusted exponent is zero (meaning the normalized number is 1.x × 20) (where, for these two adjacent values, the floating-point representation is identical except for the least significant bit of the significand) is (bfloat1 or half):
Page 2 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

2. Short code sequences.
(a) (6 points) MIPS has the instruction nor. RISC-V lacks this instruction. Imple- ment nor x2, x3, x4 in as few RISC-V instructions as possible. You may not make any changes to any register except x2.
(b) (6 points) RISC-V also lacks the instruction sgtz, which sets its output register to 1 if the input register is larger than zero and to 0 otherwise. Implement sgtz x2, x3 in as few RISC-V instructions as possible. You may not make any changes to any register except x2.
Page 3 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

(c) (12 points) RISC-V has a floating-point instruction called fsgnj.d rd, rs1, rs2, “floating-point sign injection”. It returns a number whose magnitude is the same as its rs1 argument but whose sign is the same as its rs2 argument. Note it works on double-precision (64b) data.
Implement fsgnj.d x2, x3, x4 in as few integer RISC-V instructions as possible. Your inputs are 64b floating-point numbers stored in 64b integer registers and your output is also a 64b floating-point number in a 64b integer register. Do not use floating-point registers or instructions to solve this problem. You may use x10, x11, etc. as temporary registers if needed. For this part only, you may use the pseudoinstruction li (load immediate), which will count as one instruction.
To aid grading (to help give you partial credit), describing your approach in a sentence or two will be helpful.
Page 4 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

3. (8 points) Let’s consider adding a reciprocal operation to RISC-V. Your engineers have determined that the primary use for a reciprocal operation is to normalize 3-component vectors:
xnew = √ x , x2 +y2 +z2
with ynew and znew defined analogously. Note we could take the reciprocal of the square root once and then perform multiply instructions to achieve the same results. To compute xnew, ynew, and znew, we could replace divide operations with reciprocal and multiply instructions.
If we assume that add instructions take 1 cycle, multiply operations take 4 cycles, and both square-root and divide operations take 20 cycles, what is the maximum number of cycles for the reciprocal operation so that 3-component vector normaliza- tion is at least as fast using reciprocal and multiply as it was using division?
Page 5 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

4. You have simplified the design of your new RISC-Y processor to the point where you only have two classes of instructions, A-type (which takes 1 cycle to finish) and B- type (which takes 3). You compile your company’s most important benchmark on your internal compiler and find it has a CPI of 2.8.
The Team Rocket compiler company visits your company and boasts they can achieve twice the MIPS (millions of instructions per second) on your benchmark on your RISC-Y processor. You test their compiler: it’s true, they can. However, the runtime of their compiled code is the same as yours.
(a) (7 points) What is the instruction mix (what percentage of A-type and what per- centage of B-type) of the compiled code from the Team Rocket compiler?
(b) (3 points) If Team Rocket’s code gets twice as many MIPS, why does it run at the same speed as your internal compiler’s? Be precise as to why and quantify your answer.
Page 6 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

5. In this problem we design an integer ALU that is inspired by the POWER instruction set. For the purposes of this problem, assume that POWER supports 32-bit integers, that it has an R-type instruction format that takes two input registers and one output register, and an I-type instruction format that takes one input register, one output register, and a 16-bit immediate.
Our base hardware is a 32-bit adder that has five inputs. Four of the inputs (AH, AL, BH, and BL) specify the A and B inputs to the adder, each divided into two 16-bit chunks (the 16b high and low halves of the 32b input words to the adder). The fifth input (Cin) is a one-bit carry in.
Your job is to select what goes into these five inputs. For the first four inputs, you must set the control signals on a SUPERMUX, which is a 16-bit wide mux that can select from 14 16-bit inputs:
• A[31:16]
• A[15:0]
• B[31:16]
• B[15:0]
• ̃A[31:16] (invert the bits of A[31:16]) • ̃A[15:0]
• ̃B[31:16]
• ̃B[15:0]
• imm[15:0] (the immediate from I-type instructions) • A[31] x 16 (bit A[31] copied to all 16 bits)
• B[31] x 16 (bit B[31] copied to all 16 bits)
• imm[15] x 16 (bit imm[15] copied to all 16 bits)
• 0 x 16 (16 zeroes)
• 1 x 16(16ones)
BH BL
AH AL
Cin
Cout
Out[31:0]
Cout A[15:0]
B[31:16] 32-Bit Adder
B[15:0] Cin
A[31:16]
Sum
Page 7 of 12
SUPER- MUX
SUPER- MUX
SUPER- MUX
SUPER- MUX
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

For the last control input (Cin), you can set it to 0 or 1.
As an example, if I asked you to configure this ALU to compute a standard 32-bit add,
you would set the following:
• AH = A[31:16] • AL = A[15:0] • BH = B[31:16] • BL = B[15:0] • Cin = 0
This information is replicated at the end of the exam. Feel free to detach it and use it to answer the following questions. There are no questions on this page.
Page 8 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

For each of the following instructions, fill in how you would set the five control signals.
(a) (3 points) SUB: a 32-bit subtract, A-B.
(b) (3 points) LUI: load the 16-bit immediate into the upper 16 bits of the output, setting the other output bits to zero
(c) (3 points) ADDIS: “add immediate shifted”; add input A to the (immediate left- shifted by 16 bits)
(d) (3 points) DBL: “multiply by 2”; return A×2
(e) (3 points) In English, what does the instruction with the following control settings do?
• AH = 1 x 16 • AL = 1 x 16 • BH = B[15:0] • BL = B[31:16] • Cin = 1
Page 9 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

There are no questions on this page. This page is for reference only. Feel free to detach it.
SUPERMUX inputs: • A[31:16]
• A[15:0]
• B[31:16]
• B[15:0]
• ̃A[31:16] (invert the bits of A[31:16])
• ̃A[15:0]
• ̃B[31:16]
• ̃B[15:0]
• imm[15:0] (the immediate from I-type instructions) • A[31] x 16 (bit A[31] copied to all 16 bits)
• B[31] x 16 (bit B[31] copied to all 16 bits)
• imm[15] x 16 (bit imm[15] copied to all 16 bits)
• 0 x 16 (16 zeroes)
• 1 x 16(16ones)
For the last control input (Cin), you can set it to 0 or 1.
As an example, if I asked you to configure this ALU to compute a standard 32-bit add, you would set the following: AH = A[31:16]; AL = A[15:0]; BH = B[31:16]; BL = B[15:0]; Cin = 0.
BH BL
AH AL
Cin
Cout
Out[31:0]
Cout A[15:0]
B[31:16] 32-Bit Adder
B[15:0] Cin
A[31:16]
Sum
Page 10 of 12
SUPER- MUX
SUPER- MUX
SUPER- MUX
SUPER- MUX
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

RISC-V Instruction Set Core Instruction Formats
31 27 26 25 24 20 19
RV32I Base Integer Instructions
Inst
add
sub
xor
or
and
sll
srl
sra
slt
sltu
addi
xori
ori
andi
slli
srli
srai
slti
sltiu
lb
lh
lw
lbu
lhu
sb
sh
sw
beq
bne
blt
bge
bltu
bgeu
jal
jalr
lui
auipc
ecall
ebreak
15 14 12 11
7
6
0
RISC-V Reference
James Zhu
R-type I-type S-type B-type U-type J-type
Note
msb-extends zero-extends
msb-extends zero-extends
zero-extends zero-extends
zero-extends zero-extends
imm: 0x000 imm: 0x001
funct7
rs2
rs1
funct3
rd
opcode
imm[11:0]
rs1
funct3
rd
opcode
imm[11:5]
rs2
rs1
funct3
imm[4:0]
opcode
imm[12|10:5]
rs2
rs1
funct3
imm[4:1|11]
opcode
imm[31:12]
rd
opcode
imm[20|10:1|11|19:12]
rd
opcode
Name
FMT
Opcode
F3
F7
Description (C)
ADD
SUB
XOR
OR
AND
Shift Left Logical Shift Right Logical Shift Right Arith* Set Less Than
Set Less Than (U)
R R R R R R R R R R
0000011
0000011
0000011
0000011
0000011
0000011
0000011
0000011
0110011
0110011
0x0
0x0
0x4
0x6
0x7
0x1
0x2
0x3
0x2
0x3
0x00
0x20
0x00
0x00
0x00
0x00
0x00
0x20
rd = rs1 + rs2
rd = rs1 – rs2
rd = rs1 ˆ rs2
rd = rs1 | rs2
rd = rs1 & rs2
rd = rs1 << rs2 rd = rs1 >> rs2
rd = rs1 >> rs2
rd = (rs1 < rs2)?1:0 rd = (rs1 < rs2)?1:0 ADD Immediate XOR Immediate OR Immediate AND Immediate Shift Left Logical Imm Shift Right Logical Imm Shift Right Arith Imm Set Less Than Imm Set Less Than Imm (U) I I I I I I I I I 0010011 0010011 0010011 0010011 0010011 0010011 0010011 0010011 0010011 0x0 0x0 0x0 0x0 0x1 0x1 0x3 0x2 0x3 0x00 0x00 0x00 0x00 0x00 0x00 0x20 rd = rs1 + imm rd = rs1 ˆ imm rd = rs1 | imm rd = rs1 & imm rd = rs1 << imm rd = rs1 >> imm
rd = rs1 >> imm
rd = (rs1 < imm)?1:0 rd = (rs1 < imm)?1:0 Load Byte Load Half Load Word Load Byte (U) Load Half (U) I I I I I 0000011 0000011 0000011 0000011 0000011 0x0 0x1 0x2 0x4 0x5 rd = M[rs1+imm][0:7] rd = M[rs1+imm][0:15] rd = M[rs1+imm][0:31] rd = M[rs1+imm][0:7] rd = M[rs1+imm][0:15] Store Byte Store Half Store Word S S S 0100011 0100011 0100011 0x0 0x1 0x2 M[rs1+imm][0:7] = rs2[0:7] M[rs1+imm][0:15] = rs2[0:15] M[rs1+imm][0:31] = rs2[0:31] Branch == Branch != Branch < Branch ≤ Branch < (U) Branch ≥ (U) Jump And Link Jump And Link Reg B B B B B B 1100011 1100011 1100011 1100011 1100011 1100011 0x0 0x1 0x4 0x5 0x6 0x7 if(rs1 == rs2) PC += imm if(rs1 != rs2) PC += imm if(rs1 < rs2) PC += imm if(rs1 >= rs2) PC += imm
if(rs1 < rs2) PC += imm if(rs1 >= rs2) PC += imm
J I
1101111
1100111
0x0
rd = PC+4; PC += imm
rd = PC+4; PC = rs1 + imm
Load Upper Imm
Add Upper Imm to PC
U U
0110111
0010111
rd = imm << 12 rd = PC + (imm << 12) Environment Call I 1110011 0x0 0x00 Transfer control to OS Environment Break I 1110011 0x0 0x00 Transfer control to debugger 1 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere RISC-V Reference Card V0.1 Standard Extensions RV32M Multiply Extension Inst Description (C) Name FMT Opcode F3 F7 MUL MUL High MUL High (S) (U) MUL High (U) DIV DIV (U) Remainder Remainder (U) R R R R R R R R 0110011 0110011 0110011 0110011 0110011 0110011 0110011 0110011 0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01 mul mulh mulsu rd = (rs1 * rs2)[63:32] mulu div rd=rs1/rs2 divu rd=rs1/rs2 rem rd=rs1%rs2 remu rd=rs1%rs2 RV32A Atomic Extension 31 27 26 25 24 2019 1514 1211 76 0 0} = rd rd = (rs1 * rs2)[31:0] rd = (rs1 * rs2)[63:32] rd = (rs1 * rs2)[63:32] funct5 aq rl rs2 rs1 funct3 rd opcode 51155357 Inst Description (C) Name FMT Opcode F3 F5 Load Reserved Store Conditional Atomic Swap Atomic ADD Atomic AND Atomic OR Atomix XOR Atomic MAX Atomic MIN R R R R R R R R R 0101111 0101111 0101111 0101111 0101111 0101111 0101111 0101111 0101111 0x2 0x2 0x2 0x2 0x2 0x2 0x2 0x2 0x2 0x02 0x03 0x01 0x00 0x0C 0x0A 0x04 0x14 0x10 lr.w sc.w rd = M[rs1], reserve M[rs1] if (reserved) { M[rs1] = rs2; rd = else { rd = 1 } amoswap.w rd = M[rs1]; swap(rd, rs2); M[rs1] amoadd.w amoand.w amoor.w amoxor.w amomax.w amomin.w RV32F / D Floating-Point Extensions rd = M[rs1] + rs2; M[rs1] = rd rd = M[rs1] & rs2; M[rs1] = rd rd = M[rs1] | rs2; M[rs1] = rd rd = M[rs1] ˆ rs2; M[rs1] = rd rd = max(M[rs1], rs2); M[rs1] = rd rd = min(M[rs1], rs2); M[rs1] = rd Inst Description (C) Name FMT Opcode F3 F5 Flt Load Word Flt Store Word Flt Fused Mul-Add Flt Fused Mul-Sub Flt Neg Fused Mul-Add Flt Neg Fused Mul-Sub Flt Add Flt Sub Flt Mul Flt Div Flt Square Root Flt Sign Injection Flt Sign Neg Injection Flt Sign Xor Injection Flt Minimum Flt Maximum Flt Conv from Sign Int Flt Conv from Uns Int Flt Convert to Int Flt Convert to Int Move Float to Int Move Int to Float Float Equality Float Less Than Float Less / Equal Float Classify * * * * * * * * * * * * * * * * * * * * * * * * * * flw fsw fmadd.s rd=rs1*rs2+rs3 fmsub.s rd=rs1*rs2-rs3 fnmadd.s rd=-rs1*rs2+rs3 fnmsub.s rd=-rs1*rs2-rs3 fadd.s rd=rs1+rs2 fsub.s rd=rs1-rs2 fmul.s rd=rs1*rs2 fdiv.s rd=rs1/rs2 fsqrt.s fsgnj.s fsgnjn.s fsgnjx.s fmin.s fmax.s fcvt.s.w fcvt.s.wu rd = (float) rs1 rd = (int32_t) rs1 fcvt.wu.s rd = (uint32_t) rs1 fcvt.w.s fmv.x.w fmv.w.x feq.s flt.s fle.s fclass.s rd = *((int*) &rs1) rd = *((float*) &rs1) rd=(rs1==rs2)? 1: 0 rd=(rs1= 0) return z;
return -z; }
A straightforward implementation of the above code requires control-flow instructions (branches and/or jumps). In this problem we consider non-RISC-V instructions that might allow us to implement this without control-flow instructions. In this problem ignore the fact that a twos-complement representation is unbalanced (-MININT cannot be represented).
(a) (5 points) Using a minimal sequence of RISC-V instructions, implement ABSDIFF x1, x2, x3 without control-flow instructions. ABSDIFF x1, x2, x3 means x1 = ABS(x2 – x3). You may use any RISC-V registers, but do not overwrite x2 or x3. In this part, you must use one or more conditional move instructions (which, as we discussed in class, are part of many instruction sets but not RISC-V), and specifically any or all of three different conditional move instructions:
• CMOVGTZ x1, x2, x3 # if (x3 > 0) x1 = x2 • CMOVLTZ x1, x2, x3 # if (x3 < 0) x1 = x2 • CMOVEQZ x1, x2, x3 # if (x3 == 0) x1 = x2 Page 2 of 14 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere (b) (5 points) Using a minimal sequence of RISC-V instructions, implement ABSDIFF x1, x2, x3 without control-flow instructions. You may use any RISC-V reg- isters, but do not overwrite x2 or x3. In this part, you must use either or both of the following saturating-subtract instructions: • SUBSAT x1, x2, x3 (subtraction of x2-x3, treating both operands as signed numbers and producing a signed result, but rather than overflowing, saturating to the largest/smallest signed value) • SUBUSAT x1, x2, x3 (subtraction of x2-x3, treating both operands as un- signed numbers and producing an unsigned result, but rather than overflowing, saturating to the largest/smallest unsigned value) Recall that the actual computation of subtraction of signed and unsigned numbers generates exactly the same bit result; it’s only the saturation operation that differs between these two operations. This problem is fairly tricky. Page 3 of 14 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere 2. In this problem we will look at the single-cycle datapath of the “Bitner” processor, only focusing on the compute part of the Bitner pipeline. In this datapath, the ALU can perform one arithmetic operation every clock cycle and the data memory can perform one memory operation every clock cycle. Note the output of the ALU is an input to the data memory and the output of the data memory is an input to the ALU. Assume Bitner’s instruction format contains an opcode, three registers, and one immediate. At the end of this exam is a larger, detachable copy of this datapath. RFData Control RFW RFR1 RFR2 IMM MEMD MEMA ALU2 ALU1 ALUOp 1 2 A ALU B Control points: Mux RFIn Register File RF2 RF Register Addresses RF1 RF • RFR1: register number of the first register to be read from the register file • RFR2: register number of the second register to be read from the register file • IMM: the immediate specified by the instruction • ALU1: controls the source of data into the ALU’s input 1 (choices: RF1, IMM MEM) • ALU2: controls the source of data into the ALU’s input 2 (choices: RF2, IMM, MEM) • MEMA: controls the source of data into the memory’s address input (choices: RF1, IMM, ALU) • MEMD: controls the source of data into the memory’s data input (choices: RF2, IMM, ALU) • ALUOP: controls the operation performed by the ALU (choices: ADD) • RFW: register number of the register to be written to the register file • RFData: controls the source of data to be written into register RFW in the register file (choices: ALU or MEM) In this problem we ask you to “set the control points” for this datapath. As an example, the instruction ADD x1, x2, x3 would have the following settings: RFR1 = 2; RFR2 = 3; IMM = X; ALU1 = RF1; ALU2 = RF2; MEMA = X; MEMD = X; ALUOP = ADD; RFW = 1; RFData = ALU, where X is a don’t-care. Page 4 of 14 RF1 IMM MEM RF2 IMM MEM RF1 IMM ALU Data Memory Addr MEM Data In RF2 IMM ALU Mux Mux Mux Mux (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere (a) (5 points) Which of these signals should come directly from bits in the 32-bit instruction (a bit or bits in the instruction are routed as control signals directly into the datapath without any intermediate logic) and which should be generated by additional logic derived from the 32-bit instruction? Answer either “bits” or “logic”. • RFR1 • RFR2 • IMM • ALU1 • ALU2 • MEMA • MEMD • ALUOP • RFW • RFData (b) (4 points) What is the primary reason why this particular datapath would be a challenge to pipeline with good performance? Page 5 of 14 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere (c) Consider the following C program. This program fetches the value in array from memory, adds 42 to it, stores the new value back to memory, and increments array to point to the next element in the array. int * array; // assume 64-bit integers while (1) { *array++ += 42; } Assume that array is stored in x2. i. (4 points) Write the body of this loop (corresponding to the single line of C code between the braces) as a minimal sequence of instructions in RISC-V assembly. You can use x3 as a temporary. Your solution should have 4 RISC-V instructions. Do not write any instructions that relate to the loop itself (you should have no branch/jump instructions). Page 6 of 14 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere ii. (10 points) Write the body of this loop (corresponding to the single line of C code between the braces) as a minimal sequence of instructions on Bitner. Express each instruction as a setting of control points; by this we mean to specify how you would set each of the 10 “control points” listed above. If a control point is a don’t-care, mark it as X. You can use x3 as a temporary. Do not write any instructions that relate to the loop itself (you should have no branch/jump instructions). Hint: Because this datapath is potentially more capable than the RISC-V datapath we discussed in class, it may be possible to accomplish more in a single instruction than you could in a RISC-V datapath. Page 7 of 14 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere (d) (5 points) On the following diagram, note 12 circles on the inputs to the 4 muxes. Put an X through each circle whose input is not required to run RISC-V instructions. For inputs that are required to run RISC-V instructions, do nothing. For instance, if the top input of the ALU does not require a RF1 input to run RISC-V instructions, put an X in the topmost circle. RFData Control RFW RFR1 RFR2 IMM MEMD RF1 ALUOp A ALU B MEMA RFIn File RF2 Register Addresses RF1 Register RF2 RF2 IMM IMM ALU2 ALU1 RF1 MEM MEM Mux ALU MEM RF1 IMM Data Memory Addr MEM Data In ALU RF2 IMM ALU Page 8 of 14 Mux Mux Mux Mux (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere (e) Ben Bitdiddle proposes new instructions for Bitner. Ben hopes each of these in- structions can be implemented as single Bitner instructions (not sequences of Bitner instructions). For each of Ben’s proposals: • If it is possible to implement Ben’s proposal as a single instruction with the current datapath, set the control points that implement the instruction on this datapath; OR • If it is not possible to implement Ben’s proposal as a single instruction with the current datapath, describe the changes needed to the datapath (as English sentences). i. (4 points) LDADDI xdest1, xdest2, xsrc, imm: Perform two operations in parallel: (1) Register xdest1 is set to xsrc + imm; register xdest2 is set to the memory contents of the address stored in xsrc. ii. (4 points) MEMCPY xdest, xsrc: Copies the memory word stored at the ad- dress specified in xsrc into the memory location specified by the address in xdest. iii. (4 points) LD2 xdest, imm(xbase1+xbase2): Same as LD but instead of the load address as xbase+imm, it now supports xbase1+xbase2+imm. Page 9 of 14 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere 3. Alyssa P. Hacker was looking at the class datapath (below) and identified some possible inefficiencies. Specifically, she was looking at the following control signals: • RegWrite: 1 only if data is written into the register file • ALUSrc: the second input of the ALU is from the register file (0) or from the immediate (1) • PCSrc: the next PC comes from PC+4 (0) or from a computed branch/jump target (1) • MemWrite: 1 only if the write data input to the memory will be written to the address input • MemRead: 1 only if the data memory is outputting the contents of the memory address specified by its address input • MemtoReg: the data to write into the register file comes from the ALU (0) or from the data memory (1) She proposed the following design simplifications. For each proposed design change, indicate whether her proposal could work properly OR, if it will not, an in- struction that will not work correctly and why. For the purposes of this question, only consider the following instructions (the ones we covered in Lecture 2): ADD, ADDI, AND, ANDI, BEQ, BGE, BLT, BNE, JAL, JALR, LD, LUI, OR, ORI, SD, SLLI, SRLI, SUB, XOR, XORI. Page 10 of 14 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere (a) (4 points) Combine MemWrite and MemRead into one signal MemOp, where MemOp is set to 1 only on store instructions, and MemWrite <- MemOp and MemRead <- not(MemOp). Assume that the memory system can read from any arbitrary memory address without any side effects (no exceptions/page faults/etc.) and that there are no performance consequences if we do this. (b) (4 points) Remove RegWrite; instead set RegWrite if PCSrc is 0. (c) (4 points) Remove ALUSrc; instead set ALUSrc if either MemWrite or MemRead are 1. Page 11 of 14 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere RISC-V Instruction Set Core Instruction Formats 31 27 26 25 24 20 19 RV32I Base Integer Instructions Inst add sub xor or and sll srl sra slt sltu addi xori ori andi slli srli srai slti sltiu lb lh lw lbu lhu sb sh sw beq bne blt bge bltu bgeu jal jalr lui auipc ecall ebreak 15 14 12 11 7 6 0 RISC-V Reference James Zhu
R-type I-type S-type B-type U-type J-type
Note
msb-extends zero-extends
msb-extends zero-extends
zero-extends zero-extends
zero-extends zero-extends
imm: 0x000 imm: 0x001
funct7
rs2
rs1
funct3
rd
opcode
imm[11:0]
rs1
funct3
rd
opcode
imm[11:5]
rs2
rs1
funct3
imm[4:0]
opcode
imm[12|10:5]
rs2
rs1
funct3
imm[4:1|11]
opcode
imm[31:12]
rd
opcode
imm[20|10:1|11|19:12]
rd
opcode
Name
FMT
Opcode
F3
F7
Description (C)
ADD
SUB
XOR
OR
AND
Shift Left Logical Shift Right Logical Shift Right Arith* Set Less Than
Set Less Than (U)
R R R R R R R R R R
0000011
0000011
0000011
0000011
0000011
0000011
0000011
0000011
0110011
0110011
0x0
0x0
0x4
0x6
0x7
0x1
0x2
0x3
0x2
0x3
0x00
0x20
0x00
0x00
0x00
0x00
0x00
0x20
rd = rs1 + rs2
rd = rs1 – rs2
rd = rs1 ˆ rs2
rd = rs1 | rs2
rd = rs1 & rs2
rd = rs1 << rs2 rd = rs1 >> rs2
rd = rs1 >> rs2
rd = (rs1 < rs2)?1:0 rd = (rs1 < rs2)?1:0 ADD Immediate XOR Immediate OR Immediate AND Immediate Shift Left Logical Imm Shift Right Logical Imm Shift Right Arith Imm Set Less Than Imm Set Less Than Imm (U) I I I I I I I I I 0010011 0010011 0010011 0010011 0010011 0010011 0010011 0010011 0010011 0x0 0x0 0x0 0x0 0x1 0x1 0x3 0x2 0x3 0x00 0x00 0x00 0x00 0x00 0x00 0x20 rd = rs1 + imm rd = rs1 ˆ imm rd = rs1 | imm rd = rs1 & imm rd = rs1 << imm rd = rs1 >> imm
rd = rs1 >> imm
rd = (rs1 < imm)?1:0 rd = (rs1 < imm)?1:0 Load Byte Load Half Load Word Load Byte (U) Load Half (U) I I I I I 0000011 0000011 0000011 0000011 0000011 0x0 0x1 0x2 0x4 0x5 rd = M[rs1+imm][0:7] rd = M[rs1+imm][0:15] rd = M[rs1+imm][0:31] rd = M[rs1+imm][0:7] rd = M[rs1+imm][0:15] Store Byte Store Half Store Word S S S 0100011 0100011 0100011 0x0 0x1 0x2 M[rs1+imm][0:7] = rs2[0:7] M[rs1+imm][0:15] = rs2[0:15] M[rs1+imm][0:31] = rs2[0:31] Branch == Branch != Branch < Branch ≤ Branch < (U) Branch ≥ (U) Jump And Link Jump And Link Reg B B B B B B 1100011 1100011 1100011 1100011 1100011 1100011 0x0 0x1 0x4 0x5 0x6 0x7 if(rs1 == rs2) PC += imm if(rs1 != rs2) PC += imm if(rs1 < rs2) PC += imm if(rs1 >= rs2) PC += imm
if(rs1 < rs2) PC += imm if(rs1 >= rs2) PC += imm
J I
1101111
1100111
0x0
rd = PC+4; PC += imm
rd = PC+4; PC = rs1 + imm
Load Upper Imm
Add Upper Imm to PC
U U
0110111
0010111
rd = imm << 12 rd = PC + (imm << 12) Environment Call I 1110011 0x0 0x00 Transfer control to OS Environment Break I 1110011 0x0 0x00 Transfer control to debugger 1 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere RISC-V Reference Card V0.1 Standard Extensions RV32M Multiply Extension Inst Description (C) Name FMT Opcode F3 F7 MUL MUL High MUL High (S) (U) MUL High (U) DIV DIV (U) Remainder Remainder (U) R R R R R R R R 0110011 0110011 0110011 0110011 0110011 0110011 0110011 0110011 0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01 mul mulh mulsu rd = (rs1 * rs2)[63:32] mulu div rd=rs1/rs2 divu rd=rs1/rs2 rem rd=rs1%rs2 remu rd=rs1%rs2 RV32A Atomic Extension 31 27 26 25 24 2019 1514 1211 76 0 0} = rd rd = (rs1 * rs2)[31:0] rd = (rs1 * rs2)[63:32] rd = (rs1 * rs2)[63:32] funct5 aq rl rs2 rs1 funct3 rd opcode 51155357 Inst Description (C) Name FMT Opcode F3 F5 Load Reserved Store Conditional Atomic Swap Atomic ADD Atomic AND Atomic OR Atomix XOR Atomic MAX Atomic MIN R R R R R R R R R 0101111 0101111 0101111 0101111 0101111 0101111 0101111 0101111 0101111 0x2 0x2 0x2 0x2 0x2 0x2 0x2 0x2 0x2 0x02 0x03 0x01 0x00 0x0C 0x0A 0x04 0x14 0x10 lr.w sc.w rd = M[rs1], reserve M[rs1] if (reserved) { M[rs1] = rs2; rd = else { rd = 1 } amoswap.w rd = M[rs1]; swap(rd, rs2); M[rs1] amoadd.w amoand.w amoor.w amoxor.w amomax.w amomin.w RV32F / D Floating-Point Extensions rd = M[rs1] + rs2; M[rs1] = rd rd = M[rs1] & rs2; M[rs1] = rd rd = M[rs1] | rs2; M[rs1] = rd rd = M[rs1] ˆ rs2; M[rs1] = rd rd = max(M[rs1], rs2); M[rs1] = rd rd = min(M[rs1], rs2); M[rs1] = rd Inst Description (C) Name FMT Opcode F3 F5 Flt Load Word Flt Store Word Flt Fused Mul-Add Flt Fused Mul-Sub Flt Neg Fused Mul-Add Flt Neg Fused Mul-Sub Flt Add Flt Sub Flt Mul Flt Div Flt Square Root Flt Sign Injection Flt Sign Neg Injection Flt Sign Xor Injection Flt Minimum Flt Maximum Flt Conv from Sign Int Flt Conv from Uns Int Flt Convert to Int Flt Convert to Int Move Float to Int Move Int to Float Float Equality Float Less Than Float Less / Equal Float Classify * * * * * * * * * * * * * * * * * * * * * * * * * * flw fsw fmadd.s rd=rs1*rs2+rs3 fmsub.s rd=rs1*rs2-rs3 fnmadd.s rd=-rs1*rs2+rs3 fnmsub.s rd=-rs1*rs2-rs3 fadd.s rd=rs1+rs2 fsub.s rd=rs1-rs2 fmul.s rd=rs1*rs2 fdiv.s rd=rs1/rs2 fsqrt.s fsgnj.s fsgnjn.s fsgnjx.s fmin.s fmax.s fcvt.s.w fcvt.s.wu rd = (float) rs1 rd = (int32_t) rs1 fcvt.wu.s rd = (uint32_t) rs1 fcvt.w.s fmv.x.w fmv.w.x feq.s flt.s fle.s fclass.s rd = *((int*) &rs1) rd = *((float*) &rs1) rd=(rs1==rs2)? 1: 0 rd=(rs1= 0) return z;
return -z; }
A straightforward implementation of the above code requires control-flow instructions (branches and/or jumps). In this problem we consider non-RISC-V instructions that might allow us to implement this without control-flow instructions. In this problem ignore the fact that a twos-complement representation is unbalanced (-MININT cannot be represented).
(a) (5 points) Using a minimal sequence of RISC-V instructions, implement ABSDIFF x1, x2, x3 without control-flow instructions. ABSDIFF x1, x2, x3 means x1 = ABS(x2 – x3). You may use any RISC-V registers, but do not overwrite x2 or x3. In this part, you must use one or more conditional move instructions (which, as we discussed in class, are part of many instruction sets but not RISC-V), and specifically any or all of three different conditional move instructions:
• CMOVGTZ x1, x2, x3 # if (x3 > 0) x1 = x2 • CMOVLTZ x1, x2, x3 # if (x3 < 0) x1 = x2 • CMOVEQZ x1, x2, x3 # if (x3 == 0) x1 = x2 Solution: Page 1 of 9 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere SUB x4, x1, x2 CMOVGTZ x1, x4, x4 # x1 now has the correct answer # if the diff was positive SUB x4, x0, x4 # flip the sign of x4 CMOVGTZ x1, x4, x4 # x1 now also has the correct answer # if the diff was negative CMOVEQZ x1, x4, x4 # better cover the 0 case too (b) (5 points) Using a minimal sequence of RISC-V instructions, implement ABSDIFF x1, x2, x3 without control-flow instructions. You may use any RISC-V reg- isters, but do not overwrite x2 or x3. In this part, you must use either or both of the following saturating-subtract instructions: • SUBSAT x1, x2, x3 (subtraction of x2-x3, treating both operands as signed numbers and producing a signed result, but rather than overflowing, saturating to the largest/smallest signed value) • SUBUSAT x1, x2, x3 (subtraction of x2-x3, treating both operands as un- signed numbers and producing an unsigned result, but rather than overflowing, saturating to the largest/smallest unsigned value) Recall that the actual computation of subtraction of signed and unsigned numbers generates exactly the same bit result; it’s only the saturation operation that differs between these two operations. This problem is fairly tricky. 2. In this problem we will look at the single-cycle datapath of the “Bitner” processor, only focusing on the compute part of the Bitner pipeline. In this datapath, the ALU can perform one arithmetic operation every clock cycle and the data memory can perform one memory operation every clock cycle. Note the output of the ALU is an input to the data memory and the output of the data memory is an input to the ALU. Assume Bitner’s instruction format contains an opcode, three registers, and one immediate. At the end of this exam is a larger, detachable copy of this datapath. Solution: The key realization here is that unsigned saturated subtract of a − b is equivalent to max(a − b, 0). SUBUSAT x1, x2, x3 \# x1 = max(x2-x3,0) SUBUSAT x4, x3, x2 \# x4 = max(x3-x2,0); either x1 or x4 is now 0 OR x1, x1, x4 \# choose whichever of x1 or x4 is not 0 Page 2 of 9 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere RFData Control RFW RFR1 RFR2 IMM MEMD MEMA ALU2 ALU1 ALUOp RFIn Register File RF2 RF Register Addresses RF1 RF 1 2 A ALU B RF1 IMM MEM RF2 IMM MEM Mux RF1 IMM ALU RF2 IMM ALU Data Memory Addr MEM Data In Control points: • RFR1: register number of the first register to be read from the register file • RFR2: register number of the second register to be read from the register file • IMM: the immediate specified by the instruction • ALU1: controls the source of data into the ALU’s input 1 (choices: RF1, IMM MEM) • ALU2: controls the source of data into the ALU’s input 2 (choices: RF2, IMM, MEM) • MEMA: controls the source of data into the memory’s address input (choices: RF1, IMM, ALU) • MEMD: controls the source of data into the memory’s data input (choices: RF2, IMM, ALU) • ALUOP: controls the operation performed by the ALU (choices: ADD) • RFW: register number of the register to be written to the register file • RFData: controls the source of data to be written into register RFW in the register file (choices: ALU or MEM) In this problem we ask you to “set the control points” for this datapath. As an example, the instruction ADD x1, x2, x3 would have the following settings: RFR1 = 2; RFR2 = 3; IMM = X; ALU1 = RF1; ALU2 = RF2; MEMA = X; MEMD = X; ALUOP = ADD; RFW = 1; RFData = ALU, where X is a don’t-care. (a) (5 points) Which of these signals should come directly from bits in the 32-bit instruction (a bit or bits in the instruction are routed as control signals directly into the datapath without any intermediate logic) and which should be generated by additional logic derived from the 32-bit instruction? Answer either “bits” or “logic”. • bits RFR1 Page 3 of 9 Mux Mux Mux Mux (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere • • • • • • • • • bits RFR2 bits IMM logic ALU1 logic ALU2 logic MEMA logic MEMD logic ALUOP bits RFW logic RFData (b) (4 points) What is the primary reason why this particular datapath would be a challenge to pipeline with good performance? Solution: Because the data memory output is an input to the ALU and the ALU output is an input to the memory, we’d have two choices: • We could put them both into the same pipeline stage in which case that stage would be very long and yield poor performance • We could put them into separate pipeline stages but then would have to choose which came first, and we would have a structural hazard every time they were used in the opposite order (c) Consider the following C program. This program fetches the value in array from memory, adds 42 to it, stores the new value back to memory, and increments array to point to the next element in the array. int * array; // assume 64-bit integers while (1) { *array++ += 42; } Assume that array is stored in x2. i. (4 points) Write the body of this loop (corresponding to the single line of C code between the braces) as a minimal sequence of instructions in RISC-V assembly. You can use x3 as a temporary. Your solution should have 4 RISC-V instructions. Do not write any instructions that relate to the loop itself (you should have no branch/jump instructions). Solution: ld x3, 0(x2) addi x3, x3, 42 sd x3, 0(x2) addi x2, x2, 8 Page 4 of 9 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere ii. (10 points) Write the body of this loop (corresponding to the single line of C code between the braces) as a minimal sequence of instructions on Bitner. Express each instruction as a setting of control points; by this we mean to specify how you would set each of the 10 “control points” listed above. If a control point is a don’t-care, mark it as X. You can use x3 as a temporary. Do not write any instructions that relate to the loop itself (you should have no branch/jump instructions). Hint: Because this datapath is potentially more capable than the RISC-V datapath we discussed in class, it may be possible to accomplish more in a single instruction than you could in a RISC-V datapath. Solution: We can collapse the four RISC-V instructions above into two Bitner instructions: • Combine the load and the first add-immediate (+42) into one instruc- tion; the output of the load is the input to an add, and the output of the add goes back into the register file. RFR1 = 2 RFR2 = X IMM = 42 RFW = 3 RFData = ALU ALU1 = MEM ALU2 = IMM (42) MEMA = RF1 MEMD = X ALUOP = ADD • Then combine the store and the second add-immediate (+8) into one instruction that run in parallel. Only the add-immediate writes back. RFR1 = 2 RFR2 = 3 IMM = 8 RFW = 2 RFData = ALU ALU1 = RF1 ALU2 = IMM (8) MEMA = RF1 MEMD = RF2 ALUOP = ADD (d) (5 points) On the following diagram, note 12 circles on the inputs to the 4 muxes. Put an X through each circle whose input is not required to run RISC-V Page 5 of 9 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere instructions. For inputs that are required to run RISC-V instructions, do nothing. For instance, if the top input of the ALU does not require a RF1 input to run RISC-V instructions, put an X in the topmost circle. RFData ALU MEM Control RFW RFR1 RFR2 IMM MEMD MEMA ALU2 ALU1 ALUOp RFIn File RF2 Register Addresses RF1 Register RF1 RF2 RF1 IMM ALU RF1 IMM MEM RF2 IMM MEM A B ALU Mux Data Memory Addr MEM Data In RF2 IMM ALU Solution: To run only RISC-V instructions, we do not need the following connections: • ALU A: IMM, MEM • ALU B: MEM • Memory address: RF1, IMM • Memory data: IMM, ALU It’s arguable that ALU A and ALU B could switch their answers, although that may lead to complications with where I-type instructions get their RF input. RFData Control RFW RFR1 RFR2 IMM MEMD MEMA ALU2 ALU1 ALUOp Register Addresses RF1 RF RF1 IMM MEM 1x x RFIn Register File RF2 RF 2 x x A ALU B RF2 IMM MEM Mux x x x RF1 IMM ALU Data Memory Addr MEM Data In ALU MEM RF2 IMM ALU Page 6 of 9 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere Mux Mux Mux Mux Mux Mux Mux Mux (e) Ben Bitdiddle proposes new instructions for Bitner. Ben hopes each of these in- structions can be implemented as single Bitner instructions (not sequences of Bitner instructions). For each of Ben’s proposals: • If it is possible to implement Ben’s proposal as a single instruction with the current datapath, set the control points that implement the instruction on this datapath; OR • If it is not possible to implement Ben’s proposal as a single instruction with the current datapath, describe the changes needed to the datapath (as English sentences). i. (4 points) LDADDI xdest1, xdest2, xsrc, imm: Perform two operations in parallel: (1) Register xdest1 is set to xsrc + imm; register xdest2 is set to the memory contents of the address stored in xsrc. ii. (4 points) MEMCPY xdest, xsrc: Copies the memory word stored at the ad- dress specified in xsrc into the memory location specified by the address in xdest. Solution: The current datapath does not support this; it requires writing back to two different registers. We would have to add a second write port to the register file and make sure the outputs of the ALU and the memory could both be routed into the two write ports. Solution: The current datapath does not support this; it requires two memory operations and thus two data memories. There are multiple places where a second memory could be placed, including: • Inparallelwiththecurrentdatamemory(ratherthanhavingoneALU and one memory in parallel, we’d have one ALU and two memories in parallel). The output of the first memory (load) must be able to be routed into the data input of the second memory (store). • Inseriesafterthecurrentdatamemory(theoutputofthefirstmemory can be routed into the data input of the second memory). iii. (4 points) LD2 xdest, imm(xbase1+xbase2): Same as LD but instead of the load address as xbase+imm, it now supports xbase1+xbase2+imm. Solution: The current datapath does not support this; it does not allow two additions. One way to address this is to add an adder to the address input of the data memory that allows a second addition. One input to that adder should be the output of the ALU; the second input could be either read register or the immediate (the datapath allows any of those three to work). Page 7 of 9 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere 3. Alyssa P. Hacker was looking at the class datapath (below) and identified some possible inefficiencies. Specifically, she was looking at the following control signals: • RegWrite: 1 only if data is written into the register file • ALUSrc: the second input of the ALU is from the register file (0) or from the immediate (1) • PCSrc: the next PC comes from PC+4 (0) or from a computed branch/jump target (1) • MemWrite: 1 only if the write data input to the memory will be written to the address input • MemRead: 1 only if the data memory is outputting the contents of the memory address specified by its address input • MemtoReg: the data to write into the register file comes from the ALU (0) or from the data memory (1) She proposed the following design simplifications. For each proposed design change, indicate whether her proposal could work properly OR, if it will not, an in- struction that will not work correctly and why. For the purposes of this question, only consider the following instructions (the ones we covered in Lecture 2): ADD, ADDI, AND, ANDI, BEQ, BGE, BLT, BNE, JAL, JALR, LD, LUI, OR, ORI, SD, SLLI, SRLI, SUB, XOR, XORI. (a) (4 points) Combine MemWrite and MemRead into one signal MemOp, where MemOp is set to 1 only on store instructions, and MemWrite <- MemOp and MemRead <- not(MemOp). Assume that the memory system can read from any arbitrary memory address without any side effects (no exceptions/page faults/etc.) and that there are no performance consequences if we do this. Solution: This should work properly. The reasons to have two signals are to allow both of them to be true at a time (and we have no instructions that do both a read and a write) or both of them to be false (which is probably true on non-memory instructions, but the MemtoReg mux handles this case). Note, however, that we don’t generally want to dispatch memory reads that aren’t Page 8 of 9 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere from a load instruction; we may access addresses that we’re not allowed to access (e.g., segmentation faults), we’d probably pollute the cache, and we’d burden the memory system with requests that we don’t need so there would be performance consequences. (b) (4 points) Remove RegWrite; instead set RegWrite if PCSrc is 0. (c) (4 points) Remove ALUSrc; instead set ALUSrc if either MemWrite or MemRead are 1. Solution: This doesn’t work. Alyssa’s logic here is that non-branch-or-jump instructions always write back to the register file. This is not true for store instructions (SD). Solution: This doesn’t work. Alyssa’s logic here is that we use the immediate input to the ALU only on load and store instructions, but we also use it on ALU instructions that take an immediate (ADDI, ANDI, ORI, etc.). Page 9 of 9 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere EEC 170 Name: Instructions: UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering Introduction to Computer Architecture Final Examination Fall 2019 1. This exam is open-note, open-book. Calculators are allowed. 2. No devices that can communicate (laptops, phones, or PDAs) allowed, with the excep- tion of a device that can read your e-book textbook, on which you must turn off your wifi and/or cellular connections. Other than “find”, no typing. Turn off all phones, pagers, and alarms. 3. You may not share any materials with anyone else, including books, notes, lab reports, datasheets, calculators, and most importantly, answers! 4. This exam is designed to be a 60 minute exam, but you have 120 minutes to complete it. (Roughly) one point is budgeted per minute. Use your time wisely! Excerpts from the UC Davis Code of Academic Conduct: 1. Each student should act with personal honesty at all times. 2. Each student should act with fairness to others in the class. That means, for example, that when taking an examination, students should not seek an unfair advantage over classmates through cheating or other dishonest behavior. 3. Students should take group as well as individual responsibility for honorable behavior. I understand the honor code and agree to be bound by it. Signature: Page: 2 3 4 5 6 7 8 9 Total Points: 4 10 10 9 7 3 12 6 61 Score: Page 1 of 12 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere All instructions in this exam, unless otherwise noted, are RISC-V instructions. Recall that if the instruction writes to a register, that register appears first in the instruction. Unless otherwise indicated, you may not use any RISC-V pseudoinstructions on this exam, only actual RISC-V instructions. Feel free to detach the two RISC-V reference pages and the memory system diagram at the end of the exam, but if you do, please take them with you and recycle them (don’t try to reattach them or turn them in). 1. 1 2 3 4 We discussed dense matrix-dense matrix multiplication in class. To recap, consider A, B, and C as n×n matrices; we wish to compute C = C +A×B. This can be computed by fori=1ton for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) // this is normal (scalar) + and * We begin by computing the arithmetic intensity of this code in operations per memory word, assuming no caches, as a function of n. (Recall that the arithmetic intensity is the number of arithmetic operations per word of memory bandwidth.) Assume n is large; you can approximate any answers given this assumption. Assume each addition or multiplication counts as one operation. Assume all of A, B, and C are stored in main memory (DRAM) and each read from/write to main memory counts as one memory word. The inner line of code runs n3 times. It has 2 arithmetic operations, 3 loads, and 1 write. Thus the arithmetic intensity is 2n3 = 1/2. (a) (4 points) Consider two different implementations of the same task with the same runtime. Baby Yoda’s has high arithmetic intensity and is limited by arithmetic capability. The Mandalorian’s has low arithmetic intensity and is limited by mem- ory bandwidth. Even though both have the same runtime, we prefer Baby Yoda’s implementation with higher arithmetic intensity because of how we expect future machines to behave. In terms of technology trends, why do we prefer imple- mentations with more arithmetic intensity? (3+1)n3 Page 2 of 12 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere (b) (4 points) Assume we have a very large cache so all memory loads only read a reference the first time they see that reference and (because they are captured in the cache) all memory writes occur only at the end of the program. Reads from/writes to cache are free (do not count in terms of arithmetic intensity). What is the arithmetic intensity now? (c) 1 2 3 4 5 6 7 8 (6 points) Now consider a more realistic cache that is not large enough to hold the entire input and output. This cache can hold two matrix rows or columns plus one output element. Assume there are no cache conflicts; any two rows/columns plus one output element can fit in the cache. The Mandalorian proposes using the cache in the following way: fori=1ton // read row i of A into cache for j = 1 to n // read row j of B into cache // read C(i,j) into cache for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) // this is normal (scalar) + and * // after k loop is complete, write C(i,j) back to memory What is the arithmetic intensity now? It is OK to ignore lower-order terms in this analysis. Page 3 of 12 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere (d) (10 points) Finally, Baby Yoda proposes to partition the matrix in a different way. Instead of reading 1-dimensional rows or columns from A and B, Baby Yoda ar- ranges those matrices as b × b subblocks. (For instance, a 128 × 128 matrix may be structured as a 4×4 grid of 32 × 32 subblocks.) The cache can hold any two input blocks plus one output block at any time. Now our matrix multiply looks like this (note our operations are now on blocks at a time instead of rows or columns at a time). The arithmetic, while factored differently, is exactly the same as the previous case. // note i, j, and k are now block indexes, not element indexes fori=1ton/b for j = 1 to n/b // read block C(i,j) into cache for k = 1 to n/b // read block A(i,k) into cache // read block B(k,j) into cache C(i,j) = C(i,j) + A(i,k) * B(k,j) // Here + indicates a matrix sum of blocks (block + block) // Here * indicates a matrix multiply of blocks (block * block) // after k loop is complete, write block C(i,j) back to memory What is the arithmetic intensity now? It is again OK to ignore lower-order terms in this analysis. 1 2 3 4 5 6 7 8 9 10 11 Page 4 of 12 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere 2. In the 2-page RISC-V Reference Card at the end of the exam, at the bottom right of the first page, you will see the “16-bit (RVC) Instruction Formats”. RVC stands for “RISC-V Compressed”. (a) (5 points) Note CR-type instructions have a funct4 entry as part of their opcodes instead of a funct3 (as in all the other instructions). Assume that the opcode combi- nations (inst[1:0] == 01 OR inst[1:0] == 10) AND inst[15:13] == 100 are used to encode CR-type instructions. Given the instruction formats listed in the reference page, how many different instructions can RVC support? Note not all possible RVC instructions are specified on the reference card. (b) In practice, these 16-bit RVC instructions are translated to 32-bit full (“RVI”) RISC-V instructions. This process is summarized in the table on the reference card called “Optional Compressed (16-bit) Instruction Extension: RVC”, but we will highlight the important parts here directly. In this part, you may need to consult the table “16-bit (RVC) instruction formats”. i. (2 points) The RVC instruction C.ADD rd, rs1 translates to the RVI instruc- tion ADD rd, rd, rs1. Both instructions run a full 64b add. However, the C.ADD operation has a significant restriction compared to the full ADD instruc- tion. What is that restriction? Be specific. ii. (2 points) Some RVC instructions, like C.BEQZ, have a register argument with a prime (apostrophe) after the register name (C.BEQZ rs1’, imm → BEQ rs1’, x0, imm). What is the restriction indicated by the prime? Page 5 of 12 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere iii. (2 points) All RVC memory-access instructions multiply their immediates by a power-of-twoconstant(e.g.,C.LW rd’, rs1’, imm→LW rd’, rs1’, imm*4). Because of this multiplication, what can a RVI memory-access instruc- tion do that its partner RVC instruction cannot? (The answer is not “allows byte addressing”; both RVC and RVI memory-access instructions are byte-addressed.) iv. (2 points) The instruction C.ADDI16SP is primarily intended to manipulate the . (Fill in the previous blank with one word.) (c) (3 points) The new BabyYoda compiler inputs a RVI (32-bit) RISC-V instruction sequence and transforms some RVI instructions into their RVC (16-bit) equivalents. When running the resulting code, compared to the RVI implementation, would you expect the instruction cache hit rate would increase, decrease, or stay the same? Justify your answer. Page 6 of 12 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere 3. While sipping his soup, Baby Yoda uses the Force to derive the following characteristics of his laptop’s memory system, which is byte-addressed and contains a virtual memory subsystem and L1 and L2 caches. A copy of this diagram is attached at the end of the exam, which you can detach. In this diagram, means a signal of bits wide.
Figure B.17 T(ah)e (o3vepraolilnptsic)tuWrehoaftaishythpoethpeatgicealsimze?mory hierarchy going from virtual address to L2 cache access. The page size is 16 KB. The TLB is two-way set associative with 256 entries. The L1 cache is a direct-mapped 16 KB, and the L2 cache is a four-way set associative with a total of 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the physical address is 40 bits.
Copyright © 2011, Elsevier Inc. All rights Reserved.
Page 7 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

(b) (3 points) The TLB has 256 entries in it. What is the associativity of the TLB (e.g., 1-way set associative = direct-mapped, 2-way set associative, 4-way, fully associative, etc.)?
(c) (3 points) How large is the L1 cache (data only)?
(d) (3 points) Assuming the L2 cache stores 222 B = 4 MiB of data, what is its associa- tivity?
(e) (3 points) What is the motivation for making replacement algorithms for hardware caches less complicated than replacement algorithms for pages?
Page 8 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

4. Assume that you have 2 64-bit double-precision floating point numbers stored in (integer) registers x2 and x3. You run the instruction XOR x1, x2, x3. In this question “normal number” means “not zero, not infinity, not denormal, not NaN”.
(a) (3 points) If you know that x2 is a positive normal number and x3 is a negative normal number, what bits in x1 do you know for certain? (Your answer shouldbeoftheform“Bit42willbea1andbit24willbea0”.) Bit0istheleast significant bit (LSB) and bit 63 is the most significant bit (MSB).
(b) (3 points) If you know that x2 is a negative denormal number and x3 is negative infinity, what bits in x1 do you know for certain? (Your answer should be of the form “Bit 42 will be a 1 and bit 24 will be a 0”.)
Page 9 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

Free & Open Reference Card 1
Base Integer Instructions: RV32I, RV64I, and RV128I
Category Name
Loads Load Byte Load Halfword Load Word Load Byte Unsigned Load Half Unsigned
Stores Store Byte Store Halfword Store Word
Shifts Shift Left Shift Left Immediate Shift Right Shift Right Immediate Shift Right Arithmetic Shift Right Arith Imm
Arithmetic ADD ADD Immediate SUBtract
Load Upper Imm Add Upper Imm to PC
Fmt
I I I I I
S S S
R I R I R I
R I R
U U
RV32I Base
LB rd,rs1,imm
LH rd,rs1,imm
LW rd,rs1,imm
LBU rd,rs1,imm
LHU rd,rs1,imm
SB rs1,rs2,imm
SH rs1,rs2,imm
SW rs1,rs2,imm
SLL rd,rs1,rs2
SLLI rd,rs1,shamt
SRL rd,rs1,rs2
SRLI rd,rs1,shamt
SRA rd,rs1,rs2
SRAI rd,rs1,shamt
ADD rd,rs1,rs2
ADDI rd,rs1,imm
SUB rd,rs1,rs2
LUI rd,imm
AUIPC rd,imm
+RV{64,128}
L{D|Q} rd,rs1,imm
L{W|D}U rd,rs1,imm
S{D|Q} rs1,rs2,imm
SLL{W|D} rd,rs1,rs2
SLLI{W|D} rd,rs1,shamt
SRL{W|D} rd,rs1,rs2
SRLI{W|D} rd,rs1,shamt
SRA{W|D} rd,rs1,rs2
SRAI{W|D} rd,rs1,shamt
ADD{W|D} rd,rs1,rs2
ADDI{W|D} rd,rs1,imm
SUB{W|D} rd,rs1,rs2
RV Privileged Instructions
Category Name
CSR Access Atomic R/W Atomic Read & Set Bit Atomic Read & Clear Bit Atomic R/W Imm Atomic Read & Set Bit Imm Atomic Read & Clear Bit Imm
Change Level Env. Call Environment Breakpoint
Environment Return
Trap Redirect to Supervisor Redirect Trap to Hypervisor Hypervisor Trap to Supervisor
Interrupt Wait for Interrupt
MMU Supervisor FENCE
ECALL EBREAK
ERET
MRTS
MRTH
HRTS
WFI
RV mnemonic
CSRRW rd,csr,rs1
CSRRS rd,csr,rs1
CSRRC rd,csr,rs1
CSRRWI rd,csr,imm
CSRRSI rd,csr,imm
CSRRCI rd,csr,imm
SFENCE.VM rs1
Logical XOR XOR Immediate
OR OR Immediate AND AND Immediate
Compare Set < Set < Immediate Set < Unsigned Set < Imm Unsigned Branches Branch = Branch ≠ Branch < Branch ≥ Branch < Unsigned Branch ≥ Unsigned Jump & Link J&L Jump & Link Register Synch Synch thread Synch Instr & Data System System CALL System BREAK Counters ReaD CYCLE ReaD CYCLE upper Half ReaD TIME ReaD TIME upper Half ReaD INSTR RETired ReaD INSTR upper Half R I R I R I R I R I SB SB SB SB SB SB UJ UJ I I I I I I I I I I XOR rd,rs1,rs2 XORI rd,rs1,imm OR rd,rs1,rs2 ORI rd,rs1,imm AND rd,rs1,rs2 ANDI rd,rs1,imm SLT rd,rs1,rs2 SLTI rd,rs1,imm SLTU rd,rs1,rs2 SLTIU rd,rs1,imm BEQ rs1,rs2,imm BNE rs1,rs2,imm BLT rs1,rs2,imm BGE rs1,rs2,imm BLTU rs1,rs2,imm BGEU rs1,rs2,imm JAL rd,imm JALR rd,rs1,imm FENCE FENCE.I SCALL SBREAK RDCYCLE rd RDCYCLEH rd RDTIME rd RDTIMEH rd RDINSTRET rd RDINSTRETH rd Optional Compressed (16-bit) Instruction Extension: RVC Category Name Loads Load Word Load Word SP Load Double Load Double SP Load Quad Load Quad SP Stores Store Word Store Word SP Store Double Store Double SP Store Quad Store Quad SP Arithmetic ADD ADD Word ADD Immediate ADD Word Imm ADD SP Imm * 16 ADD SP Imm * 4 Load Immediate Load Upper Imm MoVe SUB Shifts Shift Left Imm Branches Branch=0 Branch≠0 Jump Jump Jump Register Jump & Link J&L Jump & Link Register System Env. BREAK Fmt CL CI CL CI CL CI CS CSS CS CSS CS CSS CR CR CI CI CI CIW CI CI CR CR CI CB CB CJ CR CJ CR CI C.SLLI rd,imm C.BEQZ rs1′,imm C.BNEZ rs1′,imm C.J imm C.JR rd,rs1 C.JAL imm C.JALR rs1 C.EBREAK RVC C.LW rd′,rs1′,imm C.LWSP rd,imm C.LD rd′,rs1′,imm C.LDSP rd,imm C.LQ rd′,rs1′,imm C.LQSP rd,imm C.SW rs1′,rs2′,imm C.SWSP rs2,imm C.SD rs1′,rs2′,imm C.SDSP rs2,imm C.SQ rs1′,rs2′,imm C.SQSP rs2,imm C.ADD rd,rs1 C.ADDW rd,rs1 C.ADDI rd,imm C.ADDIW rd,imm C.ADDI16SP x0,imm C.ADDI4SPN rd',imm C.LI rd,imm C.LUI rd,imm C.MV rd,rs1 C.SUB rd,rs1 RVI equivalent LW rd′,rs1′,imm*4 LW rd,sp,imm*4 LD rd′,rs1′,imm*8 LD rd,sp,imm*8 LQ rd′,rs1′,imm*16 LQ rd,sp,imm*16 SW rs1′,rs2′,imm*4 SW rs2,sp,imm*4 SD rs1′,rs2′,imm*8 SD rs2,sp,imm*8 SQ rs1′,rs2′,imm*16 SQ rs2,sp,imm*16 ADD rd,rd,rs1 ADDW rd,rd,imm ADDI rd,rd,imm ADDIW rd,rd,imm ADDI sp,sp,imm*16 ADDI rd',sp,imm*4 ADDI rd,x0,imm LUI rd,imm ADD rd,rs1,x0 SUB rd,rd,rs1 SLLI rd,rd,imm BEQ rs1',x0,imm BNE rs1',x0,imm JAL x0,imm JALR x0,rs1,0 JAL ra,imm JALR ra,rs1,0 EBREAK 32-bit Instruction Formats 16-bit (RVC) Instruction Formats R I S SB U UJ CR CSS CIW CL CS CB CJ RISC-V Integer Base (RV32I/64I/128I), privileged, and optional compressed extension (RVC). Registers x1-x31 and the pc are 32 bits wide in RV32I, 64 in RV64I, and 128 in RV128I (x0=0). RV64I/128I add 10 instructions for the wider formats. The RVI base of <50 classic integer RISC instructions is required. Every 16-bit RVC instruction matches an existing 32-bit RVI instruction. See risc.org. (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere CI Free & Open Reference Card (riscv.org) 2 Category Name Multiply MULtiply MULtiply upper Half MULtiply Half Sign/Uns MULtiply upper Half Uns Divide DIVide DIVide Unsigned Remainder REMainder REMainder Unsigned Swap SWAP Add ADD Logical XOR AND OR Min/Max MINimum MAXimum MINimum Unsigned MAXimum Unsigned Move Move from Integer Move to Integer Convert Convert from Int Convert from Int Unsigned Convert to Int Convert to Int Unsigned Arithmetic ADD SUBtract MULtiply DIVide SQuare RooT Mul-Add Multiply-ADD Multiply-SUBtract Negative Multiply-SUBtract Negative Multiply-ADD Sign Inject SiGN source Negative SiGN source Xor SiGN source Min/Max MINimum MAXimum Compare Compare Float = Compare Float < Compare Float ≤ Categorization Classify Type Configuration Read Status Read Rounding Mode Read Flags Swap Status Reg Swap Rounding Mode Swap Flags Swap Rounding Mode Imm Swap Flags Imm Fmt R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R I I Optional Multiply-Divide Instruction Extension: RVM RV32M (Multiply-Divide) MUL rd,rs1,rs2 MULH rd,rs1,rs2 MULHSU rd,rs1,rs2 MULHU rd,rs1,rs2 DIV rd,rs1,rs2 DIVU rd,rs1,rs2 REM rd,rs1,rs2 REMU rd,rs1,rs2 SC.W rd,rs1,rs2 AMOSWAP.W rd,rs1,rs2 AMOADD.W rd,rs1,rs2 AMOXOR.W AMOAND.W AMOOR.W rd,rs1,rs2 rd,rs1,rs2 rd,rs1,rs2 AMOMIN.W AMOMAX.W AMOMINU.W AMOMAXU.W rd,rs1,rs2 rd,rs1,rs2 rd,rs1,rs2 rd,rs1,rs2 Three Optional Floating-Point Instruction Extensions: RVF, RVD, & RVQ FMV.{H|S}.X rd,rs1 FMV.X.{H|S} rd,rs1 FCVT.{H|S|D|Q}.W rd,rs1 FCVT.{H|S|D|Q}.WU rd,rs1 FCVT.W.{H|S|D|Q} rd,rs1 FCVT.WU.{H|S|D|Q} rd,rs1 FADD.{S|D|Q} rd,rs1,rs2 FSUB.{S|D|Q} rd,rs1,rs2 FMUL.{S|D|Q} rd,rs1,rs2 FDIV.{S|D|Q} rd,rs1,rs2 FSQRT.{S|D|Q} rd,rs1 FMADD.{S|D|Q} rd,rs1,rs2,rs3 FMSUB.{S|D|Q} rd,rs1,rs2,rs3 FNMSUB.{S|D|Q} rd,rs1,rs2,rs3 FNMADD.{S|D|Q} rd,rs1,rs2,rs3 FSGNJ.{S|D|Q} rd,rs1,rs2 FSGNJN.{S|D|Q} rd,rs1,rs2 FSGNJX.{S|D|Q} rd,rs1,rs2 FMIN.{S|D|Q} rd,rs1,rs2 FMAX.{S|D|Q} rd,rs1,rs2 FEQ.{S|D|Q} rd,rs1,rs2 FLT.{S|D|Q} rd,rs1,rs2 FLE.{S|D|Q} rd,rs1,rs2 FCLASS.{S|D|Q} rd,rs1 FRCSR FRRM FRFLAGS FSCSR FSRM FSFLAGS FSRMI FSFLAGSI rd rd rd rd,rs1 rd,rs1 rd,rs1 rd,imm rd,imm MUL{W|D} DIV{W|D} REM{W|D} REMU{W|D} rd,rs1,rs2 rd,rs1,rs2 rd,rs1,rs2 rd,rs1,rs2 +RV{64,128} Category Name Load Load Reserved Store Store Conditional Optional Atomic Instruction Extension: RVA Fmt R R RV32A (Atomic) LR.W rd,rs1 +RV{64,128} LR.{D|Q} rd,rs1 SC.{D|Q} rd,rs1,rs2 AMOSWAP.{D|Q} rd,rs1,rs2 AMOADD.{D|Q} rd,rs1,rs2 AMOXOR.{D|Q} rd,rs1,rs2 AMOAND.{D|Q} rd,rs1,rs2 AMOOR.{D|Q} rd,rs1,rs2 AMOMIN.{D|Q} rd,rs1,rs2 AMOMAX.{D|Q} rd,rs1,rs2 AMOMINU.{D|Q} rd,rs1,rs2 AMOMAXU.{D|Q} rd,rs1,rs2 Category Name Fmt R R RV32{F|D|Q} (HP/SP,DP,QP Fl Pt) +RV{64,128} FMV.{D|Q}.X rd,rs1 FMV.X.{D|Q} rd,rs1 FCVT.{H|S|D|Q}.{L|T} rd,rs1 FCVT.{H|S|D|Q}.{L|T}U rd,rs1 FCVT.{L|T}.{H|S|D|Q} rd,rs1 FCVT.{L|T}U.{H|S|D|Q} rd,rs1 Load Load Store Store I S FL{W,D,Q} rd,rs1,imm FS{W,D,Q} rs1,rs2,imm Register x0 x1 x2 x3 x4 x5-7 x8 x9 x10-11 x12-17 x18-27 x28-31 f0-7 f8-9 f10-11 f12-17 f18-27 f28-31 ABI Name zero ra sp gp tp t0-2 s0/fp s1 a0-1 a2-7 s2-11 t3-t6 ft0-7 fs0-1 fa0-1 fa2-7 fs2-11 ft8-11 RISC-V Calling Convention Saver --- Caller Callee --- --- Caller Callee Callee Caller Caller Callee Caller Caller Callee Caller Caller Callee Caller Description Hard-wired zero Return address Stack pointer Global pointer Thread pointer Temporaries Saved register/frame pointer Saved register Function arguments/return values Function arguments Saved registers Temporaries FP temporaries FP saved registers FP arguments/return values FP arguments FP saved registers FP temporaries RISC-V calling convention and five optional extensions: 10 multiply-divide instructions (RV32M); 11 optional atomic instructions (RV32A); and 25 floating-point instructions each for single-, double-, and quadruple-precision (RV32F, RV32D, RV32Q). The latter add registers f0-f31, whose width matches the widest precision, and a floating-point control and status register fcsr. Each larger address adds some instructions: 4 for RVM, 11 for RVA, and 6 each for RVF/D/Q. Using regex notation, {} means set, so L{D|Q} is both LD and LQ. See risc.org. (8/21/15 revision) (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere Diagram from Problem 3. igure B.17 The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache ac age size is 16 KB. The TLB is two-way set associative with 256 entries. The L1 cache is a direct-mapped 16 KB, a ache is a four-way set associative with a total of 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the ddress is 40 bits. Copyright © 2011, Elsevier Inc. All rights Reserved. Page 12 of 12 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere Fc pn c a UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering EEC 170 Introduction to Computer Architecture Fall 2019 Final Examination Solutions All instructions in this exam, unless otherwise noted, are RISC-V instructions. Recall that if the instruction writes to a register, that register appears first in the instruction. Unless otherwise indicated, you may not use any RISC-V pseudoinstructions on this exam, only actual RISC-V instructions. Feel free to detach the two RISC-V reference pages and the memory system diagram at the end of the exam, but if you do, please take them with you and recycle them (don’t try to reattach them or turn them in). 1. 1 2 3 4 We discussed dense matrix-dense matrix multiplication in class. To recap, consider A, B, and C as n×n matrices; we wish to compute C = C +A×B. This can be computed by fori=1ton for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) // this is normal (scalar) + and * We begin by computing the arithmetic intensity of this code in operations per memory word, assuming no caches, as a function of n. (Recall that the arithmetic intensity is the number of arithmetic operations per word of memory bandwidth.) Assume n is large; you can approximate any answers given this assumption. Assume each addition or multiplication counts as one operation. Assume all of A, B, and C are stored in main memory (DRAM) and each read from/write to main memory counts as one memory word. The inner line of code runs n3 times. It has 2 arithmetic operations, 3 loads, and 1 write. Thus the arithmetic intensity is 2n3 = 1/2. (a) (4 points) Consider two different implementations of the same task with the same runtime. Baby Yoda’s has high arithmetic intensity and is limited by arithmetic capability. The Mandalorian’s has low arithmetic intensity and is limited by mem- ory bandwidth. Even though both have the same runtime, we prefer Baby Yoda’s implementation with higher arithmetic intensity because of how we expect future machines to behave. In terms of technology trends, why do we prefer imple- mentations with more arithmetic intensity? (3+1)n3 Page 1 of 7 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere Solution: Historically, arithmetic capability increases more quickly than mem- ory bandwidth (this is the “memory wall”). A future machine will hopefully continue to increase both arithmetic capability and memory bandwidth, but arithmetic capability will likely increase more, so Baby Yoda’s implementation will run faster on future machines than the Mandalorian’s. (b) (4 points) Assume we have a very large cache so all memory loads only read a reference the first time they see that reference and (because they are captured in the cache) all memory writes occur only at the end of the program. Reads from/writes to cache are free (do not count in terms of arithmetic intensity). What is the arithmetic intensity now? (6 points) Now consider a more realistic cache that is not large enough to hold the entire input and output. This cache can hold two matrix rows or columns plus one output element. Assume there are no cache conflicts; any two rows/columns plus one output element can fit in the cache. The Mandalorian proposes using the cache in the following way: fori=1ton // read row i of A into cache for j = 1 to n // read row j of B into cache // read C(i,j) into cache for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) // this is normal (scalar) + and * // after k loop is complete, write C(i,j) back to memory What is the arithmetic intensity now? It is OK to ignore lower-order terms in this analysis. Solution: We still do 2n3 operations. But now we read each element of A, B, and C once and write each element of C once. That’s a total of 4n2 memory operations. So the arithmetic intensity is 2n3 = n/2. 4n2 (c) 1 2 3 4 5 6 7 8 Solution: Again we have 2n3 operations. The primary memory cost is reading each column of B n times (n2 column reads × n elements per column = n3 memory accesses). Thus the arithmetic intensity is 2n3 = 2. n3 More precisely, we do the following memory operations: • read each column of B n times (n3 memory loads) Page 2 of 7 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere • read each column of A once for each i (n2 memory loads) • read and write each element of C once (2n2 memory loads) for a total of = n3 + 3n2 memory operations. The arithmetic intensity is thus 2n3 , roughly 2 for large n. n3 +3n2 (d) (10 points) Finally, Baby Yoda proposes to partition the matrix in a different way. Instead of reading 1-dimensional rows or columns from A and B, Baby Yoda ar- ranges those matrices as b × b subblocks. (For instance, a 128 × 128 matrix may be structured as a 4×4 grid of 32 × 32 subblocks.) The cache can hold any two input blocks plus one output block at any time. Now our matrix multiply looks like this (note our operations are now on blocks at a time instead of rows or columns at a time). The arithmetic, while factored differently, is exactly the same as the previous case. // note i, j, and k are now block indexes, not element indexes fori=1ton/b for j = 1 to n/b // read block C(i,j) into cache for k = 1 to n/b // read block A(i,k) into cache // read block B(k,j) into cache C(i,j) = C(i,j) + A(i,k) * B(k,j) // Here + indicates a matrix sum of blocks (block + block) // Here * indicates a matrix multiply of blocks (block * block) // after k loop is complete, write block C(i,j) back to memory What is the arithmetic intensity now? It is again OK to ignore lower-order terms in this analysis. 1 2 3 4 5 6 7 8 9 10 11 Solution: Again we have 2n3 operations. For both A and B, we read (n/b)3 blocks of size b2. That’s 2n3/b reads. We also read and write each block of C once (n2 reads, n2 writes). (For large n, you can ignore this term.) The arithmetic intensity is thus 2n3 , roughly b for large n. This is sub- 2n3 /b+2n2 stantially better than the previous cached version (2). It suggests larger b gives better arithmetic intensity. This is true, but it also means the cache must be large enough to hold three blocks of size b. Page 3 of 7 (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere 2. In the 2-page RISC-V Reference Card at the end of the exam, at the bottom right of the first page, you will see the “16-bit (RVC) Instruction Formats”. RVC stands for “RISC-V Compressed”. (a) (5 points) Note CR-type instructions have a funct4 entry as part of their opcodes instead of a funct3 (as in all the other instructions). Assume that the opcode combi- nations (inst[1:0] == 01 OR inst[1:0] == 10) AND inst[15:13] == 100 are used to encode CR-type instructions. Given the instruction formats listed in the reference page, how many different instructions can RVC support? Note not all possible RVC instructions are specified on the reference card. Solution: Most instructions have 2 opcode bits and 3 funct bits. This implies 5 opcode bits total, so 25 = 32 instructions can be supported in this way. However, CR-type instructions get an extra opcode bit instr[12]. The CR instructions take 2 of the 32 instruction slots available to a 5-bit opcode. Since CR-type instructions have an additional opcode bit, they can fit 4 instructions into those 2 instruction slots. Thus RVC, given these assumptions, supports 34 instructions. (b) In practice, these 16-bit RVC instructions are translated to 32-bit full (“RVI”) RISC-V instructions. This process is summarized in the table on the reference card called “Optional Compressed (16-bit) Instruction Extension: RVC”, but we will highlight the important parts here directly. In this part, you may need to consult the table “16-bit (RVC) instruction formats”. i. (2 points) The RVC instruction C.ADD rd, rs1 translates to the RVI instruc- tion ADD rd, rd, rs1. Both instructions run a full 64b add. However, the C.ADD operation has a significant restriction compared to the full ADD instruc- tion. What is that restriction? Be specific. ii. (2 points) Some RVC instructions, like C.BEQZ, have a register argument with a prime (apostrophe) after the register name (C.BEQZ rs1’, imm → BEQ rs1’, x0, imm). What is the restriction indicated by the prime? iii. (2 points) All RVC memory-access instructions multiply their immediates by a Page 4 of 7 Solution: C.ADD only takes 2 register arguments, so the destination register must also be a source register. ADD takes 3 register arguments, so the two source registers can be distinct from the destination register. Solution: Registers indicated with a prime are only specified with 3 bits in the instruction encoding, instead of RISC-V’s 5 bits. Thus these instruc- tions can only access a subset of registers (this is not important, but it’s x8–x15). (c) John Owens 2019 Do not reproduce in any way Do not upload anywhere power-of-twoconstant(e.g.,C.LW rd’, rs1’, imm→LW rd’, rs1’, imm*4). Because of this multiplication, what can a RVI memory-access instruc- tion do that its partner RVC instruction cannot? (The answer is not “allows byte addressing”; both RVC and RVI memory-access instructions are byte-addressed.) iv. (2 points) The instruction C.ADDI16SP is primarily intended to manipulate the stack . (Fill in the previous blank with one word.) (c) (3 points) The new BabyYoda compiler inputs a RVI (32-bit) RISC-V instruction sequence and transforms some RVI instructions into their RVC (16-bit) equivalents. When running the resulting code, compared to the RVI implementation, would you expect the instruction cache hit rate would increase, decrease, or stay the same? Justify your answer. 3. While sipping his soup, Baby Yoda uses the Force to derive the following characteristics of his laptop’s memory system, which is byte-addressed and contains a virtual memory subsystem and L1 and L2 caches. A copy of this diagram is attached at the end of the exam, which you can detach. In this diagram, means a signal of bits wide.
Solution: RVI instructions support misaligned memory operations where the n-byte memory address is not aligned to an n-byte boundary. RVC instructions do not allow misaligned memory addresses.
Solution: The I-cache hit rate should increase. The instruction stream is iden- tical in terms of operations but the instruction stream with RVC instructions takes fewer bytes, takes up less space in the cache, can thus store more instruc- tions than the RVI-only stream, and hence will have a higher hit rate.
Page 5 of 7
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

Figure B.17 The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache access. The (a) (3 points) What is the page size?
page size is 16 KB. The TLB is two-way set associative with 256 entries. The L1 cache is a direct-mapped 16 KB, and the L2 cach
addr
e is a four-way set associative with a total of 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the physical
14
ess is 40 bits.
Solution: The page offset is 14 bits, so the page size is 2
= 16 kB.
Copyright © 2011, Elsevier Inc. All rights Reserved. 10
(b) (3 points) The TLB has 256 entries in it. What is the associativity of the TLB (e.g., 1-way set associative = direct-mapped, 2-way set associative, 4-way, fully associative, etc.)?
(c) (3 points) How large is the L1 cache (data only)?
(d) (3 points) Assuming the L2 cache stores 222 B = 4 MiB of data, what is its associa- tivity?
Solution: The TLB index has 7 bits, so there are 27 = 128 sets. That’s 2 entries per set; the TLB is 2-way set associative.
Solution: The cache index is 8 bits, so there are 28 = 256 entries in the cache. Each entry is 512 bits (64 bytes), also shown by a 6-bit block offset.
256 entries × 64 bytes is 16 kB.
Solution: The data per entry is 512 bits = 64 B. Thus there are 4 MiB/64 B = 64 k = 216 entries. The cache index has only 14 bits, so that implies 4 entries per set, making the cache 4-way set associative.
Page 6 of 7
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere

(e) (3 points) What is the motivation for making replacement algorithms for hardware caches less complicated than replacement algorithms for pages?
Solution: Either of a couple of general reasons are good answers here:
• Cache replacement algorithms have to run quickly (and hence can’t be as complicated) because processing time for the algorithm (thus hit/miss times) must be low.
• Page replacement algorithms must be as accurate as possible (reducing the miss rate) because the miss penalty when pages are not in memory and have to be paged from disk is very long.
4. Assume that you have 2 64-bit double-precision floating point numbers stored in (integer) registers x2 and x3. You run the instruction XOR x1, x2, x3. In this question “normal number” means “not zero, not infinity, not denormal, not NaN”.
(a) (3 points) If you know that x2 is a positive normal number and x3 is a negative normal number, what bits in x1 do you know for certain? (Your answer shouldbeoftheform“Bit42willbea1andbit24willbea0”.) Bit0istheleast significant bit (LSB) and bit 63 is the most significant bit (MSB).
(b) (3 points) If you know that x2 is a negative denormal number and x3 is negative infinity, what bits in x1 do you know for certain? (Your answer should be of the form “Bit 42 will be a 1 and bit 24 will be a 0”.)
Solution: Bit 63 will be 1, because you know the two input sign bits will be different.
Solution: Bit 63 will be 0, because both input sign bits are the same.
Bits 62–52 inclusive will all be 1, because x2’s exponent is all zeroes and x3’s exponent is all ones.
Page 7 of 7
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere