EEC 170
Name: Instructions:
UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering
Introduction to Computer Architecture
Midterm 1
Fall 2019
1. This exam is open-note, open-book. Calculators are allowed.
2. No devices that can communicate (laptops, phones, or PDAs) allowed, with the excep- tion of a device that can read your e-book textbook, on which you must turn off your wifi and/or cellular connections. Other than “find”, no typing. Turn off all phones, pagers, and alarms.
3. You may not share any materials with anyone else, including books, notes, lab reports, datasheets, calculators, and most importantly, answers!
4. This exam is designed to be a 60 minute exam, but you have 110 minutes to complete it. (Roughly) one point is budgeted per minute. Use your time wisely!
Excerpts from the UC Davis Code of Academic Conduct:
1. Each student should act with personal honesty at all times.
2. Each student should act with fairness to others in the class. That means, for example, that when taking an examination, students should not seek an unfair advantage over classmates through cheating or other dishonest behavior.
3. Students should take group as well as individual responsibility for honorable behavior.
I understand the honor code and agree to be bound by it.
Signature:
Page:
2
3
4
5
6
9
Total
Points:
8
12
12
8
10
15
65
Score:
Page 1 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
All instructions in this exam, unless otherwise noted, are RISC-V instructions. Recall that if the instruction writes to a register, that register appears first in the instruction. Unless otherwise indicated, you may not use any RISC-V pseudoinstructions on this exam, only actual RISC-V instructions. Feel free to detach the two RISC-V reference pages and the adder reference page at the end of the exam, but if you do, please take them with you and recycle them (don’t try to reattach them or turn them in).
1. Many machine learning applications do not require the precision or range of single- precision (32b) floating point. Thus Google has introduced a new 16-bit floating-point format, bfloat16, which is structured in a similar way to the IEEE 754 formats we discussed in class. bfloat16 has one sign bit, an 8-bit exponent, and a 7-bit significand (the bits after the 1.).
(a) (5 points) What is the smallest normal positive number that can be rep- resented as a bfloat16? (“Normal” here means “ignore subnormal numbers”.) Express your answer as a normalized number ±a.b×2c, where you specify the sign, a, b, and c. You should clearly show your work if you want partial credit.
(b) NVIDIA hardware also supports a 16-bit floating-point format, half. It has 1 sign bit, 5 exponent bits, and 11 significand bits.
Answer each of the following questions with bfloat16 or half.
i. (1 point) The format with the largest representable value is (bfloat1 or half):
ii. (1 point) The format with a representable value closest to zero is (bfloat1 or half):
iii. (1 point) The format with the smallest range between two adjacent values when the bias-adjusted exponent is zero (meaning the normalized number is 1.x × 20) (where, for these two adjacent values, the floating-point representation is identical except for the least significant bit of the significand) is (bfloat1 or half):
Page 2 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
2. Short code sequences.
(a) (6 points) MIPS has the instruction nor. RISC-V lacks this instruction. Imple- ment nor x2, x3, x4 in as few RISC-V instructions as possible. You may not make any changes to any register except x2.
(b) (6 points) RISC-V also lacks the instruction sgtz, which sets its output register to 1 if the input register is larger than zero and to 0 otherwise. Implement sgtz x2, x3 in as few RISC-V instructions as possible. You may not make any changes to any register except x2.
Page 3 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
(c) (12 points) RISC-V has a floating-point instruction called fsgnj.d rd, rs1, rs2, “floating-point sign injection”. It returns a number whose magnitude is the same as its rs1 argument but whose sign is the same as its rs2 argument. Note it works on double-precision (64b) data.
Implement fsgnj.d x2, x3, x4 in as few integer RISC-V instructions as possible. Your inputs are 64b floating-point numbers stored in 64b integer registers and your output is also a 64b floating-point number in a 64b integer register. Do not use floating-point registers or instructions to solve this problem. You may use x10, x11, etc. as temporary registers if needed. For this part only, you may use the pseudoinstruction li (load immediate), which will count as one instruction.
To aid grading (to help give you partial credit), describing your approach in a sentence or two will be helpful.
Page 4 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
3. (8 points) Let’s consider adding a reciprocal operation to RISC-V. Your engineers have determined that the primary use for a reciprocal operation is to normalize 3-component vectors:
xnew = √ x , x2 +y2 +z2
with ynew and znew defined analogously. Note we could take the reciprocal of the square root once and then perform multiply instructions to achieve the same results. To compute xnew, ynew, and znew, we could replace divide operations with reciprocal and multiply instructions.
If we assume that add instructions take 1 cycle, multiply operations take 4 cycles, and both square-root and divide operations take 20 cycles, what is the maximum number of cycles for the reciprocal operation so that 3-component vector normaliza- tion is at least as fast using reciprocal and multiply as it was using division?
Page 5 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
4. You have simplified the design of your new RISC-Y processor to the point where you only have two classes of instructions, A-type (which takes 1 cycle to finish) and B- type (which takes 3). You compile your company’s most important benchmark on your internal compiler and find it has a CPI of 2.8.
The Team Rocket compiler company visits your company and boasts they can achieve twice the MIPS (millions of instructions per second) on your benchmark on your RISC-Y processor. You test their compiler: it’s true, they can. However, the runtime of their compiled code is the same as yours.
(a) (7 points) What is the instruction mix (what percentage of A-type and what per- centage of B-type) of the compiled code from the Team Rocket compiler?
(b) (3 points) If Team Rocket’s code gets twice as many MIPS, why does it run at the same speed as your internal compiler’s? Be precise as to why and quantify your answer.
Page 6 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
5. In this problem we design an integer ALU that is inspired by the POWER instruction set. For the purposes of this problem, assume that POWER supports 32-bit integers, that it has an R-type instruction format that takes two input registers and one output register, and an I-type instruction format that takes one input register, one output register, and a 16-bit immediate.
Our base hardware is a 32-bit adder that has five inputs. Four of the inputs (AH, AL, BH, and BL) specify the A and B inputs to the adder, each divided into two 16-bit chunks (the 16b high and low halves of the 32b input words to the adder). The fifth input (Cin) is a one-bit carry in.
Your job is to select what goes into these five inputs. For the first four inputs, you must set the control signals on a SUPERMUX, which is a 16-bit wide mux that can select from 14 16-bit inputs:
• A[31:16]
• A[15:0]
• B[31:16]
• B[15:0]
• ̃A[31:16] (invert the bits of A[31:16]) • ̃A[15:0]
• ̃B[31:16]
• ̃B[15:0]
• imm[15:0] (the immediate from I-type instructions) • A[31] x 16 (bit A[31] copied to all 16 bits)
• B[31] x 16 (bit B[31] copied to all 16 bits)
• imm[15] x 16 (bit imm[15] copied to all 16 bits)
• 0 x 16 (16 zeroes)
• 1 x 16(16ones)
BH BL
AH AL
Cin
Cout
Out[31:0]
Cout A[15:0]
B[31:16] 32-Bit Adder
B[15:0] Cin
A[31:16]
Sum
Page 7 of 12
SUPER- MUX
SUPER- MUX
SUPER- MUX
SUPER- MUX
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
For the last control input (Cin), you can set it to 0 or 1.
As an example, if I asked you to configure this ALU to compute a standard 32-bit add,
you would set the following:
• AH = A[31:16] • AL = A[15:0] • BH = B[31:16] • BL = B[15:0] • Cin = 0
This information is replicated at the end of the exam. Feel free to detach it and use it to answer the following questions. There are no questions on this page.
Page 8 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
For each of the following instructions, fill in how you would set the five control signals.
(a) (3 points) SUB: a 32-bit subtract, A-B.
(b) (3 points) LUI: load the 16-bit immediate into the upper 16 bits of the output, setting the other output bits to zero
(c) (3 points) ADDIS: “add immediate shifted”; add input A to the (immediate left- shifted by 16 bits)
(d) (3 points) DBL: “multiply by 2”; return A×2
(e) (3 points) In English, what does the instruction with the following control settings do?
• AH = 1 x 16 • AL = 1 x 16 • BH = B[15:0] • BL = B[31:16] • Cin = 1
Page 9 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
There are no questions on this page. This page is for reference only. Feel free to detach it.
SUPERMUX inputs: • A[31:16]
• A[15:0]
• B[31:16]
• B[15:0]
• ̃A[31:16] (invert the bits of A[31:16])
• ̃A[15:0]
• ̃B[31:16]
• ̃B[15:0]
• imm[15:0] (the immediate from I-type instructions) • A[31] x 16 (bit A[31] copied to all 16 bits)
• B[31] x 16 (bit B[31] copied to all 16 bits)
• imm[15] x 16 (bit imm[15] copied to all 16 bits)
• 0 x 16 (16 zeroes)
• 1 x 16(16ones)
For the last control input (Cin), you can set it to 0 or 1.
As an example, if I asked you to configure this ALU to compute a standard 32-bit add, you would set the following: AH = A[31:16]; AL = A[15:0]; BH = B[31:16]; BL = B[15:0]; Cin = 0.
BH BL
AH AL
Cin
Cout
Out[31:0]
Cout A[15:0]
B[31:16] 32-Bit Adder
B[15:0] Cin
A[31:16]
Sum
Page 10 of 12
SUPER- MUX
SUPER- MUX
SUPER- MUX
SUPER- MUX
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
RISC-V Instruction Set Core Instruction Formats
31 27 26 25 24 20 19
RV32I Base Integer Instructions
Inst
add
sub
xor
or
and
sll
srl
sra
slt
sltu
addi
xori
ori
andi
slli
srli
srai
slti
sltiu
lb
lh
lw
lbu
lhu
sb
sh
sw
beq
bne
blt
bge
bltu
bgeu
jal
jalr
lui
auipc
ecall
ebreak
15 14 12 11
7
6
0
RISC-V Reference
James Zhu
R-type I-type S-type B-type U-type J-type
Note
msb-extends zero-extends
msb-extends zero-extends
zero-extends zero-extends
zero-extends zero-extends
imm: 0x000 imm: 0x001
funct7
rs2
rs1
funct3
rd
opcode
imm[11:0]
rs1
funct3
rd
opcode
imm[11:5]
rs2
rs1
funct3
imm[4:0]
opcode
imm[12|10:5]
rs2
rs1
funct3
imm[4:1|11]
opcode
imm[31:12]
rd
opcode
imm[20|10:1|11|19:12]
rd
opcode
Name
FMT
Opcode
F3
F7
Description (C)
ADD
SUB
XOR
OR
AND
Shift Left Logical Shift Right Logical Shift Right Arith* Set Less Than
Set Less Than (U)
R R R R R R R R R R
0000011
0000011
0000011
0000011
0000011
0000011
0000011
0000011
0110011
0110011
0x0
0x0
0x4
0x6
0x7
0x1
0x2
0x3
0x2
0x3
0x00
0x20
0x00
0x00
0x00
0x00
0x00
0x20
rd = rs1 + rs2
rd = rs1 – rs2
rd = rs1 ˆ rs2
rd = rs1 | rs2
rd = rs1 & rs2
rd = rs1 << rs2
rd = rs1 >> rs2
rd = rs1 >> rs2
rd = (rs1 < rs2)?1:0
rd = (rs1 < rs2)?1:0
ADD Immediate
XOR Immediate
OR Immediate
AND Immediate
Shift Left Logical Imm Shift Right Logical Imm Shift Right Arith Imm Set Less Than Imm
Set Less Than Imm (U)
I I I I I I I I I
0010011
0010011
0010011
0010011
0010011
0010011
0010011
0010011
0010011
0x0
0x0
0x0
0x0
0x1
0x1
0x3
0x2
0x3
0x00
0x00
0x00
0x00
0x00
0x00
0x20
rd = rs1 + imm
rd = rs1 ˆ imm
rd = rs1 | imm
rd = rs1 & imm
rd = rs1 << imm
rd = rs1 >> imm
rd = rs1 >> imm
rd = (rs1 < imm)?1:0
rd = (rs1 < imm)?1:0
Load Byte Load Half Load Word Load Byte (U) Load Half (U)
I I I I I
0000011
0000011
0000011
0000011
0000011
0x0
0x1
0x2
0x4
0x5
rd = M[rs1+imm][0:7]
rd = M[rs1+imm][0:15]
rd = M[rs1+imm][0:31]
rd = M[rs1+imm][0:7]
rd = M[rs1+imm][0:15]
Store Byte Store Half Store Word
S S S
0100011
0100011
0100011
0x0 0x1 0x2
M[rs1+imm][0:7] = rs2[0:7]
M[rs1+imm][0:15] = rs2[0:15]
M[rs1+imm][0:31] = rs2[0:31]
Branch ==
Branch !=
Branch <
Branch ≤
Branch < (U) Branch ≥ (U) Jump And Link Jump And Link Reg
B B B B B B
1100011
1100011
1100011
1100011
1100011
1100011
0x0
0x1
0x4
0x5
0x6
0x7
if(rs1 == rs2) PC += imm
if(rs1 != rs2) PC += imm
if(rs1 < rs2) PC += imm
if(rs1 >= rs2) PC += imm
if(rs1 < rs2) PC += imm
if(rs1 >= rs2) PC += imm
J I
1101111
1100111
0x0
rd = PC+4; PC += imm
rd = PC+4; PC = rs1 + imm
Load Upper Imm
Add Upper Imm to PC
U U
0110111
0010111
rd = imm << 12
rd = PC + (imm << 12)
Environment Call
I
1110011
0x0
0x00
Transfer control to OS
Environment Break
I
1110011
0x0
0x00
Transfer control to debugger
1
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
RISC-V Reference Card
V0.1
Standard Extensions
RV32M Multiply Extension
Inst Description (C)
Name
FMT
Opcode
F3
F7
MUL
MUL High
MUL High (S) (U) MUL High (U) DIV
DIV (U) Remainder Remainder (U)
R R R R R R R R
0110011
0110011
0110011
0110011
0110011
0110011
0110011
0110011
0x0
0x1
0x2
0x3
0x4
0x5
0x6
0x7
0x01
0x01
0x01
0x01
0x01
0x01
0x01
0x01
mul
mulh
mulsu rd = (rs1 * rs2)[63:32]
mulu
div rd=rs1/rs2 divu rd=rs1/rs2 rem rd=rs1%rs2 remu rd=rs1%rs2
RV32A Atomic Extension
31 27 26 25 24
2019
1514 1211
76
0
0} = rd
rd = (rs1 * rs2)[31:0]
rd = (rs1 * rs2)[63:32]
rd = (rs1 * rs2)[63:32]
funct5
aq
rl
rs2
rs1
funct3
rd
opcode
51155357
Inst Description (C)
Name
FMT
Opcode
F3
F5
Load Reserved Store Conditional
Atomic Swap Atomic ADD Atomic AND Atomic OR Atomix XOR Atomic MAX Atomic MIN
R R
R R R R R R R
0101111
0101111
0101111
0101111
0101111
0101111
0101111
0101111
0101111
0x2 0x2
0x2
0x2
0x2
0x2
0x2
0x2
0x2
0x02 0x03
0x01
0x00
0x0C
0x0A
0x04
0x14
0x10
lr.w sc.w
rd = M[rs1], reserve M[rs1]
if (reserved) { M[rs1] = rs2; rd =
else { rd = 1 }
amoswap.w rd = M[rs1]; swap(rd, rs2); M[rs1]
amoadd.w
amoand.w
amoor.w
amoxor.w
amomax.w
amomin.w
RV32F / D Floating-Point Extensions
rd = M[rs1] + rs2; M[rs1] = rd
rd = M[rs1] & rs2; M[rs1] = rd
rd = M[rs1] | rs2; M[rs1] = rd
rd = M[rs1] ˆ rs2; M[rs1] = rd
rd = max(M[rs1], rs2); M[rs1] = rd
rd = min(M[rs1], rs2); M[rs1] = rd
Inst Description (C)
Name
FMT
Opcode
F3
F5
Flt Load Word
Flt Store Word
Flt Fused Mul-Add
Flt Fused Mul-Sub
Flt Neg Fused Mul-Add Flt Neg Fused Mul-Sub Flt Add
Flt Sub
Flt Mul
Flt Div
Flt Square Root
Flt Sign Injection
Flt Sign Neg Injection Flt Sign Xor Injection Flt Minimum
Flt Maximum
Flt Conv from Sign Int Flt Conv from Uns Int Flt Convert to Int
Flt Convert to Int Move Float to Int Move Int to Float
Float Equality
Float Less Than
Float Less / Equal Float Classify
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
flw
fsw
fmadd.s rd=rs1*rs2+rs3 fmsub.s rd=rs1*rs2-rs3 fnmadd.s rd=-rs1*rs2+rs3 fnmsub.s rd=-rs1*rs2-rs3 fadd.s rd=rs1+rs2
fsub.s rd=rs1-rs2
fmul.s rd=rs1*rs2
fdiv.s rd=rs1/rs2
fsqrt.s
fsgnj.s
fsgnjn.s
fsgnjx.s
fmin.s
fmax.s
fcvt.s.w
fcvt.s.wu rd = (float) rs1
rd = (int32_t) rs1 fcvt.wu.s rd = (uint32_t) rs1
fcvt.w.s
fmv.x.w
fmv.w.x
feq.s
flt.s
fle.s
fclass.s
rd = *((int*) &rs1)
rd = *((float*) &rs1) rd=(rs1==rs2)? 1: 0 rd=(rs1
return -z; }
A straightforward implementation of the above code requires control-flow instructions (branches and/or jumps). In this problem we consider non-RISC-V instructions that might allow us to implement this without control-flow instructions. In this problem ignore the fact that a twos-complement representation is unbalanced (-MININT cannot be represented).
(a) (5 points) Using a minimal sequence of RISC-V instructions, implement ABSDIFF x1, x2, x3 without control-flow instructions. ABSDIFF x1, x2, x3 means x1 = ABS(x2 – x3). You may use any RISC-V registers, but do not overwrite x2 or x3. In this part, you must use one or more conditional move instructions (which, as we discussed in class, are part of many instruction sets but not RISC-V), and specifically any or all of three different conditional move instructions:
• CMOVGTZ x1, x2, x3 # if (x3 > 0) x1 = x2 • CMOVLTZ x1, x2, x3 # if (x3 < 0) x1 = x2 • CMOVEQZ x1, x2, x3 # if (x3 == 0) x1 = x2
Page 2 of 14
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
(b) (5 points) Using a minimal sequence of RISC-V instructions, implement ABSDIFF x1, x2, x3 without control-flow instructions. You may use any RISC-V reg- isters, but do not overwrite x2 or x3. In this part, you must use either or both of the following saturating-subtract instructions:
• SUBSAT x1, x2, x3 (subtraction of x2-x3, treating both operands as signed numbers and producing a signed result, but rather than overflowing, saturating to the largest/smallest signed value)
• SUBUSAT x1, x2, x3 (subtraction of x2-x3, treating both operands as un- signed numbers and producing an unsigned result, but rather than overflowing, saturating to the largest/smallest unsigned value)
Recall that the actual computation of subtraction of signed and unsigned numbers generates exactly the same bit result; it’s only the saturation operation that differs between these two operations. This problem is fairly tricky.
Page 3 of 14
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
2. In this problem we will look at the single-cycle datapath of the “Bitner” processor, only focusing on the compute part of the Bitner pipeline. In this datapath, the ALU can perform one arithmetic operation every clock cycle and the data memory can perform one memory operation every clock cycle. Note the output of the ALU is an input to the data memory and the output of the data memory is an input to the ALU. Assume Bitner’s instruction format contains an opcode, three registers, and one immediate.
At the end of this exam is a larger, detachable copy of this datapath.
RFData
Control
RFW RFR1 RFR2 IMM MEMD MEMA ALU2 ALU1 ALUOp
1 2
A ALU
B
Control points:
Mux
RFIn
Register File
RF2 RF
Register Addresses
RF1 RF
• RFR1: register number of the first register to be read from the register file
• RFR2: register number of the second register to be read from the register file
• IMM: the immediate specified by the instruction
• ALU1: controls the source of data into the ALU’s input 1 (choices: RF1, IMM MEM)
• ALU2: controls the source of data into the ALU’s input 2 (choices: RF2, IMM, MEM)
• MEMA: controls the source of data into the memory’s address input (choices: RF1, IMM, ALU)
• MEMD: controls the source of data into the memory’s data input (choices: RF2, IMM, ALU)
• ALUOP: controls the operation performed by the ALU (choices: ADD)
• RFW: register number of the register to be written to the register file
• RFData: controls the source of data to be written into register RFW in the register file (choices: ALU or MEM)
In this problem we ask you to “set the control points” for this datapath. As an example, the instruction ADD x1, x2, x3 would have the following settings: RFR1 = 2; RFR2 = 3; IMM = X; ALU1 = RF1; ALU2 = RF2; MEMA = X; MEMD = X; ALUOP = ADD; RFW = 1; RFData = ALU, where X is a don’t-care.
Page 4 of 14
RF1 IMM MEM
RF2 IMM MEM
RF1 IMM ALU
Data Memory Addr MEM
Data In
RF2 IMM ALU
Mux
Mux
Mux
Mux
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
(a) (5 points) Which of these signals should come directly from bits in the 32-bit instruction (a bit or bits in the instruction are routed as control signals directly into the datapath without any intermediate logic) and which should be generated by additional logic derived from the 32-bit instruction? Answer either “bits” or “logic”.
• RFR1 • RFR2 • IMM
• ALU1 • ALU2 • MEMA • MEMD • ALUOP • RFW
• RFData
(b) (4 points) What is the primary reason why this particular datapath would be a challenge to pipeline with good performance?
Page 5 of 14
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
(c) Consider the following C program. This program fetches the value in array from memory, adds 42 to it, stores the new value back to memory, and increments array to point to the next element in the array.
int * array; // assume 64-bit integers
while (1) {
*array++ += 42;
}
Assume that array is stored in x2.
i. (4 points) Write the body of this loop (corresponding to the single line of C code between the braces) as a minimal sequence of instructions in RISC-V assembly. You can use x3 as a temporary. Your solution should have 4 RISC-V instructions. Do not write any instructions that relate to the loop itself (you should have no branch/jump instructions).
Page 6 of 14
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
ii. (10 points) Write the body of this loop (corresponding to the single line of C code between the braces) as a minimal sequence of instructions on Bitner. Express each instruction as a setting of control points; by this we mean to specify how you would set each of the 10 “control points” listed above. If a control point is a don’t-care, mark it as X. You can use x3 as a temporary. Do not write any instructions that relate to the loop itself (you should have no branch/jump instructions). Hint: Because this datapath is potentially more capable than the RISC-V datapath we discussed in class, it may be possible to accomplish more in a single instruction than you could in a RISC-V datapath.
Page 7 of 14
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
(d) (5 points) On the following diagram, note 12 circles on the inputs to the 4 muxes.
Put an X through each circle whose input is not required to run RISC-V instructions. For inputs that are required to run RISC-V instructions, do nothing. For instance, if the top input of the ALU does not require a RF1 input to run RISC-V instructions, put an X in the topmost circle.
RFData
Control
RFW RFR1 RFR2 IMM MEMD RF1
ALUOp
A ALU
B
MEMA
RFIn
File
RF2
Register Addresses
RF1 Register
RF2
RF2
IMM
IMM
ALU2
ALU1
RF1
MEM
MEM
Mux
ALU MEM
RF1
IMM
Data Memory
Addr MEM
Data In
ALU
RF2
IMM
ALU
Page 8 of 14
Mux
Mux
Mux
Mux
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
(e) Ben Bitdiddle proposes new instructions for Bitner. Ben hopes each of these in- structions can be implemented as single Bitner instructions (not sequences of Bitner instructions). For each of Ben’s proposals:
• If it is possible to implement Ben’s proposal as a single instruction with the current datapath, set the control points that implement the instruction on this datapath; OR
• If it is not possible to implement Ben’s proposal as a single instruction with the current datapath, describe the changes needed to the datapath (as English sentences).
i. (4 points) LDADDI xdest1, xdest2, xsrc, imm: Perform two operations in parallel: (1) Register xdest1 is set to xsrc + imm; register xdest2 is set to the memory contents of the address stored in xsrc.
ii. (4 points) MEMCPY xdest, xsrc: Copies the memory word stored at the ad- dress specified in xsrc into the memory location specified by the address in xdest.
iii. (4 points) LD2 xdest, imm(xbase1+xbase2): Same as LD but instead of the load address as xbase+imm, it now supports xbase1+xbase2+imm.
Page 9 of 14
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
3. Alyssa P. Hacker was looking at the class datapath (below) and identified some possible inefficiencies. Specifically, she was looking at the following control signals:
• RegWrite: 1 only if data is written into the register file
• ALUSrc: the second input of the ALU is from the register file (0) or from the
immediate (1)
• PCSrc: the next PC comes from PC+4 (0) or from a computed branch/jump target (1)
• MemWrite: 1 only if the write data input to the memory will be written to the address input
• MemRead: 1 only if the data memory is outputting the contents of the memory address specified by its address input
• MemtoReg: the data to write into the register file comes from the ALU (0) or from the data memory (1)
She proposed the following design simplifications. For each proposed design change,
indicate whether her proposal could work properly OR, if it will not, an in- struction that will not work correctly and why. For the purposes of this question, only consider the following instructions (the ones we covered in Lecture 2): ADD, ADDI, AND, ANDI, BEQ, BGE, BLT, BNE, JAL, JALR, LD, LUI, OR, ORI, SD, SLLI, SRLI, SUB, XOR, XORI.
Page 10 of 14
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
(a) (4 points) Combine MemWrite and MemRead into one signal MemOp, where MemOp is set to 1 only on store instructions, and MemWrite <- MemOp and MemRead <- not(MemOp). Assume that the memory system can read from any arbitrary memory address without any side effects (no exceptions/page faults/etc.) and that there are no performance consequences if we do this.
(b) (4 points) Remove RegWrite; instead set RegWrite if PCSrc is 0.
(c) (4 points) Remove ALUSrc; instead set ALUSrc if either MemWrite or MemRead are 1.
Page 11 of 14
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
RISC-V Instruction Set Core Instruction Formats
31 27 26 25 24 20 19
RV32I Base Integer Instructions
Inst
add
sub
xor
or
and
sll
srl
sra
slt
sltu
addi
xori
ori
andi
slli
srli
srai
slti
sltiu
lb
lh
lw
lbu
lhu
sb
sh
sw
beq
bne
blt
bge
bltu
bgeu
jal
jalr
lui
auipc
ecall
ebreak
15 14 12 11
7
6
0
RISC-V Reference
James Zhu
R-type I-type S-type B-type U-type J-type
Note
msb-extends zero-extends
msb-extends zero-extends
zero-extends zero-extends
zero-extends zero-extends
imm: 0x000 imm: 0x001
funct7
rs2
rs1
funct3
rd
opcode
imm[11:0]
rs1
funct3
rd
opcode
imm[11:5]
rs2
rs1
funct3
imm[4:0]
opcode
imm[12|10:5]
rs2
rs1
funct3
imm[4:1|11]
opcode
imm[31:12]
rd
opcode
imm[20|10:1|11|19:12]
rd
opcode
Name
FMT
Opcode
F3
F7
Description (C)
ADD
SUB
XOR
OR
AND
Shift Left Logical Shift Right Logical Shift Right Arith* Set Less Than
Set Less Than (U)
R R R R R R R R R R
0000011
0000011
0000011
0000011
0000011
0000011
0000011
0000011
0110011
0110011
0x0
0x0
0x4
0x6
0x7
0x1
0x2
0x3
0x2
0x3
0x00
0x20
0x00
0x00
0x00
0x00
0x00
0x20
rd = rs1 + rs2
rd = rs1 – rs2
rd = rs1 ˆ rs2
rd = rs1 | rs2
rd = rs1 & rs2
rd = rs1 << rs2
rd = rs1 >> rs2
rd = rs1 >> rs2
rd = (rs1 < rs2)?1:0
rd = (rs1 < rs2)?1:0
ADD Immediate
XOR Immediate
OR Immediate
AND Immediate
Shift Left Logical Imm Shift Right Logical Imm Shift Right Arith Imm Set Less Than Imm
Set Less Than Imm (U)
I I I I I I I I I
0010011
0010011
0010011
0010011
0010011
0010011
0010011
0010011
0010011
0x0
0x0
0x0
0x0
0x1
0x1
0x3
0x2
0x3
0x00
0x00
0x00
0x00
0x00
0x00
0x20
rd = rs1 + imm
rd = rs1 ˆ imm
rd = rs1 | imm
rd = rs1 & imm
rd = rs1 << imm
rd = rs1 >> imm
rd = rs1 >> imm
rd = (rs1 < imm)?1:0
rd = (rs1 < imm)?1:0
Load Byte Load Half Load Word Load Byte (U) Load Half (U)
I I I I I
0000011
0000011
0000011
0000011
0000011
0x0
0x1
0x2
0x4
0x5
rd = M[rs1+imm][0:7]
rd = M[rs1+imm][0:15]
rd = M[rs1+imm][0:31]
rd = M[rs1+imm][0:7]
rd = M[rs1+imm][0:15]
Store Byte Store Half Store Word
S S S
0100011
0100011
0100011
0x0 0x1 0x2
M[rs1+imm][0:7] = rs2[0:7]
M[rs1+imm][0:15] = rs2[0:15]
M[rs1+imm][0:31] = rs2[0:31]
Branch ==
Branch !=
Branch <
Branch ≤
Branch < (U) Branch ≥ (U) Jump And Link Jump And Link Reg
B B B B B B
1100011
1100011
1100011
1100011
1100011
1100011
0x0
0x1
0x4
0x5
0x6
0x7
if(rs1 == rs2) PC += imm
if(rs1 != rs2) PC += imm
if(rs1 < rs2) PC += imm
if(rs1 >= rs2) PC += imm
if(rs1 < rs2) PC += imm
if(rs1 >= rs2) PC += imm
J I
1101111
1100111
0x0
rd = PC+4; PC += imm
rd = PC+4; PC = rs1 + imm
Load Upper Imm
Add Upper Imm to PC
U U
0110111
0010111
rd = imm << 12
rd = PC + (imm << 12)
Environment Call
I
1110011
0x0
0x00
Transfer control to OS
Environment Break
I
1110011
0x0
0x00
Transfer control to debugger
1
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
RISC-V Reference Card
V0.1
Standard Extensions
RV32M Multiply Extension
Inst Description (C)
Name
FMT
Opcode
F3
F7
MUL
MUL High
MUL High (S) (U) MUL High (U) DIV
DIV (U) Remainder Remainder (U)
R R R R R R R R
0110011
0110011
0110011
0110011
0110011
0110011
0110011
0110011
0x0
0x1
0x2
0x3
0x4
0x5
0x6
0x7
0x01
0x01
0x01
0x01
0x01
0x01
0x01
0x01
mul
mulh
mulsu rd = (rs1 * rs2)[63:32]
mulu
div rd=rs1/rs2 divu rd=rs1/rs2 rem rd=rs1%rs2 remu rd=rs1%rs2
RV32A Atomic Extension
31 27 26 25 24
2019
1514 1211
76
0
0} = rd
rd = (rs1 * rs2)[31:0]
rd = (rs1 * rs2)[63:32]
rd = (rs1 * rs2)[63:32]
funct5
aq
rl
rs2
rs1
funct3
rd
opcode
51155357
Inst Description (C)
Name
FMT
Opcode
F3
F5
Load Reserved Store Conditional
Atomic Swap Atomic ADD Atomic AND Atomic OR Atomix XOR Atomic MAX Atomic MIN
R R
R R R R R R R
0101111
0101111
0101111
0101111
0101111
0101111
0101111
0101111
0101111
0x2 0x2
0x2
0x2
0x2
0x2
0x2
0x2
0x2
0x02 0x03
0x01
0x00
0x0C
0x0A
0x04
0x14
0x10
lr.w sc.w
rd = M[rs1], reserve M[rs1]
if (reserved) { M[rs1] = rs2; rd =
else { rd = 1 }
amoswap.w rd = M[rs1]; swap(rd, rs2); M[rs1]
amoadd.w
amoand.w
amoor.w
amoxor.w
amomax.w
amomin.w
RV32F / D Floating-Point Extensions
rd = M[rs1] + rs2; M[rs1] = rd
rd = M[rs1] & rs2; M[rs1] = rd
rd = M[rs1] | rs2; M[rs1] = rd
rd = M[rs1] ˆ rs2; M[rs1] = rd
rd = max(M[rs1], rs2); M[rs1] = rd
rd = min(M[rs1], rs2); M[rs1] = rd
Inst Description (C)
Name
FMT
Opcode
F3
F5
Flt Load Word
Flt Store Word
Flt Fused Mul-Add
Flt Fused Mul-Sub
Flt Neg Fused Mul-Add Flt Neg Fused Mul-Sub Flt Add
Flt Sub
Flt Mul
Flt Div
Flt Square Root
Flt Sign Injection
Flt Sign Neg Injection Flt Sign Xor Injection Flt Minimum
Flt Maximum
Flt Conv from Sign Int Flt Conv from Uns Int Flt Convert to Int
Flt Convert to Int Move Float to Int Move Int to Float
Float Equality
Float Less Than
Float Less / Equal Float Classify
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
flw
fsw
fmadd.s rd=rs1*rs2+rs3 fmsub.s rd=rs1*rs2-rs3 fnmadd.s rd=-rs1*rs2+rs3 fnmsub.s rd=-rs1*rs2-rs3 fadd.s rd=rs1+rs2
fsub.s rd=rs1-rs2
fmul.s rd=rs1*rs2
fdiv.s rd=rs1/rs2
fsqrt.s
fsgnj.s
fsgnjn.s
fsgnjx.s
fmin.s
fmax.s
fcvt.s.w
fcvt.s.wu rd = (float) rs1
rd = (int32_t) rs1 fcvt.wu.s rd = (uint32_t) rs1
fcvt.w.s
fmv.x.w
fmv.w.x
feq.s
flt.s
fle.s
fclass.s
rd = *((int*) &rs1)
rd = *((float*) &rs1) rd=(rs1==rs2)? 1: 0 rd=(rs1
return -z; }
A straightforward implementation of the above code requires control-flow instructions (branches and/or jumps). In this problem we consider non-RISC-V instructions that might allow us to implement this without control-flow instructions. In this problem ignore the fact that a twos-complement representation is unbalanced (-MININT cannot be represented).
(a) (5 points) Using a minimal sequence of RISC-V instructions, implement ABSDIFF x1, x2, x3 without control-flow instructions. ABSDIFF x1, x2, x3 means x1 = ABS(x2 – x3). You may use any RISC-V registers, but do not overwrite x2 or x3. In this part, you must use one or more conditional move instructions (which, as we discussed in class, are part of many instruction sets but not RISC-V), and specifically any or all of three different conditional move instructions:
• CMOVGTZ x1, x2, x3 # if (x3 > 0) x1 = x2 • CMOVLTZ x1, x2, x3 # if (x3 < 0) x1 = x2 • CMOVEQZ x1, x2, x3 # if (x3 == 0) x1 = x2
Solution:
Page 1 of 9
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
SUB x4, x1, x2
CMOVGTZ x1, x4, x4 # x1 now has the correct answer
# if the diff was positive
SUB x4, x0, x4 # flip the sign of x4
CMOVGTZ x1, x4, x4 # x1 now also has the correct answer
# if the diff was negative
CMOVEQZ x1, x4, x4 # better cover the 0 case too
(b) (5 points) Using a minimal sequence of RISC-V instructions, implement ABSDIFF x1, x2, x3 without control-flow instructions. You may use any RISC-V reg- isters, but do not overwrite x2 or x3. In this part, you must use either or both of the following saturating-subtract instructions:
• SUBSAT x1, x2, x3 (subtraction of x2-x3, treating both operands as signed numbers and producing a signed result, but rather than overflowing, saturating to the largest/smallest signed value)
• SUBUSAT x1, x2, x3 (subtraction of x2-x3, treating both operands as un- signed numbers and producing an unsigned result, but rather than overflowing, saturating to the largest/smallest unsigned value)
Recall that the actual computation of subtraction of signed and unsigned numbers generates exactly the same bit result; it’s only the saturation operation that differs between these two operations. This problem is fairly tricky.
2. In this problem we will look at the single-cycle datapath of the “Bitner” processor, only focusing on the compute part of the Bitner pipeline. In this datapath, the ALU can perform one arithmetic operation every clock cycle and the data memory can perform one memory operation every clock cycle. Note the output of the ALU is an input to the data memory and the output of the data memory is an input to the ALU. Assume Bitner’s instruction format contains an opcode, three registers, and one immediate.
At the end of this exam is a larger, detachable copy of this datapath.
Solution: The key realization here is that unsigned saturated subtract of a − b is equivalent to max(a − b, 0).
SUBUSAT x1, x2, x3 \# x1 = max(x2-x3,0)
SUBUSAT x4, x3, x2 \# x4 = max(x3-x2,0); either x1 or x4 is now 0
OR x1, x1, x4 \# choose whichever of x1 or x4 is not 0
Page 2 of 9
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
RFData
Control
RFW RFR1 RFR2 IMM MEMD MEMA ALU2 ALU1 ALUOp
RFIn
Register File
RF2 RF
Register Addresses
RF1 RF
1 2
A ALU
B
RF1 IMM MEM
RF2 IMM MEM
Mux
RF1 IMM ALU
RF2 IMM ALU
Data Memory Addr MEM
Data In
Control points:
• RFR1: register number of the first register to be read from the register file
• RFR2: register number of the second register to be read from the register file
• IMM: the immediate specified by the instruction
• ALU1: controls the source of data into the ALU’s input 1 (choices: RF1, IMM MEM)
• ALU2: controls the source of data into the ALU’s input 2 (choices: RF2, IMM, MEM)
• MEMA: controls the source of data into the memory’s address input (choices: RF1, IMM, ALU)
• MEMD: controls the source of data into the memory’s data input (choices: RF2, IMM, ALU)
• ALUOP: controls the operation performed by the ALU (choices: ADD)
• RFW: register number of the register to be written to the register file
• RFData: controls the source of data to be written into register RFW in the register file (choices: ALU or MEM)
In this problem we ask you to “set the control points” for this datapath. As an example, the instruction ADD x1, x2, x3 would have the following settings: RFR1 = 2; RFR2 = 3; IMM = X; ALU1 = RF1; ALU2 = RF2; MEMA = X; MEMD = X; ALUOP = ADD; RFW = 1; RFData = ALU, where X is a don’t-care.
(a) (5 points) Which of these signals should come directly from bits in the 32-bit instruction (a bit or bits in the instruction are routed as control signals directly into the datapath without any intermediate logic) and which should be generated by additional logic derived from the 32-bit instruction? Answer either “bits” or “logic”.
• bits RFR1
Page 3 of 9
Mux
Mux
Mux
Mux
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
• • • • • • • • •
bits RFR2
bits IMM logic ALU1 logic ALU2 logic MEMA logic MEMD logic ALUOP
bits RFW logic RFData
(b) (4 points) What is the primary reason why this particular datapath would be a challenge to pipeline with good performance?
Solution: Because the data memory output is an input to the ALU and the ALU output is an input to the memory, we’d have two choices:
• We could put them both into the same pipeline stage in which case that stage would be very long and yield poor performance
• We could put them into separate pipeline stages but then would have to choose which came first, and we would have a structural hazard every time they were used in the opposite order
(c) Consider the following C program. This program fetches the value in array from memory, adds 42 to it, stores the new value back to memory, and increments array to point to the next element in the array.
int * array; // assume 64-bit integers
while (1) {
*array++ += 42;
}
Assume that array is stored in x2.
i. (4 points) Write the body of this loop (corresponding to the single line of C code between the braces) as a minimal sequence of instructions in RISC-V assembly. You can use x3 as a temporary. Your solution should have 4 RISC-V instructions. Do not write any instructions that relate to the loop itself (you should have no branch/jump instructions).
Solution:
ld x3, 0(x2)
addi x3, x3, 42
sd x3, 0(x2)
addi x2, x2, 8
Page 4 of 9
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
ii. (10 points) Write the body of this loop (corresponding to the single line of C code between the braces) as a minimal sequence of instructions on Bitner. Express each instruction as a setting of control points; by this we mean to specify how you would set each of the 10 “control points” listed above. If a control point is a don’t-care, mark it as X. You can use x3 as a temporary. Do not write any instructions that relate to the loop itself (you should have no branch/jump instructions). Hint: Because this datapath is potentially more capable than the RISC-V datapath we discussed in class, it may be possible to accomplish more in a single instruction than you could in a RISC-V datapath.
Solution: We can collapse the four RISC-V instructions above into two Bitner instructions:
• Combine the load and the first add-immediate (+42) into one instruc- tion; the output of the load is the input to an add, and the output of the add goes back into the register file.
RFR1 = 2
RFR2 = X
IMM = 42
RFW = 3
RFData = ALU
ALU1 = MEM
ALU2 = IMM (42)
MEMA = RF1
MEMD = X
ALUOP = ADD
• Then combine the store and the second add-immediate (+8) into one instruction that run in parallel. Only the add-immediate writes back.
RFR1 = 2
RFR2 = 3
IMM = 8
RFW = 2
RFData = ALU
ALU1 = RF1
ALU2 = IMM (8)
MEMA = RF1
MEMD = RF2
ALUOP = ADD
(d) (5 points) On the following diagram, note 12 circles on the inputs to the 4 muxes.
Put an X through each circle whose input is not required to run RISC-V
Page 5 of 9
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
instructions. For inputs that are required to run RISC-V instructions, do nothing. For instance, if the top input of the ALU does not require a RF1 input to run RISC-V instructions, put an X in the topmost circle.
RFData
ALU MEM
Control
RFW RFR1 RFR2 IMM MEMD MEMA ALU2 ALU1 ALUOp
RFIn
File
RF2
Register Addresses
RF1 Register
RF1 RF2
RF1 IMM ALU
RF1 IMM MEM
RF2 IMM MEM
A B
ALU
Mux
Data Memory
Addr MEM
Data In
RF2 IMM ALU
Solution: To run only RISC-V instructions, we do not need the following connections:
• ALU A: IMM, MEM
• ALU B: MEM
• Memory address: RF1, IMM • Memory data: IMM, ALU
It’s arguable that ALU A and ALU B could switch their answers, although that may lead to complications with where I-type instructions get their RF input.
RFData
Control
RFW RFR1 RFR2 IMM MEMD MEMA ALU2 ALU1 ALUOp
Register Addresses
RF1 RF
RF1 IMM MEM
1x x
RFIn
Register File
RF2 RF
2
x x
A ALU
B
RF2 IMM MEM
Mux
x
x x
RF1 IMM ALU
Data Memory Addr MEM
Data In
ALU
MEM
RF2 IMM ALU
Page 6 of 9
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
Mux
Mux
Mux
Mux
Mux
Mux
Mux
Mux
(e) Ben Bitdiddle proposes new instructions for Bitner. Ben hopes each of these in- structions can be implemented as single Bitner instructions (not sequences of Bitner instructions). For each of Ben’s proposals:
• If it is possible to implement Ben’s proposal as a single instruction with the current datapath, set the control points that implement the instruction on this datapath; OR
• If it is not possible to implement Ben’s proposal as a single instruction with the current datapath, describe the changes needed to the datapath (as English sentences).
i. (4 points) LDADDI xdest1, xdest2, xsrc, imm: Perform two operations in parallel: (1) Register xdest1 is set to xsrc + imm; register xdest2 is set to the memory contents of the address stored in xsrc.
ii. (4 points) MEMCPY xdest, xsrc: Copies the memory word stored at the ad- dress specified in xsrc into the memory location specified by the address in xdest.
Solution: The current datapath does not support this; it requires writing back to two different registers. We would have to add a second write port to the register file and make sure the outputs of the ALU and the memory could both be routed into the two write ports.
Solution: The current datapath does not support this; it requires two memory operations and thus two data memories. There are multiple places where a second memory could be placed, including:
• Inparallelwiththecurrentdatamemory(ratherthanhavingoneALU and one memory in parallel, we’d have one ALU and two memories in parallel). The output of the first memory (load) must be able to be routed into the data input of the second memory (store).
• Inseriesafterthecurrentdatamemory(theoutputofthefirstmemory can be routed into the data input of the second memory).
iii. (4 points) LD2 xdest, imm(xbase1+xbase2): Same as LD but instead of the load address as xbase+imm, it now supports xbase1+xbase2+imm.
Solution: The current datapath does not support this; it does not allow two additions. One way to address this is to add an adder to the address input of the data memory that allows a second addition. One input to that adder should be the output of the ALU; the second input could be either read register or the immediate (the datapath allows any of those three to work).
Page 7 of 9
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
3. Alyssa P. Hacker was looking at the class datapath (below) and identified some possible inefficiencies. Specifically, she was looking at the following control signals:
• RegWrite: 1 only if data is written into the register file
• ALUSrc: the second input of the ALU is from the register file (0) or from the
immediate (1)
• PCSrc: the next PC comes from PC+4 (0) or from a computed branch/jump target (1)
• MemWrite: 1 only if the write data input to the memory will be written to the address input
• MemRead: 1 only if the data memory is outputting the contents of the memory address specified by its address input
• MemtoReg: the data to write into the register file comes from the ALU (0) or from the data memory (1)
She proposed the following design simplifications. For each proposed design change,
indicate whether her proposal could work properly OR, if it will not, an in- struction that will not work correctly and why. For the purposes of this question, only consider the following instructions (the ones we covered in Lecture 2): ADD, ADDI, AND, ANDI, BEQ, BGE, BLT, BNE, JAL, JALR, LD, LUI, OR, ORI, SD, SLLI, SRLI, SUB, XOR, XORI.
(a) (4 points) Combine MemWrite and MemRead into one signal MemOp, where MemOp is set to 1 only on store instructions, and MemWrite <- MemOp and MemRead <- not(MemOp). Assume that the memory system can read from any arbitrary memory address without any side effects (no exceptions/page faults/etc.) and that there are no performance consequences if we do this.
Solution: This should work properly. The reasons to have two signals are to allow both of them to be true at a time (and we have no instructions that do both a read and a write) or both of them to be false (which is probably true on non-memory instructions, but the MemtoReg mux handles this case). Note, however, that we don’t generally want to dispatch memory reads that aren’t
Page 8 of 9
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
from a load instruction; we may access addresses that we’re not allowed to access (e.g., segmentation faults), we’d probably pollute the cache, and we’d burden the memory system with requests that we don’t need so there would be performance consequences.
(b) (4 points) Remove RegWrite; instead set RegWrite if PCSrc is 0.
(c) (4 points) Remove ALUSrc; instead set ALUSrc if either MemWrite or MemRead are 1.
Solution: This doesn’t work. Alyssa’s logic here is that non-branch-or-jump instructions always write back to the register file. This is not true for store instructions (SD).
Solution: This doesn’t work. Alyssa’s logic here is that we use the immediate input to the ALU only on load and store instructions, but we also use it on ALU instructions that take an immediate (ADDI, ANDI, ORI, etc.).
Page 9 of 9
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
EEC 170
Name: Instructions:
UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering
Introduction to Computer Architecture
Final Examination
Fall 2019
1. This exam is open-note, open-book. Calculators are allowed.
2. No devices that can communicate (laptops, phones, or PDAs) allowed, with the excep- tion of a device that can read your e-book textbook, on which you must turn off your wifi and/or cellular connections. Other than “find”, no typing. Turn off all phones, pagers, and alarms.
3. You may not share any materials with anyone else, including books, notes, lab reports, datasheets, calculators, and most importantly, answers!
4. This exam is designed to be a 60 minute exam, but you have 120 minutes to complete it. (Roughly) one point is budgeted per minute. Use your time wisely!
Excerpts from the UC Davis Code of Academic Conduct:
1. Each student should act with personal honesty at all times.
2. Each student should act with fairness to others in the class. That means, for example, that when taking an examination, students should not seek an unfair advantage over classmates through cheating or other dishonest behavior.
3. Students should take group as well as individual responsibility for honorable behavior.
I understand the honor code and agree to be bound by it.
Signature:
Page:
2
3
4
5
6
7
8
9
Total
Points:
4
10
10
9
7
3
12
6
61
Score:
Page 1 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
All instructions in this exam, unless otherwise noted, are RISC-V instructions. Recall that if the instruction writes to a register, that register appears first in the instruction. Unless otherwise indicated, you may not use any RISC-V pseudoinstructions on this exam, only actual RISC-V instructions. Feel free to detach the two RISC-V reference pages and the memory system diagram at the end of the exam, but if you do, please take them with you and recycle them (don’t try to reattach them or turn them in).
1.
1
2 3 4
We discussed dense matrix-dense matrix multiplication in class. To recap, consider A, B, and C as n×n matrices; we wish to compute C = C +A×B. This can be computed by
fori=1ton for j = 1 to n
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j) // this is normal (scalar) + and *
We begin by computing the arithmetic intensity of this code in operations per memory word, assuming no caches, as a function of n. (Recall that the arithmetic intensity is the number of arithmetic operations per word of memory bandwidth.) Assume n is large; you can approximate any answers given this assumption. Assume each addition or multiplication counts as one operation. Assume all of A, B, and C are stored in main memory (DRAM) and each read from/write to main memory counts as one memory word.
The inner line of code runs n3 times. It has 2 arithmetic operations, 3 loads, and 1 write. Thus the arithmetic intensity is 2n3 = 1/2.
(a) (4 points) Consider two different implementations of the same task with the same runtime. Baby Yoda’s has high arithmetic intensity and is limited by arithmetic capability. The Mandalorian’s has low arithmetic intensity and is limited by mem- ory bandwidth. Even though both have the same runtime, we prefer Baby Yoda’s implementation with higher arithmetic intensity because of how we expect future machines to behave. In terms of technology trends, why do we prefer imple- mentations with more arithmetic intensity?
(3+1)n3
Page 2 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
(b)
(4 points) Assume we have a very large cache so all memory loads only read a reference the first time they see that reference and (because they are captured in the cache) all memory writes occur only at the end of the program. Reads from/writes to cache are free (do not count in terms of arithmetic intensity). What is the arithmetic intensity now?
(c)
1
2 3 4 5 6 7 8
(6 points) Now consider a more realistic cache that is not large enough to hold the entire input and output. This cache can hold two matrix rows or columns plus one output element. Assume there are no cache conflicts; any two rows/columns plus one output element can fit in the cache. The Mandalorian proposes using the cache in the following way:
fori=1ton
// read row i of A into cache for j = 1 to n
// read row j of B into cache // read C(i,j) into cache
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j) // this is normal (scalar) + and * // after k loop is complete, write C(i,j) back to memory
What is the arithmetic intensity now? It is OK to ignore lower-order terms in this analysis.
Page 3 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
(d)
(10 points) Finally, Baby Yoda proposes to partition the matrix in a different way. Instead of reading 1-dimensional rows or columns from A and B, Baby Yoda ar- ranges those matrices as b × b subblocks. (For instance, a 128 × 128 matrix may be structured as a 4×4 grid of 32 × 32 subblocks.) The cache can hold any two input blocks plus one output block at any time.
Now our matrix multiply looks like this (note our operations are now on blocks at a time instead of rows or columns at a time). The arithmetic, while factored differently, is exactly the same as the previous case.
// note i, j, and k are now block indexes, not element indexes
fori=1ton/b for j = 1 to n/b
// read block C(i,j) into cache
for k = 1 to n/b
// read block A(i,k) into cache
// read block B(k,j) into cache
C(i,j) = C(i,j) + A(i,k) * B(k,j)
// Here + indicates a matrix sum of blocks (block + block)
// Here * indicates a matrix multiply of blocks (block * block)
// after k loop is complete, write block C(i,j) back to memory
What is the arithmetic intensity now? It is again OK to ignore lower-order terms in this analysis.
1
2 3 4 5 6 7 8 9
10 11
Page 4 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
2. In the 2-page RISC-V Reference Card at the end of the exam, at the bottom right of the first page, you will see the “16-bit (RVC) Instruction Formats”. RVC stands for “RISC-V Compressed”.
(a) (5 points) Note CR-type instructions have a funct4 entry as part of their opcodes instead of a funct3 (as in all the other instructions). Assume that the opcode combi- nations (inst[1:0] == 01 OR inst[1:0] == 10) AND inst[15:13] == 100 are used to encode CR-type instructions. Given the instruction formats listed in the reference page, how many different instructions can RVC support? Note not all possible RVC instructions are specified on the reference card.
(b) In practice, these 16-bit RVC instructions are translated to 32-bit full (“RVI”) RISC-V instructions. This process is summarized in the table on the reference card called “Optional Compressed (16-bit) Instruction Extension: RVC”, but we will highlight the important parts here directly. In this part, you may need to consult the table “16-bit (RVC) instruction formats”.
i. (2 points) The RVC instruction C.ADD rd, rs1 translates to the RVI instruc- tion ADD rd, rd, rs1. Both instructions run a full 64b add. However, the C.ADD operation has a significant restriction compared to the full ADD instruc- tion. What is that restriction? Be specific.
ii. (2 points) Some RVC instructions, like C.BEQZ, have a register argument with a prime (apostrophe) after the register name (C.BEQZ rs1’, imm → BEQ rs1’, x0, imm). What is the restriction indicated by the prime?
Page 5 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
iii. (2 points) All RVC memory-access instructions multiply their immediates by a power-of-twoconstant(e.g.,C.LW rd’, rs1’, imm→LW rd’, rs1’, imm*4). Because of this multiplication, what can a RVI memory-access instruc- tion do that its partner RVC instruction cannot? (The answer is not “allows byte addressing”; both RVC and RVI memory-access instructions are byte-addressed.)
iv. (2 points) The instruction C.ADDI16SP is primarily intended to manipulate the . (Fill in the previous blank with one word.)
(c) (3 points) The new BabyYoda compiler inputs a RVI (32-bit) RISC-V instruction sequence and transforms some RVI instructions into their RVC (16-bit) equivalents. When running the resulting code, compared to the RVI implementation, would you expect the instruction cache hit rate would increase, decrease, or stay the same? Justify your answer.
Page 6 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
3. While sipping his soup, Baby Yoda uses the Force to derive the following characteristics of his laptop’s memory system, which is byte-addressed and contains a virtual memory subsystem and L1 and L2 caches. A copy of this diagram is attached at the end of the exam, which you can detach. In this diagram,
Figure B.17 T(ah)e (o3vepraolilnptsic)tuWrehoaftaishythpoethpeatgicealsimze?mory hierarchy going from virtual address to L2 cache access. The page size is 16 KB. The TLB is two-way set associative with 256 entries. The L1 cache is a direct-mapped 16 KB, and the L2 cache is a four-way set associative with a total of 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the physical address is 40 bits.
Copyright © 2011, Elsevier Inc. All rights Reserved.
Page 7 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
(b) (3 points) The TLB has 256 entries in it. What is the associativity of the TLB (e.g., 1-way set associative = direct-mapped, 2-way set associative, 4-way, fully associative, etc.)?
(c) (3 points) How large is the L1 cache (data only)?
(d) (3 points) Assuming the L2 cache stores 222 B = 4 MiB of data, what is its associa- tivity?
(e) (3 points) What is the motivation for making replacement algorithms for hardware caches less complicated than replacement algorithms for pages?
Page 8 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
4. Assume that you have 2 64-bit double-precision floating point numbers stored in (integer) registers x2 and x3. You run the instruction XOR x1, x2, x3. In this question “normal number” means “not zero, not infinity, not denormal, not NaN”.
(a) (3 points) If you know that x2 is a positive normal number and x3 is a negative normal number, what bits in x1 do you know for certain? (Your answer shouldbeoftheform“Bit42willbea1andbit24willbea0”.) Bit0istheleast significant bit (LSB) and bit 63 is the most significant bit (MSB).
(b) (3 points) If you know that x2 is a negative denormal number and x3 is negative infinity, what bits in x1 do you know for certain? (Your answer should be of the form “Bit 42 will be a 1 and bit 24 will be a 0”.)
Page 9 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
Free & Open Reference Card 1
Base Integer Instructions: RV32I, RV64I, and RV128I
Category Name
Loads Load Byte Load Halfword Load Word Load Byte Unsigned Load Half Unsigned
Stores Store Byte Store Halfword Store Word
Shifts Shift Left Shift Left Immediate Shift Right Shift Right Immediate Shift Right Arithmetic Shift Right Arith Imm
Arithmetic ADD ADD Immediate SUBtract
Load Upper Imm Add Upper Imm to PC
Fmt
I I I I I
S S S
R I R I R I
R I R
U U
RV32I Base
LB rd,rs1,imm
LH rd,rs1,imm
LW rd,rs1,imm
LBU rd,rs1,imm
LHU rd,rs1,imm
SB rs1,rs2,imm
SH rs1,rs2,imm
SW rs1,rs2,imm
SLL rd,rs1,rs2
SLLI rd,rs1,shamt
SRL rd,rs1,rs2
SRLI rd,rs1,shamt
SRA rd,rs1,rs2
SRAI rd,rs1,shamt
ADD rd,rs1,rs2
ADDI rd,rs1,imm
SUB rd,rs1,rs2
LUI rd,imm
AUIPC rd,imm
+RV{64,128}
L{D|Q} rd,rs1,imm
L{W|D}U rd,rs1,imm
S{D|Q} rs1,rs2,imm
SLL{W|D} rd,rs1,rs2
SLLI{W|D} rd,rs1,shamt
SRL{W|D} rd,rs1,rs2
SRLI{W|D} rd,rs1,shamt
SRA{W|D} rd,rs1,rs2
SRAI{W|D} rd,rs1,shamt
ADD{W|D} rd,rs1,rs2
ADDI{W|D} rd,rs1,imm
SUB{W|D} rd,rs1,rs2
RV Privileged Instructions
Category Name
CSR Access Atomic R/W Atomic Read & Set Bit Atomic Read & Clear Bit Atomic R/W Imm Atomic Read & Set Bit Imm Atomic Read & Clear Bit Imm
Change Level Env. Call Environment Breakpoint
Environment Return
Trap Redirect to Supervisor Redirect Trap to Hypervisor Hypervisor Trap to Supervisor
Interrupt Wait for Interrupt
MMU Supervisor FENCE
ECALL EBREAK
ERET
MRTS
MRTH
HRTS
WFI
RV mnemonic
CSRRW rd,csr,rs1
CSRRS rd,csr,rs1
CSRRC rd,csr,rs1
CSRRWI rd,csr,imm
CSRRSI rd,csr,imm
CSRRCI rd,csr,imm
SFENCE.VM rs1
Logical XOR XOR Immediate
OR OR Immediate AND AND Immediate
Compare Set < Set < Immediate Set < Unsigned Set < Imm Unsigned
Branches Branch = Branch ≠ Branch < Branch ≥ Branch < Unsigned Branch ≥ Unsigned
Jump & Link J&L Jump & Link Register
Synch Synch thread Synch Instr & Data
System System CALL System BREAK
Counters ReaD CYCLE ReaD CYCLE upper Half ReaD TIME ReaD TIME upper Half ReaD INSTR RETired ReaD INSTR upper Half
R I
R I R I
R I R I
SB
SB
SB
SB
SB
SB
UJ UJ
I I
I I
I I I I I I
XOR rd,rs1,rs2
XORI rd,rs1,imm
OR rd,rs1,rs2
ORI rd,rs1,imm
AND rd,rs1,rs2
ANDI rd,rs1,imm
SLT rd,rs1,rs2
SLTI rd,rs1,imm
SLTU rd,rs1,rs2
SLTIU rd,rs1,imm
BEQ rs1,rs2,imm
BNE rs1,rs2,imm
BLT rs1,rs2,imm
BGE rs1,rs2,imm
BLTU rs1,rs2,imm
BGEU rs1,rs2,imm
JAL rd,imm
JALR rd,rs1,imm
FENCE
FENCE.I
SCALL SBREAK
RDCYCLE rd
RDCYCLEH rd
RDTIME rd
RDTIMEH rd
RDINSTRET rd
RDINSTRETH rd
Optional Compressed (16-bit) Instruction Extension: RVC
Category Name
Loads Load Word Load Word SP
Load Double Load Double SP Load Quad Load Quad SP
Stores Store Word Store Word SP Store Double Store Double SP
Store Quad Store Quad SP
Arithmetic ADD ADD Word ADD Immediate ADD Word Imm ADD SP Imm * 16 ADD SP Imm * 4 Load Immediate Load Upper Imm MoVe SUB
Shifts Shift Left Imm
Branches Branch=0 Branch≠0
Jump Jump Jump Register
Jump & Link J&L Jump & Link Register
System Env. BREAK
Fmt
CL CI
CL CI CL CI
CS CSS CS CSS
CS CSS
CR CR CI CI CI CIW CI CI CR CR
CI
CB
CB
CJ
CR
CJ
CR
CI
C.SLLI rd,imm
C.BEQZ rs1′,imm
C.BNEZ rs1′,imm
C.J imm
C.JR rd,rs1
C.JAL imm
C.JALR rs1
C.EBREAK
RVC
C.LW rd′,rs1′,imm
C.LWSP rd,imm
C.LD rd′,rs1′,imm
C.LDSP rd,imm
C.LQ rd′,rs1′,imm
C.LQSP rd,imm
C.SW rs1′,rs2′,imm
C.SWSP rs2,imm
C.SD rs1′,rs2′,imm
C.SDSP rs2,imm
C.SQ rs1′,rs2′,imm
C.SQSP rs2,imm
C.ADD rd,rs1
C.ADDW rd,rs1
C.ADDI rd,imm
C.ADDIW rd,imm
C.ADDI16SP x0,imm
C.ADDI4SPN rd',imm
C.LI rd,imm
C.LUI rd,imm
C.MV rd,rs1
C.SUB rd,rs1
RVI equivalent
LW rd′,rs1′,imm*4
LW rd,sp,imm*4
LD rd′,rs1′,imm*8
LD rd,sp,imm*8
LQ rd′,rs1′,imm*16
LQ rd,sp,imm*16
SW rs1′,rs2′,imm*4
SW rs2,sp,imm*4
SD rs1′,rs2′,imm*8
SD rs2,sp,imm*8
SQ rs1′,rs2′,imm*16
SQ rs2,sp,imm*16
ADD rd,rd,rs1
ADDW rd,rd,imm
ADDI rd,rd,imm
ADDIW rd,rd,imm
ADDI sp,sp,imm*16
ADDI rd',sp,imm*4
ADDI rd,x0,imm
LUI rd,imm
ADD rd,rs1,x0
SUB rd,rd,rs1
SLLI rd,rd,imm
BEQ rs1',x0,imm
BNE rs1',x0,imm
JAL x0,imm
JALR x0,rs1,0
JAL ra,imm
JALR ra,rs1,0
EBREAK
32-bit Instruction Formats
16-bit (RVC) Instruction Formats
R I S SB U UJ
CR
CSS CIW CL CS CB CJ
RISC-V Integer Base (RV32I/64I/128I), privileged, and optional compressed extension (RVC). Registers x1-x31 and the pc are 32 bits wide in RV32I, 64 in RV64I, and 128 in RV128I (x0=0). RV64I/128I add 10 instructions for the wider formats. The RVI base of <50 classic integer RISC instructions is required. Every 16-bit RVC instruction matches an existing 32-bit RVI instruction. See risc.org.
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
CI
Free & Open Reference Card (riscv.org) 2
Category Name
Multiply MULtiply MULtiply upper Half MULtiply Half Sign/Uns MULtiply upper Half Uns Divide DIVide DIVide Unsigned Remainder REMainder REMainder Unsigned
Swap SWAP
Add ADD
Logical XOR AND OR
Min/Max MINimum MAXimum MINimum Unsigned MAXimum Unsigned
Move Move from Integer Move to Integer
Convert Convert from Int Convert from Int Unsigned Convert to Int Convert to Int Unsigned
Arithmetic ADD SUBtract MULtiply DIVide SQuare RooT
Mul-Add Multiply-ADD Multiply-SUBtract Negative Multiply-SUBtract Negative Multiply-ADD
Sign Inject SiGN source Negative SiGN source Xor SiGN source
Min/Max MINimum MAXimum
Compare Compare Float = Compare Float < Compare Float ≤ Categorization Classify Type
Configuration Read Status Read Rounding Mode Read Flags Swap Status Reg Swap Rounding Mode Swap Flags Swap Rounding Mode Imm Swap Flags Imm
Fmt
R R R R R R R R
R
R
R R R
R R R R
R R R R
R R R R R
R R R R
R R R
R R
R R R R
R R R R R R I I
Optional Multiply-Divide Instruction Extension: RVM
RV32M (Multiply-Divide)
MUL rd,rs1,rs2
MULH rd,rs1,rs2
MULHSU rd,rs1,rs2
MULHU rd,rs1,rs2
DIV rd,rs1,rs2
DIVU rd,rs1,rs2
REM rd,rs1,rs2
REMU rd,rs1,rs2
SC.W rd,rs1,rs2
AMOSWAP.W rd,rs1,rs2
AMOADD.W rd,rs1,rs2
AMOXOR.W
AMOAND.W
AMOOR.W
rd,rs1,rs2
rd,rs1,rs2
rd,rs1,rs2
AMOMIN.W
AMOMAX.W
AMOMINU.W
AMOMAXU.W
rd,rs1,rs2
rd,rs1,rs2
rd,rs1,rs2
rd,rs1,rs2
Three Optional Floating-Point Instruction Extensions: RVF, RVD, & RVQ
FMV.{H|S}.X rd,rs1
FMV.X.{H|S} rd,rs1
FCVT.{H|S|D|Q}.W rd,rs1
FCVT.{H|S|D|Q}.WU rd,rs1
FCVT.W.{H|S|D|Q} rd,rs1
FCVT.WU.{H|S|D|Q} rd,rs1
FADD.{S|D|Q} rd,rs1,rs2
FSUB.{S|D|Q} rd,rs1,rs2
FMUL.{S|D|Q} rd,rs1,rs2
FDIV.{S|D|Q} rd,rs1,rs2
FSQRT.{S|D|Q} rd,rs1
FMADD.{S|D|Q} rd,rs1,rs2,rs3
FMSUB.{S|D|Q} rd,rs1,rs2,rs3
FNMSUB.{S|D|Q} rd,rs1,rs2,rs3
FNMADD.{S|D|Q} rd,rs1,rs2,rs3
FSGNJ.{S|D|Q} rd,rs1,rs2
FSGNJN.{S|D|Q} rd,rs1,rs2
FSGNJX.{S|D|Q} rd,rs1,rs2
FMIN.{S|D|Q} rd,rs1,rs2
FMAX.{S|D|Q} rd,rs1,rs2
FEQ.{S|D|Q} rd,rs1,rs2
FLT.{S|D|Q} rd,rs1,rs2
FLE.{S|D|Q} rd,rs1,rs2
FCLASS.{S|D|Q} rd,rs1
FRCSR
FRRM
FRFLAGS
FSCSR
FSRM
FSFLAGS
FSRMI
FSFLAGSI
rd
rd
rd
rd,rs1
rd,rs1
rd,rs1
rd,imm
rd,imm
MUL{W|D}
DIV{W|D}
REM{W|D}
REMU{W|D}
rd,rs1,rs2
rd,rs1,rs2
rd,rs1,rs2
rd,rs1,rs2
+RV{64,128}
Category Name
Load Load Reserved
Store Store Conditional
Optional Atomic Instruction Extension: RVA
Fmt
R
R
RV32A (Atomic)
LR.W rd,rs1
+RV{64,128}
LR.{D|Q} rd,rs1
SC.{D|Q} rd,rs1,rs2
AMOSWAP.{D|Q} rd,rs1,rs2
AMOADD.{D|Q} rd,rs1,rs2
AMOXOR.{D|Q} rd,rs1,rs2
AMOAND.{D|Q} rd,rs1,rs2
AMOOR.{D|Q} rd,rs1,rs2
AMOMIN.{D|Q} rd,rs1,rs2
AMOMAX.{D|Q} rd,rs1,rs2
AMOMINU.{D|Q} rd,rs1,rs2
AMOMAXU.{D|Q} rd,rs1,rs2
Category Name
Fmt
R R
RV32{F|D|Q} (HP/SP,DP,QP Fl Pt)
+RV{64,128}
FMV.{D|Q}.X rd,rs1
FMV.X.{D|Q} rd,rs1
FCVT.{H|S|D|Q}.{L|T} rd,rs1
FCVT.{H|S|D|Q}.{L|T}U rd,rs1
FCVT.{L|T}.{H|S|D|Q} rd,rs1
FCVT.{L|T}U.{H|S|D|Q} rd,rs1
Load Load
Store Store
I
S
FL{W,D,Q} rd,rs1,imm
FS{W,D,Q} rs1,rs2,imm
Register
x0
x1
x2
x3
x4
x5-7
x8
x9
x10-11
x12-17
x18-27
x28-31
f0-7
f8-9
f10-11
f12-17
f18-27
f28-31
ABI Name
zero
ra
sp
gp
tp
t0-2 s0/fp
s1
a0-1
a2-7
s2-11
t3-t6
ft0-7
fs0-1
fa0-1
fa2-7
fs2-11
ft8-11
RISC-V Calling Convention
Saver
--- Caller Callee --- --- Caller Callee Callee Caller Caller Callee Caller
Caller Callee Caller Caller Callee Caller
Description
Hard-wired zero
Return address
Stack pointer
Global pointer
Thread pointer
Temporaries
Saved register/frame pointer Saved register
Function arguments/return values Function arguments
Saved registers
Temporaries
FP temporaries
FP saved registers
FP arguments/return values FP arguments
FP saved registers
FP temporaries
RISC-V calling convention and five optional extensions: 10 multiply-divide instructions (RV32M); 11 optional atomic instructions (RV32A); and 25 floating-point instructions each for single-, double-, and quadruple-precision (RV32F, RV32D, RV32Q). The latter add registers f0-f31, whose width matches the widest precision, and a floating-point control and status register fcsr. Each larger address adds some instructions: 4 for RVM, 11 for RVA, and 6 each for RVF/D/Q. Using regex notation, {} means set, so L{D|Q} is both LD and LQ. See risc.org. (8/21/15 revision)
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
Diagram from Problem 3.
igure B.17 The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache ac
age size is 16 KB. The TLB is two-way set associative with 256 entries. The L1 cache is a direct-mapped 16 KB, a ache is a four-way set associative with a total of 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the ddress is 40 bits.
Copyright © 2011, Elsevier Inc. All rights Reserved.
Page 12 of 12
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
Fc
pn c
a
UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering
EEC 170 Introduction to Computer Architecture Fall 2019
Final Examination Solutions
All instructions in this exam, unless otherwise noted, are RISC-V instructions. Recall that if the instruction writes to a register, that register appears first in the instruction. Unless otherwise indicated, you may not use any RISC-V pseudoinstructions on this exam, only actual RISC-V instructions. Feel free to detach the two RISC-V reference pages and the memory system diagram at the end of the exam, but if you do, please take them with you and recycle them (don’t try to reattach them or turn them in).
1.
1
2 3 4
We discussed dense matrix-dense matrix multiplication in class. To recap, consider A, B, and C as n×n matrices; we wish to compute C = C +A×B. This can be computed by
fori=1ton for j = 1 to n
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j) // this is normal (scalar) + and *
We begin by computing the arithmetic intensity of this code in operations per memory word, assuming no caches, as a function of n. (Recall that the arithmetic intensity is the number of arithmetic operations per word of memory bandwidth.) Assume n is large; you can approximate any answers given this assumption. Assume each addition or multiplication counts as one operation. Assume all of A, B, and C are stored in main memory (DRAM) and each read from/write to main memory counts as one memory word.
The inner line of code runs n3 times. It has 2 arithmetic operations, 3 loads, and 1 write. Thus the arithmetic intensity is 2n3 = 1/2.
(a) (4 points) Consider two different implementations of the same task with the same runtime. Baby Yoda’s has high arithmetic intensity and is limited by arithmetic capability. The Mandalorian’s has low arithmetic intensity and is limited by mem- ory bandwidth. Even though both have the same runtime, we prefer Baby Yoda’s implementation with higher arithmetic intensity because of how we expect future machines to behave. In terms of technology trends, why do we prefer imple- mentations with more arithmetic intensity?
(3+1)n3
Page 1 of 7
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
Solution: Historically, arithmetic capability increases more quickly than mem- ory bandwidth (this is the “memory wall”). A future machine will hopefully continue to increase both arithmetic capability and memory bandwidth, but arithmetic capability will likely increase more, so Baby Yoda’s implementation will run faster on future machines than the Mandalorian’s.
(b)
(4 points) Assume we have a very large cache so all memory loads only read a reference the first time they see that reference and (because they are captured in the cache) all memory writes occur only at the end of the program. Reads from/writes to cache are free (do not count in terms of arithmetic intensity). What is the arithmetic intensity now?
(6 points) Now consider a more realistic cache that is not large enough to hold the entire input and output. This cache can hold two matrix rows or columns plus one output element. Assume there are no cache conflicts; any two rows/columns plus one output element can fit in the cache. The Mandalorian proposes using the cache in the following way:
fori=1ton
// read row i of A into cache for j = 1 to n
// read row j of B into cache // read C(i,j) into cache
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j) // this is normal (scalar) + and * // after k loop is complete, write C(i,j) back to memory
What is the arithmetic intensity now? It is OK to ignore lower-order terms in this analysis.
Solution: We still do 2n3 operations. But now we read each element of A, B,
and C once and write each element of C once. That’s a total of 4n2 memory
operations. So the arithmetic intensity is 2n3 = n/2. 4n2
(c)
1
2 3 4 5 6 7 8
Solution: Again we have 2n3 operations. The primary memory cost is reading
each column of B n times (n2 column reads × n elements per column = n3
memory accesses). Thus the arithmetic intensity is 2n3 = 2. n3
More precisely, we do the following memory operations: • read each column of B n times (n3 memory loads)
Page 2 of 7
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
• read each column of A once for each i (n2 memory loads) • read and write each element of C once (2n2 memory loads)
for a total of = n3 + 3n2 memory operations. The arithmetic intensity is thus
2n3 , roughly 2 for large n. n3 +3n2
(d)
(10 points) Finally, Baby Yoda proposes to partition the matrix in a different way. Instead of reading 1-dimensional rows or columns from A and B, Baby Yoda ar- ranges those matrices as b × b subblocks. (For instance, a 128 × 128 matrix may be structured as a 4×4 grid of 32 × 32 subblocks.) The cache can hold any two input blocks plus one output block at any time.
Now our matrix multiply looks like this (note our operations are now on blocks at a time instead of rows or columns at a time). The arithmetic, while factored differently, is exactly the same as the previous case.
// note i, j, and k are now block indexes, not element indexes
fori=1ton/b for j = 1 to n/b
// read block C(i,j) into cache
for k = 1 to n/b
// read block A(i,k) into cache
// read block B(k,j) into cache
C(i,j) = C(i,j) + A(i,k) * B(k,j)
// Here + indicates a matrix sum of blocks (block + block)
// Here * indicates a matrix multiply of blocks (block * block)
// after k loop is complete, write block C(i,j) back to memory
What is the arithmetic intensity now? It is again OK to ignore lower-order terms in this analysis.
1
2 3 4 5 6 7 8 9
10 11
Solution: Again we have 2n3 operations.
For both A and B, we read (n/b)3 blocks of size b2. That’s 2n3/b reads.
We also read and write each block of C once (n2 reads, n2 writes). (For large n, you can ignore this term.)
The arithmetic intensity is thus 2n3 , roughly b for large n. This is sub- 2n3 /b+2n2
stantially better than the previous cached version (2). It suggests larger b gives better arithmetic intensity. This is true, but it also means the cache must be large enough to hold three blocks of size b.
Page 3 of 7
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
2. In the 2-page RISC-V Reference Card at the end of the exam, at the bottom right of the first page, you will see the “16-bit (RVC) Instruction Formats”. RVC stands for “RISC-V Compressed”.
(a) (5 points) Note CR-type instructions have a funct4 entry as part of their opcodes instead of a funct3 (as in all the other instructions). Assume that the opcode combi- nations (inst[1:0] == 01 OR inst[1:0] == 10) AND inst[15:13] == 100 are used to encode CR-type instructions. Given the instruction formats listed in the reference page, how many different instructions can RVC support? Note not all possible RVC instructions are specified on the reference card.
Solution: Most instructions have 2 opcode bits and 3 funct bits. This implies 5 opcode bits total, so 25 = 32 instructions can be supported in this way.
However, CR-type instructions get an extra opcode bit instr[12]. The CR instructions take 2 of the 32 instruction slots available to a 5-bit opcode. Since CR-type instructions have an additional opcode bit, they can fit 4 instructions into those 2 instruction slots.
Thus RVC, given these assumptions, supports 34 instructions.
(b) In practice, these 16-bit RVC instructions are translated to 32-bit full (“RVI”) RISC-V instructions. This process is summarized in the table on the reference card called “Optional Compressed (16-bit) Instruction Extension: RVC”, but we will highlight the important parts here directly. In this part, you may need to consult the table “16-bit (RVC) instruction formats”.
i. (2 points) The RVC instruction C.ADD rd, rs1 translates to the RVI instruc- tion ADD rd, rd, rs1. Both instructions run a full 64b add. However, the C.ADD operation has a significant restriction compared to the full ADD instruc- tion. What is that restriction? Be specific.
ii. (2 points) Some RVC instructions, like C.BEQZ, have a register argument with a prime (apostrophe) after the register name (C.BEQZ rs1’, imm → BEQ rs1’, x0, imm). What is the restriction indicated by the prime?
iii. (2 points) All RVC memory-access instructions multiply their immediates by a Page 4 of 7
Solution: C.ADD only takes 2 register arguments, so the destination register must also be a source register. ADD takes 3 register arguments, so the two source registers can be distinct from the destination register.
Solution: Registers indicated with a prime are only specified with 3 bits in the instruction encoding, instead of RISC-V’s 5 bits. Thus these instruc- tions can only access a subset of registers (this is not important, but it’s x8–x15).
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
power-of-twoconstant(e.g.,C.LW rd’, rs1’, imm→LW rd’, rs1’, imm*4). Because of this multiplication, what can a RVI memory-access instruc- tion do that its partner RVC instruction cannot? (The answer is not “allows byte addressing”; both RVC and RVI memory-access instructions are byte-addressed.)
iv. (2 points) The instruction C.ADDI16SP is primarily intended to manipulate the stack . (Fill in the previous blank with one word.)
(c) (3 points) The new BabyYoda compiler inputs a RVI (32-bit) RISC-V instruction sequence and transforms some RVI instructions into their RVC (16-bit) equivalents. When running the resulting code, compared to the RVI implementation, would you expect the instruction cache hit rate would increase, decrease, or stay the same? Justify your answer.
3. While sipping his soup, Baby Yoda uses the Force to derive the following characteristics of his laptop’s memory system, which is byte-addressed and contains a virtual memory subsystem and L1 and L2 caches. A copy of this diagram is attached at the end of the exam, which you can detach. In this diagram,
Solution: RVI instructions support misaligned memory operations where the n-byte memory address is not aligned to an n-byte boundary. RVC instructions do not allow misaligned memory addresses.
Solution: The I-cache hit rate should increase. The instruction stream is iden- tical in terms of operations but the instruction stream with RVC instructions takes fewer bytes, takes up less space in the cache, can thus store more instruc- tions than the RVI-only stream, and hence will have a higher hit rate.
Page 5 of 7
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
Figure B.17 The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache access. The (a) (3 points) What is the page size?
page size is 16 KB. The TLB is two-way set associative with 256 entries. The L1 cache is a direct-mapped 16 KB, and the L2 cach
addr
e is a four-way set associative with a total of 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the physical
14
ess is 40 bits.
Solution: The page offset is 14 bits, so the page size is 2
= 16 kB.
Copyright © 2011, Elsevier Inc. All rights Reserved. 10
(b) (3 points) The TLB has 256 entries in it. What is the associativity of the TLB (e.g., 1-way set associative = direct-mapped, 2-way set associative, 4-way, fully associative, etc.)?
(c) (3 points) How large is the L1 cache (data only)?
(d) (3 points) Assuming the L2 cache stores 222 B = 4 MiB of data, what is its associa- tivity?
Solution: The TLB index has 7 bits, so there are 27 = 128 sets. That’s 2 entries per set; the TLB is 2-way set associative.
Solution: The cache index is 8 bits, so there are 28 = 256 entries in the cache. Each entry is 512 bits (64 bytes), also shown by a 6-bit block offset.
256 entries × 64 bytes is 16 kB.
Solution: The data per entry is 512 bits = 64 B. Thus there are 4 MiB/64 B = 64 k = 216 entries. The cache index has only 14 bits, so that implies 4 entries per set, making the cache 4-way set associative.
Page 6 of 7
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere
(e) (3 points) What is the motivation for making replacement algorithms for hardware caches less complicated than replacement algorithms for pages?
Solution: Either of a couple of general reasons are good answers here:
• Cache replacement algorithms have to run quickly (and hence can’t be as complicated) because processing time for the algorithm (thus hit/miss times) must be low.
• Page replacement algorithms must be as accurate as possible (reducing the miss rate) because the miss penalty when pages are not in memory and have to be paged from disk is very long.
4. Assume that you have 2 64-bit double-precision floating point numbers stored in (integer) registers x2 and x3. You run the instruction XOR x1, x2, x3. In this question “normal number” means “not zero, not infinity, not denormal, not NaN”.
(a) (3 points) If you know that x2 is a positive normal number and x3 is a negative normal number, what bits in x1 do you know for certain? (Your answer shouldbeoftheform“Bit42willbea1andbit24willbea0”.) Bit0istheleast significant bit (LSB) and bit 63 is the most significant bit (MSB).
(b) (3 points) If you know that x2 is a negative denormal number and x3 is negative infinity, what bits in x1 do you know for certain? (Your answer should be of the form “Bit 42 will be a 1 and bit 24 will be a 0”.)
Solution: Bit 63 will be 1, because you know the two input sign bits will be different.
Solution: Bit 63 will be 0, because both input sign bits are the same.
Bits 62–52 inclusive will all be 1, because x2’s exponent is all zeroes and x3’s exponent is all ones.
Page 7 of 7
(c) John Owens 2019
Do not reproduce in any way Do not upload anywhere