CS 61C Great Ideas in Computer Architecture Summer 2020
INSTRUCTIONS
Final
This is your exam. Complete it either at exam.cs61a.org or, if that doesn’t work, by emailing course staff with your solutions before the exam deadline.
This exam is intended for the student with email address qwe1029384756@berkeley.edu. If this is not your email address, notify course staff immediately, as each exam is different. Do not distribute this exam PDF even after the exam ends, as some students may be taking the exam in a different time zone.
For questions with circular bubbles, you should select exactly one choice. # You must choose either this option
# Or this one, but not both!
For questions with square checkboxes, you may select multiple choices. 2 You could select this choice.
2 You could select this one too!
You may start your exam now. Your exam is due at 12:00PM Pacific Time. Go to the next page to begin.
Exam generated for qwe1029384756@berkeley.edu 2
Preliminaries
Please complete and submit these questions before the exam starts.
(a) What is your full name?
(b) What is your student ID number?
(c) If an answer requires hex input, make sure you only use capitalized letters! For example, 0xDEADBEEF instead of 0xdeadbeef. You will be graded incorrectly otherwise! Please always add the hex (0x) and binary (0b) prefix to your answers or you will receive 0 points. For all other bases, do not add the suffix or prefixes.
Do not add units unless the problem explicitly tells you to!
Some of the questions may use images to describe a problem. If the image is too small, you can click and drag the image to a new tab to see the full image. You can also right click the image and download it or copy its address to view it better. You can use the image below to try this. You can also click the star by the question if you would like to go back to it (it will show up on the side bar). In addition, you are able see a check mark for questions you have fully entered in the sidebar. Questions will auto submit about 5 seconds after you click off of them, though we still recommend you click the save button.
Good luck!
Exam generated for qwe1029384756@berkeley.edu 3 1. CALL
Suppose we have compiled some C code using the Hilfinger-Approved(TM) CS61Compiler, which will compile, assemble, and link the files max.c and jie.c,among others, to create a wonderful executable. After the code has been assembled to RISC-V we have the following labels across all files: sean, jenny, stephan, philspel, poggers, crossroads, and segfault. Assume no two files define the same label, though each file interacts with every label, either via reference or definition.
Note: segment refers to a directive in any assembly file, e.g. .data or .text
The CS 61Compiler begins to fill out the relocation table on the first pass of assembling max.s, which defines or
references all of the labels above. This is its relocation table after the first pass:
label address
sean ???? stephan ???? jenny ???? segfault ???? philspel ????
(a) (2.0 pt) sean, stephan, jenny, segfault, and philspel all show up in the relocation table after the first pass through. Which of the following must be true? Select all that apply.
2 They are external references.
2 They are referenced before poggers and crossroads. 2 They belong in the .text segment.
2 None of the other options
2 They are referenced before they are defined.
(b) (2.0 pt) After the first pass through, poggers and crossroads don’t show up in the relocation table. What does this imply about the two function labels? Select all that apply.
2 They are .globals.
2 They are both referenced before they are defined.
2 None of the other options
2 After the assembler is finished, they are in the same segment.
(c) (2.0 pt) After the second pass by the assembler, we see that philspel is no longer in the relocation table. Which of the following is true about philspel? Select all that apply.
2 philspel is in the .text segment of max.s 2 philspel is an external reference.
2 philspel is in the .text segment of jie.s 2 The address for philspel was resolved.
2 None of the other options
(d) (2.0 pt) After assembling jie.s to jie.o we have the following symbol table for jie.o. In linking max.o and jie.o we get dan.out. Which of the following could be true about ‘sean’ and ‘jenny’ after linking? Select all that apply.
Exam generated for qwe1029384756@berkeley.edu 4 label address
sean 0x061c jenny 0x1620
2 None of the other options
2 sean and jenny are in different sections of jie.s.
2 They are in different files.
2 They are in the same segment.
2 sean and jenny will have the same byte difference after linking as it did in jie.o.
Exam generated for qwe1029384756@berkeley.edu 5 2. Cache and MOESI
Consider a computer which has 2 processors, each with their own cache. Both have the same design: A 128 B cache size, 2-way set associative, 4 ints per block, write-back, and write-allocate with LRU replacement. Each cache takes in 20-bit addresses. Assume that ints are 4 bytes, and we are using the MOESI cache-coherence protocol.
(a) (0.25 pt) The 20-bit addresses are Virtual Addresses # True
# False
(b) i. (0.25 pt) How many Offset bits?
ii. (0.25 pt) How many Index bits?
iii. (0.25 pt) How many Tag bits?
Exam generated for qwe1029384756@berkeley.edu 6
(c) We decide to parallelize a for loop across these 2 processors, but instead of using OpenMP, we have each thread do a strided memory access, where processor 0 handles even indices, while processor 1 handles odd indices. However, the memory accesses are perfectly interleaved, i.e. the order of array accesses are still A[0], A[1], A[2], A[3]. . .
# define ARR_LEN 32
// A is located at address 0xA0000
int A[ARR_LEN];
// Processor 0’s loop
for (int i = 0; i < ARR_LEN; i += 2) {
A[i] += i }
// Processor 1’s loop
for (int j = 1; j < ARR_LEN; j += 2) {
A[j] += j }
For each memory access below,
i. Classify it as a Hit or Miss. Snooping another cache for data is considered a coherency Miss.
ii. Since we are working in a multiprocessor system, classify the state of the block that the data accessed
resides in from the specified processors perspective. i. A[0] Read
A. (0.25 pt)
# Hit # Miss
B. (0.25 pt)
State from proc 0’s point of view. #E
#S
#I
#M #O
Exam generated for qwe1029384756@berkeley.edu 7
ii. A[0] Write A. (0.25 pt)
# Miss # Hit
B. (0.25 pt)
State from proc 0’s point of view. #O
#E
#I
#M #S
Exam generated for qwe1029384756@berkeley.edu 8
iii. A[1] Read A. (0.25 pt)
# Miss # Hit
B. (0.25 pt)
State from proc 1’s point of view. #I
#S
#O
#E #M
Exam generated for qwe1029384756@berkeley.edu 9
iv. A[1] Write A. (0.25 pt)
# Hit # Miss
B. (0.25 pt)
State from proc 0’s point of view. #M
#S
#O
#E #I
Exam generated for qwe1029384756@berkeley.edu 10
v. A[2] Read A. (0.25 pt)
# Hit # Miss
B. (0.25 pt)
State from proc 1’s point of view. #M
#O
#S
#E #I
Exam generated for qwe1029384756@berkeley.edu 11
vi. A[2] Write A. (0.25 pt)
# Hit # Miss
B. (0.25 pt)
State from proc 1’s point of view. #I
#M
#O
#E #S
Exam generated for qwe1029384756@berkeley.edu 12
vii. A[3] Read A. (0.25 pt)
# Hit # Miss
B. (0.25 pt)
State from proc 0’s point of view. #E
#I
#S
#O #M
Exam generated for qwe1029384756@berkeley.edu 13
viii. A[3] Write A. (0.25 pt)
# Miss # Hit
B. (0.25 pt)
State from proc 1’s point of view. #I
#O
#E
#M #S
Exam generated for qwe1029384756@berkeley.edu 14 (d) (2.0 pt) What is the overall hit rate? Leave your answer as a fully simplified fraction.
(e) (2.0 pt) What fraction of misses are coherency misses? Leave your answer as a fully simplified fraction.
(f) (1.0 pt) In total, how many times did we need to go to main memory to write-back?
(g) (2.0 pt) We want to avoid all the coherency misses, so we look to see if we can rewrite our code to optimize for cache performance. Which of the following methods will lead to a higher HR than that from the interleaved accesses?
2 Letting processor 1 start and finish, then processor 0 starts and finishes 2 Letting processor 0 start and finish, then processor 1 starts and finishes 2 None of the other options
Exam generated for qwe1029384756@berkeley.edu 15 3. TLP
In signal processing, the technique of cross-correlation (or sliding dot-product) is often used to determine the delay of a signal. In this problem, we will implement a cross-correlation function in C, parallelized of course!
(You don’t need to know anything about EE to ace this problem!) :0
The following function, sliding_dot, takes two arrays. original contains an array of length n and other contains n + k elements. We will shift other k times, and for each shift, compute its dot product with original. We then store these values in result.
We want to parallelize sliding_dot with OpenMP. Examine our attempts below and choose the behavior(s) you expect from each version. Assume the processor has four threads, 32B cache blocks, and sizeof(int) = 4.You may also assume that all calls to calloc() succeed.
Here is the template of the code where we will replace the /* OPTIMIZED CODE */ with the code on each question.
int * sliding_dot(int * other, int * original, int n, int k) {
int * result = (int *) calloc(k * sizeof(int))
// shift the array
for (int shift = 0; shift <= k; shift++) {
/* OPTIMIZED CODE */
}
return result;
}
(a) (2.0 pt)
int * shifted = other + shift;
int dot_product = 0;
#pragma omp parallel for private(dot_product)
for (int i = 0; i < n; i++) {
#pragma omp critical
dot_product += shifted[i] * original[i];
}
result[shift] = dot_product;
How will this code behave?
2 Sometimes Incorrect
2 Always Correct, slower than serial 2 Always Correct, faster than serial
Exam generated for qwe1029384756@berkeley.edu 16 (b) (2.0 pt)
int * shifted = other + shift;
int dot_product = 0;
#pragma omp parallel for reduction(+:dot_product)
for (int i = 0; i < n; i++) {
#pragma omp critical
dot_product += shifted[i] * original[i];
}
result[shift] = dot_product;
How will this code behave?
2 Always Correct, faster than serial 2 Sometimes Incorrect
2 Always Correct, slower than serial
(c) (2.0 pt)
int * shifted = other + shift;
int dps[4] = {0,0,0,0};
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < n; i++) {
int id = omp_get_thread_num();
dps[id] += shifted[i] * original[i];
}
}
result[shift] = dps[0] + dps[1] + dps[2] + dps[3];
How will this code behave?
2 Always Correct, slower than serial 2 Sometimes Incorrect
2 Always Correct, faster than serial
(d) (2.0 pt)
int * shifted = other + shift;
int dot_product = 0
#pragma omp parallel for
for (int i = 0; i < n; i++) {
dot_product += shifted[i] * original[i];
}
result[shift] = dot_product;
How will this code behave?
2 Always Correct, faster than serial 2 Sometimes Incorrect
2 Always Correct, slower than serial
Exam generated for qwe1029384756@berkeley.edu 17 4. RISC-V Instruction Format
You are working on a new chip for an embedded application, and want to create a new ISA. Fed up with the different RISC-V instruction types, you decide to include only one, universal type - the X-type instruction.
Say we wish to include the following instructions:
0. add rd1, rs1, rs2
1. and rd1, rs1, rs2
2. lw rd1, offset1 (rs1)
3. sw rs2, offset1 (rs1)
4. addi rd1, rs1, imm1
5. beq rs1, rs2, offset1
6. lui rd1, offset1
7. jal rd1, imm
8. stw rs3, offset1, offset2 (rs1)
The new stw instruction stores the contents of rs3 into both rs1 + offset1 and rs1 + offset2. The RTL is: Mem(R[rs1] + offset1) R[rs3] AND Mem(R[rs1] + offset2) R[rs3]
(a) (2.0 pt) You want to do away with the funct3 and funct7 fields and only use an opcode. If we only wish to support the instructions listed above, what is the minimum number of bits the opcode field can be?
(b) (3.0 pt) We want to be able to jump up to 64 KiB in either direction with a single instruction. How many bits are necessary to encode an immediate that would allow us to do this? Assume that, just like RV32, the least significant bit is an implicit 0 and is not stored in the instruction.
(c) (2.0 pt) Regardless of your previous answers, you finally decide on the instruction format below. You’ve added some fields to account for new instructions you might want to include later on. The opcode for each instruction is the same as the list index given at the beginning of this problem (e.g. sw has opcode 3).
imm2 imm1 rs3 rs2 rs1 rd2 rd1 opcode
This instruction format is quite long, so we decide to work on a 64-bit machine. Each immediate field is 11 bits, and the opcode is 7 bits. What is the maximum number of registers we can have?
Exam generated for qwe1029384756@berkeley.edu 18
(d) (3.0 pt) Realizing supplies have run low due to COVID-19, you switch to a 32-bit machine, and finalize your instruction format to have 4 bits for each of the immediate fields, 4 bits for each register, and 4 bits for the opcode.
Convert the instruction stw x8, 0, 4 (x5) into machine code. Leave your answer in binary (don’t forget the prefix!). If a field is not used, fill in the field with ‘x’s.
Exam generated for qwe1029384756@berkeley.edu 19 5. DLP
In many applications, we wish to not only find the maximum element of an array, but the index of the maximum element, or the argmax. To do this quickly, we decide to utilize Data Level Parallelism. The following function, argmax, takes in an array, arr, and its length, n, and returns the index of the maximum value. If there exist multiple indices which contain the same maximum value, the function returns the first of these indices.
Use the provided “pseudo” SIMD intrinsics to fill in the function so it behaves as expected. The SIMD intrinsics operate on vec structs which represent SIMD vectors that contain 4 packed integers (exactly like Intel’s __m128i structs). You may not need all lines.
SIMD Instructions:
vec sum_epi32 (vec a, vec b)
// returns a + b
vec and_epi32 (vec a, vec b)
// returns a & b
vec set_epi32 (int a)
// return SIMD vector with all entries set to a
vec load_epi32 (int *a)
// return SIMD vector with entries a[0], a[1], a[2], and a[3] respectively
int reducemax_epi32 (vec a)
// return the value of the maximum int in vector a
vec maskeq_epi32 (vec a, int b)
// return mask vector with 0xFFFFFFFF for indices where a is equal to b and 0 otherwise
int firstv_epi32 (vec a)
// return index of first entry with lowest bit set to 1
int argmax(int *arr, int n) {
int curr, index = 0, running_max = -2147483648; // -2^31
vec temp;
/* Your Code Here */
return index;
}
Exam generated for qwe1029384756@berkeley.edu 20 (a) (10.0 pt) /* Your Code Here */
Exam generated for qwe1029384756@berkeley.edu 21 6. Single Cycle Datapath
(a) Which of the following components are not utilized by the given instruction? As in, the output(s) of the component are not useful to the overall execution of the instruction. Select all that apply.
i. (2.0 pt) lui s2, 0xC561C 2 Immediate generator
2 Branch comparator
2 IMEM
2 Register File
2 All components are utilized by this instruction
ii. (2.0 pt) jal ra, t0, label
2 All components are utilized by this instruction 2 ALU
2 Control Logic Unit
2 DMEM
2 PC register
Exam generated for qwe1029384756@berkeley.edu 22
(b) You’ve been running multiple recursive programs on your RISC-V CPU lately, and noticed that one of the main causes of slowdown is that you always have to save ra onto the stack before you do the recursive call. You decide to modify your current single-cycle RISC-V datapath to implement an instruction that can save to the stack and jump at the same time.
Jump-and-save
jas label
R[ra] = PC + 4, R[sp] = sp - 4, PC = PC + offset, Mem[sp - 4] = PC + 4
i. (2.0 pt) Which of the following will correctly implement sp_wb, the value that will get written back to sp? spMinus4 is a pre-computed value equal to sp - 4.
2C 2D 2A 2E 2B
sp_wb Choices
Exam generated for qwe1029384756@berkeley.edu 23
RegFile Choices
Exam generated for qwe1029384756@berkeley.edu 24
ii. (2.0 pt) Now that we have sp_wb, which of the following will correctly write it back to the RegFile? SPWEn is a signal analogous to RegWEn: it is 1 when we wish to write to sp and 0 otherwise. Furthermore, if SPEn is false, SP will not be updated, even if RegWEn is true.
2C 2A 2B 2D
Exam generated for qwe1029384756@berkeley.edu 25 iii. (2.0 pt) Which combination of the following circuits will correctly implement the “save to the stack”
operation?
2E 2H 2D 2C 2G 2B 2A 2F
Memory Choices
Exam generated for qwe1029384756@berkeley.edu 26 7. Virtual Memory
(a) We are working with a system with a 4 GiB physical memory, and 16 MiB virtual memory, and a page size of 4 KiB. For each PTE, we choose to store 12 bits of metadata (dirty bit, permissions).
i. For this part, assume we are working with a single level page table. A. (0.5 pt) How many bits are in the page offset?
B. (0.5 pt) How many bits are in the PPN?
C. (0.5 pt) How many bits are in the VPN?
D. (0.5 pt) How many bits are in a PTE?
Exam generated for qwe1029384756@berkeley.edu 27
(b) For the rest of the problem, we will be working with a 2-level, hierarchical page table with no TLBs. Assume the VPN bits are split evenly between levels, so every PT at every level has the same number of PTEs.
i. For each page table level, calculate the number of PTEs in total, across all possible page tables in that level.
A. (0.5 pt) L1 Number of PTEs
B. (0.5 pt) L2 Number of PTEs
Exam generated for qwe1029384756@berkeley.edu 28
ii. (2.0 pt) Let’s say the computer just started up, meaning that the page table has yet to allocate any pages in the physical memory. We then store 8 contiguous bytes to memory. In the worst case, how many page tables will we use?
iii. Consider the following hierarchical page table. Regardless of your previous answers, assume that there are 64 PTEs in each page table. Arrows from one level to another represent a valid PTE for the page tables, or page for physical memory. The indices of the PTEs/pages are ordered from the top-down, i.e. the top-most refers to index 0. Only consider slots with a letter inside of them.
Multi-Level Page Table
Given the following PTEs accessed at each level, reconstruct the virtual and physical addresses in hex. If the data provided creates an invalid address, your answers should be N/A. For all memory accesses, we are attempting to access the 0th byte of the page.
L1 PTE: E
L2 PTE: K
Exam generated for qwe1029384756@berkeley.edu 29 A. (2.0 pt) Virtual Address
B. (2.0 pt) Physical Address
Exam generated for qwe1029384756@berkeley.edu 30
iv. Given the following virtual addresses, first identify whether it is a page hit or page fault. If it is a hit, write out the sequence of “letters” that make up the path. If it page faults, then you must play the role of the OS, create the appropriate mapping given the available PTEs/physical pages, and leave your answer as the new path that will now be taken. Format your answer without spaces between the letters e.g. ABC.
A. VA = 0xF83000 B. (0.5 pt)
# Page Fault # Page Hit
C. (1.5 pt) Path
Exam generated for qwe1029384756@berkeley.edu 31
D. VA = 0x0C3000 E. (0.5 pt)
# Page Hit # Page Fault
F. (1.5 pt) Path
Exam generated for qwe1029384756@berkeley.edu 32
8. ECC, RAID
(a) (0.5 pt) If we want to tolerate 1 disk failure, which version(s) of RAID should we use?
2 RAID 5
2 None of the other options 2 RAID 4
2 RAID 1
2 RAID 0
(b) (0.5 pt) Which version of RAID is fastest for small random writes? 2 RAID 1
2 RAID 0
2 RAID 5
2 None of the other options 2 RAID 4
Parity Bits
i. (2.0 pt) What is the minimum Hamming distance necessary to allow for Single Error Detection
and Single Error Correction?
ii. (2.0 pt) How many bits of data can we cover if we have 7 parity bits?
iii. Consider the following codeword we wish to send: 0b10110100.
A. (3.0 pt) What is the Hamming ECC we should send over to ensure that we can detect and correct a 1-bit error?
(c)
Exam generated for qwe1029384756@berkeley.edu 33
B. (2.0 pt) We receive the word, but notice something is off. We are unable to see the contents of the bits, but we are told that only the parity check for p1 failed. Given this information, which bit position holds the error? (Remember that indices begin at 1 for ECC)
Exam generated for qwe1029384756@berkeley.edu 34 9. RISC-V Coding
We wish to implement a function, reverse_str, that will take in a pointer to a string, its length, and reverse it. Assume that the argument registers, a0, a1, hold the pointer to and length of the string, respectively. Complete the following code skeleton to implement this function. You must use commas to separate arguments in your code, e.g. add x0, x0, x0.
Reverse_str:
# This part saves all the required registers you will use.
# HIDDEN CODE
mv s0, a0 # memory address
mv s1, a1 # strlen
addi t0, x0, 0 # iteration
Loop:
# YOUR CODE HERE
# retrieve left and right letters
# switch chars
# iterate if necessary
# END YOUR CODE HERE
# This part restores all of the registers which were used.
# HIDDEN CODE
ret
Exam generated for qwe1029384756@berkeley.edu 35 (a) (10.0 pt)
Exam generated for qwe1029384756@berkeley.edu 10. SDS, Logic
We will be analyzing the following circuit:
36
Given the following information:
• AND gates have a propagation delay of 9ns
• OR gates have a propagation delay of 14ns
• NOT gates have a propagation delay of 5ns
• x_input switches value(i.e. 1 to 0, 0 to 1) 30 ns after the rising edge of the clk
• y_output is directly attached to a register
• Setup time is 3ns
• Clk-to-q delay time: 4ns
(a) (2.0 pt) What is the max hold time in ns?
(b) (2.0 pt) What is the minimum clock period in ns?
Circuit
(c) (3.0 pt) Regardless of your previous answers, assume the clock period is 50ns, the first rising edge of the clock is at 25 ns and x_input is initialized to 0 at 0ns. At what time in ns will y_output become 1?
Exam generated for qwe1029384756@berkeley.edu 37 (d) (3.0 pt) How long will y_output remain equal to 1 before switching to 0?
Exam generated for qwe1029384756@berkeley.edu 38 11. Number Rep
(a) Translate the following numbers to their specified bases and representations. Do not include leading 0s, and remember to include the appropriate prefix for hex and binary, but no other base.
21213
i. (1.5 pt) Decimal
ii. (1.5 pt) Base 5 bias (added bias of -100)
Exam generated for qwe1029384756@berkeley.edu 39 (b) We want to use a new floating point format with base 3. Consider an 8 digit “minifloat” S EEEE MMM (1
sign digit, 4 exponent digit, 3 mantissa digit).
All other properties of IEEE754 apply (bias, denormalized numbers,1, NaNs, etc), which includes normalized numbers having a most significant digit of 1 and denormalized numbers having an implicit leading 1 and denormalized numbers having an implicit leading 0. The sign digit only takes values of 0 and 1.
Normalized: ( 1)sign ⇤ 3exponent+bias ⇤ 1.mantissa Denormalized: ( 1)sign ⇤ 3exponent+bias+1 ⇤ 0.mantissa Assume we have a bias of -33.
i. (2.5 pt) Represent 5.2 with our new floating point format.
ii. (2.5 pt) What is the decimal value of the largest positive normalized float? Express your answer in terms of powers of 3 from largest to smallest. Ex: 2*3ˆ8+1*3ˆ2+2*3ˆ0. Leave out the zero bits and do NOT add spaces. Do not add parenthses for powers!
Exam generated for qwe1029384756@berkeley.edu 40 12. I/O
We wish to communicate with an I/O device using Memory Mapped I/O. To do so, we have set aside a portion of our address space to communicate with this device, beginning at address 0xA0000000. Below is a table describing all the special addresses (control/data registers) and the purpose of each value that lives there. Assume that our device has 32 pins for I/O which can each hold 64 bits of data, sizeof(uint16_t) == 2, sizeof(uint32_t) == 4, and sizeof(uint64_t) == 8:
Address
0xA0000100 0xA0000108
0xA0000200 0xA0000208
Field Name
IN_READY OUT_READY
IN_DATA OUT_DATA
Purpose
The i-th bit indicates whether or not the device has a value at pin i that should be read by the computer via a 1 or 0, respectively The i-th bit indicates whether or not the computer has a value for pin i that should be read by the device via a 1 or 0, respectively The input data from the pin indicated by IN_READY
The output data to the pin indicated by OUT_READY
(a) Fill in the following C code to complete the implementation of a struct that will “cover” these addresses and allow us to manage this device without hard-coding all the addresses. For example, we should be able to access IN_READY by using IO_device->IN_READY. Assume that memory will be word-aligned, but not padded. You should be using all provided lines and can only have one semicolon per line:
typedef struct {
uint32_t IN_READY;
uint32_t padding1[<**CODE INPUT 1**>];
<**CODE INPUT 2**>;
uint32_t padding2[<**CODE INPUT 3**>];
<**CODE INPUT 4**>
uint64_t OUT_DATA;
} IO_DEVICE;
i. (1.0 pt) <**CODE INPUT 1**>
ii. (0.5 pt) <**CODE INPUT 2**>
iii. (1.0 pt) <**CODE INPUT 3**>
Exam generated for qwe1029384756@berkeley.edu 41 iv. (0.5 pt) <**CODE INPUT 4**>
Exam generated for qwe1029384756@berkeley.edu 42 (b) Now that you have this struct at your disposal, use it to complete the following functions that will allow
you to communicate with your device. Assume that memory has been initialized to random data.
# define base_io_addr 0xA0000000
IO_DEVICE* IO_DEVICE_ptr = <**CODE INPUT 1**>;
uint64_t read_from_pin(int pin) {
# Check if pin has something to be read, and if so, read this value. Else, return 0
<**CODE INPUT 2**>
}
void write_to_pin(int pin, uint64_t data) {
# Notify the device that we have something to write, and then write it.
# Note that OUT_READY can only have one bit active at a time,
# but our device handles resetting this value every time it reads.
<**CODE INPUT 3**>
}
i. (1.0 pt) <**CODE INPUT 1**>
ii. (3.0 pt) <**CODE INPUT 2**>
iii. (5.0 pt) <**CODE INPUT 3**>
Exam generated for qwe1029384756@berkeley.edu 43 13. Pipeline
We wish to implement SIMD instructions in our pipelined RISC-V datapath. In order to do so, we will take 2 steps:
Combine the IF and ID stages into an IFD stage Create 2 paths for the datapath to take after IF: one path for normal RISC-V instructions and one for SIMD RISC-V containing specialized hardware.
The only problem is that the SIMD EX stage takes 3 cycles to complete instead of 1, and no other SIMD instruction is allowed to enter the SIMD EX stage while another SIMD instruction is there.
Pipeline
(a) (3.0 pt) This “delay” from the SIMD EX stage necessitates the use of stalls to ensure proper functionality. Which of the following implementations correctly generates the stall signal? You may ignore any kinds of stalls caused by hazards; we are only concerned with this special case in our new pipeline. However, we still want to maintain a good instruction throughput. To do this, we should allow normal instructions to continue through the CPU, as they are not blocked from doing so by the SIMD path.
The signal SIMD_Inst is an indicator that the current instruction fetched from IFD is a SIMD instruction, while the signal SIMD_count refers to the number of the cycle the SIMD instruction is completing in the EX stage, i.e. when it is in the first cycle of the EX stage, SIMD_count = 1. If there is no instruction in the SIMD EX stage, this value is 0. The comparators are unsigned. Select all that apply.
2A
2D
2 None of the other options 2C
2B
Exam generated for qwe1029384756@berkeley.edu
44
Pipeline 1
Pipeline 2
Exam generated for qwe1029384756@berkeley.edu 45
(b) (3.0 pt) Because we wish to actually stall and not flush, how should the PC and PC mux be updated to allow for this? Assume stall is a signal that is 1 when we should stall, and therefore not fetch a new instruction, or 0 otherwise. Select all that apply.
2B
2D
2A
2C
2 None of the other options
(c) (2.0 pt) How many stalls caused by the SIMD EX stage are needed for the following piece of code?
1. addi t0, x0, 1
2. simd_add x1, x2, x3
3. simd_sub x2, x1, x3
4. addi t1, t0, 2
5. simd_mul x2, x2, x2
6. sub t0, t0, t1
7. mul t0, t0, t0
8. simd_div x2, x2, x1
Exam generated for qwe1029384756@berkeley.edu 46 No more questions.