CS代考 CS 61C Great Ideas in Computer Architecture Summer 2020

CS 61C Great Ideas in Computer Architecture Summer 2020
INSTRUCTIONS
This is your exam. Complete it either at exam.cs61a.org or, if that doesn’t work, by emailing course staff with your solutions before the exam deadline.
This exam is intended for the student with email address If this is not your email address, notify course staff immediately, as each exam is different. Do not distribute this exam PDF even after the exam ends, as some students may be taking the exam in a different time zone.

For questions with circular bubbles, you should select exactly one choice. # You must choose either this option
# Or this one, but not both!
For questions with square checkboxes, you may select multiple choices. 2 You could select this choice.
2 You could select this one too!
You may start your exam now. Your exam is due at 12:00PM Pacific Time. Go to the next page to begin.

Exam generated for 2
Preliminaries
Please complete and submit these questions before the exam starts.
(a) What is your full name?
(b) What is your student ID number?
(c) If an answer requires hex input, make sure you only use capitalized letters! For example, 0xDEADBEEF instead of 0xdeadbeef. You will be graded incorrectly otherwise! Please always add the hex (0x) and binary (0b) prefix to your answers or you will receive 0 points. For all other bases, do not add the suffix or prefixes.
Do not add units unless the problem explicitly tells you to!
Some of the questions may use images to describe a problem. If the image is too small, you can click and drag the image to a new tab to see the full image. You can also right click the image and download it or copy its address to view it better. You can use the image below to try this. You can also click the star by the question if you would like to go back to it (it will show up on the side bar). In addition, you are able see a check mark for questions you have fully entered in the sidebar. Questions will auto submit about 5 seconds after you click off of them, though we still recommend you click the save button.
Good luck!

Exam generated for 3 1. CALL
Suppose we have compiled some C code using the Hilfinger-Approved(TM) CS61Compiler, which will compile, assemble, and link the files max.c and jie.c,among others, to create a wonderful executable. After the code has been assembled to RISC-V we have the following labels across all files: sean, jenny, stephan, philspel, poggers, crossroads, and segfault. Assume no two files define the same label, though each file interacts with every label, either via reference or definition.
Note: segment refers to a directive in any assembly file, e.g. .data or .text
The CS 61Compiler begins to fill out the relocation table on the first pass of assembling max.s, which defines or
references all of the labels above. This is its relocation table after the first pass:
label address
sean ???? stephan ???? jenny ???? segfault ???? philspel ????
(a) (2.0 pt) sean, stephan, jenny, segfault, and philspel all show up in the relocation table after the first pass through. Which of the following must be true? Select all that apply.
2 They are external references.
2 They are referenced before poggers and crossroads. 2 They belong in the .text segment.
2 None of the other options
2 They are referenced before they are defined.
(b) (2.0 pt) After the first pass through, poggers and crossroads don’t show up in the relocation table. What does this imply about the two function labels? Select all that apply.
2 They are .globals.
2 They are both referenced before they are defined.
2 None of the other options
2 After the assembler is finished, they are in the same segment.
(c) (2.0 pt) After the second pass by the assembler, we see that philspel is no longer in the relocation table. Which of the following is true about philspel? Select all that apply.
2 philspel is in the .text segment of max.s 2 philspel is an external reference.
2 philspel is in the .text segment of jie.s 2 The address for philspel was resolved.
2 None of the other options
(d) (2.0 pt) After assembling jie.s to jie.o we have the following symbol table for jie.o. In linking max.o and jie.o we get dan.out. Which of the following could be true about ‘sean’ and ‘jenny’ after linking? Select all that apply.

Exam generated for 4 label address
sean 0x061c jenny 0x1620
2 None of the other options
2 sean and jenny are in different sections of jie.s.
2 They are in different files.
2 They are in the same segment.
2 sean and jenny will have the same byte difference after linking as it did in jie.o.

Exam generated for 5 2. Cache and MOESI
Consider a computer which has 2 processors, each with their own cache. Both have the same design: A 128 B cache size, 2-way set associative, 4 ints per block, write-back, and write-allocate with LRU replacement. Each cache takes in 20-bit addresses. Assume that ints are 4 bytes, and we are using the MOESI cache-coherence protocol.
(a) (0.25 pt) The 20-bit addresses are Virtual Addresses # True
(b) i. (0.25 pt) How many Offset bits?
ii. (0.25 pt) How many Index bits?
iii. (0.25 pt) How many Tag bits?

Exam generated for 6
(c) We decide to parallelize a for loop across these 2 processors, but instead of using OpenMP, we have each thread do a strided memory access, where processor 0 handles even indices, while processor 1 handles odd indices. However, the memory accesses are perfectly interleaved, i.e. the order of array accesses are still A[0], A[1], A[2], A[3]. . .
# define ARR_LEN 32
// A is located at address 0xA0000
int A[ARR_LEN];
// Processor 0’s loop
for (int i = 0; i < ARR_LEN; i += 2) { A[i] += i } // Processor 1’s loop for (int j = 1; j < ARR_LEN; j += 2) { A[j] += j } For each memory access below, i. Classify it as a Hit or Miss. Snooping another cache for data is considered a coherency Miss. ii. Since we are working in a multiprocessor system, classify the state of the block that the data accessed resides in from the specified processors perspective. i. A[0] Read A. (0.25 pt) # Hit # Miss B. (0.25 pt) State from proc 0’s point of view. #E Exam generated for 7 ii. A[0] Write A. (0.25 pt) # Miss # Hit B. (0.25 pt) State from proc 0’s point of view. #O Exam generated for 8 iii. A[1] Read A. (0.25 pt) # Miss # Hit B. (0.25 pt) State from proc 1’s point of view. #I Exam generated for 9 iv. A[1] Write A. (0.25 pt) # Hit # Miss B. (0.25 pt) State from proc 0’s point of view. #M Exam generated for 10 v. A[2] Read A. (0.25 pt) # Hit # Miss B. (0.25 pt) State from proc 1’s point of view. #M Exam generated for 11 vi. A[2] Write A. (0.25 pt) # Hit # Miss B. (0.25 pt) State from proc 1’s point of view. #I Exam generated for 12 vii. A[3] Read A. (0.25 pt) # Hit # Miss B. (0.25 pt) State from proc 0’s point of view. #E Exam generated for 13 viii. A[3] Write A. (0.25 pt) # Miss # Hit B. (0.25 pt) State from proc 1’s point of view. #I Exam generated for 14 (d) (2.0 pt) What is the overall hit rate? Leave your answer as a fully simplified fraction. (e) (2.0 pt) What fraction of misses are coherency misses? Leave your answer as a fully simplified fraction. (f) (1.0 pt) In total, how many times did we need to go to main memory to write-back? (g) (2.0 pt) We want to avoid all the coherency misses, so we look to see if we can rewrite our code to optimize for cache performance. Which of the following methods will lead to a higher HR than that from the interleaved accesses? 2 Letting processor 1 start and finish, then processor 0 starts and finishes 2 Letting processor 0 start and finish, then processor 1 starts and finishes 2 None of the other options Exam generated for 15 3. TLP In signal processing, the technique of cross-correlation (or sliding dot-product) is often used to determine the delay of a signal. In this problem, we will implement a cross-correlation function in C, parallelized of course! (You don’t need to know anything about EE to ace this problem!) :0 The following function, sliding_dot, takes two arrays. original contains an array of length n and other contains n + k elements. We will shift other k times, and for each shift, compute its dot product with original. We then store these values in result. We want to parallelize sliding_dot with OpenMP. Examine our attempts below and choose the behavior(s) you expect from each version. Assume the processor has four threads, 32B cache blocks, and sizeof(int) = 4.You may also assume that all calls to calloc() succeed. Here is the template of the code where we will replace the /* OPTIMIZED CODE */ with the code on each question. int * sliding_dot(int * other, int * original, int n, int k) { int * result = (int *) calloc(k * sizeof(int)) // shift the array for (int shift = 0; shift <= k; shift++) { /* OPTIMIZED CODE */ return result; (a) (2.0 pt) int * shifted = other + shift; int dot_product = 0; #pragma omp parallel for private(dot_product) for (int i = 0; i < n; i++) { #pragma omp critical dot_product += shifted[i] * original[i]; result[shift] = dot_product; How will this code behave? 2 Sometimes Incorrect 2 Always Correct, slower than serial 2 Always Correct, faster than serial Exam generated for 16 (b) (2.0 pt) int * shifted = other + shift; int dot_product = 0; #pragma omp parallel for reduction(+:dot_product) for (int i = 0; i < n; i++) { #pragma omp critical dot_product += shifted[i] * original[i]; result[shift] = dot_product; How will this code behave? 2 Always Correct, faster than serial 2 Sometimes Incorrect 2 Always Correct, slower than serial (c) (2.0 pt) int * shifted = other + shift; int dps[4] = {0,0,0,0}; #pragma omp parallel #pragma omp for for (int i = 0; i < n; i++) { int id = omp_get_thread_num(); dps[id] += shifted[i] * original[i]; result[shift] = dps[0] + dps[1] + dps[2] + dps[3]; How will this code behave? 2 Always Correct, slower than serial 2 Sometimes Incorrect 2 Always Correct, faster than serial (d) (2.0 pt) int * shifted = other + shift; int dot_product = 0 #pragma omp parallel for for (int i = 0; i < n; i++) { dot_product += shifted[i] * original[i]; result[shift] = dot_product; How will this code behave? 2 Always Correct, faster than serial 2 Sometimes Incorrect 2 Always Correct, slower than serial Exam generated for 17 4. RISC-V Instruction Format You are working on a new chip for an embedded application, and want to create a new ISA. Fed up with the different RISC-V instruction types, you decide to include only one, universal type - the X-type instruction. Say we wish to include the following instructions: 0. add rd1, rs1, rs2 1. and rd1, rs1, rs2 2. lw rd1, offset1 (rs1) 3. sw rs2, offset1 (rs1) 4. addi rd1, rs1, imm1 5. beq rs1, rs2, offset1 6. lui rd1, offset1 7. jal rd1, imm 8. stw rs3, offset1, offset2 (rs1) The new stw instruction stores the contents of rs3 into both rs1 + offset1 and rs1 + offset2. The RTL is: Mem(R[rs1] + offset1) R[rs3] AND Mem(R[rs1] + offset2) R[rs3] (a) (2.0 pt) You want to do away with the funct3 and funct7 fields and only use an opcode. If we only wish to support the instructions listed above, what is the minimum number of bits the opcode field can be? (b) (3.0 pt) We want to be able to jump up to 64 KiB in either direction with a single instruction. How many bits are necessary to encode an immediate that would allow us to do this? Assume that, just like RV32, the least significant bit is an implicit 0 and is not stored in the instruction. (c) (2.0 pt) Regardless of your previous answers, you finally decide on the instruction format below. You’ve added some fields to account for new instructions you might want to include later on. The opcode for each instruction is the same as the list index given at the beginning of this problem (e.g. sw has opcode 3). imm2 imm1 rs3 rs2 rs1 rd2 rd1 opcode This instruction format is quite long, so we decide to work on a 64-bit machine. Each immediate field is 11 bits, and the opcode is 7 bits. What is the maximum number of registers we can have? Exam generated for 18 (d) (3.0 pt) Realizing supplies have run low due to COVID-19, you switch to a 32-bit machine, and finalize your instruction format to have 4 bits for each of the immediate fields, 4 bits for each register, and 4 bits for the opcode. Convert the instruction stw x8, 0, 4 (x5) into machine code. Leave your answer in binary (don’t forget the prefix!). If a field is not used, fill in the field with ‘x’s. Exam generated for 19 5. DLP In many applications, we wish to not only find the maximum element of an array, but the index of the maximum element, or the argmax. To do this quickly, we decide to utilize Data Level Parallelism. The following function, argmax, takes in an array, arr, and its length, n, and returns the index of the maximum value. If there exist multiple indices which contain the same maximum value, the function returns the first of these indices. Use the provided “pseudo” SIMD intrinsics to fill in the function so it behaves as expected. The SIMD intrinsics operate on vec structs which represent SIMD vectors that contain 4 packed integers (exactly like Intel’s __m128i structs). You may not need all lines. SIMD Instructions: vec sum_epi32 (vec a, vec b) // returns a + b vec and_epi32 (vec a, vec b) // returns a & b vec set_epi32 (int a) // return SIMD vector with all entries set to a vec load_epi32 (int *a) // return SIMD vector with entries a[0], a[1], a[2], and a[3] respectively int reducemax_epi32 (vec a) // return the value of the maximum int in vector a vec maskeq_epi32 (vec a, int b) // return mask vector with 0xFFFFFFFF for indices where a is equal to b and 0 otherwise int firstv_epi32 (vec a) // return index of first entry with lowest bit set to 1 int argmax(int *arr, int n) { int curr, index = 0, running_max = -2147483648; // -2^31 /* Your Code Here */ return index; Exam generated for 20 (a) (10.0 pt) /* Your Code Here */ Exam generated for 21 6. Single Cycle Datapath (a) Which of the following components are not utilized by the given instruction? As in, the output(s) of the component are not useful to the overall execution of the instruction. Select all that apply. i. (2.0 pt) lui s2, 0xC561C 2 Immediate generator 2 Branch comparator 2 Register File 2 All components are utilized by this instruction ii. (2.0 pt) jal ra, t0, label 2 All components are utilized by this instruction 2 ALU 2 Control Logic Unit 2 PC register Exam generated for 22 (b) You’ve been running multiple recursive programs on your RISC-V CPU lately, and noticed that one of the main causes of slowdown is that you always have to save ra onto the stack before you do the recursive call. You decide to modify your current single-cycle RISC-V datapath to implement an instruction that can save to the stack and jump at the same time. Jump-and-save R[ra] = PC + 4, R[sp] = sp - 4, PC = PC + offset, Mem[sp - 4] = PC + 4 i. (2.0 pt) Which of the following will correctly implement sp_wb, the value that will get written back to sp? spMinus4 is a pre-computed value equal to sp - 4. 2C 2D 2A 2E 2B sp_wb Choices Exam generated for 23 RegFile Choices Exam generated for 24 ii. (2.0 pt) Now that we have sp_wb, which of the following will correctly write it back to the RegFile? SPWEn is a signal analogous to RegWEn: it is 1 when we wish to write to sp and 0 otherwise. Furthermore, if SPEn is false, SP will not be updated, even if RegWEn is true. 2C 2A 2B 2D Exam generated for 25 iii. (2.0 pt) Which combination of the following circuits will correctly implement the “save to the stack” operation? 2E 2H 2D 2C 2G 2B 2A 2F Memory Choices Exam generated for 26 7. Virtual Memory (a) We are working with a system with a 4 GiB physical memory, and 16 MiB virtual memory, and a page size of 4 KiB. For each PTE, we choose to store 12 bits of metadata (dirty bit, permissions). i. For this part, assume we are working with a single level page table. A. (0.5 pt) How many bits are in the page offset? B. (0.5 pt) How many bits are in the PPN? C. (0.5 pt) How many bits are in the VPN? D. (0.5 pt) How many bits are in a PTE? Exam generated for 27 (b) For the rest of the problem, we will be working with a 2-level, hierarchical page table with no TLBs. Assume the VPN bits are split evenly between levels, so every PT at every level has the same number of PTEs. i. For each page table level, calculate the number of PTEs in total, across all possible page tables in that level. A. (0.5 pt) L1 Number of PTEs B. (0.5 pt) L2 Number of PTEs Exam generated for 28 ii. (2.0 pt) Let’s say the computer just started up, meaning that the page table has yet to allocate any pages in the physical memory. We then store 8 contiguous bytes to memory. In the worst case, how many page tables will we use? iii. Consider the following hierarchical page table. Regardless of your previous answers, assume that there are 64 PTEs in each page table. Arrows from one level to another represent a valid PTE for the page tables, or page for physical memory. The indices of the PTEs/pages are ordered from the top-down, i.e. the top-most refers to index 0. Only consider slots with a letter inside of them. Multi-Level Page Table Given the following PTEs accessed at each level, reconstruct the virtual and physical addresses in hex. If the data provided creates an invalid address, your answers should be N/A. For all memory accesses, we are attempting to access the 0th byte of the page. Exam generated for 29 A. (2.0 pt) Virtual Address B. (2.0 pt) Physical Address Exam generated for 30 iv. Given the following virtual addresses, first identify whether it is a page hit or page fault. If it is a hit, write out the sequence of “letters” that make up the path. If it page faults, then you must play the role of the OS, create the appropriate mapping given the available PTEs/physical pages, and leave your answer as the new path that will now be taken. Format your answer without spaces between the letters e.g. ABC. A. VA = 0xF83000 B. (0.5 pt) # Page Fault # Page Hit C. (1.5 pt) Path Exam generated for 31 D. VA = 0x0C3000 E. (0.5 pt) # Page Hit # Page Fault F. (1.5 pt) Path Exam generated for 32 8. ECC, RAID (a) (0.5 pt) If we want to tolerate 1 disk failure, which version(s) of RAID should we us 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts