CS代考 ELEC3441: Computer Architecture Second Semester, 2021–22

ELEC3441: Computer Architecture Second Semester, 2021–22
• Part (A) — 4 Mar, 2022, 11:59pm • Part (B) — 4 Mar, 2022, 11:59pm
Instruction: Submit your answers electronically through Moodle.
Homework 1 (r1.0)

Copyright By PowCoder代写 加微信 powcoder

There are 2 parts in this homework. Part A includes questions that aim to help you with understanding the lecture materials. They resemble the kind of questions you will encounter in quizzes and the final exam. Part A is an individual portion of this homework. You should be completing and turning in your own work.
Part B of this homework are hands-on exercises that require you to design and evaluate processor systems using various software and hardware tools, including the RISCV-V compilation tool chains. They are designed to help you understand real-world processor design and the use of various tools to help you along the way. Part B is a group work. Each group should turn in only 1 set of project file and 1 project report.
The following summarizes the 2 parts.
A Basic problem set
B Hands-on
Individual
Individual or Group of 2 to 3
In all cases, you are encouraged to discuss the homework problems offline or online using Piazza. However, you should not ask for or give out solution directly as that defeat the idea of having homework exercise. Giving out answers or copying answers directly will likely constitute an act of plagiarism.

ELEC3441: Computer Architecture Homework 1, Part A
Part A: Problem Set
A.1 Iron Law
Consider the 2 programs P and Q running in processor A and B with the following number of instructions:
Cycles per instruction Cycles per instruction
# instructions in P # instructions in Q
ALU Load/Store
10000 2000
70000 3000
Branch/Jumps
A.1.1 What is the average CPI of program P, Q when executed on processor A and B?
A.1.2 Processor A runs at 2GHz while processor B runs at 3GHz. Consider only program P, which
processor results in higher performance, and by how much?
A.1.3 Now consider only program Q, which processor results in higher performance, and by how much?
A.1.4 You are allowed to improve the speed of only 1 of the three instruction types (ALU, Load/Store, Branch/Jump) of the slower processor. Assuming the clock frequency remains unchanged, what is the maximum speedup possible when using the total runtime for both program P and Q as metric?
A.2 Benchmark Performance
In class, we mentioned it is necessary to use geometric means when considering normalized performance of multiple benchmark programs. In this question, you will explore further how various ways of performance evaluation can be performed. For those of you who are interested, a lot of the ideas in this question can be traced back to this paper http://www.eee.hku.hk/~elec3441/sp21/handout/Fleming_Wallace_ 86.pdf.
Now, you are given 3 processors A, B, C, and 6 benchmark programs P 1 to P 6. The following summarizes the run time of these programs in the 3 processors. In all cases, both the raw CPU times (in seconds) and the normalized time relative to processor A are shown. HINT: use a spreadsheet program to help with answering this question.
r1.0 Page 2 of 15

ELEC3441: Computer Architecture
Homework 1, Part A
(i) Arithmetic mean of raw CPU time of the benchmark programs; (ii) Geometric mean of raw CPU time of the benchmark programs;
(iii) Arithmetic mean of normalized CPU time of the benchmark programs; (iv) Geometric mean of normalized CPU time of the benchmark programs;
(v) Total raw CPU time;
(vi) Total normalized CPU time;
P1 P2 P3 P4 P5
compute the following:
300 1 1200 1 600 1 300 1 60 000 1
time time 600 2.00
300 0.25 1200 2.00 1200 4.00
time time 1200 4.00
600 0.50 300 0.50 600 2.00
120 000 2.00
A.2.1 Consider the first 3 benchmark programs (P1 to P3) only in this part. For
Hint: You may want to use a spredsheet program to help you in this and the following parts of this question.
A.2.2 Follow up from previous part, examine the benchmark results closely and you will notice that processor A, B, and C, is fastest in exactly 1 of the 3 benchmark. As a result, a reasonable conclusion can be that all three processors are equally good.
Now, consider each of 6 results you obtained in A.2.1, what is the conclusion that each metric is sug- gesting? For example, consider the total run time of the 3 programs, which processor is fastest? Do they correspond to your expectation? Which metric would you choose to use for a fair comparison of the performance of the processors?
A.2.3 Now, repeat the previous 2 subquestions but with benchmark programs P1 to P4. You would notice that Processor A is fastest in program P4. As a result, considering all 4 programs, a reasonable conclusion is that A is the fastest processor. How about processor B and C?
Consider the 6 different metrics, how have they changed when compared to your answers above regarding when P1 to P3 were considered. Which metrics have produced consistent results? Which metrics have changed?
A.2.4 Now, consider all benchmark programs together (P1 to P5). How have the metric results changed? Which metric(s) results in conclusions consistent with your expectation? Which metric(s) contradicts your expectation? Explain your answers.
A.3 RISCV Assembly Programming
In this question, you will practice writing assembly programs using the RISCV assembly code. In all cases, assume the C variables are stored in the following registers:
r1.0 Page 3 of 15
30 000 0.50
each processor,

ELEC3441: Computer Architecture
Homework 1, Part A
C Declaration
unsigned x
unsigned y
unsigned z
unsigned a[16]
Variable Register
A.3.1 Arithmetic and logic operations Implement the following C code segment as RISCV assembly code.
• z = x – (y – z)
• y = x & 0x000000F0 | 0x80000000
• a[5] = a[4] + a[3]
• a[x & 0xF] = a[y & 0xF] + 1
• x = y % 8 Hint: you don’t need to use modulo operator. • z = 2*y + 1 Hint: you don’t need to use multiplication.
A.3.2 Assembly to C Convert the following assembly code segment into C code:
• addi t0, a0, 18
add a2, t0, a1
• lw t0, 4(a0) addi a1, t0, 4
• andi t0, a0, 0xFF ori t1, a2, 0x55
add a1, t0, t1
• addi t0, a3, 16 lw t0, -12(t0)
sw t0, 0(a3)
A.3.3 Branches and Jumps Implement the following C code segment as RISCV assembly code.
x a0 y a1 z a2 a a3
if (x > y) {
z = x – y;
z = y – x;
Page 4 of 15

ELEC3441: Computer Architecture
Homework 1, Part A
• if (x < y && x > z) { x = x + a[x];
x = x + y + z;
• for (x = 0; x < 10; x++) { a[x] = x; while (x <= 10) { Page 5 of 15 ELEC3441: Computer Architecture Homework 1, Part A A.4 Bit Rotation A bit rotation operation is similar to a shift operation, except that the bits that get shifted away are rotated back to the other end of the word. The following C code shows a rotate left by 1 in action: unsigned slr1(unsigned x) { unsigned r = (x >> 31) | (x << 1); A.4.1 BasedonthesampleCcode,implementthe“leftrotateby1”functionasRISC-Vassemblycode. Assume x is stored in register a0 and also store r in a0 when returning from the function. A.4.2 The rotation function will be much more useful if variable rotation amount can be provided. A naive way to achieve that is simply to iterate the rotate-by-1 function n times as in the following C code: unsigned slr(unsigned x, int n) { unsigned r = x; for (i = 0; i < n; i++) { r = slr1(r) } return r; } Implement this naive version of n-bit left rotation in RISC-V assembly code. Assume x is stored in a0, n is in a1 and store return value r in a0. For function calling, you should follow the following subset of RISC-V calling convention: • Pass argument 0 and 1 by storing them in register a0 and a1 respectively. • Return value in register a0. Refer to the RISC-V documents for the full calling convention. You can assume a label slr1 has been defined for the function slr1(). A.4.3 It is possible to implement the above variable rotate without using any branch or jump instruc- tions. Implement an n-bit rotation function that supports BOTH variable left and right rotate without any iterations. Your code should be less than 10 lines. // dir: 0 = rotate left, 1 = rotate right unsigned rotate(unsigned x, int n, int dir){} r1.0 Page 6 of 15 ELEC3441: Computer Architecture Homework 1, Part B Part B: Hands-on Exercise B.1 RISC-V Processor in Logisim In this question, your task is to build a complete functional single cycle RISC-V processor similar to that shown in class. Most of the datapath for the single cycle processor design is already provided to you, and we have implemented R-type instructions as an example. You will have to add support for the other types of instructions. Your main task will be to complete the instruction decoder, but some additional data connections will also be needed. For this homework, you may choose to use either the original Logisim or the newer Logisim-Evolution software. The original Logisim software was created by , but its development has stopped since 2011. You can still obtain the original files from its homepwage http://www.cburch.com/logisim/ Logisim is available for Windows, Linux and macOS, and requires only the Java JRE to run. You should not encounter any trouble using it at home. Alternatively, you may choose to use/install a newer version of Logisim called Logisim-Evolution. Logisim- Evolution is a fork of the original Logisim software from . Just like the original Logisim software, it is an educational tools for designing and simulating digital circuit. It is free and you can download the latest version from its homepage https://github.com/reds-heig/logisim-evolution. On its homepage you can also find online documentation about the tools. If you are new to Logisim (or Logisim-Evolution), the online Beginner’s tutorial from the above link can be a good starting point. You may also consult the lab 1 handout from ENGG1203 that covers some basic operations: http://www.eee.hku.hk/~engg1203/sp18/labs/lab1.pdf You may download and run Logisim from your own computer. Alternatively, you may also use the installed logisim on tux-1.eee.hku.hk. To launch logisim on tux-1 using x2go, open a terminal within the remote desktop, ensure that the environment is set up as described in Lab 1, and type: tux-1$ java -jar logisim-evolution.jar & For this hands-on exercise, we will provide you with precompiled memory images to load into the in- struction and data memory of the processor that you will be building. Translating from assembly to memory image is done with the RISC-V compilation tool-chain. Any tools you need have been installed on a departmental server called tux-1. They are all freely available open-source software, so you may also choose to install them on your own machine. If you wish to generate your own test code for the processor, you will need a Unix system with the RISC- V cross-compilation GCC toolchain configured to compile to the RV32I subset only. For this optional part, we recommend that you use the toolchain already installed on tux-1.eee.hku.hk. r1.0 Page 7 of 15 ELEC3441: Computer Architecture Homework 1, Part B B.1.1 Getting the Files You will need the files located inside the logisim subdirectory in your down- loaded file elec3441hw1.tar.gz in {HW1ROOT}/logisim. It contains the following files: – rv32-addi.circ ∗ addsub.s ∗ addsub.mem ∗ ... ∗ s2mem.sh ← Logisim source ← Test programs ← Assembly file ← Memory content file for use with Logisim ← Converts .s to .mem (requires riscv-gcc) B.1.2 Subset of RISC-V ISA To keep your design simple, you only need to implement a subset of the RISC-V ISA as shown below. Of course, if you are feeling adventurous, you are more than welcome to implement and try out additional instructions. 25 24 0000000 rs2 0100000 rs2 0000000 imm[11:0] 15 14 12 11 rs1 000 rs1 000 rs1 000 rs1 001 rs1 010 rs1 010 rs1 000 rs1 001 rs1 100 rs1 101 imm[11:5] imm[12][10:5] imm[12][10:5] imm[12][10:5] imm[12][10:5] imm[31:12] shamt rs2 rs2 rs2 rs2 rs2 rd imm[4:0] imm[4:1][11] imm[4:1][11] imm[4:1][11] imm[4:1][11] rd add rd,rs1,rs2 sub rd,rs1,rs2 addi rd,rs1,imm lui rd, imm slli rd,rs1,shamt lw rd,imm(rs1) sw rs2, imm(rs1) beq rs1,rs2,imm bne rs1,rs2,imm blt rs1,rs2,imm bge rs1,rs2,imm jal rd, imm jalr rd,imm(rs1) imm[20][10:1][11][19:12] imm[11:0] rs1 000 Figure B.1: Subset of RISC-V instruction to be implemented in Logisim B.1.3 Basic Data Path – Add You will begin by the examining the design of a minimal processor that supports only the add, sub, and, or instructions as shown in Figure B.2. It contains the main components we have seen in class: an instruction fetch unit, an instruction memory, an instruction decoder, a register file, an ALU, and a data memory. Open the file rv32-addi.circ in Logisim. Figure B.2: Data Path r1.0 Page 8 of 15 ELEC3441: Computer Architecture Homework 1, Part B Instruction Fetch Unit The heart of the instruction fetch is the program counter (PC) and an adder to compute the next instruction address (PC+4). In Logisim, PC can be implemented using a register. You can find the library component register from the memory library that you can see on the left hand side of Logisim windows. As we are implementing a 32-bit architecture, the bit width of the register is set to 32. You can click on the component with the edit tool to view the property “Data Bits” in the lower half of the sidebar on the left. To compute the next PC, the built-in component adder from Logisim is used to add the constant 4 to the current PC value. Note that there is not yet a way to choose a value other than PC+4: you will have to modify this path when implementing jump and branch instructions in part C. Instruction Memory To implement the instruction memory of a processor, we use the Logisim built-in ROM unit. You can find a ROM unit under the built-in memory library on the left. While we have a 32-bit address space, we do not want to implement 4GB of memory in this small simulated system. Instead, the ROM component is set to 64 kB of memory. As a result, the instruction memory will have a 16 bit address bus. However, in the RISC-V processor (as well as most other processors), memory is byte addressable. That is, each memory address corresponds to 1 byte. However, in the case of Logisim, as well as in most real processors, the memory returns one 32-bit word at a time. Therefore, the address signal passed to the instruction ROM should omit the lowest 2 bits, which are always 00 in aligned word addresses anyway. If an individual byte is needed (such as in the case of load byte (lb)), the lowest 2 bits will then be used to select the correct byte from the output word. In this assignment we only deal with 32-bit words, so we don’t need this capacity for now. That is why you can see: Address Bit Width 14 Data Bit Width 32 in the parameters for the ROM unit. A splitter is used to select bits 2-15 to address the memory component. Instruction Decoder This is the component that decodes the fetched instruction from the instruction memory and generates the corresponding control signals. For the sample add instruction, it extracts the following: • Register number of rs1, rs2 and rd. • The corresponding alu fun for the ALU From the instruction encoding of add in Figure B.1, or equivalently the RISC-V ISA from the course website, you see that rs1, rs2 and rd are located at inst<19:15>, inst<24:20>, and inst<11:7> respectively. You can also determine that it is an add instruction from its opcode at inst<6:0>, and in the case of ALU-OP, also funct3 and funct7.
All the decoding logic should be implemented within the inst decoder subcircuit. In Logisim, you can edit a subcircuit by right-clicking the component and selecting “View component name”, or by double- clicking the component name in the left-hand sidebar. Take a look at the provided instruction decoder (inst decoder).
For each ALU operation add, sub, and, or, we create an individual 1-bit signal is high when we encounter an instruction that uses this function. Observe how we use comparators to identify the
r1.0 Page 9 of 15

ELEC3441: Computer Architecture Homework 1, Part B
instruction. The signals are fed into a component called priority encoder (lower left) which will output the number of the first port with a signal set to 1. For example, if the and signal is 1 and the others are 0, it will output “3”, because and is connected to port “Input 3”. If you look inside the ALU component and compare the mux that selects alu out, you will see that the inputs to the MUX are connected in the same order: the and32 component is also connected to port 3. So the priority encoder generates the right signal to select the same input port on the mux as is high on its own input ports.
Register File
The register file of a RISC-V processor can essentially be modeled as a RAM with 32 locations. The only difference is that it also has to support 2 read and 1 write operations at the same cycle, and to hardwire the value 0 to register x0. A register file has been created for you. Take a good look at the rf32 subcircuit to understand how it works.
The register file has debug ports dx0-dx31 that allow you to peek at the contents of the registers so you can see if your circuit is functioning correctly. Above the register file, we have connected probes that display the hexadecimal value of the registers x0-x8. This will be your main interface for debugging when you implement the missing instructions.
The final and arguably most important component in the processor to implement the R-type instructions is the ALU. In the provided design, a simple ALU has been created for you. Its job is to provide the required function (addition, subtraction, etc.) as specified by the alu fun signal.
Data Memory
Since the lw and sw instructions are not yet implemented, the data memory is not in use yet. We have however already added it to the circuit so that it is ready for use in the next steps.
The data memory is the same size and layout as the instruction memory, however this time we use a RAM component as it also has write capability. This means an additional port for write data and a store enable signal that tells the component to write the data to the address given on the address port at the next rising clock edge.
With all the above components correctly connected, you have a complete data path for the R-type instructions.
B.1.4 Testing Your Processor To test that the processor works correctly, you should manually load the program into the instruction memory. To do so, right click the ROM component corresponding to the instruction memory, and select Load image. Here is where you can load your test program directly into Logisim using its own memory image format.
Load the memory image file addsub.mem in the tests directory. It contains the following instructions:
add x5, x1, x2
sub x6, x3, x4
and x7, x5, x6
or x8, x5, x6
# indicates end of program
Page 10 of 15

ELEC3441: Computer Architecture Homework 1, Part B
Once it is loaded, you can try executing the processor by toggling the clock signal using the poke tool. When you execute the ebreak instruction, the output halt (located above the decoder) will turn high, so you can recognize that you have reached the end of the program.
Of course, with only support for register-register add/sub/and/or, it is not going to be very interesting as all the values are zero. Before you have support for immediate instructions in next step, you can manually change the values of x1, x2, x3, and x4 by going inside the rf32 component.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com