程序代写代做代考 compiler assembly C clock • You may work in pairs for this assignment. If you choose to work with a partner, make sure only one of you submits a solution, and you paste a copy of the Partners Template that contains the names and PIDs of both students at the beginning of the file.

• You may work in pairs for this assignment. If you choose to work with a partner, make sure only one of you submits a solution, and you paste a copy of the Partners Template that contains the names and PIDs of both students at the beginning of the file.

• Always justify your answer; describe your reasoning clearly. No points will be given if a GTA cannot understand your description. This will help you practice for the real world; your team/manager will judge you poorly if your team/manager cannot understand your written communication/documentation.

• What to submit: Write your answers in this Word document, create a PDF from this Word document, and submit the PDF. No late submissions will be accepted.

Problems 4, 5, 6, 7 use the CPUs below. All three CPUs have a cycle time of 200ps; every CPU also write to the register file during the first half of a cycle and then read from it during the second half of the cycle.

CPU A
CPU A
CPU B
CPU B
CPU C
CPU C

• [15 pts] Benchmarking

A benchmark suite has two benchmarks. In Benchmark 1, 20% of instructions are “j”, 20% of instructions are “beq”, 50% of instructions are R-type instructions, 5% of instructions are “sw”, and 5% of instructions are “lw”. In Benchmark 2, 5% of instructions are “j”, 5% of instructions are “beq”, 20% of instructions are R-type instructions, 10% of instructions are “sw”, and 60% of instructions are “lw”. In Benchmark 2,

• [5 pts] Consider a single-cycle CPU in which ALL five processing steps take 200ps each (i.e., even Register Fetch and Writeback take 200ps each, instead of 100ps). What is the CPU’s IPC and IPS when running Benchmark 1?

• [5 pts] Consider a basic multi-cycle CPU in which ALL five processing steps take 200ps each (i.e., even Decode/Register Fetch and Writeback take 200ps each, instead of 100ps). The CPU has a cycle time of 400ps. What is the CPU’s IPC and IPS when running Benchmark 1?

• [5 pts] What is the average speedup of the basic multi-cycle CPU above over the single-cycle CPU across the whole benchmark suite?

• [10 pts] Amdahl’s Law

An old CPU, spends 60% of its clock cycles on handling pipeline hazards and 40% of cycles on actually executing useful instructions.

• [5 pts] A new CPU Design 1 can execute useful instructions four times as fast as before (e.g., by having more hardware to concurrently process more useful instructions). What is the speedup of CPU 2 over CPU 1?

• [5 pts] An alternative new CPU Design 2 seeks to improve performance by handling pipeline hazards faster, instead. The performance target for CPU Design 2 is to 1.5X speedup over the old CPU. How many times as fast should CPU Design 2 be at handling hazards compared to the old CPU?

• [10pts] Pipeline Stages

A micro-controller company has a single-cycle CPU model with a cycle time of 10ns. The company decides to turn the single-cycle model into a faster 4-stage pipelined CPU model. Due to some practical design constraints, the engineers were unable to make the new CPU perfectly pipelined. Instead, three out of the four stages have identical latency; the latency of the remaining stage is 10% longer than the latency of the three other stages.

• [5 pts] What is the cycle time of the new pipelined CPU?

• [5 pts] For typical programs with billions of instructions, what is the speedup of the new pipelined CPU over the single-cycle CPU? Simply assume the programs targeting the CPU have perfect instruction-level parallelism.

• [15 pts] Compiler Support to Correctly Handle Data Hazard

The following assembly code is generated by compiling a C program to run on a single-cycle CPU:

or $7, $8, $9 #instruction 1
or $10, $7, $4 #instruction 2
lw $12, 7($10) #instruction 3
lw $13, ($12) #instruction 4
add $13, $1, $1 #instruction 5
add $14, $15, $16 #instruction 6
add $17, $18, $19 #instruction 7

• [5 pts] How should the same C program be recompiled to run on CPU A on the front page? Assume a basic compiler that handles data hazards via NOPs, instead of reordering instructions. Give the new assembly code below.

• [5 pts] How should the same C program be recompiled to run on CPU B on the front page? Assume a basic compiler that handles data hazards via NOPs, instead of reordering instructions. Give the new assembly code below.

• [5 pts] How should the same C program be recompiled to run on CPU C on the front page? Assume a basic compiler that handles data hazards via NOPs, instead of reordering instructions. Give the new assembly code below.

• [15 pts] Performance Optimizations

When working on this problem, use the same C program as Problem 4 and use the CPUs on the front page.

• What is the speedup of CPU A over a single-Cycle CPU for this C program? The single-cycle CPU has a cycle time of 800ps. Assume the pipeline is initially completely empty (i.e., the program counter has not yet recorded the first instruction). Also assume the C program is compiled without instruction reordering.

• What is the speedup of CPU C over CPU A for this C program? Assume the pipeline is initially completely empty (i.e., the program counter has not yet recorded the first instruction). Also assume the C program is compiled without instruction reordering.

• How should the same C program be recompiled to run on CPU C to achieve maximum performance? Assume an optimized compiler that can reorder instructions. Show the new assembly code below. How much speedup can the optimized compiler achieve over a more basic compiler?

• [15 pts] Hazard Detection

Consider the following assembly code:

lw $1, 2($2)
lw $1, 2($3)
lw $1, 2($4)

This code exhibits a new type of data dependence -Write-after-Write dependence. The data dependence examples covered in class are all Read-after-Write dependence. To correctly answer this question, you may want to walk through the pipeline cycle by cycle

• [5 pts] Can the assembly code above, AS IS, run correctly on CPU A on the front page? If so, how many cycles would it take the CPU to run the code above? Assume the pipeline is initially completely empty (i.e., the program counter has not yet recorded the first instruction).

• [5 pts] As covered in class, the combinational logic that generates CPU C’s Hazard Detection Unit’s output implements the following C expression: “(IF/ID.rs == ID/EX.rt || IF/ID.rt == ID/EX.rt) && ID/EX.MemRead”. How many cycles would it take CPU C to run the code above? Assume the pipeline is initially completely empty (i.e., the program counter has not yet recorded the first instruction).

• [5 pts] Modify the Hazard Detection Unit to run the code above faster; show the new C expression that describes the modified Hazard Detection Unit.

• [20 pts Hazard Detection and Data Forwarding

The following code carries out memory copy:

lw $1, 134242($2)
sw $1, 134634($3)

CPU C currently stalls 1 cycle for memory copy due to the load-use hazard between load and store. Answer the following questions to enhance CPU C to not stall for memory copy.

• [5 pts] How should Hazard Detection Unit be modified? Give the new C expression to describe the modified hazard detection unit.

• [5 pts] CPU C needs a new forwarding mux. What pipeline register fields should connect to the new forwarding mux’s data inputs? What should connect to the new forwarding mux’s output?

• [5 pts] How should Control be modified so that the new forwarding unit controling the new forwarding mux can get all inputs it needs from EX/Mem register and Mem/WB register to detect that a younger sw depends on an older lw? [Hint: Only one Control output should be modified.]

• [5 pts] One pipeline register must add new field(s); the new forwarding unit that controls the new forwarding mux uses these new pipeline register field(s) as input(s). Which pipeline register needs new field(s)? What new field(s) does the register need?