Department of Electrical and Computer Engineering
Rutgers, The State University of New Jersey
Computer Architecture and Assembly Language Lab
Fall 2021
Lab 6
CPU Structure, Pipeline Programming and Hazards
Goal
After completing this lab, you will:
• Know the difference between pipelined and non-pipelined processors
• Learn how the pipeline works; know how to prevent hazards in the pipeline
Introduction
A CPU has 5 basic components, the Arithmetic and Logic Unit (ALU), the Instruction
Memory, the Register File, the Data Memory and its Control block. The ALU is
responsible for executing the arithmetic functions, such as addition, subtraction, and the
logical functions, e.g. AND, OR, logical shift, as well as computing branch, jump and
memory addresses. The instruction memory is the memory where an assembly program
is stored before execution. The Program Counter contains the address of the instruction
being fetched. The data memory is the memory where all the data produced or not are
stored. Finally, the control block is responsible for decoding an instruction and setting all
necessary control signals with the proper values. Figure 1 shows a diagram of a simple
CPU. Computers execute instructions in a serial manner. A way to speed up a
program’s execution is by using a Pipeline. Pipelining is a method to increase the
throughput of the executing instruction without actually affecting the time a single
instruction needs to execute. Finally, control is the most challenging aspect of processor
design: it is both the hardest part to get right and the hardest part to make fast. One of
the difficult parts of control design is implementing exceptions and interrupts. An
exception is an unscheduled event that disrupts program execution. An interrupt is an
exception that comes from outside of the processor, such as when an Input/Output
device needs servicing.
CPU Structure
In order to examine the structure of the CPU, the implementation of actual instructions
must be analyzed, so that the organization of the memory units, the ALU and the control
of the CPU are understood. Figure 1 depicts the parts of a CPU. The most important
parts are:
Department of Electrical and Computer Engineering
Rutgers, The State University of New Jersey
1. The Program Counter (PC) points to the address of the instruction that is being
fetched from the instruction memory.
2. The Register File contains registers that store the state arguments for the ALU or
results of Arithmetic operations, at every clock cycle.
3. The Arithmetic and Logic Unit (ALU) is responsible to execute every arithmetic or
logical operation that is needed by an instruction.
4. The Data Memory contains all the data that either are generated by the execution
if the instructions or are the initial data that a program may need.
5. The Instruction Memory holds the compiled and linked instructions before
execution.
6. The Control Block of the processor decodes each instruction and sets all the
control signals according to that instruction.
7. The ALU control is responsible to pass the correct control signal to the ALU
according to the arithmetic or logical operation that is requested by the Control
unit.
Figure 1. The simple data path with the control unit shown.
Department of Electrical and Computer Engineering
Rutgers, The State University of New Jersey
In order to understand how an instruction is executed and how the data path operates,
two examples will be given below. The first example explains the instruction add x5,
x6, x7 and the second explains the instruction sw x5, offset(x6). Table 1 shows
the values the control signals have for the two examples.
Example 1
add x5, x6, x7
The execution can be shown in four steps:
1. The instruction is fetched and the PC is incremented by 4.
2. Two registers, x6 and x7, are read from the register file. Also, the main
control unit sets of the control lines for an add operation.
3. The ALU operates on the data read from the register file, using the
function code (bits: 2 ALUOp of the instruction) to generate the ALU
function.
4. The result of the ALU is written back into the register file using the bits
15:11 of the instruction to select the destination register x5.
Example 2
sw x5, offset(x6)
The execution can be shown in 5 steps [1]:
1. The instruction is fetched and the PC is incremented.
2. A register (x5) value is read from the register file.
3. The ALU computes the sum of the value read from the register file and the
sign-extended, lower 16 bits of the instruction (offset).
4. The sum of the ALU is used as the address for the data memory.
5. The data from the register file is written into the memory unit; the register
destination is given by x6.
Instruction ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp [1:0]
add x5, x6, x7 0 0 1 0 0 0 10
sw x5, offset(x6) 1 0 0 0 1 0 00
Table 1. The setting of the control lines of examples 1 and 2 based on Figure 1.
Department of Electrical and Computer Engineering
Rutgers, The State University of New Jersey
The explanation of the control signals is as follows:
Signal Name Explanation
ALUSrc Controls the MUX to select the second input of the ALU either from the second
output of the register file or the sign extended lower 16 bits of the instruction
MemtoReg Selects the data fed to the register Write data input port either from the ALU
output or the data memory output
RegWrite The register on the Write register input is written with the value on the Write
data input
MemRead Data memory contents designated by the address input are put on the Read
data output
MemWrite Data memory contents designated by the address input are replaced by the
value on the Write data input
Branch Line goes HI if the instruction is a branch instruction
ALUOp[1:0] The ALU operation.
Pipeline Programming and Hazards
When a processor supports pipeline, an instruction can be divided in four or five steps
which are 1) Instruction Fetch [IF], 2) Instruction Decode [ID], 3) Execution [EX], 4)
Data Memory Access [MEM] and 5) Write Back the result into the register file [WB].
For a processor to achieve its maximum throughput, the pipeline must always be kept
full (i.e. each stage of the pipeline needs to be occupied at each clock cycle). Also,
the clock cycle time is kept fixed and is defined by the slowest step of an instruction.
Let us now see an example. Consider 3 consecutive instructions A, B and C. Each
instruction has the above-mentioned five steps. When a processor does not support
pipelining, each instruction will be executed after the previous one has finished, as it
is shown in Table 2. Also, suppose that instruction A starts at cycle 1.
Instr/Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A IF ID EX MEM WB
B IF ID EX MEM WB
C IF ID EX MEM WB
Table 2. Instruction execution without pipelining.
On the other hand, when pipeline is supported, each instruction can start executing
even though the previous one has not necessarily finished.
Department of Electrical and Computer Engineering
Rutgers, The State University of New Jersey
Instr/Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A IF ID EX MEM WB
B IF ID EX MEM WB
C IF ID EX MEM WB
Table 3. Instruction execution with pipeline support.
Note that there are situations in pipelining when the next instruction cannot
execute in the following clock cycle. These events are called Hazards and there
are three different types of hazards:
A. Structural Hazard. When a planned instruction cannot execute in the
proper clock cycle because the hardware does not support the combination
of instructions that are set to execute.
B. Data Hazard or Pipeline Data Hazard. When a planned instruction cannot
execute in the proper clock cycle because data that is needed to execute
the instruction is not yet available.
C. Control Hazard or Branch Hazard. When the instruction cannot execute
in the proper pipeline clock cycle because the instruction that was fetched
is not the one that is needed; that is, the flow of instruction addresses is not
what the pipeline expected.
To understand a Data Hazard, consider the following code fragment:
lw x5, 0(x6)
sub x7, x5, x6
The execution of the code with pipeline support is as follows:
Instr/Cycle 1 2 3 4 5 6
lw x5,0(x6)
IF ID EX MEM WB
sub x7,x5,x6
IF ID EX MEM WB
The lw will write the value from the memory in the register file in step WB, in the
meantime the sub instruction will have retrieved its operands in the ID step. This
actually means that the old value of register x5 will be used for the addition.
Department of Electrical and Computer Engineering
Rutgers, The State University of New Jersey
Such hazards can be prevented by either writing a program that does not provoke
any hazard or by adding an ample number of no-operation instructions (NOP)
bubbles between the instructions that depend on the result of each other. NOP
operations act like bubbles inside the pipeline, delaying the execution of the second
instruction (and subsequent instructions) until the values it requires as input are
ready. When the NOP technique is used in the above example, the following
happens:
Instr/Cycl 1 2 3 4 5 6 7 8
lw x5, 0(x6)
IF ID EX MEM WB
sub x7,x5,x6 IF Stall Stall ID EX MEM WB
Unfortunately, Stalling slows down the pipeline and degrades its efficiency. An
alternative technique instead of NOPs is Instruction Forwarding. In this technique,
the result of an instruction will be sent forward to the next instruction before it is
written back to the register file and it is possible that stalling is not needed. The
above data hazard between lw and sub can be resolved with instruction
forwarding:
Instr/Cycl 1 2 3 4 5 6 7 8
lw x5, 0(x6)
IF ID EX MEM WB
sub x7, x5, x6
IF Stall ID EX MEM WB
It can be easily seen that a Stall cycle still exists. The reason is that the lw
instruction will not have the data ready for forwarding before it is read from the
memory. When it writes back the value in the register file, it also forwards it to the
input of the ALU for immediate use (as symbolized by the arrow). If that stall does
not exist even with instruction forwarding, the forwarded result would arrive too late
and the result of the sub instruction would be wrong.
The third technique that a programmer can use to avoid data hazards is
Instruction Reordering. As implied by its name, the independence between
instructions can be exploited in order to reorder them in a way that keeps the
pipeline as full as possible.
Department of Electrical and Computer Engineering
Rutgers, The State University of New Jersey
Consider the following code segment:
lw x5, 0(x6)
addi x7, x5, 2
sub x28, x29, x9
Without instruction reordering, the execution will be the following:
Instr/Cycle
1 2 3 4 5 6 7 8 9
lw x5,0(x6)
IF ID EX MEM WB
addi x7,x5,2 IF Stall Stall ID EX MEM WB
sub x28,x29,x9 IF ID EX MEM WB
It is clear that two cycles are lost due to the NOP instruction. Also, it is easily seen
that sub does not need the result of the lw or the add. In this case sub is
independent of the other two and it can be moved between them. After reordering
the execution will look like this:
Instr/Cycle
1 2 3 4 5 6 7 8 9
lw x5,0(x6)
IF ID EX MEM WB
sub x28,x29,x9
IF ID EX MEM WB
addi x7,x5,2 IF Stall ID EX MEM WB
Department of Electrical and Computer Engineering
Rutgers, The State University of New Jersey
Exercises
1. [30 pts] Suppose you have the CPU of Figure 1 and that the architecture is not pipelined.
1. [15 pts] For the following program indicate the values of the control signals for each
instruction.
sw x8, 0(x10)
lw x15, 4(x8)
beq x0, x0, BoolTrue
2. [15 pts] Given the following control signals, write the corresponding assembly
instruction(s).
ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp[1:0]
1 0 1 0 0 0 10
ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp[1:0]
0 0 1 0 0 0 10
2. [30 pts] Suppose you have a machine that supports the five pipeline stages mentioned
above. Using the program below, answer the following questions:
lw x20, 4(x19)
add x20, x20, x11
sub x11, x19, x11
addi x10, x19, 4
sub x17, x17, x10
1. [5 pts] Complete the table of execution with five pipeline stages. Ignore all hazards.
2. [5 pts] Suppose that registers x10, x11, x17, x19, and x20 have the initial values of 2,
5, 8, 0, and 1, respectively. Give the values of the registers after running the program
without the processor handling any hazards.
3. [5 pts] Assume that the initial value of the PC is 0x0…0BF. What is the new PC value
at cycle 4 when ignoring hazards?
4. [5 pts] Resolve the hazards through stalling. Insert NOP instructions so that the
registers would get the correct values when the processor executes the program one
instruction per cycle.
5. [5 pts] Resolve all hazards (if possible) through instruction reordering. Can all
hazards be resolved through instruction reordering only? Explain.
6. [5 pts] Resolve the hazards (if possible) through instruction forwarding. Can all
hazards be resolved through instruction forwarding only? Explain.
Department of Electrical and Computer Engineering
Rutgers, The State University of New Jersey
3. [25 pts] Consider the following code:
addi x10, x0, 1
add x12, x0, x10
slli x13, x12, 2
addi x7, x12, 0
addi x29, x29, 4
Loop: sw x7, 10(x29)
addi x28, x0, 2
bne x7, x29, Loop
1. [5 pts] Determine how many cycles it would take for the program to execute without
hazard handling hardware.
2. [10 pts] Draw the execution table for one iteration. Identify the hazards and their types.
3. [10 pts] Assuming perfect hazard handling, what is the minimum number of cycles to
execute one iteration of the program? Explain what combination of hazard resolution
techniques must be used to achieve this and how. Draw the execution table to justify
your explanation.
Department of Electrical and Computer Engineering
Rutgers, The State University of New Jersey
4. [15 pts] Please read the following code, and answer the questions:
main:
addi x5, x0, 10
addi x6, x0, 0
jal x0, block_b
block_a:
add x6, x6, x7
addi x10, x0, 1
addi x11, x6, 0
ecall
jal x0, end
block_b:
addi x7, x0, 0
addi x8, x0, 0
jal x0, block_c
block_c:
sub x9, x5, x7
bge x8, x9, block_a
add x7, x7, x8
add x12, x8, x7
addi x8, x8, 1
jal x0, block_c
end:
Using any combination of techniques you like, identify and resolve all hazards in the program.
Draw execution tables showing the pipelined instructions before and after implementing your
hazard resolution method to illustrate how your solution fixes the issue.
For simplicity, you do not need to draw the table for every instruction in the program, only the
instructions affected by each hazard you identify.