Chapter 4
part 2
Processor (Datapath and Control) Pipelining
1
Pipelining
• Implementation technique in which multiple instructions are overlapped in execution, like assembly line
• Suppose there are only one washer, one dryer, one “folder”, and “storer”. If there are 20 people lined up for the laundry, how faster the second one is than the first one?
2
Multi-cycle MIPS processor
3
5 steps of lw instruction
• Assume that
– Instruction memory read takes:
– Register read/instruction decode:
– Execution
– Data memory access:
– Register write:
200 ps
How log does it take to execute the instruction?
100 ps
200 ps 200 ps 100 ps
4
Pipelining Example:
P ro g ra m e x e c u tio n o rd e r
200 400 600 800 1000 1200 1400 1600
1800
T im e (in instructions)
l w
$ 1 , 1 0 0 ($ 0 ) $2, 200($0) $3, 300($0)
Instruction fe tc h
Reg
ALU 800 ps
D a ta
Reg
lw lw
ALU Data 800 ps
Reg
•
P ro g ra m
e x e c u tio n
o rd e r
(in instructions)
200 400 600 800 1000 1200 1400
Tim e
lw $1, 100($0) Instruction Reg ALU Data Reg fetch access
lw $2, 200($0) 200 ps Instruction Reg fe tc h
ALU Data Reg access
lw $3, 300($0) 200 ps Instruction fe tc h
Data Reg access
access
Instruction Reg
fetch access
Reg ALU
200ps 200ps 200ps 200ps 200ps
What is the minimum clock cycle time for single cycle, non-pipelined execution?
For pipelined execution?
5
Instruction fe tc h
800 ps
Pipelined Processor
• Each functional unit is used only once per each instruction execution.
• Assume R-type instructions take 4 steps for pipelined execution. What could be a problem in running the following sequence of
instructions?
lw $s1, 100($s2) add $s3, $s4, $s5
6
• A solution is to make all instructions with 5 similar stages. Each functional unit can only be used at the same state for all instructions.
no operation
7
Pipeline lw instructions
-One instruction enters the pipeline every cycle. -The effective CPI is ______________
8
Designing instruction sets for Pipelining
• What makes the design easy?
1. all instructions have the same length
2. justafewinstructionformatskeepingsymmetryamongthem.
3. memory operands appear only in loads and stores
• What makes the design hard?
1. structural hazards: suppose we had only one memory
2. control hazards: need to worry about branch instructions
3. data hazards: an instruction depends on a previous instruction
9
Pipeline Hazards
• Situations in pipelining when the next instruction
cannot execute in the following clock cycle. 1. structural hazards:
– suppose we had only one memory unit
2. data hazards:
– an instruction depends on a previous instruction
3. control hazards:
– need to worry about branch instructions
10
1. Structural Hazards
1. structural hazards: Hardware cannot support the combination of instructions that are set to execute in the same clock cycle.
What is wrong with the sequence of instructions in the following if there
is only one memory?
800 ps
P ro g ra m
e x e c u tio n
o rd e r
(in instructions)
200 400 600 800 1000 1200 1400
lw $ 1 , 1 0 0 ($ 0 )
Instruction f e t c h
Reg
ALU
D a ta access
Reg
Tim e
lw $2, 200($0) 200 ps Instruction fe tc h
Reg
ALU Reg
Data Reg access
lw $3, 300($0) 200 ps Instruction fe tc h
Data Reg access
ALU
200 ps 200 ps 200 ps 200 ps 200 ps
If we had a fourth instruction, we would see that in the same clock cycle the first instruction is accessing data from memory while the forth instruction is fetching an instruction from that same memory.
Without two memories, we could have a structural hazard. 11
Structure Hazards
• The MIPS instruction set was designed to be pipelined, making it fairly easy for designers to avoid structural hazards.
• We have an instruction memory unit and a data memory unit.
• For the register file, “write” occurs in the first half clock cycle, and
“read” occurs in the second half clock cycle.
register file
write read
12
2. Data Hazards
2. Datahazards:Aninstructioncannotbeexecutedasplanned because data is not yet available.
What could be a problem with executing the following instructions by pipelining?
add $s0, $t0, $t1 sub $t2, $s0, $t3
1. A solution: Forwarding (a.k.a. bypassing) : Adding extra hardware
to retrieve the missing
item early from the
internal resources
13
• 2. Another solution: stall (with forwarding)
A stall is needed when an R-type instruction following a load tries
to use the data loaded.
Bubble – a nickname for a pipeline stall
14
• 3. Yet another solution: re-ordering the code.
Ex: The following C code: A = B + D;
C = B + E;
is translated into
lw $t1, 4($t0)
lw $t2, 12($t0)
add $t3, $t1, $t2 sw $t3, 0($t0) lw $t4, 16($t0) add $t5, $t1, $t4 sw $t5, 8($t0)
Find the hazards and reorder the instructions to avoid any pipeline stalls.
15
Re-ordered result:
lw $t1, 4($t0)
lw $t2, 12($t0) lw $t4, 16($t1) add $t3, $t1, $t2 sw $t3, 0($t0) add $t5, $t1, $t4 sw $t5, 8($t1)
16
• Forwarding cannot prevent all pipeline stalls. Ex:
lw $s0, 12($t4) sub $t2, $s0, $t3
The desired data $s0 is a available only after the 4th stage, MEM. It is too late for the input of the input stage of sub.
17
3. Control Hazards
3. Controlhazards(a.k.a.branchhazards):
When we decide to branch, other instructions are already in the pipeline.
One solution is to “predict” that branches are not taken: need hardware to flush instructions if we are wrong.
18
Branch not taken
Branch taken
The pipelined datapath
• Pipeline registers are added (to the single cycle datapath).
IF
ID
EX
MEM
WB 19
64bits
128bits
97bits
64bits
1. IF:InstructionFetch
2. ID:InstructionDecodeandRegisterFileRead 3. EX:ExecutionorAddressCalculation
4. MEM:DataMemoryAccess
5. WB:WriteBack
Two exceptions to left-to-right flow of instructions:
• The write-back state, which places the result back into the register
file in the middle of the datapath
• The selection of the next value of the PC, choosing between the incremented PC and the branch address from the MEM stage.
20
The five stages of pipelined execution for lw • We look at a load because it is active in all five stages.
• The right half of registers or memory is highlighted when they are being read.
• The left half is highlighted when they are being written.
21
Pipelined execution of lw • IF
22
Pipelined execution of lw • ID
23
Pipelined execution of lw • EX
24
Pipelined execution of lw • MEM
25
Pipelined execution of lw • WB
26
• Instruction fetch:
-The instruction is read from memory using the address in the PC ->
the IF/ID pipeline register.
• Instruction decode and register file read:
-The instruction in the IF/ID register -> register numbers
-Three 32 bits values (reg1, reg2, extended 16 bit immediate) -> the ID/EX pipeline register.
• Execute and address calculation:
-The effective address -> the EX/MEM pipeline register
27
• Memory access:
– The register containing the data to be stored was stored in ID/EX in
an earlier state.
– Place the data into the EX/MEM pipeline register in the EX state, as we stored the effective address into EX/MEM
• Write back:
– Aninstructionpassesthroughastateevenifthereisnothingto do because later instructions are already progressing at the maximum rate. (see the next page.)
28
We adjusted the arrow to Write register from previous datapath
29
• The Write Register from the instruction needed to be preserved through stages so that it can be used in WB stage to write the data.
30
Example
31
Pipelined Control
32
• Because each control line is associated with a component active in only a single pipeline stage, we can divide the control lines into five groups according to the pipeline state.
1. Instructionfetch:Thecontrolsignalstoreadinstructionmemory and to write the PC are always asserted, so there is nothing special to control in this pipeline stage.
2. Instruction decode/register file read: As in the previous stage, the
same thing happens at every clock cycle, so there are no optional
control lines to set.
3. Execution/address calculation: The signals to be set are RegDst, ALUOp, and ALUSrc. The signals select the Result register, the ALU operation, and either Read data 2 or a sign-extended immediate for the ALU.
33
4. Memoryaccess:ThecontrollinessetinthisstageareBranch, MemRead, and MemWrite. These signals are set by the branch equal, load, and store instructions, respectively. Recall that PCSrc selects the next sequential address unless control asserts Branch and the ALU result was zero.
5. Writeback:ThetwocontrollinesareMemtoReg,whichdecides between sending the ALU result or the memory value to the register file, and RegWrite, which writes the chosen value.
34
The effect of each of the seven control signals.
Branch – is it a branch instruction? If it is, 1, otherwise 0. ALU Op1 ALU Op0
R-format 1 0 lw 0 0 sw 0 0 beq 0 1
-> ALU operation determined by funct field -> ALU performs addition
-> ALU performs addition
-> ALU performs subtraction
35
Can you fill the control table for pipelined control?
Instruction
Execution/address calculation stage control lines
Memory access stage control lines
Write-back stage control lines
R-format lw
sw
beq
Reg ALU ALU ALU Branch Mem Mem Reg Mem
Dst Op1 Op0 Src
Read Write Write to Reg
The answer is in the next page.
36
Pipelined Control
Control information is created during instruction decode, and moved
down the pipeline via pipelined registers.
37
• The values of the control lines are the same as the ones in the single cycle datapath, but they have been shuffled into three groups corresponding to the last three pipeline stages.
38
Data hazards and Forwarding
sub $2, $1, $3 add $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2)
The last four instructions are all dependent on the result in register $2 of the first instruction.
39
The values read for $2 would not be the result of sub, unless the read occurred during the clock cycle 5 or later.
40
Only the last two, add and sw get the correct value of $2.
Actually the value of $2 is available at the end of the EX stage, or cycle 3.
=> We can forward the data.
We need the data as input to ALU in the next instruction
41
How to detect hazards
• The two pairs of hazard conditions are:
– 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
– 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt And
EX/MEM.RegisterRd ≠ 0
– 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
– 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt And
MEM/WB.RegisterRd ≠ 0
ID/EX.RegisterRs refers to the number of one register whose value is found in the pipeline register ID/EX, etc.
(We should not have: sll $0, $1, 2 – destination is $zero register)
42
add Type 1.a
$12, $2, $5 hazard
or Type 2.b
$13, $6, $2 hazard
sub $2, $1, $3 add $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2)
Based on the previous conditions:
sub
$2, $1, $3
sub
$2, $1, $3
sub
$2, $1, $3 $14, $2, $2
add No hazard
sub $2, $1, $3
sw $15, 100($2) No hazard
43
The dependences between the pipeline register move forwards in time
$1
$2
$3
$2
$5
$6
$2
$2
$2
$2
100
44
45
Datapath with forwarding
46
Control values for the forwarding multiplexors in the previous page
Mux control
Source Explanation
ForwardA = 00 ForwardA = 10
ID/EX The first ALU operand comes from the register file.
ForwardA = 01 ForwardB = 00 ForwardB = 10 ForwardB = 01
MEM/WB The first ALU operand is forwarded from data memory or an earlier ALU result.
EX/MEM The first ALU operand is forwarded from the prior ALU result.
ID/EX The second ALU operand comes from the register file.
EX/MEM The second ALU operand is forwarded from the prior ALU result.
MEM/WB The second ALU operand is forwarded from data memory or an earlier ALU result.
47
Hazards detection and control for forwarding
1. EX hazard
if ( (EX/MEM. RegWrite=1)
and (EX/MEM.RegisterRd ≠ 0) — $zero cannot be overwritten and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10
if ( (EX/MEM. RegWrite=1)
and (EX/MEM.RegisterRd ≠ 0) — $zero cannot be overwritten and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10
48
Hazards detection and control for forwarding 2. MEM hazard
if ( and and
(MEM/WB. RegWrite=1)
(MEM/WB.RegisterRd ≠ 0) — $zero cannot be overwritten
and
(MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01
if (
(MEM/WB. RegWrite=1)
(MEM/WB.RegisterRd ≠ 0) — $zero cannot be overwritten
and
and and
not ( (EX/MEM. RegWrite=1)
and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))– check if result in MEM is more recent than WB
not ( (EX/MEM. RegWrite=1)
and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) — check if result in MEM is more recent than WB
(MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
49
50
• There is no hazard in the WB stage because we assume that the register file supplied the correct result if the instruction in the ID stage read the same register written by the instruction in the WB stage.
• For loads immediately followed by stores, we need to add more forwarding hardware to make memory-to-memory copies run faster.
51
Data Hazards and Stalls
• When an instruction tried to read a register following a load instruction that writes the same register
– cannot be saved by forwarding.
52
Example: “add” uses $2 register that “load” wrote:
53
How to detect the hazards of that kind
if (ID/EX.MemRead=1 and — test if the instruction is load(lw) ((ID/EX.RegisterRt=IF/ID.RegisterRs) or
(ID/EX.RegisterRt=IF/ID.RegisterRt)))
stall the pipeline — the instruction holds 1 cycle clock
To stall the instructions in ID and IF stages
It is sufficient to prevent PC and IF/ID pipeline register from changing.
The back half of the pipeline starting with the EX stage must be doing something — doing “nop”.
nop: An instruction that does no operation to change state.
We can implement nop in pipelining by changing EX, MEM, WB control fields of ID/EX pipeline register to 0.
54
and $4, $2, $5
NOP is implemented by deasserting all 9 control signals
55
With forwarding, the hazard detection unit, and the forwarding unit
56
Control Hazards/Branch Hazards
• 2 ways to improve branch performance:
– Topredictthebranchdecisionwithahighprobability.
– Toreducethecostofthetakenbranch,bymovingthebranch decision in the earlier cycle (reduce to penalty of one cycle):
• Move the branch target address calculation to the ID stage from the EX stage.
• Evaluating the branch decision (check the equality) during the ID stage by adding a new unit to check the equality. – First exclusive ORing their respective bits, and then ORing all the result bits. (note: stalls might be needed. 2 stalls for “lw”)
57
Control Hazards/Branch Hazards Example
• IF.Flush signal to zero controls in IF is added.
• In the example in the next page, since the branch instruction decides whether to branch in clock cycle 4 for the beq, the 3 sequential instruction that following the branch will be fetched and begin execution, if there is no intervention.
58
Note: branch address = PC + 4 + const (28—already multiplied by 4) = 40+4+28 = 72
59
The ID stage of clock cycle 3 determines that a branch must be taken, so it selects 72 as the next PC address and zeros the instruction fetched for the next cycle.
60
61
• One solution is to assume branch is not taken.
• If the branch is actually taken, we need to discard instructions that are being fetched and decoded. – we change the original control values to 0s, We also need to change the 3 instructions in the IF, ID, EX stages when the branch reaches the MEM stage. For load-use stall, we just changed control to – in the ID stage. Flush instructions.
• Modern processors use “dynamic branch prediction”, which records the history of branch on the run (in branch history table/branch prediction buffer) and predicts branch based on it. Typically, they predict correctly 95% of the time.
• A branch prediction buffer is a small memory indexed by the lower portion of the address of the branch instruction.
62
Branches
• If the branch is taken, we have a penalty of one cycle
• For our simple design, this is reasonable
• With deeper pipelines, penalty increases and static branch prediction drastically hurts performance
• Solution: dynamic branch prediction Taken
Predict taken
Not taken Taken
Predict taken
Taken
Not taken
Predict not taken
Not taken Taken
Predict not taken Not taken
A 2-bit prediction scheme
63
Branch Prediction
• Sophisticated Techniques:
– A“branchtargetbuffer”tohelpuslookupthedestination
– Correlatingpredictorsthatbasepredictiononglobalbehavior and recently executed branches (e.g., prediction for a specific
branch instruction based on what happened in previous branches)
– Tournamentpredictorsthatusedifferenttypesofprediction strategies and keep track of which one is performing best.
– A“branchdelayslot”whichthecompilertriestofillwithauseful instruction (make the one cycle delay part of the ISA)
• Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective!
• Modern processors predict correctly 95% of the time!
64
65
Exceptions/interrupts
• Events other than branches or jumps that change the normal flow of instruction exception
Type of event
From where?
MIPS terminology
I/O device request
External
Interrupt
Invoke the operating system from user program
Internal
Exception
Arithmetic overflow
Using an undefined instruction Hardware malfunctions
Internal Internal Either
Exception Exception
• When an exception occurs, the machine saves the address of the offending instruction in EPC (Exception Program Counter) and then transfer control to Operating System at some specified address.
Exception or interrupt
66
Arithmetic Overflow
• To flush instructions in the ID stage, we use the multiplexor already in the ID stage that zeros control signals for stalls. A new control signal, ID.Flush is ORed with the stall signal from the hazard detection unit to flush during ID.
• To flush instruction in the EX phase, we use a new signal called EX.Flush to cause new multiplexors to zero the control lines. To start fetching instructions from location 8000 0180 hex, which is the exception location for an arithmetic overflow, we add an additional input to the PC multiplexor that sends 8000 0180 hex to the PC.
• IF.Flush signal to zero controls in IF (from Branch hazard detection).
• ID.Flush signal to zero controls in ID.
• EX.Flush signal to zero controls in EX.
67
Datapath to handle exceptions
• EPC (Exception Program Counter): A 32-bit register used to hold the address of the affected instruction.
• Cause: A register used to record the cause of the exception.
68
• ALU overflow signal will be input to Control.
69
Example (Exception in a Pipelined Computer) Given this instruction sequence,
40 hex sub 44 hex and
$11, $2, $4 $12, $2, $5 $13, $2, $6 $1, $2, $1 $15, $6, $7 $16, 50($7)
48 hex or 4C hex add
50 hex slt 54 hex lw
Assume the instructions to be invoked on an exception begin like this:
4000 0040 hex sw $25, 1000($0) 4000 0044 hex sw $26, 1004($0)
What happens in the pipeline if an overflow exception occurs in the add instruction?
70
• The overflow is detected during the EX stage of clock 6, saving the address following the add in the EPC register (4C+4 = 50).
• Overflow causes all the Flush signals to be set near the end of this clock cycle, de-asserting control values (setting them to 0) for the add. Clock cycle 7 shows the instructions converted to bubbles in the pipeline plus the fetching of the first instruction of the exception routine – sw $25, 1000 ($0) from instruction location 4000 0040 hex.
• Note that the “and“ and “or” instructions, which are prior to the add, still complete. Although not shown, the ALU overflow signal is an input to the control unit.
• See the next 2 pages.
71
72
73
Super scalars (Dynamic multiple-issue processors)
• Advanced pipelining that enables the processor to execute more than one instruction per clock cycle.
• The processor decides whether zero, one, or more instructions can issue in a given clock cycle. It needs to schedule instructions to move dependencies apart and improve the instruction issue rate.
• They extend the basic framework of dynamic issue decisions to include dynamic pipeline scheduling
74
Simple multiple code scheduling
Loop: lw
$t0, 0($s1) # $t0=array element addu $t0, $t0, $t2 # add scalar in $s2 sw $t0, 0($s1) # store result
addi $s1, $s1, -4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0
• Reorder the instructions to avoid as many pipeline stalls as possible. Assume branches are predicted, so that control hazards are handled by the hardware.
75
• The instructions addu and sw have data dependences on lw, and the last “addi” instruction does not have data dependences.
• So after lw, execute “addi” on the 2nd cycle, then execute “addu” on the 3rd cycle, then we do not need to stall for “addu”.
• Also, after the 3rd cycle, “sw” is ready to execute without any stall.
What is CPI for this example? 0.8
76
Dynamic Pipeline Scheduling
• It chooses which instructions to execute in a given clock cycle while trying to avoid hazards and stalls.
• It analyzes the data flow structure of a program. Then the processor executes the instructions in some order that preserved the data flow order of the program.
• The instruction fetch and decode unit is required to issue instructions in order, which allows dependences to be tracked.
• The multiple functional units. – has reservation stations that hold the operands and the operation.
• The commit unit is required to write results to registers and memory in program execution order. (it has the reorder buffer).
77
78
1. Whenaninstructionissues,ifeitherofitsoperandsisintheregister file or the reorder buffer, it is copied to the reservation station immediately, where it is buffered until all the operands and an execution unit are available. For the issuing instruction, the register copy of the operand is no longer required, and if a write to that register occurred, the value could be over-written.
2. If an operand is not in the register file or reorder buffer, it must be waiting to be produced by a functional unit. The name of the functional unit that will produce the result is tracked. When that unit eventually produces the result, it is copied directly into the waiting reservation station from the functional unit bypassing the registers.
These steps effectively use the reorder buffer and the reservation stations to implement register renaming.
79
• Dynamic Pipeline Scheduling
– Hardwarechooseswhichinstructionstoexecutenext
– Willexecuteinstructionsoutoforder(e.g.,doesn’twaitfora dependency to be resolved, but rather keeps going!)
– Speculatesonbranchesandkeepsthepipelinefull (may need to rollback if prediction incorrect)
80
Summary
• In chapter 4, we discussed and compared different approaches to processor implementation.
– Singlecycle – Pipelined
Pipelining improves the average exception time per instruction by exploiting instruction-level parallelism.
81