Chapter …
Morgan Kaufmann Publishers 4 November, 2018
Chapter 4 — The Processor 1
Chapter 4 — The Processor — 58
Pipelining Analogy
Pipelined laundry: overlapping execution
Parallelism improves performance
§
4
.5
A
n
O
v
e
rv
ie
w
o
f P
ip
e
lin
in
g Four loads:
Speedup
= 8/3.5 = 2.3
Non-stop:
Speedup
= 2n/0.5n + 1.5 ≈ 4
= number of stages
Pop Quiz
On the previous slide, we had 4 stages.
Therefore, each load of laundry should
finish in 25% of the time.
A: True
B: False
Chapter 4 — The Processor — 59
Morgan Kaufmann Publishers 4 November, 2018
Chapter 4 — The Processor 2
Pipelining Impacts Throughput
… not latency (time it takes for a single
instruction to complete)
Another example: can nine women
produce a baby in a single month?
Nine women can produce nine babies in
nine months (avg. 1 baby / month)
Chapter 4 — The Processor — 60
Chapter 4 — The Processor — 61
MIPS Pipeline
Five stages, one step per stage
1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate
address
4. MEM: Access memory operand
5. WB: Write result back to register
Morgan Kaufmann Publishers 4 November, 2018
Chapter 4 — The Processor 3
Chapter 4 — The Processor — 62
Pipeline Performance
Assume time for stages is
100ps for register read or write
200ps for other stages
Compare pipelined datapath with single-cycle
datapath
Instr Instr fetch Register
read
ALU op Memory
access
Register
write
Total time
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps
Chapter 4 — The Processor — 63
Pipeline Performance
Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
Morgan Kaufmann Publishers 4 November, 2018
Chapter 4 — The Processor 4
Chapter 4 — The Processor — 64
Pipeline Speedup
If all stages are balanced
i.e., all take the same time
Time between instructionspipelined
= Time between instructionsnonpipelined
Number of stages
If not balanced, speedup is less
Speedup due to increased throughput
Latency (time for each instruction) does
not decrease
Chapter 4 — The Processor — 65
Pipelining and ISA Design
MIPS ISA designed for pipelining
All instructions are 32-bits
Easier to fetch and decode in one cycle
c.f. x86: 1- to 17-byte instructions
Few and regular instruction formats
Can decode and read registers in one step
Load/store addressing
Can calculate address in 3rd stage, access
memory in 4th stage
Alignment of memory operands
Memory access takes only one cycle
Morgan Kaufmann Publishers 4 November, 2018
Chapter 4 — The Processor 5
Chapter 4 — The Processor — 66
Hazards
Situations that prevent starting the
next instruction in the next cycle
Structure hazards
A required resource is busy
Data hazard
Need to wait for previous instruction to
complete its data read/write
Control hazard
Deciding on control action depends on
previous instruction
Chapter 4 — The Processor — 67
Structure Hazards
Conflict for use of a resource
In MIPS pipeline with a single memory
Load/store requires data access
Instruction fetch would have to stall for
that cycle
Would cause a pipeline “bubble”
Hence, pipelined datapaths require
separate instruction/data memories
Or separate instruction/data caches
Morgan Kaufmann Publishers 4 November, 2018
Chapter 4 — The Processor 6
Chapter 4 — The Processor — 68
Data Hazards
An instruction depends on completion of
data access by a previous instruction
add $s0, $t0, $t1
sub $t2, $s0, $t3
Chapter 4 — The Processor — 69
Forwarding (aka Bypassing)
Use result when it is computed
Don’t wait for it to be stored in a register
Requires extra connections in the datapath
Morgan Kaufmann Publishers 4 November, 2018
Chapter 4 — The Processor 7
Chapter 4 — The Processor — 70
Load-Use Data Hazard
Can’t always avoid stalls by forwarding
If value not computed when needed
Can’t forward backward in time!
Chapter 4 — The Processor — 71
Code Scheduling to Avoid Stalls
Reorder code to avoid use of load result
in the next instruction
C code for A = B + E; C = B + F;
lw $t1, 0($t0)
lw $t2, 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t4, 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
stall
stall
lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t4, 8($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
11 cycles13 cycles
Morgan Kaufmann Publishers 4 November, 2018
Chapter 4 — The Processor 8
Chapter 4 — The Processor — 72
Control Hazards
Branch determines flow of control
Fetching next instruction depends on
branch outcome
Pipeline can’t always fetch correct
instruction
Still working on ID stage of branch
In MIPS pipeline
Need to compare registers and compute
target early in the pipeline
Add hardware to do it in ID stage
Chapter 4 — The Processor — 73
Stall on Branch
Wait until branch outcome determined
before fetching next instruction
Morgan Kaufmann Publishers 4 November, 2018
Chapter 4 — The Processor 9
Chapter 4 — The Processor — 74
Branch Prediction
Longer pipelines can’t readily determine
branch outcome early
Stall penalty becomes unacceptable
Predict outcome of branch
Only stall if prediction is wrong
In MIPS pipeline
Can predict branches not taken
Fetch instruction after branch, with no
delay
Chapter 4 — The Processor — 75
MIPS with Predict Not Taken
Prediction
correct
Prediction
incorrect
Morgan Kaufmann Publishers 4 November, 2018
Chapter 4 — The Processor 10
Chapter 4 — The Processor — 76
More-Realistic Branch Prediction
Static branch prediction
Based on typical branch behavior
Example: loop and if-statement branches
Predict backward branches taken
Predict forward branches not taken
Dynamic branch prediction
Hardware measures actual branch behavior
e.g., record recent history of each branch
Assume future behavior will continue the trend
When wrong, stall while re-fetching, and update history
Pop Quiz
We have approaches to mitigate pipeline
stalls from:
A: structural hazards
B: structural and control hazards
C: data and control hazards
D: structural, data, and control hazards
Chapter 4 — The Processor — 77
Morgan Kaufmann Publishers 4 November, 2018
Chapter 4 — The Processor 11
Chapter 4 — The Processor — 78
Pipeline Summary
Pipelining improves performance by
increasing instruction throughput
Executes multiple instructions in parallel
Each instruction has the same latency
Subject to hazards
Structure, data, control
Instruction set design affects
complexity of pipeline implementation
The BIG Picture
Chapter 4 — The Processor — 79
MIPS Pipelined Datapath
§
4
.6
P
ip
e
lin
e
d
D
a
ta
p
a
th
a
n
d
C
o
n
tro
l
WB
MEM
Right-to-left
flow leads to
hazards
Morgan Kaufmann Publishers 4 November, 2018
Chapter 4 — The Processor 12
Chapter 4 — The Processor — 80
Pipeline registers
Need registers between stages
To hold information produced in previous
cycle
Chapter 4 — The Processor — 81
Pipeline Operation
Cycle-by-cycle flow of instructions
through the pipelined datapath
“Single-clock-cycle” pipeline diagram
Shows pipeline usage in a single cycle
Highlight resources used
c.f. “multi-clock-cycle” diagram
Graph of operation over time
We’ll look at “single-clock-cycle”
diagrams for load & store
Morgan Kaufmann Publishers 4 November, 2018
Chapter 4 — The Processor 13
Chapter 4 — The Processor — 82
IF for Load, Store, …
Chapter 4 — The Processor — 83
ID for Load, Store, …
Morgan Kaufmann Publishers 4 November, 2018
Chapter 4 — The Processor 14
Chapter 4 — The Processor — 84
EX for Load
Chapter 4 — The Processor — 85
MEM for Load
Morgan Kaufmann Publishers 4 November, 2018
Chapter 4 — The Processor 15
Chapter 4 — The Processor — 86
WB for Load
Wrong
register
number
Chapter 4 — The Processor — 87
Corrected Datapath for Load