CPU
Intro to Pipelining
Datapaths
3:
CS 154: Computer Architecture Lecture #13
Winter 2020
Ziad Matni, Ph.D.
Dept. of Computer Science, UCSB
Administrative
• Talk next week – must attend • Tuesday at 5:00 PM
2/26/20
Matni, CS154, Wi20 2
Lecture Outline
• Full Single-Cycle Datapaths • Pipelining
2/26/20 Matni, CS154, Wi20 3
The Main Control Unit
• Control signals derived (i.e. decoded) from instruction
opcode
always read
write
write
sign extend and add
2/26/20 Op[5:0] Matni, CS154, Wi20 4
Full Datapath showing 7 Control Signals
See Fig. 4.16 in book (p.264) for a description of each signal
2/26/20 Matni, CS154, Wi20 5
One Control Unit to Set them All… my precious
2/26/20 Matni, CS154, Wi20
6
One Control Unit to Set them All…
Let’s do some of these examples:
add $t0, $t1, $t2
addi
lw beq jal
my precious
$a0, $v0, 64
$t0, 4($
$a1, $a2,
jlabel
sp
)
blabel
2/26/20
Matni, CS154, Wi20
7
add $t0, $t1, $t2
rd = rs + rt
2/26/20
Matni, CS154, Wi20
rs + rt 8
rs = $t1 code rt = $t2 code
rd = $t0 code
rs rd
rs + rt
rt
RegDst 1 Branch 0 Zero X MemRead 0 MemtoReg 0 MemWrite 0 ALUOp 0010 ALUSrc 0 RegWrite 1
addi
$a0, $v0, 64
rt = rs + immed
2/26/20
Matni, CS154, Wi20
rs + immed 9
rs = $v0 code rt = $a0 code
rs
immed = 64
rt
immed
rs + immed
RegDst 0 Branch 0 Zero X MemRead 0 MemtoReg 0 MemWrite 0 ALUOp 0010 ALUSrc 1 RegWrite 1
lw
$t0, 4($
sp
)
rt = *(rs + immed)
2/26/20
Matni, CS154, Wi20
Value @ (rs+immed) 10
rs = $sp code rt = $t0 code
rs
immed = 4
rt
immed
rs + immed
Value @ (rs+immed)
RegDst 0 Branch 0 Zero X MemRead 1 MemtoReg 1 MemWrite 0 ALUOp 0010 ALUSrc 1 RegWrite 1
beq
$a1, $a2,
Assume in this example that a1 = a2
New address
rs = $a2 code rt = $a1 code
rt
immed = label
rs rt
immed
2/26/20
Matni, CS154, Wi20
11
blabel
immed
New address
rs – rt
RegDst 1 Branch 1 Zero 1 MemRead 0 MemtoReg 0 MemWrite 0 ALUOp 0110 ALUSrc 0 RegWrite 0
R-Type Instruction
2/26/20 Matni, CS154, Wi20 12
Load Instruction
2/26/20 Matni, CS154, Wi20 13
Branch-on-Equal Instruction
2/26/20 Matni, CS154, Wi20 14
Reminder: Implementing Jumps
• Jump uses word address
• Update PC with concatenation of 4 MS bits of old PC,
26-bit jump address, and 00 at the end
• Need an extra control signal decoded from opcode
• Need to implement a couple of other logic blocks… 2/26/20 Matni, CS154, Wi20 15
Jump Instruction
2/26/20 Matni, CS154, Wi20 16
Performance Issues
• Longest delay determines clock period • Critical path: load instruction
• Goes:
Instruction memoryàregister fileàALUàdata memoryà register file
• Not feasible to vary period for different instructions • Violates design principle
• Making the common case fast
• We can/will improve performance by pipelining
2/26/20 Matni, CS154, Wi20 17
Pipelining Analogy
• Pipelined laundry: overlapping execution
• An example of how parallelism improves performance
18
• 4 loads speeded up: • From 8 hrs to 3.5 hrs • Speed-up factor: 2.3
But for infinite loads:
• Speed-up factor ≈ 4 = number of stages
2/26/20 Matni, CS154, Wi20
Pipelining Analogy
• Pipelined laundry: overlapping execution
• An example of how parallelism improves throughput performance
• 4 loads speeded up: • From 8 hrs to 3.5 hrs • Speed-up factor: 2.3
But for infinite loads:
• Speed-up factor ≈ 4 = number of stages
2/26/20 Matni, CS154, Wi20 19
MIPS Pipeline
Five stages,
1. IF:
2. ID:
3. EX:
4. MEM:
5. WB:
one step per stage
Instruction fetch from memory Instruction decode & register read Execute operation or calculate address Access memory operand
Write result back to register
2/26/20
Matni, CS154, Wi20 20
Pipeline Performance
• Assume time for stages is
• 100ps for register read or write • 200ps for other stages
• Compare pipelined datapath with single-cycle datapath
2/26/20 Matni, CS154, Wi20 21
Tc = 800 ps
Tc = 200 ps
Comparison of Per-Instruction Time
22
Improvement
• In the previous example, per-instruction improvement was 4x • 800psto200ps
• But total execution time went from 2400 ps to 1400 ps (~1.7x imp.) • That’s because we’re only looking at 3 instructions…
• What if we looked at 1,000,003 instructions?
• Total execution time = 1,000,000 x 200 ps + 1400 ps = 200,001,400 ps • In non-pipelined, total time = 1,000,000 x 800 ps + 2400 ps = 800,002,400 ps
• Improvement = 800,002,400 ps ≈ 4.00 200,001,400 ps
2/26/20
Matni, CS154, Wi20 23
About Pipeline Speedup
• If all stages are balanced, i.e. all take the same time • Time between instructions (pipelined)
= Time between instructions (non-pipelined) / # of stages • If not balanced, speedup will be less
• Speedup is due to increased throughput,
but instruction latency does not change
2/26/20 Matni, CS154, Wi20 24
MIPS vs Others’ Pipelining
MIPS (and RISC-types in general) simplification advantages:
• All instructions are the same length (32 bits)
• x86 has variable length instructions (8 bits to 120 bits)
• MIPS has only 3 instruction formats (R, I, J) – rs fields all in the same place
• x86 requires extra pipelines b/c they don’t
• Memory ops only appear in load/store
• x86 requires extra pipelines b/c they don’t
2/26/20 Matni, CS154, Wi20 25
YOUR TO-DOs for the Week
•Lab 6 due soon…
2/26/20 Matni, CS154, Wi20 26
2/26/20 Matni, CS154, Wi20 27