Pipelined Processor Design
Serial versus Pipelined Execution
COE 301 / ICS 233
Computer Organization
Dr. Muhamed Mudawar
College of Computer Sciences and Engineering
King Fahd University of Petroleum and Minerals
Serial versus Pipelined Execution COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Drawback of Single Cycle Processor
Drawback is the Long cycle time
All instructions take as much time as the slowest instruction
longest delay
Instruction
Fetch
ALU
Decode
Reg Read
ALU
Reg
Write
Load
Instruction
Fetch
Decode
Reg Read
Compute
Address
Reg
Write
Memory Read
Store
Instruction
Fetch
Decode
Reg Read
Compute
Address
Memory Write
Jump
Instruction
Fetch
Decode &
Update PC
Branch
Instruction
Fetch
Reg Read
Br Target
Compare &
Update PC
Serial versus Pipelined Execution COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Alternative: Multicycle Implementation
Break instruction execution into five steps
Instruction fetch
Instruction decode, register read, target address for jump/branch
Execution, memory address calculation, or branch outcome
Memory access or ALU instruction completion
Load instruction completion
One clock cycle per step (clock cycle is reduced)
First 2 steps are the same for all instructions
Instruction # cycles Instruction # cycles
ALU & Store 4 Branch 3
Load 5 Jump 2
Serial versus Pipelined Execution COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Performance Example
Assume the following operation times for components:
Access time for Instruction and data memories: 200 ps
Delay in ALU and adders: 180 ps
Delay in Decode and Register file access (read or write): 150 ps
Ignore the other delays in PC, mux, extender, and wires
Which of the following would be faster and by how much?
Single-cycle implementation for all instructions
Multicycle implementation optimized for every class of instructions
Assume the following instruction mix:
40% ALU, 20% Loads, 10% stores, 20% branches, & 10% jumps
Serial versus Pipelined Execution COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Solution
Instruction
Class Instruction
Memory Register
Read ALU
Operation Data
Memory Register
Write Total
ALU 200 150 180 150 680 ps
Load 200 150 180 200 150 880 ps
Store 200 150 180 200 730 ps
Branch 200 150 180 530 ps
Jump 200 150 350 ps
For fixed single-cycle implementation:
Clock cycle =
For multi-cycle implementation:
Clock cycle =
Average CPI =
Speedup =
0.4×4 + 0.2×5 + 0.1×4+ 0.2×3 + 0.1×2 = 3.8
max (200, 150, 180) = 200 ps (maximum delay at any step)
880 ps determined by longest delay (load instruction)
880 ps / (3.8 × 200 ps) = 880 / 760 = 1.16
Compare and update PC
Decode and update PC
Serial versus Pipelined Execution COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Laundry Example: Three Stages
Wash dirty load of clothes
Dry wet clothes
Fold and put clothes into drawers
Each stage takes 30 minutes to complete
Four loads of clothes to wash, dry, and fold
A
B
C
D
Pipelining Example
Serial versus Pipelined Execution COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
6
Sequential laundry takes 6 hours for 4 loads
Intuitively, we can use pipelining to speed up laundry
Sequential Laundry
Time
6 PM
A
30
30
30
7
8
9
10
11
12 AM
30
30
30
B
30
30
30
C
30
30
30
D
Serial versus Pipelined Execution COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
7
Pipelined laundry takes 3 hours for 4 loads
Speedup factor is 2 for 4 loads
Time to wash, dry, and fold one load is still the same (90 minutes)
Pipelined Laundry: Start Load ASAP
Time
6 PM
A
30
7
8
9 PM
B
30
30
C
30
30
30
D
30
30
30
30
30
30
Serial versus Pipelined Execution COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
8
Serial Execution versus Pipelining
Consider a task that can be divided into k subtasks
The k subtasks are executed on k different stages
Each subtask requires one time unit
The total execution time of the task is k time units
Pipelining is to overlap the execution
The k stages work in parallel on k different tasks
Tasks enter/leave pipeline at the rate of one task per time unit
1
2
k
…
1
2
k
…
1
2
k
…
1
2
k
…
1
2
k
…
1
2
k
…
Without Pipelining
One completion every k time units
With Pipelining
One completion every 1 time unit
Serial versus Pipelined Execution COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Synchronous Pipeline
Uses clocked registers between stages
Upon arrival of a clock edge …
All registers hold the results of previous stages simultaneously
The pipeline stages are combinational logic circuits
It is desirable to have balanced stages
Approximately equal delay in all stages
Clock period is determined by the maximum stage delay
S1
S2
Sk
Register
Register
Register
Register
Input
Clock
Output
Serial versus Pipelined Execution COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Let ti = time delay in stage Si
Clock cycle t = max(ti) is the maximum stage delay
Clock frequency f = 1/t = 1/max(ti)
A pipeline can process n tasks in k + n – 1 cycles
k cycles are needed to complete the first task
n – 1 cycles are needed to complete the remaining n – 1 tasks
Ideal speedup of a k-stage pipeline over serial execution
Pipeline Performance
k + n – 1
Pipelined execution in cycles
Serial execution in cycles
=
=
Sk → k for large n
nk
Sk
Serial versus Pipelined Execution COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
MIPS Processor Pipeline
Five stages, one cycle per stage
IF: Instruction Fetch from instruction memory
ID: Instruction Decode, register read, and J/Br address
EX: Execute operation or calculate load/store address
MEM: Memory access for load and store
WB: Write Back result to register
Serial versus Pipelined Execution COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Morgan Kaufmann Publishers
30 April, 2017
Chapter 4 — The Processor
12
Single-Cycle vs Pipelined Performance
Consider a 5-stage instruction execution in which …
Instruction fetch = ALU operation = Data memory access = 200 ps
Register read = register write = 150 ps
What is the clock cycle of the single-cycle processor?
What is the clock cycle of the pipelined processor?
What is the speedup factor of pipelined execution?
Solution
Single-Cycle Clock =
200+150+200+200+150 = 900 ps
Reg
ALU
MEM
IF
900 ps
Reg
Reg
ALU
MEM
IF
900 ps
Reg
Serial versus Pipelined Execution COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Single-Cycle versus Pipelined – cont’d
Pipelined clock cycle =
CPI for pipelined execution =
One instruction completes each cycle (ignoring pipeline fill)
Speedup of pipelined execution =
Instruction count and CPI are equal in both cases
Speedup factor is less than 5 (number of pipeline stage)
Because the pipeline stages are not balanced
900 ps / 200 ps = 4.5
1
max(200, 150) = 200 ps
200
IF
Reg
MEM
ALU
Reg
IF
Reg
MEM
Reg
ALU
IF
Reg
MEM
ALU
Reg
200
200
200
200
200
200
Serial versus Pipelined Execution COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
Pipeline Performance Summary
Pipelining doesn’t improve latency of a single instruction
However, it improves throughput of entire workload
Instructions are initiated and completed at a higher rate
In a k-stage pipeline, k instructions operate in parallel
Overlapped execution using multiple hardware resources
Potential speedup = number of pipeline stages k
Pipeline rate is limited by slowest pipeline stage
Unbalanced lengths of pipeline stages reduces speedup
Also, time to fill and drain pipeline reduces speedup
Serial versus Pipelined Execution COE 301 / ICS 233 – Computer Organization © Muhamed Mudawar – slide ‹#›
15
/docProps/thumbnail.jpeg