COMP 273 Pipelines (1)
Based on Patterson’s 61C
Introduction to Pipelined Execution
COMP 273
Review (1/3)
• Datapath is the hardware that performs operations necessary to execute programs.
• Control instructs datapath on what to do next.
• Datapath needs:
– access to storage (general purpose registers and memory) – computational ability (ALU)
– helper hardware (local registers and PC)
COMP 273 Pipelines (2)
Based on Patterson’s 61C
COMP 273 Pipelines (3)
Based on Patterson’s 61C
Review (2/3)
• Five stages of datapath (executing an instruction): 1. Instruction Fetch (Increment PC)
2. Instruction Decode (Read Registers)
3. ALU (Computation)
4. Memory Access 5. Write to Registers
• ALL instructions must go through ALL five stages.
Review Datapath
rd
rs rt
imm
2. Decode/ Register
Read
ALU
+4
1. Instruction Fetch
3. Execute
4. Memory 5. Write Back
COMP 273 Pipelines (4)
Based on Patterson’s 61C
instruction memory
PC
registers
Data memory
Outline • Pipelining Instruction Execution
• Pipelining Analogy • Hazards
COMP 273 Pipelines (5) Based on Patterson’s 61C
COMP 273 Pipelines (6) Based on Patterson’s 61C
•
Bono, The Edge, Adam Clayton, and Larry Mullen Jr., each have one load of clothes to wash, dry, fold, and put away
• Washer takes 30 minutes
• Dryer takes 30 minutes
• “Folder” takes 30 minutes
• “Stasher” takes 30 minutes to put clothes into drawers
With or without your laundry
ABCD
COMP 273 Pipelines (7)
Based on Patterson’s 61C
T a s k
O r d e r
Sequential Laundry
6PM 7 8 9 10 11 12 1 2AM 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
A
B C
D
Time
COMP 273 Pipelines (8)
Based on Patterson’s 61C
Sequential laundry takes 8 hours for 4 loads
Pipelined Laundry
6 PM 7 8 9 10 11 12 1 2 AM
T a s k
O r d e r
30 30 30 30 30 30 30
A
B C D
Time
COMP 273 Pipelines (9)
Based on Patterson’s 61C
Pipelined laundry takes 3.5 hours for 4 loads!
General Definitions – Time to completely execute a certain task
• Latency
• For example, time to read a sector from disk is disk access time or disk latency
• Throughput
– Amount of work that can be done over a period of time
COMP 273 Pipelines (10) Based on Patterson’s 61C
T a s k
6 PM 7 8 9 Time
30 30 30 30 30 30 30
• Pipelining doesn’t help latency of single task, it helps throughput of entire workload
• Multiple tasks operating simultaneously using different resources
Pipelining Lessons (1/2)
A B
O C• r D•
d e r
Potential speedup = Number pipe stages Time to “fill” pipeline and time to “drain” it reduces speedup:
2.3X v. 4X in this example
COMP 273 Pipelines (11)
Based on Patterson’s 61C
6 PM 7 8 9
• Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline?
• Pipeline rate limited by slowest pipeline stage
• Unbalanced lengths of pipe stages also reduces speedup
Pipelining Lessons (2/2)
T a s k
B C
D
Time
30 30 30 30 30 30 30
A
O r d e r
COMP 273 Pipelines (12)
Based on Patterson’s 61C
1) IFetch:
2) Decode:
3) Execute: Mem-ref: Arith-log:
4) Memory: Load:
Steps in Executing MIPS
Fetch Instruction, Increment PC Instruction, Read Registers
Calculate Address Perform Operation
Read Data from Memory Write Data to Memory
Store:
5) Write Back: Write Data to Register
COMP 273 Pipelines (13)
Based on Patterson’s 61C
Pipelined Execution Representation
Time
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
COMP 273 Pipelines (14)
Based on Patterson’s 61C
• Every instruction must take same number of steps, also called pipeline “stages”, so some will go idle sometimes
Review: Datapath for MIPS
rd
rs rt
imm
2. Decode/ Register Read
ALU
+4
1. Instruction Fetch
3. Execute
4. Memory 5. Write Back
• Use datapath figure to represent pipeline
IFtch
Dcd
Exec
Mem
WB
I$
Reg
D$
Reg
COMP 273 Pipelines (15)
Based on Patterson’s 61C
instruction memory
PC
registers
Data memory
ALU
Graphical Pipeline Representation
(In Reg, right half highlight read, left half write) Time (clock cycles)
Load Add
Store Sub
Or
I$
Reg
D$
I$
Re
g
I$
Reg
Reg
D$
Reg
D$
Reg
I$
Reg
I$
D$
Reg
Reg
D$
Reg
COMP 273 Pipelines (16)
Based on Patterson’s 61C
ALU
ALU ALU
Instruction Order
ALU ALU
Example
• Suppose 2 ns for memory access, 2 ns for ALU operation, and 1 ns for register file read or write; what is the instruction execution rate?
• Non-pipelined Execution, consider LW and ADD:
LW: IF + Read Reg + ALU + Memory + Write Reg = 2 + 1 + 2 + 2 + 1 = 8 ns
ADD: IF + Read Reg + ALU + Write Reg = 2 + 1 + 2 + 1 = 6 ns
• Pipelined Execution:
Max(IF,Read Reg,ALU,Memory,Write Reg)
= 2 ns
COMP 273 Pipelines (17)
Based on Patterson’s 61C
Problems for Pipelined Processors
• There exist Hazards that prevent the next instruction from executing during its designated clock cycle
– Structural hazards: HW cannot support this combination of instructions
– Control hazards: Pipelining of branches & other instructions can stall the pipeline until the hazard
– Data hazards: Instruction depends on result of prior instruction still in the pipeline
COMP 273 Pipelines (18) Based on Patterson’s 61C
Structural Hazard #1: Single Memory (1/2)
Time (clock cycles)
Load Instr 1
Instr 2 Instr 3
Instr 4
I$
Reg
Reg
I$
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
D$
Reg
COMP 273 Pipelines (19)
Based on Patterson’s 61C
Read same memory twice in same clock cycle
D$
ALU
ALU ALU
Instruction Order
ALU ALU
Structural Hazard #1: Single Memory (2/2) • Solution:
– Infeasible and inefficient to create an independent second memory, but can simulate this by having two Level 1 Caches
– Use an L1 Instruction Cache and an L1 Data Cache
– Need more complex hardware to control when both caches miss
COMP 273 Pipelines (20) Based on Patterson’s 61C
Structural Hazard #2: Registers (1/2)
Time (clock cycles)
lw Instr 1
Instr 2 Instr 3
Instr 4
I$
Reg
Reg
I$
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
D$
Reg
COMP 273 Pipelines (21)
Based on Patterson’s 61C
Can’t read and write to registers simultaneously
D$
ALU
ALU ALU
Instruction Order
ALU ALU
Structural Hazard #2: Registers (2/2)
• Fact: Register access is VERY fast: takes less than half the time of ALU stage
• Solution: modify clocking convention
– always Write to Registers during first half of each clock cycle
– always Read from Registers during second half of each clock cycle – Result: can perform Read and Write during same clock cycle
COMP 273 Pipelines (22) Based on Patterson’s 61C
Control Hazard: Branching (1/7)
Time (clock cycles)
beq Instr 1
Instr 2 Instr 3
Instr 4
I$
Reg
I$
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
D$
Reg
COMP 273 Pipelines (23)
Based on Patterson’s 61C
Reg
Where do we do the compare for the branch?
D$
ALU
ALU ALU
Instruction Order
ALU ALU
Control Hazard: Branching (2/7)
• We naively put branch decision-making hardware in ALU stage
– therefore two more instructions after the branch will always be fetched, whether or not the branch is taken
• Consider: Desired functionality of a branch
– if we do not take the branch, don’t waste any time and continue
executing normally
– if we take the branch, don’t execute any instructions after the branch, just go to the desired label
COMP 273 Pipelines (24) Based on Patterson’s 61C
Control Hazard: Branching (3/7)
• Initial Solution: Stall until decision is made
– insert “no-op” instructions that accomplish nothing, just take time
– Drawback: branches take 3 clock cycles each (assuming comparator is put in ALU stage)
COMP 273 Pipelines (25) Based on Patterson’s 61C
Control Hazard: Branching (4/7)
• Optimization #1:
– Move asynchronous comparator up to Stage 2
– As soon as instruction is decoded (Opcode identifies is as a branch), immediately make a decision and set the value of the PC (if necessary)
– Benefit: since branch is complete in Stage 2, only one unnecessary instruction is fetched, so only one no-op is needed
– Side Note: This means that branches are idle in Stages 3, 4 and 5.
COMP 273 Pipelines (26) Based on Patterson’s 61C
Control Hazard: Branching (5/7) Insert a single no-op (bubble)
add beq
lw
Time (clock cycles)
Reg
B?
bubble
I$
Reg
D$
Reg
I$
D$
Reg
I$
Reg
D$
Reg
COMP 273 Pipelines (27)
Based on Patterson’s 61C
Impact: 2 clock cycles per branch instruction, slow
ALU ALU
Instruction Order
ALU
Control Hazard: Branching (6/7) • Optimization #2: Redefine branches
– Old definition: if we take the branch, none of the instructions after the branch get executed by accident
– New definition: whether or not we take the branch, the single instruction immediately following the branch gets executed
– This is called the branch-delay slot
COMP 273 Pipelines (28) Based on Patterson’s 61C
Control Hazard: Branching (7/7)
• Notes on Branch-Delay Slot
– Worst-Case Scenario: can always put a no-op in the branch-delay slot
– Better Case: can find an instruction preceding the branch which can be placed in the branch-delay slot without affecting flow of the program
• Re-ordering instructions is a common method of speeding up programs • Compiler must be very smart in order to find instructions to do this
• Usually can find such an instruction at least 50% of the time
• Jumps also have a delay slot!
COMP 273 Pipelines (29)
Based on Patterson’s 61C
Example: Nondelayed vs. Delayed Branch
Nondelayed Branch
or $8, $9 ,$10
add $1 ,$2,$3
sub $4, $5,$6
beq $1, $4, Exit
Delayed Branch
add $1 ,$2,$3
sub $4, $5,$6
beq $1, $4, Exit
or $8, $9 ,$10
COMP 273 Pipelines (30)
Based on Patterson’s 61C
Exit:
What can we stick in the branch delay slot?
xor $10, $1,$11
xor $10, $1,$11
Exit:
Simulating Hazards • MARS can simulate delayed branches
COMP 273 Pipelines (31) Based on Patterson’s 61C
Sub will try to read $t0 before the result is written!
Data Hazards
Dependencies backwards in time are hazards
I n s t r.
O r d e r
Time (clock cycles)
add $t0,$t1,$t2 sub $t4,$t0,$t3 and $t5,$t0,$t6
or $t7,$t0,$t8 xor $t9,$t0,$t10
IF
ID/RF EX
MEM WB
I$
Reg
D$
Reg
I$
Reg
D$
I$
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
Reg
D$
Reg
COMP 273 Pipelines (32)
Based on Patterson’s 61C
$t0 is ready here because we wrote back in first half of clock cycle
ALU
ALU ALU
ALU ALU
Data Hazard Solution: Forwarding Forward result from one stage to another
add $t0,$t1,$t2 sub $t4,$t0,$t3 and $t5,$t0,$t6
or $t7,$t0,$t8 xor $t9,$t0,$t10
IF ID/RF
EX MEM WB
I$
D$
Reg
I$
Reg
D$
Reg
D$
I$
Reg
D$
I$
Reg
Reg
I$
Reg
Reg
D$
Reg
COMP 273 Pipelines (33)
Based on Patterson’s 61C
Reg
“or” hazard solved by register hardware
ALU
ALU ALU
ALU ALU
•
Data Hazard: Loads (1/4) Dependencies backwards in time are hazards
IF ID/RF EX MEM WB
lw $t0,0($t1) sub $t3,$t0,$t2
Can’t solve with forwarding
Must stall instruction dependent on load, then forward (more hardware)
Reg
I$
Reg
D$
I$
Reg
D$
Reg
COMP 273 Pipelines (34)
Based on Patterson’s 61C
• •
ALU ALU
Data Hazard: Loads (2/4) • Hardware must stall pipeline
• Also called “interlock”
lw $t0, 0($t1) sub $t3,$t0,$t2
and $t5,$t0,$t4 or $t7,$t0,$t6
IF
ID/RF EX
MEM WB
bubb le
bubb le
bubb le
I$
Reg
Reg
I$
Reg
Reg
D$
Reg
D$
Reg
I$
Reg
D$
COMP 273 Pipelines (35)
Based on Patterson’s 61C
D$
I$
ALU
ALU ALU
ALU
Data Hazard: Loads (3/4)
• Instruction slot after a load is called load delay slot
• If that instruction uses the result of the load, then the hardware interlock will stall it for one cycle.
• If compiler puts an unrelated instruction in that slot, then no stall
• Letting the hardware stall the instruction in the delay slot is
equivalent to putting a nop in the slot
• Alternatively: could redefine instruction semantics to introduce a delay slot, i.e., after lw (or lb or lbu) instruction, the data isn’t actually available in the destination register until an extra instruction later
COMP 273 Pipelines (36) Based on Patterson’s 61C
Recall: Control Hazard
Decision to branch made during instruction decode, registers can be read in the 2nd half of the cycle. Comparison of values very fast with simple XNOR + AND.
Reg
D$
Reg
addi $t5 $t6 -1 beq $t0 $0 exit ori (delay slot) add $t2 $t3 $t4
Time (clock cycles)
Reg
B?
Reg
I$
D$
I$
Reg
Reg
D$
Reg
I$
Reg
D$
COMP 273 Pipelines (37)
Based on Patterson’s 61C
I$
ALU ALU
Instruction Order
ALU ALU
Data-Control Hazard?
Decision to branch might also need a result from the ALU. Can we assume that the branch comparison is fast enough to asynchronously access the ALU result in the ID stage?
Reg
D$
Reg
addi $t0 $t0 -1 beq $t0 $0 exit ori (delay slot) add $t2 $t3 $t4
Time (clock cycles)
Reg
B?
Reg
I$
D$
I$
Reg
Reg
D$
Reg
I$
Reg
D$
COMP 273 Pipelines (38)
Based on Patterson’s 61C
I$
ALU ALU
Instruction Order
ALU ALU
Data-Control Hazard?
Even with the load delay slot, we may need to asynchronously access the load from memory to make the branch decision.
Time (clock cycles)
Reg
D$
Reg
I$
lw $t0 0($t1)
ori (load delay slot) beq $t0 $0 exit
ori (branch delay slot) add $t2 $t3 $t4
Reg
B?
Reg
Reg
I$
D$
Reg
I$
D$
Reg
I$
Reg
D$
I$
Reg
D$
Reg
COMP 273 Pipelines (39)
Based on Patterson’s 61C
ALU
ALU ALU
Instruction Order
ALU ALU
Optimization (1/3)
• Now that we know what is fast and what is slow, how do we write fast programs?
– As long as we avoid hazards and corresponding pipeline stalls, we maximize instruction throughput.
• First, simplest technique: maintain locality
– Stalling a cycle or two for a control/data hazard is nothing compared
to stalling hundreds of cycles for a cache miss!
COMP 273 Pipelines (40) Based on Patterson’s 61C
Optimization (2/3) • Instruction reordering:
– Be aware of delay slots, reorder instructions to put a useful yet unrelated instruction in a delay slot.
– This is a pretty tedious task: compilers and even assemblers do a good job of doing the dirty work for you.
COMP 273 Pipelines (41) Based on Patterson’s 61C
COMP 273 Pipelines (42)
Based on Patterson’s 61C
Question
• Assume 1 instr/clock, delayed branch, 5 stage pipeline, forwarding, interlock on unresolved load hazards (after 103 loops, so pipeline full)
Loop:
lw $t0, 0($s1)
addu $t0, $t0, $s2
sw $t0, 0($s1) addiu $s1, $s1, -4
bne $s1, $zero, Loop nop
•
How many pipeline stages (clock cycles) per loop iteration to execute this code?
Answer
• Assume 1 instr/clock, delayed branch, 5 stage pipeline, forwarding, interlock on unresolved load hazards (after 103 loops, so pipeline full)
Loop:
2 (data hazard causes stall) 1 lw $t0, 0($s1)
3 addu $t0, $t0, $s2 4 sw $t0, 0($s1)
5 addiu $s1, $s1, -4
bne $s1, $s3, Loop
nop
COMP 273 Pipelines (43)
Based on Patterson’s 61C
•
How many pipeline stages (clock cycles) per loop iteration to execute this code?
6 7
(delayed branch means we execute this instruction)
COMP 273 Pipelines (44)
Based on Patterson’s 61C
Optimization Question • Instruction Reordering Example
– The loop takes 7 clock cycles per iteration. Can you improve it?
Loop: lw $t0, 0($s1)
nop
addu $t0, $t0, $s2 sw $t0, 0($s1) addiu $s1, $s1, -4 bne $s1, $s3, Loop nop
COMP 273 Pipelines (45)
Based on Patterson’s 61C
Intruction Reordering Solution
• Reordering can reduce the number of cycles to 5 per iteration:
Loop: lw $t0, 0($s1) addiu $s1, $s1, -4 addu $t0, $t0, $s2 bne $s1, $s3, loop
sw $t0, 4($s1)
Optimization (3/3): Loop Unrolling • Branches are difficult, just avoid them if possible.
• For a loop with a fixed number of iterations, write the assembly code by just writing the body of the loop out repeatedly rather than using conditional branches.
• Tradeoff: more total code but fewer instructions executed overall and fewer branch instructions→means faster execution time.
• Can also partially unroll large loops!
COMP 273 Pipelines (46) Based on Patterson’s 61C
Loop Unrolling Example
addi $s0, $0, $0
loop: lw $t0, 0($s1)
add $s2, $s2, $t0
addi $s1, $s1, 4
addi $s0, $s0, 1
slt $t1, $s0, 4
bne $t1, $0, loop # then loop
What does this code do
# $s0 = 0
If this loop were longer, it would be useful to eliminate the counter s0 and instead keep track of the target end address s1+4n and eliminate two instructions in the loop
# if $s0 < 4
COMP 273 Pipelines (47)
Based on Patterson’s 61C
Loop Unrolling Example
lw $t0, 0($s1)
add $s2, $s2, $t0
lw $t0, 4($s1)
add $s2, $s2, $t0
lw $t0, 8($s1)
add $s2, $s2, $t0
lw $t0, 12($s1)
add $s2, $s2, $t0
COMP 273 Pipelines (48)
Based on Patterson’s 61C
lw $t0, 0($s1)
lw $t1, 4($s1)
lw $t2, 8($s1)
lw $t3, 12($s1)
add $s2, $s2, $t0
add $s2, $s2, $t1
add $s2, $s2, $t2
add $s2, $s2, $t3
Can still rearrange to use fewer registers
Loop Unrolling Example
COMP 273 Pipelines (49)
Based on Patterson’s 61C
lw $t0, 0($s1)
lw $t1, 4($s1)
add $s2, $s2, $t0
add $s2, $s2, $t1
lw $t0, 8($s1)
lw $t1, 12($s1)
add $s2, $s2, $t0
add $s2, $s2, $t1
Can still rearrange to use fewer registers
Loop Unrolling Example
COMP 273 Pipelines (50)
Based on Patterson’s 61C
Branch Prediction
• Modern processors with long pipelines also use a technique
called branch prediction.
• As soon as any branch instruction is fetched, make an instantaneous guess about whether the branch will be taken or not and keep executing assuming we guessed right.
• If we guess wrong, undo everything that we shouldn’t have done and start over at the correct place.
COMP 273 Pipelines (51) Based on Patterson’s 61C
Branch Prediction • A naïve branch prediction algorithm:
– If the branch target is backward, guess that it will be taken. If the target is forward, guess that it will not be taken.
– What’s the rationale?
• More elaborate algorithms actually track how often each branch instruction is taken and modify their behavior – called dynamic branch prediction.
COMP 273 Pipelines (52) Based on Patterson’s 61C
The Big Picture
• Although the compiler generally relies on the hardware to resolve hazards and thereby ensure correct execution, the compiler must understand the pipeline to achieve the best performance. Otherwise, unexpected stalls will reduce the performance of the compiled code.
COMP 273 Pipelines (53) Based on Patterson’s 61C
Questions
• The book uses the excellent "washing and drying clothes" analogy to explain pipelines and the three common hazards. Describe another analogy in which pipelines are useful and explain what the three hazards are in that domain. Aim for an analogy as far from clothes (and computers) as possible -- think outside the box.
• What requirements did the original MIPS processor place on compilers to avoid the need for hardware to stall the pipeline?
COMP 273 Pipelines (54) Based on Patterson’s 61C
Questions
A. Thanks to pipelining, I have reduced the time it took me to wash my shirt.
B. Longer pipelines are always a win (since less work per stage & a faster clock).
C. We can rely on compilers to help us avoid data hazards by reordering instructions.
COMP 273 Pipelines (55)
Based on Patterson’s 61C
Exception Handling
• If the datapath and control aren’t already complex enough, what about exceptions?
• On an exception, need to jump immediately to the exception handler
– Exceptions really must be precise, we can’t just arbitrarily redefine the behavior to add delay slots.
• Definitely an advanced topic, beyond the scope of this course...
COMP 273 Pipelines (56) Based on Patterson’s 61C
Superscalar Laundry: Parallel per stage More resources,
HW to match mix of parallel tasks?
T a s k
O r d e r
6PM 7 8 9 10 11 12 1 2AM
30 30 30 30 30 Time (light clothing)
A
B C D
E F
(dark clothing) (very dirty clothing)
(light clothing) (dark clothing)
(very dirty clothing)
COMP 273 Pipelines (59)
Based on Patterson’s 61C
Things to Remember (1/2)
• Optimal Pipeline
– Each stage is executing part of an instruction each clock cycle. – One instruction finishes during each clock cycle.
– On average, execute far more quickly.
• What makes this work?
– Similarities between instructions allow us to use same stages for all
instructions (generally).
– Each stage takes about the same amount of time as all others: little wasted time.
COMP 273 Pipelines (60) Based on Patterson’s 61C
Things to Remember (2/2)
• Pipelining is a BIG IDEA – widely used concept
• What makes it less than perfect?
– Structural hazards: suppose we had only one cache?
Need more HW resources
– Control hazards: need to worry about branch instructions?
Delayed branch
– Data hazards: an instruction depends on a previous instruction?
• Advanced techniques: branch prediction, out of order execution, superscalar
COMP 273 Pipelines (61) Based on Patterson’s 61C
Review and More Information • Textbook Section 4.5 and 4.6
• Hazards in Sections 4.7 and 4.8
COMP 273 Pipelines (62) Based on Patterson’s 61C
Extra Question • Implement the memory copy function
memcopy( int[] A, int[] B, int n ) {
for ( int i = 0; i < n; i++ ) A[i] = B[i];
}
– Assume a MIPS machine with 1 instruction per clock cycle, delayed branching, a 5 stage pipeline, forwarding, and interlock on unresolved load hazards
– Respect register conventions
– Use only true assembly language
– Use careful instruction ordering to make a loop that takes the shortest possible number of cycles to complete
COMP 273 Pipelines (63) Based on Patterson’s 61C
Extra Question, Part 2 • Partially unroll the memory copy function
blockcopy( int[] A, int[] B, int n ) {
for ( int i = 0; i < 4*n; i+=4 ) {
A[i] = B[i]; A[i+1] = B[i+1]; A[i+2] = B[i+2]; A[i+3] = B[i+3];
} }
– Same assumptions as before
– Respect register conventions and only use true MIPS
– Use careful instruction ordering to produce a loop that takes the shortest possible number of cycles to complete
COMP 273 Pipelines (64)
Based on Patterson’s 61C
Questions
• How much faster is blockcopy than memcopy? – Can you make memcopy do 1 word in 5 cycles?
– Can you make blockcopy do 4 words in 11 cycles? – If yes, 1.82 times faster!
COMP 273 Pipelines (65)
Based on Patterson’s 61C