Digital System Design 4
Digital System Design 4
Lecture 11 – Processor Architecture 3
Computer Architecture
Dr Chang Liu
Course Outline
Week Lecture Topic Chapter Tutorial
1 1 Introduction
1 2 A Historical Perspective
2 3 Modern Technology and Types of Computer
2 4 Computer Perfomance 1
3 5 Digital Logic Review C
3 6 Instruction Set Architecture 1 2
4 7 Instruction Set Architecture 2 2
4 8 Processor Architecture 1 4
5 9 Instruction Set Architecture 3 2
5 10 Processor Architecture 2 4
Festival of Creative Learning
6 11 Processor Architecture 3 4
6 12 Processor Architecture 4 4Processor Architecture 3 – Chang Liu 2
This Lecture
• Pipelining
• Hazards
Processor Architecture 3 – Chang Liu 3
• Pipelined laundry: overlapping execution
– Parallelism improves performance
• Four loads:
– Speedup
= 8/3.5 = 2.3
• Non-stop:
– Speedup
= 2n/(0.5n + 1.5)
≈ 4
= number of stages
Pipelining: The Laundry Analogy
Processor Architecture 3 – Chang Liu 4
MIPS Pipeline
Five stages, one step per stage
1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access memory operand
5. WB: Write result back to register
Processor Architecture 3 – Chang Liu 5
MIPS Pipelined Datapath
WB
MEM
Right-to-left
flow leads to
hazards
Processor Architecture 3 – Chang Liu 6
Pipeline Performance
• Assume time for stages is
– 100ps for register read or write
– 200ps for other stages
• Compare pipelined datapath with single-cycle
datapath
Instr Instr fetch Register
read
ALU op Memory
access
Register
write
Total time
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps
Processor Architecture 3 – Chang Liu 7
Pipeline Performance
Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
Processor Architecture 3 – Chang Liu 8
Pipeline Speedup
• If all stages are balanced
– i.e., all take the same time
– Time between instructionspipelined
= Time between instructionsnonpipelined
Number of stages
• If not balanced, speedup is less
• Speedup due to increased throughput
– Latency (time for each instruction) does not
decrease
Processor Architecture 3 – Chang Liu 9
Hazards
• Situations that prevent starting the next
instruction in the next cycle
• Structure hazards
– A required resource is busy
• Data hazards
– Need to wait for previous instruction to complete
its data read/write
• Control hazards
– Deciding on control action depends on previous
instruction
Processor Architecture 3 – Chang Liu 10
Structure Hazards
• Conflict for use of a resource
• In MIPS pipeline with a single memory
– Load/store requires data access
– Instruction fetch would have to stall for that cycle
• Would cause a pipeline “bubble”
• Hence, pipelined datapaths require separate
instruction/data memories
– Or separate instruction/data caches
Processor Architecture 3 – Chang Liu 11
Data Hazards
• An instruction depends on completion of data
access by a previous instruction
– add $s0, $t0, $t1
sub $t2, $s0, $t3
Processor Architecture 3 – Chang Liu 12
Forwarding (aka Bypassing)
• Use result when it is computed
– Don’t wait for it to be stored in a register
– Requires extra connections in the datapath
Processor Architecture 3 – Chang Liu 13
Load-Use Data Hazard
• Can’t always avoid stalls by forwarding
– If value not computed when needed
– Can’t forward backward in time!
Processor Architecture 3 – Chang Liu 14
MIPS…
• “…Data hazards can be detected quite easily when the program’s machine code is written by
the compiler.
• The original Stanford RISC machine relied on the compiler to add the NOP instructions in this
case, rather than having the circuitry to detect and (more taxingly) stall the first two pipeline
stages. Hence the name MIPS: Microprocessor without Interlocked Pipeline Stages.
• It turned out that the extra NOP instructions added by the compiler expanded the program
binaries enough that the instruction cache hit rate was reduced. The stall hardware, although
expensive, was put back into later designs to improve instruction cache hit rate…
•at which point the acronym no longer makes sense.”
or is it?
http://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_B._Pipeline_interlock
Processor Architecture 3 – Chang Liu 15
Code Scheduling to Avoid Stalls
• Reorder code to avoid use of load result in the
next instruction
• C code for A = B + E; C = B + F;
lw $t1, 0($t0)
lw $t2, 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t4, 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
stall
stall
lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t4, 8($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
11 cycles13 cycles
Processor Architecture 3 – Chang Liu 16
Control Hazards
• Branch determines flow of control
– Fetching next instruction depends on branch
outcome
– Pipeline can’t always fetch correct instruction
• Still working on ID stage of branch
• In MIPS pipeline
– Need to compare registers and compute target
early in the pipeline
– Add hardware to do it in ID stage
Processor Architecture 3 – Chang Liu 17
Stall on Branch
• Wait until branch outcome determined before
fetching next instruction
Processor Architecture 3 – Chang Liu 18
Branch Prediction
• Longer pipelines can’t readily determine
branch outcome early
– Stall penalty becomes unacceptable
• Predict outcome of branch
– Only stall if prediction is wrong
• In MIPS pipeline
– Can predict branches not taken
– Fetch instruction after branch, with no delay
Processor Architecture 3 – Chang Liu 19
MIPS with Predict Not Taken
Prediction
correct
Prediction
incorrect
Processor Architecture 3 – Chang Liu 20
More-Realistic Branch Prediction
• Static branch prediction
– Based on typical branch behavior
– Example: loop and if-statement branches
• Predict backward branches taken
• Predict forward branches not taken
• Dynamic branch prediction
– Hardware measures actual branch behavior
• e.g., record recent history of each branch
– Assume future behavior will continue the trend
• When wrong, stall while re-fetching, and update history
Processor Architecture 3 – Chang Liu 21
Next Lecture
• Pipelined Datapath
• Pipeline Control
Processor Architecture 3 – Chang Liu 22