Memory Allocation III CSE 351 Autumn 2016
Review of Last Lecture
Implementing controller for your datapath
Take decoded signals from instruction and generate control signals
Pipelining improves performance by exploiting Instruction Level Parallelism
5-stage pipeline for RISC-V: IF, ID, EX, MEM, WB
Executes multiple instructions in parallel
Each instruction has the same latency
What can go wrong???
1
CMPT 295
L29:Pipeline Hazards
1
Agenda
RISC-V Pipeline
Hazards
Structural
Data
R-type instructions
Load
Control
Superscalar processors
2
Hazards Ahead!
CMPT 295
L29:Pipeline Hazards
Pipelining Hazards
A hazard is a situation that prevents starting the next instruction in the next clock cycle
Structural hazard
A required resource is busy
(e.g. needed in multiple stages)
Data hazard
Data dependency between instructions
Need to wait for previous instruction to complete its data write
Control hazard
Flow of execution depends on previous instruction
3
CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers
Chapter 4 — The Processor
3
Agenda
RISC-V Pipeline
Hazards
Structural
Data
R-type instructions
Load
Control
Superscalar processors
4
CMPT 295
L29:Pipeline Hazards
Structural Hazard
Problem: Two or more instructions in the pipeline compete for access to a single physical resource
Solution 1: Instructions take turns using resource, some instructions have to stall (wait)
Solution 2: Add more hardware to machine
Can always solve a structural hazard by adding more hardware
5
CMPT 295
L29:Pipeline Hazards
I
n
s
t
r
O
r
d
e
r
Load
Add
Store
Sub
Or
Time (clock cycles)
RegFile: Used in ID and WB!
Structural Hazard: Regfile!
6
CMPT 295
L29:Pipeline Hazards
Regfile Structural Hazards
Each instruction:
can read up to two operands in decode stage
can write one value in writeback stage
Avoid structural hazard by having separate “ports”
two independent read ports and one independent write port
Three accesses per cycle can happen simultaneously
7
Clk
portW
Write Enable
32
32
portA
32
portB
5
5
5
RW
RA
RB
32 x 32-bit
Registers
CMPT 295
L29:Pipeline Hazards
Two alternate solutions:
Build RegFile with independent read and write ports (what you will do in the project; good for single-stage)
Double Pumping: split RegFile access in two! Prepare to write during 1st half, write on falling edge, read during 2nd half of each clock cycle
Will save us a cycle later…
Possible because RegFile access is VERY fast
(takes less than half the time of ALU stage)
Conclusion: Read and Write to registers during same clock cycle is okay
8
Regfile Structural Hazards
CMPT 295
L29:Pipeline Hazards
Structural Hazard: Memory Access
add t0, t1, t2
or t3, t4, t5
slt t6, t0, t3
instruction sequence
sw t0, 4(t3)
lw t0, 8(t3)
Instruction and data memory used simultaneously
Use two separate memories
9
CMPT 295
L29:Pipeline Hazards
Instruction and Data Caches
10
Processor
Control
Datapath
PC
Registers
Arithmetic & Logic Unit
(ALU)
Memory (DRAM)
Bytes
Program
Data
Instruction Cache
Data
Cache
Caches: small and fast “buffer” memories
CMPT 295
L29:Pipeline Hazards
Structural Hazards – Summary
Conflict for use of a resource
In RISC-V pipeline with a single memory unit
Load/store requires data access
Without separate memory units, instruction fetch would have to stall for that cycle
All other operations in pipeline would have to wait
Pipelined datapaths require separate instruction/data memory units
Or separate instruction/data caches
RISC ISAs (including RISC-V) designed to avoid structural hazards
e.g. at most one memory access/instruction
11
CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers
10 October, 2017
Chapter 4 — The Processor
11
Agenda
RISC-V Pipeline
Hazards
Structural
Data
R-type instructions
Load
Control
Superscalar processors
12
CMPT 295
L29:Pipeline Hazards
2. Data Hazards (1/2)
Consider the following sequence of instructions:
13
add s0, s1, s2
sub s4, s0, s3
and s5, s0, s6
or s7, s0, s8
xor s9, s0, s10
Stored
during WB
Read during ID
CMPT 295
L29:Pipeline Hazards
2. Data Hazards (2/2)
14
sub s4, s0, s3
and s5, s0, s6
or s7, s0, s8
xor s9, s0, s10
add s0, s1, s2
Time (clock cycles)
Identifying data hazards:
Where is data WRITTEN?
Where is data READ?
Does the WRITE happen AFTER the READ?
CMPT 295
L29:Pipeline Hazards
Solution 1: Stalling
Problem: Instruction depends on result from previous instruction
add s0, s1, s2
sub s4, s0, s3
Bubble:
effecsively NOP: affecsed pipeline ssages do “nothing” (add x0 x0 x0)
CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers
10 October, 2017
Chapter 4 — The Processor
15
Stalls and Performance
Stalls reduce performance
Decrease throughput of “valid” or useful instructions
Can also be seen as increasing the latency of our stalled instruction
But stalls are required to get correct results
Compiler can arrange code to avoid hazards and stalls!
Requires knowledge of the pipeline structure, and knowledge of instruction interactions
16
CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers
10 October, 2017
Chapter 4 — The Processor
16
Data Hazard Solution: Forwarding
Forward result as soon as it is available, even though it’s not stored in RegFile yet
17
sub s4, s0, s3
and s5, s0, s6
or s7, s0, s8
xor s9, s0, s10
add s0, s1, s2
Forwarding: grab operand from pipeline stage, rather than register file
CMPT 295
L29:Pipeline Hazards
Forwarding (aka Bypassing)
Use result when it is computed
Don’t wait for it to be stored in a register
Requires extra hardware in the datapath (and extra control!)
18
CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers
10 October, 2017
Chapter 4 — The Processor
18
Forwarding Path
19
IMEM
ALU
+4
DMEM
Branch Comp.
Reg[]
AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR
1
0
aluX
pcF+4
+4
pcD
pcF
pcX
pcM
instD
instX
rs1X
rs2X
aluM
rs2M
immX
Imm.
instM
instW
Forwarding Control Logic
CMPT 295
L29:Pipeline Hazards
19
Detect Need for Forwarding
(example)
add s0, s1, s2
or s3, s0, s5
sub s6, s0, s3
X
M
W
D
instX.rd
instD.rs1
20
Compare destination of older instructions in pipeline with sources of new instruction in decode stage.
Must ignore writes to x0!
CMPT 295
L29:Pipeline Hazards
20
In our 5-stage pipeline, how many subsequent instructions do we need to look at to detect data hazards for this add? Assume we have double-pumping.
1 instruction
2 instructions
3 instructions
4 instructions
5 instructions
21
sub s4, s0, s3
and s5, s0, s6
or s7, s0, s8
xor s9, s0, s10
add s0, s1, s2
CMPT 295
L29:Pipeline Hazards
21
In our 5-stage pipeline, how many subsequent instructions do we need to look at to detect data hazards for this add? Assume we have double-pumping.
1 instruction
2 instructions
3 instructions
4 instructions
5 instructions
22
sub s4, s0, s3
and s5, s0, s6
or s7, s0, s8
xor s9, s0, s10
add s0, s1, s2
CMPT 295
L29:Pipeline Hazards
22
Agenda
RISC-V Pipeline
Hazards
Ssructural
Data
R-type instructions
Load
Control
Superscalar processors
23
CMPT 295
L29:Pipeline Hazards
Data Hazard: Loads (1/4)
Recall: Dataflow backwards in time are hazards
Can’t solve all cases with forwarding
Must stall instruction dependent on load (sub), then forward after the load is done (more hardware)
24
sub t3, t0, t2
lw t0, 0(t1)
CMPT 295
L29:Pipeline Hazards
Data Hazard: Loads (2/4)
Hardware stalls pipeline
Called “hardware interlock”
25
This is what happens in hardware in a “hardware interlock”
sub t3, t0, t2
and t5, t0, t4
or t7, t0, t6
lw t0, 0(t1)
bubble
bubble
bubble
CMPT 295
L29:Pipeline Hazards
Data Hazard: Loads (3/4)
Stall is equivalent to nop
26
sub t3, t0, t2
and t5, t0, t4
or t7, t0, t6
lw t0, 0(t1)
nop
bubble
bubble
bubble
bubble
bubble
CMPT 295
L29:Pipeline Hazards
Data Hazard: Loads (4/4)
Slot after a load is called a load delay slot
If that instruction uses the result of the load, then the hardware will stall for one cycle
Equivalent to inserting an explicit nop in the slot
except the latter uses more code space
Performance loss
Idea: Let the compiler/assembler put an unrelated instruction in that slot → no stall!
27
CMPT 295
L29:Pipeline Hazards
Code Scheduling to Avoid Stalls
Reorder code to avoid use of load result in the next instruction!
RISC-V code for D=A+B; E=A+C;
28
Original Order:
lw t1, 0(t0)
lw t2, 4(t0)
add t3, t1, t2
sw t3, 12(t0)
lw t4, 8(t0)
add t5, t1, t4
sw t5, 16(t0)
Alternative:
lw t1, 0(t0)
lw t2, 4(t0)
lw t4, 8(t0)
add t3, t1, t2
sw t3, 12(t0)
add t5, t1, t4
sw t5, 16(t0)
Stall!
Stall!
13 cycles
11 cycles
CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers
Chapter 4 — The Processor
28
Break!
29
CMPT 295
L29:Pipeline Hazards
Agenda
RISC-V Pipeline
Hazards
Structural
Data
R-type instructions
Load
Control
Superscalar processors
30
CMPT 295
L29:Pipeline Hazards
3. Control Hazards
Branch (beq, bne,…) determines flow of control
Fetching next instruction depends on branch outcome
Pipeline can’t always fetch correct instruction
Result isn’t known until end of execute
Simple Solution: Stall or flush on every branch until we have the new PC value
How long must we stall?
31
CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers
Chapter 4 — The Processor
31
32
How many instructions after beq are affected by the control hazard?
1
2
3
4
5
beq
Instr 1
Instr 2
Instr 3
Instr 4
CMPT 295
L29:Pipeline Hazards
Branch Stall
33
How many bubbles required for branch?
beq
Instr 1
Instr 2
Instr 3
Instr 4
Time (clock cycles)
CMPT 295
L29:Pipeline Hazards
3. Control Hazard: Branching
Option #1: Move branch comparator to ID stage
As soon as instruction is decoded, immediately make a decision and set the new value of PC
Benefit: Branch decision made in 2nd stage, so only one nop is needed instead of two
Side Note: Have to compute new PC value (PC + imm) in ID instead of EX
Adds extra copy of new-PC logic in ID stage
Branches are idle in EX, MEM, and WB
34
CMPT 295
L29:Pipeline Hazards
Improved Branch Stall
35
When is comparison result available?
beq
Instr 1
Instr 2
Instr 3
Instr 4
Only one stall needed!
CMPT 295
L29:Pipeline Hazards
Data Hazard: Branches!
Recall: Dataflow backwards in time are hazards
Now that t0 is needed earlier (ID instead of EX), we can’t forward it to the beq’s ID stage
Must stall after add, then forward (more hardware)
36
beq x0, t0, foo
add t0, t0, t1
CMPT 295
L29:Pipeline Hazards
Takeaway: Moving branch comparator to ID stage would add redundant hardware and introduce new problems
Can we work with the nature of branches?
If branch not taken, then instructions fetched sequentially after branch are correct
If branch or jump taken, then need to flush incorrect instructions from pipeline by converting to NOPs
Observations
37
CMPT 295
L29:Pipeline Hazards
Agenda
Structural Hazards
Data Hazards
Forwarding
Administrivia
Data Hazards (Continued)
Load Delay Slot
Control Hazards
Branch and Jump Delay Slots
Branch Prediction
38
CMPT 295
L29:Pipeline Hazards
3. Control Hazard: Branching
RISC-V Solution: Branch Prediction – guess outcome of a branch, fix afterwards if necessary
Must cancel (flush) all instructions in pipeline that depended on guess that was wrong
How many instructions do we end up flushing?
39
CMPT 295
L29:Pipeline Hazards
39
Kill Instructions after Branch if Taken
beq t0, t1, label
sub t2, s0, t5
or t6, s0, t3
label: xxxxxx
PC updated reflecting branch outcome
40
Taken branch
Convert to NOP
Convert to NOP
Two instructions are affected by an incorrect branch, just like we’d have to insert two NOP’s/stalls in the pipeline to wait on the correct value!
CMPT 295
L29:Pipeline Hazards
Branch Prediction
beq t0, t1, label
label: …..
…..
41
Taken branch
Guess next PC!
Check guess correct
In the correct case, we don’t have any stalls/NOP’s at all!
Prediction, if done correctly, is better on average than stalling
CMPT 295
L29:Pipeline Hazards
Dynamic Branch Prediction
Branch penalty is more significant in deeper pipelines
Use dynamic branch prediction
Have branch prediction mechanism (a.k.a. branch history table) that stores outcomes (taken/not taken) of previous branches
To execute a branch
Check table and predict the same outcome for next fetch
If wrong, flush pipeline and flip prediction
42
CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers
Chapter 4 — The Processor
42
1-Bit Predictor: Shortcoming
Examine the code below, assuming both loops will be executed multiple times:
43
outer: …
…
inner: …
…
beq …, …, inner
…
beq …, …, outer
Inner loop branches are predicted wrong twice!
Predict as taken on last iteration of inner loop
Then predict as not taken on first iteration of inner loop next time around
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 10; j++) {
….
}
}
CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers
Chapter 4 — The Processor
43
Question: For the code sequence below, choose the statement that best describes requirements for correctness
No stalls as is
A
No stalls with forwarding
B
Must stall
C
lw t0,0(t0)
add t1,t0,t0
44
CMPT 295
L29:Pipeline Hazards
44
Code Sequence 1
45
lw
add
instr
instr
instr
I
n
s
t
r
O
r
d
e
r
Time (clock cycles)
Must stall at least once! Forwarding doesn’t help us here!
CMPT 295
L29:Pipeline Hazards
Question: For the code sequence below, choose the statement that best describes requirements for correctness
No stalls as is
A
No stalls with forwarding
B
Must stall
C
add t1, t0, t0
addi t2,t0,5
addi t4,t1,5
46
CMPT 295
L29:Pipeline Hazards
46
Code Sequence 2
47
add
addi
addi
instr
instr
I
n
s
t
r
O
r
d
e
r
Time (clock cycles)
No stalls are necessary if we forward!
CMPT 295
L29:Pipeline Hazards
Question: For the code sequence below, choose the statement that best describes requirements for correctness
No stalls as is
A
No stalls with forwarding
B
Must stall
C
3:
addi t1,t0,1
addi t2,t0,2
addi t3,t0,2
addi t3,t0,4
addi t5,t1,5
48
CMPT 295
L29:Pipeline Hazards
48
Code Sequence 3
49
addi
addi
addi
addi
addi
I
n
s
t
r
O
r
d
e
r
Time (clock cycles)
No stalls as is! Our reading takes place after our write has finished!
CMPT 295
L29:Pipeline Hazards
Agenda
RISC-V Pipeline
Hazards
Structural
Data
R-type instructions
Load
Control
Superscalar processors
50
CMPT 295
L29:Pipeline Hazards
Increasing Processor Performance
Clock rate
Limited by technology and power dissipation
Pipelining
“Overlap” instruction execution
Deeper pipeline: 5 => 10 => 15 stages
Less work per stage → shorter clock cycle
But more potential for hazards (CPI > 1)
Multi-issue ”super-scalar” processor
Multiple execution units (ALUs)
Several instructions executed simultaneously
CPI < 1 (ideally)
51
CMPT 295
L29:Pipeline Hazards
Superscalar Processor
52
P&H p. 340
DECODE ORDER
addi t0 t0 t1
lw t2 0(a0)
sub t4 a1 a0
addi t0 t0 t1
EXECUTION ORDER
sub AND addi (1) AND load
addi (2)
COMMIT ORDER
addi t0 t0 t1
lw t2 0(a0)
sub t4 a1 a0
addi t0 t0 t1
CMPT 295
L29:Pipeline Hazards
Benchmark: CPI of Intel Core i7
53
CPI = 1
P&H p. 350
CMPT 295
L29:Pipeline Hazards
Hazards reduce effectiveness of pipelining
Cause stalls/bubbles
Structural Hazards
Conflict in use of a datapath component
Data Hazards
Need to wait for result of a previous instruction
Control Hazards
Address of next instruction uncertain/unknown
Superscalar processors use multiple execution units for additional instruction level parallelism
Performance benefit highly code dependent
Summary
54
CMPT 295
L29:Pipeline Hazards
Extra Slides
55
CMPT 295
L29:Pipeline Hazards
Pipelining and ISA Design
RISC-V ISA designed for pipelining
All instructions are 32-bits
Easy to fetch and decode in one cycle
Versus x86: 1- to 15-byte instructions
Few and regular instruction formats
Decode and read registers in one step
Load/store addressing
Calculate address in 3rd stage, access memory in 4th stage
Alignment of memory operands
Memory access takes only one cycle
56
CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers
10 October, 2017
Chapter 4 — The Processor
56
Superscalar Processor
Multiple issue “superscalar”
Replicate pipeline stages ⇒ multiple pipelines
Start multiple instructions per clock cycle
CPI < 1, so use Instructions Per Cycle (IPC)
E.g., 4GHz 4-way multiple-issue
16 BIPS, peak CPI = 0.25, peak IPC = 4
Dependencies reduce this in practice
“Out-of-Order” execution
Reorder instructions dynamically in hardware to reduce impact of hazards
CS152 discusses these techniques!
57
CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers
10 October, 2017
Chapter 4 — The Processor
57