程序代写代做代考 compiler assembler cache RISC-V Memory Allocation III CSE 351 Autumn 2016

Memory Allocation III CSE 351 Autumn 2016

Review of Last Lecture
Implementing controller for your datapath
Take decoded signals from instruction and generate control signals
Pipelining improves performance by exploiting Instruction Level Parallelism
5-stage pipeline for RISC-V: IF, ID, EX, MEM, WB
Executes multiple instructions in parallel
Each instruction has the same latency
What can go wrong???

CMPT 295
L29:Pipeline Hazards

Agenda
RISC-V Pipeline
Hazards
Structural
Data
R-type instructions
Load
Control
Superscalar processors
2

Hazards Ahead!

CMPT 295
L29:Pipeline Hazards

Pipelining Hazards
A hazard is a situation that prevents starting the next instruction in the next clock cycle
Structural hazard
A required resource is busy
(e.g. needed in multiple stages)
Data hazard
Data dependency between instructions
Need to wait for previous instruction to complete its data write
Control hazard
Flow of execution depends on previous instruction
3

CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers

Chapter 4 — The Processor
3

Agenda
RISC-V Pipeline
Hazards
Structural
Data
R-type instructions
Load
Control
Superscalar processors
4

CMPT 295
L29:Pipeline Hazards

Structural Hazard
Problem: Two or more instructions in the pipeline compete for access to a single physical resource
Solution 1: Instructions take turns using resource, some instructions have to stall (wait)
Solution 2: Add more hardware to machine
Can always solve a structural hazard by adding more hardware

CMPT 295
L29:Pipeline Hazards

I
n
s
t
r

O
r
d
e
r
Load
Add
Store
Sub
Or
Time (clock cycles)
RegFile: Used in ID and WB!
Structural Hazard: Regfile!
6

CMPT 295
L29:Pipeline Hazards

Regfile Structural Hazards
Each instruction:
can read up to two operands in decode stage
can write one value in writeback stage
Avoid structural hazard by having separate “ports”
two independent read ports and one independent write port
Three accesses per cycle can happen simultaneously
7
Clk
portW

Write Enable
32
32
portA
32
portB
5
5
5
RW
RA
RB
32 x 32-bit
Registers

CMPT 295
L29:Pipeline Hazards

Two alternate solutions:
Build RegFile with independent read and write ports (what you will do in the project; good for single-stage)
Double Pumping: split RegFile access in two! Prepare to write during 1st half, write on falling edge, read during 2nd half of each clock cycle
Will save us a cycle later…
Possible because RegFile access is VERY fast
(takes less than half the time of ALU stage)
Conclusion: Read and Write to registers during same clock cycle is okay
8
Regfile Structural Hazards

CMPT 295
L29:Pipeline Hazards

Structural Hazard: Memory Access

add t0, t1, t2
or t3, t4, t5
slt t6, t0, t3

instruction sequence
sw t0, 4(t3)
lw t0, 8(t3)

Instruction and data memory used simultaneously
Use two separate memories
9

CMPT 295
L29:Pipeline Hazards
Instruction and Data Caches
10
Processor
Control
Datapath
PC

Registers

Arithmetic & Logic Unit
(ALU)
Memory (DRAM)

Bytes
Program
Data
Instruction Cache
Data
Cache
Caches: small and fast “buffer” memories

CMPT 295
L29:Pipeline Hazards

Structural Hazards – Summary
Conflict for use of a resource

In RISC-V pipeline with a single memory unit
Load/store requires data access
Without separate memory units, instruction fetch would have to stall for that cycle
All other operations in pipeline would have to wait

Pipelined datapaths require separate instruction/data memory units
Or separate instruction/data caches

RISC ISAs (including RISC-V) designed to avoid structural hazards
e.g. at most one memory access/instruction
11

CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers
10 October, 2017
Chapter 4 — The Processor
11

Agenda
RISC-V Pipeline
Hazards
Structural
Data
R-type instructions
Load
Control
Superscalar processors
12

CMPT 295
L29:Pipeline Hazards

2. Data Hazards (1/2)
Consider the following sequence of instructions:
13
add s0, s1, s2
sub s4, s0, s3
and s5, s0, s6
or s7, s0, s8
xor s9, s0, s10

Stored
during WB

Read during ID

CMPT 295
L29:Pipeline Hazards

2. Data Hazards (2/2)
14
sub s4, s0, s3
and s5, s0, s6
or s7, s0, s8
xor s9, s0, s10
add s0, s1, s2
Time (clock cycles)

Identifying data hazards:
Where is data WRITTEN?
Where is data READ?
Does the WRITE happen AFTER the READ?

CMPT 295
L29:Pipeline Hazards

Solution 1: Stalling
Problem: Instruction depends on result from previous instruction
add s0, s1, s2
sub s4, s0, s3

Bubble:
effecsively NOP: affecsed pipeline ssages do “nothing” (add x0 x0 x0)

CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers
10 October, 2017
Chapter 4 — The Processor
15

Stalls and Performance
Stalls reduce performance
Decrease throughput of “valid” or useful instructions
Can also be seen as increasing the latency of our stalled instruction
But stalls are required to get correct results
Compiler can arrange code to avoid hazards and stalls!
Requires knowledge of the pipeline structure, and knowledge of instruction interactions
16

CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers
10 October, 2017
Chapter 4 — The Processor
16

Data Hazard Solution: Forwarding
Forward result as soon as it is available, even though it’s not stored in RegFile yet
17
sub s4, s0, s3
and s5, s0, s6
or s7, s0, s8
xor s9, s0, s10
add s0, s1, s2
Forwarding: grab operand from pipeline stage, rather than register file

CMPT 295
L29:Pipeline Hazards

Forwarding (aka Bypassing)
Use result when it is computed
Don’t wait for it to be stored in a register
Requires extra hardware in the datapath (and extra control!)
18

CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers
10 October, 2017
Chapter 4 — The Processor
18

Forwarding Path
19
IMEM

ALU

+4
DMEM

Branch Comp.
Reg[]

AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataW
DataR

1
0

aluX
pcF+4

+4
pcD
pcF
pcX
pcM
instD
instX
rs1X
rs2X
aluM
rs2M
immX

Imm.

instM
instW
Forwarding Control Logic

CMPT 295
L29:Pipeline Hazards

Detect Need for Forwarding
(example)
add s0, s1, s2
or s3, s0, s5
sub s6, s0, s3

X
M
W

D
instX.rd
instD.rs1
20
Compare destination of older instructions in pipeline with sources of new instruction in decode stage.
Must ignore writes to x0!

CMPT 295
L29:Pipeline Hazards

In our 5-stage pipeline, how many subsequent instructions do we need to look at to detect data hazards for this add? Assume we have double-pumping.

1 instruction
2 instructions
3 instructions
4 instructions
5 instructions
21
sub s4, s0, s3
and s5, s0, s6
or s7, s0, s8
xor s9, s0, s10
add s0, s1, s2

CMPT 295
L29:Pipeline Hazards

In our 5-stage pipeline, how many subsequent instructions do we need to look at to detect data hazards for this add? Assume we have double-pumping.

1 instruction
2 instructions
3 instructions
4 instructions
5 instructions
22
sub s4, s0, s3
and s5, s0, s6
or s7, s0, s8
xor s9, s0, s10
add s0, s1, s2

CMPT 295
L29:Pipeline Hazards

Agenda
RISC-V Pipeline
Hazards
Ssructural
Data
R-type instructions
Load
Control
Superscalar processors
23

CMPT 295
L29:Pipeline Hazards

Data Hazard: Loads (1/4)
Recall: Dataflow backwards in time are hazards

Can’t solve all cases with forwarding
Must stall instruction dependent on load (sub), then forward after the load is done (more hardware)

24
sub t3, t0, t2
lw t0, 0(t1)

CMPT 295
L29:Pipeline Hazards

Data Hazard: Loads (2/4)
Hardware stalls pipeline
Called “hardware interlock”

25
This is what happens in hardware in a “hardware interlock”
sub t3, t0, t2

and t5, t0, t4
or t7, t0, t6
lw t0, 0(t1)

bubble

bubble
bubble

CMPT 295
L29:Pipeline Hazards

Data Hazard: Loads (3/4)
Stall is equivalent to nop
26
sub t3, t0, t2

and t5, t0, t4
or t7, t0, t6
lw t0, 0(t1)

nop

bubble
bubble
bubble
bubble
bubble

CMPT 295
L29:Pipeline Hazards

Data Hazard: Loads (4/4)
Slot after a load is called a load delay slot
If that instruction uses the result of the load, then the hardware will stall for one cycle
Equivalent to inserting an explicit nop in the slot
except the latter uses more code space
Performance loss
Idea: Let the compiler/assembler put an unrelated instruction in that slot → no stall!
27

CMPT 295
L29:Pipeline Hazards

Code Scheduling to Avoid Stalls
Reorder code to avoid use of load result in the next instruction!
RISC-V code for D=A+B; E=A+C;
28
Original Order:
lw t1, 0(t0)
lw t2, 4(t0)
add t3, t1, t2
sw t3, 12(t0)
lw t4, 8(t0)
add t5, t1, t4
sw t5, 16(t0)
Alternative:
lw t1, 0(t0)
lw t2, 4(t0)
lw t4, 8(t0)
add t3, t1, t2
sw t3, 12(t0)
add t5, t1, t4
sw t5, 16(t0)

Stall!
Stall!
13 cycles
11 cycles

CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers

Chapter 4 — The Processor
28

Break!
29

CMPT 295
L29:Pipeline Hazards

Agenda
RISC-V Pipeline
Hazards
Structural
Data
R-type instructions
Load
Control
Superscalar processors
30

CMPT 295
L29:Pipeline Hazards

3. Control Hazards
Branch (beq, bne,…) determines flow of control
Fetching next instruction depends on branch outcome
Pipeline can’t always fetch correct instruction
Result isn’t known until end of execute
Simple Solution: Stall or flush on every branch until we have the new PC value
How long must we stall?
31

CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers

Chapter 4 — The Processor
31

32
How many instructions after beq are affected by the control hazard?

1
2
3
4
5
beq
Instr 1
Instr 2
Instr 3
Instr 4

CMPT 295
L29:Pipeline Hazards

Branch Stall
33
How many bubbles required for branch?
beq
Instr 1
Instr 2
Instr 3
Instr 4
Time (clock cycles)

CMPT 295
L29:Pipeline Hazards

3. Control Hazard: Branching
Option #1: Move branch comparator to ID stage

As soon as instruction is decoded, immediately make a decision and set the new value of PC
Benefit: Branch decision made in 2nd stage, so only one nop is needed instead of two
Side Note: Have to compute new PC value (PC + imm) in ID instead of EX
Adds extra copy of new-PC logic in ID stage
Branches are idle in EX, MEM, and WB
34

CMPT 295
L29:Pipeline Hazards

Improved Branch Stall
35
When is comparison result available?
beq
Instr 1
Instr 2
Instr 3
Instr 4

Only one stall needed!

CMPT 295
L29:Pipeline Hazards

Data Hazard: Branches!
Recall: Dataflow backwards in time are hazards

Now that t0 is needed earlier (ID instead of EX), we can’t forward it to the beq’s ID stage
Must stall after add, then forward (more hardware)

36
beq x0, t0, foo
add t0, t0, t1

CMPT 295
L29:Pipeline Hazards

Takeaway: Moving branch comparator to ID stage would add redundant hardware and introduce new problems
Can we work with the nature of branches?
If branch not taken, then instructions fetched sequentially after branch are correct
If branch or jump taken, then need to flush incorrect instructions from pipeline by converting to NOPs
Observations
37

CMPT 295
L29:Pipeline Hazards

Agenda
Structural Hazards
Data Hazards
Forwarding
Administrivia
Data Hazards (Continued)
Load Delay Slot
Control Hazards
Branch and Jump Delay Slots
Branch Prediction
38

CMPT 295
L29:Pipeline Hazards

3. Control Hazard: Branching
RISC-V Solution: Branch Prediction – guess outcome of a branch, fix afterwards if necessary
Must cancel (flush) all instructions in pipeline that depended on guess that was wrong
How many instructions do we end up flushing?
39

CMPT 295
L29:Pipeline Hazards

Kill Instructions after Branch if Taken

beq t0, t1, label
sub t2, s0, t5
or t6, s0, t3

label: xxxxxx
PC updated reflecting branch outcome
40
Taken branch
Convert to NOP
Convert to NOP
Two instructions are affected by an incorrect branch, just like we’d have to insert two NOP’s/stalls in the pipeline to wait on the correct value!

CMPT 295
L29:Pipeline Hazards

Branch Prediction

beq t0, t1, label
label: …..
…..

41
Taken branch
Guess next PC!
Check guess correct
In the correct case, we don’t have any stalls/NOP’s at all!
Prediction, if done correctly, is better on average than stalling

CMPT 295
L29:Pipeline Hazards

Dynamic Branch Prediction
Branch penalty is more significant in deeper pipelines
Use dynamic branch prediction
Have branch prediction mechanism (a.k.a. branch history table) that stores outcomes (taken/not taken) of previous branches
To execute a branch
Check table and predict the same outcome for next fetch
If wrong, flush pipeline and flip prediction
42

CMPT 295
L29:Pipeline Hazards
Morgan Kaufmann Publishers

Chapter 4 — The Processor
42

1-Bit Predictor: Shortcoming
Examine the code below, assuming both loops will be executed multiple times:
43

outer: …
…
inner: …
…
beq …, …, inner
…
beq …, …, outer
Inner loop branches are predicted wrong twice!
Predict as taken on last iteration of inner loop
Then predict as not taken on first iteration of inner loop next time around
for (int i = 0; i < 10; i++) { for (int j = 0; j < 10; j++) { …. } } CMPT 295 L29:Pipeline Hazards Morgan Kaufmann Publishers Chapter 4 — The Processor 43 Question: For the code sequence below, choose the statement that best describes requirements for correctness No stalls as is A No stalls with forwarding B Must stall C lw t0,0(t0) add t1,t0,t0 44 CMPT 295 L29:Pipeline Hazards 44 Code Sequence 1 45 lw add instr instr instr I n s t r O r d e r Time (clock cycles) Must stall at least once! Forwarding doesn’t help us here! CMPT 295 L29:Pipeline Hazards Question: For the code sequence below, choose the statement that best describes requirements for correctness No stalls as is A No stalls with forwarding B Must stall C add t1, t0, t0 addi t2,t0,5 addi t4,t1,5 46 CMPT 295 L29:Pipeline Hazards 46 Code Sequence 2 47 add addi addi instr instr I n s t r O r d e r Time (clock cycles) No stalls are necessary if we forward! CMPT 295 L29:Pipeline Hazards Question: For the code sequence below, choose the statement that best describes requirements for correctness No stalls as is A No stalls with forwarding B Must stall C 3: addi t1,t0,1 addi t2,t0,2 addi t3,t0,2 addi t3,t0,4 addi t5,t1,5 48 CMPT 295 L29:Pipeline Hazards 48 Code Sequence 3 49 addi addi addi addi addi I n s t r O r d e r Time (clock cycles) No stalls as is! Our reading takes place after our write has finished! CMPT 295 L29:Pipeline Hazards Agenda RISC-V Pipeline Hazards Structural Data R-type instructions Load Control Superscalar processors 50 CMPT 295 L29:Pipeline Hazards Increasing Processor Performance Clock rate Limited by technology and power dissipation Pipelining “Overlap” instruction execution Deeper pipeline: 5 => 10 => 15 stages
Less work per stage → shorter clock cycle
But more potential for hazards (CPI > 1)
Multi-issue ”super-scalar” processor
Multiple execution units (ALUs)
Several instructions executed simultaneously
CPI < 1 (ideally) 51 CMPT 295 L29:Pipeline Hazards Superscalar Processor 52 P&H p. 340 DECODE ORDER addi t0 t0 t1 lw t2 0(a0) sub t4 a1 a0 addi t0 t0 t1 EXECUTION ORDER sub AND addi (1) AND load addi (2) COMMIT ORDER addi t0 t0 t1 lw t2 0(a0) sub t4 a1 a0 addi t0 t0 t1 CMPT 295 L29:Pipeline Hazards Benchmark: CPI of Intel Core i7 53 CPI = 1 P&H p. 350 CMPT 295 L29:Pipeline Hazards Hazards reduce effectiveness of pipelining Cause stalls/bubbles Structural Hazards Conflict in use of a datapath component Data Hazards Need to wait for result of a previous instruction Control Hazards Address of next instruction uncertain/unknown Superscalar processors use multiple execution units for additional instruction level parallelism Performance benefit highly code dependent Summary 54 CMPT 295 L29:Pipeline Hazards Extra Slides 55 CMPT 295 L29:Pipeline Hazards Pipelining and ISA Design RISC-V ISA designed for pipelining All instructions are 32-bits Easy to fetch and decode in one cycle Versus x86: 1- to 15-byte instructions Few and regular instruction formats Decode and read registers in one step Load/store addressing Calculate address in 3rd stage, access memory in 4th stage Alignment of memory operands Memory access takes only one cycle 56 CMPT 295 L29:Pipeline Hazards Morgan Kaufmann Publishers 10 October, 2017 Chapter 4 — The Processor 56 Superscalar Processor Multiple issue “superscalar” Replicate pipeline stages ⇒ multiple pipelines Start multiple instructions per clock cycle CPI < 1, so use Instructions Per Cycle (IPC) E.g., 4GHz 4-way multiple-issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 Dependencies reduce this in practice “Out-of-Order” execution Reorder instructions dynamically in hardware to reduce impact of hazards CS152 discusses these techniques! 57 CMPT 295 L29:Pipeline Hazards Morgan Kaufmann Publishers 10 October, 2017 Chapter 4 — The Processor 57

Related Posts