CS计算机代考程序代写 computer architecture assembler x86 compiler scheme mips arm ECE437: Introduction to Digital Computer Design

ECE437: Introduction to Digital Computer Design
Chapter 4b (Pipelining)
1st of 4 pillars of modern computer systems in 437
Applies to ALL hardware and software Spring 2021
Outline
• Pipelining
– What? Basic concepts
• Overlapping execution • Latency vs. throughput
– Why? Performance implications • Speedup
• CPI, cycletime
– How? Implementation challenges
ECE437, S’21 © Vijaykumar and Thottethodi (2)
Motivation
• We want to improve performance
– Single-cycle CPU’s CPI = 1 but long cycle time
– Can we shrink cycle time without worsening CPI?
• Simply 4x faster clock would mean CPI = 4  no improvement in performance!
• Breaking up each instr into many cycles is one (only?) way to get faster clock assuming single- cycle CPU implementation is as fast as it can be
– Break every instr into 5 cycles for fast cycle time and then overlap 5 instrs to make the CPI = 1!
• Remember CPI is EFFECTIVE CPI • We overlap through “pipelining”
ECE437, S’21 © Vijaykumar and Thottethodi (3)
RECALL: Iron law
• Time/program = instrs/program x cycles/instr x sec/cycle
– NEVER forget this!
• sec/cycle (a.k.a. cycle time, clock time)
– mostly determined by technology and CPU orgn. • cycles/instr (a.k.a. CPI)
– mostly determined by ISA and CPU organization – EFFECTIVE cycles/instr and NOT actual latency – overlap among instructions makes this smaller
• Each instr 5 cycles but 5 instrs overlap  CPI = 1
– AVERAGE over instrs (instrs have different CPI)
• instr/program (a.k.a. instruction count)
– instrs executed, NOT static code
– mostly determined by program, compiler, ISA
ECE437, S’21 © Vijaykumar and Thottethodi (4)
1

T a s k
6PM 7 8 9 10 11 12 1 2AM 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
T aA
s kB
30 30 30 30 30 30 30
Time
Pipelining
• CENTRAL computer science/engineering concept in BOTH hardware and software
– There is almost no hardware or software that does not use pipelining for performance!
Pipelining
• Ann, Brian, Cathy, Dave
each have one load of clothes to wash, dry, and fold
• Washertakes30minutes
• Dryertakes30minutes
• “Folder”takes30minutes
• “Stasher”takes30minutes to put clothes into drawers
ECE437, S’21 © Vijaykumar and Thottethodi (6)
ECE437, S’21 © Vijaykumar and Thottethodi
(5)
A Time B
Sequential Laundry
Pipelined Laundry
6PM 7 8 9 10 11 12 1 2AM
OC OC rD rD
de • Sequential laundry takes 8 hours for 4 loads r • If they learned pipelining, how long would
laundrytake?
ECE437, S’21 © Vijaykumar and Thottethodi (7)
de r
• Pipelined laundry takes 3.5 hours for 4 loads!
• Not esoteric computer architecture concept • Natural,Intuitive
ECE437, S’21 © Vijaykumar and Thottethodi (8)
• LaundryExample
ABCD
2

Pipelining lessons
6PM 7 8 9
Time entire workload
T • a 30 30 30 30 30 30 30
s A
k •
Multiple tasks operating simultaneously using different resources Potential speedup =
• Pipeliningdoesn’thelp latency of single task, it helps throughput of
B Number pipe steps/
O
stages
C •
Pipeline rate limited by
r D d
e r
ECE437, S’21 © Vijaykumar and Thottethodi
slowest pipeline stage • Unbalanced lengths of
pipe stages reduces
speedup
• Timeto“fill”pipeline
and time to “drain” it
reduces speedup
• Stallfordependencies
(9)
The Steps/Stages of Load
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Load Ifetch Reg/Dec Exec Mem Wr
• Ifetch: Instruction Fetch
– Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode
• Exec: compute/calculate memory address
• Mem: Read the data from the Data Memory
• Wr: Write the data back to the register file
• REMEMBER THESE 5 STEPS FOREVER!
ECE437, S’21 © Vijaykumar and Thottethodi (10)
IF
Time
Exe
Mem WB
Single-cycle execution
Dec
IF
Dec
Exe
Mem WB
IF
Dec
Exe
Mem WB
Program Flow
• When an instruction is in one stage what are the other stages doing?
ECE437, S’21 © Vijaykumar and Thottethodi (11)
IF
Dec
Pipelined Execution Representation
Time
Program Flow
• What is the CPI here?
• Ideal speedup =?
ECE437, S’21 © Vijaykumar and Thottethodi (12)
IFetch
Dcd
IFetch
Exec
Dcd
IFetch
Mem
Exec
Dcd
IFetch
WB
Mem
Exec
Dcd
IFetch
WB
Mem
Exec
Dcd
IFetch
WB
Mem
Exec
Dcd
WB
Mem
Exec
WB
Mem
WB
E
3

Ideal speedup
• All instructions are executed in 1 cycle in a single-cycle datapath (i.e. CPI = 1)
• Single-cycle cycletime = t ns (say)
• Instr. Count = n
• For P pipeline stages, pipelined cycletime = t/P
• Old time (single-cycle) = n x 1 x t
• Newtime(pipelined)=nx1xt/P+2x(P-1)x t/P (fill + drain)
• Speedup = P/(1 + 2(P-1)/n)
• P is some constant, n is large  Speedup = P
(= the number of pipeline stages)
ECE437, S’21 © Vijaykumar and Thottethodi (13)
One-minute quiz
• Compared to single-cycle, which (one or more) of these three factors does pipelining improve: instruction count, CPI, clock speed?
• Previous quiz
– Iron Law: Time/prog = Instr. Count x CPI x cycle time
ECE437, S’21 © Vijaykumar and Thottethodi (14)
•
Non-uniform pipeline stages
Ideal speedup is number of stages in the pipeline. Do we achieve this?
ECE437, S’21 © Vijaykumar and Thottethodi (15)
Non-uniform stages
Maximum Speedup  Number of stages Speedup Time for unpipelined operation
Time for longest stage
UNLESS you go to asynchronous or self- timed pipelines
– Idea has been around for decades
– Not caught on because very hard to do
ECE437, S’21 © Vijaykumar and Thottethodi (16)
4

Exercise
• A single cycle processor implementation can be pipelined in two ways
• Pipeline A uses a 5-stage pipeline
– the 5 stages account for 15%, 10%, 15%, 20%, 40%
of the delays respectively
• Pipeline B uses a 3-stage pipeline
– the stages are balanced
• If instructions are all independent, which pipeline implementation is the better option
ECE437, S’21 © Vijaykumar and Thottethodi (17)
Single Cycle vs. Pipeline
Cycle 1
Clk
Single Cycle Implementation:
Load
Cycle 2
Store Waste
Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7 Cycle8 Cycle9Cycle10 Clk
Pipeline Implementation: Load Ifetch Reg Exec Store Ifetch Reg
R-type Ifetch
Mem Wr
Exec Mem Wr
Reg Exec Mem Wr
ECE437, S’21 © Vijaykumar and Thottethodi
(18)
What hardware is added to Single-cycle to get pipelining?
ECE437, S’21 © Vijaykumar and Thottethodi (19)
Pipelined Datapath
• Pipeline datapath with latches
ECE437, S’21 © Vijaykumar and Thottethodi (20)
5

Hardware requirements
• Similar to single cycle pipeline – Latches** to separate stages
• Real machines use latches and multiphase clocks • “Latches” == flipflops in this course
• Control
– Good news
• Minor changes from single-cycle version
ECE437, S’21 © Vijaykumar and Thottethodi (21)
More on Pipeline “latches”
phi1 phi2
• Remember construction of flip flop – Two latches, clock and inverted clock.
• 2-phase non-overlapping clocks
• 1 pipe stage uses two (level-sensitive) latches
Edge-triggered
ECE437, S’21 © Vijaykumar and Thottethodi
phi1 phi2 phi1
(22)
Ideal speedup with latches?
• Suppose we execute N instructions • Single Cycle Machine
–45ns/cycle x1CPIxNinst=45Nns • Ideal pipelined machine
– 5 stages but assume slightly slower clock at 10ns and not 9ns (due to latch overhead)
– 10 ns/cycle x (1 CPI x N inst + 8 cycle fill+drain) = 10N + 80 ns = ~10N for large N
• Speedup = 4.5
ECE437, S’21 © Vijaykumar and Thottethodi (23)
Resources other than latches are already there!
Reg Dm
Reg Im
nIInst0Im Reg s
Time (clock cycles)
Dm
Reg Reg
Im Reg Dm Reg
Im Dm Reg
t Inst 1 r.
Im
O Inst2 r
d Inst 3 e
r
Inst 4
ECE437, S’21 © Vijaykumar and Thottethodi
Reg
(24)
Dm Reg
6
ALU
ALU ALU
ALU ALU

One important thing about latency vs. throughput
• Pipelining does a strange thing
• To improve program latency, pipelining improves instruction throughput without improving instruction latency
– May actually worsen instruction latency – E.g., Pipeline Latch overheads
ECE437, S’21 © Vijaykumar and Thottethodi (25)
Pipelined Processor Design
• Designing a pipelined processor
– Associate resources with states
– Resources not necessarily indivisible or full-cycle
• Register reads and writes can happen in the same cycle
• Writes in first half of cycle and reads in the second half – we will see later
– Assert appropriate controls in each stage
• Make sure all necessary information is carried through the pipeline along with the instruction
ECE437, S’21 © Vijaykumar and Thottethodi (26)
Shading convention
• Shade first half => write shade second half => read
• ALU full-cycle
• Register file can handle read/write in the same cycle
• Memory can handle only 1 read or 1 write per cycle
ECE437, S’21 © Vijaykumar and Thottethodi (27)
Pipelining the Load Instruction
Clock 1st lw
Ifetch 2nd lw
Reg/Dec Ifetch
3rd lw
Exec Reg/Dec Ifetch
Mem
Exec Reg/Dec
Wr Mem Exec
Cycle 1 Cycle 2
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Wr
Mem Wr
• Thefiveindependentblocksinthepipelinedatapath are:
– Instruction Memory for the Ifetch stage
– RegisterFile’sReadports(busAandbusB)fortheReg/Dec
stage
– ALUfortheExecstage
– DataMemoryfortheMemstage
– Register File’s Write port (bus W) for the Wr stage
ECE437, S’21 © Vijaykumar and Thottethodi (28)
7

Pipelined Datapath
• Pipeline datapath with latches
ECE437, S’21 © Vijaykumar and Thottethodi (29)
lw
Instruction fetch
0 M u x 1
Add 4
MEM/WB
1 M u x 0
Instruction Fetch
IF/ID ID/EX
EX/MEM
Add
Add result
Zero
ALU ALU result
Shift left 2
0
M u x
PC
Address
Instru mem
ction ory
Read register 1
Read register 2
Read data 1
Registers Read Write data 2
Write data
register
Read
data
ta ory
1
Address
Da
mem
Write data
ECE437, S’21 © Vijaykumar and Thottethodi (30)
16
Sign extend
32
Address
Instru mem
0 M u x 1
Instruction Decode/Reg Read
lw
Instruction decode
ction ory
IF/ID
ID/EX
EX/MEM
MEM/WB
1 M u x 0
Add 4
Read register 1
Read register 2
Registers Read Write data 2
Write data
Shift left 2
0 M u x 1
Add
Add result
Zero
ALU ALU result
PC
register
ECE437, S’21 © Vijaykumar and Thottethodi (31)
Read data1
Address
Da me
Write data
Read
data
ta mory
16
Sign extend
32
0 M u x 1
Execute (Address calculation)
lw
Execution
MEM/WB
1 M u x 0
Add 4
PC
Address
Instru mem
IF/ID ID/EX
EX/MEM
Read register 1
Read register 2
Registers Read Write data 2
Write data
A
dd
Add result
Zero
ALU ALU result
Shift left 2
M u x
ction ory
register
ECE437, S’21 © Vijaykumar and Thottethodi (32)
Read data 1
0
Address
Da
mem
Write data
Read
data ta
ory
1
16
Sign extend
32
8
Instruction
Instruction
Instruction

Memory Access
lw
Memory
0 M u x 1
Add 4
MEM/WB
1 M u x 0
IF/ID ID/EX
EX/MEM
Add result
Zero ALU ALU result
M u x
A
dd
Shift left 2
PC
Address
Instru mem
Read register 1
Read register 2
Registers Read Write data 2
Write data
Read data1
ction ory
0
1
Address Da
mem
Write data
Read data
ta ory
register
ECE437, S’21 © Vijaykumar and Thottethodi (33)
16
Sign extend
32
IF/ID ID/EX
EX/MEM
Writeback
0 M ux 1
Add 4
lw
W r i t e b a c k
Shift left 2
M u x
A
dd
Add result
Zero ALU ALU result
PC
MEM/WB
1 M u x 0
Address
Instru mem
Read register 1
Read register 2
Registers Read Write data 2
Write data
Read data1
ction ory
0
Address Da
mem
Write data
Read ta data
ory
register
1
ECE437, S’21 © Vijaykumar and Thottethodi
(34)
16
Sign extend
32
Pipeline latches
• Length of Pipeline latches • Book says (Fig 4.35)
– IF/ID : IR (32), PC+4 (32) : 64 bits
– ID/EX: IR (32), PC+4 (32) + RegA + RegB : 128 bits – EX/MEM: ALUout(32) + Equal(1) + PC+4+SX(imm)
(32) : 97
– MEM/WB: ALUout (32) + MemData(32) : 64
• Inaccurate, will be refined:
– (see next slide)
– Other control bits (IR not going through)
ECE437, S’21 © Vijaykumar and Thottethodi (35)
Corrected Datapath for lw
• Carry destination register through pipeline latches to WB stage
ECE437, S’21 © Vijaykumar and Thottethodi (36)
9
Instruction
Instruction

The Four Stages of R-type
Cycle1 Cycle2 Cycle3 Cycle4
R-type Ifetch Reg/Dec Exec Wr
• Ifetch:InstructionFetch
– Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode • Exec:
– ALU operates on the two register operands
– Update PC
• Wr:WritetheALUoutputbacktotheregisterfile
ECE437, S’21 © Vijaykumar and Thottethodi (37)
Pipelining R-type and Loads
Cycle 1 Cycle 2
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock
R-type Ifetch
R-type
Reg/Dec Ifetch
Load
Exec Reg/Dec Ifetch
R-type
Wr
Exec Reg/Dec Ifetch
R-type
Wr
Exec Reg/Dec Ifetch
Oops! We have a problem!
Mem Wr
Exec Wr
Reg/Dec Exec Wr
• Wehavepipelineconflictor“structuralhazard”:
– Twoinstructionstrytowritetotheregisterfileatthesame
time!
– Onlyonewriteport
ECE437, S’21 © Vijaykumar and Thottethodi (38)
Key observation
• Eachhardwareblockcanonlybeusedonceper instruction
• Eachhardwareblockmustbeusedatthesamestage for all instructions:
– Load uses Register File’s Write Port during its 5th stage 12345
Load Ifetch Reg/Dec Exec Mem Wr
– R-type uses Register File’s Write Port during its 4th stage 1234
R-type Ifetch Reg/Dec Exec Wr
– 2 ways to solve this pipeline hazard ECE437, S’21 © Vijaykumar and Thottethodi (39)
Soln.1: Insert “Bubble”
Cycle 1 Cycle 2
Ifetch Load
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock
Wr
Exec Mem R-type Ifetch Reg/Dec Exec
Reg/Dec Ifetch
Exec Reg/Dec
Wr
Wr
R-type Ifetch Reg/Dec Pipeline Exec Wr R-type Ifetch Bubble Reg/Dec Exec
Wr Ifetch Reg/Dec Exec
• Inserta“bubble”intothepipelinetoprevent2writes at the same cycle
– Thecontrollogiccanbecomplex.
– Lose instruction fetch opportunity.
• NoinstructionisstartedinCycle6–CPIincreases!
ECE437, S’21 © Vijaykumar and Thottethodi (40)
10

Soln.2: Delay R-type’s Write
• Delay R-type’s register write by one cycle:
– Now R-type instructions also use Reg File’s write port at Stage 5
– Mem stage is NOOP for R-type: nothing done 12345
R-type Ifetch Reg/Dec Exec
Mem
Cycle 6
Wr
Mem
Exec
Reg/Dec Exec
(41)
Cycle 1 Cycle 2
Cycle 3 Cycle 4 Cycle 5
Wr
Cycle 7 Cycle 8
Cycle 9
Wr
Clock
R-type
Ifetch R-type
Reg/Dec Ifetch
Load
Exec Reg/Dec Ifetch
R-type
Mem Wr
Exec Mem Reg/Dec Exec
Ifetch Reg/Dec R-type Ifetch
Mem
Wr
Mem
Wr
ECE437, S’21 © Vijaykumar and Thottethodi
Does soln. 2 lose performance?
• DoesitincreaseR-type’slatency? • Does it worsen R-type’s CPI?
• Whatisthebottomlineperformanceconcernin pipelining?
ECE437, S’21 © Vijaykumar and Thottethodi (42)
Modified RTL & Datapath
SA + B;
SA or ZX;
SA + SX;
If Cond
PC  PC+SX;
MS R[rd]  M;
MS R[rt]  M;
IRMem[PC]; PC <– PC+4; AR[rs]; B<– R[rt] MMem[S] Mem[S]B R[rt]  M; ASM B D ECE437, S'21 © Vijaykumar and Thottethodi (43) Four Stages of Store Cycle1 Cycle2 Cycle3 Cycle4 Store Ifetch Reg/Dec Exec Mem Wr • Ifetch:InstructionFetch – Fetch the instruction from the Instruction Memory • Reg/Dec: Registers Fetch and Instruction Decode • Exec:Calculatethememoryaddress • Mem:WritethedataintotheDataMemory ECE437, S'21 © Vijaykumar and Thottethodi (44) 11 Mem Access Data Mem Next PC PC Inst. Mem IR Reg File Exec Reg. File Equal 0 M u x 1 Exec Stage of Store sw Execution IF/ID ID/EX EX/MEM MEM/WB 1 M u x 0 PC Add 4 Reg B and SX(Imm) are both needed Address Instru me Read register 1 Read register 2 Registers Read Write data 2 Write data • register ECE437, S'21 © Vijaykumar and Thottethodi (45) Read data1 Add Add result Zero ALU ALU result Shift left 2 0 M u x ction mory 1 Address Da me Write data Read data ta mory 16 Sign extend 32 0 M u x 1 Add 4 sw Memory Mem stage of Store IF/ID ID/EX EX/MEM PC MEM/WB 1 M u x 0 Address Instru mem Read data 1 Shift left 2 Add result Zero ALU ALU result M u x A dd ction ory Read register 1 Read register 2 Registers Read Write data 2 Write data 0 1 Address Da mem Write data Read data ta ory register 16 Sign extend 32 ECE437, S'21 © Vijaykumar and Thottethodi (46) FourStages of Beq Cycle1 Cycle2 Cycle3 Cycle4 Beq Ifetch Reg/Dec Exec Mem Wr • Ifetch: Instruction Fetch – Fetch the instruction from the Instruction Memory • Reg/Dec: – Registers Fetch and Instruction Decode • Exec: – compares the two register operand, – Compute branch target • Mem – select correct branch target address – latch into PC • There is more to this ......later ECE437, S'21 © Vijaykumar and Thottethodi (47) 10 lw 14 addI 20 sub 24 beq 30 ori 34 add 100 and r1, r2(35) r2, r2, 3 r3, r4, r5 r6, r7, 100 r8, r9, 17 r10, r11, r12 r13, r14, 15 these addresses are OCTAL (48) ECE437, S'21 © Vijaykumar and Thottethodi Visualizing the pipeline 12 Instruction Instruction Start: Fetch 10 nn IR nn WB Ctrl Mem Ctrl rs rt im A B S D M 10 lw 14 addI 20 sub 24 beq 30 ori 34 add 100 and r1, r2(35) r2, r2, 3 r3, r4, r5 r6, r7, 100 r8, r9, 17 r10, r11, r12 r13, r14, 15 ECE437, S'21 © Vijaykumar and Thottethodi (49) Fetch 14, Decode 10 n nn WB Ctrl IR 2 rt im A B M S D 10 lw 14 addI 20 sub 24 beq 30 ori 34 add 100 and r1, r2(35) r2, r2, 3 r3, r4, r5 r6, r7, 100 r8, r9, 17 r10, r11, r12 r13, r14, 15 ECE437, S'21 © Vijaykumar and Thottethodi (50) Mem Ctrl Fetch 20, Decode 14, Exec 10 nn WB Ctrl Mem Ctrl IR 2 rt r2 B M S D 10 lw 14 addI 20 sub 24 beq 30 ori 34 add 100 and r1, r2(35) r2, r2, 3 r3, r4, r5 r6, r7, 100 r8, r9, 17 r10, r11, r12 r13, r14, 15 ECE437, S'21 © Vijaykumar and Thottethodi (51) Fetch 24, Decode 20, Exec 14, Mem 10 n WB Ctrl em Ctrl M IR 45 r2 B M 10 lw D 14 addI 20 sub 24 beq 30 ori 34 add 100 and r1, r2(35) r2, r2, 3 r3, r4, r5 r6, r7, 100 r8, r9, 17 r10, r11, r12 r13, r14, 15 ECE437, S'21 © Vijaykumar and Thottethodi (52) 13 PC 20 PC 10 PC 24 r2+35 lw r1 PC 14 Next PC Exec Next PC Exec Next PC Exec Next PC Exec Mem Access Mem Access Data Mem Data Mem Mem Access Mem Access Data Mem Data Mem Reg File Decode Reg File Decode Reg. File Reg. File Reg File Decode Reg File Decode Reg. File Reg. File 35 lw r1 3 addI r2, r2, 3 Inst. Mem Inst. Mem addI r2, r2, 3 Inst. Mem Inst. Mem sub r3, r4, r5 lw r1, r2(35) Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 --- deal with branch later IR 67 r4 r5 Mem Ctrl WB Ctrl 10 lw D 14 addI 20 sub 24 beq 30 ori 34 add 100 and r1, r2(35) r2, r2, 3 r3, r4, r5 r6, r7, 100 r8, r9, 17 r10, r11, r12 r13, r14, 15 ECE437, S'21 © Vijaykumar and Thottethodi (53) One-minute quiz • What is the speedup over single-cycle? – 5-stage pipeline with per-stage delays in terms of the single-cycle clock of 10%, 15%, 20%, 20%, and the remaining – On top of these delays, the pipeline latch overhead is 5% of single-cycle clock • FYI: Real latch overhead is 1-2% • Previous quiz – Compared to single-cycle, which (one or more) of these three factors does pipelining improve: instruction count, CPI, clock speed? Clock speed ECE437, S'21 © Vijaykumar and Thottethodi (54) 0 M u x 1 Add 4 PC Recap: Datapath IF/ID ID/EX EX/MEM MEM/WB 1 M u x 0 Add result Zero ALU ALU result M u x A dd Shift left 2 Read register 1 Read register 2 Registers Read Write data 2 Write data Address Instru me ction mory 0 1 Address Da me Write data Read data ta mory register Read data 1 ECE437, S'21 © Vijaykumar and Thottethodi (55) 16 Sign extend 32 Outline • Pipeline Datapath – Under simplifying assumptions • All independent instructions • Dependencies – later – Datapath not yet capable of handling dependencies • Walk-through • Control – Most complicated (read as ‘irregular/unstructured’) – Not that bad for pipelined processors • Similar to single cycle control ECE437, S'21 © Vijaykumar and Thottethodi (56) 14 Instruction Reg File Decode Next PC Exec PC 30 Mem r2+3 addI r2 Access Data Mem M[r2+35] lw r1 Reg. File Inst. Mem beq r6, r7 100 sub r3 Pipelined Datapath with Controls 0 M u x 1 Add 4 IF/ID ID/EX EX/MEM B Add Add result ZeZreoro ALU ALU result ranch PCSrc MemWrite MEM/WB MemtoReg 1 M u x 0 RegWrite Shift left 2 RegDst PC Address Instruction memory Read register 1 Read register 2 Read data 1 Registers Read Write data 2 Write data Address Read register Write data data memory Data Instruction [15–0] 16 truction [20– 16] truction [15– 11] Sign extend 32 ALUSrc 0 M u x 1 6 0 M u x 1 ALU control Ins Ins MemRead ECE437, S'21 © Vijaykumar and Thottethodi (57) ALUOp Pipeline Control vs. Single cycle control • Similar (generation) • What about – PCWrite? IR-write? – Write enable for the pipeline latches? • Pipelined processor implementation – Stages common to all instructions • Instruction fetch stage (IF) • Decode and Register read (ID) – Instruction-specific stages • Execute (Exec) • Memory access (Mem) • Write back (WB) ECE437, S'21 © Vijaykumar and Thottethodi (58) Generating controls • Simplify the problem – Reduce pipeline control to single-cycle control (almost) – Generate controls once – Consume (i.e., use and discard) signals as you proceed along the pipeline stages • Identify Stage of consumption for all control signals ECE437, S'21 © Vijaykumar and Thottethodi (59) Focus on Control Exec ALUSrc ALUOp RegDst MemRd MemWr Branch MemtoReg RegWr Reg/Dec ALUSrc ALUOp RegDst MemRd MemWr Branch MemtoReg RegWr Mem MemRd MemWr Branch MemtoReg RegWr Wr MemtoReg RegWr Main Control • The Main Control generates the control signals during Reg/Dec – Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later – Control signals for Mem (MemWr Branch) are used 2 cycles later – Control signals for Wr (MemtoReg MemWr) are used 3 cycles later ECE437, S'21 © Vijaykumar and Thottethodi (60) 15 Instruction Mem/Wr Register Ex/Mem Register ID/Ex Register IF/ID Register Meaning of controls • RegWr : 1-> write, 0-> no write
• MemToReg : 1-> MDR, 0-> ALUOut
• RegDst : 1->rd, 0->rt
• ALUOp<1:0> : 00->Add,01->Sub,10->’funct’ • ALUSrc : 0->RegB, 1->SX(Imm)
• Branch: 0->non-branch inst, 1-> branch
• MemRead: 1->memread, 0-> no memread
• MemWrite: 1->memwrite, 0->no memwrite • ALU control abstracted away (as before)
– Inputs: ALUop (2 bits), 6 “funct” bits from IR
ECE437, S’21 © Vijaykumar and Thottethodi (61)
Extended pipeline latches
Control
ID/EX EX/MEM MEM/WB
• Simple extensions to carry control state
ECE437, S’21 © Vijaykumar and Thottethodi (62)
WB
Instruction
M
EX
WB
M
WB
IF/ID
Pipeline Walkthrough with controls
• Use walkthrough worksheets – Use code segment shown
– Fill in controls
– Interesting stages
• Controls generated in Decode stage
• Controls consumed in subsequent stages
ECE437, S’21 © Vijaykumar and Thottethodi (63)
lw $10, 20($1) sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
IF: lw $10, 20($1)
ID: before<1>
EX: before<2>
MEM: before<3>
WB: before<4>
1 M u x 0
Pipeline Walkthrough with controls
4
Add
IF/ID
ID/EX
WB M
EX/MEM
MEM/WB
0 WB 0
0 M u x 1
00 000 0000
00
Control
WB
000 00
00
0 00
EX
00
M
0
M u x 1
Ad
Add d result
ALUSrc
Shift left2
Branch
Address
Instru mem
Read
register 1 Read Read data 1 register 2
Registers Read Write data 2
Write data
ction ory
register
Zero
ALU ALU result
Read data
Write data
Address memory
Data
truction
[15– 0] Sign
Ins
truction [20–16]
Ins
extend
0
ALU control
ALUOp
MemRead
Ins
M truction u
[15–11] x 1
RegDst
PC
lw
$10, 20($1)
sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
Clock 1
ECE437, S’21 © Vijaykumar and Thottethodi
(64)
16
WB <1:0> Exec <3:0>
Mem <2:0>
Instruction
MemtoReg
MemWrite
RegWrite

Pipeline Walkthrough with controls
IF: sub $11, $2, $3
ID: lw $10, 20($1)
EX: before<1>
MEM: before<2>
WB: before<3>
1 M u x 0
0
M u
IF/ID
x
1
l
Control
000 00
00
00 0
ID/EX
WB M
EX/MEM
WB
MEM/WB
0 WB 0
11 010 0001
$1 $X
00
EX M 00
1 X
w
Branch
Address
Instru mem
Read register 1
Read register 2
Read data 1
ction ory
Registers Read Write data 2
Write data
register
Read data
Write data
Address memory
Data
Instruction
[15–0] Sign 20
Shift left 2
0
M u x 1
0 M u
Add Add result
ALUSrc
Zero
ALU ALU result
ALU control
20
MemRead
Ins 10
extend [20–16] 10
truction
ALUOp
X
Instruction
[15–11] X x
1
RegDst
PC
4
Add
lw
$10, 20($1)
sub $11, $2, $3
and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
Clock 2
ECE437, S’21 © Vijaykumar and Thottethodi
(65)
IF: and $12, $4, $5
ID: sub $11, $2, $3
EX: lw $10, . . .
MEM: before<1>
WB: before<2>
1 M u x 0
Pipeline Walkthrough with controls
Add
4
0
M u
IF/ID
EX/MEM
WB
MEM/WB
0 WB 0
x
1
sub
Control
10 000 1100
ID/EX
WB M
11
010 00
00 00 0 10
EX
M
Shift left 2
$2 $1
Ad
Add d result
ALUSrc
Branch
Address
Instru mem
2 3
Read
register 1
Read data 1 register 2
Registers Read Write data 2
Write data
Read
ction ory
0
register
$3
M u x 1
Zero
ALU ALU result
Read data
Write data
Address memory
Data
Ins
X
truction
[15–0] Sign X
20
10
M u
ALU control
ALUOp
Ins
X
truction [20– 16]
X
extend
0
[15–11] 11 x 1
MemRead
Ins
11
truction
RegDst
PC
lw
$10, 20($1)
sub $11, $2, $3
and $12, $4, $5
or $13, $6, $7 add $14, $8, $9
Clock 3
ECE437, S’21 © Vijaykumar and Thottethodi
(66)
IF: or $13, $6, $7
ID: and $12, $2, $3
EX: sub $11, . . .
MEM: lw $10, . . .
WB: before<1>
1 M u x 0
Pipeline Walkthrough with controls
Add
IF/ID
ID/EX EX/MEM
11
WB
0
1
0
MEM/WB
0 WB 0
0 M u x 1
and
Control
10 000 1 100
WB M EX
10
000
1 10 0
d r e As ud l dt
ALUSrc
M
Ad
Shift left 2
$4 $2
Branch
4 5
Address
Instru mem
Read register 1
Read register 2
Read data 1
ction ory
Registers Read Write data 2
Write data
$5 $3
0
Address
Write data
Read
data Data
memory
register
M u x 1
Zero
ALU ALU result
X
Instruction
[15–0] Sign X
MemRead
Ins X
truction [20– 16]
X
extend
0
M truction u [15–11] 12 11 x 1
ALU control
ALUOp
Ins 12
10
RegDst
PC
4
lw
sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
Clock 4
$10, 20($1)
ECE437, S’21 © Vijaykumar and Thottethodi
(67)
4
Pipeline Walkthrough with controls
IF: add $14, $8, $9
ID: or $13, $6, $7
EX: and $12, . . .
MEM: sub $11, . . .
WB: lw $10, . . .
1 M u x 0
Add
IF/ID
EX/MEM
WB
MEM/WB
1 WB 1
0 M u x 1
or
Control
10 000 1100
ID/EX
WB M
10
000 10
10 10 0 00
EX
M
Ad
Add d result
ALUSrc
Shift left 2
$6 $4
Branch
Address
Instruction memory
6 7 10
Read register 1
Read register 2
Regist Write
register
Write data
Read data 1
ers Read data 2
$7 $5
Read data
Write data
Address memory
M u x 1
Z ero
ALU ALU result
Ins
X
Ins
X
truction
[15–0] Sign X
0
0
Data
ALU control
ALUOp
MemRead
truction [20– 16]
X
extend
Ins
13
M truction u
[15–11] 13 12 x 1
11
10
RegDst
PC
lw
sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
Clock 5
$10, 20($1)
ECE437, S’21 © Vijaykumar and Thottethodi
(68)
17
Instruction
Instruction
MemtoReg
Instruction
Instruction
MemtoReg
MemtoReg
MemWrite
MemWrite
MemWrite
RegWrite
RegWrite
RegWrite
RegWrite
MemtoReg
MemWrite

Pipeline Walkthrough with controls
Pipeline Walkthrough with controls
IF: after<1>
4
PC
$10, 20($1)
ID: add $14, $8, $9
EX: or $13, . . .
MEM: and $12, . . .
WB: sub $11, .
1
M u x 0
Add
0
M u
IF/ID
ID/EX EX/MEM
10
WB
0 0 0
MEM/WB
1 WB 0
x 1
add
Control
10 000 1100
WB M EX
10
000
1 10 0
Shift left 2
$8 $6
Ad
d reAsudldt
ALUSrc
M
Branch
Read register 1
Read register 2
Read data 1
Address
Instruction memory
8 9 11
Registers Read Write data 2
Write data
$9 $7
0
Read data
Write data
Address memory
register
M u x 1
Zero
ALU ALU result
Data
X
Instruction
[15–0] Sign X
extend
truction
[20–16] X
0
M truction u [15–11] 14 13 x 1
ALU control
ALUOp
MemRead
Ins X
Ins 14
12
11
RegDst
lw
sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
Clock 6
ECE437, S’21 © Vijaykumar and Thottethodi
(69)
IF: after<2>
4
PC
lw
sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
Clock 7
ID: after<1>
EX: add $14, . . .
MEM: or $13, . . .
WB: and $12,
1
M u x 0
Add
0
M u
IF/ID
ID/EX EX/MEM
MEM/WB
1 WB 0
x 1
Control
00 000 0000
WB M
10
000
1 10 0
10
WB
0 0 0
EX
M
Shift left 2
$8
Ad
Add d result
ALUSrc
Branch
Address
Instru mem
Read register 1
Read register 2
Regist Write
register
Write data
ction ory
Read data 1
ers Read data 2
$9
12
0
Read data
memory
Write data
Address
Data
M u x 1
Zero
ALU ALU result
truction
[15– 0] Sign
Ins
truction [20–16]
extend
0
M
truction u [15–11] 14 x
ALU control
ALUOp
MemRead
Ins
Ins
1
RegDst
13
12
$10, 20($1)
ECE437, S’21 © Vijaykumar and Thottethodi
(70)
Pipeline Walkthrough with controls
IF: after<3>
4
PC
lw $10, 20($1) sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
Clock 8
ID: after<2>
EX: after<1>
MEM: add $14, . . .
WB: or $13, . . .
1
M u x 0
Pipeline Walkthrough with controls
Add
0
M u
IF/ID
EX/MEM
10
WB
0 0 0
MEM/WB
1 WB 0
x 1
Control
00 000 0000
ID/EX
WB M EX
00
000
0 00 0
M
Add reAsudldt
ALUSrc
Shift left 2
Branch
Address
Instru mem
Read register 1
Read register 2
Regis
Write register
Write data
Read data 1
ters Read data 2
ction ory
13
0
M u x 1
Zero
ALU ALU result
Read data
Write data
Ins
0
M truction u [15–11] x 1
Address memory
Data
Instruction
[15– 0] Sign
ALU control
ALUOp
MemRead
truction [20–16]
extend
Ins
R
egDst
14
13
ECE437, S’21 © Vijaykumar and Thottethodi
(71)
IF: after<4>
4
PC
lw $10, 20($1) sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9
Clock 9
ID: after<3>
EX: after<2>
MEM: after<1>
WB: add $14, .
1
M u x 0
Add
0
M u
IF/ID
ID/EX EX/MEM
MEM/WB
1 WB 0
x 1
Control
00 000 0000
WB M EX
00
000
0 00 0
00
WB
0 0 0
M
Add Add result
ALUSrc
Shift left 2
Branch
Address
Instru mem
Read register 1
Read register 2
Regis
Write register
Write data
Read data 1
ters Read data 2
ction ory
14
0
Read data
Write data
Address memory
M u x 1
Zero
ALU ALU result
Data
Instruction
[15– 0] Sign
ALU control
ALUOp
truction [20–16]
Ins
extend
0
M truction u [15–11] x 1
MemRead
Ins
Re
gDst
14
ECE437, S’21 © Vijaykumar and Thottethodi
(72)
18
Instruction
MemtoReg
Instruction
Instruction
MemtoReg
MemtoReg
MemWrite
MemWrite
MemWrite
RegWrite
RegWrite
RegWrite
Instruction
MemtoReg
MemWrite
RegWrite

Implementing Pipeline control
• How do we design the control logic block?
– Similar to single-cycle implementation – Derive logic expressions
• E.g. MemtoReg = lw
• ALUSrc = lw OR sw
• RegWrite = R-type OR lw
– Implement Combinational logic • PLA implementation
• ROM implementation
ECE437, S’21 © Vijaykumar and Thottethodi (73)
Pipeline registers
• Preliminary estimates
– IF/ID : IR (32), PC+4 (32) : 64 bits
– ID/EX: IR (32), PC+4 (32) + RegA + RegB : 128 bits
– EX/MEM: ALUout(32) + zero(1) + PC+4+SX(imm)
(32) : 97
– MEM/WB: ALUout (32) + MemData(32) : 64
• Corrections:
– ALUout and MemData
– Destination register (5 bits)
– Other control bits (IR not going through)
ECE437, S’21 © Vijaykumar and Thottethodi (74)
Implementing Pipeline Control
• Exercise
– Compute required bit-width of pipeline latches
– Before the next lecture
• Keep memory controller (arbiter) SEPARATE from pipeline (as you did for single-cycle)
– In IF and MEM, send request to arbiter
ECE437, S’21 © Vijaykumar and Thottethodi (75)
Summary
• Only slightly more complicated than single cycle
– NOT really, seemingly so because
• We ignored complications of DEPENDENCIES
– Need deeper understanding of dependencies
• need to modify datapath as well
ECE437, S’21 © Vijaykumar and Thottethodi (76)
19

• Definitely:
Hard part of pipelining
– MUST maintain “illusion” of sequential execution – Execution is actually overlapped.
• Pipeline Hazards
– structural hazards: attempt to use the same resource two different
ways at the same time
– data hazards: attempt to use item before it is ready
• instruction depends on result of prior instruction still in the pipeline
– control flow hazards: attempt to make a decision before condition is
evaulated
• branch instructions
ECE437, S’21 © Vijaykumar and Thottethodi (77)
Hazards
• Structural hazards
– Two instructions need the same hardware at the same time
• Data Hazards
– Data not ready
• Control flow Hazards
– Which instruction to fetch? Not known.
ECE437, S’21 © Vijaykumar and Thottethodi (78)
Hazards
• Can always resolve hazards by waiting
– pipeline control must detect the hazard
– take action (or delay action) to resolve hazards
• Delays
– Pipeline stalls/bubbles – Increase CPI
– Reduce speedup
ECE437, S’21 © Vijaykumar and Thottethodi (79)
Single Memory: Structural Hazard
nI Load Mem s
t Instr 1 r.
O Instr 2 r
de I n s t r 3
Time (clock cycles)
Reg Mem Reg
Mem Reg
Mem Reg
M e m
Mem Reg Mem
Reg M e m
r
Instr 4
R e g Mem
Reg
R e g
Mem Reg
Detection is easy in this case! (right half highlight means read, left half write) ECE437, S’21 © Vijaykumar and Thottethodi (80)
20
ALU
ALU ALU
ALU ALU

Structural Hazards
• Single memory (suppose)
• 1.3 memory accesses per instruction
– How?
– 1 per instruction for instruction fetch – Fraction for data load/store
• Depends on instruction mix • 20% load + 10% store
• 15% load + 15% store
• CPI is at least 1.3 (otherwise memory is used more than 100%)
• Solution?
ECE437, S’21 © Vijaykumar and Thottethodi (81)
Data Hazards
or r8, r1 ,r9 xor r10, r1 ,r11
ECE437, S’21 © Vijaykumar and Thottethodi (82)
add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7
Hazards on r1
• Dependencies backwards in time
Time (clock cycles)
IF ID/RF EX MEM WB add r1,r2,r3 Im Reg Dm Reg
I n
st sub r4,r1,r3 Im Reg
Dm Reg Reg Dm
r. and r6,r1,r7 O
dr or r8,r1,r9
e
r xor r10,r1,r11
ECE437, S’21 © Vijaykumar and Thottethodi
Im
Reg
Dm Reg
Dm Reg
Im
Reg
Im Reg
(83)
Data Hazard Solution1: Stall
• Can always stall until hazard goes away – Delay sub and later instrs till add is in WB – Increase CPI – lose performance
• But performance loss bad only if common case
– Amdahl’s law
• So what about data hazards?
– Think about how code looks
ECE437, S’21 © Vijaykumar and Thottethodi (84)
21
ALU
ALU ALU
ALU ALU

Data Hazard Solution1: Stall
• So what about data hazards? – Think about how code looks
• Keydifferencebetweencircuitdesignerand computer architect
• circuitpeopledonotthinkaboutcode
• Architectsthinkaboutinteractionofcodewith
hardware – common case interactions – VERY COMMON code:
• x=a+b;
• = use x; /* IMMEDIATELY use x */
ECE437, S’21 © Vijaykumar and Thottethodi (85)
Data Hazard Solution2: forwarding (a.k.a. bypassing)
Time (clock cycles)
IF ID/RF EX MEM WB
I n
add r1,r2,r3 Im
Reg Dm Reg
Im Reg Dm Reg
Im Reg Dm Im Reg
Im Reg
st sub r4,r1,r3
r. and r6,r1,r7
O
dr or r8,r1,r9
e
r xor r10,r1,r11
Reg
Dm Reg
Dm Reg
• What is the EARLIEST point where r1 is available SOMEWHERE in the pipe
ECE437, S’21 © Vijaykumar and Thottethodi (86)
Datapath w/Forwarding Unit
ID/EX
EX/MEM MEM/WB
M u x
Registers
Forwar
M u
dA
Forwarding unit
ALU
Data memory
M u x
Rs For
Rt
x
wardB
Rt M
Rd u x
EX/MEM.RegisterRd
MEM/WB.RegisterRd
b. With forwarding
• Forwarded values from pipeline latches
ECE437, S’21 © Vijaykumar and Thottethodi (87)
Time (in clock cycles)
Value of CC 1 register $2: 10
Program execution order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
CC 2 10
Reg
CC 3 10
Reg
CC 4 10
CC 5 10/–20
Reg
Reg
CC 6 –20
Reg
Reg
CC 7 –20
Reg
DM
CC 8 –20
Reg
CC 9 –20
Dependence : Backward in time
IM
DM
IM
DM
IM Reg
DM
IM
IM
DM Reg
ECE437, S’21 © Vijaykumar and Thottethodi
(88)
22
ALU
ALU ALU
ALU ALU

Other types of Data Hazards
• Challenge: maintain illusion of sequential
execution
• Types of data hazards
– RAW (what we have seen), WAR, WAW
RAW (read after write) Data Hazard (what we have seen so far)
IF
DCD
EX
Mem
WB
IF
DCD
EX
Mem
WB
WAW Data Hazard (write after write)
Mem
WAR Data Hazard (write after read)
IF
DCD EX
Mem
WB
IF
DCD OF
Ex
ECE437, S’21 © Vijaykumar and Thottethodi (90)
IF
DCD
OF
Ex
WB
True dependence : Forward in time
Time (in clock cycles)
CC 1 Value of register $2 : 10 ValueofEX/MEM: X ValueofMEM/WB: X
Program execution order (in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
CC 2 10 X X
CC 3 10 X X
Reg
CC 4 10
–20 X
CC 5 CC 6 10/–20 –20 X X –20 X
Reg
Reg
IM Reg DM Reg
CC 7 –20 X X
Reg
CC 8 –20 X X
CC 9 –20 X X
DM
IM Reg
IM
DM
IM Reg
sw $15, 100($2)
ECE437, S’21 © Vijaykumar and Thottethodi
IM
Reg DM Reg
(89)
DM
Data Hazards
RAW Data Hazard (what we have seen) WAW Data Hazard
Mem
WAR Data Hazard
IF
DCD
EX
Mem
WB
IF
DCD
EX
Mem
WB
IF
DCD EX
Mem
WB
IF
DCD OF
Ex
IF
DCD
• Avoidsome“bydesign”(asdoneinMIPSpipeline)
– eliminateWARbyalwaysfetchingoperandsearly(DCD)in
pipe
– eliminateWAWbydoingallWBsinlaststage
• Detectandresolveremainingones(RAW) – stallorforward(ifpossible)
ECE437, S’21 © Vijaykumar and Thottethodi (91)
OF
Ex
WB
Handling RAW Hazards
• Pre-requisite for handling RAW hazard – Detection!
– Need to know:
• Pending writes
– available results that haven’t been written back to
registers
• Operand Reads
– Later instructions that potentially use these values – Some instructions may not write to
register file (store, branch)
ECE437, S’21 © Vijaykumar and Thottethodi (92)
23

Detecting RAW hazards
• Suppose instruction i is in EX and a predecessor instruction j is in a later stage
• A RAW hazard may exist on register if Rregs( i ) Wregs( j )
– Compare pending writes (for inst’s in later stages) with operand
• A WAW hazard may exist on register if Wregs( i ) Wregs( j )
• A WAR hazard may exist on register if Wregs( i ) Rregs( j )
• MIPS: RAW hazards only
ECE437, S’21 © Vijaykumar and Thottethodi (93)
regs of current instruction.
One-minute quiz
• Fill in the blanks:
– Pipelining improves instruction _________ while possibly worsening instruction ________
• Previous quiz – Speedup over single-cycle
– 5-stage pipeline with per-stage delays in terms of the single- cycle clock of 10%, 15%, 20%, 20%, and the remainder
– Ontopofthesedelays,thepipelinelatchoverheadis5%of single-cycle clock
• Real latch overhead is 1-2% – 2.5x
ECE437, S’21 © Vijaykumar and Thottethodi (94)
Registers
Record of pending writes
ID/EX EX/MEM MEM/WB
M u x
M u
xM u x
ForwardA ALU
Rs ForwardB Rt
RtM
Rd u
x
Current operand registers Pending writes
EX/MEM.RegisterRd Forwarding MEM/WB.RegisterRd
unit
b.Withforwarding
• • •
hazard <= ((rs == rwex) ((rs == rwwb) ((rt == rwex) ((rt == rwwb) & regWex) OR ((rs == rwmem) & regWme) OR & regWwb) OR & regWex) OR ((rt == rwmem) & regWme) OR & regWwb) ECE437, S'21 © Vijaykumar and Thottethodi (95) Data memory Logic equations for Hazard Detection • Restatement of equations • Text book version – WB stage is not really a hazard • written in first half of cycle, read in 2nd half – EX/MEM.RegisterRd == ID/EX.RegisterRs – EX/MEM.RegisterRd == ID/EX.RegisterRt – MEM/WB.RegisterRd == ID/EX.RegisterRs – MEM/WB.RegisterRd == ID/EX.RegisterRt ECE437, S'21 © Vijaykumar and Thottethodi (96) 24 Lookahead: Forwarding datapath • We know how to detect RAW hazards • Now, – Modify Datapath to enable forwarding – Desired control behavior ECE437, S'21 © Vijaykumar and Thottethodi (97) Base Pipelined Datapath • Simplifiedrepresentationofpipelineddatapath – Toavoidclutter ID/EX EX/MEM MEM/WB Registers ALU Data memory M u x a. No forwarding ECE437, S'21 © Vijaykumar and Thottethodi (98) Datapath w/Forwarding Unit ID/EX EX/MEM MEM/WB M u x Registers Forwar M u dA Forwarding unit ALU Data memory M u x Rs For Rt x wardB Rt M Rd u x EX/MEM.RegisterRd MEM/WB.RegisterRd b. With forwarding • ForwardA/ForwardB: 01->Mem, 10->EX
ECE437, S’21 © Vijaykumar and Thottethodi (99)
Data Hazards and Forwarding: Walkthrough
• Code snippet
– identify hazards
– identify forwarding paths
ECE437, S’21 © Vijaykumar and Thottethodi
sub $2, $1, $3 and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
(100)
25

Walkthrough
or $4, $4, $2
and $4, $2, $5
IF/ID
sub $2, $1, $3 ID/EX
before<1>
EX/MEM
before<2>
M
10
WB M EX
10
$1
Control
MEM/WB
WB
M u x
Instruction memory
sub $2, $1, $3 and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
Clock 3
ALU
PC
u Mx
u x
M u x
• Skip the boring stuff, jump to cycle 3
ECE437, S’21 © Vijaykumar and Thottethodi (101)
Forwarding unit
WB M
2 5
$2
$5
Regis
ters
$3
Data memory
2
4
1 3
2
5
add $9, $4, $2
or $4, $4, $2
and $4, $2, $5 ID/EX
sub $2, . . .
EX/MEM
10
before<1>
10
WB M EX
10
Control
WB M
MEM/WB
IF/ID
WB
4 $4$2
6
Regis
ters
Instruction memory
sub $2, $1, $3 and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
Clock 4
ALU
PC
M u x
$2
2 6
4
M u x
M u x
Data memory
$5
2
5
M 4u x
2
• Forward ALUOut to Operand 1
ECE437, S’21 © Vijaykumar and Thottethodi (102)
Forwarding unit
after<1>
PC Instruction memory
sub $2, $1, $3 and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
Clock 5
add $9, $4, $2
or $4, $4, $2 ID/EX
and $4, . . .
EX/MEM
10
sub $2, . . .
10
WB M EX
10
Control
WB M
MEM/WB
WB
IF/ID
1
4 $4$4
M 2u x
2
Regis
ters
ALU
M
$2
$2
Data memory
u Mx
u x
M u x
4 2
9
4 2
4
4
2
• Forward ALUout to Op1, Mem to Op2
ECE437, S’21 © Vijaykumar and Thottethodi (103)
Forwarding unit
after<2> after<1>
add $9, $4, $2
or $4, . . .
EX/MEM
10
and $4, . . .
WB Control M EX
MEM/WB
ID/EX
10
IF/ID
PC
sub $2, $1, $3 and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
Clock 6
M u x
• Two candidates match, forward the latest
ECE437, S’21 © Vijaykumar and Thottethodi (104)
WB M
1
WB
$4
M u x
M u x
M u x
ALU
Forwarding unit
Instruction memory
4
Regis
ters
$2
Data memory
4
2 9
4
4
26
Instruction
Instruction
Instruction
Instruction

ID/EX
Final Datapath
EX/MEM MEM/WB
M u x
M u x
ALUSrc
ALU
Forwarding unit
Registers
M u x
M u x
M u x
• For “Imm” op, ALUsrc mux (Fig 4.57)
ECE437, S’21 © Vijaykumar and Thottethodi (105)
Data memory
Control for Forwarding
• EXhazard
If (EX/MEM.RegWrite AND // not store or branch EX/MEM.RegsterRd != 0 AND // Result is used EX/MEM.RegisterRd = ID/EX.RegisterRs) ForwardA = 10
If (EX/MEM.RegWrite AND EX/MEM.RegsterRd != 0 AND EX/MEM.RegisterRd = ID/EX.RegisterRt) ForwardB = 10
ECE437, S’21 © Vijaykumar and Thottethodi (106)
Control for Forwarding
• MEMhazard
If (MEM/WB.RegWrite AND MEM/WB.RegsterRd != 0 AND MEM/WB.RegisterRd = ID/EX.RegisterRs) ForwardA = 01
If (MEM/WB.RegWrite AND MEM/WB.RegsterRd != 0 AND MEM/WB.RegisterRd = ID/EX.RegisterRt) ForwardB = 01
• Doesthisfullymeetourrequirements?
ECE437, S’21 © Vijaykumar and Thottethodi (107)
Forwarding – Summary
ID/EX
EX/MEM MEM/WB
M u x
MM uu xx
M u x
Registers
ALUSrc
ALU
Data memory
M u x
Forwarding unit
• Designed forwarding unit to solve RAW hazards for R-type instructions
ECE437, S’21 © Vijaykumar and Thottethodi (108)
27

Pay attention
• Forwardingis
– FROM EX/MEM or MEM/WB pipeline latches – TOINSIDEEX(NOTtoID/EXlatch)
• Othercombos
– egFROMunlatchedALUoutputTOINSIDEEX
– FROM unlatched ALU output TO ID/EX pipeline latch
– or FROM EX/M or M/W pipeline latches to ID/EX latch – etc
– willnotworkorcausestallsorrequirelargerpipelinelatches
ECE437, S’21 © Vijaykumar and Thottethodi (109)
Forwarding does’nt solve all RAW hazards
Time (clock cycles)
lw r1,0(r2) sub r4,r1,r3
IF ID/RF EX MEM WB
Im Reg Dm Reg
Im Reg
Dm Reg
• What is the EARLIEST point where load data is available SOMEWHERE in the pipe
– KEY difference between R-type and ld
– Value available: for R-type after EX, for ld after
MEM
• Can’t solve with forwarding:
– Must delay/stall instruction dependent on loads
ECE437, S’21 © Vijaykumar and Thottethodi (110)
Time (in clock cycles)
Load instruction
Program execution order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
CC 1
CC 2
CC 3
CC 4
DM
CC 5
Reg
CC 6
Reg
CC 7
Reg
CC 8
CC 9
IM Reg
IM Reg
D
IM Reg
IM
Reg DM Reg
DM
IM
Reg DM Reg
• True dependence backward in time
ECE437, S’21 © Vijaykumar and Thottethodi
(111)
M
Ld-use hazard Solution
• Catch-all solution for hazards
– Stall if instruction immediately next to a load is dependent on the load
• always works, but hurts performance
• Use as last resort
– Can’t help: load value is unavailable, unlike
R-type instructions • Challenge:
– Modify pipeline to stall for such hazards
– Detect and stall LATER instructions
ECE437, S’21 © Vijaykumar and Thottethodi (112)
28
ALU ALU

Terminology
• Minor change in terminology
– If forwarding can solve it, it is not a hazard!
– “Hazard” refers only to true backward dependencies in time
ECE437, S’21 © Vijaykumar and Thottethodi (113)
Ld-use Detection
• Conditions
– IMMEDIATELY preceding instruction must read memory
• MemRead must be asserted
– Destination of preceding instruction
lw $2, 20($1) and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
(rt) must be one of operands of current instruction
• Logic equations– restate above conditions formally
– If( ID/EX.MemRead AND
( (ID/EX.RegRt = IF/ID.RegRs) OR
(ID/EX.RegRt = IF/ID.RegRt) ) ) STALL
ECE437, S’21 © Vijaykumar and Thottethodi (114)
Stalling the pipeline
• Current instruction cannot proceed
– Later instructions must be stalled too else they will run into each other
• Solution for RAW hazards
– inject NOP into ID/EX pipeline latch
– Prevent writes to PC and IF/ID register – Earlier instructions proceed as usual
• GENERAL solution for all stalling
• Inject nop instead of stalled instr in next stage • “freeze” later instrs behind in previous stages
• “continue” earlier instrs ahead in later stages
ECE437, S’21 © Vijaykumar and Thottethodi (115)
Stalling the pipeline
• GENERAL solution for all stalling • WHILE stalled
– inject nop in place of stalled instr in next stage
• As many nops as stalls
– “freeze” later instrs behind (in earlier
stages)
• As long as stalled
– “continue” earlier instrs ahead (in later stages)
ECE437, S’21 © Vijaykumar and Thottethodi (116)
29

Instruction memory
Registers
0
Datapath
Hazard detection unit
Control
IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRd
ID/EX.RegisterRt
EX/MEM
ID/EX.MemRead
ID/EX
WB
M
u M x
WB
M WB
u Mx
Data memory
MEM/WB
IF/ID
EX
M u x
PC
ALU
M
Rt Rd
Rs Rt
u x
M u x
EX
/MEM.RegisterRd
ECE437, S’21 © Vijaykumar and Thottethodi
Forwarding unit
(117)
MEM/WB.RegisterRd
and $4, $2, $5
lw $2, 20($1)
before<1>
before<2>
EX/MEM
WB M
before<3>
MEM/WB
WB
Walk-through (1 of 6)
Hazard detection unit
ID/EX.MemRead
1
ID/EX
WB
M
u M x
X
11
Control
0
IF/ID
EX
1 $1
M u x
Instru mem
ction ory
X
Regis
ters
Data memory
PC
ALU
$X M u Mx
lw
$2, 20($1)
1 X 2
u x
Mu x
and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
Clock 2
ID/EX.RegisterRt
Forwarding unit
(118)
• Skiptocycle2–ldisinDecode
ECE437, S’21 © Vijaykumar and Thottethodi
or $4, $4, $2
PC Instruction memory
lw $2,20($1) and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
Clock 3
lw $2, 20($1) ID/EX
2
5
2 5
Walk-through (2 of 6)
and $4, $2, $5
Hazard detection unit
IF/ID
• All‘0’s=>NOP(MemWr,RegWr,othersdeasserted)
ECE437, S’21 © Vijaykumar and Thottethodi (119)
before<1>
EX/MEM
WB M
before<2>
ID/EX.MemRead
Control
M u x
00
$2
WB
M EX
11
$1
MEM/WB
WB
0
Regis
ters
M u x
Data memory
ALU
M
$5 $X
21 5 X
u Mx
u x
2M 4u x
ID/EX.RegisterRt
Forwarding unit
or $4, $4, $2
PC
lw $2,20($1)
and $4, $2, $5
bubble ID/EX
lw $2, . . .
EX/MEM
2 5
5
Walk-through (3 of 6)
Hazard detection unit
ID/EX.MemRead
before<1>
WB
M
u Mx
u x
10
WB
00
Control
u x
M
EX
$2 $2
$5 $5
22 5 5
WB M
M 11
MEM/WB
IF/ID
0
2
M u x
Instruction memory
Registers
Da me
ta mory
ALU
nop 44Mu 2
and $4, $2, $5
or $4, $4, $2 add $9, $4, $2
Clock 4
x
ECE437, S’21 © Vijaykumar and Thottethodi
ID/EX.RegisterRt
Forwarding unit
(120)
30
Instruction
Instruction
Instruction
PCWrite
PCWrite
PCWrite
IF/IDWrite
Instruction
IF/IDWrite
IF/IDWrite
PCWrite
IF/IDWrite

add $9, $4, $2
PC
or $4, $4, $2
and $4, $2, $5
WB
bubble
EM
lw $2, . . .
MEM/WB
Walk-through (4 of 6)
4
Hazard detection unit
ID/EX.MemRead
ID/E
10
2
Control
M0 u M
x
0
11
IF/ID
EX M 4 $4$2
WB
M u x
ALU
M
$2 $5
42
u Mx
u x
lw $2,20($1)
nop M 2
and$4,$2,$5 or $4, $4, $2 add $9, $4, $2
Clock 5
2 5
4 4 ux
ID/EX.RegisterRt
Forwarding unit
• LoadvalueforwardedfromMEM/WBregister
ECE437, S’21 © Vijaykumar and Thottethodi (121)
X
10
EX/M
WB
2
Instruction memory
2
Registers
Data memory
after<1>
PC
lw $2,20($1)
add $9, $4, $2
or $4, $4, $2
and $4, . . .
EX/MEM
bubble
0
4 2
2
Walk-through (5 of 6)
Hazard detection unit
ID/EX.MemRead
ID/
WB
EX
10
Control
M 10 u M
x
10
WB
IF/ID
EX
M
WB
0
MEM/WB
4 $4$4
$2 $2
44
2 2
M u x
Instruction memory
Regis
ters
ALU
M
Data memory
u Mx
u x
nop M 4
and $4, $2, $5 or $4, $4, $2 add $9, $4, $2 Clock 6
94u x
ID/EX.RegisterRt
Forwarding unit
• $4valueforwardedfromEX/MEMregister
ECE437, S’21 © Vijaykumar and Thottethodi (122)
after<2>
after<1>
Hazard detection unit
add $9, $4, $2
or $4, . . .
MEM
and $4, . . .
Walk-through (6 of 6)
ID/EX.MemRead
10
ID/EX
WB
10
M 10 u M
x
0
EX
PC M u x
Control
WB M
MEM/WB
1
IF/ID
WB
$4
M u x
EX/
ALU
Instruction memory
4
Registers
$2
Data memory
lw $2, 20($1) nop
and $4, $2, $5 or $4, $4, $2 add $9, $4, $2
Clock 7
• TWOvaluesfor$4,pickmostrecenttoforward
ECE437, S’21 © Vijaykumar and Thottethodi (123)
4
4
9
M u x
2
M u x
4
ID/EX.RegisterRt
Forwarding unit
RAW Hazard with Loads: Summary
• True backward dependencies in time
– Need to stall
– Dependent instruction & later instructions stalled – Earlier instructions can proceed
• Stall achieved by
– Detecting hazard (remember logic equation)
– Inserting NOP (all EX/MEM/WB controls set to 0)
– Preventing IF/ID latch and PC from being
overwritten
ECE437, S’21 © Vijaykumar and Thottethodi (124)
31
Instruction
Instruction
PCWrite
PCWrite
IF/IDWrite
IF/IDWrite
Instruction
PCWrite
IF/IDWrite

Recap : Pipeline Register Widths
IF: add $14, $8, $9
ID: or $13, $6, $7
EX: and $12, . . .
MEM: sub $11, . . .
WB: lw $10, . . .
1 M u x 0
Add
IF/ID = 64 ID/EX = 147 EX/MEM = 107 MEM/WB = 71
Clock 5
4
0
M u
IF/ID
EX/MEM
WB
MEM/WB
1 WB 1
x
1
or
Control
10 000 1100
ID/EX
WB M
10
000 10
10 10 0 00
EX
M
Ad
Add d result
ALUSrc
Shift left 2
$6 $4
Branch
Address
Instruction memory
6 7 10
Read register 1
Read register 2
Regist Write
register
Write data
Read data 1
ers Read data 2
$7 $5
Read data
Write data
Address memory
M u x 1
Zero
ALU ALU result
Ins
X
Ins
X
truction
[15–0] Sign X
0
0
Data
ALU control
ALUOp
MemRead
truction [20– 16]
X
extend
Ins
13
M truction u
[15–11] 13 12 x 1
11
10
RegDst
PC
ECE437, S’21 © Vijaykumar and Thottethodi
(125)
Solution3 for RAW Hazard with Loads
• We can use the compiler to insert nops between ld and use to prevent hazards
– Then hardware need not detect this hazard • But we can do better
– Ld $16, 0[$24]
– Nop (inserted by compiler or h/w) – Add, $17, $16, $18
– Sub $23, $8, $9
– St $17, 0[$25]
• What can you do?
ECE437, S’21 © Vijaykumar and Thottethodi (126)
Solution3 for RAW Hazard with Loads
– Ld $16, 0[$24]
– Nop (inserted by compiler or h/w) – Add, $17, $16, $18
– Sub $23, $8, $9
– St $17, 0[$25]
• What can you do?
– Move sub which is independent of ld and add
ECE437, S’21 © Vijaykumar and Thottethodi (127)
Recall Hazards
• Structural hazards
– Two instructions need the same hardware – Half/half for REG and ch5 for MEM
• Data Hazards
– Data not ready
– Forward/bypass (stall for loads)
• Control flow Hazards – next
– Which instruction to fetch? Not known. – Delayed branch, Predict not taken
ECE437, S’21 © Vijaykumar and Thottethodi (128)
32
Instruction
MemtoReg
MemWrite
RegWrite

When are conditional branches resolved?
0 M u x 1
Add 4
IF/ID
ID/EX
EX/MEM
B
Add
Add result
ZeZreoro
ALU ALU result
ranch
PCSrc
MemWrite
MEM/WB
MemtoReg
1 M u x 0
RegWrite
Shift left 2
RegDst
PC
Address
Instruction memory
Read register 1
Read register 2
Read data 1
Registers Read Write data 2
Write data
Address
Read
register
Write data
data memory
Data
Instruction [15–0] 16
truction [20– 16]
truction [15– 11]
Sign extend
32
ALUSrc
0 M u x 1
6
0 M u x 1
ALU control
Ins
Ins
MemRead
ECE437, S’21 © Vijaykumar and Thottethodi
(129)
ALUOp
Program execution order
(in instructions)
Time (in clock cycles)
Branch Hazards
CC 1
CC 2
CC 3
CC 4
CC 5
Reg
DM
CC 6
Reg
CC 7
CC 8
CC 9
IM Reg DM
IM Reg
IM Reg
40 beq $1, $3, 7
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
Branch resolved in the MEM stage If taken,
– PC<- PC + 4 + SX(Imm*4) – 40+4+7*4=72 DM Reg IM Reg DM Reg IM Reg (130) DM Reg • • ECE437, S'21 © Vijaykumar and Thottethodi Control/Branch Hazards • Branch resolved in the MEM stage – Which is 3 cycles later but next instruction has to fetched in the next cycle • Solution 1: Stall but penalty is 3 bubbles – Oh but how common is this? • Amdahl’s law – worry about it only if common – Every 6th instruction is a branch • CPI would go from ~1.2 to 1.2+1/6*3 = 1.7 (why 1.2?) • Large performance loss • There are 3 other solutions ECE437, S'21 © Vijaykumar and Thottethodi (131) Hazards in real microprocessors • Some of you may be thinking – 1-3 bubbles for loads and branches ain’t that bad – performance impact is not much • But real pipelines are 10-15 stages deep, so without doing anything each load and branch would incur 10+ bubbles! – Well then why are real pipelines so deep? – To get fast clock via many schemes (taught in 437) ECE437, S'21 © Vijaykumar and Thottethodi (132) 33 Instruction Control flow Hazard: Solution2 • WecanmoveupbranchdecisiontoDecodebyadding h/w to compare registers as they are being read I n st Add r. O Beq dr And e r Time (clock cycles) Mem Reg Mem Mem Reg Mem Reg Mem Reg Reg • Impact:from3bubblesto1bubble ECE437, S'21 © Vijaykumar and Thottethodi (133) Mem Reg One-minute quiz • True or false – Instructions generate the relevant control signals at every pipeline stage • Previous quiz – Pipelining improves instruction throughput while possibly worsening instruction latency ECE437, S'21 © Vijaykumar and Thottethodi (134) IF.Flu sh Datapath for branch hazards ard detection unit Haz ID/EX WB EX/MEM M u x Control M uM WB 0x EX M IF/ID MEM/WB WB 4 PC Registers = M u x M u x M u x Instruction memory Shi left ft 2 Data memory M u x Sign extend ALU Forwarding unit ECE437, S'21 © Vijaykumar and Thottethodi (135) Solution 2 • move branch decision earlier in pipeline – Need additional comparator (r1=r2?) and adder (PC+4+SX(IMM)*4) – Recall branch needs both outcome & target – “flush” IF/ID latch to kill potentially incorrect fetch immediately after branch – Will this work well? ECE437, S'21 © Vijaykumar and Thottethodi (136) 34 ALU ALU ALU Br Solution 2 • move branch decision earlier in pipeline – Whoa! Value needed in earlier stage • what if r1/r2 write is pending? • Add r1, r2, r3 • Beq r1, 0, target // is this common? • Forwarding and/or stalling • Forwarding from EX/MEM and MEM/WB into decode (in addition to forwarding into EX) – 1 bubble if no dependence – If dependent like above, 2 bubbles with forwarding – Is this dependent case common? – Too much! (and better solution exists) ECE437, S'21 © Vijaykumar and Thottethodi (137) Br Solution 2 • Don’t use sol.2 in the lab – it’s ugly – Ignore the textbook • If you insist on using solution2 in the lab (please don’t) then you need to latch the branch target at the end of decode (it is not latched in previous slide) else br target will be lost if you stall fetch for some (other) reason (e.g., due to memory structural hazard) WHILE a br in decode proceeds without stall ECE437, S'21 © Vijaykumar and Thottethodi (138) Can we do anything about br stalls? • RAW hazards a problem for both branches and R-types – Value produced later but needed earlier – R-type: Written in WB and needed in ID – Br: Outcome produced in MEM and needed in ID • But solutions are fundamentally different – There is a key difference between branches and R-types – what? ECE437, S'21 © Vijaykumar and Thottethodi (139) Can we do anything about br stalls? • Solution 3 – Predict branch is always not taken • MUCH more sophisticated prediction possible • Why not predict taken? – Soution 4: Delay slots • Compiler’s problem • Walkthrough example for solution 3 – Predict not taken ECE437, S'21 © Vijaykumar and Thottethodi (140) 35 Control flow Hazard: Solution3 • Predict:guessonedirectionthenbackupifwrong – Predictnottaken I n st Add r. O Beq dr And e r Mem Reg Mem Time (clock cycles) Mem Reg Reg Mem Reg Mem Reg Mem Reg • Impact: 0 bubbles if correct, 1 bubble as before if wrong assuming prediction+sol. #2 (correct say 50%) • Moredynamicscheme(correct90+%) ECE437, S'21 © Vijaykumar and Thottethodi (141) and $12, $2, $5 IF.Flush Walkthrough (1 of 2) beq $1, $3, 7 Hazard detection unit sub $10, $4, $8 before<1>
EX/MEM
before<2>
72
M u
ID
W
/EX
B
48
x
M
Control uM WB
28x ME 0
IF/ID 48
EX M
M/WB
WB
44
72
Shift left 2
7
Sign extend
$1
=
$4
M u x
M u x
Registers
Data memory
ALU
$3 M
72
4
PC
Instruction memory
44
10
Clock 3
u $8 x
(142)
• Flush IF/ID latch
ECE437, S’21 © Vijaykumar and Thottethodi
Forwarding unit
lw $4, 50($7)
IF.Flush
Walkthrough (2 of 2)
bubble (nop)
Hazard detection unit
beq $1, $3, 7
sub $10, . . .
EX/MEM
before<1>
M
u 76 x
ID
/EX
WB
M
Control uM WB
x
0
=
EX M
IF/ID
76
MEM/WB
WB
72
Shift left 2
Sign extend
$1
M u x
M u x
Registers
Data memory
ALU
$3 x
(143)
76
4
PC
Instruction memory
72
M u
10
Clock 4
ECE437, S’21 © Vijaykumar and Thottethodi
Forwarding unit
Visualizing sol. #3 – prediction+sol. #2
10 lw 14 addI 20 sub 24 beq 30 ori 34 add
100 and
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
these addresses are octal
(144)
ECE437, S’21 © Vijaykumar and Thottethodi
36
ALU
ALU ALU

Fetch 24, Decode 20, Exec 14, Mem 10
n
WB Ctrl
em Ctrl
M
IR
45
r2
B
M
10 lw
D
14 addI 20 sub 24 beq 30 ori 34 add
100 and
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
ECE437, S’21 © Vijaykumar and Thottethodi
(145)
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
IR
D
• Fetchori-predictnottaken,resolvebeq ECE437, S’21 © Vijaykumar and Thottethodi (146)
Mem Ctrl
67
r4
r5
WB Ctrl
14 addI 20 sub 24 beq 30 ori 34 add
100 and
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
10 lw
Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14
IR
D
• Squashmispredictedori–turninto“nop”
ECE437, S’21 © Vijaykumar and Thottethodi (147)
Mem Ctrl
00
r6
r7
WB Ctrl
14 addI 20 sub 24 beq 30 ori 34 add
100 and
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
10 lw
Fetch 104, Dcd 100, Ex 30, Mem 24, WB 20
WB Ctrl
Mem Ctrl
14 15
IR
r9 x
r1=M[r2+35]
10 lw 14 addI 20 sub 24 beq 30 ori 34 add
100 and
r2 = r2+3
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
D
• Nop’doriflowsthroughpipe
ECE437, S’21 © Vijaykumar and Thottethodi
(148)
37
PC
100
100
beq
3
addI r2, r2, 3
PC
104
17
NOP
sub r3
Next PC
Reg File
Decode
Next PC PC 24
Reg File
Decode
Next PC
Next PC
Mem Access
Mem Access
Data Mem
r2+3
addI r2
Data Mem
Mem Access
Mem Access
Data Mem
r4-r5
sub r3
Data Mem
M[r2+35]
lw r1
Exec
Exec
r4-r5
sub r3
r2+35
lw r1
Reg. File
Reg. File
r1=M[r2+35]
Reg File
Decode
Reg File
Decode
Exec
Exec
xxx
beq
r2+3
addI r2
Reg. File
Reg. File
Inst. Mem
Inst. Mem
NOP
sub r3, r4, r5
Inst. Mem
Inst. Mem
And r13, r14, r15
beq r6, r7 100
PC
30

Fetch 110, Dcd 104, Ex 100, Mem 30, WB 24 n
WB Ctrl
Mem Ctrl
IR
14 15
r15
r1=M[r2+35]
r14
r2 = r2+3 r3 = r4-r5
10 lw 14 addI 20 sub 24 beq 30 ori 34 add
100 and
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
D
ECE437, S’21 © Vijaykumar and Thottethodi
(149)
Fetch 114, Dcd 110, Ex 104, Mem 100, WB 30 n
Ctrl
D
WB Ctrl
Mem
IR
r1=M[r2+35]
r2 = r2+3 r3 = r4-r5
10 lw 14 addI 20 sub 24 beq 30 ori 34 add
100 and
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
• NOPdoesnotdoanywrites
ECE437, S’21 © Vijaykumar and Thottethodi
(150)
Br Solution 3 – prediction+sol. #1
• If we assume prediction+sol. #1 then – Sol.#1:Brscause3-cyclestall
– With prediction (solution 3 )
• If correct 0 bubbles and 3 bubbles otherwise (as before)
ECE437, S’21 © Vijaykumar and Thottethodi (151)
How will sol.#1 datapath change for 3-
bubble mispredictions?
PCSrc
MemWrite
0 M u x 1
IF/ID
ID/EX
EX/MEM
Branch
MEM/WB
Add 4
Shift left 2
Add
c
0
Add result
ZeZreoro
PC
Address
Instruction memory
Read register 1
Read register 2
Read data 1
RegWrite
Registers Read Write data 2
Write data
register
ALUSr
ALU ALU result
MemtoReg
1 M u x 0
M u x 1
Address
Read
Write data
data memory
Data
Instruction [15–0] 16
Instruction [20– 16]
Sign extend
32
6
ALU control
ALUOp
0
M Instruction u [15– 11] x 1
ECE437, S’21 © Vijaykumar and Thottethodi (152)
MemRead
RegDst
38
Instruction
Next PC PC 114
Reg File
Decode
Next PC PC 110
Reg File
Decode
Mem Access
Data Mem
xxx
beq
Mem Access
Data Mem
r9 | 17
NOP
Exec
r9 | 17
NOP
Reg. File
Exec
R14.r15
And r13
Reg. File
xx
And r13
xx
Inst. Mem
Inst. Mem

Dynamic Branch Prediction – take solution3 further
• Better (i.e., more accurate) than static prediction of not-taken
– Branches are biased  predictable
– Some are mostly taken and others mostly not-
taken
– Why? Think of code (architects always think of
s/w)
– Give egs of mostly taken and mostly not-taken
ECE437, S’21 © Vijaykumar and Thottethodi (153)
Dynamic Branch Prediction – take solution3 further
– ~90% of program execution time is spent in ~10% of code (inner loops)
– Think of a program loop of N iterations • Taken N-1 times
• Not taken last time
• Then how does h/w learn the bias of each branch?
ECE437, S’21 © Vijaykumar and Thottethodi (154)
Dynamic Branch Prediction
Taken
Predict taken Taken Not taken
Predict not taken Not taken
1-bit saturating counter
Taken
Predict taken Th
Taken
Predict not taken Ns
Not taken Taken Not taken
Taken
Predict taken Ts
Not taken
Predict not taken Nh
Not taken
2-bit saturating counter
• Store each branch instruction’s history ***
– If a branch is taken in “recent” history, assume the branch is biased to taken and predict taken
• One bit saturating counter • Two bit counters
ECE437, S’21 © Vijaykumar and Thottethodi (155)
•
• • • •
•
While (always)
– fori=0to3..
blabla
Outcome: 1-bit (init N):
1-bit vs. 2-bit
top: mov ri, 0 loop:blabla
add ri, ri, 1 blt ri, 3 loop j top
TTTNTTTNTTTNTTTN..
NTTTNTTTNTTTNTTT..
– 2 mistakes per loop: one each at entry, exit
– Ignore init mistake
2-bit (init Nh): NhNsThThTsThThThTsThThThTsThThTh..
– 1 mistake per loop: at exit – Ignore init mistake
ECE437, S’21 © Vijaykumar and Thottethodi (156)
39

One-minute quiz
• Remove the load-use stall: – Add $16, $16, 4
– Ld $24, 0[$16]
– Sub $25,$24,$10
– Or $25, $25, $11 • Previous quiz
– Instructions generate the relevant control signals at every pipeline stage
• False
• Previous outcome vs average bias,
complex control flow in loop?
ECE437, S’21 © Vijaykumar and Thottethodi (157)
Branch Prediction Table
PC range
Branch prediction table 1- or 2-bit entries
• Storeeachbranch’shistory*** – Notreally
• Keepasmalltableindexedbyprogramcounter
• PCislarge(32bitnumber)
• Many-to-fewmappingfromPCtotableentries
– Eachentry1-or2-bitstateofstatemachine
– E.g. 2048-entry branch prediction table
– Mapping:use11bitsofPCWITHOUTlast2bits(why?)
• Problem:Multiplebranchesmaymaptosameentryin table – Aliasing
ECE437, S’21 © Vijaykumar and Thottethodi (158)
Branch Prediction: Branch Target Buffer
PC range
Branch target buffer (BTB)
• Unlikestaticpredictionofalwaysnot-taken,dynamic prediction will predict both taken and not-taken
• Fortakenprediction,wehaveaproblem(what?)
• Thereissomethingaboutbranchtargetsthathelpsus solve the problem (what?)
• KeepasmalltableofbrtargetsindexedbyPC(BTB) – Nextslide
• Many-to-fewmappingfromPCtoBTB
ECE437, S’21 © Vijaykumar and Thottethodi (159)
Fetch with branch prediction
10 lw 14 addI 20 sub 24 beq 30 ori 34 add
100 and
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
Mem
PC
IF/ID
Fetch
Br Pred
BTB
• •
ECE437, S’21 © Vijaykumar and Thottethodi (160)
WHILEfetching(potentially)abranch,usebranchPC to look up prediction AND target of branch so NEXT cycle you can fetch taken instruction if predict taken else fetch not-taken instruction – back-to-back fetch for both taken or not-taken
beq F | D | X | M | W
• and F|D |X nobubble
40

Updating the predictor and BTB
• At fetch, you only look up the predictor/BTB – to make a prediction and get the br target
– Carry prediction/br target down the pipe
• When br is resolved (in Mem)
– You know the true outcome and target – Compare these to predictions
– If correct, do nothing
– If wrong, update predictor and BTB
• Predictor: state machine state change
• BTB: write the target for br PC into BTB
• All this in Mem WHILE squashing mispredicted instrs • Next cycle fetch from correct PC & lookup pred/BTB
• Correct~80%(0bubbles)incorrect20%(3bubbles)
ECE437, S’21 © Vijaykumar and Thottethodi (161)
Prediction accuracy
• Always not-taken – ~30% – – whoa! Why so bad?
• always taken – ~60%
• 1-bit – ~65%,
• 2-bit – ~80%
• correlating predictors (~90+%)
– Come to grad school, 565
ECE437, S’21 © Vijaykumar and Thottethodi (162)
Recap
• Branch instructions
– Control flow hazard
– Static branch prediction
• Predict not taken
• Squash instruction if prediction incorrect
• Dynamic Branch prediction
– 1-bit and 2-bit state machines to track history of branches
– Finite table
• Potential for “aliasing”
• Multiple branches map to the same predictor
ECE437, S’21 © Vijaykumar and Thottethodi (163)
Control flow Hazard: Solution4
• Redefinebranchbehaviortotakeeffectafternext instruction) “delayed branch”
I n
st Add r.
O Beq
Mem Reg Mem
Time (clock cycles)
Mem Reg Reg Mem
Mem Reg
Mem Reg
Reg
Mem Reg
Mem Reg
dr Delay-slot e
r Load
• Impact:0bubblesperbranchifcanfindinstruction to put in “delay slot” ( say 60% of time)
• Aspipelinesgetdeeper,lessuseful–why?
ECE437, S’21 © Vijaykumar and Thottethodi (164)
41
ALU
ALU ALU
ALU

10 lw 14 addI 20 sub 24 ori 30 beq 34 NOP 38 add
100 and
r1, r2(35) r2, r2, 3 r3, r4, r5 r8, r9, 17 r6, r7, 100
r10, r11, r12 r13, r14, 15
10 lw 14 addI 20 sub 24 beq 30 ori 34 add
100 and
r1, r2(35) r2, r2, 3
r3, r4, r5
r6, r7, 100 r8, r9, 17 r10, r11, r12
r13, r14, 15
Easy way*** to hide branch hazard delay
• Delayed branch
– Instruction after branch always executes
independent instruction from before the branch
– Find independent instructions from Taken (target) OR from Not Taken (fall-through) code section
a. From before
Becomes
b. From target
Becomes
(166)
c. From fall through
Becomes
– Find an
add $s1, $s2, $s3 if $s1 = 0 then
Delay slot sub $t4, $t5, $t6
• *** For Architects ECE437, S’21 © Vijaykumar and Thottethodi
add $s1, $s2, $s3 if $s2 = 0 then
Delay slot
if $s2 = 0 then add $s1, $s2, $s3
sub $t4, $t5, $t6 …
add $s1, $s2, $s3 if $s1 = 0 then
Delay slot
add $s1, $s2, $s3 if $s1 = 0 then
sub $t4, $t5, $t6
add $s1, $s2, $s3 if $s1 = 0 then
sub $t4, $t5, $t6
Delayed Branch: Solution 4
• Delayedbranch:instafterbranchALWAYSexecuted – Brancheffectisdelayedby1cycle–hencethename
– Invisibletoprogrammer
– Compiler and/or assembler transforms code
ECE437, S’21 © Vijaykumar and Thottethodi (165)
Ideal – from before branch
R1 <- M[R2] R2==0? Delay slot R2==0? R1 <- M[R2] • Independentinstructionstofilldelayslot • Codetransformationmustpreserveoriginalsemantics – Correctness • Butmaynotfindindependentinstructionfrombefore branch then try from taken or not-taken path ECE437, S'21 © Vijaykumar and Thottethodi (167) From Taken/Not-taken paths R2 <- M[R2] R2==0? N
• Y is more likely (common case)
R2 <- M[R2] R2==0? N R3<-R5+R6 Y
Y
Delay slot

R3<- R5+R6 R4<- R3 • R3 DEAD in “N” path and in“wingding” and beyond – DEAD = overwritten before being read (KEY correctness requirement) • Delay slot – Profitable mostly, not profitable occasionally, but NEVER unsafe ECE437, S'21 © Vijaykumar and Thottethodi (168) (incorrect)
R4<-R3

42

Two cases if we move from Y path
R2 <- M[R2] R2==0? N Y R2 <- M[R2] R2==0? N R3<-R5+R6 Y R3 <- R8-R9 • OK (R3 dead in N path)
R3<-R5+R6 R23 <- R3
NOT OK (R3 alive in N path)

(169)
R4<- R3 ECE437, S'21 © Vijaykumar and Thottethodi R4<-R3 R48 <- R3 Recall Hazards • Structural hazards – Two instructions need the same hardware – Half/half for REG and two copies for MEM • Data Hazards – Data not ready – Forward/bypass (stall for loads) • Control flow Hazards – Which instruction to fetch? Not known. – Move up br. decision, Predict not taken, delayed branch ECE437, S'21 © Vijaykumar and Thottethodi (170) IF.Flu sh Recap: Datapath for branch hazards ard detection unit Haz ID/EX WB EX/MEM M u x Control M uM WB 0x EX M IF/ID MEM/WB WB 4 PC Registers = M u x M u x M u x Instruction memory Shi left ft 2 Data memory M u x Sign extend ALU Forwarding unit ECE437, S'21 © Vijaykumar and Thottethodi (172) Common confusion in CPI calculation • Ideal CPI = 1, 1/4th loads, 1/6th branches, br prediction accuracy 90% and branches resolved at the END of MEM and assume 50% loads are followed immediately by a dependent instr and by independent instr for rest • CPI=1+1/4*1/2*1+1/6*0.1*3 – The first “1” captures no stalls for ALL instructions, so later terms are ONLY for stalls ECE437, S'21 © Vijaykumar and Thottethodi (173) 43 Common confusion in CPI calculation • CPI=1+1/4*1/2*1+1/6*0.1*3 • DO NOT do – 1–(1⁄4+1/6)+1⁄4*1/2*1+1⁄4*1/2 *2+1/6 *0.9*1 + 1/6*0.1*4 • correct but laborious – 1 + 1⁄4*1+1/6*3 • Incorrect because double-counts non-stall cases ECE437, S'21 © Vijaykumar and Thottethodi (174) What next? • Exceptions – Multiple instructions in flight – PC has changed • Advanced topics – VERY briefly – Superscalar, dynamically scheduled processors, etc • Real machines – Pentium 4 pipeline, Niagara Pipeline ECE437, S'21 © Vijaykumar and Thottethodi (175) Exceptions Exception: normal control flow: sequential, jumps, branches, calls, returns • Exception=unprogrammed‘surprise’functioncall – OStakesactiontohandletheexception • must record the address of the offending instruction – returns control to user user program System Exception Handler return from exception – mustsave&restoreuserstate • LetsOSgetin/outwithoutbreakinguserprogram– as if user program had a “virtual CPU” to itself ECE437, S'21 © Vijaykumar and Thottethodi (176) Interrupt, Exception, Trap? • Interrupts – caused by external events (eg I/O) – asynchronous to program execution – may be handled between instructions – suspend user program, handle interrupt, resume user program • Traps – caused by internal events • exceptional conditions (overflow) • errors (parity) • faults (ch. 5) – KEY mechanism in all modern computers – synchronous to program execution – condition must be remedied by the handler for program to proceed – instruction may be retried or simulated and program continued, or program may be aborted • MIPS convention: – External : Interrupts – Internal : Exception ECE437, S'21 © Vijaykumar and Thottethodi (177) 44 Importance of exceptions • No modern OS would work without exceptions and interrupts • Exceptions are uncommon case • Butcommoncaseisnoteverything,uncommoncase must be correct – Especially if uncommon is needed by OS ECE437, S'21 © Vijaykumar and Thottethodi (178) Exception Semantics • MIPS architecture defines the instruction as having no effect if the instruction causes an exception. • When we get to virtual memory (ch 5b) we will see that certain classes of exceptions must prevent the instruction from changing the machine state. • This aspect of handling exceptions becomes complex and potentially limits performance => why it is hard
– Precise interrupts vs Imprecise interrupts
ECE437, S’21 © Vijaykumar and Thottethodi (179)
Exceptions
• Pipeline Semantics
– NO instruction AFTER the excepting instruction may execute
– EVERY instruction BEFORE the excepting instruction must complete execution
• Sounds similar to what?
– Exception handler software saves and
restores registers and other state • Different from 362 microcontroller
ECE437, S’21 © Vijaykumar and Thottethodi (180)
MIPS Exceptions
• All exceptions jump to same handler code – “Cause” register
• We consider
– Illegal instructions
– Arithmetic overflows
• Hardware behavior
– Save PC of offending instruction (How? PC+4 has already been written to PC)
– Use special register EPC(why not use $31 like jal?)
– Set cause register appropriately (0=ILL; 1=OVF)
– Jump to handler at fixed address
ECE437, S’21 © Vijaykumar and Thottethodi (181)
45

Datapath modifications
• Pipeline complications
• Which stage is exception detected?
– Overflow?
• In EX, squash (convert to nop) EX & earlier stages
– Illegal Instruction?
• In ID, squash (convert to nop) ID & earlier stages • Similar to RAW hazard
– What about external interrupts?
• Overflow in instruction i, illegal instruction in
instruction i+1
– Simultaneous exceptions – Hardware sorting
ECE437, S’21 © Vijaykumar and Thottethodi (182)
Walk-through: Code snippet
• Main Code
40 sub $11, $2, $4 44 and $12, $2, $5 48 or $13, $2, $1 4C add $1, $2, $1 50 slt $15, $6, $7
ECE437, S’21 © Vijaykumar and Thottethodi
• Exception Code
[Exception handler PC] sw $25, 1000($0)
(183)
lw $16, 50($7)
4000
Clock 5
Walkthrough (1 of 2)
IF.Flu
sh
slt $15, $6, $7
Shift left 2
Sign extend
add $1, $2, $1
WB x 0
or $13, . . .
EX/MEM
and $12, . . .
ID.Flush
EX.Flush
ard detection
unit
trol
Haz
ID/EX
40000040
M u x
M0 0 10 u
4
IF/ID
58
Con
M0010 M0
uM uWB
0x x /WB
0
$6
=
0
Cause 1 EX M B
50 Except PC
M $2 u
x
54
PC
0040
Instruction memory
54
12
Registers
Data memory
MEM
W
ALU
$7 M u M $1 x
u x
M 13 12 15$1 ux
Forwarding un it
• All three instructions converted to nop
ECE437, S’21 © Vijaykumar and Thottethodi (184)
IF.Flush
40000040
4
40000040
4000
x
Walkthrough (2 of 2)
sw $25, 1000($0)
4000
Clock 6
bubble (nop)
ard tion t
bubble
WB x 0
bubble
EX/MEM
or $13, . . .
Haz
Co
ID.Flush
EX.Flush
M u
detec uni
ID/EX
M 00 0 00 u
ntrol M0000 M 00 uM uWB
0044 IF/ID
0x x /WB 001
MEM
W
Shift left 2
Sign extend
=
EXCause M B Except
PC
M u x
PC 0044
Instruction memory
13
Registers
Data memory
ALU
M
u Mx
u x
M u x
13
• FetchnextinstructionfromhandlerPC(MIPS)
ECE437, S’21 © Vijaykumar and Thottethodi (185)
Forwarding un it
46

WB
MEM/WB
WB
M
PC
Instruction memory
Address
Read data
Data memory
Address Read
data data
Write
MemRead
Understanding Performance
• Iron law: Insts/prog * CPI * cycletime • With pipelining:
– CPI ~ 1 (with ideal memory, good branch prediction and few data hazards)
– Cycletime : determined by critical path of one stage
• Can we do better? – CPI<1? – How about CPI of 0.25? ECE437, S'21 © Vijaykumar and Thottethodi (187) Branch Pipelined Processor EX.Flush IF.Flush 40000040 4 ID.Flush Hazard detection unit M u ID/EX M xu WB x IF/ID EX/MEM 0 MM Control uM u 0xx 0 ALUSrc EX Cause Except PC M u x Shift left 2 Read register 1data 1 Read register 2 Read = data uM x u x ALU control ALUOp Forwarding unit (186) Registers Write ALU register Write Read data 2 32 M x M u x M u 16 Sign extend Instruction [25– 21] Instruction [20– 16] Instruction [20– 16] Instruction [15– 11] RegDst • Phew! ECE437, S'21 © Vijaykumar and Thottethodi Superscalar Processor • What does it mean? – Scalar processors (operate on scalar quantities) – Vector (operate on vectors) – Superscalar: multiple scalar operations in one cycle – More than one instruction per cycle ECE437, S'21 © Vijaykumar and Thottethodi (188) M 40000040 u x PC • • ECE437, S'21 © Vijaykumar and Thottethodi M u x M u x (189) ALU ALU Superscalar Datapath 4 Instruction memory Replicate datapath elements Static Multiple issue datapath Sign extend Sign Registers M u x Write data Data memory Address extend 47 Instruction MemtoReg RegWrite MemWrite Dynamic Scheduling • No need to suffer hazards if other useful work can be achieved • Load hazard results in pipeline stall – But later instructions are ready – “Oh! But we cannot execute instructions out of order” – Not true! Modern CPUs do it all the time!! lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, $t3 – Execute sub,slti, & later instrs while lw has not completed! ECE437, S'21 © Vijaykumar and Thottethodi (190) Dynamic Scheduling Instruction fetch and decode unit Reservation station Reser vation station Reservation station Reser vation station Functional units Integer ... Integer ... Floating Load/ point Store In-order issue Out-of-order execute In-order commit Commit unit • Instructions can execute when operands are ready • Instructions “commit” when all preceding instructions have ECE437, S'21 © Vijaykumar and Thottethodi (191) committed Real machines • Let’s examine Pentium 4, Core 2, i7 – Microarchitecture has evolved – Technology has improved – Also ARM A3, Sun Niagara, Apple A14 Firestorm ECE437, S'21 © Vijaykumar and Thottethodi (192) Pentium 4 on 0.18 micron • 42 million transistors • 3GHz • Several parts are clocked at half the speed • In-order front-end, out-of-order execution, in order retire ECE437, S'21 © Vijaykumar and Thottethodi (193) 48 Pentium 4 pipeline • One specific pipeline (misprediction) ECE437, S'21 © Vijaykumar and Thottethodi (194) Core 2 • 45nm • Multiple decodes • 14 stage pipeline – (went as high as ~31 in Pentium 4 line) – Many other considerations • Pipelining for yield • Source: Wikipedia ECE437, S'21 © Vijaykumar and Thottethodi (195) i7 • 6 uops/cycle • 14 stage pipeline – 17 cycles br penalty – 48 loads and 32 stores in-flight ECE437, S'21 © Vijaykumar and Thottethodi (196) ARM A3 • 3-cycle fetch – Addr Generation Unit uses BTB • 5-cycle decode and 6-cycle EX ECE437, S'21 © Vijaykumar and Thottethodi (197) 49 • Nottoo dissimilar from 437 • 4threads • Eightsuch processors • March/April 2005 Issue of IEEE MICRO ECE437, S'21 © Vijaykumar and Thottethodi on a chip Sun Niagara (198) • X86 contender • BeatsIntel i7 1185G7 (28 W) and close to AMD Ryzen 9 5950X (49 W) • Foriphone! • <5W! ECE437, S'21 © Vijaykumar and Thottethodi (199) Apple A14 Firestorm Summary • Exceptions – Know how to handle the easy cases • What to squash, what not to – Know how complicated exceptions can be • Read Chapter 4 NOW – Maximize impact – Study while lecture material is “warm” • 2-3 hours now vs. 8-12 hours later ECE437, S'21 © Vijaykumar and Thottethodi (200) 50

Related Posts