CS代写 CSE 371 Computer Organization and Design

CSE 371 Computer Organization and Design

CIS 501: Comp. Arch. | Dr. | Pipelining
Computer Organization and Design

Copyright By PowCoder代写 加微信 powcoder

Unit 6: Pipelining

Based on slides by Profs. , & C.J. IS 501: Comp. Arch. | Dr. | Pipelining
This Unit: Pipelining
Processor performance
Latency vs throughput
Single-cycle & multi-cycle datapaths
Basic pipelining
Data hazards
Software interlocks and scheduling
Hardware interlocks and stalling
Load-use stalling
Pipelined multi-cycle operations
Control hazards
Branch prediction

System software

CIS 501: Comp. Arch. | Dr. | Pipelining

Welcome to the Laundromat
CIS 501: Comp. Arch. | Dr. | Pipelining

other pipelining examples?

’s Big Idea:
CIS 501: Comp. Arch. | Dr. | Pipelining

In-Class Exercise
You have a washer, dryer, and “folding robot”
Each takes 1 unit of time per load
How long for one load in total?
How long for two loads of laundry?
How long for 100 loads of laundry?

Now assume:
Washing takes 30 minutes, drying 60 minutes, and folding 15 min
How long for one load in total?
How long for two loads of laundry?
How long for 100 loads of laundry?

CIS 501: Comp. Arch. | Prof. | Pipelining

no pipeline: 1load=3units, 2loads=6 units, 100loads=300units
with pipeline: 1load=3units, 2loads=4units, 100loads=102units
2nd part: 1 load=105minutes, 2 loads=165minutes, 100loads=30 + 60*100 + 15 = 100h45m
Where is the parallelism?

In-Class Exercise Answers
You have a washer, dryer, and “folding robot”
Each takes 1 unit of time per load
How long for one load in total?
How long for two loads of laundry?
How long for 100 loads of laundry?

Now assume:
Washing takes 30 minutes, drying 60 minutes, and folding 15 min
How long for one load in total? 30+60+15=1h45m
How long for two loads of laundry? 30+(60*2)+15 = 2h45m
How long for 100 loads of laundry? 30+(60*100)+15= 100h45m

CIS 501: Comp. Arch. | Prof. | Pipelining

How could we make scenario #2 faster? Get more folding robots?

CIS 501: Comp. Arch. | Dr. | Pipelining
CIS 240: build something that works
CIS 371: build something that works “well”
“well” means “high-performance” but also cheap, low-power, etc.
Mostly “high-performance”
So, what is the performance of this?
What is performance?

Performance

CIS 501: Comp. Arch. | Dr. | Pipelining

Processor Performance Equation
Instructions per program: “dynamic instruction count”
Runtime count of instructions executed by the program
Determined by program, compiler, instruction set architecture (ISA)
Cycles per instruction: “CPI” (typical range: 2 to 0.5)
On average, how many cycles does an instruction take to execute?
Determined by program, compiler, ISA, micro-architecture
Sec. per cycle: “clock period” (typical range: 2ns to 0.25ns
Reciprocal is frequency: 0.5 Ghz to 4 Ghz (1 Hertz = 1 cycle per sec)
Determined by micro-architecture, technology parameters
For minimum execution time, minimize each term
Difficult: often pull against one another

CIS 501: Comp. Arch. | Dr. | Pipelining
(1 billion instructions) * (1ns per cycle) * (1 cycle per insn)
= 1 second
Execution time = “seconds per program” =
(instructions/program) * (seconds/cycle) * (cycles/instruction)

LC4 doesn’t have subtract-immediate insn: what’s the impact of adding it?

CIS 501: Comp. Arch. | Dr. | Pipelining
Cycles per Instruction (CPI)
CPI: Cycle/instruction on average
IPC = 1/CPI
Used more frequently than CPI
Favored because “bigger is better”, but harder to compute with
Different instructions have different cycle costs
E.g., “add” typically takes 1 cycle, “divide” takes >10 cycles
Depends on relative instruction frequencies

CPI example
A program executes equal: integer, floating point (FP), memory ops
Cycles per instruction type: integer = 1, memory = 2, FP = 3
What is the CPI? (33% * 1) + (33% * 2) + (33% * 3) = 2
Caveat: this sort of calculation ignores many effects
Back-of-the-envelope arguments only

Note CPI is a measure of throughput not latency – actually each instruction takes lots of cycles

CIS 501: Comp. Arch. | Dr. | Pipelining
Improving Clock Frequency
Faster transistors

Micro-architectural techniques
Multi-cycle processors
Break each instruction into small bits
Less logic delay -> improved clock frequency
Different instructions take different number of cycles
Pipelined processors
As above, but overlap parts of instruction (parallelism!)
Faster clock, but CPI can still be around 1

CIS 501: Comp. Arch. | Dr. | Pipelining
Single-Cycle Datapath
Single-cycle datapath: true “atomic” fetch/execute loop
Fetch, decode, execute one complete instruction every cycle
Takes 1 cycle to execution any instruction by definition (“CPI” is 1)
Long clock period: to accommodate slowest instruction
(worst-case delay through circuit, must wait this long every time)

Tsinglecycle

CIS 501: Comp. Arch. | Dr. | Pipelining
Latency versus Throughput
Latency (execution time): time to finish a fixed task
Throughput (bandwidth): number of tasks in fixed time
Different: exploit parallelism for throughput, not latency (e.g., bread)
Often contradictory (latency vs. throughput)
Will see many examples of this
Choose definition of performance that matches your goals
Scientific program? Latency, web server: throughput?
Example: move people 10 miles
Car: capacity = 5, speed = 60 miles/hour
Bus: capacity = 60, speed = 20 miles/hour
Latency: car = 10 min, bus = 30 min
Throughput: car = 15 PPH (count return trip), bus = 60 PPH
Fastest way to send 10TB of data? (at 1+ gbits/second)

Answer: Use FedEx overnight to deliver the drive. 10 TB in, say, 15 hours is 1.3 gbits/second of bandwidth (667 GB/hour)

CIS 501: Comp. Arch. | Dr. | Pipelining

CIS 501: Comp. Arch. | Dr. | Pipelining
Pipelining
Important performance technique
Improves instruction throughput, not instruction latency
Break instruction execution into stages
When insn advances from stage 1 to 2, next insn enters at stage 1
Form of parallelism: “insn-stage parallelism”
Maintains illusion of sequential fetch/execute loop
Individual instruction takes the same number of stages
But instructions enter and leave at a much faster rate
Just like our laundromat
insn0.fetch
insn1.fetch
insn0.exec
insn1.exec

CIS 501: Comp. Arch. | Dr. | Pipelining
5 Stage Pipeline: Inter-Insn Parallelism
Pipelining: cut datapath into N stages (here 5)
One insn in each stage in each cycle
Clock period = MAX(Tinsn-mem, Tregfile, TALU, Tdata-mem, Twriteback)
Base CPI = 1: insn enters and leaves every cycle
Actual CPI > 1: pipeline must often “stall”
Individual insn latency increases (pipeline overhead)

Note feedback paths for PC and from Writeback stage into register file.

CIS 501: Comp. Arch. | Dr. | Pipelining
Five stage: Fetch, Decode, eXecute, Memory, Writeback
Nothing magical about 5 stages (Pentium 4 had 22 stages!)
Pipeline registers named by stages they begin
PC, D, X, M, W

CIS 501: Comp. Arch. | Dr. | Pipelining
More Terminology & Foreshadowing
Scalar pipeline: one insn per stage per cycle
Alternative: “superscalar” (later)

In-order pipeline: insns enter execute stage in order
Alternative: “out-of-order” (later)

Pipeline depth: number of pipeline stages
Nothing magical about five
Contemporary high-performance cores have ~15 stage pipelines

CIS 501: Comp. Arch. | Dr. | Pipelining
Instruction Convention
Different ISAs use inconsistent register orders

Some ISAs (for example MIPS)
Instruction destination (i.e., output) on the left
add $1,$2,$3 means $1⟵$2+$3

Other ISAs
Instruction destination (i.e., output) on the right
add r1,r2,r3 means r1+r2⟶r3
ld 8(r5),r4 means mem[r5+8]⟶r4
st r4,8(r5) means r4⟶mem[r5+8]

Will try to specify to avoid confusion, next slides MIPS style

CIS 501: Comp. Arch. | Dr. | Pipelining
Pipeline Example: Cycle 1
3 instructions

add $3⟵$2,$1

Pipeline Example: Cycle 2
CIS 501: Comp. Arch. | Dr. | Pipelining

lw $4⟵8($5)
add $3⟵$2,$1

Pipeline Example: Cycle 3
CIS 501: Comp. Arch. | Dr. | Pipelining

sw $6⟶4($7)
lw $4⟵8($5)
add $3⟵$2,$1

Pipeline Example: Cycle 4
CIS 501: Comp. Arch. | Dr. | Pipelining

sw $6⟶4($7)
lw $4⟵8($5)
add $3⟵$2,$1

CIS 501: Comp. Arch. | Dr. | Pipelining

Pipeline Example: Cycle 5

sw $6⟶4($7)
lw $4⟵8($5)

Pipeline Example: Cycle 6
CIS 501: Comp. Arch. | Dr. | Pipelining

sw $6⟶4($7)

Pipeline Example: Cycle 7
CIS 501: Comp. Arch. | Dr. | Pipelining

CIS 501: Comp. Arch. | Dr. | Pipelining
Pipeline Diagram
Pipeline diagram: shorthand for what we just saw
Across: cycles
Down: insns
Convention: X means lw $4⟵8($5) finishes execute stage and writes into M register at end of cycle 4
assuming no stalls (discussed later)

1 2 3 4 5 6 7 8 9
add $3,$2,$1 F D X M W
lw $4,8($5) F D X M W
sw $6,4($7) F D X M W

CIS 501: Comp. Arch. | Dr. | Pipelining
Example Pipeline Perf. Calculation
Single-cycle
Clock period = 50ns, CPI = 1
Performance = 50ns/insn
5-stage pipelined
Clock period = 12ns approx. (50ns / 5 stages) + overheads
CPI = 1 (each insn takes 5 cycles, but 1 completes each cycle)
Performance = 12ns/insn
Well actually … CPI = 1 + some penalty for pipelining (next)
CPI = 1.5 (on average insn completes every 1.5 cycles)
Performance = 18ns/insn
Much higher performance than single-cycle

CIS 501: Comp. Arch. | Dr. | Pipelining
Q1: Why Is Pipeline Clock Period …
… > (delay thru datapath) / (number of pipeline stages)?

Three reasons:
Registers add delay
Pipeline stages have different delays, clock period is max delay
Extra datapaths for pipelining (bypassing paths)

These factors have implications for ideal number pipeline stages
Diminishing clock frequency gains for longer (deeper) pipelines

CIS 501: Comp. Arch. | Dr. | Pipelining
Q2: Why Is Pipeline CPI…
CPI for scalar in-order pipeline is 1 + stall penalties
Stalls used to resolve hazards
Hazard: condition that jeopardizes sequential illusion
Stall: pipeline delay introduced to restore sequential illusion

Calculating pipeline CPI
Frequency of stall * stall cycles
Penalties add (stalls typically don’t overlap in in-order pipelines)
1 + (stall-freq1*stall-cyc1) + (stall-freq2*stall-cyc2) + …

Correctness/performance/make common case fast
Long penalties OK if they are rare, e.g., 1 + (0.01 * 10) = 1.1
Stalls also have implications for ideal number of pipeline stages

Pipeline Control

CIS 501: Comp. Arch. | Dr. | Pipelining

Control Signals
CIS 501: Comp. Arch. | Dr. | Pipelining

Instruction Decode
CIS 501: Comp. Arch. | Dr. | Pipelining

Control signals need to be pipelined

Data Dependences, Pipeline Hazards, and Bypassing

CIS 501: Comp. Arch. | Dr. | Pipelining

CIS 501: Comp. Arch. | Dr. | Pipelining
Dependences and Hazards
Dependence: relationship between two insns
Data: two insns use same storage location
Control: one insn affects whether another executes at all
Not a bad thing, programs would be boring without them
Enforced by making older insn go before younger one
Happens naturally in single-/multi-cycle designs
But not in a pipeline

Hazard: dependence & possibility of wrong insn order
Effects of wrong insn order must not be externally visible
Stall: for order by keeping younger insn in same stage
Hazards are a bad thing: stalls reduce performance

CIS 501: Comp. Arch. | Dr. | Pipelining
Data Hazards
Let’s forget about branches for now
The three insn sequence we saw earlier executed fine…
But it wasn’t a real program
Real programs have data dependences
They pass values via registers and memory

add $3⟵$2,$1
lw $4⟵8($5)
sw $6⟶4($7)

CIS 501: Comp. Arch. | Dr. | Pipelining
Dependent Operations
Independent operations

add $3⟵$2,$1
add $6⟵$5,$4

Would this program execute correctly on a pipeline?

add $3⟵$2,$1
add $6⟵$5,$3

What about this program?

add $3⟵$2,$1
lw $4⟵8($3)
addi $6⟵1,$3
sw $3⟶8($7)

CIS 501: Comp. Arch. | Dr. | Pipelining
Data Hazards
Would this “program” execute correctly on this pipeline?
Which insns would execute with correct inputs?
add is writing its result into $3 in current cycle
lw read $3 two cycles ago  got wrong value
addi read $3 one cycle ago  got wrong value
sw is reading $3 this cycle  maybe (depending on regfile design)
add $3⟵$2,$1
lw $4⟵8($3)
sw $3⟶4($7)
addi $6⟵1,$3

Our registers are all positive-edge triggered, so sw won’t get the result from add
Can play tricks with mixing positive- and negative-edge registers

CIS 501: Comp. Arch. | Dr. | Pipelining
Fixing Register Data Hazards
Can only read register value three cycles after writing it

Option #1: make sure programs don’t do it
Compiler puts two independent insns between write/read insn pair
If they aren’t there already
Independent means: “do not interfere with register in question”
Do not write it: otherwise meaning of program changes
Do not read it: otherwise create new data hazard
Code scheduling: compiler moves existing insns to do this
If none can be found, must use nops (no-operation)

This is called software interlocks
MIPS: Microprocessor w/out Interlocking Pipeline Stages

CIS 501: Comp. Arch. | Dr. | Pipelining
Software Interlock Example
add $3⟵$2,$1
lw $4⟵8($3)
sw $7⟶8($3)
sub $6⟵$2,$8
addi $3⟵$5,4
Can any of last three insns be scheduled between first two
sw $7⟶8($3)? No, creates hazard with add $3⟵$2,$1
sub $6⟵$2,$8? Okay
addi $3⟵$5,4? No, lw would read $3 from it
Still need one more insn, use nop
add $3⟵$2,$1
sub $6⟵$2,$8
lw $4⟵8($3)
sw $7⟶8($3)
addi $3⟵$5,4

CIS 501: Comp. Arch. | Dr. | Pipelining
Software Interlock Performance
Branch: 20%, load: 20%, store: 10%, other: 50%

For software interlocks, let’s assume:
20% of insns require insertion of 1 nop
5% of insns require insertion of 2 nops

CPI is still 1 technically
But now there are more insns
#insns = 1 + 0.20*1 + 0.05*2 = 1.3
30% more insns (30% slowdown) due to data hazards

CIS 501: Comp. Arch. | Dr. | Pipelining
Hardware Interlocks
Problem with software interlocks? Not compatible
Where does 3 in “read register 3 cycles after writing” come from?
From structure (depth) of pipeline
What if next MIPS version uses a 7-stage pipeline?
Programs compiled assuming 5-stage pipeline will break!

Option #2: hardware interlocks
Processor detects data hazards and fixes them
Resolves the above compatibility concern
Two aspects to this
Detecting hazards
Fixing hazards

CIS 501: Comp. Arch. | Dr. | Pipelining
Detecting Data Hazards
Compare input register names of insn in D stage
with output register names of older insns in pipeline
(D.IR.RegSrc1 == X.IR.RegDest) ||
(D.IR.RegSrc2 == X.IR.RegDest) ||
(D.IR.RegSrc1 == M.IR.RegDest) ||
(D.IR.RegSrc2 == M.IR.RegDest)

Also need to check X.RegWrite and M.RegWrite to see if those instructions are actually going to write to the register file.

CIS 501: Comp. Arch. | Dr. | Pipelining

Fixing Data Hazards
Prevent D insn from reading (advancing) this cycle
Write nop into X.IR (effectively, insert nop in hardware)
Also reset (clear) the datapath control signals
Disable D register and PC write enables (why?)
Re-evaluate situation next cycle

Now that we know that there is a problem, what do we do?
We only really need to worry about the control signals that update state: RegWrite and MemWrite, and PCwe
Why is waiting for the next cycle a reasonable strategy?

CIS 501: Comp. Arch. | Dr. | Pipelining
Hardware Interlock Example: cycle 1
(D.IR.RegSrc1 == X.IR.RegDest) ||
(D.IR.RegSrc2 == X.IR.RegDest) ||
(D.IR.RegSrc1 == M.IR.RegDest) ||
(D.IR.RegSrc2 == M.IR.RegDest) = 1

add $3⟵$2,$1
lw $4⟵0($3)

Hardware Interlock Example: cycle 2
CIS 501: Comp. Arch. | Dr. | Pipelining

add $3⟵$2,$1
lw $4⟵0($3)

(D.IR.RegSrc1 == X.IR.RegDest) ||
(D.IR.RegSrc2 == X.IR.RegDest) ||
(D.IR.RegSrc1 == M.IR.RegDest) ||
(D.IR.RegSrc2 == M.IR.RegDest) = 1

Hardware Interlock Example: cycle 3
CIS 501: Comp. Arch. | Dr. | Pipelining

add $3⟵$2,$1
lw $4⟵0($3)

(D.IR.RegSrc1 == X.IR.RegDest) ||
(D.IR.RegSrc2 == X.IR.RegDest) ||
(D.IR.RegSrc1 == M.IR.RegDest) ||
(D.IR.RegSrc2 == M.IR.RegDest) = 0

CIS 501: Comp. Arch. | Dr. | Pipelining
Pipeline Control Terminology
Hardware interlock maneuver is called stall or bubble

Mechanism is called stall logic

Part of more general pipeline control mechanism
Controls advancement of insns through pipeline

Distinguish from pipelined datapath control
Controls datapath within each stage
Pipeline control controls advancement of datapath control

CIS 501: Comp. Arch. | Dr. | Pipelining
Hardware Interlock Performance
As before:
Branch: 20%, load: 20%, store: 10%, other: 50%

Hardware interlocks: same as software interlocks
20% of insns require 1 cycle stall (I.e., insertion of 1 nop)
5% of insns require 2 cycle stall (I.e., insertion of 2 nops)

CPI = 1 + 0.20*1 + 0.05*2 = 1.3
So, either CPI stays at 1 and #insns increases 30% (software)
Or, #insns stays at 1 (relative) and CPI increases 30% (hardware)
Same difference

Anyway, we can do better

CIS 501: Comp. Arch. | Dr. | Pipelining
Observation!
Technically, this situation is broken
lw $4⟵8($3) has already read $3 from regfile
add $3⟵$2,$1 hasn’t yet written $3 to regfile
But fundamentally, everything is OK
lw $4⟵8($3) hasn’t actually used $3 yet
add $3⟵$2,$1 has already computed $3

add $3⟵$2,$1
lw $4⟵8($3)

CIS 501: Comp. Arch. | Dr. | Pipelining
Reading a value from an intermediate (marchitectural) source
Not waiting until it is available from primary source
Here, we are bypassing the register file
Also called forwarding

add $3⟵$2,$1
lw $4⟵8($3)

Why is it called forwarding when the data is going backwa

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com