Computer Organization and Design
Unit 6: Pipelining
Based on slides by Profs. , & C.J. IS 501: Comp. Arch. | Dr. | Pipelining 1
Copyright By PowCoder代写 加微信 powcoder
This Unit: Pipelining
System software
• Processor performance • Latency vs throughput
• Single-cycle & multi-cycle datapaths
• Basic pipelining
• Data hazards
• Software interlocks and scheduling
• Hardware interlocks and stalling • Bypassing
• Load-use stalling
• Pipelined multi-cycle operations
• Control hazards
• Branch prediction
CIS 501: Comp. Arch. | Dr. | Pipelining 2
• Chapter 4
CIS 501: Comp. Arch. | Dr. | Pipelining 3
Welcome to the Laundromat
CIS 501: Comp. Arch. | Dr. | Pipelining 4
’s Big Idea:
CIS 501: Comp. Arch. | Dr. | Pipelining 5
In-Class Exercise
• You have a washer, dryer, and “folding robot” • Each takes 1 unit of time per load
• How long for one load in total?
• How long for two loads of laundry?
• How long for 100 loads of laundry?
• Now assume:
• Washing takes 30 minutes, drying 60 minutes, and folding 15 min • How long for one load in total?
• How long for two loads of laundry?
• How long for 100 loads of laundry?
CIS 501: Comp. Arch. | Prof. | Pipelining 6
In-Class Exercise Answers
• You have a washer, dryer, and “folding robot” • Each takes 1 unit of time per load
• How long for one load in total?
• How long for two loads of laundry?
• How long for 100 loads of laundry?
• Now assume:
• Washing takes 30 minutes, drying 60 minutes, and folding 15 min • How long for one load in total? 30+60+15=1h45m
• How long for two loads of laundry? 30+(60*2)+15 = 2h45m
• How long for 100 loads of laundry? 30+(60*100)+15= 100h45m
CIS 501: Comp. Arch. | Prof. | Pipelining 7
• CIS 240: build something that works
• CIS 371: build something that works “well”
• “well” means “high-performance” but also cheap, low-power, etc. • Mostly “high-performance”
• So, what is the performance of this?
• What is performance?
CIS 501: Comp. Arch. | Dr. | Pipelining 8
Performance
CIS 501: Comp. Arch. | Dr. | Pipelining 9
Processor Performance Equation
(1 billion instructions) * (1ns per cycle) * (1 cycle per insn) = 1 second
• Instructions per program: “dynamic instruction count”
• Runtime count of instructions executed by the program
• Determined by program, compiler, instruction set architecture (ISA)
• Cyclesperinstruction:“CPI”(typicalrange:2to0.5)
• On average, how many cycles does an instruction take to execute?
• Determined by program, compiler, ISA, micro-architecture
• Sec.percycle:“clockperiod”(typicalrange:2nsto0.25ns
• Reciprocal is frequency: 0.5 Ghz to 4 Ghz (1 Hertz = 1 cycle per sec) • Determined by micro-architecture, technology parameters
• For minimum execution time, minimize each term
• Difficult: often pull against one another
CIS 501: Comp. Arch. | Dr. | Pipelining 10
Execution time = “seconds per program” = (instructions/program) * (seconds/cycle) * (cycles/instruction)
Cycles per Instruction (CPI)
• CPI: Cycle/instruction on average • IPC = 1/CPI
• Used more frequently than CPI
• Favored because “bigger is better”, but harder to compute with • Different instructions have different cycle costs
• E.g., “add” typically takes 1 cycle, “divide” takes >10 cycles • Depends on relative instruction frequencies
• CPI example
• A program executes equal: integer, floating point (FP), memory ops • Cycles per instruction type: integer = 1, memory = 2, FP = 3
• WhatistheCPI?(33%*1)+(33%*2)+(33%*3)=2
• Caveat:thissortofcalculationignoresmanyeffects
• Back-of-the-envelope arguments only
CIS 501: Comp. Arch. | Dr. | Pipelining 11
Improving Clock Frequency
• Fastertransistors
• Micro-architectural techniques
• Multi-cycleprocessors
• Break each instruction into small bits
• Less logic delay -> improved clock frequency
• Different instructions take different number of cycles
• Pipelined processors
• As above, but overlap parts of instruction (parallelism!) • Faster clock, but CPI can still be around 1
CIS 501: Comp. Arch. | Dr. | Pipelining 12
Single-Cycle Datapath
Register File
Tsinglecycle
• Fetch, decode, execute one complete instruction every cycle
+ Takes 1 cycle to execution any instruction by definition (“CPI” is 1)
– Long clock period: to accommodate slowest instruction (worst-case delay through circuit, must wait this long every time)
• Single-cycledatapath:true“atomic”fetch/executeloop
CIS 501: Comp. Arch. | Dr. | Pipelining 13
Latency versus Throughput
• Latency(executiontime):timetofinishafixedtask
• Throughput(bandwidth):numberoftasksinfixedtime
• Different: exploit parallelism for throughput, not latency (e.g., bread) • Often contradictory (latency vs. throughput)
• Will see many examples of this
• Choose definition of performance that matches your goals
• Scientific program? Latency, web server: throughput?
• Example: move people 10 miles
• Car: capacity = 5, speed = 60 miles/hour
• Bus: capacity = 60, speed = 20 miles/hour
• Latency: car = 10 min, bus = 30 min
• Throughput: car = 15 PPH (count return trip), bus = 60 PPH
• Fastest way to send 10TB of data? (at 1+ gbits/second)
CIS 501: Comp. Arch. | Dr. | Pipelining 14
CIS 501: Comp. Arch. | Dr. | Pipelining 15
Pipelining
insn0.fetch
insn0.exec
insn1.fetch
• Important performance technique
• Improvesinstructionthroughput,notinstructionlatency
• Break instruction execution into stages
• When insn advances from stage 1 to 2, next insn enters at stage 1
• Form of parallelism: “insn-stage parallelism”
• Maintains illusion of sequential fetch/execute loop
• Individual instruction takes the same number of stages
+ But instructions enter and leave at a much faster rate
• Just like our laundromat
CIS 501: Comp. Arch. | Dr. | Pipelining 16
insn1.exec
5 Stage Pipeline: Inter-Insn Parallelism
Tinsn-mem Tregfile TALU Tdata-mem Tregfile
• Pipelining:cutdatapathintoNstages(here5) • One insn in each stage in each cycle
Tsinglecycle
+ Clock period = MAX(Tinsn-mem, Tregfile, TALU, Tdata-mem, Twriteback) + Base CPI = 1: insn enters and leaves every cycle
– Actual CPI > 1: pipeline must often “stall”
• Individual insn latency increases (pipeline overhead)
CIS 501: Comp. Arch. | Dr. | Pipelining 17
Register File
• Five stage: Fetch, Decode, eXecute, Memory, Writeback • Nothing magical about 5 stages (Pentium 4 had 22 stages!)
• Pipeline registers named by stages they begin • PC,D,X,M,W
CIS 501: Comp. Arch. | Dr. | Pipelining 18
More Terminology & Foreshadowing
• Scalarpipeline:oneinsnperstagepercycle • Alternative: “superscalar” (later)
• In-orderpipeline:insnsenterexecutestageinorder • Alternative: “out-of-order” (later)
• Pipelinedepth:numberofpipelinestages
• Nothing magical about five
• Contemporary high-performance cores have ~15 stage pipelines
CIS 501: Comp. Arch. | Dr. | Pipelining 19
Instruction Convention
• Different ISAs use inconsistent register orders • Some ISAs (for example MIPS)
• Instruction destination (i.e., output) on the left • add $1,$2,$3 means $1⟵$2+$3
• Other ISAs
• Instruction destination (i.e., output) on the right add r1,r2,r3 means r1+r2⟶r3
ld 8(r5),r4 means mem[r5+8]⟶r4
st r4,8(r5) means r4⟶mem[r5+8]
• Will try to specify to avoid confusion, next slides MIPS style
CIS 501: Comp. Arch. | Dr. | Pipelining 20
Pipeline Example: Cycle 1
Register File
add $3⟵$2,$1
• 3 instructions
CIS 501: Comp. Arch. | Dr. | Pipelining 21
Pipeline Example: Cycle 2
Register File
lw $4⟵8($5) add $3⟵$2,$1
CIS 501: Comp. Arch. | Dr. | Pipelining 22
Pipeline Example: Cycle 3
Register File
sw $6⟶4($7) lw $4⟵8($5) add $3⟵$2,$1
CIS 501: Comp. Arch. | Dr. | Pipelining
Pipeline Example: Cycle 4
sw $6⟶4($7) lw $4⟵8($5) add $3⟵$2,$1
Register File
CIS 501: Comp. Arch. | Dr. | Pipelining
Pipeline Example: Cycle 5
sw $6⟶4($7) lw $4⟵8($5)
Register File
CIS 501: Comp. Arch. | Dr. | Pipelining
Pipeline Example: Cycle 6
sw $6⟶4($7)
Register File
CIS 501: Comp. Arch. | Dr. | Pipelining
Pipeline Example: Cycle 7
Register File
CIS 501: Comp. Arch. | Dr. | Pipelining 27
Pipeline Diagram
• Pipelinediagram:shorthandforwhatwejustsaw
• Across: cycles
• Down: insns
• Convention:Xmeanslw$4⟵8($5)finishesexecutestageand
writes into M register at end of cycle 4 • assuming no stalls (discussed later)
add $3,$2,$1
lw $4,8($5)
sw $6,4($7)
CIS 501: Comp. Arch. | Dr. | Pipelining 28
Example Pipeline Perf. Calculation
• Single-cycle
• Clock period = 50ns, CPI = 1 • Performance = 50ns/insn
• 5-stage pipelined
• Clock period = 12ns approx. (50ns / 5 stages) + overheads + CPI = 1 (each insn takes 5 cycles, but 1 completes each cycle)
+ Performance = 12ns/insn
– Well actually … CPI = 1 + some penalty for pipelining (next)
• CPI = 1.5 (on average insn completes every 1.5 cycles) • Performance = 18ns/insn
• Much higher performance than single-cycle
CIS 501: Comp. Arch. | Dr. | Pipelining 29
Q1: Why Is Pipeline Clock Period …
• … > (delay thru datapath) / (number of pipeline stages)?
• Three reasons:
• Registers add delay
• Pipeline stages have different delays, clock period is max delay • Extra datapaths for pipelining (bypassing paths)
• These factors have implications for ideal number pipeline stages
• Diminishing clock frequency gains for longer (deeper) pipelines
CIS 501: Comp. Arch. | Dr. | Pipelining 30
Q2: Why Is Pipeline CPI…
• CPI for scalar in-order pipeline is 1 + stall penalties • Stalls used to resolve hazards
• Hazard: condition that jeopardizes sequential illusion
• Stall: pipeline delay introduced to restore sequential illusion
• Calculating pipeline CPI
• Frequency of stall * stall cycles
• Penalties add (stalls typically don’t overlap in in-order pipelines) • 1 + (stall-freq1*stall-cyc1) + (stall-freq2*stall-cyc2) + …
• Correctness/performance/make common case fast
• Long penalties OK if they are rare, e.g., 1 + (0.01 * 10) = 1.1 • Stalls also have implications for ideal number of pipeline stages
CIS 501: Comp. Arch. | Dr. | Pipelining 31
Pipeline Control
CIS 501: Comp. Arch. | Dr. | Pipelining 32
Control Signals
CIS 501: Comp. Arch. | Dr. | Pipelining 33
Instruction Decode
CIS 501: Comp. Arch. | Dr. | Pipelining 34
Data Dependences, Pipeline Hazards, and Bypassing
CIS 501: Comp. Arch. | Dr. | Pipelining 35
Dependences and Hazards
• Dependence:relationshipbetweentwoinsns
• Data:twoinsnsusesamestoragelocation
• Control:oneinsnaffectswhetheranotherexecutesatall • Not a bad thing, programs would be boring without them • Enforced by making older insn go before younger one
• Happens naturally in single-/multi-cycle designs • But not in a pipeline
• Hazard:dependence&possibilityofwronginsnorder
• Effects of wrong insn order must not be externally visible • Stall: for order by keeping younger insn in same stage
• Hazards are a bad thing: stalls reduce performance
CIS 501: Comp. Arch. | Dr. | Pipelining 36
Data Hazards
Register File
sw $6⟶4($7) lw $4⟵8($5) add $3⟵$2,$1 • Let’s forget about branches for now
• The three insn sequence we saw earlier executed fine… • But it wasn’t a real program
• Real programs have data dependences
• They pass values via registers and memory
CIS 501: Comp. Arch. | Dr. | Pipelining 37
Dependent Operations
• Independent operations
add $3⟵$2,$1 add $6⟵$5,$4
• Would this program execute correctly on a pipeline?
add $3⟵$2,$1 add $6⟵$5,$3
• What about this program?
add $3⟵$2,$1 lw $4⟵8($3) addi $6⟵1,$3 sw $3⟶8($7)
CIS 501: Comp. Arch. | Dr. | Pipelining 38
Data Hazards
Register File
sw $3⟶4($7) addi $6⟵1,$3 lw $4⟵8($3) add $3⟵$2,$1 • Would this “program” execute correctly on this pipeline?
• Which insns would execute with correct inputs?
• addiswritingitsresultinto$3incurrentcycle
– lw read $3 two cycles ago ® got wrong value
– addiread$3onecycleago® gotwrongvalue
• swisreading$3thiscycle®maybe(dependingonregfiledesign)
CIS 501: Comp. Arch. | Dr. | Pipelining 39
Fixing Register Data Hazards
• Can only read register value three cycles after writing it
• Option#1:makesureprogramsdon’tdoit
• Compiler puts two independent insns between write/read insn pair
• If they aren’t there already
• Independent means: “do not interfere with register in question”
• Do not write it: otherwise meaning of program changes
• Do not read it: otherwise create new data hazard
• Codescheduling:compilermovesexistinginsnstodothis • If none can be found, must use nops (no-operation)
• This is called software interlocks
• MIPS: Microprocessor w/out Interlocking Pipeline Stages
CIS 501: Comp. Arch. | Dr. | Pipelining 40
Software Interlock Example
add $3⟵$2,$1 nop
lw $4⟵8($3) sw $7⟶8($3) sub $6⟵$2,$8 addi $3⟵$5,4
• Can any of last three insns be scheduled between first two • sw $7⟶8($3)? No, creates hazard with add $3⟵$2,$1
• sub $6⟵$2,$8? Okay
• addi$3⟵$5,4?No,lwwouldread$3fromit
• Still need one more insn, use nop add $3⟵$2,$1
sub $6⟵$2,$8 nop
lw $4⟵8($3) sw $7⟶8($3) addi $3⟵$5,4
CIS 501: Comp. Arch. | Dr. | Pipelining 41
Software Interlock Performance
• Branch: 20%, load: 20%, store: 10%, other: 50%
• For software interlocks, let’s assume: • 20%ofinsnsrequireinsertionof1nop • 5%ofinsnsrequireinsertionof2nops
• CPI is still 1 technically
• But now there are more insns
• #insns=1+0.20*1+0.05*2=1.3
– 30% more insns (30% slowdown) due to data hazards
CIS 501: Comp. Arch. | Dr. | Pipelining 42
Hardware Interlocks
• Problem with software interlocks? Not compatible
• Where does 3 in “read register 3 cycles after writing” come from?
• From structure (depth) of pipeline
• What if next MIPS version uses a 7-stage pipeline?
• Programs compiled assuming 5-stage pipeline will break!
• Option #2: hardware interlocks
• Processor detects data hazards and fixes them • Resolves the above compatibility concern
• Two aspects to this
• Detecting hazards • Fixing hazards
CIS 501: Comp. Arch. | Dr. | Pipelining 43
Detecting Data Hazards
Register File
• Compare input register names of insn in D stage with output register names of older insns in pipeline
(D.IR.RegSrc1 == X.IR.RegDest) || (D.IR.RegSrc2 == X.IR.RegDest) || (D.IR.RegSrc1 == M.IR.RegDest) || (D.IR.RegSrc2 == M.IR.RegDest)
CIS 501: Comp. Arch. | Dr. | Pipelining 44
Fixing Data Hazards
Register File
• Prevent D insn from reading (advancing) this cycle • WritenopintoX.IR(effectively,insertnopinhardware) • Also reset (clear) the datapath control signals
• Disable D register and PC write enables (why?)
• Re-evaluate situation next cycle
CIS 501: Comp. Arch. | Dr. | Pipelining 45
Hardware Interlock Example: cycle 1
Register File
lw $4⟵0($3)
add $3⟵$2,$1
(D.IR.RegSrc1 == X.IR.RegDest) || (D.IR.RegSrc2 == X.IR.RegDest) || (D.IR.RegSrc1 == M.IR.RegDest) || (D.IR.RegSrc2 == M.IR.RegDest) = 1
CIS 501: Comp. Arch. | Dr. | Pipelining 46
Hardware Interlock Example: cycle 2
Register File
lw $4⟵0($3)
nop add $3⟵$2,$1
(D.IR.RegSrc1 == X.IR.RegDest) || (D.IR.RegSrc2 == X.IR.RegDest) || (D.IR.RegSrc1 == M.IR.RegDest) || (D.IR.RegSrc2 == M.IR.RegDest) = 1
CIS 501: Comp. Arch. | Dr. | Pipelining 47
Hardware Interlock Example: cycle 3
Register File
lw $4⟵0($3)
add $3⟵$2,$1
(D.IR.RegSrc1 == X.IR.RegDest) || (D.IR.RegSrc2 == X.IR.RegDest) || (D.IR.RegSrc1 == M.IR.RegDest) || (D.IR.RegSrc2 == M.IR.RegDest) = 0
CIS 501: Comp. Arch. | Dr. | Pipelining
Pipeline Control Terminology
• Hardware interlock maneuver is called stall or bubble • Mechanism is called stall logic
• Part of more general pipeline control mechanism • Controls advancement of insns through pipeline
• Distinguish from pipelined datapath control
• Controls datapath within each stage
• Pipelinecontrolcontrolsadvancementofdatapathcontrol
CIS 501: Comp. Arch. | Dr. | Pipelining 49
Hardware Interlock Performance
• As before:
• Branch: 20%, load: 20%, store: 10%, other: 50%
• Hardware interlocks: same as software interlocks
• 20%ofinsnsrequire1cyclestall(I.e.,insertionof1nop) • 5%ofinsnsrequire2cyclestall(I.e.,insertionof2nops)
• CPI=1+0.20*1+0.05*2=1.3
• So, either CPI stays at 1 and #insns increases 30% (software)
• Or, #insns stays at 1 (relative) and CPI increases 30% (hardware) • Same difference
• Anyway, we can do better
CIS 501: Comp. Arch. | Dr. | Pipelining 50
Observation!
Register File
lw $4⟵8($3) add $3⟵$2,$1
• Technically, this situation is broken
• lw$4⟵8($3)hasalreadyread$3fromregfile • add$3⟵$2,$1hasn’tyetwritten$3toregfile
• But fundamentally, everything is OK
• lw$4⟵8($3)hasn’tactuallyused$3yet • add$3⟵$2,$1hasalreadycomputed$3
CIS 501: Comp. Arch. | Dr. | Pipelining 51
Register File
lw $4⟵8($3) add $3⟵$2,$1
• Bypassing
• Reading a value from an intermediate (μarchitectural) source • Not waiting until it is available from primary source
• Here, we are bypassing the register file
• Also called forwarding
CIS 501: Comp. Arch. | Dr. | Pipelining 52
WX Bypassing
Register File
add $4⟵$3,$2 add $3⟵$2,$1
• What about this combination?
• Add another bypass path and MUX (multiplexor) input • First one was an MX bypass
• This one is a WX bypass
CIS 501: Comp. Arch. | Dr. | Pipelining 53
ALUinB Bypassing
add $4⟵$2,$3 • Can a
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com