程序代写代做代考 compiler case study go cache C mips clock graph Module Outline:

Module Outline:
• Pipelined Data Path Design
• Pipelined Control Path Design • Hazards
• Exception Handling
ELEC6036 (Vincent Tam)
Module 1: Pipelining
Page 1
Module 1: Pipelining

Pipelining is Natural
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 2

Sequential Laundry
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 3

Pipelined Laundry: Start Work ASAP
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 4

Pipelining Lessons
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 5

The Five Stages of Load
• Ifetch: Instruction Fetch
– fetch the instruction from the instruction memory
• Reg/Dec: Registers Fetch and Instruction Decode • Exec: Calculate the memory address
• Mem: Read the data from the Data Memory
• Wr: Write the data back to the register file
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 6

NOTE: These 5 Stages Were There All Along!
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 7

P1: P2: P3: P4:
1 2 3 4 5 6 7 8 1234567 123456 12345
P5: Time
1234 12345678
Pipelining
• As the instruction execution cycle uses different hardware components for different steps, it is possible to set up a pipeline.
P1 P2 P3 P4 P5
Instruction fetch unit
Instruction Address analyzer calculation
Data fetch unit
Instruction execution unit
unit
• Suppose each stage takes 1 nsec. Then each instruction still takes 5 nsec. But once the pipeline has been filled, a complete instruction rolls out every 1 nsec. Thus, the speedup is 5.
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 8

Pipelining
• Pipelining is an implementation technique whereby multiple instructions are overlapped in execution.
• Pipelining is the key implementation technique that is currently used to make high- performance CPUs.
• Pipelining does not help to lower the latency of a single instruction, but it helps to increase the throughput of the entire program execution.
• The pipeline rate is limited by the slowest pipeline stage.
• The principle: multiple independent tasks are operating simultaneously.
• Potential speedup = number of pipeline stages.
• Unbalanced lengths of pipeline stages reduce the speedup: that is, the key is every stage should have the same duration.
• The time needed to “fill” the pipeline and the time needed to “drain” it reduces the speedup.
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 9

Pipelining in the CPU
• Pipelining is an implementation technique that exploits parallelism among instructions in a sequential instruction stream.
• A major advantage of pipelining over “parallel processing” is that it is not visible to the programmer (whereas in parallel processing, the program usually needs to specify what kinds of tasks to be executed in parallel).
• In a computer system, each pipeline stage completes a part of the instruction being executed.
• The time required between moving an instruction one step down the pipeline is a machine cycle (the clock cycle). The length of a machine cycle is determined by the time required for the slowest stage to proceed.
• The computer engineer should try to balance the length of each pipeline stage so as to achieve the ideal speedup. In practice, however, the pipeline stages will not be perfectly the same, and there are additional overheads. But we can get close to the ideal case.
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 10

Design Issues:
between the pipeline stages.
ELEC6036 (Vincent Tam)
Module 1: Pipelining
Page 11
Program
e x e c u tio n
order
(in instructions)
Tim e (in clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7
lw $1,100($0)
IM
Reg ALU DM Reg
lw $2,200($0)
IM Reg ALU DM Reg
lw $3,300($0)
IM Reg ALU DM Reg
Pipelining in the CPU (continued)
• We have to make sure that the same resource (e.g., ALU) is not used in more than one pipeline stage.
• If the resources used in the same pipelining stage are different, then overlapping is possible.
• However, we must note that to retain the intermediate values produced by an
individual instruction for all its pipeline stages, we must include temporary registers

Performance of Pipelining
• Pipelining increases the processor instruction throughput—the number of instructions completed per unit time.
• Remember: pipelining does not reduce the execution time of a single instruction.
• The increase in instruction throughput means that the program runs faster, even though no single instruction runs faster.
• Imbalance among the pipeline stages reduces performance since the clock cannot run faster than the time needed for the slowest pipeline stage.
• Pipelining overhead can arise from the combination of pipeline register delay and other factors.
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 12

Graphically Representing Pipelines
• Can help with answering questions like:
– how many cycles does it take to execute this code? – what is the ALU doing during cycle 4?
– use this representation to help understand datapaths
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 13

Conventional Pipelined Execution Representation
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 14

Single Cycle, Multiple Cycle, vs. Pipeline
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 15

• Suppose we execute 100 instructions • Single Cycle Machine
Why Pipelining Actually?
– 45 ns/cycle x 1 CPI x 100 instructions = 4500ns • Multi-Cycle Machine
– 10 ns/cycle x 4.6 CPI (due to instruction mix) x 100 instructions = 4600ns • Ideal pipelined machine
– 10 ns/cycle x (1 CPI x 100 instructions + 4 cycles drain) = 1040ns!!
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 16

Why Pipeline? Because We Can!
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 17

YES: Pipeline Hazards
– branch instructions
Can Pipelining Get Us into Trouble?
• Structural Hazards: attempt to use the same resource in two different ways (e.g., by two different instructions) at the same time
– e.g., combined washer/dryer would be a structural hazard or “folder” busy doing something else (e.g., watching TV 😉
• Control Hazards: attempt to make a decision before condition is evaluated
– e.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in
• Data Hazards: attempt to use item before it is ready
– e.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer
– instruction depends on result of prior instruction still in the pipeline
• Can always resolve these hazards by waiting (pipeline stall):
– pipeline control must detect the hazard
– take action (or delay action) to resolve hazards
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 18

Single Memory Is a Structural Hazard
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 19

Structural Hazard Limits Performance
• Example: If 1.3 memory accesses per instruction, and only one memory access per cycle then:
– average CPI >= 1.3
– otherwise resource is more than 100% utilized (impossible!)
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 20

Control Hazard Solution #1: Stall
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 21

Control Hazard Solution #2: Predict
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 22

Control Hazard Solution #3: Delayed Branch
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 23

ELEC6036 (Vincent Tam)
Module 1: Pipelining
Page 24
Data Hazard on r1: Read After Write (RAW)
add r1, r2, r3 sub r4, r1, r3 and r6, r1, r7 or r8, r1, r9 xor r10, r1, r11

Data Hazard on r1: Read After Write (RAW)
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 25

Data Hazard Solution: Forwarding
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 26

Forwarding (or Bypassing): What About Loads?
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 27

Forwarding (or Bypassing): What About Loads?
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 28

Designing a Pipelined Processor
• Go back and examine your datapath and control diagram
• Associate resources with states
• Ensure that flows do not conflict, or figure out how to resolve • Assert control in appropriate stage
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 29

Control and Datapath: Split State Diagram into 5 Pieces
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 30

• Utilize capabilities of the datapath
• What makes it easy
Summary of Concepts
• Reduce CPI by overlapping many instructions
– average throughput of approximately 1 CPI with fast clock
– start next instruction while working on the current one – limited by length of longest stage (plus fill/flush)
– detect and resolve hazards
– all instructions are of the same length
– just a few instruction formats
– memory operands appear only in loads and stores
• What makes it hard?
– structural hazards: suppose we had only one memory
– control hazards: need to worry about branch instructions
– data hazards: an instruction depends on a previous instruction
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 31

Pipelined Processor
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 32

Pipelining the Load Instruction
• The 5 independent functional units in the pipeline datapath are:
– instruction memory for the Ifetch stage
– register file’s Read ports (bus A and bus B) for the Reg/Dec stage – ALU for the Exec stage
– data memory for the Mem stage
– register file’s Write port (bus W) for the Wr stage
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 33

The Four Stages of R-type
• Ifetch: Instruction Fetch
– fetch the instruction from the instruction memory
• Reg/Dec: registers fetch and instruction decode • Exec:
– ALU operates on the two register operands – update PC
• Wr: write the ALU output back to the register file
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 34

Pipelining the R-type and Load Instruction
• We have pipeline conflict or structural hazard:
– two instructions try to write to the register file at the same time – only one write port
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 35

Important Observation
• Each functional unit can only be userd once per instruction
• Each functional unit must be used at the same stage for all instructions
• 2 ways to solve this pipeline hazard
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 36

Solution 1: Insert “Bubble” into the Pipeline
• Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle
– the control logic can be complex
– lose instruction fetch and issue opportunity
• No instruction is started in Cycle 6!!
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 37

Solution 2: Delay R-type’s Write by One Cycle
• Delay R-type’s register write by one cycle:
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 38

Modified Control and Datapath
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 39

The Four Stages of Store
• Ifetch: Instruction Fetch
– fetch the instruction from the instruction memory
• Reg/Dec: registers fetch and instruction decode • Exec: calculate the memory address
• Mem: write the data into the data memory
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 40

The Three Stages of Beq
• Ifetch: Instruction Fetch
– fetch the instruction from the instruction memory
• Reg/Dec: registers fetch and instruction decode • Exec:
– compares the two register operands – select correct branch target address – latch into PC
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 41

Control Diagram
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 42

Data Stationary Control
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 43

Datapath + Data Stationary Control
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 44

Let’s Try It Out
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 45

Start: Fetch 10
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 46

Fetch 14, Decode 10
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 47

Fetch 20, Decode 14, Exec 10
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 48

Fetch 24, Decode 20, Exec 14, Mem 10
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 49

Fetch 30, Decode 24, Exec 20, Mem 14, WB 10
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 50

Fetch 100, Decode 30, Exec 24, Mem 20, WB 14
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 51

Fetch 104, Decode 100, Exec 30, Mem 24, WB 20
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 52

Fetch 110, Decode 104, Exec 100, Mem 30, WB 24
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 53

Fetch 114, Decode 110, Exec 104, Mem 100, WB 30
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 54

Pipeline Hazards Again
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 55

• Avoid some “by design”
Data Hazards
– eliminate WAR by always fetching operands early (DCD) in pipe – eliminate WAW by doing all WBs in order (last stage, static)
• Detect and resolve remaining ones – stall or forward (if possible)
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 56

Hazard Detection
• Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline
• A RAW hazard exists on register ρ if ρ ∈ Rregs(i) ∩ Wregs(j)
– keep a record of pending writes (for instructions in the pipe) and compare with operand regs of current instruction
– when instruction issues, reserve its result register
– when on operation completes, remove its write reservation
• A WAW hazard exists on register ρ if ρ ∈ Wregs(i) ∩ Wregs(j)
• A WAR hazard exists on register ρ if ρ ∈ Wregs(i) ∩ Rregs(j)
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 57

Record of Pending Writes
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 58

Resolve RAW by Forwarding
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 59

– R1<--R4+R5 - R2 <-- Mem[R2 + I] - R3<--R2+R1 - ==> delayed loads
What about Memory Operations?
•If instructions are initiated in order and operations always occur in the same stage, there can be no hazards between memory operations!
• What does delaying WB on arithmetic operations cost?
– cycles?
– hardware?
• What about data dependence on loads?
• Can recognize this in decode stage and introduce bubble while stalling fetch stage
• Tricky situation:
– R1 <-- Mem[R2 + I] - Mem[R3 + 34] <-- R1 Handle with bypass in memory stage! ELEC6036 (Vincent Tam) Module 1: Pipelining Page 60 Compiler Avoiding Load Stalls ELEC6036 (Vincent Tam) Module 1: Pipelining Page 61 • External interrupts: What about interrupts, traps, faults? - allow pipeline to drain - load PC with interrupt address • Faults (within instruction, restartable) - force trap instruction into IF - disable writes till trap hits WB - must save multiple PCs or PC + state • Recall: precise exceptions ==> state of the machine is preserved as if program executed up to the offending instruction
– all previous instructions completed
– offending instruction and all following instructions act as if they have not even started – same system code will work on different implementations
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 62

– how to stop the pipeline? – restart?
– who caused the interrupt?
Exception Problem
• Exceptions/Interrupts: 5 instructions executing in 5 stage pipeline
Stage Problem when interrupts occur
IF page fault on instruction fetch; misaligned memory access; memory- protection violation
ID undefined or illegal opcode
EX arithmetic exception
MEM page fault on data fetch; misaligned memory access; memory-
• Load with data page fault, Add with instruction page fault?
• Solution
1: interrupt vector/instruction
2: interrupt ASAP, restart everything incomplete
protection violation; memory error
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 63

Another look at the exception problem
• Use pipeline to sort this out!
– pass exception status along with instruction
– keep track of PCs for every instruction in pipeline – don’t act on exception until it reaches WB stage
• Handle interrupts through “faulting no-op” in IF stage • When instruction reaches WB stage:
– save PC => EPC, interrupt vector address => PC – turn all instructions in earlier stages into no-ops
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 64

Exception Handling
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 65

Resolution: Freeze above & Bubble below
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 66

FYI: MIPS R3000 Clocking Discipline
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 67

MIPS R3000 Instruction Pipeline
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 68

Recall: Data Hazard on r1
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 69

MIPS R3000 Multicycle Operations
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 70

Issues in Pipelined Design
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 71

• What makes it easy
Summary of Concepts
– all instructions are the same length
– just a few instruction formats
– memory operands appear only in loads and stores
• What makes it hard? HAZARDS!!
– structural hazards: suppose we had only one memory
– control hazards: need to worry about branch instructions
– data hazards: an instruction depends on a previous instruction
• Pipelines pass control information down the pipe just as data moves down pipe
• Forwarding/stalls handled by local control
• Exceptions stop the pipeline
• Pipelines pass control information down the pipe just as data moves down pipe
• Forwarding/stalls handled by local control
• Exceptions stop the pipeline
• MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load)
• More performance from deeper pipleines, parallelism
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 72

Is CPI = 1 for our pipeline?
• Remember that CPI is an “average # cycles/instruction”
• CPI here is 1, since the average throughput is 1 instruction every cycle • What if there are stalls or multi-cycle execution?
• Usually CPI > 1. How close can we get to 1??
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 73

• 8 stage pipeline:
Case Study: MIPS R4000 (200 MHz)
– IF—first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access
– IS—second half of access to instruction cache
– RF—instruction decode and register fetch, hazard checking and also instruction cache hit
detection
– EX—execution, which includes effective address calculation, ALU operation, and
branch target computation and condition evaluation
– DF—data fetch, first half of access to data cache
– DS—second half of access to data cache
– TC—tag check, determine whether the data cache access hit
– WB—write back for loads and register-register operations
• 8 stages: what is the impact on Load delay? Branch delay? Why?
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 74

Case Study: MIPS R4000
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 75

MIPS R4000 Floating Point
• FP adder, FP multiplier, FP divider
• Last step of FP multiplier/divider uses FP adder HW • 8 kinds of stages in FP units:
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 76

MIPS FP Pipe Stages
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 77

R4000 Performance
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 78

• Hazards limit performance
Summary of Concepts
– structural: need more HW resources
– data: need forwarding, compiler scheduling
– control: early evaluation and PC, delayed branch, prediction
• Data hazards must be handled carefully:
– RAW data hazards handled by forwarding
– WAW and WAR hazards don’t exist in 5-stage pipeline
• MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load)
• Exceptions in 5-stage pipeline recorded when they occur, but acted on only at WB (end of MEM) stage
– must flush all previous instructions
• More performance from deeper pipelines, parallelism
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 79