代写代考 COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN
The Hardware/Software Interface
Chapter 4 The Processor

Copyright By PowCoder代写 加微信 powcoder

Introduction
 CPU performance factors  Instruction count
 Determined by ISA and compiler  CPI and Cycle time
 Determined by CPU hardware
 We will examine two RISC-V implementations  A simplified version
 A more realistic pipelined version
 Simple subset, shows most aspects  Memory reference: ld, sd
 Arithmetic/logical:add,sub,and,or  Control transfer: beq
Chapter 4 — The Processor — 2
§4.1 Introduction

Instruction Execution
 PC → instruction memory, fetch instruction
 Register numbers → register file, read registers
 Depending on instruction class
 Use ALU to calculate
 Arithmetic result
 Memory address for load/store  Branch comparison
 Access data memory for load/store  PC←targetaddressorPC+4
Chapter 4 — The Processor — 3

CPU Overview
Chapter 4 — The Processor — 4

Multiplexers
 Can’t just join wires together
 Use multiplexers
Chapter 4 — The Processor — 5

Chapter 4 — The Processor — 6

Logic Design Basics
 Information encoded in binary
 Low voltage = 0, High voltage = 1
 One wire per bit
 Multi-bit data encoded on multi-wire buses
 Combinational element
 Operate on data
 Output is a function of input
 State (sequential) elements  Store information
Chapter 4 — The Processor — 7
§4.2 Logic Design Conventions

Combinational Elements
 AND-gate  Y = A & B
 Arithmetic/Logic Unit  Y = F(A, B)
 Multiplexer
 Y = S ? I1 : I0
I0 Mu Y I1 x
Chapter 4 — The Processor — 8

Sequential Elements
 Register: stores data in a circuit
 Uses a clock signal to determine when to
update the stored value
 Edge-triggered: update when Clk changes from 0 to 1
Chapter 4 — The Processor — 9

Sequential Elements
 Register with write control
 Only updates on clock edge when write
control input is 1
 Used when stored value is required later
Chapter 4 — The Processor — 10

Clocking Methodology
 Combinational logic transforms data during clock cycles
 Between clock edges
 Input from state elements, output to state
 Longest delay determines clock period
Chapter 4 — The Processor — 11

Building a Datapath
 Datapath
 Elements that process data and addresses in the CPU
 Registers, ALUs, mux’s, memories, …  We will build a RISC-V datapath
incrementally
 Refining the overview design
Chapter 4 — The Processor — 12
§4.3 Building a Datapath

Instruction by 4 for next instruction
64-bit register
Chapter 4 — The Processor — 13

R-Format Instructions
 Read two register operands
 Perform arithmetic/logical operation  Write register result
Chapter 4 — The Processor — 14

Load/Store Instructions
 Read register operands
 Calculate address using 12-bit offset
 Use ALU, but sign-extend offset
 Load: Read memory and update register  Store: Write register value to memory
Chapter 4 — The Processor — 15

Branch Instructions
 Read register operands  Compare operands
 Use ALU, subtract and check Zero output
 Calculate target address
 Sign-extend displacement
 Shift left 1 place (halfword displacement)  Add to PC value
Chapter 4 — The Processor — 16

Branch Instructions
Just re-routes wires
Sign-bit wire replicated
Chapter 4 — The Processor — 17

Composing the Elements
 First-cut data path does an instruction in one clock cycle
 Each datapath element can only do one function at a time
 Hence, we need separate instruction and data memories
 Use multiplexers where alternate data sources are used for different instructions
Chapter 4 — The Processor — 18

R-Type/Load/Store Datapath
Chapter 4 — The Processor — 19

Full Datapath
Chapter 4 — The Processor — 20

ALU Control
 ALU used for
 Load/Store: Function = add
 Branch: Function = subtract
 R-type: Function depends on opcode
ALU control
AppendixA: a-invert, b-negate, operation (and, orr, add) 1100: a’ AND b’ = (a+b)’ NOR; 1101: NAND
Chapter 4 — The Processor — 21
§4.4 A Simple Implementation Scheme

ALU Control
 Assume 2-bit ALUOp derived from opcode  Combinational logic derives ALU control
0000011 0100011 1100011
0110011 R-type (double) ALUop[0]= b[6]*b[5]*b[4]’*b[3]’*b[2]’*b[1]*b[0] 0111011 R-type (word) ALUop[1]= b[6]’*b[5]*b[4] *b[2]’*b[1]*b[0]
 Construct the truth table for the 4 ALU control bits (called Operation)
Chapter 4 — The Processor — 22

The Main Control Unit
 Control signals derived from instruction
sub, sra FP instr.
Op[0]= ALUop[1]*I[30]’*I[14]*I[13]*I[12]’
Op[1]= (ALUop[0]’*ALUop[1]’)+(ALUop[0])+(ALUop[1] *I[14]’*I[13]’*I[12]’) Op[2]= ALUop[0]+(ALUop[1]*I[30]*I[14]’*I[13]’*I[12]’)
Chapter 4 — The Processor — 23

Datapath With Control
Chapter 4 — The Processor — 24

R-Type Instruction
Chapter 4 — The Processor — 25

Load Instruction
Chapter 4 — The Processor — 26

BEQ Instruction
Chapter 4 — The Processor — 27

Performance Issues
 Longest delay determines clock period
 Critical path: load instruction
 Instruction memory → register file → ALU → data memory → register file
 Not feasible to vary period for different instructions
 Violates design principle
 Making the common case fast
 We will improve performance by pipelining Chapter 4 — The Processor — 28

Pipelining Analogy
 Pipelined laundry: overlapping execution  Parallelism improves performance
 Four loads:  Speedup
= 8/3.5 = 2.3  Non-stop:
= 2n/0.5n + 1.5 ≈ 4 = number of stages
Chapter 4 — The Processor — 29
§4.5 An Overview of Pipelining

RISC-V Pipeline
Five stages, one step per stage
1. IF:Instructionfetchfrommemory
2. ID:Instructiondecode&registerread
3. EX:Executeoperationorcalculateaddress 4. MEM:Accessmemoryoperand
5. WB:Writeresultbacktoregister
Chapter 4 — The Processor — 30

Pipeline Performance
 Assume time for stages is
 100ps for register read or write  200ps for other stages
 Compare pipelined datapath with single-cycle datapath
Instr fetch
Register read
Memory access
Register write
Total time
Chapter 4 — The Processor — 31

Pipeline Performance
Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
Chapter 4 — The Processor — 32

Pipeline Speedup
 If all stages are balanced
 i.e., all take the same time
 Time between instructionspipelined
= Time between instructionsnonpipelined
Number of stages
 If not balanced, speedup is less
 Speedup due to increased throughput
 Latency (time for each instruction) does not decrease
Chapter 4 — The Processor — 33

Pipelining and ISA Design
 RISC-V ISA designed for pipelining
 All instructions are 32-bits
 Easier to fetch and decode in one cycle  c.f. x86: 1- to 17-byte instructions
 Few and regular instruction formats
 Can decode and read registers in one step
 Load/store addressing
 Can calculate address in 3rd stage, access memory in 4th stage
Chapter 4 — The Processor — 34

 Situations that prevent starting the next instruction in the next cycle
 Structure hazards
 A required resource is busy
 Data hazard
 Need to wait for previous instruction to
complete its data read/write
 Control hazard
 Deciding on control action depends on previous instruction
Chapter 4 — The Processor — 35

Structure Hazards
 Conflict for use of a resource
 In RISC-V pipeline with a single memory
 Load/store requires data access
 Instruction fetch would have to stall for that cycle
 Would cause a pipeline “bubble”
 Hence, pipelined datapaths require
separate instruction/data memories  Or separate instruction/data caches
Chapter 4 — The Processor — 36

Data Hazards
 An instruction depends on completion of data access by a previous instruction
 add x19, x0, x1 sub x2, x19, x3
Chapter 4 — The Processor — 37

Forwarding (aka Bypassing)
 Use result when it is computed
 Don’t wait for it to be stored in a register
 Requires extra connections in the datapath
Chapter 4 — The Processor — 38

Load-Use Data Hazard
 Can’t always avoid stalls by forwarding  If value not computed when needed
 Can’t forward backward in time!
Chapter 4 — The Processor — 39

stall stall
Code Scheduling to Avoid Stalls
 Reorder code to avoid use of load result in the next instruction
 C code for a = b + e; c = b + f;
ld x1, 0(x0) ld x2, 8(x0) add x3, x1, x2 sd x3, 24(x0) ld x4, 16(x0) add x5, x1, x4 sd x5, 32(x0)
ld x1, 0(x0) ld x2, 8(x0) ld x4, 16(x0) add x3, x1, x2 sd x3, 24(x0) add x5, x1, x4 sd x5, 32(x0)
Chapter 4 — The Processor — 40

Control Hazards
 Branch determines flow of control
 Fetching next instruction depends on branch
 Pipeline can’t always fetch correct instruction
 Still working on ID stage of branch  In RISC-V pipeline
 Need to compare registers and compute target early in the pipeline
 Add hardware to do it in ID stage
Chapter 4 — The Processor — 41

Stall on Branch
 Wait until branch outcome determined before fetching next instruction
Chapter 4 — The Processor — 42

Branch Prediction
 Longer pipelines can’t readily determine branch outcome early
 Stall penalty becomes unacceptable  Predict outcome of branch
 Only stall if prediction is wrong
 In RISC-V pipeline
 Can predict branches not taken
 Fetch instruction after branch, with no delay
Chapter 4 — The Processor — 43

More-Realistic Branch Prediction
 Static branch prediction
 Based on typical branch behavior
 Example: loop and if-statement branches  Predict backward branches taken
 Predict forward branches not taken
 Dynamic branch prediction
 Hardware measures actual branch behavior
 e.g., record recent history of each branch
 Assume future behavior will continue the trend
 When wrong, stall while re-fetching, and update history
Chapter 4 — The Processor — 44

Pipeline Summary
The BIG Picture
 Pipelining improves performance by increasing instruction throughput
 Executes multiple instructions in parallel  Each instruction has the same latency
 Subject to hazards
 Structure, data, control
 Instruction set design affects complexity of pipeline implementation
Chapter 4 — The Processor — 45

Right-to-left flow leads to hazards
Chapter 4 — The Processor — 46
§4.6 and Control

Pipeline registers
 Need registers between stages
 To hold information produced in previous cycle
Chapter 4 — The Processor — 47

Pipeline Operation
 Cycle-by-cycle flow of instructions through the pipelined datapath
 “Single-clock-cycle” pipeline diagram  Shows pipeline usage in a single cycle  Highlight resources used
 c.f. “multi-clock-cycle” diagram  Graph of operation over time
 We’ll look at “single-clock-cycle” diagrams for load & store
Chapter 4 — The Processor — 48

IF for Load, Store, …
Chapter 4 — The Processor — 49

ID for Load, Store, …
Chapter 4 — The Processor — 50

EX for Load
Chapter 4 — The Processor — 51

MEM for Load
Chapter 4 — The Processor — 52

WB for Load
Wrong register number
Chapter 4 — The Processor — 53

Corrected Datapath for Load
Chapter 4 — The Processor — 54

EX for Store
Chapter 4 — The Processor — 55

MEM for Store
Chapter 4 — The Processor — 56

WB for Store
Chapter 4 — The Processor — 57

Multi-Cycle Pipeline Diagram
 Form showing resource usage
Chapter 4 — The Processor — 58

Multi-Cycle Pipeline Diagram
 Traditional form
Chapter 4 — The Processor — 59

Single-Cycle Pipeline Diagram
 State of pipeline in a given cycle
Chapter 4 — The Processor — 60

Pipelined Control (Simplified)
Chapter 4 — The Processor — 61

Pipelined Control
 Control signals derived from instruction  As in single-cycle implementation
Chapter 4 — The Processor — 62

Pipelined Control
Chapter 4 — The Processor — 63

Data Hazards in ALU Instructions (Compute-Use Data Hazard)
 Consider this sequence:
sub x2, x1,x3
and x12,x2,x5
or x13,x6,x2
add x14,x2,x2
sd x15,100(x2)
 We can resolve hazards with forwarding  How do we detect when to forward?
Chapter 4 — The Processor — 64
§4.7 Data Hazards: Forwarding vs. Stalling

Dependencies & Forwarding
Chapter 4 — The Processor — 65

Detecting the Need to Forward
 Pass register numbers along pipeline
 e.g., ID/EX.RegisterRs1 = register number for Rs1
sitting in ID/EX pipeline register
 ALU operand register numbers in EX stage are given by
 ID/EX.RegisterRs1, ID/EX.RegisterRs2
 Data hazards when
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs1 1b. EX/MEM.RegisterRd = ID/EX.RegisterRs2 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs1 2b. MEM/WB.RegisterRd = ID/EX.RegisterRs2
Chapter 4 — The Processor — 66
Fwd from EX/MEM pipeline reg
Fwd from MEM/WB pipeline reg

Detecting the Need to Forward
 But only if forwarding instruction will write to a register!
 EX/MEM.RegWrite, MEM/WB.RegWrite
 And only if Rd for that instruction is not x0
 EX/MEM.RegisterRd ≠ 0, MEM/WB.RegisterRd ≠ 0
Chapter 4 — The Processor — 67

Forwarding Paths
Chapter 4 — The Processor — 68

Forwarding Conditions
Mux control
Explanation
ForwardA = 00
The first ALU operand comes from the register file.
ForwardA = 10
The first ALU operand is forwarded from the prior ALU result.
ForwardA = 01
The first ALU operand is forwarded from data memory or an earlier ALU result.
ForwardB = 00
The second ALU operand comes from the register file.
ForwardB = 10
The second ALU operand is forwarded from the prior ALU result.
ForwardB = 01
The second ALU operand is forwarded from data memory or an earlier ALU result.
Chapter 4 — The Processor — 69

Double Data Hazard
 Consider the sequence:
add x1,x1,x2
add x1,x1,x3
add x1,x1,x4
Dashed forwarding should be suppressed
 Both hazards occur
 Want to use the most recent
 Revise MEM hazard condition
 Only fwd if EX hazard condition isn’t true
Chapter 4 — The Processor — 70

Revised Forwarding Condition
 MEM hazard
 if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRs1))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs1)) ForwardA = 01
 if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRs2))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs2)) ForwardB = 01
Chapter 4 — The Processor — 71

Datapath with Forwarding
Chapter 4 — The Processor — 72

Load-Use Hazard Detection
 Check when using instruction is decoded in ID stage
 ALU operand register numbers in ID stage are given by
 IF/ID.RegisterRs1, IF/ID.RegisterRs2
 Load-use hazard when
 ID/EX.MemRead and
((ID/EX.RegisterRd = IF/ID.RegisterRs1) or
(ID/EX.RegisterRd = IF/ID.RegisterRs1))
 If detected, stall and insert bubble
Chapter 4 — The Processor — 73

How to Stall the Pipeline
 Force control values in ID/EX register to 0
 EX, MEM and WB do nop (no-operation)
 Prevent update of PC and IF/ID register  Using instruction is decoded again
 Following instruction is fetched again
 1-cycle stall allows MEM to read data for ld
 Can subsequently forward to EX stage
Chapter 4 — The Processor — 74

Load-Use Data Hazard
Stall inserted here
Chapter 4 — The Processor — 75

Datapath with Hazard Detection
Chapter 4 — The Processor — 76

Stalls and Performance
The BIG Picture
 Stalls reduce performance
 But are required to get correct results
 Compiler can arrange code to avoid hazards and stalls
 Requires knowledge of the pipeline structure
Chapter 4 — The Processor — 77

Branch Hazards
 If branch outcome determined in MEM
Chapter 4 — The Processor — 78
Flush these instructions (Set control values to 0)
§4.8 Control Hazards

Reducing Branch Delay
 More hardware to determine outcome to ID stage
 Target address adder  Register comparator
 Example: branch taken
36: sub x10, x4, x8
40: beq x1, x3, 16 // PC-relative branch
44: and x12, x2, x5
48: orr x13, x2, x6
52: add x14, x4, x2
56: sub x15, x6, x7
72: ld x4, 50(x7)
// to 40+16*2=72
Chapter 4 — The Processor — 79

Example: Branch Taken
Chapter 4 — The Processor — 80

Example: Branch Taken
Chapter 4 — The Processor — 81

Dynamic Branch Prediction
 In deeper and superscalar pipelines, branch penalty is more significant
 Use dynamic prediction
 Branch prediction buffer (aka branch history table)  Indexed by recent branch instruction addresses
 Stores outcome (taken/not taken)
 To execute a branch
 Check table, expect the same outcome
 Start fetching from fall-through or target  If wrong, flush pipeline and flip prediction
Chapter 4 — The Processor — 82

1-Bit Predictor: Shortcoming
 Inner loop branches mispredicted twice!
outer: …
inner: …
beq …, …, inner
beq …, …, outer
 Mispredict as taken on last iteration of inner loop
 Then mispredict as not taken on first iteration of inner loop next time around
Chapter 4 — The Processor — 83

2-Bit Predictor
 Only change prediction on two successive mispredictions
Chapter 4 — The Processor — 84

Calculating the Branch Target
 Even with predictor, still need to calculate the target address
 1-cycle penalty for a taken branch
 Branch target buffer
 Cache of target addresses
 Indexed by PC when instruction fetched
 If hit and instruction is branch predicted taken, can fetch target immediately
Chapter 4 — The Processor — 85

Exceptions and Interrupts
 “Unexpected” events requiring change in flow of control
 Different ISAs use the terms differently
 Exception
 Arises within the CPU
 e.g., undefined opcode, syscall, …
 Interrupt
 From an external I/O controller
 Dealing with them without sacrificing performance is hard
Chapter 4 — The Processor — 86
§4.9 Exceptions

Handling Exceptions
 Save PC of offending (or interrupted) instruction  In RISC-V: Supervisor Exception Program Counter
 Save indication of the problem
 In RISC-V: Supervisor Exception Cause Register (SCAUSE)
 64 bits, but most bits unused
Exception code field: 2 for undefined opcode, 12 for hardware malfunction, …
 Jump to handler
 Assume at 0000 0000 1C09 0000hex
Chapter 4 — The Processor — 87

An Alternate Mechanism
 Vectored Interrupts
 Handler address determined by the cause
 Exception vector address to be added to a vector table base register:
 Undefined opcode
 Hardware malfunction:  …:
00 0100 0000two 01 1000 0000two …
 Instructions either
 Deal with the interrupt, or  Jump to real handler
Chapter 4 — The Processor — 88

 Read cause, and transfer to relevant handler
 Determine action required
 If restartable
 Take corrective action
 use SEPC to return to program
 Otherwise
 Terminate program
 Report error using SEPC, SCAUSE, …
Chapter 4 — The Processor — 89

Exceptions in a Pipeline
 Another form of control hazard
 Consider malfunction on add in EX stage add x1, x2, x1
 Prevent x1 from being clobbered
 Complete previous instructions
 Flush add and subsequent instructions  Set SEPC and SCAUSE register values  Transfer control to handler
 Similar to mispredicted branch  Use much of the same hardware
Chapter 4 — The Processor — 90

Pipeline with Exceptions
Chapter 4 — The Processor — 91

Exception Properties
 Restartable exceptions
 Pipeline can flush the instruction
 Handler executes, then returns to the instruction
 Refetched and executed from s

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com