程序代写代做代考 C clock cache compiler mips UCCD1133

UCCD1133
Introduction to Computer Organisation and Architecture
Chapter 6
Processor, Memory System and Instruction Execution
1

Disclaimer
• This slide may contain copyrighted material of which has not been specifically authorized by the copyright owner. The use of copyrighted materials are solely for educational purpose. If you wish to use this copyrighted material for other purposes, you must first obtain permission from the original copyright owner.
2

Chapter 6-1
Instruction execution cycle and Stages of processor
3

Instruction Execution Cycle
4

Instruction Execution Cycle
5

Instruction Execution Cycle
6

Instruction Execution Cycle
7

Instruction Execution Cycle
8

Instruction Execution Cycle
9

Instruction Execution Cycle
10

Instruction Execution Cycle
11

Instruction Execution Cycle
12

Instruction Execution Cycle
13

Instruction Execution Cycle
Control needs to have circuitry to:
• Decode the instruction
• Issue signals that control the way information flows between datapath
components
• Control what operations the datapath’s functional units perform
• Decide which is the next instruction and obtain it from memory
14

Instruction Execution Cycle
Datapath needs to have circuitry to:
• Execute instructions – functional units (e.g., adder) and storage
locations (e.g., register file)
• Interconnect the functional units accordingly • Load data from and store data to memory
15

Performance of Computer Systems
• Basic performance measure:
• Response time/ Execution time:
The time between the start and the completion of a task
• Throughput:
The total amount of tasks done in a given time period
• Main factors influencing the performance • Processor and memory
• Input/output controllers and peripherals
• Compilers
• Operating system
• Challenge is to satisfy constraints of: • Cost
• Power
• Performance
16

Performance of CPU
• To measure CPU performance
Execution Time / CPU Time
= Instruction countCPIClock cycle time
= (instructions/program)(cycles/instruction)(seconds/clock cycle)
• CPI (Cycles/instruction): Average number of clock cycles per instruction for a program
• clock cycle (seconds/cycle): The time for one clock period • Note: clock cycle = 1/clock rate
• How to improve CPU time:
• Instruction count: ISA and compiler technology
• CPI: organisation, ISA and compiler technology
• Clock rate: hardware technology and organisation
17

Performance of CPU
• Example
Consider the following performance measurements for a program:
Measurement
Computer A
Computer B
Instruction count
10 billion
8 billion
Clock rate
4 GHz
4 GHz
CPI
1.0
1.1
Which computer is faster? • Solution:
Computer A execution time = (10109 )1.0(2.510-10 sec) = 2.5 sec
Computer B execution time = (8109 )1.1(2.510-10 sec) = 2.2 sec
Computer B is faster.
18

Chapter 6-2
Processors:
Single-cycle, Multi-cycle and Pipeline
19

Microarchitecture
• Microarchitecture: how to implement an architecture in hardware
• Processor:
• Datapath: functional blocks • Control: control signals
20

Datapath Abstract Implementation View
• Assemble the components
• Based on stages of “instruction execution cycle”
• Obtain an abstract datapath implementation view.
• Note : one could have different number of stages for different CPU architecture
21

Datapath Abstract Implementation View
• Datapath with control signals
22

Microarchitecture
• Multiple implementations for a single architecture:
• Single-cycle: Each instruction executes in a single cycle
• Multicycle: Each instruction is broken into series of shorter steps
• Pipelined: Each instruction broken up into series of steps & multiple instructions execute at once
23

Single-Cycle Datapath: lw
Example: lw rt, imm(rs) STEP 1: Fetch instruction
CLK
PC’
CLK CLK
PC Instr
A1 WE3 RD1 A2 RD2
A3 Register WD3 File
WE A RD
Data Memory
WD
A RD
Instruction Memory
24

Single-Cycle Datapath: lw Register Read
Example: lw rt, imm(rs)
STEP 2: Read source operands from RF
CLK
PC’ PC Instr
CLK CLK
25:21
A1 WE3 RD1 A2 RD2
A3 Register WD3 File
WE A RD
Data Memory
WD
A RD
Instruction Memory
25

Single-Cycle Datapath: lw Immediate
Example: lw rt, imm(rs) STEP 3: Sign-extend the immediate
CLK
PC’ PC
CLK
CLK
Instr
25:21
A RD
Instruction Memory
A1 WE3 RD1 A2 RD2
A3 Register WD3 File
15:0
SignImm
WE A RD
Data Memory
WD
Sign Extend
26

Single-Cycle Datapath: lw address
Example: lw rt, imm(rs)
STEP 4: Compute the memory address
CLK
PC’ PC
CLK
ALUControl2:0 010
Zero ALUResult
CLK
WE A RD
Data Memory
WD
Instr
25:21
SrcA
SrcB
A RD
Instruction Memory
A1 WE3 RD1 A2 RD2
A3 Register WD3 File
15:0
Sign Extend
SignImm
27
ALU

Single-Cycle Datapath: lw Memory Read Example: lw rt, imm(rs)
STEP 5: Read data from memory and write it back to register file
CLK
PC’ PC
Instr
25:21
20:16
SrcA
SrcB
Zero ALUResult
ReadData
RegWrite
1
CLK
ALUControl2:0
010
CLK
A RD
Instruction Memory
A1 WE3 RD1 A2 RD2
A3 Register WD3 File
WE A RD
Data Memory
WD
15:0
Sign Extend
SignImm
28
ALU

Single-Cycle Datapath: lw PC Increment Example: lw rt, imm(rs)
STEP 6: Determine address of next instruction
RegWrite
1
CLK
ALUControl2:0 010
CLK
CLK
PC’
PC
Instr
25:21
SrcA
SrcB
Zero ALUResult
A1 WE3 RD1
A RD
Instruction Memory
A2 RD2 A3
ReadData
20:16
WD3 Register File
WE A RD
Data Memory
WD
4
SignImm
PCPlus4
15:0
Sign Extend
Result
29
+
ALU

Single-Cycle Datapath: sw
Example: sw rt, imm(rs)
Write data in rt to memory
RegWrite ALUControl2:0 MemWrite 0 010 1
CLK
PC’ PC
CLK CLK
Instr
25:21
SrcA
SrcB
Zero ALUResult
WriteData
ReadData
A1 WE3 RD1 A2 RD2
A3 Register WD3 File
WE A RD
Data Memory
WD
A RD
Instruction Memory
20:16
20:16
4
SignImm
PCPlus4
15:0
Sign Extend
Result
30
+
ALU

Single-Cycle Processor
Control MemtoReg Unit MemWrite
Branch
ALUControl2:0
RegWrite
PCSrc
31:26 Op
5:0
ALUSrc Funct RegDst
CLK PC’ PC
CLK
0 1
Instr
25:21
20:16 15:11
15:0
SrcA 0 SrcB
1
<<2 Zero ALUResult WriteData PCBranch ReadData 0 1 CLK A RD Instruction Memory A1 WE3 RD1 A2 RD2 A3 Register WD3 File WE A RD Data Memory WD 20:16 PCPlus4 4:0 WriteReg Sign Extend 0 1 SignImm 4 31 Result + ALU + Drawback of Single-Cycle Processor ❑ Uses the clock cycle inefficiency – the clock cycle is set to accommodate the slowest instruction, and must be the same length for all instructions Problem: - all instructions take as much time as the slowest instruction than needed - Especially problematic for more complex instructions like floating point multiply ❑ Need to duplicate resources that are used more than once per cycle – waste area and power. Some functional units (eg. adders) must be duplicate since they can not be shared during a clock cycle ❑ Real memory is slower than idealized memory Cannot always get the job done in one (short) cycle Need to further stretch the clock period – more slower. 32 Multicycle MIPS Processor ❑ Single-cycle datapath => CPI=1, CCT => long
❑ Multi-cycle processors
• Require more complex control – control is the hard part.
• Avoid idling – different instructions to take a different number of clock cycles.
• Faster clock rates.
• Data path allows greater sharing of hardware.
❑ MIPS makes control easier
• Instructions same size
• Source registers always in same place
• Immediate same size, location
• Operations always on registers/immediates
❑ Functional units can be used more than once per instruction as long as they are used on different clock cycles, as a result
• Only one memory – holding both instructions and data
• Only one ALU (used almost every cycle) – saves two adders
33

Multicycle MIPS Processor
34

Multicycle State Elements
❑ Replace Instruction and Data memories with a single unified memory – more realistic
CLK CLK
CLK
A1 WE3 RD1 A2 RD2
A3 WD3
Register File
PC’
PC
WE
RD
Instr / Data Memory
WD
EN
A
35

Multicycle Datapath: Instruction Fetch
Example: lw rt, imm(rs) STEP 1: Fetch instruction
IRWrite
CLK CLK CLK PC’b PC
CLK
WE
RD
A
Instr / Data Memory
WD
EN
Instr
A1 WE3 RD1 A2 RD2
A3 WD3
Register File
36

Multicycle Datapath: lw Register Read Example: lw rt, imm(rs)
STEP 2a: Read source operands from RF IRWrite
CLK CLK PC’ b PC
CLK CLK CLK Instr 25:21
A
WE
RD
A
Instr / Data Memory
WD
A1 WE3 RD1 A2 RD2
A3 WD3
EN
Register File
37

Multicycle Datapath: lw Immediate
Example: lw rt, imm(rs)
STEP 2b: Sign-extend the immediate
CLK PC’ b PC
CLK
IRWrite
CLK
CLK CLK
WE
RD
A
Instr / Data Memory
WD
Instr
25:21
A
A1 WE3 RD1 A2 RD2
A3 WD3
EN
Register File
SignImm
15:0
Sign Extend
38

Multicycle Datapath: lw Address
Example: lw rt, imm(rs)
STEP 3: Compute the memory address
CLK
PC’b PC
CLK
CLK
A
IRWrite
CLK CLK
ALUControl2:0
WE
RD
A
Instr / Data Memory
WD
A1 WE3 RD1 A2 RD2
A3 WD3
Instr 25:21
SrcA
SrcB
CLK
ALUOut
EN
ALUResult
Register File
15:0
Sign Extend
SignImm
39
ALU

Multicycle Datapath: lw Memory Read
Example: lw rt, imm(rs) STEP 4: Read data from memory
IorD
CLK CLK
PC’b PC 0 Adr 1
IRWrite
CLK
CLK
Data
15:0
ALUControl2:0
CLK
CLK
WE
RD
A
Instr / Data Memory
WD
A1 WE3 RD1 A2 RD2
A3 WD3
Instr 25:21
A
SrcA
SrcB
CLK
ALUOut
EN
ALUResult
Register File
Sign Extend
SignImm
40
ALU

Multicycle Datapath: lw Write Register
Example: lw rt, imm(rs)
STEP 5: Write data back to register file
IorD
CLK CLK
PC’b PC 0 Adr
1
IRWrite
CLK
RegWrite
CLK
ALUControl2:0
WE
RD
A
Instr / Data Memory
WD
Instr 25:21
CLK A
SrcA
SrcB
CLK
ALUOut
EN
ALUResult
CLK
Data
20:16
15:0
Sign Extend
A1 WE3 RD1 A2 RD2
A3 WD3
Register File
SignImm
41
ALU

Multicycle Datapath: Increment PC
Example: lw rt, imm(rs) STEP 6: Increment PC
PCWrite IorD
CLK
IRWrite
RegWrite
ALUSrcA ALUSrcB1:0 ALUControl2:0 0 SrcA
CLK
CLK
PC’ PC
b 0Adr
A 1
00
ALUResult
CLK
ALUOut
CLK
Instr 25:21
CLK
WE
RD
A
Instr / Data Memory
WD
A1 WE3 RD1 A2 RD2
A3
WD3
EN
CLK
Data
20:16
Register File
EN 1
4
01 SrcB
10 11
15:0
Sign Extend
SignImm
42
ALU

Multicycle Datapath: sw
Example: sw rt, imm(rs) Write data in rt to memory
PCWrite
CLK
PC’ PC
IorD
MemWrite IRWrite CLK
RegWrite
ALUSrcA ALUSrcB1:0 ALUControl2:0
CLK
b 0Adr EN 1
0 SrcA A 1
CLK
ALUOut
B
00
01 SrcB 10
11
ALUResult
CLK
Instr 25:21
CLK
WE
RD
A
Instr / Data Memory
WD
A1 WE3 RD1 A2 RD2
A3
WD3
EN
CLK
Data
20:16
20:16
Register File
4
15:0
Sign Extend
SignImm
43
ALU

Multicycle Processor
CLK
PCWrite
Branch
PCSrc ALUControl2:0 ALUSrcB1:0 ALUSrcA RegWrite
PCEn
IorD Control
Unit
31:26 Op
MemWrite
IRWrite
5:0
Funct
CLK
1
CLK
CLK
PC’
PC
A B
0 SrcA 1
00
4 01 SrcB 10
11 <<2 Zero ALUResult CLK ALUOut 1 0 Adr 0 EN CLK Instr CLK Data 25:21 20:16 CLK WE RD A Instr / Data Memory WD A1 WE3 RD1 A2 RD2 A3 WD3 EN 20:16 0 15:11 1 0 1 Register File 15:0 Sign Extend SignImm 44 MemtoReg RegDst ALU Single Cycle vs. Multiple Cycle Timing 45 Pipelined Processor ❑ Pipelining is an implementation technique in which multiple instructions are overlapped in execution. ❑ Today, pipelining is nearly universal. ❑ MIPS Instruction Set is designed for pipelining. All instructions are of the same length – 32 bits Easier to fetch and decode in one cycle 46 Pipelined Processor • Recall that an instruction can be executed in stages (with each stage partially-completed). This is seen previously in single-cycle and multi- cycle processors. • IF: Instruction Fetch, Increment PC • ID: Instruction Decode, Read Registers • EX: Execution (ALU) • Load/Store: Calculate Address Others: Perform Operation • MEM: • Load: Read Data from Memory Store: Write Data to Memory • WB: Write the result (Data) Back to Register file • The term “stages” implies datapath resources at each stage 47 Pipelined Processor ❑ Pipelined processor start the next instruction before the current one has completed improves throughput - total amount of work done in a given time instruction latency (execution time, delay time, response time - time from the start of an instruction to its completion) is not reduced Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw sw R-type IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB clock cycle (pipeline stage time) is limited by the slowest stage for some instructions, some stages are wasted cycles 48 Pipeline Analogy: Doing Laundry 49 Pipeline Analogy: Doing Laundry 50 Pipeline Analogy: Doing Laundry 51 Pipeline Analogy: Doing Laundry Pipelining Lessons: 52 Single Cycle, Multiple Cycle, vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 Clk Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 Clk lw sw Waste lw sw Pipeline Implementation: lw sw R-type IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB R-type 53 Pipelined Processor Abstraction 1 2 3 4 5 6 7 8 9 10 DM lw$s2,40($0) IM add $s3, $t1, $t2 sub $s4, $s1, $s5 and $s5, $t5, $t6 sw$s6,20($s1) or$s7,$t3,$t4 lw $0 40 + $s2 - IM Time (cycles) R F R F R F R F DM IM add $t1 $t2 sub + IM $s3 R F R F DM IM $s1 $s5 and $s4 $s1 20 + or $t5 $t6 & sw IM Resource usage $s5 $t3 $t4 | $s6 R F R F DM R F DM RF DM $s7 R F R F 54 Single-Cycle vs. Pipelined Datapath CLK 0 PC' PC 1 CLK CLK Instr 25:21 SrcA 0 SrcB 1 WriteReg4:0 SignImm <<2 Zero ALUResult WriteData PCBranch CLK ReadData 0 1 A1 WE3 RD1 WE A RD Data Memory WD A RD Instruction Memory 20:16 A2 RD2 A3 WD3 Register File 20:16 15:11 0 1 PCPlus4 4 15:0 Sign Extend CLK Result ALUOutW 0 ReadDataW 1 CLK CLK CLK CLK 0 PC' PCF 1 InstrD 25:21 SrcAE 0 SrcBE 1 WriteDataE ZeroM ALUOutM CLK A1 WE3 RD1 A2 RD2 A3 WD3 Register File A RD Instruction Memory WE A RD Data Memory WD 20:16 WriteDataM RtE 0 RdE 1 20:16 15:11 WriteRegE4:0 <<2 PCPlus4E 15:0 PCPlus4D Sign Extend SignImmE 4 PCPlus4F PCBranchM Fetch Decode Execute Memory Writeback ResultW 55 + + + ALU + ALU Pipelined Processor Control InstrD Control Unit 31:26 Op 5:0 25:21 BranchM PCSrcM ALUOutM MemtoRegW ALUOutW ReadDataW 0 1 WriteRegW4:0 CLK 0 PC' PCF 1 CLK Funct CLK MemtoRegD MemWriteD BranchD ALUControlD ALUSrcD RegDstD MemtoRegE MemWriteE BranchE ALUControlE2:0 ALUSrcE SrcAE 0 SrcBE 1 WriteDataE RegDstE MemWriteM CLK CLK RegWriteW RegWriteD CLK CLK RegWriteE RegWriteM MemtoRegM ZeroM WE A RD Data Memory WD A RD Instruction Memory A1 WE3 RD1 20:16 15:11 A2 RD2 A3 WD3 Register File WriteDataM RtE 0 RdE 1 20:16 WriteRegE4:0 WriteRegM4:0 15:0 PCPlus4D Sign Extend SignImmE PCPlus4E <<2 PCBranchM 4 PCPlus4F ResultW 56 + ALU + Pipeline Performance • Use T (“time between completion of instructions”) to measure speedup c – Equality only achieved if stages are balance (i.e. take the same amount of time) • If not balanced, speedup is reduced • Speedup due to increased throughput • Latency for each instruction does not decrease 57 Pipeline Performance • • Assume time for stages is – 100ps for register read or write – 200psforotherstages What is pipelined clock rate? • Compare pipelined datapath with single-cycle datapath 58 Pipeline Performance 59 Pipeline Hazards ❑ Problem for pipeline: instruction depends on result from instruction that hasn’t completed ❑ Types: Structural hazard: attempt to use the same resource (hardware unit) two different ways at the same time - E.g., two instructions try to read the same memory at the same time. Data hazard: register value not yet written back to register file - E.g., add $r1, $r2, $r3 sub $r4, $r2, $r1 Control hazard: next instruction not decided yet (caused by branches) - E.g., beq $r1, $r2, loop add $r3, $r4, $r5 ❑ Pipeline control must detect hazard Resolve hazards by waiting / delay action 60 Chapter 6-3 Memory system and hierarchy 61 Introduction • Computer performance depends on: – – Processor performance Memory system performance Memory Interface CLK MemWrite Processor Address WriteData CLK Memory ReadData WE 62 Processor-Memory Gap In prior chapters, assumed access memory in 1 clock cycle – but hasn’t been true since the 1980’s 63 Memory System Challenge • Make memory system appear as fast as processor • Use hierarchy of memories • Ideal memory: – Fast – Cheap(inexpensive) – Large(capacity) But can only choose two! 64 Memory Hierarchy $10,000 1 $10 10 - 50 Cache Main Memory Virtual Memory Capacity Technology Price / GB SRAM DRAM SSD $1 HDD $0.1 Access Bandwidth Time (ns) (GB/s) 25+ 10 100,000 0.5 10,000,000 0.1 • • • Cache: SRAM Physical Memory: DRAM (Main Memory) Virtual Memory: Hard drive – Slow, Large, Cheap 65 Speed Cache ❑ Highest level in memory hierarchy ❑ Fast (typically ~ 1 cycle access time) ❑ Ideally supplies most data to processor ❑ Usually holds most recently accessed data 66 Memory Type • Types of memory: • Random access memory (RAM) • Read only memory (ROM) • RAM is volatile – loses its data when power off. • Historically called random access memory because any data word accessed as easily as any other (in contrast to sequential access memories such as a tape recorder) • The main memory in your computer is dynamic random access memory (DRAM). • DRAM stores data using capacitor • ROM is nonvolatile – it retains data when power off • Flash memory in cameras, thumb drives, and digital cameras are all ROMs. • Historically called read only memory because ROMs were written at manufacturing time or by burning fuses. Once ROM was configured, it could not be written again. This is no longer the case for Flash memory and other types of ROMs. 67 Virtual Memory • Gives the illusion of bigger memory • Main memory (DRAM) acts as cache for hard disk Magnetic Disks Read/Write Head Hard disk takes milliseconds to seek the correct location on disk 68 Principle of Locality Exploit locality to make memory accesses fast • Temporal Locality: – Localityintime – If data used recently, likely to use it again soon – How to exploit: keep recently accessed data in higher levels of memory hierarchy – E.g., instructions in a loop. • Spatial Locality: – Locality in space – If data used recently, likely to use nearby data soon – How to exploit: when access data, bring nearby data into higher levels of memory hierarchy too – E.g., Sequential instruction access, array data. 69 Taking advantage of Locality • Store everything on disk • Copy recently accessed (and nearby) items from disk to smaller DRAM memory – Mainmemory • Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory – CachememoryattachedtoCPU 70