UCCD1133
Introduction to Computer Organisation and Architecture
Chapter 6
Processor, Memory System and Instruction Execution
1
Disclaimer
• This slide may contain copyrighted material of which has not been specifically authorized by the copyright owner. The use of copyrighted materials are solely for educational purpose. If you wish to use this copyrighted material for other purposes, you must first obtain permission from the original copyright owner.
2
Chapter 6-1
Instruction execution cycle and Stages of processor
3
Instruction Execution Cycle
4
Instruction Execution Cycle
5
Instruction Execution Cycle
6
Instruction Execution Cycle
7
Instruction Execution Cycle
8
Instruction Execution Cycle
9
Instruction Execution Cycle
10
Instruction Execution Cycle
11
Instruction Execution Cycle
12
Instruction Execution Cycle
13
Instruction Execution Cycle
Control needs to have circuitry to:
• Decode the instruction
• Issue signals that control the way information flows between datapath
components
• Control what operations the datapath’s functional units perform
• Decide which is the next instruction and obtain it from memory
14
Instruction Execution Cycle
Datapath needs to have circuitry to:
• Execute instructions – functional units (e.g., adder) and storage
locations (e.g., register file)
• Interconnect the functional units accordingly • Load data from and store data to memory
15
Performance of Computer Systems
• Basic performance measure:
• Response time/ Execution time:
The time between the start and the completion of a task
• Throughput:
The total amount of tasks done in a given time period
• Main factors influencing the performance • Processor and memory
• Input/output controllers and peripherals
• Compilers
• Operating system
• Challenge is to satisfy constraints of: • Cost
• Power
• Performance
16
Performance of CPU
• To measure CPU performance
Execution Time / CPU Time
= Instruction countCPIClock cycle time
= (instructions/program)(cycles/instruction)(seconds/clock cycle)
• CPI (Cycles/instruction): Average number of clock cycles per instruction for a program
• clock cycle (seconds/cycle): The time for one clock period • Note: clock cycle = 1/clock rate
• How to improve CPU time:
• Instruction count: ISA and compiler technology
• CPI: organisation, ISA and compiler technology
• Clock rate: hardware technology and organisation
17
Performance of CPU
• Example
Consider the following performance measurements for a program:
Measurement
Computer A
Computer B
Instruction count
10 billion
8 billion
Clock rate
4 GHz
4 GHz
CPI
1.0
1.1
Which computer is faster? • Solution:
Computer A execution time = (10109 )1.0(2.510-10 sec) = 2.5 sec
Computer B execution time = (8109 )1.1(2.510-10 sec) = 2.2 sec
Computer B is faster.
18
Chapter 6-2
Processors:
Single-cycle, Multi-cycle and Pipeline
19
Microarchitecture
• Microarchitecture: how to implement an architecture in hardware
• Processor:
• Datapath: functional blocks • Control: control signals
20
Datapath Abstract Implementation View
• Assemble the components
• Based on stages of “instruction execution cycle”
• Obtain an abstract datapath implementation view.
• Note : one could have different number of stages for different CPU architecture
21
Datapath Abstract Implementation View
• Datapath with control signals
22
Microarchitecture
• Multiple implementations for a single architecture:
• Single-cycle: Each instruction executes in a single cycle
• Multicycle: Each instruction is broken into series of shorter steps
• Pipelined: Each instruction broken up into series of steps & multiple instructions execute at once
23
Single-Cycle Datapath: lw
Example: lw rt, imm(rs) STEP 1: Fetch instruction
CLK
PC’
CLK CLK
PC Instr
A1 WE3 RD1 A2 RD2
A3 Register WD3 File
WE A RD
Data Memory
WD
A RD
Instruction Memory
24
Single-Cycle Datapath: lw Register Read
Example: lw rt, imm(rs)
STEP 2: Read source operands from RF
CLK
PC’ PC Instr
CLK CLK
25:21
A1 WE3 RD1 A2 RD2
A3 Register WD3 File
WE A RD
Data Memory
WD
A RD
Instruction Memory
25
Single-Cycle Datapath: lw Immediate
Example: lw rt, imm(rs) STEP 3: Sign-extend the immediate
CLK
PC’ PC
CLK
CLK
Instr
25:21
A RD
Instruction Memory
A1 WE3 RD1 A2 RD2
A3 Register WD3 File
15:0
SignImm
WE A RD
Data Memory
WD
Sign Extend
26
Single-Cycle Datapath: lw address
Example: lw rt, imm(rs)
STEP 4: Compute the memory address
CLK
PC’ PC
CLK
ALUControl2:0 010
Zero ALUResult
CLK
WE A RD
Data Memory
WD
Instr
25:21
SrcA
SrcB
A RD
Instruction Memory
A1 WE3 RD1 A2 RD2
A3 Register WD3 File
15:0
Sign Extend
SignImm
27
ALU
Single-Cycle Datapath: lw Memory Read Example: lw rt, imm(rs)
STEP 5: Read data from memory and write it back to register file
CLK
PC’ PC
Instr
25:21
20:16
SrcA
SrcB
Zero ALUResult
ReadData
RegWrite
1
CLK
ALUControl2:0
010
CLK
A RD
Instruction Memory
A1 WE3 RD1 A2 RD2
A3 Register WD3 File
WE A RD
Data Memory
WD
15:0
Sign Extend
SignImm
28
ALU
Single-Cycle Datapath: lw PC Increment Example: lw rt, imm(rs)
STEP 6: Determine address of next instruction
RegWrite
1
CLK
ALUControl2:0 010
CLK
CLK
PC’
PC
Instr
25:21
SrcA
SrcB
Zero ALUResult
A1 WE3 RD1
A RD
Instruction Memory
A2 RD2 A3
ReadData
20:16
WD3 Register File
WE A RD
Data Memory
WD
4
SignImm
PCPlus4
15:0
Sign Extend
Result
29
+
ALU
Single-Cycle Datapath: sw
Example: sw rt, imm(rs)
Write data in rt to memory
RegWrite ALUControl2:0 MemWrite 0 010 1
CLK
PC’ PC
CLK CLK
Instr
25:21
SrcA
SrcB
Zero ALUResult
WriteData
ReadData
A1 WE3 RD1 A2 RD2
A3 Register WD3 File
WE A RD
Data Memory
WD
A RD
Instruction Memory
20:16
20:16
4
SignImm
PCPlus4
15:0
Sign Extend
Result
30
+
ALU
Single-Cycle Processor
Control MemtoReg Unit MemWrite
Branch
ALUControl2:0
RegWrite
PCSrc
31:26 Op
5:0
ALUSrc Funct RegDst
CLK PC’ PC
CLK
0 1
Instr
25:21
20:16 15:11
15:0
SrcA 0 SrcB
1
<<2
Zero
ALUResult WriteData
PCBranch
ReadData 0
1
CLK
A RD
Instruction Memory
A1 WE3 RD1 A2 RD2
A3 Register WD3 File
WE
A RD
Data Memory
WD
20:16
PCPlus4
4:0
WriteReg
Sign Extend
0
1
SignImm
4
31
Result
+
ALU
+
Drawback of Single-Cycle Processor
❑ Uses the clock cycle inefficiency – the clock cycle is set to accommodate the slowest instruction, and must be the same length for all instructions
Problem:
- all instructions take as much time as the slowest instruction than needed
- Especially problematic for more complex instructions like floating point multiply
❑ Need to duplicate resources that are used more than once per cycle – waste area and power.
Some functional units (eg. adders) must be duplicate since they can not be shared during a clock cycle
❑ Real memory is slower than idealized memory Cannot always get the job done in one (short) cycle Need to further stretch the clock period – more slower.
32
Multicycle MIPS Processor
❑ Single-cycle datapath => CPI=1, CCT => long
❑ Multi-cycle processors
• Require more complex control – control is the hard part.
• Avoid idling – different instructions to take a different number of clock cycles.
• Faster clock rates.
• Data path allows greater sharing of hardware.
❑ MIPS makes control easier
• Instructions same size
• Source registers always in same place
• Immediate same size, location
• Operations always on registers/immediates
❑ Functional units can be used more than once per instruction as long as they are used on different clock cycles, as a result
• Only one memory – holding both instructions and data
• Only one ALU (used almost every cycle) – saves two adders
33
Multicycle MIPS Processor
34
Multicycle State Elements
❑ Replace Instruction and Data memories with a single unified memory – more realistic
CLK CLK
CLK
A1 WE3 RD1 A2 RD2
A3 WD3
Register File
PC’
PC
WE
RD
Instr / Data Memory
WD
EN
A
35
Multicycle Datapath: Instruction Fetch
Example: lw rt, imm(rs) STEP 1: Fetch instruction
IRWrite
CLK CLK CLK PC’b PC
CLK
WE
RD
A
Instr / Data Memory
WD
EN
Instr
A1 WE3 RD1 A2 RD2
A3 WD3
Register File
36
Multicycle Datapath: lw Register Read Example: lw rt, imm(rs)
STEP 2a: Read source operands from RF IRWrite
CLK CLK PC’ b PC
CLK CLK CLK Instr 25:21
A
WE
RD
A
Instr / Data Memory
WD
A1 WE3 RD1 A2 RD2
A3 WD3
EN
Register File
37
Multicycle Datapath: lw Immediate
Example: lw rt, imm(rs)
STEP 2b: Sign-extend the immediate
CLK PC’ b PC
CLK
IRWrite
CLK
CLK CLK
WE
RD
A
Instr / Data Memory
WD
Instr
25:21
A
A1 WE3 RD1 A2 RD2
A3 WD3
EN
Register File
SignImm
15:0
Sign Extend
38
Multicycle Datapath: lw Address
Example: lw rt, imm(rs)
STEP 3: Compute the memory address
CLK
PC’b PC
CLK
CLK
A
IRWrite
CLK CLK
ALUControl2:0
WE
RD
A
Instr / Data Memory
WD
A1 WE3 RD1 A2 RD2
A3 WD3
Instr 25:21
SrcA
SrcB
CLK
ALUOut
EN
ALUResult
Register File
15:0
Sign Extend
SignImm
39
ALU
Multicycle Datapath: lw Memory Read
Example: lw rt, imm(rs) STEP 4: Read data from memory
IorD
CLK CLK
PC’b PC 0 Adr 1
IRWrite
CLK
CLK
Data
15:0
ALUControl2:0
CLK
CLK
WE
RD
A
Instr / Data Memory
WD
A1 WE3 RD1 A2 RD2
A3 WD3
Instr 25:21
A
SrcA
SrcB
CLK
ALUOut
EN
ALUResult
Register File
Sign Extend
SignImm
40
ALU
Multicycle Datapath: lw Write Register
Example: lw rt, imm(rs)
STEP 5: Write data back to register file
IorD
CLK CLK
PC’b PC 0 Adr
1
IRWrite
CLK
RegWrite
CLK
ALUControl2:0
WE
RD
A
Instr / Data Memory
WD
Instr 25:21
CLK A
SrcA
SrcB
CLK
ALUOut
EN
ALUResult
CLK
Data
20:16
15:0
Sign Extend
A1 WE3 RD1 A2 RD2
A3 WD3
Register File
SignImm
41
ALU
Multicycle Datapath: Increment PC
Example: lw rt, imm(rs) STEP 6: Increment PC
PCWrite IorD
CLK
IRWrite
RegWrite
ALUSrcA ALUSrcB1:0 ALUControl2:0 0 SrcA
CLK
CLK
PC’ PC
b 0Adr
A 1
00
ALUResult
CLK
ALUOut
CLK
Instr 25:21
CLK
WE
RD
A
Instr / Data Memory
WD
A1 WE3 RD1 A2 RD2
A3
WD3
EN
CLK
Data
20:16
Register File
EN 1
4
01 SrcB
10 11
15:0
Sign Extend
SignImm
42
ALU
Multicycle Datapath: sw
Example: sw rt, imm(rs) Write data in rt to memory
PCWrite
CLK
PC’ PC
IorD
MemWrite IRWrite CLK
RegWrite
ALUSrcA ALUSrcB1:0 ALUControl2:0
CLK
b 0Adr EN 1
0 SrcA A 1
CLK
ALUOut
B
00
01 SrcB 10
11
ALUResult
CLK
Instr 25:21
CLK
WE
RD
A
Instr / Data Memory
WD
A1 WE3 RD1 A2 RD2
A3
WD3
EN
CLK
Data
20:16
20:16
Register File
4
15:0
Sign Extend
SignImm
43
ALU
Multicycle Processor
CLK
PCWrite
Branch
PCSrc ALUControl2:0 ALUSrcB1:0 ALUSrcA RegWrite
PCEn
IorD Control
Unit
31:26 Op
MemWrite
IRWrite
5:0
Funct
CLK
1
CLK
CLK
PC’
PC
A B
0 SrcA 1
00
4 01 SrcB 10
11 <<2
Zero
ALUResult
CLK
ALUOut 1
0 Adr
0
EN
CLK
Instr
CLK
Data
25:21
20:16
CLK
WE
RD
A
Instr / Data Memory
WD
A1 WE3 RD1 A2 RD2
A3 WD3
EN
20:16 0 15:11 1
0
1
Register File
15:0
Sign Extend
SignImm
44
MemtoReg RegDst
ALU
Single Cycle vs. Multiple Cycle Timing
45
Pipelined Processor
❑ Pipelining is an implementation technique in which multiple instructions are overlapped in execution.
❑ Today, pipelining is nearly universal.
❑ MIPS Instruction Set is designed for pipelining. All instructions are of the same length – 32 bits Easier to fetch and decode in one cycle
46
Pipelined Processor
• Recall that an instruction can be executed in stages (with each stage partially-completed). This is seen previously in single-cycle and multi- cycle processors.
• IF: Instruction Fetch, Increment PC
• ID: Instruction Decode, Read Registers
• EX: Execution (ALU)
• Load/Store: Calculate Address Others: Perform Operation
• MEM:
• Load: Read Data from Memory Store: Write Data to Memory
• WB: Write the result (Data) Back to Register file
• The term “stages” implies datapath resources at each stage
47
Pipelined Processor
❑ Pipelined processor start the next instruction before the current one has completed
improves throughput - total amount of work done in a given time
instruction latency (execution time, delay time, response time - time from the
start of an instruction to its completion) is not reduced
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
lw
sw
R-type
IFetch
Dec
Exec
Mem
WB
IFetch
Dec
Exec
Mem
WB
IFetch
Dec
Exec
Mem
WB
clock cycle (pipeline stage time) is limited by the slowest stage for some instructions, some stages are wasted cycles
48
Pipeline Analogy: Doing Laundry
49
Pipeline Analogy: Doing Laundry
50
Pipeline Analogy: Doing Laundry
51
Pipeline Analogy: Doing Laundry
Pipelining Lessons:
52
Single Cycle, Multiple Cycle, vs. Pipeline
Single Cycle Implementation:
Cycle 1 Cycle 2
Clk
Multiple Cycle Implementation:
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 Clk
lw
sw
Waste
lw sw
Pipeline Implementation:
lw
sw
R-type
IFetch
Dec
Exec
Mem
WB
IFetch
Dec
Exec
Mem
IFetch
IFetch
Dec
Exec
Mem
WB
IFetch
Dec
Exec
Mem
WB
IFetch
Dec
Exec
Mem
WB
R-type
53
Pipelined Processor Abstraction
1 2 3 4 5 6 7 8 9 10
DM
lw$s2,40($0) IM add $s3, $t1, $t2 sub $s4, $s1, $s5 and $s5, $t5, $t6 sw$s6,20($s1) or$s7,$t3,$t4
lw
$0
40 +
$s2
-
IM
Time (cycles)
R
F
R
F
R
F
R
F
DM
IM
add
$t1 $t2
sub
+
IM
$s3
R
F
R
F
DM
IM
$s1 $s5
and
$s4
$s1
20 +
or
$t5
$t6 &
sw
IM
Resource usage
$s5
$t3
$t4 |
$s6
R
F
R
F
DM
R
F
DM
RF
DM
$s7
R
F
R
F
54
Single-Cycle vs. Pipelined Datapath
CLK
0 PC' PC
1
CLK
CLK
Instr
25:21
SrcA 0 SrcB
1
WriteReg4:0 SignImm <<2
Zero
ALUResult WriteData
PCBranch
CLK
ReadData 0
1
A1 WE3 RD1
WE
A RD
Data Memory WD
A RD
Instruction Memory
20:16
A2 RD2 A3
WD3 Register File
20:16
15:11
0
1
PCPlus4
4
15:0
Sign Extend
CLK
Result
ALUOutW
0 ReadDataW 1
CLK
CLK
CLK
CLK
0 PC' PCF 1
InstrD
25:21
SrcAE
0 SrcBE 1
WriteDataE
ZeroM
ALUOutM
CLK
A1 WE3 RD1
A2 RD2 A3
WD3 Register File
A RD
Instruction Memory
WE
A RD
Data Memory WD
20:16
WriteDataM
RtE 0 RdE 1
20:16
15:11
WriteRegE4:0
<<2 PCPlus4E
15:0 PCPlus4D
Sign Extend
SignImmE
4 PCPlus4F
PCBranchM
Fetch
Decode
Execute
Memory
Writeback
ResultW
55
+
+
+ ALU + ALU
Pipelined Processor Control
InstrD
Control Unit
31:26 Op
5:0
25:21
BranchM
PCSrcM
ALUOutM
MemtoRegW
ALUOutW
ReadDataW 0
1
WriteRegW4:0
CLK
0 PC' PCF 1
CLK
Funct
CLK
MemtoRegD
MemWriteD
BranchD
ALUControlD
ALUSrcD
RegDstD
MemtoRegE
MemWriteE
BranchE
ALUControlE2:0 ALUSrcE
SrcAE
0 SrcBE 1
WriteDataE
RegDstE
MemWriteM
CLK
CLK RegWriteW
RegWriteD
CLK CLK RegWriteE
RegWriteM
MemtoRegM
ZeroM
WE
A RD
Data Memory
WD
A RD
Instruction Memory
A1 WE3 RD1
20:16
15:11
A2 RD2 A3
WD3 Register File
WriteDataM
RtE 0 RdE 1
20:16
WriteRegE4:0
WriteRegM4:0
15:0 PCPlus4D
Sign Extend
SignImmE PCPlus4E
<<2
PCBranchM
4 PCPlus4F
ResultW
56
+
ALU
+
Pipeline Performance
• Use T (“time between completion of instructions”) to measure speedup c
– Equality only achieved if stages are balance (i.e. take the same amount of time)
• If not balanced, speedup is reduced • Speedup due to increased throughput
• Latency for each instruction does not decrease
57
Pipeline Performance
•
•
Assume time for stages is
– 100ps for register read or write – 200psforotherstages
What is pipelined clock rate?
• Compare pipelined datapath with single-cycle datapath
58
Pipeline Performance
59
Pipeline Hazards
❑ Problem for pipeline:
instruction depends on result from instruction that hasn’t completed
❑ Types:
Structural hazard: attempt to use the same resource (hardware unit) two
different ways at the same time
- E.g., two instructions try to read the same memory at the same time.
Data hazard: register value not yet written back to register file - E.g., add $r1, $r2, $r3
sub $r4, $r2, $r1
Control hazard: next instruction not decided yet (caused by branches) - E.g., beq $r1, $r2, loop
add $r3, $r4, $r5
❑ Pipeline control must detect hazard Resolve hazards by waiting / delay action
60
Chapter 6-3
Memory system and hierarchy
61
Introduction
• Computer performance depends on:
– –
Processor performance Memory system performance
Memory Interface
CLK
MemWrite
Processor Address WriteData
CLK
Memory ReadData
WE
62
Processor-Memory Gap
In prior chapters, assumed access memory in 1 clock cycle – but hasn’t been true since the 1980’s
63
Memory System Challenge
• Make memory system appear as fast as processor
• Use hierarchy of memories
• Ideal memory:
– Fast
– Cheap(inexpensive) – Large(capacity)
But can only choose two!
64
Memory Hierarchy
$10,000
1
$10
10 - 50
Cache Main Memory
Virtual Memory
Capacity
Technology Price / GB
SRAM
DRAM
SSD $1
HDD $0.1
Access Bandwidth Time (ns) (GB/s)
25+ 10
100,000 0.5 10,000,000 0.1
• • •
Cache: SRAM
Physical Memory: DRAM (Main Memory) Virtual Memory: Hard drive
– Slow, Large, Cheap
65
Speed
Cache
❑ Highest level in memory hierarchy
❑ Fast (typically ~ 1 cycle access time)
❑ Ideally supplies most data to processor
❑ Usually holds most recently accessed data
66
Memory Type
• Types of memory:
• Random access memory (RAM) • Read only memory (ROM)
• RAM is volatile – loses its data when power off.
• Historically called random access memory because any data word accessed as easily as any other (in contrast to sequential access memories such as a tape recorder)
• The main memory in your computer is dynamic random access memory (DRAM).
• DRAM stores data using capacitor
• ROM is nonvolatile – it retains data when power off
• Flash memory in cameras, thumb drives, and digital cameras are all ROMs.
• Historically called read only memory because ROMs were written at manufacturing time or by burning fuses. Once ROM was configured, it could not be written again. This is no longer the case for Flash memory and other types of ROMs.
67
Virtual Memory
• Gives the illusion of bigger memory
• Main memory (DRAM) acts as cache for hard disk
Magnetic Disks
Read/Write Head
Hard disk takes milliseconds to seek the correct location on disk
68
Principle of Locality
Exploit locality to make memory accesses fast
• Temporal Locality:
– Localityintime
– If data used recently, likely to use it again soon
– How to exploit: keep recently accessed data in higher levels of
memory hierarchy
– E.g., instructions in a loop.
• Spatial Locality:
– Locality in space
– If data used recently, likely to use nearby data soon
– How to exploit: when access data, bring nearby data into higher levels
of memory hierarchy too
– E.g., Sequential instruction access, array data.
69
Taking advantage of Locality
• Store everything on disk
• Copy recently accessed (and nearby) items from disk to smaller DRAM memory
– Mainmemory
• Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory
– CachememoryattachedtoCPU
70