CS4203/EE4363 Computer Organization and Design
Chapter 5
Memory Hierarchy (Virtual Memory)
Prof. Pen
–
Chung Yew
With Slides from Profs. Patterson, Hennessy and Mary Jane Irwin
Review: Major Components of a Computer
Processor
Control Datapath
Memory
Devices Input
Output
IFetch
Dec
Exec
Mem
WB
2
Secondary Memory (Disk)
Main Memory
Cache
Recap: Characteristics of the Memory Hierarchy
Processor
Increasing L1$ distance
from L2$ processor in
Inclusive –
what is in L1$ is a subset of what is in L2$, is a subset of what is in MM, is a subset of what is in SM
access time
Main Memory
Secondary Memory (Disks)
(Relative) size of the memory at each level
3
• •
•
Recap: How is the Hierarchy Managed?
registers « memory
• by compiler (programmer)
cache « main memory
• by cache controller hardware
• Miss penalty can be dozens or hundreds of cycles
main memory « disks (virtual memory)
• miss penalty can be ~million of cycles
• by operating system (virtual memory)
• virtual to physical address mapping assisted by hardware (TLB)
• by programmer (files)
6
Virtual Memory
• Each program is compiled into its own address space – a “virtual” address space (32-bit address or 64-bit address)
• Each virtual address must be assigned (or mapped/translated) to a physical address in main memory (DRAM) at runtime
• Use main memory as a “cache” for secondary memory (i.e. disks)
• Allows efficient and safe sharing of main memory among multiple
programs/processes
• Provides ability to run programs larger than size of physical memory (i.e. DRAM main memory)
• Simplifies loading a program for execution through code relocation (i.e., code can be loaded anywhere in main memory – fully associative)
• What makes it work? – again Principle of Locality !
• A program is likely to repeatedly access a relatively small portion of its address space during any period of time
8
Two Programs Sharing Physical Memory
q A program’s address space is divided into fixed-size pages (i.e. cache blocks) q To completely eliminate conflict misses (to avoid extremely large miss
penalty), it is fully associative
q A page table is used to find each page in main memory.
q Starting address of each page is kept in each page table entry
Program 1 virtual address space
main memory
9
Program 2 virtual address space
Address Translation
q A virtual address is translated (assigned) to a physical address using a combination of hardware and software
Virtual page number
Page offset
Physical page number
Page offset
•
Virtual Address (VA)
3130 … 1211…0
Translation
Physical Address (PA)
29 … 1211 0
Each memory request (load/store) first requires an address translation from a virtual address to a physical address
• A main memory miss (i.e., when the page is not in main memory) is called a page fault (i.e. a disk-cache miss) 10
Address Translation Mechanisms
Virtual page #
Page Offset
Physical page #
Page Offset
Physical page base addr
Main memory
V
1
1
1
1
1
1
0
1
0
1
0
Page Table
(in main memory)
Disk storage
11
Page table register
An additional memory access
Virtual Addressing with a Cache
• It takes an extra memory access to translate a VA to a PA
VA
PA
hit data
miss
CPU
Trans- lation
Cache
Main Memory
q This makes memory (cache) accesses very expensive (if every access needs two memory accesses)
q Hardware fix is to use a small dedicated cache that keeps recently-used address mappings to avoid a page-table lookup to main memory
® called Translation Lookaside Buffer (TLB)
IFetch
Dec
Exec
Mem
WB
Making Address Translation Fast
V Tag TLB
V Physical page #
Physical page #
Virtual page #
1
1
1
0
1
Main memory
1
1
1
1
1
1
0
1
0
1
0
Page Table
(in physical memory)
13
Disk storage
Page table register
Translation Lookaside Buffers (TLBs)
• TLB is just like any other cache
• It can be organized as fully associative, set associative, or
direct mapped
Valid bit
V
Tag
Part of virtual Page #
Physical Page #
Data
Dirty
Ref
Access
q TLB access time is much smaller than cache access time (because TLB is much smaller than cache)
qTLBs typically not more than 512 entries
14
TLB in Memory Hierarchy
VA
miss
1⁄4t hit 3⁄4t PA
hit
miss
CPU
VA: virtual address PA: physical address
TLB Lookup
Cache
Main Memory
•
A TLB miss – could be a page fault or merely a TLB miss • If the page is in main memory, i.e. a true TLB miss
• Load missing entry from page table into TLB (by software or hardware)
•
• Takes 10’s of cycles to find and load a TLB entry
• If the page is not in main memory, i.e. a true page fault
• Takes 1,000,000’s of cycles to service a page fault TLB misses are much more frequent than true page faults
15
Trans- lation
data
TLB Event Combinations
TLB
Page Table
Cache
Possible? Under what circumstances?
Hit
Hit
Hit
Yes – what we want!
Hit
Hit
Miss
Yes – although page table is not checked if TLB hits
Miss
Hit
Hit
Yes – TLB miss, PA in page table
Miss
Hit
Miss
Yes – TLB miss, PA in page table, but data not in cache
Miss
Miss
Miss
Yes – page fault
Hit
Miss
Miss/ Hit
Impossible – TLB translation not possible if page is not present in memory
Miss
Miss
Hit
Impossible – data not allowed in cache if page is not in memory
Handling a TLB Miss
• Consider a true TLB miss (i.e., valid bit in page table is set)
• A TLB miss exception is asserted by the end of memory access stage, so exception handling can begin in next clock cycle
IFetch
Dec
Exec
Mem
WB
Register
CP0 Reg #
Description
EPC
14
Where to restart after exception
Cause
13
Cause of exception
BadVAddr
8
Address that caused exception
Index
0
Location in TLB to be read/written
Random
1
Pseudorandom location in TLB
EntryLo
2
Physical page address and flags
EntryHi
10
Virtual page address
Context
4
Page table address & page number
18
Some Virtual Memory Design Parameters
Paged VM
TLBs
Total page table size
64KB to 1MB
16 to 512 entries
Total memory size (KB)
250,000 to 1,000,000,000
0.25 to 16
Block size (B)
4000 to 64,000
4 to 8
Hit time
0.5 to 1 clock cycle
Miss penalty (clocks)
10,000,000 to 100,000,000
10 to 100
Miss rates
0.00001% to 0.0001%
0.01% to 1%
Address sizes
Page size
TLB organization
(Two-level, L1TLB and L2TLB)
Two Machines’ TLB Parameters
Intel Nehalem
48 bits (vir); 44 bits (phy)
4KB
L1 TLB for instructions and L1 TLB for data per core; both are 4-way set assoc.; LRU
L1 ITLB has 128 entries, L1 DTLB has 64 entries
L2 TLB (unified) is 4-way set assoc.; LRU
L2 TLB has 512 entries
TLB misses handled in hardware
AMD Barcelona
48 bits (vir); 48 bits (phy)
4KB
L1 TLB for instructions and L1 TLB for data per core; both are fully assoc.; LRU
L1 ITLB and DTLB each has 48 entries
Separate L2 TLB for data and instructions per core; 4-way set assoc.; round robin LRU
Both L2 TLBs have 512 entries
TLB misses handled in hardware
Why Not a Virtually Addressed Cache?
• A virtually addressed cache would only require address translation on cache misses
VA CPU
hit
PA
Trans- lation
Main Memory
But,
• Two programs sharing data will have 2 different virtual
addresses for the same physical address – aliasing
• Two copies of shared data in cache and two entries in TLB
àcoherence problem
• Must update all cache entries with the same physical
address or memory becomes inconsistent
• A problem only when cache entries need to be updated (e.g. data cache)
Cache
data
•
Reducing Translation Time
Overlap cache access with TLB access
• Works when high order bits of VA are used to access TLB, while low order bits are used as index into cache
Virtual page #
Page offset
Block offset Index
PA Tag
2-way Associative Cache
VA Tag
PA Tag
Tag
Data
Tag
Data
TLB Hit
•
What if we have a larger cache? • Use a higher associativity
==
26 Cache Hit Desired word
Simple Memory System: Putting it Together
• Addressing Scheme
• Page size = 26 (64 bytes)
• 14-bit virtual addresses = 214 (16 Kbyte) à 214/26 = 28 pages
• 12-bit physical address = 212 (4 Kbytes)à212 /26 = 26 pages
Virtual Address
13 12 11 10 9 8 7 6 5 4 3 2 1 0
VPN VPO Virtual Page Number Virtual Page Offset
11 10 9 8 7 6 5 4 3 2 1 0
PPN PPO
Physical Address
Physical Page Number
Physical Page Offset
Simple Memory System (Page Table)
Page Table – Only show first 16 entries (out of 28 = 256)
VPN
PPN
Valid
00
28
1
01
–
0
02
33
1
03
02
1
04
–
0
05
16
1
06
–
0
07
–
0
VPN
PPN
Valid
08
13
1
09
17
1
0A
09
1
0B
–
0
0C
–
0
0D
2D
1
0E
11
1
0F
0D
1
Virtual Address
13 12 11 10 9 8 7 6 5 4 3 2 1 0
VPN VPO Virtual Page Number Virtual Page Offset
Simple Memory System (TLB)
• TLB – a small cache for address translation
• 16 entries
• 4-way associative
• 4 sets (16/4 = 4 = 22) total in TLB
Virtual address
TLBI = TLB Index TLBT = TLB Tag
TLBT TLBI 13 12 11 10 9 8 7 6
VPN
TLB
543210
VPO
Set
Tag
PPN
Valid
Tag
PPN
Valid
Tag
PPN
Valid
Tag
PPN
Valid
0
03
–
0
09
0D
1
00
–
0
07
02
1
1
03
2D
1
02
–
0
04
–
0
0A
–
0
2
02
–
0
08
–
0
06
–
0
03
–
0
3
07
–
0
03
0D
1
0A
34
1
02
–
0
Simple Memory System (L1 Cache)
• Physically addressed, Direct mapped
• 4-byte block sizeà22 cache block size
• 16 linesà24 lines;
Physical address
L1 Cache
CT CI CO 11 10 9 8 7 6 5 4 3 2 1 0
PPN PPO
Idx
Tag
Valid
B0
B1
B2
B3
0
19
1
99
11
23
11
1
15
0
–
–
–
–
2
1B
1
00
02
04
08
3
36
0
–
–
–
–
4
32
1
43
6D
8F
09
5
0D
1
36
72
F0
1D
6
31
0
–
–
–
–
7
16
1
11
C2
DF
03
Idx
Tag
Valid
B0
B1
B2
B3
8
24
1
3A
00
51
89
9
2D
0
–
–
–
–
A
2D
1
93
15
DA
3B
B
0B
0
–
–
–
–
C
12
0
–
–
–
–
D
16
1
04
96
34
15
E
13
1
83
77
1B
D3
F
14
0
–
–
–
–
Simple Memory System
Page Table – Only show first 16 entries (out of 28 = 256)
Idx
Tag
Valid
B0
B1
B2
B3
0
19
1
99
11
23
11
1
15
0
–
–
–
–
2
1B
1
00
02
04
08
3
36
0
–
–
–
4
32
1
43
6D
8F
09
5
0D
1
36
72
F0
1D
6
31
0
–
–
–
–
7
16
1
11
C2
DF
03
VPN
PPN
Valid
00
28
1
01
–
0
02
33
1
03
02
1
04
–
0
05
16
1
06
–
0
07
–
0
VPN
PPN
Valid
08
13
1
09
17
1
0A
09
1
0B
–
0
0C
–
0
0D
2D
1
0E
11
1
0F
0D
1
L1 Cache
• Physically addressed
• Direct mapped
• 4-byte block sizeà
22 cache block size
• 16 linesà24 lines;
Idx
Tag
Valid
B0
B1
B2
B3
8
24
1
3A
00
51
89
9
2D
0
–
–
–
–
A
2D
1
93
15
DA
3B
B
0B
0
–
–
–
C
12
0
–
–
–
–
D
16
1
04
96
34
15
E
13
1
83
77
1B
D3
F
14
0
–
–
–
–
TLB
•16 entries
•4-way associative
•4 sets (16/4 = 4 = 22) total in TLB
Set
Tag
PPN
Valid
Tag
PPN
Valid
Tag
PPN
Valid
Tag
PPN
Valid
0
03
–
0
09
0D
1
00
–
0
07
02
1
1
03
2d
1
02
–
0
04
–
0
0A
–
0
2
02
–
0
08
–
0
06
–
0
03
–
0
3
07
–
0
03
0D
1
0A
34
1
02
–
0
Address Translation Example #1
Virtual Address: 0x03D4
TLBT TLBI
13 12 11 10 9 8 7 6 5 4 3 2 1 0
0
0
0
0
1
1
1
1
0
1
0
1
0
0
VPN
0x0F 0x3 0x03
Physical Address
VPO
Y N 0x0D TLB Hit? __ Page Fault? __ PPN: ____
CT CI CO
11 10 9 8 7 6 5 4 3 2 1 0
VPN ___ TLBI ___ TLBT ____
0
0
1
1
0
1
0
1
0
1
0
0
PPN
PPO
CO ___
CI___ CT ____
Hit? __
Byte: ____
32
0
0x5 0x0D
Y
0x36
Address Translation Example #2
Virtual Address: 0x0020
TLBT TLBI
13 12 11 10 9 8 7 6 5 4 3 2 1 0 VPN VPO
VPN ___ TLBI ___ TLBT ____ TLB Hit? __ Page Fault? __ PPN: ____
Physical Address
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0x00 0 0x00 N N 0x28
CT CI CO 11 10 9 8 7 6 5 4 3 2 1 0
1
0
1
0
0
0
1
0
0
0
0
0
CO___
PPN
CI___ CT ____
Hit? __
PPO
Byte: ____
33
0
0x8 0x28
N Mem
The Hardware/Software Boundary
• What parts of virtual to physical address translation is assisted by hardware?
• Translation Lookaside Buffer (TLB) that caches recent translations
• TLB access time is part of cache hit time
• May need an extra stage in pipeline for TLB access
• Page table storage, fault detection and updating
• Page faults result in interrupts that are handled by OS
• Hardware must support (i.e., update appropriately) Dirty and Reference bits (e.g., ~LRU) in Page Tables
35
4 Questions for the Memory Hierarchy
• Q1: Where can a entry be placed in the upper level? (Entry placement)
• Q2: How is a entry found if it is in the upper level? (Entry identification)
• Q3: Which entry should be replaced on a miss? (Entry replacement)
• Q4: What happens on a write? (Write strategy)
36
Q1&Q2: Where can a entry be placed/found?
# of sets
Entries per set
Direct mapped
# of entries
1
Set associative
(# of entries)/ associativity
Associativity (typically 2 to 16)
Fully associative
1
# of entries
Location method
# of comparisons
Direct mapped
Index
1
Set associative
Index the set; compare set’s tags
Degree of associativity
Fully associative
Compare all entries’ tags Separate lookup (page) table
# of entries 0
Q3: Which entry should be replaced on a miss?
• Easy for direct mapped – only one choice
• Set associative or fully associative • Random
• LRU (Least Recently Used)
• For a 2-way set associative, random replacement has a miss
rate about 1.1 times higher than LRU
• LRU is too costly to implement for high levels of associativity (> 4-way) since tracking usage information is costly
38
Q4: What happens on a write?
• Write-through – Information is written to the entry in current memory level and to the entry in next level of memory hierarchy
• Always combined with a write buffer so write waits to next level memory can be eliminated (as long as write buffer doesn’t fill)
• Write-back – Information is written only to the entry in current memory level. The modified entry is written to next level of memory only when it is replaced.
• Need a dirty bit to keep track of whether the entry is clean or dirty
• Virtual memory systems always use write-back of dirty pages to disk
• Pros and cons of each?
• Write-through: write misses don’t result in reads (so simpler and cheaper), easier to implement
• Write-back: writes run at the speed of cache; repeated writes require only one write to lower level
39
Summary
• The Principle of Locality:
• Program likely to access a relatively small portion of
address space in a short period of time (i.e. working set).
• Temporal Locality: Locality in Time
• Spatial Locality: Locality in Space
• Caches, TLBs, Virtual Memory all understood by examining how they deal with four questions
1. Where can entry be placed?
2. How is entry found?
3. What entry is replaced on miss?
4. How are writes handled?
• Page tables map virtual address to physical address • TLBs are important for fast translation
40