[537] TLBs
Virtualizing Memory:
Faster with TLBS
Questions answered in this lecture:
Review paging…
How can page translations be made faster?
What is the basic idea of a TLB (Translation Lookaside Buffer)?
What types of workloads perform well with TLBs?
How do TLBs interact with context-switches?
CSE 2431
Introduction to Operating Systems
Based on slides by Andrea C. Arpaci, Dusseau,
Remzi H. Arpaci-Dusseau
Announcements
Reading for today: Chapter 19
DISADVANTAGES OF PAGING
Although paging has many advantages (discussed further later), and that is why all modern OS’s use it, there are some disadvantages that have to be addressed:
1) Paging doubles the number of memory references (one for page table access in memory, and one for data/instruction that process needs). Since memory is system bottleneck, this is a killer if it can’t be solved!!!!!
2) Necessity of storing a page table, which is a very large data structure, for every running process in the system.
3) Possibility of internal fragmentation (bytes within a page not needed by process).
Here, we see how to deal with number 1) above.
P1
P2
P2
P1
PT
P1
0x4000
0x5000
0x6000
0x2000
0x3000
0x1000
0x0000
PT
P1 pagetable
1
5
4
…
P2 pagetable
6
2
3
…
P2
0x7000
Virtual
Physical
Review: PaginG
0x0800
(P2 running)
load 0x0000
load 0x0800
load 0x6000
load 0x1444
load 0x0808
load 0x2444
(P1 running)
load 0x1444
load 0x0008
load 0x5444
Assume 4 KB pages (so how many bits for page offset?
What do we need to know?
Location of page table in memory (ptbr)
ptbr
Size of each page table entry (assume 8 bytes)
Review:
Paging PROS and CONS
Advantages
No external fragmentation
don’t need to find contiguous RAM
All free pages are equivalent
Easy to manage, allocate, and free pages
Disadvantages
Page tables are too big
Must have one entry for every page of address space
Accessing page tables is too slow [today’s focus]
Doubles number of memory references per instruction (this disadvantage would make paging unusable if it can’t be solved); today, we try to improve this.
Translation Steps
H/W: for each mem reference:
1. extract VPN (virt page num) from VA (virt addr)
2. calculate addr of PTE (page table entry)
3. read PTE from memory
4. extract PFN (page frame num)
5. build PA (phys addr)
6. read contents of PA from memory into register
(cheap)
(cheap)
(cheap)
(cheap)
(expensive)
(expensive)
Which expensive step will we avoid in today’s lecture?
Which steps are expensive?
For step 3, don’t always have to read PTE from memory if we do things in a better way!
Example:
Array Iterator
int sum = 0;
for (i=0; i
Strategy: Cache
Page Translations
TLB: Translation Lookaside Buffer (yes, a poor name!)
SO: The TLB is a cache for PTEs; it does not have the whole page table, but only recently/commonly used entries.
CPU
RAM
memory interconnect
PT
Translation Cache
Some popular entries
TLB Organization
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
A
0
1
2
3
4
5
6
7
A
B
0
1
2
3
A
B
C
D
A
B
C
D
E
L
M
N
O
P
Direct mapped
Fully associative
Two-way set associative
Four-way set associative
Tag (virtual page number)
Physical page number (page table entry)
TLB Entry
Various ways to organize a 16-entry TLB (artificially small)
Lookup
Calculate set (tag % num_sets)
Search for tag within resulting set
Set
Index
TLB Associativity Trade-offs
Higher associativity
+ Better utilization, fewer collisions
Slower (compared with less associativity)
More hardware (compared with less associativity)
Lower associativity
+ Fast
+ Simple, less hardware
BUT Greater chance of collisions
TLBs usually fully associative; this means they cannot be large (otherwise, search time would be too slow)
Array Iterator
(w/ TLB)
int sum = 0;
for (i = 0; i < 2048; i++){
sum += a[i];
}
Assume following virtual address stream:
load 0x1000
load 0x1004
load 0x1008
load 0x100C
…
What will TLB behavior look like?
Virt
Phys
P1
P2
P2
P1
PT
P1
16 KB
20 KB
24 KB
8 KB
12 KB
4 KB
0 KB
PT
P1 pagetable
1
5
4
…
P2
28 KB
TLB Accesses:
SEQUENTIAL Example
load 0x1000
load 0x1004
load 0x1008
load 0x100c
…
load 0x2000
load 0x2004
load 0x0004
load 0x5000
(TLB hit)
load 0x5004
(TLB hit)
load 0x5008
(TLB hit)
load 0x500C
…
load 0x0008
load 0x4000
(TLB hit)
load 0x4004
0
1
2
3
CPU’s TLB
PTBR
Valid VPN PPN
1
1
1
2
5
4
PERFORMANCe OF TLB?
int sum = 0;
for (i=0; i<2048; i++) {
sum += a[i];
}
Calculate miss rate of TLB for data:
# TLB misses / # TLB lookups
# TLB lookups?
= number of accesses to a = 2048
# TLB misses?
= number of unique pages accessed
= 2048 / (elements of ‘a’ per 4K page)
= 2K / (4K / sizeof(int)) = 2K / 1K
= 2
Miss rate?
2/2048 = 0.1% (rounded from 0.097%)
Hit rate? (100 – miss rate)
99.9%
Would hit rate get better or worse with smaller pages?
Worse
TLB PERFORMANCE
How can system improve TLB performance (hit rate) given fixed number of TLB entries?
Increase page size
Fewer unique page translations needed to access same amount of memory
TLB Reach:
Number of TLB entries * Page Size
TLB PERFORMANCE
with Workloads
Sequential array accesses almost always hit in TLB
Very fast!
What access pattern will be slow?
Highly random, with no repeat accesses
Workload
acCESS PATTERNS
int sum = 0;
for (i=0; i<2048; i++) {
sum += a[i];
}
int sum = 0;
srand(1234);
for (i=0; i<1000; i++) {
sum += a[rand() % N];
}
srand(1234);
for (i=0; i<1000; i++) {
sum += a[rand() % N];
}
Workload A
Workload B
time
address
Sequential Accesses
time
address
Repeated Random Accesses
…
…
Workload
ACCESS PATTERNS
Spatial Locality
Temporal Locality
Workload Locality
Spatial Locality: future access will be to nearby addresses
Temporal Locality: future access will be repeats to the same data/instruction
What TLB characteristics are best for each type?
Spatial:
Access same page repeatedly; need same vpn->ppn translation
Same TLB entry re-used
Temporal:
Access same address in near future
Same TLB entry re-used in near future
How near in future? How many TLB entries are there?
TLB
Replacement policies
LRU: evict Least-Recently Used TLB slot when needed
(More on LRU later in policies next week)
Random: Evict randomly choosen entry
Which is better?
A
B
C
D
E
L
M
N
O
P
LRU Troubles
Valid Virt Phys
0 ? ?
0 ? ?
0 ? ?
0 ? ?
virtual addresses:
0
1
2
3
4
Workload repeatedly accesses same offset across 5 pages (strided access),
but only 4 TLB entries
What will TLB contents be over time?
How will TLB perform?
GOOD NEWS: This kind of access pattern is exceedingly rare in
software!
TLB Replacement policies
LRU: evict Least-Recently Used TLB slot when needed
(More on LRU later in policies next week)
Random: Evict randomly choosen entry
Sometimes random is better than a “smart” policy! [But rarely!]
TLB PERFORMANCE
How can system improve TLB performance (hit rate) given fixed number of TLB entries?
Increase page size
Fewer unique translations needed to access same amount of memory
Context Switches
What happens if a process uses cached TLB entries from another process?
Solutions?
Flush TLB on each switch
Costly; lose all recently cached translations
Track which entries are for which process
Address Space Identifier
Tag each TLB entry with an 8-bit ASID
– how many ASIDs do we get?
– why not use PIDs?
P1
P2
P2
P1
PT
P1
16 KB
20 KB
24 KB
8 KB
12 KB
4 KB
0 KB
Virtual
Physical
PT
P2
28 KB
PTBR
load 0x1444
load 0x2444
load 0x1444
load 0x5444
P1 pagetable (ASID 11)
1
5
4
…
P2 pagetable (ASID 12)
6
2
3
…
Valid Virt Phys ASID
0 1 9 11
1 1 5 11
1 1 2 12
1 0 1 11
TLB:
TLB Example with ASID
ASID: 12
ASID: 11
TLB Performance
Context switches are expensive
Even with ASID, other processes “pollute” TLB
Discard process A’s TLB entries for process B’s entries
Architectures can have multiple TLBs
1 TLB for data, 1 TLB for instructions
1 TLB for regular pages, 1 TLB for “super pages”
HW and OS Roles
Who Handles TLB MISS? H/W or OS?
H/W: CPU must know where pagetables are [PTBR]
CR3 register on x86
Pagetable structure fixed and agreed upon between HW and OS
HW “walks” the pagetable and fills TLB
OS: CPU traps into OS upon TLB miss
“Software-managed TLB”
OS interprets pagetables as it chooses
Modifying TLB entries is privileged
– otherwise what could process do?
Need same protection bits in TLB as page table
– rwx
Summary
Paging is great, but accessing page tables for every memory access is slow
Cache recent page translations TLB
Hardware performs TLB lookup on every memory access
TLB performance depends strongly on workload
Sequential workloads perform well
Workloads with temporal locality can perform well
Increase TLB reach by increasing page size
In different systems, hardware or OS handles TLB misses
TLBs increase cost of context switches
Flush TLB on every context switch
Add ASID to every TLB entry