MEMORY: TLBS
Andrea Arpaci-Dusseau CS 537, Fall 2019
ADMINISTRIVIA
– Project 1 graded out of 120 points
– Details in 3 files in your handin/p1 directory
– Contact TA (yifan) if significant grading discrepancy
– Project 2 due last night
– Project 3 available: Shell in Linux
– Subject of tomorrow’s discussion sections (fork() and exec())
– Midterm 1: Thursday, Oct 10th from 7: 30-9: 30pm
– Fill out Exam Conflict form in Canvas by Thursday Sept 26
– Fill out form again if filled out for old date of Oct 9th if still conflict
– Next discussion sections on lecture review? Post sample exam soon
– Canvas Homeworks
– Due each Tuesday and Thursday
AGENDA / LEARNING OUTCOMES
Memory virtualization
Review paging…
How can page translations be made faster?
What is the basic idea of a TLB (Translation Lookaside Buffer)? What types of workloads perform well with TLBs?
How do TLBs interact with context-switches?
RECAP
FRAGMENTATION
Definition: Free memory that can’t be usefully allocated
Types of fragmentation
External: Visible to allocator (e. g. , OS) Internal: Visible to requester
Internal
useful
free
Paging
Goal: Eliminate requirement that address space is contiguous
Eliminate external fragmentation
Grow segments as needed – Don’t allocate pages that aren’t needed
Idea:
Divide address spaces and physical memory into fixed-sized pages
Size: 2n, Example: 4KB
Process 1
Process 2 Process 3 Logical View
Physical View
PAGETABLES
What is a good data structure ?
Simple solution: Linear page table aka array
VPN 0
2^n
Assume 4 KB pages
0x0800
0x1000 ptbr 0x2000
0x3000
0x4000
0x5000 0x6000 0x7000
Review: PaginG
0x0000
1 5 6 2
Virtual
load 0x0000 load 0x1444
4 … 3 …
P1 pagetable
P2 pagetable
Physical
PT
PT
P1
P2
P2
P1
P1
P2
load 0x0800
load 0x6000
load 0x0808
load 0x2444
load 0x1444
load 0x0008
load 0x5444
What do we need to know? Location of page table in memory (ptbr)
Size of each page table entry (assume 8 bytes)
1 minute chat: PAGING TRANSLATION STEPS
For each mem reference:
(cheap) 1. extract VPN (virt page num) from VA (virt addr)
(cheap) 2. calculate addr of PTE (page table entry)
(expensive)
3. read PTE from memory (cheap) 4. extract PFN (page frame num)
(cheap)
(expensive)
5. build PA (phys addr)
6. read contents of PA from memory into register
Which steps are cheap and which are expensive?
Disadvantages of Paging
Additional memory reference to page tableàVery inefficient – Pagetablemustbestoredinmemory
– MMUstoresonlybaseaddressofpagetable
Storage for page tables may be substantial
– Simplepagetable:RequiresPTEforallpagesinaddressspace
Entry needed even if page not allocated ?
int sum = 0;
for (i=0; i
Observation:
Repeatedly access same PTE because program repeatedly accesses same virtual page
Example: Array Iterator
What virtual addresses?
Strategy: Cache Page Translations
CPU
Translation Cache
RAM
PT
memory interconnect
TLB: Translation Lookaside Buffer
TLB: TRANSLATION LOOKASIDE BUFFER
TLB Entry
TLB Organization
Tag (virtual page number)
Physical page number (page table entry)
Protection bits – rwx
ABCDELMNOP
Fully associative
Any given translation can be anywhere in the TLB Hardware will search the entire TLB in parallel
Operations:
Lookup and Replacement
int sum = 0;
for (i = 0; i < 2048; i++){
sum += a[i];
}
Assume ‘a’ starts at 0x1000 Ignore instruction fetches and access to ‘i’
Assume following virtual address stream: load 0x1000
load 0x1004
load 0x1008
load 0x100C ...
Array Iterator (w/ TLB)
What will TLB behavior look like?
0 KB 4 KB 8 KB
12 KB 16 KB 20 KB 24 KB 28 KB
PTBR
P1 pagetable
154...
0123 TLB
Virt Phys
load 0x1000 load 0x1004
load 0x1008 load 0x100c
...load 0x2000 load 0x2004
TLB Accesses: SEQUENTIAL Example
PT
PT
P1
P2
P2
P1
P1
P2
Valid
VPN
PPN
0 KB 4 KB 8 KB
12 KB 16 KB 20 KB 24 KB 28 KB
P TBR
P1 pagetable
154...
0123 CPU’s TLB
Virt
load 0x1000 load 0x1004
load 0x1008 load 0x100c ...
load 0x2000 load 0x2004
Phys
load 0x0004
load 0x5000
( TLB hit)
load 0x5004
( TLB hit)
load 0x5008
( TLB hit)
load 0x500C
...load 0x0008 load 0x4000 (TLB hit) load 0x4004
TLB Accesses: SEQUENTIAL Example
PT
PT
P1
P2
P2
P1
P1
P2
Valid
VPN
PPN
1
1
5
1
2
4
int sum = 0;
for (i=0; i<2048; i++) {
sum += a[i]; }
Calculate miss rate of TLB for data (ignore code + sum) # TLB misses / # TLB lookups
# TLB lookups?
= number of accesses to array a[] = 2048
# TLB misses?
= number of unique pages accessed
= 2048 / (elements of a[] per 4K page)
= 2K / (4KB / sizeof(int)) = 2K / 1K = 2 pages
Miss rate?
2/2048 = 0.1%
Hit rate? (1 – miss rate)
99. 9%
Would hit rate get better or worse with smaller pages?
Worse
Would hit rate get better or worse with more iterations?
Stay same!
Miss first access to each page Always miss 1/1024
PERFORMANCe OF TLB?
TLB PERFORMANCE
How can system improve TLB performance (hit rate) given fixed number of TLB entries?
Increase page size
Fewer unique page translations needed to access same amount of memory
TLB Reach:
Number of TLB entries * Page Size
TLB PERFORMANCE with Workloads
Sequential array accesses almost always hit in TLB – Veryfast!
What access pattern will be slow?
– Highly random, with no repeat accesses
Workload A
int sum = 0;
for (i=0; i<2048; i++) {
sum += a[i];
}
Workload B
intsum=0; //largeN
srand(1234);
for (i=0; i<1024; i++) {
sum += a[rand() % N];
}
srand(1234);
for (i=0; i<1024; i++) {
sum += a[rand() % N];
}
Workload acCESS PATTERNS
Workload ACCESS PATTERNS
Spatial Locality
Temporal Locality
......
Sequential Accesses Repeated Random Accesses
time time
Workload Locality
Spatial Locality: future access will be to nearby addresses
Temporal Locality: future access will be repeats to the same data as past access
What TLB characteristics are best for each type? Spatial:
– Access same page repeatedly; need same vpn à ppn translation
– SameTLBentryre-used Temporal:
– Access same address near in future
– SameTLBentryre-usedinnearfuture
– How near in future? How many TLB entries are there?
More entries in TLB -> More likely to contain entries accessed in past
LRU: Evict Least-Recently-Used TLB slot when needed (More on LRU later in policies next week)
Random: Evict randomly-chosen entry ABCDELMNOP
Which is better?
TLB Replacement policies
2-Minute Chat: LRU Troubles
Initial TLB
Valid VPN PFN
0?? 0??
0?? 01234 0 ? ? virtual addresses:
TLB after 4 accesses
Valid VPN PFN
10? 11? 12? 13?
Workload repeatedly accesses same offset (0x001) across 5 pages (strided access), but only 4 TLB entries
What will TLB contents be over time?
How will TLB perform?
Always misses! 100% miss rate
TLB Replacement policies
LRU: Evict Least-Recently Used TLB slot when needed (More on LRU later in policies next week)
Random: Evict randomly-chosen entry
Sometimes random is better than a “smart” policy!
TLB Performance for Code?
• Code tends to be relatively sequential
– Branch for if statements and while loops often in same page – Good spatial locality
• Procedure calls depend on temporary locality
• Code usually has reasonable TLB performance
Context Switches
What happens if a process uses TLB entry from another process?
Solutions:
1. Flush TLB on each context switch
Poor performance: lose all recently cached translations, increases miss rate
2. Track which TLB entries are for which process
– Address Space Identifier (ASID) – similar to PID
– Tag each TLB entry with 8-bit ASID; How many ASIDs do we get? – Must match ASID for TLB entry to be used
0 KB 4 KB 8 KB
12 KB 16 KB 20 KB 24 KB 28 KB
PTBR 1 5 6 2
4 … 3 …
P1 pagetable (ASID 11)
P2 pagetable (ASID 12)
Physical
load 0x2444 load 0x5444
TLB Example with ASID
PT
PT
P1
P2
P2
P1
P1
P2
0
1
1
1
3
1
1
0
9
5
2
1
Virtual
TLB:
load 0x1444 ASID: 12 load 0x1444 ASID: 11
Valid
Virt
Phys
ASID
11
11
12
11
TLB Performance
Context switches are expensive
Even with ASID, other processes “pollute” TLB
– Discard process A’s TLB entries for process B’s entries
Architectures can have multiple TLBs
– 1TLBfordata,1TLBforinstructions
– 1TLBforregularpages,1TLBfor“superpages”
HW and OS Roles
Who Handles TLB Hit?
Who Handles TLB Miss? HW or OS H/W
H/W must know where pagetables are stored in memory
– CR3registeronx86
– PagetablestructurefixedandagreeduponbetweenHWandOS – HW“walks”knownpagetablestructureandfillsTLB
H/W
HW AND OS ROLES
Who Handles TLB MISS? H/W or OS? OS:
CPU traps into OS upon TLB miss “Software-managed TLB”
OS interprets pagetables as it chooses; any data structure possible Modifying TLB entries is privileged instruction
Characteristics of TLBs
Pages are great, but accessing page tables for every memory access is slow Cache recent page translationsàTLB
– HardwareperformsTLBlookuponeverymemoryaccess TLB performance depends strongly on workload
– Sequentialworkloadsperformwell
– Workloads with temporal locality can perform well (if enough TLB entries) In different systems, hardware or OS handles TLB misses
TLBs increase cost of context switches
– FlushTLBoneverycontextswitch – Add ASID to every TLB entry
Disadvantages of Paging
Additional memory reference to page tableàVery inefficient – Pagetablemustbestoredinmemory
– MMUstoresonlybaseaddressofpagetable
Storage for page tables may be substantial
– Simplepagetable:RequiresPTEforallpagesinaddressspace
Entry needed even if page not allocated ?