CS计算机代考程序代写 data structure cache MEMORY: TLBS

MEMORY: TLBS
Andrea Arpaci-Dusseau CS 537, Fall 2019

ADMINISTRIVIA
– Project 1 graded out of 120 points
– Details in 3 files in your handin/p1 directory
– Contact TA (yifan) if significant grading discrepancy
– Project 2 due last night
– Project 3 available: Shell in Linux
– Subject of tomorrow’s discussion sections (fork() and exec())
– Midterm 1: Thursday, Oct 10th from 7: 30-9: 30pm
– Fill out Exam Conflict form in Canvas by Thursday Sept 26
– Fill out form again if filled out for old date of Oct 9th if still conflict
– Next discussion sections on lecture review? Post sample exam soon
– Canvas Homeworks
– Due each Tuesday and Thursday

AGENDA / LEARNING OUTCOMES
Memory virtualization
Review paging…
How can page translations be made faster?
What is the basic idea of a TLB (Translation Lookaside Buffer)? What types of workloads perform well with TLBs?
How do TLBs interact with context-switches?

RECAP

FRAGMENTATION
Definition: Free memory that can’t be usefully allocated
Types of fragmentation
External: Visible to allocator (e. g. , OS) Internal: Visible to requester
Internal
useful
free

Paging
Goal: Eliminate requirement that address space is contiguous
Eliminate external fragmentation
Grow segments as needed – Don’t allocate pages that aren’t needed
Idea:
Divide address spaces and physical memory into fixed-sized pages
Size: 2n, Example: 4KB
Process 1
Process 2 Process 3 Logical View
Physical View

PAGETABLES
What is a good data structure ?
Simple solution: Linear page table aka array
VPN 0
2^n

Assume 4 KB pages
0x0800
0x1000 ptbr 0x2000
0x3000
0x4000
0x5000 0x6000 0x7000
Review: PaginG
0x0000
1 5 6 2
Virtual
load 0x0000 load 0x1444
4 … 3 …
P1 pagetable
P2 pagetable
Physical
PT
PT
P1
P2
P2
P1
P1
P2
load 0x0800
load 0x6000
load 0x0808
load 0x2444
load 0x1444
load 0x0008
load 0x5444
What do we need to know? Location of page table in memory (ptbr)
Size of each page table entry (assume 8 bytes)

1 minute chat: PAGING TRANSLATION STEPS
For each mem reference:
(cheap) 1. extract VPN (virt page num) from VA (virt addr)
(cheap) 2. calculate addr of PTE (page table entry)
(expensive)
3. read PTE from memory (cheap) 4. extract PFN (page frame num)
(cheap)
(expensive)
5. build PA (phys addr)
6. read contents of PA from memory into register
Which steps are cheap and which are expensive?

Disadvantages of Paging
Additional memory reference to page tableàVery inefficient – Pagetablemustbestoredinmemory
– MMUstoresonlybaseaddressofpagetable
Storage for page tables may be substantial
– Simplepagetable:RequiresPTEforallpagesinaddressspace
Entry needed even if page not allocated ?

int sum = 0;
for (i=0; iPPN7
Observation:
Repeatedly access same PTE because program repeatedly accesses same virtual page
Example: Array Iterator
What virtual addresses?

Strategy: Cache Page Translations
CPU
Translation Cache
RAM
PT
memory interconnect
TLB: Translation Lookaside Buffer

TLB: TRANSLATION LOOKASIDE BUFFER

TLB Entry
TLB Organization
Tag (virtual page number)
Physical page number (page table entry)
Protection bits – rwx
ABCDELMNOP
Fully associative
Any given translation can be anywhere in the TLB Hardware will search the entire TLB in parallel
Operations:
Lookup and Replacement

int sum = 0;
for (i = 0; i < 2048; i++){ sum += a[i]; } Assume ‘a’ starts at 0x1000 Ignore instruction fetches and access to ‘i’ Assume following virtual address stream: load 0x1000 load 0x1004 load 0x1008 load 0x100C ... Array Iterator (w/ TLB) What will TLB behavior look like? 0 KB 4 KB 8 KB 12 KB 16 KB 20 KB 24 KB 28 KB PTBR P1 pagetable 154... 0123 TLB Virt Phys load 0x1000 load 0x1004 load 0x1008 load 0x100c ...load 0x2000 load 0x2004 TLB Accesses: SEQUENTIAL Example PT PT P1 P2 P2 P1 P1 P2 Valid VPN PPN 0 KB 4 KB 8 KB 12 KB 16 KB 20 KB 24 KB 28 KB P TBR P1 pagetable 154... 0123 CPU’s TLB Virt load 0x1000 load 0x1004 load 0x1008 load 0x100c ... load 0x2000 load 0x2004 Phys load 0x0004 load 0x5000 ( TLB hit) load 0x5004 ( TLB hit) load 0x5008 ( TLB hit) load 0x500C ...load 0x0008 load 0x4000 (TLB hit) load 0x4004 TLB Accesses: SEQUENTIAL Example PT PT P1 P2 P2 P1 P1 P2 Valid VPN PPN 1 1 5 1 2 4 int sum = 0; for (i=0; i<2048; i++) { sum += a[i]; } Calculate miss rate of TLB for data (ignore code + sum) # TLB misses / # TLB lookups # TLB lookups? = number of accesses to array a[] = 2048 # TLB misses? = number of unique pages accessed = 2048 / (elements of a[] per 4K page) = 2K / (4KB / sizeof(int)) = 2K / 1K = 2 pages Miss rate? 2/2048 = 0.1% Hit rate? (1 – miss rate) 99. 9% Would hit rate get better or worse with smaller pages? Worse Would hit rate get better or worse with more iterations? Stay same! Miss first access to each page Always miss 1/1024 PERFORMANCe OF TLB? TLB PERFORMANCE How can system improve TLB performance (hit rate) given fixed number of TLB entries? Increase page size Fewer unique page translations needed to access same amount of memory TLB Reach: Number of TLB entries * Page Size TLB PERFORMANCE with Workloads Sequential array accesses almost always hit in TLB – Veryfast! What access pattern will be slow? – Highly random, with no repeat accesses Workload A int sum = 0; for (i=0; i<2048; i++) { sum += a[i]; } Workload B intsum=0; //largeN srand(1234); for (i=0; i<1024; i++) { sum += a[rand() % N]; } srand(1234); for (i=0; i<1024; i++) { sum += a[rand() % N]; } Workload acCESS PATTERNS Workload ACCESS PATTERNS Spatial Locality Temporal Locality ...... Sequential Accesses Repeated Random Accesses time time Workload Locality Spatial Locality: future access will be to nearby addresses Temporal Locality: future access will be repeats to the same data as past access What TLB characteristics are best for each type? Spatial: – Access same page repeatedly; need same vpn à ppn translation – SameTLBentryre-used Temporal: – Access same address near in future – SameTLBentryre-usedinnearfuture – How near in future? How many TLB entries are there? More entries in TLB -> More likely to contain entries accessed in past
LRU: Evict Least-Recently-Used TLB slot when needed (More on LRU later in policies next week)
Random: Evict randomly-chosen entry ABCDELMNOP
Which is better?
TLB Replacement policies

2-Minute Chat: LRU Troubles
Initial TLB
Valid VPN PFN
0?? 0??
0?? 01234 0 ? ? virtual addresses:
TLB after 4 accesses
Valid VPN PFN
10? 11? 12? 13?
Workload repeatedly accesses same offset (0x001) across 5 pages (strided access), but only 4 TLB entries
What will TLB contents be over time?
How will TLB perform?
Always misses! 100% miss rate

TLB Replacement policies
LRU: Evict Least-Recently Used TLB slot when needed (More on LRU later in policies next week)
Random: Evict randomly-chosen entry
Sometimes random is better than a “smart” policy!

TLB Performance for Code?
• Code tends to be relatively sequential
– Branch for if statements and while loops often in same page – Good spatial locality
• Procedure calls depend on temporary locality
• Code usually has reasonable TLB performance

Context Switches
What happens if a process uses TLB entry from another process?
Solutions:
1. Flush TLB on each context switch
Poor performance: lose all recently cached translations, increases miss rate
2. Track which TLB entries are for which process
– Address Space Identifier (ASID) – similar to PID
– Tag each TLB entry with 8-bit ASID; How many ASIDs do we get? – Must match ASID for TLB entry to be used

0 KB 4 KB 8 KB
12 KB 16 KB 20 KB 24 KB 28 KB
PTBR 1 5 6 2
4 … 3 …
P1 pagetable (ASID 11)
P2 pagetable (ASID 12)
Physical
load 0x2444 load 0x5444
TLB Example with ASID
PT
PT
P1
P2
P2
P1
P1
P2
0
1
1
1
3
1
1
0
9
5
2
1
Virtual
TLB:
load 0x1444 ASID: 12 load 0x1444 ASID: 11
Valid
Virt
Phys
ASID
11
11
12
11

TLB Performance
Context switches are expensive
Even with ASID, other processes “pollute” TLB
– Discard process A’s TLB entries for process B’s entries
Architectures can have multiple TLBs
– 1TLBfordata,1TLBforinstructions
– 1TLBforregularpages,1TLBfor“superpages”

HW and OS Roles
Who Handles TLB Hit?
Who Handles TLB Miss? HW or OS H/W
H/W must know where pagetables are stored in memory
– CR3registeronx86
– PagetablestructurefixedandagreeduponbetweenHWandOS – HW“walks”knownpagetablestructureandfillsTLB
H/W

HW AND OS ROLES
Who Handles TLB MISS? H/W or OS? OS:
CPU traps into OS upon TLB miss “Software-managed TLB”
OS interprets pagetables as it chooses; any data structure possible Modifying TLB entries is privileged instruction

Characteristics of TLBs
Pages are great, but accessing page tables for every memory access is slow Cache recent page translationsàTLB
– HardwareperformsTLBlookuponeverymemoryaccess TLB performance depends strongly on workload
– Sequentialworkloadsperformwell
– Workloads with temporal locality can perform well (if enough TLB entries) In different systems, hardware or OS handles TLB misses
TLBs increase cost of context switches
– FlushTLBoneverycontextswitch – Add ASID to every TLB entry

Related Posts