Memory and Caches Part 3
Stewart Smith Digital Systems Design 4
Digital System Design 4
Memory & Caches Part 3
Stewart Smith Digital Systems Design 4
This Section
• Virtual Memory
• Common principles of memory
• Cache control systems
• Pitfalls
Stewart Smith Digital Systems Design 4
Virtual Memory
• Use main memory as a “cache” for secondary (disk) storage
‣ Managed jointly by CPU hardware and the operating system (OS)
• Programs share main memory
‣ Each gets a private virtual address space holding its frequently
used code and data
‣ Protected from other programs
• CPU and OS translate virtual addresses to physical
addresses
‣ VM “block” is called a page
‣ VM translation “miss” is called a page fault
Stewart Smith Digital Systems Design 4
Address Translation
Stewart Smith Digital Systems Design 4
Page Fault Penalty
• On page fault, the page must be fetched
from disk
‣ Takes millions of clock cycles
‣ Handled by OS code
• Try to minimise page fault rate
‣ Fully associative placement
‣ Smart replacement algorithms
Stewart Smith Digital Systems Design 4
Page Tables
• Stores placement information
‣ Array of page table entries (PTEs), indexed by
virtual page number
‣ Page table register in CPU points to page table in
physical memory
• If page is present in memory (valid = 1)
‣ PTE stores the physical page number
‣ Plus other status bits (referenced, dirty, …)
• If page is not present (valid = 0)
‣ PTE refers to location in swap space on disk
Stewart Smith Digital Systems Design 4
Translation Using a Page Table
Stewart Smith Digital Systems Design 4
Mapping Pages to Storage
Stewart Smith Digital Systems Design 4
Replacement and Writes
• To reduce page fault rate, prefer least-recently
used (LRU) replacement
‣ Reference bit/use bit in PTE set to 1 on access to page
‣ Periodically cleared to 0 by OS
‣ A page with reference bit = 0 has not been used
recently
• Disk writes take millions of cycles
‣ Write through is impractical
‣ Must use write-back
‣ Dirty bit in PTE set when page is written
Stewart Smith Digital Systems Design 4
Fast Translation Using a
Translation Lookaside Buffer (TLB)
• Address translation would appear to require
extra memory references
‣ One to access the Page Table Entry (PTE)
‣ Then another for the actual memory access
• But access to page tables has good locality
‣ So, use a fast cache of PTEs within the CPU
‣ Called a Translation Look-aside Buffer (TLB)
‣ Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–
100 cycles for miss, 0.01%–1% miss rate
‣ Misses could be handled by hardware or software
Stewart Smith Digital Systems Design 4
Fast Translation Using a TLB
Stewart Smith Digital Systems Design 4
Intrinsity FastMATH TLB
Stewart Smith Digital Systems Design 4
FastMATH TLB process
Stewart Smith Digital Systems Design 4
TLB Misses
• If page is in memory
‣ Load the PTE from memory and retry
‣ Could be handled in hardware
– Can get complex for more complicated page table
structures
‣ Or in software
– Raise a special exception, with optimised handler
• If page is not in memory (page fault)
‣ OS handles fetching the page and updating the page
table
‣ Then restart the faulting instruction
Stewart Smith Digital Systems Design 4
Page Fault Handler
• Use virtual address to find PTE
• Locate page on disk
• Choose page to replace
‣ If dirty, write to disk first
• Read page into memory and update page
table
• Make process runnable again
‣ Restart from faulting instruction
Stewart Smith Digital Systems Design 4
Memory Protection
• Different tasks can share parts of their virtual
address spaces
‣ But need to protect against errant access
‣ Requires OS assistance
• Hardware support for OS protection
‣ Privileged supervisor mode (aka kernel mode)
‣ Privileged instructions
‣ Page tables and other state information only
accessible in supervisor mode
‣ System call exception (e.g., syscall in MIPS)
Stewart Smith Digital Systems Design 4
Summary – Virtual Memory
• Virtual memory extends hierarchy to disk storage
‣ Page tables hold translations to physical addresses
‣ TLB is a cache for virtual memory translation
• Key point of virtual memory is to allow
many programmes to share memory with
protection!
• Secondary aspect is to allow an address space
that is larger than physical memory.
Stewart Smith Digital Systems Design 4
Multilevel Cache Example
Intel Nehalem processor with labelled sections
SMT
CPU
Core 0
SMT
CPU
Core 1
SMT
CPU
Core 2
SMT
CPU
Core 332kiB
L1D
32kiB
L1I
256 kiB
L2
256 kiB
L2
256 kiB
L2
256 kiB
L2
2 MiB of
8 MiB L3
Cache
2 MiB of
8 MiB L3
Cache
2 MiB of
8 MiB L3
Cache
2 MiB of
8 MiB L3
Cache
W
rite B
uffers
W
rite B
uffers
W
rite B
uffers
W
rite B
uffers
N
orth B
ridge and C
om
m
unication Sw
itch
Stewart Smith Digital Systems Design 4
Multilevel Cache Example
ARM Cortex-A8 uni-core processor
Stewart Smith Digital Systems Design 4
2-Level TLB Organisation
ARM Cortex-A8 Intel Core i7 (Nehalem)
Virtual Address 32 bits 48 bits
Physical Address 32 bits 44 bits
Page sizes Variable: 4, 16, 64 KiB, 1, 16 MiB Variable: 4 KiB, 2/4 MiB
TLB Organisation
1 TLB for instruc>ons and 1 TLB for
data
Both TLBs are fully associa>ve, with 32
entries, round robin replacement.
TLB misses handled in hardware
1 TLB for instruc>ons and 1 TLB for
data per core
Both L1 TLBs are four-way set
associa>ve, LRU replacement
L1 I-TLB has 128 entries for small
pages, 7 per thread for large pages
L1 D-TLB has 64 entries for small
pages, 32 for large pages
The L2 TLB is four-way set associa>ve,
LRU replacement
The L2 TLB has 512 entries
TLB misses handled in hardware
Stewart Smith Digital Systems Design 4
Miss Penalty Reduction
• Return requested word first
‣ Then back-fill rest of block
• Non-blocking miss processing
‣ Hit under miss: allow hits to proceed
‣ Miss under miss: allow multiple outstanding misses
• Hardware prefetch: instructions and data
• Bank interleaved L1 D-caches
‣ Multiple concurrent accesses per cycle
Stewart Smith Digital Systems Design 4
Summary of the Memory Hierarchy
• Common principles apply at all levels of the
memory hierarchy
‣ Based on notions of caching
• At each level in the hierarchy
‣ Block placement (where do we put a block?)
‣ Finding a block (how do we find it?)
‣ Replacement on a miss (which block is replaced?)
‣ Write policy (what happens with a Write?)
Stewart Smith Digital Systems Design 4
Block Placement
• Determined by associativity
‣ Direct mapped (1-way associative)
– One choice for placement
‣ n-way set associative
– n choices within a set
‣ Fully associative
– Any location
• Higher associativity reduces miss rate
‣ Increases complexity, cost, and access time
Stewart Smith Digital Systems Design 4
Finding a Block
Associativity Location Method Tag comparisons
Directly mapped Index 1
n-way set
associative
Set index, then search
entries within the set n
Fully Associative
Search all entries #entries
Full lookup table 0
Stewart Smith Digital Systems Design 4
Replacement
• Choice of entry to replace on a miss
‣ Least recently used (LRU)
– Complex and costly hardware for high associativity
‣ Random
– Close to LRU, easier to implement
• Virtual memory
‣ LRU approximation with hardware support
Stewart Smith Digital Systems Design 4
Write Policy
• Write-through
‣ Update both upper and lower levels
‣ Simplifies replacement, probably requires a write
buffer
• Write-back
‣ Update upper level only
‣ Update lower level when block is replaced
‣ Need to keep more state
• Virtual memory
‣ Only write-back is feasible, given disk write latency
Stewart Smith Digital Systems Design 4
Sources of Misses
• Compulsory misses (aka cold start misses)
‣ First access to a block
• Capacity misses
‣ Due to finite cache size
‣ A replaced block is later accessed again
• Conflict misses (aka collision misses)
‣ In a non-fully associative cache
‣ Due to competition for entries in a set
‣ Would not occur in a fully associative cache of the
same total size
Stewart Smith Digital Systems Design 4
Cache Design Trade-offs
Design change Effect on miss rate
Nega2ve performance
effect
Increase cache size Decrease capacity misses May increase access >me
Increase associa>vity Decrease conflict misses May increase access >me
Increase block size
Decrease compulsory
misses
Increases miss penalty. For
very large block size, may
increase miss rate due to
pollu>on.
Stewart Smith Digital Systems Design 4
Cache Control
• Example cache characteristics
‣ Direct-mapped, write-back, write allocate
‣ Block size: 4 words (16 bytes)
‣ Cache size: 16 KB (1024 blocks)
‣ 32-bit, byte addresses
‣ Valid bit and dirty bit per block
‣ Blocking cache
– CPU waits until access is complete
Tag Index Offset
034131431
4 bits10 bits18 bits
Stewart Smith Digital Systems Design 4
Interface Signals
CacheCPU Memory
Read/Write
Valid
Address
Write Data
Read Data
Ready
32
32
32
Read/Write
Valid
Address
Write Data
Read Data
Ready
32
128
128
Multiple cycles
per access
Stewart Smith Digital Systems Design 4
Finite State Machines
• Use an FSM to sequence control
steps
• Set of states, transition on each
clock edge
‣ State values are binary encoded
‣ Current state stored in a register
‣ Next state
= fn (current state,
current inputs)
• Control output signals
= fo (current state)
Stewart Smith Digital Systems Design 4
Cache Controller FSM
Could partition
into separate
states to reduce
clock cycle time
Stewart Smith Digital Systems Design 4
Pitfalls
• Byte vs. word addressing
‣ Example: 32-byte direct-mapped cache,
4-byte blocks
– Byte 36 maps to block 1
– Word 36 maps to block 4
• Ignoring memory system effects when writing
or generating code
‣ Example: iterating over rows vs. columns of arrays
‣ Large strides result in poor locality
Stewart Smith Digital Systems Design 4
Memory and Caches, A Summary
• Fast memories are small, large memories are slow
‣ We really want fast, large memories ☹
‣ Caching gives this illusion ☺
• Principle of locality
‣ Programs use a small part of their memory space
frequently
• Memory hierarchy
‣ L1 cache ↔ L2 cache ↔ … ↔ DRAM memory ↔ disk
• Memory system design is critical for
multiprocessors
Stewart Smith Digital Systems Design 4
This Section
• Virtual Memory
• Common principles of memory
• Cache control systems
• Pitfalls