CS计算机代考程序代写 mips cache arm algorithm Digital System Design 4 Memory & Caches Part 3

Digital System Design 4 Memory & Caches Part 3
Stewart Smith Digital Systems Design 4

This Section
• • • •
Virtual Memory
Common principles of memory Cache control systems
Pitfalls
Stewart Smith
Digital Systems Design 4

Virtual Memory
• •



‣ ‣
Use main memory as a “cache” for secondary (disk) storage ‣ Managed jointly by CPU hardware and the operating system (OS)
Programs share main memory
Each gets a private virtual address space holding its frequently used code and data
Protected from other programs
CPU and OS translate virtual addresses to physical addresses
VM “block” is called a page
VM translation “miss” is called a page fault
Stewart Smith Digital Systems Design 4

Address Translation
Stewart Smith Digital Systems Design 4

Page Fault Penalty

‣ ‣

‣ ‣
On page fault, the page must be fetched from disk
Takes millions of clock cycles Handled by OS code
Try to minimise page fault rate
Fully associative placement Smart replacement algorithms
Stewart Smith
Digital Systems Design 4

Stewart Smith
If page is not present (valid = 0)
PTE refers to location in swap space on disk Digital Systems Design 4
Page Tables

‣ ‣




Stores placement information
Array of page table entries (PTEs), indexed by virtual page number
Page table register in CPU points to page table in physical memory
If page is present in memory (valid = 1)
PTE stores the physical page number
‣ Plus other status bits (referenced, dirty, …)

Translation Using a Page Table
Stewart Smith Digital Systems Design 4

Mapping Pages to Storage
Stewart Smith Digital Systems Design 4

Replacement and Writes





‣ ‣ ‣
To reduce page fault rate, prefer least-recently used (LRU) replacement
Reference bit/use bit in PTE set to 1 on access to page Periodically cleared to 0 by OS
A page with reference bit = 0 has not been used recently
Disk writes take millions of cycles
Write through is impractical Must use write-back
Dirty bit in PTE set when page is written
Stewart Smith
Digital Systems Design 4

Fast Translation Using a Translation Lookaside Buffer (TLB)







Address translation would appear to require extra memory references
One to access the Page Table Entry (PTE) ‣ Then another for the actual memory access
But access to page tables has good locality
So, use a fast cache of PTEs within the CPU Called a Translation Look-aside Buffer (TLB)
Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10– 100 cycles for miss, 0.01%–1% miss rate
Misses could be handled by hardware or software
Stewart Smith
Digital Systems Design 4

Fast Translation Using a TLB
Stewart Smith Digital Systems Design 4

Intrinsity FastMATH TLB
Stewart Smith Digital Systems Design 4

FastMATH TLB process
Stewart Smith Digital Systems Design 4

TLB Misses

‣ ‣


‣ ‣
If page is in memory
Load the PTE from memory and retry Could be handled in hardware
– Can get complex for more complicated page table structures
Or in software
– Raise a special exception, with optimised handler
If page is not in memory (page fault)
OS handles fetching the page and updating the page table
Then restart the faulting instruction
Stewart Smith
Digital Systems Design 4

Page Fault Handler
• • •
Use virtual address to find PTE
Locate page on disk
Choose page to replace •‣ If dirty, write to disk first
Read page into memory and update page table


Make process runnable again
Restart from faulting instruction
Stewart Smith
Digital Systems Design 4

Memory Protection







Different tasks can share parts of their virtual address spaces
But need to protect against errant access ‣ Requires OS assistance
Hardware support for OS protection
Privileged supervisor mode (aka kernel mode) Privileged instructions
Page tables and other state information only accessible in supervisor mode
System call exception (e.g., syscall in MIPS)
Stewart Smith
Digital Systems Design 4

Summary – Virtual Memory

‣ ‣
• •
Virtual memory extends hierarchy to disk storage
Page tables hold translations to physical addresses TLB is a cache for virtual memory translation
Key point of virtual memory is to allow many programmes to share memory with protection!
Secondary aspect is to allow an address space that is larger than physical memory.
Stewart Smith
Digital Systems Design 4

Multilevel Cache Example
Intel Nehalem processor with labelled sections
SMT CPU Core 0
256 kiB L2
SMT CPU Core 1
256 kiB L2
2 MiB of 8 MiB L3 Cache
2 MiB of 8 MiB L3 Cache
256 kiB L2
SMT CPU Core 2
256 kiB L2
SMT CPU Core 3
2 MiB of 8 MiB L3 Cache
2 MiB of 8 MiB L3 Cache
Stewart Smith Digital Systems Design 4
Write Buffers North Bridge and Communication Switch
32kiB L1D
32kiB L1I
Write Buffers
Write Buffers
Write Buffers

Multilevel Cache Example
ARM Cortex-A8 uni-core processor Stewart Smith Digital Systems Design 4

2-Level TLB Organisation
ARM Cortex-A8
Intel Core i7 (Nehalem)
Virtual Address
32 bits
48 bits
Physical Address
32 bits
44 bits
Page sizes
Variable: 4, 16, 64 KiB, 1, 16 MiB
Variable: 4 KiB, 2/4 MiB
TLB Organisation
1 TLB for instruc>ons and 1 TLB for data
Both TLBs are fully associa>ve, with 32 entries, round robin replacement.
TLB misses handled in hardware
1 TLB for instruc>ons and 1 TLB for data per core
Both L1 TLBs are four-way set associa>ve, LRU replacement
L1 I-TLB has 128 entries for small pages, 7 per thread for large pages
L1 D-TLB has 64 entries for small pages, 32 for large pages
The L2 TLB is four-way set associa>ve, LRU replacement
The L2 TLB has 512 entries TLB misses handled in hardware
Stewart Smith Digital Systems Design 4

Miss Penalty Reduction



‣ ‣
• •

Return requested word first
Then back-fill rest of block
Non-blocking miss processing
Hit under miss: allow hits to proceed
Miss under miss: allow multiple outstanding misses
Hardware prefetch: instructions and data Bank interleaved L1 D-caches
Multiple concurrent accesses per cycle
Stewart Smith
Digital Systems Design 4

Summary of the Memory Hierarchy



‣ ‣ ‣ ‣
Common principles apply at all levels of the memory hierarchy
Based on notions of caching At each level in the hierarchy
Block placement (where do we put a block?) Finding a block (how do we find it?) Replacement on a miss (which block is replaced?) Write policy (what happens with a Write?)
Stewart Smith
Digital Systems Design 4

Block Placement






Increases complexity, cost, and access time
Determined by associativity
Direct mapped (1-way associative)
– •

One choice for placement n-way set associative
n choices within a set Fully associative
Any location
Higher associativity reduces miss rate
Stewart Smith
Digital Systems Design 4

Finding a Block
Associativity
Location Method
Tag comparisons
Directly mapped
Index
1
n-way set associative
Set index, then search entries within the set
n
Fully Associative
Search all entries
#entries
Full lookup table
0
Stewart Smith Digital Systems Design 4

Replacement




– •

LRU approximation with hardware support
Choice of entry to replace on a miss
Least recently used (LRU)
Complex and costly hardware for high associativity Random
Close to LRU, easier to implement Virtual memory
Stewart Smith
Digital Systems Design 4

Write Policy

‣ ‣

‣ ‣


Write-through
Update both upper and lower levels
Simplifies replacement, probably requires a write buffer
Write-back
Update upper level only
Update lower level when block is replaced ‣ Need to keep more state
Virtual memory
Only write-back is feasible, given disk write latency
Stewart Smith
Digital Systems Design 4

Sources of Misses
• •





Compulsory misses (aka cold start misses) ‣ First access to a block
Capacity misses
Due to finite cache size
‣ A replaced block is later accessed again
Conflict misses (aka collision misses)
In a non-fully associative cache
Due to competition for entries in a set
Would not occur in a fully associative cache of the same total size
Stewart Smith
Digital Systems Design 4

Cache Design Trade-offs
Design change
Effect on miss rate
Nega2ve performance effect
Increase cache size
Decrease capacity misses
May increase access >me
Increase associa>vity
Decrease conflict misses
May increase access >me
Increase block size
Decrease compulsory misses
Increases miss penalty. For very large block size, may increase miss rate due to pollu>on.
Stewart Smith Digital Systems Design 4

Cache Control







Example cache characteristics
Direct-mapped, write-back, write allocate Block size: 4 words (16 bytes)
Cache size: 16 KB (1024 blocks)
32-bit, byte addresses
Valid bit and dirty bit per block Blocking cache
– CPU waits until access is complete
31 1413430
18 bits 10 bits 4 bits
Tag
Index
Offset
Stewart Smith
Digital Systems Design 4

Interface Signals
Read/Write Read/Write
CPU
Valid
Write Data
32
Cache
Valid
Address
Address
Write Data
32
Memory
Read Data
32
Read Data
128
Ready
32
Ready
128
Stewart Smith
Digital Systems Design 4
Multiple cycles per access

Finite State Machines
• Use an FSM to sequence control steps
• Set of states, transition on each clock edge
‣ ‣ ‣
current inputs)
• Control output signals
= fo (current state)
Stewart Smith Digital Systems Design 4
State values are binary encoded
Current state stored in a register
Next state
= fn (current state,

Cache Controller FSM
Could partition into separate states to reduce clock cycle time
Stewart Smith Digital Systems Design 4

Pitfalls
Byte vs. word addressing
Example: 32-byte direct-mapped cache, 4-byte blocks
– –

‣ ‣


Byte 36 maps to block 1 Word 36 maps to block 4
Ignoring memory system effects when writing or generating code
Example: iterating over rows vs. columns of arrays Large strides result in poor locality
Stewart Smith
Digital Systems Design 4

Memory and Caches, A Summary

‣ ‣





Fast memories are small, large memories are slow
We really want fast, large memories ☹ Caching gives this illusion ☺
Principle of locality
Programs use a small part of their memory space frequently
Memory hierarchy
L1 cache ↔ L2 cache ↔ … ↔ DRAM memory ↔ disk
Memory system design is critical for multiprocessors
Stewart Smith
Digital Systems Design 4

This Section
• • • •
Virtual Memory
Common principles of memory Cache control systems
Pitfalls
Stewart Smith
Digital Systems Design 4