CS计算机代考程序代写 mips cache arm algorithm Memory and Caches Part 3

Memory and Caches Part 3

Stewart Smith Digital Systems Design 4

Digital System Design 4
Memory & Caches Part 3

Stewart Smith Digital Systems Design 4

This Section

• Virtual Memory
• Common principles of memory
• Cache control systems
• Pitfalls

Stewart Smith Digital Systems Design 4

Virtual Memory
• Use main memory as a “cache” for secondary (disk) storage
‣ Managed jointly by CPU hardware and the operating system (OS)

• Programs share main memory
‣ Each gets a private virtual address space holding its frequently

used code and data
‣ Protected from other programs

• CPU and OS translate virtual addresses to physical
addresses
‣ VM “block” is called a page
‣ VM translation “miss” is called a page fault

Stewart Smith Digital Systems Design 4

Address Translation

Stewart Smith Digital Systems Design 4

Page Fault Penalty
• On page fault, the page must be fetched

from disk

‣ Takes millions of clock cycles

‣ Handled by OS code

• Try to minimise page fault rate
‣ Fully associative placement

‣ Smart replacement algorithms

Stewart Smith Digital Systems Design 4

Page Tables
• Stores placement information
‣ Array of page table entries (PTEs), indexed by

virtual page number
‣ Page table register in CPU points to page table in

physical memory

• If page is present in memory (valid = 1)
‣ PTE stores the physical page number
‣ Plus other status bits (referenced, dirty, …)

• If page is not present (valid = 0)
‣ PTE refers to location in swap space on disk

Stewart Smith Digital Systems Design 4

Translation Using a Page Table

Stewart Smith Digital Systems Design 4

Mapping Pages to Storage

Stewart Smith Digital Systems Design 4

Replacement and Writes
• To reduce page fault rate, prefer least-recently

used (LRU) replacement
‣ Reference bit/use bit in PTE set to 1 on access to page
‣ Periodically cleared to 0 by OS
‣ A page with reference bit = 0 has not been used

recently

• Disk writes take millions of cycles
‣ Write through is impractical
‣ Must use write-back
‣ Dirty bit in PTE set when page is written

Stewart Smith Digital Systems Design 4

Fast Translation Using a
Translation Lookaside Buffer (TLB)
• Address translation would appear to require

extra memory references
‣ One to access the Page Table Entry (PTE)
‣ Then another for the actual memory access

• But access to page tables has good locality
‣ So, use a fast cache of PTEs within the CPU
‣ Called a Translation Look-aside Buffer (TLB)
‣ Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–

100 cycles for miss, 0.01%–1% miss rate
‣ Misses could be handled by hardware or software

Stewart Smith Digital Systems Design 4

Fast Translation Using a TLB

Stewart Smith Digital Systems Design 4

Intrinsity FastMATH TLB

Stewart Smith Digital Systems Design 4

FastMATH TLB process

Stewart Smith Digital Systems Design 4

TLB Misses
• If page is in memory
‣ Load the PTE from memory and retry
‣ Could be handled in hardware
– Can get complex for more complicated page table

structures

‣ Or in software
– Raise a special exception, with optimised handler

• If page is not in memory (page fault)
‣ OS handles fetching the page and updating the page

table
‣ Then restart the faulting instruction

Stewart Smith Digital Systems Design 4

Page Fault Handler
• Use virtual address to find PTE
• Locate page on disk
• Choose page to replace
‣ If dirty, write to disk first

• Read page into memory and update page
table

• Make process runnable again
‣ Restart from faulting instruction

Stewart Smith Digital Systems Design 4

Memory Protection
• Different tasks can share parts of their virtual

address spaces
‣ But need to protect against errant access
‣ Requires OS assistance

• Hardware support for OS protection
‣ Privileged supervisor mode (aka kernel mode)
‣ Privileged instructions
‣ Page tables and other state information only

accessible in supervisor mode
‣ System call exception (e.g., syscall in MIPS)

Stewart Smith Digital Systems Design 4

Summary – Virtual Memory

• Virtual memory extends hierarchy to disk storage
‣ Page tables hold translations to physical addresses

‣ TLB is a cache for virtual memory translation

• Key point of virtual memory is to allow
many programmes to share memory with
protection!

• Secondary aspect is to allow an address space
that is larger than physical memory.

Stewart Smith Digital Systems Design 4

Multilevel Cache Example

Intel Nehalem processor with labelled sections

SMT
CPU

Core 0

SMT
CPU

Core 1

SMT
CPU

Core 2

SMT
CPU

Core 332kiB
L1D

32kiB
L1I

256 kiB
L2

256 kiB
L2

256 kiB
L2

256 kiB
L2

2 MiB of
8 MiB L3

Cache

2 MiB of
8 MiB L3

Cache

2 MiB of
8 MiB L3

Cache

2 MiB of
8 MiB L3

Cache

W
rite B

uffers

W
rite B

uffers

W
rite B

uffers

W
rite B

uffers

N
orth B

ridge and C
om

m
unication Sw

itch

Stewart Smith Digital Systems Design 4

Multilevel Cache Example

ARM Cortex-A8 uni-core processor

Stewart Smith Digital Systems Design 4

2-Level TLB Organisation
ARM Cortex-A8 Intel Core i7 (Nehalem)

Virtual Address 32 bits 48 bits
Physical Address 32 bits 44 bits

Page sizes Variable: 4, 16, 64 KiB, 1, 16 MiB Variable: 4 KiB, 2/4 MiB

TLB Organisation

1 TLB for instruc>ons and 1 TLB for
data

Both TLBs are fully associa>ve, with 32
entries, round robin replacement.

TLB misses handled in hardware

1 TLB for instruc>ons and 1 TLB for
data per core

Both L1 TLBs are four-way set
associa>ve, LRU replacement

L1 I-TLB has 128 entries for small
pages, 7 per thread for large pages

L1 D-TLB has 64 entries for small
pages, 32 for large pages

The L2 TLB is four-way set associa>ve,
LRU replacement

The L2 TLB has 512 entries

TLB misses handled in hardware

Stewart Smith Digital Systems Design 4

Miss Penalty Reduction
• Return requested word first
‣ Then back-fill rest of block

• Non-blocking miss processing
‣ Hit under miss: allow hits to proceed

‣ Miss under miss: allow multiple outstanding misses

• Hardware prefetch: instructions and data
• Bank interleaved L1 D-caches
‣ Multiple concurrent accesses per cycle

Stewart Smith Digital Systems Design 4

Summary of the Memory Hierarchy

• Common principles apply at all levels of the
memory hierarchy

‣ Based on notions of caching

• At each level in the hierarchy
‣ Block placement (where do we put a block?)

‣ Finding a block (how do we find it?)

‣ Replacement on a miss (which block is replaced?)

‣ Write policy (what happens with a Write?)

Stewart Smith Digital Systems Design 4

Block Placement
• Determined by associativity
‣ Direct mapped (1-way associative)
– One choice for placement

‣ n-way set associative
– n choices within a set

‣ Fully associative
– Any location

• Higher associativity reduces miss rate
‣ Increases complexity, cost, and access time

Stewart Smith Digital Systems Design 4

Finding a Block

Associativity Location Method Tag comparisons

Directly mapped Index 1

n-way set
associative

Set index, then search
entries within the set n

Fully Associative
Search all entries #entries

Full lookup table 0

Stewart Smith Digital Systems Design 4

Replacement
• Choice of entry to replace on a miss
‣ Least recently used (LRU)

– Complex and costly hardware for high associativity
‣ Random

– Close to LRU, easier to implement

• Virtual memory
‣ LRU approximation with hardware support

Stewart Smith Digital Systems Design 4

Write Policy
• Write-through
‣ Update both upper and lower levels
‣ Simplifies replacement, probably requires a write

buffer

• Write-back
‣ Update upper level only
‣ Update lower level when block is replaced
‣ Need to keep more state

• Virtual memory
‣ Only write-back is feasible, given disk write latency

Stewart Smith Digital Systems Design 4

Sources of Misses
• Compulsory misses (aka cold start misses)
‣ First access to a block

• Capacity misses
‣ Due to finite cache size
‣ A replaced block is later accessed again

• Conflict misses (aka collision misses)
‣ In a non-fully associative cache
‣ Due to competition for entries in a set
‣ Would not occur in a fully associative cache of the

same total size

Stewart Smith Digital Systems Design 4

Cache Design Trade-offs

Design change Effect on miss rate
Nega2ve performance

effect

Increase cache size Decrease capacity misses May increase access >me

Increase associa>vity Decrease conflict misses May increase access >me

Increase block size
Decrease compulsory

misses

Increases miss penalty. For
very large block size, may
increase miss rate due to

pollu>on.

Stewart Smith Digital Systems Design 4

Cache Control
• Example cache characteristics
‣ Direct-mapped, write-back, write allocate
‣ Block size: 4 words (16 bytes)
‣ Cache size: 16 KB (1024 blocks)
‣ 32-bit, byte addresses
‣ Valid bit and dirty bit per block
‣ Blocking cache
– CPU waits until access is complete

Tag Index Offset
034131431

4 bits10 bits18 bits

Stewart Smith Digital Systems Design 4

Interface Signals

CacheCPU Memory

Read/Write

Valid

Address

Write Data

Read Data

Ready

32

32

32

Read/Write

Valid

Address

Write Data

Read Data

Ready

32

128

128

Multiple cycles
per access

Stewart Smith Digital Systems Design 4

Finite State Machines
• Use an FSM to sequence control

steps

• Set of states, transition on each
clock edge

‣ State values are binary encoded
‣ Current state stored in a register
‣ Next state

= fn (current state,
current inputs)

• Control output signals
= fo (current state)

Stewart Smith Digital Systems Design 4

Cache Controller FSM
Could partition
into separate

states to reduce
clock cycle time

Stewart Smith Digital Systems Design 4

Pitfalls

• Byte vs. word addressing
‣ Example: 32-byte direct-mapped cache,

4-byte blocks
– Byte 36 maps to block 1
– Word 36 maps to block 4

• Ignoring memory system effects when writing
or generating code
‣ Example: iterating over rows vs. columns of arrays
‣ Large strides result in poor locality

Stewart Smith Digital Systems Design 4

Memory and Caches, A Summary

• Fast memories are small, large memories are slow
‣ We really want fast, large memories ☹
‣ Caching gives this illusion ☺

• Principle of locality
‣ Programs use a small part of their memory space

frequently

• Memory hierarchy
‣ L1 cache ↔ L2 cache ↔ … ↔ DRAM memory ↔ disk

• Memory system design is critical for
multiprocessors

Stewart Smith Digital Systems Design 4

This Section

• Virtual Memory
• Common principles of memory
• Cache control systems
• Pitfalls