Chapter 5
Morgan Kaufmann Publishers 27 November, 2018
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 1
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 61
How Much Associativity
Increased associativity decreases miss
rate
But with diminishing returns
Simulation of a system with 64KB
D-cache, 16-word blocks, SPEC2000
1-way: 10.3%
2-way: 8.6%
4-way: 8.3%
8-way: 8.1%
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 62
Set Associative Cache Organization
Morgan Kaufmann Publishers 27 November, 2018
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 2
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 63
Replacement Policy
Direct mapped: no choice
Set associative
Prefer non-valid entry, if there is one
Otherwise, choose among entries in the set
Least-recently used (LRU)
Choose the one unused for the longest time
Simple for 2-way, manageable for 4-way, too
hard beyond that
Random
Gives approximately the same performance
as LRU for high associativity
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 64
Multilevel Caches
Primary cache attached to CPU
Small, but fast
Level-2 cache services misses from
primary cache
Larger, slower, but still faster than main
memory
Main memory services L-2 cache misses
Some high-end systems include L-3
cache
Morgan Kaufmann Publishers 27 November, 2018
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 3
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 65
Multilevel Cache Example
Given
CPU base CPI = 1, clock rate = 4GHz
Miss rate/instruction = 2%
Main memory access time = 100ns
With just primary cache
Miss penalty = 100ns/0.25ns = 400 cycles
Effective CPI = 1 + 0.02 × 400 = 9
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 66
Example (cont.)
Now add L-2 cache
Access time = 5ns
Global miss rate to main memory = 0.5%
Primary miss with L-2 hit
Penalty = 5ns/0.25ns = 20 cycles
Primary miss with L-2 miss
Extra penalty = 500 cycles
CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4
Performance ratio = 9/3.4 = 2.6
Morgan Kaufmann Publishers 27 November, 2018
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 4
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 67
Multilevel Cache Considerations
Primary cache
Focus on minimal hit time
L-2 cache
Focus on low miss rate to avoid main
memory access
Hit time has less overall impact
Results
L-1 cache usually smaller than a single cache
L-1 block size smaller than L-2 block size
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 68
Interactions with Advanced CPUs
Out-of-order CPUs can execute
instructions during cache miss
Pending store stays in load/store unit
Dependent instructions wait in reservation
stations
Independent instructions continue
Effect of miss depends on program data
flow
Much harder to analyse
Use system simulation
Morgan Kaufmann Publishers 27 November, 2018
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 5
Pop Quiz
Our cache performance depends on:
A: CPU instructions
B: Data access patterns
C: Compilers
D: All of the above
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 69
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 70
Interactions with Software
Misses depend on
memory access
patterns
Algorithm behavior
Compiler
optimization for
memory access
Morgan Kaufmann Publishers 27 November, 2018
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 6
Software Optimization via Blocking
Goal: maximize accesses to data before
it is replaced
Consider inner loops of DGEMM:
for (int j = 0; j < n; ++j) { double cij = C[i+j*n]; for( int k = 0; k < n; k++ ) cij += A[i+k*n] * B[k+j*n]; C[i+j*n] = cij; } Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 71 DGEMM Access Pattern C, A, and B arrays Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 72 older accesses new accesses Morgan Kaufmann Publishers 27 November, 2018 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 7 Cache Blocked DGEMM 1 #define BLOCKSIZE 32 2 void do_block (int n, int si, int sj, int sk, double *A, double 3 *B, double *C) 4 { 5 for (int i = si; i < si+BLOCKSIZE; ++i) 6 for (int j = sj; j < sj+BLOCKSIZE; ++j) 7 { 8 double cij = C[i+j*n];/* cij = C[i][j] */ 9 for( int k = sk; k < sk+BLOCKSIZE; k++ ) 10 cij += A[i+k*n] * B[k+j*n];/* cij+=A[i][k]*B[k][j] */ 11 C[i+j*n] = cij;/* C[i][j] = cij */ 12 } 13 } 14 void dgemm (int n, double* A, double* B, double* C) 15 { 16 for ( int sj = 0; sj < n; sj += BLOCKSIZE ) 17 for ( int si = 0; si < n; si += BLOCKSIZE ) 18 for ( int sk = 0; sk < n; sk += BLOCKSIZE ) 19 do_block(n, si, sj, sk, A, B, C); 20 } Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 73 Blocked DGEMM Access Pattern Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 74 Unoptimized Blocked Morgan Kaufmann Publishers 27 November, 2018 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 8 Disk caching Use main memory as a cache for magnetic disk We can do this for a number of reasons: speed up disk access pretend we have more main memory than we really have support multiple programs easily (each can pretend it has all the memory) Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 75 Our focus We will focus on using the disk as a storage area for chunks of main memory that are not being used The basic concepts are similar to providing a cache for main memory, although we now view part of the hard disk as being the memory Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 76 Morgan Kaufmann Publishers 27 November, 2018 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 9 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 77 Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main memory Each gets a private virtual address space holding its frequently used code and data Protected from other programs CPU and OS translate virtual addresses to physical addresses VM “block” is called a page VM translation “miss” is called a page fault § 5 .7 V irtu a l M e m o ry Motivation Pretend we have 4GB, we really have only 512MB At any time, the processor needs only a small portion of the 4GB memory only a few programs are active an active program might not need all the memory that has been reserved by the program We just keep the stuff needed in the main memory, and store the rest on disk Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 78 Morgan Kaufmann Publishers 27 November, 2018 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 10 A Program’s view of memory We can write programs that address the virtual memory There is hardware that translates these virtual addresses to physical addresses The operating system is responsible for managing the movement of memory between disk and main memory, and for keeping the address translation table accurate Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 79 Advantages of Virtual Memory A program can be written (linked) to use whatever addresses it wants to! It doesn’t matter where it is physically loaded! When a program is loaded, it doesn’t need to be placed in continuous memory locations any group of physical memory pages will do fine Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 80 Morgan Kaufmann Publishers 27 November, 2018 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 11 Pop Quiz Each process on our computer has its own set of virtual memory addresses. A: True B: False Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 81 Terminology page: The unit of memory transferred between disk and the main memory page fault: when a program accesses a virtual memory location that is not currently in the main memory address translation: the process of finding the physical address that corresponds to a virtual address Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 82 Morgan Kaufmann Publishers 27 November, 2018 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 12 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 83 Address Translation Fixed-size pages (e.g., 4K) Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 84 Page Fault Penalty On page fault, the page must be fetched from disk Takes millions of clock cycles Handled by OS code Try to minimize page fault rate Fully associative placement Smart replacement algorithms Morgan Kaufmann Publishers 27 November, 2018 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 13 Page Fault Penalty (cont.) A Page Fault is a disaster! disk is very, very, very slow compared to memory – millions of cycles! Minimization of faults is the primary design consideration for virtual memory systems Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 85 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 86 Page Tables Stores placement information Array of page table entries, indexed by virtual page number Page table register in CPU points to page table in physical memory If page is present in memory PTE stores the physical page number Plus other status bits (referenced, dirty, …) If page is not present PTE can refer to location in swap space on disk Morgan Kaufmann Publishers 27 November, 2018 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 14 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 87 Translation Using a Page Table Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 88 Mapping Pages to Storage Morgan Kaufmann Publishers 27 November, 2018 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 15 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 89 Replacement and Writes To reduce page fault rate, prefer least- recently used (LRU) replacement Reference bit (aka use bit) in PTE set to 1 on access to page Periodically cleared to 0 by OS A page with reference bit = 0 has not been used recently Disk writes take millions of cycles Block at once, not individual locations Write through is impractical Use write-back Dirty bit in PTE set when page is written Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 90 Fast Translation Using a TLB Address translation would appear to require extra memory references One to access the PTE Then the actual memory access But access to page tables has good locality So use a fast cache of PTEs within the CPU Called a Translation Look-aside Buffer (TLB) Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100 cycles for miss, 0.01%–1% miss rate Misses could be handled by hardware or software Morgan Kaufmann Publishers 27 November, 2018 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 16 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 91 Fast Translation Using a TLB Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 92 TLB Misses If page is in memory Load the PTE from memory and retry Could be handled in hardware Can get complex for more complicated page table structures Or in software Raise a special exception, with optimized handler If page is not in memory (page fault) OS handles fetching the page and updating the page table Then restart the faulting instruction Morgan Kaufmann Publishers 27 November, 2018 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 17 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 93 TLB Miss Handler TLB miss indicates Page present, but PTE not in TLB Page not preset Must recognize TLB miss before destination register overwritten Raise exception Handler copies PTE from memory to TLB Then restarts instruction If page not present, page fault will occur Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 94 Page Fault Handler Use faulting virtual address to find PTE Locate page on disk Choose page to replace If dirty, write to disk first Read page into memory and update page table Make process runnable again Restart from faulting instruction Morgan Kaufmann Publishers 27 November, 2018 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 18 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 95 TLB and Cache Interaction If cache tag uses physical address Need to translate before cache lookup Alternative: use virtual address tag Complications due to aliasing Different virtual addresses for shared physical address Pop Quiz If our cache is using only virtual addresses, after a context switch we must: A: Invalidate cached code B: Invalidate cached data C: Invalidate shared libraries D: Invalidate everything Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 96 Morgan Kaufmann Publishers 27 November, 2018 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy 19 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 97 Memory Protection Different tasks can share parts of their virtual address spaces But need to protect against errant access Requires OS assistance Hardware support for OS protection Privileged supervisor mode (aka kernel mode) Privileged instructions Page tables and other state information only accessible in supervisor mode System call exception (e.g., syscall in MIPS) Possible TLB Outcomes Chapter 5 — The Memory Hierarchy — 98