Carnegie Mellon
Cache Memories
15-213/18-213/15-513: Introduction to Computer Systems 11th Lecture, June 12, 2020
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
1
Carnegie Mellon
Today
Cache memory organization and operation
Performance impact of caches ▪ The memory mountain
▪ Rearranging loops to improve spatial locality ▪ Using blocking to improve temporal locality
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
2
Carnegie Mellon
Recall: Locality
Principle of Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
Temporal locality:
▪ Recently referenced items are likely
to be referenced again in the near future
Spatial locality:
▪ Items with nearby addresses tend
to be referenced close together in time
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
3
Carnegie Mellon
Recall: Memory Hierarchy
L0:
Regs
L1 cache (SRAM)
L2 cache (SRAM)
L3 cache (SRAM)
Main memory (DRAM)
Local secondary storage (local disks)
Smaller, faster, and costlier (per byte) storage devices
Larger, slower, and cheaper (per byte) storage devices
L6:
L1:
CPU registers hold words retrieved from the L1 cache.
L2:
L1 cache holds cache lines retrieved from the L2 cache.
L2 cache holds cache lines retrieved from L3 cache.
L3 cache holds cache lines retrieved from main memory.
Main memory holds disk blocks retrieved from local disks.
Local disks hold files retrieved from disks on remote servers.
L3:
L4:
L5:
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
4
Remote secondary storage (e.g., Web servers)
Carnegie Mellon
Recall: General Cache Concepts
84
9
14
3
Cache
Smaller, faster, more expensive memory caches a subset of the blocks
Data is copied in block-sized transfer units
Larger, slower, cheaper memory viewed as partitioned into “blocks”
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Memory
4
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
5
Carnegie Mellon
General Cache Concepts: Hit
8
9
14
3
Cache
Request: 14
Data in block b is needed
Block b is in cache:
Hit!
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Memory
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
6
Carnegie Mellon
General Cache Concepts: Miss
8
192
14
3
Cache
Request: 12
Data in block b is needed Block b is not in cache:
Miss!
Block b is fetched from memory
Block b is stored in cache
• Placement policy: determines where b goes
• Replacement policy: determines which block gets evicted (victim)
12
Request: 12
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Memory
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
7
Carnegie Mellon
Recall: General Caching Concepts: 3 Types of Cache Misses
Cold (compulsory) miss
▪ Cold misses occur because the cache starts empty and this is the first
reference to the block.
Capacity miss
▪ Occurs when the set of active cache blocks (working set) is larger than
the cache.
Conflict miss
▪ Most caches limit blocks at level k+1 to a small subset (sometimes a
singleton) of the block positions at level k.
▪ E.g. Block i at level k+1 must be placed in block (i mod 4) at level k.
▪ Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block.
8
▪ E.g. Referencing blocks 0, 8, 0, 8, 0, 8, … would miss every time. Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Carnegie Mellon
Cache Memories
Cache memories are small, fast SRAM-based memories managed automatically in hardware
▪ Hold frequently accessed blocks of main memory CPU looks first for data in cache
Typical system structure:
CPU chip
System bus
Memory bus
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
9
Register file
Cache memory
ALU
Bus interface
I/O bridge
Main memory
Carnegie Mellon
General Cache Organization (S, E, B)
E = 2e lines per set
set line
S = 2s sets
Cache size
= S x E x B data bytes
0
1
2
B-1
v
tag
valid bit
B = 2b bytes per cache block (the data)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
10
Carnegie Mellon
Cache Read
E = 2e lines per set
t bits
s bits
b bits
S = 2s sets
Address of word:
tag set block index offset
v
valid bit
B = 2b bytes per cache block (the data)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
11
tag
0
1
2
B-1
• Locate set
• Check if any line in set
has matching tag
• Yes + line valid: hit
• Locate data starting
at offset
data begins at this offset
Carnegie Mellon
Example: Direct Mapped Cache (E = 1)
Direct mapped: One line per set Assume: cache block size B=8 bytes
Address of int:
find set
0
1
2
3
4
5
6
7
v
tag
0…01
100
S = 2s sets
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
12
t bits
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
Carnegie Mellon
Example: Direct Mapped Cache (E = 1)
Direct mapped: One line per set Assume: cache block size B=8 bytes
valid? + match: assume yes (= hit)
Address of int:
t bits
0…01
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
13
block offset
100
v
tag
0
1
2
3
4
5
6
7
Carnegie Mellon
Example: Direct Mapped Cache (E = 1)
Direct mapped: One line per set Assume: cache block size B=8 bytes
valid? + match: assume yes (= hit)
Address of int:
t bits
0…01
int (4 Bytes) is here
If tag doesn’t match (= miss): old line is evicted and replaced Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
14
block offset
100
v
tag
0
1
2
3
4
5
6
7
Carnegie Mellon
Direct-Mapped Cache Simulation
t=1 s=2 b=1
4-bit addresses (address space size M=16 bytes) S=4 sets, E=1 Blocks/set, B=2 bytes/block
Address trace (reads, one byte per read):
0 [00002], miss
1 [00012], hit
7 [01112], miss
8 [10002], miss
0 [00002] miss
x
xx
x
v Tag
Block
101
010
M[80-191]
0
0
10
0
M[6-7]
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
15
Set 0 Set 1 Set 2 Set 3
Carnegie Mellon
E-way Set Associative Cache (Here: E = 2) E = 2: Two lines per set
Assume: cache block size B=8 bytes
2 lines per set
Address of short int:
v
tag
0
1
2
3
4
5
6
6
7
7
v
tag
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
S sets
v
tag
0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
16
1
2
3
4
5
v
tag
t bits
7
0…01
find set
100
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
Carnegie Mellon
E-way Set Associative Cache (Here: E = 2) E = 2: Two lines per set
Assume: cache block size B=8 bytes
Address of short int:
valid? +
v
tag
0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
17
1
2
3
4
5
6
7
v
tag
0
1
block offset
2
3
4
5
t bits
6
7
0…01
100
match: yes (= hit)
compare both
Carnegie Mellon
E-way Set Associative Cache (Here: E = 2) E = 2: Two lines per set
Assume: cache block size B=8 bytes
Address of short int:
valid? +
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
short int (2 Bytes) is here
No match or not valid (= miss):
• One line in set is selected for eviction and replacement
• Replacement policies: random, least recently used (LRU), …
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
18
block offset
0…01
100
match: yes (= hit)
compare both
t bits
Carnegie Mellon
2-Way Set Associative Cache Simulation
t=2 s=1 b=1
4-bit addresses (M=16 bytes)
S=2 sets, E=2 blocks/set, B=2 bytes/block
Address trace (reads, one byte per read):
xx
x
x
0 1 7 8 0
v Tag
[00002],
[00012], hit [01112], miss [10002], miss [00002] hit
Block
miss
10
00
M[0-1]
0 1
10
M[8-9]
Set 0
Set 1
0 1
01
M[6-7]
0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
19
Carnegie Mellon
What about writes?
Multiple copies of data exist:
▪ L1, L2, L3, Main Memory, Disk
valid bit
dirty bit
0
1
2
B-1
B = 2b bytes
▪ Each cache line needs a dirty bit (set if data has been written to)
What to do on a write-miss?
▪ Write-allocate (load into cache, update line in cache)
▪ Good if more writes to the location will follow
▪ No-write-allocate (writes straight to memory, does not load into cache)
Typical
▪ Write-through + No-write-allocate ▪ Write-back + Write-allocate
What to do on a write-hit?
▪ Write-through (write immediately to memory)
▪ Write-back (defer write to memory until replacement of line)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
20
v
d
tag
Carnegie Mellon
Practical Write-back Write-allocate A write to address X is issued
0
1
2
B-1
If it is a hit
▪ Update the contents of block
▪ Set dirty bit to 1 (bit is sticky and only cleared on eviction)
If it is a miss
▪ Fetch block from memory (per a read miss)
▪ The perform the write operations (per a write hit)
If a line is evicted and dirty bit is set to 1
▪ The entire block of 2b bytes are written back to memory
▪ Dirty bit is cleared (set to 0)
▪ Line is replaced by new contents
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
21
valid bit
v
d
dirty bit
B = 2b bytes
tag
Carnegie Mellon
Intel Core i7 Cache Hierarchy
Processor package
Core 0
Core 3
L1 i-cache and d-cache:
32 KB, 8-way, Access: 4 cycles
L2 unified cache:
256 KB, 8-way, Access: 10 cycles
L3 unified cache:
8 MB, 16-way, Access: 40-75 cycles
Block size: 64 bytes for all caches.
Regs
L1 d-cache
L2 unified cache
L1 i-cache
…
L3 unified cache (shared by all cores)
Regs
L1 d-cache
L2 unified cache
L1 i-cache
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
22
Main memory
Carnegie Mellon
Cache Performance Metrics Miss Rate
▪ Fraction of memory references not found in cache (misses / accesses) = 1 – hit rate
▪ Typical numbers (in percentages): ▪ 3-10% for L1
▪ can be quite small (e.g., < 1%) for L2, depending on size, etc.
Hit Time
▪ Time to deliver a line in the cache to the processor
▪ includes time to determine whether the line is in the cache ▪ Typical numbers:
▪ 4 clock cycle for L1
▪ 10 clock cycles for L2
Miss Penalty
▪ Additional time required because of a miss
▪ typically 50-200 cycles for main memory (Trend: increasing!) Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
23
Carnegie Mellon
Let’s think about those numbers Huge difference between a hit and a miss
▪ Could be 100x, if just L1 and main memory
Would you believe 99% hits is twice as good as 97%?
▪ Consider this simplified example: cache hit time of 1 cycle
miss penalty of 100 cycles
▪ Average access time:
97% hits: 1 cycle + 0.03 x 100 cycles = 4 cycles 99% hits: 1 cycle + 0.01 x 100 cycles = 2 cycles
This is why “miss rate” is used instead of “hit rate” Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
24
Carnegie Mellon
Writing Cache Friendly Code Make the common case go fast
▪ Focus on the inner loops of the core functions
Minimize the misses in the inner loops
▪ Repeated references to variables are good (temporal locality) ▪ Stride-1 reference patterns are good (spatial locality)
Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
25
Carnegie Mellon
Today
Cache organization and operation
Performance impact of caches ▪ The memory mountain
▪ Rearranging loops to improve spatial locality ▪ Using blocking to improve temporal locality
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
26
Carnegie Mellon
The Memory Mountain
Read throughput (read bandwidth)
▪ Number of bytes read from memory per second (MB/s)
Memory mountain: Measured read throughput as a function of spatial and temporal locality.
▪ Compact way to characterize memory system performance.
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
27
Carnegie Mellon
Memory Mountain Test Function
Call test() with many combinations of elems and stride.
For each elems and stride:
1. Call test() once to warm up the caches.
2. Call test() again and measure the read throughput(MB/s)
28
long data[MAXELEMS]; /* Global array to traverse */
/* test - Iterate over first "elems" elements of
* array "data" with stride of "stride“,
* using 4x4 loop unrolling. */
int test(int elems, int stride) {
long i, sx2=stride*2, sx3=stride*3, sx4=stride*4;
long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0;
long length = elems, limit = length - sx4;
/* Combine 4 elements at a time */
for (i = 0; i < limit; i += sx4) { acc0 = acc0 + data[i];
acc1 = acc1 + data[i+stride]; acc2 = acc2 + data[i+sx2]; acc3 = acc3 + data[i+sx3];
}
/* Finish any remaining elements */
for (; i < length; i++) {
acc0 = acc0 + data[i];
}
return ((acc0 + acc1) + (acc2 + acc3));
}
mountain/mountain.c
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Carnegie Mellon
The Memory Mountain
Aggressive prefetching
16000
14000
12000
10000
8000
6000
4000 2000
Slopes of spatial locality
Core i7 Haswell 2.1 GHz
32 KB L1 d-cache 256 KB L2 cache 8 MB L3 cache 64 B block size
L1
L2
L3
0
s1
s3
Stride (x8 bytes)
Ridges
of temporal locality
32k 128k
2m 512k
Size (bytes)
Mem
s5
s7
8m 32m
s9
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
29
s11 128m
Read throughput (MB/s)
Carnegie Mellon
Today
Cache organization and operation
Performance impact of caches ▪ The memory mountain
▪ Rearranging loops to improve spatial locality ▪ Using blocking to improve temporal locality
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
30
Carnegie Mellon
Matrix Multiplication Example
Description:
▪ Multiply N x N matrices
▪ Matrix elements are doubles (8 bytes)
▪ O(N3) total operations
▪ N reads per source
element
▪ N values summed per destination
▪ but may be able to hold in register
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
31
/* ijk */
for (i=0; i
▪ miss rate = sizeof(aij) / B
Stepping through rows in one column: ▪ for (i = 0; i < n; i++)
sum += a[i][0];
▪ accesses distant elements ▪ no spatial locality!
▪ miss rate = 1 (i.e. 100%) Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
33
Carnegie Mellon
Matrix Multiplication (ijk)
Inner loop:
/* ijk */
for (i=0; i