Carnegie Mellon
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
1
14
–
513
18
–
613
Carnegie Mellon
Cache Memories
15-213/18-213/14-513/15-513/18-613: Introduction to Computer Systems 11th Lecture, October 6, 2020
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
2
Carnegie Mellon
Announcements Lab 3 (attacklab)
▪ Due this Thursday, Oct. 8, 11:59pm ET
Lab 4 (cachelab)
▪ Out Oct. 8, 11:59pm ET ▪ Due Oct. 20, 11:59pm ET
Written Assignment 3 peer grading ▪ Due Wed, Oct. 7, 11:59pm ET
Written Assignment 4 available on Canvas ▪ Due Wed, Oct. 7, 11:59pm ET
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
3
Carnegie Mellon
Today
Cache memory organization and operation
Performance impact of caches ▪ The memory mountain
▪ Rearranging loops to improve spatial locality ▪ Using blocking to improve temporal locality
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
4
Carnegie Mellon
Recall: Locality
Principle of Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
Temporal locality:
▪ Recently referenced items are likely
to be referenced again in the near future
Spatial locality:
▪ Items with nearby addresses tend
to be referenced close together in time
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
5
Carnegie Mellon
Recall: Memory Hierarchy
L0:
Regs
L1 cache (SRAM)
L2 cache (SRAM)
L3 cache (SRAM)
Main memory (DRAM)
Local secondary storage (local disks)
Smaller, faster, and costlier (per byte) storage devices
Larger, slower, and cheaper (per byte) storage devices
L6:
L1:
CPU registers hold words retrieved from the L1 cache.
L2:
L1 cache holds cache lines retrieved from the L2 cache.
L2 cache holds cache lines retrieved from L3 cache.
L3 cache holds cache lines retrieved from main memory.
Main memory holds disk blocks retrieved from local disks.
Local disks hold files retrieved from disks on remote servers.
L3:
L4:
L5:
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
6
Remote secondary storage (e.g., Web servers)
Carnegie Mellon
Recall: General Cache Concepts
84
9
14
3
Cache
Smaller, faster, more expensive memory caches a subset of the blocks
Data is copied in block-sized transfer units
Larger, slower, cheaper memory viewed as partitioned into “blocks”
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Memory
4
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
7
Carnegie Mellon
General Cache Concepts: Hit
8
9
14
3
Cache
Request: 14
Data in block b is needed
Block b is in cache:
Hit!
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Memory
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
8
Carnegie Mellon
General Cache Concepts: Miss
8
192
14
3
Cache
Request: 12
Data in block b is needed Block b is not in cache:
Miss!
Block b is fetched from memory
Block b is stored in cache
• Placement policy: determines where b goes
• Replacement policy: determines which block gets evicted (victim)
12
Request: 12
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Memory
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
9
Carnegie Mellon
Recall: General Caching Concepts: 3 Types of Cache Misses
Cold (compulsory) miss
▪ Cold misses occur because the cache starts empty and this is the first
reference to the block.
Capacity miss
▪ Occurs when the set of active cache blocks (working set) is larger than
the cache.
Conflict miss
▪ Most caches limit blocks at level k+1 to a small subset (sometimes a
singleton) of the block positions at level k.
▪ E.g. Block i at level k+1 must be placed in block (i mod 4) at level k.
▪ Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block.
10
▪ E.g. Referencing blocks 0, 8, 0, 8, 0, 8, … would miss every time. Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Carnegie Mellon
Cache Memories
Cache memories are small, fast SRAM-based memories managed automatically in hardware
▪ Hold frequently accessed blocks of main memory CPU looks first for data in cache
Typical system structure:
CPU chip
System bus
Memory bus
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
11
Register file
Cache memory
ALU
Bus interface
I/O bridge
Main memory
Carnegie Mellon
What it Really Looks Like
CPU chip
Nehalem
AMD FX 8150 Core i7-3960X
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
13
Register file
Cache memory
ALU
Bus interface
Carnegie Mellon
What it Really Looks Like (Cont.)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
14
Intel Sandy Bridge Processor Die
L1: 32KB Instruction + 32KB Data L2: 256KB
L3: 3–20MB
Carnegie Mellon
General Cache Organization (S, E, B)
E = 2e lines per set
set line
S = 2s sets
Cache size
= S x E x B data bytes
0
1
2
B-1
v
tag
valid bit
B = 2b bytes per cache block (the data)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
15
Carnegie Mellon
Cache Read
E = 2e lines per set
t bits
s bits
b bits
S = 2s sets
Address of word:
tag set block index offset
v
valid bit
B = 2b bytes per cache block (the data)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
16
tag
0
1
2
B-1
• Locate set
• Check if any line in set
has matching tag
• Yes + line valid: hit
• Locate data starting
at offset
data begins at this offset
Carnegie Mellon
Example: Direct Mapped Cache (E = 1)
Direct mapped: One line per set Assume: cache block size B=8 bytes
Address of int:
find set
0
1
2
3
4
5
6
7
v
tag
0…01
100
S = 2s sets
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
17
t bits
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
Carnegie Mellon
Example: Direct Mapped Cache (E = 1)
Direct mapped: One line per set Assume: cache block size B=8 bytes
valid? + match: assume yes (= hit)
Address of int:
t bits
0…01
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
18
block offset
100
v
tag
0
1
2
3
4
5
6
7
Carnegie Mellon
Example: Direct Mapped Cache (E = 1)
Direct mapped: One line per set Assume: cache block size B=8 bytes
valid? + match: assume yes (= hit)
Address of int:
t bits
0…01
int (4 Bytes) is here
If tag doesn’t match (= miss): old line is evicted and replaced Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
19
block offset
100
v
tag
0
1
2
3
4
5
6
7
Carnegie Mellon
Direct-Mapped Cache Simulation
t=1 s=2 b=1
4-bit addresses (address space size M=16 bytes) S=4 sets, E=1 Blocks/set, B=2 bytes/block
Address trace (reads, one byte per read):
0 [00002], miss (cold)
1 [00012], hit
7 [01112], miss (cold)
8 [10002], miss (cold)
x
xx
x
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
20
Set 0 Set 1 Set 2 Set 3
v Tag
Block
0 [00002]
miss (conflict)
101
010
M[80-191]
0
0
10
0
M[6-7]
Carnegie Mellon
E-way Set Associative Cache (Here: E = 2) E = 2: Two lines per set
Assume: cache block size B=8 bytes
2 lines per set
Address of short int:
v
tag
0
1
2
3
4
5
6
6
7
7
v
tag
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
S sets
v
tag
0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
21
1
2
3
4
5
v
tag
t bits
7
0…01
find set
100
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
Carnegie Mellon
E-way Set Associative Cache (Here: E = 2) E = 2: Two lines per set
Assume: cache block size B=8 bytes
Address of short int:
valid? +
v
tag
0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
22
1
2
3
4
5
6
7
v
tag
0
1
block offset
2
3
4
5
t bits
6
7
0…01
100
match: yes (= hit)
compare both
Carnegie Mellon
E-way Set Associative Cache (Here: E = 2) E = 2: Two lines per set
Assume: cache block size B=8 bytes
Address of short int:
valid? +
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
short int (2 Bytes) is here
No match or not valid (= miss):
• One line in set is selected for eviction and replacement
• Replacement policies: random, least recently used (LRU), …
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
23
block offset
0…01
100
match: yes (= hit)
compare both
t bits
Carnegie Mellon
2-Way Set Associative Cache Simulation
t=2 s=1 b=1
4-bit addresses (M=16 bytes)
S=2 sets, E=2 blocks/set, B=2 bytes/block
Address trace (reads, one byte per read):
xx
x
x
0 1 7 8 0
v Tag
[00002],
[00012], hit [01112], miss [10002], miss [00002] hit
Block
miss
10
00
M[0-1]
0 1
10
M[8-9]
Set 0
Set 1
0 1
01
M[6-7]
0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
24
Carnegie Mellon
What about writes?
Multiple copies of data exist:
▪ L1, L2, L3, Main Memory, Disk
valid bit
dirty bit
0
1
2
B-1
B = 2b bytes
▪ Each cache line needs a dirty bit (set if data differs from memory)
What to do on a write-miss?
▪ Write-allocate (load into cache, update line in cache)
▪ Good if more writes to the location will follow
▪ No-write-allocate (writes straight to memory, does not load into cache)
Typical
▪ Write-through + No-write-allocate ▪ Write-back + Write-allocate
What to do on a write-hit?
▪ Write-through (write immediately to memory)
▪ Write-back (defer write to memory until replacement of line)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
25
v
d
tag
Carnegie Mellon
Why Index Using Middle Bits?
Direct mapped: One line per set Assume: cache block size 8 bytes
0
1
2
3
4
5
6
7
v
tag
t bits
0…01
100
0
1
2
3
4
5
6
7
S = 2s sets
Standard Method: Middle bit indexing
Address of int:
find set
Alternative Method: High bit indexing
Address of int:
find set
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
1…11
t bits
100
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
26
Carnegie Mellon
Illustration of Indexing Approaches
64-byte memory ▪ 6-bit addresses
16 byte, direct-mapped cache
Block size = 4. (Thus, 4 sets; why?)
2 bits tag, 2 bits index, 2 bits offset 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx
1111xx
0000xx
0001xx
0010xx
0011xx
0100xx
0101xx
0110xx
Set 0 Set 1 Set 2 Set 3
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
27
Carnegie Mellon
Middle Bit Indexing
Addresses of form TTSSBB
0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx
1111xx
▪ TT ▪ SS ▪ BB
Tag bits
Set index bits Offset bits
Makes good use of spatial locality
Set 0 Set 1 Set 2 Set 3
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
28
Carnegie Mellon
High Bit Indexing
Addresses of form SSTTBB
0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx
1111xx
▪ SS ▪ TT ▪ BB
Set index bits Tag bits Offset bits
Program with high spatial locality would generate lots of conflicts
Set 0 Set 1 Set 2 Set 3
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
29
Carnegie Mellon
Intel Core i7 Cache Hierarchy
Processor package
Core 0
Core 3
L1 i-cache and d-cache:
32 KB, 8-way, Access: 4 cycles
L2 unified cache:
256 KB, 8-way, Access: 10 cycles
L3 unified cache:
8 MB, 16-way, Access: 40-75 cycles
Block size: 64 bytes for all caches.
Regs
L1 d-cache
L2 unified cache
L1 i-cache
…
L3 unified cache (shared by all cores)
Regs
L1 d-cache
L2 unified cache
L1 i-cache
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
30
Main memory
Carnegie Mellon
Example: Core i7 L1 Data Cache
32 kB 8-way set associative 64 bytes/block
47 bit address range
B=
S = , s= E = , e= C=
Block offset: . bits Set index: . bits Tag: . bits
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
31
Stack Address: Block offset: 0x?? 0x00007f7262a1e010 Set index: 0x?? Tag: 0x??
Carnegie Mellon
Example: Core i7 L1 Data Cache
32 kB 8-way set associative 64 bytes/block
47 bit address range
B = 64
S = 64, s = 6
E = 8, e = 3
C = 64 x 64 x 8 = 32,768
Block offset: 6 bits Set index: 6 bits Tag: 35 bits
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
32
Stack Address: Block offset: 0x10 0x00007f7262a1e010 Set index: 0x0 Tag: 0x7f7262a1e
0000 0001 0000
Carnegie Mellon
Cache Performance Metrics Miss Rate
▪ Fraction of memory accesses not found in cache (misses / accesses) = 1 – hit rate
▪ Typical numbers (as %): ▪ 3-10% for L1
▪ can be quite small (e.g., < 1%) for L2, depending on size, etc.
Hit Time
▪ Time to deliver a cached block to the processor
▪ includes time to determine whether line is in cache ▪ Typical numbers:
▪ 4 clock cycle for L1
▪ 10 clock cycles for L2
Miss Penalty
▪ Additional time required because of a miss
▪ typically 50-200 cycles for main memory (Trend: increasing!) Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
33
Carnegie Mellon
How Bad Can a Few Cache Misses Be? Huge difference between a hit and a miss
▪ Could be 100x, if just L1 and main memory
Would you believe 99% hits is twice as good as 97%?
▪ Consider this simplified example: cache hit time of 1 cycle
miss penalty of 100 cycles
▪ Average access time:
97% hits: 1 cycle + 0.03 x 100 cycles = 4 cycles 99% hits: 1 cycle + 0.01 x 100 cycles = 2 cycles
This is why “miss rate” is used instead of “hit rate” Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
34
Carnegie Mellon
Writing Cache Friendly Code Make the common case go fast
▪ Focus on the inner loops of the core functions
Minimize the misses in the inner loops
▪ Repeated references to variables are good (temporal locality) ▪ Stride-1 reference patterns are good (spatial locality)
Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
35
Carnegie Mellon
Quiz Time!
Check out:
https://canvas.cmu.edu/courses/17808
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
36
Carnegie Mellon
Today
Cache organization and operation
Performance impact of caches ▪ The memory mountain
▪ Rearranging loops to improve spatial locality ▪ Using blocking to improve temporal locality
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
37
Carnegie Mellon
The Memory Mountain
Read throughput (read bandwidth)
▪ Number of bytes read from memory per second (MB/s)
Memory mountain: Measured read throughput as a function of spatial and temporal locality.
▪ Compact way to characterize memory system performance.
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
38
Carnegie Mellon
Memory Mountain Test Function
Call test() with many combinations of elems and stride.
For each elems and stride:
1. Call test() once to warm up the caches.
2. Call test() again and measure the read throughput(MB/s)
39
long data[MAXELEMS]; /* Global array to traverse */
/* test - Iterate over first "elems" elements of
* array "data" with stride of "stride“,
* using 4x4 loop unrolling. */
int test(int elems, int stride) {
long i, sx2=stride*2, sx3=stride*3, sx4=stride*4;
long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0;
long length = elems, limit = length - sx4;
/* Combine 4 elements at a time */
for (i = 0; i < limit; i += sx4) { acc0 = acc0 + data[i];
acc1 = acc1 + data[i+stride]; acc2 = acc2 + data[i+sx2]; acc3 = acc3 + data[i+sx3];
}
/* Finish any remaining elements */
for (; i < length; i++) {
acc0 = acc0 + data[i];
}
return ((acc0 + acc1) + (acc2 + acc3));
}
mountain/mountain.c
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
Carnegie Mellon
The Memory Mountain
Aggressive prefetching
16000
14000
12000
10000
8000
6000
4000 2000
Slopes of spatial locality
Core i7 Haswell 2.1 GHz
32 KB L1 d-cache 256 KB L2 cache 8 MB L3 cache 64 B block size
L1
L2
L3
0
s1
s3
Stride (x8 bytes)
Ridges
of temporal locality
32k 128k
2m 512k
Size (bytes)
Mem
s5
s7
8m 32m
s9
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
40
s11 128m
Read throughput (MB/s)
Carnegie Mellon
Today
Cache organization and operation
Performance impact of caches ▪ The memory mountain
▪ Rearranging loops to improve spatial locality ▪ Using blocking to improve temporal locality
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
41
Carnegie Mellon
Matrix Multiplication Example
Description:
▪ Multiply N x N matrices
▪ Matrix elements are doubles (8 bytes)
▪ O(N3) total operations
▪ N reads per source
element
▪ N values summed per destination
▪ but may be able to hold in register
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
42
/* ijk */
for (i=0; i
▪ miss rate = sizeof(aij) / B
Stepping through rows in one column: ▪ for (i = 0; i < n; i++)
sum += a[i][0];
▪ accesses distant elements ▪ no spatial locality!
▪ miss rate = 1 (i.e. 100%) Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
44
Carnegie Mellon
Matrix Multiplication (ijk)
Inner loop:
/* ijk */
for (i=0; i