The Memory Hierarchy
1
Random-Access Memory (RAM)
Key features
RAM comes in two varieties:
RAM is traditionally packaged as a chip.
Basic storage unit is normally a cell (one bit per cell). Multiple RAM chips form a memory.
SRAM (Static RAM) DRAM (Dynamic RAM)
2
SRAM vs DRAM Summary
Trans. Access Needs Needs
per bit time refresh? EDC? Cost Applications
SRAM 4 or 6 1X No Maybe 100x Cache memories
DRAM 1 10X Yes Yes 1X Mainmemories, frame buffers
3
Nonvolatile Memories
DRAM and SRAM are volatile memories
Lose information if powered off.
Nonvolatile memories retain value even if powered off
Firmware programs stored in a ROM (BIOS, controllers for disks, network cards, graphics accelerators, security subsystems,…)
Solid state disks (replace rotating disks in thumb drives, smart phones, mp3 players, tablets, laptops,…)
Disk caches
Read-only memory (ROM): programmed during production
Programmable ROM (PROM): can be programmed once
Eraseable PROM (EPROM): can be bulk erased (UV, X-Ray)
Electrically eraseable PROM (EEPROM): electronic erase capability
Flash memory: EEPROMs. with partial (block-level) erase capability
Wears out after about 100,000 erasings
Uses for Nonvolatile Memories
4
Traditional Bus Structure Connecting CPU and Memory
A bus is a collection of parallel wires that carry address, data, and control signals.
Buses are typically shared by multiple devices.
CPU chip
Register file
ALU
System bus Memory bus
Bus interface
I/O bridge
Main memory
5
Memory Read Transaction (1)
CPU places address A on the memory bus.
Register file
%rax
Load operation: movq A, %rax
ALU
I/Obridge A
0
Main memory A
x
Bus interface
6
Memory Read Transaction (2)
Main memory reads A from the memory bus, retrieves word x, and places it on the bus.
Register file
%rax
Load operation: movq A, %rax
ALU
I/O bridge x
Main memory 0
A
x
Bus interface
7
Memory Read Transaction (3)
CPU read word x from the bus and copies it into register %rax.
Register file
%rax
Load operation: movq A, %rax
x
ALU
I/O bridge
Main memory0 A
x
Bus interface
8
Memory Write Transaction (1)
CPU places address A on bus. Main memory reads it and waits for the corresponding data word to arrive.
Register file
%rax
Store operation: movq %rax, A
y
ALU
I/O bridge
A
Main memory0
A
Bus interface
9
Memory Write Transaction (2)
CPU places data word y on the bus.
Register file
%rax
Store operation: movq %rax, A
y
ALU
I/O bridge
y
Main memory 0
A
Bus interface
10
Memory Write Transaction (3)
Main memory reads data word y from the bus and stores it at address A.
Register file %rax
Store operation: movq %rax, A main memory
y
ALU
I/O bridge
0 A
y
Bus interface
11
Locality
Principle of Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
Recently referenced items are likely
to be referenced again in the near future
Temporal locality:
Spatial locality:
Items with nearby addresses tend
to be referenced close together in time
12
Locality Example
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;
Reference array elements in succession (stride-1 reference pattern).
Spatial locality Temporal locality
Spatial locality Temporal locality
Data references
Reference variable sum each iteration. Instruction references
Reference instructions in sequence. Cycle through loop repeatedly.
13
Example Memory Hierarchy
L0: Regs L1 cache
(SRAM)
L2 cache (SRAM)
L3 cache (SRAM)
Main memory (DRAM)
Local secondary storage (local disks)
Remote secondary storage (e.g., Web servers)
Smaller, faster, and costlier (per byte) storage devices
Larger, slower,
and
cheaper (per byte) storage L5: devices
L6:
L1:
CPU registers hold words retrieved from the L1 cache.
L2:
L1 cache holds cache lines retrieved from the L2 cache.
L2 cache holds cache lines retrieved from L3 cache
L3 cache holds cache lines retrieved from main memory.
Main memory holds disk blocks retrieved from local disks.
Local disks hold files retrieved from disks on remote servers
L4:
L3:
14
Caches
Cache: A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device.
Fundamental idea of a memory hierarchy:
For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1.
Because of locality, programs tend to access the data at level k more often than they access the data at level k+1.
Big Idea: The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.
Why do memory hierarchies work?
Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit.
15
General Cache Concepts Cache
Smaller, faster, more expensive memory caches a subset of the blocks
84
9
140
3
140
Data is copied in block-sized transfer units
Larger, slower, cheaper Memory viewed as partitioned into “blocks”
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Memory
16
General Cache Concepts: Hit Request: 14
Cache
Data in block b is needed
Block b is in cache:
Hit!
8
9
14
3
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Memory
17
General Cache Concepts: Miss
8
192
14
3
Cache
Request: 12
Data in block b is needed Block b is not in cache:
Miss!
Block b is fetched from memory
Block b is stored in cache
• Placement policy: determines where b goes
• Replacement policy: determines which block gets evicted (victim)
Memory
12
Request: 12
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
18
General Caching Concepts: Types of Cache Misses
Cold misses occur because the cache is empty.
Occurs when the set of active cache blocks (working set) is larger than the cache.
Cold (compulsory) miss
Conflict miss
Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k.
E.g. Block i at level k+1 must be placed in block (i mod 4) at level k.
Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block.
E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.
Capacity miss
19
Example Memory Hierarchy
L0: Regs L1 cache
(SRAM)
L2 cache (SRAM)
L3 cache (SRAM)
Main memory (DRAM)
Local secondary storage (local disks)
Remote secondary storage (e.g., Web servers)
Smaller, faster, and costlier (per byte) storage devices
Larger, slower,
and
cheaper (per byte) storage L5: devices
L6:
L1:
CPU registers hold words retrieved from the L1 cache.
L2:
L1 cache holds cache lines retrieved from the L2 cache.
L2 cache holds cache lines retrieved from L3 cache
L3 cache holds cache lines retrieved from main memory.
Main memory holds disk blocks retrieved from local disks.
Local disks hold files retrieved from disks on remote servers
L4:
L3:
20
General Cache Concept Cache
Smaller, faster, more expensive memory caches a subset of the blocks
84
9
140
3
140
Data is copied in block-sized transfer units
Larger, slower, cheaper Memory viewed as partitioned into “blocks”
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Memory
21
Cache Memories
Cache memories are small, fast SRAM-based memories managed automatically in hardware
Hold frequently accessed blocks of main memory
CPU looks first for data in cache Typical system structure:
CPU chip
Cache memory
Bus interface
Register file
ALU
System bus
I/O bridge
Memory bus
Main memory
22
General Cache Organization (S, E, B) E = 2e lines per set
S = 2s sets
set line
Cache size:
C=SxExBbytes
0
1
2
B-1
v
tag
valid bit
B = 2b bytes per cache block (the data)
23
• Locate set
• Check if any line in set
has matching tag
• Yes + line valid: hit
• Locate data starting
at offset
Cache Read
E = 2e lines per set
t bits
s bits
b bits
S=
2s sets
Address of word:
tag set block index offset
valid bit
v
tag
0
1
2
B-1
B = 2b bytes per cache block (the data)
24
data begins at this offset
Ex.: Direct Mapped Cache (E = 1)
0
1
2
3
4
5
6
7
S=
2s sets
Direct mapped: One line per set Assume: cache block size 8 bytes
v
tag
Address of int:
find set
t bits
0...01
100
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
25
Ex.: Direct Mapped Cache (E = 1)
Direct mapped: One line per set Assume: cache block size 8 bytes
valid?
match: assume yes = hit
Address of int:
v
tag
0
1
2
3
4
5
6
7
t bits
0...01
100
block offset
26
Ex.: Direct Mapped Cache (E = 1)
Direct mapped: One line per set Assume: cache block size 8 bytes
Valid? match: assume yes = hit
Address of int:
v
tag
0
1
2
3
4
5
6
7
t bits
0...01
100
int (4 Bytes) is here
If tag doesn’t match:
old line is evicted and replaced
block offset
27
Direct-Mapped Cache Simulation
t=1 s=2 b=1
M=16 bytes (4-bit addresses), B=2 bytes/block, S=4 sets, E=1 Blocks/set
Address trace (reads, one byte per read):
x
xx
x
0 [00002], 1 [00012], 7 [01112], 8 [10002], 0 [00002]
miss
hit miss
miss miss
Set 0 Set 1 Set 2 Set 3
v Tag
Block
01
010?
M[80?-19]
1
0
M[6-7]
28
E-way Set Associative Cache (Here: E = 2)
E=2:Twolinesperset
Assume: cache block size 8 bytes
Address of short int:
t bits
0...01
100
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
find set
v
tag
0
1
2
3
4
4
5
6
7
7
v
tag
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
v
tag
0
1
2
3
5
6
v
tag
7
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
29
E-way Set Associative Cache (Here: E = 2) E=2:Twolinesperset
Assume: cache block size 8 bytes
valid? +
Address of short int:
match: yes = hit
compare both
t bits
0...01
100
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
block offset
30
E-way Set Associative Cache (Here: E = 2)
E=2:Twolinesperset
Assume: cache block size 8 bytes
valid? +
Address of short int:
match: yes = hit
compare both
t bits
0...01
100
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
short int (2 Bytes) is here
No match:
• Onelineinsetisselectedforevictionandreplacement
• Replacementpolicies:random,leastrecentlyused(LRU),...
block offset
31
2-Way Set Associative Cache Simulation
t=2 s=1 b=1
M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set
Address trace (reads, one byte per read):
xx
x
x
0 [00002], 1 [00012], 7 [01112], 8 [10002], 0 [00002]
v Tag
miss hit miss miss hit
Block
01
?00
?M[0-1]
0 1
10
M[8-9]
Set 0 Set 1
0 1
01
M[6-7]
0
32
Ex.: Fully Associative Cache (S = 1)
One set containing all lines Assume: cache block size 8 bytes
0
1
2
3
4
5
6
7
One set:
v
tag
Address of int:
t bits
100
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
33
Cache associativity
E=1(1lineperset)
n lines per set
S = 1 (1 set containing all lines)
Direct mapped
n-way associative
Fully associative
34
Cache associativity
Requested address gives all information needed to locate data in the cache:
Set index: which bookshelf?
Tag: book title
Block index: what page of the book?
– – –
35
Cache Size
C=S*E*B
Suppose we’re given:
How many tag, set, block index bits?
S = #sets
E = #lines / set (associativity) B = #bytes / block (in a line)
32 KB cache
64 B line
4-way set associative 32-bit architecture
36
Cache Size
Suppose we’re given:
How many tag, set, block index bits?
32 KB cache
64 B line
4-way set associative 32-bit architecture
64Bperblock=?bits
How many lines in a 32 KB cache? How many sets to hold those lines?
37
Cache Size
Suppose we’re given:
How many tag, set, block index bits?
32 KB cache
64 B line
4-way set associative 32-bit architecture
64Bperblock=6bits 32KBcache/64Bperline=2^9lines
2^9 lines / 4 lines per set = 2^7 sets
So 6 block index bits, 7 set index bits, ??? tag bits
38
Cache Size
Suppose we’re given:
How many tag, set, block index bits?
32 KB cache
64 B line
4-way set associative 32-bit architecture
64Bperblock=6bits 32KBcache/64Bperline=2^9lines 2^9 lines / 4 lines per set = 2^7 sets So 6 block index bits, 7 set index bits 32–6–7=19tagbits
39
Cache Size
Read 0x2b74ce2d
(19 tag, 7 set, 6 block)
Suppose we’re given:
What are the tag, set, block index?
40
Cache Size
Read 0x2b74ce2d
What are the tag, set, block index?
Suppose we’re given:
(19 tag, 7 set, 6 block)
0x2b74ce2d =
0010 1011 0111 0100 1100 1110 0010 1101
41
Cache Size
Read 0x2b74ce2d
What are the tag, set, block index?
Suppose we’re given:
(19 tag, 7 set, 6 block) 0x2b74ce2d =
0010 1011 0111 0100 110
0 1110 00
10 1101
Tag = 0010101101110100110 Set index = 0111000 (56) Block index = 101101 (45)
42
What about writes?
L1, L2, L3, Main Memory, Disk
What to do on a write-hit?
Typical
Multiple copies of data exist:
Write-through (write immediately to memory)
Write-back (defer write to memory until replacement of line)
Need a dirty bit (line different from memory or not)
What to do on a write-miss?
Write-allocate (load into cache, update line in cache)
Good if more writes to the location follow
No-write-allocate (writes straight to memory, does not load into cache)
Write-through + No-write-allocate
Write-back + Write-allocate
43
Intel Core i7 Cache Hierarchy
Processor package
L1 i-cache and d-cache:
32 KB, 8-way, Access: 4 cycles
L2 unified cache:
256 KB, 8-way, Access: 10 cycles
L3 unified cache:
8 MB, 16-way, Access: 40-75 cycles
Block size: 64 bytes for all caches.
Core 0
Core 3
Regs
L1 d-cache
L2 unified cache
L1 i-cache
...
Regs
L1 d-cache
L2 unified cache
L1 i-cache
L3 unified cache (shared by all cores)
Main memory
44
Cache Performance Metrics
Fraction of memory references not found in cache (misses / accesses) =1–hitrate
Time to deliver a line in the cache to the processor
includes time to determine whether the line is in the cache
Additional time required because of a miss
typically 50-200 cycles for main memory (Trend: increasing!)
Miss Rate
Typical numbers (in percentages):
3-10% for L1
can be quite small (e.g., < 1%) for L2, depending on size, etc.
Hit Time
Typical numbers:
4 clock cycle for L1
10 clock cycles for L2
Miss Penalty
45
Cache Performance Metrics
Average memory access time (AMAT)
AMAT=H+MR*AMP
Each request may hit or miss.
What is the average time over many requests?
H = hit latency
MR = miss rate
AMP = average miss penalty
46
Cache Performance Metrics
AMAT=H+MR*AMP
Suppose we’re given:
What is AMAT?
H = hit latency
MR = miss rate
AMP = average miss penalty
Hit latency of L1 = 1 cycle HitrateofL1=30%
Latency of main memory = 200 cycles
47
Cache Performance Metrics
AMAT=H+MR*AMP
Suppose we’re given:
1+0.70*200=141cycles
H = hit latency
MR = miss rate
AMP = average miss penalty
Hit latency of L1 = 1 cycle HitrateofL1=30%
Latency of main memory = 200 cycles
What is AMAT?
48
Cache Performance Metrics
What about multi-level caches?
Suppose we’re given:
What is AMAT?
AMAT(L1) = H(L1) + MR(L1) * AMAT(L2) AMAT(L2) = H(L2) + MR(L2) * AMP
L1: 1 cycle, 30% hit rate L2: 10 cycles, 55% hit rate RAM: 200 cycles
49
Cache Performance Metrics
What about multi-level caches?
Suppose we’re given:
1 + 0.70 * AMAT(L2)
AMAT(L1) = H(L1) + MR(L1) * AMAT(L2) AMAT(L2) = H(L2) + MR(L2) * AMP
L1: 1 cycle, 30% hit rate L2: 10 cycles, 55% hit rate RAM: 200 cycles
What is AMAT?
50
Cache Performance Metrics
What about multi-level caches?
Suppose we’re given:
1+0.70*(10+0.45*200)=71cycles
AMAT(L1) = H(L1) + MR(L1) * AMAT(L2) AMAT(L2) = H(L2) + MR(L2) * AMP
L1: 1 cycle, 30% hit rate L2: 10 cycles, 55% hit rate RAM: 200 cycles
What is AMAT?
51
Let’s think about those numbers
Huge difference between a hit and a miss
Consider:
cache hit time of 1 cycle miss penalty of 100 cycles
This is why “miss rate” is used instead of “hit rate”
Could be 100x, if just L1 and main memory
Would you believe 99% hits is twice as good as 97%?
Average access time:
97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles
52
Writing Cache Friendly Code
Focus on the inner loops of the core functions
Minimize the misses in the inner loops
Make the common case go fast
Repeated references to variables are good (temporal locality) Stride-1 reference patterns are good (spatial locality)
Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories
53
Cache Performance
What is the hit rate of this code?
64 B block intis4B
int a[m][n];
for (int i = 0; i < m; i++)
for (int j = 0; j < n; j++)
a[i][j] += 1;
54
Cache Performance
What is the hit rate of this code?
For each cache line (64 B), it holds 16 ints First will miss, next 15 will hit
Then another miss, then another 15 hits, ... Hit rate = 15/16
64 B block Intis4B
int a[m][n];
for (int i = 0; i < m; i++)
for (int j = 0; j < n; j++)
a[i][j] += 1;
55
Matrix Multiplication Example
/* ijk */
Description:
Variable sum held in register
for (j=0; j
miss rate = sizeof(aij) / B
accesses distant elements no spatial locality!
miss rate = 1 (i.e. 100%)
58
Matrix Multiplication (ijk)
/* ijk */
/* ijk */
Inner loop: (i,*)
ABC
for (i=0; i