Introduction to Computer Systems 15-213/18-243, spring 2009 1st Lecture, Jan. 12th
Caches I
http://xkcd.com/1353/
CMPT 295
L14: Caches I
Roadmap
2
car *c = malloc(sizeof(car));
c->miles = 100;
c->gals = 17;
float mpg = get_mpg(c);
free(c);
Car c = new Car();
c.setMiles(100);
c.setGals(17);
float mpg =
c.getMPG();
Java:
C:
Assembly language:
Machine code:
0111010000011000
100011010000010000000010
1000100111000010
110000011111101000011111
Computer system:
OS:
Memory & data
Arrays & structs
Integers & floats
RISC V assembly
Procedures & stacks
Executables
Memory & caches
Processor Pipeline
Performance
Parallelism
CMPT 295
L14: Caches I
2
How does execution time grow with SIZE?
3
int array[SIZE];
int sum = 0;
for (int i = 0; i < 200000; i++) {
for (int j = 0; j < SIZE; j++) {
sum += array[j];
}
}
SIZE
Time
Plot
CMPT 295
L14: Caches I
What should we expect for this experiment?
Accessing and adding 200 000 x size times
Actual Data
4
SIZE
Time
CMPT 295
L14: Caches I
Knee in the curve
After some size threshold
Memory “wall”
Steeper slope = execution time grows faster
Fits in cache, doesn’t fit in cache
How does execution time ROW vs COL?
5
ROW-wise
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++)
sum += array[i][j];
COL-wise
for (int i = 0; i < SIZE; i++) {
for (int j = 0; j < SIZE; j++) {
sum += array[j][i];
}
}
CMPT 295
L14: Caches I
What should we expect for this experiment?
Accessing and adding 200 000 x size times
Actual Data
6
Time
CMPT 295
L14: Caches I
Knee in the curve
After some size threshold
Memory “wall”
Steeper slope = execution time grows faster
Fits in cache, doesn’t fit in cache
Series 1 Row Column 191 1713
How does execution time vary with stride?
7
for (int i = 0; i < SIZE; i+stride) {
sum += array[i];
}
CMPT 295
L14: Caches I
Step 1 2 3 4 8 16 32 64 128 256 1024 1 0.95 0.9 0.9 1 0.9 2 4 10 15 20
8
Programs 101
Load/Store Architectures:
Read data from memory (put in registers)
Manipulate it
Store it back to memory
int main (int argc, char* argv[ ]) {
int i;
int m = n;
int sum = 0;
for (i = 1; i <= m; i++) {
sum += i;
}
printf (“...”, n, sum);
}
C Code
main: addi sp,sp,-48
sw ra,44(sp)
sw fp,40(sp)
move fp,sp
sw a0,-36(fp)
sw a1,-40(fp)
la a5,n
lw a5,0(x15)
sw a5,-28(fp)
sw x0,-24(fp)
li a5,1
sw a5,-20(fp)
L2: lw a4,-20(fp)
lw a5,-28(fp)
blt a5,a4,L3
. . .
RISC-V Assembly
Instructions that read from
or write to memory…
CMPT 295
L14: Caches I
Making memory accesses fast!
Cache basics
Principle of locality
Memory hierarchies
Cache organization
Program optimizations that consider caches
9
CMPT 295
L14: Caches I
Processor-Memory Gap
10
Processor-Memory
Performance Gap
(grows 50%/year)
1989 first Intel CPU with cache on chip
1998 Pentium III has two cache levels on chip
“Moore’s Law”
µProc
55%/year
(2X/1.5yr)
DRAM
7%/year
(2X/10yrs)
CMPT 295
L14: Caches I
How fast you can crunch stuff
How fast you can get stuff
Problem: Processor-Memory Bottleneck
11
Main Memory
CPU
Reg
Processor performance
doubled about
every 18 months
Bus latency / bandwidth
evolved much slower
Core 2 Duo:
Can process at least
256 Bytes/cycle
Core 2 Duo:
Bandwidth
2 Bytes/cycle
Latency
100-200 cycles (30-60ns)
Problem: lots of waiting on memory
cycle: single machine step (fixed-time)
CMPT 295
L14: Caches I
What is a cycle
Only 2 hard problems in computer science: cache invalidation, naming things, and off-by-one errors.
I can process avocados very quickly
It takes me a long time to get them
Problem: Processor-Memory Bottleneck
12
Main Memory
CPU
Reg
Processor performance
doubled about
every 18 months
Bus latency / bandwidth
evolved much slower
Core 2 Duo:
Can process at least
256 Bytes/cycle
Core 2 Duo:
Bandwidth
2 Bytes/cycle
Latency
100-200 cycles (30-60ns)
Solution: caches
Cache
cycle: single machine step (fixed-time)
CMPT 295
L14: Caches I
Avocado cache
Cache 💰
Pronunciation: “cash”
We abbreviate this as “$”
English: A hidden storage space
for provisions, weapons, and/or treasures
Computer: Memory with short access time used for the storage of frequently or recently used instructions (i-cache/I$) or data (d-cache/D$)
More generally: Used to optimize data transfers between any system elements with different characteristics (network interface cache, file cache, etc.)
13
CMPT 295
L14: Caches I
General Cache Mechanics
14
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
7
9
14
3
Cache
Memory
Larger, slower, cheaper memory.
Viewed as partitioned into “blocks”
Data is copied in block-sized transfer units
Smaller, faster, more expensive memory
Caches a subset of the blocks
CMPT 295
L14: Caches I
Note: these are not bytes: blocks
General Cache Concepts: Hit
15
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
7
9
14
3
Cache
Memory
Data in block b is needed
Request: 14
14
Block b is in cache:
Hit!
Data is returned to CPU
CMPT 295
L14: Caches I
General Cache Concepts: Miss
16
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
7
9
14
3
Cache
Memory
Data in block b is needed
Request: 12
Block b is not in cache:
Miss!
Block b is fetched from
memory
Request: 12
12
12
12
Block b is stored in cache
Placement policy:
determines where b goes
Replacement policy:
determines which block
gets evicted (victim)
Data is returned to CPU
CMPT 295
L14: Caches I
Why Caches Work
Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
17
CMPT 295
L14: Caches I
Why Caches Work
Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
Temporal locality:
Recently referenced items are likely
to be referenced again in the near future
18
block
CMPT 295
L14: Caches I
Equal
Why Caches Work
Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
Temporal locality:
Recently referenced items are likely
to be referenced again in the near future
Spatial locality:
Items with nearby addresses tend
to be referenced close together in time
How do caches take advantage of this?
19
block
block
CMPT 295
L14: Caches I
Near
Example: Any Locality?
Data:
Temporal: sum referenced in each iteration
Spatial: array a[] accessed in stride-1 pattern
Instructions:
Temporal: cycle through loop repeatedly
Spatial: reference instructions in sequence
20
sum = 0;
for (i = 0; i < n; i++) {
sum += a[i];
}
return sum;
CMPT 295
L14: Caches I
Locality Example #1
21
int sum_array_rows(int a[M][N])
{
int i, j, sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
}
CMPT 295
L14: Caches I
Locality Example #1
22
Access Pattern:
stride = ?
M = 3, N=4
Note: 76 is just one possible starting address of array a
int sum_array_rows(int a[M][N])
{
int i, j, sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
}
76
92
108
Layout in Memory
a[0][0] a[0][1] a[0][2] a[0][3]
a[1][0] a[1][1] a[1][2] a[1][3]
a[2][0] a[2][1] a[2][2] a[2][3]
a [0] [0] a [0] [1] a [0] [2] a [0] [3] a [1] [0] a [1] [1] a [1] [2] a [1] [3] a [2] [0] a [2] [1] a [2] [2] a [2] [3]
1) a[0][0]
2) a[0][1]
3) a[0][2]
4) a[0][3]
5) a[1][0]
6) a[1][1]
7) a[1][2]
8) a[1][3]
9) a[2][0]
10) a[2][1]
11) a[2][2]
12) a[2][3]
CMPT 295
L14: Caches I
Stride of 1 (4 bytes)
Locality Example #2
23
int sum_array_cols(int a[M][N])
{
int i, j, sum = 0;
for (j = 0; j < N; j++)
for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
}
CMPT 295
L14: Caches I
Locality Example #2
24
int sum_array_cols(int a[M][N])
{
int i, j, sum = 0;
for (j = 0; j < N; j++)
for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
}
76
92
108
Layout in Memory
a [0] [0] a [0] [1] a [0] [2] a [0] [3] a [1] [0] a [1] [1] a [1] [2] a [1] [3] a [2] [0] a [2] [1] a [2] [2] a [2] [3]
M = 3, N=4
a[0][0] a[0][1] a[0][2] a[0][3]
a[1][0] a[1][1] a[1][2] a[1][3]
a[2][0] a[2][1] a[2][2] a[2][3]
Access Pattern:
stride = ?
1) a[0][0]
2) a[1][0]
3) a[2][0]
4) a[0][1]
5) a[1][1]
6) a[2][1]
7) a[0][2]
8) a[1][2]
9) a[2][2]
10) a[0][3]
11) a[1][3]
12) a[2][3]
CMPT 295
L14: Caches I
Stride of 4
Locality Example #3
What is wrong with this code?
How can it be fixed?
25
int sum_array_3D(int a[M][N][L])
{
int i, j, k, sum = 0;
for (i = 0; i < N; i++)
for (j = 0; j < L; j++)
for (k = 0; k < M; k++)
sum += a[k][i][j];
return sum;
}
a[2][0][0] a[2][0][1] a[2][0][2] a[2][0][3]
a[2][1][0] a[2][1][1] a[2][1][2] a[2][1][3]
a[2][2][0] a[2][2][1] a[2][2][2] a[2][2][3]
a[1][0][0] a[1][0][1] a[1][0][2] a[1][0][3]
a[1][1][0] a[1][1][1] a[1][1][2] a[1][1][3]
a[1][2][0] a[1][2][1] a[1][2][2] a[1][2][3]
a[0][0][0] a[0][0][1] a[0][0][2] a[0][0][3]
a[0][1][0] a[0][1][1] a[0][1][2] a[0][1][3]
a[0][2][0] a[0][2][1] a[0][2][2] a[0][2][3]
m = 0
m = 1
m = 2
CMPT 295
L14: Caches I
Grids, rows, cols
Stride = N * L
Locality Example #3
26
int sum_array_3D(int a[M][N][L])
{
int i, j, k, sum = 0;
for (i = 0; i < N; i++)
for (j = 0; j < L; j++)
for (k = 0; k < M; k++)
sum += a[k][i][j];
return sum;
}
What is wrong with this code?
How can it be fixed?
Layout in Memory (M = ?, N = 3, L = 4)
a [0] [0] [0] a [0] [0] [1] a [0] [0] [2] a [0] [0] [3] a [0] [1] [0] a [0] [1] [1] a [0] [1] [2] a [0] [1] [3] a [0] [2] [0] a [0] [2] [1] a [0] [2] [2] a [0] [2] [3]
a [1] [0] [0] a [1] [0] [1] a [1] [0] [2] a [1] [0] [3] a [1] [1] [0] a [1] [1] [1] a [1] [1] [2] a [1] [1] [3] a [1] [2] [0] a [1] [2] [1] a [1] [2] [2] a [1] [2] [3]
76
92
108
124
140
156
172
CMPT 295
L14: Caches I
Make J the inner loop
Cache Performance Metrics
Huge difference between a cache hit and a cache miss
Could be 100x speed difference between accessing cache and main memory (measured in clock cycles)
Miss Rate (MR)
Fraction of memory references not found in cache (misses / accesses) = 1 - Hit Rate
Hit Time (HT)
Time to deliver a block in the cache to the processor
Includes time to determine whether the block is in the cache
Miss Penalty (MP)
Additional time required because of a miss
27
CMPT 295
L14: Caches I
Hit takes HT
Miss takes HT + MP
Cache Performance
Two things hurt the performance of a cache:
Miss rate and miss penalty
Average Memory Access Time (AMAT): average time to access memory considering both hits and misses
AMAT = Hit time + Miss rate × Miss penalty
(abbreviated AMAT = HT + MR × MP)
99% hit rate can be twice as good as 97% hit rate!
Assume HT of 1 clock cycle and MP of 100 clock cycles
97%: AMAT =
99%: AMAT =
28
CMPT 295
L14: Caches I
AMAT = always have to pay HT, only sometimes (MR) pay MP
HR = .97 -> 1 + .03 * 100 = 4 cycles
HR = .99 -> 1 + .01 * 100 = 2 cycles
Since MP is high, keep MR loooow
29
CMPT 295
L14: Caches I
Peer Instruction Question
Processor specs: 200 ps clock, MP of 50 clock cycles, MR of 0.02 misses/instruction, and HT of 1 clock cycle
AMAT =
Which improvement would be best?
190 ps clock
Miss penalty of 40 clock cycles
MR of 0.015 misses/instruction
30
CMPT 295
L14: Caches I
C
AMAT = 1 + .02 * 50 = 2 cycles = 400 ps (trillionth!)
A -> 2 cycles -> 380 ps
B -> 1 + 0.02 * 40 = 1.8 cyc = 360ps
C 1 + 0.015 * 50 = 1.75 cyc = 350ps
Can we have more than one cache?
Why would we want to do that?
Avoid going to memory!
Typical performance numbers:
Miss Rate
L1 MR = 3-10%
L2 MR = Quite small (e.g. < 1%), depending on parameters, etc.
Hit Time
L1 HT = 4 clock cycles
L2 HT = 10 clock cycles
Miss Penalty
P = 50-200 cycles for missing in L2 & going to main memory
Trend: increasing!
31
CMPT 295
L14: Caches I
Smaller cache will have higher MR, lower HT
We have 3 levels now!
32
CMPT 295
L14: Caches I
Memory Hierarchies
Some fundamental and enduring properties of hardware and software systems:
Faster = smaller = more expensive
Slower = bigger = cheaper
The gaps between memory technology speeds are widening
True for: registers ↔ cache, cache ↔ DRAM, DRAM ↔ disk, etc.
Well-written programs tend to exhibit good locality
These properties complement each other beautifully
They suggest an approach for organizing memory and storage systems known as a memory hierarchy
For each level k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1
33
CMPT 295
L14: Caches I
An Example Memory Hierarchy
34
registers
on-chip L1
cache (SRAM)
main memory
(DRAM)
local secondary storage
(local disks)
Larger,
slower,
cheaper
per byte
remote secondary storage
(distributed file systems, web servers)
Local disks hold files retrieved from disks on remote network servers
Main memory holds disk blocks retrieved from local disks
off-chip L2
cache (SRAM)
L1 cache holds cache lines retrieved from L2 cache
CPU registers hold words retrieved from L1 cache
L2 cache holds cache lines retrieved from main memory
Smaller,
faster,
costlier
per byte
CMPT 295
L14: Caches I
Implemented with different technologies!
disks, solid-state
DRAM: little wells that drain slowly, refreshed periodically
SRAM: flip-flops, little logic gates that feedback with each other, faster but uses more power!
That’s why they’re different sizes and costs
An Example Memory Hierarchy
35
registers
on-chip L1
cache (SRAM)
main memory
(DRAM)
local secondary storage
(local disks)
Larger,
slower,
cheaper
per byte
remote secondary storage
(distributed file systems, web servers)
off-chip L2
cache (SRAM)
explicitly program-controlled (e.g. refer to exactly %t1, %t2)
Smaller,
faster,
costlier
per byte
program sees “memory”;
hardware manages caching
transparently
CMPT 295
L14: Caches I
So, why haven’t we seen caches before now in this class?
Because they’re designed to be architecturally transparent!
Never affect correctness, only performance
Intel Core i7 Cache Hierarchy
36
Regs
L1 D$
L1 I$
L2 unified cache
Core 0
Regs
L1 D$
L1 I$
L2 unified cache
Core 3
…
L3 unified cache
(shared by all cores)
Main memory
Processor package
Block size:
64 bytes for all caches
L1 i-cache and d-cache:
32 KiB, 8-way,
Access: 4 cycles
L2 unified cache:
256 KiB, 8-way,
Access: 11 cycles
L3 unified cache:
8 MiB, 16-way,
Access: 30-40 cycles
CMPT 295
L14: Caches I
i$ d$
Summary
Memory Hierarchy
Successively higher levels contain “most used” data from lower levels
Exploits temporal and spatial locality
Caches are intermediate storage levels used to optimize data transfers between any system elements with different characteristics
Cache Performance
Ideal case: found in cache (hit)
Bad case: not found in cache (miss), search in next level
Average Memory Access Time (AMAT) = HT + MR × MP
Hurt by Miss Rate and Miss Penalty
37
CMPT 295
L14: Caches I
0510152025303540450200040006000800010000
Chart1
1
2
8
16
32
65
128
256
512
768
1024
1536
2048
3036
4096
8192
0.01
0.01
0.02
0.02
0.05
0.08
0.15
0.29
0.56
0.83
1.13
5.23
8.61
14.49
20.03
40.8
232_cache_data
1 0.01
2 0.01
8 0.02
16 0.02
32 0.05
65 0.08
128 0.15
256 0.29
512 0.56
768 0.83
1024 1.13
1536 5.23
2048 8.61
3036 14.49
4096 20.03
8192 40.8
232_cache_data
Cache Terminology
X
Cache
%miss
Thit
TmissAccess : Read/Write to cache
Hit : Desired data in cache
Miss : Desired data not in cache
Fill : Data placed in cache
% miss = #misses/ #accesses
MPKI = # misses/ 1000 inst.
Thit = Time to read (write) data from cache
Tmiss = Time to read data into cache
Tavg = Thit + %miss*Tmiss
Cache T erminology
X
Cache
%miss
T
hit
T
miss
Access : Read/W rite to cache
Hit : Desir ed data in cache
Miss : Desir ed data not in cache
Fill : Data placed in cache
% miss = #misses/ #accesses
MPKI = # misses/ 1000 inst.
T
hit
= Time to r ead (write) data fr om cache
T
miss
= Time to r ead data into cache
T
avg
= T
hit
+ %miss*T
miss
X
CIS 501 (Martin): Caches 13
Exploiting Locality: Memory Hierarchy
• Hierarchy of memory components
• Upper components
• Fast ! Small ! Expensive
• Lower components
• Slow ! Big ! Cheap
• Connected by “buses”
• Which also have latency and bandwidth issues
• Most frequently accessed data in M1
• M1 + next most frequently accessed in M2, etc.
• Move data up-down hierarchy
• Optimize average access time
• latencyavg = latencyhit + %miss * latencymiss
• Attack each component
CPU
M1
M2
M3
M4
CIS 501 (Martin): Caches 14
Concrete Memory Hierarchy
• 0th level: Registers
• 1st level: Primary caches
• Split instruction (I$) and data (D$)
• Typically 8KB to 64KB each
• 2nd level: 2nd and 3rd cache (L2, L3)
• On-chip, typically made of SRAM
• 2nd level typically ~256KB to 512KB
• “Last level cache” typically 4MB to 16MB
• 3rd level: main memory
• Made of DRAM (“Dynamic” RAM)
• Typically 1GB to 4GB for desktops/laptops
• Servers can have 100s of GB
• 4th level: disk (swap and files)
• Uses magnetic disks
Processor
D$
L2, L3
Main
Memory
I$
Disk
Compiler
Managed
Hardware
Managed
Software
Managed
(by OS)
Regs
Library Analogy Revisited
• Registers ! books on your desk
• Actively being used, small capacity
• Caches ! bookshelves
• Moderate capacity, pretty fast to access
• Main memory ! library
• Big, holds almost all data, but slow
• Disk (swap) ! inter-library loan
• Very slow, but hopefully really uncommon
CIS 501 (Martin): Caches 15 CIS 501 (Martin): Caches 16
Evolution of Cache Hierarchies
Intel 486
8KB
I/D$
1.5MB L2
L3 tags
64KB D$
64KB I$
IBM Power5 (dual core)
• Chips today are 30–70% cache by area
4MB 4MB
Intel 486 Intel Penryn
X
CIS 501 (Martin): Caches 13 Exploiting Locality: Memory Hierarchy • Hierarchy of memory components • Upper components • Fast ! Small ! Expensive • Lower components • Slow ! Big ! Cheap • Connected by “buses” • Which also have latency and bandwidth issues • Most frequently accessed data in M1 • M1 + next most frequently accessed in M2, etc. • Move data up-down hierarchy • Optimize average access time • latencyavg = latencyhit + %miss * latencymiss • Attack each component CPU M1 M2 M3 M4 CIS 501 (Martin): Caches 14 Concrete Memory Hierarchy • 0th level: Registers • 1st level: Primary caches • Split instruction (I$) and data (D$) • Typically 8KB to 64KB each • 2nd level: 2nd and 3rd cache (L2, L3) • On-chip, typically made of SRAM • 2nd level typically ~256KB to 512KB • “Last level cache” typically 4MB to 16MB • 3rd level: main memory • Made of DRAM (“Dynamic” RAM) • Typically 1GB to 4GB for desktops/laptops • Servers can have 100s of GB • 4th level: disk (swap and files) • Uses magnetic disks Processor D$ L2, L3 Main Memory I$ Disk Compiler Managed Hardware Managed Software Managed (by OS) Regs Library Analogy Revisited
• Registers ! books on your desk
• Actively being used, small capacity
• Caches ! bookshelves
• Moderate capacity, pretty fast to access
• Main memory ! library
• Big, holds almost all data, but slow
• Disk (swap) ! inter-library loan
• Very slow, but hopefully really uncommon
CIS 501 (Martin): Caches 15 CIS 501 (Martin): Caches 16
Evolution of Cache Hierarchies
Intel 486
8KB
I/D$
1.5MB L2
L3 tags
64KB D$
64KB I$
IBM Power5 (dual core)
• Chips today are 30–70% cache by area
4MB
4MB
Intel 486 Intel Penryn