Many of the following slides are taken with permission from
Complete Powerpoint Lecture Notes for
Computer Systems: A Programmer’s Perspective (CS:APP)
Randal E. Bryant and David R. O’Hallaron
http://csapp.cs.cmu.edu/public/lectures.html
The book is used explicitly in CS 2505 and CS 3214 and as a reference in CS 2506.
Cache Memory and Performance
CachePerformance 1
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Cache memories are small, fast SRAM-based memories managed automatically in hardware.
– Hold frequently accessed blocks of main memory
CPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory. Typical system structure:
CPU chip
Register file
Cache memories
ALU
Bus interface
System bus Memory bus
I/O bridge
Main memory
Cache Memories
CachePerformance 2
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
E = 2e lines (blocks) per set
0 1 2e-1
set
line (block)
S = 2s sets
0 1
2 3
2s-1
0
1
2
B-1
valid bit
C = S x E x B data bytes
B = 2b bytes per cache block (the data)
Cache size:
v
tag
General Cache Organization (S, E, B)
CachePerformance 3
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
The “geometry” of the cache is defined by:
S = 2s E = 2e B = 2b
E = 1 (e = 0) S >1
E = K > 1
S = 1 (only one set) E = # of cache blocks
the number of sets in the cache
the number of lines (blocks) in a set the number of bytes in a line (block)
direct-mapped cache
only one possible location in cache for each DRAM block K-way associative cache
K possible locations (in same cache set) for each DRAM block
fully-associative cache
each DRAM block can be at any location in the cache
Cache Organization Types
CachePerformance 4
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
miss rate: fraction of memory references not found in cache (# misses / # accesses) = 1 – hit rate
Typical miss rates:
– 3-10%forL1
– can be quite small (e.g., < 1%) for L2, depending on cache size and locality
hit time: time to deliver a line in the cache to the processor includes time to determine whether the line is in the cache
Typical times:
- 1-2 clock cycles for L1
- 5-20 clock cycles for L2
miss penalty: additional time required for data access because of a cache miss typically 50-200 cycles for main memory
Trend is for increasing # of cycles... why?
Cache Performance Metrics
CachePerformance 5
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Let’s say that we have two levels of cache, backed by DRAM:
- L1 cache costs 1 cycle to access and has miss rate of 10%
- L2 cache costs 10 cycles to access and has miss rate of 2%
- DRAM costs 80 cycles to access (and has miss rate of 0%)
Then the average memory access time (AMAT) would be:
1
+
0.10 * 10
+
0.10 * 0.02 * 80
= 2.16 cycles
always access L1 cache
probability of miss in L1 cache * time to access L2
probability of miss in L1 cache *
probability of miss in L2 cache * time to access DRAM
Calculating Average Access Time
CachePerformance 6
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
DRAM
% of time we go here*: 0.2% # cycles to go here: 80
% of time we go here*: 10% # cycles to go here: 10
L2
L1 CPU
Averagepenalty*: 0.16cycles
Average penalty*: 1 cycle
Average penalty*: 1 cycle
* On a memory access
% of time we go here*: 100% # cycles to go here: 1
Let’s see that again...
Let’s say that we have two levels of cache, backed by DRAM:
- L1 cache costs 1 cycle to access and has miss rate of 10%
- L2 cache costs 10 cycles to access and has miss rate of 2%
- DRAM costs 80 cycles to access (and has miss rate of 0%)
CachePerformance 7
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
There can be a huge difference between the cost of a hit and a miss. Could be 100x, if just L1 and main memory
Would you believe 99% hits is twice as good as 97%?
Consider:
L1 cache hit time of 1 cycle
L1 miss penalty of 100 cycles (to DRAM)
Average access time:
97% L1 hits: 1 cycle + 0.03 * 100 cycles = 4 cycles 99% L1 hits: 1 cycle + 0.01 * 100 cycles = 2 cycles
Lets think about those numbers
CachePerformance 8
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Components of CPU time:
Program execution cycles Includes cache hit time
Memory stall cycles
Mainly from cache misses
With simplifying assumptions:
Memory stall cycles
Memory accesses Miss rate Miss penalty Program
Instructio ns Misses Miss penalty Program Instructio n
Measuring Cache Performance
CachePerformance 9
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Given
- Instruction-cache miss rate = 2%
- Data-cache miss rate = 4%
- Miss penalty = 100 cycles
- Base CPI (with ideal cache performance) = 2
- Load & stores are 36% of instructions
Miss cycles per instruction
- Instruction-cache: 0.02 × 100 = 2
- Data-cache: 0.36 × 0.04 × 100 = 1.44
Actual CPI = 2 + 2 + 1.44 = 5.44
- Ideal CPI is 5.44/2 =2.72 times faster
- We spend 3.44/5.44 = 63% of our execution time on memory stalls!
Cache Performance Example
Cache Performance 10
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
What if we improved the datapath so that the average ideal CPI was reduced?
- Instruction-cache miss rate = 2%
- Data-cache miss rate = 4%
- Miss penalty = 100 cycles
- Base CPI (with ideal cache performance) = 1.5
- Load & stores are 36% of instructions
Miss cycles per instruction will still be the same as before.
Actual CPI = 1.5 + 2 + 1.44 = 4.94
- Ideal CPI is 4.94/1.5 =3.29 times faster
- We spend 3.44/4.94 = 70% of our execution time on memory stalls!
Reduce Ideal CPI Cache Performance 11
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
When CPU performance increases
- effect of miss penalty becomes more significant
Decreasing base CPI
- greater proportion of time spent on memory stalls
Increasing clock rate
- memory stalls account for more CPU cycles
Can’t neglect cache behavior when evaluating system performance
Performance Summary Cache Performance 12
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Primary cache
– Focus on minimal hit time
L-2 cache
– Focus on low miss rate to avoid main memory access
– Hit time has less overall impact
Results
– L-1 cache usually smaller than a single cache
– L-1 block size smaller than L-2 block size
Multilevel Cache Considerations
Cache Performance 13
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Processor package
Core 0
Core 3
Regs
L1 d-cache
L2 unified cache
L1 i-cache
...
L3 unified cache (shared by all cores)
Regs
L1 d-cache
L2 unified cache
L1 i-cache
Main memory
2700 Series
L1 i-cache and d-cache: 32 KB, 8-way,
Access: 4 cycles
L2 unified cache: 256 KB, 8-way,
Access: 11 cycles
L3 unified cache:
8 MB, 16-way,
Access: 25-40 cycles
Block size: 64 bytes for all caches.
Intel Core i7 Cache Hierarchy Cache Performance 14
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
I-7 Sandybridge Cache Performance 15
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Multiple copies of data may exist: - L1
- L2
- DRAM - Disk
Remember: each level of the hierarchy is a subset of the one below it. Suppose we write to a data block that's in L1.
If we update only the copy in L1, then we will have multiple, inconsistent versions! If we update all the copies, we'll incur a substantial time penalty!
And what if we write to a data block that's not in L1?
What about writes?
Cache Performance 16
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
What to do on a write-hit?
Write-through (write immediately to memory)
Write-back (defer write to memory until replacement of line) Need a dirty bit (cached line is different from memory or not)
What about writes?
Cache Performance 17
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
What to do on a write-miss?
Write-allocate (load into cache, update line in cache) Good if more writes to the location follow
No-write-allocate (writes immediately to memory)
Typical combinations:
Write-through + No-write-allocate Write-back + Write-allocate
What about writes?
Cache Performance 18
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain