Lecture 14 – Memory and Cache 2
Stewart Smith Digital Systems Design 4
Digital System Design 4
Memory & Caches 2
Dr Stewart Smith
Stewart Smith Digital Systems Design 4
Last Section
•Memory Access Locality
•Memory Hierarchy
•Caches
•DRAM technology
Stewart Smith Digital Systems Design 4
This Section
•Cache Performance
•Associative Caches
•Multilevel Caches
•Real examples
Stewart Smith Digital Systems Design 4
Measuring Cache Performance
• Components of CPU time
‣ Program execution cycles
– Includes cache hit time
‣ Memory stall cycles
– Mainly from cache misses
• With simplifying assumptions:
penalty Miss
nInstructio
Misses
Program
nsInstructio
penalty Missrate Miss
Program
accessesMemory
cycles stallMemory
××=
××=
Stewart Smith Digital Systems Design 4
Cache Performance Example
• Given
‣ Instruction-cache miss rate = 2%
‣ Data-cache miss rate = 4%
‣ Miss penalty = 100 cycles
‣ Base CPI (ideal cache) = 2
‣ Load & stores are 36% of instructions
• Miss cycles per instruction
‣ Instruction-cache: 0.02 × 100 = 2
‣ Data-cache: 0.36 × 0.04 × 100 = 1.44
• Actual CPI = 2 + 2 + 1.44 = 5.44
‣ Ideal CPU is 5.44/2 =2.72 times faster
• Amount of time spent on memory stalls:
‣ 3.44/5.44 = 63%
Stewart Smith Digital Systems Design 4
Cache Performance Example
• What happens with a faster processor?
• If the CPI is 1 rather than 2?
• Miss cycles per instruction
‣ Instruction-cache: 0.02 × 100 = 2
‣ Data-cache: 0.36 × 0.04 × 100 = 1.44
• Actual CPI = 1 + 2 + 1.44 = 4.44
‣ Ideal CPU is 4.44/1 =4.44 times faster
• Amount of time spent on memory stalls:
‣ 3.44/4.44 = 77%
‣ Increased from 63%
• Demonstration of Amdahl’s Law
‣ More on that in previous lecture on performance
Stewart Smith Digital Systems Design 4
Average Access Time
• Hit time is also important for performance
• Average memory access time (AMAT)
‣ AMAT = Hit time + Miss rate × Miss penalty
• Example
‣ CPU with 1 ns clock,
‣ hit time = 1 cycle,
‣ miss penalty = 20 cycles,
‣ I-cache miss rate = 5%
‣ AMAT = 1 + 0.05 × 20 = 2ns
– 2 cycles per instruction
Stewart Smith Digital Systems Design 4
Associative Caches
• Fully associative
‣ Allow a given block to go in any cache entry
‣ Requires all entries to be searched at once
‣ Comparator per entry (expensive)
• n-way set associative
‣ Each set contains n entries
‣ Block number determines which set
– (Block number) modulo (#Sets in cache)
‣ Search all entries in a given set at once
‣ n comparators (less expensive)
Stewart Smith Digital Systems Design 4
Associative Cache Example
Stewart Smith Digital Systems Design 4
Spectrum of Associativity
Stewart Smith Digital Systems Design 4
Block
address
Cache
Index
Hit or
Miss?
Cache content after access
0 1 2 3
0 0
8 0
0 0
6 2
8 0
Block
address
Cache
Index
Hit or
Miss?
Cache content after access
0 1 2 3
0 0 miss Mem[0]
8 0
0 0
6 2
8 0
Block
address
Cache
Index
Hit or
Miss?
Cache content after access
0 1 2 3
0 0 miss Mem[0]
8 0 Mem[0]
0 0
6 2
8 0
Block
address
Cache
Index
Hit or
Miss?
Cache content after access
0 1 2 3
0 0 miss Mem[0]
8 0 miss Mem[8]
0 0
6 2
8 0
Block
address
Cache
Index
Hit or
Miss?
Cache content after access
0 1 2 3
0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 Mem[8]
6 2
8 0
Block
address
Cache
Index
Hit or
Miss?
Cache content after access
0 1 2 3
0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2
8 0
Block
address
Cache
Index
Hit or
Miss?
Cache content after access
0 1 2 3
0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2 Mem[0]
8 0
Block
address
Cache
Index
Hit or
Miss?
Cache content after access
0 1 2 3
0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2 miss Mem[0] Mem[6]
8 0
Block
address
Cache
Index
Hit or
Miss?
Cache content after access
0 1 2 3
0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2 miss Mem[0] Mem[6]
8 0 Mem[0] Mem[6]
Block
address
Cache
Index
Hit or
Miss?
Cache content after access
0 1 2 3
0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]
Associativity Example
• Compare 4-block caches
‣ Direct mapped, 2-way set associative,
fully associative
‣ Block address access sequence: 0, 8, 0, 6, 8
• Direct mapped
‣ Index = (Block address) modulo (number of entries)
Stewart Smith Digital Systems Design 4
Block
address
Set
Index
Hit or
Miss?
Cache content after access
Set 0 Set 1
0 0
8 0
0 0
6 0
8 0
Block
address
Set
Index
Hit or
Miss?
Cache content after access
Set 0 Set 1
0 0 miss Mem[0]
8 0
0 0
6 0
8 0
Block
address
Set
Index
Hit or
Miss?
Cache content after access
Set 0 Set 1
0 0 miss Mem[0]
8 0 Mem[0]
0 0
6 0
8 0
Block
address
Set
Index
Hit or
Miss?
Cache content after access
Set 0 Set 1
0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0
6 0
8 0
Block
address
Set
Index
Hit or
Miss?
Cache content after access
Set 0 Set 1
0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 Mem[0] Mem[8]
6 0
8 0
Block
address
Set
Index
Hit or
Miss?
Cache content after access
Set 0 Set 1
0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0
8 0
Block
address
Set
Index
Hit or
Miss?
Cache content after access
Set 0 Set 1
0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0 Mem[0] Mem[8]
8 0
Block
address
Set
Index
Hit or
Miss?
Cache content after access
Set 0 Set 1
0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0 miss Mem[0] Mem[6]
8 0
Block
address
Set
Index
Hit or
Miss?
Cache content after access
Set 0 Set 1
0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0 miss Mem[0] Mem[6]
8 0 Mem[0] Mem[6]
Block
address
Set
Index
Hit or
Miss?
Cache content after access
Set 0 Set 1
0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]
Associativity Example
• 2-way set associative
‣ Each set can contain two blocks
‣ Same address Sequence: 0, 8, 0, 6, 8
‣ Set index = (Block)Mod(number of sets)
Stewart Smith Digital Systems Design 4
Block
address
Hit or
Miss? Cache content after access
0
8
0
6
8
Block
address
Hit or
Miss? Cache content after access
0 miss Mem[0]
8
0
6
8
Block
address
Hit or
Miss? Cache content after access
0 miss Mem[0]
8 Mem[0]
0
6
8
Block
address
Hit or
Miss? Cache content after access
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0
6
8
Block
address
Hit or
Miss? Cache content after access
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 Mem[0] Mem[8]
6
8
Block
address
Hit or
Miss? Cache content after access
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6
8
Block
address
Hit or
Miss? Cache content after access
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 Mem[0] Mem[8]
8
Block
address
Hit or
Miss? Cache content after access
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
8
Block
address
Hit or
Miss? Cache content after access
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
8 Mem[0] Mem[8] Mem[6]
Associativity Example
• Fully associative
‣ Blocks can go anywhere
‣ Same address sequence: 0, 8, 0, 6, 8
Block
address
Hit or
Miss? Cache content after access
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
8 hit Mem[0] Mem[8] Mem[6]
Stewart Smith Digital Systems Design 4
How Much Associativity
• Increased associativity decreases miss rate
‣ But with diminishing returns
• Simulation of a system with 64KiB data-
cache, 16-word blocks, SPEC2000
benchmarks
‣ 1-way: 10.3%
‣ 2-way: 8.6%
‣ 4-way: 8.3%
‣ 8-way: 8.1%
Stewart Smith Digital Systems Design 4
Dividing Up the Address
Memory Address
Increases with
block size
Decreases with
greater associativity
Everything else…
Tag Index Block Offset
Stewart Smith Digital Systems Design 4
Quick Quiz 1
• If we have a directly mapped cache of size
1 KiB with block size of 4 words, how
many bits are required for the index?
A. 2 bits from the address
B. 4 bits
C. 6 bits
D. 8 bits
!
Stewart Smith Digital Systems Design 4
Quick Quiz 2
• In the same directly mapped cache of size
1 KiB with block size of 4 words, what is
the size of the block offset?
A. Lowest 4 bits (including word/byte offset)
B. 5 bits
C. 6 bits
D. 8 bits
!
Stewart Smith Digital Systems Design 4
Quick Quiz 3
• In the same directly mapped cache of size
1 KiB with block size of 4 words, what is
the tag size if the address is 32 bits wide?
A. Upper 16 bits of the address
B. 18 bits
C. 20 bits
D. 22 bits
!
Stewart Smith Digital Systems Design 4
Quick Quiz 4
• If we change the cache architecture to be
4-way set associative, what happens to
the index if other parameters stay the
same?
A. Increases by 1 bit
B. Stays the same size
C. Decreases by 2 bits
D. Decreases by 4 bits
!
Stewart Smith Digital Systems Design 4
Set Associative Cache Organisation
Stewart Smith Digital Systems Design 4
Replacement Policy
• Direct mapped: no choice
• Set associative
‣ Prefer non-valid entry, if there is one
‣ Otherwise, choose among entries in the set
• Least-recently used (LRU)
‣ Choose the one unused for the longest time
– Simple for 2-way, manageable for 4-way, too hard
beyond that – though there are methods…
• Random
‣ Gives approximately the same performance as LRU
for high associativity
Stewart Smith Digital Systems Design 4
Multilevel Caches
• Primary cache attached to CPU
‣ Small, but fast
• Level-2 cache services misses from primary
cache
‣ Larger, slower, but still faster than main memory
• Main memory services L-2 cache misses
• L-3 cache is now pretty standard on modern
multicore processors.
Stewart Smith Digital Systems Design 4
Multilevel Cache Example
• Given
‣ CPU base CPI = 1, clock rate = 4 GHz
‣ Miss rate/instruction = 2 %
‣ Main memory access time = 100 ns
• With just primary cache
‣ Miss penalty = 100 ns/0.25 ns = 400 cycles
‣ Effective CPI = 1 + 0.02 × 400 = 9
Stewart Smith Digital Systems Design 4
Example (cont.)
• Now add L-2 cache
‣ Access time = 5 ns
‣ Global miss rate to main memory = 0.5 %
• Primary miss with L-2 hit
‣ Penalty = 5 ns/0.25 ns = 20 cycles
• Primary miss with L-2 miss
‣ Extra penalty = 400 cycles
• CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4
• Performance ratio = 9/3.4 = 2.6
Stewart Smith Digital Systems Design 4
Multilevel Cache Considerations
• Primary cache
‣ Focus on minimal hit time
• L-2 cache
‣ Focus on low miss rate to avoid main memory
access
‣ Hit time has less overall impact
• Results
‣ L-1 cache usually smaller than a single cache
‣ L-1 block size may be smaller than L-2 block size
Stewart Smith Digital Systems Design 4
Interactions with Advanced CPUs
• Out-of-order CPUs can execute instructions
during cache miss
‣ Pending store stays in load/store unit
‣ Dependent instructions wait in reservation
stations
– Independent instructions continue
• Effect of miss depends on program data flow
‣ Much harder to analyse
‣ Use system simulation
Stewart Smith Digital Systems Design 4
Interactions with Software
• Misses depend on
memory access patterns
‣ Algorithm behavior
‣ Compiler optimization for
memory access
Stewart Smith Digital Systems Design 4
Interactions with Software
• Misses depend on
memory access patterns
‣ Algorithm behavior
‣ Compiler optimization for
memory access
Stewart Smith Digital Systems Design 4
Interactions with Software
• Misses depend on
memory access patterns
‣ Algorithm behavior
‣ Compiler optimization for
memory access
Stewart Smith Digital Systems Design 4
Multilevel Cache Example
Intel Nehalem processor with labelled sections
SMT
CPU
Core 0
SMT
CPU
Core 1
SMT
CPU
Core 2
SMT
CPU
Core 332kiB
L1D
32kiB
L1I
256 kiB
L2
256 kiB
L2
256 kiB
L2
256 kiB
L2
2 MiB of
8 MiB L3
Cache
2 MiB of
8 MiB L3
Cache
2 MiB of
8 MiB L3
Cache
2 MiB of
8 MiB L3
Cache
W
rite B
uffers
W
rite B
uffers
W
rite B
uffers
W
rite B
uffers
N
orth B
ridge and C
om
m
unication Sw
itch
Stewart Smith Digital Systems Design 4
Multilevel Cache Example
ARM Cortex-A8 uni-core processor
Stewart Smith Digital Systems Design 4
Multilevel Cache Organization
ARM Cortex-A8 Intel Nehalem
L1 caches
(per core)
Split instruction and data caches,
32 KiB each, 64-byte blocks,
4 way set associative,
Random replacement,
Write-back with write allocate,
Hit time: 1 clock cycle
Split instruction and data caches,
32 KiB each, 64-byte blocks,
4 way (Instruction) 8 way (data),
Approximated LRU,
Write-back no write allocate,
Hit time: 4 clock cycles (pipelined)
L2 unified
cache
(per core)
128 KiB to 1 MiB, 64-byte blocks,
8-way set associative, Random
replacement, Write back, Hit time:
11 clock cycles
256KiB, 64-byte blocks, 8-way,
Approximated LRU,
Write-back with write allocate,
Hit time: 10 clock cycles
L3 unified
cache
(shared)
None
8MiB, 64-byte blocks, 16-way
associativity, Approximated LRU,
Write-back with write allocate,
Hit time: 35 clock cycles
Stewart Smith Digital Systems Design 4
ARM A8 Minnespec Benchmarks
Stewart Smith Digital Systems Design 4
ARM A8 Minnespec Benchmarks
Stewart Smith Digital Systems Design 4
Core i7 920 SPEC2006 Benchmarks
Stewart Smith Digital Systems Design 4
Summary 1
• When CPU performance increased
‣ Miss penalty becomes more significant
• Decreasing base CPI
‣ Greater proportion of time spent on memory stalls
• Increasing clock rate
‣ Memory stalls account for more CPU cycles
Stewart Smith Digital Systems Design 4
Summary 2
• Don’t neglect cache behaviour when
evaluating performance
• Higher cache associativity can reduce miss
rate
‣ Each block can go in more than one place in
cache
‣ More complicated: longer hit access times?
‣ Requires a replacement policy: LRU/Random?
• Multilevel caches to reduce miss rates
Stewart Smith Digital Systems Design 4
Next Section
• Virtual Memory
• Common principles of memory
• Cache control systems
• Pitfalls