CS3350B Computer Organization Chapter 1: CPU and Memory Part 3: Cache Implementations
Alex Brandt
Department of Computer Science University of Western Ontario, Canada
Monday January 25, 2021
Alex Brandt
Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 1 / 38
Outline
1 The Basics
2 Cache Organization Schemes
3 Handling Cache Hits & Misses
4 Cache Design & Improvements
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 2 / 38
How is the Hierarchy Managed?
Recall: CPU ← Registers ← L1 ← . . . ← Main Memory ← Disk
Registers ↔ Cache Memory:
ë The compiler (sometimes with the programmer’s help) decides which
values are stored in which registers. Caches ↔ Main Memory:
ë The cache controller (thus the hardware) handles cache memory movement.
Main Memory ↔ Disk:
ë ë
ë
The operating system (which controls the virtual memory).
the TLB (thus the hardware) which assists the virtual-to-physical address mapping.
The programmer (who organizes the data into files), if data comes from file.
Alex Brandt
Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 3 / 38
Cache Design Questions
Q1 How best to organize the memory blocks (a.k.a lines) inside the cache?
Q2 To which block (line) of the cache does a given (main) memory address map?
ë Note: since the cache is a subset of the main memory, multiple memory addresses can map to the same cache location.
Q3 How do we know if a block of the main memory currently has a copy in cache?
Q4 How do we quickly find a particular copy of main memory (memory address contents) in the cache?
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 4 / 38
General Organization of a Cache Memory (1/2)
Cache is an array of 𝑅 = 2𝑠 sets
Each set contains 𝑁 ≥ 1 lines
Each cache line (block) holds 𝐵 = 2𝑏 bytes of data
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 5 / 38
General Organization of a Cache Memory (2/2)
Cache capacity = bytes per line × lines per set × number of sets 𝐶=𝐵×𝑁×𝑅
All values are a power of 2.
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 6 / 38
Memory-Cache Mapping (Addressing Cache Memories)
The data word at the 𝑚-bit address A is in the cache if the tag bits in one of the
The word contents begin at offset
Address Mapping:
block address =
ë just take the “s bits” as set index. Note: Main memory address space and
cache size/implementation highly coupled.
𝑏 = log2(𝐵), 𝑅 = 𝐶⇑(𝐵 ∗ 𝑁) 𝑠=log2(𝑅), 𝑡=𝑚−𝑠−𝑏
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 7 / 38
Word Addresses and Byte Addresses
Often, a memory word for a computer will be more than 1 byte. In these cases a byte memory address and a word memory address are not the same.
Consider an 𝑁-byte word with word address 𝑋.
This word address is actually addressing the byte 𝑁 × 𝑋, and the
word 𝑋 has byte address 𝑁 × 𝑋.
Example
Consider the word address 4 on a 16-bit computer.
4 in binary is 100, but we must account for the fact that words are 2 bytes long.
The word address 4 is actually the byte address 8 ( = 1000 in binary).
Note: if not explicitly said, an address should be considered a byte address. Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 8 / 38
Outline
1 The Basics
2 Cache Organization Schemes
3 Handling Cache Hits & Misses
4 Cache Design & Improvements
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 9 / 38
Types of Cache Organization
Direct-Mapped
N=1
ë One line per set.
ë Each memory address is mapped to exactly one line in the cache.
𝑏=log2(𝐵), 𝑅=𝐶⇑𝐵, 𝑠=log2(𝑅), 𝑡 (tag size) =𝑚−𝑠−𝑏. Fully Associative
R = 1 (allow a memory address to be mapped to any cache block). Tag is whole address except block offset.
𝑏=log2(𝐵), 𝑁 =𝐶⇑𝐵, 𝑠=0, 𝑡=𝑚−𝑏.
N-way set associative
𝑁 is typically 2, 4, 8, or 16.
A memory block maps to a specific set but can and can be placed in any way of that set (so there are 𝑁 choices of mapping).
𝑏=log2(𝐵), 𝑅=𝐶⇑(𝐵×𝑁), 𝑠=log2(𝑅), 𝑡=𝑚−𝑠−𝑏.
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 10 / 38
Direct-Mapped Cache Example
Direct-Mapped ⇒ 𝑁=1. Let𝐵=1,𝑅=4.
Therefore we have a 2-bit set index. Assume a 2-bit tag ⇒ 𝑚 = 4. Start with an empty cache – blanks are considered invalid.
Consider the sequence of memory address accesses:
0 1 2 3 4 3 4 15
0000 0001 0010 0011 0100 0011 0100 1111
8 requests, 2 hits, 6 misses = 25% hit rate
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 11 / 38
Why Middle Bits For Set Index?
Consider a 4-line direct-mapped cache; sets are indexed as 00, 01, 10, 11:
High-Order Bit Indexing
ë Adjacent memory lines would
map to same cache entry
ë Poor use of locality
Middle-Order Bit Indexing
ë Consecutive memory lines map
to different cache lines
ë Can hold 𝐶-bytes of contiguous
memory in cache at one time
In this example (0000x, 0001x, . . . ) with an 𝑚-bit address, x must be 𝑚 − 4 bits.
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 12 / 38
Direct-Mapped Cache Circuit Logic Example 1
In this example:
One word (4 byte) cache lines (𝐵 = 4).
Notice byte offset related to, but not exactly, block offset. 210 = 1024 sets ⇒ 10 bits for set index.
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 13 / 38
Direct-Mapped Cache Example 2: (𝐵 > 1) Let the cache line hold two words = bytes (𝐵 = 2).
Direct-Mapped ⇒ 𝑁 = 1. 𝑅 = 2.
Therefore 1-bit set index. Assume a 2-bit tag again.
Start with an empty cache – blanks are considered invalid.
Consider the sequence of memory address accesses:
0 1 2 3 4 3 4 15
0000 0001 0010 0011 0100 0011 0100 1111
8 requests, 4 hits, 4 misses = 50% hit rate!
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 14 / 38
Direct-Mapped Cache Circuit Logic Example 2
In this example:
Four data words/block (𝐵 = 16).
28 = 256 sets ⇒ 8 bits for set index.
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 15 / 38
Block Size & Cache Performance
Miss rate goes up if the block size becomes a significant fraction of the cache size.
ë For a fixed cache size, ↑ block size ⇒ ↓ number of blocks. ë Increases number of capacity misses.
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 16 / 38
Direct-Mapped Cache Worst-Case
Direct-Mapped ⇒ 𝑁=1. Let𝐵=1,𝑅=4. Consider the sequence of memory addresses accesses:
0 = (0000), 4 = (0100), 0, 4, 0, 4, 0, 4, …
8 requests, 8 misses = 100% miss rate!
ë conflict misses – two addresses map into the same cache block
Thrasing!
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 17 / 38
Thrashing Fix with Set Associativity
2-wayassociativecache ⇒ 𝑁=2. Let𝐵=1,𝑅=2now. Consider the sequence of memory addresses accesses:
0 = (0000), 4 = (0100), 0, 4, 0, 4, 0, 4, …
8 requests, 2 misses = 75% hit rate
Solves thrashing of direct-mapped cache caused by conflict misses
ë Now two memory locations that map to the same block can co-exist.
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 18 / 38
Four-way Set Associative Cache Circuit Logic Example
28 = 256 sets each with four ways (each with one block of one word)
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 19 / 38
Set Associativity Costs Extra
When a miss occurs, which way do we pick for replacement?
Least Recently Used (LRU): the block replaced is the one that has been unused for the longest time.
ë Hardware must keep track of when each way was last used relative to the other blocks in the set.
ë For 2-way set associative, takes one bit per set.
ë Set the bit when a block is referenced (and reset the other way’s bit).
First In First Out (FIFO)
ë Can be implemented as a circular buffer or counter.
𝑁-way set associative cache costs:
𝑁 comparators for LRU policy (delay and physical area on chip).
MUX delay (selecting the proper way) before data is available.
In a direct mapped cache, the cache block is available before the hit/miss decision.
ë Impossible to assume a hit and recover later in set associative caches.
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 20 / 38
Range of Set Associative Caches
For a fixed size cache (and fixed size of cache lines) each doubling in associativity (ways):
Doubles number of blocks per set, Halves number of sets,
Decreases size of set-index by 1 bit, Increases size of tag by 1 bit.
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 21 / 38
Benefits of Set Associative Caches
The choice between direct mapped (1-way) and set associative depends on the cost of a miss versus the cost of implementation.
ë Largest gains are in going from direct mapped to 2-way (>20% reduction in miss rate).
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 22 / 38
Cache Structure in Different Processors
What do we know so far?
L1 cache size & organization
L1 associativity
L1 write policy L2 cache size & organization
L2 associativity L2 write policy L3 cache size & organization
L3 associativity L3 write policy
Intel Nehalem
32KB for each per core; 64B blocks; Split I$ and D$ 4-way (I), 8-way (D) set assoc.; ∼LRU replacement write-back, write-allocate 256KB per core;
64B blocks; Unified
8-way set assoc.; ∼LRU write-back, write-allocate 8192KB (8MB) shared by cores; 64B blocks; Unified 16-way set assoc.
write-back, write-allocate
AMD Barcelona
64KB for each per core;
64B blocks; Split I$ and D$ 2-way set assoc.; LRU replacement
write-back, write-allocate 512KB per core;
64B blocks; Unified
16-way set assoc.; ∼LRU write-back, write-allocate 2048KB (2MB) shared by
by cores; 64B blocks; Unified 32-way set assoc.; evict block shared by fewest cores write-back, write-allocate
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 23 / 38
Outline
1 The Basics
2 Cache Organization Schemes
3 Handling Cache Hits & Misses
4 Cache Design & Improvements
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 24 / 38
Handling Cache Hits
Read Hits (Instructions and Data)
This is what we want! ⇒ Do nothing special.
Write Hits (Data only) has two policies:
Write-through: Require the cache and backing memory to always be consistent.
ë When writing to a cache, also pass the value to the next lower level cache (or main memory) to update its copy.
ë In naïve implementation this is very slow. Speed of cache writes limited by the speed of the next lower level.
ë Use a write buffer between levels and stall only if buffer is full. Write-back: Allow cache and memory to be inconsistent
ë ë ë
Write data into the cache block, this becomes a dirty block.
Only update the next lower level when a dirty block is evicted. Requires >two cycles – one to check for evict/dirty and another to actually do write – or again use a write buffer and use only one cycle.
Alex Brandt
Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 25 / 38
Handling Cache Misses
Read Misses (Instruction and Data)
Stall the execution, fetch block from the next lower level of memory, write (“install”) it into cache, pass it to next higher level of memory (or the processor), and let processor resume.
Write Misses (Data only) stall execution and perform one of two policies: Write allocate:
ë Fetch the block from the next level in the memory hierarchy,
ë Install it in the cache and write the updated word into the new block ë Installing block and writing the updated word can be done
simultaneously.
No-Write Allocate: Skip fetching/installing block into cache, write directly into lower level of memory (or a write buffer).
then let the processor resume.
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 26 / 38
Dealing With Cache Misses with Hardware Design
Compulsory Misses:
Caused by cold starts, process migrations or very first references.
Reduce impact by
Capacity Misses:
Caused by the cache becoming full; it cannot hold all blocks referenced by the program.
Reduce impact by
Conflict Misses:
Caused by multiple addresses mapping to the same cache block. Reduce impact by
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 27 / 38
Dealing With Cache Misses with Hardware Design
Compulsory Misses:
Caused by cold starts, process migrations or very first references. Reduce impact by increasing block size. But this causes increased miss penalty and could increase miss rate.
Capacity Misses:
Caused by the cache becoming full; it cannot hold all blocks referenced by the program.
Reduce impact by increasing cache size (but may increase access time.
Conflict Misses:
Caused by multiple addresses mapping to the same cache block. Reduce impact by increasing associativity or increasing cache/block size (but may increase access time)
ë Larger cache ⇒ more sets ⇒ fewer addresses map to same loc.
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 27 / 38
Outline
1 The Basics
2 Cache Organization Schemes
3 Handling Cache Hits & Misses
4 Cache Design & Improvements
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 28 / 38
Reducing effects of Cache Miss
Reducing Cache Miss Rate ⇒ increase cache size
With increasing technology (especially transistor size and density)
there is more room for larger caches.
Reducing Cache Miss Penalty ⇒ use more levels of cache
L1 cache around for a long time.
L2 cache first appeared with Intel’s Celeron processors with only
128KB (1999).
L3 was first seen in 2003 but prohibitively expensive. Became standard with Intel’s Nehalem in 2008.
Recall: New AMAT Example
1 cycle L1 hit time, 2% L1 miss rate, 5 cycle L2 hit time, 5% L2 miss
rate,100 cycle main memory access time
Without L2 cache: AMAT = 1 + .02*100 = 3
With L2 cache: AMAT = 1 + .02*(5 + .05*100) = 1.2
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 29 / 38
Intel Pentium Memory Hierarchy
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 30 / 38
Multilevel Cache Design
Design considerations for L1 and L2 caches are very different:
Primary cache should focus on minimizing hit time in support of faster processor clock.
ë Smaller capacity with smaller block sizes.
Secondary cache(s) should focus on reducing miss penalty for L1 cache by reducing miss rate to main memory.
ë Larger capacity with larger block sizes. ë Higher levels of associativity.
The miss penalty of the L1 cache is significantly reduced by the presence of an L2 cache.
ë L1 cache can be smaller and faster (but results in higher miss rate).
Hit time is less important for L2 cache than miss rate.
ë But, L2 hit time determines L1’s miss penalty.
ë Presence of L3 cache greatly improves the situation, allowing L2 miss
rate to be slightly less important. It’s all a balancing act.
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 31 / 38
Improving Cache Performance (1/2)
AMAT = Time for a Hit + Miss Rate * Miss Penalty
(1) Reduce the time to hit in the cache. ë Smaller cache size, smaller block size.
ë Direct-Mapped
ë For writes, two possible strategies:
– No-Write Allocate: no “hit” on cache, just write to write buffer. Makes subsequent reads tricky.
– Write Allocate: avoid 2 cycles using write buffer to write to lower cache.
(2) Reduce the miss rate.
ë ë ë
Larger cache, Larger block size (16 – 64 bytes typical).
More flexible placement (increase associativity).
Use a victim-cache – a small buffer holding most recently discarded blocks.
Alex Brandt
Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 32 / 38
Improving Cache Performance (2/2)
AMAT = Time for a Hit + Miss Rate * Miss Penalty (3) Reduce the miss penalty
ë ë ë ë ë ë ë
Smaller blocks.
Use a write-buffer.
Check the write-buffer (or the victim-cache) on a read miss: luck! Pre-fetch critical word first, then rest of cache block.
Use multiple cache levels.
Faster backing store (= main memory).
Improved memory bandwidth – amount and speed of memory transfer between levels (e.g. wider buses).
Alex Brandt
Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 33 / 38
The Cache Design Space
It’s all a balancing act.
Several interacting dimensions ë cache size
ë block size
ë associativity
ë replacement policy
ë write-through vs write-back ë write allocation
The optimal choice is a compromise
ë depends on access characteristics
(what the program is doing).
ë depends on technology / cost.
Simplicity often wins.
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 34 / 38
Memory Hierarchy: Critical Aspects
The Principle of Locality:
Program likely to access a relatively small portion of the address space at any instant of time.
ë Temporal locality: Locality in Time. ë Spatial locality: Locality in Space.
Three major types of cache misses:
Compulsory/Cold Misses: Sad facts of life. The first reference to
an address/block.
Conflict misses: Increase cache size and/or associativity. Thrashing! Capacity misses: It turns out size does matter 🙁
Cache Design Space
Total size, Block size, Associativity (& Replacement policy). Write-hit policy: write-through, write-back
Write-miss policy: (no)-write-allocate. Use write buffers.
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 35 / 38
Questions to Consider
Q1 Where can an entry be placed or found in cache? ë Cache Organization, Direct-Mapping, Associativity.
Q2 Which entry should be replaced on a miss? ë Replacement Policies: LRU, FIFO.
Q3 What happens on a write?
ë Write hit/miss strategies: write-through, write-back, write-allocate.
Typical Example: Given list of memory references describe the cache’s state after each reference.
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 36 / 38
Q1: Where can an entry be placed?
Direct mapped Set associative
Fully associative
# of sets
# of cache lines
(# of cache lines) / associativity
1
Location method Index
Index the set;
compare set’s tags Compare all entries’ tags
Entries per set 1
Associativity (typically 2 to 16) # of entries
# of comparisons 1
Degree of associativity # of entries
Direct mapped Set associative
Fully associative
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations
Monday January 25, 2021
37 / 38
Q2: Which entry should be replaced on a miss?
Easy for direct mapped – only one choice. Set associative or fully associative:
ë Random
ë LRU (Least Recently Used) ë FIFO (First In First Out)
For 2-way set associative LRU is easy and ideal.
Comparisons required for LRU can be too costly for high levels of associativity (> 4-way).
Alex Brandt Chapter 1: CPU and Memory, Part 3: Cache Implementations Monday January 25, 2021 38 / 38