CS4203/EE4363 Computer Organization and Design
Chapter 5
Memory Hierarchy (Cache Memory)
Prof. Pen
–
Chung Yew
With Slides from Profs. Patterson, Hennessy and Mary Jane Irwin
Major Components of a Computer
Processor
Control Datapath
Memory
Devices Input
Output
Main Memory
Secondary Memory (Disk)
Cache
•
The “Memory Wall”
Processor vs. DRAM speed disparity continues to exist 1000
Core
Memory
100
10
1
0.1
0.01 VAX/1980
PPro/1996
2010+
q Good memory hierarchy (cache) design is increasingly important to overall performance
Clocks per instruction
Clocks per DRAM access
The Memory Hierarchy Goal
qFact: Large memories are slow, and fast memories are small qHow do we create a memory that gives the illusion of being
large, cheap and fast (most of the time)?
• With hierarchy – layers of memories from large to small
• With parallelism – fetch multiple words each time
A Typical Memory Hierarchy
q Take advantage of principle of locality
q Give users as much memory as possible q Offer speed by the fastest technology q Use cheapest technology
Secondary Memory (Disk)
On-Chip Components
Control
Second Level Cache (SRAM)
Datapath
Speed (%cycles): Size (bytes):
Cost:
1⁄2’s 100’s
highest
1’s 10’s 10K’s M’s
100’s G’s
10,000’s T’s
lowest
Main Memory (DRAM)
Instr Cache
Data Cache
ITLB DTLB RegFile
Memory Hierarchy Technologies
qCaches use SRAM for speed and technology compatibility to CPU
• Fast – typical access times of 0.5 to 2.5 nsec
• Low density – 6 transistor cells per bit
• Higher power and more expensive (compared to DRAM)
• Static: content will last as long as power is left on
q Main memory uses DRAM for size/capacity (density)
• Slower – typical access times of 50 to 70 nsec
• High density – 1 transistor cells per bit
• Lower power and cheaper (compared to SRAM)
• Dynamic: needs to be “refreshed” regularly (~ every 8 ms) • Consumes1%to2%oftheactivecyclesoftheDRAM
The Memory Hierarchy: Why Does it Work?
• Temporal Locality (locality in time, i.e. tend to be reused soon) • If a memory location is referenced then it tends to be
referenced again soon
Þ Keep most recently accessed data items closer to processor
• Spatial Locality (locality in space, i.e. tend to be clustered)
• If a memory location is referenced, locations with nearby
addresses tend to be referenced soon
Þ Move data blocks consisting of contiguous words at a time closer to processor (to amortize memory access overhead)
The Memory Hierarchy: Terminology
• Block (or line): minimum unit stored in a cache
• Hit Rate: fraction of memory accesses found in a level of memory hierarchy
• Hit Time: time to access a cache.
• It consists of (Time to access the block + Time to determine hit/miss) • Miss Rate: fraction of memory accesses not found in the cache
Þ 1 – (Hit Rate)
• Miss Penalty: time to bring a missed block from next level.
• It consists of
Time to access the block from next level +
Time to transmit that block to current level + Time to insert block in current level +
Time to pass block to processor
Hit Time << Miss Penalty
Characteristics of the Memory Hierarchy
Increasing distance from processor in access time
Processor (Register File)
L1$ L2$
Main Memory (MM)
Inclusion Property:
what is in L1$ is a subset of what is in L2$,
is a subset of what is in MM, is a subset of what is in SM
Secondary Memory (SM)
(Relative) size of the memory at each level
How is the Hierarchy Managed?
• Registers « Cache
• by compiler (or programmer) – using lw/sw instructions
• Main memory « Disks
• by operating system (virtual memory) – via page faults
• virtual to physical address mapping assisted by hardware (Translation Lookaside Buffer (TLB))
• by programmer (files)
• Cache « Main memory
• by cache controller hardware – invisible to software
Cache Basics
• Two questions to answer (implemented in hardware):
• Q1: How do we find where a data item is stored in cache?
• Q2: How to know if a data item we want is there (i.e. hit or miss)?
• Direct mapped
• Each memory block is mapped to exactly one specific block
location in cache
• Address mapping (to answer Q1):
(block address) modulo (# of blocks in cache)
• Many blocks will be assigned/mapped to same block location
• Use a tag associated with each cache to identify the block (to answer Q2)
• Useupperportionofaddressasthetag
(block address) modulo (# of blocks in cache)
• 148957 modulo 100 = 57 ----------------- >
• 148957 modulo 1000 = 957 —————— >
• 148957 modulo 10000 = 8957 —————— >
100 = 102 1000 = 103
10000 = 104
• 10110110 modulo 100 = 10 —————– >
• 10110110 modulo 1000 = 110 —————– >
• 10110110 modulo 10000 = 0110 —————– >
100 = 22 1000 = 23
10000 = 24
Binary Representations
• 21=2, 22=4,23=8,24=8,25=32 • 26=64,27=128,28=256,29=512
• 210=1024=1K
• 32K=25×210=215 , 256K=28 x210=218
• 220=210 x210 =1M
• 64M=26×220=226, 128M=27×220=227
Memory Hierarchy
Processor (Register File)
L1$ L2$
Main Memory (MM) Secondary Memory (SM)
Inclusion Property:
Caching: A Simple First Example
Main Memory 0000
Cache
(4 blocks)
Index Valid Tag 00
01 10 11
Q2: Is it there? Compare cache tags
(16 blocks) Data
0001 Assume:
0010 q 1 word blocks 0011
0100
0101
0110 0111
1000
1001
1010
1011
1100
1101
1110
1111
Q1: How do we find where it is stored?
Use next 2 low order memory address bits – the index
(block address) modulo (# of blocks in the cache)
Direct Mapped Cache
• Consider main memory word reference string 0 1 2 3 4 3 4 15
Start with an empty cache – all blocks initially marked as not valid
0 miss 1 miss 2 miss 3 miss
(0000)
(0001)
(0010)
Mem(2)
(0011)
00
Mem(0)
00
Mem(0)
00
Mem(1)
4 miss (0100)
3 (0011)
hit
4 (0100)
hit
Mem(3)
15 miss (1111)
00
Mem(0)
00
Mem(1)
00
00
Mem(0)
00
Mem(1)
00
Mem(2)
00
Mem(3)
01
Mem(4)
00
Mem(1)
00
Mem(2)
00
Mem(3)
01
Mem(4)
00
Mem(1)
00
Mem(2)
00
Mem(3)
01
Mem(4)
00
Mem(1)
00
Mem(2)
00
Mem(3)
01
4
8 requests, 6 misses
Hit Rate = 2/8 = 0.25
Miss Rate = 1 – 0.25 = 0.75
11 15
00
Mem(0)
00
Mem(1)
00
Mem(2)
00
MIPS Direct Mapped Cache Example
1-word blocks,
Cache size = 1K words (or 4KB)
Byte offset
31 30
Tag
Index Valid
0 1 2
. . .
1021
1022
1023
. . .
13 12 11
20 Index
. . .
2 1 0
Hit
10
Data
Tag
Data
20
32
What kind of locality are we taking advantage of here? Answer: Temporal Locality
Multiword Block Direct Mapped Cache
4 words/block, cache size = 1K words
Hit
3130 …
1312 11 … 4 32 10
20 8 Index
Byte offset
Data
Data
Tag Tag
Block offset
Index Valid 0
1 2 . . . 253 254 255
20
32
What kind of locality are we taking advantage of here? Both Temporal and Spatial Locality
Taking Advantage of Spatial Locality
Consider main memory word reference string 0 1 2 3 4 3 4 15
Start with an empty cache – •
all blocks initially marked as not valid
0
(0000) 3
(0011)
miss
hit
4
(0100)
1 hit
2 miss
00
(0001)
(0010)
00
Mem(1)
Mem(0)
00
Mem(3)
Mem(2)
00
Mem(1)
Mem(0)
01
hit
4
5(0100)
miss
4
15
(1111)
3 hit (0011)
miss
Mem(1)
Mem(0)
00
Mem(1)
Mem(0)
00
Mem(3)
Mem(2)
00
Mem(1)
Mem(0)
00
Mem(3)
Mem(2)
01
Mem(5)
Mem(4)
00
Mem(3)
Mem(2)
01
Mem(5)
Mem(4)
00
Mem(3)
Mem(2)
01
Mem(5)
Mem(4)
00
Mem(3)
Mem(2)
8 requests, 4 misses
11 15 14
Hit Rate = 4/8 = 0.5 Miss Rate = 1 – 0.5 = 0.5
Miss Rate vs Block Size vs Cache Size
q Miss rate goes up if block size becomes too large because number of blocks decreases in the same size cache
Þ increasing capacity misses
Cache Field Sizes
• A cache block includes data, tag and valid bit
• 32-bit byte-addressing
• For a direct mapped cache with 2n blocks, n bits are used for index
• For a block size of 2m bytes, m bits are used to address the byte within the block
• What is the size of tag field in a direct-mapped cache memory? 32 – n – m
• Total number of bits in a cache is then
2n x (data size + tag field size + valid bit)
•
4 = 22 words/block, cache size = 1K = 210 words
3130 … 1312 11 … 4 32 10
Tag Block Index Block Size
Example
• Total number of bits in a cache is then
2n x (block size + tag field size + valid bit)
• How many total bits are required for a direct-mapped cache with 16KB
(214 bytes) of data and 16-byte blocks (24 bytes) assuming a 32-bit address?
Wehave214/24 =210Blocks,andneed 32–10–4=18bitsfortags Total number of bits needed in the cache is:
210 x No. of
Blocks
(4 x 32 + Data Bits in
a block
18 +
Tag Bits
3130 …
1 ) = 1024 x 147 = 150,528 bits
Valid Bit
1312 11 … 4 32 10
Tag
Block Index Block Size
Handling Cache Hits
• Read hits (I$ and D$) —- > This is what we want!
• Write hits (D$ only)
• Assume we only have one level cache
• Require cache and memory to be consistent
• Write-through – Always write data into both cache block and next level in memory hierarchy
• Writes run at speed of next level in memory hierarchy – very slow! • Can use a write buffer (queue)
• Stall only when write buffer is full
• Allow cache and memory to be inconsistent
• Write data into cache block only (not to next level of memory)
• Write back “dirty” cache block to next memory level only when it is “evicted” (or “replaced” by a new block due to conflict)
• Need a dirty bit
Sources of Cache Misses
• Compulsory (first reference due to cold start or process migration)
• Not much we can do about it.
• If you are going to run “billions” of instruction, compulsory misses are insignificant
• Remedy: increase block size (but, very large blocks can increase conflict miss because we have fewer blocks in cache)
• Capacity:
• Cache cannot contain all blocks accessed by a program
• Remedy: increase cache size (but, may increase access time)
• Conflict (collision):
• Multiple cache blocks mapped to same cache location
• Remedy 1: increase cache size
• Remedy 2: increase associativity (but, may increase access time)
• •
Handling Cache Misses (Single-Word Blocks)
Read misses (I$ and D$) – Single-Word Blocks
• Stall pipeline, fetch the block from next memory level
• Send requested word to processor, then resume pipeline
Write misses (D$ only) – Single-Word Blocks
1. Stall pipeline, fetch block from next level in memory hierarchy, install it in cache (which may involve evicting a dirty block), write the word to cache, then resume pipeline
2. Write allocate – just write the word into cache, updating both tag and data, no need to check for cache hit, no need to stall, may need to write back dirty block.
3. No-write allocate – skip cache write and just write the word to write buffer, no need to stall if write buffer isn’t full
Multiword Block Considerations
• Read misses (I$ and D$) – Multiword Blocks
• Process same as for single word blocks – a miss returns entire
block from memory
• Miss penalty grows as block size grows
• Early restart – processor resumes execution as soon as the
requested word of the block is returned
• Requested word first – requested word is transferred from memory to cache (and processor) first
• Nonblocking cache – allows processor to continue to access cache while cache is handling an earlier miss (out-of-order execution)
• Write misses (D$) – Multiword Blocks
• If using write allocate must first fetch the block from memory, and then write the word to the block
Requested word
Measuring Cache Performance
q Assuming cache access cost (access time), including cache misses, are included in the CPU execution time, then
CPU time = Instruction Count × CPI × Clock Cycle Time
= Instruction Count× (CPIideal + Cache-Stall Cycles) × Clock Cycle Time
CPIstall
q Cache-stall cycles come from cache misses (a sum of read-miss stall cycles and write-miss stall cycles)
Read-Stall Cycles = #reads/program × read miss rate × read miss penalty Write-Stall Cycles = #writes/program × write miss rate × write miss penalty
q For write-allocate caches, we can simplify this to
Cache-Stall Cycles = # memory_accesses/program × miss rate × miss penalty
Impacts of Cache Performance
• Relative cache penalty increases as processor performance improves (faster clock rate and/or lower CPI)
• Memory speed is unlikely to improve as fast as processor cycle time.
• When calculating CPIstall, cache miss penalty is measured in processor
clock cycles
• The lower the CPIideal, the more pronounced the impact of memory stalls
• Example
A processor with a CPIideal of 2, a 100 cycle miss penalty, 36% load/store instructions, and 2% I$ and 4% D$ miss rates
Memory-stall cycles = 2% × 100 + 36% × 4% × 100 = 3.44 Þ CPIstalls = 2 + 3.44 = 5.44
Þ more than twice CPIideal !
• What if CPIideal is reduced to 1? 0.5? 0.25?
• What if D$ miss rate went up by 1%? 2%?
• What if processor clock rate is doubled (doubling miss penalty)?
Average Memory Access Time (AMAT)
• A larger cache improves cache hit rate, but will have a longer hit time
• An increase in hit time may add another stage to pipeline.
• Eventually, the increase in hit time will offset improvement in hit rate leading to a decrease in performance.
• Average Memory Access Time (AMAT) is the average to access memory considering both hits and misses
AMAT = Time for a hit + Miss rate x Miss penalty
• Example
What is AMAT for a processor with a 20 psec clock, a miss penalty of 50 clock cycles, a miss rate of 0.02 misses per instruction and a cache access time of 1 clock cycle?
AMAT = 20ps + 2% x 50 x 20ps = 20ps + 20ps = 40ps
Reducing Cache Miss Rates #1
1. Allow more flexible block placement
• In a direct mapped cache a memory block maps to exactly one
cache block
Block Index = (block address) modulo (# of blocks in cache)
Direct-Mapped Cache
Consider the main memory word reference string (4-bit address) 04040404
0 miss 4 miss 0 miss 4 miss (0000) 01 (0100) 4 00 (0000) 0 01 (0100) 4
Start with an empty cache –
all blocks initially marked as not valid
00
Mem(0)
00
Mem(0)
01
Mem(4)
00
Mem(0)
00 0 miss 0 01 4 miss 4 00 0 miss 0 01 4 miss 4 (0000) (0100) (0000) (0100)
01
Mem(4)
00
Mem(0)
01
Mem(4)
00
Mem(0)
8 requests, 8 misses
q Ping pong effect (Thrashing) due to conflict misses: Two memory locations map into same block
1.
Reducing Cache Miss Rates #1
Allow more flexible block placement
• In a direct mapped cache a memory block maps to exactly one
cache block
Block Index = (block address) modulo (# of blocks in cache)
• At the other extreme, could allow a memory block to be mapped to
any cache block – fully associative cache
• A compromise is to divide cache into sets, and allow n cache blocks
to be mapped to each set. We call this n-way set associative.
• Each memory block maps to a unique set (specified by index field)
and can be placed in any way of that set (so there are n choices)
Set Index = (block address) modulo (# of sets in cache)
2-way Set Associative Cache
Cache Set Way V
0000 0001
Tag Data 0010 0011
Main Memory
1-word blocks
0 1
0
1
0
1
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Q1: How do we find it?
Use 1 low order bit to determine which cache set (i.e., modulo the number of sets in cache)
Q2: Is it there?
Compare all cache tags in the set to high order 3 address bits to tell if memory block is in cache
Set Index = (block address) modulo (# of sets in cache)
2-way Set Associative Cache
Cache Set Way V
Tag Data
0000xx Main Memory 0001xx
0010xx
0011xx
0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx
1-word blocks
2 low-order bits define the byte in the word (32b words)
0 1
0
1
0
1
Q1: How do we find it?
Use next 1 low order address bit to determine which cache set (i.e., modulo the number of sets in cache)
Q2: Is it there?
Compare all cache tags in the set to high order 3 address bits to tell if memory block is in cache
Set Index = (block address) modulo (# of sets in cache)
Caching: A Simple First Example
Cache
Index Valid Tag 00
01 10 11
Q2: Is it there?
Main Memory
Data
0000xx
0001xx
0010xx
0011xx
0100xx
0101xx
0110xx
0111xx
1000xx
1001xx
1010xx
1011xx
1100xx
1101xx
1110xx
1111xx
q 1 word blocks q2loworderbits define the byte in the word (32b words)
Q1: How do we find where it is stored?
Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache)
Compare cache tag to high order 2 memory
address bits to find out if the memory block is in cache
(block address) modulo (# of blocks in the cache)
2-way Set Associative
Consider the main memory word reference string 04040404
0 miss 4 miss 0 hit 4 hit (0000) (0100) (0000) (0100)
Start with an empty cache – all blocks initially marked as not valid
000
Mem(0)
000
Mem(0)
010
Mem(4)
000
Mem(0)
010
Mem(4)
000
Mem(0)
010
Mem(4)
8 requests, 2 misses
q Solves the ping-pong effect caused by conflict misses in a direct mapped cache since now two memory locations that map into the same cache set can co-exist!
Four-Way Set Associative Cache
28 = 256 sets each with 4 ways (each with one block)
31 30
. . .
13 12 11
. . .
2 1 0
Byte offset
Tag
22
8 Index
Index V Tag Data 0000 1111 2222
…. …. ….
V Tag Data
V Tag
Data V Tag Data
Way 0
Way 1
Way 2
Way 3
253 253
254 254
255 255
253 254 255
253 254 255
32
4×1 select
Hit
Data
Recap: Multiword Block Direct Mapped Cache
4 words/block, cache size = 1K words
Hit
3130 …
1312 11 … 4 32 10
20 8 Index
Byte offset
Data
Data
Tag Tag
Block offset
Index Valid 0
1 2 . . . 253 254 255
20
We can take advantage of spatial locality
32
53
Spectrum of Associativity
• For a cache with 8 entries
Range of Set Associative Caches
• For a fixed-size cache, double (i.e. increase by a factor of two) in associativity will
• Double the number of blocks per set (i.e., number of ways)
• Halve the number of sets, i.e. decrease the size of index by 1 bit, and increase size of tag by 1 bit
Used for tag compare
Decreasing associativity
Direct mapped (only one way) Smaller tags, only a single comparator
Selects the set Selects the word in the block Byte offset
Increasing associativity
Fully associative
(only one set)
Tag is all the bits except block and byte offset
Tag
Index
Block offset
Costs of Set Associative Caches
• When a miss occurs, which block do we pick for replacement (i.e. evict)? • Least Recently Used (LRU): the block replaced is the one that has
been unused for the longest time
• Must have hardware to keep track of when each way’s block was used relative to the other blocks in the set
• For 2-way set associative, takes one bit per set → set the bit when a block is referenced (and reset other way’s bit)
• N-way set associative cache costs
• N comparators (increase delay and area)
• MUX delay (set selection) before data is available
• Data available after set selection (and Hit/Miss decision).
• In a direct-mapped cache, cache block is available before Hit/Miss decision
• So it is not possible to just assume a hit and continue and recover later if it was a miss
Recap: Measuring Cache Performance
§ Assuming cache access cost (access time) are included as part of the normal CPU execution cycle, then
CPU time = Instruction Count × CPI × Clock Cycle Time
= Instruction Count× (CPIideal + Memory-stall cycles) × Clock Cycle Time
CPIstall
§ For write-allocate caches, we can simplify this to
Memory-stall cycles = accesses/program × miss rate × miss penalty
§ Average Memory Access Time (AMAT) is the average to access memory considering both hits and misses
AMAT = Time for a hit + Miss rate x Miss penalty x cycle time
2.
Reducing Cache Miss Penalty #2
Use multiple levels of caches
• With advancing technology, we have more than enough room on die for bigger L1 caches, or for a second level of caches – normally a unified L2 cache (i.e., UL2$ to hold both instructions and data) and in some cases even a unified L3 cache
• Example,
CPIideal of 2, 100-cycle miss penalty (to main memory), and 25-cycle miss penalty (to UL2$), 36% load/stores, with a 2% (4%) L1 I$ (D$) miss rate, add a 0.5% UL2$ miss rate
• CPI without UL2$, i.e. with a 100-cycle miss penalty to memory CPIw/o_UL2$ = 2+2%×100+36%×4%×100=2+3.44=5.44
• CPIstalls with UL2$
CPIw_UL2$ = 2 + 2%×25 + 36%×4%×25 + 0.5%×100
= 2 + 1.36 = 3.36 -> 5.54/3.36 = 1.64
Multilevel Cache Design Considerations
• Design considerations for L1 and L2 caches are very different
• Primary cache (i.e. L1$) should focus on minimizing hit time in
support of a shorter clock cycle
• Smaller with smaller block sizes
• Secondary cache(s) (i.e. L2$, L3$) should focus on reducing miss rate to reduce penalty of long main memory access times
• Larger with larger block sizes
• Higher levels of associativity
• Miss penalty of L1 cache is significantly reduced by the presence of an L2 cache – so it can be smaller (i.e., faster) but have a higher miss rate
• For L2 cache, hit time is less important than miss rate
• L2$ hit time determines L1$’s miss penalty
• L2$ local miss rate >> than global miss rate
Two Machines’ Cache Parameters
Intel Nehalem
AMD Barcelona
L1 cache organization & size
Split I$ and D$; 32KB for each per core; 64B blocks
Split I$ and D$; 64KB for each per core; 64B blocks
L1 associativity
4-way (I), 8-way (D) set assoc.; ~LRU replacement
2-way set assoc.; LRU replacement
L1 write policy
write-back, write-allocate
write-back, write-allocate
L2 cache organization & size
Unified; 256KB (0.25MB) per core; 64B blocks
Unified; 512KB (0.5MB) per core; 64B blocks
L2 associativity
8-way set assoc.; ~LRU
16-way set assoc.; ~LRU
L2 write policy
write-back
write-back
L2 write policy
write-back, write-allocate
write-back, write-allocate
L3 cache organization & size
Unified; 8192KB (8MB) shared by cores; 64B blocks
Unified; 2048KB (2MB) shared by cores; 64B blocks
L3 associativity
16-way set assoc.
32-way set assoc.; evict block shared by fewest cores
L3 write policy
write-back, write-allocate
write-back; write-allocate
Cache in Multicores – Coherence Problem
• In a multi-core processor, multiple cores will share a common physical address space, causing a cache coherence problem
Read X Core 1
Write 1 to X
L1 I$
L1 D$
X=1 X=0
Read X Core 2
L1 I$
L1 D$ X=0
X X = = 01
Unified (shared) L2
A Coherent Memory System
• Any read of a data item should return the most recently written value of the data item
• Coherence
• Defines what values can be returned by a read
• Writes to the same location are serialized
• Two writes to the same location must be seen in the same order by all cores
q To enforce coherence, caches must provide
! Migration of shared data items to a core’s local cache
! Migration reduced latency of data access and bandwidth demand on shared memory (L2 in our example)
! Replication of shared data items in multiple cores’ caches
! Replication reduces both latency and contention for a read shared data item
Cache Coherence Protocols
• Need a hardware protocol to ensure cache coherence
• Most popular one is based on snooping
• Cache controllers monitor (snoop) on bus to determine if their cache has a copy of a block that is requested
• Write invalidate protocol
• Writes require exclusive access and invalidate all other copies
• Exclusive access ensure that no other readable or writable copies of an item exists
• If two processors attempt to write the same data at the same time
• One of them wins the race causing the other core’s copy to be
invalidated.
• For the other core to complete, it must obtain a new copy of the data with the updated value – thus enforcing write serialization
Handling Writes
Two ways to ensure all processors sharing the data are informed of writes
1. Write-update (write-broadcast)
• Writing processor broadcasts new data over bus, all copies are updated
• All writes to the same location must go to bus ® higher bus traffic
• Since new values appear in caches sooner, can reduce latency
2. Write-invalidate
• Writing processor issues invalidation signal on bus
• Caches snoop and check if they have a copy of the data
• If so, they invalidate their cache block containing the word (this allows multiple readers but only one writer)
• Only on the first write goes to bus
• All later writes to the same location done locally.
® lower bus traffic, better use of bus bandwidth
Example of Snooping Invalidation
Read X Core 1
Write 1 to X
L1 I$
L1 D$
X=0 X=1
Read X Core 2
Read X
L1 I$
L1 D$
X=0
X=I X=1
XX==1I 0 Unified (shared) L2
• When second miss by Core 2 occurs, Core 1 responds with the value canceling the response from L2 cache (and also updating L2 copy)
A Write-Invalidate CC Protocol (MSI)
Invalid
read (miss)
receives invalidate (write by another core to this block)
Shared (clean)
read (hit or miss)
Modified (dirty)
read (hit) or write (hit)
signals from the core in red
caching protocol in black
signals from the bus in blue
write (miss)
send invalidate
write-back due to read (miss) by another core to this block
write (hit or miss) send invalidate
3. C1 snoops read request for A & lets MM supply A
1. Read miss for A
4. C2 gets A from MM & changes its state to S
1. Write miss for A
4. C2 gets A from MM & changes its state to M
2. C2 sends invalidate for A
Write-Invalidate CC Examples
• M = modified (only one), S = shared (many) , I = invalid (many)
Core 1
Core 2
Core 2
A (S)
A (I)
A (S)
A (I)
2. C2 sends read request for A
3. C1 changes A state to I
Core 1
Main Mem A
Main Mem A
Write-Invalidate CC Examples
• M = modified (only one), S = shared (many) , I = invalid (many)
3. C1 snoops read request for A, writes-back A to MM
1. Read miss for A
5. C2 gets A from MM & changes its state to S
1. Write miss for A
4. C2 gets A from MM & changes its state to M
2. C2 sends invalidate for A
Core 1
Core 2
Core 2
A (M)
A (I)
A (M)
A (I)
4. C1 changes A state to S
3. C1 sends A to MM & changes A state to I
2. C2 sends read request for A
Core 1
Main Mem A
Main Mem A
• •
Block Size Effects – False Sharing
Writes to one word in a multi-word block mean that the full block is invalidated
Multi-word blocks can also result in false sharing: when two cores are writing to two different variables that happen to fall in same cache block
• Write-invalidate false sharing can significantly increase cache miss rates Core1 Core2
4-word cache block
Compilers can help reduce false sharing by allocating highly correlated data to the same cache block
A
B
•
Other Coherence Protocols
• There are many variations on cache coherence protocols
• Another write-invalidate protocol used in Pentium 4 (and many other
processors) is MESI with four states:
• Modified – same
• Exclusive – only one copy of the shared data is allowed to be cached; memory has an up-to-date copy
• Since there is only one copy of the block, write hits don’t need to send invalidate signal
• Shared – multiple copies of the shared data may be cached (i.e., data permitted to be cached with more than one processor); memory has an up-to-date copy
• Invalid – same
MESI Cache Coherency Protocol
Invalid (not valid
block)
Invalidate or write back this block Processor shared read miss
Processor shared read
Shared (clean)
Modified (dirty)
Processor write or read hit
Processor write
Exclusive (clean)
Processor exclusive read
Processor shared read
Invalidate or write back this block Processor exclusive read miss
Write back block
Processor write miss
Processor shared read
Send invalidate signal
Random-Access Memory (RAM) Organization (I)
FIGURE B.9.1 A 2M x 16 SRAM showing the 21 address lines (2M = 221) and 16 data inputs, the 3 control lines, and the 16 data outputs.
For reads – Chip select and Output enable are activated For writes – Chip select and Write enable are activated
Random-Access Memory (RAM) Organization (III)
Write enable [0]
Write enable [1]
Write enable [2]
Write enable [3]
Word line [0]
Word line [1]
Word line [2]
Word line [3]
FIGURE B.9.3 The basic structure of a 4 x 2 SRAM consists of a decoder that selects which pair of cells to activate.
• 2-D cells avoid a large decoder needed for
large arrays, e.g. 4M x 8 SRAM, which needs a 22-to-4M decoder, but use a 2K x 2K arrays instead, i.e. 2 11-to-2K decoders instead.
• The address that selects the cell (Chip enable) is sent on one of a set of horizontal address lines, called word lines (i.e. Chip enable).
• The activated cells use a three-state output connected to the vertical bit lines that supply the requested data.
Bit line [1]
Bit line [0]
Bit lines
Random-Access Memory (RAM) Organization (IV)
FIGURE B.9.4 Typical organization of a 4M x 8 SRAM as an array of 4K x 1024 arrays.
• The first decoder generates the addresses for 8 4K x 1024 arrays;
• A set of multiplexors select 1 bit from each 1024-bit-wide array.
• This is a much easier design than a single-level decode that would need either an enormous decoder or a gigantic multiplexor.
• In practice, a modern SRAM would probably use an even larger number of blocks, each somewhat smaller.
DRAM Organization
FIGURE B.9.6 A 4M x 1 DRAM is built with a 2048 x 2048 array.
• The row access uses 11 bits to select a row, which is then latched in 2048 1-bit latches.
• Each time a row is refreshed together
• A multiplexor chooses the output bit from these 2048 latches.
• The RAS (row address strobe) and CAS (column address strobe) signals control whether
the address lines are sent to the row decoder or column multiplexor.
• Address lines are shared by both row address and column address, i.e. number of address
bits is always even. This impact each new generation of DRAMs is 4x larger than the previous one.
(DDR) SDRAM Operation
+1
Column Address q After a row is read into 1 x M SRAM register
q Double-data-rate synchronous RAM
§ Input CAS as starting “burst” address along with
a burst length
§ Transfers a burst of data from a series of sequential address within that row
§ Memory bus clock controls transfer of successive words in the burst
M cols
DRAM
Row Address
§ RAS: Row access strobe
§ CAS: Column access strobe
1-bit Output 4th M-bit
1st M-bit Access
Cycle Time
2nd M-bit
3rd M-bit
RAS CAS
Row Address
Col Address
Row Add
1 x M SRAM
N rows
Memory Systems that Support Caches
Off-chip interconnect and memory architecture can affect overall system performance in dramatic ways
on-chip
One-word wide organization – one-word wide bus and one-word wide memory
q Assume
CPU
Cache
bus
§ 1 bus clock cycle to send address
§ 15 bus clock cycles to get 1st word in the block from DRAM (row cycle time), 5 bus clock cycles for each of 2nd, 3rd, 4th words (column access time)
§ 1 bus clock cycle to return a word of data
Memory-bus to cache bandwidth
§ number of bytes accessed from memory and transferred to cache/CPU per bus clock cycle
32-bit data & 32-bit addr per cycle
DRAM Memory
q
on-chip
One Word Wide Bus, One-Word Blocks
• If block size is one word, then for a cache miss, pipeline will have to stall for the number of cycles required to return one data word from memory
1 15 1
17 total clock cycles miss penalty
• Number of bytes transferred per clock cycle
(bandwidth) for a single miss is
4/17 = 0.235 bytes per bus clock cycle
CPU
memory bus clock cycle to send address memory bus clock cycles to read DRAM memory bus clock cycle to return data
Cache
bus
DRAM Memory
One Word Wide Bus, Four-Word Blocks
•
on-chip
What if the block size is four words and each word is in a different DRAM row?
CPU
1 4×15 = 60 1
62
cycle to send 1st address cyclestoreadDRAM
cycles to return last data word total clock cycles miss penalty
Cache
15 cycles
bus
15 cycles
15 cycles
15 cycles
DRAM Memory
•
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
(4 x 4)/62 = 0.258 bytes per clock
on-chip
One Word Wide Bus, Four-Word Blocks
• What if the block size is four words and all words are in the same DRAM row?
1
15 + 3*5 = 30 1
32
CPU
cycle to send 1st address cycles to read DRAM
cycles to return last data word total clock cycles miss penalty
Cache
bus
15 cycles
5 cycles
DRAM Memory
5 cycles
5 cycles
• Number of bytes transferred per clock cycle (bandwidth) for a single miss is
(4×4)/32=0.5 bytesperclock
Interleaved Memory, One-Word Wide Bus
on-chip
1
15 4*1 = 4
20
q For a block size of four words
cycle to send 1st address cycles to read DRAM banks cycles to return last data word total clock cycles miss penalty
CPU
Cache
15 cycles
15 cycles
bus
15 cycles
DRAM Memory
bank 0
DRAM Memory
bank 1
DRAM Memory
bank 2
DRAM Memory
bank 3
15 cycles
q Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per clock
(4 x 4)/20 = 0.8
DRAM Memory System Summary
• Caches access one block at a time (usually more than one word)
• Its important to match cache characteristics
• With DRAM characteristics
• Use DRAMs that support fast multiple word accesses, preferably ones that match block size of cache
• With memory-bus characteristics
• Make sure memory-bus can support DRAM access rates and
patterns
• With the goal of increasing memory-bus to cache bandwidth
Summary: Improving Cache Performance
1. Reduce hit time in the cache
• Smaller cache
• Direct-mapped cache
• Smaller blocks
• Write allocate – to avoid two cycles (first check for hit, then write)
• Pipeline writes via a write buffer
2. Reduce miss rate
• Bigger cache
• More flexible placement (increase associativity)
• Larger blocks (16 to 64 bytes typical)
• Victim cache – small buffer holding most recently discarded blocks
Summary: Improving Cache Performance
3. Reduce miss penalty
• Smaller blocks
• Use a write buffer to hold dirty blocks being replaced ® don’t have to wait for write to complete
• Check write buffer (and/or victim cache) on a read miss – may get lucky
• For large blocks, fetch critical word first
• Use multiple cache levels – L2 cache not tied to CPU clock rate
• Faster backing store/improved memory bandwidth
• Wider buses
• Multiple banks, DDR SDRAMs
Summary: The Cache Design Space
• Several interacting dimensions
• Cache size
• Block size
• Associativity
• Replacement policy
• Write-through vs write-back
• Write allocation
• The optimal choice is a compromise
• Depends on access characteristics
• Workload
• Use (I-cache, D-cache, TLB) • Depends on technology / cost
• Simplicity often wins
Cache Size
Associativity
Block Size
Bad Good
Factor A
Less
Factor B
More