Memory and Caches Part 1
Stewart Smith Digital Systems Design 4
Digital System Design 4
Memory & Caches Part 1
Dr Stewart Smith
Stewart Smith Digital Systems Design 4
This Section
• Memory Access Locality
• Memory Hierarchy
• Cache Technology
‣ Directly Mapped Caches
‣ Misses and Writes
• Main memory Technology
Stewart Smith Digital Systems Design 4
Classic Components of a Computer: Memory
The processor gets instructions and data from memory. Input writes data to
memory, and output reads data from memory. Control sends the signals that
determine the operations of the datapath, memory, input, and output.
Stewart Smith Digital Systems Design 4
Full Datapath
Stewart Smith Digital Systems Design 4
Memory Technology
• Static RAM (SRAM)
‣ 0.5ns – 2.5ns, $2000 – $5000 per GB
• Dynamic RAM (DRAM)
‣ 50ns – 70ns, $20 – $75 per GB
• Magnetic disk
‣ 5ms – 20ms, $0.20 – $2 per GB
• Ideal memory
‣ Access time of SRAM
‣ Capacity and cost/GB of disk
Stewart Smith Digital Systems Design 4
Memory Technology
Speed Size Cost ($/bit)
Current
Technology
Fastest Smallest Highest SRAM
DRAM
Slowest Biggest Lowest
Flash SSD
or Magnetic
Disk
Processor
Memory
Memory
Memory/
Storage
Stewart Smith Digital Systems Design 4
Binary Notation – Giga v.s. “Gibi”
Decimal
Term
Abbreviation Value
Binary
Term
Abbreviation Value % larger
Kilobyte KB 103 Kibibyte KiB 210 2%
Megabyte MB 106 Mebibyte MiB 220 5%
Gigabyte GB 109 Gibibyte GiB 230 7%
Terabyte TB 1012 Tebibyte TiB 240 10%
Same for bits as for bytes, so gigabit (Gb) is 109 bits
and a gibibit is 230 bits (1073741824!)
Stewart Smith Digital Systems Design 4
Principle of Locality
• Programs access a small proportion of their
address space at any time
• Temporal locality
‣ Items accessed recently are likely to be accessed
again soon
‣ e.g., instructions in a loop, induction variables
• Spatial locality
‣ Items near those accessed recently are likely to be
accessed soon
‣ E.g., sequential instruction access, array data
Stewart Smith Digital Systems Design 4
Taking Advantage of Locality
• Memory hierarchy
• Store everything on disk
• Copy recently accessed (and nearby) items
from disk to smaller DRAM memory
‣ Main memory
• Copy more recently accessed (and nearby)
items from DRAM to smaller SRAM memory
‣ Cache memory attached to CPU
Stewart Smith Digital Systems Design 4
Memory Hierarchy
CPU
Level 1
Level 2
…
Level n
Levels in the
memory hierarchy
Increasing distance
from the CPU in
access time
Size of the memory at each level
Stewart Smith Digital Systems Design 4
Memory Hierarchy Levels
• Block (aka line): unit of copying
‣ May be multiple words
• If accessed data is present in upper level:
‣ Hit: access satisfied by upper level
– Hit ratio: hits/accesses
• If accessed data is absent
‣ Miss: block copied from lower level
– Time taken: miss penalty
– Miss ratio: misses/accesses
= 1 – hit ratio
‣ Then accessed data supplied from upper
level
Stewart Smith Digital Systems Design 4
Cache Memory
• Cache memory
‣ The level/s of the memory hierarchy closest to the CPU
• Cache: a safe place for hiding or storing things
• Given accesses X1, …, Xn–1, Xn
• How do we know if the
data is present?
• Where do we look?
Stewart Smith Digital Systems Design 4
Direct Mapped Cache
• Location determined by address
• Direct mapped: only one choice
‣ (Block address) modulo (#Blocks in cache)
• #Blocks is a
power of 2
• Use low-order
address bits
00101 → 101
Stewart Smith Digital Systems Design 4
Tags and Valid Bits
• How do we know which particular block is
stored in a cache location?
‣ Store block address as well as the data
‣ Actually, only need the high-order bits
‣ Called the tag
• What if there is no data in a location?
‣ Valid bit: 1 = present, 0 = not present
‣ Initially 0
Stewart Smith Digital Systems Design 4
Cache Example 1
• 8-blocks, 1 word/block, direct mapped
• Initial state
‣ All valid-bits = N (0, not present)
Index V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
110 N
111 N
Stewart Smith Digital Systems Design 4
Index V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
110 N
111 N
Index V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word address Binary address Hit/miss Cache index
22 10 110 Miss 110
Cache Example 2
Stewart Smith Digital Systems Design 4
Index V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
110 N
111 N
Index V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word address Binary address Hit/miss Cache index
22 10 110 Miss 110
Cache Example 2
Stewart Smith Digital Systems Design 4
Index V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word addr Binary addr Hit/miss Cache block
26 11010 ? ?
Cache Example 3 !
Stewart Smith Digital Systems Design 4
Index V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
Index V Tag Data
000 N
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word addr Binary addr Hit/miss Cache block
26 11 010 Miss 010
Cache Example 3
Stewart Smith Digital Systems Design 4
Index V Tag Data
000 N
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word addr Binary addr Hit/miss Cache block
22 10 110 Hit 110
26 11 010 Hit 010
Cache Example 4
Stewart Smith Digital Systems Design 4
Index V Tag Data
000 N
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
Index V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
Index V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word addr Binary addr Hit/miss Cache block
16 10 000 Miss 000
3 00 011 Miss 011
Cache Example 5
Stewart Smith Digital Systems Design 4
Word addr Binary addr Hit/miss Cache block
16 10 000 Hit 000
Cache Example 6
Index V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
Stewart Smith Digital Systems Design 4
Word addr Binary addr Hit/miss Cache block
18 10010 ? ?
Cache Example 7
Index V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
!
Stewart Smith Digital Systems Design 4
Index V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
Index V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 10 Mem[10010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word addr Binary addr Hit/miss Cache block
18 10 010 Miss 010
Cache Example 7
Stewart Smith Digital Systems Design 4
Address Subdivision
Comparator
1024 Blocks = 210 (10 bit index)
Stewart Smith Digital Systems Design 4
Address Structure
• How can you go from the address
structure to the cache structure
‣ What is the block size (words or bytes)?
‣ How many blocks/lines in the cache?
31 … 14 13 … 6 5 … 2 1 0
Tag Index Block offset byte
!
Stewart Smith Digital Systems Design 4
Example of FastMATH Cache
Stewart Smith Digital Systems Design 4
Choice of Block Size
• Larger blocks exploit spatial locality
‣ Contiguous data in arrays for example
• However the benefits reduce as block size
increases
‣ Very large blocks mean fewer blocks in cache
and an increased miss rate
‣ Large blocks also increase the miss penalty of
retrieving the block from the next level
Stewart Smith Digital Systems Design 4
Choice of Block Size
• Larger blocks exploit spatial locality
‣ Contiguous data in arrays for example
• However the benefits reduce as block size
increases
‣ Very large blocks mean fewer blocks in cache
and an increased miss rate
‣ Large blocks also increase the miss penalty of
retrieving the block from the next level
Stewart Smith Digital Systems Design 4
Cache Misses (Reads)
• On cache hit, CPU proceeds normally
• On cache miss
‣ Stall the CPU pipeline
‣ Fetch block from next level of hierarchy
‣ Instruction cache miss
– Restart instruction fetch
‣ Data cache miss
– Complete data access
Stewart Smith Digital Systems Design 4
Write-Through
• On data-write hit, could just update the block in
cache
‣ But then cache and memory would be inconsistent
• Write through: also update memory
• But it makes writes take longer
‣ e.g., if base CPI = 1, 10% of instructions are stores,
write to memory takes 100 cycles
– Effective CPI = 1 + 0.1×100 = 11 (factor of 10 decrease…)
• Solution: write buffer
‣ Holds data waiting to be written to memory
‣ CPU continues immediately
– Only stalls on write if write buffer is already full
Stewart Smith Digital Systems Design 4
Write-Back
• Alternative: On data-write hit, just update
the block in cache
‣ Keep track of whether each block is dirty
• When a dirty block is replaced
‣ Write it back to memory
‣ Can use a write buffer to allow replacing
block to be read first
Stewart Smith Digital Systems Design 4
Write Allocation
• What should happen on a write miss?
• For write-back
‣ Usually fetch the block
‣ Need to make sure we don’t lose changed data
• Alternatives for write-through
‣ Allocate on miss: fetch the block
‣ Write around: don’t fetch the block
– Since programs often write a whole block before
reading it (e.g., initialization)
Stewart Smith Digital Systems Design 4
Write Through
No Allocation
Write Around
Write
Allocation
Stewart Smith Digital Systems Design 4
Write Back
Write
Allocation
Stewart Smith Digital Systems Design 4
Main Memory Supporting Caches
• Use DRAMs for main memory
‣ Fixed width (e.g., 1 word)
‣ Connected by fixed-width clocked bus
– Bus clock is typically slower than CPU clock
• Example cache block read
‣ 1 bus cycle for address transfer
‣ 15 bus cycles per DRAM access
‣ 1 bus cycle per data transfer
• For 4-word block, 1-word-wide DRAM
‣ Miss penalty = 1 + 4×15 + 4×1 = 65 bus cycles
‣ Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle
Stewart Smith Digital Systems Design 4
DRAM Technology
• Data stored as a charge in a capacitor
‣ Single transistor used to access the charge
‣ Must periodically be refreshed
– Read contents and write back
– Performed on a DRAM “row”
Stewart Smith Digital Systems Design 4
DRAM Performance Factors
• Row buffer
‣ Allows several words to be read and refreshed in parallel
• Synchronous DRAM
‣ Allows for consecutive accesses in bursts without
needing to send each address
‣ Improves bandwidth
• DRAM banking
‣ Allows simultaneous access to multiple DRAMs
‣ Improves bandwidth
Stewart Smith Digital Systems Design 4
Advanced DRAM Organisation
• Bits in a DRAM are organised as a
rectangular array
‣ DRAM accesses an entire row
‣ Burst mode: supply successive words from a row
with reduced latency
• Double data rate (DDR) DRAM
‣ Transfer on rising and falling clock edges
• Quad data rate (QDR) DRAM
‣ Separate DDR inputs and outputs
Stewart Smith Digital Systems Design 4
Row Hammer
• Activating a row in DRAM can cause
fluctuating potentials in adjacent row
• Rapidly cycling the activation can cause
‘bit flips’
• Worse effects seen in more advanced
memory from DDR3 onwards
• Exploits have been demonstrated which
can bypass security on computers/mobiles
Stewart Smith Digital Systems Design 4
DRAM Generations
DRAM Access time (ns)
0
1
10
100
1000
1980
1983
1985
1989
1992
1996
1998
2000
2004
2007
2010
2012
2015
2018
Trac
Tcac
Year Capacity $/GiB
1980 64 Kib $6,480,000
1983 256 Kib $1,980,000
1985 1 Mib $720,000
1989 4 Mib $128,000
1992 16 Mib $30,000
1996 64 Mib $9,000
1998 128 Mib $900
2000 256 Mib $840
2004 512 Mib $150
2007 1 Gib $40
2010 2 Gib $13
2012 4 Gib $5
2015 8 Gib $7
2108 16 Gib $6
TRAC – time to access new row/column
TCAC – Time to access column in active row
Stewart Smith Digital Systems Design 4
Increasing Memory Bandwidth
• 4-word wide memory
‣ Miss penalty = 1 + 15 + 1 = 17 bus cycles
‣ Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle
• 4-bank interleaved memory
‣ Miss penalty = 1 + 15 + 4×1 = 20 bus cycles
‣ Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle
The primary method of achieving higher memory bandwidth is to increase the physical
or logical width of the memory system.
Stewart Smith Digital Systems Design 4
Next Section
•Cache Performance
•Associative Caches
•Multilevel Caches
•Real examples