Digital System Design 4 Memory & Caches Part 1
Dr Stewart Smith
Stewart Smith Digital Systems Design 4
This Section
•
•
•
‣ ‣
•
Memory Access Locality
Memory Hierarchy
Cache Technology
Directly Mapped Caches Misses and Writes
Main memory Technology
Stewart Smith
Digital Systems Design 4
Classic Components of a Computer: Memory
The processor gets instructions and data from memory. Input writes data to memory, and output reads data from memory. Control sends the signals that determine the operations of the datapath, memory, input, and output.
Stewart Smith Digital Systems Design 4
Full Datapath
Stewart Smith Digital Systems Design 4
Memory Technology
•
‣
•
‣
•
‣
•
‣ ‣
Static RAM (SRAM)
0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic RAM (DRAM)
50ns – 70ns, $20 – $75 per GB Magnetic disk
5ms – 20ms, $0.20 – $2 per GB
Ideal memory
Access time of SRAM Capacity and cost/GB of disk
Stewart Smith
Digital Systems Design 4
Memory Technology
Speed
Fastest
Slowest
Processor
Memory
Size
Smallest
Biggest
Cost ($/bit)
Current Technology
Highest
Memory
Memory/ Storage
Lowest
SRAM
DRAM
Flash SSD or Magnetic Disk
Stewart Smith Digital Systems Design 4
Binary Notation – Giga v.s. “Gibi”
Decimal Term
Abbreviation
Value
Binary Term
Abbreviation
Value
% larger
Kilobyte
KB
103
Kibibyte
KiB
210
2%
Megabyte
MB
106
Mebibyte
MiB
220
5%
Gigabyte
GB
109
Gibibyte
GiB
230
7%
Terabyte
TB
1012
Tebibyte
TiB
240
10%
Same for bits as for bytes, so gigabit (Gb) is 109 bits
and a gibibit is 230 bits (1073741824!)
Stewart Smith Digital Systems Design 4
Principle of Locality
• •
‣ ‣
•
‣ ‣
Programs access a small proportion of their address space at any time
Temporal locality
Items accessed recently are likely to be accessed again soon
e.g., instructions in a loop, induction variables Spatial locality
Items near those accessed recently are likely to be accessed soon
E.g., sequential instruction access, array data
Stewart Smith
Digital Systems Design 4
Taking Advantage of Locality
•
•
•
‣
•
‣
Memory hierarchy
Store everything on disk
Copy recently accessed (and nearby) items from disk to smaller DRAM memory
Main memory
Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory
Cache memory attached to CPU
Stewart Smith
Digital Systems Design 4
Memory Hierarchy
CPU
Level 1 Level 2 … Level n
Size of the memory at each level
Increasing distance from the CPU in access time
Levels in the memory hierarchy
Stewart Smith
Digital Systems Design 4
Memory Hierarchy Levels
• Block (aka line): unit of copying
‣
‣
• If accessed data is absent
‣
‣
May be multiple words
• If accessed data is present in upper level:
Hit: access satisfied by upper level – Hit ratio: hits/accesses
Miss: block copied from lower level – Time taken: miss penalty
– Miss ratio: misses/accesses = 1 – hit ratio
Then accessed data supplied from upper level
Stewart Smith
Digital Systems Design 4
Cache Memory
•
• •
Cache memory
‣ The level/s of the memory hierarchy closest to the CPU
Cache: a safe place for hiding or storing things Given accesses X1, …, Xn–1, Xn
• •
How do we know if the data is present?
Where do we look?
Stewart Smith
Digital Systems Design 4
•Direct Mapped Cache Location determined by address
•
‣
Direct mapped: only one choice
(Block address) modulo (#Blocks in cache)
• •
#Blocks is a power of 2
Use low-order address bits
Stewart Smith
Digital Systems Design 4
00101→ 101
Tags and Valid Bits
•
‣ ‣ ‣
•
‣ ‣
How do we know which particular block is stored in a cache location?
Store block address as well as the data Actually, only need the high-order bits Called the tag
What if there is no data in a location?
Valid bit: 1 = present, 0 = not present Initially 0
Stewart Smith
Digital Systems Design 4
Cache Example 1
• •
‣
8-blocks, 1 word/block, direct mapped Initial state
All valid-bits = N (0, not present)
Index
V
Tag
Data
000
N
001
N
010
N
011
N
100
N
101
N
110
N
111
N
Stewart Smith
Digital Systems Design 4
Cache Example 2
Word address
Binary address
Hit/miss
Cache index
22
10 110
Miss
110
Index
V
Tag
Data
000
N
001
N
010
N
011
N
100
N
101
N
110
YN
10
Mem[10110]
111
N
Stewart Smith Digital Systems Design 4
Cache Example 2
Word address
Binary address
Hit/miss
Cache index
22
10 110
Miss
110
Index
V
Tag
Data
000
N
001
N
010
N
011
N
100
N
101
N
110
YN
10
Mem[10110]
111
N
Stewart Smith Digital Systems Design 4
Cache Example 3
!
Word addr
Binary addr
Hit/miss
Cache block
26
11010
?
?
Index
V
Tag
Data
000
N
001
N
010
N
011
N
100
N
101
N
110
Y
10
Mem[10110]
111
N
Stewart Smith Digital Systems Design 4
Cache Example 3
Word addr
Binary addr
Hit/miss
Cache block
26
11 010
Miss
010
Index
V
Tag
Data
000
N
001
N
010
YN
11
Mem[11010]
011
N
100
N
101
N
110
Y
10
Mem[10110]
111
N
Stewart Smith Digital Systems Design 4
Cache Example 4
Word addr
Binary addr
Hit/miss
Cache block
22
10 110
Hit
110
26
11 010
Hit
010
Index
V
Tag
Data
000
N
001
N
010
Y
11
Mem[11010]
011
N
100
N
101
N
110
Y
10
Mem[10110]
111
N
Stewart Smith Digital Systems Design 4
Cache Example 5
Word addr
Binary addr
Hit/miss
Cache block
16
10 000
Miss
000
3
00 011
Miss
011
Index
V
Tag
Data
000
YN
10
Mem[10000]
001
N
010
Y
11
Mem[11010]
011
YN
00
Mem[00011]
100
N
101
N
110
Y
10
Mem[10110]
111
N
Stewart Smith Digital Systems Design 4
Cache Example 6
Word addr
Binary addr
Hit/miss
Cache block
16
10 000
Hit
000
Index
V
Tag
Data
000
Y
10
Mem[10000]
001
N
010
Y
11
Mem[11010]
011
Y
00
Mem[00011]
100
N
101
N
110
Y
10
Mem[10110]
111
N
Stewart Smith Digital Systems Design 4
Cache Example 7
!
Word addr
Binary addr
Hit/miss
Cache block
18
10010
?
?
Index
V
Tag
Data
000
Y
10
Mem[10000]
001
N
010
Y
11
Mem[11010]
011
Y
00
Mem[00011]
100
N
101
N
110
Y
10
Mem[10110]
111
N
Stewart Smith Digital Systems Design 4
Cache Example 7
Word addr
18
Binary addr
10 010
Hit/miss
Miss
Cache block
010
Index
V
Tag
Data
000
Y
10
Mem[10000]
001
N
010
Y
110
Mem[[11001100]]
011
Y
00
Mem[00011]
100
N
101
N
110
Y
10
Mem[10110]
111
N
Stewart Smith Digital Systems Design 4
Address Subdivision
1024 Blocks = 210 (10 bit index)
Stewart Smith Digital Systems Design 4
Comparator
31
•
‣ ‣
Address Structure
… 14 13 … 6 5 … 2 1 0
!
Tag
Index
Block offset
byte
How can you go from the address structure to the cache structure
What is the block size (words or bytes)? How many blocks/lines in the cache?
Stewart Smith
Digital Systems Design 4
Example of FastMATH Cache
Stewart Smith Digital Systems Design 4
Choice of Block Size
•
Larger blocks exploit spatial locality •‣ Contiguous data in arrays for example
However the benefits reduce as block size increases
Very large blocks mean fewer blocks in cache and an increased miss rate
Large blocks also increase the miss penalty of retrieving the block from the next level
‣ ‣
Stewart Smith
Digital Systems Design 4
Choice of Block Size
•
Larger blocks exploit spatial locality •‣ Contiguous data in arrays for example
However the benefits reduce as block size increases
Very large blocks mean fewer blocks in cache and an increased miss rate
Large blocks also increase the miss penalty of retrieving the block from the next level
‣ ‣
Stewart Smith
Digital Systems Design 4
Cache Misses (Reads)
• •
‣ ‣ ‣
‣
On cache hit, CPU proceeds normally
On cache miss
Stall the CPU pipeline
Fetch block from next level of hierarchy
Instruction cache miss – Restart instruction fetch
Data cache miss
– Complete data access
Stewart Smith
Digital Systems Design 4
Write-Through
•
• •
‣
•
‣ ‣
On data-write hit, could just update the block in
cache
‣ But then cache and memory would be inconsistent
Write through: also update memory But it makes writes take longer
e.g., if base CPI = 1, 10% of instructions are stores,
write to memory takes 100 cycles
– Effective CPI = 1 + 0.1×100 = 11 (factor of 10 decrease…)
Solution: write buffer
Holds data waiting to be written to memory CPU continues immediately
– Only stalls on write if write buffer is already full Digital Systems Design 4
Stewart Smith
Write-Back
•
‣
•
‣ ‣
Alternative: On data-write hit, just update the block in cache
Keep track of whether each block is dirty
When a dirty block is replaced
Write it back to memory
Can use a write buffer to allow replacing block to be read first
Stewart Smith
Digital Systems Design 4
Write Allocation
• •
‣ ‣
•
‣ ‣
What should happen on a write miss?
For write-back
Usually fetch the block
Need to make sure we don’t lose changed data
Alternatives for write-through
Allocate on miss: fetch the block Write around: don’t fetch the block
– Since programs often write a whole block before reading it (e.g., initialization)
Stewart Smith
Digital Systems Design 4
Write Through
No Allocation Write Write Around Allocation
Stewart Smith Digital Systems Design 4
Write Back
Write Allocation
Stewart Smith Digital Systems Design 4
Main Memory Supporting Caches
•
‣
– Bus clock is typically slower than CPU clock Example cache block read
1 bus cycle for address transfer
15 bus cycles per DRAM access ‣ 1 bus cycle per data transfer
•
‣ ‣
•
‣ ‣
Use DRAMs for main memory
Fixed width (e.g., 1 word)
‣ Connected by fixed-width clocked bus
For 4-word block, 1-word-wide DRAM
Miss penalty = 1 + 4×15 + 4×1 = 65 bus cycles Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle
Stewart Smith
Digital Systems Design 4
DRAM Technology
•
‣ ‣
– –
Data stored as a charge in a capacitor
Single transistor used to access the charge
Must periodically be refreshed
Read contents and write back Performed on a DRAM “row”
Stewart Smith
Digital Systems Design 4
DRAM Performance Factors
• •
•
‣ ‣
Row buffer
‣ Allows several words to be read and refreshed in parallel
Synchronous DRAM
Allows for consecutive accesses in bursts without needing to send each address
‣
‣ Improves bandwidth
DRAM banking
Allows simultaneous access to multiple DRAMs Improves bandwidth
Stewart Smith
Digital Systems Design 4
Advanced DRAM Organisation
•
‣ ‣
•
‣
•
‣
Bits in a DRAM are organised as a rectangular array
DRAM accesses an entire row
Burst mode: supply successive words from a row with reduced latency
Double data rate (DDR) DRAM
Transfer on rising and falling clock edges Quad data rate (QDR) DRAM
Separate DDR inputs and outputs
Stewart Smith
Digital Systems Design 4
Row Hammer
• •
• •
Activating a row in DRAM can cause fluctuating potentials in adjacent row
Rapidly cycling the activation can cause ‘bit flips’
Worse effects seen in more advanced memory from DDR3 onwards
Exploits have been demonstrated which can bypass security on computers/mobiles
Stewart Smith
Digital Systems Design 4
DRAM Generations
DRAM Access time (ns)
1000 100 10 1 0
TRAC – time to access new row/column
TCAC – Time to access column in active row
Year
Capacity
$/GiB
1980
64 Kib
$6,480,000
1983
256 Kib
$1,980,000
1985
1 Mib
$720,000
1989
4 Mib
$128,000
1992
16 Mib
$30,000
1996
64 Mib
$9,000
1998
128 Mib
$900
2000
256 Mib
$840
2004
512 Mib
$150
2007
1 Gib
$40
2010
2 Gib
$13
2012
4 Gib
$5
2015
8 Gib
$7
2108
16 Gib
$6
Trac Tcac
Stewart Smith
Digital Systems Design 4
2018
2015
2012
2010
2007
2004
2000
1998
1996
1992
1989
1985
1983
1980
Increasing Memory Bandwidth
• •
The primary method of achieving higher memory bandwidth is to increase the physical
or logical width of the memory system.
4-word wide memory
‣ ‣
Miss penalty = 1 + 15 + 1 = 17 bus cycles Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle
4-bank interleaved memory
‣
‣
Miss penalty = 1 + 15 + 4×1 = 20 bus cycles
Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle
Stewart Smith Digital Systems Design 4
Next Section
•Cache Performance •Associative Caches •Multilevel Caches •Real examples
Stewart Smith Digital Systems Design 4