CSCI-2500:
Computer Organization
Memory Hierarchy (Chapter 5)
Memory Technologies: Speed vs. Cost (1997)
Technology SRAM DRAM Mag. disk
Access Time 5-25ns 60-120ns 10-20 million ns
Cost: $/Mbyte $100-$250 $5-$10 $0.1-$0.2
Access Time: the length of time it takes to get a value from memory, given an address.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 2
Memory Technologies: Speed vs. Cost (2004)
Technology SRAM
DRAM Mag. disk
Access Time 0.5-5ns
50-70ns 5-20 million ns
Cost: $/Gbyte $4000-$10K (25x)
$100-$200 (50x) $0.50-$2.00 (12x)
Observe: access time not changing much over the last 7 years, but unit cost per capacity has changed dramatically
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 3
Performance and Memory
n SRAM is fast, but too expensive (we want large memories!).
n Using only SRAM (enough of it) would mean that the memory ends up costing more than everything else combined!
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 4
Caching
n The idea is to use a small amount of fast memory near the processor (in a cache).
n The cache hold frequently needed memory locations.
n when an instruction references a memory location, we want that value to be in the cache!
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 5
Principles of Locality
Temporal: if a memory location is referenced, it is likely that it will be referenced again in the near future.
Spatial: if a memory location is referenced, it is likely that nearby items will be referenced in the near future.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 6
time
space
Programs and Locality
Programs tend to exhibit a great deal of locality in memory accesses.
n array, structure/record access
n subroutines (instructions are near each other)
n local variables (counters, pointers, etc) are often referenced many times.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 7
Memory Hierarchy
The general idea is to build a hierarchy:
n at the top is a small, fast memory that is close to the processor.
n in the middle are larger, slower memories.
n At the bottom is massive memory with very slow access time.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 8
CPU
Level 1 Level 2
Increasing distance from the CPU in
access time
Levels in the memory hierarchy
Level n
Size of the memory at each level
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 9
Cache and Main Memory
n For now we will focus on a 2 level hierarchy: n cache (small, fast memory directly connected to
the processor).
n main memory (large, slow memory at level 2 in the hierarchy).
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 10
Memory Hierarchy and Data Transfer
Transfer of data is done between adjacent levels in the hierarchy only!
All access by the processor is to the topmost level.
Data are transferred
Processor
Figure 7.2
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 11
Terminology
n hit: when the memory location accessed by the processor is in the cache (upper level).
n miss: when the memory location accessed by the process is not in the cache.
n block: the minimum unit of information transferred between the cache and the main memory. Typically measured in bytes or words.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 12
Terminology (cont.)
n hit rate: the ratio of hits to total memory accesses.
n miss rate: 1 – hit rate
n hit time: the time to access an element that is
in the cache:
n time to find out if it’s in the cache.
n time to transfer from cache to processor.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 13
Terminology (cont.)
n miss penalty: the time to replace a block in the cache with a block from main memory and to deliver deliver the element to the processor.
n hit time is small compared to miss penalty (otherwise we wouldn’t bother with a memory hierarchy!)
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 14
Simple Cache Model
n Assume that the processor accesses memory one word at a time.
n A block consists of one word.
n When a word is referenced and is not in the cache, it is put in the cache (copied from main memory).
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 15
Cache Usage
n At some point in time the cache holds memory items X1,X2,…Xn-1
n The processor next accesses memory item Xn which is not in the cache.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 16
Cache before and after
X4
X1
Xn – 2
Xn – 1
X2
X3
X4
X1
Xn – 2
Xn – 1
X2
Xn
X3
a. Before the reference to Xn b. After the reference to Xn
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 17
Issues
n How do we know if an item is in the cache?
n If it is in the cache, how do we know where it is?
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 18
Direct-Mapped Cache
n Each memory location is mapped to a single location in the cache.
n there in only one place it can be!
n Remember that the cache is smaller than memory, so many memory locations will be mapped to the same location in the cache.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 19
Mapping Function
n The simplest mapping is based on the LS bits of the address.
n For example, all memory locations whose address ends in 000 will be mapped to the same location in the cache.
n The requires a cache size of 2n locations (a power of 2).
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 20
A Direct Mapped Cache
Cache
00001 00101 01001 01101 10001 10101 11001 11101 Memory
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 21
000
001
010
011
100
101
110
111
Who’s in slot 000?
n We still need a way to find out which of the many possible memory elements is currently in a cache slot.
n slot: a location in the cache that can hold a block. n We need to store the address of the item
currently using cache slot 000.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 22
Tags
n We don’t need to store the entire memory location address, just those bits that are not used to determine the slot number (the mapping).
n We call these bits the tag.
n The tag associated with a cache slot tells who
is currently using the slot.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 23
16 word memory, 4 word cache
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Data
Tags
Memory
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 24
Initialization Problem
n Initially the cache is empty.
n all the bits in the cache (including the tags) will
have random values.
n After some number of accesses, some of the tags are real and some are still just random junk.
n How do we know which cache slots are junk and which really mean something?
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 25
Valid Bits
n Include one more bit with each cache slot that indicates whether the tag is valid or not.
n Provide hardware to initialize these bits to 0 (one bit per cache slot).
n When checking a cache slot for a specific memory location, ignore the tag if the valid bit is 0.
n Change a slot’s valid bit to a 1 when putting something in the slot (from main memory).
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 26
Revised Cache
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Data
Memory
Valid Tags
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 27
Simple Simulation
n We can simulate the operation of our simple direct-mapped cache by listing a sequence of memory locations that are referenced.
n Assume the cache is initialized with all the valid bits set to 0 (to indicate all the slots are empty).
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 28
Memory Access Sequence
Address
Binary Address
Slot
hit or miss
3
0011
11 (3)
miss
8
1000
00 (0)
miss
3
0011
11 (3)
hit
2
0010
10 (2)
miss
4
0100
00 (0)
miss
8
1000
00 (0)
miss
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 29
Hardware
n We need to have hardware that can perform all the operations:
n find the right slot given an address (perform the mapping).
n check the valid bit.
n compare the tag to part of the address
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 30
Address (showing bit positions) 31 30 13 12 11 2 1 0
Hit 2010 Data Tag
Index
Index Valid Tag 0
Data
1 2
1021 1022 1023
20
32
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 31
Byte offset
Possible Test Question
Given the following:
n 32 bit addresses (232 byte memory, 230 words)
n 64 KB cache (16 K words). Each slots holds 1 word. n Direct Mapped Cache.
n How many bits are needed for each tag?
n How many memory locations are mapped to the same
cache slot?
n How many total bits in the cache (data + tag + valid).
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 32
Possible Test Answer
n Memory has 230 words
n Cache has 16K = 214 slots (words).
n Each cache slot can hold any one of 230 ÷ 214 = 216 memory locations, so the tag must be 16 bits.
n 216 is 64K memory locations that map to the same cache slot.
n Add one for the valid bit for each cache line.
n Total memory in bits = 214 x (32+16+1) = 49 x 16K = 784 Kbits (98 Kbytes!)
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 33
Handling a Cache Miss
n A miss means the processor must wait until the memory requested is in the cache.
n a separate controller handles transferring data between the cache and memory.
n In general the processor continuously tries the fetch until it works (until it’s a hit).
n continuously means “once per cycle”.
n in the meantime the pipeline is stalled!
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 34
Data vs. Instruction Cache
n Nothing other than a stall can happen if we get a miss when fetching the next instruction!
n It is possible to execute other instructions while waiting for data (need to detect data hazards), this is called stall on use.
n the pipeline stalls only when there are no instructions that can execute without the data.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 35
DecStation 3100 Cache
n Simple Cache implementation n 64 KB cache (16K words).
n 16 bit tags
n Direct Mapped
n Two caches, one for instructions and the other for data.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 36
Address (showing bit positions) 3130 171615 543210
16 14 Byte offset
Hit Data
16 bits 32 bits
Valid Tag Data
16 32
16K entries
DecStation 3100 cache
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 37
Handling Writes
n What happens when a store instruction is executed?
n what if it’s a hit?
n what if it’s a miss?
n DecStation 3100 does the following:
n don’t bother checking the cache, just write the
new value in to the cache!
n Also write the word to main memory (called write-through).
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 38
Write-Through
n Always updating main memory on each store instruction can slow things down!
n the memory is tied up for a while.
n It is possible to set up a write buffer that holds a
number of pending writes.
n If we also update the cache, it is not likely that we need to worry about getting a memory value from the buffer (but it’s possible!)
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 39
Write-back
n Another scheme for handling writes:
n only update the cache.
n when the memory location is booted out of the cache (someone else is being put in to the same slot), write the value to memory.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 40
Cache Performance
For the simple DecStation 3100 cache:
Program gcc
spice
Miss Rate Instruction Data
Combined 6.1% 2.1% 5.4%
1.2% 1.3% 1.2%
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 41
Spatial Locality?
n So far we’ve only dealt with temporal locality (it we access an item, it is likely we will access it again soon).
n What about space (the final frontier)?
n In general we make a block hold more than a
single word.
n Whenever we move data to the cache, we also move it’s neighbors.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 42
Blocks and Slots
n Each cache slot holds one block.
n Given a fixed cache size (number of bytes) as the block size increases, the number of slots must decrease.
n Reducing the number of slots in the cache increases the number of memory locations that compete for the same slot.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 43
Example multi-word block cache
n 4 words/block
n we now use a block address to determine the
slot mapping.
n the block address in this case is the address/4.
n on a hit we need to extract a single word (need a multiplexor controlled by the LS 2 address bits).
n 64KB data
n 16 Bytes/block n 4K slots.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 44
Example multi-word block cache
Address (showing bit positions)
31 16 15
4 32 1 0
2 Byte offset
128 bits Data
Hit
V
Tag
16 bits Tag
12
Data
16 Index
Block offset
4K entries
16 32
32
32
32
Mux 32
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 45
Performance and Block Size
DecStation 3100 cache with block sizes 1 and 4 (words).
Block
Program Size Instruction
gcc 6.1% gcc 2.0% spice 1 1.2% spice 4 0.3%
Miss Rate
Data Combined
2.1% 5.4% 1.7% 1.9% 1.3% 1.2% 0.6% 0.4%
1 4
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 46
Is bigger always better?
n Eventually increasing the block size will mean that the competition for cache slots is too high
n miss rate will increase.
n Consider the extreme case: the entire cache is a single block!
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 47
Miss rate vs. Block Size
40% 35% 30% 25% 20% 15% 10%
5% 0%
4 16 64 256
Block size (bytes)
1 KB 8 KB 16 KB 64 KB
256 KB
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 48
Miss rate
Block Size and Miss Time
n As the block size increases, we need to worry about what happens to the miss time.
n The larger a block is, the longer it takes to transfer from main memory to cache.
n It is possible to design memory systems with transfer of an entire block at a time, but only for relatively small block sizes (4 words).
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 49
Example Timings
Hypothetical access times:
n 1 cycle to send the address
n 15 cycles to initiate each access n 1 cycle to transfer each word.
n Miss penalty for 4-word wide memory is: 1 + 4×15 + 4×1 = 65 cycles.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 50
Memory Organization Options
CPU
CPU
Cache
Cache
Cache
Bus
Multiplexor
Bus Bus
Memory bank 0
Memory bank 1
Memory bank 2
Memory bank 3
Memory
Memory
b. Wide memory organization c. Interleaved memory organization
Improving memory bandwidth….
a. One-word-wide memory organization
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 51
CPU
Improving Cache Performance
n Cache performance is based on two factors: n miss rate
n depends on both the hardware and on the program being measured (miss rate can vary).
n miss penalty
n the penalty is dictated by the hardware (the
organization of memory and memory access times).
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 52
Cache and CPU Performance
The total number of cycles it takes for a program is the sum of:
n number of normal instruction execution cycles. n number of cycles stalled waiting for memory.
Memory – stall cycles = Memory Accesses ́ Miss rate ́ Miss penalty Pr ogram
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 53
Cache Calculations
How much faster would this program run with a perfect cache?:
CPI (without memory stalls): 2 Miss Rate: 5%
Miss Penalty: 40 cycles
% of instructions that are load/store: 30%
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 54
Speedup Calc
Timeperfect=IC * 2 (cpi) * cycle time = IC * 2.0
Timecache=IC*( 0.3*(2+0.05*40) + 0.7*2 ) = IC * 2.6
Speedup: 2.6/2 = 1.3 times faster with a perfect cache.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 55
Clock Rate and Cache Performance
n If we double the clock rate of the processor, we don’t change:
n cache miss rate
n miss penalty (memory is not likely to change!).
n The cache will not improve, so the speedup is not close to double!
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 56
Reducing Miss Rate
n Obviously a larger cache will reduce the miss rate!
n We can also reduce miss rate by reducing the competition for cache slots.
n allow a block to be placed in one of many possible cache slots.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 57
An extreme example of how to mess up a direct mapped cache.
n Assume that every 64th memory element maps to the same cache slot.
for (i=0;i<10000;i++) {
a[i] = a[i] + a[i+64] + a[i+128];
a[i+64] = a[i+64] + a[i+128];
}
a[i], a[i+64] and a[i+128] use the same cache slot!
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 58
Fully Associative Cache
n Instead of direct mapped, we allow any memory block to be placed in any cache slot.
n It’s harder to check for a hit (hit time will increase).
n Requires lots more hardware (a comparator for each cache slot).
n Each tag will be a complete block address.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 59
Fully Associative Cache
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Data
Tags
Memory
Valid
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 60
Tradeoffs
n Fully Associate is much more flexible, so the miss rate will be lower.
n Direct Mapped requires less hardware (cheaper).
n will also be faster! i.e. better hit time! n Tradeoff of miss rate vs. hit time.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 61
Middle Ground
n We can also provide more flexibility without going to a fully associative placement policy.
n For each memory location, provide a small number of cache slots that can hold the memory element.
n This is much more flexible than direct- mapped, but requires less hardware than fully associative.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 62
Set Associative
n A fixed number of locations where each block can be placed.
n n-way set associative means there are n places (slots) where each block can be placed.
n Chop up the cache in to a number of sets each set is of size n.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 63
Block Placement Options
(memory block address 12)
Direct mapped Block# 01234567
Data
Set associative Set# 0 1 2 3
Data Data
Fully associative
1 2
1 2
1 2
Tag Search
Tag Tag Search Search
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 64
Possible 8-block Cache designs
One-way set associative (direct mapped)
Block Tag Data
0 1 2 3 4 5 6 7
S e t
0 1
Tag Data
Two-way set associative
S e t
0 1 2 3
T a g
D a t a
T a g
D a ta
T a g
Four-way set associative
D a t a T a g D a ta T a g D a ta T a g D a ta
Tag Data
Eight-way set associative (fully associative)
Tag Data Tag Data Tag Data Tag Data Tag Data T ag Data
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 65
Block Addresses & Set Associative Caching
n The LS bits of block address is used to determine which set the block can be placed in.
n The rest of the bits must be used for the tag. block address
Tag
Index
Block Offset
The index is the set number
32 bit byte address
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 66
Possible Test Question
n Block Size: 4 words
n Cache size (data only): 64 K Bytes
n 8-way set associative (each set has 8 slots). n 32 bit address space (bytes).
n How many sets are there in the cache?
n How many memory blocks compete for placement in each set?
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 67
Answer
Cache size:
64 K Bytes is 216 bytes
216 bytes is 214 words
214 words is 211 sets of 8 blocks each
Memory Size:
232 bytes = 230 words = 228 blocks
blocks per set:
228/211 = 217 blocks per set
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 68
4-way Set Associative Cache
Address
31 30 12 11 10 9 8
3 2 1 0
Index V Tag Data 0
1 2
253 254 255
22 8
V Tag Data
V Tag Data
V Tag Data
22 32
Hit
4-to-1 multiplexor
Data
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 69
4-way set associative and the extreme example.
for (i=0;i<10000;i++) {
a[i] = a[i] + a[i+64] + a[i+128];
a[i+64] = a[i+64] + a[i+128];
}
a[i], a[i+64] and a[i+128] belong to the same set – that’s OK, we can hold all 3 in the cache at the same time.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 70
Performance Comparison
Miss Rate
Program Associativity Instructio Data n
gcc 2.0% 1.7% gcc 1.6% 1.4% gcc 1.6% 1.4% spice 1 (direct) 0.3% 0.6% spice 2 0.3% 0.6% spice 4 0.3% 0.6%
Combined
1.9% 1.5% 1.5% 0.4% 0.4% 0.4%
1 (direct) 2
4
DecStation 3100 cache with block size 4 words.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 71
A note about set associativity
n Direct mapped is really just 1-way set associative (1 block per set).
n Fully associative is n-way set associative, where n is the number of blocks in the cache.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 72
Question
Cache size 4K blocks.
block size 4 words (16 bytes)
32 bit address
n How many bits for storing the tags (for the entire cache), if the cache is:
n direct mapped
n 2-way set associative n 4-way set associative n fully associative
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 73
Answer
Direct Mapped: 16 * 4K = 64K bits
2-way:
17 * 4K = 68K bits
4-way: 18 18 * 4K = 72K bits
Fully Associative: 28 * 4K = 112K bits
16 12 4 17 11 4
tag
index
offset
tag
index
offset
10 4 28 4
tag
index
offset
tag
offset
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 74
Block Replacement Policy
n With a direct mapped cache there is no choice which memory element gets removed from the cache when a new element is moved to the cache.
n With a set associative cache, eventually we will need to remove an element from a set.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 75
Replacement Policy: LRU
LRU: Least recently used.
n keep track of how old each block is (the blocks in the cache).
n When we need to put a new element in the cache, use the slot occupied by the oldest block.
n Every time a block in the cache is accessed (a hit), set the age to 0.
n Increase the age of all blocks in a set whenever a block in the set is accessed.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 76
LRU in hardware
n We must implement this strategy in hardware!
n 2-way is easy, we need only 1 bit to keep track of which element in the set is older.
n 4-way is tougher (but possible).
n 8-way requires too much hardware
(typically LRU is only approximated).
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 77
Multilevel Caches
n Most modern processors include an on- chip cache (the cache is part of the processor chip).
n The size of the on-chip cache is restricted by the size of the chip!
n Often, a secondary cache is used between the on-chip cache and the main memory.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 78
Adding a secondary cache
n Typically use SRAM (fast, expensive). Miss penalty is much lower than for main memory.
n Using a fast secondary cache can change the design of the primary cache:
n make the on-chip cache hit time as small as possible!
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 79
Performance Analysis
n Processor with CPI of 1 if all memory access handled by the on-chip cache.
n Clock rate 5 GHz (.2 ns period)
n Main memory access time 100ns n Miss rate for primary cache is 2%
n How much faster if we add a secondary cache with 5ns access time that reduces the miss rate (to main memory) to 0.5%.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 80
Analysis without secondary cache
Without the secondary cache the CPI will be based on:
n the CPI without memory stall (for all except misses)
n the CPI with a memory stall (just for cache misses).
n Without a stall the CPI is 1, and this happens 98% of the time.
n With a stall the CPI is 1 + miss penalty which is 100/.2 = 500 cycles. This happens 2% of the time.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 81
CPI Calculation (no secondary cache)
Total CPI = Base CPI + Memory-Stall cycles per instruction
CPI = 1.0 + (2% * 500) = 11.0
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 82
With secondary cache
With secondary cache the CPI will be based on:
n the CPI without memory stall (for all except misses)
n the CPI with a stall for accessing the secondary cache (for cache misses that are resolved in the secondary cache).
n the CPI with a stall for accessing secondary cache and main memory (for accesses to main memory).
The stall for accessing secondary cache is 5/.2 = 25 cycles.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 83
CPI Calculation (with secondary cache)
Total CPI = 1 + Primary stalls per instruction + Secondary stalls per instruction
= 1 + (2% * 25) + (.5% * 500) = 1 + 0.5 + 2.5
= 4.0
Processor w/ 2ndary Cache is 11/4 = 2.8x faster! CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 84
Virtual Memory
Disk caching
n Use main memory as a cache for magnetic disk.
n We can do this for a number of reasons:
n speed up disk access
n pretend we have more main memory than we really have.
n support multiple programs easily (each can pretend it has all the memory).
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 86
Our focus
n We will focus on using the disk as a storage area for chunks of main memory that are not being used.
n The basic concepts are similar to providing a cache for main memory, although we now view part of the hard disk as being the memory.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 87
Virtual memory
Consider a machine with a 32 bit address space:
n it probably doesn’t have 232 = 4 GB of main memory!
n How do we write programs without knowing how much memory is really available ahead of time?
n Why not pretend we always have 4GB, and if we use more than we really have, store some blocks on the hard disk.
n this must happen automatically to be useful.
n Note: 64-bit architectures typically have something like a 48 bit address or 262144 GB address space which is ~256 TB
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 88
Motivation
n Pretend we have 4GB, we really have only 512MB.
n At any time, the processor needs only a small portion of the 4GB memory.
n only a few programs are active
n an active program might not need all the memory that has been reserved by the program.
n We just keep the stuff needed in the main
memory, and store the rest on disk.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 89
Virtual addresses Physical addresses
Address translation
Disk addresses
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 90
A Program’s view of memory
n We can write programs that address the virtual memory.
n There is hardware that translates these virtual addresses to physical addresses.
n The operating system is responsible for managing the movement of memory between disk and main memory, and for keeping the address translation table accurate.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 91
Terminology
n page: The unit of memory transferred between disk and the main memory.
n page fault: when a program accesses a virtual memory location that is not currently in the main memory.
n address translation: the process of finding the physical address that corresponds to a virtual address.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 92
Virtual Memory & Address Translation
Virtual
Physical
CPU
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 93
“It’s really Mem[1049]”
“gimme Mem[25]”
Translation and Pages
n Only the page number need be translated. n The offset within the page stays constant.
Virtual address
3130292827 15141312 111098 3210
Virtual page number
Page offset
Translation
292827 15141312 111098 3210
Physical address
Physical page number
Page offset
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 94
CPU & address translation
n The CPU doesn’t need to worry about address translation – this is handled by the memory system (e.g., MMU)
n As far as the CPU is concerned, it is using physical addresses.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 95
Advantages of VM
n A program can be written (linked) to use whatever addresses it wants to! It doesn’t matter where it is physically loaded!
n When a program is loaded, it doesn’t need to be placed in continuous memory locations
n any group of physical memory pages will do fine.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 96
Design Issue
n A Page Fault is a disaster!
n disk is very, very, very slow compared to
memory – millions of cycles!
n Minimization of faults is the primary design
consideration for virtual memory systems.
n This “page” is important! It’s your “fault” if you miss this pointJ
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 97
Minimizing faults
n Pages should be big enough to make a transfer from disk worthwhile. 4KB-64KB are typical sizes.
n Some systems have 1 to 256 MB page sizes n Fully associative placement is the most
flexible (will reduce the rate of faults).
n software handles the placement of pages.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 98
What about rights writes?
n Write through is not practical for a virtual memory system (writes to disk are way to slow).
n Write back is always used.
n write the entire page to disk only when kicked out of the main memory and placed on disk.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 99
The dirty bit
n It would be wasteful to always write an entire page to disk if nothing in the page has changed.
n A flag is used to keep track of which pages have been changed in main memory (if not change happens, no need to write the page to disk).
n The flag is called the dirty bit.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 100
Address Translation
n Address translation must be fast (it happens to every memory access).
n We need a fully associative placement policy.
n We can’t afford to go looking at every virtual page to find the right one
n we don’t use the tag bits approach
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 101
Page Table
n We need a large table that holds the physical address for each virtual page.
n Want virtual page 1234? Look at row 1234 in the table.
n the page table is a big array indexed by virtual page number.
n The table will be huge! 232/page size.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 102
Page table register
Virtual page number
20
Physical page number
18
15 14 13 12 11 10 9 8
Page offset
Physical page number
Page offset
31 30 29 28 27
Virtual address
15 14 13 12 11 10 9 8 3 2 1 0
12
Valid
Page table
If 0 then page is not present in memory
29 28 27
3 2 1 0
Physical address
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 103
Processes and Page Tables
n Each process has it’s own page table!
n each program can pretend it is loaded
and running at the same address.
n One page table is huge, now we need to worry about lots of page tables.
n We can’t include dedicated hardware that holds all these page tables.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 104
Page Tables memory needs
n Assume 32 bit virtual address space. n Assume 16K byte page size.
n each page table needs 232/214 = 218 elements.
n We would like to support 256 different processes.
n We need 28 * 218 = 226 page table elements, assume each is 1 word wide.
n Total needed is 256 MBytes!
n A solution – “Page” the page table.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 105
Page Table Elements
n Each element in the page table needs to include:
n a valid bit.
n if the page is in memory, the physical
address.
n if the page is on disk, some indication of where on the disk
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 106
Virtual page number
Page table Physical page or
Physical memory
Valid disk address
1
1
1
1
0
1
1
0
1
1
0
1
Disk storage
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 107
I need to go buy more memory!
n Page tables are stored in main memory.
n Most programs are small, so we don’t need to actually create the entire page table for each process.
n just enough to cover the actual pages that have been reserved for use by the program.
n this number will be quite small (a few thousand pages is enough for a large program).
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 108
Speed of address translation
n Page tables are in memory.
n We need to access an element of the page
table every time a translation is needed.
n A translation is needed on every memory access!
n Every memory access really requires 2 memory accesses!
n This is very bad .. Especially for your uber- faster, superscalar pipelined processor!
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 109
Making address translation fast
n We can create a dedicated cache that holds the most recently used page table entries.
n the same page table entry is used for all memory locations in the page. Spatial Locality.
n This cache is called a Translation
Lookaside Buffer (TLB).
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 110
Virtual page number
Valid Tag
TLB Physical page address
1
1
1
1
0
1
Physical memory
Page table
Physical page Valid or disk address
1
1
1
1
0
1
Disk storage
1
0
1
1
0
1
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 111
DecStation 3100 TLB
n 32 bit address space n 4KB Page size
n virtual page address is 20 bits. n TLB has 64 slots
n each has 20 bit tag, 20 bit physical page address, a valid bit and a dirty bit.
n Fully Associative.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 112
Virtual page number
20
Tag Physical page number
Page offset
Physical page number Page offset Physical address
Physical address tag Cache index
20
16
31 30 29
Virtual address
15 14 13 12 11 10 9 8
3 2 1 0 12
Valid Dirty TLB
TLB hit
Byte offset
14 2
Valid
Tag
Data
Cache
Cache hit
32
Data
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 113
Cache + Virtual Memory
n The Decstation 3100 does address translation before the cache.
n The cache operates on physical memory addresses.
n It is also possible to cache virtual memory, although there are some problems.
n if programs can share pages, a single word from physical memory could end up in the cache twice! (the same physical location could have 2 different virtual addresses).
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 114
Protection
n Virtual memory allows multiple processes to share the same physical memory.
n What if my process tries to write to your process’s memory?
n we don’t want this to be possible!
n we don’t even want it to be able to read!
n We can provide protection via the page tables
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 115
Independent Page Tables
n Each process has it’s own page table.
n All page tables are created by the operating system – your program can’t change it’s own page table.
n Supporting virtual memory requires a combination of hardware and software.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 116
Common Issues
n There are a number of issues that are common to both cache and virtual memory system design:
n block placement policy.
n how is a block found?
n block replacement policy. n write policy.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 117
Block Placement Options
n Direct-Mapped
n cheap, easy to implement, relatively high
miss rate.
n Set Associative
n middle ground n Fully Associative
n expensive (lots of hardware or software), minimizes miss rate.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 118
15%
12%
9%
6%
3%
0%
One-way Two-way
Four-way
Eight-way
16 KB 32 KB 64 KB 128 KB
Associativity
1 KB 2 KB 4 KB 8 KB
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 119
Miss rate
How is a block found?
This depends on placement policy. n Direct Mapped: uses an index.
n Set Associative: index selects a set, and we need to look at all set elements.
n Fully Associative: need to look at all elements.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 120
Replacement Policies
n Direct-Mapped: not an issue. n Set and fully associative
n LRU (least recently used) hard to implement in hardware for large sets, often approximated.
n random easy to implement, does nearly as well as LRU approximations.
n LRU is always used (or approximated) for virtual memory.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 121
Write Policies
n Write-Through: update the cache and lower level memory.
n Write-Back: update the cache only. When block/page is booted from the cache - write to lower-level memory if any changes.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 122
Where do misses come from?
n Compulsory misses: the first access is always a miss. Can’t avoid these.
n Capacity misses: cache can’t hold all the blocks needed.
n Conflict misses: multiple blocks compete for the same cache slot(s) and collide.
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 123
Where do misses come from?
14% 12% 10%
8%
6%
4%
2%
Miss Rate and the cause of misses. Compulsory misses are baseline of 0.2%
Ranges show the conflict misses for various set sizes
Capacity
0%
1 2 4 8 16 32 64 128
Cache size (KB)
One-way Four-way
Two-way Eight-way
CSCI-2500 FALL 2020, Memory Hierarchy (Ch 5) — 124
Miss rate per type