Chapter 5 Memory Hierarchy part 1
cache – direct mapped
1
Recall: SRAM and DRAM
• SRAM (Static Random Access Memory):
– value is stored on a pair of inverting gates
– veryfastbuttakesupmorespacethanDRAM(4to6transistors)
• DRAM (Dynamic Random Access Memory):
– value is stored as a charge on capacitor (must be refreshed) – verysmallbutslowerthanSRAM(factorof5to10)
AA BB
Pass transistor Capacitor
Word line
Bit line
2
Why memory hierarchy? • Cost, capacity, and speed.
– Userswantthecheapestandfastestmemorywiththeinfinite size!
– In2008,
• SRAM: access times are .5 – 2.5ns
at cost of $2,000 to $5,000 per GB • DRAM: access times are 50-70ns
at cost of $20 to $75 per GB
• Magnetic disk: access times are 5 to 20 million ns
at cost of $.20 to $2 per GB
3
Memory Hierarchy
4
The Basic structure of a memory hierarchy
5
The principle of locality
• Temporal locality (locality in time): If an item is referenced, it will tend to be referenced again soon.
• Spatial locality (locality in space): If an item is referenced, items whose addresses are close by will tend to be referenced soon.
6
Getting books into your table from book shelves
Cache is like your table
Main memory is like book shelves
7
block: minimum unit of data/information
hit: data requested is in the upper level miss: data requested is not in the upper level
• Miss Rate: The faction of memory access not found in a level of the memory hierarchy.
• Hit Rate: The fraction of memory accesses found in a cache.
• Hit Time: The time required to access a level of the memory hierarchy,
including the time needed to determine whether the access is a hit or a miss.
• Miss Penalty: The time required to fetch a block into a level of the memory hierarchy from the lower level, including the time to access the block, transmit it from one level to the other, insert it in the level that experienced the miss, and then pass the block to the requestor.
8
• Make the average access time small
– Servicing most accesses from a small, fast memory – Savealldatainabig,slowmemory
• The goal is to present the user with as much memory as is available in the cheapest technology, while providing access at the speed of fastest memory.
• Build a memory hierarchy
Levels in the memory hierarchy
Level 2
CPU
Level 1
Increasing distance from the CPU in access time
Level n
Size of the memory at each level
9
Cache
• Cache (Webster’s definition): a safe place for hiding or storing things.
• Cache (our definition): The level of the memory hierarchy between the processor and main memory in the first commercial computer to have this extra level.
Any storage managed to take advantage of locality of access.
10
• Two issues:
– Howdoweknowifadataitemisinthecache? – Ifitis,howdowefindit?
• Simplest way: Direct Mapped Cache.
– Eachmemorylocation/address(block/onewordofdata)inthe lower level is mapped directly to exactly one location in the upper level.
– Almostalldirect-mappedcachesusethemapping: (Block address) modulo (# of blocks in the cache)
• If the number of entries in the cache is a power of 2, then modulo can be computed simply by using the low-order log 2 (cache size in blocks) bits of the address.
11
000
001
010
011
100
101
110
111
Direct Mapped Cache – Example
• Mapping: address is modulo the number of blocks in the cache
Cache
00001 00101 01001 01101 10001 10101 11001 11101 Memory
Addresses 00001 (1) and 11101 (29) map to 001 (1) and 101 (5), using 8 words (mod 8).
12
• How do we know whether a requested word is in the cache or not?
• We add a set of tags to the cache.
• Tag: A field in a table used for a memory hierarchy that contains the address information required to identify whether the associated block in the hierarchy corresponds to a requested word.
• The tag needs only to contain the upper portion of the address, corresponding to the bits that are not used as an index into the cache. (# of bits used for indices is 3 if the cache size is 2^3 = 8)
• In the previous example, we need to use the upper 2 of the 5 address bits in the tag.
• We also need to recognize that a cache block sometimes does not have any data in it.
• Valid bit: it is added to indicate whether any entry contains a valid address. If the bit is not set, there cannot be a match for this block. (it might be empty)
13
Accessing a Cache
14
a) The initial state of the cache (of size 8 –
using 3 bits) after power-on
Index V Tag Data
000 N
001 N
010 N
011 N 100 N 101 N
110 N
111 N
15
b) After handling a miss of address (10110)
Index V Tag
000 N
001 N
010 N
Data
011 N 100 N 101 N
110 Y 10
111 N
Memory (10110)
16
c) After handling a miss of address (11010)
Index V Tag
000 N
001 N
010 Y 11 011 N
Data
100 N
101 N
110 Y 10
111 N
Memory (10110)
Memory(11010)
17
d) After handling a miss of address (10000)
Index V
000 Y
001 N
010 Y 011 N 100 N 101 N
110 Y 10
111 N
Tag Data
10 Memory(10000)
11 Memory(11010)
Memory (10110)
18
e) After handling a miss of address (00011)
Index V Tag
000 Y 10
001 N
010 Y 11 011 Y 00 100 N
Data
101 N
110 Y 10
111 N
Memory (10110)
Memory(10000)
Memory(11010)
Memory(00011)
19
f) After handling a miss of address (10010)
Index V Tag
000 Y 10
001 N
010 Y 10 011 Y 00 100 N
Data
101 N
110 Y 10
111 N
Memory (10110)
Memory(10000)
Memory(10010) replaced (11010)
Memory(00011)
20
Direct Mapped Cache –here data is just one word (32 bits)
• For MIPS:
Address (showing bit positions)
Hit
Tag
20 Index
10
Index 0
1
2
Valid Tag Data
1021
1022
1023
3130
131211
2 10 Byte
20 32 =
The lower portion of the address is used to select a cache entry consisting of a data word and a tag.
21
offset
Data
Direct Mapped Cache with Multi-word blocks
• Taking advantage of spatial locality:
Address (showing bit positions)
Hit
Tag 18 bits
18 Index
8
4
Byte offset
Data
Here several words are store for each index.
V Tag
512 bits Data
16 =
3232 32
31
14 13
6 5
2 1 0
Mux
32
Block offset
256 entries
22
Mapping an Address to a Multi-word Cache Block – Example
• Consider a cache with 64 blocks and a block size of 16 bytes. What block number does byte address 1200 map to?
The address of the block is:
⎣ Byte address / Bytes per block ⎦ (the integer part of the division result) And the block is given by:
(Block address) modulo (Number of cache blocks)
Note that this block address is the block containing all address between: ⎣ Byte address/ Bytes per block⎦ x Bytes per block
and
⎣Byte address / Bytes per block⎦ x Bytes per block + (Bytes per block-1)
23
Example cont.
Thus, with 16 bytes per block, byte address 1200 is block address
⎣1200/16⎦ = 75 block address
Then it maps to cache block number (75 mod 64) = 11.
The block maps all addresses between 1200 and 1215 to the block number 11
24
Direct Mapped Cache with Multi-word blocks
• Taking advantage of spatial locality:
Address (showing bit positions)
Hit
Tag 18 bits
18 Index
8
4
Byte offset
Data
The total number of bits used for this cache: # of indices x (Data + tag + valid bit)
= 2^8 x (32 x 2^4 + 18 + 1)
= 256 x (512 + 18 + 1) = 256 x 531
= 135,936 bits
25
V Tag
512 bits Data
16 =
3232 32
31
14 13
6 5
2 1 0
Mux
32
Block offset
256 entries
Size of a direct mapped cache • Assume 32-bit address,
• A direct-mapped cache of size 2^n blocks with 2^m word (2^m+2 bytes) blocks.
Then
A tag field size is: 32 – (n+m+2) bits.
n bits used for the index
m bits are used for the word within the block, 2 bits used for bytes part of the address.
The total number of bits in a direct-mapped cache is: 2^n x (block size + tag size + valid field size)
Since:
The block size is 2^m words (2^(m+5) bits)
The address size : 32 bits
The number of bits in a cache:
2^n x (2^m x 32 + (32-n-m-2)+1) = 2^n x (2^m x 32 + 31 – n – m).
26
Size of a direct mapped cache
Q. How many total bits are required for a direct-mapped cache with 16 KB of data and 4 word blocks, assuming a 32-bit address?
16KB = 4 K words ≈ 2^ 12 words (=4096 words).
With each 4-word (2^2) blocks, there are 2^10 blocks. ((2^12)/(2^2) = 2^10) Each block has:
-4 x 32 (=128 bits) of data, (4-word block)
-plus a tag (=32-10-2-2 bits), (2^n = 2^10 word, size 2^m = 2^2 blocks) -plus a valid bit.
The total cache size is: (m=2, n=10)
2^10 x (128 + (32 – 10 – 2 – 2) + 1) = 2^10 x 147 = 150528 bits ≈ 147 K bits.
or
= 18.375 Kbytes (=147/8) for 16 KB cache.
The total number of bits in the cache is about 1.14 times as many as needed just for the storage of the data.
27
Direct Mapped Cache
• Taking advantage of spatial locality:
Address (showing bit positions)
Hit
Tag 18 bits
18 Index
8
4
Byte offset
Data
The total number of bits used for this cache: # of indices x (Data + tag + valid bit)
= 2^8 x (32 x 2^4 + 18 + 1)
= 256 x (512 + 18 + 1) = 256 x 531
= 135,936 bits
28
V Tag
512 bits Data
16 =
3232 32
31
14 13
6 5
2 1 0
Mux
32
Block offset
256 entries
Mapping an address to multiword cache block
• Early Restart: to resume execution as soon as the requested word of the block is returned, rather than to wait for the entire block. – hiding some of the transfer time so that the miss penalty (time) is effectively smaller.
• Requested Word First/Critical Word First: To organize the memory so that the requested word is transferred from the memory to the cache first. The remainder of the block is then transferred, staring with the address after the requested word and wrapping around to the beginning of the block
29
Design issues
• Large block exploit spatial locality to lower miss rates.
• Why not have large blocks?
– Largerblocksizemeanslargermisspenalty,andeventuallymiss rate will go up.
30
Hits vs. Misses • Read hits
– Simplyread! • Read misses
– stalltheCPU,fetchblockfrommemory,delivertocache,restart • Write hits:
– Write-through:updateboththecacheandmemory,ensuringthat data is always consistent between the two. May use write buffer.
– Write-back: update values only to the block in the cache, then writing the modified block to the memory when the block is replaced.
• Write misses:
– read the entire block into the cache, then write the word
31
Handling Cache Misses – steps
1. Send the original PC value (current PC – 4) to the memory.
2. Instruct main memory to perform a read and wait for the memory to complete its access.
3. Writethecacheentry,puttingthedatafrommemoryinthedata portion of the entry, writing the upper bits of the address (from the ALU) into the tag field, and turning the valid bit on.
4. Restarttheinstructionexecutionatthefirststep,whichwillre-fetch the instruction, this time finding it in the cache.
The control of the cache on a data access is identical: on a miss, we stall the processor until the memory responds with the data.
32
Handling Writes
• When we write into the cache without changing main memory, then memory would have a different value from that in the cache.
=> inconsistent.
Write-through: A scheme in which writes always update both the cache and the
memory, ensuring that data is always consistent between the two.
This takes time, at least 100 processor clock cycles (“sw” is 10% of instructions too.)
Use a Write Buffer:
-When writing the data into cache, we also write it in a Write Buffer. After that, the processor continue next executions while writing the data from the write buffer to the memory. When it is finished, the entry in the write buffer is freed.
If the processor reaches a next write and the write buffer is still full, then the processor must stall until there is an empty position in the write buffer. (we can increase the depth of the write buffer.)
33
Handling Writes – cont. • Write-Back (scheme)
When a write occurs, the new value is written only to the block I the cache. The modified block is written to the lower level of the hierarchy when it is replaced.
34
Steps for a read request to cache
1. Sendtheaddresstotheappropriatecache.Theaddresscomes either from the PC (for an instruction) or from the ALU (for data).
2. Ifthecachesignalshit,therequestwordisavailableonthedata lines. Since there are 16 words in the desired block, we need to select the right one. A block index field is used to control the multiplexor, which selects the requested word from the 16 words in the indexed block.
3. Ifthecachesignalsmiss,wesendtheaddresstothemainmemory. When the memory returns with the data, we write it into the cache and then read it to fulfill the request.
35
36
Design the Memory System to support cache — Example
• The processor is typically connected to memory over a bus. The clock rate of the bus is usually much slower than the processor, by as much as a factor of 10. The speed of this bus affects the miss penalty.
• Assume a cache of 4 words blocks.
• Assume:
– 1 memory bus clock cycle to send the address
– 15 memory bus clock cycles for each DRAM access initiated
– 1 memory bus clock cycle to send a word of data.
Also assume the three different memory organization in the following page:
• How many clock cycles are required to transfer 4 words? 1. Case (a)
2. Case (b) with two-word memory width 3. Case (b) with four-word memory width 4. Case (c)
37
Higher memory bandwidth
• Make reading multiple words easier by using banks of memory
CPU
CPU
CPU
Cache
Multiplexor Cache
Cache
Bus
Bus
Bus
Memory
a. One-word-wide memory organization
Memory
b. Wide memory organization
Memory bank 0
Memory Memory bank 1 bank 2
Memory bank 3
c. Interleaved memory organization
38
Higher memory bandwidth – case (a)
• Memory is one word wide, and all accesses are made sequentially.
CPU
If we have a cache block of 4 words and a one-word-wide bank of DRAMs, the miss penalty would be:
1 + 4 words x 15 + 4 words x 1
= 65 memory buss clock cycles.
Cache
Bus
Memory
The number of bytes transferred per bus clock cycle for a single miss would be:
a. One-word-wide memory organizatio
4 words/65 = (4 x 4 bytes) / 65 = 0.25 bytes/cycle
39
Higher memory bandwidth – case (b)
•
The bandwidth to memory is increased by widening the memory and the buses between the processor and memory. This allows parallel access to all the words of the block.
Memory
b. Wide memory organization
With a main memory width of 4 words, the miss penalty would be:
1 + 1 x 15 + 1 x 1
= 17 memory buss clock cycles.
CPU
With a main memory width of 2 words, the miss penalty would be:
1 + 2 x 15 + 2 x 1
= 33 memory buss clock cycles.
Multiplexor Cache
The number of bytes transferred per bus clock cycle for a single miss would be:
4 words/33 = (4 x 4 bytes) / 33 = 0.48 bytes/cycle
Bus
The number of bytes transferred per bus clock cycle for a single miss would be:
4 words/17 = (4 x 4 bytes) / 17 = 0.94 bytes/cycle
40
Higher memory bandwidth – case (c)
• The bandwidth is increased by widening the memory but not the interconnection bus. We still pay a cost to transmit each word, but we can avoid paying the cost of the access latency more than once.
Memory bank 0
Memory Memory bank 1 bank 2
Memory bank 3
c. Interleaved memory organization
CPU
the miss penalty would be:
1 + 1 x 15 + 4 word x 1
= 20 memory buss clock cycles.
Cache
Bus
The number of bytes transferred per bus clock
cycle for a single miss would be:
4 words/20 = (4 x 4 bytes) / 20 = 0.8 bytes/cycle
41
Performance
• Simplified model:
execution time = (execution cycles + stall cycles) × cycle time stall cycles = # of instructions × miss ratio × miss penalty
• Two ways of improving performance: – decreasing the miss ratio
– decreasingthemisspenalty
42
Measuring and Improving Cache Performance
• Two techniques for improving cache performance:
1. Wewanttoreducethemissratebyreducingtheprobabilitythattwo
different memory blocks will contend for the same cache location. 2. Wewanttoreducethemisspenaltybyaddinganadditionallevelto
the hierarchy. – Multi-level caching.
• CPU time = (CPU execution clock cycles + Memory-stall clock cycles) x Clock cycle time
• Memory-stall clock cycles = Read-stall cycles + Write-stall cycles
• Read-stall cycles = Reads/Program x Read miss rate x Read miss
penalty
• Write-stall cycles = (Writes/Program x Write miss rate x Write miss penalty) + Write buffer stalls
43
• If we assume that write buffer stalls are negligible, we can combine the reads and writes by using a single miss rate and the miss penalty:
• Memory-stall clock cycles = Memory accesses/Program x Miss rate X Miss penalty
• Memory-stall clock cycles = (Instructions/Program) x (Misses/Instruction) x Miss Penalty
44
Example
• Assume:
• An instruction cache miss rate for a program: 2%
• A data cache miss rate: 4%
• A processor has a CPI=2 w/o memory stall
• The miss penalty is 100 cycles for all misses
• The frequency of all loads and stores is 36%.
• Determine how much faster a processor would run with a perfect cache that never missed.
45
Example – cont.
The number of memory miss cycles for instruction fetch =I x 2% x 100 = 2.00 x I
where I is the instruction count.
The number of memory miss cycles for data memory access =I x 36% x 4% x 100 = 1.44 x I
where I is the instruction count.
The total number of memory-stall cycles = 2.00 I + 1.44 I = 3.44 I
The total number of cycles = (2 x I) + 3.44 x I
The CPI with memory stall = total number of cycles/ I = 2+3.44 = 5.44.
The ratio of the CPU execution time =
CPU time with stalls/CPU time with perfect cache
=(I x CPI_stall x Clock cycle) / (I x CPI_perfect x Clock cycle) = 5.44/2 = 2.72
46
Example 2 – Cache Performance with Reduced CPI to 1
• We reduce the CPI to 1 from the previous example.
• How much faster will the computer be with the faster clock,
assuming the same miss rate as the previous example?
Total miss cycles per instruction = (2% x 100) + 36% x (4% x 100) = 3.44
The faster computer with cache misses: CPI = 1 + 3.44 = 4.44 (previous one has CPI = 5.44)
Performance with fast clock/Performance with slow clock
= Execution time with slow clock/Execution time with fast clock
= (IC x CPI_slow_clock x Clock cycle) / (IC x CPI_fast_clock x (Clock cycle)/2)
=4.44/1 = 4.44
The system with the perfect would be 4.44 times faster.
The amount of execution time spent on memory stalls is more for this
system with CPI=1 than for the system with CPI = 2.
47
AMAT: Average Memory Access Time
AMAT (Average Memory Access Time): the average time to access memory considering both hits and misses and the frequency of different access:
AMAT = Time for a hit + Miss rate x Miss Penalty
Example:
Suppose that we have a processor with a 1 ns clock cycle time, a miss penalty of 20 clock cycles, a miss rate of 0.05 misses per instruction, and a cache access time (including hit detection) of 1 clock cycle. Assume that the read and write miss penalties are the same and ignore other write stalls.
Compute its AMAT.
48
AMAT Example
The average memory access time per instruction is
AMAT (in clock cycles)
= cycles for a hit + Miss rate x Miss penalty in cycles = 1 + 0.05 x 20
= 2 clock cycles
AMAT (in time) = 2 clock cycles x 1 ns/clock cycle = 2 ns.
49