PowerPoint 演示文稿
CO101
Principle of Computer
Organization
Lecture 17: Memory 3
Liang Yanyan
澳門科技大學
Macau of University of Science and
Technology
Memory Performance
• Programmers want unlimited amounts of memory with
low latency.
• Fast memory technology is more expensive per bit than
slower memory.
• Solution: organize memory system into a hierarchy
• Entire addressable memory space available in largest, slowest
memory
• Incrementally smaller and faster memories, each containing a
subset of the memory below it, proceed in steps up toward the
processor
• Temporal and spatial locality insures that nearly all
references can be found in smaller memories
• Gives the allusion of a large, fast memory being presented to the
processor
2
Handling Cache Hits
• Read hits (I$ and D$)
• this is what we want!
• Write hits (D$ only)
• require the cache and memory to be consistent
• require the cache and memory to be consistent the memory
hierarchy (write-through)
• writes run at the speed of the next level in the memory hierarchy –
so slow! – or can use a write buffer and stall only if the write buffer is
full.
• allow cache and memory to be inconsistent
• write the data only into the cache block (write-back the cache block
to the next level in the memory hierarchy when that cache block is
“evicted”).
• need a dirty bit for each data cache block to tell if it needs to be
written back to memory when it is evicted – can use a write buffer to
help “buffer” write-backs of dirty blocks.
3
Handling Cache Misses
• Read misses (I$ and D$)
• stall the pipeline, fetch the block from the next level in the
memory hierarchy, install it in the cache and send the requested
word to the processor, then let the pipeline resume.
• Write misses (D$ only)
• stall the pipeline, fetch the block from next level in the memory
hierarchy, install it in the cache (which may involve having to
evict a dirty block if using a write-back cache), write the word
from the processor to the cache, then let the pipeline resume.
• Write allocate – just write the word into the cache updating both
the tag and data, no need to stall.
• Write not allocate – skip the cache write (but must invalidate that
cache block since it will now hold stale data) and just write the
word to the write buffer (and eventually to the next memory level),
no need to stall if the write buffer isn’t full.
4
Cache Misses
• Miss rate
• Fraction of cache access that result in a miss
• Causes of misses
• Compulsory (cold start or process migration, first (cold start or process
migration, first reference):
• First access to a block, “cold” fact of life, not a whole lot you can do about it.
If you are going to run “millions” of instruction, compulsory misses are
insignificant.
• Solution: increase block size (increases miss penalty; very large blocks
could increase miss rate).
• Capacity:
• Cache cannot contain all blocks accessed by the program.
• Solution: increase cache size (may increase access time).
• Conflict (collision):
• Multiple memory locations mapped to the same cache location.
• Solution 1: increase cache size.
• Solution 2: increase associativity (may increase access time).
5
6
Cache size and miss rate
• The cache size also has a significant impact on
performance.
• Larger cache can reduce address conflict.
• Again this means the miss rate decreases, so the AMAT and
number of memory stall cycles is also lower.
7
Block size and miss rate
• The following figure shows miss rates relative to the
block size and overall cache size.
• Smaller block size doesn’t take maximum advantage of spatial
locality.
• But if blocks are too large, the number of blocks is smaller, which
may increase address conflict.
8
Associativity and miss rate
• Higher associative cache means more complex
hardware. But a highly-associative cache will also exhibit
a lower miss rate.
• Each set has more blocks, it helps to reduce address conflict.
• Overall, this will reduce AMAT and memory stall cycles.
9
Memory and overall performance
• Assuming cache hit costs are included as part of the
normal CPU execution cycle, then
• The total number of stall cycles depends on the number
of cache misses and the miss penalty.
Memory stall cycles = Num of memory accesses x miss rate x miss penalty
• To include stalls due to cache misses in CPU
performance equations, we have to add them to the
“base” number of execution cycles.
CPU time = (CPU execution cycles + Memory stall cycles) x Cycle time
10
Impacts of Cache Performance
• Assume the total number of instruction in a program is N,
and 33% of the instructions are data accesses. The
cache hit ratio is 97% and the hit time is one cycle, but
the miss penalty is 20 cycles.
Memory stall cycles = Memory accesses x Miss rate x Miss penalty
= 0.33 x N x 0.03 x 20 cycles
= 0.2 x N cycles
• The number of wasted cycles is 0.2N, total cycles will be
1.2N cycles.
• This code is 1.2 times slower than a program with a
“perfect” CPI of 1!
11
Average memory access time (AMAT)
• Factors that affect memory performance:
• Hit time: the access time when cache is hit, i.e. the cache accessing
time.
• Miss rate: the percentage of memory accesses that cannot be
handled by the cache.
• Miss penalty: the additional time to load data from main memory for
a cache miss.
• The average memory accessing time (AMAT) is determined
by the above three factors:
AMAT = Hit time + (Miss rate × Miss penalty)
• E.g. L1 hit time = 2, L2 hit time = 16, Memory time = 200, L1
miss rate = 1%, L2 miss rate = 5%
AMAT = 2 +
0.01 × 16 +
0.01 × 0.05 × 200
= 2.26 cycles
12
Multilevel Cache Design Considerations
• Design considerations for L1 and L2 caches are
very different.
• Primary cache should focus on minimizing hit time in
support of a shorter clock cycle.
• Smaller with smaller block sizes
• Direct-mapped
• caches can overlap tag compare and transmission of data
• Lower associativity
• reduces power because fewer cache lines are accessed
• Secondary cache(s) should focus on reducing miss
rate to reduce the penalty of long main memory
access times.
• Larger with larger block sizes
• Higher levels of associativity
13
Multilevel Cache Design Considerations
• The miss penalty of the L1 cache is significantly reduced
by the presence of an L2 cache – so it can be smaller
(i.e., faster) but have a higher miss rate.
• For the L2 cache, hit time is less important than miss
rate.
• The L2$ hit time determines L1$’s miss penalty.
• L2$ local miss rate >> than the global miss rate.
14
L1 Size and Associativity
15 Access time vs. size and associativity
L1 Size and Associativity
16
Energy per read vs. size and associativity
Summary: Improving Cache Performance
• Reduce the time to hit in the cache
• smaller cache
• direct mapped cache
• smaller blocks
• Reduce the miss rate
• bigger cache
• more flexible placement (increase associativity)
• larger blocks (16 to 64 bytes typical)
• victim cache – small buffer holding most recently discarded blocks
• Reduce the miss penalty
• smaller blocks
• use multiple cache levels – L2 cache not tied to CPU clock rate
• faster backing store/improved memory bandwidth
• wider buses
• memory interleaving, DDR SDRAMs
17
Summary: The Cache Design Space
• Several interacting dimensions
• cache size
• block size
• associativity
• replacement policy
• write-through vs write-back
• write allocation
• The optimal choice is a compromise
• depends on access characteristics
• workload
• use (I-cache, D-cache, TLB)
• depends on technology / cost
• Simplicity often wins
18
Associativity
Cache Size
Block Size
Bad
Good
Less More
Factor A Factor B
CO101�Principle of Computer Organization
Memory Performance
Handling Cache Hits
Handling Cache Misses
Cache Misses
幻灯片编号 6
Cache size and miss rate
Block size and miss rate
Associativity and miss rate
Memory and overall performance
Impacts of Cache Performance
Average memory access time (AMAT)
Multilevel Cache Design Considerations
Multilevel Cache Design Considerations
L1 Size and Associativity
L1 Size and Associativity
Summary: Improving Cache Performance
Summary: The Cache Design Space