Computer Organization and Design
Unit 8: Caches
Slides by Profs. , , C.J. Taylor and
CIS 371 | Dr. | Caches 1
Copyright By PowCoder代写 加微信 powcoder
• P&H Chapter 5 • 5.1-5.3, 5.5
• Appendix C.9
CIS 371 | Dr. | Caches 2
The Problem
• Using current technologies, memories can be either large or fast, but not both.
• The size or capacity of a memory refers to the number of bits it can store
• The speed of a memory typically refers to the amount of time it takes to access a stored entry: the delay between asking for the value stored at an address and receiving the result.
CIS 371 | Dr. | Caches 3
Memories (SRAM & DRAM)
CIS 371 | Dr. | Caches 4
Types of Memory
• StaticRAM(SRAM) • 6 or 8 transistors per bit
• Two inverters (4 transistors) + transistors for reading/writing • Optimized for speed (first) and density (second)
• Fast (sub-nanosecond latencies for small SRAM)
• Speed roughly proportional to its area (~ sqrt(number of bits)) • Mixes well with standard processor logic
• DynamicRAM(DRAM)
• 1 transistor + 1 capacitor per bit
• Optimized for density (in terms of cost per bit)
• Slow (>30ns internal access, ~50ns pin-to-pin)
• Different fabrication steps (does not mix well with logic)
• Nonvolatile storage: Magnetic disk, NAND Flash, Phase- change memory, …
CIS 371 | Dr. | Caches 5
Memory Technology Trends
c/o O’Hallaron, Ganger, Kesden, CMU 15-213 / 18-213
CIS 371 | Dr. | Caches
Locality of Memory Technologies
• For many (all?) memory technologies, it may take a long time to get the first element out of the memory but you can fetch a lot of data in the vicinity very quickly
• It is usually a good idea to ‘buy in bulk’ by fetching a lot of elements in the vicinity rather than just one.
CIS 371 | Dr. | Caches 10
The Memory Hierarchy
CIS 371 | Dr. | Caches 11
Big Picture Motivation
• Processor can compute only as fast as memory
• A 3Ghz processor can execute an “add” operation in 0.33ns
• Today’s “Main memory” latency is more than 33ns
• Naïve implementation: loads/stores can be 100x slower than other
operations
• Unobtainable goal:
• Memory that operates at processor speeds
• Memory as large as needed for all running programs • Memory that is cost effective
• Can’t achieve all of these goals at once
CIS 371 | Dr. | Caches 12
Known From the Beginning
“Ideally, one would desire an infinitely large memory capacity such that any particular word would be immediately available … We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has a greater capacity than the preceding but which is less quickly accessible.”
Burks, Goldstine, VonNeumann “Preliminary discussion of the logical design of an
electronic computing instrument” IAS memo 1946
CIS 371 | Dr. | Caches 13
This Unit: Caches
• “Cache”:hardwaremanaged
• Hardware automatically retrieves missing data • Built from fast SRAM, usually on-chip today
• In contrast to off-chip, DRAM “main memory”
• Cache organization • Speed vs. Capacity • ABC
• Miss classification
• Some example performance calculations
Main Memory
CIS 371 | Dr. | Caches 14
Key Observation: Locality
• Localityofmemoryreferences
• Empirical property of real-world programs, few exceptions
• Temporallocality
• Recently referenced data is likely to be referenced again soon • Reactive: “cache” recently used data in small, fast memory
• Spatiallocality
• More likely to reference data near recently referenced data
• Proactive: “cache” large chunks of data to include nearby data
• Both properties hold for data and instructions
• Cache: finite-sized hashtable of recently used data blocks
• In hardware, transparent to software
CIS 371 | Dr. | Caches 15
Spatial and Temporal Locality Example
• Which memory accesses demonstrate spatial locality?
• Which memory accesses demonstrate temporal locality?
int sum = 0; int X[1000];
for(int c = 0; c < 1000; c++){ sum += c;
X[c] = 0; }
CIS 371 | Dr. | Caches 16
Library Analogy
• Consider books in a library
• Library has lots of books, but it is slow to access • Far away (time to walk to the library)
• Big (time to walk within the library)
• How can you avoid these latencies?
• Check out books, take them home with you
• Put them on desk, on bookshelf, etc.
• But desks & bookshelves have limited capacity
• Keep recently used books around (temporal locality)
• Grab books on related topic at the same time (spatial locality) • Guess what books you'll need in the future (prefetching)
CIS 371 | Dr. | Caches 17
Library Analogy Explained
• Registers « books on your desk • Actively being used, small capacity
• Caches « bookshelves
• Moderate capacity, pretty fast to access
• Main memory « library
• Big, holds almost all data, but slow
• Disk (virtual memory) « inter-library loan • Very slow, but hopefully really uncommon
CIS 371 | Dr. | Caches 18
Exploiting Locality: Memory Hierarchy
• Hierarchy of memory components • Upper components
• Fast « Small « Expensive • Lower components
• Slow « Big « Cheap • Connected by “buses”
• Which also have latency and bandwidth issues
• Most frequently accessed data in M1
• M1 + next most frequently accessed in M2, etc. • Move data up-down hierarchy
• Optimize average access time
• latencyavg=latencyhit + (%miss*latencymiss) • Attack each component
CIS 371 | Dr. | Caches
Concrete Memory Hierarchy
Compiler Managed
Hardware Managed
• 0th level: Registers
• 1st level: Primary caches
• Split instruction (I$) and data (D$) • Typically 8KB to 64KB each
• 2nd level: 2nd and 3rd cache (L2, L3) • On-chip, typically made of SRAM
• 2nd level typically ~256KB to 512KB
• “Lastlevelcache”typically4MBto16MB
• 3rd level: main memory
• Made of DRAM (“Dynamic” RAM)
• Typically 1GB to 4GB for desktops/laptops
• Servers can have 100s of GB
• 4th level: disk (swap and files) • Uses magnetic disks or flash drives20
Main Memory
Software Managed (by OS)
CIS 371 | Dr. | Caches
Evolution of Cache Hierarchies
256KB L2 (private)
8MB L3 (shared)
Intel 486 Intel Core i7 (quad core)
• Chips today are 30–70% cache by area
CIS 371 | Dr. | Caches 21
CIS 371 | Dr. | Caches 22
Logical Cache Organization
• Cacheisahardwarehashtable • The setup
• 32-bit ISA ® 4G words/addresses, 232 B address space
• Logical cache organization
• 4KB, organized as 1K 4B blocks (aka lines) • Each block can hold a 4-byte word
• Physical cache implementation
• 1K (1024 bit) by 4B SRAM • Called data array
• 10-bit address input
• 32-bit data input/output
CIS 371 | Dr. | Caches
Looking Up A Block
• Q: which 10 of the 32 address bits to use?
• A: bits [11:2]
• 2 least significant (LS) bits [1:0] are the offset bits
• Locate byte within word
• Don't need these to locate word
• Next 10 LS bits [11:2] are the index bits
• These locate the word
• Nothing says index must be these bits • But these work best in practice
• Why? (think about it)
CIS 371 | Dr. | Caches
Is this the block you’re looking for?
• Each cache row corresponds to 220 blocks
• How to know which if any is currently there?
• Tag each cache word with remaining address bits [31:12]
• Build separate and parallel tag array • 1K by 21-bit SRAM
• 20-bit (next slide) tag + 1 valid bit
• Lookup algorithm
• Read tag indicated by index bits • If tag matches & valid bit set:
then: Hit ® data is good
else: Miss ® data is garbage, wait...
CIS 371 | Dr. | Caches
addr hit data 26
A Concrete Example
• Lookupaddressx000C14B8
• Index = addr [11:2] = (addr >> 2) & x3FF = x12E • Tag = addr [31:12] = (addr >> 12) = x000C1
1 0000 0000 0000 1100 0001
0000 0000 0000 1100 0001
0100 1011 10
CIS 371 | Dr. | Caches
addr hit data 27
Handling a Cache Miss
• What if requested data isn’t in the cache? • How does it get in there?
• Cachecontroller:finitestatemachine • Remembers miss address
• Accesses next level of memory
• Waits for response
• Writes data/tag into proper locations
• Bringing a missing block into the cache is a cache fill
CIS 371 | Dr. | Caches 28
Cache Misses and Pipeline Stalls
• I$ and D$ misses stall pipeline just like data hazards • Stall logic driven by miss signal
• Cache “logically” re-evaluates hit/miss every cycle
• Block is filled ® miss signal goes low ® pipeline restarts
CIS 371 | Dr. | Caches 29
Cache Performance Equation
• For a cache
• Access:readorwritetocache
• Hit:desireddatafoundincache
• Miss:desireddatanotfoundincache
• Must get from another component
• No notion of “miss” in register file • Fill:actionofplacingdataintocache
• %miss (miss-rate): #misses / #accesses
• taccess:timetocheckcache.Ifhit,we’redone. • tmiss:timetoreaddataintocache
• Performance metric: average access time
tavg = taccess + (%miss * tmiss)
CIS 371 | Dr. | Caches
CPI Calculation with Cache Misses
• Parameters
• Simple pipeline with base CPI of 1 • Instruction mix: 30% loads/stores • I$: %miss = 2%, tmiss = 10 cycles
• D$: %miss = 10%, tmiss = 10 cycles
• What is new CPI?
• CPII$ = %missI$*tmiss = 0.02*10 cycles = 0.2 cycle
• CPID$ = %load/store*%missD$*tmissD$ = 0.3 * 0.1*10 cycles = 0.3 cycle • CPInew = CPI + CPII$ + CPID$ = 1+0.2+0.3 = 1.5
CIS 371 | Dr. | Caches 31
Measuring Cache Performance
• Ultimate metric is tavg
• Cache capacity and circuits roughly determines taccess • Lower-level memory structures determine tmiss
• Measure %miss
• Hardware performance counters • Simulation
CIS 371 | Dr. | Caches 32
Cache operation: 1B block
• 8-bit addresses ® 256B memory • Keeps diagrams simple
• 4B cache, 1B blocks
• Figure out number of sets: 4 (capacity / block-size)
• Figure out how address splits into offset/index/tag bits
• Offset: least-significant log2(block-size) = log2(1) = 0
• Index: next log2(number-of-sets) = log2(4) = 2 ® 00000000 • Tag:rest=8–2=6®00000000
tag (6 bits)
index (2 bits)
CIS 371 | Dr. | Caches 33
Multi-Word Cache Blocks
• In most modern implementation we store more than one address (>1 byte) in each cache block.
• The number of bytes or words stored in each cache block is referred to as the block size.
• The entries in each block come from a contiguous set of addresses to exploit locality of reference, and to simplify indexing
• Cacheblocksarealsoreferredtoascachelines
• Related to cache frames – a frame is the bucket, and the
block/line is the data that goes in the bucket
• blocks/lines move around due to fills & evictions
• frames are part of the cache structure and never move
CIS 371 | Dr. | Caches 34
Cache operation: 2B blocks
• 8B cache, 2B blocks
• Figure out number of sets: 4 (capacity / block-size)
• Figure out how address splits into offset/index/tag bits
• Offset: least-significant log2(block-size) = log2(2) = 1 ® 00000000 • Index: next log2(number-of-sets) = log2(4) = 2 ® 00000000
• Tag:rest=8–1–2=5®00000000
tag (5 bits)
index (2 bits)
block offset (1 bit)
CIS 371 | Dr. | Caches 35
Capacity and Performance
• Simplest way to reduce %miss: increase capacity + Miss rate decreases monotonically
• “Working set”: insns/data program is actively using
• Diminishing returns – However taccess increases
• Latency proportional to sqrt(capacity)
“working set” size
• Given capacity, manipulate %miss by changing organization CIS 371 | Dr. | Caches 39
Cache Capacity
Block Size
• Given capacity, manipulate %miss by changing organization
• One option: increase block size • Exploit spatial locality
• Notice index/offset bits change
• Tag remain the same
• Ramifications
+ Reduce %miss (up to a point)
+ Reduce tag overhead (why?)
– Potentially useless data transfer
– Premature replacement of useful data
512*512bit SRAM
9-bit block size
CIS 371 | Dr. | Caches
data hit? 40
Block Size and Tag Overhead
• 4KB cache with 1024 4B blocks?
• 4B blocks ® 2-bit offset, 1024 frames ® 10-bit index • 32-bit address – 2-bit offset – 10-bit index = 20-bit tag • 20-bit tag / 32-bit block = 63% overhead
• 4KB cache with 512 8B blocks
• 8B blocks ® 3-bit offset, 512 frames ® 9-bit index
• 32-bit address – 3-bit offset – 9-bit index = 20-bit tag • 20-bit tag / 64-bit block = 32% overhead
• Notice: tag size is same, but data size is twice as big
• A realistic example: 64KB cache with 64B blocks • 16-bit tag / 512-bit block = ~ 2% overhead
• Note: Tags are not optional
CIS 371 | Dr. | Caches 41
Effect of Block Size on Miss Rate
• Two effects on miss rate
+ Spatial prefetching (good)
• For blocks with adjacent addresses
• Turns miss/miss into miss/hit pairs – Interference (bad)
• For blocks with non-adjacent addresses (but in adjacent frames)
• Turns hits into misses by disallowing simultaneous residence
• Consider entire cache as one big block
• Both effects always present
• Spatial prefetching dominates initially
• Depends on size of the cache
• Good block size is 32–256B
• Program dependent CIS 371 | Dr. | Caches
Block Size
Block Size and Miss Penalty
• Does increasing block size increase tmiss?
• Don’t larger blocks take longer to read, transfer, and fill? • They do, but…
• tmiss of an isolated miss is not affected
• CriticalWordFirst/EarlyRestart(CRF/ER)
• Requested word fetched first, pipeline restarts immediately
• Remaining words in block transferred/filled in the background
• tmiss’es of a cluster of misses will suffer
• Reads/transfers/fills of two misses can’t happen at the same time • Latencies can start to pile up
• This is a bandwidth problem
CIS 371 | Dr. | Caches 45
Cache Conflicts
Main memory
tag (1 bit)
index (2 bits)
1001 11 1010
• Pairs like “0010” and “1010” conflict • Sameindex!
• Can such pairs to simultaneously reside in cache? • A: Yes, if we reorganize cache to do so
CIS 371 | Dr. | Caches 46
Associativity
• Set-associativity
• Block can reside in one of few frames • Frame groups called sets
• Each frame in set called a way
• This is 2-way set-associative (SA) • 1-way ® direct-mapped (DM)
• 1-set ® fully-associative (FA)
+ Reduces conflicts
– Increases latencyhit:
• additional tag match & muxing
associativity
CIS 371 | Dr. | Caches
Associativity
• Lookup algorithm
• Use index bits to find set
• Read data/tags in all frames in parallel • Any(matchandvalidbit),Hit
• Notice tag/index/offset bits
• Only 9-bit index (versus 10-bit for direct mapped)
associativity
CIS 371 | Dr. | Caches
addr hit data 49
Replacement Policies
• Set-associative caches present a new design choice • On cache miss, which block in set to replace (kick out)?
• Some options • Random
• FIFO(first-infirst-out)
• LRU(leastrecentlyused)
• Fits with temporal locality, LRU = least likely to be used in future • NMRU(notmostrecentlyused)
• An easier to implement approximation of LRU
• Is LRU for 2-way set-associative caches
• Belady’s:replaceblockthatwillbeusedfurthestinfuture
• Unachievable optimum
CIS 371 | Dr. | Caches 50
LRU and Miss Handling
• Add LRU field to each set • “Leastrecentlyused”
• LRU data is encoded “way” • Hit? update MRU
• LRU bits updated on each access
data from memory
CIS 371 | Dr. | Caches
data 51 hit?
2-way Set-Associative Cache Operation
• 8B cache, 2 ways, 2B blocks
• Figure out number of sets: 2 ((capacity / ways) / block-size) • Figure out how address splits into offset/index/tag bits
• Offset: least-significant log2(block-size) = log2(2) = 1 ® 00000000 • Index: next log2(number-of-sets) = log2(2) = 1 ® 00000000
• Tag:rest=8–1–1=6®00000000
tag (6 bits)
index (1 bit)
block offset (1 bit)
CIS 371 | Dr. | Caches 52
Associativity and Performance
• Higher associative caches + Have better (lower) %miss
• Diminishing returns – However taccess increases
• The more associative, the slower • What about tavg?
Associativity
• Block-size and number of sets should be powers of two • Makes indexing easier (just rip bits out of the address)
• 3-way set-associativity? No problem
CIS 371 | Dr. | Caches 56
Cache Glossary
data array
block/line
CIS 371 | Dr. | Caches
block offset/displacement
What About Stores? Handling Cache Writes
CIS 371 | Dr. | Caches 58
Write Issues
• So far we have looked at reading from cache • Instruction fetches, loads
• What about writing into cache?
• Several new issues
• Tag/data access
• Write-through vs. write-back
• Write-allocate vs. write-not-allocate • Hiding write miss latency
CIS 371 | Dr. | Caches 59
Tag/Data Access
• Reads: read tag and data in parallel
• Tag mis-match ® data is wrong (OK, just stall until good data arrives)
• Writes: read tag, write data in parallel? No! Why not? • Tag mis-match ® clobbered data (oops!)
• For associative caches, which way was written into?
• Writes are a pipelined two step (multi-cycle) process • Step 1: match tag
• Step 2: write to matching way
• Bypass (with address check) to avoid load stalls
• May introduce structural hazards
CIS 371 | Dr. | Caches 60
Write Propagation
• When to propagate new value to lower-level caches/memory?
• Option#1:Write-through:immediately • On hit, update cache
• Immediately send the write to the next level
• Option#2:Write-back:whenblockisreplaced
• Now we have multiple versions of the same block in various
caches and in memory!
• Requires additional “dirty” bit per block
• Evict clean block: no extra traffic
• there was only 1 version of the block
• Evict dirty block: extra “writeback” of block • the dirty block is the most up-to-date version
CIS 371 | Dr. | Caches 61
Write-back Cache Operation
• Each cache block has a dirty bit associated with it • state is either clean or dirty
valid bit tag
-I– afterldr0<=[A] CV A 0x01
initial state after st r1 => [A]
CIS 371 | Dr. | Caches
Write-backs across caches
• When a dirty block is evicted to a lower-level cache, it remains dirty
• Writing a block back to memory cleanses it
• There are never dirty blocks in memory, only in caches
evicted from L1
evicted from L2
block is cleansed no need for dirty/valid bits or tag
CIS 371 | Dr. | Caches 63
Optimizing Writebacks
+ Writeback-buffer (WBB):
• Hide latency of writeback (keep them off critical path) • Step#1: Send “fill” request to next-level
• Step#2: While waiting, write dirty block to buffer
• Step#3: When new blocks arrives, put it into cache
• Step#4: Write buffer contents to next-level
CIS 371 | Dr. | Caches
Next-level-$
Write Propagation Comparison
• Write-through
– Requires additional bus bandwidth
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com