CSE 371 Computer Organization and Design
CIS 371 | Dr. | Caches
Computer Organization and Design
Copyright By PowCoder代写 加微信 powcoder
Unit 8: Caches
Slides by Profs. , ,
C.J. Taylor and
CIS 371 | Dr. | Caches
P&H Chapter 5
5.1-5.3, 5.5
Appendix C.9
The Problem
Using current technologies, memories can be either large or fast, but not both.
The size or capacity of a memory refers to the number of bits it can store
The speed of a memory typically refers to the amount of time it takes to access a stored entry: the delay between asking for the value stored at an address and receiving the result.
CIS 371 | Dr. | Caches
Memories (SRAM & DRAM)
CIS 371 | Dr. | Caches
CIS 371 | Dr. | Caches
Types of Memory
Static RAM (SRAM)
6 or 8 transistors per bit
Two inverters (4 transistors) + transistors for reading/writing
Optimized for speed (first) and density (second)
Fast (sub-nanosecond latencies for small SRAM)
Speed roughly proportional to its area (~ sqrt(number of bits))
Mixes well with standard processor logic
Dynamic RAM (DRAM)
1 transistor + 1 capacitor per bit
Optimized for density (in terms of cost per bit)
Slow (>30ns internal access, ~50ns pin-to-pin)
Different fabrication steps (does not mix well with logic)
Nonvolatile storage: Magnetic disk, NAND Flash, Phase-change memory, …
PCM == Intel XPoint?
A bit stored using a pass transistor and a capacitor.
CIS 371 | Dr. | Caches
DRAM Array
Showing a 4Megabit array where the information is stored in a 2D grid.
A bit is accessed by fetching the correct row and then selecting the bit in the column.
CIS 371 | Dr. | Caches
Note that once you fetch a row its sitting there in the latch you can clock out everything.
CIS 371 | Dr. | Caches
Memory & Storage Technologies
Cost – what can $200 buy (2009)?
SRAM: 16MB
DRAM: 4,000MB (4GB) – 250x cheaper than SRAM
Flash: 64,000MB (64GB) – 16x cheaper than DRAM
Disk: 2,000,000MB (2TB) – 32x vs. Flash (512x vs. DRAM)
SRAM: <1 to 2ns (on chip)
DRAM: ~50ns – 100x or more slower than SRAM
Flash: 75,000ns (75 microseconds) – 1500x vs. DRAM
Disk: 10,000,000ns (10ms) – 133x vs Flash (200,000x vs DRAM)
SRAM: 300GB/sec (e.g., 12-port 8-byte register file @ 3Ghz)
DRAM: ~25GB/s
Flash: 0.25GB/s (250MB/s), 100x less than DRAM
Disk: 0.1 GB/s (100MB/s), 250x vs DRAM, sequential access only
Memory Technology Trends
CIS 371 | Dr. | Caches
c/o O’Hallaron, Ganger, Kesden, CMU 15-213 / 18-213
Locality of Memory Technologies
For many (all?) memory technologies, it may take a long time to get the first element out of the memory but you can fetch a lot of data in the vicinity very quickly
It is usually a good idea to 'buy in bulk' by fetching a lot of elements in the vicinity rather than just one.
CIS 371 | Dr. | Caches
DDR rams where you transfer data on rising and falling clock edges.
The Costco phenomenon-buy a lot of toilet paper at once.
The Memory Hierarchy
CIS 371 | Dr. | Caches
CIS 371 | Dr. | Caches
Big Picture Motivation
Processor can compute only as fast as memory
A 3Ghz processor can execute an “add” operation in 0.33ns
Today's “Main memory” latency is more than 33ns
Naïve implementation: loads/stores can be 100x slower than other operations
Unobtainable goal:
Memory that operates at processor speeds
Memory as large as needed for all running programs
Memory that is cost effective
Can't achieve all of these goals at once
CIS 371 | Dr. | Caches
Known From the Beginning
“Ideally, one would desire an infinitely large memory capacity such that any particular word would be immediately available … We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has a greater capacity than the preceding but which is less quickly accessible.”
Burks, Goldstine, VonNeumann
“Preliminary discussion of the logical design of an electronic computing instrument”
IAS memo 1946
CIS 371 | Dr. | Caches
This Unit: Caches
“Cache”: hardware managed
Hardware automatically retrieves missing data
Built from fast SRAM, usually on-chip today
In contrast to off-chip, DRAM “main memory”
Cache organization
Speed vs. Capacity
Miss classification
Some example performance calculations
CIS 371 | Dr. | Caches
Key Observation: Locality
Locality of memory references
Empirical property of real-world programs, few exceptions
Temporal locality
Recently referenced data is likely to be referenced again soon
Reactive: “cache” recently used data in small, fast memory
Spatial locality
More likely to reference data near recently referenced data
Proactive: “cache” large chunks of data to include nearby data
Both properties hold for data and instructions
Cache: finite-sized hashtable of recently used data blocks
In hardware, transparent to software
Spatial and Temporal Locality Example
Which memory accesses demonstrate spatial locality?
Which memory accesses demonstrate temporal locality?
CIS 371 | Dr. | Caches
int sum = 0;
int X[1000];
for(int c = 0; c < 1000; c++){
Reason for spatial and temporal locality – code access sequential with lots of loops so instruction fetches are very patterned.
Data accesses usually are but there are situations where this isn't the case and you can write very bad code that makes this worse.
Library Analogy
Consider books in a library
Library has lots of books, but it is slow to access
Far away (time to walk to the library)
Big (time to walk within the library)
How can you avoid these latencies?
Check out books, take them home with you
Put them on desk, on bookshelf, etc.
But desks & bookshelves have limited capacity
Keep recently used books around (temporal locality)
Grab books on related topic at the same time (spatial locality)
Guess what books you'll need in the future (prefetching)
CIS 371 | Dr. | Caches
Library Analogy Explained
Registers books on your desk
Actively being used, small capacity
Caches bookshelves
Moderate capacity, pretty fast to access
Main memory library
Big, holds almost all data, but slow
Disk (virtual memory) inter-library loan
Very slow, but hopefully really uncommon
CIS 371 | Dr. | Caches
CIS 371 | Dr. | Caches
Exploiting Locality: Memory Hierarchy
Hierarchy of memory components
Upper components
Fast Small Expensive
Lower components
Slow Big Cheap
Connected by “buses”
Which also have latency and bandwidth issues
Most frequently accessed data in M1
M1 + next most frequently accessed in M2, etc.
Move data up-down hierarchy
Optimize average access time
latencyavg=latencyhit + (%miss*latencymiss)
Attack each component
Can't decrease the miss latency by much, so we need to amortize it over many hits per miss
CIS 371 | Dr. | Caches
Concrete Memory Hierarchy
0th level: Registers
1st level: Primary caches
Split instruction (I$) and data (D$)
Typically 8KB to 64KB each
2nd level: 2nd and 3rd cache (L2, L3)
On-chip, typically made of SRAM
2nd level typically ~256KB to 512KB
“Last level cache” typically 4MB to 16MB
3rd level: main memory
Made of DRAM (“Dynamic” RAM)
Typically 1GB to 4GB for desktops/laptops
Servers can have 100s of GB
4th level: disk (swap and files)
Uses magnetic disks or flash drives
CIS 371 | Dr. | Caches
Evolution of Cache Hierarchies
Intel Core i7 (quad core)
Chips today are 30–70% cache by area
CIS 371 | Dr. | Caches
CIS 371 | Dr. | Caches
Logical Cache Organization
Cache is a hardware hashtable
32-bit ISA 4G words/addresses, 232 B address space
Logical cache organization
4KB, organized as 1K 4B blocks (aka lines)
Each block can hold a 4-byte word
Physical cache implementation
1K (1024 bit) by 4B SRAM
Called data array
10-bit address input
32-bit data input/output
CIS 371 | Dr. | Caches
Looking Up A Block
Q: which 10 of the 32 address bits to use?
A: bits [11:2]
2 least significant (LS) bits [1:0] are the offset bits
Locate byte within word
Don't need these to locate word
Next 10 LS bits [11:2] are the index bits
These locate the word
Nothing says index must be these bits
But these work best in practice
Why? (think about it)
Direct Mapped Cache
CIS 371 | Dr. | Caches
In a direct mapped cache every address in main memory is mapped to a unique 'parking space' in the cache.
The easiest way to do this is by considering the low order bits of the address. Those define which cache block you are assigned to.
Note that this means that there will be inevitable conflicts just due to shared cache blocks even if the rest of the cache is empty.
CIS 371 | Dr. | Caches
Is this the block you’re looking for?
Each cache row corresponds to 220 blocks
How to know which if any is currently there?
Tag each cache word with remaining address bits [31:12]
Build separate and parallel tag array
1K by 21-bit SRAM
20-bit (next slide) tag + 1 valid bit
Lookup algorithm
Read tag indicated by index bits
If tag matches & valid bit set:
then: Hit data is good
else: Miss data is garbage, wait…
Because we have an offset of 2 bits and an index of 10 bits, there are 20 bits left, so 2^20 different addresses can have the same index bits
A Concrete Example
Lookup address x000C14B8
Index = addr [11:2] = (addr >> 2) & x3FF = x12E
Tag = addr [31:12] = (addr >> 12) = x000C1
CIS 371 | Dr. | Caches
0000 0000 0000 1100 0001
0100 1011 10
0000 0000 0000 1100 0001
Direct mapped cache – every address in memory has one designated parking spot in the cache. Note that multiple addresses share the same parking spot so only one can be in the cache at a time.
CIS 371 | Dr. | Caches
Handling a Cache Miss
What if requested data isn’t in the cache?
How does it get in there?
Cache controller: finite state machine
Remembers miss address
Accesses next level of memory
Waits for response
Writes data/tag into proper locations
Bringing a missing block into the cache is a cache fill
CIS 371 | Dr. | Caches
Cache Misses and Pipeline Stalls
I$ and D$ misses stall pipeline just like data hazards
Stall logic driven by miss signal
Cache “logically” re-evaluates hit/miss every cycle
Block is filled miss signal goes low pipeline restarts
Maybe remove?
CIS 371 | Dr. | Caches
Cache Performance Equation
For a cache
Access: read or write to cache
Hit: desired data found in cache
Miss: desired data not found in cache
Must get from another component
No notion of “miss” in register file
Fill: action of placing data into cache
%miss (miss-rate): #misses / #accesses
taccess: time to check cache. If hit, we’re done.
tmiss: time to read data into cache
Performance metric: average access time
tavg = taccess + (%miss * tmiss)
CIS 371 | Dr. | Caches
CPI Calculation with Cache Misses
Parameters
Simple pipeline with base CPI of 1
Instruction mix: 30% loads/stores
I$: %miss = 2%, tmiss = 10 cycles
D$: %miss = 10%, tmiss = 10 cycles
What is new CPI?
CPII$ = %missI$*tmiss = 0.02*10 cycles = 0.2 cycle
CPID$ = %load/store*%missD$*tmissD$ = 0.3 * 0.1*10 cycles = 0.3 cycle
CPInew = CPI + CPII$ + CPID$ = 1+0.2+0.3 = 1.5
Note the whopping increase in CPI – 50%
CIS 371 | Dr. | Caches
Measuring Cache Performance
Ultimate metric is tavg
Cache capacity and circuits roughly determines taccess
Lower-level memory structures determine tmiss
Measure %miss
Hardware performance counters
Simulation
CIS 371 | Dr. | Caches
Cache operation: 1B block
8-bit addresses 256B memory
Keeps diagrams simple
4B cache, 1B blocks
Figure out number of sets: 4 (capacity / block-size)
Figure out how address splits into offset/index/tag bits
Offset: least-significant log2(block-size) = log2(1) = 0
Index: next log2(number-of-sets) = log2(4) = 2 00000000
Tag: rest = 8 – 2 = 6 00000000
tag (6 bits)
index (2 bits)
Multi-Word Cache Blocks
In most modern implementation we store more than one address (>1 byte) in each cache block.
The number of bytes or words stored in each cache block is referred to as the block size.
The entries in each block come from a contiguous set of addresses to exploit locality of reference, and to simplify indexing
Cache blocks are also referred to as cache lines
Related to cache frames – a frame is the bucket, and the block/line is the data that goes in the bucket
blocks/lines move around due to fills & evictions
frames are part of the cache structure and never move
CIS 371 | Dr. | Caches
CIS 371 | Dr. | Caches
Cache operation: 2B blocks
8B cache, 2B blocks
Figure out number of sets: 4 (capacity / block-size)
Figure out how address splits into offset/index/tag bits
Offset: least-significant log2(block-size) = log2(2) = 1 00000000
Index: next log2(number-of-sets) = log2(4) = 2 00000000
Tag: rest = 8 – 1 – 2 = 5 00000000
block offset
tag (5 bits)
index (2 bits)
4-bit Address, 8B Cache, 2B Blocks
CIS 371 | Dr. | Caches
Set Tag 0 1
tag (1 bit)
index (2 bits)
Main memory
In this cache, what does the offset bit tell us?
4-bit Address, 8B Cache, 2B Blocks
CIS 371 | Dr. | Caches
Set Tag 0 1
tag (1 bit)
index (2 bits)
Main memory
Load: 1110 Miss
4-bit Address, 8B Cache, 2B Blocks
CIS 371 | Dr. | Caches
Set Tag 0 1
tag (1 bit)
index (2 bits)
Main memory
Load: 1110 Miss
Q is filled into the cache as well even though we didn’t ask for it. Why?
Go through how you find stuff in the cache by figuring out which cache block to look at and which entry within the cache block
CIS 371 | Dr. | Caches
Capacity and Performance
Simplest way to reduce %miss: increase capacity
Miss rate decreases monotonically
“Working set”: insns/data program is actively using
Diminishing returns
However taccess increases
Latency proportional to
sqrt(capacity)
Given capacity, manipulate %miss by changing organization
Cache Capacity
“working set” size
CIS 371 | Dr. | Caches
Block Size
Given capacity, manipulate %miss by changing organization
One option: increase block size
Exploit spatial locality
Notice index/offset bits change
Tag remain the same
Ramifications
Reduce %miss (up to a point)
Reduce tag overhead (why?)
Potentially useless data transfer
Premature replacement of useful data
512*512bit
block size
CIS 371 | Dr. | Caches
Block Size and Tag Overhead
4KB cache with 1024 4B blocks?
4B blocks 2-bit offset, 1024 frames 10-bit index
32-bit address – 2-bit offset – 10-bit index = 20-bit tag
20-bit tag / 32-bit block = 63% overhead
4KB cache with 512 8B blocks
8B blocks 3-bit offset, 512 frames 9-bit index
32-bit address – 3-bit offset – 9-bit index = 20-bit tag
20-bit tag / 64-bit block = 32% overhead
Notice: tag size is same, but data size is twice as big
A realistic example: 64KB cache with 64B blocks
16-bit tag / 512-bit block = ~ 2% overhead
Note: Tags are not optional
4-bit Address, 8B Cache, 4B Blocks
CIS 371 | Dr. | Caches
Set Tag 00 01 10 11
0 0 A B C D
1 0 E F G H
tag (1 bit)
index (1 bits)
Main memory
Load: 1110 Miss
4-bit Address, 8B Cache, 4B Blocks
CIS 371 | Dr. | Caches
Set Tag 00 01 10 11
0 0 A B C D
1 1 M N P Q
tag (1 bit)
index (1 bits)
Main memory
Load: 1110 Miss
CIS 371 | Dr. | Caches
Effect of Block Size on Miss Rate
Two effects on miss rate
Spatial prefetching (good)
For blocks with adjacent addresses
Turns miss/miss into miss/hit pairs
Interference (bad)
For blocks with non-adjacent addresses (but in adjacent frames)
Turns hits into misses by disallowing simultaneous residence
Consider entire cache as one big block
Both effects always present
Spatial prefetching dominates initially
Depends on size of the cache
Good block size is 32–256B
Program dependent
Block Size
CIS 371 | Dr. | Caches
Block Size and Miss Penalty
Does increasing block size increase tmiss?
Don’t larger blocks take longer to read, transfer, and fill?
They do, but…
tmiss of an isolated miss is not affected
Critical Word First / Early Restart (CRF/ER)
Requested word fetched first, pipeline restarts immediately
Remaining words in block transferred/filled in the background
tmiss’es of a cluster of misses will suffer
Reads/transfers/fills of two misses can’t happen at the same time
Latencies can start to pile up
This is a bandwidth problem
Cache Conflicts
CIS 371 | Dr. | Caches
Set Tag 0 1
tag (1 bit)
index (2 bits)
Main memory
Pairs like “0010” and “1010” conflict
Same index!
Can such pairs to simultaneously reside in cache?
A: Yes, if we reorganize cache to do so
What does a hash-table do in this situation?
Adding Associativity to Caches
CIS 371 | Dr. | Caches
Instead of designated parking slots you just park in any one that is open – fully associative. Making this happen is tricky and costs hardware but we can do it.
A halfway house is a set – associative cache where the cache is divided into regions but within a region you can park anywhere you want. Licenses plates ending with 0, 1, 2, 3.
CIS 371 | Dr. | Caches
Associativity
Set-associativity
Block can reside in one of few frames
Frame groups called sets
Each frame in set called a way
This is 2-way set-associative (SA)
1-way direct-mapped (DM)
1-set fully-associative (FA)
Reduces conflicts
Increases latencyhit:
additional tag match & muxing
associativity
Equivalent to using a linked list to chain entries in a hash table (but limited to 2 links)
Now you need to check every car in the set to see if it’s there. – That required hardware which makes this tricky and expensive for full associativity.
CIS 371 | Dr. | Caches
Associativity
Lookup algorithm
Use index bits to find set
Read data/tags in all frames in parallel
Any (match and valid bit), Hit
Notice tag/index/offset bits
Only 9-bit index (versus 10-bit
for direct mapped)
associativity
CIS 371 | Dr. | Caches
Replacement Policies
Set-associative caches present a new design choice
On cache miss, which block in set to replace (kick out)?
Some options
FIFO (first-in first-out)
LRU (least recently used)
Fits with temporal locality, LRU = least likely to be used in future
NMRU (not most recently used)
An easier to implement approximation of LRU
Is LRU for 2-way set-associative caches
Belady’s: replace block that will be used furthest in future
Unachievable optimum
CIS 371 | Dr. | Caches
LRU and Miss Handling
Add LRU field to each set
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com