编程辅导 CSE 371 Computer Organization and Design

CSE 371 Computer Organization and Design

CIS 371 | Dr. | Caches
Computer Organization and Design

Unit 8: Caches
Slides by Profs. , ,
C.J. Taylor and

CIS 371 | Dr. | Caches
P&H Chapter 5
5.1-5.3, 5.5
Appendix C.9

The Problem
Using current technologies, memories can be either large or fast, but not both.
The size or capacity of a memory refers to the number of bits it can store
The speed of a memory typically refers to the amount of time it takes to access a stored entry: the delay between asking for the value stored at an address and receiving the result.
CIS 371 | Dr. | Caches

Memories (SRAM & DRAM)

CIS 371 | Dr. | Caches

CIS 371 | Dr. | Caches
Types of Memory
Static RAM (SRAM)
6 or 8 transistors per bit
Two inverters (4 transistors) + transistors for reading/writing
Optimized for speed (first) and density (second)
Fast (sub-nanosecond latencies for small SRAM)
Speed roughly proportional to its area (~ sqrt(number of bits))
Mixes well with standard processor logic
Dynamic RAM (DRAM)
1 transistor + 1 capacitor per bit
Optimized for density (in terms of cost per bit)
Slow (>30ns internal access, ~50ns pin-to-pin)
Different fabrication steps (does not mix well with logic)
Nonvolatile storage: Magnetic disk, NAND Flash, Phase-change memory, …

PCM == Intel XPoint?

A bit stored using a pass transistor and a capacitor.
CIS 371 | Dr. | Caches

DRAM Array
Showing a 4Megabit array where the information is stored in a 2D grid.
A bit is accessed by fetching the correct row and then selecting the bit in the column.
CIS 371 | Dr. | Caches

Note that once you fetch a row its sitting there in the latch you can clock out everything.

CIS 371 | Dr. | Caches
Memory & Storage Technologies
Cost – what can $200 buy (2009)?
SRAM: 16MB
DRAM: 4,000MB (4GB) – 250x cheaper than SRAM
Flash: 64,000MB (64GB) – 16x cheaper than DRAM
Disk: 2,000,000MB (2TB) – 32x vs. Flash (512x vs. DRAM)
SRAM: <1 to 2ns (on chip) DRAM: ~50ns – 100x or more slower than SRAM Flash: 75,000ns (75 microseconds) – 1500x vs. DRAM Disk: 10,000,000ns (10ms) – 133x vs Flash (200,000x vs DRAM) SRAM: 300GB/sec (e.g., 12-port 8-byte register file @ 3Ghz) DRAM: ~25GB/s Flash: 0.25GB/s (250MB/s), 100x less than DRAM Disk: 0.1 GB/s (100MB/s), 250x vs DRAM, sequential access only Memory Technology Trends CIS 371 | Dr. | Caches c/o O’Hallaron, Ganger, Kesden, CMU 15-213 / 18-213 Locality of Memory Technologies For many (all?) memory technologies, it may take a long time to get the first element out of the memory but you can fetch a lot of data in the vicinity very quickly It is usually a good idea to 'buy in bulk' by fetching a lot of elements in the vicinity rather than just one. CIS 371 | Dr. | Caches DDR rams where you transfer data on rising and falling clock edges. The Costco phenomenon-buy a lot of toilet paper at once. The Memory Hierarchy CIS 371 | Dr. | Caches CIS 371 | Dr. | Caches Big Picture Motivation Processor can compute only as fast as memory A 3Ghz processor can execute an “add” operation in 0.33ns Today's “Main memory” latency is more than 33ns Naïve implementation: loads/stores can be 100x slower than other operations Unobtainable goal: Memory that operates at processor speeds Memory as large as needed for all running programs Memory that is cost effective Can't achieve all of these goals at once CIS 371 | Dr. | Caches Known From the Beginning “Ideally, one would desire an infinitely large memory capacity such that any particular word would be immediately available … We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has a greater capacity than the preceding but which is less quickly accessible.” Burks, Goldstine, VonNeumann “Preliminary discussion of the logical design of an electronic computing instrument” IAS memo 1946 CIS 371 | Dr. | Caches This Unit: Caches “Cache”: hardware managed Hardware automatically retrieves missing data Built from fast SRAM, usually on-chip today In contrast to off-chip, DRAM “main memory” Cache organization Speed vs. Capacity Miss classification Some example performance calculations CIS 371 | Dr. | Caches Key Observation: Locality Locality of memory references Empirical property of real-world programs, few exceptions Temporal locality Recently referenced data is likely to be referenced again soon Reactive: “cache” recently used data in small, fast memory Spatial locality More likely to reference data near recently referenced data Proactive: “cache” large chunks of data to include nearby data Both properties hold for data and instructions Cache: finite-sized hashtable of recently used data blocks In hardware, transparent to software Spatial and Temporal Locality Example Which memory accesses demonstrate spatial locality? Which memory accesses demonstrate temporal locality? CIS 371 | Dr. | Caches int sum = 0; int X[1000]; for(int c = 0; c < 1000; c++){ Reason for spatial and temporal locality – code access sequential with lots of loops so instruction fetches are very patterned. Data accesses usually are but there are situations where this isn't the case and you can write very bad code that makes this worse. Library Analogy Consider books in a library Library has lots of books, but it is slow to access Far away (time to walk to the library) Big (time to walk within the library) How can you avoid these latencies? Check out books, take them home with you Put them on desk, on bookshelf, etc. But desks & bookshelves have limited capacity Keep recently used books around (temporal locality) Grab books on related topic at the same time (spatial locality) Guess what books you'll need in the future (prefetching) CIS 371 | Dr. | Caches Library Analogy Explained Registers  books on your desk Actively being used, small capacity Caches  bookshelves Moderate capacity, pretty fast to access Main memory  library Big, holds almost all data, but slow Disk (virtual memory)  inter-library loan Very slow, but hopefully really uncommon CIS 371 | Dr. | Caches CIS 371 | Dr. | Caches Exploiting Locality: Memory Hierarchy Hierarchy of memory components Upper components Fast  Small  Expensive Lower components Slow  Big  Cheap Connected by “buses” Which also have latency and bandwidth issues Most frequently accessed data in M1 M1 + next most frequently accessed in M2, etc. Move data up-down hierarchy Optimize average access time latencyavg=latencyhit + (%miss*latencymiss) Attack each component Can't decrease the miss latency by much, so we need to amortize it over many hits per miss CIS 371 | Dr. | Caches Concrete Memory Hierarchy 0th level: Registers 1st level: Primary caches Split instruction (I$) and data (D$) Typically 8KB to 64KB each 2nd level: 2nd and 3rd cache (L2, L3) On-chip, typically made of SRAM 2nd level typically ~256KB to 512KB “Last level cache” typically 4MB to 16MB 3rd level: main memory Made of DRAM (“Dynamic” RAM) Typically 1GB to 4GB for desktops/laptops Servers can have 100s of GB 4th level: disk (swap and files) Uses magnetic disks or flash drives CIS 371 | Dr. | Caches Evolution of Cache Hierarchies Intel Core i7 (quad core) Chips today are 30–70% cache by area CIS 371 | Dr. | Caches CIS 371 | Dr. | Caches Logical Cache Organization Cache is a hardware hashtable 32-bit ISA  4G words/addresses, 232 B address space Logical cache organization 4KB, organized as 1K 4B blocks (aka lines) Each block can hold a 4-byte word Physical cache implementation 1K (1024 bit) by 4B SRAM Called data array 10-bit address input 32-bit data input/output CIS 371 | Dr. | Caches Looking Up A Block Q: which 10 of the 32 address bits to use? A: bits [11:2] 2 least significant (LS) bits [1:0] are the offset bits Locate byte within word Don't need these to locate word Next 10 LS bits [11:2] are the index bits These locate the word Nothing says index must be these bits But these work best in practice Why? (think about it) Direct Mapped Cache CIS 371 | Dr. | Caches In a direct mapped cache every address in main memory is mapped to a unique 'parking space' in the cache. The easiest way to do this is by considering the low order bits of the address. Those define which cache block you are assigned to. Note that this means that there will be inevitable conflicts just due to shared cache blocks even if the rest of the cache is empty. CIS 371 | Dr. | Caches Is this the block you’re looking for? Each cache row corresponds to 220 blocks How to know which if any is currently there? Tag each cache word with remaining address bits [31:12] Build separate and parallel tag array 1K by 21-bit SRAM 20-bit (next slide) tag + 1 valid bit Lookup algorithm Read tag indicated by index bits If tag matches & valid bit set: then: Hit  data is good else: Miss  data is garbage, wait… Because we have an offset of 2 bits and an index of 10 bits, there are 20 bits left, so 2^20 different addresses can have the same index bits A Concrete Example Lookup address x000C14B8 Index = addr [11:2] = (addr >> 2) & x3FF = x12E
Tag = addr [31:12] = (addr >> 12) = x000C1

CIS 371 | Dr. | Caches

0000 0000 0000 1100 0001
0100 1011 10

0000 0000 0000 1100 0001

Direct mapped cache – every address in memory has one designated parking spot in the cache. Note that multiple addresses share the same parking spot so only one can be in the cache at a time.

CIS 371 | Dr. | Caches
Handling a Cache Miss
What if requested data isn’t in the cache?
How does it get in there?

Cache controller: finite state machine
Remembers miss address
Accesses next level of memory
Waits for response
Writes data/tag into proper locations

Bringing a missing block into the cache is a cache fill

CIS 371 | Dr. | Caches
Cache Misses and Pipeline Stalls
I$ and D$ misses stall pipeline just like data hazards
Stall logic driven by miss signal
Cache “logically” re-evaluates hit/miss every cycle
Block is filled  miss signal goes low  pipeline restarts

Maybe remove?

CIS 371 | Dr. | Caches
Cache Performance Equation
For a cache
Access: read or write to cache
Hit: desired data found in cache
Miss: desired data not found in cache
Must get from another component
No notion of “miss” in register file
Fill: action of placing data into cache

%miss (miss-rate): #misses / #accesses
taccess: time to check cache. If hit, we’re done.
tmiss: time to read data into cache

Performance metric: average access time
tavg = taccess + (%miss * tmiss)

CIS 371 | Dr. | Caches
CPI Calculation with Cache Misses

Parameters
Simple pipeline with base CPI of 1
Instruction mix: 30% loads/stores
I$: %miss = 2%, tmiss = 10 cycles
D$: %miss = 10%, tmiss = 10 cycles

What is new CPI?
CPII$ = %missI$*tmiss = 0.02*10 cycles = 0.2 cycle
CPID$ = %load/store*%missD$*tmissD$ = 0.3 * 0.1*10 cycles = 0.3 cycle
CPInew = CPI + CPII$ + CPID$ = 1+0.2+0.3 = 1.5

Note the whopping increase in CPI – 50%

CIS 371 | Dr. | Caches
Measuring Cache Performance
Ultimate metric is tavg
Cache capacity and circuits roughly determines taccess
Lower-level memory structures determine tmiss
Measure %miss
Hardware performance counters
Simulation

CIS 371 | Dr. | Caches
Cache operation: 1B block
8-bit addresses  256B memory
Keeps diagrams simple

4B cache, 1B blocks
Figure out number of sets: 4 (capacity / block-size)
Figure out how address splits into offset/index/tag bits
Offset: least-significant log2(block-size) = log2(1) = 0
Index: next log2(number-of-sets) = log2(4) = 2  00000000
Tag: rest = 8 – 2 = 6  00000000

tag (6 bits)
index (2 bits)

Multi-Word Cache Blocks
In most modern implementation we store more than one address (>1 byte) in each cache block.
The number of bytes or words stored in each cache block is referred to as the block size.
The entries in each block come from a contiguous set of addresses to exploit locality of reference, and to simplify indexing
Cache blocks are also referred to as cache lines
Related to cache frames – a frame is the bucket, and the block/line is the data that goes in the bucket
blocks/lines move around due to fills & evictions
frames are part of the cache structure and never move
CIS 371 | Dr. | Caches

CIS 371 | Dr. | Caches
Cache operation: 2B blocks
8B cache, 2B blocks
Figure out number of sets: 4 (capacity / block-size)
Figure out how address splits into offset/index/tag bits
Offset: least-significant log2(block-size) = log2(2) = 1  00000000
Index: next log2(number-of-sets) = log2(4) = 2  00000000
Tag: rest = 8 – 1 – 2 = 5  00000000

block offset
tag (5 bits)
index (2 bits)

4-bit Address, 8B Cache, 2B Blocks

CIS 371 | Dr. | Caches
Set Tag 0 1

tag (1 bit)
index (2 bits)
Main memory

In this cache, what does the offset bit tell us?

4-bit Address, 8B Cache, 2B Blocks

CIS 371 | Dr. | Caches
Set Tag 0 1

tag (1 bit)
index (2 bits)
Main memory
Load: 1110 Miss

4-bit Address, 8B Cache, 2B Blocks

CIS 371 | Dr. | Caches
Set Tag 0 1

tag (1 bit)
index (2 bits)
Main memory
Load: 1110 Miss

Q is filled into the cache as well even though we didn’t ask for it. Why?

Go through how you find stuff in the cache by figuring out which cache block to look at and which entry within the cache block

CIS 371 | Dr. | Caches
Capacity and Performance
Simplest way to reduce %miss: increase capacity
Miss rate decreases monotonically
“Working set”: insns/data program is actively using
Diminishing returns
However taccess increases
Latency proportional to
sqrt(capacity)

Given capacity, manipulate %miss by changing organization

Cache Capacity
“working set” size

CIS 371 | Dr. | Caches
Block Size
Given capacity, manipulate %miss by changing organization
One option: increase block size
Exploit spatial locality
Notice index/offset bits change
Tag remain the same
Ramifications
Reduce %miss (up to a point)
Reduce tag overhead (why?)
Potentially useless data transfer
Premature replacement of useful data

512*512bit

block size

CIS 371 | Dr. | Caches
Block Size and Tag Overhead
4KB cache with 1024 4B blocks?
4B blocks  2-bit offset, 1024 frames  10-bit index
32-bit address – 2-bit offset – 10-bit index = 20-bit tag
20-bit tag / 32-bit block = 63% overhead
4KB cache with 512 8B blocks
8B blocks  3-bit offset, 512 frames  9-bit index
32-bit address – 3-bit offset – 9-bit index = 20-bit tag
20-bit tag / 64-bit block = 32% overhead
Notice: tag size is same, but data size is twice as big
A realistic example: 64KB cache with 64B blocks
16-bit tag / 512-bit block = ~ 2% overhead

Note: Tags are not optional

4-bit Address, 8B Cache, 4B Blocks

CIS 371 | Dr. | Caches
Set Tag 00 01 10 11
0 0 A B C D
1 0 E F G H

tag (1 bit)
index (1 bits)
Main memory
Load: 1110 Miss

4-bit Address, 8B Cache, 4B Blocks

CIS 371 | Dr. | Caches
Set Tag 00 01 10 11
0 0 A B C D
1 1 M N P Q

tag (1 bit)
index (1 bits)
Main memory
Load: 1110 Miss

CIS 371 | Dr. | Caches
Effect of Block Size on Miss Rate
Two effects on miss rate
Spatial prefetching (good)
For blocks with adjacent addresses
Turns miss/miss into miss/hit pairs
Interference (bad)
For blocks with non-adjacent addresses (but in adjacent frames)
Turns hits into misses by disallowing simultaneous residence
Consider entire cache as one big block
Both effects always present
Spatial prefetching dominates initially
Depends on size of the cache
Good block size is 32–256B
Program dependent
Block Size

CIS 371 | Dr. | Caches
Block Size and Miss Penalty
Does increasing block size increase tmiss?
Don’t larger blocks take longer to read, transfer, and fill?
They do, but…

tmiss of an isolated miss is not affected
Critical Word First / Early Restart (CRF/ER)
Requested word fetched first, pipeline restarts immediately
Remaining words in block transferred/filled in the background

tmiss’es of a cluster of misses will suffer
Reads/transfers/fills of two misses can’t happen at the same time
Latencies can start to pile up
This is a bandwidth problem

Cache Conflicts

CIS 371 | Dr. | Caches
Set Tag 0 1

tag (1 bit)
index (2 bits)
Main memory
Pairs like “0010” and “1010” conflict
Same index!
Can such pairs to simultaneously reside in cache?
A: Yes, if we reorganize cache to do so

What does a hash-table do in this situation?

Adding Associativity to Caches
CIS 371 | Dr. | Caches

Instead of designated parking slots you just park in any one that is open – fully associative. Making this happen is tricky and costs hardware but we can do it.
A halfway house is a set – associative cache where the cache is divided into regions but within a region you can park anywhere you want. Licenses plates ending with 0, 1, 2, 3.

CIS 371 | Dr. | Caches
Associativity
Set-associativity
Block can reside in one of few frames
Frame groups called sets
Each frame in set called a way
This is 2-way set-associative (SA)
1-way  direct-mapped (DM)
1-set  fully-associative (FA)

Reduces conflicts
Increases latencyhit:
additional tag match & muxing

associativity

Equivalent to using a linked list to chain entries in a hash table (but limited to 2 links)

Now you need to check every car in the set to see if it’s there. – That required hardware which makes this tricky and expensive for full associativity.

CIS 371 | Dr. | Caches
Associativity
Lookup algorithm
Use index bits to find set
Read data/tags in all frames in parallel
Any (match and valid bit), Hit

Notice tag/index/offset bits
Only 9-bit index (versus 10-bit
for direct mapped)

associativity

CIS 371 | Dr. | Caches
Replacement Policies
Set-associative caches present a new design choice
On cache miss, which block in set to replace (kick out)?
Some options
FIFO (first-in first-out)
LRU (least recently used)
Fits with temporal locality, LRU = least likely to be used in future
NMRU (not most recently used)
An easier to implement approximation of LRU
Is LRU for 2-way set-associative caches
Belady’s: replace block that will be used furthest in future
Unachievable optimum

CIS 371 | Dr. | Caches
LRU and Miss Handling
Add LRU field to each set

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts