Memory Hierarchy Design
Four Memory Hierarchy Questions
Q1 (block placement): where can a block be placed in cache?
Q2 (block identification): how is a block found if it is in cache?
Copyright By PowCoder代写 加微信 powcoder
Q3 (block replacement): which block should be replaced on a miss?
Q4 (write strategy): what happens on a write?
Memory Hierarchy Design
Q1: Block Placement
Cache: 8 blocks
Memory: 32 blocks
Memory Hierarchy Design
Q2: Block Identification
p Caches have an address tag on each block frame that gives the block address
mIt is checked to see if it matches the block address from the processor
p Each cache block has a valid bit
mTo indicate if the entry has a valid address
Memory Hierarchy Design
Two questions to answer (in hardware):
pQuestion 2: How do we know if a data item is in the cache?
pQuestion 1: If it is, how do we find it?
Memory Hierarchy Design
Direct mapped
p Each memory block is mapped to exactly one block in the cache
mlots of blocks must share blocks in the cache p Address mapping (to answer Q1):
(block address) modulo (# of blocks in the cache)
p Have a tag associated with each cache block that contains the address information (the upper portion of the address) required to identify the block (to answer Q2)
Memory Hierarchy Design
Caching: A Simple First
Main Memory
0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx
Byte addressable!
• One-word blocks
• Two low order bits define the byte in the word (32b words)
IndexValid Tag 00
Q1: How do we find it?
Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache)
Q2: Is it there?
Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache
(block address) modulo (# of blocks in the cache) 6
Memory Hierarchy Design
Direct Mapped Cache Example
Block offset
Index Valid Tag 0
1 2 . . . 1021 1022 1023
One word blocks, cache size = 1K words (or 4KB)
Memory Hierarchy Design
Multiword Block Direct Mapped
Cache 3130 … 1312 11 … 4 32 10 Byte
Index Valid Tag 0
1 2 . . . 253 254 255
20 8 Index
Block offset
Four words/block,
cache size = 1K words
Memory Hierarchy Design
Set Associative Mapped
Allow more flexible block placement
p In a direct mapped cache a memory block maps to exactly one cache block
p At the other extreme, could allow a memory block to be mapped to any cache block – fully associative cache
p A compromise is to divide the cache into sets each of which consists of n “ways” (n-way set associative). A memory block maps to a unique set (specified by the index field) and can be placed in any way of that set (so there are n choices)
(block address) modulo (# sets in the cache)
Memory Hierarchy Design
Set Associative Cache Example
Main Memory
0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx
• One-word blocks
• Two low order bits
define the byte in the word (32b words)
Q1: How do we find it?
Use next 1 low order memory address bit to determine which cache set (i.e., modulo the number of sets in the cache)
Q2: Is it there?
Compare all the cache tags in the set to the high order 3 memory address bits to tell if the memory block is in the cache
Memory Hierarchy Design
Four-Way Set Associative Cache
… 1312 11
Byteoffset
Index V Tag Data 0000 1111 2222
…. …. ….
V Tag Data
Data V Tag Data
28 = 256 sets each with four ways (each with one block)
253 254 255
253 254 255
4×1 select
Memory Hierarchy Design
Q3: Block Replacement
When a miss occurs, the cache controller must select a block to be replaced with the desired data
Data cache misses per 1000 instructions comparing LRU, Random and FIFO
Memory Hierarchy Design
Q4: Write Strategy
Write data to cache. Two write policies.
Write-through
Write-back
The information is written to both the block in the cache and the block in the lower-level memory
The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced
Memory Hierarchy Design
Handling Cache Hits p Read hits (I$ and D$)
m this is what we want! p Write hits (D$ only)
m require the cache and memory to be consistent
n always write the data into both the cache block and the next level in the
memory hierarchy (write-through)
n writes run at the speed of the next level in the memory hierarchy – so slow!
– or can use a write buffer and stall only if the write buffer is full m allow cache and memory to be inconsistent
n write the data only into the cache block (write-back the cache block to the next level in the memory hierarchy when that cache block is “evicted”)
n need a dirty bit for each data cache block to tell if it needs to be written back to memory when it is evicted – can use a write buffer to help “buffer” write- backs of dirty blocks
Memory Hierarchy Design
Handling Cache Misses (Single Word Blocks)
p Read misses (I$ and D$)
m stall the pipeline, fetch the block from the next level in the memory hierarchy, install it in the cache and send the requested word to the processor, then let the pipeline resume
p Write misses (D$ only)
1. Write allocate – stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (which may involve having to evict a dirty block if using a write-back cache), write the word from the processor to the cache, then let the pipeline resume, or;
2. No-write allocate – skip the cache write and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn’t full
Memory Hierarchy Design
Handling Cache Misses (Multiword Blocks)
p Read misses (I$ and D$)
m Processed the same as for single word blocks – a miss
returns the entire block from memory m Miss penalty grows as block size grows
Early restart – processor resumes execution as soon as the requested word of the block is returned
Requested word first – requested word is transferred from the memory to the cache (and processor) first
m Nonblocking cache – allows the processor to continue to access the cache while the cache is handling an earlier miss
p Write misses (D$)
m If using write allocate must first fetch the block from memory and then write the word to the block
Memory Hierarchy Design
Memory Hierarchy Design
Memory hierarchy design becomes more crucial with recent multi-core processors
Aggregate peak bandwidth grows with # cores:
p Intel Core i7 can generate two references per core per clock
p Four cores and 3.2 GHz clock
m 25.6 billion 64-bit data references/second +
m 12.8 billion 128-bit instruction references m = 409.6 GB/s!
p DRAM bandwidth is only 6% of this (25 GB/s)
Memory Hierarchy Design
Cache Performance
Measuring Memory Hierarchy Performance
Average Memory Access Time
= Hit Time + Miss Rate * Miss Penalty
Hit Time : Miss Penalty :
Miss Rate :
Hit Ratio?
The time to hit in the cache
The time to get data from the main memory
Fraction of memory accesses not found in the cache
Memory Hierarchy Design
Then, Processor Performance?
Previously
= Instruction Counts * CPI * Cycle Time
= CPU Execution Clock Cycles * Cycle Time
= (CPU Execution Clock Cycles + Memory Stall Clock Cycles) *
Cycle Time
Thus, cache behavior have a great impact on processor performance!!
Memory Hierarchy Design
CPU Time as function of CPI
= (CPU Execution Clock Cycles + Memory Stall Clock Cycles) * Cycle Time
= (!” ∗ “$! + !” ∗ &'()*+ ,–‘..’. ∗ 4566 789: ∗ 4566 $:;8<9=) * Cycle Time /0.1*2-13)0
= IC * ("$! + &'()*+ ,--'..'. ∗ 4566 789: ∗ 4566 $:;8<9=) * Cycle Time /0.1*2-13)0
The lower CPI, the greater impact of the misses
Memory Hierarchy Design
pCache Organization
pVirtual Memory
pSix Basic Cache Optimizations
pTen Advanced Optimizations of Cache Performance
pMemory Technology and Optimizations
pVirtual Memory and Protection
pProtection: Virtual Memory and Virtual Machines
Memory Hierarchy Design
Review: Major Components of a Computer
Control Datapath
Devices Input
Secondary Memory (Disk)
Main Memory
Memory Hierarchy Design
How is the Hierarchy Managed?
pregisters « memory
mby compiler (programmer?)
pcache « main memory
mby the cache controller hardware
pmain memory « disks
mby the operating system (virtual memory)
mphysical address mapping assisted by the hardware (TLB)
Memory Hierarchy Design
Review: The Memory Hierarchy
Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the
fastest technology
Increasing distance from the processor in access time
4-8 bytes (word) L1$
8-32 bytes (block) L2$
1 to 4 blocks
Main Memory
Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM
1,024+ bytes (disk sector = page) Secondary Memory
(Relative) size of the memory at each level
Memory Hierarchy Design
Virtual Memory - Why and How?
p Why virtual memory?
m (1) Provides the ability to easily run programs larger than the size of
physical memory
m (2) Simplifies loading a program for execution by providing for code
relocation (i.e., the code can be loaded anywhere in main memory) m (3) Allows efficient and safe sharing of memory among multiple
p What makes it work? – again the Principle of Locality
m A program is likely to access a relatively small portion of its address
space during any period of time
p Each program is compiled into its own address space – a “virtual” address space
m During run-time each virtual address must be translated to a physical address (an address in main memory)
Memory Hierarchy Design
Two Programs Sharing Physical Memory
q A program’s address space is divided into pages (all one fixed size) or segments (variable sizes)
● Each process has a page table containing mapping from logical address to physical address
virtual address space
main memory
virtual address space
Memory Hierarchy Design
Address Translation
q A virtual address is translated to a physical address by a combination of hardware and software
Virtual Address (VA)
31 30 . . . 12 11 . . . 0
Translation
Virtual page number
Page offset
Physical page number
Page offset
29 . . . 12 11
Physical Address (PA)
p So each memory request first requires an address translation from the virtual space to the physical space
m A virtual memory miss (i.e., when the page is not in physical memory) is called a page fault
Memory Hierarchy Design
Address Translation Mechanisms
Virtual page #
The virtual address of reference
Physical page #
Physical page base addr
The physical address?
Main memory
Page Table
(in main memory)
Disk storage
Page table register
Memory Hierarchy Design
Virtual Addressing with a Cache
pThus it takes an extra memory access to translate a VA to a PA
PA miss hit
Trans- lation
Main Memory
q This makes memory (cache) accesses very expensive (if every access was really two accesses)
q The hardware fix is to use a Translation Lookaside Buffer (TLB) – a small cache that keeps track of recently used address mappings to avoid having to do a page table lookup
Memory Hierarchy Design
Making Address Translation Fast
Virtual page #
Physical page V Tag base addr
Physical page base addr
TLB (cache)
Main memory
Page Table
(in physical memory)
Disk storage
Page table register
Memory Hierarchy Design
Translation Lookaside Buffers (TLBs)
p Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped
Virtual Page #
Physical Page #
q What is used as the tag for identification?
q TLB access time is typically smaller than cache access
time (because TLBs are much smaller than caches)
● TLBs are typically not more than 512 entries even on high end machines
Memory Hierarchy Design
A TLB in the Memory Hierarchy
1⁄4t hit 3⁄4t PA
TLB Lookup
Main Memory
Trans- lation
p A TLB miss – is it a page fault or merely a TLB miss?
m If the page is loaded into main memory, then the TLB miss can be handled (in hardware or software) by loading the translation information from the page table into the TLB
n Takes 10’s of cycles to find and load the translation info into the TLB m If the page is not in main memory, then it’s a true page fault
n Takes 1,000,000’s of cycles to service a page fault
p TLB misses are much more frequent than true page faults
Memory Hierarchy Design
TLB Event Combinations
Page Table
Possible? Under what circumstances?
Yes – what we want!
Yes – although the page table is not checked if the TLB hits
Yes – TLB miss, PA in page table
Yes – TLB miss, PA in page table, but data not in cache
Yes – page fault
Impossible – TLB translation not possible if page is not present in memory
Impossible – data not allowed in cache if page is not in memory
Memory Hierarchy Design
Review: with and w/o TLB
Trans- lation
Main Memory
TLB Lookup
Trans- lation
Main Memory
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com