CS代考 Memory Hierarchy Design

Memory Hierarchy Design
Four Memory Hierarchy Questions
Q1 (block placement): where can a block be placed in cache?
Q2 (block identification): how is a block found if it is in cache?

Copyright By PowCoder代写 加微信 powcoder

Q3 (block replacement): which block should be replaced on a miss?
Q4 (write strategy): what happens on a write?

Memory Hierarchy Design
Q1: Block Placement
Cache: 8 blocks
Memory: 32 blocks

Memory Hierarchy Design
Q2: Block Identification
p Caches have an address tag on each block frame that gives the block address
mIt is checked to see if it matches the block address from the processor
p Each cache block has a valid bit
mTo indicate if the entry has a valid address

Memory Hierarchy Design
Two questions to answer (in hardware):
pQuestion 2: How do we know if a data item is in the cache?
pQuestion 1: If it is, how do we find it?

Memory Hierarchy Design
Direct mapped
p Each memory block is mapped to exactly one block in the cache
mlots of blocks must share blocks in the cache p Address mapping (to answer Q1):
(block address) modulo (# of blocks in the cache)
p Have a tag associated with each cache block that contains the address information (the upper portion of the address) required to identify the block (to answer Q2)

Memory Hierarchy Design
Caching: A Simple First
Main Memory
0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx
Byte addressable!
• One-word blocks
• Two low order bits define the byte in the word (32b words)
IndexValid Tag 00
Q1: How do we find it?
Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache)
Q2: Is it there?
Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache
(block address) modulo (# of blocks in the cache) 6

Memory Hierarchy Design
Direct Mapped Cache Example
Block offset
Index Valid Tag 0
1 2 . . . 1021 1022 1023
One word blocks, cache size = 1K words (or 4KB)

Memory Hierarchy Design
Multiword Block Direct Mapped
Cache 3130 … 1312 11 … 4 32 10 Byte
Index Valid Tag 0
1 2 . . . 253 254 255
20 8 Index
Block offset
Four words/block,
cache size = 1K words

Memory Hierarchy Design
Set Associative Mapped
Allow more flexible block placement
p In a direct mapped cache a memory block maps to exactly one cache block
p At the other extreme, could allow a memory block to be mapped to any cache block – fully associative cache
p A compromise is to divide the cache into sets each of which consists of n “ways” (n-way set associative). A memory block maps to a unique set (specified by the index field) and can be placed in any way of that set (so there are n choices)
(block address) modulo (# sets in the cache)

Memory Hierarchy Design
Set Associative Cache Example
Main Memory
0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx
• One-word blocks
• Two low order bits
define the byte in the word (32b words)
Q1: How do we find it?
Use next 1 low order memory address bit to determine which cache set (i.e., modulo the number of sets in the cache)
Q2: Is it there?
Compare all the cache tags in the set to the high order 3 memory address bits to tell if the memory block is in the cache

Memory Hierarchy Design
Four-Way Set Associative Cache
… 1312 11
Byteoffset
Index V Tag Data 0000 1111 2222
…. …. ….
V Tag Data
Data V Tag Data
28 = 256 sets each with four ways (each with one block)
253 254 255
253 254 255
4×1 select

Memory Hierarchy Design
Q3: Block Replacement
When a miss occurs, the cache controller must select a block to be replaced with the desired data
Data cache misses per 1000 instructions comparing LRU, Random and FIFO

Memory Hierarchy Design
Q4: Write Strategy
Write data to cache. Two write policies.
Write-through
Write-back
The information is written to both the block in the cache and the block in the lower-level memory
The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced

Memory Hierarchy Design
Handling Cache Hits p Read hits (I$ and D$)
m this is what we want! p Write hits (D$ only)
m require the cache and memory to be consistent
n always write the data into both the cache block and the next level in the
memory hierarchy (write-through)
n writes run at the speed of the next level in the memory hierarchy – so slow!
– or can use a write buffer and stall only if the write buffer is full m allow cache and memory to be inconsistent
n write the data only into the cache block (write-back the cache block to the next level in the memory hierarchy when that cache block is “evicted”)
n need a dirty bit for each data cache block to tell if it needs to be written back to memory when it is evicted – can use a write buffer to help “buffer” write- backs of dirty blocks

Memory Hierarchy Design
Handling Cache Misses (Single Word Blocks)
p Read misses (I$ and D$)
m stall the pipeline, fetch the block from the next level in the memory hierarchy, install it in the cache and send the requested word to the processor, then let the pipeline resume
p Write misses (D$ only)
1. Write allocate – stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (which may involve having to evict a dirty block if using a write-back cache), write the word from the processor to the cache, then let the pipeline resume, or;
2. No-write allocate – skip the cache write and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn’t full

Memory Hierarchy Design
Handling Cache Misses (Multiword Blocks)
p Read misses (I$ and D$)
m Processed the same as for single word blocks – a miss
returns the entire block from memory m Miss penalty grows as block size grows
Early restart – processor resumes execution as soon as the requested word of the block is returned
Requested word first – requested word is transferred from the memory to the cache (and processor) first
m Nonblocking cache – allows the processor to continue to access the cache while the cache is handling an earlier miss
p Write misses (D$)
m If using write allocate must first fetch the block from memory and then write the word to the block

Memory Hierarchy Design
Memory Hierarchy Design
Memory hierarchy design becomes more crucial with recent multi-core processors
Aggregate peak bandwidth grows with # cores:
p Intel Core i7 can generate two references per core per clock
p Four cores and 3.2 GHz clock
m 25.6 billion 64-bit data references/second +
m 12.8 billion 128-bit instruction references m = 409.6 GB/s!
p DRAM bandwidth is only 6% of this (25 GB/s)

Memory Hierarchy Design
Cache Performance
Measuring Memory Hierarchy Performance
Average Memory Access Time
= Hit Time + Miss Rate * Miss Penalty
Hit Time : Miss Penalty :
Miss Rate :
Hit Ratio?
The time to hit in the cache
The time to get data from the main memory
Fraction of memory accesses not found in the cache

Memory Hierarchy Design
Then, Processor Performance?
Previously
= Instruction Counts * CPI * Cycle Time
= CPU Execution Clock Cycles * Cycle Time
= (CPU Execution Clock Cycles + Memory Stall Clock Cycles) *
Cycle Time
Thus, cache behavior have a great impact on processor performance!!

Memory Hierarchy Design
CPU Time as function of CPI
= (CPU Execution Clock Cycles + Memory Stall Clock Cycles) * Cycle Time
= (!” ∗ “$! + !” ∗ &'()*+ ,–‘..’. ∗ 4566 789: ∗ 4566 $:;8<9=) * Cycle Time /0.1*2-13)0 = IC * ("$! + &'()*+ ,--'..'. ∗ 4566 789: ∗ 4566 $:;8<9=) * Cycle Time /0.1*2-13)0 The lower CPI, the greater impact of the misses Memory Hierarchy Design pCache Organization pVirtual Memory pSix Basic Cache Optimizations pTen Advanced Optimizations of Cache Performance pMemory Technology and Optimizations pVirtual Memory and Protection pProtection: Virtual Memory and Virtual Machines Memory Hierarchy Design Review: Major Components of a Computer Control Datapath Devices Input Secondary Memory (Disk) Main Memory Memory Hierarchy Design How is the Hierarchy Managed? pregisters « memory mby compiler (programmer?) pcache « main memory mby the cache controller hardware pmain memory « disks mby the operating system (virtual memory) mphysical address mapping assisted by the hardware (TLB) Memory Hierarchy Design Review: The Memory Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology Increasing distance from the processor in access time 4-8 bytes (word) L1$ 8-32 bytes (block) L2$ 1 to 4 blocks Main Memory Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM 1,024+ bytes (disk sector = page) Secondary Memory (Relative) size of the memory at each level Memory Hierarchy Design Virtual Memory - Why and How? p Why virtual memory? m (1) Provides the ability to easily run programs larger than the size of physical memory m (2) Simplifies loading a program for execution by providing for code relocation (i.e., the code can be loaded anywhere in main memory) m (3) Allows efficient and safe sharing of memory among multiple p What makes it work? – again the Principle of Locality m A program is likely to access a relatively small portion of its address space during any period of time p Each program is compiled into its own address space – a “virtual” address space m During run-time each virtual address must be translated to a physical address (an address in main memory) Memory Hierarchy Design Two Programs Sharing Physical Memory q A program’s address space is divided into pages (all one fixed size) or segments (variable sizes) ● Each process has a page table containing mapping from logical address to physical address virtual address space main memory virtual address space Memory Hierarchy Design Address Translation q A virtual address is translated to a physical address by a combination of hardware and software Virtual Address (VA) 31 30 . . . 12 11 . . . 0 Translation Virtual page number Page offset Physical page number Page offset 29 . . . 12 11 Physical Address (PA) p So each memory request first requires an address translation from the virtual space to the physical space m A virtual memory miss (i.e., when the page is not in physical memory) is called a page fault Memory Hierarchy Design Address Translation Mechanisms Virtual page # The virtual address of reference Physical page # Physical page base addr The physical address? Main memory Page Table (in main memory) Disk storage Page table register Memory Hierarchy Design Virtual Addressing with a Cache pThus it takes an extra memory access to translate a VA to a PA PA miss hit Trans- lation Main Memory q This makes memory (cache) accesses very expensive (if every access was really two accesses) q The hardware fix is to use a Translation Lookaside Buffer (TLB) – a small cache that keeps track of recently used address mappings to avoid having to do a page table lookup Memory Hierarchy Design Making Address Translation Fast Virtual page # Physical page V Tag base addr Physical page base addr TLB (cache) Main memory Page Table (in physical memory) Disk storage Page table register Memory Hierarchy Design Translation Lookaside Buffers (TLBs) p Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped Virtual Page # Physical Page # q What is used as the tag for identification? q TLB access time is typically smaller than cache access time (because TLBs are much smaller than caches) ● TLBs are typically not more than 512 entries even on high end machines Memory Hierarchy Design A TLB in the Memory Hierarchy 1⁄4t hit 3⁄4t PA TLB Lookup Main Memory Trans- lation p A TLB miss – is it a page fault or merely a TLB miss? m If the page is loaded into main memory, then the TLB miss can be handled (in hardware or software) by loading the translation information from the page table into the TLB n Takes 10’s of cycles to find and load the translation info into the TLB m If the page is not in main memory, then it’s a true page fault n Takes 1,000,000’s of cycles to service a page fault p TLB misses are much more frequent than true page faults Memory Hierarchy Design TLB Event Combinations Page Table Possible? Under what circumstances? Yes – what we want! Yes – although the page table is not checked if the TLB hits Yes – TLB miss, PA in page table Yes – TLB miss, PA in page table, but data not in cache Yes – page fault Impossible – TLB translation not possible if page is not present in memory Impossible – data not allowed in cache if page is not in memory Memory Hierarchy Design Review: with and w/o TLB Trans- lation Main Memory TLB Lookup Trans- lation Main Memory 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com