Memory Hierarchy Design
pCache Organization
pVirtual Memory
pSix Basic Cache Optimizations
Copyright By PowCoder代写 加微信 powcoder
p2.4 Ten Advanced Optimizations of Cache Performance
pMemory Technology and Optimizations
pVirtual Memory and Protection
pProtection: Virtual Memory and Virtual Machines
Memory Hierarchy Design
Ten Advanced Optimizations
p The previous six basic optimizations try to reduce miss rate, miss penalty and hit time
p In the ten advanced optimizations, two more are added
m Cache bandwidth
m Power consumption
Ten Advanced Optimizations
Reduce hit time
Increasing cache bandwidth
Reduce the miss penalty
Reduce the miss rate
Reduce the miss penalty or miss rate via parallelism
Memory Hierarchy Design
O1: Small and simple first level caches
addressing tag memory with index
comparing tags
selecting correct data item
pCritical timing path in a cache hit: maddressing tag memory (in cache) using the
index portion of the address
mcomparing the read tag value to the address mChoosing the correct data item (set associative)
pA smaller L1 cache reduce the hit time!
Memory Hierarchy Design
Access Time versus L1 Size and Associativity
Access time vs. size and associativity
Memory Hierarchy Design
Energy versus L1 Size and Associativity
Energy per read vs. size and associativity
Memory Hierarchy Design
Higher Associativity in L1 Cache
In recent designs, there are three factors that have led to the use of higher associativity
Many processors take at least two clock cycles to access the cache • Thus, the impact of a slightly longer hit time may not be critical
To keep TLB out of the critical path, many L1 caches should be virtually indexed (but physically tagged).
• This limits the size of the cache to the page size times the associativity Higher associativity reduces conflict misses
Memory Hierarchy Design
O2: Way Prediction
p To improve hit time, predict the way to pre-set mux
mTo reduce conflict misses but maintain the hit speed of direct-mapped cache
mMis-prediction gives longer hit time
pPrediction accuracy n > 90% for two-way
n > 80% for four-way
n I-cache has better accuracy than D-cache
mFirst used on MIPS R10000 in mid-90s mUsed on ARM Cortex-A8
Memory Hierarchy Design
O3: Pipelining Cache
p Pipeline cache access to improve cache
mThe cycles for the pipeline for instruction cache access should be larger to make use of pipelining
mExamples (Intel evolvement of cache access cycles): n Pentium: 1 cycle
n Pentium Pro – Pentium III: 2 cycles
n Pentium 4 – Core i7: 4 cycles
mAllows more stages for pipelining results in the following consequences:
n Meanwhile,itincreasesbranchmis-predictionpenalty
n Makes it easier to increase associativity (more time for performing parallel comparisons)
Memory Hierarchy Design
O4: Nonblocking Caches
pAllow hits before previous misses complete m“Hit under miss”
m“Hit under multiple miss”
pThus, the miss penalty can be reduced!
pRationale: Pipelined computers allow out- of-order execution. Then, the processor need not stall on a data miss
Memory Hierarchy Design
O5: Multibanked Caches
p Organize cache as independent banks to support simultaneous access
mARM Cortex-A8 supports 1-4 banks for L2
mIntel i7 supports 4 banks for L1 and 8 banks for L2
p Banking works best when the accesses naturally spread themselves across the banks
Memory Hierarchy Design
O6: Critical Word First, Early Restart
p Observation: the processor normally needed just one word of the block at a time!
p 1) Critical word first
m Request missed word from memory first
m Send it to the processor as soon as it arrives
p 2) Early restart
m Request words in normal order
m Send missed word to the processor as soon as it arrives
p Miss penalty can be reduced!
p Effectiveness of these strategies depends on two
factors! What are they?
m block size
m likelihood of another access to the portion of the block that has not yet been fetched
Memory Hierarchy Design
O7: Merging Write Buffer
p When storing to a block that is already pending in the write buffer, update write buffer
p Reduces stalls due to full write buffer
p Thus, miss penalty can be reduced!
No write merging
With write merging
One entry has multiple words
Memory Hierarchy Design
O8: Compiler Optimizations
pLoop Interchange: Swap nested loops to access memory in sequential order
pTo reduce Miss Rate
Suppose x is a two-dimensional array of size [5000,100 X[i,j] and X[i,j+1] are adjacent
Memory Hierarchy Design
O9: Hardware Prefetching
To prefetch items before the processor requests them!
p Both instructions and data can be prefetched
minto caches or into an external buffer
mTypically the processor fetches two blocks on a miss: the requested and the next block
pTo reduce penalty or miss rate
p Prefetching relies on utilizing memory
bandwidth that otherwise would be unused!
Memory Hierarchy Design
O10: Compiler Prefetching
p Insert prefetch instructions before data is needed
m1) Register prefetch
n Loads data into register
m2) Cache prefetch
n Loads data into cache
Memory Hierarchy Design
Summary of Impacts
Memory Hierarchy Design
p Review of Cache Organization p Review of Virtual Memory
p Six Basic Cache Optimizations
p Ten Advanced Optimizations of Cache
Performance
p2.5 Memory Technology and Optimizations
p Virtual Memory and Protection
p Protection: Virtual Memory and Virtual Machines
Memory Hierarchy Design
Memory Technology
pPerformance metrics mLatency is concern of cache
n Access time: Time between read request and when desired word arrives
n Cycle time: Minimum time between unrelated requests to memory
mBandwidth is concern of multiprocessors and I/O
pDRAM used for main memory, SRAM used for cache
Memory Hierarchy Design
SRAM Review
p Requires 6 transistors/bit
p Requires low power to retain bit
p One cycle time for access
Memory Hierarchy Design
DRAM Review
p One transistor/bit
p Address lines are multiplexed:
m Upper half of address: row access strobe (RAS)
m Lower half of address: column access strobe (CAS)
p Must be re-written after being read
p Must also be periodically refreshed
m Every ~ 8 ms
m Each row can be refreshed simultaneously
Memory Hierarchy Design
DRAM Technology
m Memory capacity should grow linearly with processor
m Unfortunately, memory capacity and speed has not kept pace with processors
p Some optimizations:
m Multiple accesses to same row m Synchronous DRAM
n Added clock signal to DRAM interface n Burst mode with critical word first
m Wider interfaces
m Double data rate (DDR)
m Multiple banks on each DRAM device
Memory Hierarchy Design
DRAM generations
Performance improvement of row access time is about only 5% per year
Memory Hierarchy Design
Relations of Different Numbers and Names
Memory Hierarchy Design
• Lower power (2.5 V -> 1.8 V)
• Higher clock rates (266 MHz, 333 MHz, 400 MHz)
• 1600 MHz
Memory Hierarchy Design
Graphics memory
pAchieve 2-5X bandwidth per DRAM vs. DDR3
mWider interfaces (32 vs. 16 bit) mHigher clock rate
n Possible because they are attached to GPU via soldering instead of socketted DIMM modules
pGDDR5 is graphics memory based on DDR3
Memory Hierarchy Design
Flash Memory
p Type of EEPROM.
p Feature: can hold data without any power.
Non volatile
p Must be erased (in blocks) before being overwritten, not a single word
p Limited number of write cycles, typical of 100,000
p Cheaper than SDRAM, more expensive than disk
p Slower than SDRAM, faster than disk
Memory Hierarchy Design
New Trends: NVM
You may already know:
Read-only memory
You probably do not know:
Flash memory
Non-volatile memory
FeRAM PCM MRAM
Byte addressable
access time comparable to DRAM
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com