代写代考 SPEC2000 miss rates

Large and Fast: Exploiting Memory Hierarchy

Copyright By PowCoder代写 加微信 powcoder

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Principle of Locality
Programs access a small proportion of their address space at any time
Temporal locality
Items accessed recently are likely to be accessed again soon
e.g., instructions in a loop, induction variables
Spatial locality
Items near those accessed recently are likely to be accessed soon
E.g., sequential instruction access, array data

§5.1 Introduction

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Taking Advantage of Locality
Memory hierarchy
Store everything on disk
Copy recently accessed (and nearby) items from disk to smaller DRAM memory
Main memory
Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory
Cache memory attached to CPU

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Memory Hierarchy Levels
Block (aka line): unit of copying
May be multiple words
If accessed data is present in upper level
Hit: access satisfied by upper level
Hit ratio: hits/accesses
If accessed data is absent
Miss: block copied from lower level
Time taken: miss penalty
Miss ratio: misses/accesses
= 1 – hit ratio
Then accessed data supplied from upper level

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Memory Technology
Static RAM (SRAM)
0.5ns – 2.5ns, $2000 – $5000 per GB
Dynamic RAM (DRAM)
50ns – 70ns, $20 – $75 per GB
Magnetic disk
5ms – 20ms, $0.20 – $2 per GB
Ideal memory
Access time of SRAM
Capacity and cost/GB of disk

§5.2 Memory Technologies

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

DRAM Technology
Data stored as a charge in a capacitor
Single transistor used to access the charge
Must periodically be refreshed
Read contents and write back
Performed on a DRAM “row”

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Advanced DRAM Organization
Bits in a DRAM are organized as a rectangular array
DRAM accesses an entire row
Burst mode: supply successive words from a row with reduced latency
Double data rate (DDR) DRAM
Transfer on rising and falling clock edges
Quad data rate (QDR) DRAM
Separate DDR inputs and outputs

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
DRAM Generations
Year Capacity $/GB
1980 64Kbit $1500000
1983 256Kbit $500000
1985 1Mbit $200000
1989 4Mbit $50000
1992 16Mbit $15000
1996 64Mbit $10000
1998 128Mbit $4000
2000 256Mbit $1000
2004 512Mbit $250
2007 1Gbit $50

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

DRAM Performance Factors
Row buffer
Allows several words to be read and refreshed in parallel
Synchronous DRAM
Allows for consecutive accesses in bursts without needing to send each address
Improves bandwidth
DRAM banking
Allows simultaneous access to multiple DRAMs
Improves bandwidth

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Increasing Memory Bandwidth
4-word wide memory
Miss penalty = 1 + 15 + 1 = 17 bus cycles
Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle
4-bank interleaved memory
Miss penalty = 1 + 15 + 4×1 = 20 bus cycles
Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 6 — Storage and Other I/O Topics — *
Flash Storage
Nonvolatile semiconductor storage
100× – 1000× faster than disk
Smaller, lower power, more robust
But more $/GB (between disk and DRAM)

Chapter 6 — Storage and Other I/O Topics — *

Chapter 6 — Storage and Other I/O Topics

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 6 — Storage and Other I/O Topics — *
Flash Types
NOR flash: bit cell like a NOR gate
Random read/write access
Used for instruction memory in embedded systems
NAND flash: bit cell like a NAND gate
Denser (bits/area), but block-at-a-time access
Cheaper per GB
Used for USB keys, media storage, …
Flash bits wears out after 1000’s of accesses
Not suitable for direct RAM or disk replacement
Wear leveling: remap data to less used blocks

Chapter 6 — Storage and Other I/O Topics — *

Chapter 6 — Storage and Other I/O Topics

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 6 — Storage and Other I/O Topics — *
Disk Storage
Nonvolatile, rotating magnetic storage

Chapter 6 — Storage and Other I/O Topics — *

Chapter 6 — Storage and Other I/O Topics

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 6 — Storage and Other I/O Topics — *
Disk Sectors and Access
Each sector records
Data (512 bytes, 4096 bytes proposed)
Error correcting code (ECC)
Used to hide defects and recording errors
Synchronization fields and gaps
Access to a sector involves
Queuing delay if other accesses are pending
Seek: move the heads
Rotational latency
Data transfer
Controller overhead

Chapter 6 — Storage and Other I/O Topics — *

Chapter 6 — Storage and Other I/O Topics

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 6 — Storage and Other I/O Topics — *
Disk Access Example
512B sector, 15,000rpm, 4ms average seek time, 100MB/s transfer rate, 0.2ms controller overhead, idle disk
Average read time
4ms seek time
+ ½ / (15,000/60) = 2ms rotational latency
+ 512 / 100MB/s = 0.005ms transfer time
+ 0.2ms controller delay
If actual average seek time is 1ms
Average read time = 3.2ms

Chapter 6 — Storage and Other I/O Topics — *

Chapter 6 — Storage and Other I/O Topics

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 6 — Storage and Other I/O Topics — *
Disk Performance Issues
Manufacturers quote average seek time
Based on all possible seeks
Locality and OS scheduling lead to smaller actual average seek times
Smart disk controller allocate physical sectors on disk
Present logical sector interface to host
SCSI, ATA, SATA
Disk drives include caches
Prefetch sectors in anticipation of access
Avoid seek and rotational delay

Chapter 6 — Storage and Other I/O Topics — *

Chapter 6 — Storage and Other I/O Topics

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Cache Memory
Cache memory
The level of the memory hierarchy closest to the CPU
Given accesses X1, …, Xn–1, Xn

§5.3 The Basics of Caches
How do we know if the data is present?
Where do we look?

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Direct Mapped Cache
Location determined by address
Direct mapped: only one choice
(Block address) modulo (#Blocks in cache)

#Blocks is a power of 2
Use low-order address bits

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Tags and Valid Bits
How do we know which particular block is stored in a cache location?
Store block address as well as the data
Actually, only need the high-order bits
Called the tag
What if there is no data in a location?
Valid bit: 1 = present, 0 = not present
Initially 0

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Cache Example
8-blocks, 1 word/block, direct mapped
Initial state

Index V Tag Data

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Cache Example
Index V Tag Data
110 Y 10 Mem[10110]

Word addr Binary addr Hit/miss Cache block
22 10 110 Miss 110

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Cache Example
Index V Tag Data
010 Y 11 Mem[11010]
110 Y 10 Mem[10110]

Word addr Binary addr Hit/miss Cache block
26 11 010 Miss 010

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Cache Example
Index V Tag Data
010 Y 11 Mem[11010]
110 Y 10 Mem[10110]

Word addr Binary addr Hit/miss Cache block
22 10 110 Hit 110
26 11 010 Hit 010

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Cache Example
Index V Tag Data
000 Y 10 Mem[10000]
010 Y 11 Mem[11010]
011 Y 00 Mem[00011]
110 Y 10 Mem[10110]

Word addr Binary addr Hit/miss Cache block
16 10 000 Miss 000
3 00 011 Miss 011
16 10 000 Hit 000

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Cache Example
Index V Tag Data
000 Y 10 Mem[10000]
010 Y 10 Mem[10010]
011 Y 00 Mem[00011]
110 Y 10 Mem[10110]

Word addr Binary addr Hit/miss Cache block
18 10 010 Miss 010

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Address Subdivision

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Example: Larger Block Size
64 blocks, 16 bytes/block
To what block number does address 1200 map?
Block address = 1200/16 = 75
Block number = 75 modulo 64 = 11

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Block Size Considerations
Larger blocks should reduce miss rate
Due to spatial locality
But in a fixed-sized cache
Larger blocks  fewer of them
More competition  increased miss rate
Larger blocks  pollution
Larger miss penalty
Can override benefit of reduced miss rate
Early restart and critical-word-first can help

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Cache Misses
On cache hit, CPU proceeds normally
On cache miss
Stall the CPU pipeline
Fetch block from next level of hierarchy
Instruction cache miss
Restart instruction fetch
Data cache miss
Complete data access

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Write-Through
On data-write hit, could just update the block in cache
But then cache and memory would be inconsistent
Write through: also update memory
But makes writes take longer
e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles
Effective CPI = 1 + 0.1×100 = 11
Solution: write buffer
Holds data waiting to be written to memory
CPU continues immediately
Only stalls on write if write buffer is already full

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Write-Back
Alternative: On data-write hit, just update the block in cache
Keep track of whether each block is dirty
When a dirty block is replaced
Write it back to memory
Can use a write buffer to allow replacing block to be read first

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Write Allocation
What should happen on a write miss?
Alternatives for write-through
Allocate on miss: fetch the block
Write around: don’t fetch the block
Since programs often write a whole block before reading it (e.g., initialization)
For write-back
Usually fetch the block

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Example: Intrinsity FastMATH
Embedded MIPS processor
12-stage pipeline
Instruction and data access on each cycle
Split cache: separate I-cache and D-cache
Each 16KB: 256 blocks × 16 words/block
D-cache: write-through or write-back
SPEC2000 miss rates
I-cache: 0.4%
D-cache: 11.4%
Weighted average: 3.2%

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Example: Intrinsity FastMATH

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Main Memory Supporting Caches
Use DRAMs for main memory
Fixed width (e.g., 1 word)
Connected by fixed-width clocked bus
Bus clock is typically slower than CPU clock
Example cache block read
1 bus cycle for address transfer
15 bus cycles per DRAM access
1 bus cycle per data transfer
For 4-word block, 1-word-wide DRAM
Miss penalty = 1 + 4×15 + 4×1 = 65 bus cycles
Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Measuring Cache Performance
Components of CPU time
Program execution cycles
Includes cache hit time
Memory stall cycles
Mainly from cache misses
With simplifying assumptions:

§5.4 Measuring and Improving Cache Performance

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Cache Performance Example
I-cache miss rate = 2%
D-cache miss rate = 4%
Miss penalty = 100 cycles
Base CPI (ideal cache) = 2
Load & stores are 36% of instructions
Miss cycles per instruction
I-cache: 0.02 × 100 = 2
D-cache: 0.36 × 0.04 × 100 = 1.44
Actual CPI = 2 + 2 + 1.44 = 5.44
Ideal CPU is 5.44/2 =2.72 times faster

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Average Access Time
Hit time is also important for performance
Average memory access time (AMAT)
AMAT = Hit time + Miss rate × Miss penalty
CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5%
AMAT = 1 + 0.05 × 20 = 2ns
2 cycles per instruction

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Performance Summary
When CPU performance increased
Miss penalty becomes more significant
Decreasing base CPI
Greater proportion of time spent on memory stalls
Increasing clock rate
Memory stalls account for more CPU cycles
Can’t neglect cache behavior when evaluating system performance

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Associative Caches
Fully associative
Allow a given block to go in any cache entry
Requires all entries to be searched at once
Comparator per entry (expensive)
n-way set associative
Each set contains n entries
Block number determines which set
(Block number) modulo (#Sets in cache)
Search all entries in a given set at once
n comparators (less expensive)

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *
Associative Cache Example

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — *

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com