IT代写 SPEC2000  1-way: 10.3%

COMPUTER ORGANIZATION AND DESIGN
The Hardware/Software Interface
Large and Fast: Exploiting Memory Hierarchy

Copyright By PowCoder代写 加微信 powcoder

Principle of Locality
 Programs access a small proportion of their address space at any time
 Temporal locality
 Items accessed recently are likely to be
accessed again soon
 e.g., instructions in a loop, induction variables
 Spatial locality
 Items near those accessed recently are likely
to be accessed soon
 E.g., sequential instruction access, array data
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 2
§5.1 Introduction

Taking Advantage of Locality
 Memory hierarchy
 Store everything on disk
 Copy recently accessed (and nearby) items from disk to smaller DRAM memory
 Main memory
 Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory
 Cache memory attached to CPU
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 3

Memory Hierarchy Levels
 Block (aka line): unit of copying  May be multiple words
 If accessed data is present in upper level
 Hit: access satisfied by upper level  Hit ratio: hits/accesses
 If accessed data is absent
 Miss: block copied from lower level
 Time taken: miss penalty
 Miss ratio: misses/accesses = 1 – hit ratio
 Then accessed data supplied from upper level
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 4

Memory Technology
 Static RAM (SRAM)
 0.5ns – 2.5ns, $2000 – $5000 per GB
 Dynamic RAM (DRAM)
 50ns – 70ns, $20 – $75 per GB
 Magnetic disk
 5ms – 20ms, $0.20 – $2 per GB
 Ideal memory
 Access time of SRAM
 Capacity and cost/GB of disk
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 5
§5.2 Memory Technologies

DRAM Technology
 Data stored as a charge in a capacitor
 Single transistor used to access the charge
 Must periodically be refreshed  Read contents and write back
 Performed on a DRAM “row”
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 6

Advanced DRAM Organization
 Bits in a DRAM are organized as a rectangular array
 DRAM accesses an entire row
 Burst mode: supply successive words from a
row with reduced latency
 Double data rate (DDR) DRAM
 Transfer on rising and falling clock edges  Quad data rate (QDR) DRAM
 Separate DDR inputs and outputs
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 7

DRAM Generations
’80 ’83 ’85 ’89 ’92 ’96 ’98 ’00 ’04 ’07

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8

DRAM Performance Factors
 Row buffer
 Allows several words to be read and refreshed in
 Synchronous DRAM
 Allows for consecutive accesses in bursts without needing to send each address
 Improves bandwidth
 DRAM banking
 Allows simultaneous access to multiple DRAMs  Improves bandwidth
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 9

Increasing Memory Bandwidth
 4-word wide memory
 Misspenalty=1+15+1=17buscycles
 Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle
 4-bank interleaved memory
 Misspenalty=1+15+4×1=20buscycles
 Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 10

Flash Storage
 Nonvolatile semiconductor storage
 100× – 1000× faster than disk
 Smaller, lower power, more robust
 But more $/GB (between disk and DRAM)
Chapter 6 — Storage and Other I/O Topics — 11

Flash Types
 NOR flash: bit cell like a NOR gate
 Random read/write access
 Used for instruction memory in embedded systems
 NAND flash: bit cell like a NAND gate
 Denser (bits/area), but block-at-a-time access  Cheaper per GB
 Used for USB keys, media storage, …
 Flash bits wears out after 1000’s of accesses  Not suitable for direct RAM or disk replacement
 Wear leveling: remap data to less used blocks
Chapter 6 — Storage and Other I/O Topics — 12

Disk Storage
 Nonvolatile, rotating magnetic storage
Chapter 6 — Storage and Other I/O Topics — 13

Disk Sectors and Access
 Each sector records  Sector ID
 Data (512 bytes, 4096 bytes proposed)  Error correcting code (ECC)
 Used to hide defects and recording errors  Synchronization fields and gaps
 Access to a sector involves
 Queuing delay if other accesses are pending  Seek: move the heads
 Rotational latency
 Data transfer
 Controller overhead
Chapter 6 — Storage and Other I/O Topics — 14

Disk Access Example
 512B sector, 15,000rpm, 4ms average seek time, 100MB/s transfer rate, 0.2ms controller overhead, idle disk
 Average read time
 4ms seek time
+ 1⁄2 / (15,000/60) = 2ms rotational latency + 512 / 100MB/s = 0.005ms transfer time + 0.2ms controller delay
 If actual average seek time is 1ms  Average read time = 3.2ms
Chapter 6 — Storage and Other I/O Topics — 15

Disk Performance Issues
 Manufacturers quote average seek time
 Based on all possible seeks
 Locality and OS scheduling lead to smaller actual average seek times
 Smart disk controller allocate physical sectors on disk
 Present logical sector interface to host  SCSI, ATA, SATA
 Disk drives include caches
 Prefetch sectors in anticipation of access  Avoid seek and rotational delay
Chapter 6 — Storage and Other I/O Topics — 16

Cache Memory
 Cache memory
 The level of the memory hierarchy closest to
 Given accesses X1, …, Xn–1, Xn
 How do we know if the data is present?
 Where do we look? Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 17
§5.3 The Basics of Caches

Direct Mapped Cache
 Location determined by address  Direct mapped: only one choice
 (Block address) modulo (#Blocks in cache)  #Blocks is a
power of 2
 Use low-order address bits
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 18

Tags and Valid Bits
 How do we know which particular block is stored in a cache location?
 Store block address as well as the data  Actually, only need the high-order bits  Called the tag
 What if there is no data in a location?  Valid bit: 1 = present, 0 = not present
 Initially 0
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 19

Cache Example
 8-blocks, 1 word/block, direct mapped  Initial state
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 20

Cache Example
Binary addr
Cache block
Mem[10110]
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 21

Cache Example
Binary addr
Cache block
Mem[11010]
Mem[10110]
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 22

Cache Example
Binary addr
Cache block
Mem[11010]
Mem[10110]
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 23

Cache Example
Binary addr
Cache block
Mem[10000]
Mem[11010]
Mem[00011]
Mem[10110]
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24

Cache Example
Binary addr
Cache block
Mem[10000]
Mem[10010]
Mem[00011]
Mem[10110]
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 25

Address Subdivision
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26

Example: Larger Block Size
 64 blocks, 16 bytes/block
 To what block number does address 1200
 Block address = 1200/16 = 75
 Block number = 75 modulo 64 = 11
63 109430 22 bits 6 bits 4 bits
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27

Block Size Considerations
 Larger blocks should reduce miss rate  Due to spatial locality
 But in a fixed-sized cache
 Larger blocks  fewer of them
 More competition  increased miss rate  Larger blocks  pollution
 Larger miss penalty
 Can override benefit of reduced miss rate  Early restart and critical-word-first can help
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28

Cache Misses
 On cache hit, CPU proceeds normally
 On cache miss
 Stall the CPU pipeline
 Fetch block from next level of hierarchy
 Instruction cache miss  Restart instruction fetch
 Data cache miss
 Complete data access
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 29

Write-Through
 On data-write hit, could just update the block in cache
 But then cache and memory would be inconsistent
 Write through: also update memory
 But makes writes take longer
 e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles
 Effective CPI = 1 + 0.1×100 = 11
 Solution: write buffer
 Holds data waiting to be written to memory
 CPU continues immediately
 Only stalls on write if write buffer is already full
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 30

Write-Back
 Alternative: On data-write hit, just update the block in cache
 Keep track of whether each block is dirty
 When a dirty block is replaced
 Write it back to memory
 Can use a write buffer to allow replacing block to be read first
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 31

Write Allocation
 What should happen on a write miss?
 Alternatives for write-through
 Allocate on miss: fetch the block
 Write around: don’t fetch the block
 Since programs often write a whole block before
reading it (e.g., initialization)  For write-back
 Usually fetch the block
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 32

Main Memory Supporting Caches
 Use DRAMs for main memory
 Fixed width (e.g., 1 word)
 Connected by fixed-width clocked bus
 Bus clock is typically slower than CPU clock
 Example cache block read
 1 bus cycle for address transfer  15 bus cycles per DRAM access  1 bus cycle per data transfer
 For 4-word block, 1-word-wide DRAM
 Miss penalty = 1 + 4×15 + 4×1 = 65 bus cycles  Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 33

Measuring Cache Performance
 Components of CPU time  Program execution cycles
 Includes cache hit time  Memory stall cycles
 Mainly from cache misses
 With simplifying assumptions:
Memory stallcycles
 Memory accesses Miss rateMiss penalty Program
Instructions Misses Misspenalty Program Instruction
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34
§5.4 Measuring and Improving Cache Performance

Cache Performance Example
 I-cache miss rate = 2%
 D-cache miss rate = 4%
 Miss penalty = 100 cycles
 Base CPI (ideal cache) = 2
 Load & stores are 36% of instructions
 Miss cycles per instruction
 I-cache: 0.02 × 100 = 2
 D-cache: 0.36 × 0.04 × 100 = 1.44
 Actual CPI = 2 + 2 + 1.44 = 5.44
 Ideal CPU is 5.44/2 =2.72 times faster
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 35

Average Access Time
 Hit time is also important for performance  Average memory access time (AMAT)
 AMAT = Hit time + Miss rate × Miss penalty  Example
 CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5%
 AMAT = 1 + 0.05 × 20 = 2ns  2 cycles per instruction
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36

Performance Summary
 When CPU performance increases  Miss penalty becomes more significant
 Decreasing base CPI
 Greater proportion of time spent on memory
 Increasing clock rate
 Memory stalls account for more CPU cycles  Can’t neglect cache behavior when
evaluating system performance
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 37

Associative Caches
 Fully associative
 Allow a given block to go in any cache entry  Requires all entries to be searched at once  Comparator per entry (expensive)
 n-way set associative
 Each set contains n entries
 Block number determines which set
 (Block number) modulo (#Sets in cache)
 Search all entries in a given set at once  n comparators (less expensive)
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 38

Associative Cache Example
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 39

Spectrum of Associativity
 For a cache with 8 entries
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 40

Associativity Example
 Compare 4-block caches
 Direct mapped, 2-way set associative,
fully associative
 Block access sequence: 0, 8, 0, 6, 8
 Direct mapped
Block address
Cache index
Cache content after access
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 41

Associativity Example
 2-way set associative
Block address
Cache index
Cache content after access
 Fully associative
Block address
Cache content after access
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 42

How Much Associativity
 Increased associativity decreases miss rate
 But with diminishing returns
 Simulation of a system with 64KB
D-cache, 16-word blocks, SPEC2000  1-way: 10.3%
 2-way: 8.6%
 4-way: 8.3%
 8-way: 8.1%
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 43

Set Associative Cache Organization
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 44

Replacement Policy
 Direct mapped: no choice
 Set associative
 Prefer non-valid entry, if there is one
 Otherwise, choose among entries in the set
 Least-recently used (LRU)
 Choose the one unused for the longest time
 Simple for 2-way, manageable for 4-way, too hard beyond that
 Gives approximately the same performance
as LRU for high associativity
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 45

Multilevel Caches
 Primary cache attached to CPU  Small, but fast
 Level-2 cache services misses from primary cache
 Larger, slower, but still faster than main memory
 Main memory services L-2 cache misses
 Some high-end systems include L-3 cache
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 46

Multilevel Cache Example
 CPU base CPI = 1, clock rate = 4GHz  Miss rate/instruction = 2%
 Main memory access time = 100ns
 With just primary cache
 Miss penalty = 100ns/0.25ns = 400 cycles  Effective CPI = 1 + 0.02 × 400 = 9
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 47

Example (cont.)
 Now add L-2 cache
 Access time = 5ns
 Global miss rate to main memory = 0.5%
 Primary miss with L-2 hit
 Penalty = 5ns/0.25ns = 20 cycles
 Primary miss with L-2 miss  Extra penalty = 500 cycles
 CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4  Performance ratio = 9/3.4 = 2.6
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 48

Multilevel Cache Considerations
 Primary cache
 Focus on minimal hit time
 L-2 cache
 Focus on low miss rate to avoid main memory
 Hit time has less overall impact
 L-1 cache usually smaller than a single cache  L-1 block size smaller than L-2 block size
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 49

Concluding Remarks
 Fast memories are small, large memories are slow
 We really want fast, large memories   Caching gives this illusion 
 Principle of locality
 Programs use a small part of their memory space
frequently
 Memory hierarchy
 L1 cache  L2 cache  …  DRAM memory  disk
 Memory system design is critical for multiprocessors
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 50
§5.17 Concluding Remarks

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com