程序代写代做代考 go graph C cache clock The Memory Hierarchy

The Memory Hierarchy
1

Random-Access Memory (RAM)
Key features
  


RAM comes in two varieties:
 
RAM is traditionally packaged as a chip.
Basic storage unit is normally a cell (one bit per cell). Multiple RAM chips form a memory.
SRAM (Static RAM) DRAM (Dynamic RAM)
2

SRAM vs DRAM Summary
Trans. Access Needs Needs
per bit time refresh? EDC? Cost Applications
SRAM 4 or 6 1X No Maybe 100x Cache memories
DRAM 1 10X Yes Yes 1X Mainmemories, frame buffers
3

Nonvolatile Memories
DRAM and SRAM are volatile memories


Lose information if powered off.

Nonvolatile memories retain value even if powered off
    



Firmware programs stored in a ROM (BIOS, controllers for disks, network cards, graphics accelerators, security subsystems,…)

Solid state disks (replace rotating disks in thumb drives, smart phones, mp3 players, tablets, laptops,…)

Disk caches
Read-only memory (ROM): programmed during production
Programmable ROM (PROM): can be programmed once
Eraseable PROM (EPROM): can be bulk erased (UV, X-Ray)
Electrically eraseable PROM (EEPROM): electronic erase capability
Flash memory: EEPROMs. with partial (block-level) erase capability
Wears out after about 100,000 erasings
Uses for Nonvolatile Memories
4

Traditional Bus Structure Connecting CPU and Memory
A bus is a collection of parallel wires that carry address, data, and control signals.
Buses are typically shared by multiple devices.


CPU chip
Register file
ALU
System bus Memory bus
Bus interface
I/O bridge
Main memory
5

Memory Read Transaction (1)
CPU places address A on the memory bus.

Register file
%rax
Load operation: movq A, %rax
ALU
I/Obridge A
0
Main memory A
x
Bus interface
6

Memory Read Transaction (2)
Main memory reads A from the memory bus, retrieves word x, and places it on the bus.

Register file
%rax
Load operation: movq A, %rax
ALU
I/O bridge x
Main memory 0
A
x
Bus interface
7

Memory Read Transaction (3)
CPU read word x from the bus and copies it into register %rax.

Register file
%rax
Load operation: movq A, %rax
x
ALU
I/O bridge
Main memory0 A
x
Bus interface
8

Memory Write Transaction (1)
CPU places address A on bus. Main memory reads it and waits for the corresponding data word to arrive.

Register file
%rax
Store operation: movq %rax, A
y
ALU
I/O bridge
A
Main memory0
A
Bus interface
9

Memory Write Transaction (2)
CPU places data word y on the bus.

Register file
%rax
Store operation: movq %rax, A
y
ALU
I/O bridge
y
Main memory 0
A
Bus interface
10

Memory Write Transaction (3)
Main memory reads data word y from the bus and stores it at address A.

Register file %rax
Store operation: movq %rax, A main memory
y
ALU
I/O bridge
0 A
y
Bus interface
11

Locality
Principle of Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently



Recently referenced items are likely
to be referenced again in the near future
Temporal locality:
Spatial locality:


Items with nearby addresses tend
to be referenced close together in time
12

Locality Example
sum = 0;
for (i = 0; i < n; i++) sum += a[i]; return sum;   Reference array elements in succession (stride-1 reference pattern). Spatial locality Temporal locality Spatial locality Temporal locality     Data references Reference variable sum each iteration. Instruction references Reference instructions in sequence. Cycle through loop repeatedly. 13 Example Memory Hierarchy L0: Regs L1 cache (SRAM) L2 cache (SRAM) L3 cache (SRAM) Main memory (DRAM) Local secondary storage (local disks) Remote secondary storage (e.g., Web servers) Smaller, faster, and costlier (per byte) storage devices Larger, slower, and cheaper (per byte) storage L5: devices L6: L1: CPU registers hold words retrieved from the L1 cache. L2: L1 cache holds cache lines retrieved from the L2 cache. L2 cache holds cache lines retrieved from L3 cache L3 cache holds cache lines retrieved from main memory. Main memory holds disk blocks retrieved from local disks. Local disks hold files retrieved from disks on remote servers L4: L3: 14 Caches Cache: A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device. Fundamental idea of a memory hierarchy:    For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1.   Because of locality, programs tend to access the data at level k more often than they access the data at level k+1.  Big Idea: The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top. Why do memory hierarchies work?  Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit. 15 General Cache Concepts Cache Smaller, faster, more expensive memory caches a subset of the blocks 84 9 140 3 140 Data is copied in block-sized transfer units Larger, slower, cheaper Memory viewed as partitioned into “blocks” 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Memory 16 General Cache Concepts: Hit Request: 14 Cache Data in block b is needed Block b is in cache: Hit! 8 9 14 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Memory 17 General Cache Concepts: Miss 8 192 14 3 Cache Request: 12 Data in block b is needed Block b is not in cache: Miss! Block b is fetched from memory Block b is stored in cache • Placement policy: determines where b goes • Replacement policy: determines which block gets evicted (victim) Memory 12 Request: 12 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 18 General Caching Concepts: Types of Cache Misses   Cold misses occur because the cache is empty.     Occurs when the set of active cache blocks (working set) is larger than the cache. Cold (compulsory) miss Conflict miss Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k.    E.g. Block i at level k+1 must be placed in block (i mod 4) at level k. Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block. E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time. Capacity miss 19 Example Memory Hierarchy L0: Regs L1 cache (SRAM) L2 cache (SRAM) L3 cache (SRAM) Main memory (DRAM) Local secondary storage (local disks) Remote secondary storage (e.g., Web servers) Smaller, faster, and costlier (per byte) storage devices Larger, slower, and cheaper (per byte) storage L5: devices L6: L1: CPU registers hold words retrieved from the L1 cache. L2: L1 cache holds cache lines retrieved from the L2 cache. L2 cache holds cache lines retrieved from L3 cache L3 cache holds cache lines retrieved from main memory. Main memory holds disk blocks retrieved from local disks. Local disks hold files retrieved from disks on remote servers L4: L3: 20 General Cache Concept Cache Smaller, faster, more expensive memory caches a subset of the blocks 84 9 140 3 140 Data is copied in block-sized transfer units Larger, slower, cheaper Memory viewed as partitioned into “blocks” 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Memory 21 Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware   Hold frequently accessed blocks of main memory   CPU looks first for data in cache Typical system structure: CPU chip Cache memory Bus interface Register file ALU System bus I/O bridge Memory bus Main memory 22 General Cache Organization (S, E, B) E = 2e lines per set S = 2s sets set line Cache size: C=SxExBbytes 0 1 2 B-1 v tag valid bit B = 2b bytes per cache block (the data) 23 • Locate set • Check if any line in set has matching tag • Yes + line valid: hit • Locate data starting at offset Cache Read E = 2e lines per set t bits s bits b bits S= 2s sets Address of word: tag set block index offset valid bit v tag 0 1 2 B-1 B = 2b bytes per cache block (the data) 24 data begins at this offset Ex.: Direct Mapped Cache (E = 1) 0 1 2 3 4 5 6 7 S= 2s sets Direct mapped: One line per set Assume: cache block size 8 bytes v tag Address of int: find set t bits 0...01 100 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 25 Ex.: Direct Mapped Cache (E = 1) Direct mapped: One line per set Assume: cache block size 8 bytes valid? match: assume yes = hit Address of int: v tag 0 1 2 3 4 5 6 7 t bits 0...01 100 block offset 26 Ex.: Direct Mapped Cache (E = 1) Direct mapped: One line per set Assume: cache block size 8 bytes Valid? match: assume yes = hit Address of int: v tag 0 1 2 3 4 5 6 7 t bits 0...01 100 int (4 Bytes) is here If tag doesn’t match: old line is evicted and replaced block offset 27 Direct-Mapped Cache Simulation t=1 s=2 b=1 M=16 bytes (4-bit addresses), B=2 bytes/block, S=4 sets, E=1 Blocks/set Address trace (reads, one byte per read): x xx x 0 [00002], 1 [00012], 7 [01112], 8 [10002], 0 [00002] miss hit miss miss miss Set 0 Set 1 Set 2 Set 3 v Tag Block 01 010? M[80?-19] 1 0 M[6-7] 28 E-way Set Associative Cache (Here: E = 2) E=2:Twolinesperset Assume: cache block size 8 bytes Address of short int: t bits 0...01 100 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 find set v tag 0 1 2 3 4 4 5 6 7 7 v tag 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 v tag 0 1 2 3 5 6 v tag 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 29 E-way Set Associative Cache (Here: E = 2) E=2:Twolinesperset Assume: cache block size 8 bytes valid? + Address of short int: match: yes = hit compare both t bits 0...01 100 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 block offset 30 E-way Set Associative Cache (Here: E = 2) E=2:Twolinesperset Assume: cache block size 8 bytes valid? + Address of short int: match: yes = hit compare both t bits 0...01 100 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 short int (2 Bytes) is here No match: • Onelineinsetisselectedforevictionandreplacement • Replacementpolicies:random,leastrecentlyused(LRU),... block offset 31 2-Way Set Associative Cache Simulation t=2 s=1 b=1 M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set Address trace (reads, one byte per read): xx x x 0 [00002], 1 [00012], 7 [01112], 8 [10002], 0 [00002] v Tag miss hit miss miss hit Block 01 ?00 ?M[0-1] 0 1 10 M[8-9] Set 0 Set 1 0 1 01 M[6-7] 0 32 Ex.: Fully Associative Cache (S = 1) One set containing all lines Assume: cache block size 8 bytes 0 1 2 3 4 5 6 7 One set: v tag Address of int: t bits 100 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 33 Cache associativity   E=1(1lineperset)   n lines per set   S = 1 (1 set containing all lines) Direct mapped n-way associative Fully associative 34 Cache associativity Requested address gives all information needed to locate data in the cache: Set index: which bookshelf? Tag: book title Block index: what page of the book?  – – – 35 Cache Size C=S*E*B      Suppose we’re given:      How many tag, set, block index bits? S = #sets E = #lines / set (associativity) B = #bytes / block (in a line) 32 KB cache 64 B line 4-way set associative 32-bit architecture 36 Cache Size Suppose we’re given:       How many tag, set, block index bits?    32 KB cache 64 B line 4-way set associative 32-bit architecture 64Bperblock=?bits How many lines in a 32 KB cache? How many sets to hold those lines? 37 Cache Size Suppose we’re given:       How many tag, set, block index bits?     32 KB cache 64 B line 4-way set associative 32-bit architecture 64Bperblock=6bits 32KBcache/64Bperline=2^9lines 2^9 lines / 4 lines per set = 2^7 sets So 6 block index bits, 7 set index bits, ??? tag bits 38 Cache Size Suppose we’re given:       How many tag, set, block index bits?      32 KB cache 64 B line 4-way set associative 32-bit architecture 64Bperblock=6bits 32KBcache/64Bperline=2^9lines 2^9 lines / 4 lines per set = 2^7 sets So 6 block index bits, 7 set index bits 32–6–7=19tagbits 39 Cache Size   Read 0x2b74ce2d   (19 tag, 7 set, 6 block) Suppose we’re given: What are the tag, set, block index? 40 Cache Size   Read 0x2b74ce2d  What are the tag, set, block index?   Suppose we’re given: (19 tag, 7 set, 6 block) 0x2b74ce2d = 0010 1011 0111 0100 1100 1110 0010 1101 41 Cache Size   Read 0x2b74ce2d  What are the tag, set, block index?      Suppose we’re given: (19 tag, 7 set, 6 block) 0x2b74ce2d = 0010 1011 0111 0100 110 0 1110 00 10 1101 Tag = 0010101101110100110 Set index = 0111000 (56) Block index = 101101 (45) 42 What about writes?   L1, L2, L3, Main Memory, Disk  What to do on a write-hit?       Typical   Multiple copies of data exist: Write-through (write immediately to memory) Write-back (defer write to memory until replacement of line) Need a dirty bit (line different from memory or not) What to do on a write-miss? Write-allocate (load into cache, update line in cache)   Good if more writes to the location follow No-write-allocate (writes straight to memory, does not load into cache) Write-through + No-write-allocate Write-back + Write-allocate 43 Intel Core i7 Cache Hierarchy Processor package L1 i-cache and d-cache: 32 KB, 8-way, Access: 4 cycles L2 unified cache: 256 KB, 8-way, Access: 10 cycles L3 unified cache: 8 MB, 16-way, Access: 40-75 cycles Block size: 64 bytes for all caches. Core 0 Core 3 Regs L1 d-cache L2 unified cache L1 i-cache ... Regs L1 d-cache L2 unified cache L1 i-cache L3 unified cache (shared by all cores) Main memory 44 Cache Performance Metrics   Fraction of memory references not found in cache (misses / accesses) =1–hitrate   Time to deliver a line in the cache to the processor  includes time to determine whether the line is in the cache   Additional time required because of a miss  typically 50-200 cycles for main memory (Trend: increasing!) Miss Rate  Typical numbers (in percentages):  3-10% for L1  can be quite small (e.g., < 1%) for L2, depending on size, etc. Hit Time  Typical numbers:  4 clock cycle for L1  10 clock cycles for L2 Miss Penalty 45 Cache Performance Metrics Average memory access time (AMAT)     AMAT=H+MR*AMP    Each request may hit or miss. What is the average time over many requests? H = hit latency MR = miss rate AMP = average miss penalty 46 Cache Performance Metrics AMAT=H+MR*AMP      Suppose we’re given:     What is AMAT? H = hit latency MR = miss rate AMP = average miss penalty Hit latency of L1 = 1 cycle HitrateofL1=30% Latency of main memory = 200 cycles 47 Cache Performance Metrics AMAT=H+MR*AMP      Suppose we’re given:      1+0.70*200=141cycles H = hit latency MR = miss rate AMP = average miss penalty Hit latency of L1 = 1 cycle HitrateofL1=30% Latency of main memory = 200 cycles What is AMAT? 48 Cache Performance Metrics What about multi-level caches?     Suppose we’re given:     What is AMAT? AMAT(L1) = H(L1) + MR(L1) * AMAT(L2) AMAT(L2) = H(L2) + MR(L2) * AMP L1: 1 cycle, 30% hit rate L2: 10 cycles, 55% hit rate RAM: 200 cycles 49 Cache Performance Metrics What about multi-level caches?     Suppose we’re given:      1 + 0.70 * AMAT(L2) AMAT(L1) = H(L1) + MR(L1) * AMAT(L2) AMAT(L2) = H(L2) + MR(L2) * AMP L1: 1 cycle, 30% hit rate L2: 10 cycles, 55% hit rate RAM: 200 cycles What is AMAT? 50 Cache Performance Metrics What about multi-level caches?     Suppose we’re given:      1+0.70*(10+0.45*200)=71cycles AMAT(L1) = H(L1) + MR(L1) * AMAT(L2) AMAT(L2) = H(L2) + MR(L2) * AMP L1: 1 cycle, 30% hit rate L2: 10 cycles, 55% hit rate RAM: 200 cycles What is AMAT? 51 Let’s think about those numbers Huge difference between a hit and a miss     Consider: cache hit time of 1 cycle miss penalty of 100 cycles  This is why “miss rate” is used instead of “hit rate” Could be 100x, if just L1 and main memory Would you believe 99% hits is twice as good as 97%?  Average access time: 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles 52 Writing Cache Friendly Code   Focus on the inner loops of the core functions  Minimize the misses in the inner loops   Make the common case go fast Repeated references to variables are good (temporal locality) Stride-1 reference patterns are good (spatial locality) Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories 53 Cache Performance What is the hit rate of this code?    64 B block intis4B int a[m][n]; for (int i = 0; i < m; i++) for (int j = 0; j < n; j++) a[i][j] += 1; 54 Cache Performance What is the hit rate of this code?        For each cache line (64 B), it holds 16 ints First will miss, next 15 will hit Then another miss, then another 15 hits, ... Hit rate = 15/16 64 B block Intis4B int a[m][n]; for (int i = 0; i < m; i++) for (int j = 0; j < n; j++) a[i][j] += 1; 55 Matrix Multiplication Example   /* ijk */ Description: Variable sum held in register for (j=0; j sizeof(aij) bytes, exploit spatial locality
miss rate = sizeof(aij) / B
accesses distant elements no spatial locality!
miss rate = 1 (i.e. 100%)
58

Matrix Multiplication (ijk)
/* ijk */
/* ijk */
Inner loop: (i,*)
ABC
for (i=0; i