程序代写代做代考 compiler assembly file system cache RISC-V Java Introduction to Computer Systems 15-213/18-243, spring 2009 1st Lecture, Jan. 12th

Introduction to Computer Systems 15-213/18-243, spring 2009 1st Lecture, Jan. 12th

Caches I

http://xkcd.com/1353/

CMPT 295
L14: Caches I

Roadmap
2
car *c = malloc(sizeof(car));
c->miles = 100;
c->gals = 17;
float mpg = get_mpg(c);
free(c);
Car c = new Car();
c.setMiles(100);
c.setGals(17);
float mpg =
c.getMPG();

Java:
C:
Assembly language:
Machine code:
0111010000011000
100011010000010000000010
1000100111000010
110000011111101000011111
Computer system:
OS:

Memory & data
Arrays & structs
Integers & floats
RISC V assembly
Procedures & stacks
Executables
Memory & caches
Processor Pipeline
Performance
Parallelism

CMPT 295
L14: Caches I

2

How does execution time grow with SIZE?
3
int array[SIZE];
int sum = 0;

for (int i = 0; i < 200000; i++) { for (int j = 0; j < SIZE; j++) { sum += array[j]; } } SIZE Time Plot CMPT 295 L14: Caches I What should we expect for this experiment? Accessing and adding 200 000 x size times Actual Data 4 SIZE Time CMPT 295 L14: Caches I Knee in the curve After some size threshold Memory “wall” Steeper slope = execution time grows faster Fits in cache, doesn’t fit in cache How does execution time ROW vs COL? 5 ROW-wise for (int i = 0; i < SIZE; i++) for (int j = 0; j < SIZE; j++) sum += array[i][j]; COL-wise for (int i = 0; i < SIZE; i++) { for (int j = 0; j < SIZE; j++) { sum += array[j][i]; } } CMPT 295 L14: Caches I What should we expect for this experiment? Accessing and adding 200 000 x size times Actual Data 6 Time CMPT 295 L14: Caches I Knee in the curve After some size threshold Memory “wall” Steeper slope = execution time grows faster Fits in cache, doesn’t fit in cache Series 1 Row Column 191 1713 How does execution time vary with stride? 7 for (int i = 0; i < SIZE; i+stride) { sum += array[i]; } CMPT 295 L14: Caches I Step 1 2 3 4 8 16 32 64 128 256 1024 1 0.95 0.9 0.9 1 0.9 2 4 10 15 20 8 Programs 101 Load/Store Architectures: Read data from memory (put in registers) Manipulate it Store it back to memory int main (int argc, char* argv[ ]) { int i; int m = n; int sum = 0; for (i = 1; i <= m; i++) { sum += i; } printf (“...”, n, sum); } C Code main: addi sp,sp,-48 sw ra,44(sp) sw fp,40(sp) move fp,sp sw a0,-36(fp) sw a1,-40(fp) la a5,n lw a5,0(x15) sw a5,-28(fp) sw x0,-24(fp) li a5,1 sw a5,-20(fp) L2: lw a4,-20(fp) lw a5,-28(fp) blt a5,a4,L3 . . . RISC-V Assembly Instructions that read from or write to memory… CMPT 295 L14: Caches I Making memory accesses fast! Cache basics Principle of locality Memory hierarchies Cache organization Program optimizations that consider caches 9 CMPT 295 L14: Caches I Processor-Memory Gap 10 Processor-Memory Performance Gap (grows 50%/year) 1989 first Intel CPU with cache on chip 1998 Pentium III has two cache levels on chip “Moore’s Law” µProc 55%/year (2X/1.5yr) DRAM 7%/year (2X/10yrs) CMPT 295 L14: Caches I How fast you can crunch stuff How fast you can get stuff Problem: Processor-Memory Bottleneck 11 Main Memory CPU Reg Processor performance doubled about every 18 months Bus latency / bandwidth evolved much slower Core 2 Duo: Can process at least 256 Bytes/cycle Core 2 Duo: Bandwidth 2 Bytes/cycle Latency 100-200 cycles (30-60ns) Problem: lots of waiting on memory cycle: single machine step (fixed-time) CMPT 295 L14: Caches I What is a cycle Only 2 hard problems in computer science: cache invalidation, naming things, and off-by-one errors. I can process avocados very quickly It takes me a long time to get them Problem: Processor-Memory Bottleneck 12 Main Memory CPU Reg Processor performance doubled about every 18 months Bus latency / bandwidth evolved much slower Core 2 Duo: Can process at least 256 Bytes/cycle Core 2 Duo: Bandwidth 2 Bytes/cycle Latency 100-200 cycles (30-60ns) Solution: caches Cache cycle: single machine step (fixed-time) CMPT 295 L14: Caches I Avocado cache Cache 💰 Pronunciation: “cash” We abbreviate this as “$” English: A hidden storage space for provisions, weapons, and/or treasures Computer: Memory with short access time used for the storage of frequently or recently used instructions (i-cache/I$) or data (d-cache/D$) More generally: Used to optimize data transfers between any system elements with different characteristics (network interface cache, file cache, etc.) 13 CMPT 295 L14: Caches I General Cache Mechanics 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 7 9 14 3 Cache Memory Larger, slower, cheaper memory. Viewed as partitioned into “blocks” Data is copied in block-sized transfer units Smaller, faster, more expensive memory Caches a subset of the blocks CMPT 295 L14: Caches I Note: these are not bytes: blocks General Cache Concepts: Hit 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 7 9 14 3 Cache Memory Data in block b is needed Request: 14 14 Block b is in cache: Hit! Data is returned to CPU CMPT 295 L14: Caches I General Cache Concepts: Miss 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 7 9 14 3 Cache Memory Data in block b is needed Request: 12 Block b is not in cache: Miss! Block b is fetched from memory Request: 12 12 12 12 Block b is stored in cache Placement policy: determines where b goes Replacement policy: determines which block gets evicted (victim) Data is returned to CPU CMPT 295 L14: Caches I Why Caches Work Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently 17 CMPT 295 L14: Caches I Why Caches Work Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently Temporal locality: Recently referenced items are likely to be referenced again in the near future 18 block CMPT 295 L14: Caches I Equal Why Caches Work Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently Temporal locality: Recently referenced items are likely to be referenced again in the near future Spatial locality: Items with nearby addresses tend to be referenced close together in time How do caches take advantage of this? 19 block block CMPT 295 L14: Caches I Near Example: Any Locality? Data: Temporal: sum referenced in each iteration Spatial: array a[] accessed in stride-1 pattern Instructions: Temporal: cycle through loop repeatedly Spatial: reference instructions in sequence 20 sum = 0; for (i = 0; i < n; i++) { sum += a[i]; } return sum; CMPT 295 L14: Caches I Locality Example #1 21 int sum_array_rows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } CMPT 295 L14: Caches I Locality Example #1 22 Access Pattern: stride = ? M = 3, N=4 Note: 76 is just one possible starting address of array a int sum_array_rows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } 76 92 108 Layout in Memory a[0][0] a[0][1] a[0][2] a[0][3] a[1][0] a[1][1] a[1][2] a[1][3] a[2][0] a[2][1] a[2][2] a[2][3] a [0] [0] a [0] [1] a [0] [2] a [0] [3] a [1] [0] a [1] [1] a [1] [2] a [1] [3] a [2] [0] a [2] [1] a [2] [2] a [2] [3] 1) a[0][0] 2) a[0][1] 3) a[0][2] 4) a[0][3] 5) a[1][0] 6) a[1][1] 7) a[1][2] 8) a[1][3] 9) a[2][0] 10) a[2][1] 11) a[2][2] 12) a[2][3] CMPT 295 L14: Caches I Stride of 1 (4 bytes) Locality Example #2 23 int sum_array_cols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; } CMPT 295 L14: Caches I Locality Example #2 24 int sum_array_cols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; } 76 92 108 Layout in Memory a [0] [0] a [0] [1] a [0] [2] a [0] [3] a [1] [0] a [1] [1] a [1] [2] a [1] [3] a [2] [0] a [2] [1] a [2] [2] a [2] [3] M = 3, N=4 a[0][0] a[0][1] a[0][2] a[0][3] a[1][0] a[1][1] a[1][2] a[1][3] a[2][0] a[2][1] a[2][2] a[2][3] Access Pattern: stride = ? 1) a[0][0] 2) a[1][0] 3) a[2][0] 4) a[0][1] 5) a[1][1] 6) a[2][1] 7) a[0][2] 8) a[1][2] 9) a[2][2] 10) a[0][3] 11) a[1][3] 12) a[2][3] CMPT 295 L14: Caches I Stride of 4 Locality Example #3 What is wrong with this code? How can it be fixed? 25 int sum_array_3D(int a[M][N][L]) { int i, j, k, sum = 0; for (i = 0; i < N; i++) for (j = 0; j < L; j++) for (k = 0; k < M; k++) sum += a[k][i][j]; return sum; } a[2][0][0] a[2][0][1] a[2][0][2] a[2][0][3] a[2][1][0] a[2][1][1] a[2][1][2] a[2][1][3] a[2][2][0] a[2][2][1] a[2][2][2] a[2][2][3] a[1][0][0] a[1][0][1] a[1][0][2] a[1][0][3] a[1][1][0] a[1][1][1] a[1][1][2] a[1][1][3] a[1][2][0] a[1][2][1] a[1][2][2] a[1][2][3] a[0][0][0] a[0][0][1] a[0][0][2] a[0][0][3] a[0][1][0] a[0][1][1] a[0][1][2] a[0][1][3] a[0][2][0] a[0][2][1] a[0][2][2] a[0][2][3] m = 0 m = 1 m = 2 CMPT 295 L14: Caches I Grids, rows, cols Stride = N * L Locality Example #3 26 int sum_array_3D(int a[M][N][L]) { int i, j, k, sum = 0; for (i = 0; i < N; i++) for (j = 0; j < L; j++) for (k = 0; k < M; k++) sum += a[k][i][j]; return sum; } What is wrong with this code? How can it be fixed? Layout in Memory (M = ?, N = 3, L = 4) a [0] [0] [0] a [0] [0] [1] a [0] [0] [2] a [0] [0] [3] a [0] [1] [0] a [0] [1] [1] a [0] [1] [2] a [0] [1] [3] a [0] [2] [0] a [0] [2] [1] a [0] [2] [2] a [0] [2] [3] a [1] [0] [0] a [1] [0] [1] a [1] [0] [2] a [1] [0] [3] a [1] [1] [0] a [1] [1] [1] a [1] [1] [2] a [1] [1] [3] a [1] [2] [0] a [1] [2] [1] a [1] [2] [2] a [1] [2] [3] 76 92 108 124 140 156 172 CMPT 295 L14: Caches I Make J the inner loop Cache Performance Metrics Huge difference between a cache hit and a cache miss Could be 100x speed difference between accessing cache and main memory (measured in clock cycles) Miss Rate (MR) Fraction of memory references not found in cache (misses / accesses) = 1 - Hit Rate Hit Time (HT) Time to deliver a block in the cache to the processor Includes time to determine whether the block is in the cache Miss Penalty (MP) Additional time required because of a miss 27 CMPT 295 L14: Caches I Hit takes HT Miss takes HT + MP Cache Performance Two things hurt the performance of a cache: Miss rate and miss penalty Average Memory Access Time (AMAT): average time to access memory considering both hits and misses AMAT = Hit time + Miss rate × Miss penalty (abbreviated AMAT = HT + MR × MP) 99% hit rate can be twice as good as 97% hit rate! Assume HT of 1 clock cycle and MP of 100 clock cycles 97%: AMAT = 99%: AMAT = 28 CMPT 295 L14: Caches I AMAT = always have to pay HT, only sometimes (MR) pay MP HR = .97 -> 1 + .03 * 100 = 4 cycles
HR = .99 -> 1 + .01 * 100 = 2 cycles
Since MP is high, keep MR loooow

29

CMPT 295
L14: Caches I
Peer Instruction Question
Processor specs: 200 ps clock, MP of 50 clock cycles, MR of 0.02 misses/instruction, and HT of 1 clock cycle
AMAT =
Which improvement would be best?
190 ps clock

Miss penalty of 40 clock cycles

MR of 0.015 misses/instruction
30

CMPT 295
L14: Caches I
C
AMAT = 1 + .02 * 50 = 2 cycles = 400 ps (trillionth!)
A -> 2 cycles -> 380 ps
B -> 1 + 0.02 * 40 = 1.8 cyc = 360ps
C 1 + 0.015 * 50 = 1.75 cyc = 350ps

Can we have more than one cache?
Why would we want to do that?
Avoid going to memory!
Typical performance numbers:
Miss Rate
L1 MR = 3-10%
L2 MR = Quite small (e.g. < 1%), depending on parameters, etc. Hit Time L1 HT = 4 clock cycles L2 HT = 10 clock cycles Miss Penalty P = 50-200 cycles for missing in L2 & going to main memory Trend: increasing! 31 CMPT 295 L14: Caches I Smaller cache will have higher MR, lower HT We have 3 levels now! 32 CMPT 295 L14: Caches I Memory Hierarchies Some fundamental and enduring properties of hardware and software systems: Faster = smaller = more expensive Slower = bigger = cheaper The gaps between memory technology speeds are widening True for: registers ↔ cache, cache ↔ DRAM, DRAM ↔ disk, etc. Well-written programs tend to exhibit good locality These properties complement each other beautifully They suggest an approach for organizing memory and storage systems known as a memory hierarchy For each level k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1 33 CMPT 295 L14: Caches I An Example Memory Hierarchy 34 registers on-chip L1 cache (SRAM) main memory (DRAM) local secondary storage (local disks) Larger, slower, cheaper per byte remote secondary storage (distributed file systems, web servers) Local disks hold files retrieved from disks on remote network servers Main memory holds disk blocks retrieved from local disks off-chip L2 cache (SRAM) L1 cache holds cache lines retrieved from L2 cache CPU registers hold words retrieved from L1 cache L2 cache holds cache lines retrieved from main memory Smaller, faster, costlier per byte CMPT 295 L14: Caches I Implemented with different technologies! disks, solid-state DRAM: little wells that drain slowly, refreshed periodically SRAM: flip-flops, little logic gates that feedback with each other, faster but uses more power! That’s why they’re different sizes and costs An Example Memory Hierarchy 35 registers on-chip L1 cache (SRAM) main memory (DRAM) local secondary storage (local disks) Larger, slower, cheaper per byte remote secondary storage (distributed file systems, web servers) off-chip L2 cache (SRAM) explicitly program-controlled (e.g. refer to exactly %t1, %t2) Smaller, faster, costlier per byte program sees “memory”; hardware manages caching transparently CMPT 295 L14: Caches I So, why haven’t we seen caches before now in this class? Because they’re designed to be architecturally transparent! Never affect correctness, only performance Intel Core i7 Cache Hierarchy 36 Regs L1 D$ L1 I$ L2 unified cache Core 0 Regs L1 D$ L1 I$ L2 unified cache Core 3 … L3 unified cache (shared by all cores) Main memory Processor package Block size: 64 bytes for all caches L1 i-cache and d-cache: 32 KiB, 8-way, Access: 4 cycles L2 unified cache: 256 KiB, 8-way, Access: 11 cycles L3 unified cache: 8 MiB, 16-way, Access: 30-40 cycles CMPT 295 L14: Caches I i$ d$ Summary Memory Hierarchy Successively higher levels contain “most used” data from lower levels Exploits temporal and spatial locality Caches are intermediate storage levels used to optimize data transfers between any system elements with different characteristics Cache Performance Ideal case: found in cache (hit) Bad case: not found in cache (miss), search in next level Average Memory Access Time (AMAT) = HT + MR × MP Hurt by Miss Rate and Miss Penalty 37 CMPT 295 L14: Caches I 0510152025303540450200040006000800010000 Chart1 1 2 8 16 32 65 128 256 512 768 1024 1536 2048 3036 4096 8192 0.01 0.01 0.02 0.02 0.05 0.08 0.15 0.29 0.56 0.83 1.13 5.23 8.61 14.49 20.03 40.8 232_cache_data 1 0.01 2 0.01 8 0.02 16 0.02 32 0.05 65 0.08 128 0.15 256 0.29 512 0.56 768 0.83 1024 1.13 1536 5.23 2048 8.61 3036 14.49 4096 20.03 8192 40.8 232_cache_data Cache Terminology X Cache %miss Thit TmissAccess : Read/Write to cache Hit : Desired data in cache Miss : Desired data not in cache Fill : Data placed in cache % miss = #misses/ #accesses MPKI = # misses/ 1000 inst. Thit = Time to read (write) data from cache Tmiss = Time to read data into cache Tavg = Thit + %miss*Tmiss Cache T erminology X Cache %miss T hit T miss Access : Read/W rite to cache Hit : Desir ed data in cache Miss : Desir ed data not in cache Fill : Data placed in cache % miss = #misses/ #accesses MPKI = # misses/ 1000 inst. T hit = Time to r ead (write) data fr om cache T miss = Time to r ead data into cache T avg = T hit + %miss*T miss X CIS 501 (Martin): Caches 13 Exploiting Locality: Memory Hierarchy •  Hierarchy of memory components •  Upper components •  Fast ! Small ! Expensive •  Lower components •  Slow ! Big ! Cheap •  Connected by “buses” •  Which also have latency and bandwidth issues •  Most frequently accessed data in M1 •  M1 + next most frequently accessed in M2, etc. •  Move data up-down hierarchy •  Optimize average access time •  latencyavg = latencyhit + %miss * latencymiss •  Attack each component CPU M1 M2 M3 M4 CIS 501 (Martin): Caches 14 Concrete Memory Hierarchy •  0th level: Registers •  1st level: Primary caches •  Split instruction (I$) and data (D$) •  Typically 8KB to 64KB each •  2nd level: 2nd and 3rd cache (L2, L3) •  On-chip, typically made of SRAM •  2nd level typically ~256KB to 512KB •  “Last level cache” typically 4MB to 16MB •  3rd level: main memory •  Made of DRAM (“Dynamic” RAM) •  Typically 1GB to 4GB for desktops/laptops •  Servers can have 100s of GB •  4th level: disk (swap and files) •  Uses magnetic disks Processor D$ L2, L3 Main Memory I$ Disk Compiler Managed Hardware Managed Software Managed (by OS) Regs Library Analogy Revisited •  Registers ! books on your desk •  Actively being used, small capacity •  Caches ! bookshelves •  Moderate capacity, pretty fast to access •  Main memory ! library •  Big, holds almost all data, but slow •  Disk (swap) ! inter-library loan •  Very slow, but hopefully really uncommon CIS 501 (Martin): Caches 15 CIS 501 (Martin): Caches 16 Evolution of Cache Hierarchies Intel 486 8KB I/D$ 1.5MB L2 L3 tags 64KB D$ 64KB I$ IBM Power5 (dual core) •  Chips today are 30–70% cache by area 4MB 4MB Intel 486 Intel Penryn X CIS 501 (Martin): Caches 13 Exploiting Locality: Memory Hierarchy • Hierarchy of memory components • Upper components • Fast ! Small ! Expensive • Lower components • Slow ! Big ! Cheap • Connected by “buses” • Which also have latency and bandwidth issues • Most frequently accessed data in M1 • M1 + next most frequently accessed in M2, etc. • Move data up-down hierarchy • Optimize average access time • latencyavg = latencyhit + %miss * latencymiss • Attack each component CPU M1 M2 M3 M4 CIS 501 (Martin): Caches 14 Concrete Memory Hierarchy • 0th level: Registers • 1st level: Primary caches • Split instruction (I$) and data (D$) • Typically 8KB to 64KB each • 2nd level: 2nd and 3rd cache (L2, L3) • On-chip, typically made of SRAM • 2nd level typically ~256KB to 512KB • “Last level cache” typically 4MB to 16MB • 3rd level: main memory • Made of DRAM (“Dynamic” RAM) • Typically 1GB to 4GB for desktops/laptops • Servers can have 100s of GB • 4th level: disk (swap and files) • Uses magnetic disks Processor D$ L2, L3 Main Memory I$ Disk Compiler Managed Hardware Managed Software Managed (by OS) Regs Library Analogy Revisited • Registers ! books on your desk • Actively being used, small capacity • Caches ! bookshelves • Moderate capacity, pretty fast to access • Main memory ! library • Big, holds almost all data, but slow • Disk (swap) ! inter-library loan • Very slow, but hopefully really uncommon CIS 501 (Martin): Caches 15 CIS 501 (Martin): Caches 16 Evolution of Cache Hierarchies Intel 486 8KB I/D$ 1.5MB L2 L3 tags 64KB D$ 64KB I$ IBM Power5 (dual core) • Chips today are 30–70% cache by area 4MB 4MB Intel 486 Intel Penryn