代写 C data structure algorithm html Many of the following slides are taken with permission from

Many of the following slides are taken with permission from
Complete Powerpoint Lecture Notes for
Computer Systems: A Programmer’s Perspective (CS:APP)
Randal E. Bryant and David R. O’Hallaron
http://csapp.cs.cmu.edu/public/lectures.html
The book is used explicitly in CS 2505 and CS 3214 and as a reference in CS 2506.
Cache Memory and Performance Code and Caches 1
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain

Claim: Being able to look at code and get a qualitative sense of its locality is a key skill for a professional programmer.
Question: Whichofthesefunctionshasgoodlocality?
int sumarrayrows(int a[M][N])
{
int i, j, sum = 0;
for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } int sumarraycols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; } Locality Example (1) Code and Caches 2 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain C arrays allocated in contiguous memory locations with addresses ascending with the array index: int32_t A[10] = {0, 1, 2, 3, 4, ..., 8, 9}; 7FFF99702320 7FFF99702324 7FFF99702328 7FFF9970232C 7FFF99702330 ... 7FFF99702340 7FFF99702344 0 1 2 3 4 ... 8 9 Layout of C Arrays in Memory Code and Caches 3 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain In C, a two-dimensional array is an array of arrays: int32_t A[3][5] = { { 0, 1, 2, 3, 4}, {10, 11, 12, 13, 14}, {20, 21, 22, 23, 24} }; A[0] A[1] A[2] In fact, if we print the values as pointers, we see something like this: A: 0x7fff22e41d30 A[0]: 0x7fff22e41d30 0x14 A[1]: 0x7fff22e41d44 0x14 A[2]: 0x7fff22e41d58 Two-dimensional Arrays in C Code and Caches 4 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain Two-dimensional C arrays allocated in row-major order - each row in contiguous memory locations: int32_t A[3][5] = {{0, 1, 2, 3, 4}, {10, 11, 12, 13, 14}, {20, 21, 22, 23, 24} }; 7FFF22E41D30 7FFF22E41D34 7FFF22E41D38 7FFF22E41D3C 7FFF22E41D40 7FFF22E41D44 7FFF22E41D48 7FFF22E41D4C 7FFF22E41D50 7FFF22E41D54 7FFF22E41D58 7FFF22E41D5C 7FFF22E41D60 7FFF22E41D64 7FFF22E41D68 0 1 2 3 4 10 11 12 13 14 20 21 22 23 24 Layout of C Arrays in Memory Code and Caches 5 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain int32_t A[3][5] = { { 0, 1, 2, 3, 4}, {10, 11, 12, 13, 14}, {20, 21, 22, 23, 24}, }; Stepping through columns in one row: for (i = 0; i < 3; i++) for (j = 0; j < 5; j++) sum += A[i][j]; - accesses successive elements in memory - if cache block size B > 4 bytes, exploit spatial locality compulsory miss rate = 4 bytes / B
i = 0
7FFF22E41D30 7FFF22E41D34 7FFF22E41D38 7FFF22E41D3C 7FFF22E41D40 7FFF22E41D44 7FFF22E41D48 7FFF22E41D4C 7FFF22E41D50 7FFF22E41D54 7FFF22E41D58 7FFF22E41D5C 7FFF22E41D60 7FFF22E41D64 7FFF22E41D68
0
1
2
3
4
10
11
12
13
14
20
21
22
23
24
i = 1
i = 2
Layout of C Arrays in Memory Code and Caches 6
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain

int32_t A[3][5] =
{ { 0, 1, 2, 3, 4},
7FFF22E41D30
7FFF22E41D3C 7FFF22E41D40 7FFF22E41D44 7FFF22E41D48 7FFF22E41D4C 7FFF22E41D50 7FFF22E41D54 7FFF22E41D58 7FFF22E41D5C 7FFF22E41D60 7FFF22E41D64 7FFF22E41D68
0
1
2
3
4
10
11
12
13
14
20
21
22
23
24
j=0
j=1
}; 7FFF22E41D38
{10, 11, 12, 13, 14},
7FFF22E41D34
{20, 21, 22, 23, 24},
Stepping through rows in one column:
for (j = 0; i < 5; i++) for (i = 0; i < 3; i++) sum += a[i][j]; accesses distant elements no spatial locality! compulsory miss rate = 1 (i.e. 100%) Layout of C Arrays in Memory Code and Caches 7 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain 0 1 2 3 4 10 11 12 13 14 20 21 22 23 24 Stride 1 7FFF22E41D30 7FFF22E41D34 7FFF22E41D38 7FFF22E41D3C 7FFF22E41D40 7FFF22E41D44 7FFF22E41D48 7FFF22E41D4C 7FFF22E41D50 7FFF22E41D54 7FFF22E41D58 7FFF22E41D5C 7FFF22E41D60 7FFF22E41D64 7FFF22E41D68 Stride 4 Stride and Array Accesses Code and Caches 8 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain Repeated references to variables are good (temporal locality) Stride-1 reference patterns are good (spatial locality) Assume an initially-empty cache with 16-byte cache blocks. 0 1 2 3 4 5 6 7 i = 0, j = 0 to i = 0, j = 3 i = 0, j = 4 to i = 1, j = 2 int sumarrayrows(int a[M][N]) { int row, col, sum = 0; for (row = 0; row < M; row++) for (col = 0; col < N; col++) sum += a[row][col]; return sum; } Miss rate = 1/4 = 25% Writing Cache Friendly Code Code and Caches 9 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain 0 1 2 3 4 5 6 ... 7 8 9 10 15 16 Consider the previous slide, but assume that the cache uses a block size of 64 bytes instead of 16 bytes.. int sumarrayrows(int a[M][N]) { int row, col, sum = 0; for (row = 0; row < M; row++) for (col = 0; col < N; col++) sum += a[row][col]; return sum; } i = 0, j = 0 to i = 3, j = 1 Miss rate = 1/16 = 6.25% Writing Cache Friendly Code Code and Caches 10 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain "Skipping" accesses down the rows of a column do not provide good locality: int sumarraycols(int a[M][N]) { int row, col, sum = 0; for (col = 0; col < N; col++) for (row = 0; row < M; row++) sum += a[row][col]; return sum; } Miss rate = 100% (That's actually somewhat pessimistic... depending on cache geometry.) Writing Cache Friendly Code Code and Caches 11 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain It's easy to write an array traversal and see the addresses at which the array elements are stored: int A[5] = {0, 1, 2, 3, 4}; for (i = 0; i < 5; i++) printf("%d: %p\n", i, &A[i]); We see there that for a 1D array, the index varies in a stride-1 pattern. i address ----------- 0: 28ABE0 1: 28ABE4 2: 28ABE8 3: 28ABEC 4: 28ABF0 stride-1: addressesdifferbythesizeof an array cell (4 bytes, here) Layout of C Arrays in Memory Code and Caches 12 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain int B[3][5] = { ... }; for (i = 0; i < 3; i++) for (j = 0; j < 5; j++) printf("%d %3d: %p\n", i, j, &B[i][j]); We see that for a 2D array, the second index varies in a stride-1 pattern. But the first index does not vary in a stride-1 pattern. j-i order: i j address ---------------- i-j order: i j address ---------------- 0 0: 0 1: 0 2: 0 3: 0 4: 1 0: 1 1: 1 2: 28ABA4 28ABA8 28ABAC 28ABB0 28ABB4 28ABB8 28ABBC 28ABC0 s tride-1 0 0: 1 0: 2 0: 0 1: 1 1: 2 1: 0 2: 1 2: 28CC9C 28CCB0 28CCC4 28CCA0 28CCB4 28CCC8 28CCA4 28CCB8 stride-5 (0x14/4) Layout of C Arrays in Memory Code and Caches 13 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain int32_t A[2][3][5] = { { { 0, 1, 2, 3, 4}, { 10, 11, 12, 13, 14}, { 20, 21, 22, 23, 24}}, { { 0, 1, 2, 3, 4}, {110, 111, 112, 113, 114}, {220, 221, 222, 223, 224}} }; 3D Arrays in C Code and Caches 14 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain Question: Can you permute the loops so that the function scans the 3D array a[][][] with a stride-1 reference pattern (and thus has good spatial locality)? int sumarray3d(int a[N][N][N]) { int row, col, page, sum = 0; for (row = 0; row < N; row++) for (col = 0; col < N; col++) for (page = 0; page < N; page++) sum += a[page][row][col]; return sum; } Locality Example (2) Code and Caches 15 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain int C[2][3][5] = { ... }; for (i = 0; i < 2; i++) for (j = 0; j < 3; j++) for (k = 0; k < 5; k++) printf("%3d %3d %3d: %p\n", i, j, k, &C[i][j][k]); i-j-k order: i j k address ------------------ 0 0 0: 0 0 1: 0 0 2: 0 0 3: 0 0 4: 0 1 0: 0 1 1: 0 1 2: 28CC1C 28CC20 28CC24 28CC28 28CC2C 28CC30 28CC34 28CC38 0x 0x 0x Layout of C Arrays in Memory Code and Caches 16 We see that for a 3D array, the third index varies in a stride-1 pattern: But... if we change the order of access, we no longer have a stride-1 pattern: 4 4 4 0 0 0: 1 0 0: 0 1 0: 1 1 0: 0 2 0: 1 2 0: 0 0 1: 1 0 1: 28CC24 0x3C 28CC60 0x28 28CC38 0x3C 28CC74 28CC4C 28CC88 28CC28 28CC64 0 0 0: 0 1 0: 0 2 0: 0 0 1: 0 1 1: 0 2 1: 0 0 2: 0 1 2: 28CC24 0x14 28CC38 0x14 28CC4C 0x14 28CC28 28CC3C 28CC50 28CC2C 28CC40 k-j-i order: i j k address ------------------ i-k-j order: i j k address ------------------ CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain Question: Can you permute the loops so that the function scans the 3D array a[] with a stride-1 reference pattern (and thus has good spatial locality)? int sumarray3d(int a[N][N][N]) { int i, j, k, sum = 0; for (i = 0; i < N; i++) for (j = 0; j < N; j++) for (k = 0; k < N; k++) sum += a[k][i][j]; return sum; } This code does not yield good locality at all. The inner loop is varying the first index, worst case! Locality Example (2) Code and Caches 17 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain // struct of arrays struct soa { float *x; float *y; float *z; float *r; }; compute_r(struct soa s) { for (i = 0; ...) { s.r[i] = s.x[i] * s.x[i] + s.y[i] * s.y[i] } } + s.z[i] * s.z[i]; // array of structs struct aos { float x; float y; float z; float r; }; compute_r(struct aos *s) { for (i = 0; ...) { s[i].r = s[i].x * s[i].x + s[i].y * s[i].y } } + s[i].z * s[i].z; Locality Example (3) Question: Which of these two exhibits better spatial locality? Code and Caches 18 For the following discussions assume a cache block size of 32 bytes, and that the cache is not capable of holding all the blocks of the relevant structure at once. CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain // struct of arrays struct soa { float *x; float *y; float *z; float *r; }; struct soa s; s.x = malloc(1000 * sizeof(float)); ... ... ... ... ... Locality Example (3) Code and Caches 19 x y z r 16 bytes 4 bytes per cell, 1000 cells per array CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain // array of structs struct aos { float x; float y; float z; float r; }; struct aos s[1000]; x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r ... x y z r x y z r x y z r 16 bytes per cell, 1000 cells Locality Example (3) Code and Caches 20 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain Describe the locality exhibited by this algorithm: s.x[0] miss s.y[0] miss s.z[0] miss s.r[0] miss s.x[1] hit s.y[1] hit s.z[1] hit s.r[1] hit // struct of arrays compute_r(struct soa s) { for (int i = 0; i < 1000; i++) { s.r[i] = s.x[i] * s.x[i] + s.y[i] * s.y[i] } } + s.z[i] * s.z[i]; 8 cells ... s.x[7] hit s.y[7] hit s.z[7] hit s.r[7] hit s.x[8] miss s.y[8] miss s.z[8] miss s.r[8] miss ... x y z r ... ... ... 32 bytes 4 bytes per cell, 1000 cells per array Locality Example (3) Code and Caches 21 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain s.x[8] miss s.y[8] miss s.z[8] miss s.r[8] miss s.x[9] hit s.y[9] hit s.z[9] hit s.r[9] hit Describe the locality exhibited by this algorithm: // struct of arrays compute_r(struct soa s) { for (int i = 0; i < 1000; i++) { s.r[i] = s.x[i] * s.x[i] + s.y[i] * s.y[i] } } + s.z[i] * s.z[i]; 8 cells 8 cells ... For the arrays: Misses= 4*1*125 Hits = 4*7*125 Hit rate = 87.5% ... x y z r ... ... ... 32 bytes 4 bytes per cell, 1000 cells per array Locality Example (3) Code and Caches 22 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain Describe the locality exhibited by this algorithm: s[0].x miss // array of structs compute_r(struct aos *s) { for (int i = 0; i < 1000; i++) { s[i].r = s[i].x * s[i].x + s[i].y * s[i].y } } + s[i].z * s[i].z; s[0].y hit s[0].z hit s[0].r hit s[1].x hit s[2].y hit s[3].z hit s[4].r hit ... x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r ... x y z r x y z r x y z r Hit rate: 7/8 or 87.5% Locality Example (3) Code and Caches 23 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain Describe the locality exhibited by this algorithm: // struct of arrays sum_r(struct soa s) { sum = 0; for (int i = 0; i < 1000; i++) { sum += s.r[i]; } } ... x y z r ... ... ... Locality Example (4) Code and Caches 24 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain Describe the locality exhibited by this algorithm: // array of structs sum_r(struct aos *s) { sum = 0; for (int i = 0; i < 1000; i++) { sum += s[i].r; } } x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r x y z r ... x y z r x y z r x y z r Locality Example (4) Code and Caches 25 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain QTP: How would this compare to the previous two? // array of pointers to structs struct aops { float x; float y; float z; float r; }; struct *aops apos[1000]; for (i = 0; i < 1000; i++) apos[i] = malloc(sizeof(struct aops)); Locality Example (5) Code and Caches 26 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain Make the common case go fast – Focus on the inner loops of the core functions Minimize the misses in the inner loops – Repeated references to variables are good (temporal locality) – Stride-1 reference patterns are good (spatial locality) Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories. Writing Cache Friendly Code Code and Caches 27 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain Assume: Line size = 32B (big enough for four 64-bit words) Matrix dimension (N) is very large Approximate 1/N as 0.0 Cache is not even big enough to hold multiple rows Analysis Method: Look at access pattern of inner loop kjj iki AB C Miss Rate Analysis for Matrix Multiply Code and Caches 28 CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain Description: Multiply N x N matrices O(N3) total operations N reads per source element N values summed per destination /* ijk */ for (i=0; i