Many of the following slides are taken with permission from
Complete Powerpoint Lecture Notes for
Computer Systems: A Programmer’s Perspective (CS:APP)
Randal E. Bryant and David R. O’Hallaron
http://csapp.cs.cmu.edu/public/lectures.html
The book is used explicitly in CS 2505 and CS 3214 and as a reference in CS 2506.
Cache Memory and Performance Code and Caches 1
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Claim: Being able to look at code and get a qualitative sense of its locality is a key skill for a professional programmer.
Question: Whichofthesefunctionshasgoodlocality?
int sumarrayrows(int a[M][N])
{
int i, j, sum = 0;
for (i = 0; i < M; i++) for (j = 0; j < N; j++)
sum += a[i][j]; return sum;
}
int sumarraycols(int a[M][N])
{
int i, j, sum = 0;
for (j = 0; j < N; j++) for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
}
Locality Example (1)
Code and Caches 2
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
C arrays allocated in contiguous memory locations with addresses ascending with the array index:
int32_t A[10] = {0, 1, 2, 3, 4, ..., 8, 9};
7FFF99702320 7FFF99702324 7FFF99702328 7FFF9970232C 7FFF99702330 ... 7FFF99702340 7FFF99702344
0
1
2
3
4
...
8
9
Layout of C Arrays in Memory Code and Caches 3
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
In C, a two-dimensional array is an array of arrays:
int32_t A[3][5] = {
{ 0, 1, 2, 3, 4}, {10, 11, 12, 13, 14}, {20, 21, 22, 23, 24}
};
A[0]
A[1]
A[2]
In fact, if we print the values as pointers, we see something like this:
A: 0x7fff22e41d30
A[0]: 0x7fff22e41d30
0x14
A[1]: 0x7fff22e41d44
0x14
A[2]: 0x7fff22e41d58
Two-dimensional Arrays in C Code and Caches 4
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Two-dimensional C arrays allocated in row-major order - each row in contiguous memory locations:
int32_t A[3][5] =
{{0, 1, 2, 3, 4},
{10, 11, 12, 13, 14},
{20, 21, 22, 23, 24}
};
7FFF22E41D30 7FFF22E41D34 7FFF22E41D38 7FFF22E41D3C 7FFF22E41D40 7FFF22E41D44 7FFF22E41D48 7FFF22E41D4C 7FFF22E41D50 7FFF22E41D54 7FFF22E41D58 7FFF22E41D5C 7FFF22E41D60 7FFF22E41D64 7FFF22E41D68
0
1
2
3
4
10
11
12
13
14
20
21
22
23
24
Layout of C Arrays in Memory Code and Caches 5
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
int32_t A[3][5] =
{ { 0, 1, 2, 3, 4},
{10, 11, 12, 13, 14},
{20, 21, 22, 23, 24}, };
Stepping through columns in one row:
for (i = 0; i < 3; i++) for (j = 0; j < 5; j++)
sum += A[i][j];
- accesses successive elements in memory
- if cache block size B > 4 bytes, exploit spatial locality compulsory miss rate = 4 bytes / B
i = 0
7FFF22E41D30 7FFF22E41D34 7FFF22E41D38 7FFF22E41D3C 7FFF22E41D40 7FFF22E41D44 7FFF22E41D48 7FFF22E41D4C 7FFF22E41D50 7FFF22E41D54 7FFF22E41D58 7FFF22E41D5C 7FFF22E41D60 7FFF22E41D64 7FFF22E41D68
0
1
2
3
4
10
11
12
13
14
20
21
22
23
24
i = 1
i = 2
Layout of C Arrays in Memory Code and Caches 6
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
int32_t A[3][5] =
{ { 0, 1, 2, 3, 4},
7FFF22E41D30
7FFF22E41D3C 7FFF22E41D40 7FFF22E41D44 7FFF22E41D48 7FFF22E41D4C 7FFF22E41D50 7FFF22E41D54 7FFF22E41D58 7FFF22E41D5C 7FFF22E41D60 7FFF22E41D64 7FFF22E41D68
0
1
2
3
4
10
11
12
13
14
20
21
22
23
24
j=0
j=1
}; 7FFF22E41D38
{10, 11, 12, 13, 14},
7FFF22E41D34
{20, 21, 22, 23, 24},
Stepping through rows in one column:
for (j = 0; i < 5; i++)
for (i = 0; i < 3; i++)
sum += a[i][j];
accesses distant elements
no spatial locality!
compulsory miss rate = 1 (i.e. 100%)
Layout of C Arrays in Memory Code and Caches 7
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
0
1
2
3
4
10
11
12
13
14
20
21
22
23
24
Stride 1
7FFF22E41D30 7FFF22E41D34 7FFF22E41D38 7FFF22E41D3C 7FFF22E41D40 7FFF22E41D44 7FFF22E41D48 7FFF22E41D4C 7FFF22E41D50 7FFF22E41D54 7FFF22E41D58 7FFF22E41D5C 7FFF22E41D60 7FFF22E41D64 7FFF22E41D68
Stride 4
Stride and Array Accesses
Code and Caches 8
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Repeated references to variables are good (temporal locality)
Stride-1 reference patterns are good (spatial locality)
Assume an initially-empty cache with 16-byte cache blocks.
0
1
2
3
4
5
6
7
i = 0, j = 0
to
i = 0, j = 3
i = 0, j = 4
to
i = 1, j = 2
int sumarrayrows(int a[M][N]) {
int row, col, sum = 0;
for (row = 0; row < M; row++)
for (col = 0; col < N; col++)
sum += a[row][col]; return sum;
}
Miss rate = 1/4 = 25%
Writing Cache Friendly Code
Code and Caches 9
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
0
1
2
3
4
5
6
...
7
8
9
10
15
16
Consider the previous slide, but assume that the cache uses a block size of 64 bytes instead of 16 bytes..
int sumarrayrows(int a[M][N]) {
int row, col, sum = 0;
for (row = 0; row < M; row++)
for (col = 0; col < N; col++)
sum += a[row][col]; return sum;
}
i = 0, j = 0 to
i = 3, j = 1
Miss rate = 1/16 = 6.25%
Writing Cache Friendly Code
Code and Caches 10
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
"Skipping" accesses down the rows of a column do not provide good locality:
int sumarraycols(int a[M][N]) {
int row, col, sum = 0;
for (col = 0; col < N; col++) for (row = 0; row < M; row++)
sum += a[row][col];
return sum;
}
Miss rate = 100%
(That's actually somewhat pessimistic... depending on cache geometry.)
Writing Cache Friendly Code
Code and Caches 11
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
It's easy to write an array traversal and see the addresses at which the array elements are stored:
int A[5] = {0, 1, 2, 3, 4};
for (i = 0; i < 5; i++)
printf("%d: %p\n",
i, &A[i]);
We see there that for a 1D array, the index varies in a stride-1 pattern.
i address
-----------
0: 28ABE0
1: 28ABE4
2: 28ABE8
3: 28ABEC
4: 28ABF0
stride-1: addressesdifferbythesizeof an array cell (4 bytes, here)
Layout of C Arrays in Memory Code and Caches 12
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
int B[3][5] = { ... }; for (i = 0; i < 3; i++)
for (j = 0; j < 5; j++) printf("%d %3d: %p\n",
i, j, &B[i][j]);
We see that for a 2D array, the second index varies in a stride-1 pattern.
But the first index does not vary in a stride-1 pattern.
j-i order:
i j address
----------------
i-j order:
i j address ----------------
0 0:
0 1:
0 2:
0 3:
0 4:
1 0:
1 1: 1 2:
28ABA4 28ABA8 28ABAC 28ABB0 28ABB4 28ABB8 28ABBC 28ABC0
s
tride-1
0 0: 1 0: 2 0: 0 1: 1 1: 2 1: 0 2: 1 2:
28CC9C 28CCB0 28CCC4 28CCA0 28CCB4 28CCC8 28CCA4 28CCB8
stride-5 (0x14/4)
Layout of C Arrays in Memory Code and Caches 13
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
int32_t A[2][3][5] = {
{ { 0, 1, 2, 3, 4},
{ 10, 11, 12, 13, 14},
{ 20, 21, 22, 23, 24}}, { { 0, 1, 2, 3, 4}, {110, 111, 112, 113, 114}, {220, 221, 222, 223, 224}}
};
3D Arrays in C Code and Caches 14
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Question: Can you permute the loops so that the function scans the 3D array a[][][] with a stride-1 reference pattern (and thus has good spatial
locality)?
int sumarray3d(int a[N][N][N]) {
int row, col, page, sum = 0;
for (row = 0; row < N; row++)
for (col = 0; col < N; col++)
for (page = 0; page < N; page++) sum += a[page][row][col];
return sum; }
Locality Example (2) Code and Caches 15
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
int C[2][3][5] = { ... };
for (i = 0; i < 2; i++) for (j = 0; j < 3; j++)
for (k = 0; k < 5; k++) printf("%3d %3d %3d: %p\n",
i, j, k, &C[i][j][k]);
i-j-k order:
i j k address
------------------
0 0 0: 0 0 1: 0 0 2: 0 0 3: 0 0 4: 0 1 0: 0 1 1: 0 1 2:
28CC1C 28CC20 28CC24 28CC28 28CC2C 28CC30 28CC34 28CC38
0x 0x 0x
Layout of C Arrays in Memory Code and Caches 16
We see that for a 3D array, the third index varies in a stride-1 pattern:
But... if we change the order of access, we no longer have a stride-1 pattern:
4 4 4
0 0 0:
1 0 0:
0 1 0:
1 1 0:
0 2 0:
1 2 0:
0 0 1:
1 0 1:
28CC24 0x3C 28CC60 0x28 28CC38 0x3C 28CC74 28CC4C 28CC88 28CC28 28CC64
0 0 0: 0 1 0: 0 2 0: 0 0 1: 0 1 1: 0 2 1: 0 0 2: 0 1 2:
28CC24 0x14 28CC38 0x14 28CC4C 0x14 28CC28 28CC3C 28CC50 28CC2C 28CC40
k-j-i order:
i j k address
------------------
i-k-j order:
i j k address ------------------
CS@VT
Computer Organization II
©2005-2017 CS:APP & McQuain
Question: Can you permute the loops so that the function scans the 3D array a[] with a stride-1 reference pattern (and thus has good spatial locality)?
int sumarray3d(int a[N][N][N]) {
int i, j, k, sum = 0;
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
sum += a[k][i][j];
return sum; }
This code does not yield good locality at all.
The inner loop is varying the first index, worst case!
Locality Example (2) Code and Caches 17
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
// struct of arrays struct soa {
float *x;
float *y;
float *z;
float *r;
};
compute_r(struct soa s) {
for (i = 0; ...) {
s.r[i] = s.x[i] * s.x[i]
+ s.y[i] * s.y[i]
} }
+ s.z[i] * s.z[i];
// array of structs struct aos {
float x;
float y;
float z;
float r;
};
compute_r(struct aos *s) {
for (i = 0; ...) {
s[i].r = s[i].x * s[i].x
+ s[i].y * s[i].y
} }
+ s[i].z * s[i].z;
Locality Example (3)
Question: Which of these two exhibits better spatial locality?
Code and Caches 18
For the following discussions assume a cache block size of 32 bytes, and that the cache is not capable of holding all the blocks of the relevant structure at once.
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
// struct of arrays
struct soa {
float *x; float *y; float *z; float *r;
};
struct soa s;
s.x = malloc(1000 * sizeof(float)); ...
...
...
...
...
Locality Example (3)
Code and Caches 19
x y
z
r
16 bytes
4 bytes per cell, 1000 cells per array
CS@VT
Computer Organization II
©2005-2017 CS:APP & McQuain
// array of structs
struct aos {
float x; float y; float z; float r;
};
struct aos s[1000];
x y
z r
x y
z r
x y
z r
x y z r
x y
z r
x y
z r
x y
z r
x y
z r
x y z r
x y z r
x y z r
x y z r
x y z r
x y z r
x y z r
x y z r
...
x y
z r
x y z r
x y z r
16 bytes per cell, 1000 cells
Locality Example (3)
Code and Caches 20
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Describe the locality exhibited by this algorithm:
s.x[0] miss s.y[0] miss s.z[0] miss s.r[0] miss
s.x[1] hit s.y[1] hit s.z[1] hit s.r[1] hit
// struct of arrays compute_r(struct soa s) {
for (int i = 0; i < 1000; i++) {
s.r[i] = s.x[i] * s.x[i]
+ s.y[i] * s.y[i]
} }
+ s.z[i] * s.z[i];
8 cells
...
s.x[7] hit s.y[7] hit s.z[7] hit s.r[7] hit
s.x[8] miss s.y[8] miss s.z[8] miss s.r[8] miss
...
x
y
z r
...
...
...
32 bytes
4 bytes per cell, 1000 cells per array
Locality Example (3)
Code and Caches 21
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
s.x[8] miss s.y[8] miss s.z[8] miss s.r[8] miss
s.x[9] hit s.y[9] hit s.z[9] hit s.r[9] hit
Describe the locality exhibited by this algorithm:
// struct of arrays compute_r(struct soa s) {
for (int i = 0; i < 1000; i++) {
s.r[i] = s.x[i] * s.x[i]
+ s.y[i] * s.y[i]
} }
+ s.z[i] * s.z[i];
8 cells
8 cells
...
For the arrays: Misses= 4*1*125 Hits = 4*7*125
Hit rate = 87.5%
...
x
y
z r
...
...
...
32 bytes
4 bytes per cell, 1000 cells per array
Locality Example (3)
Code and Caches 22
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Describe the locality exhibited by this algorithm:
s[0].x miss
// array of structs compute_r(struct aos *s) {
for (int i = 0; i < 1000; i++) {
s[i].r = s[i].x * s[i].x
+ s[i].y * s[i].y
} }
+ s[i].z * s[i].z;
s[0].y hit
s[0].z hit
s[0].r hit s[1].x hit s[2].y hit s[3].z hit s[4].r hit
...
x y
z r
x y
z r
x y
z r
x y z r
x y
z r
x y
z r
x y
z r
x y
z r
x y z r
x y z r
x y z r
x y z r
x y z r
x y z r
x y z r
x y z r
...
x y
z r
x y z r
x y z r
Hit rate: 7/8 or 87.5%
Locality Example (3)
Code and Caches 23
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Describe the locality exhibited by this algorithm:
// struct of arrays sum_r(struct soa s) {
sum = 0;
for (int i = 0; i < 1000; i++) {
sum += s.r[i];
}
}
...
x y
z r
...
...
...
Locality Example (4)
Code and Caches 24
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Describe the locality exhibited by this algorithm:
// array of structs sum_r(struct aos *s) {
sum = 0;
for (int i = 0; i < 1000; i++) {
sum += s[i].r;
}
}
x y
z r
x y
z r
x y
z r
x y z r
x y
z r
x y
z r
x y
z r
x y
z r
x y z r
x y z r
x y z r
x y z r
x y z r
x y z r
x y z r
x y z r
...
x y
z r
x y z r
x y z r
Locality Example (4)
Code and Caches 25
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
QTP: How would this compare to the previous two?
// array of pointers to structs struct aops {
float x;
float y;
float z;
float r;
};
struct *aops apos[1000];
for (i = 0; i < 1000; i++)
apos[i] = malloc(sizeof(struct aops));
Locality Example (5)
Code and Caches 26
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Make the common case go fast
– Focus on the inner loops of the core functions
Minimize the misses in the inner loops
– Repeated references to variables are good (temporal locality) – Stride-1 reference patterns are good (spatial locality)
Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories.
Writing Cache Friendly Code Code and Caches 27
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Assume:
Line size = 32B (big enough for four 64-bit words) Matrix dimension (N) is very large
Approximate 1/N as 0.0
Cache is not even big enough to hold multiple rows
Analysis Method:
Look at access pattern of inner loop
kjj iki
AB
C
Miss Rate Analysis for Matrix Multiply Code and Caches 28
CS@VT Computer Organization II ©2005-2017 CS:APP & McQuain
Description:
Multiply N x N matrices
O(N3) total operations
N reads per source element
N values summed per destination
/* ijk */
for (i=0; i