18-646 – How to Write Fast Code II 1
Outline
CPU Achievable Peak Performance Matrix-Multiplication Discussion
Overview of Mini-Project 1
A Quick Review: SIMD Parallelism
OpenMP Pragmas
CPU Memory Hierarchy False Sharing
Mini-Project 1.1 – Matrix Multiply Mini-Project 1.2 – K-Means
18-646 – How to Write Fast Code II 2
Peak Single-precision Performance
tithing
Intel Sandy Bridge 3.5 Ghz, 4 cores * 2 sockets
o
[GFLOPS]
None, scalar
1 core
2 cores
4 cores
8 cores
7
14
28
56
SSE
28
56
112
224
AVX
56
112
224
448
f o r I n t e l s a n e l yB r i d g e
Most CPU applications achieve a fraction of this
2 ALU
18-646 – How to Write Fast Code II
3
In addition …
Deep memory hierarchy Registers
L1 cache
L2 cache
TLB (Translation Lookaside Buffer)
Other forms of parallelism Pipelining
atableusedto translate virtualmen physicalmen
M
0
Instruction-level parallelism (multiple functional units) “Free operations” are not free
2 mutes sometime
18-646 – How to Write Fast Code II
4 4
Outline
CPU Achievable Peak Performance Matrix-Multiplication Discussion Overview of Mini-Project 1
A Quick Review: SIMD Parallelism
OpenMP Pragmas
CPU Memory Hierarchy False Sharing
Mini-Project 1.1 – Matrix Multiply Mini-Project 1.2 – K-Means
18-646 – How to Write Fast Code II 5
Naïve Matrix Multiply
{implements C = C + A*B}
for i = 1 to n
for j = 1 to n
for k = 1 to n i C(i,j) = C(i,j) + A(i,k) * B(k,j)
A(i,:)
B(:,j)
18-646 – How to Write Fast Code II
6
C(i,j)
C(i,j)
=+*
Matrix-Matrix Multiplication Speed
MMM (square real double) Core 2 Duo 3Ghz
performance [Gflop/s]
6
4
2
0
1 2 4 8 16 32 64 128 256 512 1024 2048
matrix side
• Hugeperformancedifferenceforlargesizes
• Greatcasestudytolearnmemoryhierarchyoptimization
theoretical peak
ATLAS generated
triple loop
18-646 – How to Write Fast Code II 7
Note on Matrix Storage
A matrix is a 2-D array of elements, but memory addresses are “1-D” Conventions for matrix layout
by column, or “column major” (Fortran default); A[i,j] at i + j*n
by row, or “row major” (C default) A[i,j] at i*n + j
Column major matrix in memory
Column major
Row major
0
5
10
15
1
6
11
16
2
7
12
17
3
8
13
18
4
9
14
19
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Column major (for now) 18-646 – How to Write Fast Code II
Yellow row of matrix is stored in red cachelines
cachelines
8
Naïve Matrix Multiply
iron major
{implements C = C + A*B}
for i = 1 to n
Ali
// O(n^3) slow loads
{read row i of A into fast memory}
for j = 1 to n
{read C(i,j) into fast memory}
{read column j of B into fast memory} for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
{write C(i,j) back to slow memory}
μ
A(i,:)
18-646 – How to Write Fast Code II
9
C(i,j)
C(i,j)
=+*
B(:,j)
Naïve Matrix Multiply
Number of slow memory references on unblocked matrix multiply
toreadeachcolumnofB n times
to read each row of A once
to read and write each element of C once
Soq=f/m=2n3 /(n3 +3n2)
» 2 for large n, no improvement over matrix-vector multiply
Inner two loops are just matrix-vector multiply, of row i of A times B Similar for any other order of 3 loops
=+*
m=n3 O
+ n2
+ 2n2 =n3 +3n2
A(i,:)
B(:,j)
18-646 – How to Write Fast Code II
10
C(i,j)
C(i,j)
Naïve Matrix Multiply on RS/6000
6
5
4
3
2
1
0
O(N3) performance would have constant cycles/flop
18-646 – How to Write Fast Code II Slide source: Larry Carter, UCSD 11
12000 would take 1095 years
T = N4.7
Size 2000 took 5 days
12345
log Problem Size
log cycles/flop
Naïve Matrix Multiply on RS/6000
6
5
4
3
2
1
Page miss every iteration
TLB miss every iteration
Cache miss every 16 iterations
Page miss every 512 iterations
0
i
l
log Problem Size
0
0 1 2lov 3 4 5
18-646 – How to Write Fast Code II
Slide source: Larry Carter, UCSD 12
log cycles/flop
Blocked (Tiled) Matrix Multiply
Houle
0
A,B,C = NxN matrices of b x b sub-blocks, N = n / b
for i = 1 to N for j = 1 to N
{read block C(i,j) into fast memory}
for k = 1 to N
{readblockA(i,k)intofastmemory}} {read block B(k,j) into fast memory}
shodd
be smaller than cache
//O(N^3)slowloads smallest C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks}
{write block C(i,j) back to slow memory}
=+*
18-646 – How to Write Fast Code II
13
C(i,j)
C(i,j)
A(i,k)
B(k,j)
Further Matrix-Multiply Optimizations in Real-world Libraries
Block size adaptation for appropriate caches
Register-level blocking
Copy optimization (data layout)
Optimizing the mini-matrix-multiply (base case) Multi-level blocking
Multi-level copying
18-646 – How to Write Fast Code II 14
Register-level blocking
D
O
μ
H
Number of rows in register block
A 2-D slice of a 3-D register-tile search space. The dark blue region was pruned.
18-646 – How to Write Fast Code II Slide source: Jim Demmel, UCB 15
Number of columns in register block
Copy optimization
Copy input operands or blocks
Reduce cache conflicts
Constant array offsets for fixed size blocks
Expose page-level locality
Alternative: use different data structures from start (if users willing)
Original matrix
(numbers are addresses)
Reorganized into 2×2 blocks
0
4
8
12
1
5
9
13
2
6
10
14
3
7
11
15
0
2
8
10
1
3
9
11
4
6
12
13
5
7
14
15
18-646 – How to Write Fast Code II
Slide source: Jim Demmel, UCB 16
Outline
CPU Achievable Peak Performance Matrix-Multiplication Discussion
Overview of Mini-Project 1
A Quick Review: SIMD Parallelism
OpenMP Pragmas
CPU Memory Hierarchy False Sharing
Mini-Project 1.1 – Matrix Multiply Mini-Project 1.2 – K-Means
18-646 – How to Write Fast Code II 17
Mini-Project 1
The goal of this project is to use your understanding of parallel computing resources in a multicore microprocessor to optimize two fully functional applications.
The applications are Matrix Multiplication and K-Means Clustering. For a functional description of the applications, please refer to:
http://en.wikipedia.org/wiki/Matrix_multiplication http://en.wikipedia.org/wiki/K-means
The code optimization techniques you may want to explore include:
OpenMP Pro??? (omp parallel, omp for. omp atomic, omp reduction, …) Cache blocking
SIMD and Sequential Optimizations
18-646 – How to Write Fast Code II 18
Mini-Project 1
Mini-Project Due Monday, 3/8 @ 11:59PM EST Mini-Project One Review Thursday, 3/11 @ 6:00PM EST
Handout: https://cmu.box.com/s/uzfkghtkzfs0qc8acwwwzgdjep0lu9o4
Gradescope Links:
Matrix Multiply: https://www.gradescope.com/courses/241050/assignments/998372 K-Means: https://www.gradescope.com/courses/241050/assignments/1025865
Project Writeup: https://www.gradescope.com/courses/241050/assignments/997551
18-646 – How to Write Fast Code II 19
Mini-Project 1: gradescope
18-646 – How to Write Fast Code II 20
Mini-Project 1: gradescope
18-646 – How to Write Fast Code II 21
Mini-Project 1: gradescope
18-646 – How to Write Fast Code II 22
Mini-Project 1: gradescope
18-646 – How to Write Fast Code II 23
Mini-Project 1: gradescope
18-646 – How to Write Fast Code II 24
Mini-Project 1: gradescope
18-646 – How to Write Fast Code II 25
Mini-Project 1: Grading Criteria
18-646 – How to Write Fast Code II 26
Mini-Project 1: Write-Up
18-646 – How to Write Fast Code II 27
18-646 – How to Write Fast Code II 28
Mini-Project 1 – Setup
Matrix Multiple
$ cd ~/18646_MP1/matrix_mul/omp
$ make
$ ./matrix_mul -i ../matrix_mul_03.dat -o
…
Test Case 10 1114.42 milliseconds
K-Means
$ cd ~/18646_MP1/kmeans
$ make
$ time ./omp_main -i ~/18646_MP1/data/kmeans03.dat -n 184 Writing coordinates of K=184 cluster centers…
Writing membership of N=191681 data objects…
115.659u 0.100s 0:15.57 743.4%
18-646 – How to Write Fast Code II 29
Mini-Project 1 – Makefile
The only two files you will submit are “matrix_mul.cpp” and “omp_kmeans.c”
However!!!
For your REPORT you can explore the use of different flags during
compilation and report on their effect!
-mavx (code changes may be required) -O3
If
you
make
changes
to
the
Makefile
for
your
please
include
the
code i
n
your
project
report!!!
18-646 – How to Write Fast Code II 30
fastest performing version,
Outline
CPU Achievable Peak Performance Matrix-Multiplication Discussion
Overview of Mini-Project 1
A Quick Review: SIMD Parallelism
OpenMP Pragmas
CPU Memory Hierarchy False Sharing
Mini-Project 1.1 – Matrix Multiply Mini-Project 1.2 – K-Means
18-646 – How to Write Fast Code II 31
Exploiting SIMD Parallelism
Applied to finding distance between two points:
ld r1, x1
ld r2, x2
ld r3, x3
ld r4, y1
ld r5, y2
ld r6, y3
sub r1, r4, r1 sub r2, r5, r2 sub r3, r6, r3 mul r1, r1, r1 mul r2, r2, r2 mul r3, r3, r3 add r1, r1, r2 add r1, r1, r3 sqrt r1, r1 st d, r1
V1 = _MM_LOAD(&X1[0]) V2 = _MM_LOAD(&Y1[0]) V1 = _MM_SUB(V2, V1)
V1 = _MM_MUL(V1, V1) _MM_STORE(&res[0], V1) add r1, &res[0], &res[1] add r1, r1, &res[2] sqrt r1, r1
st d, r1
18-646 – How to Write Fast Code II 32
SIMD Parallelism
#include
#include
#define k 32
int main(){
float x[k]; float y[k]; // vectors of length k __m128 X, Y; // 128-bit values __m128 acc = _mm_setzero_ps(); // set to (0, 0, 0, 0) float inner_prod, temp[4];
int i, j;
for (j=0; j
clusters[i][j] = newClusters[i][j] / newClusterSize[i];
}
}
…
newClusters[i][j] = 0.0; newClusterSize[i] = 0;
/* set back to 0 */ /* set back to 0 */
18-646 – How to Write Fast Code II
50
prepare for next iteration
Example Code (Maximization)
….
/* update new cluster centers : sum of objects located within */
newClusterSize[index]++; for (j=0; j
clusters[i][j] = newClusters[i][j] / newClusterSize[i];
newClusters[i][j] = 0.0; /* set back to 0 */
x 1
}
…
N (Histogram computation into k bins
}
newClusterSize[i] = 0;
Possible Concurrencies:
18-646 – How to Write Fast Code II
51
/* set back to 0 */
D (independent)
K-Means: Parallelization
#pragma omp parallel for
private(i, ???)
firstprivate(numObjs, ???)
shared(objects, ???)
reduction(???)
• • •
How can we avoid false sharing in OMP?
Could we introduce additional local variables in each thread? Can we parallelize this further?
18-646 – How to Write Fast Code II 52
K-Means: Cost Reduction
for (i=0; i
clusters[i][j] = newClusters[i][j] * tmp;
}
• Remove the dividing operation and shift it to outer loop
18-646 – How to Write Fast Code II 53
K-Means: Other Ideas?
• Predict “stable” points, which means that if a point doesn’t change its cluster for “m” rounds, we regard it as “stable” and skip the computation next round
18-646 – How to Write Fast Code II 54