CS计算机代考程序代写 Fortran cache data structure 18-646 – How to Write Fast Code II 1

18-646 – How to Write Fast Code II 1

Outline
— CPU Achievable Peak Performance — Matrix-Multiplication Discussion
— Overview of Mini-Project 1
— A Quick Review: — SIMD Parallelism
— OpenMP Pragmas
— CPU Memory Hierarchy — False Sharing
— Mini-Project 1.1 – Matrix Multiply — Mini-Project 1.2 – K-Means
18-646 – How to Write Fast Code II 2

Peak Single-precision Performance
tithing
— Intel Sandy Bridge 3.5 Ghz, 4 cores * 2 sockets
o
[GFLOPS]
None, scalar
1 core
2 cores
4 cores
8 cores
7
14
28
56
SSE
28
56
112
224
AVX
56
112
224
448
f o r I n t e l s a n e l yB r i d g e
Most CPU applications achieve a fraction of this
2 ALU
18-646 – How to Write Fast Code II
3

In addition …
— Deep memory hierarchy — Registers
— L1 cache
— L2 cache
— TLB (Translation Lookaside Buffer)
— Other forms of parallelism — Pipelining
atableusedto translate virtualmen physicalmen
M
0
— Instruction-level parallelism (multiple functional units) — “Free operations” are not free
2 mutes sometime
18-646 – How to Write Fast Code II
4 4

Outline
— CPU Achievable Peak Performance — Matrix-Multiplication Discussion — Overview of Mini-Project 1
— A Quick Review: — SIMD Parallelism
— OpenMP Pragmas
— CPU Memory Hierarchy — False Sharing
— Mini-Project 1.1 – Matrix Multiply — Mini-Project 1.2 – K-Means
18-646 – How to Write Fast Code II 5

Naïve Matrix Multiply
{implements C = C + A*B}
for i = 1 to n
for j = 1 to n
for k = 1 to n i C(i,j) = C(i,j) + A(i,k) * B(k,j)
A(i,:)
B(:,j)
18-646 – How to Write Fast Code II
6
C(i,j)
C(i,j)
=+*

Matrix-Matrix Multiplication Speed
MMM (square real double) Core 2 Duo 3Ghz
performance [Gflop/s]
6
4
2
0
1 2 4 8 16 32 64 128 256 512 1024 2048
matrix side
• Hugeperformancedifferenceforlargesizes
• Greatcasestudytolearnmemoryhierarchyoptimization
theoretical peak
ATLAS generated
triple loop
18-646 – How to Write Fast Code II 7

Note on Matrix Storage
— A matrix is a 2-D array of elements, but memory addresses are “1-D” — Conventions for matrix layout
— by column, or “column major” (Fortran default); A[i,j] at i + j*n
— by row, or “row major” (C default) A[i,j] at i*n + j
Column major matrix in memory
Column major
Row major
0
5
10
15
1
6
11
16
2
7
12
17
3
8
13
18
4
9
14
19
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
— Column major (for now) 18-646 – How to Write Fast Code II
Yellow row of matrix is stored in red cachelines
cachelines
8

Naïve Matrix Multiply
iron major
{implements C = C + A*B}
for i = 1 to n
Ali
// O(n^3) slow loads
{read row i of A into fast memory}
for j = 1 to n
{read C(i,j) into fast memory}
{read column j of B into fast memory} for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
{write C(i,j) back to slow memory}
μ
A(i,:)
18-646 – How to Write Fast Code II
9
C(i,j)
C(i,j)
=+*
B(:,j)

Naïve Matrix Multiply
Number of slow memory references on unblocked matrix multiply
toreadeachcolumnofB n times
to read each row of A once
to read and write each element of C once
Soq=f/m=2n3 /(n3 +3n2)
» 2 for large n, no improvement over matrix-vector multiply
Inner two loops are just matrix-vector multiply, of row i of A times B Similar for any other order of 3 loops
=+*
m=n3 O
+ n2
+ 2n2 =n3 +3n2
A(i,:)
B(:,j)
18-646 – How to Write Fast Code II
10
C(i,j)
C(i,j)

Naïve Matrix Multiply on RS/6000
6
5
4
3
2
1
0
O(N3) performance would have constant cycles/flop
18-646 – How to Write Fast Code II Slide source: Larry Carter, UCSD 11
12000 would take 1095 years
T = N4.7
Size 2000 took 5 days
12345
log Problem Size
log cycles/flop

Naïve Matrix Multiply on RS/6000
6
5
4
3
2
1
Page miss every iteration
TLB miss every iteration
Cache miss every 16 iterations
Page miss every 512 iterations
0
i
l
log Problem Size
0
0 1 2lov 3 4 5
18-646 – How to Write Fast Code II
Slide source: Larry Carter, UCSD 12
log cycles/flop

Blocked (Tiled) Matrix Multiply
Houle
0
A,B,C = NxN matrices of b x b sub-blocks, N = n / b
for i = 1 to N for j = 1 to N
{read block C(i,j) into fast memory}
for k = 1 to N
{readblockA(i,k)intofastmemory}} {read block B(k,j) into fast memory}
shodd
be smaller than cache
//O(N^3)slowloads smallest C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks}
{write block C(i,j) back to slow memory}
=+*
18-646 – How to Write Fast Code II
13
C(i,j)
C(i,j)
A(i,k)
B(k,j)

Further Matrix-Multiply Optimizations in Real-world Libraries
— Block size adaptation for appropriate caches
— Register-level blocking
— Copy optimization (data layout)
— Optimizing the mini-matrix-multiply (base case) — Multi-level blocking
— Multi-level copying
18-646 – How to Write Fast Code II 14

Register-level blocking
D
O
μ
H
Number of rows in register block
A 2-D slice of a 3-D register-tile search space. The dark blue region was pruned.
18-646 – How to Write Fast Code II Slide source: Jim Demmel, UCB 15
Number of columns in register block

Copy optimization
— Copy input operands or blocks
— Reduce cache conflicts
— Constant array offsets for fixed size blocks
— Expose page-level locality
— Alternative: use different data structures from start (if users willing)
Original matrix
(numbers are addresses)
Reorganized into 2×2 blocks
0
4
8
12
1
5
9
13
2
6
10
14
3
7
11
15
0
2
8
10
1
3
9
11
4
6
12
13
5
7
14
15
18-646 – How to Write Fast Code II
Slide source: Jim Demmel, UCB 16

Outline
— CPU Achievable Peak Performance — Matrix-Multiplication Discussion
— Overview of Mini-Project 1
— A Quick Review: — SIMD Parallelism
— OpenMP Pragmas
— CPU Memory Hierarchy — False Sharing
— Mini-Project 1.1 – Matrix Multiply — Mini-Project 1.2 – K-Means
18-646 – How to Write Fast Code II 17

Mini-Project 1
The goal of this project is to use your understanding of parallel computing resources in a multicore microprocessor to optimize two fully functional applications.
The applications are Matrix Multiplication and K-Means Clustering. For a functional description of the applications, please refer to:
— http://en.wikipedia.org/wiki/Matrix_multiplication — http://en.wikipedia.org/wiki/K-means
The code optimization techniques you may want to explore include:
— OpenMP Pro??? (omp parallel, omp for. omp atomic, omp reduction, …) — Cache blocking
— SIMD and Sequential Optimizations
18-646 – How to Write Fast Code II 18

Mini-Project 1
— Mini-Project Due Monday, 3/8 @ 11:59PM EST — Mini-Project One Review Thursday, 3/11 @ 6:00PM EST
Handout: https://cmu.box.com/s/uzfkghtkzfs0qc8acwwwzgdjep0lu9o4
Gradescope Links:
— Matrix Multiply: https://www.gradescope.com/courses/241050/assignments/998372 — K-Means: https://www.gradescope.com/courses/241050/assignments/1025865
— Project Writeup: https://www.gradescope.com/courses/241050/assignments/997551
18-646 – How to Write Fast Code II 19

Mini-Project 1: gradescope
18-646 – How to Write Fast Code II 20

Mini-Project 1: gradescope
18-646 – How to Write Fast Code II 21

Mini-Project 1: gradescope
18-646 – How to Write Fast Code II 22

Mini-Project 1: gradescope
18-646 – How to Write Fast Code II 23

Mini-Project 1: gradescope
18-646 – How to Write Fast Code II 24

Mini-Project 1: gradescope
18-646 – How to Write Fast Code II 25

Mini-Project 1: Grading Criteria
18-646 – How to Write Fast Code II 26

Mini-Project 1: Write-Up
18-646 – How to Write Fast Code II 27

18-646 – How to Write Fast Code II 28

Mini-Project 1 – Setup
— Matrix Multiple
$ cd ~/18646_MP1/matrix_mul/omp
$ make
$ ./matrix_mul -i ../matrix_mul_03.dat -o

Test Case 10 1114.42 milliseconds
— K-Means
$ cd ~/18646_MP1/kmeans
$ make
$ time ./omp_main -i ~/18646_MP1/data/kmeans03.dat -n 184 Writing coordinates of K=184 cluster centers…
Writing membership of N=191681 data objects…
115.659u 0.100s 0:15.57 743.4%
18-646 – How to Write Fast Code II 29

Mini-Project 1 – Makefile
— The only two files you will submit are “matrix_mul.cpp” and “omp_kmeans.c”
— However!!!
— For your REPORT you can explore the use of different flags during
compilation and report on their effect!
— -mavx (code changes may be required) — -O3
If
you
make
changes
to
the
Makefile
for
your
please
include
the
code i
n
your
project
report!!!
18-646 – How to Write Fast Code II 30
fastest performing version,

Outline
— CPU Achievable Peak Performance — Matrix-Multiplication Discussion
— Overview of Mini-Project 1
— A Quick Review: — SIMD Parallelism
— OpenMP Pragmas
— CPU Memory Hierarchy — False Sharing
— Mini-Project 1.1 – Matrix Multiply — Mini-Project 1.2 – K-Means
18-646 – How to Write Fast Code II 31

Exploiting SIMD Parallelism
— Applied to finding distance between two points:
ld r1, x1
ld r2, x2
ld r3, x3
ld r4, y1
ld r5, y2
ld r6, y3
sub r1, r4, r1 sub r2, r5, r2 sub r3, r6, r3 mul r1, r1, r1 mul r2, r2, r2 mul r3, r3, r3 add r1, r1, r2 add r1, r1, r3 sqrt r1, r1 st d, r1
V1 = _MM_LOAD(&X1[0]) V2 = _MM_LOAD(&Y1[0]) V1 = _MM_SUB(V2, V1)
V1 = _MM_MUL(V1, V1) _MM_STORE(&res[0], V1) add r1, &res[0], &res[1] add r1, r1, &res[2] sqrt r1, r1
st d, r1
18-646 – How to Write Fast Code II 32

SIMD Parallelism
#include
#include // header for SSE3
#define k 32
int main(){
float x[k]; float y[k]; // vectors of length k __m128 X, Y; // 128-bit values __m128 acc = _mm_setzero_ps(); // set to (0, 0, 0, 0) float inner_prod, temp[4];
int i, j;
for (j=0; j 0)
clusters[i][j] = newClusters[i][j] / newClusterSize[i];
}
}

newClusters[i][j] = 0.0; newClusterSize[i] = 0;
/* set back to 0 */ /* set back to 0 */
18-646 – How to Write Fast Code II
50
prepare for next iteration

Example Code (Maximization)
….
/* update new cluster centers : sum of objects located within */
newClusterSize[index]++; for (j=0; j 0)
clusters[i][j] = newClusters[i][j] / newClusterSize[i];
newClusters[i][j] = 0.0; /* set back to 0 */
x 1
}

N (Histogram computation into k bins
}
newClusterSize[i] = 0;
Possible Concurrencies:
18-646 – How to Write Fast Code II
51
/* set back to 0 */
D (independent)

K-Means: Parallelization
#pragma omp parallel for
private(i, ???)
firstprivate(numObjs, ???)
shared(objects, ???)
reduction(???)
• • •
How can we avoid false sharing in OMP?
Could we introduce additional local variables in each thread? Can we parallelize this further?
18-646 – How to Write Fast Code II 52

K-Means: Cost Reduction
for (i=0; i 1)
clusters[i][j] = newClusters[i][j] * tmp;
}
• Remove the dividing operation and shift it to outer loop
18-646 – How to Write Fast Code II 53

K-Means: Other Ideas?
• Predict “stable” points, which means that if a point doesn’t change its cluster for “m” rounds, we regard it as “stable” and skip the computation next round
18-646 – How to Write Fast Code II 54