Data Access Optimization
CMPSC 450
Review: Memory Hierarchy
• Latency increases with distance
• Memory size increases with distance
CMPSC 450
Vector Triad Benchmark
A Simple streaming benchmark:
double A[N], B[N], C[N], D[N]
do j=1,R {
do i = 1,N {
A[i] = B[i] + C[i] * D[i]
}
3 loads, 1 store 2FLOPS
if (something_that_is_never_true) dummy (A,B,C,D)
}
This kernel is limited by data transfer performance for all memory levels on all current architectures.
CMPSC 450
A(:)=B(:) + C(:) * D(:) on one Sandy Bridge core (3 GHz)
CMPSC 450
Throughput capabilities on the Intel Sandy Bridge
• Per cycle with AVX
• 1 load instruction (256 bits) AND 1⁄2 store instruction (128 bits)
• 1 AVX MULT and 1 AVX Add instruction • (4DP/8SPflopseach)
• Overall maximum of 4 micro-ops
• Per cycle with SSE or scalar
• 2 load instruction OR 1 load and 1 store instruction • 1 MULT and 1 ADD instruction
• Overall maximum of 4 micro-ops
• Data transfer between cache levels
• 256 bits per cycle, half-duplex (i.e. full CL transfer == 2 cy)
CMPSC 450
Balance analysis
Machine Balance:
• Ratio of possible memory bandwidth to peak
performance.
• Each memory level has a new ratio.
Code Balance:
• Ratio of data traffic (loads and stores) to
computational operations
Lightspeed:
• Maximum fraction of peak performance of a code
with Balance Bc on a machine with balance Bm. Performance:
Bm= Memorybandwidth[Gwords/sec]= bmax Peak performance [Gflops/sec] Pmax
Data traffic [Words] Floating point ops [Flops]
L = min( 1, Bm / Bc )
P = L * Pmax = min( 1, bmax / Bc )
Bc =
CMPSC 450
Code Balance
Intel i7-2720QM (Sandy Bridge):
• 2.2 GHZ (3.3 GHz Max)
• 6M Cache
• 66400 MFLOPS (16600 MFLOPS per core)
• 25.6 GB/s = 3.2 GWords/sec
• Bm=3.2/16.6=0.192
Lightspeed:
• L=min(1,Bm/Bc) • L=0.096
Vector Triad Demo A[0] = B[0] + C[0] * D[0]; • 3 Loads, 1 store
• 2 Flops
• Bc = 2
Performance:
• P=L*Pmax
• P=0.096*16.6=1.6GFLOPS
• Vector Triad was at approx. 0.8 GFLOPS!
CMPSC 450
Jacobi Stencil
• Four loads, one store, 3 maths.
• One element is new. The rest may still be in cache.
pg = 0; pg2 = 1;
for (ii = 0; ii < iter; ii ++) {
for (y = 1; y < ymax-1; y++) for(x = 1; x < xmax-1; x++)
phi[pg][y][x] = phi[pg2][y][x+1] + phi[pg2][y][x-1] + phi[pg2][y-1][x] + phi[pg2][y+1][x];
pg = (pg + 1) % 2;
pg2 = (pg2 + 1) % 2 }
CMPSC 450
Data access – general guidelines
O(N) / O(N) Algorithms
• O(N) arithmetic operations vs. O(N) data access operations
• Examples: Scalar product, vector addition, sparse MVM etc.
• Performance limited by memory BW for large N (“memory bound”) • Limited optimization potential for single loops
• Example: successive vector additions
do i=1,N
a(i) = b(i) + c(i)
enddo
do i=1,N
z(i) = b(i) + e(i)
enddo
do i=1,N
a(i) = b(i) + c(i)
z(i) = b(i) + e(i)
enddo
Loop fusion
No optimization potential for either loop
Fusing different loops allows O(N) data reuse from registers
CMPSC 450
Data access – general guidelines
O(N2) / O(N2) algorithms
• Nested loops
• Examples: dense matrix-vector multiply, matrix addition, dense matrix transposition etc.
• Memory bound for large N
• Some optimization potential
• Canoftenenhancecodebalancebyouterloopunrollingorspatialblocking
• Example: dense matrix vector multiplication
CMPSC 450
Data access – general guidelines
O(N2) / O(N2) algorithms: Outer loop unrolling (unroll and jam)
do i=1,N
do j=1,N
c(i)=c(i) + a(i, j) * b(j)
enddo
enddo
do i=1,N,2
do j=1,N
c(i)=c(i) + a(i, j) * b(j)
c(i+1)=c(i+1) + a(i+1, j) * b(j)
enddo
enddoc
b(j) can be re-used
CMPSC 450
Data access – general guidelines
O(N2) / O(N2) algorithms: spatial blocking
• Split up inner loop in small chunks (pref. multiple cache lines) • Add outer loop pointing to the start of the innermost loop
• bs: blocking size, CLS: Cache line size
Reads from a and b now fit into cache lines. b is now loaded only once.
However, c is loaded and stored nb times.
bs=m*CLS, nb=N/bs
do k=1,nb
do i=1,N
do j=(k-1)*bs+1,k*bs
c(i)=c(i) + a(i, j) * b(j)
enddo
enddo enddo
CMPSC 450
Spatial Blocking Visualized
CMPSC 450
Data access – general guidelines
Case 3: O(N3) / O(N2) algorithms
• Computation outweighs data traffic by factor of N
• Example: Dense Matrix-Matrix multiplication
• Possible approach, apply O(N2) / O(N2) algorithms to inner loops.
do i=1,N
do j=1,N
do k=1,N
c(j,i)=c(j,i) + a(k,i) * b(k,j)
enddo enddo
enddo
Core task: dense MVM (O(N2)/O(N2)) Memory bound
CMPSC 450
Sources:
• Introduction to High Performance Computing for Scientists and Engineers, Chapter 3
• Single Core Issues presentation by J. Treilbig, HiPerCH 3, 23./24.03.2015
• http://www.7-cpu.com/cpu/SandyBridge.html
CMPSC 450