CS计算机代考程序代写 cache algorithm Data Access Optimization

Data Access Optimization
CMPSC 450

Review: Memory Hierarchy
• Latency increases with distance
• Memory size increases with distance
CMPSC 450

Vector Triad Benchmark
A Simple streaming benchmark:
double A[N], B[N], C[N], D[N]
do j=1,R {
do i = 1,N {
A[i] = B[i] + C[i] * D[i]
}
3 loads, 1 store 2FLOPS
if (something_that_is_never_true) dummy (A,B,C,D)
}
This kernel is limited by data transfer performance for all memory levels on all current architectures.
CMPSC 450

A(:)=B(:) + C(:) * D(:) on one Sandy Bridge core (3 GHz)
CMPSC 450

Throughput capabilities on the Intel Sandy Bridge
• Per cycle with AVX
• 1 load instruction (256 bits) AND 1⁄2 store instruction (128 bits)
• 1 AVX MULT and 1 AVX Add instruction • (4DP/8SPflopseach)
• Overall maximum of 4 micro-ops
• Per cycle with SSE or scalar
• 2 load instruction OR 1 load and 1 store instruction • 1 MULT and 1 ADD instruction
• Overall maximum of 4 micro-ops
• Data transfer between cache levels
• 256 bits per cycle, half-duplex (i.e. full CL transfer == 2 cy)
CMPSC 450

Balance analysis
Machine Balance:
• Ratio of possible memory bandwidth to peak
performance.
• Each memory level has a new ratio.
Code Balance:
• Ratio of data traffic (loads and stores) to
computational operations
Lightspeed:
• Maximum fraction of peak performance of a code
with Balance Bc on a machine with balance Bm. Performance:
Bm= Memorybandwidth[Gwords/sec]= bmax Peak performance [Gflops/sec] Pmax
Data traffic [Words] Floating point ops [Flops]
L = min( 1, Bm / Bc )
P = L * Pmax = min( 1, bmax / Bc )
Bc =
CMPSC 450

Code Balance
Intel i7-2720QM (Sandy Bridge):
• 2.2 GHZ (3.3 GHz Max)
• 6M Cache
• 66400 MFLOPS (16600 MFLOPS per core)
• 25.6 GB/s = 3.2 GWords/sec
• Bm=3.2/16.6=0.192
Lightspeed:
• L=min(1,Bm/Bc) • L=0.096
Vector Triad Demo A[0] = B[0] + C[0] * D[0]; • 3 Loads, 1 store
• 2 Flops
• Bc = 2
Performance:
• P=L*Pmax
• P=0.096*16.6=1.6GFLOPS
• Vector Triad was at approx. 0.8 GFLOPS!
CMPSC 450

Jacobi Stencil
• Four loads, one store, 3 maths.
• One element is new. The rest may still be in cache.
pg = 0; pg2 = 1;
for (ii = 0; ii < iter; ii ++) { for (y = 1; y < ymax-1; y++) for(x = 1; x < xmax-1; x++) phi[pg][y][x] = phi[pg2][y][x+1] + phi[pg2][y][x-1] + phi[pg2][y-1][x] + phi[pg2][y+1][x]; pg = (pg + 1) % 2; pg2 = (pg2 + 1) % 2 } CMPSC 450 Data access – general guidelines O(N) / O(N) Algorithms • O(N) arithmetic operations vs. O(N) data access operations • Examples: Scalar product, vector addition, sparse MVM etc. • Performance limited by memory BW for large N (“memory bound”) • Limited optimization potential for single loops • Example: successive vector additions do i=1,N a(i) = b(i) + c(i) enddo do i=1,N z(i) = b(i) + e(i) enddo do i=1,N a(i) = b(i) + c(i) z(i) = b(i) + e(i) enddo Loop fusion No optimization potential for either loop Fusing different loops allows O(N) data reuse from registers CMPSC 450 Data access – general guidelines O(N2) / O(N2) algorithms • Nested loops • Examples: dense matrix-vector multiply, matrix addition, dense matrix transposition etc. • Memory bound for large N • Some optimization potential • Canoftenenhancecodebalancebyouterloopunrollingorspatialblocking • Example: dense matrix vector multiplication CMPSC 450 Data access – general guidelines O(N2) / O(N2) algorithms: Outer loop unrolling (unroll and jam) do i=1,N do j=1,N c(i)=c(i) + a(i, j) * b(j) enddo enddo do i=1,N,2 do j=1,N c(i)=c(i) + a(i, j) * b(j) c(i+1)=c(i+1) + a(i+1, j) * b(j) enddo enddoc b(j) can be re-used CMPSC 450 Data access – general guidelines O(N2) / O(N2) algorithms: spatial blocking • Split up inner loop in small chunks (pref. multiple cache lines) • Add outer loop pointing to the start of the innermost loop • bs: blocking size, CLS: Cache line size Reads from a and b now fit into cache lines. b is now loaded only once. However, c is loaded and stored nb times. bs=m*CLS, nb=N/bs do k=1,nb do i=1,N do j=(k-1)*bs+1,k*bs c(i)=c(i) + a(i, j) * b(j) enddo enddo enddo CMPSC 450 Spatial Blocking Visualized CMPSC 450 Data access – general guidelines Case 3: O(N3) / O(N2) algorithms • Computation outweighs data traffic by factor of N • Example: Dense Matrix-Matrix multiplication • Possible approach, apply O(N2) / O(N2) algorithms to inner loops. do i=1,N do j=1,N do k=1,N c(j,i)=c(j,i) + a(k,i) * b(k,j) enddo enddo enddo Core task: dense MVM (O(N2)/O(N2)) Memory bound CMPSC 450 Sources: • Introduction to High Performance Computing for Scientists and Engineers, Chapter 3 • Single Core Issues presentation by J. Treilbig, HiPerCH 3, 23./24.03.2015 • http://www.7-cpu.com/cpu/SandyBridge.html CMPSC 450