CS代写 OPTIMISING MEMORY STRUCTURES II

OPTIMISING MEMORY STRUCTURES II

What can go wrong
• Poor cache/page use

• Lack of spatial locality
• Lack of temporal locality • cache thrashing
• Unnecessary memory accesses • pointer chasing
• array temporaries
• Aliasing problems
• Use of pointers can inhibit code optimisation

Reducing memory accesses
• Memory accesses are often the most important limiting factor for code performance.
• Many older codes were written when memory access was relatively cheap.
• Things to look for:
• Unnecessary pointer chasing
• pointer arrays that could be simple arrays • linked lists that could be arrays.
• Unnecessary temporary arrays.
• Tables of values that would be cheap to re-calculate.

Vector temporaries
• Oldvectorcodeoftenhadmanysimpleloopswithintermediateresultsin temporary arrays
REAL V(1024,3), S(1024), U(3)
DO I=1,1024
DO I=1,1024
DO I=1,1024
S(I) = U(1)*V(I,1)
S(I) = S(I) + U(2)*V(I,2)
S(I) = S(I) + U(3)*V(I,3)
DO I=1,1024
V(I,J) = S(I) * U(J)

• Can merge loops and use a scalar
REAL V(1024,3), S, U(3)
DO I=1,1024
S = U(1)*V(I,1) + U(2)*V(I,2) + U(3)*V(I,3) DO J=1,3
V(I,J) = S * U(J)
• Compilers are reasonably good at turning scalars into vector temporaries if necessary, but the reverse operation is harder.
• Particularly if temporary is not local.

• May want hybrid scheme to increase locality while still allowing vectorisation
REAL V(1024,3), S(8), U(3)
DO K=1,1024,8
S(I) = U(1)*V(I+K,1) + U(2)*V(I+K,2) + U(3)*V(I+K,3)
V(I+K,J) = S(I) * U(J)
END DO END DO
• Hopefully compiler will generate this from scalar version

Other temporaries
• Similar problem can occur at higher levels of the code.
• Temporary arrays used to pass data between different parts of the program.
• Merging routines might allow these arrays to be removed, improving performance
• Danger of damaging code maintainability
• However refactored code structure might be just as good if not better.

High level data flow analysis
Programmer’s perspective:
• The high level program structure is the programmers responsibility
• Compiler will not make any changes to this.
• Mostly compilers process each subroutine independently.
• The high level program architecture and data structures are completely determined by the programmer.
• Changing the high level structure can significantly change the data locality of the program.

Subroutine merging
• Try to understand how the data lifetimes in your application.
• What are the lifetimes of the data structures
• When are they created
• When are they destroyed
• However several generations of data may inhabit the data structures during this lifetime.
• Also need to understand the life cycle of the data generations • Whenisitfirstwritten
• Whenisitlastread
• Also need to understand the iteration space of subroutines
• If two routines have the same iteration space and an data array only exists to pass data between them then consider merging

• Caches rely on temporal and spatial locality
• Caches are divided into blocks
• Blocks are organized as sets
• A memory location in mapped to a set depending on its address
• It can occupy any block within that set Full address
set block index offset
01110011101011101
0110011100

Utilizing caches
• Want to avoid cache conflicts
• This happens when too much related data maps to the same cache set.
• Arrays or array dimensions proportional to (cache- size/set-size) can cause this.
• However modern caches tend to have a large number of sets which reduces the impact of this problem

Conflict example
• Assume a 1024 word direct mapped cache
REAL A(1024), B(1024), C(1024), X COMMON /DAT/ A,B,C ! Contiguous
DO I=1,1024
A(I) = B(I) + X*C(I)
• Corresponding elements are exactly 1024 words apart so map to the same block.
• Loop accesses corresponding elements of each array in turn.
• Every access causes a cache miss.

Cache padding
• Insert padding (assume 8 REALs per cache block) REAL A(1024),B(1024),C(1024),T(8),T2(8),X COMMON /DAT/ A,T,B,T2,C
DO I=1,1024
A(I) = B(I) + X*C(I)
• Corresponding elements are exactly 1028 words apart so map to different cache.
• Only 1-in-8 accesses is a cache miss.

Internal conflict
• Conflicts can also occur within a single array REAL A(1024,4), B(1024)
DO I=1,1024 DO J=1,4
B(I) = B(I) + A(I,J)
• Corresponding elements of each row of A (and B) map to the same cache block

Array extension
• Fix by extending array declaration
REAL A(1032,4), B(1024)
DO I=1,1024
B(I) = B(I) + A(I,J) END DO
• Compiler may be able to do this for local arrays but would need explicit permission via command line flag or pragma.

Loop re-ordering
• Conflicts are only a problem if the program is going to revisit data ejected from the cache.
• Impact depends on data access order
REAL A(1024,4),T(8), B(1024)
DO I=1,1024
B(I) = B(I) + A(I,J) END DO
• Loop does not revisit earlier rows of A so conflict has no impact • We have increased the locality of access to A.

Index re-ordering
• In general you want array index order and loop order to match to ensure locality
• Either can be changed
REAL A(4,1024), B(1024)
DO I=1,1024 DO J=1,4
B(I) = B(I) + A(J,I) END DO

Index re-ordering II
• This is a global edit
• Every use of array A needs to be edited
• However this can often be scripted or performed by tools

Introduce temporaries
• Could also unroll loop and copy cache block to temporaries
• Cache block does not need to be re-visited REAL A(1024,4), B(1024),T1,T2,…
DO I=1,1024,8
T1 = A(I,J)
T2 = A(I+1,J)
B(I) = B(I) + T1 +T2 …
END DO END DO
• Only works if unroll is multiple of block size

Conflicts and set associated caches
• Set associated caches reduce the impact of cache conflicts.
• Depending on the set-size multiple pieces of conflicting data can co-exist in the cache.
• Though not as critical as with direct mapped caches it is still frequently worth removing conflicts where you can.

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts