代写代考 OPTIMISING MEMORY STRUCTURES III

OPTIMISING MEMORY STRUCTURES III

Utilizing caches II
• Want to use all of the data in a cache line

• loading unwanted values is a waste of memory bandwidth. • structures are good for this
• Or loop fastest over the corresponding index of an array.
• Place variables that are used together close together • Also have to worry about alignment with cache block
boundaries.
• Avoid “gaps” in structures
• In C structures may contain gaps to ensure the address of each variable is aligned with its size.
• Compilers usually place structure elements in the order they appear in the definition so re-order elements to reduce padding

Data writes
• Reads are more common so caches optimised for reads
• Writes can still impact code performance though.
• Data writes also consume resources.
• Memory systems are often optimised for transferring blocks (e.g. loading cache lines)
• On some architectures it is better if write operations also come in blocks.

Caches and writes • Caches may be:
• write-through
• Information written to cache block and to lower memory layer
• write-back
• Information is only written to the cache. Lower layers updated
when cache block is replaced • Caches may also be:
• write allocate
• If write location in not in cache the enclosing block is loaded into
the cache (usual for write-back) • No write allocate
• If write location is not in cache only the underlying layer is modified (usual for write-through)

Write stalls
• Write stalls are when the CPU must wait for a write operation to complete.
• Data must have left the register before a new value can replace it.
• One hardware solution is to introduce write buffers
• Processor continues as soon as data written to buffer
• write from buffer continues in parallel.
• Multiple write buffers allow multiple writes to be active at one time.
• If write buffers are wider than a word multiple writes can be merged into a single block write.
• If the writes happen close enough together that is.
• Write buffers often same size as level 1 cache blocks

Problems with writes • Array initialization
• Large array initializations may be particularly slow when using write allocate caches.
• We only want to perform lots of writes to overwrite junk data.
• The cache will carefully load all the junk data before overwriting it.
• Especially nasty if the array is sized generously but everything is initialized
• Work arounds
• Use special HW features to zero the array (compiler directives). • Combine initialization with the first access loop
• This increases the chance of a programming error so have a debugging options to perform original initialization as well

More write problems
• Running out of write buffers
• If the program writes to many widespread locations in quick succession the CPU may run out of write buffers and stall.
• Even if it does not stall this may force partially filled write buffers to be written out.
• Solution
• Cluster writes to close locations together in time. Try to
avoid interleaving writes to different areas of memory. • Introduce additional scalar if necessary.
• Re-order loops to write data contiguously
• re-order elements in structures to optimize write buffer use.

Writes and vectorisation
• SIMD/Vector store operations usually work best when writing well-aligned contiguous blocks of data.
• These are the same requirements that write- buffers have
• The same code transformations will allow efficient vectorisation should help write-buffer usage.

Prefetching
• Cannot avoid capacity cache misses.
• We may be able to reduce their impact
• Most processors can continue with computation
while a cache miss takes place
• Provided no instruction attempts to use the result of the load that caused the cache miss.
• Usually a limit on the number of outstanding cache misses being processed as well.
• Want to initiate a load well in advance of use so data is in cache before it is actually used

Load instructions
• Within a basic block compilers should schedule loads
• Copying data to a local scalar early in the basic block may help if the compiler has problems.
• The compiler may pipeline loops and load data ready for the next iteration.
• You can do this by hand if necessary.
• Also possible to pre-load the cache by issuing
additional load instructions
• Hard to do by hand as compiler will probably optimize them away. There may be compiler directives to do this.

Prefetch instructions
• Many processors have special prefetch instructions to request data to be loaded into cache.
• Compilers will try to insert these automatically
• For best results will probably need compiler
directives to be inserted. • Read the compiler manual.
• Write-allocate caches may have instructions to zero cache lines
• Useful for array initialization
• Probably need directives again.

Example prefetch
• void __builtin_prefetch (const void *addr, int rw, int locality)
• https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
• Rw, 0 means read, 1 means write (default 0)
• Locality, 0, 1, 2, or 3 (default 3), degree of temporal locality. 0 means no locality (don’t keep in cache), 3 means high locality (keep in all levels of the cache if possible)
for (int i=0; i<1024; i++) { __builtin_prefetch (&array[i+k], 1, 0); array1[i] = 2 * array1[i]; • Some processors prefetch automatically • Regular access patterns are recognized and cache lines fetched in advance. • Usually only works for contiguous sequence of cache misses. • Processor has a set of stream buffers • Each holds address of an active stream • Loads to the current block causes the next block to be prefetched and the stream address to be updated. • Streams are established by series of cache misses to consecutive locations Using streams • To utilize stream hardware use linear access patterns where possible • Only the order of cache block accesses needs to be linear, not each word access. • Most loops will require multiple streams • If the loop requires more streams than are supported in hardware no prefetching will take place because hardware stream engines will be continually reallocated to new streams. • Consider splitting the loop. • Writes may activate a stream if cache is write- allocate. Non-temporal operations • Some processors/compilers allow non-temporal modifiers for instructions • #pragma vector nontemporal • void __builtin_nontemporal_store(T value, T *addr); • T __builtin_nontemporal_load(T *addr); • Allow loads/stores to be sent directly to memory/registers, rather than cache • Bypass cache for reduced cache pollution • Bypass cache for bypassing cache instruction limits • Primarily focused on store instructions • Benefit can be mixed • If workload fits into cache can reduce performance • Intel compiler can also generate directly • -qopt-streaming-stores=keyword • always • never • Auto • Fencing may be required for correct memory semantics • _mm_sfence(); // OR _mm_mfence() Alignment problems • Not all load operations are the same size • Single precision loads 32-bit • Double precision loads 64-bits • Instruction sets with SIMD extensions may support 128-bit or higher multi-word loads • Different instructions implement the different types of load. • In many cases wider loads can perform better but perform badly (or cannot be used at all) unless the address being loaded is aligned on the size of the load Compilers and alignment • Compilers usually do well with • Intrinsic types (alignment constraints part of the language specification) • Local variables (alignment under the compilers control) • Do less well with • Subroutine arguments • Pointers • Structure members • If the compiler cannot determine alignment at compile time • Use lower performance un-aligned instructions • Dynamically test alignment and branch to different implementations Alignment optimisation • Compiler may support directives to provide additional information about memory alignment • Still need to ensure that memory is aligned correctly. • Particular problem with malloc • Used for all types of data • Usually only guarantees alignment for largest intrinsic type (64-bit) • Some systems provide alternative calls with greater alignment guarantees. • Alternative is to wrap malloc/free with functions that adjust alignment. Pointer aliasing • Pointers are variables containing memory addresses. • Pointers are useful but can seriously inhibit code performance. • Compilers try very hard to reduce memory accesses. • Only loading data from memory once. • Keep variables in registers and only update memory copy when necessary. • Pointers could point anywhere, so to be safe compiler will: • Reload all values after write through pointer • Synchronize all variables with memory before read through • Even if the lack of overlap is obvious to you the compiler may not be able to prove this. Pointers and Fortran • F77 had no pointers • Arguments passed by reference (address) • Subroutine arguments are effectively pointers • But it is illegal Fortran if two arguments overlap • Programmers can (and do) break this rule but the compiler does not have to consider the possibility when generating code. • F90/F95 has restricted pointers • Pointers can only point at variables declared as a “target” or at the target of another pointer • Compiler therefore knows more about possible aliasing problems • Try to avoid F90 pointers for performance critical data structures. Pointers and C • In C pointers are unrestricted • Can therefore seriously inhibit performance • Almost impossible to do without pointers • malloc requires the use of pointers. • Pointers used for call by reference. Alternative is call by value where all data is copied! • Compilers may have #pragma extensions or compiler flags to assert pointers do not overlap • Usually not portable between platforms • Explicit use of local scalar temporaries may reduce the problem. • Compiler has visibility of entire lifetime of the variables and can prove nothing overlaps with it. Pointers and caches • Pointers of some sort are essential for dynamic memory structures • Heap allocated memory • Linked lists • Dictionary/Maps • Addresses returned from heap allocator are essentially random so less possibility of cache conflicts • Unless the heap allocator creates pools of commonly requested sized objects • Greater chance of TLB misses as no page locality either. • Dynamic memory structures can also consume more memory • In a doubly-linked list of complex numbers 50% of the stored data is pointers • 2 words of complex data. • 2 words forward/back pointers. • Higher ratio of cache misses • Also very hard to vectorise dynamic memory structures 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts