程序代写代做代考 compiler c++ cache chain Squishy Maps for Soft Body Modelling Using Generalised Chain Mail

Squishy Maps for Soft Body Modelling Using Generalised Chain Mail

KIT308/408 (Advanced) Multicore Architecture and Programming

Optimisation
Dr. Ian Lewis
Discipline of ICT, School of TED
University of Tasmania, Australia
1

Arrays of Structures Versus Structures of Arrays

2

Arrays of Structures (AoS) are what we are used to working with in OO programming
We define a struct/class
We make an array of them
Or some other datastructure that uses an array underneath
Or even linked lists or trees or whatever
For example:
struct Vector4 {
float x, y, z, w;
};

Vector4 array[100];

3
Array of Structures Versus Structure of Arrays
Structures of Arrays (SoA) store arrays of each of the values we care about
Each value that would normally be a field in a struct is stored in an array of its own
For example:
struct Vector4_SOA {
float x[100], y[100], z[100], w[100];
};

Vector4_SOA array;

For example:
struct Vector4 {
float x, y, z, w;
};

Vector4 array[100];

For example:
struct Vector4_AOS {
float x[100], y[100], z[100], w[100];
};

Vector4_SOA array;
1. https://software.intel.com/en-us/articles/ticker-tape-part-2
4
AoS Versus SoA: Data Layout

What happens in memory with a structure like the following?
struct Vector3 {
float x, y, z;
};

Vector3 array[100];
How many bytes in memory does array take?
What about something like the following?
struct Paddington {
char x;
int y;
};

Paddington array[100];
How many bytes in memory does array take?
1. https://software.intel.com/en-us/articles/ticker-tape-part-2
5
Aside: Data Layout and Padding

Performing some calculation on some aspect of a AoS type is likely to access non-contiguous parts of memory
Lots of values are read into the cache and not used
Weak locality of reference
Performing some calculation on some aspect of a SoA type is likely to access only contiguous parts of memory
Strong locality of reference
1. http://asc.ziti.uni-heidelberg.de/en/node/18
6
AoS Versus SoA: Cache

For example:
struct Vector4 {
float x, y, z, w;
};

Vector4 array[100];

for (int i = 0; i < 100; ++i) { if (array[i].z < 0f) { array[i].w = -array[i].w; } } Structures of Arrays require an unfamiliar approach But it’s relatively simple to translate to SoA from AoS code For example: struct Vector4 { float x[100], y[100], z[100], w[100]; }; Vector4 array; for (int i = 0; i < 100; ++i) { if (array.z[i] < 0f) { array.w[i] = -array.w[i]; } } 7 AoS Versus SoA: Accessing Values Arrays of Structures can also be dynamically sized For example: struct Vector4_SOA { float *x, *y, *z, *w; }; Vector4_SOA array; ... array.x = new float[100]; array.y = new float[100]; array.z = new float[100]; array.w = new float[100]; 1. https://software.intel.com/en-us/articles/ticker-tape-part-2 8 Dynamic Structure of Arrays (SoA) Where There is One, There is Many 9 Consider the following typical C++ style OO code: struct Vec { float x, y, z; virtual void calc() { x = z * z; } } ... Vec array[100]; for (int i = 0; i < 100; ++i) { array[i].calc(); } 10 OO-Method Call Overhead What is happening here when this is executed? How are virtual functions (AKA run-time polymorphic methods) actually implemented? What cost is there to function calls? No virtual function call this time: struct Vec { float x, y, z; void calc() { x = z * z; } } ... Vec array[100]; for (int i = 0; i < 100; ++i) { array[i].calc(); } 11 Function Call Overhead Or the equivalent in C-style code: struct Vec { float x, y, z; } float calc(float b) { return b * b; } ... Vec array[100]; for (int i = 0; i < 100; ++i) { array[i].x = calc(array[i].z); } Using the inline keyword: struct Vec { float x, y, z; } inline float calc(float b) { return b * b; } ... Vec array[100]; for (int i = 0; i < 100; ++i) { array[i].x = calc(array[i].z); } 12 Inlining The inline keyword allows us to ask the compiler (nicely) to not actually do a function call Instead the code inside the function is spliced into the site of the call (i.e. inside the loop) This isn’t like a macro, the compiler does this safely (without changing any normal function call semantics) WTIOTIM suggests never writing functions that work on a single element: struct Vec { float x, y, z; } void calcs(Vec* a) { for (int i = 0; i < 100; ++i) { a[i].x = a[i].z * a[i].z; } } ... Vec array[100]; calcs(array); 13 Where There is One, There is Many (WTIOTIM) When working on lots of values, doesn’t it make sense to perform calculations on all of them? This is a shift in mentality From: “design for a single object” and make a list of them later if we need lots To: “design for lots of things right from the start” As equivalent SoA code: struct Vec_SOA { float x[100], y[100], z[100]; } void calc(float* a, float* b) { for (int i = 0; i < 100; ++i) { a[i] = b[i] * b[i]; } } ... Vec_SOA array; calc(array.x, array.z); 14 Where There is One, There is Many (WTIOTIM) This version of calc even explicitly states which data from Vec_SOA it cares about We know xs and zs are involved but no other data is touched The function “doesn’t know” about any other values in the structure (It’d be much better if I had room for longer descriptive variable names :P) Loop Unrolling 15 Loops can be “unrolled”: struct Vec_AOS { float x[100], y[100], z[100]; } void calc(float* a, float* b) { for (int i = 0; i < 100; i += 2) { a[i] = b[i] * b[i]; a[i + 1] = b[i + 1] * b[i + 1]; } } ... Vec_AOS array; calc(array.x, array.z); 16 Loop Unrolling What would the advantage of this loop be? Less iterations Less testing of i Less branching Less increases of i More opportunity for independent instructions to be executed between branches i.e. before having to check the value of i and decide whether to loop or not Loops can be unrolled more: struct Vec_AOS { float x[100], y[100], z[100]; } void calc(float* a, float* b) { for (int i = 0; i < 100; i += 4) { a[i] = b[i] * b[i]; a[i + 1] = b[i + 1] * b[i + 1]; a[i + 2] = b[i + 2] * b[i + 2]; a[i + 3] = b[i + 3] * b[i + 3]; } } ... Vec_AOS array; calc(array.x, array.z); 17 Loop Unrolling Would this be faster? What’s happening to the code here? Single Instruction Multiple Data 18 SIMD instructions work on multiple values at the same time e.g. not X * Y, but X[0..3] * Y[0..3] On the CPU this means having special execution units that do multiple calculations at once If the SIMD unit works on 4 values at once you can approach 4x the execution speed 1. https://software.intel.com/en-us/articles/ticker-tape-part-2 19 Single Instruction Multiple Data (SIMD) Underlying the approach of SIMD is the recognition that lots of extraneous work is being done to execute a single instruction The idea is to make that “Data Operation” circle do more useful work 20 Why SIMD? From a programming perspective we need Special SIMD vector types Operations / functions that perform SIMD basic operations Functions that allow for rewriting of loops and branches In C, on x86/x64 machines, using Windows There’s a number of SIMD types, e.g. __m128 holds 4 float values Note: that’s two leading underscores This is a 128-bit value Can be used as a single value Can also access single elements like an array e.g. pseudocode __m128 value; value[2] = 80; Visual Studio: __m128 value; value.m128_f32[2] = 80; Mac+Linux: __m128 value; value[2] = 80; Accessing each element via these accesses is slower than doing something on all of them at once 21 Programming SIMD From a programming perspective we need Special SIMD vector types Operations / functions that perform SIMD basic operations Functions that allow for rewriting of loops and branches In C, on x86/x64 machines, using Windows+Mac+Linux Operations are provided through compiler intrinsic functions, e.g. _mm_set1_ps allows for 1 value to initialise all the elements in the vector _mm_load_ps and _mm_store_ps allows for reading/writing from memory into a SIMD vector _mm_add_ps allows for addition of two SIMD vectors 22 Programming SIMD From a programming perspective we need Special SIMD vector types Operations / functions that perform SIMD basic operations Functions that allow for rewriting of loops and branches Branches and certain kinds of loops mean we need to rethink how the code is structured For example, how can we translate this into SIMD? float a[16]; for (int i = 0; i < 16; i++) { if (a[i] > 0) a[i] = a[i] * 2;
}
Or even
float a[16];
float b;
for (int i = 0; i < 16; i++) { if (a[i] > 0) {
b = a[i];
break;
}
}
We’ll come back to this idea in a later lecture / practical

23
Programming SIMD

Most architectures require SIMD types to be aligned exactly in multiples of their size
e.g. a 128-bit type (16 bytes) must be stored in a memory address that is exactly divisble by 16
Simplifies reads and writes
To get the VS compiler to do this use
__declspec(align(X))
Where X is the number of bytes the address will be divisible
e.g. __declspec(align(16)) float f;
Ensures f is aligned to a 16-byte boundary
On Mac+Linux using GCC
__attribute__((aligned (16))) float f;
1. https://i0.wp.com/media.boingboing.net/wp-content/uploads/2019/08/antones.jpg?w=1666&ssl=1
24
Memory Alignment

__attribute__((aligned (16)))

There’s a technique used sometimes that combines AoS and SoA
Make a small SoA structure and then have an array of these
The SoA structure is created to either
Contain exactly one SIMD type; or
Fit nicely into a cache line
Normal SoA:
struct Vector4_SOA {
float x[100], y[100], z[100], w[100];
};

Vector4_SOA array;
AoSoA:
struct Vector4_AOSOA {
__m128 x, y, z, w;
};

Vector4_AOSOA array[25];

25
Aside: AoSoA (AKA Tiled AoS)