Squishy Maps for Soft Body Modelling Using Generalised Chain Mail
KIT308/408 (Advanced) Multicore Architecture and Programming
Optimisation
Dr. Ian Lewis
Discipline of ICT, School of TED
University of Tasmania, Australia
1
Arrays of Structures Versus Structures of Arrays
2
Arrays of Structures (AoS) are what we are used to working with in OO programming
We define a struct/class
We make an array of them
Or some other datastructure that uses an array underneath
Or even linked lists or trees or whatever
For example:
struct Vector4 {
float x, y, z, w;
};
Vector4 array[100];
3
Array of Structures Versus Structure of Arrays
Structures of Arrays (SoA) store arrays of each of the values we care about
Each value that would normally be a field in a struct is stored in an array of its own
For example:
struct Vector4_SOA {
float x[100], y[100], z[100], w[100];
};
Vector4_SOA array;
For example:
struct Vector4 {
float x, y, z, w;
};
Vector4 array[100];
For example:
struct Vector4_AOS {
float x[100], y[100], z[100], w[100];
};
Vector4_SOA array;
1. https://software.intel.com/en-us/articles/ticker-tape-part-2
4
AoS Versus SoA: Data Layout
What happens in memory with a structure like the following?
struct Vector3 {
float x, y, z;
};
Vector3 array[100];
How many bytes in memory does array take?
What about something like the following?
struct Paddington {
char x;
int y;
};
Paddington array[100];
How many bytes in memory does array take?
1. https://software.intel.com/en-us/articles/ticker-tape-part-2
5
Aside: Data Layout and Padding
Performing some calculation on some aspect of a AoS type is likely to access non-contiguous parts of memory
Lots of values are read into the cache and not used
Weak locality of reference
Performing some calculation on some aspect of a SoA type is likely to access only contiguous parts of memory
Strong locality of reference
1. http://asc.ziti.uni-heidelberg.de/en/node/18
6
AoS Versus SoA: Cache
For example:
struct Vector4 {
float x, y, z, w;
};
Vector4 array[100];
for (int i = 0; i < 100; ++i) {
if (array[i].z < 0f) {
array[i].w = -array[i].w;
}
}
Structures of Arrays require an unfamiliar approach
But it’s relatively simple to translate to SoA from AoS code
For example:
struct Vector4 {
float x[100], y[100], z[100], w[100];
};
Vector4 array;
for (int i = 0; i < 100; ++i) {
if (array.z[i] < 0f) {
array.w[i] = -array.w[i];
}
}
7
AoS Versus SoA: Accessing Values
Arrays of Structures can also be dynamically sized
For example:
struct Vector4_SOA {
float *x, *y, *z, *w;
};
Vector4_SOA array;
...
array.x = new float[100];
array.y = new float[100];
array.z = new float[100];
array.w = new float[100];
1. https://software.intel.com/en-us/articles/ticker-tape-part-2
8
Dynamic Structure of Arrays (SoA)
Where There is One, There is Many
9
Consider the following typical C++ style OO code:
struct Vec
{
float x, y, z;
virtual void calc() { x = z * z; }
}
...
Vec array[100];
for (int i = 0; i < 100; ++i) {
array[i].calc();
}
10
OO-Method Call Overhead
What is happening here when this is executed?
How are virtual functions (AKA run-time polymorphic methods) actually implemented?
What cost is there to function calls?
No virtual function call this time:
struct Vec
{
float x, y, z;
void calc() { x = z * z; }
}
...
Vec array[100];
for (int i = 0; i < 100; ++i) {
array[i].calc();
}
11
Function Call Overhead
Or the equivalent in C-style code:
struct Vec
{
float x, y, z;
}
float calc(float b) {
return b * b;
}
...
Vec array[100];
for (int i = 0; i < 100; ++i) {
array[i].x = calc(array[i].z);
}
Using the inline keyword:
struct Vec
{
float x, y, z;
}
inline float calc(float b) {
return b * b;
}
...
Vec array[100];
for (int i = 0; i < 100; ++i) {
array[i].x = calc(array[i].z);
}
12
Inlining
The inline keyword allows us to ask the compiler (nicely) to not actually do a function call
Instead the code inside the function is spliced into the site of the call (i.e. inside the loop)
This isn’t like a macro, the compiler does this safely (without changing any normal function call semantics)
WTIOTIM suggests never writing functions that work on a single element:
struct Vec
{
float x, y, z;
}
void calcs(Vec* a) {
for (int i = 0; i < 100; ++i) {
a[i].x = a[i].z * a[i].z;
}
}
...
Vec array[100];
calcs(array);
13
Where There is One, There is Many (WTIOTIM)
When working on lots of values, doesn’t it make sense to perform calculations on all of them?
This is a shift in mentality
From: “design for a single object” and make a list of them later if we need lots
To: “design for lots of things right from the start”
As equivalent SoA code:
struct Vec_SOA
{
float x[100], y[100], z[100];
}
void calc(float* a, float* b) {
for (int i = 0; i < 100; ++i) {
a[i] = b[i] * b[i];
}
}
...
Vec_SOA array;
calc(array.x, array.z);
14
Where There is One, There is Many (WTIOTIM)
This version of calc even explicitly states which data from Vec_SOA it cares about
We know xs and zs are involved but no other data is touched
The function “doesn’t know” about any other values in the structure
(It’d be much better if I had room for longer descriptive variable names :P)
Loop Unrolling
15
Loops can be “unrolled”:
struct Vec_AOS
{
float x[100], y[100], z[100];
}
void calc(float* a, float* b) {
for (int i = 0; i < 100; i += 2) {
a[i] = b[i] * b[i];
a[i + 1] = b[i + 1] * b[i + 1];
}
}
...
Vec_AOS array;
calc(array.x, array.z);
16
Loop Unrolling
What would the advantage of this loop be?
Less iterations
Less testing of i
Less branching
Less increases of i
More opportunity for independent instructions to be executed between branches
i.e. before having to check the value of i and decide whether to loop or not
Loops can be unrolled more:
struct Vec_AOS
{
float x[100], y[100], z[100];
}
void calc(float* a, float* b) {
for (int i = 0; i < 100; i += 4) {
a[i] = b[i] * b[i];
a[i + 1] = b[i + 1] * b[i + 1];
a[i + 2] = b[i + 2] * b[i + 2];
a[i + 3] = b[i + 3] * b[i + 3];
}
}
...
Vec_AOS array;
calc(array.x, array.z);
17
Loop Unrolling
Would this be faster?
What’s happening to the code here?
Single Instruction Multiple Data
18
SIMD instructions work on multiple values at the same time
e.g. not X * Y, but X[0..3] * Y[0..3]
On the CPU this means having special execution units that do multiple calculations at once
If the SIMD unit works on 4 values at once you can approach 4x the execution speed
1. https://software.intel.com/en-us/articles/ticker-tape-part-2
19
Single Instruction Multiple Data (SIMD)
Underlying the approach of SIMD is the recognition that lots of extraneous work is being done to execute a single instruction
The idea is to make that “Data Operation” circle do more useful work
20
Why SIMD?
From a programming perspective we need
Special SIMD vector types
Operations / functions that perform SIMD basic operations
Functions that allow for rewriting of loops and branches
In C, on x86/x64 machines, using Windows
There’s a number of SIMD types, e.g.
__m128 holds 4 float values
Note: that’s two leading underscores
This is a 128-bit value
Can be used as a single value
Can also access single elements like an array
e.g. pseudocode __m128 value; value[2] = 80;
Visual Studio:
__m128 value; value.m128_f32[2] = 80;
Mac+Linux:
__m128 value; value[2] = 80;
Accessing each element via these accesses is slower than doing something on all of them at once
21
Programming SIMD
From a programming perspective we need
Special SIMD vector types
Operations / functions that perform SIMD basic operations
Functions that allow for rewriting of loops and branches
In C, on x86/x64 machines, using Windows+Mac+Linux
Operations are provided through compiler intrinsic functions, e.g.
_mm_set1_ps allows for 1 value to initialise all the elements in the vector
_mm_load_ps and _mm_store_ps allows for reading/writing from memory into a SIMD vector
_mm_add_ps allows for addition of two SIMD vectors
22
Programming SIMD
From a programming perspective we need
Special SIMD vector types
Operations / functions that perform SIMD basic operations
Functions that allow for rewriting of loops and branches
Branches and certain kinds of loops mean we need to rethink how the code is structured
For example, how can we translate this into SIMD?
float a[16];
for (int i = 0; i < 16; i++) {
if (a[i] > 0) a[i] = a[i] * 2;
}
Or even
float a[16];
float b;
for (int i = 0; i < 16; i++) {
if (a[i] > 0) {
b = a[i];
break;
}
}
We’ll come back to this idea in a later lecture / practical
23
Programming SIMD
Most architectures require SIMD types to be aligned exactly in multiples of their size
e.g. a 128-bit type (16 bytes) must be stored in a memory address that is exactly divisble by 16
Simplifies reads and writes
To get the VS compiler to do this use
__declspec(align(X))
Where X is the number of bytes the address will be divisible
e.g. __declspec(align(16)) float f;
Ensures f is aligned to a 16-byte boundary
On Mac+Linux using GCC
__attribute__((aligned (16))) float f;
1. https://i0.wp.com/media.boingboing.net/wp-content/uploads/2019/08/antones.jpg?w=1666&ssl=1
24
Memory Alignment
__attribute__((aligned (16)))
There’s a technique used sometimes that combines AoS and SoA
Make a small SoA structure and then have an array of these
The SoA structure is created to either
Contain exactly one SIMD type; or
Fit nicely into a cache line
Normal SoA:
struct Vector4_SOA {
float x[100], y[100], z[100], w[100];
};
Vector4_SOA array;
AoSoA:
struct Vector4_AOSOA {
__m128 x, y, z, w;
};
Vector4_AOSOA array[25];
25
Aside: AoSoA (AKA Tiled AoS)