Squishy Maps for Soft Body Modelling Using Generalised Chain Mail
KIT308/408 (Advanced) Multicore Architecture and Programming
SIMD Challenges
Dr. Ian Lewis
Discipline of ICT, School of TED
University of Tasmania, Australia
1
Certain things are hard (/hardish) when translating to SIMD
Dynamic memory allocation / alignment
Conditional code
Changing data-structures breaks everything
It’s a really good idea to translate to SIMD as slowly as possible
Literally as close to one line at a time as you can
2
SIMD Challenges
A Worked Example
SIMD Challenges
3
Again, a pretty nonsense piece of code
But enough for our purposes
Interesting features:
Dynamically allocated arrays
Arrays looped over
Conditional expression in loop
void func(float* result, const float* v, unsigned int len) {
for (unsigned int i = 0; i < len; ++i) {
float val = v[i];
if (val >= 0.0f)
result[i] = std::sqrt(val);
else
result[i] = val;
}
}
…
float* dest = new float[SIZE];
float* src = new float[SIZE];
// not shown: fill src with a bunch of values
func(dest, src, SIZE);
4
Starting Scalar Code
The necessity to have everything aligned correctly is difficult
No aligned C++ new instruction
Have to use a special version of C-style malloc instead
_aligned_malloc will allocated memory aligned at a byte boundary specified
And return it as a void* which needs to be casted to the desired type
void* _aligned_malloc(
size_t size,
size_t alignment);
For example, to allocate 100 128-bit SIMD vectors with the first aligned at a 128-byte boundary:
__m128* array = (__m128*) _aligned_malloc(sizeof(__m128) * 100, 128);
5
Aligned Dynamically Allocated Memory
Convert the data structures to SIMD types
Deal with the fallout
Changed function specification
Added an extra for loop to iterate over SIMD vector elements
All accesses to individual scalar elements use SIMD union accessors
Rough edges
Here the len8 variable is set to be SIZE/8
This is assuming that original data structure size was evenly divisible by 8
Can’t do this in general
Aligned values to 32-byte boundary
Appropriate for a 256-bit type
Ignores potential larger alignment sizes that might be more cache appropriate
void func(__m256* result, const __m256* v, unsigned int len8) {
for (unsigned int i = 0; i < len8; ++i) {
for (unsigned int j = 0; j < 8; ++j) {
float val = v[i].m256_f32[j];
if (val >= 0.0f)
result[i].m256_f32[j] = std::sqrt(val);
else
result[i].m256_f32[j] = val;
}
}
}
…
int SIZE8 = SIZE / 8;
__m256* dest = (__m256*) _aligned_malloc(sizeof(__m256) * SIZE8,32);
__m256* src = (__m256*) _aligned_malloc(sizeof(__m256) * SIZE8, 32);
// not shown: fill src with a bunch of values
func(dest, src, SIZE8);
6
Inner Loop Using SIMD Types (as Scalars)
Take the first line from inside the inner SIMD loop (j here) and translate it to SIMD code
Translation goes outside the inner loop
It’s doing the work of the version inside the loop all in one go
Replace the version in the loop with an access to the result of the SIMD calculation
No testing approach shown here
Can be really helpful to calculate everything twice
The SIMD translation and the scalar original
Compare the outputs
I’d recommend using printf liberally
void func(__m256* result, const __m256* v, unsigned int len8) {
for (unsigned int i = 0; i < len8; ++i) {
__m256 vals = v[i];
for (unsigned int j = 0; j < 8; ++j) {
float val = vals.m256_f32[j];
if (val >= 0.0f)
result[i].m256_f32[j] = std::sqrt(val);
else
result[i].m256_f32[j] = val;
}
}
}
7
Translation to SIMD Code One Line at a Time
Same process as the last slide, but this translation resulted in two SIMD calculations though
Needed a conversion of a scalar value into a SIMD vector (splatting)
Conversion of a comparison
Can still be handled by this approach
Normal C/C++ code “true” is usually 1, whereas SIMD “true” is -1
Neither of these are the “false” value of 0, so the if-statement still works
void func(__m256* result, const __m256* v, unsigned int len8) {
for (unsigned int i = 0; i < len8; ++i) {
__m256 vals = v[i];
__m256 zeros = _mm256_set1_ps(0.0f);
__m256 gts = _mm256_cmp_ps(vals, zeros, _CMP_GE_OQ);
for (unsigned int j = 0; j < 8; ++j) {
float val = vals.m256_f32[j];
if (gts.m256_f32[j])
result[i].m256_f32[i] = std::sqrt(val);
else
result[i].m256_f32[i] = val;
}
}
}
8
Translation of Boolean Conditions
Next part of the loop is an if-statement
Note that this is an “easy” if-statement
No full block, just one line for each of the true/false statements
Both statements put a value into the same destination
Multiple steps required for translation
First step, perform the calculations of the true and false expressions
After performing this step, the scalar val temporary is no longer needed
void func(__m256* result, const __m256* v, unsigned int len8) {
for (unsigned int i = 0; i < len8; ++i) {
__m256 vals = v[i];
__m256 zeros = _mm256_set1_ps(0.0f);
__m256 gts = _mm256_cmp_ge(vals, zeros, _CMP_GE_OQ);
__m256 trues = _mm256_sqrt_ps(vals);
__m256 falses = vals;
for (unsigned int j = 0; j < 8; ++j) {
float val = vals.m256_f32[j];
if (gts.m256_f32[j])
result[i].m256_f32[j] = trues.m256_f32[j];
else
result[i].m256_f32[j] = falses.m256_f32[j];
}
}
}
9
Translation of If-Statement Calculations to SIMD
Second step, convert our “easy” if-statement into a ternary if-expression
This makes it clear that we are just doing a calculation and storing it in to result
Rough edges:
What if we can’t? For example:
No else part to the if-statement
True/false parts set different variables
If-statement uses changes flow-of-control, e.g. break, continue, return, etc.
Complicated multi-line block
Nested if
With work, all of these things can be refactor into a series of independent simple if-expressions if needed
void func(__m256* result, const __m256* v, unsigned int len8) {
for (unsigned int i = 0; i < len8; ++i) {
__m256 vals = v[i];
__m256 zeros = _mm256_set1_ps(0.0f);
__m256 gts = _mm256_cmp_ge(vals, zeros, _CMP_GE_OQ);
__m256 trues = _mm256_sqrt_ps(vals);
__m256 falses = vals;
for (unsigned int j = 0; j < 8; ++j) {
result[i].m256_f32[j] = gts.m256_f32[j] ?
trues.m256_f32[j] :
falses.m256_f32[j];
}
}
}
10
Refactor If-Statements to If-Expressions
Third step, convert if-expression into sequence of SIMD instructions
Other SIMD architectures have a single instruction for this called SELECT
void func(__m256* result, const __m256* v, unsigned int len8) {
for (unsigned int i = 0; i < len8; ++i) {
__m256 vals = v[i];
__m256 zeros = _mm256_set1_ps(0.0f);
__m256 gts = _mm256_cmp_ge(vals, zeros, _CMP_GE_OQ);
__m256 trues = _mm256_sqrt_ps(vals);
__m256 falses = vals;
__m256 value = _mm256_or_ps(
_m256_and_ps(gts, trues),
_m256_andnot_ps(gts, falses));
for (unsigned int j = 0; j < 8; ++j) {
result[i].m256_f32[j] = value.m256_f32[j];
}
}
}
11
Translation of If-Operators to SIMD
Once the final parts of the temporary inner SIMD loop is translated, don’t need the loop anymore
All code converted to SIMD
Except the outer for-loop remains as scalar code
(In some circumstances, (e.g. where the index of each element is needed) this could be made SIMD too)
Rough edges:
Possibly more efficient to move the calculation of zeros outside the loop
SIMD array accesses could be written using _mm256_load_ps and _mm256_store_ps
This shouldn’t be necessary
void func(__m256* result, const __m256* v, unsigned int len8) {
for (unsigned int i = 0; i < len8; ++i) {
__m256 vals = v[i];
__m256 zeros = _mm256_set1_ps(0.0f);
__m256 gts = _mm256_cmp_ge(vals, zeros, _CMP_GE_OQ);
__m256 trues = _mm256_sqrt_ps(vals);
__m256 falses = vals;
__m256 value = _mm256_or_ps(
_m256_and_ps(gts, trues),
_m256_andnot_ps(gts, falses));
result[i] = value;
}
}
12
Refactor If-Statements to If-Operators
Really, please don’t do this
Unless it’s faster
And everything still works
void func(__m256* result, const __m256* v, unsigned int len8) {
for (unsigned int i = 0; i < len8; ++i) {
__m256 gts = _mm256_cmp_ge(v[i], _mm256_set1_ps(0.0f),
_CMP_GE_OQ);
result[i] = _mm256_or_ps(
_m256_and_ps(gts, _mm256_sqrt_ps(v[i])),
_m256_andnot_ps(gts, v[i]));;
}
}
13
Now Obsfucate Everything (No, Please Don’t)