程序代写代做代考 c++ cache chain c/c++ data structure Squishy Maps for Soft Body Modelling Using Generalised Chain Mail

Squishy Maps for Soft Body Modelling Using Generalised Chain Mail

KIT308/408 (Advanced) Multicore Architecture and Programming

SIMD Challenges
Dr. Ian Lewis
Discipline of ICT, School of TED
University of Tasmania, Australia
1

Certain things are hard (/hardish) when translating to SIMD
Dynamic memory allocation / alignment
Conditional code
Changing data-structures breaks everything
It’s a really good idea to translate to SIMD as slowly as possible
Literally as close to one line at a time as you can

2
SIMD Challenges

A Worked Example
SIMD Challenges

3

Again, a pretty nonsense piece of code
But enough for our purposes
Interesting features:
Dynamically allocated arrays
Arrays looped over
Conditional expression in loop
void func(float* result, const float* v, unsigned int len) {
for (unsigned int i = 0; i < len; ++i) { float val = v[i]; if (val >= 0.0f)
result[i] = std::sqrt(val);
else
result[i] = val;
}
}

float* dest = new float[SIZE];
float* src = new float[SIZE];
// not shown: fill src with a bunch of values
func(dest, src, SIZE);

4
Starting Scalar Code

The necessity to have everything aligned correctly is difficult
No aligned C++ new instruction
Have to use a special version of C-style malloc instead
_aligned_malloc will allocated memory aligned at a byte boundary specified
And return it as a void* which needs to be casted to the desired type
void* _aligned_malloc(
size_t size,
size_t alignment);
For example, to allocate 100 128-bit SIMD vectors with the first aligned at a 128-byte boundary:
__m128* array = (__m128*) _aligned_malloc(sizeof(__m128) * 100, 128);

5
Aligned Dynamically Allocated Memory

Convert the data structures to SIMD types
Deal with the fallout
Changed function specification
Added an extra for loop to iterate over SIMD vector elements
All accesses to individual scalar elements use SIMD union accessors
Rough edges
Here the len8 variable is set to be SIZE/8
This is assuming that original data structure size was evenly divisible by 8
Can’t do this in general
Aligned values to 32-byte boundary
Appropriate for a 256-bit type
Ignores potential larger alignment sizes that might be more cache appropriate
void func(__m256* result, const __m256* v, unsigned int len8) {
for (unsigned int i = 0; i < len8; ++i) { for (unsigned int j = 0; j < 8; ++j) { float val = v[i].m256_f32[j]; if (val >= 0.0f)
result[i].m256_f32[j] = std::sqrt(val);
else
result[i].m256_f32[j] = val;
}
}
}

int SIZE8 = SIZE / 8;
__m256* dest = (__m256*) _aligned_malloc(sizeof(__m256) * SIZE8,32);
__m256* src = (__m256*) _aligned_malloc(sizeof(__m256) * SIZE8, 32);
// not shown: fill src with a bunch of values
func(dest, src, SIZE8);

6
Inner Loop Using SIMD Types (as Scalars)

Take the first line from inside the inner SIMD loop (j here) and translate it to SIMD code
Translation goes outside the inner loop
It’s doing the work of the version inside the loop all in one go
Replace the version in the loop with an access to the result of the SIMD calculation
No testing approach shown here
Can be really helpful to calculate everything twice
The SIMD translation and the scalar original
Compare the outputs
I’d recommend using printf liberally
void func(__m256* result, const __m256* v, unsigned int len8) {
for (unsigned int i = 0; i < len8; ++i) { __m256 vals = v[i]; for (unsigned int j = 0; j < 8; ++j) { float val = vals.m256_f32[j]; if (val >= 0.0f)
result[i].m256_f32[j] = std::sqrt(val);
else
result[i].m256_f32[j] = val;
}
}
}

7
Translation to SIMD Code One Line at a Time

Same process as the last slide, but this translation resulted in two SIMD calculations though
Needed a conversion of a scalar value into a SIMD vector (splatting)
Conversion of a comparison
Can still be handled by this approach
Normal C/C++ code “true” is usually 1, whereas SIMD “true” is -1
Neither of these are the “false” value of 0, so the if-statement still works
void func(__m256* result, const __m256* v, unsigned int len8) {
for (unsigned int i = 0; i < len8; ++i) { __m256 vals = v[i]; __m256 zeros = _mm256_set1_ps(0.0f); __m256 gts = _mm256_cmp_ps(vals, zeros, _CMP_GE_OQ); for (unsigned int j = 0; j < 8; ++j) { float val = vals.m256_f32[j]; if (gts.m256_f32[j]) result[i].m256_f32[i] = std::sqrt(val); else result[i].m256_f32[i] = val; } } } 8 Translation of Boolean Conditions Next part of the loop is an if-statement Note that this is an “easy” if-statement No full block, just one line for each of the true/false statements Both statements put a value into the same destination Multiple steps required for translation First step, perform the calculations of the true and false expressions After performing this step, the scalar val temporary is no longer needed void func(__m256* result, const __m256* v, unsigned int len8) { for (unsigned int i = 0; i < len8; ++i) { __m256 vals = v[i]; __m256 zeros = _mm256_set1_ps(0.0f); __m256 gts = _mm256_cmp_ge(vals, zeros, _CMP_GE_OQ); __m256 trues = _mm256_sqrt_ps(vals); __m256 falses = vals; for (unsigned int j = 0; j < 8; ++j) { float val = vals.m256_f32[j]; if (gts.m256_f32[j]) result[i].m256_f32[j] = trues.m256_f32[j]; else result[i].m256_f32[j] = falses.m256_f32[j]; } } } 9 Translation of If-Statement Calculations to SIMD Second step, convert our “easy” if-statement into a ternary if-expression This makes it clear that we are just doing a calculation and storing it in to result Rough edges: What if we can’t? For example: No else part to the if-statement True/false parts set different variables If-statement uses changes flow-of-control, e.g. break, continue, return, etc. Complicated multi-line block Nested if With work, all of these things can be refactor into a series of independent simple if-expressions if needed void func(__m256* result, const __m256* v, unsigned int len8) { for (unsigned int i = 0; i < len8; ++i) { __m256 vals = v[i]; __m256 zeros = _mm256_set1_ps(0.0f); __m256 gts = _mm256_cmp_ge(vals, zeros, _CMP_GE_OQ); __m256 trues = _mm256_sqrt_ps(vals); __m256 falses = vals; for (unsigned int j = 0; j < 8; ++j) { result[i].m256_f32[j] = gts.m256_f32[j] ? trues.m256_f32[j] : falses.m256_f32[j]; } } } 10 Refactor If-Statements to If-Expressions Third step, convert if-expression into sequence of SIMD instructions Other SIMD architectures have a single instruction for this called SELECT void func(__m256* result, const __m256* v, unsigned int len8) { for (unsigned int i = 0; i < len8; ++i) { __m256 vals = v[i]; __m256 zeros = _mm256_set1_ps(0.0f); __m256 gts = _mm256_cmp_ge(vals, zeros, _CMP_GE_OQ); __m256 trues = _mm256_sqrt_ps(vals); __m256 falses = vals; __m256 value = _mm256_or_ps( _m256_and_ps(gts, trues), _m256_andnot_ps(gts, falses)); for (unsigned int j = 0; j < 8; ++j) { result[i].m256_f32[j] = value.m256_f32[j]; } } } 11 Translation of If-Operators to SIMD Once the final parts of the temporary inner SIMD loop is translated, don’t need the loop anymore All code converted to SIMD Except the outer for-loop remains as scalar code (In some circumstances, (e.g. where the index of each element is needed) this could be made SIMD too) Rough edges: Possibly more efficient to move the calculation of zeros outside the loop SIMD array accesses could be written using _mm256_load_ps and _mm256_store_ps This shouldn’t be necessary void func(__m256* result, const __m256* v, unsigned int len8) { for (unsigned int i = 0; i < len8; ++i) { __m256 vals = v[i]; __m256 zeros = _mm256_set1_ps(0.0f); __m256 gts = _mm256_cmp_ge(vals, zeros, _CMP_GE_OQ); __m256 trues = _mm256_sqrt_ps(vals); __m256 falses = vals; __m256 value = _mm256_or_ps( _m256_and_ps(gts, trues), _m256_andnot_ps(gts, falses)); result[i] = value; } } 12 Refactor If-Statements to If-Operators Really, please don’t do this Unless it’s faster And everything still works void func(__m256* result, const __m256* v, unsigned int len8) { for (unsigned int i = 0; i < len8; ++i) { __m256 gts = _mm256_cmp_ge(v[i], _mm256_set1_ps(0.0f), _CMP_GE_OQ); result[i] = _mm256_or_ps( _m256_and_ps(gts, _mm256_sqrt_ps(v[i])), _m256_andnot_ps(gts, v[i]));; } } 13 Now Obsfucate Everything (No, Please Don’t)