程序代写代做代考 chain assembly c/c++ x86 Squishy Maps for Soft Body Modelling Using Generalised Chain Mail

Squishy Maps for Soft Body Modelling Using Generalised Chain Mail

KIT308/408 (Advanced) Multicore Architecture and Programming

x86 (and x64) SIMD
Dr. Ian Lewis
Discipline of ICT, School of TED
University of Tasmania, Australia
1

Intel has been introducing SIMD instructions into the x86 architecture since 1997
With MMX
Intel kept adding SIMD with the SSE and AVX extensions
AMD tried to introduce their own custom SIMD instructions, called 3DNow!
Never got much traction
For a while they supported 3DNow! and MMX/SSE

2
Intel SIMD

MMX has only one
“Extended MMX” was part of SSE
SSE has heaps
SSE/SSE2/SSE3/SSSE3/SSE4.1/SSE4.2
AVX has three
AVX/AVX2/AVX-512
There are a number of other extensions too
FMA, KNC, SVML
These are all browseable at
software.intel.com/sites/landingpage/IntrinsicsGuide/

Versions
3
1. https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2&cats=Logical

SIMD Datatypes
x86 SIMD

4

__m64, 64-bit SIMD integer type
1 64-bit integer
2 32-bit integers
4 16-bit integers
8 8-bit integers
First implementation of SIMD for x86
Not really relevant now

5
MMX (MultiMedia eXtensions)

__m128, 128-bit SIMD floating point type
4 32-bit floats
__m128d, 128-bit SIMD floating point type
2 64-bit floats
__m128i, 128-bit SIMD integer type
1 128-bit integer
2 64-bit integers
4 32-bit integers
8 16-bit integers
16 8-bit integers
This is what the tutorials primarily describe and work with

6
SSE (Streaming SIMD Extensions)

__m256, 256-bit SIMD floating point type
8 32-bit floats
__m256d, 256-bit SIMD floating point type
4 64-bit floats
__m256i, 256-bit SIMD integer type
1 256-bit integer
2 128-bit integer
4 64-bit integers
8 32-bit integers
16 16-bit integers
32 8-bit integers
This is available on the lab machines and what the tutorials ask you to use as a final step
For Assignment 2 you need to use these
I really hope that everyone’s CPU has these
“Core i3/i5/i7” support them
“Pentium” and “Celeron” CPUs don’t
AVX (Advanced Vector Extensions)
7

MMX introduced 8 64-bit registers that overlapped the normal floating-point ones
MMX0–7
Expensive context switch going between FP and MMX
SSE introduced 8 new 128-bit registers
XMM0–7
Once x86-64 was introduced this was extended to 16
XMM0–15
AVX introduced 16 256-bit registers that overlapped the SSE ones
YMM0–15
The “upcoming” AVX-512 increases this to 32 512-bit registers
ZMM0–31

Registers for SIMD
8

To get any speed increase by using SIMD datatypes you need to use special instructions
Covered next in the slides
But, sometimes it’s useful to be able to access individual elements
On Windows, the SIMD types are provided as unions, so that each type can be easily accessed
For example, for SSE:
__m128 provides access:
to it’s 4 floats via the m128_f32 field
And also let’s you “break” the type and access it as any integer style type too
__m128i has the above, except for the float access
__m128d only has access to it’s doubles via the m128_f64 field
typedef union
__declspec(intrin_type) __declspec(align(16))
__m128
{
float m128_f32[4];
unsigned __int64 m128_u64[2];
__int8 m128_i8[16];
__int16 m128_i16[8];
__int32 m128_i32[4];
__int64 m128_i64[2];
unsigned __int8 m128_u8[16];
unsigned __int16 m128_u16[8];
unsigned __int32 m128_u32[4];
} __m128;

9
Accessing SIMD Vector Elements

SSE Instructions
x86 SIMD

10

They are a little odd
All the types as we’ve seen
Start with a double underscore and a single “m”
Specify the size of the SIMD vector in bits
End with a type, or default to single-precision (32-bit) floats
e.g. __m128, __m128i, etc.
The SSE instructions
Start with a single underscore and two “m”s
Have the instruction name (with underscore separators either side)
End with a type
“ps”: single-precision (32-bit) floats
“pd”: double-precision (64-bit) floats
“si128”: effectively treat it as 128 bits
“epXY”, where:
X: “u” (unsigned) or “i” (signed)
Y: 8-, 16-, 32-, 64-bits
e.g. _mm_abs_epi32, _mm_addsub_ps, _mm_or_si128, _mm_test_all_ones

11
Naming Conventions

Basic operations
SIMD float versions
__m128 _mm_and_ps(__m128 a, __m128 b)
__m128 _mm_andnot_ps(__m128 a, __m128 b)
__m128 _mm_or_ps(__m128 a, __m128 b)
SIMD int versions
__m128i _mm_and_si128(__m128i a, __m128i b)
__m128i _mm_andnot_si128(__m128i a, 
__m128i b)
__m128i _mm_or_si128(__m128i a, __m128i b)
These can be combined to simulate the use of the ?: operator
We’ll do that in this week’s tutorial
e.g. given:
__m128 a, b, c;
The intrinsic operation:
a = _mm_and_ps(b, c);
Is equivalent to:
for (int i = 0; i < 4; ++i) { a.m128_f32[i] = b.m128_f32[i] & c.m128_f32[i]; } But the intrinsic is much much faster Logical 12 A couple of special intrinsics that work across all the whole vector and produce an accumulated result int _mm_test_all_ones (__m128i a) Returns true if all bits are 1 This intrinsic actually maps to two assembly instructions i.e. it’s not an assembly instruction itself int _mm_test_all_zeros (__m128i a,  __m128i mask) ANDs its two arguments together and then returns true if all bits of the result are zero In Intel Intrinsics Guide form: IF (a[127:0] AND mask[127:0] == 0) ZF := 1 ELSE ZF := 0 FI RETURN ZF Logical 13 Scalar promotion Set all SIMD vector values to be equal to the same scalar value __m128 _mm_set1_ps(float a) __m128i _mm_set1_epi32(int a) Vector initialization Set each value in a SIMD vector to be a different scalar __m128 _mm_set_ps(float e3, float e2,  float e1, float e0) __m128i _mm_set_epi32(int e3, int e2,  int e1, int e0) Note the order of the parameters here (little endian style) Zeroing __m128 _mm_setzero_ps() __m128i _mm_setzero_si128() e.g. given: __m128 a; The intrinsic operation: a = _mm_set1_ps(14.3f); Is equivalent to: for (int i = 0; i < 4; ++i) { a.m128_f32[i] = 14.3f; } Set 14 A huge number of compare functions for float SIMD vectors __m128 _mm_cmpXX_ps(__m128 a,  __m128 b) Where XX can be “eq”, “ge”, “gt”, “le”, “lt”, “neq”, “nge”, “ngt”, “nle”, “nlt” (and a number of others) Corresponding to: equal, greater than or equal, greater than, less than, etc. They all map to the same assembly function: cmpps This function takes three arguments, the last one specfies what compare operation to perform There is no intrinsic to access this directly (The AVX instructions work the other way) e.g. given: __m128 a, b, c; The intrinsic operation: a = _mm_cmplt_ps(b, c); Is equivalent to: for (int i = 0; i < 4; ++i) { a.m128_u32[i] = b.m128_f32[i] < c.m128_f32[i]; } Note here that the result is effectively stored as an unsigned int, NOT a float Additionally, “true” is stored as all ones (i.e. 0xFFFFFFFF in this case) and false is zero Compare 15 Not as many compare functions for int SIMD vectors __m128i _mm_cmpeq_epi32(__m128i a,  __m128i b) __m128i _mm_cmpgt_epi32(__m128i a,  __m128i b) __m128i _mm_cmplt_epi32(__m128i a,  __m128i b) Need to be careful about what you are calculating so that any selections make sense i.e. might need to logically reverse if statements or perform two compares Compare 16 Basic arithmetic for float SIMD vectors is exactly as you’d expect __m128 _mm_add_ps(__m128 a, __m128 b) __m128 _mm_sub_ps(__m128 a, __m128 b) __m128 _mm_mul_ps(__m128 a, __m128 b) __m128 _mm_div_ps(__m128 a, __m128 b) Also a few weird ones __m128 _mm_addsub_ps(__m128 a,  __m128 b) __m128 _mm_hadd_ps(__m128 a, __m128 b) __m128 _mm_hsub_ps(__m128 a, __m128 b) (and there’s more still) e.g. given: __m128 a, b, c; The intrinsic operation: a = _mm_add_ps(b, c); Is equivalent to: for (int i = 0; i < 4; ++i) { a.m128_f32[i] = b.m128_f32[i] + c.m128_f32[i]; } Arithmetic 17 __m128 _mm_addsub_ps(__m128 a,  __m128 b) “Alternatively add and subtract packed single-precision (32-bit) floating-point elements in a to/from packed elements in b, and store the results in dst.” FOR j := 0 to 3 i := j*32 IF (j is even) dst[i+31:i] := a[i+31:i] - b[i+31:i] ELSE dst[i+31:i] := a[i+31:i] + b[i+31:i] FI ENDFOR e.g. given: __m128 a, b, c; The intrinsic operation: a = _mm_addsub_ps(b, c); Is equivalent to: for (int i = 0; i < 4; ++i) { if (i % 2 == 0) // if i is even a.m128_f32[i] = b.m128_f32[i] - c.m128_f32[i]; else a.m128_f32[i] = b.m128_f32[i] + c.m128_f32[i]; } Arithmetic 18 __m128 _mm_hadd_ps(__m128 a,  __m128 b) __m128 _mm_hsub_ps(__m128 a,  __m128 b) “Horizontally add adjacent pairs of single-precision (32-bit) floating-point elements in a andb, and pack the results in dst.” dst[31:0] := a[31:0] - a[63:32] dst[63:32] := a[95:64] - a[127:96] dst[95:64] := b[31:0] - b[63:32] dst[127:96] := b[95:64] - b[127:96] e.g. given: __m128 a, b, c; The intrinsic operation: a = _mm_hsub_ps(b, c); Is equivalent to: a.m128_f32[0] = b.m128_f32[0] - b.m128_f32[1]; a.m128_f32[1] = b.m128_f32[2] - b.m128_f32[3]; a.m128_f32[2] = c.m128_f32[0] - c.m128_f32[1]; a.m128_f32[3] = c.m128_f32[2] - c.m128_f32[3]; Arithmetic 19 Arithmetic for integer SIMD vectors is a little more complicated Add/Subtract as normal __m128i _mm_add_epi32(__m128i a,  __m128i b) __m128i _mm_sub_epi32(__m128i a,  __m128i b) Multiple multiplication options __m128i _mm_mullo_epi32(__m128i a,  __m128i b) “Multiply the packed 32-bit integers in a and b, producing intermediate 64-bit integers, and store the low 32 bits of the intermediate integers in dst.” __m128i _mm_mul_epi32(__m128i a,  __m128i b) “Multiply the low 32-bit integers from each packed 64-bit element in a and b, and store the signed 64-bit results in dst.” e.g. given: __m128i a, b, c; The intrinsic operation: a = _mm_mullo_epi32(b, c); Is equivalent to: for (int i = 0; i < 4; ++i) { a.m128_i32[i] = b.m128_i32[i] * c.m128_i32[i]; } Arithmetic 20 Cast operations __m128 _mm_castpd_ps(__m128d a) __m128i _mm_castpd_si128(__m128d a) __m128d _mm_castps_pd(__m128 a) __m128i _mm_castps_si128(__m128 a) __m128d _mm_castsi128_pd(__m128i a) __m128 _mm_castsi128_ps(__m128i a) Casts leave the underlying bits unchanged i.e. they are just for making the C/C++ type system happy This is different from what happens when you cast, e.g. a float to an int e.g. x = (int) 6.4f; // x = 6 Casts 21 Huge number of these, e.g. __m128 _mm_cvtepi32_ps(__m128i a) FOR j := 0 to 3 i := 32*j dst[i+31:i] := Convert_Int32_To_FP32(a[i+31:i]) ENDFOR Conversions try to preserve the meaning of each SIMD value This is like normal casts in C/C++ e.g. given: __m128 a; __m128i b; The intrinsic operation: a = _mm_cvtepi32_ps(b); Is equivalent to: for (int i = 0; i < 4; ++i) { a.m128_f32[i] = (float) b.m128_i32[i]; } Conversions 22 AVX Instructions x86 SIMD 23 They are a little odd All the types as we’ve seen Start with a double underscore and a single “m” Specify the size of the SIMD vector in bits End with a type, or default to single-precision (32-bit) floats e.g. __m256, __m256i, etc. The AVX instructions Start with “_mm256” Have the instruction name (with underscore separators either side) End with a type “ps”: single-precision (32-bit) floats “pd”: double-precision (64-bit) floats “si256”: effectively treat it as 256 bits “epXY”, where: X: “u” (unsigned) or “i” (signed) Y: 8-, 16-, 32-, 64-, 128-bits e.g. _mm256_abs_epi32, _mm256_addsub_ps, _mm256_or_si256 24 Naming Conventions Basic operations SIMD float versions __m256 _mm256_and_ps(__m256 a, __m256 b) __m256 _mm256_andnot_ps(__m256 a, __m256 b) __m256 _mm256_or_ps(__m256 a, __m256 b) SIMD integer versions __m256i _mm256_and_si256(__m256i a,  __m256i b) __m256i _mm256_andnot_si256(__m256i a,  __m256i b) __m256i _mm256_or_si256(__m256i a,  __m256i b) e.g. given: __m256 a, b, c; The intrinsic operation: a = _mm_and_ps(b, c); Is equivalent to: for (int i = 0; i < 8; ++i) { a.m256_f32[i] = b.m256_f32[i] & c.m256_f32[i]; } But the intrinsic is much much faster Logical 25 A couple of special intrinsics that work across all the whole vector and produce an accumulated result int _mm256_testz_ps(__m256 a, __m256 b) Returns true if all values in its arguments are positive int _mm256_testz_si256(__m256i a, __m256i b) ANDs its two arguments together and then returns true if all bits of the result are zero Logical 26 Scalar promotion Set all SIMD vector values to be equal to the same scalar value __m256 _mm256_set1_ps(float a) __m256i _mm256_set1_epi32(int a) Vector initialization Set each value in a SIMD vector to be a different scalar __m256 _mm256_set_ps(float e7, float e6,  float e5, float e4, float e3, float e2,  float e1, float e0) __m256i _mm256_set_epi32(int e7, int e6,  int e5, int e4, int e3, int e2,  int e1, int e0) Note the order of the parameters here (little endian style) Zeroing __m256 _mm_setzero_ps() __m256i _mm_setzero_si256() e.g. given: __m256 a; The intrinsic operation: a = _mm256_set1_ps(14.3f); Is equivalent to: for (int i = 0; i < 8; ++i) { a.m256_f32[i] = 14.3f } Set 27 A single compare function for float SIMD vectors __m256 _mm256_cmp_ps(__m256 a,  __m256 b, __const int imm8) Where imm8 can be any value from 0..31, but there are constants provided, e.g. _CMP_EQ_OQ (0), equal _CMP_NEQ_OQ (12), not equal _CMP_LT_OQ (17), less than _CMP_LE_OQ (18), less than or equal _CMP_GE_OQ (29), greater than or equal _CMP_GT_OQ (30), greater than e.g. given: __m256 a, b, c; The intrinsic operation: a = _mm256_cmp_ps(b, c, _CMP_LT_OQ); Is equivalent to: for (int i = 0; i < 8; ++i) { a.m256_u32[i] = b.m256_f32[i] < c.m256_f32[i]; } Note here that the result is effectively stored as an unsigned int, NOT a float Additionally, “true” is stored as all ones (i.e. 0xFFFFFFFF in this case) and false is zero Compare 28 “O” and “U” correspond to “ordered” and “unordered” This specifies what happens if there is a NaN in one of the floating-point values “Q” and “S” correspond to “signaling” and “quiet” Whether an “exception” is raised if something is a NaN Use “Q” 29 Compare Values Predicate imm8 Value Description Result: A Is 1st Operand, B Is 2nd Operand A>B AB A