• x86 SIMD instruction sets: • AVX: register width = 256 bits
• 4 double precision floating point operands
• AVX-512: register width = 512 bits • 8 double precision floating point
Copyright By PowCoder代写 加微信 powcoder
for(i=0;i
• FMAinstructions
Intel AVX512
• 8x double • 16x float
• KNL processor had 2 x AVX512 vector units per core • Symmetrical units
• Only one supports some of the legacy stuff (x87, MMX, some of the SSE stuff)
• Vector instructions have a latency of 6 instructions
KNL AVX-512
• AVX512 has extensions to help with vectorisation
• Conflict detection (AVX-512CD)
• Should improve vectorisation of loops that have dependencies
vpconflict instructions
• If loops don’t have dependencies telling the compile will still improve performance (i.e. #pragma ivdep)
• Exponential and reciprocal functions (AVX-512ER) • Fast maths functions for transcendental sequences
• Prefetch (AVX-512PF)
• Gather/Scatter sparse vectors prior to calculation • Pack/Unpack
Compiler vs explicit vectorisation
• Compilers will automatically try to vectorise code
• Implicit vectorisation
• Can help them to do this
• Compiler always chooses correctness rather than performance
• Will often make an automatic decision about when to vectorise
• There are programming constructs/features that let you write explicit vector code
• Can be less portable/more machine specific
• Defined code will always be vectorised (even if slower)
When does the compiler vectorize • What can be vectorized
• Only loops
• Usually only one loop is vectorisable in loopnest
• Andmostcompilersonlyconsiderinnerloop
• Optimising compilers will use vector instructions
• Reliesoncodebeingvectorisable
• Or in a form that the compiler can convert to be vectorisable
• Some compilers are better at this than others
• Check the compiler output listing and/or assembler listing • LookforpackedAVX/AVX2/AVX512instructions
i.e. Instructions using registers zmm0-zmm31 (512-bit) ymm0-ymm31 (256-bit) xmm0-xmm31 (128-bit)
Instructions like vaddps, vmulps, etc…
Intel compiler
• Intel compiler requires
• Optimisation enabled (generally is by default) • -O2
• To know what hardware it’s compiling for
• -xCORE-AVX512
• This is added automatically for you on a Cray like ARCHER2
• Can disable vectorisation
• Useful for checking performance
• Intel compiler will provide vectorisation information • -qopt-report=[n] (i.e. –qopt-report=5)
Helping vectorisation
• Does the loop have dependencies?
• informationcarriedbetweeniterations
• e.g.counter:total=total+a(i)
• Tell the compiler that it is safe to vectorise
• Rewrite code to use algorithm without dependencies, e.g.
• promoteloopscalarstovectors(singledimensionarray)
• use calculated values (based on loop index) rather than iterated counters, e.g.
• Replace: count = count + 2; a(count) = … • By: a(2*i) = …
• moveifstatementsoutsidetheinnerloop
• may need temporary vectors to do this (otherwise use masking operations)
• Is there a good reason for this?
• There is an overhead in setting up vectorisation; maybe it’s not worth it • Could you unroll inner (or outer) loop to provide more work?
Vectorisation example
• Compilercannoteasilyvectorise:
• Loopswithpointers
• None-unitstrideloops
• Funnymemorypatterns
• Unaligned data accesses
• Conditionals/Function calls in loops
• Data dependencies between loop iterations
int *loop_size;
void problem_function(float *data1, float *data2, float *data3, int *index){
int i,j; for(i=0;i<*loop_size;i++){
j = index[i];
data1[j] = data2[i] * data3[i];
Vectorisation example • Can help compiler
• Tell it loops are independent • #pragma ivdep
• !dir$ ivdep
• Tell it that variables or arrays are unique • restrict
• Align arrays to cache line boundaries
• Tell the compiler the arrays are aligned
• Make loop sizes explicit to the compiler
• Ensure loops are big enough to vectorise int *loop_size;
void problem_function(float * restrict data1, float * restrict data2, float * restrict data3, int * restrict index){
int i,j,n;
n = *loop_size;
#pragma ivdep
for(i=0;i