CS代考 AVX-512: register width = 512 bits • 8 double precision floating point

• x86 SIMD instruction sets: • AVX: register width = 256 bits
• 4 double precision floating point operands
• AVX-512: register width = 512 bits • 8 double precision floating point

Copyright By PowCoder代写 加微信 powcoder

for(i=0;i 256 bit) • Gather instructions
• FMAinstructions

Intel AVX512
• 8x double • 16x float
• KNL processor had 2 x AVX512 vector units per core • Symmetrical units
• Only one supports some of the legacy stuff (x87, MMX, some of the SSE stuff)
• Vector instructions have a latency of 6 instructions

KNL AVX-512
• AVX512 has extensions to help with vectorisation
• Conflict detection (AVX-512CD)
• Should improve vectorisation of loops that have dependencies
vpconflict instructions
• If loops don’t have dependencies telling the compile will still improve performance (i.e. #pragma ivdep)
• Exponential and reciprocal functions (AVX-512ER) • Fast maths functions for transcendental sequences
• Prefetch (AVX-512PF)
• Gather/Scatter sparse vectors prior to calculation • Pack/Unpack

Compiler vs explicit vectorisation
• Compilers will automatically try to vectorise code
• Implicit vectorisation
• Can help them to do this
• Compiler always chooses correctness rather than performance
• Will often make an automatic decision about when to vectorise
• There are programming constructs/features that let you write explicit vector code
• Can be less portable/more machine specific
• Defined code will always be vectorised (even if slower)

When does the compiler vectorize • What can be vectorized
• Only loops
• Usually only one loop is vectorisable in loopnest
• Andmostcompilersonlyconsiderinnerloop
• Optimising compilers will use vector instructions
• Reliesoncodebeingvectorisable
• Or in a form that the compiler can convert to be vectorisable
• Some compilers are better at this than others
• Check the compiler output listing and/or assembler listing • LookforpackedAVX/AVX2/AVX512instructions
i.e. Instructions using registers zmm0-zmm31 (512-bit) ymm0-ymm31 (256-bit) xmm0-xmm31 (128-bit)
Instructions like vaddps, vmulps, etc…

Intel compiler
• Intel compiler requires
• Optimisation enabled (generally is by default) • -O2
• To know what hardware it’s compiling for
• -xCORE-AVX512
• This is added automatically for you on a Cray like ARCHER2
• Can disable vectorisation
• Useful for checking performance
• Intel compiler will provide vectorisation information • -qopt-report=[n] (i.e. –qopt-report=5)

Helping vectorisation
• Does the loop have dependencies?
• informationcarriedbetweeniterations
• e.g.counter:total=total+a(i)
• Tell the compiler that it is safe to vectorise
• Rewrite code to use algorithm without dependencies, e.g.
• promoteloopscalarstovectors(singledimensionarray)
• use calculated values (based on loop index) rather than iterated counters, e.g.
• Replace: count = count + 2; a(count) = … • By: a(2*i) = …
• moveifstatementsoutsidetheinnerloop
• may need temporary vectors to do this (otherwise use masking operations)
• Is there a good reason for this?
• There is an overhead in setting up vectorisation; maybe it’s not worth it • Could you unroll inner (or outer) loop to provide more work?

Vectorisation example
• Compilercannoteasilyvectorise:
• Loopswithpointers
• None-unitstrideloops
• Funnymemorypatterns
• Unaligned data accesses
• Conditionals/Function calls in loops
• Data dependencies between loop iterations
int *loop_size;
void problem_function(float *data1, float *data2, float *data3, int *index){
int i,j; for(i=0;i<*loop_size;i++){ j = index[i]; data1[j] = data2[i] * data3[i]; Vectorisation example • Can help compiler • Tell it loops are independent • #pragma ivdep • !dir$ ivdep • Tell it that variables or arrays are unique • restrict • Align arrays to cache line boundaries • Tell the compiler the arrays are aligned • Make loop sizes explicit to the compiler • Ensure loops are big enough to vectorise int *loop_size; void problem_function(float * restrict data1, float * restrict data2, float * restrict data3, int * restrict index){ int i,j,n; n = *loop_size; #pragma ivdep for(i=0;iCS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com