留学生考试辅导 AVX512

VECTORISATION III

Vectorisation
• Same operation on multiple data items

Copyright By PowCoder代写 加微信 powcoder

• Wide registers
• SIMD needed to approach FLOP peak performance, but your code
must be capable of vectorisation for(i=0;i

Did my loop get vectorised?
• Always check the compiler output to see what it did
-hlist=a -fdump-tree-vect-all= -opt-report3
• or (for the hard core) check the assembler generated
• Look to see which registers are in use.
• Clues from hardware counters, i.e. CrayPAT’s HWPC
measurements
• export PAT_RT_HWPC=13 or 14 # Floating point operations SP,DP
• Complicated, but look for ratio of operations/instructions > 1 • expect 4 for pure AVX with double precision floats

Did my loop get vectorised?
• GNU offers other options for checking: • -fopt-info
• -O3 -fopt-info-missed=missed.all
• -O2 -ftree-vectorize -fopt-info-vec-missed
• -fopt-info-loop-optimized

Example vectorisation reports
LOOP BEGIN at vectorisation.c(20,3)
remark #15388: vectorization support: reference A[i] has aligned access
[ vectorisation.c(21,5) ]
remark #15388: vectorization support: reference B[i] has aligned access
[ vectorisation.c(22,5) ]
remark #15388: vectorization support: reference A[i] has aligned access
[ vectorisation.c(22,12) ]
remark #15305: vectorization support: vector length 4
remark #15399: vectorization support: unroll factor set to 4
remark #15309: vectorization support: normalized vectorization overhead 0.536 remark #15300: LOOP WAS VECTORIZED
remark #15442: entire loop may be executed in remainder
remark #15448: unmasked aligned unit stride loads: 1
remark #15449: unmasked aligned unit stride stores: 2
remark #15475: — begin vector cost summary —
remark #15476: scalar cost: 9
remark #15477: vector cost: 1.750
remark #15478: estimated potential speedup: 4.480
remark #15487: type converts: 1
remark #15488: — end vector cost summary —
remark #25015: Estimate of max trip count of loop=32

Example vectorisation reports
LOOP BEGIN at vectorisation.c(20,3)

remark #25015: Estimate of max trip count of loop=32
LOOP BEGIN at vectorisation.c(20,3)

remark #15388: vectorization support: reference A[i] has aligned access
remark #15389: vectorization support: reference B[i] has unaligned access
remark #15388: vectorization support: reference A[i] has aligned access
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 4
remark #15309: vectorization support: normalized vectorization overhead 0.933
remark #15301: REMAINDER LOOP WAS VECTORIZED
[ vectorisation.c(21,5) ]
[ vectorisation.c(22,5) ]
[ vectorisation.c(22,12) ]

Example vectorisation reports
LOOP BEGIN at vectorisation.c(7,4)

remark #25015: Estimate of max trip count of loop=3 LOOP END
LOOP BEGIN at vectorisation.c(7,4)
remark #25228: Loop multiversioned for Data Dependence
remark #15388: vectorization support: reference C[i] has aligned access remark #15389: vectorization support: reference A[i] has unaligned access remark #15388: vectorization support: reference B[i] has aligned access remark #15381: vectorization support: unaligned access used inside loop body remark #15305: vectorization support: vector length 4
remark #15399: vectorization support: unroll factor set to 2
remark #15309: vectorization support: normalized vectorization overhead 1.417 remark #15300: LOOP WAS VECTORIZED
remark #15442: entire loop may be executed in remainder
remark #15448: unmasked aligned unit stride loads: 1
remark #15449: unmasked aligned unit stride stores: 1
remark #15450: unmasked unaligned unit stride loads: 1
remark #15475: — begin vector cost summary —
remark #15476: scalar cost: 8
remark #15477: vector cost: 1.500
remark #15478: estimated potential speedup: 4.860
remark #15488: — end vector cost summary —
LOOP BEGIN at vectorisation.c(7,4)
LOOP END
LOOP BEGIN at vectorisation.c(7,4)
LOOP END
LOOP BEGIN at vectorisation.c(7,4)

remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning
remark #25439: unrolled with remainder by 2 LOOP END
LOOP BEGIN at vectorisation.c(7,4) LOOP END
[ vectorisation.c(8,7) ]
[ vectorisation.c(8,14) ]
[ vectorisation.c(8,21) ]

Example vectorisation reports

Vectorisation example • Can help compiler
• Tell it loops are independent • #pragmaivdep
• #pragmaGCCivdep
• !dir$ivdep
• !GCC$ivdep
• Tell it that variables or arrays are unique
• restrict
• Align arrays to cache line boundaries
• Tell the compiler the arrays are aligned
• Make loop sizes explicit to the compiler
• Ensure loops are big enough to vectorise int *loop_size;
void problem_function(float * restrict data1, float * restrict data2, float * restrict data3, int * restrict index){
int i,j,n;
n = *loop_size;
#pragma ivdep
for(i=0;iCS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com