CS代写 ETA10 typically needed stride-1 memory access to vectorise.

Vectorisation

Introduction
• The current trend is for microprocessor designs to increase performance by increasing parallelism.

Copyright By PowCoder代写 加微信 powcoder

• Compilers therefore need to be able to recognise parallelism within codes.
• This technology is called vectorisation as it was originally developed for vector architectures in the 70s/80s
• The term is now used more generally for other kinds of compiler generated parallelism
• SIMD instructions
• Large overlap with analysis needed for compiler generated threading.

Vector Instructions (Vectorisation)
• Modern CPUs can perform multiple operations each cycle
• Use special SIMD (Single Instruction Multiple Data) instructions
• e.g. SSE, AVX
• Operate on a “vector” of data
• typically 2 or 4 double precision floats (on )
• Potentially gives speedup in floating point operations
• Usually only one loop is vectorisable in loop nest • And most compilers only consider inner loop

• Cray-1 1976 • 80MHz
• BuiltfromdiscretelogicIntegratedCircuits ~200,000 gates.
• Pipelineparallelism
• Vector instructions started the pipeline
• One element produced/consumed per cycle
• Once started pipelines did not stall. OK to use a vector register as soon as first element available.
• Separate pipelines for each instruction. “Chained” instructions increased performance. 80 Mflop per pipeline.
• 8 vector registers
• 64 elements at 64-bit/8-byte precision (4096 bit in total)
• LaterCrayvectorprocessorskeptthesamearchitecturebut increased the number of ALUs
• Compilers treated these as longer vector lengths

• Cray vector architectures did not use caches.
• Most vector load/stores would cycle through available banks
• Load/stores were also pipelined
• Memory system saw a series of accesses.
• Vector load stores could use any memory stride
• Could also use indirect addressing A(IDX(I)) though compiler directives were needed to tell compiler the loop iterations were independent
• Other historical vector systems were less flexible
• ETA10 typically needed stride-1 memory access to vectorise.
• Heavily banked memory
• Successive memory addresses mapped to different memory

Modern vector instructions
• Most modern microprocessors support some form of SIMD instructions.
• X86 processors no exception
• These have been added as extensions to the base
instruction set
• Many generations of these some now obsolete.
• Microprocessors have many more gates
• Vector instructions implemented with replicated
hardware not pipelining
• Microprocessors make extensive use of caches • Vector load/stores need to map onto cache-lines
• Contiguous, well aligned data important.

X86 vector extensions
• MMX instructions
• Multi media extensions 1997
• Integer operations only
• 8 x 64-bit registers MM0 … MM7 • Aliases for the x87 FP registers
• SIMD for 32,16 or 8-bit integer types • SSE instructions
• Streaming SIMD Extensions • Multiple generations
• 8 x 128-bit registers xmm0 … xmm7 (xmm8 .. xmm16 in 64-bit mode)
• Entirely new registers
• SIMD for 64 & 32-bit floating point
• SIMD for 64, 32,16 or 8-bit integer types

• AVX instructions
• SIMD registers extended to 256-bits • ymm0 … ymm15
• SSE instructions target lower half
• Introduced in Haswell (also Broadwell)
• Expanded instruction set same vector length • Fused Multiply Add (FMA)
• Gather instructions
• Newest iteration so far only in Intel KNL and Skylake
• SIMD registers extended to 512-bits • zmm0 … zmm15

AVX-512 Foundation (F)
• expands most 32-bit and 64-bit based AVX instructions with the EVEX coding scheme to support 512-bit registers, operation masks, parameter broadcasting, and embedded rounding and exception control, supported by KNL and
• AVX-512 Conflict Detection Instructions (CD)
• conflict detection to allow more loops to be vectorized,
supported by KNL and
• AVX-512 Exponential and Reciprocal Instructions (ER)
• exponential and reciprocal operations designed to help implement transcendental operations, supported by KNL
• AVX-512PrefetchInstructions(PF) • prefetchcapabilities,supportedbyKNL

Vectorisation on other processors
• AMD epyc processor line
• 256-bit SIMD – AVX2 support (Rome)
• 512-bit SIMD coming in AVX-512 support for next generation AMD processors
• Most Arm processor have no SIMD support
• Current HPC processors have 128-bit through the NEON instruction set
• SVE: Scalable Vector Extensions
• To allow chip manufacturers to build chips with different vector
• A64FX 512-bit SVE

Vectorisation
for(i=0; i
-opt-report3
• AMD/Clang:-Rpass-analysis=.*
• or (for the hard core) check the assembler generated
• Look to see which registers are in use.
• Clues from hardware profiling, i.e. CrayPAT’s HWPC
measurements
• export PAT_RT_HWPC=13 or 14 # Floating point operations SP,DP
• Complicated, but look for ratio of operations/instructions > 1 • expect 4 for pure AVX with double precision floats
• Clues from profilers such as Perf

Did my loop get vectorised?
• GNU offers other options for checking:
• -fopt-info
• -O3 -fopt-info-missed=missed.all
• -O2 -ftree-vectorize -fopt-info-vec-missed • -fopt-info-loop-optimized

16. + 1——-< 17. 1 18. + 1 r4----< 21. 1 r4---->
22. 1——->
do j = 1,N x=xinit do i = 1,N
x = x + vexpr(i,j)
y(i) = y(i) + x
end do end do
ftn-6254 ftn: VECTOR File = bufpack.F90, Line = 16
A loop starting at line 16 was not vectorized because a recurrence was found on “y” at line 20.
ftn-6005 ftn: SCALAR File = bufpack.F90, Line = 18 A loop starting at line 18 was unrolled 4 times.
ftn-6254 ftn: VECTOR File = bufpack.F90, Line = 18
A loop starting at line 18 was not vectorized because a recurrence was found on “x” at line 19.

38. Vf——< 40. Vf------>
42. ir4—–< 43. ir4 if--< 44. ir4 if 45. ir4 if 46. ir4 if-->
47. ir4—–>
do i = 1,N
x(i) = xinit
do j = 1,N
do i = 1,N
x(i) = x(i) + vexpr(i,j)
y(i) = y(i) + x(i)
end do end do
x promoted to vector:
trade slightly more memory for better performance
ftn-6007 ftn: SCALAR File = bufpack.F90, Line = 42
A loop starting at line 42 was interchanged with the loop starting at line 43.
ftn-6004 ftn: SCALAR File = bufpack.F90, Line = 43
A loop starting at line 43 was fused with the loop starting at line 38.
ftn-6204 ftn: VECTOR File = bufpack.F90, Line = 38 A loop starting at line 38 was vectorized.
ftn-6208 ftn: VECTOR File = bufpack.F90, Line = 42
A loop starting at line 42 was vectorized as part of the loop starting at line 38.
ftn-6005 ftn: SCALAR File = bufpack.F90, Line = 42 A loop starting at line 42 was unrolled 4 times.
1.089ms -37%

Vectorisation requirements
• Multi-stage process to vectorise
• Check vectorisation safe • Is code independent
• Is data independent
• Check vectorisation worthwhile
• Is there enough work in the loop to be vectorised
• What are the cost of extra instruction additions
• Are there enough registers to stop pipeline stalls
• Vectorise
• Convert the code to vector functionality
• Add additional code to protect for different code paths/data sets

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com