This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
What is Vectorization/SIMD and Why do We Care? 2 Performance!
Many hardware architectures today, both CPU and GPU, allow you to perform arithmetic operations on multiple array elements simultaneously.
(Thus the label, “Single Instruction Multiple Data”.)
Copyright By PowCoder代写 加微信 powcoder
We care about this because many problems, especially scientific and engineering, can be cast this way. Examples include convolution, Fourier transform, power spectrum, autocorrelation, etc.
Sine and Cosine values
= Fourier products
Computer Graphics
mjb – March 15, 2022
Vector Processing
(aka, Single Instruction Multiple Data, or SIMD)
Computer Graphics
simd.vector.pptx
mjb – March 15, 2022
Year Released
Width (bits)
Width (FP words)
Computer Graphics
SIMD in Intel Chips 3
Xeon Phi Note: one complete cache line! Note: also a 4×4 transformation matrix!
mjb – March 15, 2022
If you care:
• MMX stands for “MultiMedia Extensions”
• SSE stands for “Streaming SIMD Extensions”
• AVX stands for “Advanced Vector Extensions”
Intel SSE 4
Intel and AMD CPU architectures support vectorization. The most well-known form is called Streaming SIMD Extension, or SSE. It allows four floating point operations to happen simultaneously.
Normally a scalar floating point multiplication instruction happens like this:
mulss r1, r0
“ATT form”: mulss src, dst
mjb – March 15, 2022
Computer Graphics
Intel SSE 5 The SSE version of the multiplication instruction happens like this:
mulps xmm1,xmm0
xmm0 xmm0 xmm1
“ATTform”: mulps src, dst
xmm1 xmm1 xmm1
mjb – March 15, 2022
Computer Graphics
Array * Array
Array * Array
Computer Graphics
c[0:len] = a[0:len] * b[0:len];
SIMD Multiplication 6
SimdMul( float *a, float *b, float *c, int len ) {
Note that the construct:
a[ 0 : ArraySize ]
is meant to be read as:
“The set of elements in the array a starting at index 0 and going for ArraySize elements”. not as:
“The set of elements in the array a starting at index 0 and going through index ArraySize”.
SimdMul( float *a, float *b, float *c, int len ) {
#pragma omp simd
for( int i= 0; i < len; i++ )
c[i] = a[i]*b[i];
mjb – March 15, 2022
Array * Scalar
Array * Scalar
Computer Graphics
SimdMul( float *a, float b, float *c, int len ) {
c[0:len] = a[0:len] * b;
SIMD Multiplication 7
SimdMul( float *a, float b, float *c, int len ) {
#pragma omp simd
for( int i = 0; i < len; i++ )
c[i] = a[i]*b;
mjb – March 15, 2022
Array*Array Multiplication Speed 8
Array Size (M)
Computer Graphics
mjb – March 15, 2022
Array*Array Multiplication Speedup 9
Array Size (M)
You would think it would always be 4.0 ± noise effects, but it’s not. Why?
Computer Graphics
mjb – March 15, 2022
SIMD in OpenMP 4.0
Computer Graphics
#pragma omp simd
for( int i = 0; i < ArraySize; i++ ) {
c[ i ] = a[ i ] * b[ i ];
#pragma omp simd
mjb – March 15, 2022
Requirements for a For-Loop to be Vectorized 11 • If there are nested loops, the one to vectorize must be the inner one.
• There can be no jumps or branches. “Masked assignments” (an if-statement- controlled assignment) are OK, e.g.,
if( A[ i ] > 0. )
B[ i ] = 1.;
• The total number of iterations must be known at runtime when the loop starts • There can be no inter-loop data dependencies such as:
a[ i ] = a[ i-1 ] + 1.;
101st element
a[100] = a[101] =
102nd element
100th element
a[99] + 1.; // this crosses an SSE boundary, so it is ok a[100] + 1.; // this is within one SSE operation, so it is not OK
101st element
• It helps performance if the elements have contiguous memory addresses.
Computer Graphics
mjb – March 15, 2022
Prefetching 12 Prefetching is used to place a cache line in memory before it is to be used, thus hiding the
latency of fetching from off-chip memory.
There are two key issues here:
1. Issuing the prefetch at the right time
2. Issuing the prefetch at the right distance
The right time:
If the prefetch is issued too late, then the memory values won’t be back when the program wants to use them, and the processor has to wait anyway.
If the prefetch is issued too early, then there is a chance that the prefetched values could be evicted from cache by another need before they can be used.
The right distance:
The “prefetch distance” is how far ahead the prefetch memory is than the memory we are using right now.
Too far, and the values sit in cache for too long, and possibly get evicted.
Too near, and the program is ready for the values before they have arrived.
Computer Graphics
mjb – March 15, 2022
Speedup of SIMD over Non-SIMD
Speed (MFLOPS)
for( inti=0; i
#define SSE_WIDTH 4
SimdMul( float *a, float *b, float *c, int len ) {
int limit = ( len/SSE_WIDTH ) * SSE_WIDTH; register float *pa = a;
register float *pb = b;
register float *pc = c;
for(int i = 0; i < limit;i += SSE_WIDTH) {
_mm_storeu_ps(pc, _mm_mul_ps(_mm_loadu_ps(pa),_mm_loadu_ps(pb))); pa += SSE_WIDTH;
pb += SSE_WIDTH;
pc += SSE_WIDTH;
for(int i = limit;i < len; i++ ) {
c[i] = a[i] * b[i]; }
Computer Graphics
mjb – March 15, 2022
SimdMulSum using Intel Intrinsics
SimdMulSum( float *a, float *b, int len ) {
float sum[4] = { 0., 0., 0., 0. };
int limit = ( len/SSE_WIDTH ) * SSE_WIDTH; register float *pa = a;
register float *pb = b;
__m128 ss=_mm_loadu_ps(&sum[0]); for( int i = 0; i < limit; i += SSE_WIDTH )
ss=_mm_add_ps(ss, _mm_mul_ps(_mm_loadu_ps(pa), _mm_loadu_ps(pb))); pa += SSE_WIDTH;
pb += SSE_WIDTH;
_mm_storeu_ps( &sum[0], ss );
for(int i = limit;i < len; i++ ) {
sum[0] += a[ i ] * b[ i ]; }
return sum[0] + sum[1] + sum[2] + sum[3]; }
Computer Graphics
mjb – March 15, 2022
Intel Intrinsics
Computer Graphics
Array Size
mjb – March 15, 2022
SpeedUp SpeedUp
for( int i = 0; i < len; i++ ) {
c[ i ] = a[ i ] * b[ i ];
movups (%r8),%xmm0 movups (%rcx), %xmm1 mulps %xmm1, %xmm0 movups %xmm0,(%rdx) addq $16, %r8
addq $16, %rcx addq $16, %rdx addl $4, -4(%rbp)
movups (%r10),%xmm0 movups (%r9),%xmm1 mulps %xmm1, %xmm0 movups %xmm0,(%r11) addq $16, %r9
addq $16, %r10 addq $16, %r11 addl $4, %r8d
It’s actually due to the setup time. The intrinsics have a tighter coupling to the setting up of the registers. A smaller setup time makes the small dataset size speedup look better.
A preview of things to come: 26 OpenCL and CUDA have SIMD Data Types
When we get to OpenCL, we could compute projectile physics like this:
But, instead, we will do it like this:
float4 pp = p + v*DT + .5*DT*DT*G; // p’
We do it this way for two reasons:
1. Convenience and clean coding
2. Some hardware can do multiple arithmetic operations simultaneously
Computer Graphics
float4 pp; // p’
pp.x = p.x + v.x*DT;
pp.y = p .y + v.y*DT + .5*DT*DT*G.y;
pp.z = p.z + v.z*DT;
mjb – March 15, 2022
Computer Graphics
It’s not due to the code in the inner-loop:
Intrinsics
Why do the Intrinsics do so well with a small dataset size?
mjb – March 15, 2022
A preview of things to come: 27 OpenCL and CUDA have SIMD Data Types
The whole thing will look like this:
Computer Graphics
constant float4 G = (float4) ( 0., -9.8, 0., 0. ); constant float DT = 0.1;
Particle( global float4 * dPobj, global float4 * dVel, global float4 * dCobj ) {
int gid float4 p float4 v
float4 float4
= get_global_id( 0 ); = dPobj[gid];
= dVel[gid];
// particle #
// particle #gid’s position // particle #gid’s velocity
dPobj[gid] = pp; dVel[gid] = vp;
pp = p + v*DT + .5*DT*DT*G; // p’ vp = v + G*DT; // v’
mjb – March 15, 2022
• SIMD is an important way to achieve speed-ups on a CPU
• For now, you might have to write in assembly language or use Intel intrinsics to get to all of it
• I suspect that #pragma omp simd will eventually catch up
• Prefetching can really help SIMD
Computer Graphics
mjb – March 15, 2022
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com