SEC204
Computer architectures and low level programming
Dr. Vasilios Kelefouras
Email: v.kelefouras@plymouth.ac.uk Website: https://www.plymouth.ac.uk/staff/vasilios- kelefouras
1
School of Computing (University of Plymouth)
Date
04/11/2019
2
Outline
Different ways of writing assembly code
Using intrinsic functions in C/C++
Writing C/C++ programs using Intel SSE intrinsics Writing C/C++ programs using Intel AVX intrinsics
3
1. 2. 3.
Different ways of writing assembly
Writing an entire function in assembly Using inline assembly in C/C++ Using intrinsic functions in C/C++
highly recommended – much easier and safer
All the compilers support intrinsic functions
An intrinsic function is equivalent to an assembly instruction
Mixes the good things of C++ (development time, portability, maintainability etc) with the good things of assembly (execution time)
C and C++ are the most easily combined languages with assembly code
4
Different ways of writing assembly
Using intrinsic functions in C/C++
Main advantages
Classes, if conditions, loops and functions are very easy to implement Portability to almost all x86 architectures
Compatibility with different compilers
Main disadvantages
Not all assembly instructions have intrinsic function equivalents
Unskilled use of intrinsic functions can make the code less efficient than simple C++ code
5
Using intrinsic functions in C/C++
For the rest of this lecture, you will be learning how to use intrinsic functions in C/C++
Normally, “90% of a program’s execution time is spent in executing 10% of the code” – loops
What programmers normally do to improve performance is to analyze the code and find the computationally intensive functions
Then optimize those instead of the whole program
This safes time and money
Rewriting loop kernels in C++ using SIMD intrinsics is an
excellent choice
Compilers vectorize the code (not always) but manually using SIMD instrinsics can really boost performance
6
Single Instruction Multiple Data (SIMD) – Vectorization
7
Vectorization on Arm Cortex series NEON technology
Arm Neon technology is an advanced SIMD architecture extension for the Arm Cortex-A series and Cortex-R52 processors
128-bit wide
They are widely used in embedded systems
Neon instructions allow up to:
16×8-bit, 8×16-bit, 4×32-bit, 2×64-bit integer operations 8×16-bit, 4×32-bit, 2×64-bit floating-point operations
Vectorization on Intel Processors
Intel MMX technology (old – limited usage nowadays)
8 mmx registers of 64 bit
extension of the floating point registers
can be handled as 8 8-bit, 4 16-bit, 2 32-bit and 1 64-bit, operations An entire L1 cache line is loaded to the RF in 1-3 cycles
Intel SSE technology
8/16 xmm registers of 128 bit (32-bit architectures support 8 registers only) Can be handled from 16 8-bit to 1 128-bit operations
An entire L1 cache line is loaded to the RF in 1-3 cycles
Intel AVX technology
8/16 ymm registers of 256 bit (32-bit architectures support 8 registers only) Can be handled from 32 8-bit to 1 256-bit operations
Intel AVX-512 technology
32 ZMM 512-bit registers
9
Vectorization on Intel Processors (2)
Vectorization on Intel Processors (3)
The developer can use either SSE or AVX or both
AVX instructions improve throughput
SSE instructions are preferred for less data parallel algorithms
Vector instructions work only for data that they are written in consecutive main memory addresses
Aligned load/store instructions are faster than the no aligned ones.
memory and arithmetical instructions are executed in parallel
All the Intel intrinsics can be found here : https://software.intel.com/sites/landingpage/IntrinsicsGuide/#
11
A[0]
A[1]
A[2]
A[3]
A[4]
A[5]
A[6]
A[7]
….
Main memory
L2 unified cache
A[0]
A[1]
A[2]
A[3]
A[4]
A[5]
A[6]
A[7]
….
L1 data cache
L1 instruction cache
A[1]
A[2]
A[3]
A[4]
A[5]
A[6]
A[7]
A[8]
….
RF
L1
need not be 16-byte-aligned
Basic SSE Instructions (1)
__m128 _mm_load_ps (float * p ) – Loads four SP FP values. The address must be 16-byte-aligned
__m128 _mm_loadu_ps (float * p) – Loads four SP FP values. The address
L1
Aligned load
Misaligned load
Cache lines
L1
Misaligned load
words
CPU
Faster and smaller
12
A[0]
A[1]
A[2]
A[3]
A[4]
A[5]
A[6]
A[7]
….
A[0]
A[1]
A[2]
A[3]
….
A[0]
A[1]
A[2]
A[3]
A[4]
A[5]
A[6]
A[7]
….
A[0]
A[1]
A[2]
A[3]
A[4]
A[5]
A[6]
A[7]
….
Basic SSE Instructions (2)
__m128 _mm_load_ps(float * p ) – Loads four SP FP values. The address must be 16-byte-aligned
__m128 _mm_loadu_ps(float * p) – Loads four SP FP values. The address
L1 need not be 16-byte-aligned Aligned load
float A[N] __attribute__((aligned(16)));
Main Memory
L1
Misaligned load
Modulo (Address ,16)=0
L1
Misaligned load
13
Basic SSE Instructions (3)
__m128 _mm_store_ps(float * p ) – Stores four SP FP values. The address must be 16-byte-aligned
__m128 _mm_storeu_ps(float * p) – Stores four SP FP values. The address need not be 16-byte-aligned
__m128 _mm_mul_ps(__m128 a, __m128 b) – Multiplies the four SP FP values of a and b
__m128 _mm_mul_ss(__m128 a, __m128 b) – Multiplies the lower SP FP values of a and b; the upper 3 SP FP values are passed through from a.
XMM1=_mm_mul_ss(XMM1, XMM0) XMM1=_mm_mul_ps(XMM1,XMM0)
14
Basic SSE Instructions (4)
__m128 _mm_unpackhi_ps (__m128 a, __m128 b) – Selects and interleaves the upper two SP FP values from a and b.
__m128 _mm_unpacklo_ps (__m128 a, __m128 b) – Selects and interleaves the lower two SP FP values from a and b.
XMM0=_mm_unpacklo_ps (XMMO, XMM1) XMM0=_mm_unpackhi_ps (XMMO, XMM1)
15
Basic SSE Instructions (5)
__m128 _mm_hadd_ps (__m128 a, __m128 b) – Adds adjacent vector elements
void _mm_store_ss (float * p, __m128 a) – Stores the lower SP FP value
16
float A[N][N]; float X[N], Y[N]; int i,j;
for (i=0; i
after the 2nd hadd -> num3=[ya+yb+yc+yd, ya+yb+yc+yd, ya+yb+yc+yd, ya+yb+yc+yd]
num1 X
Ynum3 num0 A (NxN)
y0
y1 y2 y3 y4 y5
yN
=
x
a00 a01 a02 a03
a10
a20
a30
a40
a50
aN0
…
aNN
a11
a21
a31
a41
a51
a12 a13 a22 a23 a32 a33 a42 a43 a52 a53
a0N
a1N … a2N a3N a4N
x0
x1
x2 x3
x4
xN
… …
… …
…
19
float A[N][N]; float X[N], Y[N]; int i,j;
for (i=0; i