程序代写代做代考 compiler cache Excel case study c++ x86 c/c++ assembly algorithm computer architecture arm SEC204

SEC204
Computer architectures and low level programming
Dr. Vasilios Kelefouras
Email: v.kelefouras@plymouth.ac.uk Website: https://www.plymouth.ac.uk/staff/vasilios- kelefouras
1
School of Computing (University of Plymouth)
Date
04/11/2019

2
Outline
 Different ways of writing assembly code
 Using intrinsic functions in C/C++
 Writing C/C++ programs using Intel SSE intrinsics  Writing C/C++ programs using Intel AVX intrinsics

3
1. 2. 3.
Different ways of writing assembly
Writing an entire function in assembly Using inline assembly in C/C++ Using intrinsic functions in C/C++
 highly recommended – much easier and safer
 All the compilers support intrinsic functions
 An intrinsic function is equivalent to an assembly instruction
 Mixes the good things of C++ (development time, portability, maintainability etc) with the good things of assembly (execution time)
C and C++ are the most easily combined languages with assembly code

4
Different ways of writing assembly
Using intrinsic functions in C/C++
 Main advantages
 Classes, if conditions, loops and functions are very easy to implement  Portability to almost all x86 architectures
 Compatibility with different compilers
 Main disadvantages
 Not all assembly instructions have intrinsic function equivalents
 Unskilled use of intrinsic functions can make the code less efficient than simple C++ code

5
Using intrinsic functions in C/C++
 For the rest of this lecture, you will be learning how to use intrinsic functions in C/C++
 Normally, “90% of a program’s execution time is spent in executing 10% of the code” – loops
 What programmers normally do to improve performance is to analyze the code and find the computationally intensive functions
 Then optimize those instead of the whole program
 This safes time and money
 Rewriting loop kernels in C++ using SIMD intrinsics is an
excellent choice
 Compilers vectorize the code (not always) but manually using SIMD instrinsics can really boost performance

6
Single Instruction Multiple Data (SIMD) – Vectorization

7
Vectorization on Arm Cortex series NEON technology
 Arm Neon technology is an advanced SIMD architecture extension for the Arm Cortex-A series and Cortex-R52 processors
 128-bit wide
 They are widely used in embedded systems
 Neon instructions allow up to:
 16×8-bit, 8×16-bit, 4×32-bit, 2×64-bit integer operations  8×16-bit, 4×32-bit, 2×64-bit floating-point operations

Vectorization on Intel Processors
 Intel MMX technology (old – limited usage nowadays)
 8 mmx registers of 64 bit
 extension of the floating point registers
 can be handled as 8 8-bit, 4 16-bit, 2 32-bit and 1 64-bit, operations  An entire L1 cache line is loaded to the RF in 1-3 cycles
 Intel SSE technology
 8/16 xmm registers of 128 bit (32-bit architectures support 8 registers only)  Can be handled from 16 8-bit to 1 128-bit operations
 An entire L1 cache line is loaded to the RF in 1-3 cycles
 Intel AVX technology
 8/16 ymm registers of 256 bit (32-bit architectures support 8 registers only)  Can be handled from 32 8-bit to 1 256-bit operations
 Intel AVX-512 technology
 32 ZMM 512-bit registers

9
Vectorization on Intel Processors (2)

Vectorization on Intel Processors (3)
 The developer can use either SSE or AVX or both
 AVX instructions improve throughput
 SSE instructions are preferred for less data parallel algorithms
 Vector instructions work only for data that they are written in consecutive main memory addresses
 Aligned load/store instructions are faster than the no aligned ones.
 memory and arithmetical instructions are executed in parallel
 All the Intel intrinsics can be found here : https://software.intel.com/sites/landingpage/IntrinsicsGuide/#

11
A[0]
A[1]
A[2]
A[3]
A[4]
A[5]
A[6]
A[7]
….
Main memory
L2 unified cache
A[0]
A[1]
A[2]
A[3]
A[4]
A[5]
A[6]
A[7]
….
L1 data cache
L1 instruction cache
A[1]
A[2]
A[3]
A[4]
A[5]
A[6]
A[7]
A[8]
….
RF
L1
need not be 16-byte-aligned
Basic SSE Instructions (1)
 __m128 _mm_load_ps (float * p ) – Loads four SP FP values. The address must be 16-byte-aligned
 __m128 _mm_loadu_ps (float * p) – Loads four SP FP values. The address
L1
Aligned load
Misaligned load
Cache lines
L1
Misaligned load
words
CPU
Faster and smaller

12
A[0]
A[1]
A[2]
A[3]
A[4]
A[5]
A[6]
A[7]
….
A[0]
A[1]
A[2]
A[3]
….
A[0]
A[1]
A[2]
A[3]
A[4]
A[5]
A[6]
A[7]
….
A[0]
A[1]
A[2]
A[3]
A[4]
A[5]
A[6]
A[7]
….
Basic SSE Instructions (2)
 __m128 _mm_load_ps(float * p ) – Loads four SP FP values. The address must be 16-byte-aligned
 __m128 _mm_loadu_ps(float * p) – Loads four SP FP values. The address
L1 need not be 16-byte-aligned Aligned load
float A[N] __attribute__((aligned(16)));
Main Memory
L1
Misaligned load
Modulo (Address ,16)=0
L1
Misaligned load

13
Basic SSE Instructions (3)
 __m128 _mm_store_ps(float * p ) – Stores four SP FP values. The address must be 16-byte-aligned
 __m128 _mm_storeu_ps(float * p) – Stores four SP FP values. The address need not be 16-byte-aligned
 __m128 _mm_mul_ps(__m128 a, __m128 b) – Multiplies the four SP FP values of a and b
 __m128 _mm_mul_ss(__m128 a, __m128 b) – Multiplies the lower SP FP values of a and b; the upper 3 SP FP values are passed through from a.
XMM1=_mm_mul_ss(XMM1, XMM0) XMM1=_mm_mul_ps(XMM1,XMM0)

14
Basic SSE Instructions (4)
 __m128 _mm_unpackhi_ps (__m128 a, __m128 b) – Selects and interleaves the upper two SP FP values from a and b.
 __m128 _mm_unpacklo_ps (__m128 a, __m128 b) – Selects and interleaves the lower two SP FP values from a and b.
XMM0=_mm_unpacklo_ps (XMMO, XMM1) XMM0=_mm_unpackhi_ps (XMMO, XMM1)

15
Basic SSE Instructions (5)
 __m128 _mm_hadd_ps (__m128 a, __m128 b) – Adds adjacent vector elements
 void _mm_store_ss (float * p, __m128 a) – Stores the lower SP FP value

16
float A[N][N]; float X[N], Y[N]; int i,j;
for (i=0; i num3=[ya+yb, yc+yd, ya+yb, yc+yd]
after the 2nd hadd -> num3=[ya+yb+yc+yd, ya+yb+yc+yd, ya+yb+yc+yd, ya+yb+yc+yd]
num1 X
Ynum3 num0 A (NxN)
y0
y1 y2 y3 y4 y5
yN
=
x
a00 a01 a02 a03
a10
a20
a30
a40
a50
aN0

aNN
a11
a21
a31
a41
a51
a12 a13 a22 a23 a32 a33 a42 a43 a52 a53
a0N
a1N … a2N a3N a4N
x0
x1
x2 x3
x4
xN
… …
… …

19
float A[N][N]; float X[N], Y[N]; int i,j;
for (i=0; i 2 || x[i] < -2 ) a[i]+=x[i]; } 2 2 2 2 -2 -2 -2 -2 5 -3 0 1 a[i] a[i+1] a[i+2] a[i+3] 1 0 0 0 0 1 0 0 1 1 0 0 x[i] x[i+1] 0 0 a[i] + x[i] a[i+1] + x[i+1] a[i+2] + 0 a[i+3] + 0 20