CS267: Introduction
09/27/2011
CS4961
CS4961 Parallel Programming
Lecture 10:
Introduction to SIMD
Mary Hall
September 27, 2011
CS4961
Today’s Lecture
Conceptual understanding of SIMD
Practical understanding of SIMD in the context of multimedia extensions
Slide source:
Sam Larsen, PLDI 2000, http://people.csail.mit.edu/slarsen/
Jaewook Shin, my former PhD student
09/27/2011
CS4961
CS4961
SIMD and MIMD Architectures:
What’s the Difference?
09/27/2011
CS4961
Slide source: Grama et al., Introduction to Parallel Computing, http://www.users.cs.umn.edu/~karypis/parbook
CS4961
Overview of SIMD Programming
Vector architectures
Early examples of SIMD supercomputers
TODAY Mostly
Multimedia extensions such as SSE and AltiVec
Graphics and games processors (CUDA, stay tuned)
Accelerators (e.g., ClearSpeed)
Is there a dominant SIMD programming model
Unfortunately, NO!!!
Why not?
Vector architectures were programmed by scientists
Multimedia extension architectures are programmed by systems programmers (almost assembly language!)
GPUs are programmed by games developers (domain- specific libraries)
09/27/2011
CS4961
CS4961
Scalar vs. SIMD in Multimedia Extensions
09/27/2011
CS4961
Slide source: Sam Larsen
CS4961
Multimedia Extension Architectures
At the core of multimedia extensions
SIMD parallelism
Variable-sized data fields:
Vector length = register width / type size
09/27/2011
CS4961
Slide source: Jaewook Shin
CS4961
09/27/2011
CS4961
Multimedia / Scientific Applications
Image
Graphics : 3D games, movies
Image recognition
Video encoding/decoding : JPEG, MPEG4
Sound
Encoding/decoding: IP phone, MP3
Speech recognition
Digital signal processing: Cell phones
Scientific applications
Array-based data-parallel computation, another level of parallelism
CS4961
09/27/2011
CS4961
Characteristics of Multimedia Applications
Regular data access pattern
Data items are contiguous in memory
Short data types
8, 16, 32 bits
Data streaming through a series of processing stages
Some temporal reuse for such data streams
Sometimes …
Many constants
Short iteration counts
Requires saturation arithmetic
CS4961
Why SIMD
+More parallelism
+When parallelism is abundant
+SIMD in addition to ILP
+Simple design
+Replicated functional units
+Small die area
+No heavily ported register files
+Die area: +MAX-2(HP): 0.1% +VIS(Sun): 3.0%
-Must be explicitly exposed to the hardware
-By the compiler or by the programmer
09/27/2011
CS4961
CS4961
Programming Multimedia Extensions
Language extension
Programming interface similar to function call
C: built-in functions, Fortran: intrinsics
Most native compilers support their own multimedia extensions
GCC: -faltivec, -msse2
AltiVec: dst= vec_add(src1, src2);
SSE2: dst= _mm_add_ps(src1, src2); (Visual Studio)
BG/L: dst= __fpadd(src1, src2);
No Standard !
Need automatic compilation
09/27/2011
CS4961
CS4961
Rest of Lecture
Understanding multimedia execution
Understanding the overheads
Understanding how to write code to deal with the overheads
Note:
A few people write very low-level code to access this capability.
Today the compilers do a pretty good job, at least on relatively simple codes.
09/27/2011
CS4961
CS4961
09/27/2011
CS4961
Exploiting SLP with SIMD Execution
Benefit:
Multiple ALU ops One SIMD op
Multiple ld/st ops One wide mem op
What are the overheads:
Packing and unpacking:
rearrange data so that it is contiguous
Alignment overhead
Accessing data from the memory system so that it is aligned to a “superword” boundary
Control flow
Control flow may require executing all paths
CS4961
09/27/2011
CS4961
1. Independent ALU Ops
R = R + XR * 1.08327
G = G + XG * 1.89234
B = B + XB * 1.29835
Slide source: Sam Larsen
R R XR 1.08327
G = G + XG * 1.89234
B B XB 1.29835
CS4961
09/27/2011
CS4961
2. Adjacent Memory References
R = R + X[i+0]
G = G + X[i+1]
B = B + X[i+2]
Slide source: Sam Larsen
R R
G = G + X[i:i+2]
B B
CS4961
09/27/2011
CS4961
for (i=0; i<100; i+=1)
A[i+0] = A[i+0] + B[i+0]
3. Vectorizable Loops
Slide source: Sam Larsen
CS4961
09/27/2011
CS4961
3. Vectorizable Loops
for (i=0; i<100; i+=4)
A[i+0] = A[i+0] + B[i+0]
A[i+1] = A[i+1] + B[i+1]
A[i+2] = A[i+2] + B[i+2]
A[i+3] = A[i+3] + B[i+3]
Slide source: Sam Larsen
for (i=0; i<100; i+=4)
A[i:i+3] = B[i:i+3] + C[i:i+3]
CS4961
*
Nuts and Bolts
What does a piece of code really look like?
for (i=0; i<100; i+=4) {
__m128 btmp = _mm_load_ps(float B[I]);
__m128 ctmp = _mm_load_ps(float C[I]);
__m128 atmp = _mm_add_ps(__m128 btmp, __m128 ctmp);
void_mm_store_ps(float A[I], __m128 atmp);
}
for (i=0; i<100; i+=4)
A[i:i+3] = B[i:i+3] + C[i:i+3]
09/27/2011
4. Partially Vectorizable Loops
for (i=0; i<16; i+=1)
L = A[i+0] – B[i+0]
D = D + abs(L)
CS4961
Slide source: Sam Larsen
CS4961
09/27/2011
4. Partially Vectorizable Loops
for (i=0; i<16; i+=2)
L = A[i+0] – B[i+0]
D = D + abs(L)
L = A[i+1] – B[i+1]
D = D + abs(L)
CS4961
Slide source: Sam Larsen
for (i=0; i<16; i+=2)
L0
L1
= A[i:i+1] – B[i:i+1]
D = D + abs(L0)
D = D + abs(L1)
CS4961
Programming Complexity Issues
High level: Use compiler
may not always be successful
Low level: Use intrinsics or inline assembly tedious and error prone
Data must be aligned,and adjacent in memory
Unaligned data may produce incorrect results
May need to copy to get adjacency (overhead)
Control flow introduces complexity and inefficiency
Exceptions may be masked
09/27/2011
CS4961
CS4961
09/27/2011
CS4961
Packing/Unpacking Costs
C = A + 2
D = B + 3
Slide source: Sam Larsen
C A 2
D B 3
= +
CS4961
09/27/2011
CS4961
Packing/Unpacking Costs
Packing source operands
Copying into contiguous memory
A = f()
B = g()
C = A + 2
D = B + 3
C A 2
D B 3
= +
A A
B B
CS4961
09/27/2011
CS4961
Packing/Unpacking Costs
Packing source operands
Copying into contiguous memory
Unpacking destination operands
Copying back to location
A = f()
B = g()
C = A + 2
D = B + 3
E = C / 5
F = D * 7
C A 2
D B 3
= +
Slide source: Sam Larsen
C C
D D
A A
B B
CS4961
09/27/2011
CS4961
Alignment Code Generation
Aligned memory access
The address is always a multiple of 16 bytes
Just one superword load or store instruction
float a[64];
for (i=0; i<64; i+=4)
Va = a[i:i+3];
…
0
16
32
48
CS4961
09/27/2011
CS4961
Alignment Code Generation (cont.)
Misaligned memory access
The address is always a non-zero constant offset away from the 16 byte boundaries.
Static alignment: For a misaligned load, issue two adjacent aligned loads followed by a merge.
float a[64];
for (i=0; i<60; i+=4)
Va = a[i+2:i+5];
…
0
16
32
48
float a[64];
for (i=0; i<60; i+=4)
V1 = a[i:i+3];
V2 = a[i+4:i+7];
Va = merge(V1, V2, 8);
CS4961
Statically align loop iterations
float a[64];
for (i=0; i<60; i+=4)
Va = a[i+2:i+5];
float a[64];
Sa2 = a[2]; Sa3 = a[3];
for (i=2; i<62; i+=4)
Va = a[i:i+3];
09/27/2011
CS4961
CS4961
09/27/2011
CS4961
Alignment Code Generation (cont.)
Unaligned memory access
The offset from 16 byte boundaries is varying or not enough information is available.
Dynamic alignment: The merging point is computed during run time.
float a[64];
start = read();
for (i=start; i<60; i++)
Va = a[i:i+3];
float a[64];
start = read();
for (i=start; i<60; i++)
V1 = a[i:i+3];
V2 = a[i+4:i+7];
align = (&a[i:i+3])%16;
Va = merge(V1, V2, align);
CS4961