程序代写代做代考 compiler cuda assembly Fortran CS267: Introduction

CS267: Introduction

09/27/2011
CS4961
CS4961 Parallel Programming

Lecture 10:
Introduction to SIMD

Mary Hall
September 27, 2011

CS4961

Today’s Lecture
Conceptual understanding of SIMD
Practical understanding of SIMD in the context of multimedia extensions
Slide source:
Sam Larsen, PLDI 2000, http://people.csail.mit.edu/slarsen/
Jaewook Shin, my former PhD student

09/27/2011
CS4961

CS4961

SIMD and MIMD Architectures:
What’s the Difference?
09/27/2011
CS4961
Slide source: Grama et al., Introduction to Parallel Computing, http://www.users.cs.umn.edu/~karypis/parbook

CS4961

Overview of SIMD Programming
Vector architectures
Early examples of SIMD supercomputers
TODAY Mostly
Multimedia extensions such as SSE and AltiVec
Graphics and games processors (CUDA, stay tuned)
Accelerators (e.g., ClearSpeed)
Is there a dominant SIMD programming model
Unfortunately, NO!!!
Why not?
Vector architectures were programmed by scientists
Multimedia extension architectures are programmed by systems programmers (almost assembly language!)
GPUs are programmed by games developers (domain- specific libraries)

09/27/2011
CS4961

CS4961

Scalar vs. SIMD in Multimedia Extensions

09/27/2011
CS4961
Slide source: Sam Larsen

CS4961

Multimedia Extension Architectures
At the core of multimedia extensions
SIMD parallelism
Variable-sized data fields:
Vector length = register width / type size

09/27/2011
CS4961
Slide source: Jaewook Shin

CS4961

09/27/2011
CS4961
Multimedia / Scientific Applications
Image
Graphics : 3D games, movies
Image recognition
Video encoding/decoding : JPEG, MPEG4
Sound
Encoding/decoding: IP phone, MP3
Speech recognition
Digital signal processing: Cell phones
Scientific applications
Array-based data-parallel computation, another level of parallelism

CS4961

09/27/2011
CS4961
Characteristics of Multimedia Applications
Regular data access pattern
Data items are contiguous in memory
Short data types
8, 16, 32 bits
Data streaming through a series of processing stages
Some temporal reuse for such data streams
Sometimes …
Many constants
Short iteration counts
Requires saturation arithmetic

CS4961

Why SIMD
+More parallelism
+When parallelism is abundant
+SIMD in addition to ILP
+Simple design
+Replicated functional units
+Small die area
+No heavily ported register files
+Die area: +MAX-2(HP): 0.1% +VIS(Sun): 3.0%
-Must be explicitly exposed to the hardware
-By the compiler or by the programmer
09/27/2011
CS4961

CS4961

Programming Multimedia Extensions
Language extension
Programming interface similar to function call
C: built-in functions, Fortran: intrinsics
Most native compilers support their own multimedia extensions
GCC: -faltivec, -msse2
AltiVec: dst= vec_add(src1, src2);
SSE2: dst= _mm_add_ps(src1, src2); (Visual Studio)
BG/L: dst= __fpadd(src1, src2);
No Standard !
Need automatic compilation

09/27/2011
CS4961

CS4961

Rest of Lecture
Understanding multimedia execution
Understanding the overheads
Understanding how to write code to deal with the overheads

Note:
A few people write very low-level code to access this capability.
Today the compilers do a pretty good job, at least on relatively simple codes.
09/27/2011
CS4961

CS4961

09/27/2011
CS4961
Exploiting SLP with SIMD Execution
Benefit:
Multiple ALU ops  One SIMD op
Multiple ld/st ops  One wide mem op

What are the overheads:
Packing and unpacking:
rearrange data so that it is contiguous
Alignment overhead
Accessing data from the memory system so that it is aligned to a “superword” boundary
Control flow
Control flow may require executing all paths

CS4961

09/27/2011
CS4961
1. Independent ALU Ops
R = R + XR * 1.08327
G = G + XG * 1.89234
B = B + XB * 1.29835

Slide source: Sam Larsen

R R XR 1.08327
G = G + XG * 1.89234
B B XB 1.29835

CS4961

09/27/2011
CS4961
2. Adjacent Memory References
R = R + X[i+0]
G = G + X[i+1]
B = B + X[i+2]

Slide source: Sam Larsen

R R
G = G + X[i:i+2]
B B

CS4961

09/27/2011
CS4961
for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0] 3. Vectorizable Loops Slide source: Sam Larsen CS4961 09/27/2011 CS4961 3. Vectorizable Loops for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0] A[i+1] = A[i+1] + B[i+1] A[i+2] = A[i+2] + B[i+2] A[i+3] = A[i+3] + B[i+3] Slide source: Sam Larsen for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3] CS4961 * Nuts and Bolts What does a piece of code really look like? for (i=0; i<100; i+=4) { __m128 btmp = _mm_load_ps(float B[I]); __m128 ctmp = _mm_load_ps(float C[I]); __m128 atmp = _mm_add_ps(__m128 btmp, __m128 ctmp); void_mm_store_ps(float A[I], __m128 atmp); } for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3] 09/27/2011 4. Partially Vectorizable Loops for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L) CS4961 Slide source: Sam Larsen CS4961 09/27/2011 4. Partially Vectorizable Loops for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L) L = A[i+1] – B[i+1] D = D + abs(L) CS4961 Slide source: Sam Larsen for (i=0; i<16; i+=2) L0 L1 = A[i:i+1] – B[i:i+1] D = D + abs(L0) D = D + abs(L1) CS4961 Programming Complexity Issues High level: Use compiler may not always be successful Low level: Use intrinsics or inline assembly tedious and error prone Data must be aligned,and adjacent in memory Unaligned data may produce incorrect results May need to copy to get adjacency (overhead) Control flow introduces complexity and inefficiency Exceptions may be masked 09/27/2011 CS4961 CS4961 09/27/2011 CS4961 Packing/Unpacking Costs C = A + 2 D = B + 3 Slide source: Sam Larsen C A 2 D B 3 = + CS4961 09/27/2011 CS4961 Packing/Unpacking Costs Packing source operands Copying into contiguous memory A = f() B = g() C = A + 2 D = B + 3 C A 2 D B 3 = + A A B B CS4961 09/27/2011 CS4961 Packing/Unpacking Costs Packing source operands Copying into contiguous memory Unpacking destination operands Copying back to location A = f() B = g() C = A + 2 D = B + 3 E = C / 5 F = D * 7 C A 2 D B 3 = + Slide source: Sam Larsen C C D D A A B B CS4961 09/27/2011 CS4961 Alignment Code Generation Aligned memory access The address is always a multiple of 16 bytes Just one superword load or store instruction float a[64]; for (i=0; i<64; i+=4) Va = a[i:i+3]; … 0 16 32 48 CS4961 09/27/2011 CS4961 Alignment Code Generation (cont.) Misaligned memory access The address is always a non-zero constant offset away from the 16 byte boundaries. Static alignment: For a misaligned load, issue two adjacent aligned loads followed by a merge. float a[64]; for (i=0; i<60; i+=4) Va = a[i+2:i+5]; … 0 16 32 48 float a[64]; for (i=0; i<60; i+=4) V1 = a[i:i+3]; V2 = a[i+4:i+7]; Va = merge(V1, V2, 8); CS4961 Statically align loop iterations float a[64]; for (i=0; i<60; i+=4) Va = a[i+2:i+5]; float a[64]; Sa2 = a[2]; Sa3 = a[3]; for (i=2; i<62; i+=4) Va = a[i:i+3]; 09/27/2011 CS4961 CS4961 09/27/2011 CS4961 Alignment Code Generation (cont.) Unaligned memory access The offset from 16 byte boundaries is varying or not enough information is available. Dynamic alignment: The merging point is computed during run time. float a[64]; start = read(); for (i=start; i<60; i++) Va = a[i:i+3]; float a[64]; start = read(); for (i=start; i<60; i++) V1 = a[i:i+3]; V2 = a[i+4:i+7]; align = (&a[i:i+3])%16; Va = merge(V1, V2, align); CS4961