18-646 – How to Write Fast Code II
1
Carnegie Mellon University
Outline
Compilation with CUDA Synchronization
Data Parallel Algorithms – Map, Reduce, and Scan Compositions of Data Parallel Algorithms
Compact
Find Unique
Building a Flag Array
18-646 – How to Write Fast Code II 2
18-646 – How to Write Fast Code II 3
Mini-Project One – Submission Date
The submission date for mini-project 1 will be postponed for 1 week due to the issues with the k-means assignment.
Mini-Project One Due
Mini-Project One Review
Monday, 03/15 @ 11:59PM Thursday, 03/18
18-646 – How to Write Fast Code II
4
Mini-Project One – K-Means
Depending on when you ran the command:
$ cp /afs/andrew.cmu.edu/course/18/646/MP1/18646_MP1.tar.gz~/ you may have retrieved a slightly different version of omp_kmeans.c
Version 1: 480s to run the 10 testcases on a GHC machine
Version 2: 78s to run the 10 testcases on a GHC machine
18-646 – How to Write Fast Code II 5
Mini-Project One – K-Means
Option 1: DO NOT USE _inline functions within your submitted code
Baseline: (480s to run the 10 testcases on a GHC machine)
Goal: 5x speed up compared to baseline implementation (96s on a GHC machine)
90% of performance grade with 4x speedup (120s on a GHC machine) 80% of performance grade with 3x speedup (180s on a GHC machine) 70% of performance grade with 2x speedup (240s on a GHC machine)
Option 2: DO USE _inline functions within your submitted code
Baseline: (78s to run the 10 testcases on a GHC machine)
Goal: 2x speed up compared to baseline implementation (38s on a GHC machine) 90% of performance grade with 1.5x speedup (56s on a GHC machine)
80% of performance grade with 1.2x speedup (64s on a GHC machine)
70% of performance grade with 1.1x speedup (70s on a GHC machine)
18-646 – How to Write Fast Code II 6
Updates to Course Schedule
0
18-646 – How to Write Fast Code II 7
18-646 – How to Write Fast Code II 8
Mini-Project 1.1 – Matrix Multiply
18-646 – How to Write Fast Code II 9
Mini-Project 1.2 – K-Means
18-646 – How to Write Fast Code II 10
Outline
Compilation with CUDA Synchronization
Data Parallel Algorithms – Map, Reduce, and Scan Compositions of Data Parallel Algorithms
Compact
Find Unique
Building a Flag Array
18-646 – How to Write Fast Code II 11
Compilation
Any source file containing CUDA language extensions must be compiled with NVCC
NVCC is a compiler driver
Works by invoking all the necessary tools and compilers like cudacc, g++, cl, … NVCC outputs:
C code (host CPU Code)
Must then be compiled with the rest of the application using another tool
PTX
Object code directly
Or, PTX source, interpreted at runtime
18-646 – How to Write Fast Code II 12
The CUDA Platform
C/C++ CUDA Application
float4 me = gx[gtid];
me.x += me.y * me.z;
Parallel Thread eXecution (PTX)
NVCC PTX Code
CPU Code
GPUside
Virtual Machine and ISA cpuside Programming model
Execution resources and state
Virtual
Physical
PTX to Target
Compiler mad.f32 $f1, $f5, $f3, $f1;
… GPU Target code
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
G80
18-646 – How to Write Fast Code II
13
The CUDA Platform
C/C++ CUDA Application
NVCC PTX Code
float4 me = gx[gtid];
me.x += me.y * me.z;
CPU Code
Virtual
Parallel Thread eXecution (PTX)
Virtual Machine and ISA
Programming model
Execution resources and state
Physical
PTX to Target
Compiler mad.f32 $f1, $f5, $f3, $f1;
… GPU Target code
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
G80
18-646 – How to Write Fast Code II
14
Outline
Compilation with CUDA Synchronization
Data Parallel Algorithms – Map, Reduce, and Scan Compositions of Data Parallel Algorithms
Compact
Find Unique
Building a Flag Array
18-646 – How to Write Fast Code II 15
When to Use Shared Memory?
2
Find absolute differences in neighboring elements in an array in:
out:
d CUDAFunc
__global__ void absolute_diff( int n, float* in, float* out) {
int idx = blockIdx.x*256 + threadIdx.x;
if ((idx != 0) && (idx < n)) {
out[idx] = abs(in[idx] – in[idx-1]);
} }
Opportunity to not load elements of in twice in the loop w
5
2
3
4
1
8
3
1
3
1
1
3
7
5
2
18-646 – How to Write Fast Code II
16
Synchronization: Caveats
__syncthreads()
waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block
used to coordinate communication between the threads of the same block What’s wrong with this code?
__global__ void absolute_diff( int n, float* in, float* out) { int idx = blockIdx.x*256 + threadIdx.x;
__shared__ local_mem[256];
if ((threadIdx.x != 0) && (idx < n)) { byGPU local_mem[threadIdx.x] = in[idx]; shared
__syncthread();
out[idx] = abs(local_mem[threadIdx.x] - local_mem [threadIdx.x-1]);
} else if (threadIdx.x == 0){ firstthread 1in globalmen out[idx] = abs(local_mem[threadIdx.x] – in[idx-1]);
} }
18-646 – How to Write Fast Code II 17
Synchronization
__syncthreads()
waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block
used to coordinate communication between the threads of the same block Corrected code:
__global__ void absolute_diff( int n, float* in, float* out) { int idx = blockIdx.x*256 + threadIdx.x;
__shared__ local_mem[256];
if ((threadIdx.x != 0) && (idx < n)) { local_mem[threadIdx.x] = in[idx];
}
__syncthread();
if ((threadIdx.x != 0) && (idx < n)) {
out[idx] = abs(local_mem[threadIdx.x] - local_mem [threadIdx.x-1]);
} else if (threadIdx.x == 0){
out[idx] = abs(local_mem[threadIdx.x] – in[idx-1]);
} }
18-646 – How to Write Fast Code II 18
Synchronization
GPUs multiple
across
Device System (GPU)
__threadfence ()
Device System (GPU)
__threadfence ()
wewillfocusoneGPU same
__threadfence_block()
__threadfence_system()
Host System (CPU)
__threadfence_block()
18-646 – How to Write Fast Code II
19
Fence Synchronization
__threadfence_block()
waits until all global and shared memory accesses made by the calling thread prior to __threadfence_block() are visible to all threads in the thread block
__threadfence()
waits until all global and shared memory accesses made by the calling thread prior to __threadfence() are visible to:
All threads in the thread block for shared memory accesses All threads in the device for global memory accesses
__threadfence_system()
waits until all global and shared memory accesses made by the calling thread prior to __threadfence_system() are visible to
All threads in the thread block for shared memory accesses
All threads in the device for global memory accesses
Host threads for page-locked host memory accesses (see Section 3.2.4.3).
only supported by devices of compute capability 2.x.
18-646 – How to Write Fast Code II 20
Atomics
An atomic function performs a read-modify-write atomic operation a word 32-bit or 64-bit word
Residing in global or shared memory
atomicCAS() atomicAdd() atomicSub() atomicExch() atomicMin() atomicMax()
atomicInc() atomicDec() atomicAnd() atomicOr() atomicXor()
18-646 – How to Write Fast Code II
21
Outline
Synchronization
Data Parallel Algorithms - Map, Reduce, and Scan
Compositions of Data Parallel Algorithms Compact
Find Unique
Building a Flag Array
18-646 – How to Write Fast Code II 22
Data Parallel Algorithms - Map
Map :
A function that applies a given function to each element of a list, and returning a list of results
X
4
0 2 4 ... 254 256 258 ... 510
V
0 1 2 ... 127 128 129 ... 255
Fn : 2x
y
2
Two important properties: Side-effect free:
21711 5 Ext ly l 6
Has an independent piece of work, where its input does not depend on another
function
notindepend Only returning a value, no modifications of state with the rest of the application
Independent:
O
In
18-646 – How to Write Fast Code II
23
Data Parallel Algorithms - Reduce
Reduce:
A function that takes in a list of objects and builds up a return value
w
Important properties for parallel reduction:
Associativity: a+(b+c) == (a+b)+c
Allows elements to be reduced in parallel in a “tree”
In CUDA, the synchronization has to be managed by the programmer
a+b+c+d+e+f+g+h
= ( (a+b)+(c+d) )+( (e+f)+(g+h) )
= (a+b+c+d) + (e+f+g+h)
18-646 – How to Write Fast Code II 24
How Best to Implement Reduce?
What is an issue with this approach?
44 μerN
unsigned int t = threadIdx.x; for (unsigned int stride = 1;
stride
{
__syncthreads();
if (t < stride)
partialSum[t] += partialSum[t+stride];
}
Minimize branch divergence
18-646 – How to Write Fast Code II
26
size ofwarp nosynneeded Elimination of __syncthreads()
when dealwith array
__device__ void sum(float* g_idata, float* g_odata){ unsigned int tid = threadIdx.x;
extern __shared__ float s_data[];
// Assign initial value
s_data[tid] = g_idata[...];
__syncthreads();
if (tid < 128)
s_data[tid] += s_data[tid + 128];
__syncthreads();
if (tid < 64)
s_data[tid] += s_data[tid + 64];
__syncthreads();
if (tid < 32) {
// No __syncthreads() 32 threads in each
// warp execute in lock-step with each other volatile float* s_ptr = s_data;
s_ptr[tid] += s_ptr[tid + 32];
s_ptr[tid] += s_ptr[tid + 16];
s_ptr[tid] += s_ptr[tid + 8];
s_ptr[tid] += s_ptr[tid + 4];
s_ptr[tid] += s_ptr[tid + 2];
s_ptr[tid] += s_ptr[tid + 1];
}
0 1 2 ... 31 32 33 ... 63
0 1 2 ... 15 16 17 ... 31
012...7 8 9 ...15
01234567 0123
01
0
0
1
2
...
127
128
129
...
25 5
0
1
2
...
63
64
65
...
12 7
// Write result for this thread block to global memory
if (tid == 0)
g_odata[blockIdx.x] = s_data[0];
}
18-646 – How to Write Fast Code II 27
Data Parallel Algorithms - Scan
Scan (prefix-sum):
Takes a binary associative operator Å with identity I, and an array of n elements
OO
[a0, a1, ..., an-1]
and returns the ordered set
[I, a0, (a0 Å a1), ..., (a0 Å a1 Å ... Å an-2)]. Example:
if Å is addition, then scan on the set [3 1 7 0 4 1 6 3]
returns the set
[0 3 4 11 11 15 16 22]
How fast can we do that?
18-646 – How to Write Fast Code II 28
To Implement Scan – Revisit Reduce
Any techniques for creating for an O(logN)?
Sequential Reduction
sum[1] = v[1] For k = 2 to N
sum[k] = sum[k-1] + v[k]
Parallel Reduction
How do you do a scan?
18-646 – How to Write Fast Code II
29
Scan Techniques
Note that in the vector reduction, at least half of the processors were idle in any step. Let's have them compute something!
18-646 – How to Write Fast Code II 30
Scan Libraries
Like sort, there exist many optimizations for scan
For writing fast scan implementations – use the Thrust library Thrust library: C++ template library for CUDA
Now part of CUDA 4.0
#include
// data is now {0, 1, 1, 3, 5, 6}
http://code.google.com/p/thrust/wiki/QuickStartGuide#Prefix-Sums
18-646 – How to Write Fast Code II 31
Outline
Synchronization
Data Parallel Algorithms – Map, Reduce, and Scan
Compositions of Data Parallel Algorithms Compact
Find Unique
Building a Flag Array
18-646 – How to Write Fast Code II 32
Data Parallel Algorithms – Compact
Compaction:
Removing elements from an array – take in an array, and produce an shorter array
Removing: [1] [5] [6] [8]
ABCDEFGHIJ
ACDEHJ
How do we perform removal in parallel?
18-646 – How to Write Fast Code II 33
Data Parallel Algorithms – Compact
Compaction:
Removing elements from an array – take in an array, and produce an shorter array
How do we perform removal in parallel? Map–createflags(“1”keep,“0”remove) Scan–computeindex
Map – copy to new array
if (flag[i] == 1){
dst[ scanIdx[i] ] = src[i];
}
Removing: [1] [5] [6] [8]
ABCDEFGHIJ
Flags scanIdx
0123456781
1 0 1 1 1 0 0 1 0 1
P
O
0 1 1 2 3 3 3 4 4 5
src dest
ABCDEFGHIJ
A C D E H J
18-646 – How to Write Fast Code II
34
Data Parallel Algorithms – FindUniq
FindUniq:
Removing duplicates from an array – take in an set, produces a equal or smaller set of unique values
MISSISSIPPI
MISP
How do we perform “find unique” in parallel? How do we “find unique” sequentially?
Sort
Iterate through and copy
IIIIMPPSSSS
IMPS
18-646 – How to Write Fast Code II
35
Data Parallel Algorithms – FindUniq
FindUniq:
Removing duplicates from an array – take in an array, produces a equal or shorter array
How do we perform “find unique” in parallel? Sort
MISSISSIPPI
IIIIMPPSSSS
Map–flagwhenith and(i-1)th elementdiffer 1 0 0 0 1 1 0 1 0 0 0 [0] = 1
Scan – create compaction index
01111223333
Map–copytonewarray I M P S 18-646 – How to Write Fast Code II
36
Find Unique – Special case
Traditional Approach
0.349 seconds
Special case:
What if we know all the possible values the elements can take –
such as the 26 letters of the alphabet.
List Sorting
Sort (0.310)
Duplicate Removal
Cluster-boundary Detection (0.007)
Unique-index Prefix-scan (0.025)
Unique-list Gathering (0.007)
Sorting is the most expensive step How can we avoid sorting?
18-646 – How to Write Fast Code II
37
Find Unique – Special case
Traditional Approach Special case:
What if we know all the possible
Alternative Approach
0.055 seconds
List Sorting
Sort (0.310)
Hash insertion
Hash write (0.030)
Duplicate Removal
Cluster-boundary Detection (0.007)
Unique-index Prefix-scan (0.025)
Unique-list Gathering (0.007)
Duplicate Removal
Unique-index Prefix-scan (0.020)
Unique-list Gathering (0.005)
values the elements can take –
such as the 26 letters of the alphabet.
Sorting is the most expensive step How can we avoid sorting?
ABC…IJ…M…P…S 000…10…1…1…1
Setup a hash table, or lookup table
0
0.349 seconds
18-646 – How to Write Fast Code II
38
Populating a Hash Flag Array
Hash insertion:
Leverage the semantics of conflicting writes for non-
atomic memory accesses
At least one conflicting write to a device memory location is guaranteed to succeed
Order of execution is undefined
NVIDIA CUDA C Programming Guide Version 4.0, Chapter 4.2 (page 86)
For setting flags, map the hash insertion to threads
success of insertion in any order can achieve the goal
weM I S S I S S I P P I ABC…IJ…M…P…S
000…10…1…1…1
Alternative Approach
0.055 seconds
Hash insertion
Hash write (0.030)
Duplicate Removal
Unique-index Prefix-scan (0.020)
Unique-list Gathering (0.005)
18-646 – How to Write Fast Code II
39
How is this relevant to writing fast code?
Fast Platforms
Multicore platforms Manycore platforms Cloud platforms
Good Techniques
Data structures
Algorithms
Introduced the manycore platform HW and SW mental models
Introduced the terminologies for you to start FLIRTing with the technology Introduced design trade-offs in data structures
Introduced parallel algorithms
18-646 – How to Write Fast Code II 40
Software Architecture
18-646 – How to Write Fast Code II 41