CS计算机代考程序代写 data structure c/c++ compiler cuda GPU c++ algorithm 18-646 – How to Write Fast Code II

18-646 – How to Write Fast Code II
1
Carnegie Mellon University

Outline
— Compilation with CUDA — Synchronization
— Data Parallel Algorithms – Map, Reduce, and Scan — Compositions of Data Parallel Algorithms
— Compact
— Find Unique
— Building a Flag Array
18-646 – How to Write Fast Code II 2

18-646 – How to Write Fast Code II 3

Mini-Project One – Submission Date
The submission date for mini-project 1 will be postponed for 1 week due to the issues with the k-means assignment.
— Mini-Project One Due
— Mini-Project One Review
Monday, 03/15 @ 11:59PM Thursday, 03/18
18-646 – How to Write Fast Code II
4

Mini-Project One – K-Means
Depending on when you ran the command:
$ cp /afs/andrew.cmu.edu/course/18/646/MP1/18646_MP1.tar.gz~/ you may have retrieved a slightly different version of omp_kmeans.c
Version 1: 480s to run the 10 testcases on a GHC machine
Version 2: 78s to run the 10 testcases on a GHC machine
18-646 – How to Write Fast Code II 5

Mini-Project One – K-Means
Option 1: DO NOT USE _inline functions within your submitted code
— Baseline: (480s to run the 10 testcases on a GHC machine)
— Goal: 5x speed up compared to baseline implementation (96s on a GHC machine)
— 90% of performance grade with 4x speedup (120s on a GHC machine) — 80% of performance grade with 3x speedup (180s on a GHC machine) — 70% of performance grade with 2x speedup (240s on a GHC machine)
Option 2: DO USE _inline functions within your submitted code
— Baseline: (78s to run the 10 testcases on a GHC machine)
— Goal: 2x speed up compared to baseline implementation (38s on a GHC machine) — 90% of performance grade with 1.5x speedup (56s on a GHC machine)
— 80% of performance grade with 1.2x speedup (64s on a GHC machine)
— 70% of performance grade with 1.1x speedup (70s on a GHC machine)
18-646 – How to Write Fast Code II 6

Updates to Course Schedule
0
18-646 – How to Write Fast Code II 7

18-646 – How to Write Fast Code II 8
Mini-Project 1.1 – Matrix Multiply

18-646 – How to Write Fast Code II 9
Mini-Project 1.2 – K-Means

18-646 – How to Write Fast Code II 10

Outline
— Compilation with CUDA — Synchronization
— Data Parallel Algorithms – Map, Reduce, and Scan — Compositions of Data Parallel Algorithms
— Compact
— Find Unique
— Building a Flag Array
18-646 – How to Write Fast Code II 11

Compilation
— Any source file containing CUDA language extensions must be compiled with NVCC
— NVCC is a compiler driver
— Works by invoking all the necessary tools and compilers like cudacc, g++, cl, … — NVCC outputs:
— C code (host CPU Code)
— Must then be compiled with the rest of the application using another tool
— PTX
— Object code directly
— Or, PTX source, interpreted at runtime
18-646 – How to Write Fast Code II 12

The CUDA Platform
C/C++ CUDA Application
float4 me = gx[gtid];
me.x += me.y * me.z;
— Parallel Thread eXecution (PTX)
NVCC PTX Code
CPU Code
GPUside
— Virtual Machine and ISA cpuside — Programming model
— Execution resources and state
Virtual
Physical
PTX to Target
Compiler mad.f32 $f1, $f5, $f3, $f1;
… GPU Target code
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
G80
18-646 – How to Write Fast Code II
13

The CUDA Platform
C/C++ CUDA Application
NVCC PTX Code
float4 me = gx[gtid];
me.x += me.y * me.z;
CPU Code
Virtual
— Parallel Thread eXecution (PTX)
— Virtual Machine and ISA
— Programming model
— Execution resources and state
Physical
PTX to Target
Compiler mad.f32 $f1, $f5, $f3, $f1;
… GPU Target code
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
G80
18-646 – How to Write Fast Code II
14

Outline
— Compilation with CUDA — Synchronization
— Data Parallel Algorithms – Map, Reduce, and Scan — Compositions of Data Parallel Algorithms
— Compact
— Find Unique
— Building a Flag Array
18-646 – How to Write Fast Code II 15

When to Use Shared Memory?
2
— Find absolute differences in neighboring elements in an array in:
out:
d CUDAFunc
__global__ void absolute_diff( int n, float* in, float* out) {
int idx = blockIdx.x*256 + threadIdx.x;
if ((idx != 0) && (idx < n)) { out[idx] = abs(in[idx] – in[idx-1]); } } — Opportunity to not load elements of in twice in the loop w 5 2 3 4 1 8 3 1 3 1 1 3 7 5 2 18-646 – How to Write Fast Code II 16 Synchronization: Caveats __syncthreads() — waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block — used to coordinate communication between the threads of the same block What’s wrong with this code? __global__ void absolute_diff( int n, float* in, float* out) { int idx = blockIdx.x*256 + threadIdx.x; __shared__ local_mem[256]; if ((threadIdx.x != 0) && (idx < n)) { byGPU local_mem[threadIdx.x] = in[idx]; shared __syncthread(); out[idx] = abs(local_mem[threadIdx.x] - local_mem [threadIdx.x-1]); } else if (threadIdx.x == 0){ firstthread 1in globalmen out[idx] = abs(local_mem[threadIdx.x] – in[idx-1]); } } 18-646 – How to Write Fast Code II 17 Synchronization __syncthreads() — waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block — used to coordinate communication between the threads of the same block Corrected code: __global__ void absolute_diff( int n, float* in, float* out) { int idx = blockIdx.x*256 + threadIdx.x; __shared__ local_mem[256]; if ((threadIdx.x != 0) && (idx < n)) { local_mem[threadIdx.x] = in[idx]; } __syncthread(); if ((threadIdx.x != 0) && (idx < n)) { out[idx] = abs(local_mem[threadIdx.x] - local_mem [threadIdx.x-1]); } else if (threadIdx.x == 0){ out[idx] = abs(local_mem[threadIdx.x] – in[idx-1]); } } 18-646 – How to Write Fast Code II 18 Synchronization GPUs multiple across Device System (GPU) __threadfence () Device System (GPU) __threadfence () wewillfocusoneGPU same __threadfence_block() __threadfence_system() Host System (CPU) __threadfence_block() 18-646 – How to Write Fast Code II 19 Fence Synchronization __threadfence_block() — waits until all global and shared memory accesses made by the calling thread prior to __threadfence_block() are visible to all threads in the thread block __threadfence() — waits until all global and shared memory accesses made by the calling thread prior to __threadfence() are visible to: — All threads in the thread block for shared memory accesses — All threads in the device for global memory accesses __threadfence_system() — waits until all global and shared memory accesses made by the calling thread prior to __threadfence_system() are visible to — All threads in the thread block for shared memory accesses — All threads in the device for global memory accesses — Host threads for page-locked host memory accesses (see Section 3.2.4.3). — only supported by devices of compute capability 2.x. 18-646 – How to Write Fast Code II 20 Atomics — An atomic function performs a read-modify-write atomic operation a word — 32-bit or 64-bit word — Residing in global or shared memory atomicCAS() atomicAdd() atomicSub() atomicExch() atomicMin() atomicMax() atomicInc() atomicDec() atomicAnd() atomicOr() atomicXor() 18-646 – How to Write Fast Code II 21 Outline — Synchronization — Data Parallel Algorithms - Map, Reduce, and Scan — Compositions of Data Parallel Algorithms — Compact — Find Unique — Building a Flag Array 18-646 – How to Write Fast Code II 22 Data Parallel Algorithms - Map — Map : A function that applies a given function to each element of a list, and returning a list of results X 4 0 2 4 ... 254 256 258 ... 510 V 0 1 2 ... 127 128 129 ... 255 Fn : 2x y 2 — Two important properties: — Side-effect free: 21711 5 Ext ly l 6 Has an independent piece of work, where its input does not depend on another function notindepend Only returning a value, no modifications of state with the rest of the application — Independent: O In 18-646 – How to Write Fast Code II 23 Data Parallel Algorithms - Reduce — Reduce: A function that takes in a list of objects and builds up a return value w — Important properties for parallel reduction: — Associativity: a+(b+c) == (a+b)+c — Allows elements to be reduced in parallel in a “tree” — In CUDA, the synchronization has to be managed by the programmer a+b+c+d+e+f+g+h = ( (a+b)+(c+d) )+( (e+f)+(g+h) ) = (a+b+c+d) + (e+f+g+h) 18-646 – How to Write Fast Code II 24 How Best to Implement Reduce? — What is an issue with this approach? 44 μerN unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride 1; stride >> 1)
{
__syncthreads();
if (t < stride) partialSum[t] += partialSum[t+stride]; } — Minimize branch divergence 18-646 – How to Write Fast Code II 26 size ofwarp nosynneeded Elimination of __syncthreads() when dealwith array __device__ void sum(float* g_idata, float* g_odata){ unsigned int tid = threadIdx.x; extern __shared__ float s_data[]; // Assign initial value s_data[tid] = g_idata[...]; __syncthreads(); if (tid < 128) s_data[tid] += s_data[tid + 128]; __syncthreads(); if (tid < 64) s_data[tid] += s_data[tid + 64]; __syncthreads(); if (tid < 32) { // No __syncthreads() 32 threads in each // warp execute in lock-step with each other volatile float* s_ptr = s_data; s_ptr[tid] += s_ptr[tid + 32]; s_ptr[tid] += s_ptr[tid + 16]; s_ptr[tid] += s_ptr[tid + 8]; s_ptr[tid] += s_ptr[tid + 4]; s_ptr[tid] += s_ptr[tid + 2]; s_ptr[tid] += s_ptr[tid + 1]; } 0 1 2 ... 31 32 33 ... 63 0 1 2 ... 15 16 17 ... 31 012...7 8 9 ...15 01234567 0123 01 0 0 1 2 ... 127 128 129 ... 25 5 0 1 2 ... 63 64 65 ... 12 7 // Write result for this thread block to global memory if (tid == 0) g_odata[blockIdx.x] = s_data[0]; } 18-646 – How to Write Fast Code II 27 Data Parallel Algorithms - Scan — Scan (prefix-sum): Takes a binary associative operator Å with identity I, and an array of n elements OO [a0, a1, ..., an-1] and returns the ordered set [I, a0, (a0 Å a1), ..., (a0 Å a1 Å ... Å an-2)]. — Example: if Å is addition, then scan on the set [3 1 7 0 4 1 6 3] returns the set [0 3 4 11 11 15 16 22] — How fast can we do that? 18-646 – How to Write Fast Code II 28 To Implement Scan – Revisit Reduce — Any techniques for creating for an O(logN)? Sequential Reduction sum[1] = v[1] For k = 2 to N sum[k] = sum[k-1] + v[k] Parallel Reduction How do you do a scan? 18-646 – How to Write Fast Code II 29 Scan Techniques — Note that in the vector reduction, at least half of the processors were idle in any step. Let's have them compute something! 18-646 – How to Write Fast Code II 30 Scan Libraries — Like sort, there exist many optimizations for scan — For writing fast scan implementations – use the Thrust library — Thrust library: C++ template library for CUDA — Now part of CUDA 4.0 #include int data[6] = {1, 0, 2, 2, 1, 3}; thrust::exclusive_scan(data, data + 6, data); // in-place scan
// data is now {0, 1, 1, 3, 5, 6}
http://code.google.com/p/thrust/wiki/QuickStartGuide#Prefix-Sums
18-646 – How to Write Fast Code II 31

Outline
— Synchronization
— Data Parallel Algorithms – Map, Reduce, and Scan
— Compositions of Data Parallel Algorithms — Compact
— Find Unique
— Building a Flag Array
18-646 – How to Write Fast Code II 32

Data Parallel Algorithms – Compact
— Compaction:
Removing elements from an array – take in an array, and produce an shorter array
Removing: [1] [5] [6] [8]
ABCDEFGHIJ
ACDEHJ
— How do we perform removal in parallel?
18-646 – How to Write Fast Code II 33

Data Parallel Algorithms – Compact
— Compaction:
Removing elements from an array – take in an array, and produce an shorter array
— How do we perform removal in parallel? —Map–createflags(“1”keep,“0”remove) —Scan–computeindex
— Map – copy to new array
if (flag[i] == 1){
dst[ scanIdx[i] ] = src[i];
}
Removing: [1] [5] [6] [8]
ABCDEFGHIJ
Flags scanIdx
0123456781
1 0 1 1 1 0 0 1 0 1
P
O
0 1 1 2 3 3 3 4 4 5
src dest
ABCDEFGHIJ
A C D E H J
18-646 – How to Write Fast Code II
34

Data Parallel Algorithms – FindUniq
— FindUniq:
Removing duplicates from an array – take in an set, produces a equal or smaller set of unique values
MISSISSIPPI
MISP
— How do we perform “find unique” in parallel? — How do we “find unique” sequentially?
— Sort
— Iterate through and copy
IIIIMPPSSSS
IMPS
18-646 – How to Write Fast Code II
35

Data Parallel Algorithms – FindUniq
— FindUniq:
Removing duplicates from an array – take in an array, produces a equal or shorter array
— How do we perform “find unique” in parallel? — Sort
MISSISSIPPI
IIIIMPPSSSS
—Map–flagwhenith and(i-1)th elementdiffer 1 0 0 0 1 1 0 1 0 0 0 [0] = 1
— Scan – create compaction index
01111223333
—Map–copytonewarray I M P S 18-646 – How to Write Fast Code II
36

Find Unique – Special case
Traditional Approach
0.349 seconds
— Special case:
What if we know all the possible values the elements can take –
such as the 26 letters of the alphabet.
List Sorting
Sort (0.310)
Duplicate Removal
Cluster-boundary Detection (0.007)
Unique-index Prefix-scan (0.025)
Unique-list Gathering (0.007)
— —
Sorting is the most expensive step How can we avoid sorting?
18-646 – How to Write Fast Code II
37

Find Unique – Special case
Traditional Approach — Special case:
What if we know all the possible
Alternative Approach
0.055 seconds
List Sorting
Sort (0.310)
Hash insertion
Hash write (0.030)
Duplicate Removal
Cluster-boundary Detection (0.007)
Unique-index Prefix-scan (0.025)
Unique-list Gathering (0.007)
Duplicate Removal
Unique-index Prefix-scan (0.020)
Unique-list Gathering (0.005)
— —
values the elements can take –
such as the 26 letters of the alphabet.
Sorting is the most expensive step How can we avoid sorting?
ABC…IJ…M…P…S 000…10…1…1…1
Setup a hash table, or lookup table
0
0.349 seconds
18-646 – How to Write Fast Code II
38

Populating a Hash Flag Array
— Hash insertion:
— Leverage the semantics of conflicting writes for non-
atomic memory accesses
— At least one conflicting write to a device memory location is guaranteed to succeed
— Order of execution is undefined
NVIDIA CUDA C Programming Guide Version 4.0, Chapter 4.2 (page 86)
— For setting flags, map the hash insertion to threads
— success of insertion in any order can achieve the goal
weM I S S I S S I P P I ABC…IJ…M…P…S
000…10…1…1…1
Alternative Approach
0.055 seconds
Hash insertion
Hash write (0.030)
Duplicate Removal
Unique-index Prefix-scan (0.020)
Unique-list Gathering (0.005)
18-646 – How to Write Fast Code II
39

How is this relevant to writing fast code?
Fast Platforms
— Multicore platforms — Manycore platforms — Cloud platforms
Good Techniques
— — —
Data structures
Algorithms
— Introduced the manycore platform HW and SW mental models
— Introduced the terminologies for you to start FLIRTing with the technology — Introduced design trade-offs in data structures
— Introduced parallel algorithms
18-646 – How to Write Fast Code II 40
Software Architecture

18-646 – How to Write Fast Code II 41