CS计算机代考程序代写 data structure c/c++ compiler cuda GPU c++ algorithm 18-646

18-646 – How to Write Fast Code II
1
Carnegie Mellon University

Outline
Compilation with CUDA Synchronization
Data Parallel Algorithms – Map, Reduce, and Scan Compositions of Data Parallel Algorithms
Compact
Find Unique
Building a Flag Array
18-646 – How to Write Fast Code II 2

18-646 – How to Write Fast Code II 3

Mini-Project One – Submission Date
The submission date for mini-project 1 will be postponed for 1 week due to the issues with the k-means assignment.
Mini-Project One Due
Mini-Project One Review
Monday, 03/15 @ 11:59PM Thursday, 03/18
18-646 – How to Write Fast Code II
4

Mini-Project One – K-Means
Depending on when you ran the command:
$ cp /afs/andrew.cmu.edu/course/18/646/MP1/18646_MP1.tar.gz~/ you may have retrieved a slightly different version of omp_kmeans.c
Version 1: 480s to run the 10 testcases on a GHC machine
Version 2: 78s to run the 10 testcases on a GHC machine
18-646 – How to Write Fast Code II 5

Mini-Project One – K-Means
Option 1: DO NOT USE _inline functions within your submitted code
Baseline: (480s to run the 10 testcases on a GHC machine)
Goal: 5x speed up compared to baseline implementation (96s on a GHC machine)
90% of performance grade with 4x speedup (120s on a GHC machine) 80% of performance grade with 3x speedup (180s on a GHC machine) 70% of performance grade with 2x speedup (240s on a GHC machine)
Option 2: DO USE _inline functions within your submitted code
Baseline: (78s to run the 10 testcases on a GHC machine)
Goal: 2x speed up compared to baseline implementation (38s on a GHC machine) 90% of performance grade with 1.5x speedup (56s on a GHC machine)
80% of performance grade with 1.2x speedup (64s on a GHC machine)
70% of performance grade with 1.1x speedup (70s on a GHC machine)
18-646 – How to Write Fast Code II 6

Updates to Course Schedule
0
18-646 – How to Write Fast Code II 7

18-646 – How to Write Fast Code II 8
Mini-Project 1.1 – Matrix Multiply

18-646 – How to Write Fast Code II 9
Mini-Project 1.2 – K-Means

18-646 – How to Write Fast Code II 10

Compilation
Any source file containing CUDA language extensions must be compiled with NVCC
NVCC is a compiler driver
Works by invoking all the necessary tools and compilers like cudacc, g++, cl, … NVCC outputs:
C code (host CPU Code)
Must then be compiled with the rest of the application using another tool
PTX
Object code directly
Or, PTX source, interpreted at runtime
18-646 – How to Write Fast Code II 12

The CUDA Platform
C/C++ CUDA Application
float4 me = gx[gtid];
me.x += me.y * me.z;
Parallel Thread eXecution (PTX)
NVCC PTX Code
CPU Code
GPUside
Virtual Machine and ISA cpuside Programming model
Execution resources and state
Virtual
Physical
PTX to Target
Compiler mad.f32 $f1, $f5, $f3, $f1;
… GPU Target code
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
G80
18-646 – How to Write Fast Code II
13

The CUDA Platform
C/C++ CUDA Application
NVCC PTX Code
float4 me = gx[gtid];
me.x += me.y * me.z;
CPU Code
Virtual
Parallel Thread eXecution (PTX)
Virtual Machine and ISA
Programming model
Execution resources and state
Physical
PTX to Target
Compiler mad.f32 $f1, $f5, $f3, $f1;
… GPU Target code
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
G80
18-646 – How to Write Fast Code II
14

When to Use Shared Memory?
2
Find absolute differences in neighboring elements in an array in:
out:
d CUDAFunc
__global__ void absolute_diff( int n, float* in, float* out) {
int idx = blockIdx.x*256 + threadIdx.x;
if ((idx != 0) && (idx < n)) { out[idx] = abs(in[idx] – in[idx-1]); } } Opportunity to not load elements of in twice in the loop w 5 2 3 4 1 8 3 1 3 1 1 3 7 5 2 18-646 – How to Write Fast Code II 16 Synchronization: Caveats __syncthreads() waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block used to coordinate communication between the threads of the same block What’s wrong with this code? __global__ void absolute_diff( int n, float* in, float* out) { int idx = blockIdx.x*256 + threadIdx.x; __shared__ local_mem[256]; if ((threadIdx.x != 0) && (idx < n)) { byGPU local_mem[threadIdx.x] = in[idx]; shared __syncthread(); out[idx] = abs(local_mem[threadIdx.x] - local_mem [threadIdx.x-1]); } else if (threadIdx.x == 0){ firstthread 1in globalmen out[idx] = abs(local_mem[threadIdx.x] – in[idx-1]); } } 18-646 – How to Write Fast Code II 17 Synchronization __syncthreads() waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block used to coordinate communication between the threads of the same block Corrected code: __global__ void absolute_diff( int n, float* in, float* out) { int idx = blockIdx.x*256 + threadIdx.x; __shared__ local_mem[256]; if ((threadIdx.x != 0) && (idx < n)) { local_mem[threadIdx.x] = in[idx]; } __syncthread(); if ((threadIdx.x != 0) && (idx < n)) { out[idx] = abs(local_mem[threadIdx.x] - local_mem [threadIdx.x-1]); } else if (threadIdx.x == 0){ out[idx] = abs(local_mem[threadIdx.x] – in[idx-1]); } } 18-646 – How to Write Fast Code II 18 Synchronization GPUs multiple across Device System (GPU) __threadfence () Device System (GPU) __threadfence () wewillfocusoneGPU same __threadfence_block() __threadfence_system() Host System (CPU) __threadfence_block() 18-646 – How to Write Fast Code II 19 Fence Synchronization __threadfence_block() waits until all global and shared memory accesses made by the calling thread prior to __threadfence_block() are visible to all threads in the thread block __threadfence() waits until all global and shared memory accesses made by the calling thread prior to __threadfence() are visible to: All threads in the thread block for shared memory accesses All threads in the device for global memory accesses __threadfence_system() waits until all global and shared memory accesses made by the calling thread prior to __threadfence_system() are visible to All threads in the thread block for shared memory accesses All threads in the device for global memory accesses Host threads for page-locked host memory accesses (see Section 3.2.4.3). only supported by devices of compute capability 2.x. 18-646 – How to Write Fast Code II 20 Atomics An atomic function performs a read-modify-write atomic operation a word 32-bit or 64-bit word Residing in global or shared memory atomicCAS() atomicAdd() atomicSub() atomicExch() atomicMin() atomicMax() atomicInc() atomicDec() atomicAnd() atomicOr() atomicXor() 18-646 – How to Write Fast Code II 21 Outline Synchronization Data Parallel Algorithms - Map, Reduce, and Scan Compositions of Data Parallel Algorithms Compact Find Unique Building a Flag Array 18-646 – How to Write Fast Code II 22 Data Parallel Algorithms - Map Map : A function that applies a given function to each element of a list, and returning a list of results X 4 0 2 4 ... 254 256 258 ... 510 V 0 1 2 ... 127 128 129 ... 255 Fn : 2x y 2 Two important properties: Side-effect free: 21711 5 Ext ly l 6 Has an independent piece of work, where its input does not depend on another function notindepend Only returning a value, no modifications of state with the rest of the application Independent: O In 18-646 – How to Write Fast Code II 23 Data Parallel Algorithms - Reduce Reduce: A function that takes in a list of objects and builds up a return value w Important properties for parallel reduction: Associativity: a+(b+c) == (a+b)+c Allows elements to be reduced in parallel in a “tree” In CUDA, the synchronization has to be managed by the programmer a+b+c+d+e+f+g+h = ( (a+b)+(c+d) )+( (e+f)+(g+h) ) = (a+b+c+d) + (e+f+g+h) 18-646 – How to Write Fast Code II 24 How Best to Implement Reduce? What is an issue with this approach? 44 μerN unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride 1; stride >> 1)
{
__syncthreads();
if (t < stride) partialSum[t] += partialSum[t+stride]; } Minimize branch divergence 18-646 – How to Write Fast Code II 26 size ofwarp nosynneeded Elimination of __syncthreads() when dealwith array __device__ void sum(float* g_idata, float* g_odata){ unsigned int tid = threadIdx.x; extern __shared__ float s_data[]; // Assign initial value s_data[tid] = g_idata[...]; __syncthreads(); if (tid < 128) s_data[tid] += s_data[tid + 128]; __syncthreads(); if (tid < 64) s_data[tid] += s_data[tid + 64]; __syncthreads(); if (tid < 32) { // No __syncthreads() 32 threads in each // warp execute in lock-step with each other volatile float* s_ptr = s_data; s_ptr[tid] += s_ptr[tid + 32]; s_ptr[tid] += s_ptr[tid + 16]; s_ptr[tid] += s_ptr[tid + 8]; s_ptr[tid] += s_ptr[tid + 4]; s_ptr[tid] += s_ptr[tid + 2]; s_ptr[tid] += s_ptr[tid + 1]; } 0 1 2 ... 31 32 33 ... 63 0 1 2 ... 15 16 17 ... 31 012...7 8 9 ...15 01234567 0123 01 0 0 1 2 ... 127 128 129 ... 25 5 0 1 2 ... 63 64 65 ... 12 7 // Write result for this thread block to global memory if (tid == 0) g_odata[blockIdx.x] = s_data[0]; } 18-646 – How to Write Fast Code II 27 Data Parallel Algorithms - Scan Scan (prefix-sum): Takes a binary associative operator Å with identity I, and an array of n elements OO [a0, a1, ..., an-1] and returns the ordered set [I, a0, (a0 Å a1), ..., (a0 Å a1 Å ... Å an-2)]. Example: if Å is addition, then scan on the set [3 1 7 0 4 1 6 3] returns the set [0 3 4 11 11 15 16 22] How fast can we do that? 18-646 – How to Write Fast Code II 28 To Implement Scan – Revisit Reduce Any techniques for creating for an O(logN)? Sequential Reduction sum[1] = v[1] For k = 2 to N sum[k] = sum[k-1] + v[k] Parallel Reduction How do you do a scan? 18-646 – How to Write Fast Code II 29 Scan Techniques Note that in the vector reduction, at least half of the processors were idle in any step. Let's have them compute something! 18-646 – How to Write Fast Code II 30 Scan Libraries Like sort, there exist many optimizations for scan For writing fast scan implementations – use the Thrust library Thrust library: C++ template library for CUDA Now part of CUDA 4.0 #include int data[6] = {1, 0, 2, 2, 1, 3}; thrust::exclusive_scan(data, data + 6, data); // in-place scan
// data is now {0, 1, 1, 3, 5, 6}
http://code.google.com/p/thrust/wiki/QuickStartGuide#Prefix-Sums
18-646 – How to Write Fast Code II 31

Outline
Synchronization
Data Parallel Algorithms – Map, Reduce, and Scan
Compositions of Data Parallel Algorithms Compact
Find Unique
Building a Flag Array
18-646 – How to Write Fast Code II 32

Data Parallel Algorithms – Compact
Compaction:
Removing elements from an array – take in an array, and produce an shorter array
Removing: [1] [5] [6] [8]
ABCDEFGHIJ
ACDEHJ
How do we perform removal in parallel?
18-646 – How to Write Fast Code II 33

Data Parallel Algorithms – Compact
Compaction:
Removing elements from an array – take in an array, and produce an shorter array
How do we perform removal in parallel? Map–createflags(“1”keep,“0”remove) Scan–computeindex
Map – copy to new array
if (flag[i] == 1){
dst[ scanIdx[i] ] = src[i];
}
Removing: [1] [5] [6] [8]
ABCDEFGHIJ
Flags scanIdx
0123456781
1 0 1 1 1 0 0 1 0 1
P
O
0 1 1 2 3 3 3 4 4 5
src dest
ABCDEFGHIJ
A C D E H J
18-646 – How to Write Fast Code II
34

Data Parallel Algorithms – FindUniq
FindUniq:
Removing duplicates from an array – take in an set, produces a equal or smaller set of unique values
MISSISSIPPI
MISP
How do we perform “find unique” in parallel? How do we “find unique” sequentially?
Sort
Iterate through and copy
IIIIMPPSSSS
IMPS
18-646 – How to Write Fast Code II
35

Data Parallel Algorithms – FindUniq
FindUniq:
Removing duplicates from an array – take in an array, produces a equal or shorter array
How do we perform “find unique” in parallel? Sort
MISSISSIPPI
IIIIMPPSSSS
Map–flagwhenith and(i-1)th elementdiffer 1 0 0 0 1 1 0 1 0 0 0 [0] = 1
Scan – create compaction index
01111223333
Map–copytonewarray I M P S 18-646 – How to Write Fast Code II
36

Find Unique – Special case
Traditional Approach
0.349 seconds
Special case:
What if we know all the possible values the elements can take –
such as the 26 letters of the alphabet.
List Sorting
Sort (0.310)
Duplicate Removal
Cluster-boundary Detection (0.007)
Unique-index Prefix-scan (0.025)
Unique-list Gathering (0.007)

Sorting is the most expensive step How can we avoid sorting?
18-646 – How to Write Fast Code II
37

Find Unique – Special case
Traditional Approach Special case:
What if we know all the possible
Alternative Approach
0.055 seconds
List Sorting
Sort (0.310)
Hash insertion
Hash write (0.030)
Duplicate Removal
Cluster-boundary Detection (0.007)
Unique-index Prefix-scan (0.025)
Unique-list Gathering (0.007)
Duplicate Removal
Unique-index Prefix-scan (0.020)
Unique-list Gathering (0.005)

values the elements can take –
such as the 26 letters of the alphabet.
Sorting is the most expensive step How can we avoid sorting?
ABC…IJ…M…P…S 000…10…1…1…1
Setup a hash table, or lookup table
0
0.349 seconds
18-646 – How to Write Fast Code II
38

Populating a Hash Flag Array
Hash insertion:
Leverage the semantics of conflicting writes for non-
atomic memory accesses
At least one conflicting write to a device memory location is guaranteed to succeed
Order of execution is undefined
NVIDIA CUDA C Programming Guide Version 4.0, Chapter 4.2 (page 86)
For setting flags, map the hash insertion to threads
success of insertion in any order can achieve the goal
weM I S S I S S I P P I ABC…IJ…M…P…S
000…10…1…1…1
Alternative Approach
0.055 seconds
Hash insertion
Hash write (0.030)
Duplicate Removal
Unique-index Prefix-scan (0.020)
Unique-list Gathering (0.005)
18-646 – How to Write Fast Code II
39

How is this relevant to writing fast code?
Fast Platforms
Multicore platforms Manycore platforms Cloud platforms
Good Techniques

Data structures
Algorithms
Introduced the manycore platform HW and SW mental models
Introduced the terminologies for you to start FLIRTing with the technology Introduced design trade-offs in data structures
Introduced parallel algorithms
18-646 – How to Write Fast Code II 40
Software Architecture

18-646 – How to Write Fast Code II 41

Related Posts