PowerPoint Presentation
Parallel Computing
with GPUs: Sorting
and Libraries
Dr Paul Richmond
http://paulrichmond.shef.ac.uk/teaching/COM4521/
Last Week
We learnt about Performance optimisation
APOD cycle
Use of guided analysis to find important kernels
Use of guided analysis to find optimisation routes for code
Important Reminder
Guest lecture next week
MOLE Quiz next week 9.00am
Followed by 1 hour lab (assignment help and lab catchup)
Week 11:
No lecture (bank holiday Monday)
Lab for assignment help and GPU visualisation
Week 12:
Ne lecture or lab
Sorting Networks
Merge and Bitonic sort
Thrust Parallel Primitives Library
Applications of sorting (binning)
Serial Sorting Examples
Insertion Sort
Insert a new element into a sorted list.
E.g. [ 1 6 3 4 2 5 ]
[1] -> [1 6] -> [1 3 6] -> [1 3 4 6] -> [1 2 3 4 6] -> [1 2 3 4 5 6]
Bubble Sort
Exchange and Sweep to compare each pair of adjacent elements
O(n2) worst-case and average case, O(n) best case.
E.g. [ 1 6 3 4 2 5 ]
[1 6 3 4 2 5] -> [1 3 6 4 2 5] -> [1 3 4 6 2 5] -> [1 3 4 2 6 5] -> [1 3 4 2 5 6]
[1 3 2 4 5 6]
[1 2 3 4 5 6]
Classifying Sort Techniques/Implementations
Data driven
Each step of the algorithm depends on the previous step version
Highly serial
Data independent
The algorithms performs fixed steps and does not change its processing
based on data
Well suited to parallel implementations
Can be expressed as a sorting network…
Sorting Networks
A sorting network is a comparator network that sorts all input
sequences
Following the same execution of stages
Consider the previous Bubble Sort [ 1 6 3 4 2 5 ]
[1 6 3 4 2 5] -> [1 3 6 4 2 5] -> [1 3 4 6 2 5] -> [1 3 4 2 6 5] -> [1 3 4 2 5 6]
[1 3 4 2 5 6] -> [1 3 4 2 5 6] -> [1 3 2 4 5 6] -> [1 3 2 4 5 6]
[1 3 2 4 5 6] -> [1 2 3 4 5 6] -> [1 2 3 4 5 6]
[1 3 2 4 5 6] -> [1 2 3 4 5 6]
[1 2 3 4 5 6]
[ 1
6
3
4
2
5
] S
w
e
e
p
1
Sw
e
e
p
3
S
w
e
e
p
5
S
w
e
e
p
2
S
w
e
e
p
4
S
w
e
e
p
s
[ 1
2
3
4
5
6
]
Not considered
Compared not swapped
Compared and swapped
Sorting Networks
And Insertion Sort…
[1 6 3 4 2 5]
[1 3 6 4 2 5] -> [1 3 6 4 2 5]
[1 3 4 6 2 5] -> [1 3 4 6 2 5] -> [1 3 4 6 2 5]
[1 3 4 2 6 5] -> [1 3 2 4 6 5] -> [1 2 3 4 6 5] -> [1 2 3 4 6 5]
[1 2 3 4 5 6] -> [1 2 3 4 5 6] -> [1 2 3 4 5 6] -> [1 2 3 4 5 6] -> [1 2 3 4 5 6]
[ 1
6
3
4
2
5
] S
w
e
e
p
1
S
w
e
e
p
3
S
w
e
e
p
5
S
w
e
e
p
2
S
w
e
e
p
4
S
w
e
e
p
s
[ 1
2
3
4
5
6
]
Not considered
Compared not swapped
Compared and swapped
Parallel Sorting Networks
Bubble Insertion [1 6 3 4 2 5]
[1 3 6 4 2 5]
[1 3 4 6 2 5]
[1 3 4 2 6 5]
[1 3 2 4 5 6]
[1 2 3 4 5 6]
[1 2 3 4 5 6]
[1 2 3 4 5 6]
[1 2 3 4 5 6]
Sweeps = 9
Parallel Bubble and Insertion sorting
network is still not very efficient
2𝑛 − 3 sweeps
𝑛(𝑛 − 1)/2 comparisons – O(𝑛²) complexity
Sorting Networks
Merge and Bitonic sort
Thrust Parallel Primitives Library
Applications of sorting (binning)
Merge Sort
To reduce the 𝑂(𝑛²) overhead we need a better sorting network
The odd-even merge sort network (for power of 2 𝑛)
Sort all odd and even keys separately and then merge 𝑚 values of a stage
Merge a sorted sequence of elements on lines < 𝑎1, … , 𝑎𝑛 > with those on
lines < 𝑎𝑛+1, … , 𝑎2𝑛 >
Each merge requires log(𝑛) passes
Total complexity of 𝑂(𝑛 log(𝑛²) + log(𝑛))
𝑛 = 1
𝑛 = 2
pass 1 pass 2
pass 1
Merging of two sorted sequences (n=4)
𝑛 = 4
pass 1 pass 2 pass 3
Merge Sorting (n=8)
Stage 1 Stage 2 Stage 3
Merge Sorting (n=8)
Stage 1 Stage 2 Stage 3
Sort into two lists Merge
S
o
rt
e
d
S
o
rt
e
d
S
o
rt
e
d
Merge Sorting (n=8) example
Input Stage 1 Output
8 1 1 1 1
1 8 5 3 3 2 2
5 3 3 5 5 2 3 3
3 5 8 8 4 4 4
6 2 2 2 5 5 5
2 6 6 4 4 8 6 6
4 4 4 6 6 8 8
9 9 9 9 9
Stage 3Stage 2
Limitations of Merge Sort?
What is potentially wrong with a merge sort GPU implementation?
Hint: Think about workload per thread
Limitations of Merge Sort
What is potentially wrong with a merge sort GPU implementation?
Irregular memory accesses
Not all values are compared in each pass (uneven workload per thread)
Solution: Bitonic Sort
Bitonic sorting network
Iterative splitting and merging of inputs into increasing large bionic
sequences
A sequence is bitonic if
There is an 𝑖, such that , 𝑎0 … , 𝑎𝑖 is monotonically increasing and 𝑎𝑖 … , 𝑎𝑛 is
monotonically decreasing
𝑖
increasing decreasing
Bitonic Sorting Network
Sorting and Merging increasing large bionic sequences
When 𝑛 = 2𝑘 there are 𝑘 levels with
𝑛
2
comparisons each
GPU Implementation
Regular access strides 🙂
Efficiently balanced workload 🙂
Requires multiple kernel launches to merge over 𝑛 > block size
Sorting Networks
Merge and Bitonic sort
Thrust Parallel Primitives Library
Applications of sorting (binning)
CUDA libraries
Abstract CUDA model away from programmer
Highly optimised implementations of common tools
Mainly focused on linear algebra
Application
CUDA
Thrust/CUB
CUSPcu
B
LA
S
cu
R
A
N
D
cu
F
F
T
cu
S
PA
R
S
E
Thrust
Template Library for CUDA
Implements many parallel primitives (scan, sort, reduction
etc.)
Part of standard CUDA release
Level of Abstraction which hides kernels, mallocs and
memcpy’s
Designed for C++ programmers
Similar in design and operation as the C++ Standard
Template Library (STL)
Only a small amount of C++ required..
Thrust containers
Thrust uses only high level vector containers
host_vector: on host
device_vector: on GPU
Other STL containers include
queue
list
tack
queue
priority_queue
set
multiset
map
multimap
bitset
STL containers can be used to initialise a Thrust vector
#include int main() //create a vector on the device //device data manipulated directly from host thrust::device_vector printf(“d_vec at begin=%d”, (int)*begin); begin++;//move on a single position printf(“d_vec at begin++=%d”, (int)*begin); *end = 88; printf(“d_vec at end=%d”, (int)*end); d_vec at begin=0 d_vec at begin++=1 d_vec at end=88 Thrust Iterators Can be converted to a raw pointer Raw pointers can be used in Thrust int * d_ptr = thrust::raw_pointer_cast(begin); int * d_ptr = thrust::raw_pointer_cast(begin[0]); kernel int* d_ptr; Thrust Algorithms Transformations Reduction Can also be used to count occurrences of a value Prefix Sum Sort Binary Search Thrust Transformations Some examples of the many transformations thrust::copy(d_vec.begin(), d_vec.begin() + 10, d_vec_cpy.begin()); thrust::fill(d_vec.begin(), d_vec.begin() + 10, 0); thrust::generate(d_vec.begin(), d_vec.begin() + 10, rand); //rand is a predefined Thrust generator thrust::generate(d_vec.begin(), d_vec.begin() + 10, rand); // fill d_vec with {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} thrust::sequence(d_vec.begin(), d_vec.begin() + 10); //all occurrences of the value 1 are replaced with the value 10 thrust::replace(d_vec.begin(), d_vec.end(), 1, 10); Thrust Algorithms Either in-place or to output vector thrust::device_vector thrust::device_vector //fill d_vec with {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} thrust::sequence(d_vec.begin(), d_vec.begin() + 10); //inclusive scan to output vector thrust::inclusive_scan(d_vec.begin(), d_vec.end(), d_vec_out.begin()); //inclusive scan in place thrust::inclusive_scan(d_vec.begin(), d_vec.end(), d_vec.begin()); //generate random data (actually a transformation) thrust::generate(d_vec.begin(), d_vec.end(), rand); //sort in place thrust::sort(d_vec.begin(), d_vec.end()); Custom Transformations thrust::device_vector thrust::device_vector //fill d_vec with {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} d_vec = thrust::sequence(d_vec.begin(), d_vec.begin() + 10); //declare a custom operator struct add_5{ __host__ __device__ int operator()(int a){ return a + 5; } }; add_5 func; //apply custom transformation thrust::transform(d_vec.begin(), d_vec.end(), d_vec_out.begin(), func); //d_vec is now {5, 6, 7, 8, 9, 10, 11, 12, 13, 14} Thrust Fusion For best performance it is necessary to fuse operations __host__ __device__ int operator()(int a){ return a < 0 ? –a : a ;
}
};
absolute func;
//custom transformation to calculate absolute value
thrust::transform(d_vec.begin(), d_vec.end(), d_vec.begin(), func);
//apply reduction, maximum binary associate operator
int result = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::maximum struct absolute{ __host__ __device__ int operator()(int a){ return a < 0 ? –a : a ;
}
};
absolute func;
//apply transform reduction maximum binary associate operator
int result = thrust::transform_reduce(d_vec.begin(), d_vec.end(), func, 0, thrust::maximum Sorting Networks Merge and Bitonic sort Thrust Parallel Primitives Library Applications of sorting (binning) Sorting and parallel primitives Can be very useful for building data structures Remember Gather vs Scatter Very common in particle simulations etc. 0 1 2 3 Memory Values/Locations ThreadIdx.x 4 5 6 7 How to read multiple values afterwards? Binning and Sorting 0 2 1 4 6 5 5 7 0 1 2 3 Desired Write_Index for the thread ThreadIdx.x 4 5 6 7 0 1 2 4 5 5 6 7thrust::sort(Write_Index) 0 1 2 3 4 5 6 7 1 1 1 0 1 2 1 1Count(Write_Index) 0 1 2 3 3 4 6 7thrust::inclusive_scan(count) We can now read varying values from each bin inclusive_scan gives starting index of 4 Iterate from index 4 for a count of 2 to find all values of write_index 5 Build a data structure Unique write indices i.e. how many Particle interaction example As with previous slide use sorting radius Output particle key value pairs (keys are location Sort Keys Reorder particles based on key pairs Generate a partition boundary table Each particle needs to read all particles in its own Summary Sorting networks allow data independent sort algorithms to map Choice of a sorting network will dictate the memory access pattern Merge sort and Bitonic sort are popular choices for GPUs Thrust implements many parallel primitives Thrust is based on the idea of containers, iterators, transformations Sorting can be used to improve complex problems such as particle Acknoledgements and Further Reading Comparison on sorting approaches on GPU https://devblogs.nvidia.com/parallelforall/expressive-algorithmic- http://arxiv.org/ftp/arxiv/papers/1511/1511.03404.pdf
{
//create a vector on the host
thrust::host_vector
thrust::device_vector
for (int i = 0; i < 10; i++)
d_vec[i] = i;
//vector memory automatically released
return 0;
}
Thrust Iterators
They point to regions of a vector
Can be used like pointers
Explicit cast when dereferencing very important
thrust::device_vector
BUT not exactly the same as a vector
cudaMalloc((void**)&d_ptr, N);
thrust::device_ptr
//or
thrust::device_ptr
cudaFree(d_ptr);
Application of a function to each element within the range of a vector
Reduction of a set of values to a single value using binary associative operator
Both inclusive and exclusive scans
Can sort keys or key value pairs
Position of a target value
struct absolute{
We can use prefix sum for writing multiple values per element
What if our outputs are scattered to output
Outputs might represent spatial bins
Scatter operation
Write to a number of locations
Random access write?
E.g. for location 5
threads want
to write to this
index
Divide the environment according to some interaction
determined through some hash function)
Histogram count and prefix sum
location and any neighbouring location
Guarantees particle interactions within the interaction radius
easily parallel architectures
and hence the performance on a GPU
and algorithms
systems over a fixed range
http://arxiv.org/ftp/arxiv/papers/1511/1511.03404.pdf
programming-thrust/
https://devblogs.nvidia.com/parallelforall/expressive-algorithmic-programming-thrust/