Microsoft PowerPoint – Chapter 9 – Sorting
Introduction to
Parallel Computing
George Karypis
Sorting
Outline
Background
Sorting Networks
Quicksort
Bucket-Sort & Sample-Sort
Background
Input Specification
Each processor has n/p elements
A ordering of the processors
Output Specification
Each processor will get n/p consecutive elements of
the final sorted array.
The “chunk” is determined by the processor ordering.
Variations
Unequal number of elements on output.
In general, this is not a good idea and it may require a shift to
obtain the equal size distribution.
Basic Operation:
Compare-Split Operation
Single element per processor
Multiple elements per processor
Sorting Networks
Sorting is one of the fundamental problems in
Computer Science
For a long time researchers have focused on the
problem of “how fast can we sort n elements”?
Serial
nlog(n) lower-bound for comparison-based sorting
Parallel
O(1), O(log(n)), O(???)
Sorting networks
Custom-made hardware for sorting!
Hardware & algorithm
Mostly of theoretical interest but fun to study!
Elements of Sorting Networks
Key Idea:
Perform many comparisons in
parallel.
Key Elements:
Comparators:
Consist of two-input, two-output
wires
Take two elements on the input
wires and outputs them in sorted
order in the output wires.
Network architecture:
The arrangement of the
comparators into interconnected
comparator columns
similar to multi-stage networks
Many sorting networks have been
developed.
Bitonic sorting network
Θ(log2(n)) columns of
comparators.
Bitonic Sequence
Bitonic sequences are
graphically represented
by lines as follows:
1
2
7
0
4
6
Why Bitonic Sequences?
A bitonic sequence can be “easily” sorted in
increasing/decreasing order.
s s1 s2
• Every element of s1 will be less than or equal to every element of s2
• Both s1 and s2 are bitonic sequences.
• So how can a bitonic sequence be sorted?
Bitonic
Split
An example
Bitonic Merging Network
A comparator network that
takes as input a bitonic
sequence and performs a
sequence of bitonic splits
to sort it.
+BM[n]
A bitonic merging
network for sorting in
increasing order an n-
element bitonic
sequence.
-BM[n]
Similar sort in decreasing
order.
Are we done?
Given a set of elements, how do we re-arrange them into
a bitonic sequence?
Key Idea:
Use successively larger bitonic networks to transform the set into
a bitonic sequence.
An example
Complexity
How many columns of
comparators are required
to sort n=2l elements?
i.e., depth d(n) of the
network?
Bitonic Sort on a Hypercube
One-element-per-processor case
How do we map the algorithm onto a hypercube?
What is the comparator?
How do the wires get mapped?
What can you say about the
pairs of wires that are inputs
to the various comparators?
Illustration
Communication Pattern
Algorithm
Complexity?
Bitonic Sort on a Mesh
One-element-per-processor case
How do the wires get mapped?
Which one is better?
Why?
Row-Major Shuffled Mapping
Complexity?
Can we do better?
What is the lowest bound of sorting on a mesh?
communication performed by each process
More than one element per
processor
Hypercube
Mesh
Bitonic Sort Summary
Quicksort
Parallel Formulation
How about recursive decomposition?
Is it a good idea?
We need to do the partitioning of the array around
a pivot element in parallel.
What is the lower bound of parallel
quicksort?
What will it take to achieve this lower bound?
Optimal for CRCW PRAM
One element per processor
Arbitrary resolution of the concurrent writes.
Views the sorting as a two-step process:
(i) Constructing a binary tree of pivot elements
(ii) Obtaining the sorted sequence by performing an inorder
traversal of this binary tree.
Building the Binary Tree
Complexity?
Practical Quicksort
Shared-memory
Data resides on a shared array.
During a partitioning each
processor is responsible for a
certain portion.
Array Partitioning:
Select & Broadcast pivot.
Local re-arrangement.
Is this required?
Global re-arrangement.
Efficient Global Rearrangement
Practical Quicksort
Complexity
Complexity for message-passing is similar assuming that the all-to-all
personalized communication is not cross-bisection bandwidth limited.
A word on Pivot Selection
Selecting pivots that lead to balanced
partitions is importance
height of the tree
effective utilization of processors
Sample Sort
Generalization of bucket sort with data-driven sampling
n/p elements per-processor.
Each processor sorts is local elements.
Each processor selects p-1 equally spaced elements from its
own list.
The combined p(p-1) set of elements are sorted and p-1 equally
spaced elements are selected from that list.
Each processor splits its own list according to these splitters into
p buckets.
Each processor sends its ith bucket to the ith processor.
Each processor merges the elements that it receives.
Done.
Sample Sort Illustration
Sample Sort Complexity
Assumes
a serial sort