PowerPoint Presentation
Sorting and Hashing
See R&G Chapters:
9.1, 13.1-13.3, 13.4.2
1
Why Sort?
“Rendezvous”
Eliminating duplicates (DISTINCT)
Grouping for summarization (GROUP BY)
Upcoming sort-merge join algorithm
Ordering
Sometimes, output must be ordered (ORDER BY)
e.g., return results ranked in decreasing order of relevance
First step in bulk-loading tree indexes
Problem: sort 100GB of data with 1GB of RAM.
why not virtual memory?
Out-of-Core Algorithms
Two themes
Single-pass streaming data through RAM
Divide (into RAM-sized chunks) and Conquer
Single-pass Streaming
Simple case: “Map”.
Goal: Compute f(x) for each record, write out the result
Challenge: minimize RAM, call read/write rarely
Approach
Read a chunk from INPUT to an Input Buffer
Write f(x) for each item to an Output Buffer
When Input Buffer is consumed, read another chunk
When Output Buffer fills, write it to OUTPUT
f(x)
RAM
Input
Buffer
Output
Buffer
OUTPUT
INPUT
4
Better: Double Buffering pt 1
Main thread runs f(x) on one pair I/O bufs
2nd I/O thread drains/fills unused I/O bufs in parallel
Why is parallelism available?
Theme: I/O handling usually deserves its own thread
Main thread ready for a new buf? Swap!
RAM
Input
Buffers
Output
Buffers
f(x)
OUTPUT
INPUT
I/O
5
Better: Double Buffering pt 2
Main thread runs f(x) on one pair I/O bufs
2nd I/O thread drains/fills unused I/O bufs in parallel
Why is parallelism available?
Theme: I/O handling usually deserves its own thread
Main thread ready for a new buf? Swap!
RAM
Input
Buffers
Output
Buffers
OUTPUT
INPUT
I/O
f(x)
6
Double Buffering applies to all streams
Usable in any of the subsequent discussion
Assuming you have RAM buffers to spare!
But for simplicity we won’t bring this up again.
RAM
Input
Buffers
Output
Buffers
OUTPUT
INPUT
I/O
f(x)
7
Sorting & Hashing: Formal Specs
Given:
A file F:
containing a multiset of records R
consuming N blocks of storage
Two “scratch” disks
each with >> N blocks of free storage
A fixed amount of space in RAM
memory capacity equivalent to B blocks of disk
Sorting
Produce an output file FS
with contents R stored in order by a given sorting criterion
Hashing
Produce an output file FH
with contents R, arranged on disk so that no 2 records that have the same hash value are separated by a record with a different hash value.
I.e. matching records are always “stored consecutively” in FH.
8
Sorting: 2-Way (a strawman)
Pass 0 (conquer a batch):
read a page, sort it, write it.
only one buffer page is used
a repeated “batch job”
RAM
I/O
Buffer
OUTPUT
INPUT
Sort in place
OUTPUT
INPUT
9
5
Sorting: 2-Way (a strawman), cont
Pass 0 (conquer a batch):
read a page, sort it, write it.
only one buffer page is used
a repeated “batch job”
Pass 1, 2, 3, …, etc. (merge via streaming):
requires 3 buffer pages
note: this has nothing to do with double buffering!
merge pairs of runs into runs twice as long
a streaming algorithm, as in the previous slide!
Drain/fill buffers as the data streams through them
3
Output
Buffer
OUTPUT
INPUT
1
Input Buffer
2
Input Buffer
10
5
Two-Way External Merge Sort
Conquer and Merge:
sort subfiles and merge
Each pass we read + write each page in file (2N)
N pages in the file.
So, the number of passes is:
So total cost is:
Input file
1-page runs
2-page runs
4-page runs
8-page runs
PASS 0
PASS 1
PASS 2
PASS 3
9
3,4
6,2
9,4
8,7
5,6
3,1
2
3,4
5,6
2,6
4,9
7,8
1,3
2
2,3
4,6
4,7
8,9
1,3
5,6
2
2,3
4,4
6,7
8,9
1,2
3,5
6
1,2
2,3
3,4
4,5
6,6
7,8
11
6
General External Merge Sort
More than 3 buffer pages. How can we utilize them?
Big batches in pass 0, many streams in merge passes
To sort a file with N pages using B buffer pages:
Pass 0: use B buffer pages. Produce sorted runs of B pages each.
Pass 1, 2, …, etc.: merge B-1 runs at a time.
1
…
B-1
B
1
…
⌈N/B⌉
Conquer
Merge
Sorted Runs
length B
Sorted Runs
Length B(B-1)
Pass 0
Pass 1, …
12
7
Cost of External Merge Sort
Number of passes:
Cost = 2N * (# of passes)
E.g., with 5 buffer pages, to sort 108 page file:
Pass 0: = 22 sorted runs of 5 pages each
last run is only 3 pages
Pass 1: = 6 sorted runs of 20 pages each
last run is only 8 pages
Pass 2: 2 sorted runs, 80 pages and 28 pages
Pass 3: Sorted file of 108 pages
Formula check: 1+┌log4 22┐= 1+3 4 passes √
13
8
# of Passes of External Sort
( I/O cost is 2N times number of passes)
14
9
Memory Requirement for External Sorting
How big of a table can we sort in two passes?
Each “sorted run” after Phase 0 is of size B
Can merge up to B-1 sorted runs in Phase 1
Answer: B(B-1).
Sort N pages of data in about space
1
…
B-1
B
1
…
⌈N/B⌉
Conquer
Merge
Sorted Runs
length B
Sorted Runs
Length B(B-1)
Pass 0
Pass 1, …
15
Alternative: Hashing
Idea:
Many times we don’t require order
E.g.: removing duplicates
E.g.: forming groups
Often just need to rendezvous matches
Hashing does this
But how to do it out-of-core??
Divide
Streaming Partition (divide):
Use a hash function hp to stream records to disk partitions
All matches rendezvous in the same partition.
Each partition a mix of values
Streaming alg to create partitions on disk:
“Spill” partitions to disk via output buffers
Conquer
ReHash (conquer):
Read partitions into RAM hash table one at a time, using hash f’n hr
Each bucket contains a small number of distinct values
Then read out the RAM hash table buckets and write to disk
Ensuring that duplicate values are contiguous
Two Phases: Divide
Partition:
(Divide)
Original
Relation
OUTPUT
2
INPUT
1
hash
function
hp
B-1
Partitions
1
2
B-1
. . .
B main memory buffers
19
Two Phases: Conquer
Rehash:
(Conquer)
Partitions
Hash table for partition
Ri (k <= B pages)
Output
Relation
Hash partitions hp of
size ~N/(B-1)
B main memory buffers
Hash partitions hp of
size ~N/(B-1)
hash
function
hr
Hash partitions hr
Fully hashed!
20
Cost of External Hashing
cost = 2*N*(#passes) = 4*N IO’s
(includes initial read, final write)
Conquer (hr)
Divide (hp)
Hash partitions hp of
size ~N/(B-1)
Hash partitions hr
Fully hashed!
B-1
1
…
B
21
Memory Requirement
How big of a table can we hash in two passes?
B-1 “partitions” result from Pass 1
Each should be no more than B pages in size
Answer: B(B-1).
We can hash a table of size N pages in about space
Note: assumes hash function distributes records evenly!
Have a bigger table? Recursive partitioning!
Conquer (hr)
Divide (hp)
Hash partitions hp of size ~N/(B-1)
Hash partitions hr
Fully hashed!
B-1
1
…
B
22
Recursive Partitioning, Pt 1
Divide (hp1)
B big!
Divide (hp)
Recursive Partitioning, Pt 2
Conquer (hr)
B big!
Divide (hp)
Divide (hp1)
Recursive Partitioning, Pt 3
B big!
Divide (hp)
Divide (hp1)
Conquer (hr)
B big!
A Wrinkle: Duplicates
Consider a dataset with a very frequent key
E.g. in a big table, consider the gender column
What happens during recursive partitioning?
M
F
other
Divide (hp)
M
Divide (hp1)
26
Question…
How does external hashing compare
with external sorting?
Cost of External Hashing
cost = 4*N IO’s
(including initial read, final write)
Divide
Conquer
28
Cost of External Sorting
cost = 4*N IO’s
(including initial read, final write)
Divide
Conquer
29
Parallelize me! Hashing Phase 1
Phase 1: shuffle data across machines (hn)
streaming out to network as it is scanned
which machine for this record?
use (yet another) independent hash function hn
hn
Parallelize me! Hashing Phase 2
Phase 1: shuffle data across machines (hn)
Receivers proceed with phase 1as data streams in
from local disk and network
hp
hr
hn
Parallelize me! Sorting
Pass 0: shuffle data across machines
streaming out to network as it is scanned
which machine for this record?
Split on value range (e.g. [-∞,10], [11,100], [101, ∞]).
range
Parallelize me! Sorting, cont
Pass 0: shuffle data across machines
Receivers proceed with pass 0as the data streams in
A Wrinkle: How to ensure ranges are the same #pages?!
i.e. avoid data skew?
range
So which is better ??
Simplest analysis:
Same memory requirement for 2 passes
Same I/O cost
But we can dig a bit deeper…
34
Sorting vs Hashing
Sorting pros:
Great if we need output to be sorted anyway
Not sensitive to duplicates or “bad” hash functions
Hashing pros:
For duplicate elimination, scales with # of values
Delete dups in first pass while partitioning on hp
Vs. sort which scales with # of items!
Easy to shuffle equally in parallel case
Summary
Sort/Hash Duality
Hashing is Divide & Conquer
Sorting is Conquer & Merge
Sorting is overkill for rendezvous
But sometimes a win anyhow
Don’t forget one pass streaming and double buffering
Can “hide” the latency of I/O behind CPU work
é
ù
1
log
2
+
=
N
é
ù
(
)
1
log
2
2
+
N
N
N / B⎡⎢ ⎤⎥
N/B
é
ê
ù
ú
⎡ ⎤⎡ ⎤1 1+ −log /B N B
é ù
é ù
1
1
+
-
log /
B
NB
⎡ ⎤108 5/
é ù
1085/
⎡ ⎤22 4/
éù
224/
N B=3 B=5 B=9 B=17 B=129 B=257
100 7 4 3 2 1 1
1,000 10 5 4 3 2 2
10,000 13 7 5 4 2 2
100,000 17 9 6 5 3 3
1,000,000 20 10 7 5 3 3
10,000,000 23 12 8 6 4 3
100,000,000 26 14 9 7 4 4
1,000,000,000 30 15 10 8 5 4
N B=3 B=5 B=9 B=17 B=129 B=257
100 7 4 3 2 1 1
1,000 10 5 4 3 2 2
10,000 13 7 5 4 2 2
100,000 17 9 6 5 3 3
1,000,000 20 10 7 5 3 3
10,000,000 23 12 8 6 4 3
100,000,000 26 14 9 7 4 4
1,000,000,000 30 15 10 8 5 4
N
B=3
B=5
B=9
B=17
B=129
B=257
100
7
4
3
2
1
1
1,000
10
5
4
3
2
2
10,000
13
7
5
4
2
2
100,000
17
9
6
5
3
3
1,000,000
20
10
7
5
3
3
10,000,000
23
12
8
6
4
3
100,000,000
26
14
9
7
4
4
1,000,000,000
30
15
10
8
5
4
/docProps/thumbnail.jpeg