程序代写代做代考 algorithm PowerPoint Presentation

PowerPoint Presentation

Sorting and Hashing
See R&G Chapters:
9.1, 13.1-13.3, 13.4.2

1

Why Sort?
“Rendezvous”
Eliminating duplicates (DISTINCT)
Grouping for summarization (GROUP BY)
Upcoming sort-merge join algorithm
Ordering
Sometimes, output must be ordered (ORDER BY)
e.g., return results ranked in decreasing order of relevance
First step in bulk-loading tree indexes
Problem: sort 100GB of data with 1GB of RAM.
why not virtual memory?

Out-of-Core Algorithms
Two themes
Single-pass streaming data through RAM
Divide (into RAM-sized chunks) and Conquer

Single-pass Streaming
Simple case: “Map”.
Goal: Compute f(x) for each record, write out the result
Challenge: minimize RAM, call read/write rarely
Approach
Read a chunk from INPUT to an Input Buffer
Write f(x) for each item to an Output Buffer
When Input Buffer is consumed, read another chunk
When Output Buffer fills, write it to OUTPUT

f(x)
RAM
Input
Buffer
Output
Buffer
OUTPUT
INPUT

4

Better: Double Buffering pt 1
Main thread runs f(x) on one pair I/O bufs
2nd I/O thread drains/fills unused I/O bufs in parallel
Why is parallelism available?
Theme: I/O handling usually deserves its own thread
Main thread ready for a new buf? Swap!

RAM
Input
Buffers
Output
Buffers
f(x)
OUTPUT
INPUT

I/O

5

Better: Double Buffering pt 2
Main thread runs f(x) on one pair I/O bufs
2nd I/O thread drains/fills unused I/O bufs in parallel
Why is parallelism available?
Theme: I/O handling usually deserves its own thread
Main thread ready for a new buf? Swap!

RAM
Input
Buffers
Output
Buffers
OUTPUT
INPUT

I/O
f(x)

6

Double Buffering applies to all streams
Usable in any of the subsequent discussion
Assuming you have RAM buffers to spare!
But for simplicity we won’t bring this up again.

RAM
Input
Buffers
Output
Buffers
OUTPUT
INPUT

I/O
f(x)

7

Sorting & Hashing: Formal Specs
Given:
A file F:
containing a multiset of records R
consuming N blocks of storage
Two “scratch” disks
each with >> N blocks of free storage
A fixed amount of space in RAM
memory capacity equivalent to B blocks of disk
Sorting
Produce an output file FS
with contents R stored in order by a given sorting criterion
Hashing
Produce an output file FH
with contents R, arranged on disk so that no 2 records that have the same hash value are separated by a record with a different hash value.
I.e. matching records are always “stored consecutively” in FH.

8

Sorting: 2-Way (a strawman)
Pass 0 (conquer a batch):
read a page, sort it, write it.
only one buffer page is used
a repeated “batch job”

RAM
I/O
Buffer
OUTPUT
INPUT
Sort in place

OUTPUT
INPUT

9

5

Sorting: 2-Way (a strawman), cont
Pass 0 (conquer a batch):
read a page, sort it, write it.
only one buffer page is used
a repeated “batch job”
Pass 1, 2, 3, …, etc. (merge via streaming):
requires 3 buffer pages
note: this has nothing to do with double buffering!
merge pairs of runs into runs twice as long
a streaming algorithm, as in the previous slide!
Drain/fill buffers as the data streams through them

3
Output
Buffer
OUTPUT
INPUT
1
Input Buffer
2
Input Buffer

10

5

Two-Way External Merge Sort
Conquer and Merge:
sort subfiles and merge
Each pass we read + write each page in file (2N)
N pages in the file.
So, the number of passes is:
So total cost is:

Input file
1-page runs
2-page runs
4-page runs
8-page runs
PASS 0
PASS 1
PASS 2
PASS 3

9

3,4
6,2
9,4
8,7
5,6
3,1
2
3,4
5,6
2,6
4,9
7,8
1,3
2
2,3
4,6
4,7
8,9
1,3
5,6
2
2,3
4,4
6,7
8,9
1,2
3,5
6
1,2
2,3
3,4
4,5
6,6
7,8

11

6

General External Merge Sort
More than 3 buffer pages. How can we utilize them?
Big batches in pass 0, many streams in merge passes
To sort a file with N pages using B buffer pages:
Pass 0: use B buffer pages. Produce sorted runs of B pages each.
Pass 1, 2, …, etc.: merge B-1 runs at a time.

1

B-1

B

1

⌈N/B⌉

Conquer
Merge
Sorted Runs
length B
Sorted Runs
Length B(B-1)
Pass 0
Pass 1, …

12

7

Cost of External Merge Sort
Number of passes:
Cost = 2N * (# of passes)
E.g., with 5 buffer pages, to sort 108 page file:
Pass 0: = 22 sorted runs of 5 pages each
last run is only 3 pages
Pass 1: = 6 sorted runs of 20 pages each
last run is only 8 pages
Pass 2: 2 sorted runs, 80 pages and 28 pages
Pass 3: Sorted file of 108 pages

Formula check: 1+┌log4 22┐= 1+3  4 passes √

13

8

# of Passes of External Sort
( I/O cost is 2N times number of passes)

14

9

Memory Requirement for External Sorting
How big of a table can we sort in two passes?
Each “sorted run” after Phase 0 is of size B
Can merge up to B-1 sorted runs in Phase 1
Answer: B(B-1).
Sort N pages of data in about space

1

B-1

B
1

⌈N/B⌉

Conquer
Merge
Sorted Runs
length B
Sorted Runs
Length B(B-1)
Pass 0
Pass 1, …

15

Alternative: Hashing
Idea:
Many times we don’t require order
E.g.: removing duplicates
E.g.: forming groups
Often just need to rendezvous matches
Hashing does this
But how to do it out-of-core??

Divide
Streaming Partition (divide):
Use a hash function hp to stream records to disk partitions
All matches rendezvous in the same partition.
Each partition a mix of values
Streaming alg to create partitions on disk:
“Spill” partitions to disk via output buffers

Conquer
ReHash (conquer):
Read partitions into RAM hash table one at a time, using hash f’n hr
Each bucket contains a small number of distinct values
Then read out the RAM hash table buckets and write to disk
Ensuring that duplicate values are contiguous

Two Phases: Divide
Partition:
(Divide)

Original
Relation
OUTPUT

2

INPUT
1
hash
function
hp
B-1
Partitions
1
2
B-1

. . .

B main memory buffers

19

Two Phases: Conquer
Rehash:
(Conquer)

Partitions
Hash table for partition
Ri (k <= B pages) Output Relation Hash partitions hp of size ~N/(B-1) B main memory buffers Hash partitions hp of size ~N/(B-1) hash function hr Hash partitions hr Fully hashed! 20 Cost of External Hashing cost = 2*N*(#passes) = 4*N IO’s (includes initial read, final write) Conquer (hr) Divide (hp) Hash partitions hp of size ~N/(B-1) Hash partitions hr Fully hashed! B-1 1 … B 21 Memory Requirement How big of a table can we hash in two passes? B-1 “partitions” result from Pass 1 Each should be no more than B pages in size Answer: B(B-1). We can hash a table of size N pages in about space Note: assumes hash function distributes records evenly! Have a bigger table? Recursive partitioning! Conquer (hr) Divide (hp) Hash partitions hp of size ~N/(B-1) Hash partitions hr Fully hashed! B-1 1 … B 22 Recursive Partitioning, Pt 1 Divide (hp1) B big! Divide (hp) Recursive Partitioning, Pt 2 Conquer (hr) B big! Divide (hp) Divide (hp1) Recursive Partitioning, Pt 3 B big! Divide (hp) Divide (hp1) Conquer (hr) B big! A Wrinkle: Duplicates Consider a dataset with a very frequent key E.g. in a big table, consider the gender column What happens during recursive partitioning? M F other Divide (hp) M Divide (hp1) 26 Question… How does external hashing compare with external sorting? Cost of External Hashing cost = 4*N IO’s (including initial read, final write) Divide Conquer 28 Cost of External Sorting cost = 4*N IO’s (including initial read, final write) Divide Conquer 29 Parallelize me! Hashing Phase 1 Phase 1: shuffle data across machines (hn) streaming out to network as it is scanned which machine for this record? use (yet another) independent hash function hn hn Parallelize me! Hashing Phase 2 Phase 1: shuffle data across machines (hn) Receivers proceed with phase 1as data streams in from local disk and network hp hr hn Parallelize me! Sorting Pass 0: shuffle data across machines streaming out to network as it is scanned which machine for this record? Split on value range (e.g. [-∞,10], [11,100], [101, ∞]). range Parallelize me! Sorting, cont Pass 0: shuffle data across machines Receivers proceed with pass 0as the data streams in A Wrinkle: How to ensure ranges are the same #pages?! i.e. avoid data skew? range So which is better ?? Simplest analysis: Same memory requirement for 2 passes Same I/O cost But we can dig a bit deeper… 34 Sorting vs Hashing Sorting pros: Great if we need output to be sorted anyway Not sensitive to duplicates or “bad” hash functions Hashing pros: For duplicate elimination, scales with # of values Delete dups in first pass while partitioning on hp Vs. sort which scales with # of items! Easy to shuffle equally in parallel case Summary Sort/Hash Duality Hashing is Divide & Conquer Sorting is Conquer & Merge Sorting is overkill for rendezvous But sometimes a win anyhow Don’t forget one pass streaming and double buffering Can “hide” the latency of I/O behind CPU work é ù 1 log 2 + = N é ù ( ) 1 log 2 2 + N N N / B⎡⎢ ⎤⎥ N/B é ê ù ú ⎡ ⎤⎡ ⎤1 1+ −log /B N B é ù é ù 1 1 + - log / B NB ⎡ ⎤108 5/ é ù 1085/ ⎡ ⎤22 4/ éù 224/ N B=3 B=5 B=9 B=17 B=129 B=257 100 7 4 3 2 1 1 1,000 10 5 4 3 2 2 10,000 13 7 5 4 2 2 100,000 17 9 6 5 3 3 1,000,000 20 10 7 5 3 3 10,000,000 23 12 8 6 4 3 100,000,000 26 14 9 7 4 4 1,000,000,000 30 15 10 8 5 4 N B=3 B=5 B=9 B=17 B=129 B=257 100 7 4 3 2 1 1 1,000 10 5 4 3 2 2 10,000 13 7 5 4 2 2 100,000 17 9 6 5 3 3 1,000,000 20 10 7 5 3 3 10,000,000 23 12 8 6 4 3 100,000,000 26 14 9 7 4 4 1,000,000,000 30 15 10 8 5 4 N B=3 B=5 B=9 B=17 B=129 B=257 100 7 4 3 2 1 1 1,000 10 5 4 3 2 2 10,000 13 7 5 4 2 2 100,000 17 9 6 5 3 3 1,000,000 20 10 7 5 3 3 10,000,000 23 12 8 6 4 3 100,000,000 26 14 9 7 4 4 1,000,000,000 30 15 10 8 5 4 /docProps/thumbnail.jpeg