程序代写代做代考 database algorithm Chapter 13: External Sorting

Chapter 13: External Sorting
Ed Knorr, Fall 2020
Based on:
(a) Chapter 13 (pages 421-436), but skip Section 13.3.1 on “Minimizing the Number of Runs” on pages 428-430, and skip Section 13.4.2 on “Double Buffering” on page 432) of Ramakrishnan & Gehrke (textbook);
(b) Garcia-Molina, Ullman, & Widom (reference text); (c) Some CPSC 221 Review Material
1

Learning Goals
 Justify the importance of external sorting for database applications.
 Argue that, for many database applications, I/O activity dominates CPU
time when measuring complexity using elapsed time.
 Compute the number of I/Os (assuming one I/O per disk request) that are required to sort a file of size N using general multiway (k-way) external mergesort (GMEMS), where N is the number of pages in the file, and k ≥ 2.
 Compute the number of sorted runs (or passes) that are required to sort a large file using GMEMS, including the sizes of the sorted runs as we progress.
 Analyze the complexity and scalability of sorting a large file using GMEMS.
 Using the parameters for a given disk geometry, compute the number of I/Os, or estimate the elapsed time, for sorting a large file with GMEMS, including two-phase multiway mergesort (2PMMS).
2

Learning Goals (cont.)
 Suggest several optimizations that can be exploited when sorting large files via GMEMS (including 2PMMS)—e.g., cylindrification, larger block sizes, double buffering, disk scheduling, and multiple smaller disks.
 Show how our earlier concepts of disk geometry, buffer pool management, disk scheduling, etc., directly relate to the elapsed time for sorting large datasets.
 Argue that sorting will continue to be a bottleneck for many applications, regardless of how much memory computers may have in the future.
 Explain why the sorting performance of large files is data dependent.
 Estimate the number of short seeks and the number of long seeks that are required in a data dependent sort of large files, where a “short seek” is to a neighbouring cylinder, and a “long seek” is a seek that’s further away than one cylinder.
 Argue for or against: Sorting performance has linear complexity for large datasets.
3

Motivation
 Sorting is a classic problem in computer science – A large number of cycles is devoted to sorting.
 Database reasons:
– Data is often requested in sorted order.
– Elimination of duplicates in a collection of records
– First step in bulk loading (i.e., populating, reorganizing) a
table and its index(es)
– Aggregation types of queries (GROUP BY)
– Efficient joins of relations (e.g., sort-merge join algorithm)
4

Mergesort (from CPSC 221 or Equivalent)
 Mergesort is fast and efficient
– Works well on memory-resident data (e.g., on a vector or an array),
although Quicksort often outperforms it
– O(n log n) average case time for Mergesort
– O(n log n) worst case time for Mergesort
 Quicksort is O(n log n) on average, but degrades to O(n2) in the worst
case
 Merge = take two sorted lists and repeatedly choose the smaller of the “heads” of the lists
 Example: Merge sorted list #1: with sorted list #2:
1,3,4,8
2,5,7,9 1,2,3,4,5,7,8,9
Result = a merged, sorted list:
5

Mergesort (cont.)
 Here’s an important database-related problem: How can we sort 100 GB (or 100 TB!) of data with only a small fraction of that amount of RAM?
– CPU issues? Not a problem.
– I/O issues? Important consideration.
 Common in-memory sorting algorithms (like Quicksort) don’t optimize disk I/Os.
– However, External Mergesort does.
6

CPSC 404’s General Multiway External Mergesort Algorithm (GMEMS)
 Suppose we need to sort a disk-resident file of size N pages with B pages of RAM (buffer pool space) available, where N > B.
 Algorithm:
1. Read B blocks/pages of data into memory, sort that data, and write it to a temporary file on disk. This file is called a sorted run (SR) or sublist of length B.
2. Repeat Step 1 until all N pages have been read in, and therefore the file has possibly many SRs of size B (except for the last SR which may be shorter than B).
a) Each SR is itself sorted; but two side-by-side SRs are likely not sorted.
b) The sublists are stored using temporary disk space.
7

CPSC 404’s GMEMS Algorithm (cont.)
3. Until the whole file is sorted, do:
• Merge up to B – 1 of the sorted runs (why not B?) to form an
even longer SR. This is called a (B – 1)-way mergesort.
• This longer SR is also written out as a temporary file on disk (unless it’s the final product, in which case you probably want to store it permanently).
4. Delete the consumed (smaller, input) sorted runs (except for the original file, which we probably don’t want to delete).
8

GMEMS Diagram of Merging
 We need at least B = 3 buffer pages.
 To sort a file with N pages using B buffer pages:
– Pass 0: Use all B buffer pages.
Produce N / B sorted runs of B pages each.
– Passes 1, 2, etc.: Now, merge B – 1 sorted runs: INPUT 1
. . .
INPUT 2
. . .
Disk
B Main Memory Buffers
Disk

OUTPUT
INPUT B-1
9

GMEMS: The Merge Phase
 Manage the buffers as needed. Use one or more pages per input block—the bigger the block, the better.
– If an input block is exhausted, get the next block from the same sorted sublist.
– If an output block is full, write it to disk.
– At the end of the algorithm, write all remaining (partially filled) output blocks to disk.
 The cost of the sort is a function of the number of passes required.
 To be consistent with the textbook, let us define one pass as one complete read PLUS one complete write of all the data in the file.
10

GMEMS Example
 To sort a 108-page file using B = 5 buffer pages, we need 4 passes overall:
– Pass 0: 108 / 5 = 22 sorted runs (SRs) of 5 pages 
each (but the last SR is only 3 pages)
– Pass1: 22/4=6sortedrunsof20pageseach(but the last SR is only 8 pages)
– Pass 2: 2 SRs: 80 pages and 28 pages
– Pass 3: Final product: A sorted file of 108 pages.
 We’ll fill in the next slide with more details including a diagram.
11

GMEMS Example (cont.)
12

Two-Phase Multiway Mergesort (2PMMS)—i.e., GMEMS in 2 Passes
 If B is sufficiently large, then many cases of “General Multiway External Mergesort” can be completed in only 2 passes!
– We want to use as much memory as we can get, up to the size of the file (which would be ideal because then we can finish in just one pass).
– Nevertheless, even a 2-pass scenario has major performance benefits.
 Compare this to the 3-buffer case, which requires too many passes!
 We call the 2-pass version of GMEMS: Two-Phase Multiway Mergesort (2PMMS).
13

Number of Passes Required for GMEMS with B Buffer Pages and N Data Pages
N 100
B=3 B=5 B=9 B=17 B=129 B=257 743211
1,000 10 5 10,000 13 7 100,000 17 9 1,000,000 20 10 10,000,000 23 12 100,000,000 26 14 1,000,000,000 30 15
4 3 2 2 5 4 2 2 6 5 3 3 7 5 3 3 8 6 4 3 9 7 4 4 10 8 5 4
14

Sorting Large Files in Only 2 Passes?
Given 40 records/page and 12,800 4K pages for the buffer pool:
How many records (max.) can you sort with 2PMMS?
External mergesort runs in “O(N)” time for “most” big files that IT shops work with.
What would you do if there were even more records?
15

Improving the Running Time of External Mergesort
Here are some techniques that can make
secondary memory algorithms more efficient:
 1. Group blocks by cylinder (“cylindrification”)
2. One big disk  several smaller disks
3. Mirrored disks = multiple copies of the same data 4. Disk scheduling: e.g., the Elevator Algorithm
5. Double Buffering (we won’t cover this concept, in this version of the course)
16

Analysis of Naïve Implementation
We’ll use the same “Megatron 747” disk drive from our
unit on Disks earlier in the term.
 We want to sort these records by some field(s).
– We might do this as a step before bulk loading records in
clustering order.
– External sorting is extremely common for many kinds of large files, and they don’t have to involve databases.
 Disk geometry (from earlier in the term): – Suppose the average access time is 15.49 ms.
 In our examples, we’ll use 10 ms for average seek time, 5 ms for avg. rotational latency, and (a very generous) 0.49 ms/page for the transfer time.
– Suppose 1 cyl. = 1 MB, and suppose there are 256 pages/cyl.  Yes, this is very small; but, it’s good enough for our example.
17

Analysis (cont.)
 Assume blocks are stored at random.
– This is unlikely to be the case in general, but it will give us an
appreciation for how time-consuming a pass can be.
 Suppose we have 10M (10,000,000) records of 100 bytes, that is, approximately a 1 GB file size.
– We’ll stored the file on our Megatron 747 disk, with a 4 KB block or page size, with each page holding 40 records of 100 bytes.
– The entire file consumes 250,000 4K pages. How did we get this?  This equates to 977 cylinders. How did we get this?
18

Analysis (cont.)
 Naïve case for 2PMMS:
– 1,000,000 page I/Os * 15.49 ms/page = 15,490 seconds = 258.17
minutes = 4.3 hours!
 Let us consider a more intelligent approach to reduce
the number of seeks, namely, cylindrification.  Suppose 50 MB of main memory is available.
– This is 12,800 blocks of size 4K = about 1/20th of the file. How did we get 12,800?
19

Cylindrification
When reading or writing blocks in a known order, place them together on the same cylinder. Once we reach that cylinder (with one seek), we can read or write block after block, with negligible rotational latency, and no extra seeks (unless we need > 1 cylinder for the given sorted run).
Application to Phase 1 of 2PMMS:
1. Initially, records are on 977 consecutive cylinders. 2. Load main memory with 50 consecutive cylinders.
 The order of blocks read is unimportant, so the only time besides transfer time is: one random seek and 49 1-cylinder seeks (each of minimal seek time).
 Time to transfer 12,800 pages (50 MB) at 0.49 ms/page = 6.33 sec. Note that the last sorted run (SR) may be shorter.
3. Repeat. That is, write each SR onto 50 consecutive cylinders.  Is this realistic?
 Write time is also about 6.33 sec/SR
20

Cylindrification Details, Phase 1
See page 1 of the separate PDF download to save yourself a lot of writing.
21

Cylindrification Details, Phase 2
But in Phase 2: Blocks are read from the 20 SRs in a data dependent order, so we can’t gain much of an advantage in terms of cylinder seeks.
Why 20 SRs? There are 977 cylinders of data.
See page 2 of the separate PDF download.
22

 External sorting is very important.
Summary
 The DBMS may dedicate part of its buffer pool to sorting.
 External Mergesort minimizes disk I/O cost:
– Pass 0: Produces sorted runs of size B (# of buffer pages). In later
passes: merge the runs.
– # of runs merged at a time depends on B and on the block size
 Larger block size means less disk access cost per page, when amortizing seek and rotational latency
 Larger block size means a smaller number of runs is merged – In practice, the number of passes is rarely more than two.
 In practice, External Mergesort runs in O(N) time, where N is the number of pages.
23