Query Evaluation 2
It is sometimes necessary or useful to sort data
as it is scanned (read)
The cost to scan a file is B(R) ▪ To satisfy a query with an ORDER BY clause
▪ Or because an algorithm requires sorted input ▪ Such as projection or some join algorithms
There are a number of ways in which a sort scan can be performed
▪ Main memory sorting ▪ B+ tree index
▪ Multi-way mergesort
But only if R fits in main memory: B(R) < M If B(R) < M the cost to sort a file is B(R)
John Edgar
2
Sorting a collection of records that fit within main memory can be performed efficiently
▪ There are a number of sorting algorithms that can be performed in n(log2n) time
▪ That is, with n(log2n) comparisons, e.g., Mergesort, Quicksort, Many DB tables are too large to fit into main
memory at one time
▪ So cannot simply be read into main memory and sorted
▪ The focus in external sorting is to reduce the number of disk I/Os
▪ As it is with optimization in general
John Edgar 3
Consider the Merge sort algorithm ▪ Input sub-arrays are repeatedly halved ▪ Until they contain only one element
Sub-arrays are then merged into sorted sub-arrays by repeated merge operations
▪ merging two sorted sub-arrays can be performed in O(n)
mergesort(arr, start, end)
if(start < end) //at least two elements
mid = start + end / 2 mergesort(arr, start, mid) mergesort(arr, mid+1, end) merge(arr, start, mid, mid+1, end)
John Edgar 4
O(n log2n)
Convert main memory merge sort to work on disk data Initial step - read 2 pages of data from file
▪ Sort them and write them to disk ▪ Results in B/2 sorted "runs" of size 2
Merge the first two sorted runs of size 2
▪ Read the first page of the first two runs into input pages
▪ Merge to a single output page, and write it out when full
▪ When all records in an input page have been merged read in the second page of that run
▪ Repeat for each pair of runs of size 2
▪ There are now B/4 sorted runs of size 4
Repeatedly merge runs until the file is sorted
Note: this does not make much sense, but is included for illustration
John Edgar 5
11
15
23
31
64
87
6
10
14
41
55
76
3
25
28
39
53
disk pages contain three records
After the first sort pass the file consists of B/2 sorted runs each of two pages
Read in the first page of each of the first two sorted runs
Leaving a third page free as an output buffer
sorted runs of size 2
11
15
23
input buffers
output buffer
6
10
14
main memory
John Edgar
6
11
15
23
31
64
87
6
10
14
41
55
76
3
25
28
39
53
6
10
11
disk
Records from the input pages are merged into the output buffer
Once the output buffer is full it's contents are written out to disk, to form the first page of the first sorted run of length 4
11
15
23
input buffers
output buffer
6
10
14
6
10
11
main memory
John Edgar
7
11
15
23
31
64
87
6
10
14
41
55
76
3
25
28
39
53
6
10
11
disk
At this point all of the records from one of the input pages have been processed
The next page of that sorted run is read into the input page
And the process continues
uses
only three main memory frames!
11
15
23
input buffers
output buffer
461
5105
7164
14
15
main memory
John Edgar
8
Assume that B = 2k
▪ After the first pass there are 2k-1 sorted runs
▪ Eachistwopagesinsize
▪ After the second pass there are 2k-2 sorted runs, of length 4 ▪ After the kth pass there is one sorted run of length B
The number of passes is therefore log2 B
In each pass all the pages of the file are read and written
for a total cost of log2 B * 2B
▪ Note that only 3 frames of main memory are required! ▪ Also note that main memory costs are ignored
The algorithm can be improved in two ways
John Edgar 9
B pages read and B pages written
In the first stage of the naive process pairs of pages are read into main memory, sorted and written out
▪ Resulting in B/2 runs of size 2
To make effective use of main memory, read M
pages, and sort them length M
▪ This reduces the number of subsequent merge passes
M main memory pages available ▪ After the first pass there will be B/M sorted runs, each of
sort
memory
John Edgar
10
file
In the merge passes perform an M-1 way merge
▪ M-1 input pages, one for each of M-1 sorted runs and
▪ 1 page for an output buffer
The first items in each of the M-1 input partitions
are compared to determine the smallest Each merge pass merges M-1 runs
▪ After the first pass the runs are size (M-1)*M
▪ This results in less merge passes, and less disk I/O
memory M-1 way merge file
Runs were size M after first pass
...
John Edgar
11
The initial pass produces B/M sorted runs of size M Each merge pass reduces the number of runs by a
factor of M-1
▪ The number of merge passes is logM-1B/M
Each pass requires that the entire file is read and
then written
first pass
▪ Total cost is therefore 2B (logM-1B/M + 1)
M is typically relatively large this so this reduction
over two-way merge is considerable
John Edgar 12
B = 1,000,000
M
log2B logb-1B/M+1
3
20
20
5
20
10
200
20
3
2,000
20
2
Even a large file can usually be sorted in two passes (a cost of 4B I/Os to sort and write out) assuming a reasonable size for M
John Edgar 13
10
43
23
1
64
87
35
50
19
41
5
86
12
24
94
41
26
13
101
453
1203
191
6243
8357
4351
5403
1590
6441
865
867
current set
input buffer
output buffer
disk
In the first pass of external mergesort B/M sorted runs of size M are produced
▪ Larger initial run size means less merge steps
Replacement sort increases initial run size
▪ To 2 * M on average
The algorithm uses buffers
▪ M-2 pages to sort the file – the current set ▪ One page for input
▪ One page for output
First the current set is filled ▪ ... then sorted
main memory
John Edgar 14
10
43
23
1
64
87
35
50
19
41
5
86
12
24
94
41
26
13
121
245
9104
19
23
35
41
43
50
64
86
87
current set
input buffer
output buffer
disk
Once the current set is sorted the next page of the file is read into the input buffer
The smallest record from the current set, and input buffer, is put in the output buffer
The first element of the current set is now free and is replaced with the first record from the input buffer
This process is repeated until the output buffer is full and all the values in the input buffer are in the current set
12
24
94
1
5
10
main memory
John Edgar 15
10
43
23
1
64
87
35
50
19
41
5
86
12
24
94
41
26
13
1
5
10
12
24
94
19
23
35
41
43
50
64
86
87
current set
input buffer
output buffer
disk
When the output buffer is full
▪ The next page of the file is read into the input
buffer
▪ And the output buffer is written to disk as the first page of the first sorted run
▪ As the process continues the current set is periodically re-sorted
When a value is read that is less than the values in the output buffer the current set must be written out
4121
264
9143
1
5
10
main memory
▪ The process starts again to generate the next run John Edgar 16
In practice it may be more efficient to make the input and output buffers larger than one page
▪ This reduces the number of runs that can be merged at one time, so may increase the number of passes required
▪ But, it allows a sequence of pages to be read or written to the buffers, decreasing the actual access time per page
We have also ignored CPU costs
▪ If double buffering is used, the CPU can process one part of a run while the next is being loaded into main memory
▪ Double buffering also reduces the amount of main memory available for the sort
John Edgar 17
Primary B+ tree index
▪ The index can be used to find the first page, but ▪ Note that the file is already sorted!
Secondary B+ tree index
▪ Leaves point to data records that are not in sort order
▪ In the worst case, each data entry could point to a different page from its adjacent entries
▪ Retrieving the records in order requires reading all of the index leaf pages, plus one disk read for each record!
▪ In practice external sort is likely to be much more efficient than using a secondary index
Secondary indices are not useful for retrieving large selections
John Edgar 18