程序代写代做代考 algorithm Query Evaluation 3

Query Evaluation 3

 Projections
 Nested Loop Joins
John Edgar 2

3.1

 Naively, projection and duplicate removal entails
▪ Scan the table, remove unwanted attributes, and write it back
SELECT DISTINCT fname, lname FROM Customer
▪ cost  2B disk I/Os
▪ cost  4B, possibly more if the file is very large
The cost to write the result is less than B
▪ Sort the result, using all of its attributes as a compound sort key
Again, less than 4B
▪ Scan the result, removing the adjacent duplicates as they are encountered; cost  B And again, the relation size is now less than B
▪ The cost to write out the result of this last stage is not included; it may be the last operation or may be pipelined into another operations
 It appears that the total cost is 7B disk I/Os, but this process can be much improved by combining multiple steps
John Edgar 4

 The initial scan is performed as follows
▪ Read M pages and remove unwanted attributes
▪ Sort the records, and remove any duplicates
▪ Write the sorted run to disk
▪ Repeat for the rest of the file, for a total cost of 2B
▪ Actually less than 2B since the result will be smaller than B from attribute size
 Perform merge passes as required on the output from
The final result size can be estimated
the first stage
▪ Remove any duplicates as they are encountered
▪ If only one merge pass is required the cost is ≈ 1B
and other statistics
 For a total cost of ≈ 3B John Edgar
Unless more merge passes are required
5

 Duplicates can also be identified by using hashing  Duplicate removal by hashing has two stages
▪ Partitioning and probing  In the partitioning stage
in
output buffers
▪ Partition into M-1 partitions using a hash function, h
▪ With an output buffer for each partition, and one input buffer
▪ The file is read into main memory one page at a time, with each record being hashed to the appropriate buffer
▪ Output buffers are written out when full
 Partitions contain records with different attribute values
▪ Duplicates are eliminated in the next stage
John Edgar 6
table
result
main memory

 The duplicate elimination stage uses a second hash function h2 (h2  h) to reduce main memory costs
▪ An in-memory hash table is built using h2
▪ If two records hash to the same location they are checked to see
if they are duplicates
▪ Duplicates can, instead, be removed using in-memory sorting  If each partition produced in the partitioning stage can fit in
main memory the cost is
▪ Partitioning stage: 2B
▪ Duplicate elimination stage: B, for a total cost of 3B
 This is the same cost as projection using sorting
John Edgar 7
Approximate cost: actual cost is less since the result is smaller than the original file

 Sort and hash projection have the same cost (3B)  If M > (B) sorting and sort projection
can be performed in two passes
▪ The first pass produces B/M sorted runs
table M result
▪ If there are less than M-1 of them only one merge pass is
required
 Hash projection partitions are different sizes
▪ If just one partition is greater than M-1, further partitioning is required
result
▪ Regardless of the overall size of the file John Edgar
table
M
8

 Aggregations without groups are simple to compute
▪ Scan the file and calculate the aggregate amount
▪ Requires one input buffer and a variable for the result
▪ Aggregations can usually be pipelined
from a previous operation
 Aggregations with groups require more memory
▪ To keep track of the grouped data
▪ They can be calculated by sorting
or hashing on the group attribute(s)
▪ Or by using an index with all the required attributes
John Edgar 9
SELECT MIN(gpa) FROM Student
SELECT AVG(income) FROM Doctor GROUP BY specialty

 The table is sorted on the group attribute(s)  The results of the sort are scanned and the
aggregate operation computed
▪ These two processes can be combined in a similar way to the sort based projection algorithm
 The cost is driven by the sort cost
▪ 3B(R) if the table can be sorted in one merge pass
▪ Final result is typically much smaller than the sorted table
John Edgar 10

 In the hash based approach an in-memory hash table is build on the grouping attribute
▪ Hash table entries consist of
▪ grouping-value,running-information
e.g. count and sum for AVG
 The table is scanned and for each record
▪ Probe the hash table to find the entry for the group that the
record belongs to, and
▪ Update the running information for that group
 Once the table has been scanned the grouped results are computed using the hash table entries
▪ If the hash table fits in main memory the cost is B(R)
John Edgar 11

 It may be possible to satisfy an aggregate query using just the data entries of an index
▪ The search key must include all of the attributes required
for the query
▪ The data entries may be sorted or hashed, and
▪ No access to the records is required
▪ If the GROUP BY clause is a prefix of a tree index, the data entries can be retrieved in the grouping order
▪ The actual records may also be retrieved in this order  This is an example of an index-only plan
This may seem unlikely
John Edgar 12
but an index may be created for this use

3.2 / 4.1

SELECT *
FROM Customer NATURAL INNER JOIN Account
Customer ⋈ Account
C.sin = A.sin(Customer  Account)
 A join is defined as a Cartesian product followed by a selection
▪ Where the selection is the join condition
▪ A natural join’s condition is equality on all attributes in common  Cartesian products typically result in much larger tables
than joins
▪ It is important to be able to efficiently implement joins
John Edgar 14

R
S
R⋈S
How do we find joined records without searching the entire space?
RxS
 T(R) = 10,000 and T(S) = 4,000
Algorithms
Nested loop joins Sort-merge join Hash join
▪ Assume S has a foreign key that references R ▪ SorecordsinSrelatestoatmostonerecordinR
 The sizes of the join and the Cartesian product are ▪ Cartesian product – 40,000,000 records
▪ Natural join – 4,000 (if every s in S relates to an r in R)
John Edgar 15

 There are three nested loop join algorithms that compare each record in one relation to each record in the other
▪ They differ in how often the inner relation is read  Tuple nested loop join Cost = B(R) + (T(R) * B(S))
▪ Read one page of R at a time
▪ For each record in R
▪ Scan S and compare to all S records
▪ Result has the same ordering as R  Improved nested loop join
memory use
R
S
…
out
▪ As tuple nested loop join but scan and compare S for records of
R⋈R.i=S.j S
for each record r  R for each record s  S
if ri = sj then
add r,s to result
R one page at a time John Edgar
Cost = B(R) + (B(R) * B(S))
16

 The simple nested loop join algorithms do not make effective use of main memory
▪ Both require only two input buffers and one output buffer
 The algorithm can be improved by making the input buffer
memory use
for R as large as possible
▪ Use M – 2 pages as an input buffer for the outer relation
▪ 1 page as an input buffer for the inner relation, and
▪ 1 page as an output buffer
▪ If the smaller relation fits in M – 2 pages the cost is B(R) + B(S)
R
…
S
out
▪ CPU costs are reduced by building an in-memory hash table on R, using the join attribute for the hash function
John Edgar 17

 What if the smaller relation is larger than M-2? ▪ Break R, the outer relation, into blocks of M – 2 pages
▪ I refer (somewhat flippantly) to these blocks as clumps ▪ Scan S once for each clump of R
▪ Insertconcatenatedrecordsr,sthatmatch the join condition into the output buffer
memory
R
…
S
out
▪ SisreadB(R)/(M-2)times
▪ The total cost is B(R) + (B(R)/(M-2) * B(S))
 Disk seek time can be reduced by increasing the size of the input buffer for S
M-2 is the clump size
▪ Which may increase the number of times that S is scanned John Edgar 18
R on disk
S on disk
scan S 4 times

 Indexes can be used to compute a join where one relation has an index on the join attribute
▪ The indexed relation is made the inner relation (call it S)
▪ Scan the outer relation B(R) reads
▪ While retrieving matching records of S using the index index cost?
 The inner relation is never scanned
▪ Only records that satisfy the join condition are retrieved
▪ Unlike the other nested loop joins this algorithm does not compare every record in R to every record in S
 Cost depends on the size of R and the type of index ▪ B(R) + (T(R) * index cost)
John Edgar 19

 The cost of index nested loop join is dependent on the type of index and the number of matching records
 The outer relation is scanned and records of S retrieved by using the index for each record of R
▪ Search index for matching RIDs – access leaf or bucket ▪ If no matching records move on to next record of R
▪ Retrieve matching records
▪ One disk read if a single S record matches one R record
▪ If multiple S records match to a single R the cost is dependent on the number of records and whether the index is primary or secondary
John Edgar 20
System catalog records data to estimate cost

Related Posts