CS5481 Data Engineering Tutorial 8
1. Consider the relations r1(A,B,C), r2(C, D, E), and r3(E, F) in which r1 has 1,000 tuples, r2 has 1,500 tuples, and r3 has 750 tuples.
a) Suppose the primary keys of r1, r2 and r3 are A, C, and E, respectively. Estimate the size of r1 r2 r3.
b) Assume that there are no primary keys, except the entire schema. Let V(C, r1) be 900, V(C, r2) be 1100, V(E, r2) be 50, and V(E, r3) be 100. Estimate the size of r1 r2 r3.
a) The relation resulting from the join of r1, r2, and r3 will be the same no matter which way we join them, due to the associative and commutative properties of joins. So, we will consider the size based on the strategy of ((r1 r2) r3). Joining r1 with r2 will yield a relation of at most 1000 tuples, since C is a key for r2. Likewise, joining that result with r3 will yield a relation of at most 1000 tuples because E is a key for r3. Therefore the final relation will have at most 1000 tuples.
b) The estimated size of the relation can be determined by calculating the average number of tuples which would be joined with each tuple of the second relation. In this case, for each tuple in r1, 1500/V(C, r2) = 15/11 tuples (on the average) of r2 would join with it. The intermediate relation would have 15000/11 tuples. This relation is joined with r3 to yield a result of approximately 10,227 tuples (15000/11 × 750/100 = 10227).
2. Consider the following schema, where the keys are underlined:
ENGINEER (ID, Name, Salary) PROJECT (PID, ICEngID, Budget)
The ICEngID attribute in PROJECT is the ID of the engineer in charge of the project. The PID is the ID of the project. Both are sequential files in which records are stored in primary- key order. Consider the query
SELECT FROM WHERE
*
ENGINEER E, PROJECT P E.ID=P.ICEngID AND P.Budget > 30
a) Write an unoptimized relational expression that might initially generate from the SQL query translator.
Budget > 30 (PROJECT ICEngID=IDENGINEER)
b) Write an equivalent expression by fully pushing the selection. (Budget > 30 (PROJECT)) ICEngID=IDENGINEER
CS5481 Data Engineering Tutorial 8
c) Base on b) and use the following information, suggest an evaluation plan and estimate the cost in number of disk block transfers (ignore seek time here).
o M=5
o nENGINEER = 10,000
o bENGINEER = 2,000
o nPROJECT = 2,000
o bPROJECT = 500
o min (Budget, PROJECT) = 10
o max (Budget, PROJECT) = 60
o 4-level primary B+-tree index on ID for ENGINEER
o 2-level secondary B+-tree index on Budget for PROJECT
ENGINEER
PROJECT
o Number of tuples satisfying the selection condition = 2000 * (60-30)/(60-10) = 1200 o Cost of selection = 500
o Pipeline the selection output to the join operator.
o Cost of indexed nested loop join of the output of selection and ENGINEER = 1200 *
(4+1) = 6000
o Totalcost=500+6000=6500
o Memory allocation: at the same time, 1 block for selection input (PROJECT), 1 block
for selection output, 1 block for ID index / join input (ENGINEER), 1 block for join output
ICEngID=ID
(use index on ID)
pipeline
Budget>30 (linear search)