CS 186/286 Fall 2017 Midterm 2
• Do not turn this page until instructed to start the exam.
• You should receive 1 single-sided answer sheet and a 16-page exam packet.
• All answers should be written on the answer sheet. The exam packet will be collected but not graded. • You have 80 minutes to complete the midterm.
• The midterm has 6 questions, each with multiple parts.
• For each question, place only your final answer on the answer sheet; do not show work.
• Use the blank spaces in your exam for scratch paper.
• You are allowed one 8.5” × 11” double-sided page of notes.
• No electronic devices are allowed.
1 Relational Algebra
For the following relational algebra questions, please use the schema defined below.
Students(sid, sname, year)
Companies(cid, cname, valuation) Recruitment(sid, cid, position, salary, status)
For each question, indicate which of the expressions gives the desired output. Note that the following table abbreviations are being used: Students → S, Companies → C, Recruitment → R. Also note that some of the expressions are invalid, in which case they should definitely not be marked in your answer.
1. (2 points) Find the names of all students who have received at least one recruitment offer. Note that if a student has received an offer, the status field in the Recruitment table will be the text “offer”. Mark all that apply.
A. πsname(σstatus=“offer′′ (S ◃▹ R))
B. πsname(S ◃▹ (σstatus=“offer′′ (R)))
C. σstatus=“offer′′ ((πsname(S)) ◃▹ R)
D. πsname(σstatus=“offer′′ (S ◃▹ (πsid(R))))
2. (2 points) Find the names of all students who have not received an offer from any company. Mark all that apply.
A. πsname((σstatus=“offer′′ (πsid(S) − πsid(R))) ◃▹ S) B. πsname((πsid(S) − πsid(σstatus=“offer′′ (R))) ◃▹ S) C. πsname(S) − πsname(σstatus=“offer′′ (R ◃▹ S))
D. σstatus=“offer′′ ((πsname(πsid(S) − πsid(R))) ◃▹ S)
3. (2 points) For every record in Recruitment, output the name of the student, name of the company he/she is being recruited for, and the position the student is being recruited for. Mark all that apply.
A. (πsname(S)) ◃▹ (πcname(C)) ◃▹ (πposition(R)) B. πsname,cname,position(S ◃▹ C ◃▹ R)
C. πsname,cname,position((πsid,sname(S)) ◃▹ (πcid,cname(C)) ◃▹ (πposition(R))) D. πsname,cname,position(S ◃▹ C ◃▹ (πsid,cid,position(R)))
4. (1.5 points) Apple is an American computing company. Orange is a European telecommunications company. Consider the following two expressions:
Solution: A, B
Solution: B
Solution: B, D
Page 2 of 16
ρ(Apples(1 → sid), πsid(σcname=′Apple′ (C ◃▹ R))) ρ(Oranges(1 → sid), πsid(σcname=′Orange′ (C ◃▹ R)))
Using those expressions, which of the expressions on the answer sheet computes the sids of students being recruited by both companies? Mark all that apply.
Solution: √ Apples ◃▹ Oranges ⃝ Apples − Oranges √ Apples ∩ Oranges ⃝ Apples ∪ Oranges √ Apples−(Apples−Oranges) ⃝ Oranges−(Apples−Oranges)
Page 3 of 16
2 Joins
2.1 T/F Questions
1. (2 points) Mark the statements that are true.
A. Grace Hash Join will always perform better than Sort Merge Join
2.2
B. It is possible that a hash function will not partition data evenly during the building phase.
C. If one of the joining relations fits in RAM, Naive Hash Join can outperform Grace Hash Join.
D. A two-pass Grace Hash join requires memory equivalent to the square root of the larger input relation.
Young Justice
Solution: F,T,T,F
For the following questions in this section, assume that we are streaming our query output to a terminal. (Do not consider the cost of writing the final output)
CREATE TABLE JusticeLeague {
member_id INTEGER PRIMARY KEY,
code_name CHAR(33),
birth_planet CHAR(20),
power_level INTEGER,
}
CREATE TABLE Teaches {
member_id INTEGER PRIMARY KEY,
teacher_id INTEGER REFERENCES JusticeLeague
since DATE
}
We assume that
• JusticeLeague has [J] = 100 pages
• Teaches has [T ] = 200 pages
• Buffer size = 12 pages
• Half the JusticeLeague records have birth planet = ’Earth’
2. (2 points) What is the best possible I/O cost of using Page Nested Loops Join to perform a natural join of Teaches and JusticeLeague? Consider both join orders!
3. (3 points) Consider the following query:
Solution:
[J] + [T][J] = 100 + 100*200 = 20100 I/Os ← This is the optimal. [T] + [T][J] = 200 + 100*200 = 20200 I/Os
Page 4 of 16
— Find all members who have a earth-born teacher.
SELECT T.member_id
FROM JusticeLeague J, Teaches T
WHERE T.teacher_id = J.member_id
AND J.birth_planet = ’Earth’;
Given the fact that half of the JusticeLeague members are earth-born, what is the best possible I/O cost to evaluate the query using Block Nested Loop Join? Assume you evaluate the selection operator on the fly (nomaterialization). ConsiderbothscanT ◃▹BNLJ (σ(scanJ))and(σ(scanJ))◃▹BNLJ scanT.
4. (3 points) What is the optimal I/O cost to evaluate the previous query using Grace Hash Join, again considering both join orders with selection on the fly?
Solution: Notice we will read in [J] amount of JusticeLeague no matter what. However, if we were to evaluate J.birth planet=’Earth’ before performing the join, we will only need to save half of the original number of pages into the buffer.
[J] + 1/2 * [T][J]/(10) = 100 + 100*200/(2*10) = 1100 I/Os ⇐ This is the optimal. [T] + [T][J]/(10) = 200 + 10*200 = 2200 I/Os
Solution:
• Pass 1 read: [J] + [T]
• Pass 1 write partitions: [J]/2 + [T]
• Pass 2 build: [J]/2
• Pass 2 probe: [T]
• Total=2[J]+3[T]=200+600=800I/O
Page 5 of 16
1 2 3 4 5 6 7
3 Parallel Query Evaluation
A local animal shelter that houses dogs, cats, and ducks maintains a database of their animals in the following relations:
Dogs(dog id, name, birthyear) Cats(cat id, name, birthyear) Ducks(duck id, name id, birthyear) Names(name id, name)
These relations are distributed across 3 different machines as follows:
• Dogs is stored solely on Machine 1.
• Cats is range-partitioned on birthyear over Machines 2 and 3.
• Ducks is hash-partitioned on name id over Machines 1, 2, and 3. • Names is round-robin partitioned over Machines 1, 2, and 3.
The network our database uses supports only “point-to-point” (“unicast”) messages. That is, for Machine 1 to send the same message m of b bytes to each of Machines 2 and 3, it must send m to Machine 2, and then separately send m to Machine 3; a total of 2b bytes sent.
The following questions consider the following SQL query to get the names of all dogs, cats, and ducks:
(SELECT name FROM Dogs UNION ALL
SELECT name FROM Cats) UNION
(SELECT name
FROM Names INNER JOIN Ducks
ON Ducks.name_id = Names.name_id)
1. (2 points) Which of the following operations can be pipeline breakers in this query? Assume that we have very little memory relative to the amount of data. Mark all that apply.
A. Full Table Scan on Dogs (line 1)
B. Parallel Hash Join (lines 6, 7)
C. UNION ALL of Dogs and Cats (line 2)
D. UNION of the two subqueries (line 4)
Solution: The union between the two subqueries is a pipeline breaker: UNION filters out duplicates, and because we are memory-constrained, we cannot filter duplicates by keeping track of previously seen values, and must resort to external sorting or hashing.
The parallel hash join is a pipeline breaker: we have little memory and therefore a symmetric hash join will not work and we must first partition/hash our data locally, which means reading in everything before returning a tuple.
The union between Dogs and Cats is not a pipeline breaker since UNION ALL does not filter duplicates, so we can just immediately yield any record obtained from scanning either Dogs or Cats.
The full table scans are not pipeline breakers, since they yield records immediately and do not require everything to be read in first.
2. (1 point) Using parallel Grace hash join, what is the lowest network cost (the number of bytes sent over the network by any node) possible while performing the join of Names and Ducks?
Page 6 of 16
[N] denotes the number of pages in Names and [D] denotes the number of pages in Ducks. Each page is 4KB.
A. [N] * 4KB
B. [D] * 4KB
C. 1 ([N]+[D])*4KB 3
D. 2 ([N]+[D])*4KB 3
E. ([N] + [D]) * 4KB
F. 2 [N]*4KB 3
G. None of the above
Solution: The join between Names and Ducks is a hash join with asymmetric shuffle, since Ducks
is already partitioned over the join key (name id). We therefore do not need to partition Ducks.
The network cost of shuffling Names over the machines is 2 * [N] * 4KB, since we have to send 2 33
of the entire relation over the network once, and since we only send data over the network while shuffling, this is the total network cost. The 2 factor comes from the fact that a third of all the
3
data will be partitioned to the machine it is already on (since we have 3 machines), and therefore is not sent over the network.
3. (0.5 points) True or False? There is an extra disk I/O cost for parallel Grace Hash Join due to reparti- tioning (shuffling) the data across machines.
4. (1 point) Suppose we range-partition cats by birthyear, so that Machine 2 stores cats born before 1950, and Machine 3 stores cats born 1950 or later. Assume both machines perform scans at the same rate. If 40% of cats in our data were born before 1950, how long will it take for a parallel scan of cats to complete?
[C] denotes the number of pages in Cats, and D represents the time taken to perform a single I/O. A. 1/3*[C]*D
B. 0.4*[C]*D
C. 0.6*[C]*D
D. [C]*D
E. None of the above
5. (2 points) Now suppose we wish to compare cats and dogs with the same name, and perform the following query:
Solution: False. Tuples are streamed from the full table scans into the local hash join operators on each of the machines, so the repartitioning does not incur any additional disk I/O cost (there is a network cost, as calculated in the previous question, but no additional disk I/O cost).
Solution: Machine 1 has no pages of Cats and does nothing.
Machine 2 performs just a full table scan on the 0.4 * [C] pages of cats it has. Machine 3 performs just a full table scan on the 0.6 * [C] pages of cats it has.
The scans Machine 2 and 3 perform run in parallel, so we are finished once the longer scan finishes, which is 0.6 * [C] * r.
Page 7 of 16
SELECT * FROM Cats INNER JOIN Dogs ON Cats.name = Dogs.name
Assuming that [Dogs] = 10 pages and [Cats] = 1000 pages, what is the least number of pages that must be transferred across the network to perform this join?
6. (1.5 points) In the following questions, select the correct choice to fill in the blank.
(a) (0.5 points) The term captures the improvement in throughput as the parallelism (number of machines) is increased.
A. speedup
B. scaleup
(b) (0.5 points) The term captures the ability to maintain throughput as both the parallelism and the size of the data are increased.
A. speedup
B. scaleup
(c) (0.5 points) Given: 50% of the Ducks in our database all share the same name id. Suppose we run a series of experiments, each time increasing the number of machines over which we hash-partition the Ducks table, and measuring the time to complete a parallel scan of Ducks each time. Across experiments, the completion time for parallel scan will be a function of the number of machines used.
A. constant
B. logarithmic C. linear
D. exponential
Solution: We can perform a broadcast join since Dogs is a small table. We send all of Dogs to Machines 2 and 3, and then we join Dogs and Cats locally, resulting in a network cost of 2 * [Dogs] = 20 pages.
Solution: Ducks is hash-partitioned on name id, so a single machine will get every record with the popular name id. As we increase the number of machines, we remain bottlenecked by the single machine with the popular name id (which always has to scan half the records), so the completion time is constant.
Page 8 of 16
4 Query Optimization
For the following questions, assume the following:
• The System R assumptions about uniformity and independence from • We use System R defaults when selectivity estimation is not possible • Primary key IDs are sequential, starting from 1
• Our optimizer does not consider interesting orders
lecture hold
Table Schema
Records
Pages
Indices
CREATE TABLE Student (
sid INTEGER PRIMARY KEY, name VARCHAR(32),
major VARCHAR(64)), semesters_completed INTEGER
)
25,000
500
• Index 1: Clustered(major). There are 130 unique majors
• Index 2: Unclustered(semesters completed). There are 11 unique values in the range [0, 10]
CREATE TABLE Application (
sid INTEGER REFERENCES Student, cid INTEGER REFERENCES Company, status TEXT,
(sid, cid) PRIMARY KEY
)
100,000
10,000
• Index 3: Clustered(cid, sid).
• Given: status has 10 unique values
CREATE TABLE Company (
cid INTEGER PRIMARY KEY, open_roles INTEGER)
)
500
100
• Index 4: Unclustered(cid)
• Index 5: Clustered(open roles). There are 500 unique values in the range [1, 500]
SELECT Student.name, Company.open_roles, Application.referral FROM Student, Application, Company
WHERE Student.sid = Application.sid
AND Application.cid = Company.cid
AND Student.semesters_completed > 6
AND (Student.major=’EECS’ OR Company.open_roles <= 50) AND NOT Application.status = ’limbo’
ORDER BY Company.open_roles;
-- (Selectivity 1)
-- (Selectivity 2)
-- (Selectivity 3)
-- (Selectivity 4)
-- (Selectivity 5)
For the first 5 questions, write the reduction factor for each clause in the answer box as a fully-reduced fraction. For example, 1/2 should be written as:
Note: Answers that are not reduced fractions or are formatted incorrectly will receive 0 points. 1. (0.5 points) Student.sid = Application.sid
2. (0.5 points) Application.cid = Company.cid
1/2
Solution:
1=1 max(25000, 25000) 25, 000
There are exactly 25000 values in Student.sid, and due to the foreign key, there are at most 25000 values in Application.sid.
Page 9 of 16
Solution:
1=1 max(500, 500) 500
There are exactly 500 values in Company.cid, and due to the foreign key, there are at most 500 values in Application.cid
3. (0.5 points) Student.semesters completed > 6
4. (1 point) Student.major = ’EECS’ OR Company.open roles <= 50
Solution:
10−6 =4 10−0+1 11
We have a histogram with 11 unique values. Therefore we use the equation for less than or equal to which is (high key - value) / (high key - low key + 1).
Solution:
(1 +1)−(1 ∗1)= 10 +130− 1 =139 130 10 130 10 1300 1300 1300 1300
We can find the selectivity that they are an EECS major by using the equation 1/distinct values. Next, we find the selectivity that open positions are less than or equal to 50 using the equation (v - low key) / ((high key - low key + 1) + (1 / number distinct)). Lastly we combine these two selectivities using S(p1) + S(p2) - S(p1)S(p2) to determine the selectivity of having one or the other.
5. (0.5 points) NOT Application.status = ’limbo’
Solution:
1−1=9 10 10
Given 10 unique values, the non-negated predicate has selectivity 1 , so we can use the equation for 10
NOT which is 1 - selectivity of the predicate.
Page 10 of 16
6. (2.5 points) For each predicate, mark the first pass of Selinger’s algorithm that uses its selectivity to estimate output size.
(a) Selectivity 1: Pass 1, Pass 2, or Pass 3. (b) Selectivity 2: Pass 1, Pass 2, or Pass 3. (c) Selectivity 3: Pass 1, Pass 2, or Pass 3. (d) Selectivity 4: Pass 1, Pass 2, or Pass 3. (e) Selectivity 5: Pass 1, Pass 2, or Pass 3.
7. (1 point) Mark the choices for all access plans that would be considered in pass 2 of the Selinger algo- rithm.
A. Student ◃▹ Application (800 IOs) B. Application ◃▹ Student (750 IOs) C. Student ◃▹ Company (470 IOs)
D. Company ◃▹ Student (525 IOs)
E. Application ◃▹ Company (600 IOs) F. Company ◃▹ Application (575 IOs)
8. (1 point) Mark the choices from the previous question for all access plans that would be chosen at the end of pass 2 of the Selinger algorithm.
9. (1 point) Mark the choices for all plans that would be considered in pass 3. Note carefully the nested expressions: you may want to draw them out.
A. Company ◃▹ (Application ◃▹ Student) (175,000 IOs) B. Company ◃▹ (Student ◃▹ Application) (150,000 IOs) C. Application ◃▹ (Company ◃▹ Student) (155,000 IOs) D. Application ◃▹ (Company ◃▹ Student) (160,000 IOs) E. Student ◃▹ (Company ◃▹ Application) (215,000 IOs) F. (Company ◃▹ Application) ◃▹ Student (180,000 IOs) G. (Application ◃▹ Company) ◃▹ Student (200,000 IOs) H. (Application ◃▹ Student) ◃▹ Company (194,000 IOs)
I. (Student ◃▹ Application) ◃▹ Company (195,000 IOs) J. (Student ◃▹ Company) ◃▹ Application (165,000 IOs)
Solution: Pass 2, Pass 2, Pass 1, Pass 3, Pass 1. Note that (d)—the OR predicate—is over 2 tables that have no associated join predicate, so the selection is postponed along with the cross-product, until after 3-way joins are done.
Solution: A, B, E, and F will be considered because they are not cross products.
Solution: B and F will be chosen because they have the lower cost for joining the two tables tables.
Page 11 of 16
Solution: Considers F, H
10. (1 point) Mark the choices from the previous question for all plans that would be chosen at the end of pass 3.
Solution: Chooses F.
Page 12 of 16
5 Database Design 5.1 ER Diagrams
Consider the following ER diagrams labelled i, ii, and iii. rsrsrs
(i) (ii) (iii) LLL
1. (0.5 points) Which ER diagram do the following SQL statements correspond to? Note that there is exactly one correct answer.
CREATE TABLE S(s INT, PRIMARY KEY (s));
CREATE TABLE R(r INT, s INT REFERENCES S, PRIMARY KEY (r));
2. (0.5 points) Which ER diagram do the following SQL statements correspond to? Note that there is exactly one correct answer.
CREATE TABLE R(r INT, PRIMARY KEY (r));
CREATE TABLE S(s INT, PRIMARY KEY (s));
CREATE TABLE L(r INT REFERENCES R, s INT REFERENCES S);
3. (0.5 points) Which ER diagram do the following SQL statements correspond to? Note that there is exactly one correct answer.
CREATE TABLE S(s INT, PRIMARY KEY (s));
CREATE TABLE R(r INT, s INT NOT NULL REFERENCES S, PRIMARY KEY (r));
R
S
R
S
R
S
Solution: (ii). By embedding a reference to S in the R relation, we ensure that every tuple in R can only be linked to at most one tuple in S. Because R.s can be NULL, it is also possible that a tuple in R doesn’t link any tuple in S.
This question was graded all or nothing. The correct answer received 0.5 points; every other answer received 0 points.
Solution: (i). By creating a third L relation, a tuple in R can match many tuples in S, and a tuple in S can match many tuples in R. Moreover, it’s possible that tuples in R and tuples in S don’t link with any other tuple.
This question was graded all or nothing. The correct answer received 0.5 points; every other answer received 0 points.
Solution: (iii). By embedding a reference to S in the R relation, we ensure that every tuple in R can only be linked to at most one tuple in S. Because R.s is marked NOT NULL, every tuple is also guaranteed to be linked to some tuple in S.
This question was graded all or nothing. The correct answer received 0.5 points; every other answer received 0 points.
Page 13 of 16
5.2 Functional Dependencies and Normalization
4. (1 point) True or False? For all sets of functional dependencies F and G, F + ∪ G+ = (F ∪ G)+. That is, closure is homomorphic.
5. (2 points) Which of the following functional dependencies are in the closure of {T → UV,TV → WX,UX→Y}? Markallthatapply.
• UX → W • V → X • T → X • T → Y
Solution: False. Consider a relation with three attributes x, y, and z. Let X = {x}, Y = {y}, and Z = {z}. Let F = {X → Y}, and let G = {Y → Z}. X → Z ∈/ F+ and X → Z ∈/ G+, but by transitivity X → Z ∈ (F ∪ G)+. Thus, F+ ∪ G+ ̸= (F ∪ G)+.
This question was graded all or nothing. The correct answer received 1 point; every other answer received 0 points.
Solution: UX → W ∈/ F+ and V → X ∈/ F+. T → UV and TV → WX, so T → WX. Thus, by decomposition, T → X. Moreover, UX → Y , so T → Y .
This question was graded as 4 independent true/false questions, each worth 0.5 points. That is, you got 0.5 points for every correct choice that you selected and 0.5 points for every incorrect choice that you did not select.
6. (2 points) Given the relation R and functional dependencies above, which of the following attribute sets are superkeys? Mark all that apply.
• T • UV • TV • UX
Solution: An attribute set X is a superkey if X+ = R. T+ = TUVWXY, so T is a superkey. (UV)+ =UV,soUV isnotasuperkey. TV+ =TUVWXY,soitisasuperkey,andUX+ =UXY, so UX is not a superkey.
This question was graded as 4 independent true/false questions, each worth 0.5 points. That is, you got 0.5 points for every correct choice that you selected and 0.5 points for every incorrect choice that you did not select.
7. (1 point) Given the relation R and functional dependencies above, which of the following attribute sets are candidate keys? Mark all that apply.
• T • UV • TV • UX
Page 14 of 16
Solution: An attribute set is a candidate key if it is a minimal superkey. T+ = TUV WXY , and T isminimal,soT isacandidatekey. (UV)+ =UV,soUV isnotasuperkey. TV+ =TUVWXY but TV is not minimal, so it is a superkey but not a candidate key, and UX+ = UXY , so UX is not a superkey.
This question was graded as 4 independent true/false questions, each worth 0.25 points. That is, you got 0.25 points for every correct choice that you selected and 0.25 points for every incorrect choice that you did not select.
8. (1 point) Is R in BCNF?
9. (2 points) Which of the following are lossless-join decompositions of R? Mark all that apply. A. TVW andUXY
B. T andUVWXY
C. TUVWX and UXY D. TUVWX and TUXY
Solution: No. UX is not a superkey, so the non-trivial dependency UX → Y violates BCNF.
This question was graded all or nothing. The correct answer received 1 point; every other answer received 0 points.
Solution: A and B are not lossless-join decompositions.
C is produced by following the decomposition algorithm presented in class. We use the UX → Y functional dependency to separate TUV WXY into TUV WXY −Y and UX ∪Y . This is a lossless- join decomposition.
D is a lossless join decomposition because both TUV WX and TUXY contain T which is a candidate key.
This question was graded as 4 independent true/false questions, each worth 0.5 points. That is, you got 0.5 points for every correct choice that you selected and 0.5 points for every incorrect choice that you did not select.
10. (1 point) Which of the following are lossless-join decompositions of R into BCNF? Mark all that apply. A. TVW andUXY
B. T andUVWXY
C. TUVWX and UXY
D. TUVWX and TUXY
Solution: A and B are not lossless-join decompositions.
C is produced by following the decomposition algorithm presented in class. We use the UX → Y functional dependency to separate TUV WXY into TUV WXY −Y and UX ∪Y . This is a lossless- join decomposition. Moreover, both TUV W and UXY are in BCNF.
D is a lossless join decomposition because both TUV WX and TUXY contain T which is a candidate key. However, TUXY is not in BCNF because UX is not a superkey in TUXY , but UX → Y .
Page 15 of 16
This question was graded as 4 independent true/false questions, each worth 0.25 points. That is, you got 0.25 points for every correct choice that you selected and 0.25 points for every incorrect choice that you did not select.
Page 16 of 16