CS 186 Introduction to Database Systems Spring 2021
INSTRUCTIONS
Midterm 2
This is your exam. Complete it either at exam.cs61a.org or, if that doesn’t work, by emailing course staff with your solutions before the exam deadline.
This exam is intended for the student with email address
For questions with circular bubbles, you should select exactly one choice. You must choose either this option
Or this one, but not both!
For questions with square checkboxes, you may select multiple choices. You could select this choice.
You could select this one too!
You may start your exam now. Your exam is due at
Exam generated for
1 CS W186 Spring 2021 Midterm 2
Do not share this exam until solutions are released.
1.0.1 Contents:
• The midterm has 5 questions, each with multiple parts, and worth a total of 100 points.
1.0.2 Taking the exam:
• You have 110 minutes to complete the midterm.
• You may print this exam to work on it.
• For each question, submit only your final answer on Examtool.
• For numerical answers, do not input any suffixes (i.e. if your answer is 5 I/Os, only input 5 and not 5 I/Os
or 5 IOs).
• Make sure to save your answers in Examtool at the end of the exam, although the website should autosave
your answers as well.
1.0.3 Aids:
• You may use two pages (double sided) of handwritten notes as well as a calculator. • You must work individually on this exam.
1.0.4 Grading Notes:
• • •
(a)
(b)
All I/Os must be written as integers. There is no such thing as 1.02 I/Os – that is actually 2 I/Os. 1 KB = 1024 bytes. We will be using powers of 2, not powers of 10
Unsimplified answers, like those left in log format, will receive a point penalty.
What is your full name?
What is your student ID number?
Exam generated for
We want to model boba franchises, their affiliated locations, and customers they have sold boba to. Consider the ER diagram below.
Assumptions:
• Each franchise’s id is a unique identifier.
• Each location is affiliated with exactly one franchise, each franchise must have at least 1 location.
• Each franchise has at most one location in a given zipcode, but multiple franchises can have locations in
the same zipcode.
• Customers can buy boba from multiple locations and can buy any number of drinks, including no drinks.
(a) (1 pt) What type of edge should be drawn at 1 from Franchise to Affiliates? Thin line
Bold line Thin arrow Bold arrow
(b) (1 pt) What type of edge should be drawn at 2 from Location to Affiliates? Thin line
Bold line Thin arrow Bold arrow
Exam generated for
(c) (1 pt) What type of edge should be drawn at 3 from Location to Buys from? Thin line
Bold line Thin arrow Bold arrow
(d) (1 pt) What type of edge should be drawn at 4 from Customer to Buys from? Thin line
Bold line Thin arrow Bold arrow
(e) (2 pt) What attribute(s) should be part of the Location relation’s primary key? If there are multiple attributes, separate them with commas.
We store affiliates as part of Location as it is a weak entity.
zipcode, Franchise.id
Exam generated for
2. (14 points) Functional What? (a) (8 points)
Consider the relation R = ABCDE.
Determine if each of the following sets of functional dependencies will lead to R having exactly two different candidate keys (doesn’t need to contain the same number of attributes). If so, write “Yes” below and write out what those candidate keys are. For instance if the keys are AB and C, then write “Yes. AB, C”. Otherwise, write “No”. No explanations needed.
i. (2 pt) A -> B, B -> C
ii. (2 pt) A -> BC, B -> D, C -> DE
iii. (2 pt) AB -> DE, B -> C, D -> E, CD -> A
iv. (2 pt) B -> CDE, C -> B, D -> A
No
No
Yes. AB, BD
Yes. B, C
Exam generated for
(b) (6 points)
Consider the relation R = ABCDE.
For the following set of functional dependencies, determine if it will lead to R in BCNF without any further decomposition. If so, write “Yes” below. Otherwise, write “No” and the functional dependency that is violated. For instance if it violates the dependency A → B, then write “No. A -> B”. You only need to write one of the violations even if multiple exists.
i. (2 pt) A -> BC, B -> CD, C -> DE, D -> EA
ii. (2 pt) B -> CDE, C -> B, D -> A
iii. (2 pt) AB -> CD, B -> CDE, C -> DE, CE -> AB
Yes
No. D -> A
Yes
Exam generated for
3. (35 points) There’s no Transaction Fees! (a) (6 points) True or False
For the following 6 questions, indicate whether the statement is true or false.
i. (1 pt) For a given schedule, it is possible for the same transaction to abort under both wait-die and wound-wait deadlock prevention policies.
True
False
Consider the following schedule: T1 – X(A), T3 – S(B), T3 – S(A), T2 – X(B) with priorities 1 > 2 > 3. T3 will abort under wait-die when it tries to acquire S(A), since T1 has higher priority than T3. T3 will also abort under wound-wait when T2 tries to acquire X(B), since T2 has higher priority than T3.
ii. (1 pt) Guaranteeing view serializability is sufficient for guaranteeing isolation from the ACID properties. True
False
View serializability ensures serializability, which is how isolation is enforced.
iii. (1 pt) The waits-for graph is the conflict dependency graph with edges reversed. True
False
The dependency graph does not consider which locks each transaction holds at each timestep, whereas
the waits-for graph does.
iv. (1 pt) A view serializable schedule will always satisfy strict 2PL requirements. True
False
A read-only schedule will satisfy view serializability, but it may still violate strict 2PL if all locks are
not released together.
v. (1 pt) During deadlock no transactions in the database will make progress. True
False
Transactions not involved in deadlock can still proceed.
vi. (1 pt) Multiple locking granularity provides no throughput benefit for a workload consisting entirely of full table scans compared to table-only locking.
True False
Table scans consist simply of read-only operations, which would not block each other. Multiple locking granularity would in fact reduce throughput by adding additional overhead.
Exam generated for
Consider the following schedule. The top row is the timestep of each operation. Transaction1 2 3 4 5 6 7 8 9 10
T1 T2 T3 T4
ii. (1 pt) Select all of the following serial schedules that are conflict equivalent to the above schedule. T1, T2, T3, T4
T1, T4, T3, T2
T4, T2, T3, T1
T3, T2, T4, T1
None of the above
The schedule is not conflict serializable, so it will not be conflict equivalent to any serial schedule.
R(A)
W(C)
W(B)
R(C)
W(C)
W(D)
W(A)
R(B)
R(D)
i. (2 pt) How many edges in the conflict dependency graph of the above schedule point to T2?
See the conflict dependency graph.
W(D)
2
Exam generated for
(c) (9 points) Deadlock Detection and Prevention
Consider the following schedule. The top row is the timestep of each operation.
This schedule consists of multi granularity locks as presented in lecture. The resources are: Tables A and
B with pages A1, A2 and B1, B2 respectively. Assume every transaction has an IX lock on the database. There is queue skipping for this problem. Queue skipping allows a transaction to acquire any compatible
locks, even if the wait queue for the resource is not empty.
Transaction1 2 3 4 5 6 7 8 9 10 11
T1 IS(A)
T2 IS(B)
T3 X(A)
T4 IX(B) SIX(A)
S(A1)
S(A)
S(B2)
SIX(B)
X(B1)
X(A1)
i. (2 pt) Select all of the following transactions that point towards T1 in the waits-for graph for the schedule above.
T2
T3
T4
None of the above
See the waits-for graph.
ii. (1 pt) Select all of the following transactions that are involved in deadlock for the schedule above. T1
T2
T3
T4
No deadlock
See the waits-for graph in the previous solution. There are no cycles so there is no deadlock.
Exam generated for
policy for the schedule above.
The priorities in descending order are: T1, T2, T3, T4 (i.e. T1 has the greatest priority and T4 has the least priority).
T1
T2
T3
T4
None of the above
• T3 aborts at TS (timestep) 3, because X(A) conflicts with IS(A) and T3 has lower priority than T1
• T2 waits for T4 at TS 6, because S(A) conflicts with SIX(A), but S(A) does not conflict with IS(A). T2 has higher priority than T4, so T2 waits.
• T4 aborts at TS 10, because X(A1) conflicts with S(A1) and T4 has lower priority than T1.
iv. (3 pt) Select all of the following transactions that will abort under the wound-wait deadlock prevention policy for the schedule above.
The priorities are the same as the previous question. T1
T2
T3
T4
None of the above
• T3 waits at TS 3, because X(A) conflicts with IS(A) and T3 has lower priority than T1.
• T4 aborts at TS 6, because S(A) conflicts with SIX(A) and T2 has higher priority than T4.
Exam generated for
(d) (13 points) Serializability
Consider the following incomplete Java code that checks whether or not a schedule is serializable.
/* Returns whether or not the Schedule S is serializable */
public static boolean is_serializable(Schedule s) {
/* BEGIN BLOCK A */
if (_____(1)_____) {
return true;
}
if (is_view_serializable(s)) {
_____(2)_____;
}
/* END BLOCK A */
for (Schedule ss: serial_schedules(s)) {
if (_____(3)_____) {
return true;
}
}
return false;
}
/* Returns whether or not the Schedule S is conflict serializable */
public static boolean is_conflict_serializable(Schedule s) {
Graph graph = new Graph(s.transaction_ids);
Edges edges = _____(4)_____;
graph.add_edges(edges);
return _____(5)_____;
}
The functions shown below may be helpful to complete the blanks above. Implementations are not shown on purpose.
/* Returns whether or not the Schedule S is view serializable */
public static boolean is_view_serializable(Schedule s) {…}
/* Returns a List of all possible Schedules if S were serial */
public static List
/* Returns whether or not Schedules S1 and S2 are equivalent */
public static boolean equivalent(Schedule s1, Schedule s2) {…}
public class Schedule {
// List of transaction IDs
public List
// Conflicting operations where 1st is a read and 2nd is a write
public Edges rw_edges() {…}
// Conflicting operations where 1st is a write and 2nd is a read
public Edges wr_edges() {…}
// Conflicting operations where 1st is a write and 2nd is a write
public Edges ww_edges() {…}
}
public class Graph {
public Graph(List
public void add_edges(Edges edges) {…} // Adds EDGES to this graph
Exam generated for
}
// A collection of edges in a graph.
public class Edges {
// Result of concatenating this Edges object with OTHER
public Edges concat(Edges other) {…}
}
The next 5 questions involve filling in the blanks above.
Exact Java syntax is not necessary to get full credit, but try to be as accurate as possible.
i. (2 pt) What belongs in the blank labeled (1)?
ii. (2 pt) What belongs in the blank labeled (2)?
iii. (2 pt) What belongs in the blank labeled (3)?
iv. (3 pt) What belongs in the blank labeled (4)?
v. (3 pt) What belongs in the blank labeled (5)?
vi. (1 pt) Consider a change to the function is_serializable above, where the code within BLOCK A is removed. Would this modified code still correctly check for serializability?
Yes
No
Yes, it would still be correct, because we are just removing checks for conflict and view serializability. Any conflict or view serializable schedule is still serializable, and the serializability check is handled within the for loop. The function’s best case will now be slower, because conflict serializability is cheaper to check than serializability.
is_conflict_serializable(s)
return true
equivalent(s, ss)
s.rw_edges().concat(s.wr_edges()).concat(s.ww_edges())
!graph.is_cyclic()
Exam generated for
i. (4 pt) Consider a change where a transaction must release all of its held locks when it is placed on the wait queue for any lock. It must then re-acquire any necessary locks it had previously released upon leaving the wait queue.
In no more than 4 sentences explain whether or not this change prevents all deadlock.
Yes, this prevents deadlocks, because no transaction will wait for a transaction that is in a wait queue. Transactions may however take a lot longer to complete and isolation properties are easily violated.
Exam generated for
4. (21 points) Que-rum Op-tea-miza-sugar
Consider the following tables:
CREATE TABLE T1 (
x INTEGER,
y FLOAT,
z FLOAT );
CREATE TABLE T2 (
x INTEGER,
y FLOAT,
z FLOAT );
CREATE TABLE T3 (
x INTEGER,
y FLOAT,
z FLOAT );
Number of pages and records/page (assume all pages are full of records):
T1: 100 pages, 50 records/page,
T2: 200 pages, 50 records/page,
T3: 150 pages, 50 records/page
Indexes:
T1: Alt 1 index on T1.x with h = 2 and 120 leaf pages
T2: Alt 2 unclustered index on T2.y with h = 3 and 50 leaf pages
Table Stats (assume all are uniformly distributed and min/max are inclusive):
T1.x: min = 0, max = 50, 51 unique values
T2.y: min = 0, max = 100, 500 unique values
T2.z: min = 0, max = 200, 200 unique values
T3.y: min = 50, max = 150, 250 unique values
Buffer pages
B = 12
(a) (4 points) Selectivity Exercises
Estimate the number of pages of output for each of the following queries.
i. (1 pt) SELECT * FROM T1 WHERE x < 40;
Selectivity: (40 - 0) / (50 - 0 + 1) = 40/51 Number of tuples: floor(40/51 * 100 * 50) = 3921 Number of pages: ceil(3921 / 50) = 79
79
Exam generated for
Selectivity: 1 / max(num distinct T2.y, num distinct T3.y) * (max(T2.z) – 20) / (max(T2.z) – min(T2.z))
Selectivity: 1 / 500 * (200 – 20) / 200 = 0.0018
Number of tuples: [T2] * T2 records/page * [T3] * T3 records/page
Number of tuples (cartesian product): 200 * 50 * 150 * 50 = 75,000,000 Number of tuples (final): floor(75,000,000 * 0.0018) = 135000
Number of pages (final): ceil(135000 / 50) = 2700
120
2700
Exam generated for
(b) (17 points) Query Optimizers
Again consider the following tables:
CREATE TABLE T1 (
x INTEGER,
y FLOAT,
z FLOAT );
CREATE TABLE T2 (
x INTEGER,
y FLOAT,
z FLOAT );
CREATE TABLE T3 (
x INTEGER,
y FLOAT,
z FLOAT );
Number of pages and records/page:
T1: 100 pages, 50 records/page,
T2: 200 pages, 50 records/page,
T3: 150 pages, 50 records/page
Indexes:
T1: Alt 1 index on T1.x with h = 2 and 120 leaf pages
T2: Alt 2 unclustered index on T2.y with h = 3 and 50 leaf pages
Table Stats (assume all are uniformly distributed):
T1.x: min = 0, max = 50, 50 unique values
T2.y: min = 0, max = 100, 500 unique values
T2.z: min = 0, max = 200, 200 unique values
T3.y: min = 50, max = 150, 250 unique values
Buffer pages
B = 12
We want to execute the following query:
SELECT *
FROM T1 INNER JOIN T2 ON T1.x = T2.x
INNER JOIN T3 ON T2.y = T3.y
WHERE T2.y >= 20
ORDER BY T2.y;
i. (1 pt) How many I/Os would a full scan of T1 take?
Full scan is all 100 pages.
100
Exam generated for
2 inner nodes + 120 leaf pages
iii. (2 pt) Using the index on T2.y, how many I/Os would it take to perform an index scan of T2 that only returns tuples which match the condition T2.y >= 20?
Sel(T2.y >= 20) = 0.8
3 inner nodes
0.8 * 50 = 40 leaf nodes
0.8 * 200 * 50 = 8000 records (1 I/O per matching record) 3 + 40 + 8000 = 8043
iv. (2 pt) Assume in the first pass of the Selinger Optimizer that T2 and T3 were both accessed with a full scan and streamed directly into the second pass. How many I/Os then would T3 BNLJ T2 (T3 as outer relation) take, including the cost of the 1st pass? Remember that the number of buffer pages B = 12.
[T3] + ([T3] / (B – 2)) * [T2] 150 + (150 / 10) * 200 = 3150
v. (3 pt) Now, assume the full scan on T2 filtered T2.y >= 20 and was materialized before the same T3 BNLJ T2 join. How many I/Os would the join take with this one modification, again including the cost of the 1st pass?
Sel(T2.y >= 20) = 0.8
[T2] (read T2) + 0.8[T2] (materialize T2) + [T3] + ([T3] / (B – 2)) * 0.8[T2] 200 + 0.8(200) + 150 + (150 / 10) * 0.8(200) = 2910
122 or 120 (leaving out inner nodes)
8043
3150
2910
Exam generated for
SELECT *
FROM T1 INNER JOIN T2 ON T1.x = T2.x
INNER JOIN T3 ON T2.y = T3.y
WHERE T2.y >= 20
ORDER BY T2.y;
Which of the following query plans will be returned by the second pass of the Selinger Optimizer given total estimated I/O cost and output order?
A: T1 BNLJ T2 B: T2 BNLJ T1 C:T1SMJT2 D: T1 BNLJ T3 E: T3 BNLJ T1 F:T3SMJT1 G: T2 BNLJ T3 H: T3 BNLJ T2 I:T2SMJT3
I/Os: 3,000
I/Os: 2,000
I/Os: 5,000
I/Os, 6,000
I/Os, 2,000
I/Os: 1,500
I/Os: 4,000
I/Os: 2,500
I/Os: 3,500
Order: None
Order: None
Order: T1.x
Order: None
Order: None
Order: T3.z
Order: None
Order: None
Order: T2.y
A
B
C
D
E
F
G
H
I
Cheapest T1,T2 and T2,T3 joins are kept. T2 SMJ T3 is kept since it has an interesting order.
vii. (3 pt) Consider the following modification to the Selinger Optimizer: after we determine our final query plan, we test if materializing the output of each operator (single access or join) before inputting it to the next operator will reduce the I/O cost. For our query above, how many query plans would we have to consider after the final step of the Selinger Optimizer if we consider all possible combinations of materialization/no materialization?
There are 5 operators for this query (3 single access, 2 joins) so in total there are 2ˆ5 = 32 combinations of materialization/no materialization.
32
Exam generated for
final I/O cost of the query plan outputted by the Selinger Optimizer.
True
False
Counterexample: Pushing down the selection on the right relation of a BNLJ will not decrease I/O cost.
Discussion 7 example here
Also, more simple: Consider a predicate that evaluates to true for all tuples.
Exam generated for
For all parts, assume that table R has 60 pages and table S has 40 pages. Exclude the cost of the final write unless instructed otherwise.
(a) (2 points) What’s your favorite join?
Dylan wants to join table R with table S using Block Nested Loop Join. For this part only, assume we have 15 buffers and answer the following questions below.
i. (1 pt) What is the I/O cost of R BNLJ S (R is the outer relation)?
60 + 5 * 40 = 260 I/Os
ii. (1 pt) What is the I/O cost of S BNLJ R (S is the outer relation)?
40 + 4 * 60 = 280 I/Os
260
280
Exam generated for
(b) (6 points) Block Nested Lakshya Join (BNLJ)
[R] = 60, [S] = 40
Lakshya also wants to join table R with table S. He has 20 pages in his buffer, but there’s a catch. . . each time a block from the outer relation is evicted and a new block is read in, the available memory shrinks by 5 pages until we hit a floor of B = 5 total buffer pages.
In other words, when the first block was read in there were 20 pages in the buffer. When the next block was read in, there were only 20 – 5 = 15 pages in the buffer.
i. (3 pt) What is the I/O cost for executing BNLJ here, if R was the outer relation?
[1] 18 pg block + 40 pages [2] 13 pg block + 40 pages [3] 8 pg block + 40 pages [4] 3 pg block + 40 pages [5] 3 pg block + 40 pages [6] 3 pg block + 40 pages [7] 3 pg block + 40 pages [8] 3 pg block + 40 pages [9] 3 pg block + 40 pages [10] 3 pg block + 40 pages
ii. (3 pt) If instead, the available memory was to expand by 5 pages (starting at B = 20) every time a block from the outer relation is evicted and a new one read in, what would be the cost for BNLJ, if R was the outer relation?
[1] 18 pg block + 40 pages [2] 23 pg block + 40 pages [3] 19 pg block + 40 pages
460
180
Exam generated for
(c) (10 points) Sorting things out
[R] = 60, [S] = 40
A new resource management model means available buffer pages can change throughout query execution. For each pass in the sort phase of SMJ, Lakshya has fewer buffer pages than the previous pass! The allocation is detailed in the figures below.
Lakshya wants to use SMJ to join his tables (R SMJ S). Consider the following figure for the number of buffer pages available for each pass and answer the questions below. Assume no optimization is used.
SMJ Pass 0 Pass 1 Pass 2 10 5 ?
i. (1 pt) What is the I/O cost for pass 0 of sorting table R?
60 * 2 = 120
ii. (1 pt) What is the I/O cost for pass 0 of sorting table S?
40 * 2 = 80
iii. (1 pt) How many sorted runs do we have at the end of pass 0 for table R?
ceil(60/10) = 6
iv. (1 pt) How many sorted runs do we have at the end of pass 0 for table S?
ceil(40/10) = 4
v. (1 pt) How many pages are in our smallest run from R at the end of Pass 1?
We first merge 4 runs of 10 pages each and get 1 run of 40 pages. We then merge the remain- ing 2 runs of 10 pages each to get 1 run of 20 pages.
vi. (1 pt) What’s the minimum number of buffer pages we must have in order to finish sorting after pass 2?
We have 2 runs of table R after pass 1 (above solution has breakdown)
120
80
6
4
20
3
Exam generated for
[R] + [S] = 100
viii. (3 pt) Suppose we have the following records in table R and table S Table R Table S
11 12 22 33 34
How many times will we have to backtrack when performing the merge pass of SMJ (i.e. resetting the right iterator over table S)?
4 or 5 (depending on when you reset; either is accepted)
100
4
Exam generated for
(d) (6 points) Direct Deposit: $1400, Me: give me all the buffer pages
[R] = 60, [S] = 40
Lakshya wants to try using GHJ. Like before, for each pass of GHJ, Lakshya has fewer buffer pages than the previous pass. Consider the following figure for the number of buffer pages available for each pass and answer the questions below. Assume that our pass 1 hash function hashes 1⁄2 of the pages to the first bucket, 1⁄4 pages to the second bucket, and perfectly hashes the remaining pages uniformly across the remaining buckets,
GHJ Partitioning Pass 1 Partitioning Pass 2 Build and Probe 20 ? 12
i. (3 pt) What is the total I/O cost of pass 1?
Table R: 60 (read) + 30 (bucket 1) + 15 (bucket 2) + 1 * 17 (buckets 3 to 19 combined) = 60 + 62 = 122 I/Os
Table S: 40 (read) + 20 (bucket 1) + 10 (bucket 2) + 1 * 17 (buckets 3 to 19) = 40 + 47 = 87 I/Os Table R + Table S = 122 + 87 = 209 I/Os
Partial: 109 (did not include first read)
ii. (3 pt) Assuming we use a uniform hash function in pass 2, what is the minimum number of buffer pages that must be available in pass 2 in order to ensure that this is the last partitioning phase?
To fit everything in the build and probe we only need to ensure that all partitions have one ta- ble of size 10 pages or less (B – 2 = 10) – this enables us to build an in-memory hash table. As we use a uniform hash function in pass 2, it is sufficient to ensure that the largest bucket can be hashed into subpartitions – we just need to hash S into a small enough partition to build an in-memory table on it and then probe with R.
209
3
Exam generated for
25
No more questions.