CS W4111.001 Introduction to Databases Fall 2020
Computer Science Department Columbia University
Overview of Query Optimization
• So far, we studied choices—and their costs—for plans for individual relational operators (selections, projections, joins, …)
• We will now cover more complex queries
45
Query Optimization: Outline
Given a SQL query, ideally we would:
1. Consider all possible execution plans and their
estimated cost
2. Pick fastest plan and execute it
However:
• Far too many possible execution plans are available
• Should not spend more time finding best plan than executing query with a rough but OK plan
46
Focus on Two Problems
• Decide which plans we will consider in analysis (not all)
• Design ways of estimating the execution cost of a plan
We will follow the IBM System R approach to query optimization
47
• Most widely used; works well for up to ~10 joins
• Cost estimation is an approximate art at best:
• Based on statistics maintained in database catalogs, to estimate cost of operations and result sizes
• Based on combination of CPU and I/O costs; we will focus just on I/O costs in our class
• Space of possible execution plans is far too large, so it must be pruned, as we will see
• Each execution plan is represented as a relational algebra tree, with each relational operator in the tree annotated with a choice of implementation algorithm
Highlights of System R Optimizer
48
Motivating Example
SELECT S.sname
FROM Reserves R, Sailors S WHERE R.sid=S.sid AND
R.bid=100 AND S.rating>5
Sailors(sid, sname, rating, age) Boats(bid, bname, color) Reserves(sid, bid, day)
Motivating Example
SELECT S.sname
FROM Reserves R, Sailors S WHERE R.sid=S.sid AND
R.bid=100 AND S.rating>5
Sailors(sid, sname, rating, age) Boats(bid, bname, color) Reserves(sid, bid, day)
One possible relational algebra tree, without annotations (so not a complete plan)
Tuples “flow” from the leaves of the tree up to the root, where the output of the query is produced
sname σbid=100 ⋀ rating > 5
sid
Reserves
Sailors
Motivating Example
SELECT S.sname
FROM Reserves R, Sailors S WHERE R.sid=S.sid AND
R.bid=100 AND S.rating>5
Sailors(sid, sname, rating, age) Boats(bid, bname, color) Reserves(sid, bid, day)
We can annotate the relational algebra tree to specify an execution plan fully
The left child of the join operator is the outer relation in a nested loops execution, by convention
Cost? Refer to previous analysis of block nested loops join; 𝜎 and π don’t incur further I/Os in this plan: they are
sname
ON-THE-FLY
σbid=100 ⋀ rating > 5
ON-THE-FLY
BLOCK NESTED LOOPS
sid
evaluated “on the fly” as join tuples flow up the tree
Reserves Sailors
Other Plans for Query?
53
Other Plans for Query?
Two main sources of alternative plans:
• Different annotations for same relational algebra tree, with different algorithms for same operators (e.g., “SORT MERGE” instead of “BLOCK NESTED LOOPS” join)
54
Two main sources of alternative plans:
• Different annotations for same relational algebra tree, with different algorithms for same operators (e.g., “SORT MERGE” instead of “BLOCK NESTED LOOPS” join)
• Different trees (and annotations), with different ordering of operations, or swapping inner and outer relations, but of course still producing the same query results
Useful principle: push selections and projections “down” the relational algebra tree, to reduce size of intermediate results as early as possible
Other Plans for Query?
55
Cost Analysis of Plans?
• Need to estimate size of “intermediate” relations: they are the input of operators higher in relational algebra tree
56
Joins of Three or More Relations?
SELECT S.sname, B.bname
FROM Reserves R, Sailors S, Boats B WHERE R.sid=S.sid AND R.bid=B.bid AND
R.bid=100 AND S.rating>5
System R restricts the family of plans it considers to plans with relational algebra trees that:
• don’tinvolveanycrossproductsofrelationsif possible, to avoid generating very large intermediate relations
• are“left-deeptrees,”wheretherightchildofajoin
operator is never another join operator (i.e., join
operators always occur on the left subtree of a node);
this restriction reduces the number of trees to
consider 57
Left-Deep Trees (or Plans)
• Can be “pipelined,” without having to materialize on disk the left-side relations (tuples “flow” from the outer relation—on the left—up the tree)
• Still up to N! (N factorial) relational algebra tree options to consider for an N-way join (N! ways to assign N relations to the N leaves of the only left-deep tree “shape” with N leaves)
58
Summary
Query optimization studied in detail in
CS W4112-Database System Implementation
• Severalalternativeevaluationalgorithmsforeachrelational operator
• Aqueryisevaluatedbyconvertingittoarelationalalgebra tree of operators and evaluating the operators in the tree
• Mustunderstandqueryoptimizationinordertofully understand the performance impact of a given database design (relations, indexes) on a workload (set of queries)
• Twopartstooptimizingaquery:
• Considerasetofalternativeplans
• Must prune search space; typically, left-deep plans/trees only
• Mustestimatecostofeachplanthatisconsidered
• Size of result and cost for each plan node
• Key issues: Statistics, indexes, operator implementations
59
Transaction Processing Overview
Transaction processing studied in depth in
CS W4112-Database System Implementation
Transactions
• A user program may carry out many operations on the data retrieved from the database, but the DBMS is only concerned about what data is read from, and written to, the database
• A transaction is the DBMS’s abstract view of a user program, as simply a sequence of reads and writes to the database
Transactions
A transaction is a series of actions (Reads and Writes) on a database that form a “logical unit”
Example: all database actions required to transfer money from one bank account to another
A transaction is a series of actions (Reads and Writes) on a database that form a “logical unit”
Example: all database actions required to transfer money from one bank account to another
ACID properties of transactions:
• Atomicity(enforcedvialog+recoveryprotocol)
• Consistency(enforcedbyDBMS)
• Isolation(enforcedbyconcurrencycontrolprotocol) • Durability(enforcedvialog+recoveryprotocol)
Transactions either commit (when they complete successfully) or abort (when they don’t)
Transactions
• A transaction might commit after completing all its actions, or it could abort (or be aborted by the DBMS) after executing some actions
Atomicity of Transactions
• A transaction might commit after completing all its actions, or it could abort (or be aborted by the DBMS) after executing some actions
• A very important property guaranteed by the DBMS for all transactions is that they are atomic, so we can think of a transaction as always either:
• Executing all its actions in one step, or • Not executing any actions at all
Atomicity of Transactions
• A transaction might commit after completing all its actions, or it could abort (or be aborted by the DBMS) after executing some actions
• A very important property guaranteed by the DBMS for all transactions is that they are atomic, so we can think of a transaction as always either:
• Executing all its actions in one step, or • Not executing any actions at all
• The DBMS logs all actions so that it can undo the actions of aborted transactions
Atomicity of Transactions
Consistency of Transactions
• The DBMS doesn’t understand the semantics of the data beyond the constraints that are defined as part of the database schema
• The DBMS makes sure all such constraints hold
Consistency of Transactions
• The DBMS doesn’t understand the semantics of the data beyond the constraints that are defined as part of the database schema
• The DBMS makes sure all such constraints hold
• A transaction can cause temporary violations of database constraints while it is still in progress, but everything should be back to normal when the transaction commits
• Specifically, if the database is in a consistent state (i.e., all constraints hold) when a transaction starts executing, then the database is back in a consistent state after the transaction commits
• Concurrent execution of transactions is essential for good DBMS performance
• Concurrency is achieved by the DBMS, which interleaves actions (reads and writes of database objects) of various transactions
• Users submit transactions, and can think of each transaction as executing by itself, logically speaking
Isolation of Transactions
• Concurrent execution of transactions is essential for good DBMS performance
• Concurrency is achieved by the DBMS, which interleaves actions (reads and writes of database objects) of various transactions
• Users submit transactions, and can think of each transaction as executing by itself, logically speaking
• The DBMS supports this “illusion” via a concurrency control protocol such as Strict 2-Phase Locking
Isolation of Transactions
• If a transaction commits (i.e., it completes successfully), then its effects on the database have to persist forever and cannot be forgotten
• Any changes to the database have to survive system crashes and even natural disasters
Durability of Transactions
• If a transaction commits (i.e., it completes successfully), then its effects on the database have to persist forever and cannot be forgotten
• Any changes to the database have to survive system crashes and even natural disasters
• The DBMS logs all actions so that it can redo the actions of committed transactions if needed, to guarantee transaction durability
Durability of Transactions
Concurrency Control, for Isolation
Consider two transactions T1 and T2:
T1: BEGIN A=A+100, B=B-100 END T2: BEGIN A=1.02*A, B=1.02*B END
● T1 transfers $100 from account B to account A
● T2 gives each account 2% interest
Concurrency Control, for Isolation
Consider two transactions T1 and T2:
T1: BEGIN A=A+100, B=B-100 END T2: BEGIN A=1.02*A, B=1.02*B END
● T1 transfers $100 from account B to account A
● T2 gives each account 2% interest
If T1 and T2 are submitted to the database at about the same time, there is no guarantee that T1 will execute before T2 or vice versa
Concurrency Control, for Isolation
Consider two transactions T1 and T2:
T1: BEGIN A=A+100, B=B-100 END T2: BEGIN A=1.02*A, B=1.02*B END
● T1 transfers $100 from account B to account A
● T2 gives each account 2% interest
If T1 and T2 are submitted to the database at about the same time, there is no guarantee that T1 will execute before T2 or vice versa
However, the net effect must be equivalent to T1 and T2 running serially in some order (i.e., T1 followed by T2, or T2 followed by T1)
T1: A=A+100, B=B-100 T2: A=1.02*A,
B=1.02*B
Is this a “good” schedule?
One possible interleaving (or “schedule”) of the actions of T1 and T2
T1: A=A+100, B=B-100 T2: A=1.02*A,
B=1.02*B
The schedule is OK, because it’s logically equivalent—in terms of its effects on the database contents—to executing T1 fully, followed by executing T2 fully (i.e., it is equivalent to serial schedule T1; T2):
T1: A=A+100, B=B-100
T2: A=1.02*A, B=1.02*B
The two schedules above are equivalent because we can get from one to the other by swapping the order of B=B-100 and A=1.02*A, which are nonconflicting actions affecting different objects in the database (i.e., B and A, respectively)
One possible interleaving (or “schedule”) of the actions of T1 and T2
T1: A=A+100, B=B-100 T2: A=1.02*A, B=1.02*B
Is this a “good” schedule?
Another possible interleaving (or “schedule”) of the actions of T1 and T2
T1: A=A+100, B=B-100 T2: A=1.02*A, B=1.02*B
The schedule is not OK, because it’s not logically equivalent—in terms of its effects on the database contents—to either T1; T2 or T2; T1
Another possible interleaving (or “schedule”) of the actions of T1 and T2
• Serial schedule is a schedule that does not interleave the actions of its transactions (i.e., each transaction in the schedule is fully executed before the next transaction is fully executed, and so on)
Concurrency Control: Some Definitions
• Serial schedule is a schedule that does not interleave the actions of its transactions (i.e., each transaction in the schedule is fully executed before the next transaction is fully executed, and so on)
• Two schedules are equivalent if, for any database state, the effect on the database of executing the first schedule is identical to the effect of executing the second schedule
Concurrency Control: Some Definitions
• Serial schedule is a schedule that does not interleave the actions of its transactions (i.e., each transaction in the schedule is fully executed before the next transaction is fully executed, and so on)
• Two schedules are equivalent if, for any database state, the effect on the database of executing the first schedule is identical to the effect of executing the second schedule
• Serializable schedule is a schedule that, if we only consider its committed transactions, the schedule is equivalent to a serial schedule of the committed transactions
Concurrency Control: Some Definitions
Transaction Schedules as Series of Reads and Writes
From now on, we will view transactions and schedules as sequences of reads and writes, without worrying about the actual values that are read or written
T1: A=A+100, B=B-100 T2: A=1.02*A,
B=1.02*B
simply as reads R and writes W, where the number next to R or W identifies the corresponding transaction (e.g., R1(A) is a read of object A by transaction T1)
T1: R1(A) W1(A) R1(B) W1(B) T2: R2(A) W2(A)
R2(B) W2(B)
or sometimes, equivalently, in one line as:
R1(A) W1(A) R2(A) W2(A) R1(B) W1(B) R2(B) W2(B)
Transaction Schedules as Series of Reads and Writes
From now on, we will view transactions and schedules as sequences of reads and writes, without worrying about the actual values that are read or written
So we will write a schedule such as:
• Reading Uncommitted Data (“dirty reads”):
T1: R1(A) W1(A) R1(B) W1(B) Abort1 T2: R2(A) W2(A) Commit2
Anomalies with Interleaved Executions of Transactions
• Reading Uncommitted Data (“dirty reads”):
T1: R1(A) W1(A) R1(B) W1(B) Abort1 T2: R2(A) W2(A) Commit2
T1: R1(A) R1(A) W1(A) Commit1 T2: R2(A) W2(A) Commit2
• Unrepeatable Reads:
Anomalies with Interleaved Executions of Transactions
Anomalies (Continued)
• Overwriting Uncommitted Data:
All of these schedules are problematic (because they are not serializable or have other anomalies)
How can we avoid them?
T1: W1(A) W1(B) Commit1 T2: W2(A) W2(B) Commit2
Strict 2-Phase Locking (Strict 2PL) Protocol
Strict 2PL allows only serializable schedules
• BeforereadinganobjectA,atransactionmusthave
requested and hold a shared lock on A (i.e., S(A))
• Multipletransactionscanholdasharedlockonanobject simultaneously