Week 08 Lectures
13/9/18, 11(34 pmWeek 08 Lectures
Page 1 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
Week 08 Lectures
Assignment 2 1/141
Aim: experimental analysis of signature-based indexing
tuple-level superimposed codeword signatures
page-level superimposed codeword signatures
bit-sliced superimposed codeword signatures
Large numbers of tuples, inserted into a relation:
implemented as one data file plus three signature files
Produce several instance of (data+signatures) relations
Run a set of benchmark PMR queries on these relations
Measure query costs and analyse
… Assignment 2 2/141
File structures:
… Assignment 2 3/141
We supply:
a program to generate pseudo-random tuples (lots of them)
where tuples look like: (id, name, numeric-attributes)
a hash function (from PostgreSQL)
You write:
a program to build (data+signature) files from tuples
a program to run queries (val,?,?,val,?,…) against the data
Your programs must be parameterized:
builder accepts different values for signature parameters
query processor specifies what kind of signatures to use
13/9/18, 11(34 pmWeek 08 Lectures
Page 2 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
… Assignment 2 4/141
Experiments:
run each query using each signature type
for each sig type
read appropriate signature pages (maybe all)
determine which data pages to read
fetch pages and search for matching tuples
determine total cost (#signature pages + #data pages)
record costs and analyse/compare
Query Processing So Far 5/141
Steps in processing an SQL statement
parse, map to relation algebra (RA) expression
transform to more efficient RA expression
instantiate RA operators to DBMS operations
execute DBMS operations (aka query plan)
Cost-based optimisation:
generate possible query plans (via rewriting/heuristics)
estimate cost of each plan (sum costs of operations)
choose the lowest-cost plan (… and choose quickly)
Expression Rewriting Rules 6/141
Since RA is a well-defined formal system
there exist many algebraic laws on RA expressions
which can be used as a basis for expression rewriting
in order to produce equivalent (more-efficient?) expressions
Expression transformation based on such rules can be used
to simplify/improve SQL→RA mapping results
to generate new plan variations to check in query optimisation
Relational Algebra Laws 7/141
Commutative and Associative Laws:
R ⋈ S ↔ S ⋈ R, (R ⋈ S) ⋈ T ↔ R ⋈ (S ⋈ T) (natural join)
R ∪ S ↔ S ∪ R, (R ∪ S) ∪ T ↔ R ∪ (S ∪ T)
R ⋈Cond S ↔ S ⋈Cond R (theta join)
σc ( σd (R)) ↔ σd ( σc (R))
Selection splitting (where c and d are conditions):
σc∧d(R) ↔ σc ( σd (R))
σc∨d(R) ↔ σc(R) ∪ σd(R)
13/9/18, 11(34 pmWeek 08 Lectures
Page 3 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
… Relational Algebra Laws 8/141
Selection pushing ( σc(R ∪ S) and σc(R ∪ S) ):
σc(R ∪ S) ↔ σcR ∪ σcS, σc(R ∩ S) ↔ σcR ∩ σcS
Selection pushing with join …
σc (R ⋈ S) ↔ σc(R) ⋈ S (if c refers only to attributes from R )
σc (R ⋈ S) ↔ R ⋈ σc(S) (if c refers only to attributes from S )
If condition contains attributes from both R and S:
σc′∧c″ (R ⋈ S) ↔ σc′(R) ⋈ σc″(S)
c′ contains only R attributes, c″ contains only S attributes
… Relational Algebra Laws 9/141
Rewrite rules for projection …
All but last projection can be ignored:
πL1 ( πL2 ( … πLn (R))) → πL1 (R)
Projections can be pushed into joins:
πL (R ⋈c S) ↔ πL ( πM(R) ⋈c πN(S) )
where
M and N must contain all attributes needed for c
M and N must contain all attributes used in L (L ⊂ M∪N)
… Relational Algebra Laws 10/141
Subqueries ⇒ convert to a join
Example: (on schema Courses(id,code,…), Enrolments(cid,sid,…), Students(id,name,…)
select c.code, count(*)
from Courses c
where c.id in (select cid from Enrolments)
group by c.code
becomes
select c.code, count(*)
from Courses c join Enrolments e on c.id = e.cid
group by c.code
but not
select e.sid as student_id, e.cid as course_id
from Enrolments e
where e.sid = (select max(id) from Students)
13/9/18, 11(34 pmWeek 08 Lectures
Page 4 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
Query Optimisation
Query Optimisation 12/141
Query optimiser: RA expression → efficient evaluation plan
… Query Optimisation 13/141
Query optimisation is a critical step in query evaluation.
The query optimiser
takes relational algebra expression from SQL compiler
produces sequence of RelOps to evaluate the expression
query execution plan should provide efficient evaluation
“Optimisation” is a misnomer since query optimisers
aim to find a good plan … but maybe not optimal
Observed Query Time = Planning time + Evaluation time
… Query Optimisation 14/141
Why do we not generate optimal query execution plans?
Finding an optimal query plan …
requires exhaustive search of a space of possible plans
for each possible plan, need to estimate cost (not cheap)
Even for relatively small query, search space is very large.
Compromise:
do limited search of query plan space (guided by heuristics)
quickly choose a reasonably efficient execution plan
15/141
13/9/18, 11(34 pmWeek 08 Lectures
Page 5 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
Approaches to Optimisation
Three main classes of techniques developed:
algebraic (equivalences, rewriting, heuristics)
physical (execution costs, search-based)
semantic (application properties, heuristics)
All driven by aim of minimising (or at least reducing) “cost”.
Real query optimisers use a combination of algrebraic+physical.
Semantic QO is good idea, but expensive/difficult to implement.
… Approaches to Optimisation 16/141
Example of optimisation transformations:
For join, may also consider sort/merge join and hash join.
Cost-based Query Optimiser 17/141
Approximate algorithm for cost-based optimisation:
translate SQL query to RAexp
for enough transformations RA’ of RAexp {
while (more choices for RelOps) {
plan = {}; i = 0
for each node e of RA’ (recursively) {
select RelOp method for e
plan[i++] = RelOp method for e
}
cost = 0
for each op in plan[] { cost += Cost(op) }
if (cost < MinCost)
{ MinCost = cost; BestPlan = plan }
}
}
Heuristics: push selections down, consider only left-deep join trees.
Exercise 1: Alternative Join Plans 18/141
13/9/18, 11(34 pmWeek 08 Lectures
Page 6 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
Consider the schema
Students(id,name,....) Enrol(student,course,mark)
Staff(id,name,...) Courses(id,code,term,lic,...)
the following query on this schema
select c.code, s.id, s.name
from Students s, Enrol e, Courses c, Staff f
where s.id=e.student and e.course=c.id
and c.lic=f.id and c.term='11s2'
and f.name='John Shepherd'
Show some possible evaluation orders for this query.
Cost Models and Analysis 19/141
The cost of evaluating a query is determined by:
size of relations (database relations and temporary relations)
access mechanisms (indexing, hashing, sorting, join algorithms)
size/number of main memory buffers (and replacement strategy)
Analysis of costs involves estimating:
size of intermediate results
number of secondary storage accesses
Choosing Access Methods (RelOps) 20/141
Performed for each node in RA expression tree ...
Inputs:
a single RA operation (σ, π, ⋈)
information about file organisation, data distribution, ...
list of operations available in the database engine
Output:
specific DBMS operation to implement this RA operation
... Choosing Access Methods (RelOps) 21/141
Example:
RA operation: Sel[name='John' ∧ age>21](Student)
Student relation has B-tree index on name
database engine (obviously) has B-tree search method
giving
tmp[i] := BtreeSearch[name=’John’](Student)
tmp[i+1] := LinearSearch[age>21](tmp[i])
Where possible, use pipelining to avoid storing tmp[i] on disk.
13/9/18, 11(34 pmWeek 08 Lectures
Page 7 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
… Choosing Access Methods (RelOps) 22/141
Rules for choosing σ access methods:
σA=c(R) and R has index on A ⇒ indexSearch[A=c](R)
σA=c(R) and R is hashed on A ⇒ hashSearch[A=c](R)
σA=c(R) and R is sorted on A ⇒ binarySearch[A=c](R)
σA ≥ c(R) and R has clustered index on A
⇒ indexSearch[A=c](R) then scan
σA ≥ c(R) and R is hashed on A
⇒ linearSearch[A>=c](R)
… Choosing Access Methods (RelOps) 23/141
Rules for choosing ⋈ access methods:
R ⋈ S and R fits in memory buffers ⇒ bnlJoin(R,S)
R ⋈ S and S fits in memory buffers ⇒ bnlJoin(S,R)
R ⋈ S and R,S sorted on join attr ⇒ smJoin(R,S)
R ⋈ S and R has index on join attr ⇒ inlJoin(S,R)
R ⋈ S and no indexes, no sorting ⇒ hashJoin(R,S)
(bnl = block nested loop; inl = index nested loop; sm = sort merge)
Cost Estimation 24/141
Without executing a plan, cannot always know its precise cost.
Thus, query optimisers estimate costs via:
cost of performing operation (dealt with in earlier lectures)
size of result (which affects cost of performing next operation)
Result size determined by statistical measures on relations, e.g.
rS cardinality of relation S
RS avg size of tuple in relation S
V(A,S) # distinct values of attribute A
min(A,S) min value of attribute A
max(A,S) max value of attribute A
Estimating Projection Result Size 25/141
Straightforward, since we know:
number of tuples in output
rout = | πa,b,..(T) | = | T | = rT (in SQL, because of bag semantics)
13/9/18, 11(34 pmWeek 08 Lectures
Page 8 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
size of tuples in output
Rout = sizeof(a) + sizeof(b) + … + tuple-overhead
Assume page size B, bout = ⌈rT / cout⌉, where cout = ⌊B/Rout⌋
If using select distinct …
| πa,b,..(T) | depends on proportion of duplicates produced
Estimating Selection Result Size 26/141
Selectivity = fraction of tuples expected to satisfy a condition.
Common assumption: attribute values uniformly distributed.
Example: Consider the query
select * from Parts where colour=’Red’
If V(colour,Parts)=4, r=1000 ⇒ |σcolour=red(Parts)|=250
In general, | σA=c(R) | ≅ rR / V(A,R)
Heuristic used by PostgreSQL: | σA=c(R) | ≅ r/10
… Estimating Selection Result Size 27/141
Estimating size of result for e.g.
select * from Enrolment where year > 2005;
Could estimate by using:
uniform distribution assumption, r, min/max years
Assume: min(year)=2000, max(year)=2009, |Enrolment|=105
105 from 2000-2009 means approx 10000 enrolments/year
this suggests 40000 enrolments since 2006
Heuristic used by some systems: | σA>c(R) | ≅ r/3
… Estimating Selection Result Size 28/141
Estimating size of result for e.g.
select * from Enrolment where course <> ‘COMP9315’;
Could estimate by using:
uniform distribution assumption, r, domain size
e.g. | V(course,Enrolment) | = 2000, | σA<>c(E) | = r * 1999/2000
13/9/18, 11(34 pmWeek 08 Lectures
Page 9 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
Heuristic used by some systems: | σA<>c(R) | ≅ r
Exercise 2: Selection Size Estimation 29/141
Assuming that
all attributes have uniform distribution of data values
attributes are independent of each other
Give formulae for the number of expected results for
1. select * from R where not A=k
2. select * from R where A=k and B=j
3. select * from R where A in (k,l,m,n)
where j, k, l, m, n are constants.
Assume: V(A,R) = 10 and V(B,R)=100 and r=1000
… Estimating Selection Result Size 30/141
How to handle non-uniform attribute value distributions?
collect statistics about the values stored in the attribute/relation
store these as e.g. a histogram in the meta-data for the relation
So, for part colour example, might have distribution like:
White: 35% Red: 30% Blue: 25% Silver: 10%
Use histogram as basis for determining # selected tuples.
Disadvantage: cost of storing/maintaining histograms.
… Estimating Selection Result Size 31/141
Summary: analysis relies on operation and data distribution:
E.g. select * from R where a = k;
Case 1: uniq(R.a) ⇒ 0 or 1 result
Case 2: rR tuples && size(dom(R.a)) = n ⇒ rR / n results
E.g. select * from R where a < k; Case 1: k ≤ min(R.a) ⇒ 0 results Case 2: k > max(R.a) ⇒ ≅ rR results
Case 3: size(dom(R.a)) = n ⇒ ? min(R.a) … k … max(R.a) ?
Estimating Join Result Size 32/141
13/9/18, 11(34 pmWeek 08 Lectures
Page 10 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
Analysis relies on semantic knowledge about data/relations.
Consider equijoin on common attr: R ⋈a S
Case 1: values(R.a) ∩ values(S.a) = {} ⇒ size(R ⋈a S) = 0
Case 2: uniq(R.a) and uniq(S.a) ⇒ size(R ⋈a S) ≤ min(|R|, |S|)
Case 3: pkey(R.a) and fkey(S.a) ⇒ size(R ⋈a S) ≤ |S|
Exercise 3: Join Size Estimation 33/141
How many tuples are in the output from:
1. select * from R, S where R.s = S.id
where S.id is a primary key and R.s is a foreign key referencing S.id
2. select * from R, S where R.s <> S.id
where S.id is a primary key and R.s is a foreign key referencing S.id
3. select * from R, S where R.x = S.y
where R.x and S.y have no connection except that dom(R.x)=dom(S.y)
Under what conditions will the first query have maximum size?
Cost Estimation: Postscript 34/141
Inaccurate cost estimation can lead to poor evaluation plans.
Above methods can (sometimes) give inaccurate estimates.
To get more accurate cost estimates:
more time … complex computation of selectivity
more space … storage for histograms of data values
Either way, optimisation process costs more (more than query?)
Trade-off between optimiser performance and query performance.
PostgreSQL Query Optimiser
Overview of QOpt Process 36/141
Input: tree of Query nodes returned by parser
Output: tree of Plan nodes used by query executor
wrapped in a PlannedStmt node containing state info
Intermediate data structures are trees of Path nodes
a path tree represents one evaluation order for a query
All Node types are defined in include/nodes/*.h
13/9/18, 11(34 pmWeek 08 Lectures
Page 11 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
… Overview of QOpt Process 37/141
QOpt Data Structures 38/141
Generic Path node structure:
typedef struct Path
{
NodeTag type; // scan/join/…
NodeTag pathtype; // specific method
RelOptInfo *parent; // output relation
PathTarget *pathtarget; // list of Vars/Exprs, width
// estimated execution costs for path …
Cost startup_cost; // setup cost
Cost total_cost; // total cost
List *pathkeys; // sort order
} Path;
PathKey = (opfamily:Oid, strategy:SortDir, nulls_first:bool)
… QOpt Data Structures 39/141
Specialised Path nodes (simplified):
typedef struct IndexPath
{
Path path;
IndexOptInfo *indexinfo; // index for scan
List *indexclauses; // index select conditions
…
ScanDirection indexscandir; // used by planner
Selectivity indexselectivity; // estimated #results
} IndexPath;
typedef struct JoinPath
{
Path path;
JoinType jointype; // inner/outer/semi/anti
Path *outerpath; // outer part of the join
Path *innerpath; // inner part of the join
13/9/18, 11(34 pmWeek 08 Lectures
Page 12 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
List *restrictinfo; // join condition(s)
} JoinPath;
Query Optimisation Process 40/141
Query optimisation proceeds in two stages (after parsing)…
Rewriting:
uses PostgreSQL’s rule system
query tree is expanded to include e.g. view definitions
Planning and optimisation:
using cost-based analysis of generated paths
via one of two different path generators
chooses least-cost path from all those considered
Then produces a Plan tree from the selected path.
Top-down Trace of QOpt 41/141
Top-level of query execution: backend/tcop/postgres.c
exec_simple_query(const char *query_string)
{
// lots of setting up … including starting xact
parsetree_list = pg_parse_query(query_string);
foreach(parsetree, parsetree_list) {
// Query optimisation
querytree_list = pg_analyze_and_rewrite(parsetree,…);
plantree_list = pg_plan_queries(querytree_list,…);
// Query execution
portal = CreatePortal(…plantree_list…);
PortalRun(portal,…);
}
// lots of cleaning up … including close xact
}
Assumes that we are dealing with multiple queries (i.e. SQL statements)
… Top-down Trace of QOpt 42/141
pg_analyze_and_rewrite()
take a parse tree (from SQL parser)
transforms Parse tree into Query tree (SQL → RA)
applies rewriting rules (e.g. views)
returns a list of Query trees
Code in: backend/tcop/postgres.c
… Top-down Trace of QOpt 43/141
pg_plan_queries()
13/9/18, 11(34 pmWeek 08 Lectures
Page 13 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
takes a list of parsed/re-written queries
plans each one via planner()
which invokes subquery_planner() on each query
returns a list of query plans
Code in: backend/optimizer/plan/planner.c
… Top-down Trace of QOpt 44/141
subquery_planner()
performs algebraic transformations/simplifications, e.g.
simplifies conditions in where clauses
converts sub-queries in where to top-level join
moves having clauses with no aggregate into where
flattens sub-queries in join list
simplifies join tree (e.g. removes redundant terms), etc.
sets up canonical version of query for plan generation
invokes grouping_planner() to produce best path
Code in: backend/optimizer/plan/planner.c
… Top-down Trace of QOpt 45/141
grouping_planner() produces plan for one SQL statement
preprocesses target list for INSERT/UPDATE
handles “planning” for extended-RA SQL constructs:
set operations: UNION/INTERSECT/EXCEPT
GROUP BY, HAVING, aggregations
ORDER BY, DISTINCT, LIMIT
invokes query_planner() for select/join trees
Code in: backend/optimizer/plan/planmain.c
… Top-down Trace of QOpt 46/141
query_planner() produces plan for a select/join tree
make list of tables used in query
split where qualifiers (“quals”) into
restrictions (e.g. r.a=1) … for selections
joins (e.g. s.id=r.s) … for joins
search for quals to enable merge/hash joins
invoke make_one_rel() to find best path/plan
Code in: backend/optimizer/plan/planmain.c
… Top-down Trace of QOpt 47/141
make_one_rel() generates possible plans, selects best
generate scan and index paths for base tables
using of restrictions list generated above
13/9/18, 11(34 pmWeek 08 Lectures
Page 14 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
generate access paths for the entire join tree
recursive process, controlled by make_rel_from_joinlist()
returns a single “relation”, representing result set
Code in: backend/optimizer/path/allpaths.c
Join-tree Generation 48/141
make_rel_from_joinlist() arranges path generation
switches between two possible path tree generators
path tree generators finally return best cost path
Standard path tree generator (standard_join_search()):
“exhaustively” generates join trees (like System R)
starts with 2-way joins, finds best combination
then adds extra table to give 3-table join, etc.
Code in: backend/optimizer/path/{allpaths.c,joinrels.c}
… Join-tree Generation 49/141
Genetic query optimiser (geqo):
uses genetic algorithm (GA) to generate path trees
handles joins involving > geqo_threshold relations
goals of this approach:
find near-optimal solution
examine far less than entire search space
Code in: backend/optimizer/geqo/*.c
… Join-tree Generation 50/141
Basic idea of genetic algorithm:
Optimize(join)
{
t = 0
p = initialState(join) // pool of (random) join orders
for (t = 0; t < #generations; t++) {
p' = recombination(p) // get parts of good join orders
p'' = mutation(p') // generate new variations
p = selection(p'',p) // choose best join orders
}
}
#generations determined by size of initial pool of join orders
Query Execution
Query Execution 52/141
13/9/18, 11(34 pmWeek 08 Lectures
Page 15 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
Query execution: applies evaluation plan → result tuples
... Query Execution 53/141
Example of query translation:
select s.name, s.id, e.course, e.mark
from Student s, Enrolment e
where e.student = s.id and e.semester = '05s2';
maps to
πname,id,course,mark(Stu ⋈e.student=s.id (σsemester=05s2Enr))
maps to
Temp1 = BtreeSelect[semester=05s2](Enr)
Temp2 = HashJoin[e.student=s.id](Stu,Temp1)
Result = Project[name,id,course,mark](Temp2)
... Query Execution 54/141
A query execution plan:
consists of a collection of RelOps
executing together to produce a set of result tuples
Results may be passed from one operator to the next:
materialization ... writing results to disk and reading them back
pipelining ... generating and passing via memory buffers
Materialization 55/141
Steps in materialization between two operators
first operator reads input(s) and writes results to disk
next operator treats tuples on disk as its input
in essence, the Temp tables are produced as real tables
Advantage:
intermediate results can be placed in a file structure
13/9/18, 11(34 pmWeek 08 Lectures
Page 16 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
(which can be chosen to speed up execution of subsequent operators)
Disadvantage:
requires disk space/writes for intermediate results
requires disk access to read intermediate results
Pipelining 56/141
How pipelining is organised between two operators:
blocks execute "concurrently" as producer/consumer pairs
first operator acts as producer; second as consumer
structured as interacting iterators (open; while(next); close)
Advantage:
no requirement for disk access (results passed via memory buffers)
Disadvantage:
higher-level operators access inputs via linear scan, or
requires sufficient memory buffers to hold all outputs
Iterators (reminder) 57/141
Iterators provide a "stream" of results:
iter = startScan(params)
set up data structures for iterator (create state, open files, ...)
params are specific to operator (e.g. reln, condition, #buffers, ...)
tuple = nextTuple(iter)
get the next tuple in the iteration; return null if no more
endScan(iter)
clean up data structures for iterator
Other possible operations: reset to specific point, restart, ...
Pipelining Example 58/141
Consider the query:
select s.id, e.course, e.mark
from Student s, Enrolment e
where e.student = s.id and
e.semester = '05s2' and s.name = 'John';
which maps to the RA expression
Proj[id,course,mark](Join[student=id](Sel[05s2](Enr),Sel[John](Stu)))
which could represented by the RA expression tree
13/9/18, 11(34 pmWeek 08 Lectures
Page 17 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
... Pipelining Example 59/141
Modelled as communication between RA tree nodes:
Note: likely that projection is combined with join in real DBMSs.
Disk Accesses 60/141
Pipelining cannot avoid all disk accesses.
Some operations use multiple passes (e.g. merge-sort, hash-join).
data is written by one pass, read by subsequent passes
Thus ...
within an operation, disk reads/writes are possible
between operations, no disk reads/writes are needed
PostgreSQL Query Execution
PostgreSQL Query Execution 62/141
Defs: src/include/executor and src/include/nodes
Code: src/backend/executor
PostgreSQL uses pipelining ...
query plan is a tree of Plan nodes
each type of node implements one kind of RA operation
(node implements specific access method via iterator interface)
node types e.g. Scan, Group, Indexscan, Sort, HashJoin
13/9/18, 11(34 pmWeek 08 Lectures
Page 18 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
execution is managed via a tree of PlanState nodes
(mirrors the structure of the tree of Plan nodes; holds execution state)
PostgreSQL Executor 63/141
Modules in src/backend/executor fall into two groups:
execXXX (e.g. execMain, execProcnode, execScan)
implement generic control of plan evaluation (execution)
provide overall plan execution and dispatch to node iterators
nodeXXX (e.g. nodeSeqscan, nodeNestloop, nodeGroup)
implement iterators for specific types of RA operators
typically contains ExecInitXXX, ExecXXX, ExecEndXXX
... PostgreSQL Executor 64/141
Much simplified view of PostgreSQL executor:
ExecutePlan(execState, planStateNode, ...) {
process "before each statement" triggers
for (;;) {
tuple = ExecProcNode(planStateNode)
if (no more tuples) return END
check tuple validity // MVCC
if (got a tuple) break
}
process "after each statement" triggers
return tuple
}
...
... PostgreSQL Executor 65/141
Executor overview (cont):
...
ExecProcNode(node) {
switch (nodeType(node)) {
case SeqScan:
result = ExecSeqScan(node); break;
case NestLoop:
result = ExecNestLoop(node); break;
...
}
return result;
}
Example PostgreSQL Execution 66/141
Consider the query:
-- get manager's age and # employees in Shoe department
select e.age, d.nemps
13/9/18, 11(34 pmWeek 08 Lectures
Page 19 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
from Departments d, Employees e
where e.name = d.manager and d.name ='Shoe'
and its execution plan tree
... Example PostgreSQL Execution 67/141
The execution plan tree
contains three nodes:
NestedLoop with join condition (Outer.manager = Inner.name)
IndexScan on Departments with selection (name = 'Shoe')
SeqScan on Employees
... Example PostgreSQL Execution 68/141
Initially InitPlan() invokes ExecInitNode() on plan tree root.
ExecInitNode() sees a NestedLoop node ...
so dispatches to ExecInitNestLoop() to set up iterator
then invokes ExecInitNode() on left and right sub-plans
in left subPlan, ExecInitNode() sees an IndexScan node
so dispatches to ExecInitIndexScan() to set up iterator
in right sub-plan, ExecInitNode() sees a SeqScan node
so dispatches to ExecInitSeqScan() to set up iterator
Result: a plan state tree with same structure as plan tree.
... Example PostgreSQL Execution 69/141
Execution: ExecutePlan() repeatedly invokes ExecProcNode().
ExecProcNode() sees a NestedLoop node ...
so dispatches to ExecNestedLoop() to get next tuple
which invokes ExecProcNode() on its sub-plans
in left sub-plan, ExecProcNode() sees an IndexScan node
so dispatches to ExecIndexScan() to get next tuple
if no more tuples, return END
for this tuple, invoke ExecProcNode() on right sub-plan
ExecProcNode() sees a SeqScan node
so dispatches to ExecSeqScan() to get next tuple
check for match and return joined tuples if found
continue scan until end
reset right sub-plan iterator
13/9/18, 11(34 pmWeek 08 Lectures
Page 20 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
Result: stream of result tuples returned via ExecutePlan()
Query Performance
Performance Tuning 71/141
How to make a database perform "better"?
Good performance may involve any/all of:
making applications using the DB run faster
lowering response time of queries/transactions
improving overall transaction throughput
Remembering that, to some extent ...
the query optimiser removes choices from DB developers
by making its own decision on the optimal execution plan
... Performance Tuning 72/141
Tuning requires us to consider the following:
which queries and transactions will be used?
(e.g. check balance for payment, display recent transaction history)
how frequently does each query/transaction occur?
(e.g. 90% withdrawals; 10% deposits; 50% balance check)
are there time constraints on queries/transactions?
(e.g. EFTPOS payments must be approved within 7 seconds)
are there uniqueness constraints on any attributes?
(define indexes on attributes to speed up insertion uniqueness check)
how frequently do updates occur?
(indexes slow down updates, because must update table and index)
... Performance Tuning 73/141
Performance can be considered at two times:
during schema design
typically towards the end of schema design process
requires schema transformations such as denormalisation
outside schema design
typically after application has been deployed/used
requires adding/modifying data structures such as indexes
Difficult to predict what query optimiser will do, so ...
implement queries using methods which should be efficient
observe execution behaviour and modify query accordingly
PostgreSQL Query Tuning 74/141
PostgreSQL provides the explain statement to
give a representation of the query execution plan
13/9/18, 11(34 pmWeek 08 Lectures
Page 21 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
with information that may help to tune query performance
Usage:
EXPLAIN [ANALYZE] Query
Without ANALYZE, EXPLAIN shows plan with estimated costs.
With ANALYZE, EXPLAIN executes query and prints real costs.
Note that runtimes may show considerable variation due to buffering.
EXPLAIN Examples 75/141
Database
course_enrolments(student, course, mark, grade, ...)
courses(id, subject, semester, homepage)
people(id, family, given, title, name, ..., birthday)
program_enrolments(id, student, semester, program, wam, ...)
students(id, stype)
subjects(id, code, name, longname, uoc, offeredby, ...)
where
table_name | n_records
---------------------------+-----------
course_enrolments | 525688
courses | 73220
people | 55767
program_enrolments | 193456
students | 31361
subjects | 17779
... EXPLAIN Examples 76/141
Example: Select on non-indexed attribute
uni=# explain
uni=# select * from Students where stype='local';
QUERY PLAN
----------------------------------------------------
Seq Scan on students
(cost=0.00..562.01 rows=23544 width=9)
Filter: ((stype)::text = 'local'::text)
where
Seq Scan = operation (plan node)
cost=StartUpCost..TotalCost
rows=NumberOfResultTuples
width=SizeOfTuple (# bytes)
... EXPLAIN Examples 77/141
More notes on explain output:
each major entry corresponds to a plan node
e.g. Seq Scan, Index Scan, Hash Join, Merge Join, ...
13/9/18, 11(34 pmWeek 08 Lectures
Page 22 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
some nodes include additional qualifying information
e.g. Filter, Index Cond, Hash Cond, Buckets, ...
cost values in explain are estimates (notional units)
explain analyze also includes actual time costs (ms)
costs of parent nodes include costs of all children
estimates of #results based on sample of data
... EXPLAIN Examples 78/141
Example: Select on non-indexed attribute with actual costs
uni=# explain analyze
uni=# select * from Students where stype='local';
QUERY PLAN
----------------------------------------------------------
Seq Scan on students
(cost=0.00..562.01 rows=23544 width=9)
(actual time=0.052..5.792 rows=23551 loops=1)
Filter: ((stype)::text = 'local'::text)
Rows Removed by Filter: 7810
Planning time: 0.075 ms
Execution time: 6.978 ms
... EXPLAIN Examples 79/141
Example: Select on indexed, unique attribute
uni=# explain analyze
uni-# select * from Students where id=100250;
QUERY PLAN
-------------------------------------------------------
Index Scan using student_pkey on student
(cost=0.00..8.27 rows=1 width=9)
(actual time=0.049..0.049 rows=0 loops=1)
Index Cond: (id = 100250)
Total runtime: 0.1 ms
... EXPLAIN Examples 80/141
Example: Select on indexed, unique attribute
uni=# explain analyze
uni-# select * from Students where id=1216988;
QUERY PLAN
-------------------------------------------------------
Index Scan using students_pkey on students
(cost=0.29..8.30 rows=1 width=9)
(actual time=0.011..0.012 rows=1 loops=1)
Index Cond: (id = 1216988)
Planning time: 0.066 ms
Execution time: 0.026 ms
... EXPLAIN Examples 81/141
Example: Join on a primary key (indexed) attribute (2016)
uni=# explain analyze
uni-# select s.id,p.name
13/9/18, 11(34 pmWeek 08 Lectures
Page 23 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
uni-# from Students s, People p where s.id=p.id;
QUERY PLAN
----------------------------------------------------------
Hash Join (cost=988.58..3112.76 rows=31048 width=19)
(actual time=11.504..39.478 rows=31048 loops=1)
Hash Cond: (p.id = s.id)
-> Seq Scan on people p
(cost=0.00..989.97 rows=36497 width=19)
(actual time=0.016..8.312 rows=36497 loops=1)
-> Hash (cost=478.48..478.48 rows=31048 width=4)
(actual time=10.532..10.532 rows=31048 loops=1)
Buckets: 4096 Batches: 2 Memory Usage: 548kB
-> Seq Scan on students s
(cost=0.00..478.48 rows=31048 width=4)
(actual time=0.005..4.630 rows=31048 loops=1)
Total runtime: 41.0 ms
… EXPLAIN Examples 82/141
Example: Join on a primary key (indexed) attribute (2018)
uni=# explain analyze
uni-# select s.id,p.name
uni-# from Students s, People p where s.id=p.id;
QUERY PLAN
———————————————————-
Merge Join (cost=0.58..2829.25 rows=31361 width=18)
(actual time=0.044..25.883 rows=31361 loops=1)
Merge Cond: (s.id = p.id)
-> Index Only Scan using students_pkey on students s
(cost=0.29..995.70 rows=31361 width=4)
(actual time=0.033..6.195 rows=31361 loops=1)
Heap Fetches: 31361
-> Index Scan using people_pkey on people p
(cost=0.29..2434.49 rows=55767 width=18)
(actual time=0.006..6.662 rows=31361 loops=1)
Planning time: 0.259 ms
Execution time: 27.327 ms
… EXPLAIN Examples 83/141
Example: Join on a non-indexed attribute (2016)
uni=# explain analyze
uni=# select s1.code, s2.code
uni-# from Subjects s1, Subjects s2
uni=# where s1.offeredBy=s2.offeredBy;
QUERY PLAN
—————————————————————
Merge Join (cost=4449.13..121322.06 rows=7785262 width=18)
(actual time=29.787..2377.707 rows=8039979 loops=1)
Merge Cond: (s1.offeredby = s2.offeredby)
-> Sort (cost=2224.57..2271.56 rows=18799 width=13)
(actual time=14.251..18.703 rows=18570 loops=1)
Sort Key: s1.offeredby
Sort Method: external merge Disk: 472kB
-> Seq Scan on subjects s1
(cost=0.00..889.99 rows=18799 width=13)
(actual time=0.005..4.542 rows=18799 loops=1)
-> Sort (cost=2224.57..2271.56 rows=18799 width=13)
(actual time=15.532..1100.396 rows=8039980 loops=1)
Sort Key: s2.offeredby
Sort Method: external sort Disk: 552kB
-> Seq Scan on subjects s2
13/9/18, 11(34 pmWeek 08 Lectures
Page 24 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
(cost=0.00..889.99 rows=18799 width=13)
(actual time=0.002..3.579 rows=18799 loops=1)
Total runtime: 2767.1 ms
… EXPLAIN Examples 84/141
Example: Join on a non-indexed attribute (2018)
uni=# explain analyze
uni=# select s1.code, s2.code
uni-# from Subjects s1, Subjects s2
uni-# where s1.offeredBy = s2.offeredBy;
QUERY PLAN
—————————————————————
Hash Join (cost=1286.03..108351.87 rows=7113299 width=18)
(actual time=8.966..903.441 rows=7328594 loops=1)
Hash Cond: (s1.offeredby = s2.offeredby)
-> Seq Scan on subjects s1
(cost=0.00..1063.79 rows=17779 width=13)
(actual time=0.013..2.861 rows=17779 loops=1)
-> Hash (cost=1063.79..1063.79 rows=17779 width=13)
(actual time=8.667..8.667 rows=17720 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 1087kB
-> Seq Scan on subjects s2
(cost=0.00..1063.79 rows=17779 width=13)
(actual time=0.009..4.677 rows=17779 loops=1)
Planning time: 0.255 ms
Execution time: 1191.023 ms
… EXPLAIN Examples 85/141
Example: Join on a non-indexed attribute (2018)
uni=# explain analyze
uni=# select s1.code, s2.code
uni-# from Subjects s1, Subjects s2
uni-# where s1.offeredBy = s2.offeredBy and s1.code < s2.code;
QUERY PLAN
---------------------------------------------------------------
Hash Join (cost=1286.03..126135.12 rows=2371100 width=18)
(actual time=7.356..6806.042 rows=3655437 loops=1)
Hash Cond: (s1.offeredby = s2.offeredby)
Join Filter: (s1.code < s2.code)
Rows Removed by Join Filter: 3673157
-> Seq Scan on subjects s1
(cost=0.00..1063.79 rows=17779 width=13)
(actual time=0.009..4.602 rows=17779 loops=1)
-> Hash (cost=1063.79..1063.79 rows=17779 width=13)
(actual time=7.301..7.301 rows=17720 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 1087kB
-> Seq Scan on subjects s2
(cost=0.00..1063.79 rows=17779 width=13)
(actual time=0.005..4.452 rows=17779 loops=1)
Planning time: 0.159 ms
Execution time: 6949.167 ms
Exercise 4: EXPLAIN examples 86/141
Using the database described earlier …
Course_enrolments(student, course, mark, grade, …)
Courses(id, subject, semester, homepage)
13/9/18, 11(34 pmWeek 08 Lectures
Page 25 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
People(id, family, given, title, name, …, birthday)
Program_enrolments(id, student, semester, program, wam, …)
Students(id, stype)
Subjects(id, code, name, longname, uoc, offeredby, …)
create view EnrolmentCounts as
select s.code, c.semester, count(e.student) as nstudes
from Courses c join Subjects s on c.subject=s.id
join Course_enrolments e on e.course = c.id
group by s.code, c.semester;
predict how each of the following queries will be executed …
Check your prediction using the EXPLAIN ANALYZE command.
1. select max(birthday) from People
2. select max(id) from People
3. select family from People order by family
4. select distinct p.id, pname
from People s, CourseEnrolments e
where s.id=e.student and e.grade=’FL’
5. select * from EnrolmentCounts where code=’COMP9315′;
Examine the effect of adding ORDER BY and DISTINCT.
Add indexes to improve the speed of slow queries.
Transaction Processing
Transaction Processing 89/141
Transaction (tx) = application-level atomic op, multiple DB ops
Concurrent transactions are
desirable, for improved performance
problematic, because of potential unwanted interactions
To ensure problem-free concurrent transactions:
Atomic … whole effect of tx, or nothing
Consistent … individual tx’s are “correct” (wrt application)
Isolated … each tx behaves as if no concurrency
Durable … effects of committed tx’s persist
… Transaction Processing 90/141
Transaction states:
13/9/18, 11(34 pmWeek 08 Lectures
Page 26 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
COMMIT ⇒ all changes preserved, ABORT ⇒ database unchanged
… Transaction Processing 91/141
Transaction processing:
the study of techniques for realising ACID properties
Consistency is the property:
a tx is correct with respect to its own specification
a tx performs a mapping that maintains all DB constraints
Ensuring this must be left to application programmers.
Our discussion focusses on: Atomicity, Durability, Isolation
… Transaction Processing 92/141
Atomicity is handled by the commit and abort mechanisms
commit ends tx and ensures all changes are saved
abort ends tx and undoes changes “already made”
Durability is handled by implementing stable storage, via
redundancy, to deal with hardware failures
logging/checkpoint mechanisms, to recover state
Isolation is handled by concurrency control mechanisms
two possibilities: lock-based, timestamp-based
various levels of isolation are possible (e.g. serializable)
… Transaction Processing 93/141
Where transaction processing fits in the DBMS:
13/9/18, 11(34 pmWeek 08 Lectures
Page 27 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
Transaction Terminology 94/141
To describe transaction effects, we consider:
READ – transfer data from “disk” to memory
WRITE – transfer data from memory to “disk”
ABORT – terminate transaction, unsuccessfully
COMMIT – terminate transaction, successfully
Relationship between the above operations and SQL:
SELECT produces READ operations on the database
UPDATE and DELETE produce READ then WRITE operations
INSERT produces WRITE operations
… Transaction Terminology 95/141
More on transactions and SQL
BEGIN starts a transaction
the begin keyword in PLpgSQL is not the same thing
COMMIT commits and ends the current transaction
some DBMSs e.g. PostgreSQL also provide END as a synonym
the end keyword in PLpgSQL is not the same thing
ROLLBACK aborts the current transaction, undoing any changes
some DBMSs e.g. PostgreSQL also provide ABORT as a synonym
In PostgreSQL, tx’s cannot be defined inside stored procedures (e.g. PLpgSQL)
… Transaction Terminology 96/141
The READ, WRITE, ABORT, COMMIT operations:
occur in the context of some transaction T
involve manipulation of data items X, Y, … (READ and WRITE)
The operations are typically denoted as:
RT(X) read item X in transaction T
WT(X) write item X in transaction T
AT abort transaction T
13/9/18, 11(34 pmWeek 08 Lectures
Page 28 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
CT commit transaction T
Schedules 97/141
A schedule gives the sequence of operations from ≥ 1 tx
Serial schedule for a set of tx’s T1 .. Tn
all operations of Ti complete before Ti+1 begins
E.g. RT1(A) WT1(A) RT2(B) RT2(A) WT3(C) WT3(B)
Concurrent schedule for a set of tx’s T1 .. Tn
operations from individual Ti’s are interleaved
E.g. RT1(A) RT2(B) WT1(A) WT3(C) WT3(B) RT2(A)
… Schedules 98/141
Serial schedules guarantee database consistency
each Ti commits before Ti+1
prior to Ti database is consistent
after Ti database is consistent (assuming Ti is correct)
before Ti+1 database is consistent …
Concurrent schedules interleave operations arbitrarily
and may produce a database that is not consistent
after all of the transactions have committed
Transaction Anomalies 99/141
What problems can occur with uncontrolled concurrent transactions?
The set of phenomena can be characterised broadly under:
dirty read:
reading data item currently in use by another tx
nonrepeateable read:
re-reading data item, since changed by another tx
phantom read:
re-reading result set, since changed by another tx
… Transaction Anomalies 100/141
Dirty read: a transaction reads data written by a concurrent uncommitted transaction
Example:
Transaction T1 Transaction T2
(1) select a into X
13/9/18, 11(34 pmWeek 08 Lectures
Page 29 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
from R where id=1
(2) select a into Y
from R where id=1
(3) update R set a=X+1
where id=1
(4) commit
(5) update R set a=Y+1
where id=1
(6) commit
Effect: T1’s update on R.a is lost.
… Transaction Anomalies 101/141
Nonrepeatable read: a transaction re-reads data it has previously read and finds that data has been modified by another transaction
(that committed since the initial read)
Example:
Transaction T1 Transaction T2
(1) select * from R
where id=5
(2) update R set a=8
where id=5
(3) commit
(4) select * from R
where id=5
Effect: T1 runs same query twice; sees different data
… Transaction Anomalies 102/141
Phantom read: a transaction re-executes a query returning a set of rows that satisfy a search condition and finds that the set of rows
satisfying the condition has changed due to another recently-committed transaction
Example:
Transaction T1 Transaction T2
(1) select count(*)
from R where a=5
(2) insert into R(id,a,b)
values (2,5,8)
(3) commit
(4) select count(*)
from R where a=5
Effect: T1 runs same query twice; sees different result set
Example of Transaction Failure 103/141
Above examples assumed that all transactions commit.
Additional problems can arise when transactions abort.
Consider the following schedule where transaction T1 fails:
T1: R(X) W(X) A
T2: R(X) W(X) C
13/9/18, 11(34 pmWeek 08 Lectures
Page 30 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
Abort will rollback the changes to X, but …
Consider three places where rollback might occur:
T1: R(X) W(X) A [1] [2] [3]
T2: R(X) W(X) C
… Example of Transaction Failure 104/141
Abort / rollback scenarios:
T1: R(X) W(X) A [1] [2] [3]
T2: R(X) W(X) C
Case [1] is ok
all effects of T1 vanish; final effect is simply from T2
Case [2] is problematic
some of T1’s effects persist, even though T1 aborted
Case [3] is also problematic
T2’s effects are lost, even though T2 committed
Transaction Isolation
Transaction Isolation 106/141
Simplest form of isolation: serial execution (T1 ; T2 ; T3 ; …)
Problem: serial execution yields poor throughput.
Concurrency control schemes (CCSs) aim for “safe” concurrency
Abstract view of DBMS concurrency mechanisms:
Serializability 107/141
Consider two schedules S1 and S2 produced by
executing the same set of transactions T1..Tn concurrently
but with a non-serial interleaving of R/W operations
S1 and S2 are equivalent if StateAfter(S1) = StateAfter(S2)
i.e. final state yielded by S1 is same as final state yielded by S2
13/9/18, 11(34 pmWeek 08 Lectures
Page 31 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
S is a serializable schedule (for a set of concurrent tx’s T1 ..Tn) if
S is equivalent to some serial schedule Ss of T1 ..Tn
Under these circumstances, consistency is guaranteed
(assuming no aborted transactions and no system failures)
… Serializability 108/141
Two formulations of serializability:
conflict serializibility
i.e. conflicting R/W operations occur in the “right order”
check via precedence graph; look for absence of cycles
view serializibility
i.e. read operations see the correct version of data
checked via VS conditions on likely equivalent schedules
View serializability is strictly weaker than conflict serializability.
Exercise 5: Serializability Checking 109/141
Is the following schedule view/conflict serializable?
T1: W(B) W(A)
T2: R(B) W(A)
T3: R(A) W(A)
Is the following schedule view/conflict serializable?
T1: W(B) W(A)
T2: R(B) W(A)
T3: R(A) W(A)
Transaction Isolation Levels 110/141
SQL programmers’ concurrency control mechanism …
set transaction
read only — so weaker isolation may be ok
read write — suggests stronger isolation needed
isolation level
— weakest isolation, maximum concurrency
read uncommitted
read committed
repeatable read
serializable
— strongest isolation, minimum concurrency
Applies to current tx only; affects how scheduler treats this tx.
… Transaction Isolation Levels 111/141
Implication of transaction isolation levels:
Isolation Dirty Nonrepeatable Phantom
13/9/18, 11(34 pmWeek 08 Lectures
Page 32 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
Level Read Read Read
Read
Uncommitted Possible Possible Possible
Read
Committed Not Possible Possible Possible
Repeatable
Read Not Possible Not Possible Possible
Serializable Not Possible Not Possible Not Possible
… Transaction Isolation Levels 112/141
For transaction isolation, PostgreSQL
provides syntax for all four levels
treats read uncommitted as read committed
repeatable read behaves like serializable
default level is read committed
Note: cannot implement read uncommitted because of MVCC
For more details, see PostgreSQL Documentation section 13.2
extensive discussion of semantics of UPDATE, INSERT, DELETE
… Transaction Isolation Levels 113/141
A PostgreSQL tx consists of a sequence of SQL statements:
BEGIN S1; S2; … Sn; COMMIT;
Isolation levels affect view of DB provided to each Si:
in read committed …
each Si sees snapshot of DB at start of Si
in repeatable read and serializable …
each Si sees snapshot of DB at start of tx
serializable checks for extra conditions
Transactions fail if the system detects violation of isolation level.
… Transaction Isolation Levels 114/141
Example of repeatable read vs serializable
table R(class,value) containing (1,10) (1,20) (2,100) (2,200)
T1: X = sum(value) where class=1; insert R(2,X); commit
T2: X = sum(value) where class=2; insert R(1,X); commit
with repeatable read, both transactions commit, giving
updated table: (1,10) (1,20) (2,100) (2,200) (1,300) (2,30)
with serial transactions, only one transaction commits
13/9/18, 11(34 pmWeek 08 Lectures
Page 33 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
T1;T2 gives (1,10) (1,20) (2,100) (2,200) (2,30) (1,330)
T2;T1 gives (1,10) (1,20) (2,100) (2,200) (1,300) (2,330)
PG recognises that committing both gives serialization violation
Implementing Concurrency Control
Concurrency Control 116/141
Approaches to concurrency control:
Lock-based
Synchronise tx execution via locks on relevant part of DB.
Version-based (multi-version concurrency control)
Allow multiple consistent versions of the data to exist.
Each tx has access only to version existing at start of tx.
Validation-based (optimistic concurrency control)
Execute all tx’s; check for validity problems on commit.
Timestamp-based
Organise tx execution via timestamps assigned to actions.
Lock-based Concurrency Control 117/141
Locks introduce additional mechanisms in DBMS:
The Lock Manager
manages the locks requested by the scheduler
… Lock-based Concurrency Control 118/141
Lock table entries contain:
object being locked (DB, table, tuple, field)
type of lock: read/shared, write/exclusive
FIFO queue of tx’s requesting this lock
count of tx’s currently holding lock (max 1 for write locks)
Lock and unlock operations must be atomic.
Lock upgrade:
if a tx holds a read lock, and it is the only tx holding that lock
then the lock can be converted into a write lock
… Lock-based Concurrency Control 119/141
13/9/18, 11(34 pmWeek 08 Lectures
Page 34 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
Synchronise access to shared data items via following rules:
before reading X, get read (shared) lock on X
before writing X, get write (exclusive) lock on X
a tx attempting to get a read lock on X is blocked if another tx already has write lock on X
a tx attempting to get an write lock on X is blocked if another tx has any kind of lock on X
These rules alone do not guarantee serializability.
… Lock-based Concurrency Control 120/141
Consider the following schedule, using locks:
T1(a): Lr(Y) R(Y) continued
T2(a): Lr(X) R(X) U(X) continued
T1(b): U(Y) Lw(X) W(X) U(X)
T2(b): Lw(Y)….W(Y) U(Y)
(where Lr = read-lock, Lw = write-lock, U = unlock)
Locks correctly ensure controlled access to X and Y.
Despite this, the schedule is not serializable. (Ex: prove this)
Two-Phase Locking 121/141
To guarantee serializability, we require an additional constraint:
in every tx, all lock requests precede all unlock requests
Each transaction is then structured as:
growing phase where locks are acquired
action phase where “real work” is done
shrinking phase where locks are released
Clearly, this reduces potential concurrency …
Problems with Locking 122/141
Appropriate locking can guarantee correctness.
However, it also introduces potential undesirable effects:
Deadlock
No transactions can proceed; each waiting on lock held by another.
Starvation
One transaction is permanently “frozen out” of access to data.
Reduced performance
Locking introduces delays while waiting for locks to be released.
Deadlock 123/141
Deadlock occurs when two transactions are waiting for a lock on an item held by the other.
13/9/18, 11(34 pmWeek 08 Lectures
Page 35 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
Example:
T1: Lw(A) R(A) Lw(B) ……
T2: Lw(B) R(B) Lw(A) …..
How to deal with deadlock?
prevent it happening in the first place
let it happen, detect it, recover from it
… Deadlock 124/141
Handling deadlock involves forcing a transaction to “back off”.
select process to “back off”
choose on basis of how far transaction has progressed, # locks held, …
roll back the selected process
how far does this it need to be rolled back? (less roll-back is better)
worst-case scenario: abort one transaction
prevent starvation
need methods to ensure that same transaction isn’t always chosen
… Deadlock 125/141
Methods for managing deadlock
timeout : set max time limit for each tx
waits-for graph : records Tj waiting on lock held by Tk
prevent deadlock by checking for new cycle ⇒ abort Ti
detect deadlock by periodic check for cycles ⇒ abort Ti
timestamps : use tx start times as basis for priority
scenario: Tj tries to get lock held by Tk …
wait-die: if Tj < Tk, then Tj waits, else Tj rolls back
wound-wait: if Tj < Tk, then Tk rolls back, else Tj waits
... Deadlock 126/141
Properties of deadlock handling methods:
both wait-die and wound-wait are fair
wait-die tends to
roll back tx's that have done little work
but rolls back tx's more often
wound-wait tends to
roll back tx's that may have done significant work
but rolls back tx's less often
timestamps easier to implement than waits-for graph
waits-for minimises roll backs because of deadlock
Exercise 6: Deadlock Handling 127/141
Consider the following schedule on four transactions:
T1: R(A) W(C) W(D)
T2: R(B) W(C)
T3: R(D) W(B)
13/9/18, 11(34 pmWeek 08 Lectures
Page 36 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
T4: R(E) W(A)
Assume that: each R acquires a shared lock; each W uses an exclusive lock; two-phase locking is used.
Show how the wait-for graph for the locks evolves.
Show how any deadlocks might be resolved via this graph.
Optimistic Concurrency Control 128/141
Locking is a pessimistic approach to concurrency control:
limit concurrency to ensure that conflicts don't occur
Costs: lock management, deadlock handling, contention.
In scenarios where there are far more reads than writes ...
don't lock (allow arbitrary interleaving of operations)
check just before commit that no conflicts occurred
if problems, roll back conflicting transactions
... Optimistic Concurrency Control 129/141
Transactions have three distinct phases:
Reading: read from database, modify local copies of data
Validation: check for conflicts in updates
Writing: commit local copies of data to database
Timestamps are recorded at points noted:
... Optimistic Concurrency Control 130/141
Data structures needed for validation:
A ... set of txs that are reading data and computing results
V ... set of txs that have reached validation (not yet committed)
F ... set of txs that have finished (committed data to storage)
for each Ti, timestamps for when it reached A, V, F
R(Ti) set of all data items read by Ti
W(Ti) set of all data items to be written by Ti
Use the V timestamps as ordering for transactions
assume serial tx order based on ordering of V(Ti)'s
... Optimistic Concurrency Control 131/141
Validation check for transaction T
13/9/18, 11(34 pmWeek 08 Lectures
Page 37 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
for all transactions Ti ≠ T
if V(Ti) < A(T) < F(Ti), then check W(Ti) ∩ R(T) is empty
if V(Ti) < V(T) < F(Ti), then check W(Ti) ∩ W(T) is empty
If this check fails for any Ti, then T is rolled back.
Prevents: T reading dirty data, T overwriting Ti's changes
Problems: rolls back "complete" tx's, cost to maintain A,V,F sets
Multi-version Concurrency Control 132/141
Multi-version concurrency control (MVCC) aims to
retain benefits of locking, while getting more concurrency
by providing multiple (consistent) versions of data items
Achieves this by
readers access an "appropriate" version of each data item
writers make new versions of the data items they modify
Main difference between MVCC and standard locking:
read locks do not conflict with write locks ⇒
reading never blocks writing, writing never blocks reading
... Multi-version Concurrency Control 133/141
WTS = timestamp of last writer; RTS = timestamp of last reader
Chained tuple versions: tupoldest → tupolder → tupnewest
When a reader Ti is accessing the database
ignore any data item D created after Ti started (WTS(D) > TS(Ti))
use only newest version V satisfying WTS(V) < TS(Ti)
When a writer Tj attempts to change a data item
find newest version V satisfying WTS(V) < TS(Tj)
if no later versions exist, create new version of data item
otherwise, abort Tj
... Multi-version Concurrency Control 134/141
Advantage of MVCC
locking needed for serializability considerably reduced
Disadvantages of MVCC
visibility-check overhead (on every tuple read/write)
reading an item V causes an update of RTS(V)
storage overhead for extra versions of data items
overhead in removing out-of-date versions of data items
13/9/18, 11(34 pmWeek 08 Lectures
Page 38 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
Despite apparent disadvantages, MVCC is very effective.
... Multi-version Concurrency Control 135/141
Removing old versions:
Vj and Vk are versions of same item
WTS(Vj) and WTS(Vk) precede TS(Ti) for all Ti
remove version with smaller WTS(Vx) value
When to make this check?
every time a new version of a data item is added?
periodically, with fast access to blocks of data
PostgreSQL uses the latter (vacuum).
Concurrency Control in PostgreSQL 136/141
PostgreSQL uses two styles of concurrency control:
multi-version concurrency control (MVCC)
(used in implementing SQL DML statements (e.g. select))
two-phase locking (2PL)
(used in implementing SQL DDL statements (e.g. create table))
From the SQL (PLpgSQL) level:
can let the lock/MVCC system handle concurrency
can handle it explicitly via LOCK statements
... Concurrency Control in PostgreSQL 137/141
PostgreSQL provides read committed and serializable isolation levels.
Using the serializable isolation level, a select:
sees only data committed before the transaction began
never sees changes made by concurrent transactions
Using the serializable isolation level, an update fails:
if it tries to modify an "active" data item
(active = affected by some other tx, either committed or uncommitted)
The transaction containing the update must then rollback and re-start.
... Concurrency Control in PostgreSQL 138/141
Implementing MVCC in PostgreSQL requires:
a log file to maintain current status of each Ti
in every tuple:
xmin ID of the tx that created the tuple
xmax ID of the tx that replaced/deleted the tuple (if any)
xnew link to newer versions of tuple (if any)
13/9/18, 11(34 pmWeek 08 Lectures
Page 39 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
for each transaction Ti :
a transaction ID (timestamp)
SnapshotData: list of active tx's when Ti started
... Concurrency Control in PostgreSQL 139/141
Rules for a tuple to be visible to Ti :
the xmin (creation transaction) value must
be committed in the log file
have started before Ti's start time
not be active at Ti's start time
the xmax (delete/replace transaction) value must
be blank or refer to an aborted tx, or
have started after Ti's start time, or
have been active at SnapshotData time
For details, see: utils/time/tqual.c
... Concurrency Control in PostgreSQL 140/141
Tx's always see a consistent version of the database.
But may not see the "current" version of the database.
E.g. T1 does select, then concurrent T2 deletes some of T1's selected tuples
This is OK unless tx's communicate outside the database system.
E.g. T1 counts tuples while T2 deletes then counts; then counts are compared
Use locks if application needs every tx to see same current version
LOCK TABLE locks an entire table
SELECT FOR UPDATE locks only the selected rows
Exercise 7: Locking in PostgreSQL 141/141
How could we solve this problem via locking?
create or replace function
allocSeat(paxID int, fltID int, seat text)
returns boolean
as $$
declare
pid int;
begin
select paxID into pid from SeatingAlloc
where flightID = fltID and seatNum = seat;
if (pid is not null) then
return false; -- someone else already has seat
else
update SeatingAlloc set pax = paxID
where flightID = fltID and seatNum = seat;
commit;
return true;
end if;
end;
13/9/18, 11(34 pmWeek 08 Lectures
Page 40 of 40file:///Users/jas/srvr/apps/cs9315/18s2/lectures/week08/notes.html
$$ langauge plpgsql;
Produced: 13 Sep 2018