Relational Algebra
RELATIONAL ALGEBRA
Database Theory & Applications (M)
Dr Chris Anagnostopoulos
ROADMAP
Objective: explore what is inside a Database System
…from what to do going to how to do
List different strategies for SQL processing
Relational Algebra
Challenge: Transform SQL into Relational Expression
Challenge: Transform Relational Expression into Relational Tree
Background for Query Optimization
Notation: Cardinality of a set R, notated by |R|, is the number of
elements in R
R = {a, b, c} then |R| = 3
R = { } then |R| = 0
2
RELATIONAL ALGEBRA
Algebra: set of operations: select, project, join, aggregate, grouping
SELECT E.DNO, D.DNAME, E.AVG(Salary)
FROM EMPLOYEE E, DEPARTMENT D
WHERE E.DNO = D.DNUMBER
GROUP BY E.DNO
Relational Expression: is a sequence of operations which defines
the execution order, i.e., algorithm.
Objective: Find the optimal sequence of operations that minimizes
the computational complexity of a SQL query.
What to do How to do How to optimally do
3
RELATIONAL ALGEBRA SYMBOLS
Unary operations
SELECT: σ (sigma) //selecting tuples
PROJECT: π (pi) //selecting columns
Binary operations
JOIN (symbol: ): join two relations w.r.t., FK = PK
Cartesian Product (symbol: X ): combine two relations
Aggregation operations
Aggregation functions: γ (gamma), e.g., COUNT, AVG, …
Visit: https://en.wikipedia.org/wiki/Relational_algebra
4
SELECT ()
SELECT: (sigma); select tuples w.r.t. condition.
SELECT *
FROM EMPLOYEE
WHERE DNO = 4
DNO=4(EMPLOYEE)
SELECT *
FROM EMPLOYEE
WHERE Salary < 30000 AND Salary > 10000
Salary<30000 AND Salary>1000(EMPLOYEE)
Syntax:
5
NESTED SELECTIONS ()
σ is commutative, i.e., we can change the selection order
Salary>30000(DNO=4(EMPLOYEE)) = DNO=4(Salary>30000(EMPLOYEE))
Q: Which one do you prefer most?
A: Depends on size of the intermediate result
Objective: choose the sequence with the minimum intermediate results
size.
Challenge: Without executing a query, can you predict the size of an
intermediate result? [*]
[*] Anagnostopoulos, C. et al. (2017) Query-driven learning for predictive analytics of data
subspace cardinality. ACM Transactions on Knowledge Discovery from Data, 11(4), 47.
6
Lemma 1: 0 ≤ |
|DNO=4(EMPLOYEE)| ≤ |EMPLOYEE|
Selectivity: percentage (%) of tuples retrieved:
|
Challenge: For any query, predict the selectivity without query execution*
Decision: choose the strategy with the minimum predicted size:
Salary>30000(DNO=4(EMPLOYEE))
DNO=4(Salary>30000(EMPLOYEE))
[*] Anagnostopoulos, C. et al. (2017) Efficient Scalable Accurate Regression Queries in In-DBMS
Analytics. In: IEEE International Conference on Data Engineering (ICDE), CA, USA, Apr 2017
SELECT ()
7
[A1] WORKED EXAMPLE
Task 1: SQL → Algebra
JobType=‘Faculty’(EMPLOYEE)
JobType=‘Faculty’ AND Dept=‘CS’(EMPLOYEE)
(EMPLOYEE)
Task 2: Selectivity of each SQL.
• 2/4 = 50%
• 1/4 = 25%
• 4/4 = 100%
SELECT *
FROM EMPLOYEE
WHERE JobType = ‘Faculty’
SELECT *
FROM EMPLOYEE
WHERE JobType = ‘Faculty’
AND Dept = ‘CS’
SELECT *
FROM EMPLOYEE
8
PROJECT ()
PROJECT: (pi) show only specified columns.
SELECT LNAME, Salary
FROM EMPLOYEE
LNAME, Salary(EMPLOYEE)
Syntax:
9
[A2] WORKED EXAMPLE
Task 1: SQL→ Algebra.
ID(EMPLOYEE)
ID, Dept(EMPLOYEE)
Task 2: Selectivity of each SQL.
• 100%
• 100%
SELECT ID
FROM EMPLOYEE
SELECT ID, Dept
FROM EMPLOYEE
10
SELECTION & PROJECTION EXPRESSION
SELECT FNAME, LNAME, Salary, DNO
FROM EMPLOYEE
WHERE DNO = 5;
FNAME, LNAME, SALARY,DNO(DNO=5(EMPLOYEE)) //select-then-project
R1 DNO=5(EMPLOYEE)
R2 FNAME, LNAME, SALARY,DNO(R1)
DNO=5(FNAME, LNAME, SALARY,DNO(EMPLOYEE)) //project-then-select
R1 FNAME, LNAME, SALARY,DNO(EMPLOYEE)
R2 DNO=5(R1)
11
PROJECT OR SELECT, FIRST?
Strategy 1: first select employees in DNO=5, then project onto the attributes
i.e., reduce number of retrieved tuples, and then reduce dimensions
FNAME, LNAME, SALARY,DNO(DNO=5(EMPLOYEE))
Q: Size of intermediate result?
Strategy 2: first project onto attributes, then select the employees in DNO=5
i.e., reduce dimensions, and then reduce number of tuples
DNO=5(FNAME, LNAME, SALARY,DNO(EMPLOYEE))
Q: Size of intermediate results?
What if only one tuple satisfies DNO = 5?
Strategy 1 is better than Strategy 2, why?
What if all employees satisfy DNO = 5?
Strategy 1 is as good as Strategy 2, why?
12
[A3] WORKED EXAMPLE
SELECT W.ESSN, W.HOURS
FROM WORKS_ON AS W
WHERE W.PNO = 1
Task 1: Propose two alternative strategies.
Task 2: Propose the best alternative for each Case.
13
ESSN PNO HOURS
1 1 20
1 2 10
2 2 5
3 2 30
ESSN PNO HOURS
1 1 20
1 2 10
2 1 5
3 1 30
Case I Case II
[A3] WORKED EXAMPLE
SELECT W.ESSN, W.HOURS
FROM WORKS_ON AS W
WHERE W.PNO = 1
A1: W.ESSN, W.HOURS(W.PNO=1(WORKS_ON))
A2: W.PNO=1(W.ESSN, W.HOURS(WORKS_ON))
14
ESSN PNO HOURS
1 1 20
1 2 10
2 2 5
3 2 30
ESSN PNO HOURS
1 1 20
1 2 10
2 1 5
3 1 30
Case I Case II
LESSON LEARNT SO FAR…
Heuristic Rule 1: first select and then project π(σ)
15
JOIN & CARTESIAN PRODUCT
JOIN operation ( ) concatenates only associated tuples from R and S:
R FK = PK S
Cartesian product ( X ) concatenates all tuples from R and S: R X S
SELECT *
FROM EMPLOYEE, DEPARTMENT
WHERE MGR_SSN = SSN
EMPLOYEE SSN = MGR_SSN DEPARTMENT
SELECT *
FROM EMPLOYEE, DEPARTMENT
EMPLOYEE X DEPARTMENT
Lemma 2: |R PK = FK S| ≤ |R|
. |S| = |R X S|
Why? Because, only associated tuples appear in the result.
16
REASONING ABOUT JOIN
Lemma 3: R FK = PK S = σFK=PK (R X S)
Strategy 1: EMPLOYEE SSN = MGR_SSNDEPARTMENT
Strategy 2: σSSN=MGR_SSN(EMPLOYEE X DEPARTMENT)
Context:
EMPLOYEE: 1000 tuples. Some employees are managers of departments.
DEPARTMENT : 5 tuples. Each department is managed by one manager.
Q1: Which is the expected number of tuples in Strategy 1?
A1: Only 5 employees will be retrieved; we have only 5 departments
Q2: Which is the expected size of the intermediate result in Strategy 2?
A2: 1000 x 5 = 5000 combined tuples (including fictitious)
But: only 5 tuples satisfy the condition, i.e., 5/5000 = 0.1% of tuples should
have been matched…rest 99.9% is noise
17
LESSON LEARNT SO FAR…
Heuristic Rule 1: first select and then project π(σ)
Heuristic Rule 2: always use over FK and PK to
join two tables and never the Cartesian product.
18
SELECT FNAME, LNAME
FROM EMPLOYEE, DEPARTMENT
WHERE DNAME = ‘Research’ AND DNUMBER = DNO
◼ There are 10 departments and ‘Research’ has 15 employees
◼ There are 1000 employees
Strategy 1:
FNAME,LNAME(σDNAME=‘Research’ AND DNUMBER=DNO(DEPARTMENT X EMPLOYEE))
//10,000 tuples
Strategy 2:
FNAME, LNAME(σDNAME=‘Research’ (DEPARTMENT DNUMBER=DNOEMPLOYEE))
//1,000 tuples
R1 DEPARTMENT DNUMBER=DNO EMPLOYEE
R2 σDNAME=‘Research’ (R1)
RESULT π FNAME, LNAME (R2)
PUT ALL TOGETHER
19
SELECT FNAME, LNAME
FROM EMPLOYEE, DEPARTMENT
WHERE DNAME = ‘Research’ AND DNUMBER = DNO
◼ There are 10 departments and ‘Research’ has 15 employees
◼ There are 1000 employees
Strategy 3:
FNAME, LNAME(EMPLOYEE DNO = DNUMBER (σDNAME = ‘Research’ (DEPARTMENT)))
//1 + 15 = 16 tuples
//Optimal is: 15 tuples…very close!
R1 σDNAME=‘Research’(DEPARTMENT)
R2 (R1 DNUMBER= DNOEMPLOYEE)
RESULT πFNAME, LNAME(R2)
10,000 tuples → 1,000 tuples (90% reduction) → 16 tuples (99.84% reduction)
PUT ALL TOGETHER
20
LESSON LEARNT SO FAR…
Heuristic Rule 1: first select and then project π(σ)
Heuristic Rule 2: always use over FK and PK to
join two tables and never the Cartesian product.
Heuristic Rule 3: apply the select operation before
joining tables to significantly reduce the
intermediate result size. 21
[A4] WORKED EXAMPLE
Task 1: Propose a good strategy/expression
PAPER(AID, PAPERID, URL)
AUTHOR(ID,NAME)
//PAPER.AID references AUTHOR.ID
Context:
There are 500 authors and 10,000 papers
Chris is unique and has written 100 papers
Each author has written at least one paper
SELECT P.URL
FROM PAPER AS P, AUTHOR AS A
WHERE A.NAME = ‘Chris’ AND A.ID = P.AID;
22
[A4] WORKED EXAMPLE
SELECT P.URL
FROM PAPER AS P, AUTHOR AS A
WHERE A.NAME = ‘Chris’ AND A.ID = P.AID;
π
P.URL
(σ
Α.ΝΑΜΕ=‘Chris’AND A.ID = P.AID
(PAPER X AUTHOR))
//500 x 10,000 = 5,000,000 tuples
π
P.URL
(σ
Α.ΝΑΜΕ=‘Chris’
(PAPER
P.AID=A.ID
AUTHOR))
//retrieve papers of each author; search for Chris’ papers’ URLs
//10,000 tuples
π
P.URL
(PAPER
P.AID=A.ID
(σ
Α.ΝΑΜΕ=‘Chris’
(AUTHOR))
//retrieve only Chris; get his papers’ URLs.
//1 + 100 = 101 tuples; 99.998% reduction of complexity! 23
RELATIONAL TREE
A data-structure representing a query via algebra:
Tree Nodes: intermediate results, e.g., selection, projection, join,
Leaf Nodes: base relations.
Root: the final query result.
Interpretation:
visual feel of the complexity of the query
execution order of the operations—start from leaves to root
Transform a tree into another optimal tree: Heuristic Optimization 24
SELECT E.DNO, E.LNAME
FROM EMPLOYEE AS E
WHERE E.DNO = 5;
EMPLOYEE
E
σE.DNO=5
πE.LNAME, E.DNOroot
leaves
operator
πE.LNAME,E.DNO(σE.DNO=5(EMPLOYEE))
σE.DNO=5(πE.LNAME,E.DNO(EMPLOYEE))
EMPLOYEE
E
πE.LNAME, E.DNO
σE.DNO=5
IN-CLASS EXAMPLE: TREES
25
SELECT FNAME, LNAME
FROM EMPLOYEE E, DEPARTMENT D
WHERE D.DNAME=‘Research’ AND D.DNUMBER=E.DNO
FNAME, LNAME(σDNAME=‘Research’ (DEPARTMENT DNUMBER=DNOEMPLOYEE))
DEPARTMENT
D
EMPLOYEE
E
D.DNUMBER=E.DNO
σD.DNAME= ‘Research’
πFNAME, LNAME
root
leaves
operator
operator
26
[A5] WORKED EXAMPLE
Task 1: Draw the two relational trees of the SQL:
SELECT P.URL
FROM PAPER AS P, AUTHOR AS A
WHERE A.NAME = ‘Chris’ AND A.ID = P.AID;
π
P.URL
(σ
Α.ΝΑΜΕ=‘Chris’
(PAPER
P.AID=A.ID
AUTHOR))
π
P.URL
(PAPER
P.AID=A.ID
(σ
Α.ΝΑΜΕ=‘Chris’
(AUTHOR))
27
[A5] WORKED EXAMPLE
SELECT P.URL
FROM PAPER AS P, AUTHOR AS A
WHERE A.NAME = ‘Chris’ AND A.ID = P.AID;
28
PAPER
P
AUTHOR
A
P.AID = A.ID
σA.Name= ‘Chris’
πP.URL
PAPER
P
AUTHOR
A
P.AID = A.ID
σA.Name= ‘Chris’
πP.URL
first join and the select… first select and the join…