File Structures
ADVANCED SQL & ANALYTICS
Database Theory & Applications (M)
Dr Chris Anagnostopoulos
ROADMAP
Join Query
Dealing with NULL FKs
Analytics Query
Complex Un/Correlated Query using Aggregation Functions
over Groups of Tuples;
Objective: Extract knowledge from tuples and not just
retrieving tuples…
Modification Query
2
INNER JOIN
INNER JOIN matches tuples using FK and PK (THETA-JOIN).
The matching operator is, usually, the equality ‘ = ’, thus, it is
referred to as EQUIJOIN: R1.PK = R2.FK
Query 0: Show the employees who are working in the department
‘Research’.
3
INNER JOIN
SELECT Fname, Lname, Address
FROM EMPLOYEE, DEPARTMENT
WHERE Dname = ‘Research’ %selection condition
AND DNO = DNUMBER; %equi-join condition
Note: Join and selection conditions are both in the WHERE clause
4
INNER AND OUTER JOINS
INNER JOIN (versus OUTER JOIN)
A tuple is retrieved if and only if there exists a matching tuple;
i.e., FK is not NULL.
LEFT OUTER JOIN (LR LEFT OUTER JOIN RR)
Every tuple in the left relation LR must appear in result
If no matching tuple exists, just add NULL values for attributes of
right relation RR
RIGHT OUTER JOIN (LR RIGHT OUTER JOIN RR)
Every tuple in the right relation RR must appear in result
If no matching tuple exists, just add NULL values for attributes of left
relation LR
5
LEFT OUTER JOIN
Query 1: Show the last name of an employee and the last name of their
supervisor, if there exists!
SELECT E.Lname, S.Lname
FROM (EMPLOYEE AS E LEFT OUTER JOIN EMPLOYEE AS S
ON E.Super_SSN = S.SSN)
SELECT E.Lname, S.Lname
FROM EMPLOYEE AS E, EMPLOYEE AS S
WHERE E.Super_SSN = S.SSN
6
E.Lname S.Lname
Smith Wong
Borg NULL
Franklin Jennifer
IN-CLASS QUIZ
SELECT E.Fname, E.Minit, E.Lname, D.Dname
FROM EMPLOYEE AS E LEFT OUTER JOIN DEPARTMENT AS D
ON E.SSN = D.MGR_SSN
What are we expecting? What knowledge do we extract?
7
AGGREGATE FUNCTION
Aggregation function: statistical summary/value over group of tuples.
Built-in aggregation functions over attribute X:
COUNT( * ): How many employees are working in Dept 5?
SUM ( X ): Sum up all salaries of employees in Dept 5.
MAX( X ) / MIN ( X ): Who is the youngest employee in Dept. 5?
AVG( X ): Average salary of employees in Dept. 5
CORR(X, Y): Correlation between Age and Salary of employees in
Dept. 5
…
Note: NULL values are discarded apart from COUNT(*).
Note: Define our own function,
e.g., calculate the GPA as a weighted sum of grades;
e.g., calculate the Euclidean distance between two geo-locations in Location-
base Services applications…
8
AGGREGATE FUNCTION
Query 2: Show the maximum, minimum and average salary of those
employees who work in Dept. 5.
SELECT MAX (Salary) AS Highest_Sal,
MIN (Salary) AS Lowest_Sal,
AVG (Salary) AS Average_Sal
FROM EMPLOYEE
WHERE DNO = 5;
9
ANALYTICS: GROUPING TUPLES
Partition a relation into groups based on grouping attribute, i.e.,
clustering tuples having the same value in the grouping attribute.
GROUP BY {grouping attribute}
Which is the grouping attribute and the expected number of groups?
1: Group of employees with the same last-name
2: Group of employees working in the same department
3: Group of dependents of the same employee
4: Group of employees with the same salary…
Then, we apply aggregation functions to each group. 10
ANALYTICS: GROUPING TUPLES
aggregation function f, e.g., COUNT(*)
3
3
4
11
#tuples (whole) #tuples per group
IN-CLASS EXAMPLE: GROUP BY
Query 3: Show the number of employees per department & average salary
per department.
SELECT DNO, COUNT (*), AVG (Salary)
FROM EMPLOYEE
GROUP BY DNO;
Step 1: Partition EMPLOYEE into separate groups w.r.t. department.
Step 2: For each group, calculate its cardinality and average salary.
Note 1: The grouping attribute must appear in SELECT
Note 2: If the grouping attribute has NULL values, then a separate group is
created for the NULL value.
12
13
Actual Analytics Query is: Which is the most populated department?
IN-CLASS EXAMPLE: HISTOGRAM
Query 4: Show the number of
employees per age.
SELECT E.AGE, COUNT (*)
FROM EMPLOYEE AS E
GROUP BY E.AGE;
Step 1: Partition EMPLOYEE w.r.t. age.
Step 2: For each group, calculate its
cardinality.
Note: Use this analytics to approximate the
histogram of AGE, i.e., how the AGE is
distributed over the tuples…
14
#
E
m
p
lo
y
e
e
s
IN-CLASS EXAMPLE: REGRESSION ANALYTICS
Query 5: How is the average salary of
employees distributed along the age?
SELECT E.AGE, AVG(E.Salary)
FROM EMPLOYEE AS E
GROUP BY E.AGE;
Note: Use this analytics to approximate the
dependency of Salary on AGE.
Note: Predictive Analytics: given a age, predict
the expected salary
15
AGE
A
V
G
(S
a
la
r
y
)
IN-CLASS EXAMPLE
Query 6: How many employees are working in each project? Include the
project name at the results.
SELECT P.PNAME, COUNT (*) //employees per project
FROM PROJECT AS P, WORKS_ON AS W
WHERE P.PNUMBER = W.PNO
GROUP BY P.PNAME;
Step 1: Associate a PROJECT with the WORKS_ON
Step 2: Group associated tuples together w.r.t. PNAME
Step 3: For each group, count the number of tuples (employees)
16
17
2 members
3 members
2 members
3 members
3 members
3 members
PNAME COUNT(*)
Product X 2
Product Y 3
Product Z 2
[A1] WORKED EXAMPLE
Task 1: Which is the average salary of employees per department? Include
the department name at the results.
18
SSN Salary DNO
1 £30K 1
2 £60K 1
3 £20K 2
4 $25K 3
DNUMBER DNAME
1 D1
2 D2
3 D3
EMPLOYEE DEPARTMENT
DNAME AVG(Salary)
D1 £45K
D2 £20K
D3 £25K
[A1] WORKED EXAMPLE
Task 1: Which is the average salary of employees per department? Include
the department name at the results.
SELECT D.DNAME, AVG (E.SALARY) //avg salary/dept
FROM DEPARTMENT AS D, EMPLOYEE AS E
WHERE D.DNUMBER = E.DNO
GROUP BY D.DNAME;
Step 1: Associate EMPLOYEE with DEPARTMENT
Step 2: Group associated tuples together w.r.t. DNAME
Step 3: For each group, take the average of salaries
19
GROUP BY & HAVING
HAVING: condition to select/reject a group after grouping!
Query 7: Show the number of employees per project only from those projects
with more than 2 employees. Include the project name in the results.
SELECT P.PNAME, COUNT (*)
FROM PROJECT AS P, WORKS_ON AS W
WHERE P.PNUMBER = W.PNO
GROUP BY P.PNAME
HAVING COUNT(*) > 2
20
ESSN PNO
1 P1
2 P1
3 P2
4 P1
WORKS_ON
PName PNO
GLA P1
EDI P2
PROJECT
PName COUNT(*)
GLA 3
3
1
2
4
o
r
d
e
r
IN-CLASS EXAMPLE
Task: Who are the employees (SSN, last name) with more than two
dependents.
21
EMPLOYEE
DEPENDENT
IN-CLASS EXAMPLE
Task: Who are the employees (SSN, last name) with more than two
dependents.
Idea: Group dependents of the same employee; count.
SELECT E.SSN, E.LNAME, COUNT (*)
FROM DEPENDENT AS D, EMPLOYEE AS E
WHERE E.SSN = D.ESSN
GROUP BY E.SSN
HAVING COUNT(*) > 2 22
[A2] WORKED EXAMPLE
Task: Who are the managers (last names) of those departments
with more than 100 employees?
23
EMPLOYEE
DEPARTMENT
[A2] WORKED EXAMPLE
Task: Which are the managers (last names) of those departments
with more than 100 employees?
SELECT M.LNAME
FROM EMPLOYEE M, DEPARTMENT P
WHERE M.SSN = P.MGR_SSN
AND P.DNUMBER IN (
SELECT E.DNO
FROM EMPLOYEE AS E
GROUP BY E.DNO
HAVING COUNT(*) > 100);
24
TRICKY ANALYTICS QUERY
Query 8: For each department with more than 5 employees, tell
me how many of them are making more than £40K.
SELECT DNO, COUNT (*)
FROM EMPLOYEE
WHERE Salary > 40000
GROUP BY DNO
HAVING COUNT (*) > 5;
Step 1: Identify the departments with more than 5 employees.
Step 2: Check if employees of those departments earn more
than £40K; count
But: WHERE filters out employees with Salary <= £40K before grouping…thus, the group sizes (employees per department) are not correct… 25 26 ‘Department 1’ ‘Department 2’ ‘Department 3’ Legend Employee with more than £40K Employee with less than £40K TRICKY ANALYTICS QUERY ‘Department 1’ ‘Department 2’ ‘Department 3’ SELECT DNO, COUNT (*) FROM EMPLOYEE WHERE Salary > 40000
GROUP BY DNO
HAVING COUNT (*) > 5;
‘Department 4’
‘Department 4’
Correct: we want to count the total number of employees whose
salaries exceed £40K in those departments with more than five
employees.
27
‘Department 1’
‘Department 2’
‘Department 3’
Legend
Employee with more than £40K
Employee with less than £40K
TRICKY ANALYTICS QUERY
‘Department 4’
DNO COUNT(*)
2 2
4 3
SELECT DNO, COUNT(*)
FROM EMPLOYEE
WHERE Salary > 40000 AND DNO IN
(SELECT A.DNO
FROM EMPLOYEE A
GROUP BY A.DNO
HAVING COUNT(*) > 5)
GROUP BY DNO
First, find the
departments with
more than 5
employees
Second, for each
department, check if
members earn more
than £40K.
28
Then, group
and count
the £40K-
employees
per
department
‘Department 1’
‘Department 2’
‘Department 3’
‘Department 4’
2 4
[A3] WORKED EXAMPLE
EMPLOYEE(SSN, …, DNO)
Task: Show the department(s) with the maximum number of
employees.
Note: It might be the case that more than one department
has the maximum number of employees.
29
[A3] WORKED EXAMPLE
EMPLOYEE(SSN, …, DNO)
Task: Show the department(s) with the maximum number of
employees. It might be the case that more than one department has the
maximum number of employees.
SELECT DNO, COUNT(*)
FROM EMPLOYEE
GROUP BY DNO
HAVING COUNT(*) = (SELECT MAX(A.members)
FROM
(SELECT D.DNO, COUNT(*) AS members
FROM EMPLOYEE D
GROUP BY D.DNO) AS A
);
30
SQL: INSERT
Key, integrity and referential constraints are automatically enforced!
31
INSERT INTO EMPLOYEE (SSN, FNAME, LNAME, …)
VALUES (‘1234567’, ‘Chris’, ‘McReader’, …)
SQL: DELETE
Get a relation and include a WHERE to specify the tuple(s) to be deleted:
Note: A missing WHERE specifies that all tuples in the relation are to be
deleted; the table then becomes an empty table.
Referential Integrity: tuples are deleted from only one table at a time unless
ON DELETE CASCADE is specified on a constraint!
32
SQL: UPDATE
Modify attribute values of tuples, which satisfy WHERE before modification!
Query 9: Change the location and controlling department number of project
number 10 to ‘Bellaire’ and 5, respectively.
UPDATE PROJECT
SET Plocation = ‘Bellaire’, Dnum = 5
WHERE Pnumber = 10
Referential Integrity: tuples are updated from only one table at a time
unless ON UPDATE CASCADE is specified on a constraint!
33
IN-CLASS EXAMPLE
Task: Give all employees in the department 5 a 10% raise in salary.
UPDATE EMPLOYEE
SET SALARY = SALARY *1.1
WHERE DNO = 5
Modified SALARY depends on the original SALARY in each tuple:
SALARY on the right of = refers to the old salary before modification
SALARY on the left of = refers to the new salary after modification
34
DROP (OPTIONAL)
DROP is used to drop named schema elements, such as tables,
domains, or constraint
Example: DROP SCHEMA COMPANY CASCADE;
It removes the schema and all its elements including tables,
constraints, etc.
Example: DROP TABLE EMPLOYEE;
It drops the existing table EMPLOYEE and all of its tuples.
Example: DROP TABLE EMPLOYEE CASCADE CONSTRAINTS;
It drops the existing table EMPLOYEE, all of its tuples and drop
the FOREIGN KEY constraints of the tables referring to EMPLOYEE
(but not those tables). 35
ALTER (OPTIONAL)
ALTER includes:
Adding or dropping a column (attribute)
Changing a column definition
Adding or dropping table constraints
Example: ALTER TABLE COMPANY.EMPLOYEE ADD COLUMN Job VARCHAR(12);
Example: ALTER TABLE COMPANY.EMPLOYEE DROP CONSTRAINT EMPSUPERFK;
Example: ALTER TABLE COMPANY.DEPARTMENT ADD CONSTRAINT NEW_UNIQUE
UNIQUE (Dname);
Example: ALTER TABLE COMPANY.EMPLOYEE DROP COLUMN Address;
Example: ALTER TABLE COMPANY.DEPARTMENT ALTER COLUMN Mgr_ssn DROP
DEFAULT;
Example: ALTER TABLE COMPANY.DEPARTMENT ALTER COLUMN Mgr_ssn SET
DEFAULT ‘333445555’;
36