SQL 3: SELECT, part 2
SQL 2.3: SELECT
Jianjun Chen
1
Contents
Joins
Cross, Inner, Natural, Outer
ORDER BY to produce ordered output
Aggregate functions
MIN, MAX, SUM, AVG, COUNT
GROUP BY
HAVING
UNION
Missing Information
Joins
JOINs can be used to combine tables in a SELECT query
CROSS JOIN: Returns all pairs of rows from A and B, the same as Cartesian Product
INNER JOIN: Returns pairs of rows satisfying a condition
NATURAL JOIN: Returns pairs of rows with common values in identically named columns
OUTER JOIN: Returns pairs of rows satisfying a condition (as INNER JOIN), BUT ALSO handles NULLS
CROSS JOIN
Syntax:
SELECT * FROM A CROSS JOIN B;
Same as:
SELECT * FROM A, B;
Usually needs WHERE to filter out unrelated tuples
INNER JOIN
INNER JOIN specifies a condition that pairs of rows must satisfy.
SELECT * FROM A INNER JOIN B
ON condition
Can also use a USING clause that will output rows with equal values in the specified columns
SELECT * FROM A INNER JOIN B USING (col1, col2)
col1 and col2 must appear in both A and B.
INNER JOIN: Example
SELECT * FROM
Buyer INNER JOIN
Property
ON Price <= Budget
CREATE TABLE Buyer (name varchar(60), budget integer);
CREATE TABLE Property (addr varchar(60), price integer);
INSERT INTO Buyer VALUES ('Smith',100000),('Jones',150000),('Green',80000);
INSERT INTO Property VALUES ('15 high st',85000),('12 Queen st',125000),('87 Oak lane',175000);
Result is the same as:
SELECT * FROM
Buyer,
Property WHERE
Price <= Budget;
6
INNER JOIN: Practice
SELECT * FROM
Student INNER JOIN Enrolment
USING (ID)
?
INNER JOIN: Practice
A single ID row will be output representing the equal values from both Student.ID and Enrolment.ID
SELECT * FROM
Student INNER JOIN Enrolment
USING (ID)
NATURAL JOIN
SELECT * FROM A NATURAL JOIN B;
A NATURAL JOIN is effectively a special case of an INNER JOIN where the USING clause has specified all identically named columns.
Same as the A ⋈ B in relational algebra.
The data type of these columns can be different but will not always work. Strings can be automatically converted into integer or double, so they might work. However, data/times won’t work usually.
9
NATURAL JOIN
SELECT * FROM
Student NATURAL JOIN Enrolment;
JOINs vs WHERE Clauses
Inner/Natural JOINs are not absolutely necessary
You can obtain the same results by selecting from multiple tables and using appropriate WHERE clauses
Should you use them?
Yes
They often lead to concise and elegant queries
NATURAL JOINs are extremely common
No
Support for JOINs can vary between DBMSs
OUTER JOIN
SELECT cols
FROM table1 type-OUTER-JOIN table2
ON condition;
Where type is one of LEFT, RIGHT or FULL.
Example: Left Outer Join
SELECT * FROM
Student LEFT OUTER JOIN Enrolment
ON Student.ID = Enrolment.ID;
INSERT INTO student VALUES (123,'John'),(124,'Mary'),(125,'Mark'),(126,'Jane');
INSERT INTO enrolment VALUES (123,'DBS',60),(124,'PRG',70),(125,'DBS',50),(128,'DBS',80);
13
Example: Right Outer Join
Example: Full Outer Join
Full Outer Join in MySQL
Only Left and Right outer joins are supported in MySQL. If you really want a FULL outer join:
SELECT *
FROM Student FULL OUTER JOIN Enrolment
ON Student.ID = Enrolment.ID;
Can be achieved using:
(SELECT * FROM Student LEFT OUTER JOIN
Enrolment ON Student.ID = Enrolment.ID)
UNION
(SELECT * FROM Student RIGHT OUTER JOIN
Enrolment ON Student.ID = Enrolment.ID);
16
Why Using Outer Joins?
Sometimes an outer join is the most practical approach. We may encounter NULL values, but may still wish to see the existing information.
The next few slides will consider this problem:
For students graduating in absentia, find a list of all student IDs, names, addresses, phone numbers and their final degree classifications.
For students graduating in absentia, find a list of all student IDs, names, addresses, phone numbers and their final degree classifications.
Problems with INNER JOINS
An Inner Join with Student and Address will ignore Student 127, who doesn’t have an address record
An Inner Join with Student and Phone will ignore student 125, who doesn’t have a phone record
19
For students graduating in absentia, find a list of all student IDs, names, addresses, phone numbers and their final degree classifications. What is your solution?
Next table to combine
Next table to combine
Solution Using OURTER JOIN
SELECT ID, Name, aStreet, aTown, aPostcode, pNumber, Classification FROM
((Student LEFT OUTER JOIN Phone
ON Student.pID = Phone.pID)
LEFT OUTER JOIN Address
ON Student.aID = Address.aID)
INNER JOIN Degree
ON Student.ID = Degree.ID
WHERE Grad = ‘A’;
Solution Using OURTER JOIN
Student LEFT OUTER JOIN Phone
ON Student.pID = Phone.pID
Address
LEFT OUTER JOIN
ON Student.aID = Address.aID
INNER JOIN ON Student.ID = Degree.ID
WHERE Grad = ‘A’;
Final Result Using OUTER JOIN
The records for students 125 and 127 have been preserved despite missing information
SQL SELECT Overview
SELECT
[DISTINCT | ALL] column-list
FROM table-names
[WHERE condition]
[GROUP BY column-list]
[HAVING condition]
[ORDER BY column-list]
([] optional, | or)
ORDER BY
The ORDER BY clause sorts the results of a query
You can sort in ascending (default) or descending order
Multiple columns can be given
You cannot order by a column which isn’t in the result (Really so? Check Notes under this slide)
SELECT columns FROM tables
WHERE condition
ORDER BY cols [ASC | DESC]
SELECT y / 100 as y2 FROM a ORDER BY y DESC;
This DOES work in MySQL.
SELECT y / 100 + x as why FROM a ORDER BY y DESC;
It WORKS too!
SELECT y / 100 + x AS y1, y AS y2 FROM a ORDER BY y DESC;
Is sorted by y2 in descending order.
SELECT y AS y2 , y / 100 + x AS y1 FROM a ORDER BY y DESC;
Is still sorted by y2 in descending order.
Check this:
SELECT y + 1 AS y1 , - y + 2 AS y2, y FROM a ORDER BY y DESC
So apparently, Result is first sorted by y, then y + 1 and –y + 2 are calculated and added to the final result.
Thus, “order by” is evaluated first, followed by Select xxx.
27
ORDER BY: Example 1
SELECT * FROM Grades
ORDER BY Mark;
ORDER BY: Example 2
SELECT * FROM Grades
ORDER BY
Code ASC, Mark DESC;
Arithmetic
As well as columns, a SELECT statement can also be used to
Compute arithmetic expressions
Evaluate functions
Often helpful to use an alias when dealing with expressions or functions.
SELECT Mark / 100 FROM Grades;
SELECT Salary + Bonus FROM Employee;
SELECT 1.20 * Price
AS ‘Price inc. VAT’
FROM Products;
Updated select flow:
Once result returned from “from … where …”, each expression in select will be evaluated and included into the final table.
30
Aggregate Functions
Aggregate functions compute summaries of data in a table.
Most aggregate functions (except COUNT (*)) work on a single column of numerical data
Aggregate functions:
COUNT: The number of rows
SUM: The sum of the entries in the column
AVG: The average entry in a column
MIN,MAX: The minimum/maximum entries in a column
Again, it’s best to use an alias to name the result
31
COUNT
SELECT
COUNT(*) AS Count
FROM Grades;
SELECT
COUNT(Code) AS Count
FROM Grades;
SELECT
COUNT(DISTINCT Code)
AS Count
FROM Grades;
How about “SELECT COUNT(y), y FROM a”?
There will only be one value for y,
Which means the aggregation function is first done, and then all tuples with index larger than 0 is removed!
32
COUNT
SELECT
COUNT(*) AS Count
FROM Grades;
SELECT
COUNT(Code) AS Count
FROM Grades;
SELECT
COUNT(DISTINCT Code)
AS Count
FROM Grades;
SUM, MIN/MAX and AVG
SELECT
SUM(Mark) AS Total
FROM Grades;
SELECT
MAX(Mark) AS Best
FROM Grades;
SELECT
AVG(Mark) AS Mean
FROM Grades;
SUM, MIN/MAX and AVG
SELECT
SUM(Mark) AS Total
FROM Grades;
SELECT
MAX(Mark) AS Best
FROM Grades;
SELECT
AVG(Mark) AS Mean
FROM Grades;
Combining Aggregate Functions
You can combine aggregate functions using arithmetic
SELECT
MAX(Mark) - MIN(Mark)
AS Range_of_marks
FROM Grades;
Combining AF: Example
Find John’s average mark, weighted by the credits of each module
Combining AF: Example
Find John’s average mark, weighted by the credits of each module
SELECT SUM(Mark*Credits) / SUM (Credits)
AS ‘Final Mark’
FROM Modules, Grades
WHERE Modules.Code = Grades.Code
AND Grades.Name = ‘John’;
GROUP BY
Sometimes we want to apply aggregate functions to groups of rows
Example: find the average mark of each student individually
The GROUP BY clause achieves this.
SELECT column_set1 FROM tables
WHERE predicate
GROUP BY column_set2;
Every entry in ‘column_set2’ should be in ‘column_set1’, be a constant, or be an aggregate function
GROUP BY: Example
SELECT Name,
AVG(Mark) AS Average
FROM Grades
GROUP BY Name;
GROUP BY: Example
SELECT Name,
AVG(Mark) AS Average
FROM Grades
GROUP BY Name;
GROUP BY: Example
Find the total value of the sales for each department in each month
Can group by Month then Department or Department then Month
GROUP BY: Example
Same results, but produced in a different order
HAVING
HAVING is like a WHERE clause, except that it only applies to the results of a GROUP BY query
It can be used to select groups which satisfy a given condition
SELECT Name,
AVG(Mark) AS Average
FROM Grades
GROUP BY Name
HAVING
AVG(Mark) >= 40;
WHERE and HAVING
WHERE refers to the rows of tables, so cannot make use of aggregate functions.
HAVING refers to the groups of rows, and so cannot use columns which are not in the GROUP BY or an aggregate function.
Think of a query being processed as follows:
Tables are joined
WHERE clauses
GROUP BY clauses and aggregates
Column selection
HAVING clauses
ORDER BY
SQL SELECT Overview
SELECT
[DISTINCT | ALL] column-list
FROM table-names
[WHERE condition]
[GROUP BY column-list]
[HAVING condition]
[ORDER BY column-list]
([] optional, | or)
SET operations
UNION, INTERSECT and EXCEPT
These treat the tables as sets and are the usual set operators of union, intersection and difference
We’ll be concentrating on UNION
They all combine the results from two select statements
The results of the two selects should have the same columns and corresponding data types
UNION: Example
Find, in a single query, the average mark for each student and the average mark overall.
UNION
The average for each student:
SELECT Name, AVG(Mark) AS Average
FROM Grades
GROUP BY Name;
The average overall:
SELECT ‘Total’ AS Name,
AVG(Mark) AS Average
FROM Grades;
UNION
SELECT Name,
AVG(Mark) AS Average
FROM Grades
GROUP BY Name
UNION
SELECT
‘Total’ AS Name,
AVG(Mark) AS Average
FROM Grades;
That’s Everything for SELECT 🙂
Missing Information
Missing Information
Sometimes we don’t know what value an entry in a relation should have
We know that there is a value, but don’t know what it is
There is no value at all that makes any sense
Two main methods have been proposed to deal with this
NULLs can be used as markers to show that information is missing
A default value can be used to represent the missing value
Null
Represents a state for an attribute that is currently unknown or is not applicable for this tuple.
Nulls are a way to deal with incomplete or exceptional data.
NULL is a placeholder for missing or unknown value of an attribute. It is not itself a value.
E.g. A new staff is just added, but hasn’t been decided which branch he belongs to.
NULLs
Codd proposed to distinguish two types of NULLs:
A-marks: data Applicable but not known (for example, someone’s age)
I-marks: data is Inapplicable (telephone number for someone who does not have a telephone, or spouse’s name for someone who is not married)
Problems with NULLs
Problems extending relational algebra operations to NULLs:
Selection operation: if we check tuples for “Mark > 40” and for some tuple Mark is NULL, do we include it?
Comparing tuples in two relations: are two tuples and the same or not?
Additional problems for SQL:
NULLs treated as duplicates?
Inclusion of NULLs in count, sum, average?
If yes, how?
Arithmetic operations behaviour with argument NULL?
Theoretical Solutions
Use three-valued logic instead of classical two valued logic to evaluate conditions.
When there are no NULLs around, conditions evaluate to true or false, but if a null is involved, a condition might evaluate to the third value (‘undefined’, or ‘unknown’)
3-valued logic
If the condition involves a Boolean combination, we evaluate it as follows:
SQL NULLs in Conditions
SELECT *
FROM Employee
Where
Salary > 15,000;
WHERE clause of SQL SELECT uses three-valued logic: only tuples where the condition evaluates to true are returned.
Salary > 15,000 evaluates to ‘unknown’ on the last tuple – not included
59
SQL NULLs in Conditions
SELECT * FROM Employee
Where
Salary > 15,000
OR
Name = ‘Chris’;
Salary > 15,000 OR Name = ‘Chris’ Is essentially Unknown OR TRUE on the last tuple
SQL NULLs in Arithmetic
SELECT Name,
Salary * 0.05 AS Bonus
FROM Employee;
Arithmetic operations applied to NULLs result in NULLS
SQL NULLs in Aggregation
SELECT
AVG(Salary) AS Average,
COUNT(Salary) AS Count,
SUM(Salary) AS Sum
FROM Employee;
Average = 20,000
Count = 3
Sum = 60,000
Using COUNT(*) would give 4, even if the name of Chris is changed to NULL.
SQL NULLs in GROUP BY
SELECT Salary,
COUNT(Name) AS Count
FROM Employee
GROUP BY Salary;
NULLs are treated as equivalents in GROUP BY clauses
SQL NULLs in ORDER BY
SELECT *
FROM Employee
ORDER BY Salary;
NULLs are considered and reported in ORDER BY clauses
Dealing with Missing Information
Sometimes we don’t know what value an entry in a relation should have
We know that there is a value, but don’t know what it is
There is no value at all that makes any sense
Two main methods have been proposed to deal with this
NULLs can be used as markers to show that information is missing
A default value can be used to represent the missing value
Default Values
Default values are an alternative to the use of NULLs
You can choose a value that makes no sense in normal circumstances:
age INT DEFAULT -1,
These are actual values
Default values can have more meaning than NULLs
‘none’
‘unknown’
‘not supplied’
‘not applicable’
Not all defaults represent missing information. It depends on the situation
Default Values Example
Default values are
“Unknown” for Name
-1 for Weight and Quantity
-1 is used for Wgt and Qty as it is not sensible otherwise
There are still problems:
UPDATE Parts SET Quantity = Quantity + 5
SQL Support
SQL allows both NULLs and defaults:
A table to hold data on employees
All employees have a name
All employees have a salary (default 10000)
Some employees have phone numbers, if not we use NULLs
CREATE TABLE Employee
(
Name VARCHAR(50)
NOT NULL,
Salary INT
DEFAULT 10000
NOT NULL,
Phone VARCHAR(15)
NULL
);
68
SQL Support
SQL allows you to insert NULLs
INSERT INTO Employee
VALUES (‘John’,
12000, NULL);
UPDATE Employee
SET Phone = NULL
WHERE Name = ‘Mark’;
You can also check for NULLs
SELECT Name FROM
Employee WHERE
Phone IS NULL;
SELECT Name FROM
Employee WHERE
Phone IS NOT NULL;
SQL: The Final Example
The Final Example
Examiners’ reports
We want a list of students and their average mark
For first and second years the average is for that year
For finalists (third year) it is 40% of the second year plus 60% of the final year averages
We want the results:
Sorted by year (desc), then by average mark (high to low) then by last name, first name and finally ID
To take into account of the number of credits each module is worth
Produced by a single query
Example Output
Getting Started
Finalists should be treated differently to other years
Write one SELECT for the finalists
Write a second SELECT for the first and second years
Merge the results using a UNION
QUERY FOR FINALISTS
UNION
QUERY FOR OTHERS
Table Joins
Both subqueries need information from all the tables
The student ID, name and year
The marks for each module and the year taken
The number of credits for each module
This is a natural join operation
But because we’re practicing, we’re going to use a standard CROSS JOIN and WHERE clause
Exercise: repeat the query using natural join
The Query so Far
SELECT some-information
FROM Student, Module, Grade
WHERE Student.ID = Grade.ID
AND Module.Code = Grade.Code
AND student-is-in-third-year
UNION
SELECT some-information
FROM Student, Module, Grade
WHERE Student.ID = Grade.ID
AND Module.Code = Grade.Code
AND student-is-in-first-or-second-year;
Information for Finalists
We must retrieve
Computed average mark,weighted 40-60 across years 2 and 3
First year marks must be ignored
The ID, Name and Year are needed as they are used for ordering
The average is difficult
We don’t have any statements to separate years 2 and 3 easily
We can exploit the fact that 40 = 20 * 2 and 60 = 20 * 3, so YearTaken and the weighting have the same relationship
The Query so Far
SELECT some-information
FROM Student, Module, Grade
WHERE Student.ID = Grade.ID
AND Module.Code = Grade.Code
AND student-is-in-third-year
UNION
…
Information for Finalists
SELECT Year, Student.ID, Last, First,
SUM(((20*YearTaken)/100)*Mark*Credits)/120
AS AverageMark
FROM
Student, Module, Grade
WHERE
Student.ID = Grade.ID
AND
Module.Code = Grade.Code
AND
YearTaken IN (2,3)
AND
Year = 3
GROUP BY
Year, Student.ID, First, Last
Information for Others
Other students are easier than finalists
We just need their average marks where YearTaken and Year are the same
As before, we need ID, Name and Year for ordering
…
UNION
SELECT some-information
FROM Student, Module, Grade
WHERE Student.ID = Grade.ID
AND Module.Code = Grade.Code
AND student-is-in-first-or-second-year;
Information for Finalists
SELECT Year, Student.ID, Last, First,
SUM(Mark*Credits)/120 AS AverageMark
FROM
Student, Module, Grade
WHERE
Student.ID = Grade.ID
AND
Module.Code = Grade.Code
AND
YearTaken = Year
AND
Year IN (1,2)
GROUP BY
Year, Student.ID, First, Last
SELECT Year, Student.ID, Last, First,
SUM(((20*YearTaken)/100)*Mark*Credits)/120
AS AverageMark
FROM Student, Module, Grade
WHERE Student.ID = Grade.ID
AND Module.Code = Grade.Code
AND YearTaken IN (2,3)
AND Year = 3
GROUP BY Year, Student.ID, Last, First
UNION
SELECT Year, Student.ID, Last, First,
SUM(Mark*Credits)/120 AS AverageMark
FROM Student, Module, Grade
WHERE Student.ID = Grade.ID
AND Module.Code = Grade.Code
AND YearTaken = Year
AND Year IN (1,2)
GROUP BY Year, Student.ID, Last, First
ORDER BY
Year desc, AverageMark desc, Last, First, ID;
Questions?