Database: Normalisation
Database: Normalisation
Jianjun Chen
What is Normalisation?
“Like E/R diagram. Normalisation is another way of creating accurate representations of the data, relationships between the data, and constraints on the data that is pertinent to the enterprise.”
“It is a design technique for producing a set of suitable relations that support the data requirements of an enterprise.”
“Suitable Relations”
Characteristics of a suitable set of relations include:
The minimal number of attributes necessary to support the data requirements of the enterprise;
Attributes with a close logical relationship are found in the same relation;
Minimal redundancy with each attribute represented only once with the important exception of attributes that form all or part of foreign keys.
From textbook
3
Minimal Redundancy?
Given the example from E/R diagram, department and lecturer can be designed as:
Department
Employs
Lecturer
dName lecName
CS Matt
CS Andrew
EEE Thomas
EEE John
lecName
Matt
Andrew
Thomas
John
Department
Lecturer
dName
CS
EEE
Department
Lecturer
lecName dName
Matt CS
Andrew CS
Thomas EEE
John EEE
The design of Department table on the left has the following problems:
dName is not primary key or unique key. If any table wants to refer to dName (foreign key), then we have to include such information in the department table.
For example, student table wants to indicate what department a student belongs to.
At the same time, department table also needs to include the building names assigned to departments.
If you natural join department and student together, you will see duplicated dnames for each student, which is not something desirable.
4
Minimal Redundancy?
However, the department table in the left version has duplications of dName in the department table.
Also, the number of rows is the same as Lecturer’s
Why not combining the table with Lecturer table then?
If we combine them, we will have the same problems as the staffbranch table (see next slide).
dName lecName
CS Matt
CS Andrew
EEE Thomas
EEE John
lecName
Matt
Andrew
Thomas
John
Department
Lecturer
dName
CS
EEE
Department
Lecturer
lecName dName
Matt CS
Andrew CS
Thomas EEE
John EEE
Attributes lecName and dName has a M:1 relationship.
Thus department and lecturer on the left is a 1:1 relationship
5
Data Redundancy
Data Redundancy
StaffBranch relation has redundant data; the details of a branch are repeated for every member of staff.
In contrast, the branch information appears only once for each branch in the Branch relation and only the branch number (branchNo) is repeated in the Staff relation, to represent where each member of staff is located.
Data Redundancy and Update Anomalies
These examples illustrate the problems of data redundancy. Let’s take StaffBranch for example:
Insert anomalies: Must input the correct branch information every time a new staff is added.
Delete anomalies: Deleting the last staff of a branch also deletes the information of that branch.
Modification anomalies: To change one branch information, all staffs (rows) in that branch must also be updated.
How to Split Tables?
As a result, StaffBranch is better split into staff and branch.
And we know the answer beforehand.
What information can help us to redesign tables when we don’t know the answer?
How about working out the E/R model in the reversed way?
An entity is associated with its attributes.
M:1 relationships will become foreign keys.
We will be talking about this until slide 26
9
How to Identify Entities?
“An entity is associated with its attributes.”
That means all attributes of an entity must fit into one context.
For example, a lecturer table can have:
LecID
LName
LEmail
Attributes like these won’t fit:
StudentID
SName
How to Identify Entities?
Apart from the attribute names, what else is critical when we tries to classify these attributes?
A lecturer ID is associated with a lecturer name and his email address.
Does lecturer ID has the same relationship with student name? No!
There’s a term for this: functional dependency
LecturerStudent{LecID, LName, LEmail, StudentID, SName}
Functional Dependency
A method for finding relationships between attributes within a table
Functional Dependencies
We want to find relationship between attributes within a table so that we can regroup attributes based on their context and split the big table.
Introducing functional dependency (FD):
“If A and B are attribute sets of relation R, B is functionally dependent on A (denoted A → B), if each value of A in R is associated with exactly one value of B in R.”
A is called determinant.
Functional Dependencies
For the attributes below:
LecID → LName, LEmail
Each LecID of a lecturer is associated with exactly one lecturer name and his email address.
StudentID → SName
Each studentID of a student is associated with exactly one student name.
LecturerStudent{LecID, LName, LEmail, StudentID, SName}
Functional Dependencies
Why not LName → LecID ?
Two staffs may have the same name. Thus a name will probably be associated with two IDs.
Why not LecID → SName?
They don’t seem to have connections at all.
How about LEmail → LecID, LName ?
This is a valid FD. One staff only has one assigned email address.
It looks strange though.
LecturerStudent{LecID, LName, LEmail, StudentID, SName}
Functional Dependencies
Observe these functional dependencies carefully, you will realise that:
If these attributes are put together, they can form a relation, with the determinant being the unique key or primary key of the relation
For example:
LecID → LName, LEmail
StudentID → SName
LecturerStudent{LecID, LName, LEmail, StudentID, SName}
Functional Dependencies
However, we have a big problem here, the FD below is also true:
LecID, LName, LEmail → LName, LEmail
Is (LecID, LName, LEmail) a good primary key?
Recall the definition of super key, candidate key and primary key.
Full Functional Dependency
Full functional dependency indicates that if A and B are two sets of attributes of a relation, B is fully functionally dependent on A, if B is functionally dependent on A, but not on any proper subset of A.
In other words, determinants should have the minimal number of attributes necessary to maintain the functional dependency with the attribute(s) on the right hand-side.
Partial Functional Dependency
Partial FDs:
A FD, A → B, is a partial FD, if some attribute of A can be removed and the FD still holds
Formally, there is some proper subset of A, C ⊂ A, such that C → B
Examples
Full functional dependency in the Staff relation:
staffNo → branchNo
How about:
staffNo, sName → branchNo?
This is a partial dependency since branchNo is also functionally dependent on a subset of (staffNo, sName), namely staffNo.
Yes, partial FD has more determinants than Full FD
20
Determinant in Full/Partial FDs
Determinants in full functional dependencies:
Can become candidate keys if we split the table
Determinants in partial functional dependencies:
Can only become super keys.
FDs in Normalisation
We only care about full FDs in normalisation. Why?
Partial FDs results in super keys, if you use a foreign key to refer to a super key in another table, you will need extra columns in the referencing table.
branchNo VS (branchNo, bAddress) inside Staff table.
Unnecessary!
staffNo branchNo
staffNo branchNo bAddress
branchNo bAddress
branchNo bAddress
22
More about Determinants
If you observe the tuples in the Staff and Branch table, you can find out that the determinant has a M:1 or 1:1 relationship with other attributes in FDs.
Two staffs may have the same name.
Thus, one staff name can be associated with more than one staff number. (M:1)
Two staffs may have the same position.
Thus, one position may be associated with more than one staff number (M:1)
More about Determinants
“The determinant has a M:1 or 1:1 relationship with other attributes in FDs”
Two staffs may work in the same branch.
Thus, one branchNo may be associated with more than one staffNo. (M:1)
A branch address is associated with exactly one branchNo
1:1 relationship
More about Determinants
“The determinant has a M:1 or 1:1 relationship with other attributes in FDs”
Remember in ER Modelling, M:1 relationships between tables will become foreign keys.
We can infer that, For attributes that have a M:1 relationships, if they belong to the same context, they will be grouped into one relation, otherwise, they will be split into two tables and linked with a foreign key.
“If they belong to the same context, they will be grouped into one relation, otherwise, they will be split into two tables and linked with a foreign key”
Branch
Staff
in
sName
StaffNo
branchNo
1:m
1:m
m:1
bAddress
branchNo
1:1
That’s why these databases are called
Relational databases:
They are really about relationships!
Full FDs in StaffBranch
staffNo → sName, position, salary, branchNo, bAddress
branchNo → bAddress
Observe this dependency chain carefully:
staffNo → branchNo → bAddress
This looks like a relationship between staff and branch.
Transitive Dependency
The previous example is a transitive dependency.
“Transitive dependency describes a condition where A, B, and C are attributes of a relation such that if A → B and B → C, then C is transitively dependent on A via B (provided that A is not functionally dependent on B or C).”
Why is the underlined constraint needed? (Explained later)
Normal Forms
How can we use FDs and TDs to redesign tables
The Process of Normalisation
Steps of redesigning tables to make them “suitable”
Normalisation: Some Notes
In relational database, we always try to assign tables with primary keys (or at least candidate keys).
Why? Because relational databases are about relations.
Attributes within one context forms an entity (table).
Primary keys make referencing possible.
Attributes connecting entities become foreign keys.
Foreign keys form these connections.
Can I insist that no unique keys are used?
Then its beyond the topic of this module 🙂
Remember, we are learning one of the possible design decisions of managing data, don’t limit your imagination.
First Normal Form
In most definitions of the relational model
All data values should be atomic
This means that table entries should be single values, not sets or composite objects
A relation is said to be in first normal form (1NF):
if all data values are atomic
Unnormalised Form (UNF)
Primary key is Module
Problems With UNF
Look at “Texts” attribute:
To update the textbook name “T1” to something else, you need to manually update all “T1”s.
To delete T1 from the pool of textbooks, you need to manually do so, too.
You cannot add a textbook without giving the information of Module (because it’s a Primary Key).
Method 1:
Remove the repeating group by entering appropriate data into the empty columns of rows containing the repeating data (also called ‘flattening’ the table).
Assign a new primary/unique key to the new table.
Method 2:
Identify primary/unique key.
Place the repeating data along with a copy of the original (unique) key attribute(s) into a separate relation.
Module Texts
M1 T1
M1 T2
M2 T1
M2 T3
… …
1NFa
Module Dept Lecturer
M1 D1 L1
M2 D1 L1
M3 D1 L2
… … …
1NFb
Problems in 1NF
Look at the Primary key (Module, Text)
Changing module code from “M1” to something else requires you to check the whole table.
Adding a new lecturer with no modules and text is impossible
If a (department, lecturer) pair is modified, the change must be made to all (Module, Text) pairs.
For example, to change (D1, L1) to some other value, you must manually change the first four rows.
If (M3, T4) is deleted, L2 will be permanently lost!
Second Normal Form
Definition of second normal form 2NF:
A relation is in second normal form (2NF) if it is in 1NF and
no non-key attribute is partially dependent on the primary key
In other words, no C → B where C is a strict subset of a primary key and B is a non-key attribute.
Second Normal Form: Example
R is in 1NF but not in 2NF
We have the FD:
{Module, Text} →{Lecturer, Dept}
But also
{Module} → {Lecturer, Dept}
And so Lecturer and Dept are partially dependent on the primary key
1NF to 2NF
Identify the primary key(s) for the 1NF relation.
Identify the functional dependencies in the relation.
If partial dependencies exist on the primary key, remove them by placing attributes of the corresponding full FD in a new relation and leave the its determinant in the original table.
This mistake exists since 2006
40
1NF to 2NF
{Module, Text} → {Lecturer, Dept}
But also
{Module} → {Lecturer, Dept}
Problems in 2NF
Look at the table 2NFa
We cannot add a lecturer who is not assigned with any module. Because module is the primary key.
This is a theoretical discussion. Similar issues can be a real problem in some other tables.
By deleting M3, we lose L2 forever.
To change the department for L1, we need to change multiple rows manually
Third Normal Form (3NF)
Based on the concept of transitive dependency.
A relation that is in 1NF and 2NF and in which no non-key attribute is transitively dependent on the primary key.
2NF to 3NF
Identify the primary key in the 2NF relation.
Identify functional dependencies in the relation.
If transitive dependencies exist on the primary key, remove them by placing them in a new relation along with a copy of their determinant.
2NF to 3NF – Example
{Module} → {Lecturer}
{Lecturer} → {Dept}
So there is a transitive FD from the primary key {Module} to {Dept}
Problems Resolved in 3NF
Problems in 2NF
INSERT: Can’t add lecturers who teach no modules
UPDATE: To change the department for L1 we must alter two rows
DELETE: If we delete M3 we delete L2 as well
In 3NF all of these are resolved (for this relation – but 3NF can still have anomalies!)
46
Transitive Dependency and FK
Let’s go back and check the definition of the transitive dependency:
“Transitive dependency describes a condition where A, B, and C are attributes of a relation such that if A → B and B → C, then C is transitively dependent on A via B. Provided that A is not functionally dependent on B or C”
Translation:
“Provided that B → A or C → A is not true”
Transitive Dependency
Example:
staffNo (A) → branchNo (B) → bAddress (C)
This is transitive dependency.
LecID (A) → LEmail (B) → LName (C)
Not a transitive dependency. Because LEmail is functionally dependent on LecID.
Why the second example should not be considered as a transitive dependency (should not be split into two tables)?
Transitive Dependency and FK
LEmail (B) → LecID (A)
For each Email address, you can find one LecID.
Below are valid values of email and id:
→ 001
→ 001
But LecID is a candidate key, it is not allowed to have duplicates.
Thus, for each Email address, you can find a unique LecID.
Thus LEmail and LecID has a 1:1 relationship.
Transitive Dependency and FK
After splitting them:
Table1: LecID, LEmail
Table2: LEmail, LName
Table1 and Table2 has 1:1 relationship, with LEmail being the foreign key.
You have learned E/R modelling, this is not worth splitting.
Should talk about these AFTER finishing normalisation
50
Transitive Dependency and FK
LecID (A) → LEmail (B) → LName (C)
The confusion comes from the name “transitive”.
The above dependency chain is indeed “transitive”.
But it is not the type of “transitive” we are looking for.
The type of transitive dependencies we are looking for should support relationships, or foreign keys.
Normalisation: Practice
Normalisation Example
We have a table representing orders in an online store, Each entry in the table represents an item on a particular order.
{order, product, customer, address, quantity, unitPrice}
For example: (001, Laptop, Jianjun, SEB435 UNNC, 1, $500)
Primary key is {order, product}
No other candidate key
Task: Normalise it to 3NF.
Functional Dependencies
Each order is for a single customer and each customer has a single address.
FD1: {order} →{customer, address}
FD2: see FD 4 below
Each product has a single price
FD3: {product} →{unitPrice}
Each order transitively determines address via customer.
FD4: {order} → {customer} → {address}
Functional Dependencies
Normalisation to 2NF
2NF: no partial dependencies on candidate keys
{order} → {customer, address}
{product} → {unitPrice}
To remove the first FD we move {customer, address} to another relation, along with a copy of its determinant, order.
R1: {order, customer, address}
With remaining relation
R2: {order, product, quantity, unitPrice}
Normalisation to 2NF
Current Tables
R1: {order, customer, address} 2NF
R2: {order, product, quantity, unitPrice}
Next:
There is a partial FD in R2: {Product} → {UnitPrice}
To remove this we move it to another relation R3
R3: {Product, unitPrice}
and with remaining relation
R4: {order, product, quantity}
Normalisation to 3NF
Current Tables:
R1: {order, customer, address}
R3: {Product, unitPrice} 3NF
R4: {order, product, quantity} 3NF
Next:
R1 has a transitive FD on its key
To remove {order} → {customer} → {Address}
we decompose R1 over
R5: {order, customer}
R6: {customer, address}
58
Normalisation
1NF:
{order, product, customer, address, quantity, unitPrice}
2NF:
R1: {order, customer, address} 2NF
R3: {product, unitPrice} 3NF
R4: {order, product, quantity} 3NF
3NF:
R3: {product, unitPrice} 3NF
R4: {order, product, quantity}, 3NF
R5: {order, customer} 3NF
R6: {customer, address} 3NF