File Structures
FUNCTIONAL DEPENDENCY &
NORMALISATION THEORY
Database Theory & Applications (M)
Dr Chris Anagnostopoulos
ROADMAP
When a relational schema is good or bad…
…judging the efficiency of a schema.
Functional Dependency Theory for quantifying the degree of
goodness.
Normalization Theory for transforming a relational schema
into a set of good and efficient relations.
Transforming any relation to the Boyce-Codd Normal Form
(BCNF) dealing with fictitious tuples.
Fundamental Theorem in BCNF
2
RELATIONAL MODELING REVISIT
Challenge: develop a theory and list guidelines to assess a relational model in
terms of (performance metrics):
goodness, i.e., whether the attributes (e.g., PK, FK) form a good relation;
quantify the degree of goodness;
what pitfalls exist;
efficiency in insert/delete/update operations;
convey as much information as possible minimizing redundancy (minimize
repetition of data)…
3
GUIDELINES FOR A GOOD DESIGN
Guideline 1: The attributes of a relation should make sense
Attributes of different entities, e.g., students, employees, courses,
should not be in the same relation.
Objective: minimize the similarity between relations!
Any relationship between relations should be represented only
through Foreign Keys & Primary Keys
e.g., an employee works on a specific project for 2 hours:
WORKS_ON(SSN (FK), PNO(FK), Hours).
4
Employee Project
GUIDELINES FOR A GOOD DESIGN
Guideline 2: Avoid redundant tuples (repetition of the same information)
Impact of repetition:
storage cost: replication of tuples wastes storage.
inconsistency cost & operation anomalies:
replicas must be kept consistent (more checks) during insertion,
deletion, update of tuples
replicas result to consistency anomalies…
5
EXAMPLE: REDUNDANCY & ANOMALY
EMP_PROJ(SSN, Pnumber, Hours, Ename, Pname, Plocation)
Context: Employee is working to a project for specific hours/week.
6
EXAMPLE: UPDATE ANOMALY
EMP_PROJ(SSN, Pnumber, Hours, Ename, Pname, Plocation)
Context: EMP_PROJ contains 600 employees working on project
Pnumber = 2 with Pname = ‘ProductY’.
Update anomaly:
If the name of Project 2 changes from ‘ProductY’ to
‘ProductZ’ then we need to enforce consistency:
We should update all 600 tuples referring to Project 2,
Otherwise, the relation is inconsistent, since some employees
would appear to work for a non-existent project;
7
EXAMPLE: DELETE ANOMALY
EMP_PROJ(SSN, Pnumber, Hours, Ename, Pname, Plocation)
Context: When Project 2 is deleted, all the employees who work on
that project will be deleted as well!
Delete all tuples with Pnumber = 2 to ensure consistency…
8
GUIDELINES FOR A GOOD DESIGN
Guideline 3: Relations should have as a few NULL values as possible:
Reasons for NULL
a value is not applicable or invalid
a value is unknown (may exist; who knows?)
a value is known to exist, but unavailable
Statistics: Attributes that are frequently NULL should be placed in
separate relations to avoid wasting storage & reducing uncertainty!
9
GUIDELINES FOR A GOOD DESIGN
Fact: if only 10 out of 1600 employees have three contact numbers, then the
attribute Phone#3 contains 99.37% NULL values…(1590 nulls).
Solution: Remove #Phone3 and define a new relation:
3rdPhone(SSN, Phone#3) Employee(SSN, Name, …, Phone#1, Phone#2)
10
SSN Name … Phone#1 Phone#2 Phone#3
NULL
…
NULL
Employee
GUIDELINES FOR A GOOD DESIGN
Guideline 4: Design relations to avoid fictitious tuples after join
EMP_PROJ(SSN, Pnumber, Hours, Ename, Pname, Plocation)
This is a bad design w.r.t. delete/update anomalies.
Idea: break relation into two smaller (sub)relations that share a common
attribute (E. Codd’s intuition):
R1(SSN, Ename, Pnumber, Pname, Plocation)
R2(Hours, Plocation)
Query: Show the total working hours/week for each employee
Key insight: to recover all information, join R1 and R2 on the common
attribute Plocation.
Fact: this creates tuples which do not exist in EMP_PROJ (fictitious
tuples)!
11
[A1] WORKED EXAMPLE
R(SSN, Pname, Plocation, Hours) breaks into two w.r.t. Plocation
Q(SSN, Pname, Plocation)
P(Hours, Plocation)
R Instances:
r1=(111, ‘PR1’, Glasgow, 20)
r2=(222, ‘PR1’, Glasgow, 10)
r3=(333, ‘PR2’, Edinburg, 23)
Q Instances:
q1=(111, ‘PR1’, Glasgow)
q2=(222, ‘PR1’, Glasgow)
q3=(333, ‘PR2’, Edinburgh)
P Instances:
p1=(20, Glasgow)
p2=(10, Glasgow)
p3=(23, Edinburgh)
12
Join Algorithm:
For each tuple q ∈ Q
Join with each tuple p ∈ P
iff q.Plocation=p.Plocation
Result: r = (p, q)
End For
Example:
q1=(111, ‘PR1’, Glasgow) matches p1=(20, Glasgow)
Result: (p1, q1) = (111, ‘PR1’, Glasgow, 20)
[A1] WORKED EXAMPLE
R Instances:
r1=(111, ‘PR1’, Glasgow, 20)
r2=(222, ‘PR1’, Glasgow, 10)
r3=(333, ‘PR2’, Edinburg, 23)
Q Instances:
q1=(111, ‘PR1’, Glasgow)
q2=(222, ‘PR1’, Glasgow)
q3=(333, ‘PR2’, Edinburgh)
P Instances:
p1=(20, Glasgow)
p2=(10, Glasgow)
p3=(23, Edinburgh)
13
q1 matches p1
Result: (111, ‘PR1’, Glasgow, 20)
q1 matches p2
Result: (111, ‘PR1’, Glasgow, 10)
q2 matches p1
Result: (222, ‘PR1’, Glasgow, 20)
q2 matches p2
Result: (222, ‘PR1’, Glasgow, 10)
q3 matches p3
Result: (333, ‘PR2’, Edinburgh, 23)
Generate 2 fictitious tuples in our attempt to avoid anomalies ☺
Lesson Learnt: Split and join relations is challenging…find the best splitting
attribute that does not generate fictitious tuples…a Theory is imperative!
THEORY OF FUNCTIONAL DEPENDENCY
Functional Dependency (FD) is a formal metric of the degree of
goodness of relational schema:
➢ FD is a constraint derived from the relationship between attributes
“Given a relation, an attribute X functionally determines an attribute
Y, if a value of X determines a unique value for Y.”, Codd, 1970
In other words: give me a value of X and I’ll tell you which is the value
of Y in a specific tuple! (X determines Y)
14
THEORY OF FUNCTIONAL DEPENDENCY
FD: X → Y (X uniquely determines Y) holds if whenever two tuples
have the same value for X, they must have the same value for Y
Definition: for any two tuples t1 and t2:
If t1[X] = t2[X] then t1[Y] = t2[Y]
X → Y in R specifies a constraint on all instances, i.e., principle.
15
[A2] WORKED EXAMPLE
EMP_PROJ(SSN, Pnumber, Hours, Ename, Pname, Plocation)
FD1: the social security number (SSN) determines the employee name:
SSN → Ename
FD2: the project number determines the project name and project location
Pnumber → {Pname, Plocation}
FD3: SSN and project number determine the hours per week:
{SSN, Pnumber} → Hours
16
SSN Pnumber Hours Ename Pname Plocation
1 5 10 Chris PX G12
2 5 30 Stella PX G12
1 7 15 Chris PY G45
FUNCTIONAL DEPENDENCY PRINCIPLES
Lemma 1: If K is a Candidate Key, then K functionally determines
all attributes in relation R, i.e.,
FD: K → {R}
Lemma 2: William W. Armstrong’s inference rules for FDs (1974):
(Reflexive) If Y ⊆ X then X → Y
X = {SSN, Ename}; X → {SSN}, X → {Ename}
(Augmentation) If X → Y then X U {Z} → Y U {Z}
(Transitive) If X → Y and Y → Z then X → Z
17
[A3] WORKED EXAMPLE
Consider the relation: R(B, O, I, S, Q, D). It holds true that:
FD1: S → D
FD2: I → B
FD3: {I, S} → Q
FD4: B → O
Q: Which are the candidate keys for relation R?
{I, B}
{I, S}
{B, O}
{S, D}
18
[A3] WORKED EXAMPLE
In R(B, O, I, S, Q, D) holds true:
FD1: S → D
FD2: I → B
FD3: {I, S} → Q
FD4: B → O
….using transitivity:
{I, B}
{I, S}
{B, O}
{S, D}
19
1. {I,S} → Q //assertion
2. I → B → O then I → O //transitivity
3. Hence, I → {B, O}
4. S→D //assertion
5. Hence, {I,S} → {R}, i.e., candidate key //reasoning
1. I → B //assertion
2. I → B → O then I → O //transitivity
3. Hence, I → {B, O} and {I, B} → B → O
4. D , Q, and S are not dependent on {I, B}
5. Hence, {I, B} is not a candidate key…
FUNCTIONAL DEPENDENCY & NORMALIZATION
Idea: exploit the FDs to specify which attributes can become PKs and FKs!
Algorithm:
1. Assert which are the FDs among attributes (no other semantics required)
2. Take a pool and put into all the asserted FDs.
3. Create a universal big relation with all attributes
4. Recursively decompose the big relation based on the FDs into many
smaller ones, such that, when we re-join them, it guarantees that no
information is lost and re-constructs the original big relation without
fictitious tuples.
This is the normalization process.
Challenge: find, via progressive decomposition, the basic relations, which can
reconstruct the entire information efficiently avoiding redundancy and avoiding
fictitious tuples after their composition.
20
THEORY OF NORMALIZATION
Progressive decomposition of unsatisfactory (bad) relations by breaking
up their attributes into smaller good relations;
The degree of decomposition is referred to as Normal Form (NF).
First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
Fourth Normal Form (4NF)
Fifth Normal Form (5NF)
Sixth Normal Form (6NF)
…
21
SOME DEFINITIONS
A prime attribute is an attribute that belongs to some candidate key
of the relation
e.g., SSN and Pnumber are prime attributes.
A non-prime attribute is not a prime attribute, i.e., it is not a
member of any candidate key
e.g., Hours, Ename, Pname, Plocation are non-prime attributes
EMP_PROJ(SSN, Pnumber, Hours, Ename, Pname, Plocation)
22
FIRST NORMAL FORM (1NF)
The domain Di of each attribute Ai in a relation R refers only to
atomic (simple/indivisible) values.
1NF disallows:
nested attributes,
multivalued attributes
Note: a relation in 1NF is expected to have redundant & repeated
values…
23
1NF
24
1NF
25
SECOND NORMAL FORM (2NF)
verbosity in the primary key X
Definition: full FD X → Y means that if we remove any prime attribute A
from the primary key X, then:
X \ {A} → Y
i.e., X \ {A} does not functionally determine Y anymore.
Example:
{SSN, Pnumber} → Hours is a full FD, since neither SSN → Hours nor
Pnumber → Hours holds true.
{SSN, Pnumber} → Ename is not a full FD (partial dependency), since
SSN → Ename holds true;
i.e., Pnumber does not need to be part of the primary key; it is verbose…
26
SECOND NORMAL FORM (2NF)
Definition: A relation R is in 2NF if every non-prime attribute A in R is
fully functionally dependent on the primary key of R.
Target from 1NF to 2NF: remove all prime attributes from the primary
key, which cause partial dependencies.
Methodology:
1. Identify all the partial FDs in the original relation (already in 1NF)
2. For each partial FD, create a new relation such that all non-prime
attributes in there are fully functionally dependent on the new
primary key:
i.e., the prime attribute in the original relation causing partial FDs.
3. The new relation(s) will be in 2NF.
27
Primary key is: {SSN, Pnumber}
• FD1: full FD
• FD2: Ename (non-prime)
depends only on SSN and not on
Pnumber; partial FD
• FD3: Pname and Plocation (non-
prime) depend only on Pnumber
and not on SSN; partial FD
Relation EP1:
• FD1: full FD
Relation EP2:
• FD2: Ename (non-prime) fully
depends only on SSN
Relation EP3:
• FD3: Pname and Plocation (non-
prime) fully depend only on
Pnumber
2NF
28
PROBLEM IDENTIFICATION
EMP_DEPT(SSN, Ename, Bdate, Address, Dnumber, Dname, Dmgr_Ssn)
29
SSN Ename Bdate Address Dnumber Dname Dmgr_ssn
1 Chris 1970 A1 3 SoCS 12
2 Stella 1988 A2 3 SoCS 12
3 Philip 2001 A3 3 SoCS 12
4 John 1966 A4 3 SoCS 12
5 Chris 1955 A5 3 SoCS 12
6 Anna 1999 A6 4 Maths 44
7 Thalia 2006 A7 4 Maths 44
1NF?
2NF?
THIRD NORMAL FORM (3NF)
Definition: transitive FD means that given a Primary Key X and non-
prime attributes Z and Y such that: X → Z and Z → Y, then the non-prime Y
is transitively dependent on the primary key X via the non-prime Z.
X → Z and Z → Y then X → Y.
[CourseID, Lecturer, School]
FD1: CourseID → Lecturer
FD2: Lecturer → School
FD3: CourseID → School (via the non-prime Lecturer)
Definition: A relation R is in 3NF (being already in 2NF) if there is no non-
prime attribute which is transitively dependent on the primary key;
That is, all non-prime attributes should be directly dependent on the
primary key.
30
X Z Y
THIRD NORMAL FORM (3NF)
EMP_DEPT(SSN, Ename, Bdate, Address, Dnumber, Dname, Dmgr_Ssn)
FD1: SSN → Dmgr_Ssn is transitive
SSN → Dnumber and Dnumber → Dmgr_Ssn
Dnumber is non-prime transitive attribute
FD2: SSN → Ename is non-transitive
there is no attribute Z, where SSN → Z and Z → Ename.
FD3: SSN → Dname is transitive FD or not?
FD4: SSN → Address is transitive FD or not?
SSN Ename Bdate Address Dnumber Dname Dmgr_Ssn
EMP_DEPT
31
THIRD NORMAL FORM (3NF)
Methodology: Split the original relation into two relations: the non-prime transitive
attribute (Dnumber)
is the PK to the new relation
is the FK in the original relation referencing to the new relation.
SSN Ename Bdate Address Dnumber Dname Dmgr_Ssn
EMP_DEPT
32
THIRD NORMAL FORM (3NF)
SSN Ename Bdate Address Dnumber Dname Dmgr_Ssn
EMP_DEPT
SSN Ename Bdate Address Dnumber
ED1
Dnumber Dname Dmgr_Ssn
ED2
33
A GOOD SCHEMA NOW…
34
Dnumber Dname Dmgr_ssn
3 SoCS 12
4 Maths 44
1NF
2NF
3NF
SSN Ename Bdate Address Dnumber
1 Chris 1970 A1 3
2 Stella 1988 A2 3
3 Philip 2001 A3 3
4 John 1966 A4 3
5 Chris 1955 A5 3
6 Anna 1999 A6 4
7 Thalia 2006 A7 4
EMPLOYEE DEPARTMENT
GENERALIZED THIRD NORMAL FORM (3NF)
X → Z and Z → Y, with X as the primary key, we consider this as a problem if
only if Z is non-prime attribute.
Violation of 3NF: when a non-prime attribute (Z) determines another non-
prime attribute (Y) and Y is transitively dependent on the prime key (X).
Case: If Z is a candidate key, there is no problem in 3NF!
EMPLOYEE(SSN, PassportNo, Salary)
FD1: SSN → PassportNo
FD2: PassportNo → Salary,
FD3: SSN → Salary is not a transitive FD since PassportNo is a prime
attribute (candidate key).
Generalized 3NF: Every non-prime attribute A in relation R:
is fully functionally dependent on every candidate key in R.
is non-transitively dependent on every candidate key in R.
Ok, now, what is happening with the prime attributes? BCNF…
35
BOYCE-CODD NORMAL FORM (BCNF)
Idea: remove all inherent dependencies: any attribute should be
functionally dependent only on the Primary Key.
A relation is in BCNF iff whenever there exists a FD: X → A then X
is a PK, i.e., the left-hand side should be a PK.
A B C
R
Primary key: {A,B}
Relation is in 3NF: there is no transitivity of a non-prime attribute to the key, thus,
in 3NF.
FD1: {A,B} → C
FD2: C → B; the non-prime attribute C determines B, however, C is not a PK.
FD1
FD2
36
Primary key: {Student, Course}
FD1: {Student, Course} → Instructor
FD2: Instructor → Course /*each instructor teaches only one course*/
Is it in 3NF?
• Yes, there is no transitive dependency of a non-prime attribute on the PK
Is it in BCNF?
• No, in FD2 the left-hand side of the dependency (Instructor) is not a PK.
Challenge: How to decompose the relation? 37
DECOMPOSITION IN BCNF
Possibilities for decomposing relation TEACH
{student, instructor} and {student, course}
{course, instructor } and {course, student}
{instructor, course } and {instructor, student}
…
Objective:
Can we reconstruct TEACH after joining?
Which is the best decomposition to avoid fictitious tuples?
Which is the splitting attribute in 3NF relations?
Answer?
38
DECOMPOSITION IN BCNF
Examine & Join: {course, instructor } and {course, student}
39
R1
R2
student course instructor
Narayan Database Mark
Smith Database Mark
Narayan Database Navathe
Smith Database Navathe
Smith Operating System Ammar
n=256; k=2, then check 32640 split relations!
Combinations n choose k; n! = 1.2.3…n
BCNF DECOMPOSITION THEOREM
Theorem 1: Let relation R not in BCNF and let X → A be the FD which
causes a violation in BCNF.
Then, the relation R should be decomposed into two relations:
R1 with attributes: R\{A} (all attributes in R apart from A)
R2 with attributes: {X} U {A} (put together X and A)
If either R1 or R2 is not in BCNF, repeat the process.
Proof: see [*].
Experiment: http://www.ict.griffith.edu.au/~jw/normalization/ind.php
[*] C. Zaniolo, ‘A New Normal Form for the Design of Relational Database
Schemata’, ACM Transactions on Database Systems 7(3):493, Sept. 1982
40
THEOREM APPLICATION
FD1: {Student, Course} → Instructor
FD2: Instructor → Course (violation)
X → A refers to Instructor → Course, which violates the BCNF.
Hence, X is Instructor and A is Course.
Given R is TEACH(Student, Instructor, Course).
Split R into two relations:
R1 with attributes: R\{A} = {Student, Course, Instructor} \ {Course} =
{Student, Instructor}
R1(Student, Instructor)
R2 with attributes: {X} U {A} = {Instructor} U {Course} = {Instructor,
Course}
R2(Instructor, Course)
Note: If you join R1 and R2 w.r.t. Instructor common attribute then we
obtain the original TEACH relation with no fictitious tuples!
41
OPTIMAL DECOMPOSITION IN BCNF
Check: {student, instructor} and {instructor, course}
42
R2
student course instructor
Narayan Database Mark
Smith Database Navathe
Smith Operating System Ammar
R1