CS计算机代考程序代写 database flex AI algorithm File Structures

File Structures

FUNCTIONAL DEPENDENCY &

NORMALISATION THEORY

Database Theory & Applications (M)

Dr Chris Anagnostopoulos

ROADMAP

 When a relational schema is good or bad…

 …judging the efficiency of a schema.

 Functional Dependency Theory for quantifying the degree of

goodness.

 Normalization Theory for transforming a relational schema

into a set of good and efficient relations.

 Transforming any relation to the Boyce-Codd Normal Form

(BCNF) dealing with fictitious tuples.

 Fundamental Theorem in BCNF

2

RELATIONAL MODELING REVISIT

Challenge: develop a theory and list guidelines to assess a relational model in

terms of (performance metrics):

 goodness, i.e., whether the attributes (e.g., PK, FK) form a good relation;

 quantify the degree of goodness;

 what pitfalls exist;

 efficiency in insert/delete/update operations;

 convey as much information as possible minimizing redundancy (minimize

repetition of data)…

3

GUIDELINES FOR A GOOD DESIGN

Guideline 1: The attributes of a relation should make sense

 Attributes of different entities, e.g., students, employees, courses,

should not be in the same relation.

 Objective: minimize the similarity between relations!

 Any relationship between relations should be represented only

through Foreign Keys & Primary Keys

 e.g., an employee works on a specific project for 2 hours:

 WORKS_ON(SSN (FK), PNO(FK), Hours).

4
Employee Project

GUIDELINES FOR A GOOD DESIGN

Guideline 2: Avoid redundant tuples (repetition of the same information)

 Impact of repetition:

 storage cost: replication of tuples wastes storage.

 inconsistency cost & operation anomalies:

 replicas must be kept consistent (more checks) during insertion,

deletion, update of tuples

 replicas result to consistency anomalies…

5

EXAMPLE: REDUNDANCY & ANOMALY

EMP_PROJ(SSN, Pnumber, Hours, Ename, Pname, Plocation)

Context: Employee is working to a project for specific hours/week.

6

EXAMPLE: UPDATE ANOMALY

EMP_PROJ(SSN, Pnumber, Hours, Ename, Pname, Plocation)

 Context: EMP_PROJ contains 600 employees working on project

Pnumber = 2 with Pname = ‘ProductY’.

Update anomaly:

 If the name of Project 2 changes from ‘ProductY’ to

‘ProductZ’ then we need to enforce consistency:

 We should update all 600 tuples referring to Project 2,

 Otherwise, the relation is inconsistent, since some employees

would appear to work for a non-existent project;

7

EXAMPLE: DELETE ANOMALY

EMP_PROJ(SSN, Pnumber, Hours, Ename, Pname, Plocation)

Context: When Project 2 is deleted, all the employees who work on

that project will be deleted as well!

 Delete all tuples with Pnumber = 2 to ensure consistency…

8

GUIDELINES FOR A GOOD DESIGN

Guideline 3: Relations should have as a few NULL values as possible:

 Reasons for NULL

 a value is not applicable or invalid

 a value is unknown (may exist; who knows?)

 a value is known to exist, but unavailable

 Statistics: Attributes that are frequently NULL should be placed in

separate relations to avoid wasting storage & reducing uncertainty!

9

GUIDELINES FOR A GOOD DESIGN

Fact: if only 10 out of 1600 employees have three contact numbers, then the

attribute Phone#3 contains 99.37% NULL values…(1590 nulls).

Solution: Remove #Phone3 and define a new relation:

3rdPhone(SSN, Phone#3) Employee(SSN, Name, …, Phone#1, Phone#2)

10

SSN Name … Phone#1 Phone#2 Phone#3

NULL

NULL

Employee

GUIDELINES FOR A GOOD DESIGN

Guideline 4: Design relations to avoid fictitious tuples after join

EMP_PROJ(SSN, Pnumber, Hours, Ename, Pname, Plocation)

 This is a bad design w.r.t. delete/update anomalies.

Idea: break relation into two smaller (sub)relations that share a common

attribute (E. Codd’s intuition):

 R1(SSN, Ename, Pnumber, Pname, Plocation)

 R2(Hours, Plocation)

Query: Show the total working hours/week for each employee

Key insight: to recover all information, join R1 and R2 on the common

attribute Plocation.

Fact: this creates tuples which do not exist in EMP_PROJ (fictitious

tuples)!
11

[A1] WORKED EXAMPLE
 R(SSN, Pname, Plocation, Hours) breaks into two w.r.t. Plocation

 Q(SSN, Pname, Plocation)

 P(Hours, Plocation)

R Instances:

 r1=(111, ‘PR1’, Glasgow, 20)

 r2=(222, ‘PR1’, Glasgow, 10)

 r3=(333, ‘PR2’, Edinburg, 23)

Q Instances:

 q1=(111, ‘PR1’, Glasgow)

 q2=(222, ‘PR1’, Glasgow)

 q3=(333, ‘PR2’, Edinburgh)

P Instances:

 p1=(20, Glasgow)

 p2=(10, Glasgow)

 p3=(23, Edinburgh)
12

Join Algorithm:

For each tuple q ∈ Q
Join with each tuple p ∈ P

iff q.Plocation=p.Plocation

Result: r = (p, q)

End For

Example:

q1=(111, ‘PR1’, Glasgow) matches p1=(20, Glasgow)

Result: (p1, q1) = (111, ‘PR1’, Glasgow, 20)

[A1] WORKED EXAMPLE
R Instances:

 r1=(111, ‘PR1’, Glasgow, 20)

 r2=(222, ‘PR1’, Glasgow, 10)

 r3=(333, ‘PR2’, Edinburg, 23)

Q Instances:

 q1=(111, ‘PR1’, Glasgow)

 q2=(222, ‘PR1’, Glasgow)

 q3=(333, ‘PR2’, Edinburgh)

P Instances:

 p1=(20, Glasgow)

 p2=(10, Glasgow)

 p3=(23, Edinburgh)

13

q1 matches p1

Result: (111, ‘PR1’, Glasgow, 20)

q1 matches p2

Result: (111, ‘PR1’, Glasgow, 10)

q2 matches p1

Result: (222, ‘PR1’, Glasgow, 20)

q2 matches p2

Result: (222, ‘PR1’, Glasgow, 10)

q3 matches p3

Result: (333, ‘PR2’, Edinburgh, 23)

Generate 2 fictitious tuples  in our attempt to avoid anomalies ☺

Lesson Learnt: Split and join relations is challenging…find the best splitting

attribute that does not generate fictitious tuples…a Theory is imperative!

THEORY OF FUNCTIONAL DEPENDENCY

Functional Dependency (FD) is a formal metric of the degree of

goodness of relational schema:

➢ FD is a constraint derived from the relationship between attributes

“Given a relation, an attribute X functionally determines an attribute

Y, if a value of X determines a unique value for Y.”, Codd, 1970

In other words: give me a value of X and I’ll tell you which is the value

of Y in a specific tuple! (X determines Y)

14

THEORY OF FUNCTIONAL DEPENDENCY

 FD: X → Y (X uniquely determines Y) holds if whenever two tuples

have the same value for X, they must have the same value for Y

Definition: for any two tuples t1 and t2:

If t1[X] = t2[X] then t1[Y] = t2[Y]

X → Y in R specifies a constraint on all instances, i.e., principle.

15

[A2] WORKED EXAMPLE
EMP_PROJ(SSN, Pnumber, Hours, Ename, Pname, Plocation)

FD1: the social security number (SSN) determines the employee name:

SSN → Ename

FD2: the project number determines the project name and project location

Pnumber → {Pname, Plocation}

FD3: SSN and project number determine the hours per week:

{SSN, Pnumber} → Hours
16

SSN Pnumber Hours Ename Pname Plocation

1 5 10 Chris PX G12

2 5 30 Stella PX G12

1 7 15 Chris PY G45

FUNCTIONAL DEPENDENCY PRINCIPLES

Lemma 1: If K is a Candidate Key, then K functionally determines

all attributes in relation R, i.e.,

FD: K → {R}

Lemma 2: William W. Armstrong’s inference rules for FDs (1974):

 (Reflexive) If Y ⊆ X then X → Y

 X = {SSN, Ename}; X → {SSN}, X → {Ename}

 (Augmentation) If X → Y then X U {Z} → Y U {Z}

 (Transitive) If X → Y and Y → Z then X → Z
17

[A3] WORKED EXAMPLE

Consider the relation: R(B, O, I, S, Q, D). It holds true that:

FD1: S → D

FD2: I → B

FD3: {I, S} → Q

FD4: B → O

Q: Which are the candidate keys for relation R?

 {I, B}

 {I, S}

 {B, O}

 {S, D}

18

[A3] WORKED EXAMPLE

In R(B, O, I, S, Q, D) holds true:

FD1: S → D

FD2: I → B

FD3: {I, S} → Q

FD4: B → O

….using transitivity:

 {I, B}

 {I, S}

 {B, O}

 {S, D}

19

1. {I,S} → Q //assertion

2. I → B → O then I → O //transitivity

3. Hence, I → {B, O}

4. S→D //assertion

5. Hence, {I,S} → {R}, i.e., candidate key //reasoning

1. I → B //assertion

2. I → B → O then I → O //transitivity

3. Hence, I → {B, O} and {I, B} → B → O

4. D , Q, and S are not dependent on {I, B}

5. Hence, {I, B} is not a candidate key…

FUNCTIONAL DEPENDENCY & NORMALIZATION

Idea: exploit the FDs to specify which attributes can become PKs and FKs!

Algorithm:

1. Assert which are the FDs among attributes (no other semantics required)

2. Take a pool and put into all the asserted FDs.

3. Create a universal big relation with all attributes

4. Recursively decompose the big relation based on the FDs into many

smaller ones, such that, when we re-join them, it guarantees that no

information is lost and re-constructs the original big relation without

fictitious tuples.

This is the normalization process.

Challenge: find, via progressive decomposition, the basic relations, which can

reconstruct the entire information efficiently avoiding redundancy and avoiding

fictitious tuples after their composition.
20

THEORY OF NORMALIZATION

Progressive decomposition of unsatisfactory (bad) relations by breaking

up their attributes into smaller good relations;

The degree of decomposition is referred to as Normal Form (NF).

 First Normal Form (1NF)

 Second Normal Form (2NF)

 Third Normal Form (3NF)

 Boyce-Codd Normal Form (BCNF)

 Fourth Normal Form (4NF)

 Fifth Normal Form (5NF)

 Sixth Normal Form (6NF)

 …

21

SOME DEFINITIONS

 A prime attribute is an attribute that belongs to some candidate key

of the relation

 e.g., SSN and Pnumber are prime attributes.

 A non-prime attribute is not a prime attribute, i.e., it is not a

member of any candidate key

 e.g., Hours, Ename, Pname, Plocation are non-prime attributes

EMP_PROJ(SSN, Pnumber, Hours, Ename, Pname, Plocation)

22

FIRST NORMAL FORM (1NF)

 The domain Di of each attribute Ai in a relation R refers only to

atomic (simple/indivisible) values.

 1NF disallows:

 nested attributes,

 multivalued attributes

Note: a relation in 1NF is expected to have redundant & repeated

values…
23

1NF

24

1NF

25

SECOND NORMAL FORM (2NF)

verbosity in the primary key X

Definition: full FD X → Y means that if we remove any prime attribute A
from the primary key X, then:

X \ {A} → Y

i.e., X \ {A} does not functionally determine Y anymore.

Example:

 {SSN, Pnumber} → Hours is a full FD, since neither SSN → Hours nor
Pnumber → Hours holds true.

 {SSN, Pnumber} → Ename is not a full FD (partial dependency), since
SSN → Ename holds true;

 i.e., Pnumber does not need to be part of the primary key; it is verbose…
26

SECOND NORMAL FORM (2NF)

Definition: A relation R is in 2NF if every non-prime attribute A in R is

fully functionally dependent on the primary key of R.

Target from 1NF to 2NF: remove all prime attributes from the primary

key, which cause partial dependencies.

Methodology:

1. Identify all the partial FDs in the original relation (already in 1NF)

2. For each partial FD, create a new relation such that all non-prime

attributes in there are fully functionally dependent on the new

primary key:

 i.e., the prime attribute in the original relation causing partial FDs.

3. The new relation(s) will be in 2NF.

27

Primary key is: {SSN, Pnumber}

• FD1: full FD

• FD2: Ename (non-prime)

depends only on SSN and not on

Pnumber; partial FD

• FD3: Pname and Plocation (non-

prime) depend only on Pnumber

and not on SSN; partial FD

Relation EP1:

• FD1: full FD

Relation EP2:

• FD2: Ename (non-prime) fully

depends only on SSN

Relation EP3:

• FD3: Pname and Plocation (non-

prime) fully depend only on

Pnumber

2NF

28

PROBLEM IDENTIFICATION

EMP_DEPT(SSN, Ename, Bdate, Address, Dnumber, Dname, Dmgr_Ssn)

29

SSN Ename Bdate Address Dnumber Dname Dmgr_ssn

1 Chris 1970 A1 3 SoCS 12

2 Stella 1988 A2 3 SoCS 12

3 Philip 2001 A3 3 SoCS 12

4 John 1966 A4 3 SoCS 12

5 Chris 1955 A5 3 SoCS 12

6 Anna 1999 A6 4 Maths 44

7 Thalia 2006 A7 4 Maths 44

1NF?

2NF?

THIRD NORMAL FORM (3NF)

Definition: transitive FD means that given a Primary Key X and non-

prime attributes Z and Y such that: X → Z and Z → Y, then the non-prime Y

is transitively dependent on the primary key X via the non-prime Z.

X → Z and Z → Y then X → Y.

[CourseID, Lecturer, School]

FD1: CourseID → Lecturer

FD2: Lecturer → School

FD3: CourseID → School (via the non-prime Lecturer)

Definition: A relation R is in 3NF (being already in 2NF) if there is no non-

prime attribute which is transitively dependent on the primary key;

 That is, all non-prime attributes should be directly dependent on the

primary key.

30

X Z Y

THIRD NORMAL FORM (3NF)

EMP_DEPT(SSN, Ename, Bdate, Address, Dnumber, Dname, Dmgr_Ssn)

 FD1: SSN → Dmgr_Ssn is transitive
 SSN → Dnumber and Dnumber → Dmgr_Ssn

 Dnumber is non-prime transitive attribute

 FD2: SSN → Ename is non-transitive
 there is no attribute Z, where SSN → Z and Z → Ename.

 FD3: SSN → Dname is transitive FD or not?

 FD4: SSN → Address is transitive FD or not?

SSN Ename Bdate Address Dnumber Dname Dmgr_Ssn

EMP_DEPT

31

THIRD NORMAL FORM (3NF)

Methodology: Split the original relation into two relations: the non-prime transitive

attribute (Dnumber)

 is the PK to the new relation

 is the FK in the original relation referencing to the new relation.

SSN Ename Bdate Address Dnumber Dname Dmgr_Ssn

EMP_DEPT

32

THIRD NORMAL FORM (3NF)

SSN Ename Bdate Address Dnumber Dname Dmgr_Ssn

EMP_DEPT

SSN Ename Bdate Address Dnumber

ED1

Dnumber Dname Dmgr_Ssn

ED2

33

A GOOD SCHEMA NOW…

34

Dnumber Dname Dmgr_ssn

3 SoCS 12

4 Maths 44

1NF

2NF

3NF

SSN Ename Bdate Address Dnumber

1 Chris 1970 A1 3

2 Stella 1988 A2 3

3 Philip 2001 A3 3

4 John 1966 A4 3

5 Chris 1955 A5 3

6 Anna 1999 A6 4

7 Thalia 2006 A7 4

EMPLOYEE DEPARTMENT

GENERALIZED THIRD NORMAL FORM (3NF)

X → Z and Z → Y, with X as the primary key, we consider this as a problem if

only if Z is non-prime attribute.

 Violation of 3NF: when a non-prime attribute (Z) determines another non-

prime attribute (Y) and Y is transitively dependent on the prime key (X).

Case: If Z is a candidate key, there is no problem in 3NF!

EMPLOYEE(SSN, PassportNo, Salary)

 FD1: SSN → PassportNo

 FD2: PassportNo → Salary,

 FD3: SSN → Salary is not a transitive FD since PassportNo is a prime

attribute (candidate key).

Generalized 3NF: Every non-prime attribute A in relation R:

 is fully functionally dependent on every candidate key in R.

 is non-transitively dependent on every candidate key in R.

Ok, now, what is happening with the prime attributes? BCNF…

35

BOYCE-CODD NORMAL FORM (BCNF)
Idea: remove all inherent dependencies: any attribute should be

functionally dependent only on the Primary Key.

A relation is in BCNF iff whenever there exists a FD: X → A then X

is a PK, i.e., the left-hand side should be a PK.

A B C

R

Primary key: {A,B}

Relation is in 3NF: there is no transitivity of a non-prime attribute to the key, thus,

in 3NF.

FD1: {A,B} → C

FD2: C → B; the non-prime attribute C determines B, however, C is not a PK.

FD1

FD2

36

Primary key: {Student, Course}

FD1: {Student, Course} → Instructor

FD2: Instructor → Course /*each instructor teaches only one course*/

Is it in 3NF?

• Yes, there is no transitive dependency of a non-prime attribute on the PK

Is it in BCNF?

• No, in FD2 the left-hand side of the dependency (Instructor) is not a PK.

Challenge: How to decompose the relation? 37

DECOMPOSITION IN BCNF

 Possibilities for decomposing relation TEACH

 {student, instructor} and {student, course}

 {course, instructor } and {course, student}

 {instructor, course } and {instructor, student}

 …

 Objective:

 Can we reconstruct TEACH after joining?

 Which is the best decomposition to avoid fictitious tuples?

 Which is the splitting attribute in 3NF relations?

Answer?

38

DECOMPOSITION IN BCNF

Examine & Join: {course, instructor } and {course, student}

39

R1

R2

student course instructor

Narayan Database Mark

Smith Database Mark

Narayan Database Navathe

Smith Database Navathe

Smith Operating System Ammar

n=256; k=2, then check 32640 split relations!

Combinations n choose k; n! = 1.2.3…n

BCNF DECOMPOSITION THEOREM

Theorem 1: Let relation R not in BCNF and let X → A be the FD which
causes a violation in BCNF.

Then, the relation R should be decomposed into two relations:

 R1 with attributes: R\{A} (all attributes in R apart from A)

 R2 with attributes: {X} U {A} (put together X and A)

If either R1 or R2 is not in BCNF, repeat the process.

Proof: see [*].

Experiment: http://www.ict.griffith.edu.au/~jw/normalization/ind.php

[*] C. Zaniolo, ‘A New Normal Form for the Design of Relational Database
Schemata’, ACM Transactions on Database Systems 7(3):493, Sept. 1982

40

THEOREM APPLICATION

 FD1: {Student, Course} → Instructor

 FD2: Instructor → Course (violation)

X → A refers to Instructor → Course, which violates the BCNF.

Hence, X is Instructor and A is Course.

Given R is TEACH(Student, Instructor, Course).

Split R into two relations:

 R1 with attributes: R\{A} = {Student, Course, Instructor} \ {Course} =
{Student, Instructor}

 R1(Student, Instructor)

 R2 with attributes: {X} U {A} = {Instructor} U {Course} = {Instructor,
Course}

 R2(Instructor, Course)

Note: If you join R1 and R2 w.r.t. Instructor common attribute then we
obtain the original TEACH relation with no fictitious tuples!

41

OPTIMAL DECOMPOSITION IN BCNF

Check: {student, instructor} and {instructor, course}

42

R2

student course instructor

Narayan Database Mark

Smith Database Navathe

Smith Operating System Ammar

R1