PowerPoint Presentation
Database Systems Infrastructure
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Remember, this slide…
We are looking at Normalisation. Normalisation and ERD go together when designing a database. We have look at conceptual data modelling, i.e. ERD.
Remember, Entity-Relationship (ER) modelling is Top-down approach. Begins by looking for the data groups in the system.
On the other hand, Normalisation is a Bottom-up approach. Begins by looking at the smallest individual items of data recorded by the system.
When I was in my doing my undergraduate, I asked myself who will use normalisation. Three weeks into my first job, my boss asked me to evaluate all the databases the company had at that point in time to see if we need to do anything about normalisation.
So normalisation is an important part of database design.
INFS5710 Week 1
Normalization
Evaluating and correcting table structures to minimize data redundancies
Reduces data anomalies
Assigns attributes to tables based on determination
Normal forms
First normal form (1NF)
Second normal form (2NF)
Third normal form (3NF)
functional dependency
(including update, insertion, and deletion anomalies, see Section 6-2)
normalisation
denormalisation
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
The keys point are to reduce or minimised data redundancies, and prevent data inconsistencies. We examine the table structures to see if we could minimise or further minimise data redundancies.
Data redundancy can be generated across different tables if they are not controlled, i.e. normalised.
If data redundancy exists, then there will be anomalies.
For example, if tables are not normalised, then data redundancies will happen, and you will have issues with anonmalies.
Let’s say, your have a private email attribute/column in your student table. You also have your private email address in course_enrol table for all the courses you have enrolled. Thus, you have data redundancy because you have entered private email address column in two tables, and you have to enter the private email address every time you enrol a new course.
Thus, an anomaly refers to the situation when there is a change of an employee information, there are many columns/rows about this employee that must be manually changed/updated.
Normalization
Structural point of view of normal forms
Higher normal forms are better than lower normal forms
Properly designed 3NF structures meet the requirement of fourth normal form (4NF)
Denormalization: Produces a lower normal form
Results in increased performance and greater data redundancy
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
“For most purposes in business database design, 3NF is as high as you need to go in the normalization process. However, you will discover that properly designed 3NF structures also meet the requirements of 4NF”
Need for Normalization
Used while designing a new database structure
Analyzes the relationship among the attributes within each entity
Determines if the structure can be improved
Improves the existing data structure and creates an appropriate database design
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Normalization Process
Objective is to ensure that each table conforms to the concept of well-formed relations
Each table represents a single subject
No data item will be unnecessarily stored in more than one table
All nonprime attributes in a table are dependent on the primary key
Each table is void of insertion, update, and deletion anomalies
not in any primary key
(Consider a spreadsheet about employees. An anomaly
refers to the situation when there is a change of an employee information, there are many cells about this employee that must be manually changed / updated.)
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Normalization Process
Ensures that all tables are in at least 3NF
Higher forms are not likely to be encountered in business environment
Works one relation at a time
Starts by:
Identifying the dependencies of a relation (table)
Progressively breaking the relation into new set of relations
(tables)
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Table 6.2 – Normal Forms
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Table 6.3 – Functional Dependence Concepts
X
Y
functionally determines
determinant
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Types of Functional Dependencies
Partial dependency: Functional dependence in which the determinant is only part of the primary key
Assumption – One candidate key
Straight forward
Easy to identify
Transitive dependency: An attribute functionally depends on another nonkey attribute
(A, B C, D)
B C
Partial dependency
X Y, Y Z
Then X Z is transitive via a nonprime Y
Z
Y
Y Z is a transitive dependency (to be removed to get to 3NF)
(nonkey to nonkey)
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Partial dependency is a functional dependence in which the determinant is only part of the primary key.
For example, (A, B C, D) is Functional dependence whereas B C is Partial dependency because only part of the primary key (B) is needed to determine the value of C.
Transitive dependency is when an attribute, e.g. Z, functionally depends on another non-key attribute, e.g. Y., i.e. nonkey to nonkey)
For instance, X Y, Y Z, then X Z is transitive via a nonprime Y, so Y Z (i.e. nonkey to nonkey) is a transitive dependency (to be removed to get to 3NF)
2052 Randwick and Randwick NSW, then 2052NSW
Conversion to First Normal Form
1NF describes tabular format in which:
All key attributes are defined
There are no repeating groups in the table
All attributes are dependent on the primary key
All relational tables satisfy 1NF requirements
Some tables contain partial dependencies
Subject to data redundancies and various anomalies
“row merge”
1NF
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
So, which are the PK and dependencies?
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
So what are the PKs and their dependencies?
The dependencies are built on data shown above. You have examine the data carefully and see if there are any association between them.
If you are good at spot the difference” between two pictures or numbers or Sudoku, then you should have no problem of picking up differences and similarities. Otherwise, not to worry, you still can do it but probably it will take you a bit longer.
Have a look at project number and project name – they are unique!
Have another look at:
(a) employee number and employee name;
(b) employee_number, job_class and charge/hour.
(c) etc.
ALL_IN_ONE (PROJ_NUM, PROJ_NAME, EMP_NUM, EMP_NAME, JOB_CLASS, CHG_HOUR, HOURS)
Examine the Similarities and Differences of the Data
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
If the data come in a spreadsheet, then you can examine the similarities and differences of the Data to build up your PK and dependencies using the functions and features under the Data tab in Excel.
Figure 6.3 – First Normal Form (1NF) Dependency Diagram
PK
All attributes are dependent on the primary key
Please note the notation
(PROJ_NUM, EMP_NUM PROJ_NAME, EMP_NAME, JOB_CLASS, CHG_HOUR, HOURS)
per
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Conversion to Second Normal Form
Steps
Make new tables to eliminate partial dependencies
Reassign corresponding dependent attributes
Table is in 2NF when it:
Is in 1NF
Includes no partial dependencies
2NF
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
No partial dependencies (Each non-key field is functionally dependent on the entire PK ).
The key process in 2NF is get rid of partial dependencies and assign other dependent attributes.
Hint: Look for values that occur multiple times in non-key fields. This tells you that you have too many fields in a single table. In a well-designed database, the only data that is duplicated is in key fields used to connect tables.
Figure 6.4 – Second Normal Form (2NF) Conversion Results
HOURS
1NF
not even 1NF
per
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
So, we want to get rid of partial dependencies!
Let’s go back to the 1NF, so what can we do?
The partial dependency of proj_num and proj_name can become a table, with proj_num the PK.
The partial dependency of emp_num, emp_name, job_class and chg_hour can become another table, with emp_num is the PK.
As for the remaining, we left with proj_num, emp_num and hours, eliminating all columns allocated to (2), and this can form the last table. The composite PK is proj_num and emp_num.
So in 2NF,
We have a table Product table formed from the partial dependency in (1).
We have Employee table formed from the partial dependency in (2).
We have the remaining table to form Assignment table, and we change the column name from hours to assign_hours to make it more meaningful.
Conversion to Third Normal Form
Steps
Make new tables to eliminate transitive dependencies
Determinant: Any attribute whose value determines other values within a row
Reassign corresponding dependent attributes
Table is in 3NF when it:
Is in 2NF
Contains no transitive dependencies
3NF
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
A relation / table is in 3NF if it has no transitive dependencies (no non-key attributes determined by other non-candidate-key attributes).
Figure 6.5 – Third Normal Form (3NF) Conversion Results
?
per
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Let’s go back to the 2NF, so what else we need to do?
Project and Assignment tables are OK, and nothing we need to do.
We remove transitive dependency from the employee table into a new table called job. The transitive dependency of job_class and chg_hour are added to Job table, with job_class is the PK.
The remaining columns are emp_num and emp_name, but we need job_class column because it is associated with the employee to say what job class the employee belongs to. Thus, job_class column is an FK in the Employee table.
Improving the Design
Evaluate PK assignments and naming conventions
Refine attribute atomicity
Atomic attribute: Cannot be further subdivided
Atomicity: Characteristic of an atomic attribute
Identify new attributes and new relationships
Refine primary keys as required for data granularity
Granularity: Level of detail represented by the values stored in a table’s row
Maintain historical accuracy and evaluate using derived attributes
previously called simple attribute, as opposed to composite attribute
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
As stated in the slide…
Normalisation is valuable because its use helps eliminate data redundancies, next we want to see if we can improve the design.
More information can be found in Textbook 6-4.
Figure 6.6 – The Completed Database
granularity low
over what time frame?
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
As stated in the slide…
For the Project table, you add emp_num for the roject leader.
For the Job table, you add job description to make it more meaningful to the user.
Etc…
The Assignment table should be the most active table, but it is lean and does not provide too much information. Furthermore, the Assign_Hours needs further clarification.
Figure 6.6 – The Completed Database (cont’d.)
granularity high
x
=
Using a surrogate
key is better
What if there is a rate change?
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
In this slide, it provides an improved design from the Assignment Table on the right lower corner (involving three attributes: PROJ_NUM, EMP_NUM, and ASSIGN_HOURS)
With ASSIGN_CHG_HOUR coming from JOB_CHG_HOUR from the previous slide under JOB table and the ASSIGN_CHARGE calculated automatically.
Surrogate Key Considerations
Used by designers when the primary key is considered to be unsuitable
System-defined attribute
Created a managed via the DBMS
Have a numeric value which is automatically incremented for each new row
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
The Boyce-Codd Normal Form (BCNF)
Every determinant in the table should be a candidate key
Candidate key – Same characteristics as primary key but not chosen to be the primary key
Equivalent to 3NF when the table contains only one candidate key
Violated only when the table contains more than one candidate key
Considered to be a special case of 3NF
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Figure 6.8 – A Table That is in 3NF and not in BCNF
Transitive? Partial?
Not transitive, because it involves a PK.
why not in BCNF?
every determinant is a candidate key
STU_ID
STAFF_ID
CLASS_
CODE
EN_GRADE
Transitive dependency: An attribute functionally depends on another nonkey attribute (nonkey to nonkey)
(A nonkey attribute)
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Figure 6.9 – Decomposition to BCNF
change PK
STU_ID
STAFF_ID
CLASS_
CODE
EN_GRADE
STU_ID
CLASS_
CODE
STAFF_ID
EN_GRADE
STU_ID
CLASS_
CODE
EN_GRADE
CLASS_
CODE
STAFF_ID
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
What next is interesting!
You swap the attributes in the composite PK, that is Staff Id or B, and Class Code or C are changed and swapped! Now, it becomes Partial Dependency.
You then create a new table for Partial Dependency, and the remaining as the remaining table.
Sometimes, it might not reach here if ERD is done first and attributes/columns were assigned before normalisation is performed to ensure data consistency.
Fourth Normal Form (4NF)
Table is in 4NF when it:
Is in 3NF
Has no multivalued dependencies
Rules
All attributes must be dependent on the primary key, but they must be independent of each other
No row may contain two or more multivalued facts about an entity
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Figure 6.11 – Tables with Multivalued Dependencies
An employee 10123 has 2 ORG codes and 3 assignment #s, who is a volunteer.
How should both attributes be recorded in
a table for volunteers?
Not in 4NF!
many nulls
confusing
PK = all attributes
Repeats = redundancy
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Figure 6.12 – A Set of Tables in 4NF
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
4th Normal Form Example (1)
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
When you have a table with multivalues, you have a combination of values such as the Pizza Chain.
The table is in 3NF and BCNF. However, if I were to add a new pizza called Super Supreme to Pizza Hut chain, I have to add to all three stores, namely Kingsford, Randwick and Kensington, i.e. I have to get the combination correct!
The issue here is Pizza Chain depends on Pizza and Store, but we can break it down to two tables (Pizza and Store) instead of one table (Pizza Chain)!!!
4th Normal Form Example
One table into Two tables
Pizza Chain Pizza Chain Pizza
Pizza Chain Store
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
The issue here is Pizza Chain depends on Pizza and Store, but we can break it down to two tables (Pizza and Store) instead of one table (Pizza Chain)!!!
Based on Pizza Chain table, I created two tables: Pizza Chain Pizza, and Store!
Pizza Chain Pizza Chain Pizza and
Pizza Chain Store
Store table has Pizza Chain and Store.
Pizza Chain Pizza table has Pizza Chain and Pizza.
4th Normal Form Example
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Now, I only have to add Super Supreme to Pizza Hut in Pizza Chain Pizza table. Thus, I just have to add one row to one table instead of trying to get the combinations correct like previously!
Normalization and Database Design
Normalization should be part of the design process
Proposed entities must meet required the normal form before table structures are created
Principles and normalization procedures to be understood to redesign and modify databases
ERD is created through an iterative process
Normalization focuses on the characteristics of specific entities
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Denormalization
Design goals
Creation of normalized relations
Processing requirements and speed
Number of database tables expands when tables are decomposed to conform to normalization requirements
Joining a larger number of tables:
Takes additional input/output (I/O) operations and processing logic
Reduces system speed
is not easy
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Data is redundant but access will be much faster – this is in data warehousing and big data!
Denormalization
Defects in unnormalized tables
Data updates are less efficient because tables are larger
Indexing is more cumbersome
No simple strategies for creating virtual tables known as views
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Table 6.6 – Common Denormalization Examples
(ZIP, CITY)
(Course, credit hr)
storing avg grade
point
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Oracle
Oracle
Database
Flat Files
Machine Learning
Prepared by Vincent Pang, Feb. 2021 Note: In-Memory Database (e.g. SAP Hana) is an alternative data model not shown here
ETL (Data
Cleansing)
Data (DW to BD, or
vice versa or both)
RelationalDatabaseBig DataDataWarehouseNormalisationReporting (Business Intelligence and Visualisation) and Business Analysis (End Users)(De-Normalised)External Data (e.g. Excel)Hadoop Distributed File System (HDFS) and MapReduceEntity Relationship Model (ERM)Unstructured Data(Social Media)Structured Data (Internet of Things (IOT))HadoopSQLSQLSpark and NoSQL (and other tools)Data StreamingData Streaming(“Not Normalised”)External Data (e.g. Excel)Star Schema
Pizza ChainPizzaStore
DominoHot & SpicyKingsford
DominoSuper SupremeKingsford
DominoHawaiian PizzaKingsford
DominoHawaiian PizzaRandwick
Pizza HutHot & SpicyRandwick
Pizza HutHot & SpicyKensington
Pizza HutHot & SpicyKingsford
Pizza HutHawaiian PizzaRandwick
Pizza HutHawaiian PizzaKensington
Pizza HutHawaiian PizzaKingsford
Vincenzo’s PizzaEmperor SupremeRandwick
Vincenzo’s PizzaEmperor SupremeKensington
Vincenzo’s PizzaHot & SpicyRandwick
Vincenzo’s PizzaHot & SpicyKensington
Pizza Chain Table
Pizza ChainPizzaStore
DominoHot & SpicyKingsford
DominoSuper SupremeKingsford
DominoHawaiian PizzaKingsford
DominoHawaiian PizzaRandwick
Pizza HutHot & SpicyRandwick
Pizza HutHot & SpicyKensington
Pizza HutHot & SpicyKingsford
Pizza HutHawaiian PizzaRandwick
Pizza HutHawaiian PizzaKensington
Pizza HutHawaiian PizzaKingsford
Vincenzo’s PizzaEmperor SupremeRandwick
Vincenzo’s PizzaEmperor SupremeKensington
Vincenzo’s PizzaHot & SpicyRandwick
Vincenzo’s PizzaHot & SpicyKensington
………
Pizza HutSuper SupremeRandwick
Pizza HutSuper SupremeKensington
Pizza HutSuper SupremeKingsford
Pizza Chain Table
Pizza ChainPizza
DominoHot & Spicy
DominoSuper Supreme
DominoHawaiian Pizza
Pizza HutHot & Spicy
Pizza HutHawaiian Pizza
Vincenzo’s PizzaEmperor Supreme
Vincenzo’s PizzaHot & Spicy
Pizza Chain – Pizza
Pizza ChainStore
DominoKingsford
DominoRandwick
Pizza HutKensington
Pizza HutKingsford
Pizza HutRandwick
Vincenzo’s PizzaRandwick
Vincenzo’s PizzaKensington
Store
Pizza ChainPizza
DominoHot & Spicy
DominoSuper Supreme
DominoHawaiian Pizza
Pizza HutHot & Spicy
Pizza HutHawaiian Pizza
Vincenzo’s PizzaEmperor Supreme
Vincenzo’s PizzaHot & Spicy
……
Pizza HutSuper Supreme
Pizza Chain – Pizza
/docProps/thumbnail.jpeg