PowerPoint Presentation
Database Fundamentals
Introduction to Databases
Database Overview
What is a database?
A handful of tables containing data that are used to retrieve information
A collection of potentially useful data that models real-world or conceptual objects and their relationships
Objects/Entities like People, items, Purchases, Students etc
Relationships like Jim purchased a book
Conceptual like songs, movies
It is managed by a Database Management System (DBMS)
This system may be manual or computerised
Card Catalogue vs modern Computerised equivalent
For us, the DBMS is software designed to store and manage data across one or more databases
A typical example of a database is a Library
Books = objects containing data
Catalogue system = DBMS
2
Database Overview
3
Problems with Traditional Physical Data Storage
Limited by Physical size
MS SQL will store ~2 trillion records in a fraction of the space!
Limited Multi-access
A borrowed book can be viewed by only one person
Limited search methods (Title, Author, Subject)
What if the data has 2 Authors?
What if the title starts with “The” or “As” (eg “As Time Goes By”)
What if I want to find a word in the Title?
What if you don’t want to search by title 1st word, Author or Subject?
Slow turn-around
Would you wait 7 minutes?
Google can pull up a book in < 7 seconds
4
Split anomalies up onto separate slides
Show the example
Good example what happens if there are two authors how do you represent that so you can search for it
4
Problems with traditional Physical methods
Multiple indexes that must be rigorously maintained
Anomalies: - What IFs
What if a card is misplaced?
An item is deleted in one index but not the others (Deletion Anomaly)
An item is entered into one index but not the others (Insertion Anomaly)
An item was incorrectly entered in all indexes but only corrected in one? (Update Anomaly!)
What if a catalogue card is misplaced? What if there are different versions of the same record?
Complex Query Limitations:
How quickly can you count the number of books on a topic or by an author?
What if a book is on many subjects – it must appear multiple times in the subject catalogue (prone to above anomalies)
5
Problems with Traditional Physical Data Storage
Split anomalies up onto separate slides
Show the example
Good example what happens if there are two authors how do you represent that so you can search for it
5
Modern Data Storage
Data can be stored in files such as text files, Excel sheets, Word documents.
Advantages
low cost (is this really true? – think time and effort!)
No extra software/hardware is needed (or open source, standard Mac/PC), easy to use?
6
6
Problems with Modern Data Storage
Data can be stored in files such as text files, Excel sheets, Word documents.
Disadvantages
Size - Retrieval of data is slow when the volume of data is large (but better than physical data)
Updates/Synchronisation – files are hard to share. Multi-user updates often end up with multiple versions of the same file or overwriting someone's changes!
Accuracy – Duplication is wasteful and can cause confusion (inconsistent data)
Security – restriction to some fields is more difficult (eg TFN, studentID + username)
Programming is required for automatic processing: different files require different programs for processing
Inability to process complex queries (at least with any efficiency)
7
7
Modern Data Storage
Did you know data is still stored on Tape?
Tape is the most affordable form of data storage (costing < 10₵/GB )
Tape cartridges can store up to 6.25TB on a single cartridge
More than 6 Peta Bytes (6 x1000TB) can be stored every 1.3m2 in a data centre
Has lower error rates that HDD and less prone to viruses
Lasts longer
But ~90% of business data is rarely accessed again! - so why bother?
Compliance with laws and company regulations
Data Mining – maybe the data holds some hidden detail?
8
Why Store Data?
So we can extract facts (information) and use it for decision making
Car airbags kill 1 person for every 22 lives that they save
A new baby usually deprives each of it’s parents around 350-400 hours of sleep in the first year
On average, 100 people choke to death on ballpoint pens every year
Annoy friends and family with facts inferred from data
9
Why Store Data?
10
Why Store Data?
http://tylervigen.com/page?page=1
11
Why Store Data?
http://tylervigen.com/page?page=3
12
Why Store Data?
13
Database Overview
Why store data and not information?
Data is simple re-usable facts
Is non-specific and does not carry any meaning
Can be processed, organised, analysed to give it meaning (to provide information)
This makes data a valuable resource with a very long life cycle
Data + Context = Information
Different questions of the same data will provide different information
This makes data more flexible
Data can be reused!
What can you find out?
Data = the future
Data management = increased productivity
Better customer relations (sometimes!)
Knowing how to manage and use data = $$
Do YOU own a store card?
14
Databases - The Data Model
Data Model
The data model dictates how the data will be structured and stored
The Relational Model (most common)
Introduced in the 70s
Structured data stored in tables with relationships
Good for up to 100TB (scales up)
Alternatives exist and these will be covered in Enterprise Databases next year
Objects, XML, Hierarchical, Network
NoSQL/key-value and others
Good for 100TB+ (scales out)
Semi-structured data
Source: wikipedia.org
15
15
Relational databases are good for up to 10-100TB of data. These types of databases scale up by increasing the size of the server or splitting the load among two or more servers (although this is complicated and far from perfect)
They capture Structured Data – you need to know exactly what data you want to collect as it is difficult to modify the structure once it is implace.
NoSQL scales out and can handle much larger datasets (peta bytes) it scales by adding whole new nodes or servers that operate in parallel and can perform load balancing.
Database Overview
How do we ask questions (query) of our database?
Need a Structured Query Language to Ask Questions of Structured Data.
Structured Query Language (SQL)
Remember this you will use it forever!!
A Structured language uses keywords in a logical pattern to perform tasks
It’s easier to understand than other programming languages
A DBMS (database management system) interprets SQL and actions it
Determines what needs to be done, how and then returns the result
16
Databases in Computers – The DBMS
A DataBase Management System performs all the maintenance and management functions for the data model:
Managing disk space (shrink/expansion)
Communication of data from the computer to human users (and vice versa)
Managing the creation of tables (and various indexes)
Managing the insertion, updating and deletion of data records – including performing any necessary checks
A DBMS Performs complex query answering (searching and matching) of records
With greater efficiency and accuracy than us!
The DBMS and its users communicate using a mutually understandable language: SQL (Structured Query Language!)
Popular free DBMS include:
SQLite
MySQL
MS-SQL Express
PostgreSQL
Commercial: MS-SQL and Oracle, DB2 + the rest
17
Databases in Computers – DBMS Properties
A DBMS is an application that manages collections of data:
It can handle a large amount of data (more than RAM can hold)
It can share the data between various applications and users
Web interfaces, desktop/console applications incl. command line via the DBMS!
A DBMS ensures
Data is persistent (it can be retrieved once it is created)
The data is reliable (it can be retrieved after hardware/software failures)
It can also ensure data meets specific requirements (“constraints”: eg the student mark is between 0 and 100)
Privacy is maintained (it controls access and manipulation)
A DBMS must also
Be Efficient (using the appropriate amount of resources, not flooding CPU/RAM)
Preferably no 7 min queries please!
18
18
Structure of a DBMS
Web Form
Application Front End
Management Studio SQL Interface
SQL Commands
Query
Evaluation
Machine
DBMS
Index files
Data files
System Catalogue
Physical Database on Disk
Files and Access methods
Buffer manager
Disk Space manager
Recovery Manager
Transaction
Manager
Lock
Manager
Adapted from Database Management Systems
Ramakrishnan & Gehrke
Operator Evaluator
Plan executor
Parser
Optimiser
Ensures that queries are carried out in the correct order.
If a specific query fails the database data will not be left in an inconsistent state
Controls concurrent access to records
May reject access if a record is being accessed/modified by another user
Uses logs to retain data integrity in the event of a system failure
Ensures incomplete transactions are finalised or rolled back leaving data in the original state
Also called Data Dictionary. Stores info about:
names, types, size of data
Names of constraints/relationships
Names of authorised users
Access statistics
19
19
The Transaction Manager ensures that queries (or actions involving data) are carried out in the correct order.
It ensures that if a specific query fails (eg and update or insert) that the database will not have data left in an inconsistent state
The Lock Manager is provided to control access to records when there are multiple users involved.
If both users are updating a record, or one is trying to retrieve a record that is being modified the Lock Manager may restrict or reject access to the record
The Recovery Manager
Uses system logs created by the transaction manager to retain data integrity in the event of a system failure
The System Catalogue is a data dictionary (metadata) that holds information about schemas, users, applications etc.
It stores
names, types, sizes of data items
Names of relationships
Names of authorised users
Usage statistics etc
The Lock Manager is used to help share the database resources amongst several users. It
Why Study Databases?
Gain knowledge and skills to interact with a DBMS
Example: working in MS-SQL/Oracle
Migrating existing data from legacy systems ($$ data is money! $$)
Gain skills for using a database driven application
Data insertion and deletion with or without GUI
Data queries with or without GUI
Why things work the way they do (or fail to perform up to expectations)
20
Database Fundamentals
The Relational Model
The dominant method for managing data in todays world
21
Relational Database Modelling Overview
Conceptual Model Translation Process for Relational Model
Conceptual Model
Logical Translation
(DBMS Tool)
Relational DBMS
Physical Implementation
ER
UML
Week 2
Week 3
Week 3+
22
Relational Database Modelling Overview
Conceptual Modelling
Database requirements are collected and visualised as a UML diagram (Unified Modelling Language)
Uses higher level modelling to abstract away the complexities of implementation
A graphical way of conceptualising data capture requirements to ensure the required data is collected by the database design!
Some examples:
Week 2
23
Relational Database Modelling Overview
Logical Modelling
The next phase is to create functional relational schemas based on the conceptual design.
A written description of how the relational database will be implemented
This includes deciding which candidate key(s) will become the primary key - the value used to locate a specific record (like your studentID or emailID)
Example:
Project(ProjectName, Budget, Manager)
PK(ProjectName)
FK(Manager) Manager(Name)
ProjectMembers (ProjectName, Employee)
PK(ProjectName, Employee)
FK(projectName) Project(ProjectName)
FK(Employee) Employee(Name)
Employee(Name, BirthDate)
PK(Name)
Manager(Name, BirthDate, Title)
PK(Name)
PK = Primary Key
FK = Foreign Key
Week 3
24
24
Relational Database Modelling Overview
Physical Design
Take the relational schemas and implement them using a DBMS
Structured Query Language (SQL) is used to create the relational database and manage data within the database
CREATE TABLE Employee (
Name varchar(200),
BirthDate date,
CONSTRAINT employeePk PRIMARY KEY (Name)
);
CREATE TABLE Manager (
Name varchar(200),
CONSTRAINT managerPk PRIMARY KEY (Name)
);
CREATE TABLE Project(
ProjectName varchar(200),
Budget decimal(6,2),
Manager varchar(200),
CONSTRAINT projectPk PRIMARY KEY (ProjectName),
CONSTRAINT managerFk FOREIGN KEY (Manager) REFERENCES Manager(Name)
);
Week 3+
25
26
Relational Concepts
Jargon for Table Terminology
Relational Model – Key Concepts
In a Relational DB data is held in a series of Relations
A Relation is a table that consists of columns and rows of data
What makes a table with rows and columns a Relation?
Within a Table every column name must be UNIQUE
Within a Table every row must be UNIQUE (no duplicate data!)
Every Row must have a unique Primary Key that can identify that data row only!
Relational Name Common Name Alternative
Relation Relation Table Table
Attribute Column Field
Tuple Row Record
28
28
Relational Model – Key Concepts
A Relation defines a real world or conceptual object we collect information about
When a relation is implemented in a DBMS, it is often called a Table
A relational database consists of relations with distinct names
Eg, a table that stores Customer Information, Bank accounts, Uni Courses
An Attribute is a name for a column in a relational table
Each Attribute must have a unique name in a given relational table
Every attribute has a domain
Eg, the Customer table has columns (attributes) for collecting the customer’s Family Name, First Name, Address, Age
29
29
Eg a relation
A customer buys books
An attribute
Customer has a family name
Domain
Family name is a string of characters
Schema is a group of relationships,
i ) A relation is a set of tuples. When a relation is implemented in a DBMS, it is a table.
Eg Employees, Departments (rectangles). Also referred to as a relation
ii ) An attribute is a name for a column of a relation. Every attribute has a domain.
Relational Model – Key Concepts
A Domain dictates the acceptable values of an attribute
E.g. int, decimal(5,2), varchar(200), bit
We will refer to domains as Data Types (consistent with Programming Fundamentals)
Eg. FirstName is a string of characters of varying length (varchar), or the customer age may be an integer number (int)
A Table Schema defines the structure of a relational table in a relational database (table name + attributes)
Customer(CustomerID, FirstName, FamilyName, Address, Age)
Relation Name + attributes
30
30
Eg a relation
A customer buys books
An attribute
Customer has a family name
Domain
Family name is a string of characters
Schema is a group of relationships,
i ) A relation is a set of tuples. When a relation is implemented in a DBMS, it is a table.
Eg Employees, Departments (rectangles). Also referred to as a relation
ii ) An attribute is a name for a column of a relation. Every attribute has a domain.
Relational Database – Key Concepts
Tuples are instances of a relation or entity and contain the actual raw data
A Tuple is a row of data
A relation consists of one or more unordered tuples
We use special commands to order the records returned by the DBMS when necessary
A Key is a set of one or more attributes whose values can uniquely identify a given row of data in a relation
The Primary Key (PK) is a value that identifies a particular table row of data
Candidate (name given to all keys)
Unique
Surrogate
Natural
31
Relational Database – Key Concepts
Terms and Notation
studID emailID studName
50001 leej001 John Lee
50002 bump001 Peter Buman
50003 brod001 David Browns
Attribute name
Relation name
Students
schema
a tuple
Students
studID (PK)
emailID
studName
Students
studID (PK)(varchar)
emailID(varchar)
studName(varchar)
Equivalent Conceptual Design UML:
Attribute domain
(data type)
Logical Schema:
Students(studID, emailID, studName)
Physical Implementation:
32
Cardinality = # of tuples of a given Schema
Relational Instance = a set of tuples from the same schema (entity type)
32
Key Concepts
More Keys than Diablo III
Relational Database – Key Concepts
Candidate Key(s) – a minimum combination of attributes that could become the Primary Key when designing a database
If there is only one Candidate Key (CK) it is generally selected as the Primary Key
Other times, the developer must choose which CK will be used as the Primary Key
Primary Key – the key chosen from the candidates to be implemented in the relation
All relations require a Primary Key (PK)
It is a set of one or more attributes chosen to uniquely identify each tuple (or row) of data
If you are given a PK, you can locate the specific data
Sometimes it is an automatically generated number (eg the current row number in excel)
34
34
A key is a minimal set of attributes whose combination of values are unique in the table.
Minimal means as few as possible attributes (often there is only one)
The key is used to uniquely identify a specific tuple in the relation or table.
No two tuples in the relation should have the same key value.
Eg, a StudentID should be a key – each student has a unique studentID which is not repeated or replicated for any other student.
Relational Database – Key Concepts
Primary Key
Every relation has a special attribute or a set of attributes whose values can be used to uniquely identify a single row of data (a tuple)
It serves to distinguish one row from every other
The value(s) must be unique for each row
StudentID StudentName
1011 Jane Lane
1012 Jeff Wilson
1013 Jane Pain
CourseID CourseName
INFS1091 Proj Dev
COMP2064 FIT
LMIF3621 Math
Student
Course
Unique value for every row
35
35
Relational Database – Key Concepts
Composite key
A Key (Primary or Other!) that is composed of more than one attribute
Example shows a Composite Primary Key!
Course Student Mark
INFS1091 1011 87
COMP2064 1013 94
LMIF3621 1012 68
Enrolment
Unique values for every row
36
36
Relational Database – Key Concepts
A Unique Key is a set of attribute(s) whose values are always unique
Any candidate key that is not selected as the Primary Key can be implemented as a Unique Key constraint in the DBMS
The data is checked to ensure each entry is a unique value and does not already exist in other rows
Natural Key is a minimum set of one or more attributes that exist in the table data that can be used or combined to uniquely identify each row:
Bank account/Telephone numbers, email address etc, Name + Birth Date
Surrogate Key – ultimate lazy approach to implementing a Primary Key
An incremental number stored as an ID column in a table
A randomly generated alpha-numeric value
Has nothing to do with the table data
37
37
A key is a minimal set of attributes whose combination of values are unique in the table.
Minimal means as few as possible attributes (often there is only one)
The key is used to uniquely identify a specific tuple in the relation or table.
No two tuples in the relation should have the same key value.
Eg, a StudentID should be a key – each student has a unique studentID which is not repeated or replicated for any other student.
What Key is That?
UserName (PK) FirstName LastName
davop001 Paul Davos
adamj003 Jennifer Adams
duffl001 Lenny Duff
Candidate Key
Natural Key
Surrogate Key
Unique Key
Primary Key
userName:
38
What Key is That?
studentID (PK) userName FirstName LastName
10011 davop001 Paul Davos
10093 adamj003 Jennifer Adams
10178 duffl001 Lenny Duff
Candidate Key
Natural Key
Surrogate Key
Unique Key
Primary Key
studentID:
Candidate Key
Natural Key
Surrogate Key
Unique Key
Primary Key
userName:
39
Relational Database – Key Concepts
Foreign Keys are attributes whose values are the Unique Key of another Table (primary or unique!)
The PK values are copied and stored in the other table as a Foreign Key value
Foreign keys allow one table to be linked to one or more tuples in another table.
hence the term “Relational Database”
Example - A course mark is of no use if it doesn’t relate back to a student and a course:
Enrolment(StudentID) is a foreign key
referencing Student(StudentID)
Enrolment(CourseID) is a foreign key
referencing Course(CourseID)
PK
FK
PK
FK
40
40
Foreign keys and other constraints are used to help fortify the accuracy of data collected.
To prevent invalid data from entering the database there are several options:
Define constraints to limit choices/entries to valid options
Ask the DBMS to enforce the constraints so that no bad tuples can be input into the tables.
EG, prevent a non-existant student being enrolled
Prevent a student getting 110/100 or Good/Pass for a mark
In this example we have an Enrolment table that borrows information from two other tables. Often when representing a foreign key, the attribute is named after the table from which the PK is borrowed. In this case, enrollment borrows the studentID from the student relation, hence the attribute “student” is used to reflect the studentID.
Similarly, the courseID is borrowed from the course relation hence has been denoted as course.
This is not entirely necessary and different naming conventions can be used. Some common conventions include:
studentID could be used in both tables as the PK and FK
fkStudentID could be used
Student_ID could be used to denote the FK
Regardless. A Foreign key MUST reference a PK or a attribute whose values are unique. The foreign key must always point back to only one record.
Relational Database – Key Concepts
Foreign Key Violation:
Student 50002 took course 12550 which is not a valid code for any course.
Course 12529 was taken by Student 50009 which is not a valid id for any student.
Foreign Keys are used to prevent these scenarios from taking place
These are “orphan” records that have lost their meaning
Student
StudID EmailID StudName
50001 leej001 John Lee
50002 bump001 Peter Buman
Course
CrsID CrsName
12529 Data modelling
12510 Java Programming
Enrolment
Student Course Mark
50001 12529 80
50002 12550 75
50009 12529 50
??
??
41
A constraint is a restriction implemented by the DBMS to ensure the integrity of the tuple data being entered.
It may involve ensuring values are provided
It may involve ensuring a value is within a given range or of a specific data type.
Foreign key constraints ensure that a record exists in another table.
In this case student 50009 would not be able to be entered into the enrolment because it violates the FK constraint. FK constraints prevent the referencing of a record that does not exist.
41
Database Fundamentals
Logical Schema Design
Deciding what you need to capture and how it should be stored
42
Databases – Relational Model Schema
Database Designs
A Database design should capture all the required information
Both current and future requirements
More than one design may be plausible
One design may be better than another
A good design in one situation may perform poorly in another
A good design should take into account how the data will be used
Frequent user searches
Auditing purposes
Predictions/data mining
43
43
Database – Normalisation
Database Designs
Consider what data needs to be stored and how it will be used
44
Databases – Relational Model Schema
Bad Database Designs can create data quality issues
Tracking customer shopping habits
Customer account No
Customer Name
Customer Address
Products purchased
Stores they shop
Example Logical Schema: custHabbits(acctNo, custName, custAddress, product, storeName)
acctNo custName custAddress product storeName
1101 Mary Jane 123 Lane Coffee maker Hardly Normal
3311 James 424 Sa Milk Foodville
1101 Mary Jane 123 Lane Knife and fork set Hardly Norma
1101 Mary Jane 123 Lane Quilt set Quilts ’n’ Things
2211 Bob 111 Drive Coffee maker Hardly Normal
3311 James 424 Sa Milk Coz
45
45
Databases – Relational Model Schema
Bad Database Designs can create data quality issues
Tracking customer shopping habits
Customer account No
Customer Name
Customer Address
Products purchased
Stores they shop
Example Logical Schema: custHabbits(acctNo, custName, custAddress, product, storeName)
acctNo custName custAddress product storeName
1101 Mary Jane 123 Lane Coffee maker Hardly Normal
3311 James 424 Sa Milk Foodville
1101 Mary Jane 123 Lane Knife and fork set Hardly Norma
1101 Mary Jane 123 Lane Quilt set Quilts ’n’ Things
2211 Bob 111 Drive Coffee maker Hardly Normal
3311 James 424 Sa Milk Coz
Redundancy
Every time our user creates a new record
They need to re-enter all this information
again and again
46
46
Databases – Relational Model Schema
Bad Database Designs can create data quality issues
Tracking customer shopping habits
Customer account No
Customer Name
Customer Address
Products purchased
Stores they shop
Example Logical Schema: custHabbits(acctNo, custName, custAddress, product, storeName)
acctNo custName custAddress product storeName
1101 Mary Jane 123 Lane Coffee maker Hardly Normal
3311 James 424 Sa Milk Foodville
1101 Mary Jane 123 Lane Knife and fork set Hardly Norma
1101 Mary Jane 123 Lane Quilt set Quilts ’n’ Things
2211 Bob 111 Drive Coffee maker Hardly Normal
3311 James 424 Sa Milk Coz
Update Anomaly
If our user notices a mistake in one record
and corrects it, do they remember to correct
all the others?
47
47
Databases – Relational Model Schema
Bad Database Designs can create data quality issues
Tracking customer shopping habits
Customer account No
Customer Name
Customer Address
Products purchased
Stores they shop
Example Logical Schema: custHabbits(acctNo, custName, custAddress, product, storeName)
acctNo custName custAddress product storeName
1101 Mary Jane 123 Lane Coffee maker Hardly Normal
3311 James 424 Sa Milk Foodville
1101 Mary Jane 123 Lane Knife and fork set Hardly Norma
1101 Mary Jane 123 Lane Quilt set Quilts ’n’ Things
2211 Bob 111 Drive Coffee maker Hardly Normal
3311 James 424 Sa Milk Coz
Deletion Anomaly
If our user deletes this purchase record we
loose all our information we had on Bob!
48
48
Databases – Relational Model Schema
A Good relational schema design should take into account usage, avoid redundancies and anomalies
Redundancy
Information is recorded unnecessarily in multiple tuples (rows)
Update Anomaly
Information can be changed in one tuple leaving old information in another
Insertion Anomaly
A bad design that forces unrelated information to be recorded as well and this information may be wrong
Deletion Anomaly
Deletion of a record results in complete loss of other information
49
49
Databases – Relational Model Schema
Example – A Better Design?
Ensure a Relation only captures one type of information about an entity or object
Question: Is the new design any good for looking up customer records?
customer(acctNo, custName, custAddress)
customerPurchases(acctNo, product, storeName)
acctNo custName custAddress
1101 Mary Jane 123 Lane
3311 James 424 Sa
2211 Bob 111 Drive
acctNo product storeName
1101 Coffee maker Hardly Normal
3311 Milk Foodville
1101 Knife and fork set Hardly Normal
1101 Quilt set Quilts ’n ’Things
2211 Coffee maker Hardly Normal
3311 Milk Coz
customer
customerPurchases
50
50
Student Course Room
Mary CS145 B01
Joe CS145 B01
Sam CS145 B01
.. .. ..
If every course is in only one room, contains redundant information!
A poorly designed database causes anomalies:
If we update the room number for just one tuple (Update anomaly)
Suppose everyone drops the course suddenly… we lose information about where the course is! (Delete Anomaly)
We may not be able to create a room reservation without students. Need to know every detail (Insert anomaly)
Databases – Relational Model Schema
http://web.stanford.edu/class/cs145/
51
Database – Normalisation
Normalisation is the process of decomposing large relational schemas into smaller more efficient schemas
Smaller schemas that contain only one copy of any given record
Avoids issues that result from having multiple copies of the same record (typos etc)
Smaller schemas that contain only relevant data
Smaller schemas that are more easily searched
Normalisation prevents update anomalies and data inconsistencies
Works well where data is updated frequently but at the expense of information retrieval
We will see this in coming weeks
52
Database – Normalisation
The Problem of Un-Normalised Data:
Imagine the table has 1000s of rows
How do you find the franchises with stores in London?
How do you ensure data integrity if “storeLocations” is an unlimited text field?
Can you prevent a person writing Londn or L’dn intentionally or by accident?
franchiseID storeLocations
101 Chicago, London, New York, Paris, San Francisco
102 Seattle
103 London, Munich, Paris, Athens
104 New York, London
53
Database – Normalisation
The Problem of Poorly Normalised Data:
What if there are stores in other cities?
How do you get a list of the cities for a particular store?
You’d need to check each column to see if it contained data!
FranchiseID City1 City2 City3 City4 City5
101 Chicago London New York Paris San Francisco
102 Seattle NULL NULL NULL NULL
103 London Munich Paris Athens NULL
104 New York London NULL NULL NULL
54
Database – Normalisation
How should we represent this information?
FranchiseID City1 City2 City3 City4 City5
101 Chicago London New York Paris San Francisco
102 Seattle NULL NULL NULL NULL
103 London Munich Paris Athens NULL
104 New York London NULL NULL NULL
55
Database – Normalisation
How should we represent this information?
FranchiseID City1 City2 City3 City4 City5
101 Chicago London New York Paris San Francisco
102 Seattle NULL NULL NULL NULL
103 London Munich Paris Athens NULL
104 New York London NULL NULL NULL
Franchises
101
102
103
104 …
Cities
Chicago
London
New York
Paris …
FranchiseCities
101 Chicago
101 London
101 New York
… …
103 … London
56
Database – Normalisation
1st Normal Form
All tables should be “flat”
All occurrences of a record type must contain the same number of fields
All values in a given column must be of the same data type (what does this mean?)
A value should NOT be composed of multiple values
Student Courses
Mary {CS145,CS229}
Joe {CS145,CS106}
… …
Student Course
Mary CS145
Mary CS229
Joe CS145
Joe CS106
Not in 1st NF
1st NF Equivalent
http://web.stanford.edu/class/cs145/
57
Database – Normalisation
1st Normal Form
All tables should be “flat”
All occurrences of a record type must contain the same number of fields
All values in a given column must be of the same data type (what does this mean?)
A table should have NO repeating groups of values
Student Course1 Course2
Mary CS145 CS229
Joe CS145 CS106
… …
Student Course
Mary CS145
Mary CS229
Joe CS145
Joe CS106
Not in 1st NF
1st NF Equivalent
http://web.stanford.edu/class/cs145/
58
Database – Normalisation
Student Postcode Suburb Course CourseName Room Mark
Mary 5000 Adel CS145 FIT B01 D
Mary 5000 Adel CS229 PF B01 C
Jeff 5001 Adel CS229 PF B01 C
Jane 5092 Modbury Nth CS106 DBF B02 HD
Joe 5092 Modbury CS145 FIT B01 P
Joe 5092 Modbury CS106 DBF B02 HD
How can we improve this?
Student Enrolments
1st NF
Student: Student → Suburb + Postcode
Enrolments: Student + Course → Mark
Courses: Course → Room
Optional:
Suburbs: Suburb → Postcode ??
Separate information into relations that represent real-world or conceptual objects
Student, Courses, Enrolments, Suburbs?
Look for attributes that depend on the values of other attributes
59
Database – Normalisation
How can we improve this?
Now:
Student → Postcode, Suburb
The Student always determines the Suburb and the Postcode
Neither the Suburb or the Postcode is always unique
Student Postcode Suburb
Mary 5000 Adel
Jeff 5001 Adel
Jane 5092 Modbury Nth
Joe 5092 Modbury
Student
60
60
Database – Normalisation
Course CourseName Room
CS145 FIT B01
CS229 PF B01
CS106 DBF B02
Now:
Student + Course →Mark
Course → CourseName, Room
The Course the student enrols in determines the mark
The Course always determines the room
Student Course CourseName Room Mark
Mary CS145 FIT B01 D
Mary CS229 PF B01 C
Jeff CS229 PF B01 C
Jane CS106 DBF B02 HD
Joe CS145 FIT B01 P
Joe CS106 DBF B02 HD
Student Course Mark
Mary CS145 D
Mary CS229 C
Jeff CS229 C
Jane CS106 HD
Joe CS145 P
Joe CS106 HD
How can we improve this?
Student Enrolments
Enrolment
Courses
61
61
Database – Normalisation
62
(Original mega table)
Database – Normalisation
How can we improve this?
Now:
Student → Suburb
Suburb → Postcode
The Student always determines the Student Name and address
The Suburb always determines the Postcode
3rd NF Equivalent
Student Suburb
Mary Adel
Jeff Adel
Jane Modbury Nth
Joe Modbury
Suburb Postcode
Adel 5000
Modbury Nth 5092
Modbury 5092
Student
Student
Suburbs
This example does not work for all suburbs!
63
Student Postcode Suburb
Mary 5000 Adel
Jeff 5001 Adel
Jane 5092 Modbury Nth
Joe 5092 Modbury
63
Database – Normalisation
How can we improve this?
Now:
Student → Postcode, Suburb
The Student always determines the Suburb and the Postcode
Neither the Suburb or the Postcode is always unique
Student Postcode Suburb
Mary 5000 Adel
Jeff 5001 Adel
Jane 5092 Modbury Nth
Joe 5092 Modbury
Student Postcode Suburb
Mary 5000 Adel
Jeff 5001 Adel
Jane 5092 Modbury Nth
Joe 5092 Modbury
Student
Student
Neither the Postcode or Suburb can uniquely identify the other:
Sometimes Adelaide is 5000, sometimes it is 5001
Sometimes 5092 is Modbury sometimes it is Modbury North
64
64
Database – Normalisation
Original “mega” Logical Schema:
Student(Student, Postcode, Suburb, Course, CourseName, Room, Mark)
PK(Student)
Relations should represent real world or theoretical objects
A relation should represent one type of object or a relationship between two objects, not every everything you think you need to capture
New Logical Schemas:
Student(student, postcode, suburb, postcode)
PK(student)
Courses(course, courseName, room)
PK(course)
Enrolments(student, course, mark)
PK(student, course)
Is a student name a good Primary Key (PK)?
A better schema using a surrogate key:
Students(studentID, studentName, suburb, postcode)
Courses(courseID, courseName, room)
Enrolments(studentID, courseID, mark)
65
Note the inconsistencies with Naming here. You need to decide if you label your tables using Plural or Singular naming and then stick with it!
65
Logical Schema Design
So what does this all mean?
When you are designing a database you should:
Create a List of all the real-world AND conceptual objects you need to store data about
To each of those items, list the attributes required to capture the desired data
– ALL of them!
Check that each of your objects only captures only relevant information to that object
(relevant to that object and only to that object)
Check to see if you have any candidate keys
Do they apply to ALL Records?
If so, pick the best one (usually the smallest one) as the Primary Key.
If not, add an artificial (surrogate) Primary key – a standard ID field with a number that increases for each new row.
66
Database Fundamentals
Next Week: Conceptual Design - Unified Modelling Language (UML)
Lavf57.36.100
/docProps/thumbnail.jpeg