CS计算机代考程序代写 SQL data mining ER information retrieval Java flex gui Excel database PowerPoint Presentation

PowerPoint Presentation

Database Fundamentals
Introduction to Databases

Database Overview
What is a database?
A handful of tables containing data that are used to retrieve information
A collection of potentially useful data that models real-world or conceptual objects and their relationships
Objects/Entities like People, items, Purchases, Students etc
Relationships like Jim purchased a book
Conceptual like songs, movies

It is managed by a Database Management System (DBMS)
This system may be manual or computerised
Card Catalogue vs modern Computerised equivalent

For us, the DBMS is software designed to store and manage data across one or more databases

A typical example of a database is a Library

Books = objects containing data
Catalogue system = DBMS
2

Database Overview

3

Problems with Traditional Physical Data Storage
Limited by Physical size
MS SQL will store ~2 trillion records in a fraction of the space!

Limited Multi-access
A borrowed book can be viewed by only one person

Limited search methods (Title, Author, Subject)
What if the data has 2 Authors?
What if the title starts with “The” or “As” (eg “As Time Goes By”)
What if I want to find a word in the Title?
What if you don’t want to search by title 1st word, Author or Subject?

Slow turn-around
Would you wait 7 minutes?
Google can pull up a book in < 7 seconds 4 Split anomalies up onto separate slides Show the example Good example what happens if there are two authors how do you represent that so you can search for it 4 Problems with traditional Physical methods Multiple indexes that must be rigorously maintained Anomalies: - What IFs What if a card is misplaced? An item is deleted in one index but not the others (Deletion Anomaly) An item is entered into one index but not the others (Insertion Anomaly) An item was incorrectly entered in all indexes but only corrected in one? (Update Anomaly!) What if a catalogue card is misplaced? What if there are different versions of the same record? Complex Query Limitations: How quickly can you count the number of books on a topic or by an author? What if a book is on many subjects – it must appear multiple times in the subject catalogue (prone to above anomalies) 5 Problems with Traditional Physical Data Storage Split anomalies up onto separate slides Show the example Good example what happens if there are two authors how do you represent that so you can search for it 5 Modern Data Storage Data can be stored in files such as text files, Excel sheets, Word documents. Advantages low cost (is this really true? – think time and effort!) No extra software/hardware is needed (or open source, standard Mac/PC), easy to use? 6 6 Problems with Modern Data Storage Data can be stored in files such as text files, Excel sheets, Word documents. Disadvantages Size - Retrieval of data is slow when the volume of data is large (but better than physical data) Updates/Synchronisation – files are hard to share. Multi-user updates often end up with multiple versions of the same file or overwriting someone's changes! Accuracy – Duplication is wasteful and can cause confusion (inconsistent data) Security – restriction to some fields is more difficult (eg TFN, studentID + username) Programming is required for automatic processing: different files require different programs for processing Inability to process complex queries (at least with any efficiency) 7 7 Modern Data Storage Did you know data is still stored on Tape? Tape is the most affordable form of data storage (costing < 10₵/GB ) Tape cartridges can store up to 6.25TB on a single cartridge More than 6 Peta Bytes (6 x1000TB) can be stored every 1.3m2 in a data centre Has lower error rates that HDD and less prone to viruses Lasts longer But ~90% of business data is rarely accessed again! - so why bother? Compliance with laws and company regulations Data Mining – maybe the data holds some hidden detail? 8 Why Store Data? So we can extract facts (information) and use it for decision making Car airbags kill 1 person for every 22 lives that they save A new baby usually deprives each of it’s parents around 350-400 hours of sleep in the first year On average, 100 people choke to death on ballpoint pens every year Annoy friends and family with facts inferred from data 9 Why Store Data? 10 Why Store Data? http://tylervigen.com/page?page=1 11 Why Store Data? http://tylervigen.com/page?page=3 12 Why Store Data? 13 Database Overview Why store data and not information? Data is simple re-usable facts Is non-specific and does not carry any meaning Can be processed, organised, analysed to give it meaning (to provide information) This makes data a valuable resource with a very long life cycle Data + Context = Information Different questions of the same data will provide different information This makes data more flexible Data can be reused! What can you find out? Data = the future Data management = increased productivity Better customer relations (sometimes!) Knowing how to manage and use data = $$ Do YOU own a store card? 14 Databases - The Data Model Data Model The data model dictates how the data will be structured and stored The Relational Model (most common) Introduced in the 70s Structured data stored in tables with relationships Good for up to 100TB (scales up) Alternatives exist and these will be covered in Enterprise Databases next year Objects, XML, Hierarchical, Network NoSQL/key-value and others Good for 100TB+ (scales out) Semi-structured data Source: wikipedia.org 15 15 Relational databases are good for up to 10-100TB of data. These types of databases scale up by increasing the size of the server or splitting the load among two or more servers (although this is complicated and far from perfect) They capture Structured Data – you need to know exactly what data you want to collect as it is difficult to modify the structure once it is implace. NoSQL scales out and can handle much larger datasets (peta bytes) it scales by adding whole new nodes or servers that operate in parallel and can perform load balancing. Database Overview How do we ask questions (query) of our database? Need a Structured Query Language to Ask Questions of Structured Data. Structured Query Language (SQL) Remember this you will use it forever!! A Structured language uses keywords in a logical pattern to perform tasks It’s easier to understand than other programming languages A DBMS (database management system) interprets SQL and actions it Determines what needs to be done, how and then returns the result 16 Databases in Computers – The DBMS A DataBase Management System performs all the maintenance and management functions for the data model: Managing disk space (shrink/expansion) Communication of data from the computer to human users (and vice versa) Managing the creation of tables (and various indexes) Managing the insertion, updating and deletion of data records – including performing any necessary checks A DBMS Performs complex query answering (searching and matching) of records With greater efficiency and accuracy than us! The DBMS and its users communicate using a mutually understandable language: SQL (Structured Query Language!) Popular free DBMS include: SQLite MySQL MS-SQL Express PostgreSQL Commercial: MS-SQL and Oracle, DB2 + the rest 17 Databases in Computers – DBMS Properties A DBMS is an application that manages collections of data: It can handle a large amount of data (more than RAM can hold) It can share the data between various applications and users Web interfaces, desktop/console applications incl. command line via the DBMS! A DBMS ensures Data is persistent (it can be retrieved once it is created) The data is reliable (it can be retrieved after hardware/software failures) It can also ensure data meets specific requirements (“constraints”: eg the student mark is between 0 and 100) Privacy is maintained (it controls access and manipulation) A DBMS must also Be Efficient (using the appropriate amount of resources, not flooding CPU/RAM) Preferably no 7 min queries please! 18 18 Structure of a DBMS Web Form Application Front End Management Studio SQL Interface SQL Commands Query Evaluation Machine DBMS Index files Data files System Catalogue Physical Database on Disk Files and Access methods Buffer manager Disk Space manager Recovery Manager Transaction Manager Lock Manager Adapted from Database Management Systems Ramakrishnan & Gehrke Operator Evaluator Plan executor Parser Optimiser Ensures that queries are carried out in the correct order. If a specific query fails the database data will not be left in an inconsistent state Controls concurrent access to records May reject access if a record is being accessed/modified by another user Uses logs to retain data integrity in the event of a system failure Ensures incomplete transactions are finalised or rolled back leaving data in the original state Also called Data Dictionary. Stores info about: names, types, size of data Names of constraints/relationships Names of authorised users Access statistics 19 19 The Transaction Manager ensures that queries (or actions involving data) are carried out in the correct order. It ensures that if a specific query fails (eg and update or insert) that the database will not have data left in an inconsistent state The Lock Manager is provided to control access to records when there are multiple users involved. If both users are updating a record, or one is trying to retrieve a record that is being modified the Lock Manager may restrict or reject access to the record The Recovery Manager Uses system logs created by the transaction manager to retain data integrity in the event of a system failure The System Catalogue is a data dictionary (metadata) that holds information about schemas, users, applications etc. It stores names, types, sizes of data items Names of relationships Names of authorised users Usage statistics etc The Lock Manager is used to help share the database resources amongst several users. It Why Study Databases? Gain knowledge and skills to interact with a DBMS Example: working in MS-SQL/Oracle Migrating existing data from legacy systems ($$ data is money! $$) Gain skills for using a database driven application Data insertion and deletion with or without GUI Data queries with or without GUI Why things work the way they do (or fail to perform up to expectations) 20 Database Fundamentals The Relational Model The dominant method for managing data in todays world 21 Relational Database Modelling Overview Conceptual Model Translation Process for Relational Model Conceptual Model Logical Translation (DBMS Tool) Relational DBMS Physical Implementation ER UML Week 2 Week 3 Week 3+ 22 Relational Database Modelling Overview Conceptual Modelling Database requirements are collected and visualised as a UML diagram (Unified Modelling Language) Uses higher level modelling to abstract away the complexities of implementation A graphical way of conceptualising data capture requirements to ensure the required data is collected by the database design! Some examples: Week 2 23 Relational Database Modelling Overview Logical Modelling The next phase is to create functional relational schemas based on the conceptual design. A written description of how the relational database will be implemented This includes deciding which candidate key(s) will become the primary key - the value used to locate a specific record (like your studentID or emailID) Example: Project(ProjectName, Budget, Manager) PK(ProjectName) FK(Manager)  Manager(Name) ProjectMembers (ProjectName, Employee) PK(ProjectName, Employee) FK(projectName)  Project(ProjectName) FK(Employee)  Employee(Name) Employee(Name, BirthDate) PK(Name) Manager(Name, BirthDate, Title) PK(Name) PK = Primary Key FK = Foreign Key Week 3 24 24 Relational Database Modelling Overview Physical Design Take the relational schemas and implement them using a DBMS Structured Query Language (SQL) is used to create the relational database and manage data within the database CREATE TABLE Employee ( Name varchar(200), BirthDate date, CONSTRAINT employeePk PRIMARY KEY (Name) ); CREATE TABLE Manager ( Name varchar(200), CONSTRAINT managerPk PRIMARY KEY (Name) ); CREATE TABLE Project( ProjectName varchar(200), Budget decimal(6,2), Manager varchar(200), CONSTRAINT projectPk PRIMARY KEY (ProjectName), CONSTRAINT managerFk FOREIGN KEY (Manager) REFERENCES Manager(Name) ); Week 3+ 25 26 Relational Concepts Jargon for Table Terminology Relational Model – Key Concepts In a Relational DB data is held in a series of Relations A Relation is a table that consists of columns and rows of data What makes a table with rows and columns a Relation? Within a Table every column name must be UNIQUE Within a Table every row must be UNIQUE (no duplicate data!) Every Row must have a unique Primary Key that can identify that data row only! Relational Name Common Name Alternative Relation Relation Table Table Attribute Column Field Tuple Row Record 28 28 Relational Model – Key Concepts A Relation defines a real world or conceptual object we collect information about When a relation is implemented in a DBMS, it is often called a Table A relational database consists of relations with distinct names Eg, a table that stores Customer Information, Bank accounts, Uni Courses An Attribute is a name for a column in a relational table Each Attribute must have a unique name in a given relational table Every attribute has a domain Eg, the Customer table has columns (attributes) for collecting the customer’s Family Name, First Name, Address, Age 29 29 Eg a relation A customer buys books An attribute Customer has a family name Domain Family name is a string of characters Schema is a group of relationships, i ) A relation is a set of tuples. When a relation is implemented in a DBMS, it is a table. Eg Employees, Departments (rectangles). Also referred to as a relation ii ) An attribute is a name for a column of a relation. Every attribute has a domain. Relational Model – Key Concepts A Domain dictates the acceptable values of an attribute E.g. int, decimal(5,2), varchar(200), bit We will refer to domains as Data Types (consistent with Programming Fundamentals) Eg. FirstName is a string of characters of varying length (varchar), or the customer age may be an integer number (int) A Table Schema defines the structure of a relational table in a relational database (table name + attributes) Customer(CustomerID, FirstName, FamilyName, Address, Age) Relation Name + attributes 30 30 Eg a relation A customer buys books An attribute Customer has a family name Domain Family name is a string of characters Schema is a group of relationships, i ) A relation is a set of tuples. When a relation is implemented in a DBMS, it is a table. Eg Employees, Departments (rectangles). Also referred to as a relation ii ) An attribute is a name for a column of a relation. Every attribute has a domain. Relational Database – Key Concepts Tuples are instances of a relation or entity and contain the actual raw data A Tuple is a row of data A relation consists of one or more unordered tuples We use special commands to order the records returned by the DBMS when necessary A Key is a set of one or more attributes whose values can uniquely identify a given row of data in a relation The Primary Key (PK) is a value that identifies a particular table row of data Candidate (name given to all keys) Unique Surrogate Natural 31 Relational Database – Key Concepts Terms and Notation studID emailID studName 50001 leej001 John Lee 50002 bump001 Peter Buman 50003 brod001 David Browns Attribute name Relation name Students schema a tuple Students studID (PK) emailID studName Students studID (PK)(varchar) emailID(varchar) studName(varchar) Equivalent Conceptual Design UML: Attribute domain (data type) Logical Schema: Students(studID, emailID, studName) Physical Implementation: 32 Cardinality = # of tuples of a given Schema Relational Instance = a set of tuples from the same schema (entity type) 32 Key Concepts More Keys than Diablo III Relational Database – Key Concepts Candidate Key(s) – a minimum combination of attributes that could become the Primary Key when designing a database If there is only one Candidate Key (CK) it is generally selected as the Primary Key Other times, the developer must choose which CK will be used as the Primary Key Primary Key – the key chosen from the candidates to be implemented in the relation All relations require a Primary Key (PK) It is a set of one or more attributes chosen to uniquely identify each tuple (or row) of data If you are given a PK, you can locate the specific data Sometimes it is an automatically generated number (eg the current row number in excel) 34 34 A key is a minimal set of attributes whose combination of values are unique in the table. Minimal means as few as possible attributes (often there is only one) The key is used to uniquely identify a specific tuple in the relation or table. No two tuples in the relation should have the same key value. Eg, a StudentID should be a key – each student has a unique studentID which is not repeated or replicated for any other student. Relational Database – Key Concepts Primary Key Every relation has a special attribute or a set of attributes whose values can be used to uniquely identify a single row of data (a tuple) It serves to distinguish one row from every other The value(s) must be unique for each row StudentID StudentName 1011 Jane Lane 1012 Jeff Wilson 1013 Jane Pain CourseID CourseName INFS1091 Proj Dev COMP2064 FIT LMIF3621 Math Student Course Unique value for every row 35 35 Relational Database – Key Concepts Composite key A Key (Primary or Other!) that is composed of more than one attribute Example shows a Composite Primary Key! Course Student Mark INFS1091 1011 87 COMP2064 1013 94 LMIF3621 1012 68 Enrolment Unique values for every row 36 36 Relational Database – Key Concepts A Unique Key is a set of attribute(s) whose values are always unique Any candidate key that is not selected as the Primary Key can be implemented as a Unique Key constraint in the DBMS The data is checked to ensure each entry is a unique value and does not already exist in other rows Natural Key is a minimum set of one or more attributes that exist in the table data that can be used or combined to uniquely identify each row: Bank account/Telephone numbers, email address etc, Name + Birth Date Surrogate Key – ultimate lazy approach to implementing a Primary Key An incremental number stored as an ID column in a table A randomly generated alpha-numeric value Has nothing to do with the table data 37 37 A key is a minimal set of attributes whose combination of values are unique in the table. Minimal means as few as possible attributes (often there is only one) The key is used to uniquely identify a specific tuple in the relation or table. No two tuples in the relation should have the same key value. Eg, a StudentID should be a key – each student has a unique studentID which is not repeated or replicated for any other student. What Key is That? UserName (PK) FirstName LastName davop001 Paul Davos adamj003 Jennifer Adams duffl001 Lenny Duff Candidate Key Natural Key Surrogate Key Unique Key Primary Key userName: 38 What Key is That? studentID (PK) userName FirstName LastName 10011 davop001 Paul Davos 10093 adamj003 Jennifer Adams 10178 duffl001 Lenny Duff Candidate Key Natural Key Surrogate Key Unique Key Primary Key studentID: Candidate Key Natural Key Surrogate Key Unique Key Primary Key userName: 39 Relational Database – Key Concepts Foreign Keys are attributes whose values are the Unique Key of another Table (primary or unique!) The PK values are copied and stored in the other table as a Foreign Key value Foreign keys allow one table to be linked to one or more tuples in another table. hence the term “Relational Database” Example - A course mark is of no use if it doesn’t relate back to a student and a course: Enrolment(StudentID) is a foreign key referencing Student(StudentID) Enrolment(CourseID) is a foreign key referencing Course(CourseID) PK FK PK FK 40 40 Foreign keys and other constraints are used to help fortify the accuracy of data collected. To prevent invalid data from entering the database there are several options: Define constraints to limit choices/entries to valid options Ask the DBMS to enforce the constraints so that no bad tuples can be input into the tables. EG, prevent a non-existant student being enrolled Prevent a student getting 110/100 or Good/Pass for a mark In this example we have an Enrolment table that borrows information from two other tables. Often when representing a foreign key, the attribute is named after the table from which the PK is borrowed. In this case, enrollment borrows the studentID from the student relation, hence the attribute “student” is used to reflect the studentID. Similarly, the courseID is borrowed from the course relation hence has been denoted as course. This is not entirely necessary and different naming conventions can be used. Some common conventions include: studentID could be used in both tables as the PK and FK fkStudentID could be used Student_ID could be used to denote the FK Regardless. A Foreign key MUST reference a PK or a attribute whose values are unique. The foreign key must always point back to only one record. Relational Database – Key Concepts Foreign Key Violation: Student 50002 took course 12550 which is not a valid code for any course. Course 12529 was taken by Student 50009 which is not a valid id for any student. Foreign Keys are used to prevent these scenarios from taking place These are “orphan” records that have lost their meaning Student StudID EmailID StudName 50001 leej001 John Lee 50002 bump001 Peter Buman Course CrsID CrsName 12529 Data modelling 12510 Java Programming Enrolment Student Course Mark 50001 12529 80 50002 12550 75 50009 12529 50 ?? ?? 41 A constraint is a restriction implemented by the DBMS to ensure the integrity of the tuple data being entered. It may involve ensuring values are provided It may involve ensuring a value is within a given range or of a specific data type. Foreign key constraints ensure that a record exists in another table. In this case student 50009 would not be able to be entered into the enrolment because it violates the FK constraint. FK constraints prevent the referencing of a record that does not exist. 41 Database Fundamentals Logical Schema Design Deciding what you need to capture and how it should be stored 42 Databases – Relational Model Schema Database Designs A Database design should capture all the required information Both current and future requirements More than one design may be plausible One design may be better than another A good design in one situation may perform poorly in another A good design should take into account how the data will be used Frequent user searches Auditing purposes Predictions/data mining 43 43 Database – Normalisation Database Designs Consider what data needs to be stored and how it will be used 44 Databases – Relational Model Schema Bad Database Designs can create data quality issues Tracking customer shopping habits Customer account No Customer Name Customer Address Products purchased Stores they shop Example Logical Schema: custHabbits(acctNo, custName, custAddress, product, storeName) acctNo custName custAddress product storeName 1101 Mary Jane 123 Lane Coffee maker Hardly Normal 3311 James 424 Sa Milk Foodville 1101 Mary Jane 123 Lane Knife and fork set Hardly Norma 1101 Mary Jane 123 Lane Quilt set Quilts ’n’ Things 2211 Bob 111 Drive Coffee maker Hardly Normal 3311 James 424 Sa Milk Coz 45 45 Databases – Relational Model Schema Bad Database Designs can create data quality issues Tracking customer shopping habits Customer account No Customer Name Customer Address Products purchased Stores they shop Example Logical Schema: custHabbits(acctNo, custName, custAddress, product, storeName) acctNo custName custAddress product storeName 1101 Mary Jane 123 Lane Coffee maker Hardly Normal 3311 James 424 Sa Milk Foodville 1101 Mary Jane 123 Lane Knife and fork set Hardly Norma 1101 Mary Jane 123 Lane Quilt set Quilts ’n’ Things 2211 Bob 111 Drive Coffee maker Hardly Normal 3311 James 424 Sa Milk Coz Redundancy Every time our user creates a new record They need to re-enter all this information again and again 46 46 Databases – Relational Model Schema Bad Database Designs can create data quality issues Tracking customer shopping habits Customer account No Customer Name Customer Address Products purchased Stores they shop Example Logical Schema: custHabbits(acctNo, custName, custAddress, product, storeName) acctNo custName custAddress product storeName 1101 Mary Jane 123 Lane Coffee maker Hardly Normal 3311 James 424 Sa Milk Foodville 1101 Mary Jane 123 Lane Knife and fork set Hardly Norma 1101 Mary Jane 123 Lane Quilt set Quilts ’n’ Things 2211 Bob 111 Drive Coffee maker Hardly Normal 3311 James 424 Sa Milk Coz Update Anomaly If our user notices a mistake in one record and corrects it, do they remember to correct all the others? 47 47 Databases – Relational Model Schema Bad Database Designs can create data quality issues Tracking customer shopping habits Customer account No Customer Name Customer Address Products purchased Stores they shop Example Logical Schema: custHabbits(acctNo, custName, custAddress, product, storeName) acctNo custName custAddress product storeName 1101 Mary Jane 123 Lane Coffee maker Hardly Normal 3311 James 424 Sa Milk Foodville 1101 Mary Jane 123 Lane Knife and fork set Hardly Norma 1101 Mary Jane 123 Lane Quilt set Quilts ’n’ Things 2211 Bob 111 Drive Coffee maker Hardly Normal 3311 James 424 Sa Milk Coz Deletion Anomaly If our user deletes this purchase record we loose all our information we had on Bob! 48 48 Databases – Relational Model Schema A Good relational schema design should take into account usage, avoid redundancies and anomalies Redundancy Information is recorded unnecessarily in multiple tuples (rows) Update Anomaly Information can be changed in one tuple leaving old information in another Insertion Anomaly A bad design that forces unrelated information to be recorded as well and this information may be wrong Deletion Anomaly Deletion of a record results in complete loss of other information 49 49 Databases – Relational Model Schema Example – A Better Design? Ensure a Relation only captures one type of information about an entity or object Question: Is the new design any good for looking up customer records? customer(acctNo, custName, custAddress) customerPurchases(acctNo, product, storeName) acctNo custName custAddress 1101 Mary Jane 123 Lane 3311 James 424 Sa 2211 Bob 111 Drive acctNo product storeName 1101 Coffee maker Hardly Normal 3311 Milk Foodville 1101 Knife and fork set Hardly Normal 1101 Quilt set Quilts ’n ’Things 2211 Coffee maker Hardly Normal 3311 Milk Coz customer customerPurchases 50 50 Student Course Room Mary CS145 B01 Joe CS145 B01 Sam CS145 B01 .. .. .. If every course is in only one room, contains redundant information! A poorly designed database causes anomalies: If we update the room number for just one tuple (Update anomaly) Suppose everyone drops the course suddenly… we lose information about where the course is! (Delete Anomaly) We may not be able to create a room reservation without students. Need to know every detail (Insert anomaly) Databases – Relational Model Schema http://web.stanford.edu/class/cs145/ 51 Database – Normalisation Normalisation is the process of decomposing large relational schemas into smaller more efficient schemas Smaller schemas that contain only one copy of any given record Avoids issues that result from having multiple copies of the same record (typos etc) Smaller schemas that contain only relevant data Smaller schemas that are more easily searched Normalisation prevents update anomalies and data inconsistencies Works well where data is updated frequently but at the expense of information retrieval We will see this in coming weeks 52 Database – Normalisation The Problem of Un-Normalised Data: Imagine the table has 1000s of rows How do you find the franchises with stores in London? How do you ensure data integrity if “storeLocations” is an unlimited text field? Can you prevent a person writing Londn or L’dn intentionally or by accident? franchiseID storeLocations 101 Chicago, London, New York, Paris, San Francisco 102 Seattle 103 London, Munich, Paris, Athens 104 New York, London 53 Database – Normalisation The Problem of Poorly Normalised Data: What if there are stores in other cities? How do you get a list of the cities for a particular store? You’d need to check each column to see if it contained data! FranchiseID City1 City2 City3 City4 City5 101 Chicago London New York Paris San Francisco 102 Seattle NULL NULL NULL NULL 103 London Munich Paris Athens NULL 104 New York London NULL NULL NULL 54 Database – Normalisation How should we represent this information? FranchiseID City1 City2 City3 City4 City5 101 Chicago London New York Paris San Francisco 102 Seattle NULL NULL NULL NULL 103 London Munich Paris Athens NULL 104 New York London NULL NULL NULL 55 Database – Normalisation How should we represent this information? FranchiseID City1 City2 City3 City4 City5 101 Chicago London New York Paris San Francisco 102 Seattle NULL NULL NULL NULL 103 London Munich Paris Athens NULL 104 New York London NULL NULL NULL Franchises 101 102 103 104 … Cities Chicago London New York Paris … FranchiseCities 101 Chicago 101 London 101 New York … … 103 … London 56 Database – Normalisation 1st Normal Form All tables should be “flat” All occurrences of a record type must contain the same number of fields All values in a given column must be of the same data type (what does this mean?) A value should NOT be composed of multiple values Student Courses Mary {CS145,CS229} Joe {CS145,CS106} … … Student Course Mary CS145 Mary CS229 Joe CS145 Joe CS106 Not in 1st NF 1st NF Equivalent http://web.stanford.edu/class/cs145/ 57 Database – Normalisation 1st Normal Form All tables should be “flat” All occurrences of a record type must contain the same number of fields All values in a given column must be of the same data type (what does this mean?) A table should have NO repeating groups of values Student Course1 Course2 Mary CS145 CS229 Joe CS145 CS106 … … Student Course Mary CS145 Mary CS229 Joe CS145 Joe CS106 Not in 1st NF 1st NF Equivalent http://web.stanford.edu/class/cs145/ 58 Database – Normalisation Student Postcode Suburb Course CourseName Room Mark Mary 5000 Adel CS145 FIT B01 D Mary 5000 Adel CS229 PF B01 C Jeff 5001 Adel CS229 PF B01 C Jane 5092 Modbury Nth CS106 DBF B02 HD Joe 5092 Modbury CS145 FIT B01 P Joe 5092 Modbury CS106 DBF B02 HD How can we improve this? Student Enrolments 1st NF Student: Student → Suburb + Postcode Enrolments: Student + Course → Mark Courses: Course → Room Optional: Suburbs: Suburb → Postcode ?? Separate information into relations that represent real-world or conceptual objects Student, Courses, Enrolments, Suburbs? Look for attributes that depend on the values of other attributes 59 Database – Normalisation How can we improve this? Now: Student → Postcode, Suburb The Student always determines the Suburb and the Postcode Neither the Suburb or the Postcode is always unique Student Postcode Suburb Mary 5000 Adel Jeff 5001 Adel Jane 5092 Modbury Nth Joe 5092 Modbury Student 60 60 Database – Normalisation Course CourseName Room CS145 FIT B01 CS229 PF B01 CS106 DBF B02 Now: Student + Course →Mark Course → CourseName, Room The Course the student enrols in determines the mark The Course always determines the room Student Course CourseName Room Mark Mary CS145 FIT B01 D Mary CS229 PF B01 C Jeff CS229 PF B01 C Jane CS106 DBF B02 HD Joe CS145 FIT B01 P Joe CS106 DBF B02 HD Student Course Mark Mary CS145 D Mary CS229 C Jeff CS229 C Jane CS106 HD Joe CS145 P Joe CS106 HD How can we improve this? Student Enrolments Enrolment Courses 61 61 Database – Normalisation 62 (Original mega table) Database – Normalisation How can we improve this? Now: Student → Suburb Suburb → Postcode The Student always determines the Student Name and address The Suburb always determines the Postcode 3rd NF Equivalent Student Suburb Mary Adel Jeff Adel Jane Modbury Nth Joe Modbury Suburb Postcode Adel 5000 Modbury Nth 5092 Modbury 5092 Student Student Suburbs This example does not work for all suburbs! 63 Student Postcode Suburb Mary 5000 Adel Jeff 5001 Adel Jane 5092 Modbury Nth Joe 5092 Modbury 63 Database – Normalisation How can we improve this? Now: Student → Postcode, Suburb The Student always determines the Suburb and the Postcode Neither the Suburb or the Postcode is always unique Student Postcode Suburb Mary 5000 Adel Jeff 5001 Adel Jane 5092 Modbury Nth Joe 5092 Modbury Student Postcode Suburb Mary 5000 Adel Jeff 5001 Adel Jane 5092 Modbury Nth Joe 5092 Modbury Student Student Neither the Postcode or Suburb can uniquely identify the other: Sometimes Adelaide is 5000, sometimes it is 5001 Sometimes 5092 is Modbury sometimes it is Modbury North 64 64 Database – Normalisation Original “mega” Logical Schema: Student(Student, Postcode, Suburb, Course, CourseName, Room, Mark) PK(Student) Relations should represent real world or theoretical objects A relation should represent one type of object or a relationship between two objects, not every everything you think you need to capture New Logical Schemas: Student(student, postcode, suburb, postcode) PK(student) Courses(course, courseName, room) PK(course) Enrolments(student, course, mark) PK(student, course) Is a student name a good Primary Key (PK)? A better schema using a surrogate key: Students(studentID, studentName, suburb, postcode) Courses(courseID, courseName, room) Enrolments(studentID, courseID, mark) 65 Note the inconsistencies with Naming here. You need to decide if you label your tables using Plural or Singular naming and then stick with it! 65 Logical Schema Design So what does this all mean? When you are designing a database you should: Create a List of all the real-world AND conceptual objects you need to store data about To each of those items, list the attributes required to capture the desired data – ALL of them! Check that each of your objects only captures only relevant information to that object (relevant to that object and only to that object) Check to see if you have any candidate keys Do they apply to ALL Records? If so, pick the best one (usually the smallest one) as the Primary Key. If not, add an artificial (surrogate) Primary key – a standard ID field with a number that increases for each new row. 66 Database Fundamentals Next Week: Conceptual Design - Unified Modelling Language (UML) Lavf57.36.100 /docProps/thumbnail.jpeg