Chapter 1
Database Systems Infrastructure
Week 1 – Week 5
Week 6
Week 8 – Week 9
Copyright © 2012, SAS Institute Inc. All rights reserved.
This diagram covers infrastructure for database systems, data warehouse and big data. We cannot cover everything in this course, and we can only touch the surface of infrastructure as shown.
You will learn about normalisation design of the database using ERD, i.e. Entity-Relationship Diagram.
You will learn how to use Oracle to create an ERD. You will learn SQL or Sequel in the Lab workshops.
We will touch on how you can design a Data Warehouse using Stars schema
In the last two weeks we will talk about big data system management and its infrastructure. We will briefly discuss Hadoop, MapReduce, and Sparks behind the building of the big data.
INFS5710 Week 1
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
So, where do you find database? Today, data is everywhere. It is not limited to numbers; you can also have text, or even taking a photo is data.
If you look at a day Susan’s life, she spends time on using social media such as Facebook, Instagram, Tik Tok, Twitter, WeChat, QQ, WhatsApp, and so on. It will create tonnes and tonnes of data over times. It can include text, pictures, videos and so on. After work, when she goes to do some shopping, data will be captured by the supermarket, and credit card she uses will know where, when and how she spends her money. The marketing people like to know as a customer what she spends on, so they target her with new products and promotions in the future.
All these databases have their purpose.
INFS5710 Week 1
Data versus Information
Data
Information
Raw facts
Have not yet been processed to reveal their meaning to the end user
Building blocks of information
Produced by processing raw data to reveal its meaning
Requires context
Bedrock of knowledge
Should be accurate, relevant, and timely to enable good decision making
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Data consists of raw facts whereas Information is the result from processing raw data to reveal its meaning. Information creates knowledge and it is the foundation of decision making.
For example, data can be your marks for your quiz, marks for your assignment, and so on. For your other course such as e-commerce, you also have marks for quiz, participation, exam and so on. These are all data.
Information is when you put everything together in a report or a certificate, such as adding all your marks for this course, adding all your marks for e-commerce. We then give you a grade based on your mark for each of the courses you have done. You also have WAM, and this will tell you and your potential employer how you did at UNSW.
As for the point of decision making, using the example of your grades. For example, when you apply for a job, the potential employer will look at your certificate to decide if they will employ you along with your interview and so on.
INFS5710 Week 1
Introducing the Database
Shared, integrated computer structure that stores a collection of:
End-user data – Raw facts of interest to end user
Metadata: “Data about data”, which the end-user data are integrated and managed
Describe data characteristics and relationships
Database management system (DBMS)
Collection of programs
Manages the database structure
Controls access to data stored in the database
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Metadata is “data about data”, that is you describe data characteristics and relationships. This is important when you receive a set of data, you want to know what the numbers mean. If not, you have to ask what the numbers mean. For instance, you have a few dates on a table or in a Spreadsheet, you want to know what each date means. When you do analysis, you have to use the correct dates.
Database Management System or DBMS is the management of data of how and where the data are stored, how the data can be retrieved, and who can retrieve the data.
INFS5710 Week 1
Database
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
This is the first infrastructure you encounter in this course.
Infrastructure is important for data analytics because it is built as the foundation for storing and managing data. The infrastructure can be viewed as physical as well as logical. For instance, looking at the diagram above, a PC can logically represent as a normal PC, a laptop or even a tablet or a mobile phone.
This is the infrastructure of a normal database system, i.e. the DBMS system, which manages the interaction between the users on the left-hand side and database on the right-hand side.
So how does it work? An end user on the left-hand side wants to find who bought this new product such as a new Coca Cola Vegemite flavour? Vegemite is a popular spread for kids in Australia if you don’t know what it is.
A request is sent to DBMS (Database Management Systems), which, in turn, processes the request and send a request using SQL language to retrieve all the customers who bought this new Coca Cola Vegemite flavour. You will start learning SQL language in this week’s workshop.
Once the data is received, it will send back to DBMS which then processes and send back to the end user. This usually becomes information. The users might use the information in their decision making.
in the textbook, it talks about different types of databases. It covers single user database, and multi-user database date. It also talks about structured data, and unstructured data. Have a read in the textbook.
INFS5710 Week 1
Why Database Design is important?
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Why Database Design is important?
If you look at the diagram, what do you see? You see many numbers and text. You have to make sense what they mean. It might look very confusing for the first time.
Let’s go into details – if you look at row #9 and row 10, why are they blank? Next, look at the Name column, do you need to separate into surname and first name or are they mixed? For example, I want to find every employees with surname “Smith”, or surname “Zhang”.
If I want to find all employees with skills for “Basic Database Manipulation”, you have to go to column Skill1 to look for “Basic Database Manipulation”, and then in column Skill2 for “Basic Database Manipulation”, and then in column Skill3. So, is there a better way to do this?
INFS5710 Week 1
1
2
3
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
The slide shows what you will define as a good database design. You have three tables instead of one table. You will learn how to do this in ERD and Normalisation which will cover in the next few weeks.
Instead of having one gigantic table, we now have three smaller and different tables namely employee, skill, and certified. Now, if you want to find which employee who did the course Basic Database Manipulation, you can find all the employees much easier using these tables.
In the Skill table, basic Database Manipulation is Skill_Id 210. Now let’s go to Certified table, Skill_id is the middle column. For 210, we have two 06234 and 09382. Now, we can go to Employee table; you can see 06234 is Jasmine Patel and, 09382 is Jessica Johnson.
If you want an easier way, you will write SQL statements to link these tables together to get the results. This is something you can look forward to doing in the lab.
INFS5710 Week 1
Database Design
Focuses on the design of the database structure that will be used to store and manage end-user data
Well-designed database
Facilitates data management
Generates accurate and valuable information
Poorly designed database causes difficult-to-trace errors
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
In summary, to generate an accurate and valuable information, you need to have a well-designed database.
This is all I want to cover for Chapter One. You may want to read other topics I didn’t cover in lecture. You may want to read through the file system. SAS is based on a file system because it started in the 1970s,, that is when computer hardware was very expensive!
INFS5710 Week 1
Data Modeling and Data Models
Model – Abstraction of a real-world object or event
Data modeling: Iterative and progressive process of creating a specific data model for a determined problem domain
Data models: Simple representations of complex real-world data structures
Useful for supporting a specific problem domain
What would be the “data model” for UNSW like?
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Data modelling is to model and translate the business requirements into a data model they can be used to store data which to business can use.
Data model is designed to capture data in the real world. The data can then be used in the business.
So, what would be the data model like for UNSW?
Firstly, you have to think of what data you need to capture. The university needs to capture your details as a student – your name, your address, your telephone number, your gender, and your past academic records.
The university will give you a unique Student Id which identifies who you are. When you do a course like INFS5710, it will capture your marks, your attendance records, your assignment marks, your exam results and your final mark. The system will store all the courses you have attempted, passed or failed.
The system also needs to schedule a class, i.e. where and when INFS5710 class will take place. Can it fit all the students in one class?
Actually, the UNSW system is made up of several systems. Each system has its own data model, and they all linked together. For example, you can have a student system, generate profit and loss statements from the financial system, new student recruitment system, marketing system, student payment system, human resource and payroll systems, university shop rental systems, room booking system, library systems and so on and so forth. All these Systems are joined together; that is all different data models are actually linked together to create one complex data model.
So, a data model can be very complex. The UNSW systems took years to develop, and the systems will continue to be modified. Thus, the data models will be modified to fit the future needs. For instance, we now have online system as well as face-to-face classes for the same course and are running at the same time.
INFS5710 Week 1
Importance of Data Models
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
The importance of data models…
However, getting the data model right or near what it supposes to do is not easy. People sometimes deliberately do not want to tell you some facts. Alternatively, an IT personnel did not ask the right question, or the question is misinterpreted. On the other hand, it might be the question is so simple that no one has thought about it.
For example, in one of the projects when I was working in the industry. The project is about transferring data between two different systems. A document circulated and went through by a few consultants and a few people until we had a telephone meeting, I asked a question, ”how do I know if the dollar amount is a credit or a debit, i.e. positive or negative…” There was a silenc, for about a minute, I then asked, “hello, anyone there?” I said to my boss, ”the connection must have dropped out.” A voice came from the other end, “No, we are still here. We did not look at that, and we need to go and check, and then come back to you…”
INFS5710 Week 1
Are a communication tool
Give an overall view of the database
Organize data for various users
Are an abstraction for the creation of good database
Data Model Basic Building Blocks
Entity: Unique and distinct object used to collect and store data
Attribute: Characteristic of an entity
Relationship: Describes an association among entities
One-to-many (1:M)
Many-to-many (M:N or M:M)
One-to-one (1:1)
Constraint: Set of rules to ensure data integrity
people, thing, event, …
Consider the following entities: professors, students, courses, departments, research centres, etc.
Violation examples:
Enter an SID to STUDENT table, it finds two students (Entity integrity)
STUDENT table says that student X’s department code is Y. But in the DEPARTMENT Table, there is no code Y. (Referential integrity)
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
They are two parts of database design. One is logical, and the other is physical. In the past, it might take a few months to do the logical design, and then it might take 1 to 2 months to do the physical design. It will take time to make any changes. However, today, logical design and physical design are linked such as Oracle SQL developer. At the initial design phase, this makes life much easier as you can easily change the logical design, and it will reflect in the physical design.
The basic building blocks of all data models are entities, attributes, relationships, and constraints. If I were to ask you to design a data model, I will get 50 (assuming 50 students here) different designs and notations, and the ones which are the same is because the person sits next to you copies yours.
An entity is a person, place, thing, or event about which data will be collected and stored; for example, a student entity.
An attributes is a characteristics of an entity; for example, for our student entity, you have attributes such as a student last name, student first name, student address, student status, for instance, a local student or an international student.
A relationship describes an association among entities. In our student example, you have one student belongs to one faculty only, such as the Business School, or one student has to do several courses to be graduated. The relationship, we have cardinalities such as one-to-many, many-to-many. and one-to-one.
For example, for the case of one-to-one, you can only belong to one faculty. One-to-many can be one student can enrol in many courses.
As for many-to-many, it is something you must address in your logical data modelling because you cannot save the data in the physical form.
Many-to-many must be resolved at the logical design stage, and we will come back later when we do the Entity Relationship Diagram or ERD.
A constraint is a restriction put in placed on the data. Constraints are important because they help to ensure data integrity. In our student example, we can have a grade of High Distinction which is above 85%, a Distinction grade is between 75% and 84%, a Credit between 65 to 74%, and a pass above 50%,. And below that you fail the course. The data integrity here is to ensure all the grades are matched to the marks.
INFS5710 Week 1
Business Rules
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
We have business rules to design a data model!
INFS5710 Week 1
Brief, precise, and unambiguous description of a policy, procedure, or principle
Enable defining the basic building blocks
Describe main and distinguishing characteristics of the data
Translating Business Rules into Data Model Components
Nouns translate into entities
Verbs translate into relationships among entities
Relationships are bidirectional
Questions to identify the relationship type
How many instances of B are related to one instance of A?
How many instances of A are related to one instance of B?
Each student can take at most 3 courses each semester.
Each research student must have two supervisors at UNSW.
1:1, 1:M, or M:N
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Nouns such as student or course can be translated into entities. For example, Each student can take at most 3 courses each semester.
Verbs such as take can be a relationship.
When it comes to relationship type, it can be one of the three – 1:1, 1:M, or M:N. You have to work it out yourself based on the three business rules. The business rules drive the relationship between entities.
For example, if you have two entities, one is taxi driver and the other one is car, i.e. taxi and car.
If a taxi driver can only drive one car only and cannot drive another car, then the relationship between the entities is one to one (1:1)
However, if the taxi driver can drive any cars, then the relationship between taxi drive entity and car entity is one to many (1:M)
INFS5710 Week 1
Naming Conventions
Entity names – Required to:
Be descriptive of the objects in the business environment
Use terminology that is familiar to the users
Attribute name – Required to be descriptive of the data represented by the attribute
Proper naming:
Facilitates communication between parties
Promotes self-documentation
STUDENT, EMPLOYEE, DEPARTMENT, etc.
STUDENT: SID, DOB, GENDER, FIRST_NAME, LAST_NAME, …
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Companies usually have standard related to naming conventions.
INFS5710 Week 1
The Relational Model (Ch 3)
Produced an “automatic transmission” database that replaced “standard (manual) transmission” databases
Based on a relation
Relation or table: Matrix composed of intersecting tuple and attribute
Tuple: Rows
Attribute: Columns
Describes a precise set of data manipulation constructs
By Codd in 1970
Can you see why a “table” reflects a “relation”?
Each intersecting tuple crossing attributes 1, 2, …, n means this “entity” meets
the values of these attributes in the tuple.
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
The relation model was introduced in 1970 by E.F. Codd. He worked at IBM when he wrote the paper on relational model.
The relational model is based on mathematics and it is break through for the end users and database designers.
To use an analogy, the relational model produced an automatic transmission database to replace the standard transmission databases that was before it.
The conceptual idea of having a relation changes the way database is designed.
Till today, we still use relational models but now we have other big data models because we have too much data. We will cover relational model in more details next week.
Read the textbook, it explains more…
INFS5710 Week 1
Tuple
Entity !
in Relational DB
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
Here we have two tables. In the first diagram we have agent code links two tables, namely table agent and table customer, together.
The second diagram shows the relation between the two tables.
Agent code is a primary key or unique in table Agent, and agent code in the customer table links to agent.
In Figure 2.2, you can see an agent code must exist in agent table before it can be stored in customer table.
As for the cardinality, you can see 1 next to Agent Code in Agent table and many in customer table. The reason is you can only have one unique agent code in agent table, but in the Customer table, you can have many same agent code stored many times.
INFS5710 Week 1
The Entity Relationship Model (Ch 4)
Graphical representation of entities and their relationships in a database structure
Entity relationship diagram (ERD)
Uses graphic representations to model database components
Entity “instance” or entity “occurrence”
Rows in the relational table
Connectivity: Term used to label the relationship types
An entity is a table in ER model!
A relationship refers to the associations between tables.
By Chen in 1976
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
As for ER model, we will chat about it next week.
INFS5710 Week 1
Figure 2.3 – The ER Model Notations
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
There are a few notations around. The book uses Crow’s foot notation, so we will use that notation. My background is using Chen Notation. There is other notations in Oracle which you will expose in the lab in a few weeks time.
INFS5710 Week 1
The Evolution of Data Models
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
In the textbook, it explains very well what these data models are, what is a hierarchical, what is a network and what is a NoSQL. We will focus and discuss more on relational, entity relationship, and NoSQL in the last few weeks.
INFS5710 Week 1
Levels of Data Abstraction
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
From the definition, Data abstraction is the reduction of a particular body of data to a simplified representation of the whole.
Abstraction, in general, is the process of taking away or removing characteristics from something in order to reduce it to a set of essential characteristic.
In the textbook, it talks about how a car designer designs a car.
An engineer then designs to see how it can be transferred from concept of the car to a car to be produced.
The engineer needs to design specifically down to the screw and size of the light bulb to be used in the car.
You can see from the diagram the viewpoint of getting the business rules form the end user all the way to designing a physical model.
As for the data abstraction, it is high at the end users and when it comes down to a low level at the physical model.
INFS5710 Week 1
Physical Model
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
This shows how the model is transferred from one level of model to another. From External model to Conceptual mode to internal model and then to physical model. In Oracle, you can do that.
Nowadays, in reality, due to time constraints and money, conception model, will be built as the first model and then click of a button, it translates into a physical model.
INFS5710 Week 1
Oracle
Oracle
Database
Flat Files
Machine Learning
Prepared by Vincent Pang, Feb. 2021 Note: In-Memory Database (e.g. SAP Hana) is an alternative data model not shown here
ETL (Data
Cleansing)
Data (DW to BD, or
vice versa or both)
RelationalDatabaseBig DataDataWarehouseNormalisationReporting (Business Intelligence and Visualisation) and Business Analysis (End Users)(De-Normalised)External Data (e.g. Excel)Hadoop Distributed File System (HDFS) and MapReduceEntity Relationship Model (ERM)Unstructured Data(Social Media)Structured Data (Internet of Things (IOT))HadoopSQLSQLSpark and NoSQL (and other tools)Data StreamingData Streaming(“Not Normalised”)External Data (e.g. Excel)Star Schema