How this course is designed
● Practical, hands on approach
● Do not memorize! Practice, practice, practice!!!
4
Course Schedule
5
Module Topic
1 Introduction to Database Systems
2 SQL Basics
3 Entity-Relationship (E-R) Model
4 Intermediate SQL
5 Advanced SQL
6 Database Design and Normal Form
7 ** Midterm Exam **
8 SQL for Python and R, Data Wrangling and ETL
9 Database Systems for Big Data Analytics
10 Database Systems for Big Data Analytics (cont.)
11 Data Warehouses
12 Database Systems for Business Intelligence (BI), Metabase
13 Final Project Presentations
Textbook
Required:
● Silberschatz, A., Korth, H. F., and Sudarshan, S . (2019).
Database System Concepts (7th Edition). McGraw-Hill.
ISBN-13: 9780078022159
Optional:
● Kimball, R., and Ross, M. (2013). The Data Warehouse
Toolkit: The Definitive Guide to Dimensional Modeling.
John Wiley & Sons. ISBN-13: 978-1118530801
● Perkins, L., Redmond, E., and Wilson, J. (2018). Seven
Databases in Seven Weeks (2nd Edition). The Pragmatic
Programmers. ISBN-13: 978-1680502534
● Visochek, A. (2017). Practical Data Wrangling: Expert
techniques for transforming your raw data into a valuable
source for analytics. Packt Publishing Ltd. ISBN-13:
978-1787286139
6
Evaluation/Grading
7
Assignment Weight (%)
Class Session Attendance and Participation 5
Homework Assignments 40
Midterm Exam 25
Project Checkpoints 10
Class Project 20
● Contact CUIT for Canvas/Zoom support.
○ Note: Unless we are notified by IT for possible service
interruptions, we will not accept “technical difficulties” as an
excuse for not submitting your work on time
Lecture Format
8
● Hands-on Sessions due end of day Monday @ 10pm
● Webcasts with 3 pt short quizzes
● 6:10 pm – 8:00 pm – Lectures on Tuesday
● Please be on-time, present for the entire class
● Pay attention, ask/answer questions
● You are expected to read ahead, reading assignments are
due before the class session.
Ask questions!
Homework Assignments
9
● 7 Homework assignments due
● Released right after the class session
● One week to attempt, plan accordingly if you need to ask questions
● Solutions released shortly after due date/time
● First homework is in quiz-format, the remaining are .sql, .py, .R or
.txt file submissions
● No late submissions, no exceptions, no make-up HW
● 40% of course grade
● Will require you to practice beyond what is presented in class session
Midterm Exam
10
● Covers everything in Modules 1 through 6 (including M6)
● Code, multiple choice, critical thinking
● Will require knowledge beyond what is presented in class session, but
always included in your main textbook
● 25% of course grade
Class Project
11
● You will get to work with unstructured data, design a schema, build a
database, populate a database, and build connectors to produce reports
and visualizations
● Group effort, teams will be randomly assigned, but members will be
individually graded
● Start after the midterm exam
● Weekly activities (checkpoints) due every week to help you stay on track
● Last week: presentations
● 20% of course grade
Policies
12
● Late assignments will not be accepted.
● Canvas has been programmed to release the solutions shortly after the
deadline.
● Falsifying attendance is a form of academic dishonesty, you risk being
expelled from the class and possibly the program.
● Submitting work that is not yourown is a form of academic dishonesty, you
risk being expelled from the class and possibly the program.
● Sharing assignment/exam solutions with others is a form of academic
dishonesty, you risk being expelled from the class and possibly the program.
● Be polite, courteous, and respectful.
Introduction
Chapter 1
Poll
14
How familiar are you with SQL and Relational Databases?
A. Expert: I use SQLalmost everyday
B. Advanced: I am comfortable with SQLbut not a pro
C. Beginner: I know the basics
D. Not a clue: What is SQL?
Database Management System (DBMS)
15
● “Databases” are more formally called
Database Management Systems (DBMS)
● DBMS facilitate efficient, secure and accurate
information storage and retrieval across multiple
users and platforms
● DBMS can be very large
● DBMS contain information about a particular
enterprise
○ Collection of interrelated data
○ Set of programs to access the data
○ An environment that is both convenient and
efficient to use
Database Management System (DBMS)
16
● Database systems are used to manage collections of data that are:
○ Highly valuable
○ Relatively large
○ Accessed by multiple users and applications, often at the same time
● A modern database system is a complex software system whose task
is to manage a large, complex collection of data.
Database Management System (DBMS)
17
● Enterprise Information:
○ Sales: customers, products, purchases
○ Accounting: payments, receipts, assets
○ Human Resources: Information about employees, salaries,
payroll taxes
● Manufacturing:
○ Management of production, inventory, orders, supply chain
● Banking:
○ Customer information, accounts, loans, and banking
transactions
○ Credit card transactions
Database Management System (DBMS)
18
● Finance:
○ sales and purchases of financial instruments (e.g., stocks and
bonds; storing real-time market data)
● Universities:
○ registration, grades
● Airlines:
○ reservations, schedules
● Telecommunication:
○ records of calls, texts, and data usage, generating monthly bills,
maintaining balances on prepaid calling cards
Database Management System (DBMS)
19
● Web-based services
○ Online retailers:
■ order tracking, customized recommendations
○ Online advertisements
● Document databases
● Navigation systems: For maintaining the locations of various places of
interest along with the exact routes of roads, train systems, buses, etc.
● Databases touch all aspects of our lives
University Database Textbook Example
20
● Application program examples
○ Add new students, instructors, and courses
○ Register students for courses, and generate class rosters
○ Assign grades to students, compute grade point averages (GPA)
and generate transcripts
● In the early days, database applications were built directly on top of
file systems
University Database Schema (textbook example)
22
Drawbacks of using file systems to store data
In the early days, database applications were built directly on top of file
systems, which leads to:
● Data redundancy and inconsistency: data is stored in multiple file
formats resulting in duplication of information in different files
● Difficulty in accessing data
○ Need to write a new program to carry out each new task
● Data isolation
○ Multiple files and formats
● Integrity problems
○ Integrity constraints (e.g., account balance > 0) become
“buried” in program code rather than being stated explicitly
○ Hard to add new constraints or change existing ones
23
Drawbacks of using file systems to store data (cont.)
● Atomicity of updates
○ Failures may leave databases in an inconsistent state with
partial updates carried out
○ Example: Transfer of funds from one account to another should
either complete or not happen at all
● Concurrent access by multiple users
○ Concurrent access needed for performance
○ Uncontrolled concurrent accesses can lead to inconsistencies
■ Ex: Two people reading a balance (say 100) and updating it
by withdrawing money (say 50 each) at the same time
● Security problems
○ Hard to provide user access to some, but not all, data
History of Database Systems
24
● 1950s and early 1960s:
○ Data processing using magnetic tapes for storage
■ Tapes provided only sequential access
○ Punched cards for input
● Late 1960s and 1970s:
○ Hard disks allowed direct access to data
○ Network and hierarchical data models in widespread use
○ Ted Codddefines the relational data model
■ Won the ACM Turing Award for this work
○ IBM Research begins System R prototype
○ UC Berkeley (Michael Stonebraker) begins Ingres prototype
○ Oracle releases first commercial relational database
○ High-performance (for the era) transaction processing
History of Database Systems (cont.)
25
● 1980s:
○ Research relational prototypes evolve into commercial systems
■ SQL becomes industrial standard
○ Parallel and distributed database systems
■ Wisconsin, IBM, Teradata
○ Object-oriented database systems
● 1990s:
○ Large decision support and data-mining applications
○ Large multi-terabyte data warehouses
○ Emergence of Web commerce
History of Database Systems (cont.)
26
● 2000s:
○ Big data storage systems
■ Google BigTable, Yahoo PNuts, Amazon,
■ “NoSQL” systems.
○ Big data analysis: beyond SQL
■ MapReduce
● 2010s:
○ SQL reloaded
■ SQL front end to MapReduce systems
■ Massively parallel database systems
■ Multi-core main-memory databases
A Familiar System… Spreadsheets
● Intuitive table-type structure
with columns and rows
● Manageable for few data but slow and
cumbersome for a lot of data
● Difficult for another user to “decipher”
complicated analyses
● Good to quickly produce charts
● Limited protection and data integrity
constraints
● Limited options to “plug-in” to other
applications
27
Levels of Abstraction
28
● Physical level: describes how a record (e.g., instructor) is stored.
● Logical level: describes data stored in database, and the
relationships among the data.
type instructor = record
ID : string;
name : string;
dept_name : string;
salary : integer;
end;
● View level: application programs hide details of data types. Views
can also hide information (such as an employee’s salary) for
security purposes.
Data Models
29
● A collection of tools for describing:
○ Data
○ Data relationships
○ Data semantics
○ Data constraints
● Relational model
● Entity-Relationship data model (mainly for database design)
● Object-based data models (Object-oriented and Object-relational)
● Semistructured data model (XML)
● Other older models:
○ Network model
○ Hierarchical model
The Relational Model
● Data is stored in various tables
● Example of tabular data in the relational model: Columns
or
Attributes
30
Rows
or
Tuples
Sneak Peak:
This is the instructor table from
the uni_small database. You can
get the same output with:
SELECT * FROM instructor;
DataDefinition Language (DDL)
31
● Specification notation for defining the database schema
Example:
CREATE TABLE instructor (
ID char(5),
name varchar(20),
dept_name varchar(20),
salary numeric(8,2)
);
DataDefinition Language (DDL) (cont.)
32
● DDL compiler generates a set of table templates stored in a data
dictionary
● Data dictionary contains metadata (i.e., data about data)
○ Database schema
○ Integrity constraints:
■ Primary key (ID uniquely identifies instructors)
○ Authorization
■ Who can access what
Data Manipulation Language (DML)
33
● Language for accessing and manipulating the data organized by the
appropriate data model
○ DML also known as query language
● Two classes of languages:
○ Pure – used for proving properties about computational power
and for optimization
■ Relational Algebra
■ Tuple relational calculus
■ Domain relational calculus
○ Commercial – used in commercial systems
■ SQL (pronounced ess-que-ell, or sequel) is the most widely
used commercial language
Structured Query Language (SQL)
34
● The most widely used commercial language
● SQL is NOT a Turing machine equivalent language
● To be able to compute complex functions SQL is usually embedded
in some higher-level language
● Application programs generally access databases through one of
○ Language extensions to allow embedded SQL
○ Application program interface (e.g., ODBC/JDBC) which allow
SQL queries to be sent to a database
Database Design
35
The process of designing the general structure of the database:
● Logical Design – Deciding on the database schema. Database
design requires that we find a “good” collection of relation
schemas.
○ Business decision – What attributes should we record in the
database?
○ Computer Science decision – What relational schemas should we
have and how should the attributes be distributed among the
various relation schemas?
● Physical Design – Deciding on the physical layout of the database
Design Approaches
36
● We need to come up with a methodology to ensure that each of the
relations in the database is “good”
● There are two ways of doing so:
○ Entity Relationship Model (Chapter 7)
■ Models an enterprise as a collection of entities and
relationships
■ Represented diagrammatically by an entity-relationship
diagram:
○ Normalization Theory (Chapter 8)
■ Formalize what designs are bad, and test for them
37
Example
Assume this is the `faculty` table. Is there anything wrong with it?
A. It looks OK to me
B. The ID column has
inconsistent values
C. Information is
repeating for
building and budget
for some depts.
D. name column is
missing first names
Sneak Peak:
You can get the same output from the uni_small database with:
SELECT i.id, i.name, i.salary, d.dept_name, d.building, d.budget
FROM instructor i JOIN department d ON i.dept_name = d.dept_name;
38
Example
Assume this is the `faculty` table. Is there anything wrong with it?
A. It looks OK to me
B. The ID column has
inconsistent values
C. Information is
repeating for
building and budget
for some depts.
D. name column is
missing first names
Sneak Peak:
You can get the same output from the uni_small database with:
SELECT i.id, i.name, i.salary, d.dept_name, d.building, d.budget
FROM instructor i JOIN department d ON i.dept_name = d.dept_name;
Transaction Management
39
● What if the system fails?
● What if more than one user is concurrently updating the same data?
● A transaction is a collection of operations that performs a single
logical function in a database application
● Transaction-management component ensures that the database
remains in a consistent (correct) state despite system failures
(e.g., power failures and operating system crashes) and
transaction failures.
● Concurrency-control manager controls the interaction among the
concurrent transactions, to ensure the consistency of the database.
Database Architecture
40
The architecture of database systems is greatly influenced by the
underlying computer system on which the database is running.
● Centralized databases
○ One to a few cores, shared memory
● Client-server
○ One server machine executes work on behalf of multiple clients
● Parallel databases
○ Many core shared memory
○ Shared disk
○ Shared nothing
● Distributed databases
○ Geographical distribution
○ Schema/data heterogeneity
CHAPTER 2
Introduction to the
Relational Model
Attribute (Column)
42
● The set of allowed values for each attribute is called the domain of
the attribute
● Attribute values are (normally) required to be atomic; that
is, indivisible
● The special value NULL is a member of every domain. Indicates that
the value is “unknown”
● The null value causes complications in the definition of
many operations
Relation Schema and Instance
43
● A , A , …, A are attributes (columns)
1 2 n
● R = (A , A , …, A ) is a relation schema
1 2 n
Example:
instructor = (ID, name, dept_name, salary)
● Formally, given sets D1, D2, …, Dn a relation r is a subset of D1 x
D2 x … x Dn. Thus, a relation is a set of n-tuples (a1, a2, …, an)
where each ai ∈ Di
● The current values (relation instance) of a relation are specified by a
table
● An element t of r is a tuple, represented by a row in a table
Database Schema
● Database schema: the logical structure of the database.
● Database instance: a snapshot of the data in the database at a
given instant in time
Example:
● schema: instructor (ID, name, dept_name, salary)
● instance:
44
Relations are Unordered
● Order of tuples is irrelevant (tuples may be stored in an arbitrary
order)
● Example: instructor relation with unordered tuples
Sneak Peak:
This is the instructor table from
the uni_small database. You can
get the same output with:
SELECT * FROM instructor
ORDER BY random();
45
Keys
46
● Let K ⊆ R
● K is a superkey of R if values for K are sufficient to identify a unique
tuple of each possible relation r(R)
i.e: {ID} and {ID, name} are both superkeys of instructor.
● Superkey K is a candidate key if K is minimal
i.e: {ID} is a candidate key for instructor
● One of the candidate keys is selected to be the primarykey.
● Foreign key constraint: Value in one relation must appear in another
○ Referencing relation
○ Referenced relation
i.e.: dept_name in instructor is a foreign key from
instructor referencing department
Relational Query Languages
47
● Procedural vs. non-procedural, or declarative
● “Pure” languages:
○ Relational algebra
○ Tuple relational calculus
○ Domain relational calculus
● The above 3 pure languages are equivalent in computing power
● We will briefly introduce relational algebra
○ Not turing-machine equivalent
○ Consists of 6 basic operations
Relational Algebra
48
● A procedural language consisting of a set of operations that take
one or two relations as input and produce a new relation as their
result.
● Six basic operators:
○ Select:
○ Project:
○ Union:
○ Set difference:
○ Cartesian product:
○ Rename:
σ
Π
∪
−
x
ρ
Select Operation – Selection of Rows (Tuples)
● Given relation r
●
A B C D
α α 1 7
α β 5 7
β β 12 3
β β 23 10
A B C D
α α 1 7
β β 23 10
49
Select Operation – Selection of Columns (Attributes)
● Given relation r
●
A B C
α 10 1
α 20 1
β 30 1
β 40 2
A C
α 1
α 1
β 1
β 2
A C
α 1
β 1
β 2
=
50
Union of Two Relations (Tables)
● Given relations r, s
●
A B
α 1
α 2
β 1
A B
α 2
β 3
A B
α 1
α 2
β 1
β 3
s
r
51
SetDifference of Two Relations (Tables)
● Given relations r, s
●
A B
α 1
α 2
β 1
A B
α 2
β 3
A B
α 1
β 1
s
r
52
Set Intersection of Two Relations (Tables)
● Given relations r, s
●
Note:
A B
α 1
α 2
β 1
A B
α 2
β 3
A B
α 2
s
r
53
Joining Two Relations (Tables) – Cartesian-Product
● Given relations r, s
●
A B
α 1
β 2
C D E
α 10 a
β 10 a
β 20 b
γ 10 bA B C D E
α 1 α 10 a
α 1 β 10 a
α 1 β 20 b
α 1 γ 10 b
β 2 α 10 a
β 2 β 10 a
β 2 β 20 b
β 2 γ 10 b
r
s
54
Cartesian-Product – Naming Issue
● Given relations r, s
●
A B
α 1
β 2
B D E
α 10 a
β 10 a
β 20 b
γ 10 bA r.B s.B D E
α 1 α 10 a
α 1 β 10 a
α 1 β 20 b
α 1 γ 10 b
β 2 α 10 a
β 2 β 10 a
β 2 β 20 b
β 2 γ 10 b
r
s
55
Renaming a Table
56
● Allows us to refer to a relation, (say E) by more than one name.
ρ x (E)
returns the expression E under the name X
● Given relation r
● r x ρ (r)
s
A B
α 1
β 2
r.A r.B s.A s.B
α 1 α 1
α 1 β 2
β 2 α 1
β 2 β 2
Composition of Operations
● Can build expressions using multiple operations
●
●
A B C D E
α 1 α 10 a
α 1 β 10 a
α 1 β 20 b
α 1 γ 10 b
β 2 α 10 a
β 2 β 10 a
β 2 β 20 b
β 2 γ 10 b
A B C D E
α 1 α 10 a
β 2 β 10 a
β 2 β 20 b
57
Joining Two Relations – Natural Join
58
● Let r and s be relations on schemas R and S respectively. Then, the
“natural join” of relations R and S is a relation on schema R ∪ S
obtained as follows:
○ Consider each pair of tuples tr from r and ts from s
○ If tr and ts have the same value on each of the attributes in
R ∩ S, add a tuple t to the result, where:
■ t has the same value as tr on r
■ t has the same value as ts on s
Natural Join Example
● Given relations r, s:
● Natural Join:
A B C D E
α 1 α a α
α 1 α a γ
α 1 γ a α
α 1 γ a γ
δ 2 β b δ
A B C D
α 1 α a
β 2 γ a
γ 4 β b
α 1 γ a
δ 2 β b
B D E
1 a α
3 a β
1 a γ
2 b δ
3 b ε
r s
59
Notes about Relational Languages
60
● Each query input is a table (or set of tables)
● Each query output is a table
● All data in the output table appears in one of the input tables
● Relational Algebra is not Turing complete
● We can compute:
○ SUM
○ AVG
○ MAX
○ MIN
Summary of Relational Algebra Operators
61
Acknowledgements
62
● Some content adapted or modified from the course assigned
textbook and its relevant resources by:
Silberschatz, A., Korth, H. F., and Sudarshan, S. (2019). Database
System Concepts, 7th Edition. McGraw-Hill.
How this course is designed
Course Schedule
Textbook
Evaluation/Grading
Lecture Format
Homework Assignments
Midterm Exam
Class Project
Policies
Introduction Chapter 1
Poll
Database Management System (DBMS)
Database Management System (DBMS)
Database Management System (DBMS)
Database Management System (DBMS)
Database Management System (DBMS)
University Database Textbook Example
University Database Schema (textbook example)
Drawbacks of using file systems to store data
Drawbacks of using file systems to store data (cont.)
History of Database Systems
History of Database Systems (cont.)
History of Database Systems (cont.)
A Familiar System… Spreadsheets
Levels of Abstraction
Data Models
The Relational Model
Data Definition Language (DDL)
Data Definition Language (DDL) (cont.)
Data Manipulation Language (DML)
Structured Query Language (SQL)
Database Design
Design Approaches
Example
Example
Transaction Management
Database Architecture
Slide Number 41
Attribute (Column)
Relation Schema and Instance
Database Schema
Relations are Unordered
Keys
Relational Query Languages
Relational Algebra
Select Operation – Selection of Rows (Tuples)
Select Operation – Selection of Columns (Attributes)
Union of Two Relations (Tables)
Set Difference of Two Relations (Tables)
Set Intersection of Two Relations (Tables)
Joining Two Relations (Tables) – Cartesian-Product
Cartesian-Product – Naming Issue
Renaming a Table
Composition of Operations
Joining Two Relations – Natural Join
Natural Join Example
Notes about Relational Languages
Summary of Relational Algebra Operators
Acknowledgements