Introduction to Databases
CSC 343 Winter 2021
MICHAEL LIUT (MICHAEL.LIUT@UTORONTO.CA) ILIR DEMA (ILIR.DEMA@UTORONTO.CA)
DEPARTMENT OF MATHEMATICAL AND COMPUTATIONAL SCIENCES UNIVERSITY OF TORONTO MISSISSAUGA
Let’s look at the course syllabus!
What does this course look like?!
2
Topics
• Relational Model
• ER Model
• SQL
• Aggregation and Joins
• Constraints and Triggers • Relational Algebra
• Views and Indexes • Database Design
• Transactions
• Concurrency
• Intro to NoSQL and MongoDB (time permitting) • Hadoop vs. GFS vs. Cassandra (time permitting)
3
4
5
6
7
Big Data AND Trends
During a hurricane warning/prior to a hurricane occurring, Walmart found an increase in sales in:
Strawberry Pop-Tarts 7 times the “norm”
8
Big Data AND Trends
9
Data Science
Empirical Science: collect and systemize facts.
Theoretical Science: formulate theories and empirically test them.
Computational Science: run automatic proofs, run simulations.
Data Science: collect data and find patterns within the data. Think statistics meets mathematics meets
computer science and programming.
10
11
Question?
How do you think databases run your life?
12
How do databases run your life?
• Cloud Storage (e.g. Dropbox, Google Drive, iCloud, etc…)
o Where is the data? How is it categorized and quickly accessible?
• Online Streaming Applications (e.g. Netflix, YouTube, HBO Now, etc…)
o Generating lists of videos based on searches and tracking users preferences. “Recommended”
videos.
• Finances (e.g. Chequing/Savings Account, Stock Market, Credit Cards, etc…) o VISA processes an average of 150 million transactions per day.
13
How do databases run your life?
• Social Media (e.g. Instagram, Facebook, Twitter, etc…) o Storing personal information and multimedia content. o “Suggestions For You” or “People You May Know”.
Huge area of Data Analytics!
• E-commerce (e.g. Amazon, eBay, Alibaba, etc…)
o Online business that store and catalogue items.
o Organize their product’s details, pricing information, and sellers.
o Store users’ purchase history, payment information/preferences, and search history.
14
Data Analytics
• The science of examining and interpreting raw data to find patterns and deduce conclusions.
.
• Applying algorithms, mathematical techniques, and mechanical processes to form a conclusion about the information being analyzed.
• By 2020, there will be over $200 Billion spent annually in the ‘Big Data and Business Analytics’ market.
15
What is a Database?
Can you define this?
16
What is a Database?
Naively defined as…
◦ a collection of information that exists over a long period of time.
17
What is a Database?
DATABASE
A very large, integrated collection of data (i.e. records or files).
Models a real-world enterprise
◦ Entities (e.g. teams, games)
◦ Relationships (e.g. Barack Obama received The Nobel Peace Prize) ◦ Constraints (e.g. at least one doctor on duty during off-hours)
DATABASE MANAGEMENT SYSTEM (DBMS)
A software system designed to store, manage, and facilitate access to data.
18
Is the WWW a DBMS?
WWW = World Wide Web
Fairly sophisticated searches available ◦ Web Crawlers index pages
◦ Keyword-based search for pages
Currently data is unstructured and untyped Search ONLY
◦ Can’t modify the data
◦ Can’t get summaries or complex combinations of data
19
Is the WWW a DBMS?
Few (zero) guarantees provide for: ◦ Freshness of data
◦ Consistency across data items
◦ Fault Tolerance
Web sites (e.g. e-commerce sites such as Amazon or E-Bay) typically have a DBMS in the background to provide these functions.
20
“Search” vs. “Query”
What if you wanted to lookup all of the countries apart of the European Union (EU)?
Try “countries in the eu” in a search engine (e.g. Google)
21
Search
Based on keyword matching
◦ Our search matches countries that belong to the European Union (EU)
◦ Results are ranked based on: ◦ Popularity
◦ Reputation
◦ Paid Advertisements
◦ Web documents
◦ Limited structure
22
“Search” vs. “Query”
“Search” returns a document as is.
23
Query
A request of information from a Database.
◦ In a DBMS, a specialized language (Query Language) is used.
The ease of which this information can be obtained from a database often determines its value to a user.
The questions posed in a Query are generally designed for a more specific result than those in a search.
24
Query
Think of a University Database, some questions asked may be:
1. What is the name of the student with student ID #123456?
2. How many students are enrolled in CSC 343?
3. What fraction of students in CSC 343 received a grade better than B?
25
Is a File a DBMS?
Thought Experiment 1
You and a friend are both editing a file at the same time.
You and your friend both save the file at the exact same time. Whose change survived?
A) Yours B) Your Friends C) Both D) Neither E) Not A, B, C, or D
26
Is a File a DBMS?
Thought Experiment 2
You and a friend are updating a file The power goes out
Whose change survived?
A)All B)None C)AllSinceLastSave D)NotA,B,orC
Q: How do you write programs over a subsystem when it promises you No Options?
A: VERY, VERY CAREFULLY!!
27
Why Use a DBMS?
• Data independence and efficient access.
• Reduce application development time.
• Data integrity and security.
• Concurrent access, recovery from crashes.
28
Why Study Databases?
Shift from computation to information
◦ Always true for corporate computing
◦ Web made this point for personal computing ◦ More and more true for scientific computing
Need for DMBS has exploded!
◦ Corporate: retail swipe/clickstreams, “customer relationship management”, “supply chain management”, ”data warehouses”, Big Data, etc..
◦ Scientific: digital libraries, Human Genome Project, Sloan Digital Sky Survey, physical sensors, etc…
DMBS encompasses much of CS is a practical discipline ◦ OS, languages, theory, machine learning, logic
◦ Yet traditional focus on real-world apps
29
What is Intellectual Content?
Representing Information ◦ data modelling
Languages and Systems for Querying Data ◦ complex queries with realistic semantics*
◦ over massive data sets
Concurrency Control for Data Manipulation ◦ controlling concurrent access
◦ Ensuring transactional semantics*
Reliable Data Storage
◦ maintain data semantics* even if you pull the plug
*semantics: the meaning or relationship of meaning of a sign or set of signs.
30
Describing Data: Data Models
A data model is a collection of concepts for describing data.
A scheme is a description of a particular collection of data, using a given data model.
The relational data model is the most widely used model today. ◦ Main Concepts: relation, basically a table with rows and columns.
◦ Every relation has a schema, which describes the columns, or fields.
31
Data Independence
Applications insulated from how data is structures and stored.
Logical data independence
• Protection from changes in logical structure of data.
• i.e. the ability to change the conceptual (logical) schema without changing the external schema (user view).
• e.g. addition/removal of an entity or relationship. Physical data independence
• Protection from changes in physical structure of data. • e.g. hardware-level considerations, system designs, etc…
Q: Why is this particularly important for DBMS? A: Rate of change of DB applications are slow!
More Generally: dapp/dt << dplatform/dt
32
Concurrency Control
Concurrent execution of user programs: key to good DBMS performance. ◦ Frequent disk accesses.
◦ Keep the CPU working on several programs concurrently.
Interleaving actions of different programs: trouble!
◦ e.g., account-transfer and print statement at the same time.
DBMS ensures such problems don’t arise.
◦ Users/programmers can pretend they are using a single-user system (“Isolation”). ◦ Thank goodness! You don’t have to program “very, very carefully”.
33
Database Structure
Typically has a layered architecture.
The figure doesn’t show: ◦ Concurrency Control
◦ Recovery Components
Each system has its own variations.
DB
These layers must consider concurrency control and recovery!
Query Optimization and Execution
Relational Operators
Files and Access Methods
Buffer Management
Disk Space Management
34
Why Don’t We Always Use a DBMS?
1. Expensive/complicated to setup and maintain
2. Cost and complexity must be offset by need
3. General-purpose, not suited for special-purpose tasks (e.g. text search!)
35
The ACID Approach
1. Atomicity: all changes take effect, or none do.
2. Consistency: the database is transferred from one valid state to another
valid state.
3. Isolation: the results of a transaction are invisible to other transactions until the transaction is complete.
4. Durability: once committed, the results of a transaction are permanent and survive future system and media failures.
36
Databases Make These Folks Happy...
DBMS Vendors and Programmers ◦ Oracle, IBM, Microsoft, ...
End-Users in many fields
◦ Business, Education, Science, ...
Database Application Programmers
◦ Build enterprise applications on top of DBMSs ◦ Build web services that run off DBMSs
37
Databases Make These Folks Happy...
Database Administrators (DBAs) ◦ Handle security and authorization
◦ Data availability and crash recovery ◦ Database tuning as needs evolve
Data Scientists and Analysts
38
Summary
DBMS used to maintain, query large datasets. ◦ Can manipulate data and exploit semantics
Other benefits include:
◦ Data Independence
◦ Quick application development ◦ Data integrity and security
◦ Recovery from system crashes ◦ Concurrent access
39
Summary
Levels of abstraction provide data independence ◦ Key when dapp/dt << dplatform/dt
In this course we will explore:
1. How to be a sophisticated user of DBMS technology.
2. What goes on inside the DBMS.
40
Citations, Images and Resources
Database Management Systems (3rd Ed.), Ramakrishnan & Gehrke
http://www.vcloudnews.com/every-day-big-data-statistics-2-5-quintillion-bytes-of-data-created-daily/ https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html http://www.nytimes.com/2004/11/14/business/yourmoney/what-walmart-knows-about-customers-habits.html https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/forums/t/13299/predicting-strawberry-pop-tarts http://truthaboutguns-zippykid.netdna-ssl.com/wp-content/uploads/2014/12/Strawberry_Pop_Tarts.jpg http://gurupk.com/wp-content/uploads/2016/03/GTSEO.png http://www.npr.org/sections/alltechconsidered/2016/06/24/480949383/britains-google-searches-for-what-is-the-eu-spike-after-brexit-vote https://www.gov.uk/eu-eea
http://www.mjhaccountants.co.uk/wp-content/uploads/cartoon-filing-cabinet-l-e4b53be1891574f1.gif http://www.cs.toronto.edu/~ryanjohn/teaching/cscc43-s12/lectures/c43-intro-v03.pdf http://www.tenouk.com/ModuleV_files/image002.png http://www.quoteslike.com/images/1480/love-girl-lyrics-and-leave-a-suggestion-at-the-bottom-of-the-page-SiPB6f-quote.jpg https://www.simplilearn.com/data-science-vs-big-data-vs-data-analytics-article
Data science process flowchart from "Doing Data Science", Cathy O'Neil and Rachel Schutt, 2013
41