PowerPoint Presentation
CS w186
Introduction to Database Systems
Prof. Joe Hellerstein
Operated this semester by:
Prof. Josh Hug
Lakshya Jain
1
Essential Queries
Why take this class?
What is this class all about?
Who is running this?
How will this class work?
Why? Reason #1: Utility
This class is very, very useful
Data processing backs essentially every app
Databases of one form or another back most apps
The principles taught in this class back nearly everything in computing
Where shall I eat, Database?
Each ratings star added on a Yelp restaurant review translated to anywhere from a 5% to a 9% effect on revenues.
—Harvard Business School, 2011
http://hbswk.hbs.edu/item/the-yelp-factor-are-consumer-reviews-good-for-business
What am I missing, Database?
https://blog.bufferapp.com/instagram-analytics
https://instagram-press.com/blog/2018/07/02/introducing-youre-all-caught-up-in-feed/
What am I missing, Database?
Hey Instagram: that’s Chez Panisse in Berkeley, CA!
Who should I be with, Database?
https://www.gotinder.com/press
How does Science work? Database.
Jim Gray
Turing Award Winner
First Berkeley CS PhD
How does Science work? Database.
Experimental
Theoretical
Simulation
Data
Intensive
Astronomy in the 4th Paradigm
Sloan Digital
Sky Survey (SDSS)
Database
Systems
+
Sky Server
http://skyserver.sdss.org
Science in the 4th Paradigm
Astronomy
Connectomics
Cosmological
Physics
Genomics
Oceanography
Your career…
The fundamentals of this class are (and will remain) central to participating in this new and more data-centric world
Many of the details and technologies will change in the coming years
Be prepared to generalize from what you learn here
Keep learning new things
13
Why? Reason #1: Utility
This class is very, very useful
Data processing backs essentially every app
Databases of one form or another back most apps
The principles taught in this class back nearly everything in computing
This material will empower you.
Why? Reason #2: Centrality
Data is at the center of modern society
Data is unique in its nature and significance
Particular and voluminous
Often asymmetric
low value in isolation, high value when aggregated
Difficult to protect
At the center of major issues
Privacy
National Security
Online Misinformation (including Fake News)
18
National Security Data: 2010
Numbers from the guardian
Xkeyscore is latest system (built on federated MySQL servers) replaced Marianas
19
National Security Data: 2018
Data Integrity: Not all Data is Correct
“Any user can change any entry, and if enough users agree with them, it becomes true.”
– Colbert Report 7/31/2007
Asked viewer to update the page on Elephants to reflect a tripling population, forcing Wikipedia to lock the page.
Yet a 2005 Nature study found Wikipedia science articles to be similar in accuracy to Encyclopedia Britannica.
COMEDY CENTRAL VIDEO ARCHIVE VIA WIKIPEDIA
https://en.wikipedia.org/wiki/Reliability_of_Wikipedia
http://www.nature.com/nature/journal/v438/n7070/full/438900a.html
http://www.cc.com/video-clips/z1aahs/the-colbert-report-the-word—wikiality
Data Integrity: Not all Data is Correct
(From the Guardian, Dec 2016)
A Syllogism of Quotes
“information is knowledge”
— Albert Einstein
“knowledge is power”
— Sir Francis Bacon
“with great power comes great responsibility”
— Uncle Ben (Spiderman)
“I could go on and on about all of the amazing work that is happening around the world using data to make lives better everyday, but we also have to address where data is causing more harm than good.”
“Data is such an incredible lever arm for change, we need to make sure that the change that is coming, is the one we all want to see.
So how do we do it? First, there is no single voice that determines these choices. This MUST be community effort.”
https://medium.com/@dpatil/a-code-of-ethics-for-data-science-cda27d1fac1
https://www.oreilly.com/ideas/doing-good-data-science
Berkeley’s New Data Science Major
https://data.berkeley.edu/degrees/data-science-ba
Why? Reason #2: Centrality
Data is at the center of modern society.
Unprecedented in its nature and significance
Particular and voluminous
Often asymmetric
low value in isolation, high value when aggregated
Difficult to protect
The infrastructure determines
what’s possible
Why #3? The Core of Computing
Data growth will continue to outpace computation
Systems for Data at Scale: the core of modern computing
https://www.domo.com/learn/data-never-sleeps-5
Every Minute!
Scale of Scientific Data
Large Hadron Collider, CERN
Raw data: 1MB/event. 600,000,000 events/sec.
= 1.9×1022 bytes/year = 19 ZettaBytes/year
Downsampled: 25GB/sec = 7.88×1017 bytes/year = 788 PetaBytes/year
Downsampled further: 1050MB/sec = 3.3*1016/year = 33 PetaBytes/year
https://home.cern/about/computing/processing-what-record
Forces Driving Data Growth
Ubiquitous sensors and reporting:
Cameras, mobile computing, social media, …
Large collaborative science projects
Philosophy: More Data More Value?
Enabling Technology
Cheap, Scalable Data
Management Systems
http://hyperboleandahalf.blogspot.com
http://hyperboleandahalf.blogspot.com
30
Why #3? The Core of Computing
Data growth will continue to outpace computation
Systems for Data at Scale: the core of modern computing
Techniques you learn in this class underlie many topics in computing
Essential Queries
Why take this class?
What is this class all about?
Who is running this?
How will this class work?
What is this class all about?
Databases?
What is a database?
Database Management Systems?
Universal Symbol for a Database
Why the Symbol?
Looks Like?
Platters on a Disk Drive
Why the Symbol?
1956: IBM MODEL 350 RAMAC
First Commercial Disk Drive
5MB @ 1 ton
http://www.computerhistory.org/storageengine/first-commercial-hard-disk-drive-shipped
“…We must immediately…attack accounting problems under the philosophy of handling each business transaction as it occurs, rather than under the present condition of batching techniques….”
— F. J. Wesley IBM Senior Manager
Looks Like?
Is This a Database?
Rolodex
Alphbetically ordered cards
Indexed access by first letter
Is This a Database?
A database + “business logic” + user interface?
Most of Tinder’s value is the database itself.
Is This a Database?
Airline reservation systems were one of the earliest pervasive consumer uses of database systems.
IBM/American Airlines’ SABRE system, 1964.
“Semi-Automated Business Research Environment”
Travelocity.com a direct descendant of SABRE
Acquired by Expedia, 1/2015
What is a Database?
Let’s not split hairs.
A database is a large, organized collection of data.
Sometimes confused with a Database Management System (DBMS)
A DBMS is software that stores, manages, and facilitates access to data.
Berkeley Roots!
Ingres / Postgres
Sybase
Informix
Berkeley Roots!
Ingres / Postgres
Sybase
Informix
UC Berkeley
Oracle
IBM
Relational DBMSs
Traditionally DBMS referred to relational databases
RDBMS is a more appropriate term
SQL data description and manipulation language
ACID transaction consistency
Durable writes (prevent data loss)
Mature technologies …
Ranking of DBMS Technologies 2019
http://db-engines.com/en/ranking
Based on #mentions (e.g., stack overflow), google trends, job postings, profile data on LinkedIn, tweets …
Relational Database Market
Big Market > 41B
http://www.infoworld.com/article/2916057/open-source-software/open-source-threatens-to-eat-the-database-market.html
http://www.infoworld.com/article/2916057/open-source-software/open-source-threatens-to-eat-the-database-market.html
45
What is happening here?
Hadoop & NoSQL
Relational Database Market
http://www.infoworld.com/article/2916057/open-source-software/open-source-threatens-to-eat-the-database-market.html
46
Market Trends
Cloud DBMS disrupting on-premises vendors
Cloud is less relational-centric
But fastest-growing services at AWS are RDBMSs
“One size doesn’t fit all”
Main-memory DBMS
Graph DBMS
TimeSeries DBMS
Key-Value Stores (NoSQL)
Analytics Platforms (Spark, Hadoop)
Tools for working with data
Business Intelligence (charting tools)
ML/Data Science platforms
Data preparation and next-gen data integration (ETL)
Reasons for Change
Hardware trends: RAM, SSDs, NVRAM, GPUs, …
Platform trends: cloud and elastic computing
Need to scale: storage and transactions
New data-types: text, json, image, video…
New workloads: machine learning & advanced analytics
Change = Opportunity!
The DBMS world is rapidly changing
Will discuss these changes towards end of the course
Our textbook is rather out of date (2003!)
Opportunity!
You can shape the future of DBMSs
We won’t follow the textbook slavishly.
Instead…
Focus: Foundational System Principles
Basic ideas and components
How to compose those components into a technology stack
Goal:
You will be able to use existing & build new DBMS technologies!
You will learn…
Data Oriented Programming with SQL (a la 61A)
Foundations of Data System Design
Storage, indexing
Query processing and optimization
Transactions
Concurrency, Consistency, Recovery
Data Modeling
Application-level representations of data
Principles
Data Independence
Declarative Programming
Rendezvous in Time and Space
Isolation and consistency
Data representations
Systems
We will examine various levels of a DBMS
Concurrency Control
Recovery
Database Management
System
Database
Query Parsing
& Optimization
Relational Operators
Files and Index Management
Buffer Management
Disk Space Management
What is this class all about?
Databases?
What is a database?
Database Management Systems?
Implementation?
Big Ideas in Database Management Systems
Principles and Algorithms
System Designs
The heart of scalable CS
/docProps/thumbnail.jpeg