• • •
Please read the Course Outline (syllabus) carefully. I’ll keep it updated on Canvas.
CPSC 404: Advanced Relational Databases
Instructor: Dr. Ed Knorr Department of Computer Science University of British Columbia
September-December 2020
I’ll make an announcement in class, or I’ll post a note, when any significant change has been made.
1
Learning Goals for This Unit
List some job responsibilities of various DB personnel.
Gain an appreciation for, and explain, the scope and complexity
of DB-related jobs in an organization.
Explain the benefits of logical and physical data independence brought about by the relational model.
Identify some database challenges in providing 24×7 operations.
Explain why DBMSs are so hard to configure “properly” for an
organization.
Justify the use of self-managing (sometimes called “autonomic” or self-tuning) database systems.
Give examples of quantifiable components that determine the “all in” cost of delivering IT or database services.
2
Skylight and CWSEI
Part of the Faculty of Science’s mandate to improve teaching and learning in Science
Best practices from evidence-based research
Skylight = Science Centre for Learning and Teaching (SCLT but pronounced and referred to as “Skylight”)
CWSEI = Carl Wieman Science Education Initiative (pronounced C-W- see)
www.cwsei.ubc.ca has some great resources; you may wish to explore the site, and especially the Resources page and its links
3
Skylight and CWSEI (cont.)
In this course, there will be an emphasis on deliberate practice and problem solving.
What is “deliberate practice”?
We’re want students to develop expert-like thinking about a task.
We’ll use clickers, in-class exercises, analogies, worked examples, more frequent testing, etc.
4
e.g., e.g., e.g.,
Tiger Woods in golf
Mozart in music
Gretzky, Crosby, and McDavid in hockey
Bill Gates in computing
Beatles in pop music
a good doctor/specialist perhaps you in your degree program and computing career
e.g., e.g., e.g., e.g.,
Skylight and CWSEI (cont.)
Just about any world-class “expert” at a complex profession has spent about 10,000 hours of deliberate practice to achieve this level.
If you actually spend 10,000 hours on something, will that give you world-class success? How about on something that you don’t like doing? 5
Level #
% Questions
Level/Category
Some Verbs Used for Questions
6 5 4 3 2 1
(highest)
< 2% <2% 2-5% 12-15% 20% 60%
Evaluation Synthesis Analysis Application Comprehension Knowledge
Evaluate, Justify, Argue, Judge, ... Create, Generate, Propose, ... Analyze, Determine, ...
Apply, Use, Choose, ...
(lowest)
Explain, Show, Extrapolate, ... Describe, List, Name, Identify, ...
Skylight and CWSEI (cont.)
Learning Goals—very important!
Less of a focus on memorizing facts, and more of an emphasis on:
applying concepts and problem solving
Bloom’s Taxonomy (with a % estimate of typical exam questions in first-year, post-secondary, science courses, shown below)
• However, CS tends to have more questions in levels 3-6 than other sciences.
6
Bloom’s Taxonomy
7
Course Objectives: We’ll Study ...
The role of an RDBMS in an organization’s data management strategy.
The relationship among bytes, pages, disks, buffer pools, data tables, indexes, metadata, etc.
Indexing strategies
SQL query evaluation and optimization
Performance issues
Transaction Processing and Concurrency Control
Schedules, Serializability, Deadlock, Locking Protocols, Isolation Levels
Crash Recovery and Application Recovery Logging
Backups (Image Copies) and Recoveries
8
Historically, 3 Major Types of DBMSs
From oldest to more recent ... Network DBMSs
e.g., IDMS from Computer Associates (CA)
Has lots of pointers, similar in spirit to the WWW More difficult to program, less flexible
Hierarchical DBMSs
e.g., IBM’s IMS (Information Management System)
Many commercial applications have natural hierarchies in their data.
Many production systems throughout the world (e.g., banking, government— Government of Ontario), but nowhere near as many as for relational systems
Relational DBMSs
IBM’s DB2, Oracle, Microsoft’s SQL Server, MySQL, etc.
Heavily used throughout the world. Bruce Lindsay, IBM: “Relational databases form the bedrock of Western civilization”.
The focus of this course
9
Newer Technologies
Object-Oriented DBMSs
They haven’t really caught on, at least not the way some people had
predicted
Very small market share (e.g., 1%)
RDBMSs are a huge workhorse, and contain a lot of the data in the Deep Web (of which 96% is estimated to be generally inaccessible to search engines like Google or Bing).
10
Newer Technologies: NoSQL
NoSQL
Not Only SQL (or in some circles: No SQL at all)
• A better name might be “non-relational” because some systems have SQL-like querying.
• They support distributed data, easily. Consider the idea of hash functions that determine which machine data gets sent to/from.
Under the NoSQL banner come: • Key-value stores
• Column stores
• Document stores
• Graph databases
• XML object-based stores
An RDBMS spends a lot of time on keeping data consistent, and making sure the data obeys the schema rules.
An RDBMS may have limitations on size and performance (e.g., for
answering mobile queries from a large number of users).
11
Newer Technologies: NoSQL (cont.)
NoSQL (cont.)
NoSQL has an “eventually consistent” capability rather than the much stronger ACID properties of atomicity, consistency, isolation, and durability
There’s the lack of a schema.
Big Data often deals with semi-structured data, or data without
structure, including sensor data, images, audio, etc.
Documents are often represented using the JSON standard (JavaScript Object Notation).
Examples of NoSQL systems: Cassandra, CouchDB, DynamoDB (from Amazon), HBase, and MongoDB
You can choose to explore NoSQL as one of the options in this course, using our zyBooks feature.
Need more info about NoSQL vs. an RDBMS? Check out Principles of Database Management by Lemahieu, Vanden Broucke, and
Baesens, Cambridge University Press, 2018.
12
Some DB-Related Jobs
DB-related staff in a large IT shop (e.g., a major corporation) or as part of a consulting company):
Systems / Technical Support
• Installation, configuration, customization, problem determination, interaction with vendor, patches, monitoring, performance & tuning, availability, disk storage, capacity planning, chargeback, etc.
Database Administration (e.g., DBA = Database Administrator or Database Analyst)
• Requirements analysis, data modeling (logical DB design including Entity-Relationship Diagrams), data dictionary, standards, documentation, physical DB design, capacity planning, etc.
• SQL analysis (e.g., identify performance bottlenecks, recommend changes, add indexes)
• Support business cases; interact with users, programmers, technical support, management, business partners, and other
stakeholders
13
Some DB-Related Jobs (cont.)
DB-related staff (cont.)
Database Support (often combined with Database
Administration as part of a DBA’s role)
• Set up production jobs including backups and recoveries, utilities, authorizations, binds/rebinds, etc.
• Set up and load DBs in test and production environments
• Interact with other DB and technical staff
• Nightly and weekend callouts are not uncommon
Programming
• Historically, each area within a corporation had its own programming group (e.g., for an oil & gas company: payroll, human resources, accounting, land, drilling, exploration, production, ...)
• DBA interacts with each group regularly
14
Some DB-Related Jobs (cont.)
Software Vendors
They write and maintain a DBMS.
• e.g., Oracle, IBM, Microsoft, ...
Application/Middleware Vendors
They write software that integrates with a DBMS, and which many
companies purchase to help support their DBMS.
Some vendors specialize in HR software, financial services, data warehouses for a given industry (e.g., airlines), etc.
15
Some DB-Related Jobs (cont.)
Thus, there are many DB-related jobs.
Some large IT consulting shops employ people in each of the categories on the previous 2 pages to service multiple clients, although some customers are so big that multiple consultants are often devoted to—and are permanently stationed at—a single client’s site.
In a small shop, a single employee is often responsible for many of the above tasks.
• Do you want to be a “jack of all trades”—or a specialist?
16
Outsourcing
Companies often save money by outsourcing their IT departments (or major parts of it).
e.g., BC Hydro → Westech Information Systems and Western Integrated Technologies → BC Hydro → Accenture and others
e.g., Does each corporation really need to develop, maintain, and support its own proprietary payroll, HR, and tax programs?
• e.g., BC Hydro moves to PeopleSoft HR software → later Oracle buys PeopleSoft
17
Outsourcing (cont.)
Software, hardware, and networks change frequently; networks, mobile solutions, and security are “moving targets”.
Some multinationals are consolidating operations and working from a distance (e.g., Shell Bangalore).
Consider software as a service (SaaS) and cloud storage (e.g., Amazon Web Services, Microsoft Azure, Google Cloud).
Examples of such products (courtesy of Nerdio, 2016): • Salesforce.com
• Microsoft Office 365
• Box
• Google Apps
• Amazon Web Services • Concur
• Zendesk • Docusign • Dropbox • Slack
18
19
Outsourcing (cont.)
The all-in cost of an employee is approximately double his/her salary. Why?
Training costs
• e.g., $3000-5000 for a 5-day DB tuning course (or DB conference) in Toronto, plus one-week’s loss of an employee
There are trade-offs to outsourcing:
costs, security, privacy, stability, knowledge of the business
area, competitiveness, time to respond
Data Independence
Joe Hellerstein’s Inequality:
d application / dt << d environment / dt
What are the implications for applications that use databases?
From CPSC 304, recall some major benefits of a relational DBMS:
Sharing, Redundancy, Integrity, Concurrency, Backup, Recovery
Physical independence (disks and other hardware)
Logical independence (schema views, no pointer manipulation, a declarative query language)
CPSC 304 dealt with usage of a DBMS; in CPSC 404, we look under the hood (i.e., at the internals)
20
Availability & Performance
Here are some desirable characteristics of a DBMS to permit high-levels of availability (e.g., 24×7) and performance:
Perform a schema change without taking the database offline.
• Note that there could be complex changes that affect queries, updates,
application programs, optimizations, backup/recovery, etc.
• Creating a “schema evolution tool” is a very hard problem. • Important area of research
Online backup (i.e., while the data is still “live”)
Online reorganization (of a table and its indexes)
Add/drop indexes on the fly.
Optimize the buffer pool to help minimize the number of disk I/Os. Be proactive about poorly performing queries or resource hogs.
21
Self-Managing DBMSs
Skilled DBAs are hard to find, especially in small shops where a single person cannot be a jack of all trades.
Performance and tuning are important DBMS tasks that require substantial know-how.
DBAs tend to work longer hours than most IT people, including some evenings and weekends.
Many DBMSs have hundreds of tuning parameters, such as:
Checkpoint frequency, log size, sizes and types of buffer pools, rollback wait time, how often to check for deadlocks, max # of connections, max # of concurrent users, lock escalation levels, sizes of work files, # of concurrent users, security issues, etc.
Hard: What defaults should be provided to DBMS customers?
So ... DBMS vendors are putting more effort into self- managing database systems
e.g., IBM’s “Autonomic Computing” initiative
22
Self-Managing DBMSs (cont.)
Example 1: When should tables be backed up? Common Practice:
Preferably:
Example 2: When should tables be reorganized, and indexes be rebuilt?
Common Practice: Preferably:
A big part of this course is about performance. How do we quantify “good performance”?
How do we design for good performance?
How can we fix bad performance?
23
Self-Managing DBMSs (cont.)
Example 3: How do we reorganize a table and its indexes?
Traditionally, a reorg involved taking the database or table offline (making it unavailable to users), unloading the data, sorting it, reloading the data, rebuilding its index structures, capturing metadata, rebinding application plans, and bringing the data objects back online
Better:
• Make a duplicate of the data and reorganize the duplicate, while
keeping the original data online for read and write access.
• When the copy is reorganized, use the database log to apply the latest changes (that were made to the production data) to the new copy .
• Iterate this update process, until there are no more changes. • Switch the two sets of objects
There are lots of vendor-supplied or third-party tools to help a DBA.
24
Availability
Experienced DBA and author Craig Mullins (2013) wrote:
A major vendor reports that 70% of database downtime is due to human error, such as DBA mistakes.
Planned outages represent up to 70% of downtime.
Up to half of the unplanned outages are due to problems encountered during the planned downtime.
25
Availability (cont.)
To help keep systems running around the clock, use automated DBA tools as much as possible.
Be sure to have well-trained and experienced staff.
Use clusters of machines and storage area networks (SANs).
Can run standby systems
• Redundant hardware with copies of DB log changes being
propagated to other systems, ready to take over, if needed
Can provide fast hardware failover support
Can take a machine offline for service while the DBMS continues to run
26
Closing Thoughts
Databases are critical parts of an organization’s data assets.
Much of the data on the Deep Web is stored in databases which are (usually) inaccessible to search engines.
RDBMSs have stood the test of time—and have a great future.
The Fortune 500 corporations use RDBMSs.
Most corporations have several RDBMSs (from different
vendors).
DB-related jobs are worth considering in your future.
DB principles (e.g., logging, backup, recovery, transactions, concurrency, metadata, indexing, the I/O cost model, security, etc.) are worth knowing, even if you don’t plan to make DBs part of your career!
27