程序代写代做代考 algorithm go concurrency database compiler INFO20003 Database Systems

INFO20003 Database Systems
Dr Renata Borovica-Gajic
Lecture 11 Query Processing Part I
Week 6
1
INFO20003 Database Systems © University of Melbourne

Remember this? Components of a DBMS
Will briefly touch upon …
DBMS
Query processing module
Parser/ Compiler
Optimizer Executor
TODAY & Next time
Concurrency control module
Transaction mgr.
Lock mgr.
Crash recovery module
Concurrency
control module
Log mgr.
Storage module
File and access methods mgr.
Buffer pool mgr. Disk space mgr.
This is one of several possible architectures; each system has its own slight variations.
Index files Heap files
Database
INFO20003 Database Systems © University of Melbourne
2

Coverage
• Query Processing Overview • Selections
• Projections
Readings: Chapter 12 and 14, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems © University of Melbourne
3

Query processing overview
• Some database operations are EXPENSIVE
• DBMSs can greatly improve performance by being ‘smart’
– e.g., can speed up 1,000,000x over naïve approach
• Main weapons are:
1. clever implementation techniques for operators
2. exploiting ‘equivalencies’ of relational operators
3. using cost models to choose among alternatives
INFO20003 Database Systems © University of Melbourne
4

Query processing workflow
Query
Usually there is a heuristics-based rewriting step before the cost-based steps.
Next week
Query Parser
Query Optimizer
Plan Generator
Plan Cost Estimator
Catalog Manager
Schema
Statistics
Query Plan Evaluator
Select *
From Blah B
Where B.blah = “foo”
INFO20003 Database Systems © University of Melbourne
5

Relational Operations
• We will consider how to implement:
–Selection (σ) Selects a subset of rows from relation –Projection (π) Deletes unwanted columns from relation –Join () Allows us to combine two relations
• Operators can be then be composed creating query plans
INFO20003 Database Systems © University of Melbourne
6

Query Processing
• Query Processing Overview • Selections
• Projections
Readings: Chapter 14, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems © University of Melbourne
7

Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string)
• Sailors (S):
–Each tuple is 50 bytes long, 80 tuples per page, 500 pages –N = NPages(S) = 500, pS=NTuplesPerPage(S) = 80 –NTuples(S) = 500*80 = 40000
• Reserves (R):
–Each tuple is 40 bytes long, 100 tuples per page, 1000 pages –M= NPages(R) = 1000, pR=NTuplesPerPage(R) =100 –NTuples(R) = 100000
INFO20003 Database Systems © University of Melbourne
8

Simple Selections
• Of the form • Example:
σR.attropvalue (R)
SELECT *
FROM Reserves R WHERE R.BID > 20;
• The best way to perform a selection depends on:
1. available indexes/access paths
2. expected size of the result (number of tuples and/or number of pages)
INFO20003 Database Systems © University of Melbourne
9

Estimate result size (reduction factor)
• Size of result approximated as:
size of relation * ∏ (reduction factors)
• Reduction factor is usually called selectivity. It estimates what portion of the relation will qualify for the given predicate, i.e. satisfy the given condition.
− This is estimated by the optimizer (will be taught next week) − E.g. 30% of records qualify, or 5% of records qualify
INFO20003 Database Systems © University of Melbourne
10

Alternatives for Simple Selections
1. Withnoindex,unsorted:
–Must scan the whole relation, i.e. perform Heap Scan –Cost = Number of Pages of Relation, i.e. NPages(R) –Example: Reserves cost(R)= 1000 IO (1000 pages)
2. Withnoindex,butfileissorted:
–cost = binary search cost + number of pages containing results –Cost = log2(NPages(R)) + (RF*NPages(R))
–Example: Reserves cost(R)= 10 I/O + (RF*NPages(R))
3. Withanindexonselectionattribute: –Use index to find qualifying data entries, –Then retrieve corresponding data records –Discussed next….
INFO20003 Database Systems © University of Melbourne
11

Index Clustering: Review
Clustered vs. unclustered
Data entries
Data entries
(Index File)
(Data file)
CLUSTERED
Data Records
Data Records
UNCLUSTERED
INFO20003 Database Systems © University of Melbourne
12

Using an Index for Selections
• Cost depends on the number of qualifying tuples
• Clustering is important when calculating the total cost
• Steps to perform:
1. Find qualifying data entries:
– Go through the index: height typically small, 2-4 I/O in case of B+tree, 1.2 I/O
in case of hash index (negligible if many records retrieved)
– Once data entries are reached, go through data entries one by one and look
up corresponding data records (in the data file)
2. Retrieve data records (in the data file)
• Cost:
1. Clustered index:
Cost = (NPages(I) + NPages(R))*RF 2. Unclustered index:
Cost = (NPages(I) + NTuples(R))*RF
INFO20003 Database Systems © University of Melbourne
13

Our example
• Example: Let’s say that 10% of Reserves tuples qualify, and let’s say that index occupies 50 pages
• RF = 10% = 0.1, NPages(I) = 50, NPages(R) = 1000, NTuplesPerPage(R) = 100
• Cost:
1. 2. 3.
Clustered index:
Cost = (NPages(I) + NPages(R))*RF
Cost = (50+ 1000)*0.1 = 105 (I/O)
Unclustered index:
Cheapest access path
Cost = (NPages(I) + NTuples(R))*RF Cost = (50+ 100000)*0.1 = 10005 (I/O)
Heap Scan:
Cost = NPages(R) = 1000 (I/O)
INFO20003 Database Systems © University of Melbourne
14

General Selection Conditions
• Typically queries have multiple predicates (conditions)
• Example: day<8/9/94 AND rname=‘Paul’ AND bid=5 AND sid=3 • A B-tree index matches (a combination of) predicates that involve only attributes in a prefix of the search key –Index on matches predicates on: (a,b,c), (a,b) and (a)
–Index on matches a=5 AND b=3, but will not used to answer b=3
–This implies that only reduction factors of predicates that are part of the prefix will be used to determine the cost (they are called matching predicates (or primary conjuncts))
INFO20003 Database Systems © University of Melbourne
15

Selections approach
1. 2.
3.
Find the cheapest access path
– An index or file scan with the least estimated page I/O
Retrieve tuples using it
– Predicates that match this index reduce the number of
tuples retrieved (and impact the cost)
Apply the predicates that don’t match the index (if any) later
on
– These predicates are used to discard some retrieved tuples, but do not affect number of tuples/pages fetched (nor the total cost)
– In this case selection over other predicates is said to be done “on-the-fly”
INFO20003 Database Systems © University of Melbourne
16

Cheapest Access Path: Example
• Example: day < 8/9/94 AND bid=5 AND sid=3 • A B+ tree index on day can be used; –RF = RF(day) –Then, bid=5 and sid=3 must be checked for each retrieved tuple on the fly • Similarly, a hash index on could be used; –∏ 𝑅𝑅𝑅𝑅 = RF(bid)*RF(sid)
–Then, day<8/9/94 must be checked on the fly • How about a B+tree on ? (Y/N)
• How about a B+tree on ? (Y/N)
• How about a Hash index on ? (Y/N)
INFO20003 Database Systems © University of Melbourne
17

Query Processing
• Overview
• Selections • Projections
Readings: Chapter 14, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems © University of Melbourne
18

The Projection Operation
• Issue with projection is removing duplicates
SELECT DISTINCT R.sid, R.bid FROM Reserves R
• Projection can be done based on hashing or sorting
INFO20003 Database Systems © University of Melbourne
19

The Projection Operation
• Basic approach is to use sorting
–1. Scan R, extract only the needed attributes
–2. Sort the result set (typically using external merge sort) –3. Remove adjacent duplicates
11,80
12,10
12,10
12,75
13,20
13,20
13,75
INFO20003 Database Systems © University of Melbourne
20

External Merge Sort
• If data does not fit in memory do several passes
• Sort runs: Make each B pages sorted (called runs)
• Merge runs: Make multiple passes to merge runs
–Pass 2: Produce runs of length B(B-1) pages –Pass 3: Produce runs of length B(B-1)2 pages –…
–Pass P: Produce runs of length B(B-1)P pages
. . .
Disk
We will let you know
how many passes there are
INPUT 1
INPUT 2
.. .
.. .
Disk
OUTPUT
INPUT B-1
B Memory buffers
Readings: Chapter 13, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems © University of Melbourne
21

External Merge Sort: Example
# buffer pages in memory B = 4, each page 2 records, sorting on a single attribute (just showing the attribute value)
Input file
4-page runs
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3
Pass 1
8,9
6,7
4,4
2,3
6,9
5,6
2,3
1,1
6,8
5,5
3,4
2,3
2,3
Pass 2
Main Memory
INFO20003 Database Systems © University of Melbourne
22

External Merge Sort: Example
# buffer pages in memory B = 4, each page 2 records
Input file
4-page runs
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3
Pass 1
8,9
6,7
4,4
2,3
6,9
5,6
2,3
1,1
6,8
5,5
3,4
2,3
2,3
Pass 2
Main Memory
1,1
INFO20003 Database Systems © University of Melbourne
23

External Merge Sort: Example
# buffer pages in memory B = 4, each page 2 records
Input file
4-page runs
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3
Pass 1
8,9
6,7
4,4
2,3
6,9
5,6
2,3
1,1
6,8
5,5
3,4
2,3
2,3
Main Memory
1,1
2,3
Pass 2
INFO20003 Database Systems © University of Melbourne
24

External Merge Sort: Example
# buffer pages in memory B = 4, each page 2 records
Input file
4-page runs
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3
Pass 1
8,9
6,7
4,4
2,3
6,9
5,6
2,3
1,1
6,8
5,5
3,4
2,3
2,3
Main Memory
1,1
2,3
Pass 2
INFO20003 Database Systems © University of Melbourne
25

External Merge Sort: Example
# buffer pages in memory B = 4, each page 2 records
Input file
4-page runs
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3
Pass 1
8,9
6,7
4,4
2,3
6,9
5,6
2,3
1,1
6,8
5,5
3,4
2,3
2,3
Pass 2
Main Memory
1
1
2,3
INFO20003 Database Systems © University of Melbourne
26

External Merge Sort: Example
# buffer pages in memory B = 4, each page 2 records
Input file
4-page runs
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3
Pass 1
8,9
6,7
4,4
2,3
6,9
5,6
2,3
1,1
6,8
5,5
3,4
2,3
2,3
Main Memory
1
2,3
Pass 2
INFO20003 Database Systems © University of Melbourne
27

External Merge Sort: Example
# buffer pages in memory B = 4, each page 2 records
Input file
4-page runs
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3
Pass 1
8,9
6,7
4,4
2,3
6,9
5,6
2,3
1,1
6,8
5,5
3,4
2,3
2,3
Pass 2
Main Memory
2,3
1
2,3
INFO20003 Database Systems © University of Melbourne
28

External Merge Sort: Example
# buffer pages in memory B = 4, each page 2 records
Input file
4-page runs
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3
Pass 1
8,9
6,7
4,4
2,3
6,9
5,6
2,3
1,1
6,8
5,5
3,4
2,3
2,3
Pass 2
Main Memory
2,3
1
2,3
INFO20003 Database Systems © University of Melbourne
29

External Merge Sort: Example
# buffer pages in memory B = 4, each page 2 records
Input file
4-page runs
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3
Pass 1
8,9
6,7
4,4
2,3
6,9
5,6
2,3
1,1
6,8
5,5
3,4
2,3
3
Pass 2
Main Memory
2,3
1,2
2,3
INFO20003 Database Systems © University of Melbourne
30

External Merge Sort: Example
# buffer pages in memory B = 4, each page 2 records
Input file
4-page runs
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3
Pass 1
8,9
6,7
4,4
2,3
6,9
5,6
2,3
1,1
6,8
5,5
3,4
2,3
3
Pass 2
Main Memory
1,2
3
2,3
INFO20003 Database Systems © University of Melbourne
31

External Merge Sort: Example
# buffer pages in memory B = 4, each page 2 records
Input file
4-page runs
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3
Pass 1
8,9
6,7
4,4
2,3
6,9
5,6
2,3
1,1
6,8
5,5
3,4
2,3
3
Pass 2
Main Memory
1,2
3
3
INFO20003 Database Systems © University of Melbourne
32

External Merge Sort: Example
# buffer pages in memory B = 4, each page 2 records
Input file
4-page runs
3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3
Pass 1
8,9
6,7
4,4
2,3
6,9
5,6
2,3
1,1
6,8
5,5
3,4
2,3
3
Pass 2
Main Memory
3
3
1,2
INFO20003 Database Systems © University of Melbourne
33

The Projection Operation Cost
• Sorting with external sort:
–1. Scan R, extract only the needed attributes –2. Sort the result set using EXTERNAL SORT –3. Remove adjacent duplicates
Cost =
ReadTable + Read the entire table and keep only projected attributes WriteProjectedPages + Write pages with projected attributes to disk
SortingCost +
ReadProjectedPages Read sorted projected pages to discard adjacent
Sort pages with projected attributes with external sort
duplicates
WriteProjectedPages = NPages(R)* PF
PF: Projection Factor says how much are we projecting, ratio with respect to all attributes (e.g. keeping 1⁄4 of attributes, or 10% of all attributes)
Every time we read and write
SortingCost = 2*NumPasses*ReadProjectedPages
INFO20003 Database Systems © University of Melbourne
34

Our example
• Example: Let’s say that we project 1⁄4 of all attributes, and let’s say that we have 20 pages in memory
• PF = 1/4 = 0.25, NPages(R) = 1000
• With 20 memory pages we can sort in 2 passes
Cost = ReadTable + WriteProjectedPages +
SortingCost +
ReadProjectedPages
= 1000 + 0.25 * 1000 + 2*2*250 + 250 = 2500 (I/O)
INFO20003 Database Systems © University of Melbourne
35

Projection based on Hashing
• Hashing-based projection
–1. Scan R, extract only the needed attributes
–2. Hash data into buckets
•Apply hash function h1 to choose one of B output buffers
–3. Remove adjacent duplicates from a bucket
•2 tuples from different partitions guaranteed to be distinct
INFO20003 Database Systems © University of Melbourne
37

Projection Based on Hashing
Original Relation
.. .
Disk
B main memory buffers
INPUT
hash function
h1
Buckets
1
2
B-1
INFO20003 Database Systems © University of Melbourne
38

Projection based on External Hashing
1. Partition data into B partitions with h1 hash function
2. Load each partition, hash it with another hash function (h2) and eliminate
duplicates
Original Relation
Partitions
1 2
B-1
INPUT
hash function
h1
OUTPUT
1
2
B-1
.. .
Disk B main memory buffers Disk
INFO20003 Database Systems © University of Melbourne
39

Projection based on External Hashing
1. Partitioningphase:
–Read R using one input buffer –For each tuple:
•Discard unwanted fields
•Apply hash function h1 to choose one of B-1 output buffers –Result is B-1 partitions (of tuples with no unwanted fields)
•2 tuples from different partitions guaranteed to be distinct
2. Duplicateeliminationphase: –For each partition
•Read it and build an in-memory hash table –using hash function h2 (<> h1) on all fields
•while discarding duplicates –If partition does not fit in memory
•Apply hash-based projection algorithm recursively to this partition (we will not do this…)
INFO20003 Database Systems © University of Melbourne
40

Cost of External Hashing
Cost = ReadTable + Read the entire table and project attributes WriteProjectedPages + Write projected pages into corresponding partitions
ReadProjectedPages
Our example:
Read partitions one by one, create another hash table and discard duplicates within a bucket
Cost = ReadTable + WriteProjectedPages +
ReadProjectedPages
= 1000 + 0.25 * 1000 + 250 = 1500 (I/O)
INFO20003 Database Systems © University of Melbourne
41

What’s examinable
• Understand the logic behind relational operators
• Learn alternatives for selections and projections (for now)
–Be able to calculate the cost of alternatives • Important for Assignment 3 as well
INFO20003 Database Systems © University of Melbourne
42

Next Lecture
• Query Processing Part II ‒ Join alternatives
INFO20003 Database Systems © Univers-it-y of Melbourne
43