程序代写代做代考 concurrency database algorithm compiler INFO20003 Database Systems

INFO20003 Database Systems

INFO20003 Database Systems 1© University of Melbourne 2018

INFO20003 Database Systems

Lecture 11

Query Processing Part I

Semester 2 2018, Week 6

Dr Renata Borovica-Gajic

INFO20003 Database Systems 2© University of Melbourne 2018

Remember this? Components of a DBMS

DBMS

Index files

Heap files

Database

Concurrency

control module

File and access

methods mgr.

Buffer pool mgr.

Disk space mgr.

Storage module Concurrency

control module

Transaction

mgr.

Lock mgr.

Crash recovery

module

Log mgr.

Query processing module

Parser/

Compiler
Optimizer Executor

This is one of several possible architectures;

each system has its own slight variations.

TODAY &

Next time

Will briefly

touch upon …

INFO20003 Database Systems 3© University of Melbourne 2018

Coverage

• Query Processing Overview

• Selections

• Projections

Readings: Chapter 12 and 14, Ramakrishnan & Gehrke, Database Systems

INFO20003 Database Systems 4© University of Melbourne 2018

Query processing overview

• Some database operations are EXPENSIVE

• DBMSs can greatly improve performance by being ‘smart’

– e.g., can speed up 1,000,000x over naïve approach

• Main weapons are:

1. clever implementation techniques for operators

2. exploiting ‘equivalencies’ of relational operators

3. using cost models to choose among alternatives

INFO20003 Database Systems 5© University of Melbourne 2018

Query processing workflow

Query Parser

Query Optimizer

Plan

Generator

Plan Cost

Estimator

Query Plan Evaluator

Catalog Manager

Usually there is a
heuristics-based
rewriting step before
the cost-based steps.

Schema Statistics

Select *

From Blah B

Where B.blah = “foo”

Query

Next week

INFO20003 Database Systems 6© University of Melbourne 2018

Relational Operations

• We will consider how to implement:

–Selection () Selects a subset of rows from relation

–Projection () Deletes unwanted columns from relation

–Join ( ) Allows us to combine two relations

• Operators can be then be composed creating query plans



INFO20003 Database Systems 7© University of Melbourne 2018

Query Processing

• Query Processing Overview

• Selections

• Projections

Readings: Chapter 14, Ramakrishnan & Gehrke, Database Systems

INFO20003 Database Systems 8© University of Melbourne 2018

Schema for Examples

• Sailors (S):

–Each tuple is 50 bytes long, 80 tuples per page, 500 pages

–N = NPages(S) = 500, pS=NTuplesPerPage(S) = 80

–NTuples(S) = 500*80 = 40000

• Reserves (R):

–Each tuple is 40 bytes long, 100 tuples per page, 1000 pages

–M= NPages(R) = 1000, pR=NTuplesPerPage(R) =100

–NTuples(R) = 100000

Sailors (sid: integer, sname: string, rating: integer, age: real)

Reserves (sid: integer, bid: integer, day: dates, rname: string)

INFO20003 Database Systems 9© University of Melbourne 2018

Simple Selections

• Of the form

• Example:

• The best way to perform a selection depends on:

1. available indexes/access paths

2. expected size of the result (number of tuples and/or

number of pages)

SELECT *

FROM Reserves R

WHERE R.rname LIKE ‘C%’


R attr valueop R. ( )

INFO20003 Database Systems 10© University of Melbourne 2018

Estimate result size (reduction factor)

• Size of result approximated as:

size of relation * ς(reduction factors)

• Reduction factor is usually called selectivity. It estimates

what portion of the relation will qualify for the given

predicate, i.e. satisfy the given condition.

 This is estimated by the optimizer (will be taught next week)

 E.g. 30% of records qualify, or 5% of records qualify

INFO20003 Database Systems 11© University of Melbourne 2018

Alternatives for Simple Selections

1. With no index, unsorted:

–Must scan the whole relation, i.e. perform Heap Scan

–Cost = Number of Pages of Relation, i.e. NPages(R)

–Example: Reserves cost(R)= 1000 IO (1000 pages)

2. With no index, but file is sorted:

–cost = binary search cost + number of pages containing results

–Cost = log2(NPages(R)) + (RF*NPages(R))

–Example: Reserves cost(R)= 10 I/O + (RF*NPages(R))

3. With an index on selection attribute:

–Use index to find qualifying data entries,

–Then retrieve corresponding data records

–Discussed next….

INFO20003 Database Systems 12© University of Melbourne 2018

Index Clustering: Review

Clustered vs. unclustered

Data entries

(Index File)

(Data file)

Data Records

Data entries

Data Records
CLUSTERED UNCLUSTERED

INFO20003 Database Systems 13© University of Melbourne 2018

Using an Index for Selections

• Cost depends on the number of qualifying tuples

• Clustering is important when calculating the total cost

• Steps to perform:

1. Find qualifying data entries:

– Go through the index: height typically small, 2-4 I/O in case of B+tree, 1.2 I/O

in case of hash index (negligible if many records retrieved)

– Once data entries are reached, go through data entries one by one and look

up corresponding data records (in the data file)

2. Retrieve data records (in the data file)

• Cost:

1. Clustered index:

Cost = (NPages(I) + NPages(R))*RF

2. Unclustered index:

Cost = (NPages(I) + NTuples(R))*RF

INFO20003 Database Systems 14© University of Melbourne 2018

Our example

• Example: Let’s say that 10% of Reserves tuples qualify, and let’s

say that index occupies 50 pages

• RF = 10% = 0.1, NPages(I) = 50, NPages(R) = 1000

• Cost:

1. Clustered index:

Cost = (NPages(I) + NPages(R))*RF

Cost = (50+ 1000)*0.1 = 105 (I/O)

2. Unclustered index:

Cost = (NPages(I) + NTuples(R))*RF

Cost = (50+ 100000)*0.1 = 10005 (I/O)

3. Heap Scan:

Cost = NPages(R) = 1000 (I/O)

Cheapest access path

INFO20003 Database Systems 15© University of Melbourne 2018

General Selection Conditions

• Typically queries have multiple predicates (conditions)

• Example: day<8/9/94 AND rname=‘Paul’ AND bid=5 AND sid=3 • A B-tree index matches (a combination of) predicates that involve only attributes in a prefix of the search key –Index on matches predicates on: (a,b,c), (a,b) and
(a)

–Index on matches a=5 AND b=3, but will not used
to answer b=3

–This implies that only reduction factors of predicates that are
part of the prefix will be used to determine the cost (they are
called matching predicates (or primary conjucts))

INFO20003 Database Systems 16© University of Melbourne 2018

Selections approach

1. Find the cheapest access path

– An index or file scan with the least estimated page I/O

2. Retrieve tuples using it

– Predicates that match this index reduce the number of

tuples retrieved (and impact the cost)

3. Apply the predicates that don’t match the index (if any) later

on

– These predicates are used to discard some retrieved tuples, but

do not affect number of tuples/pages fetched (nor the total cost)

– In this case selection over other predicates is said to be done

“on-the-fly”

INFO20003 Database Systems 17© University of Melbourne 2018

Cheapest Access Path: Example

• Example: day < 8/9/94 AND bid=5 AND sid=3 • A B+ tree index on day can be used; –RF = RF(day) –Then, bid=5 and sid=3 must be checked for each retrieved tuple on the fly • Similarly, a hash index on could be used;

–ς𝑅𝐹 = RF(bid)*RF(sid)

–Then, day<8/9/94 must be checked on the fly • How about a B+tree on ? (Y/N)

• How about a B+tree on ? (Y/N)

• How about a Hash index on ? (Y/N)

INFO20003 Database Systems 18© University of Melbourne 2018

Query Processing

• Overview

• Selections

• Projections

Readings: Chapter 14, Ramakrishnan & Gehrke, Database Systems

INFO20003 Database Systems 19© University of Melbourne 2018

• Issue with projection is removing duplicates

• Projection can be done based on hashing or sorting

The Projection Operation

SELECT DISTINCT R.sid, R.bid

FROM Reserves R

INFO20003 Database Systems 20© University of Melbourne 2018

• Basic approach is to use sorting

–1. Scan R, extract only the needed attributes

–2. Sort the result set (typically using external merge sort)

–3. Remove adjacent duplicates

The Projection Operation

12,10

12,10

11,80

12,75

13,20

13,20

13,75

INFO20003 Database Systems 21© University of Melbourne 2018

External Merge Sort

• If data does not fit in memory do several passes

• Sort runs: Make each B pages sorted (called runs)

• Merge runs: Make multiple passes to merge runs

–Pass 2: Produce runs of length B(B-1) pages

–Pass 3: Produce runs of length B(B-1)2 pages

–…

–Pass P: Produce runs of length B(B-1)P pages

B Memory buffers

INPUT 1

INPUT B-1

OUTPUT

DiskDisk

INPUT 2

. . . . . .. . .

Readings: Chapter 13, Ramakrishnan & Gehrke, Database Systems

We will let you know

how many passes there are

INFO20003 Database Systems 22© University of Melbourne 2018

External Merge Sort: Example

# buffer pages in memory B = 4, each page 2 records,

sorting on a single attribute (just showing the attribute value)

Input file

4-page runs

3,4 6,2 9,4 8,7 5,6 3,1 9,2

2,3

5,66,7

4,4

8,9

1,1

2,3

6,1

6,9

8,2 3,4 5,5

5,5

6,8

2,3

6,3

3,4

Pass 1

2,3
Pass 2

Main Memory

INFO20003 Database Systems 23© University of Melbourne 2018

Input file

4-page runs

3,4 6,2 9,4 8,7 5,6 3,1 9,2

2,3

5,66,7

4,4

8,9

1,1

2,3

6,1

6,9

8,2 3,4 5,5

5,5

6,8

2,3

6,3

3,4

Pass 1

2,3
1,1

Pass 2

Main Memory

# buffer pages in memory B = 4, each page 2 records

External Merge Sort: Example

INFO20003 Database Systems 24© University of Melbourne 2018

Input file

4-page runs

3,4 6,2 9,4 8,7 5,6 3,1 9,2

2,3

5,66,7

4,4

8,9

1,1

2,3

6,1

6,9

8,2 3,4 5,5

5,5

6,8

2,3

6,3

3,4

Pass 1

2,3
1,1 2,3

Pass 2

Main Memory

External Merge Sort: Example

# buffer pages in memory B = 4, each page 2 records

INFO20003 Database Systems 25© University of Melbourne 2018

Input file

4-page runs

3,4 6,2 9,4 8,7 5,6 3,1 9,2

2,3

5,66,7

4,4

8,9

1,1

2,3

6,1

6,9

8,2 3,4 5,5

5,5

6,8

2,3

6,3

3,4

Pass 1

2,3
1,1 2,3

Pass 2

Main Memory

External Merge Sort: Example

# buffer pages in memory B = 4, each page 2 records

INFO20003 Database Systems 26© University of Melbourne 2018

Input file

4-page runs

3,4 6,2 9,4 8,7 5,6 3,1 9,2

2,3

5,66,7

4,4

8,9

1,1

2,3

6,1

6,9

8,2 3,4 5,5

5,5

6,8

2,3

6,3

3,4

Pass 1

2,3
1 2,3

Pass 2

Main Memory

1

External Merge Sort: Example

# buffer pages in memory B = 4, each page 2 records

INFO20003 Database Systems 27© University of Melbourne 2018

Input file

4-page runs

3,4 6,2 9,4 8,7 5,6 3,1 9,2

2,3

5,66,7

4,4

8,9

1,1

2,3

6,1

6,9

8,2 3,4 5,5

5,5

6,8

2,3

6,3

3,4

Pass 1

2,3
2,3

Pass 2

Main Memory

1

External Merge Sort: Example

# buffer pages in memory B = 4, each page 2 records

INFO20003 Database Systems 28© University of Melbourne 2018

Input file

4-page runs

3,4 6,2 9,4 8,7 5,6 3,1 9,2

2,3

5,66,7

4,4

8,9

1,1

2,3

6,1

6,9

8,2 3,4 5,5

5,5

6,8

2,3

6,3

3,4

Pass 1

2,3
2,3 2,3

Pass 2

Main Memory

1

External Merge Sort: Example

# buffer pages in memory B = 4, each page 2 records

INFO20003 Database Systems 29© University of Melbourne 2018

Input file

4-page runs

3,4 6,2 9,4 8,7 5,6 3,1 9,2

2,3

5,66,7

4,4

8,9

1,1

2,3

6,1

6,9

8,2 3,4 5,5

5,5

6,8

2,3

6,3

3,4

Pass 1

2,3
2,3 2,3

Pass 2

Main Memory

1

External Merge Sort: Example

# buffer pages in memory B = 4, each page 2 records

INFO20003 Database Systems 30© University of Melbourne 2018

Input file

4-page runs

3,4 6,2 9,4 8,7 5,6 3,1 9,2

2,3

5,66,7

4,4

8,9

1,1

2,3

6,1

6,9

8,2 3,4 5,5

5,5

6,8

2,3

6,3

3,4

Pass 1

3
2,3 2,3

Pass 2

Main Memory

1,2

External Merge Sort: Example

# buffer pages in memory B = 4, each page 2 records

INFO20003 Database Systems 31© University of Melbourne 2018

Input file

4-page runs

3,4 6,2 9,4 8,7 5,6 3,1 9,2

2,3

5,66,7

4,4

8,9

1,1

2,3

6,1

6,9

8,2 3,4 5,5

5,5

6,8

2,3

6,3

3,4

Pass 1

3
3 2,3

Pass 2

Main Memory

1,2

External Merge Sort: Example

# buffer pages in memory B = 4, each page 2 records

INFO20003 Database Systems 32© University of Melbourne 2018

Input file

4-page runs

3,4 6,2 9,4 8,7 5,6 3,1 9,2

2,3

5,66,7

4,4

8,9

1,1

2,3

6,1

6,9

8,2 3,4 5,5

5,5

6,8

2,3

6,3

3,4

Pass 1

3
3 3

Pass 2

Main Memory

1,2

External Merge Sort: Example

# buffer pages in memory B = 4, each page 2 records

INFO20003 Database Systems 33© University of Melbourne 2018

Input file

4-page runs

3,4 6,2 9,4 8,7 5,6 3,1 9,2

2,3

5,66,7

4,4

8,9

1,1

2,3

6,1

6,9

8,2 3,4 5,5

5,5

6,8

2,3

6,3

3,4

Pass 1

3
3 3

Pass 2

Main Memory
1,2

External Merge Sort: Example

# buffer pages in memory B = 4, each page 2 records

INFO20003 Database Systems 34© University of Melbourne 2018

The Projection Operation Cost

• Sorting with external sort:

–1. Scan R, extract only the needed attributes

–2. Sort the result set using EXTERNAL SORT

–3. Remove adjacent duplicates

Cost = ReadTable +

WriteProjectedPages +

SortingCost +

ReadProjectedPages

WriteProjectedPages = NPages(R)* PF

PF: Projection Factor says how much are we projecting, ratio with respect to
all attributes (e.g. keeping ¼ of attributes, or 10% of all attributes)

SortingCost = 2*NumPasses*ReadProjectedPages

Read the entire table and keep only projected attributes

Write pages with projected attributes to disk

Sort pages with projected attributes with external sort

Read sorted projected pages to discard adjacent

duplicates

Every time we read and write

INFO20003 Database Systems 35© University of Melbourne 2018

Our example

• Example: Let’s say that we project ¼ of all attributes, and let’s say

that we have 20 pages in memory

• PF = 1/4 = 0.25, NPages(R) = 1000

• With 20 memory pages we can sort in 2 passes

Cost = ReadTable +

WriteProjectedPages +

SortingCost +

ReadProjectedPages

= 1000 + 0.25 * 1000 + 2*2*250 + 250 = 2500 (I/O)

INFO20003 Database Systems 37© University of Melbourne 2018

Projection based on Hashing

• Hashing-based projection

–1. Scan R, extract only the needed attributes

–2. Hash data into buckets

•Apply hash function h1 to choose one of B output buffers

–3. Remove adjacent duplicates from a bucket

•2 tuples from different partitions guaranteed to be distinct

INFO20003 Database Systems 38© University of Melbourne 2018

Projection Based on Hashing

B main memory buffersDisk

Original

Relation Buckets

2
INPUT

1

hash
function

h1
B-1

. . .

INFO20003 Database Systems 39© University of Melbourne 2018

1. Partition data into B partitions with h1 hash function

2. Load each partition, hash it with another hash function (h2) and eliminate

duplicates

Projection based on External Hashing

B main memory buffers DiskDisk

Original

Relation OUTPUT

2
INPUT

1

hash
function

h1
B-1

Partitions

1

2

B-1

. . .

INFO20003 Database Systems 40© University of Melbourne 2018

1. Partitioning phase:

–Read R using one input buffer

–For each tuple:

•Discard unwanted fields

•Apply hash function h1 to choose one of B-1 output buffers

–Result is B-1 partitions (of tuples with no unwanted fields)

•2 tuples from different partitions guaranteed to be distinct

2. Duplicate elimination phase:

–For each partition

•Read it and build an in-memory hash table

–using hash function h2 (<> h1) on all fields

•while discarding duplicates

–If partition does not fit in memory

•Apply hash-based projection algorithm recursively to this

partition (we will not do this…)

Projection based on External Hashing

INFO20003 Database Systems 41© University of Melbourne 2018

Our example:

Cost = ReadTable +

WriteProjectedPages +

ReadProjectedPages

= 1000 + 0.25 * 1000 + 250 = 1500 (I/O)

Cost of External Hashing

Cost = ReadTable +
WriteProjectedPages +
ReadProjectedPages

Read the entire table and project attributes

Write projected pages into corresponding partitions

Read partitions one by one, create another hash

table and discard duplicates within a bucket

INFO20003 Database Systems 42© University of Melbourne 2018

What’s examinable

• Understand the logic behind relational operators

• Learn alternatives for selections and projections (for now)

–Be able to calculate the cost of alternatives

• Important for Assignment 3 as well

INFO20003 Database Systems 43© University of Melbourne 2018

Next Lecture

– –

• Query Processing Part II

‒ Join alternatives