CS W4111.001 Introduction to Databases Fall 2020
Computer Science Department Columbia University
1
Overview of Storage and Indexing
2
Data on External Storage
• Disks: Unit for reading and writing data is a “page” or “block”
Reading several physically consecutive pages/blocks is much cheaper than reading them in random order
3
Data on External Storage
• Disks: Unit for reading and writing data is a “page” or “block”
Reading several physically consecutive pages/blocks is much cheaper than reading them in random order
• File organization: Method for arranging a file of records (such as tuples) on external storage
• Can think of each relation as stored in a dedicated file
• A “record id”—or “rid”—is sufficient to physically locate a record on storage
• An “index” allows us to identify efficiently which records have a given value of the “search key” for the index
4
Alternative File Organizations
Many alternatives for organizing files, each well suited for some access patterns but not others:
• Heap(randomorder)files:Usefulwhentypicalaccessisa
file scan retrieving all tuples
SELECT * FROM Employees;
SELECT AVG(E.sal) FROM Employees E;
5
Alternative File Organizations
Many alternatives for organizing files, each well suited for some access patterns but not others:
• Heap(randomorder)files:Usefulwhentypicalaccessisa
file scan retrieving all tuples
SELECT * FROM Employees;
SELECT AVG(E.sal) FROM Employees E;
• Sortedfiles:Usefulwhentuplesmustberetrievedinorder for the “search key” field, or when only a “range” of tuples is needed
SELECT * FROM Employees E WHERE E.sal > 50000 ORDER BY E.sal
6
Alternative File Organizations
Many alternatives for organizing files, each well suited for some access patterns but not others:
• Heap(randomorder)files:Usefulwhentypicalaccessisa
file scan retrieving all tuples
SELECT * FROM Employees;
SELECT AVG(E.sal) FROM Employees E;
• Sortedfiles:Usefulwhentuplesmustberetrievedinorder for the “search key” field, or when only a “range” of tuples is needed
SELECT * FROM Employees E WHERE E.sal > 50000 ORDER BY E.sal
• Indexes:Usefulwhentuplesmustberetrievedbasedon
certain criteria on the “search key” fields used to build index SELECT * FROM Employees E WHERE E.ssn=“123456789”
7
Organizing Relation as Heap File
name sal age Peter 40K 25 Jane 45K 25 Mary 50K 30 Josh 40K 32 Joyce 60K 30
Assume we can fit 2 tuples per disk block
Find all employees with salary ≥50k
8
Organizing Relation as Sorted File
name sal age Peter 40K 25 Jane 45K 25 Mary 50K 30 Josh 40K 32 Joyce 60K 30
Assume we can fit 2 tuples per disk block
9
Organizing Relation as Hashed File
name sal age Peter 40K 25 Jane 45K 25 Mary 50K 30 Josh 40K 32 Joyce 60K 30
Assume we can fit 2 tuples per disk block
10
Comparing File Organizations on Performance: Analysis
• We ignore CPU costs, for simplicity, and focus on number of disk I/Os (i.e., on the number of disk pages/blocks that are read/written)
• We make several simplifying assumptions
• However, analysis is sufficient to show general overall performance of alternatives
• Main parameters:
• B: Number of data pages the relation occupies on disk
• D: Average time to read (write) one data page from (to) disk
11
Alternative File Organizations
• Heap file, with tuples stored in arbitrary order; insertions at “end” of file
• Sorted file, according to search key attribute(s)
• Hashed file, according to search key attribute(s) •…
12
Operations for Comparison
• Scan: Read all tuples
• Equality search: Find all tuples that have one specific value of an attribute
• Range search: Find all tuples that have a value of an attribute within a given range
• Insertion of a tuple
• Deletion of a tuple
13
Assumptions in Performance Analysis
• Heapfile:
• Equalitysearchisonanattributethatistheprimarykeyora candidate key for relation, so at most one match possible
14
Assumptions in Performance Analysis
• Heapfile:
• Equalitysearchisonanattributethatistheprimarykeyora candidate key for relation, so at most one match possible
• Sortedfile:
• Equalityandrangesearchareoverthesearchkeyattribute(s)on which file is sorted
• Fileiscompactedafterdeletions
15
Assumptions in Performance Analysis
• Heapfile:
• Equalitysearchisonanattributethatistheprimarykeyora candidate key for relation, so at most one match possible
• Sortedfile:
• Equalityandrangesearchareoverthesearchkeyattribute(s)on which file is sorted
• Fileiscompactedafterdeletions
• Hashedfile:
• Enoughbucketssothatinitiallywehave,onaverage,80%page occupancy (i.e., every page/bucket is only 80% full); file size is then 1.25*B (i.e., we need 25% more space than if we stored all tuples compactly)
• Wethenassumenooverflowpagesneededforthebuckets
• Equalitysearchisoverthesearchkeyattribute(s)onwhichthe hashed file is organized
16
Cost of Operations
Heap file
Scan
Equality search Range search Insertion Deletion
B: Number of data pages the relation occupies on disk
D: Average time to read (write) one data page from (to) disk 17
Sorted file
Hashed file
Heap file
Sorted file
Hashed file
Scan
B*D
B*D
1.25*B*D
Equality search
0.5*B*D
(log2B)*D
D
Range search
B*D
(log2B + # of pages with matches)*D
1.25*B*D
Insertion
2*D
cost of equality search + B*D
2*D
Deletion
cost of equality search + D
cost of equality search + B*D
2*D
Cost of Operations
B: Number of data pages the relation occupies on disk
D: Average time to read (write) one data page from (to) disk
18