File Organizations and Indexes
1
COST MODEL
D: average time to read or write a disk page. C : average time to process a record.
H : the time required to apply a hash function to a record.
3 File Organizations: Heap Files.
Sorted Files. Hashed Files.
2
Operations to be investigated Scan: fetch all records in a file.
Search with equality selection. (SWES) (“Find the students record with sid = 23”)
Search with Range Selection. (SWRS)
(“Find all students with name alphabetically after ‘Smith’”)
Insert: Insert a given record into the file.
Delete: Delete a record with given rid.
Below, we examine the costs of these operations with respect to the 3 different file organizations.
3
Scan:
Heap Files
B(D + RC ) where
• B is the number of pages, and
• R is the average number of records in a page (block).
SWES:
• 0.5B(D + RC ) on average if the selection is specified on a key.
• Otherwise B(D + RC ).
4
Heap Files
SWRS: B(D + RC ).
Insert: 2D + C . (Always insert to the end of the file)
Delete:
•
Only one record is involved.
qThe average cost is0.5B(D+RC)+ D ifrid isnot given;
qotherwise (D+C)+D.
Several records are involved. Expensive.
•
5
Sorted Files
Sorted on a search key – a combination of one or more fields.
If the following query is made against the search key, then:
1. Scan: B(D + RC ).
2. SWES:
• O(D log2 B + C log2 R) if single record.
• O(D log2 B + C log2 R + #matches).
3. SWRS: O (D log2 B + C log2 R + #matches).
4. Insert: expensive.
• Search cost plus 2 ∗ (0.5B(D + RC )).
5. Delete: expensive.
• Search cost plus 2 ∗ (0.5B(D + RC )).
6
Hashed Files
§ The pages in a file are grouped into buckets. § The buckets are defined by a hash function. § Pages are kept at about 80% occupancy.
Assume the data manipulation is based on the hash key.
§ Scan: 1.25B(D + RC ).
§ SWES: H + D + 0.5RC if each hash bucket contains
only one page.
§ SWRS: 1.25B(D + RC ). (No help from the hash structure)
§ Insert: Search cost plus C + D if one block involved.
§ Delete: Search cost plus C + D if one block involved.
7
Summary
File Type
Scan
Equality Search
Range Search
Insert
Delete
Heap
BD
0.5 BD
BD
Search + D
Search +D
Sorted
BD
D log B
D log B +
# matches
Search + BD
Search + BD
Hashed
1.25 BD
D
1.25 BD
2D
Search + BD
A Comparison of I/O Costs
8
Indexes
Basic idea behind index is as for books.
INDEXES aardvark 25,36 bat . . . . . . . 12 cat . . . . 1,5,12 dog . . . . . . . . 3 elephant . . 17 emu . . . . . . 28
lion . . . . . . . 18 llama 17,21,22
sloth . . . . . . 18 tiger . . . . . . 18
wombat . . . 27 zebra . . . . . 19
§ A table of key values, where each entry gives places where key is used.
§ Aim: efficient access to records via key values.
9
Index
Indexing Structure
k1
k2
k3
k4
k5
k6
k7
k8
Data File
10
……
Indexing Structure Index is collection of data entries k∗.
Each data entry k∗contains enough information to retrieve (one or more) records with search key value k.
Indexing:
§ How are data entries organized in order to support efficient retrieval of data entries with a given search key value?
§ Exactly what is stored as a data entry?
11
Alternatives for Data Entries in an Index
§ A data entry k∗ is an actual data record (with search key value k).
§ A data entry is (k, rid) pair (rid is the record id of a data record with search key value k).
§ A data entry isa(k,rid−list)pair (rid−list isthe list of record ids of data records with search key value k).
Example: (Xuemin Lin, page 12), (Xuemin Lin, page 100)
VS
(Xuemin Lin, page 12, page 100)
12
Clustered Index
§ Clustered: a file is organized of data records is the same as or close to the ordering of data entries in some index.
§ Typically, the search key of file is the same as the search key of index.
Index Entries
Data Entries
Index
Data File
13
Index Entries
Unclustered Index
Index
Data Entries
……
§ Clustered indexes are relatively expensive to maintain. § A data file can be clustered on at most one search key.
Data File
14
Dense VS Sparse Indexes
§ Dense: it contains (at least) one data entry for every search key value.
§ Sparse: otherwise.
Q: Can we build a sparse index that is not clustered?
15
Ashby, 25, 6003
Basu, 33, 4003
Bristow, 30, 2007
Cass, 50, 5004
Daniels, 22, 6003
Dense, 44, 6003
Ashby
Cass
Smith
Smith, 44, 3000
22
25
30
33
40
44
44
50
Tracy, 44, 5004
Sparse Index VS Dense Index
16
Primary and Secondary Indexes
§ Primary: Indexing fields include primary key.
§ Secondary: otherwise.
There may be at most one primary index for a
file.
Composite search keys: search key contains several fields.
17