File Organizations and Indexes
1
COST MODEL
D: average time to read or write a disk page.
C : average time to process a record.
H : the time required to apply a hash function
to a record.
3 File Organizations:
Heap Files.
Sorted Files.
Hashed Files.
2
Operations to be investigated
Scan: fetch all records in a file.
Search with equality selection. (SWES) (“Find the students record
with sid = 23”)
Search with Range Selection. (SWRS)
(“Find all students with name alphabetically after ‘Smith’”)
Insert: Insert a given record into the file.
Delete: Delete a record with given rid.
Below, we examine the costs of these operations with respect to
the 3 different file organizations.
3
Heap Files
Scan:
B(D + RC ) where
• B is the number of pages, and
• R is the average number of records in a
page (block).
SWES:
• 0.5B(D + RC ) on average if the selection is
specified on a key.
• Otherwise B(D + RC ).
4
Heap Files
SWRS: B(D + RC ).
Insert: 2D + C . (Always insert to the end of the file)
Delete:
• Only one record is involved.
q The average cost is 0.5B(D + RC ) + D if rid is not
given;
q otherwise (D + C ) + D.
• Several records are involved. Expensive.
5
Sorted Files
Sorted on a search key – a combination of one or more
fields.
If the following query is made against the search key,
then:
1. Scan: B(D + RC ).
2. SWES:
• O(D log2 B + C log2 R) if single record.
• O(D log2 B + C log2 R + #matches).
3. SWRS: O (D log2 B + C log2 R + #matches).
4. Insert: expensive.
• Search cost plus 2 ∗ (0.5B(D + RC )).
5. Delete: expensive.
• Search cost plus 2 ∗ (0.5B(D + RC )).
6
Hashed Files
§ The pages in a file are grouped into buckets.
§ The buckets are defined by a hash function.
§ Pages are kept at about 80% occupancy.
Assume the data manipulation is based on the hash
key.
§ Scan: 1.25B(D + RC ).
§ SWES: H + D + 0.5RC if each hash bucket contains
only one page.
§ SWRS: 1.25B(D + RC ). (No help from the hash
structure)
§ Insert: Search cost plus C + D if one block involved.
§ Delete: Search cost plus C + D if one block involved.
7
Summary
File
Type Scan
Equality
Search
Range
Search
Insert Delete
8
Heap BD 0.5 BD BD Search + D
Search
+ D
Sorted B D D log B
D log B
+
# matches
Search
+ BD
Search
+ BD
Hashed
1.25 BD D 1.25 BD
2 D
Search
+ BD
A Comparison of I/O Costs
Indexes
Basic idea behind index is as for books.
9
INDEXES
aardvark 25,36 lion . . . . . . . 18
bat . . . . . . . 12 llama 17,21,22
cat . . . . 1,5,12 sloth . . . . . . 18
dog . . . . . . . . 3 tiger . . . . . . 18
elephant . . 17 wombat . . . 27
emu . . . . . . 28 zebra . . . . . 19
§ A table of key values, where each entry gives places where key
is used.
§ Aim: efficient access to records via key values.
Indexing Structure
10
k1
k2
k3
k4
k5
k6
k7
k8
Index
Data File
…
…
Indexing Structure
Index is collection of data entries k∗.
Each data entry k∗ contains enough information
to retrieve (one or more) records with search
key value k.
Indexing:
§ How are data entries organized in order to
support efficient retrieval of data entries with a
given search key value?
§ Exactly what is stored as a data entry?
11
Alternatives for Data Entries in an Index
§ A data entry k∗ is an actual data record (with
search key value k).
§ A data entry is (k, rid) pair (rid is the record id of
a data record with search key value k).
§ A data entry is a (k, rid − list) pair (rid − list is the
list of record ids of data records with search key
value k).
Example: (Xuemin Lin, page 12), (Xuemin Lin, page 100)
VS
(Xuemin Lin, page 12, page 100)
12
Clustered Index
§ Clustered: a file is organized of data records is the same as
or close to the ordering of data entries in some index.
§ Typically, the search key of file is the same as the search key
of index.
13
Index
Data File
Index Entries
Data Entries
Unclustered Index
14
……
Index
Data File
Index Entries
Data Entries
§ Clustered indexes are relatively expensive to maintain.
§ A data file can be clustered on at most one search key.
Dense VS Sparse Indexes
§ Dense: it contains (at least) one data entry
for every search key value.
§ Sparse: otherwise.
Q: Can we build a sparse index that is not
clustered?
15
16
Ashby
Cass
Smith
22
25
30
33
Ashby, 25, 6003
Basu, 33, 4003
Bristow, 30, 2007
Cass, 50, 5004
Daniels, 22, 6003
Dense, 44, 6003
Smith, 44, 3000
Tracy, 44, 5004
40
44
44
50
Sparse Index VS Dense Index
Primary and Secondary Indexes
§ Primary: Indexing fields include primary key.
§ Secondary: otherwise.
There may be at most one primary index for a
file.
Composite search keys: search key contains
several fields.
17