Introduction to Indexes
File organizations
Introduction to indexes
Consider a simple SQL query
Assume data is stored on HDD to make access table
efficient
▪ Store the table on adjacent blocks
▪ On the same cylinder, then
▪ Adjacent cylinders
▪ Use RAID
But many queries include WHERE clauses
▪ File organizations should support the efficient retrieval of records within a file
SELECT * FROM Customer
and SELECT clauses that specify attributes
Database data must be persistent, so must be stored on secondary memory, such as an HDD or SSD
Disk access is relatively inefficient, in the order of 10 to
15 milliseconds to access a single page
▪ Hundreds of thousands of times more than access to a main memory location
SSD access is faster
▪ The cost of disk I/O dominates the cost of database operations The unit that data is read from or written to disk is a
block, often 16 kilobytes
▪ Reading several pages in sequence from a disk takes much less time than reading several pages scattered across the disk
But still slower than main memory
1.1
Records in a database are stored in data files
▪ Each file usually represents a single table A file organization is a method of
arranging records in a file
▪ Making some operations more efficient ▪ A file can have only one organization
Files can be indexed
▪ To provide efficient access based on
some criterion
▪ Files can have multiple indexes
▪ Index files contain search key values and references to records – record IDs
ID = 41
ID
name
…
11
kate
21
sue
27
lee
35
bob
41
ann
53
naomi
54
kim
… name = ‘sue’
RIDs are page address, slot position pairs
ID index
name index
The minimum set of file operations is:
▪ Create and destroy files
▪ Insert and delete records ▪ And modify records
▪ Scan an entire file
▪ A scan brings all of the records in the file into main memory
▪ Find records This is a very inefficient method of finding a few records ▪ Which can be achieved by scanning the file
There are different file organizations
▪ Which perform operations more or less efficiently
▪ The simplest file structure is an unordered file, or heap file
ID
name
…
36
bob
11
kate
20
sue
55
kim
61
ying
27
lee
41
ann
52
naomi
Heap files are unordered
▪ The only way to find records is to scan the file
New records are inserted where there is room
▪ Either in slots that contained previously deleted
records, or
▪ At the end of the file
▪ Insertion is very efficient
When a record is deleted no reorganization is required
▪ But finding the record to delete is inefficient as a sequential search is required
insert
32
marie
ID
name
…
11
kate
20
sue
27
lee
36
bob
41
ann
52
naomi
55
kim
61
ying
Records in a sorted file are stored in order ▪ Based on the sort key of the file
▪ The record attribute(s) that the file is sorted on ▪ Also known as a sequential file
▪ A good organization for range searches ▪ Since matching records are adjacent
The basic organization assumes that file’s pages are filled to conserve space
▪ Pages should be maintained in sequence ▪ To allow for more efficient disk access
insert
32
marie
Insertions result in records being shuffled up
Deletions result in records being shuffled down
There are a number of techniques to avoid inefficiencies caused by inserting and deleting records
▪ Initially pages have only partial occupancy
▪ i.e. space for future insertions is left in each page
▪ Overflow pages can be attached (by pointers) to full pages
▪ Records can be locally reorganized in adjacent pages if possible
▪ Records are inserted in an unordered overflow file
▪ Which is periodically sorted and merged with the main file
▪ Some of these techniques require that files are periodically re-ordered Records can be found using binary search
▪ Although in practice, sorted files typically have a primary index
ID
name
…
52
naomi
27
lee
41
ann
55
kim
20
sue
61
ying
36
bob
11
kate
If fast access to individual records is a priority a file can be organized as a hash file
▪ A hash function maps a record’s field to the block where the record is stored
▪ Generically referred to as a bucket
h = ID % 13
▪ Hash files support efficient search, insertion and deletion of records based on the hash key
▪ But do not efficiently support searches on non hash key fields , or range searches on the hash key
There are a variety of hash file organizations
▪ Similar to hash index techniques discussed in this section
insert
32
marie
Files organized as B trees or other data structures
▪ Discussed in this section in the context of indexes
▪ The same organization can be applied to records in a file
▪ The bottom level of the index contains records rather than data entries
Mixed record files
▪ Where records of two related files are stored together to improve the efficiency of queries with joins between the tables
▪ e.g. records of entities connected by a one to many relationship
▪ Sometimes referred to as clustered files
▪ Asrecordsofthetablesarephysicallyclusteredondisk
unfortunately – the word is over-used
owner 1
owner 2
owner 3
owner 4
…
Good for queries on the joined relation but less efficient for queries on just one of the tables
Pets of owner1
Pets of owner2
Pets of owner3
Pets of owner4
1.2
An index is a data structure that organizes data to improve the retrieval of records on some criteria
▪ An index is said to provide an access path to records
▪ Based on the search key of the index
▪ An index speeds up searches that are not efficiently supported by the file’s organization
▪ A file can have more than one index
An index is a collection of data entries which must contain
▪ A search key value and
▪ Information to find matching data records
▪ The RID
SQL Server allows up to 999 non-clustered indices per table
Indexes are smaller than files and can be searched must faster
And are organized to support searches on their search keys
We refer to a number of different kinds of key
▪ A key is a field or a set of fields used for some purpose
Keys to identify records in a table
▪ Primary key, which is a
▪ Candidate key, which is a
▪ Superkey There can be many of these – not identified in table design
A table must have exactly one of these Should be identified by a UNIQUE constraint
A key used to order a sequential file ▪ Sort key
A key used to find records in an index ▪ Search key
There is a lack of consensus of the precise definition of the terms primary index and clustered index
Our text (Elmasri et al) defines
▪ A primary index as an index on an ordered file where the search key is the same as the sort key
▪ Where every record must have a unique value for the sort key field
▪ A clustered index is defined the same as a primary index except that multiple records may have the same value for the sort key field
This lecture presentation (and various other texts) define
▪ A primary index as an index where the search key is the same as the
sort key and records’ sort key fields may have non-unique values ▪ Where possible I avoid referring to clustered indexes
A primary index does not have to refer to the same attribute(s) as the primary key
A primary index is an index on a sequential file where ▪ The search key of the index is the same set of attributes as
the sort key of the file
▪ SQL Server refers to primary indexes as clustered
▪ A file may have only one primary index
Primary indices can be either dense or sparse
Dense indices have one data entry for each record
▪ Data entries are in the same order as the file records ▪ And can be searched using binary search
In practice an index is some data structure that supports efficient searches
In this example two data records or four index records fit on one block
In reality indexes are much smaller than files
10
20
10
20
30
41
30
41
file
53
60
53
60
77
91
77
91
index
A sparse index contains one data entry
for each block of records in a data file
▪ It is only possible to use a sparse index if the data file is sorted by the search key of the index
▪ i.e. if it is a primary index
▪ Sparse indexes are smaller than dense
indexes
Sparse indexes are searched in much
the same way as dense indexes
▪ Except that the index is searched for the largest key less than or equal to the target
▪ The RID is then followed to a block of the data file
10
bob
20
dave
10
30
53
77
30
ann
41
dave
53
kim
60
ann
…
77
dave
91
lee
index file
10
bob
20
dave
An index on a large data
may span many blocks
▪ May need multiple disk I/Os to find a record
▪ Even using binary search ▪ An alternative is to build
a multiple level index
The bottom index level
may be dense or sparse
▪ Subsequent levels of the index are sparse indexes on the preceding lower level of the index
10
20
30
41
30
ann
41
dave
10
53
…
53
kim
60
ann
53
60
77
91
77
dave
91
lee
file
10
bob
20
dave
The search key of a secondary index is not the sort key of the file
▪ Also referred to as unclustered
▪ Files with secondary indices may still be sorted
▪ On some other key
▪ Secondary indexes must be dense
▪ Why?
Secondary indexes are less
efficient for range searches
▪ As matching records reside on multiple unrelated blocks
index file
ann
ann
bob
dave
30
ann
41
dave
53
kim
60
ann
dave
dave
kim
lee
77
dave
91
lee
10
bob
20
dave
A secondary index may waste space if the search key is not a candidate key
▪ One option is to introduce a level of indirection
Create a bucket for each search key value
▪ Index contains pointers to buckets
▪ Bucket contains RIDs of matching records
Saves space if search key values appear at least twice on average
▪ And if they as large as RIDs
index
…
buckets file
ann
bob
dave
kim
lee
…
30
ann
41
dave
53
kim
60
ann
77
dave
91
lee
Multiple secondary indexes using indirection can improve efficiency on queries with complex criteria
▪ i.e. a series of conjunctions
▪ Collect all the rids from the buckets that meet each of the
criteria
▪ Then intersect them
▪ And only retrieve records in the result
This avoids retrieving records that match some, but
not all, of the criteria
SELECT id, name
FROM Customer
WHERE name = ‘kate’ AND city = ‘Vancouver’
Keyword searches on documents are common Consider a document as a record in a table
▪ With Boolean attributes for each word in the document ▪ An attribute is true if and only if the word is in the document
Create secondary indices for each attribute (word)
▪ Indices only contain data entries for attributes that are true
▪ Wordsthatexistinadocument ▪ Index RIDs lead to documents
Combine document word indices into a single index ▪ That uses indirect buckets for space efficiency
▪ Known as an inverted index
Indexes can maintain data to match words in document sections and track the number of occurrences of words
Inverted Index
Word
Song
Word
Song
Word
Song
a
1, 3
gun
1
screaming
1
all
2
hand
2
shadows
1, 2
am
3
here
3
shining
1
and
1, 3
his
1
sirens
1
are
1, 2
howling
1
so
1, 2
as
3
hunger
3
Surrender
2
baby
3
I
3
take
2, 3
banquet
3
in
1, 2
take
2
be
2
is
3
the
1, 2, 3
blade
1
is
3
there’s
1
breathe
3
it’ll
2
they’ll
2
bright
1
Love
3
to
2
close
3
man
1
tonight
1, 2
come
2
me
2
true
2
desire
3
night
3
try
3
down
1
now
3
understand
3
dreams
2
of
2
valley
1
end
2
oh
1
way
1
eye
1
on
3
we
2, 3
feed
3
pull
3
which
3
fire
3
right
2
with
1, 2
fires
1
running
2
your
2
Documents (Songs)
ID
Song Contents
1
The sirens are screaming and the fires are howling, way down in the valley tonight.
There’s a man in the shadows with a gun in his eye, and a blade shining oh so bright.
2
We’re running with the shadows of the night So baby take my hand, it’ll be all right Surrender all your dreams to me tonight They’ll come true in the end
3
Take me now baby here as I am Pull me close, try and understand Desire is hunger is the fire I breathe Love is a banquet on which we feed
An index’s search key can contain several fields
▪ Such search keys are referred to as composite search keys or concatenated keys
▪ e.g.{fName,lName}
For searches on equality the values for each field in the
search key must match the values in the record
▪ e.g. Joe Smith does not match Joe Jones or Fred Smith
For range queries, ranges may be specified for any fields
in the search key.
▪ If no values are specified for a field it implies that any value is acceptable for that field
Indices can be created from an existing file by repeatedly calling the index’s insert function
▪ This process is likely to be inefficient
▪ An alternative is to sort the data and fill the index pages
▪ Referred to as bulk loading
▪ Building additional levels of a multi-level index as necessary
▪ And leaving space to accommodate future insertions
The size of an index is determined by the size (and number) of search key entries
▪ It may be possible to compress search keys
▪ For example by truncating strings such that different values can still be distinguished.