代写代考 CSE 214: Data Structures for Information Systems

CSE 214: Data Structures for Information Systems

CSE 3241: Database Systems I
Indexes and Indexing

Copyright By PowCoder代写 加微信 powcoder

Where is Chapter? (look online)
What is an Index?
Indexes in SQL
Types of Indexes
Joins and Indexes
Costs vs. Benefits
Query Optimization

What are indexes (indices)?

What is an Index?
An index is a data structure that is used to speed up retrieval of information from a database
Think of it like the index of a book
Data in the book is in a particular order
The index allows us to use a different ordering of the same data
By word in alphabetical order, instead of by “topic” or “chapter”

Why use indexes?
Indexes can make our data access orders of magnitude faster
How will our database system resolve the query below?
Without an index, we fall back on a table scan
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000

FROM EMPLOYEE
WHERE Dept=‘Mech’;

ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000

Table Scan
FROM EMPLOYEE
WHERE Dept=‘Mech’;

ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000

Table Scan
FROM EMPLOYEE
WHERE Dept=‘Mech’;

ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000

Table Scan
FROM EMPLOYEE
WHERE Dept=‘Mech’;

ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000

Table Scan
FROM EMPLOYEE
WHERE Dept=‘Mech’;

ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000

Table Scan
FROM EMPLOYEE
WHERE Dept=‘Mech’;

ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000

Table Scan
FROM EMPLOYEE
WHERE Dept=‘Mech’;

ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000

Imagine if n(EMPLOYEE) = 10,000,000
FROM EMPLOYEE
WHERE Dept=‘Mech’;

ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000

FROM EMPLOYEE
WHERE Dept=‘Mech’;
We can use an index to avoid these table scans

ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000

FROM EMPLOYEE
WHERE Dept=‘Mech’;
We can use an index to avoid these table scans

ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000

FROM EMPLOYEE
WHERE Dept=‘Mech’;
We can use an index to avoid these table scans

ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000

FROM EMPLOYEE
WHERE Dept=‘Mech’;
We can use an index to avoid these table scans

ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000

FROM EMPLOYEE
WHERE Dept=‘Mech’;
We can use an index to avoid these table scans

ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000

FROM EMPLOYEE
WHERE Dept=‘Mech’
AND Salary>=20000;
Or a different index for a different query

Indexes in relational databases
Indexes are stored in our database alongside our data
A persistent, stored data structure
Indexes contain references to our data
References stored in a way to make accessing our data easier
We can define arbitrary indexes
Most DBMS software automatically indexes on the primary key of a relation
We need to provide indexes if we want to index on things other than the primary key

Indexing Schemes
There are three main approaches to indexing data in a relational database system
Clustering (not really an index)
Hash-based indexes
Tree-based indexes
Each of these approaches has their benefits and their drawbacks

Clustering
With clustering, the data in the table is re-arranged
Ordered according to the key field of the index
Not really an index as such
Used like indexes to speed up queries
Advantages:
Easy lookup for queries that use a range of items
Everything in sorted order
Disadvantages:
Only one per table allowed
Requires us to rearrange the table on inserts, deletes, etc.

An example EMPLOYEE table clustered around Dept
ID Dept Salary
1 CSE 40000
2 CSE 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
3 Mech 20000
4 Mech 10000
8 Nuclear 20000

Hash-based Index
We can build an index as a hash table
Hash tables associates a key to a record (or list of records) using a hash function
A hash function is a function that takes a value and returns an associated integer value
We can use this idea to build an index using a hash table

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Suppose we want to build an index on Dept
We start with a hash function that maps Strings to Integers

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Suppose we want to build an index on Dept
We start with a hash function that maps Strings to Integers

Suppose we want to build an index on Dept
We start with a hash function that maps Strings to Integers
Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

hash(‘CSE’) = 2

Suppose we want to build an index on Dept
We start with a hash function that maps Strings to Integers
Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

hash(‘CSE’) = 2

Suppose we want to build an index on Dept
We start with a hash function that maps Strings to Integers
Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

hash(‘CSE’) = 2, hash(‘Mech’)=6

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Suppose we want to build an index on Dept
We start with a hash function that maps Strings to Integers
hash(‘CSE’) = 2, hash(‘Mech’)=6

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Suppose we want to build an index on Dept
Next we use these values to build a hash table as an array of lists
We can talk about each element of this table as a “bin” of values
hash(‘CSE’) = 2, hash(‘Mech’)=6, hash(‘Nuclear’)=0
idx contents

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Suppose we want to build an index on Dept
Next we add references to our hash table based on our hash function
hash(‘CSE’) = 2, hash(‘Mech’)=6, hash(‘Nuclear’)=0
idx contents

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Suppose we want to build an index on Dept
Next we add references to our hash table based on our hash function
hash(‘CSE’) = 2, hash(‘Mech’)=6, hash(‘Nuclear’)=0
idx contents

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Suppose we want to build an index on Dept
Next we add references to our hash table based on our hash function
hash(‘CSE’) = 2, hash(‘Mech’)=6, hash(‘Nuclear’)=0
idx contents

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Suppose we want to build an index on Dept
Next we add references to our hash table based on our hash function
hash(‘CSE’) = 2, hash(‘Mech’)=6, hash(‘Nuclear’)=0
idx contents

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Suppose we want to build an index on Dept
Next we add references to our hash table based on our hash function
hash(‘CSE’) = 2, hash(‘Mech’)=6, hash(‘Nuclear’)=0
idx contents

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Suppose we want to build an index on Dept
When we are done, we have a complete index
hash(‘CSE’) = 2, hash(‘Mech’)=6, hash(‘Nuclear’)=0
idx contents
2 [1,2,5,6]

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Now how do we use our index?
hash(‘CSE’) = 2, hash(‘Mech’)=6, hash(‘Nuclear’)=0
idx contents
2 [1,2,5,6]

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Now how do we use our index?
List in hash table gives us references to our underlying table
hash(‘CSE’) = 2, hash(‘Mech’)=6, hash(‘Nuclear’)=0
idx contents
2 [1,2,5,6]

FROM EMPLOYEE
WHERE Dept=‘Mech’;

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

What about deleting records?
hash(‘CSE’) = 2, hash(‘Mech’)=6, hash(‘Nuclear’)=0
idx contents
2 [1,2,5,6]

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

What about deleting records?
When we delete from an indexed table, we need to update all of its indexes
hash(‘CSE’) = 2, hash(‘Mech’)=6, hash(‘Nuclear’)=0
idx contents
2 [1,2,5,6]

DELETE FROM EMPLOYEE
WHERE Dept=‘Mech’
AND Salary>15000;

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

What about deleting records?
When we delete from an indexed table, we need to update all of its indexes
hash(‘CSE’) = 2, hash(‘Mech’)=6, hash(‘Nuclear’)=0
idx contents
2 [1,2,5,6]

DELETE FROM EMPLOYEE
WHERE Dept=‘Mech’
AND Salary>15000;

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000

4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

What about deleting records?
When we delete from an indexed table, we need to update all of its indexes
hash(‘CSE’) = 2, hash(‘Mech’)=6, hash(‘Nuclear’)=0
idx contents
2 [1,2,5,6]

DELETE FROM EMPLOYEE
WHERE Dept=‘Mech’
AND Salary>15000;

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

What about deleting records?
When we delete from an indexed table, we need to update all of its indexes
hash(‘CSE’) = 2, hash(‘Mech’)=6, hash(‘Nuclear’)=0
idx contents
2 [1,2,5,6]

DELETE FROM EMPLOYEE
WHERE Dept=‘Mech’
AND Salary>15000;

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Inserting records works like the reverse of a delete
Add the record to the table and update the indexes to include it
hash(‘CSE’) = 2, hash(‘Mech’)=6, hash(‘Nuclear’)=0
idx contents
2 [1,2,5,6]

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

We can also index on more than one attribute
This requires a new hash function that uses both attributes to determine the hash “bin”
hash(‘CSE’) = 2, hash(‘Mech’)=6, hash(‘Nuclear’)=0
idx contents
2 [1,2,5,6]

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

We can also index on more than one attribute
This requires a new hash function that uses both attributes to determine the hash “bin”
‘CSE’,40000

Hash-based Index
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

We can also index on more than one attribute
This requires a new hash function that uses both attributes to determine the hash “bin”
‘CSE’,10000

Hash-based Index
We can also index on more than one attribute
This requires a new hash function that uses both attributes to determine the hash “bin”
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

‘Mech’,20000
… and so on

Hash-based Indexes
Properties
Worst-case: roughly constant time performance to look up elements in the hash table
If your hash function is good and there are few collisions
i.e. few mappings of different keys to the same hash bin
Advantages
Good for looking up values based on equality tests
Equijoins win big with hash-based indexes
Disadvantages
Not so good for range based tests

Tree-based Index
Hash-based indexes are great for equality tests
Even with multiple conditions:
FROM EMPLOYEE
WHERE Dept = ‘CSE’
AND LastName = ‘Smith’;
Not so good with range tests
FROM EMPLOYEE
WHERE Salary >= 15000;
Can we do something intelligent to index in these cases?

Tree-based Index
Can we do something intelligent to index in these cases?
What we want is a system for keeping references to our data in sorted order
Trees are an approach to indexing that lets us do exactly this
Recall the idea of a tree data structure
Each element of the tree is a node
Nodes will contain references to our data and pointers to the node’s children
Each node has some number of children
A binary tree – each node has no more than 2 children
An n-ary tree – each node has no more than n children

Tree-based Indexes (B-Trees)
In databases, we typically talk about B-Trees as our tree data structure of choice
A B-Tree is a type of self-balancing tree
A tree that tries to keep its height as small as possible
B-Trees keep their height small by following strict algorithms for inserting and deleting elements
Guarantee that elements of the tree will be kept in sorted order according to a key value
Nice for us – we can build an index using a B-Tree!

Tree-based Indexes (B-Trees)
Each node of a B-Tree holds a number of elements and a number of references to children
Maximum number of child nodes is called the order of a B-Tree
A B-Tree of order 4 can have up to 4 children per node
The number of data elements in a node is one less than its order

Tree-based Indexes (B-Trees)
This is an example of a B-Tree of order 3
Each node has a maximum of 3 children
Each node has a maximum of 2 elements

Tree-based Indexes (B-Trees)
This is another example of a B-Tree of order 3
Each node has a maximum of 3 children
Each node has a maximum of 2 elements
This tree is not “full” the way the last one was, but it is still a B-Tree of order 3

Tree-based Indexes (B-Trees)
Here is an example of a filled B-Tree and the visit order of elements stored in its nodes
Note that we follow the element from left to right inside the node
First visiting child 0, then element 0, then child 1, then element 1, then child 2

Tree-based Indexes (B-Trees)
This tree would be a good one for visiting nodes based on ID
i.e., WHERE ID > 4

ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Tree-based Indexes (B-Trees)
What do we do when our key field is not unique?

ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Same thing we did with a hash table – our node holds a list of references instead of just one reference

Tree-based Indexes (B-Trees)
Here’s a B-Tree index using Salary as our key
Each node has a list of nodes with that Salary as key

What do we do when our key field is not unique?
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Tree-based Indexes (B-Trees)
What do we do when we want to use 2 keys?

ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

We have to pick an ordering for our keys
Then position in the tree is determined by that ordering

Tree-based Indexes (B-Trees)
What do we do when we want to use 2 keys?
i.e., Salary and ID
ID Dept Salary
1 CSE 40000
2 CSE 10000
3 Mech 20000
4 Mech 10000
5 CSE 10000
6 CSE 20000
7 Mech 10000
8 Nuclear 20000

Tree-based Indexes (B-Trees)
Create the tree with nodes in order of salary, then ID
Table is not changed

ID Dept Salary
2 CSE 10000
4 Mech 10000
5 CSE 10000
7 Mech 10000
3 Mech 20000
6 CSE 20000
8 Nuclear 20000
1 CSE 40000

Tree-based Indexes (B+Trees)

One enhancement to the idea of a B-Tree is the B+ Tree
In this tree all references to data elements are stored only in the leaf nodes
Internal nodes used only to store references to keys
Provide the structure for quickly getting to a data node based on position
The leaf nodes are linked together as a linked list
This structure is better for managing file I/O (like we need for databases)

Tree-based Indexes
Properties
Worst-case: O(log n) time to search, insert or delete
Advantages
Good for looking up values based on range tests
Disadvantages
Slightly slower on the equality tests
Many databases built on B-Trees alone!
SQLite stores data in B-Trees, as well as indexes!

Joins and Indexes
We can use indexes to speed simple queries
But we often see a lot of benefit when using them in more complex queries
Indexes can be used to speed up our join conditions
Easy to see why if we trace a basic query
FROM EMP, DEPT
WHERE EMP.DeptID = DEPT.ID

Joins and Indexes
Simple Query with Equality Condition

Pseudocode:

For each record E in EMP
For each record D in DEPT
IF E.deptID == D.ID

FROM EMP, DEPT
WHERE EMP.DeptID = DEPT.ID

Nested loop join
Say there were 100,000 employees, 1000 depts

For each record E in EMP
For each record D in DEPT
IF E.deptID == D.ID

Nested loop join
Say there were 100,000 employees, 1000 depts
100,000 x 1000 = 100,000,000!

For each record E in EMP
For each record D in DEPT
IF E.deptID == D.ID

Nested loop join
Say there were 100,000 employees, 1000 depts
100,000 x Index lookup time
We restrict ourselves to just the records where the IDs already match by using the index
This cuts the time for our join down by a few orders of magnitude!

For each recor

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com