PowerPoint Presentation
Faculty of Information Technology,
Monash University
FIT2004: Algorithms and Data Structures
Week 5: Efficient Lookup Structures
These slides are prepared by M. A. Cheema and are based on the material developed by Arun Konagurthu and Lloyd Allison.
Things to note/remember
Assignment 2 due 1 May 2020 – 23:55:00
FIT2004: Lec-5: Efficient Lookup Structures
Recommended Reading
Unit Notes – chapters 7 and 8(FIT2004 S1/2016,
FIT2004: Lec-5: Efficient Lookup Structures
Outline
Introduction
Binary Search Tree
AVL Tree
Hash tables
FIT2004: Lec-5: Efficient Lookup Structures
Lookup Table
A lookup table allows inserting, searching and deleting values by their keys.
The idea of a lookup table is very general and important in information processing systems.
The database that Monash University maintains on students is an example of a table. This table might contain information about:
Student ID
Authcate
First name
Last name
Course(s) enrolled
FIT2004: Lec-5: Efficient Lookup Structures
Lookup Table
Elements of a table, in some cases, contain a key plus some other attributes or data
The key might be a number or a string (e.g., Student ID or authcate)
It is something unique that unambiguously identifies an element
Elements can be looked up (or searched) using this key
FIT2004: Lec-5: Efficient Lookup Structures
Sorting based lookup
Keep the elements sorted on their keys in an array (e.g., sort by student ID)
Searching:
O(log N)
use Binary search to find the key – O(log N)
Insertion:
O(N)
Use Binary search to find the sorted location of new element – O(log N)
Insert the new element and shift all larger elements toward right – O(N)
Deletion:
O(N)
Search the key – O(log N)
Delete the key – O(1)
Shift all the larger elements to left – O(N)
Is it possible to do better?
Yes! Hash Tables and Balanced Binary Search trees (to name a few)
FIT2004: Lec-5: Efficient Lookup Structures
Outline
Introduction
Binary Search Tree
AVL Tree
Hash tables
FIT2004: Lec-5: Efficient Lookup Structures
Binary Search Tree (BST)
The empty tree is a BST
If the tree is not empty
the elements in the left subtree are LESS THAN the element in the root
the elements in the right subtree are GREATER THAN the element in the root
the left subtree is a BST
the right subtree is a BST
Note! Don’t forget last two conditions!
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
22
28
23
21
17
14
26
30
Searching a key in BST
// BST implemented here as a tree data structure
def search(current, key)
if (current == None)
return false // not present!
else if ( key < current.e ) // search x in Left subtree
search(current.left, key)
else if ( key > current.e ) // search x in Right subtree
search(current.right, key)
else return true
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
22
28
23
21
17
14
26
30
Search 21
Insert a key x in BST
// BST implemented here as a tree data structure
// T = fork(e, L, R)
function insert(current, x)
if (current == None) // Insert here as leaf node
set root to x
else if ( x < current.e ) // Traverse and insert ...
insert( current.left, x); // along the Left subtree
else if (x > current.e ) // Traverse and insert …
insert( current.right, x) // along the Right subtree
else // x == e
… it depends … // e.g., store as a link list/sorted array at e, replace with new data
return
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
22
28
23
21
17
14
26
30
Insert 27
27
Delete a key from BST
First lookup key in BST
If the key node has no children // Case 1
delete the key node
set subtree to nil
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
22
28
23
21
17
14
26
30
Delete 27
27
Delete a key from BST
First lookup key in BST
If the key node has one child // Case 2
delete the key node
replace the key node with its child
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
22
28
23
21
17
14
Delete 28
26
25
27
Delete a key from BST
First lookup key in BST
If the key node has two children // Case 3
Find successor (or predecessor)
Replace key node value with the value of successor (or predecessor)
Delete successor (or predecessor)
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
22
28
23
21
17
14
26
30
Delete 24
27
23
Delete a key from BST
First lookup key in BST
If the key node has two children // Case 3
Find successor (or predecessor)
Replace key node value with the value of successor (or predecessor)
Delete successor (or predecessor)
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
22
28
23
21
17
14
26
30
Delete 24
27
26
Worst-case of BST
A BST is not a balanced tree and, in worst case, may degenerate to a linked list
E.g., when elements are inserted in sorted order (ascending or descending) – insert 24, 20, 15, 14.
Worst-case time complexity
Insert: O(N)
Delete: O(N)
Search: O(N)
Can we improve the performance by keeping the tree balanced?
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
14
Outline
Introduction
Binary Search Tree
AVL Tree
Introduction
Balancing AVL tree
Complexity Analysis
Hash tables
FIT2004: Lec-5: Efficient Lookup Structures
AVL Trees: Introduction
Adelson-Velskii Landis (AVL) tree is a height-balanced BST
The heights of left and right subtrees of every node differ by at most one
If at any time they differ by more than one, rebalancing is done to restore this property.
Is the following tree balanced according to the above definition?
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
22
17
14
26
25
27
AVL Trees: Introduction
Adelson-Velskii Landis (AVL) tree is a height-balanced BST
The heights of left and right subtrees of every node differ by at most one
If at any time they differ by more than one, rebalancing is done to restore this property.
Is it still balanced after deleting 25?
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
22
17
14
26
27
AVL Trees: Introduction
Adelson-Velskii Landis (AVL) tree is a height-balanced BST
The heights of left and right subtrees of every node differ by at most one
If at any time they differ by more than one, rebalancing is done to restore this property.
Is it still balanced after deleting 17?
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
22
14
26
27
Quiz time!
https://flux.qa – RFIBMB
AVL Trees: Introduction
Adelson-Velskii Landis (AVL) tree is a height-balanced BST
The heights of left and right subtrees of every node differ by at most one
If at any time they differ by more than one, rebalancing is done to restore this property.
Is it still balanced after deleting 22?
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
14
26
27
Defining AVL Tree
T is an AVL Tree if T is a binary search tree, and . . .
Every node of T has a balance_factor 1,0, or -1
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
22
14
26
27
0
1
1
0
0
-1
1
Balance factor = height(L) – height(R)
Outline
Introduction
Binary Search Tree
AVL Tree
Introduction
Balancing AVL tree
Complexity Analysis
Hash tables
FIT2004: Lec-5: Efficient Lookup Structures
Balancing AVL Tree after insertion/deletion
The tree becomes unbalanced after deleting 22.
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
14
26
27
0
1
0
-1
1
Balance factors
22
0
1
Balancing AVL Tree after insertion/deletion
The tree becomes unbalanced after deleting 22.
The tree may also become unbalanced after insertion.
How to balance it after deletion/insertion?
We have four possible scenarios:
Left-left
Left-right
Right-right
Right-left
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
14
26
27
0
1
2
0
-1
1
Balance factors
Left-Left Case
FIT2004: Lec-5: Efficient Lookup Structures
A left-left case occurs when
A node has a balance factor +2; and
Its left child has a balance factor 0 or more
24
20
15
14
26
0
1
0
2
22
1
0
Handling Left-left case
FIT2004: Lec-5: Efficient Lookup Structures
x
y
z
0 or 1
2
A
B
C
D
x
y
z
A
B
C
D
Rotate Y to be the parent of the subtree
What about B?
Values in B are > y
Values in B are < x
B can be the left child of x!
Handling Left-left case
FIT2004: Lec-5: Efficient Lookup Structures
x
y
z
0 or 1
2
A
B
C
D
x
y
z
A
B
C
D
What about B?
Values in B are > y
Values in B are < x
B can be the left child of x!
Example: Left-left case
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
14
26
0
1
0
2
22
1
0
x
y
z
0 or 1
2
A
B
C
D
x
y
z
A
B
C
D
x
y
z
Example: Left-left case
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
14
26
0
1
0
2
22
1
0
x
y
z
0 or 1
2
A
B
C
D
x
y
z
A
B
C
D
x
y
z
A
B
C
D
Example: Left-left case
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
14
26
0
1
0
2
22
1
0
x
y
z
0 or 1
2
A
B
C
D
x
y
z
A
B
C
D
x
y
z
A
B
C
D
20
15
14
24
0
0
0
22
1
26
0
0
x
y
z
Example: Left-left case
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
14
26
0
1
0
2
22
1
0
x
y
z
0 or 1
2
A
B
C
D
x
y
z
A
B
C
D
x
y
z
A
B
C
D
20
15
14
24
0
0
0
22
1
26
0
0
A
B
C
D
x
y
z
Left-Left case: Does it work?
FIT2004: Lec-5: Efficient Lookup Structures
x
y
z
0 or 1
2
A
B
C
D
Y could have balance factor 0 or 1
Lets keep it simple. Assume
balance(y) = 0
balance(z) = 0
Left-Left case: Does it work?
FIT2004: Lec-5: Efficient Lookup Structures
x
y
z
0
2
A
B
C
D
Y could have balance factor 0 or 1
Lets keep it simple. Assume
balance(y) = 0
balance(z) = 0
0
Left-Left case: Does it work?
FIT2004: Lec-5: Efficient Lookup Structures
x
y
z
2
A
B
C
D
h
Y could have balance factor 0 or 1
Lets keep it simple. Assume
balance(y) = 0
balance(z) = 0
?
?
?
Quiz time!
https://flux.qa - RFIBMB
0
0
Left-Left case: Does it work?
FIT2004: Lec-5: Efficient Lookup Structures
x
y
z
2
A
B
C
D
h
Y could have balance factor 0 or 1
Lets keep it simple. Assume
balance(y) = 0
balance(z) = 0
h+1
h
h
0
0
Left-Left case: Does it work?
FIT2004: Lec-5: Efficient Lookup Structures
x
y
z
2
A
B
C
D
h
Y could have balance factor 0 or 1
Lets keep it simple. Assume
balance(y) = 0
balance(z) = 0
h+1
h
h
0
0
x
y
z
A
B
C
D
x
y
z
A
B
C
D
Left-Left case: Does it work?
FIT2004: Lec-5: Efficient Lookup Structures
y
z
x
C
h
Y could have balance factor 0 or 1
Lets keep it simple. Assume
balance(y) = 0
balance(z) = 0
B
h+1
D
h
A
h
x
y
z
A
B
C
D
x
y
z
A
B
C
D
Left-Left case: Does it work?
FIT2004: Lec-5: Efficient Lookup Structures
y
z
x
C
h
Y could have balance factor 0 or 1
Lets keep it simple. Assume
balance(y) = 0
balance(z) = 0
B
h+1
D
h
A
h
x
y
z
A
B
C
D
x
y
z
A
B
C
D
0
1
-1
Left-Left case: Does it work?
Now, y could have had balance factor 1
z could have been 0, 1 or -1
We can make similar arguments in each case, and show that the rotation gives us the balance factors we need
Can you come up with an argument that does not require 6 cases?
Nathan Companez (NC) -
Left-Right Case
FIT2004: Lec-5: Efficient Lookup Structures
A left-right case occurs when
A node has a balance factor +2; and
Its left child has a negative balance factor
24
20
23
14
26
0
0
2
22
-1
0
21
0
0
Left-Right Case
FIT2004: Lec-5: Efficient Lookup Structures
IMPORTANT
Left-right cases are handled in two steps
DO NOT stop in between
After step 1, you may have created other problems with balance factors, but keep going and do step 2 before checking the state of the AVL tree
Handling Left-right case
Left Right Case
Convert Left Right case to Left Left case by rotating 20.
FIT2004: Lec-5: Efficient Lookup Structures
15
24
-1
2
B
C
D
A
20
Handling Left-right case
Left Right Case
Convert Left Right case to Left Left case by rotating 20.
FIT2004: Lec-5: Efficient Lookup Structures
15
24
-1
2
B
C
D
A
20
24
15
A
B
C
D
20
Handling Left-right case
Left Right Case
Convert Left Right case to Left Left case by rotating 20.
FIT2004: Lec-5: Efficient Lookup Structures
15
24
-1
2
B
C
D
A
20
24
15
A
B
C
D
20
Now we have a left-left case!
Handling Left-right case
Left Right Case
Convert Left Right case to Left Left case by rotating 20.
Handle Left Left case as described earlier
FIT2004: Lec-5: Efficient Lookup Structures
15
24
-1
2
B
C
D
A
20
24
15
A
B
C
D
20
2
0 or 1
Now we have a left-left case!
Handling Left-right case
Left Right Case
Convert Left Right case to Left Left case by rotating 20.
Handle Left Left case as described earlier
FIT2004: Lec-5: Efficient Lookup Structures
15
24
-1
2
B
C
D
A
20
24
15
A
B
C
D
20
2
0 or 1
Example: Left-right case – Step 1
FIT2004: Lec-5: Efficient Lookup Structures
Y
X
-1
2
B
C
D
A
Z
X
Y
A
B
C
D
Z
24
20
23
14
26
0
0
2
22
-1
0
21
0
0
24
22
23
20
26
0
2
1
21
0
0
14
0
0
This is a left-left case now
Example: Left-right case – Step 2
FIT2004: Lec-5: Efficient Lookup Structures
24
20
23
14
26
0
0
2
22
-1
0
21
0
0
24
22
23
20
26
0
2
1
21
0
0
14
0
0
This is a left-left case now
X
Y
Z
0 or 1
2
A
B
C
D
X
Y
Z
A
B
C
D
22
20
21
14
24
26
23
Right-Right and Right-Left
FIT2004: Lec-5: Efficient Lookup Structures
These two cases are mirror images of the Left-Left and Left-Right cases
Left as an exercise
Outline
Introduction
Binary Search Tree
AVL Tree
Introduction
Balancing AVL tree
Complexity Analysis
Hash tables
FIT2004: Lec-5: Efficient Lookup Structures
Search, Insertion, Deletion in AVL Tree
Search algorithm in AVL Tree is exactly the same as in BST
Worst-Case time complexity
O(log N) because the tree is balanced (A proof of this is examined in the supplementary questions of tute 6)
Insertion/Deletion in AVL Tree
Insert/Delete the element in the same way as in BST (as described earlier)
Balance the tree if it has become unbalanced (as described earlier)
Worst-case insertion/deletion time complexity: ??
FIT2004: Lec-5: Efficient Lookup Structures
Complexity of Balancing AVL tree after insertion/deletion
Tree is balanced before insertion/deletion
An insertion/deletion can affect balance factor of at most O(log N) nodes
E.g., Insert 13
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
22
14
26
27
0
1
1
0
0
-1
1
Complexity of Balancing AVL tree after insertion/deletion
Tree is balanced before insertion/deletion
An insertion/deletion can affect balance factor of at most O(log N) nodes
E.g., Insert 13
The balancing is done from bottom to top
E.g., first balance node 15 and so on
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
22
14
26
27
1
2
2
0
0
-1
2
13
0
Complexity of Balancing the AVL Tree
The tree is balanced in a bottom up fashion starting from the lowest node which has a balance factor NOT in {0, 1, -1}
Balancing at each node takes constant time (1 or 2 rotations)
We need to balance at most O(log N) nodes in the worst-case
So total cost of balancing after insertion/deletion is O(log N) in worst-case
FIT2004: Lec-5: Efficient Lookup Structures
24
20
15
1
2
A
B
C
D
0
24
20
15
A
B
C
D
E
10
E
10
F
-2
0
0
-2
Computing balance factors
FIT2004: Lec-5: Efficient Lookup Structures
How do we update balance factors quickly?
B
A
1
B
A
0 or 1?
Computing balance factors
How do we update balance factors quickly?
24
20
15
22
14
26
27
0
1
1
0
-1
1
0
Computing balance factors
How do we update balance factors quickly?
25
24
20
15
22
14
26
27
0
1
1
0
-1
1
0
Computing balance factors
How do we update balance factors quickly?
25
24
20
15
22
14
26
27
0
1
1
0
0
1
0
Computing balance factors
How do we update balance factors quickly?
25
24
20
15
22
14
26
27
0
1
1
0
0
?
0
Computing balance factors
Instead, store heights at each node
Balance can then be computed as needed by taking height(current.left) – height(current.right)
Heights are easy to update
24
20
15
22
14
26
27
0
1
2
0
1
3
0
Computing balance factors
Instead, store heights at each node
Balance can then be computed as needed by taking height(current.left) – height(current.right)
Heights are easy to update
24
20
15
22
14
26
27
1
2
3
0
1
4
13
0
0
Outline
Introduction
Binary Search Tree
AVL Tree
Hash tables
Introduction
Chaining
Probing
Linear Probing
Quadratic Probing
Double Hashing
Cuckoo Hashing
FIT2004: Lec-5: Efficient Lookup Structures
Note: Non-integer keys
How do we index them?
In hash tables, we always consider keys to be integers
If you have a data set with non-integer keys, first apply some conversion process to the keys (prehash)
This prehash might be different for different data
Example 1111-222-333 (phone number) -> 1111222333
FIT2004: Lec-5: Efficient Lookup Structures
Direct-Addressing
Assume that we have N students in a class and the student IDs range from 0 to N-1. How can we store the data to delete/search/insert in O(1)-time?
Create an array of size N
Store a student with ID x at index x
Note: Each student is indexed at a unique index in the array
Searching the record with ID x
Return array[x]
FIT2004: Lec-5: Efficient Lookup Structures
Fixing the Problem with Direct-Addressing
Hash function: takes elements of a big (possibly infinite) universe and maps them to some finite range
i.e. indices in [0..M-1]
One way to do this in practice is to use modulus (%)
Note that just taking keys % m is a bad hash function!
a%m = the remainder when a is divided by m
Always gives a value in the range [0..m-1]
FIT2004: Lec-5: Efficient Lookup Structures
table
keys
Indices 0…M-1
0…M-1
M…2M-1
2M…3M-1
Collisions
A consequence of having more keys than indices in our table
FIT2004: Lec-5: Efficient Lookup Structures
table
keys
Indices 0…M-1
0…M-1
M…2M-1
2M…3M-1
Collisions
A consequence of having more keys than indices in our table
FIT2004: Lec-5: Efficient Lookup Structures
table
keys
Indices 0…M-1
0…M-1
M…2M-1
2M…3M-1
Collisions
A consequence of having more keys than indices in our table
FIT2004: Lec-5: Efficient Lookup Structures
table
keys
Indices 0…M-1
0…M-1
M…2M-1
2M…3M-1
Collisions
A consequence of having more keys than indices in our table
Many values “belong” (are hashed to) in the same slot
FIT2004: Lec-5: Efficient Lookup Structures
table
keys
Indices 0…M-1
0…M-1
M…2M-1
2M…3M-1
Collision probability
N items
Table size = M
Chance of collision?
Similar to a famous “paradox”, the birthday paradox
How many people do you need in a room before 2 of them share a birthday?
In this situation, “table size” is 365, and N is the number of people
FIT2004: Lec-5: Efficient Lookup Structures
Collision probability
FIT2004: Lec-5: Efficient Lookup Structures
Visit https://pudding.cool/2018/04/birthday-paradox/ for an interactive explanation of the birthday paradox
How many people do you need in a room before 2 of them share a birthday?
In this situation, “table size” is 365, and N is the number of people
Quiz time!
https://flux.qa – RFIBMB
Probability of Collision
prob(collision) = 1 – prob(no collision)
The probability of collision for 23 people is ~ 50%
The probability of collision for 50 people is 97%
The probability of collision for 70 people is 99.9%
FIT2004: Lec-5: Efficient Lookup Structures
Handling Collisions
Two strategies to address collisions!
Chaining: for each hash value, we have a separate data structure which stores all items with that hash value
Probing: items which collide get inserted somewhere else in the array
Note: These two general strategies have many other names, but they are confusing
FIT2004: Lec-5: Efficient Lookup Structures
Outline
Introduction
Binary Search Tree
AVL Tree
Hash tables
Introduction
Chaining
Probing
Linear Probing
Quadratic Probing
Double Hashing
Cuckoo Hashing
FIT2004: Lec-5: Efficient Lookup Structures
Chaining
If there are already some elements at hash index
Add the new element in a list at array[index]
Example: Suppose the hash table size M is 13 and hash function is key % 13.
Insert 29 index = 29%13 = 3
Insert 38 index = 38%13 = 12
Insert 16 index = 16%13 = 3
Insert 15 index = 15%13 = 2
Insert 42 index = 42%13 = 3
Insert 25 index = 25%13 = 12
FIT2004: Lec-5: Efficient Lookup Structures
0 1 2 3 4 5 6 7 8 9 10 11 12
29
16
38
15
42
25
Lookup/searching an element:
index = hash(key)
Search in list at Array[index]
Deleting an element:
Search the element
Delete it
Chaining
Assume that M is the size of hash table and N is the number of records already inserted.
Assume that our hash functions distributes our keys uniformly over our hash table
Assume that the load of the table is < c
We know that there are at most cM items in the table
These cM items are distributed among M “chains”
The average chain has c < 1 items in it
The average time complexity of an insert operation is O(1)
FIT2004: Lec-5: Efficient Lookup Structures
Complexity of resizing - intuition
FIT2004: Lec-5: Efficient Lookup Structures
What about the insert that triggers a resize?
Table size Total work for insertion (half tablesize) Total work for resize (double the table, reinsert
m m/2 2m + m/2
2m m 4m + 3m/2
4m 2m 8m + 7m/2
… … …
2im 2i-1m 2i+1m + O(2im)
Complexity of resizing - intuition
FIT2004: Lec-5: Efficient Lookup Structures
Imagine spreading out the resize work over the insertions
This concept is called “amortized analysis” (not examinable)
Table size Total work for insertion Total work for resize
m m/2 2m + m/2
2m m 4m + m
4m 2m 8m + 2m
… … …
2im 2i-1m 2i+1m + (2i-1m)
The amortized cost of each insert is O(1), even though most of the work occurs on one specific insert (the one which triggers a resize)
Other data structures as “chains”
FIT2004: Lec-5: Efficient Lookup Structures
We could use anything as our “chains”
We want something with fast insert, lookup and delete
I wonder what that could be?
Balanced binary search trees!
Or what about…
Hash tables? (Look at the tute problem on FKS hashing)
Quiz time!
https://flux.qa - RFIBMB
Probing
In probing schemes, each index in the hash table contains at most one element.
How to avoid collision in this case?
Linear Probing
Quadratic Probing
Double Hashing
Cuckoo Hashing
FIT2004: Lec-5: Efficient Lookup Structures
Outline
Introduction
Binary Search Tree
AVL Tree
Hash tables
Introduction
Chaining
Probing
Linear Probing
Quadratic Probing
Double Hashing
Cuckoo Hashing
FIT2004: Lec-5: Efficient Lookup Structures
Linear Probing
In case of collision, sequentially look at the next indices until you find an empty index
h(k, i) = h’(k) + i
h’(k) is some hash function
i is how many times we have probed
For example, suppose h’(k)= k % 12
FIT2004: Lec-5: Efficient Lookup Structures
0 1 2 3 4 5 6 7 8 9 10 11 12
// psuedocode for linear probing
index = hash(key)
i = 1
while array[index] is not empty and i != M
index = (hash(key) + i) % M
i ++
if i != M
insert element at array[index]
Linear Probing
h’(k) = k % 12
h(k,i) = k % 12 + i
Insert 5
FIT2004: Lec-5: Efficient Lookup Structures
5
0 1 2 3 4 5 6 7 8 9 10 11 12
// psuedocode for linear probing
index = hash(key)
i = 1
while array[index] is not empty and i != M
index = (hash(key) + i) % M
i ++
if i != M
insert element at array[index]
Linear Probing
h’(k) = k % 12
h(k,i) = k % 12 + i
Insert 5
Insert 13: 13%12 = 1
FIT2004: Lec-5: Efficient Lookup Structures
13 5
0 1 2 3 4 5 6 7 8 9 10 11 12
// psuedocode for linear probing
index = hash(key)
i = 1
while array[index] is not empty and i != M
index = (hash(key) + i) % M
i ++
if i != M
insert element at array[index]
Linear Probing
h’(k) = k % 12
h(k,i) = k % 12 + i
Insert 5
Insert 13: 13%12 = 1
Insert 2
FIT2004: Lec-5: Efficient Lookup Structures
13 2 5
0 1 2 3 4 5 6 7 8 9 10 11 12
// psuedocode for linear probing
index = hash(key)
i = 1
while array[index] is not empty and i != M
index = (hash(key) + i) % M
i ++
if i != M
insert element at array[index]
Linear Probing
h’(k) = k % 12
h(k,i) = k % 12 + i
Insert 5
Insert 13: 13%12 = 1
Insert 2
Insert 26: 26%12 = 2, probe once to 3
FIT2004: Lec-5: Efficient Lookup Structures
13 2 26 5
0 1 2 3 4 5 6 7 8 9 10 11 12
// psuedocode for linear probing
index = hash(key)
i = 1
while array[index] is not empty and i != M
index = (hash(key) + i) % M
i ++
if i != M
insert element at array[index]
Linear Probing
h’(k) = k % 12
h(k,i) = k % 12 + i
Insert 5
Insert 13: 13%12 = 1
Insert 2
Insert 26: 26%12 = 2, probe once to 3
Insert 38: 38%12 = 2, probe twice to 4
FIT2004: Lec-5: Efficient Lookup Structures
13 2 26 38 5
0 1 2 3 4 5 6 7 8 9 10 11 12
// psuedocode for linear probing
index = hash(key)
i = 1
while array[index] is not empty and i != M
index = (hash(key) + i) % M
i ++
if i != M
insert element at array[index]
Linear Probing
Searching:
Look at index = h’(key). If element not found at index, sequentially look into next indices until
you find the element
or reach an index which is NIL (which implies that the element is not in the hash table)
Worst-case Search Complexity?
In general, for probing based tables, we resize (double) the table size once the load factor exceeds a certain value (different for different hashing schemes)
Load factor = number of elements in table / size of table
This means the table is never full
Still O(M) for search, since the table could have load factor * M elements and they could all be in one cluster
FIT2004: Lec-5: Efficient Lookup Structures
Linear Probing
Deletion:
Search the element
Delete it
Is that all?
FIT2004: Lec-5: Efficient Lookup Structures
Linear Probing
Deletion:
Search the element
Delete it
Is that all?
Example: Suppose hash function is just h(k,i) = k%12 + i
Delete 2
FIT2004: Lec-5: Efficient Lookup Structures
13 2 26 38 5
0 1 2 3 4 5 6 7 8 9 10 11 12
Linear Probing
Deletion:
Search the element
Delete it
Is that all?
Example: Suppose hash function is just h(k,i) = k%12 + i
Delete 2
2 is located at position hash(2), so no need to probe
FIT2004: Lec-5: Efficient Lookup Structures
13 2 26 38 5
0 1 2 3 4 5 6 7 8 9 10 11 12
Linear Probing
Deletion:
Search the element
Delete it
Is that all?
Example: Suppose hash function is just h(k,i) = k%12 + i
Delete 2
2 is located at position hash(2), so no need to probe
Delete it
FIT2004: Lec-5: Efficient Lookup Structures
13 26 38 5
0 1 2 3 4 5 6 7 8 9 10 11 12
Linear Probing
Deletion:
Search the element
Delete it
Is that all?
Example: Suppose hash function is just h(k,i) = k%12 + i
Delete 2
2 is located at position hash(2), so no need to probe
Delete it
Now what happens if we look up 38?
FIT2004: Lec-5: Efficient Lookup Structures
13 26 38 5
0 1 2 3 4 5 6 7 8 9 10 11 12
Linear Probing
Deletion:
Search the element
Delete it
Is that all?
Example: Suppose hash function is just h(k,i) = k%12 + i
Delete 2
2 is located at position hash(2), so no need to probe
Delete it
Now what happens if we look up 38?
Hash(38)=2, not found. But that is wrong!
FIT2004: Lec-5: Efficient Lookup Structures
13 26 38 5
0 1 2 3 4 5 6 7 8 9 10 11 12
Linear Probing
Deletion:
Search the element
Delete it
Is that all?
Example: Suppose hash function is just h(k,i) = k%12 + i
Delete 2
2 is located at position hash(2), so no need to probe
Delete it
Now what happens if we look up 38?
Hash(38)=2, not found. But that is wrong!
FIT2004: Lec-5: Efficient Lookup Structures
13 26 38 5
0 1 2 3 4 5 6 7 8 9 10 11 12
Quiz time!
https://flux.qa - RFIBMB
Linear Probing
Deletion:
Search the element
Delete it
Is that all?
Solution: Re-insert everything after the deleted position, until the next empty position
FIT2004: Lec-5: Efficient Lookup Structures
13 26 38 5
0 1 2 3 4 5 6 7 8 9 10 11 12
Linear Probing
Deletion:
Search the element
Delete it
Is that all?
Solution: Re-insert everything after the deleted position, until the next empty position
Insert 26
FIT2004: Lec-5: Efficient Lookup Structures
13 26 38 5
0 1 2 3 4 5 6 7 8 9 10 11 12
Linear Probing
Deletion:
Search the element
Delete it
Is that all?
Solution: Re-insert everything after the deleted position, until the next empty position
Insert 26
Insert 38
FIT2004: Lec-5: Efficient Lookup Structures
13 26 38 5
0 1 2 3 4 5 6 7 8 9 10 11 12
Linear Probing
Deletion:
Search the element
Delete it
Is that all?
Solution: Re-insert everything after the deleted position, until the next empty position
Insert 26
Insert 38
Insert 5
FIT2004: Lec-5: Efficient Lookup Structures
13 26 38 5
0 1 2 3 4 5 6 7 8 9 10 11 12
Linear Probing
The previous example showed a search by increment of 1
We can increment by any constant: h(k, i) = (h’(k) + c*i) % M
E.g., if c=3 and index = 2 is a collision, we will look at index 5, and then index 8, then 11 and so on …
The problem with linear probing is that collisions from nearby hash values tend to merge into big blocks, and therefore the lookup can degenerate into a linear O(N) search. This is called primary clustering.
FIT2004: Lec-5: Efficient Lookup Structures
0 1 2 3 4 5 6 7 8 9 10 11 12
30
44
13
53
15
23
40
Outline
Introduction
Binary Search Tree
AVL Tree
Hash tables
Introduction
Chaining
Probing
Linear Probing
Quadratic Probing
Double Hashing
Cuckoo Hashing
FIT2004: Lec-5: Efficient Lookup Structures
Quadratic Probing
Unlike linear probing that uses fixed incremental jumps, quadratic probing uses quadratic jumps.
Linear probing: h(k, i) = (h’(k) + c*i) % M
Quadratic probing: h(k, i) = (h’(k) + c*i + d*i2) % M where c and d are constants
Quadratic probing is not guaranteed to probe every location in the table.
An insert could fail while there is still an empty location
It can be shown that with prime table size and load factor < 0.5, quadratic probing always finds an empty spot
Therefore, make sure we resize when load factor = 0.5!
The same probing strategy is used in the associated search and delete routine!
FIT2004: Lec-5: Efficient Lookup Structures
0 1 2 3 4 5 6 7 8 9 10 11 12
30
44
14
53
15
23
20
40
Quadratic Probing - Delete
In linear probing, each element k may be part of a probe chain, but the previous position in the chain will be 1 (or c) positions before k, and the next position in the chain will be 1 (or c) positions after k
For quadratic probing, an element may be part of many, separate probe chains
Which elements do we re-insert?
FIT2004: Lec-5: Efficient Lookup Structures
Quadratic Probing - Delete
Instead of reinserting, we use “lazy deletion”
Simply mark the element as “deleted”
When we do lookups, we do not return marked elements (but we can probe past them)
When we do inserts, we are allowed to insert over the top of marked elements
FIT2004: Lec-5: Efficient Lookup Structures
Problem with Quadratic Probing
Quadratic probing avoids primary clustering
However, if two elements have same hash index (e.g., h’(k1) = h’(k2)), the jumps are the same for both elements.
This leads to a milder form of clustering called secondary clustering.
Is there a way to have different “jumping” for elements that hash to same indexing?
Yes! Double hashing – two hash functions
Or Cuckoo hashing – two hash functions and two hash tables
FIT2004: Lec-5: Efficient Lookup Structures
0 1 2 3 4 5 6 7 8 9 10 11 12
30
44
14
53
15
23
20
40
Outline
Introduction
Binary Search Tree
AVL Tree
Hash tables
Introduction
Chaining
Probing
Linear Probing
Quadratic Probing
Double Hashing
Cuckoo Hashing
FIT2004: Lec-5: Efficient Lookup Structures
Double hashing
Use two different hash functions: one to determine the initial index and the other two determine the amount of jump
h(k, i) = (hash1(key) + i*hash2(key)) % M
E.g., suppose hash1() is key % 13 and hash2() is key % 7
Insert 40 hash1(40) = 1, hash2(40) = 5
i = 0 index = 1
i = 1 index = (1+5)%13 = 6
i = 2 index = (1+10)%13 = 11
i = 3 index = (1+ 15)%13 = 3
Insert 66 hash1(66) = 1, hash2(66) = 3
i = 0 index = 1
i = 1 index = (1+3)%13 = 4
i = 2 index = (1+6)%13 = 7
i = 3 index = (1+ 9)%13 = 10
I = 4 index = (1+ 12)%13 = 0
FIT2004: Lec-5: Efficient Lookup Structures
0 1 2 3 4 5 6 7 8 9 10 11 12
30
44
14
53
40
23
20
24
19
66
Outline
Introduction
Binary Search Tree
AVL Tree
Hash tables
Introduction
Chaining
Probing
Linear Probing
Quadratic Probing
Double Hashing
Cuckoo Hashing
FIT2004: Lec-5: Efficient Lookup Structures
Cuckoo hashing
None of the hashing schemes seen earlier can provide guarantee for O(1) searching in the worst-case.
Cuckoo Hashing:
Searching – O(1) in worst-case
Deletion – O(1) in worst-case
Insertion cost may be significantly higher – but average cost is O(1)
Idea:
Use two different hash functions hash1() and hash2() and two different hash tables T1 and T2 of possibly different sizes.
hash1() determines the index in T1 and hash2() determines the index in T2.
Each key will be indexed in only one of the two tables
Handle collision by “Cuckoo approach”
kick the other key out to the other table until every key has its own “nest” (i.e., table)
FIT2004: Lec-5: Efficient Lookup Structures
0 1 2 3 4 5 6 7 8 9 10 11 12
T1
0 1 2 3 4 5 6
T2
Key%13
Key%7
Cuckoo hashing
Insert 23
Hash1(23) 23%13 10
Insert 36
Hash1(36) 36%13 10
Hash2(23) 23%7 2
Insert 114
Hash1(114) 114%13 10
Hash2(36) 36%7 1
Insert 49
Hash1(49) 49%13 10
Hash2(114) 114%7 2
Hash1(23) 23%13 10
Hash2(49) 49%7 0
FIT2004: Lec-5: Efficient Lookup Structures
0 1 2 3 4 5 6 7 8 9 10 11 12
T1
0 1 2 3 4 5 6
23
T2
36
114
49
Insert 23 at T1[10]
Insert 36 at T1[10] and kick away 23 to T2
Insert 23 at T2[2]
Insert 114 at T1[10] and kick away 36 to T2
Insert 36 at T2[1]
Insert 49 at T1[10] and kick away 114 to T2
Insert 114 at T2[2] and kick away 23 to T1
Insert 23 at T1[10] and kick away 49 to T2
Insert 49 at T2[0]
Key%13
Key%7
Cuckoo hashing
Insert 140
Hash1(140) 140%13 10
Hash2(23) 23%7 2
Hash1(114) 114%13 10
Hash2(140) 140%7 0
Hash1(49) 49%13 10
Hash2(114) 114%7 2
…
Cuckoo Hashing gives up after MAXLOOP number of iterations
Uses new hash functions (and may resize two tables) and hashes everything again
Thus insertion may be quite costly
FIT2004: Lec-5: Efficient Lookup Structures
0 1 2 3 4 5 6 7 8 9 10 11 12
T1
0 1 2 3 4 5 6
23
T2
36
114
49
Insert 140 at T1[10] and kick away 23 to T2
Insert 23 at T2[2] and kick away 114 to T1
Insert 114 at T1[10] and kick away 140 to T2
Insert 140 at T2[0] and kick away 49 to T1
Insert 49 at T1[10] and kick away 114 to T2
Insert 114 at T2[2] and kick away 23 to T1
140
Key%13
Key%7
Cuckoo hashing
Observation: A key is either at index= hash1(key) in T1 or at index = hash2(key) in T2
Searching:
Look at T1[ hash1(key)] and T2[hash2(key)]
E.g.,
Search 36
36%13 10 Look at T1[10]. Not there, so look in T2
36%7 1 Look at T2[1]. Found!!!
Search 10
10%13 10 Look at T1[10]. Not there, so look in T2
10%7 3 Look at T2[3]. Not found!!!
What is worst-case time complexity for searching?
O(1)
FIT2004: Lec-5: Efficient Lookup Structures
0 1 2 3 4 5 6 7 8 9 10 11 12
T1
0 1 2 3 4 5 6
23
T2
36
114
49
Key%13
Key%7
Cuckoo hashing
Observation: A key is either at index= hash1(key) in T1 or at index = hash2(key) in T2
Deletion:
Search key
Delete it if found
What is worst-case time complexity for deletion?
O(1)
What about insertion? (NON-EXAMINABLE)
In the worst-case, it could be arbitrarily bad
It has been shown that the expected cost to insert N items is O(N)
FIT2004: Lec-5: Efficient Lookup Structures
0 1 2 3 4 5 6 7 8 9 10 11 12
T1
0 1 2 3 4 5 6
23
T2
36
114
49
Key%13
Key%7
Tables setups vs better hashing
FIT2004: Lec-5: Efficient Lookup Structures
So far we have talked about ways to arrange our table
We can also have better hash functions!
The game:
You make a hash table algorithm
I (an evil person) pick some inputs
You make the table and put the inputs into your table
If you choose any deterministic hash, I can break it as follows:
Tables setups vs better hashing
FIT2004: Lec-5: Efficient Lookup Structures
Suppose h(k) is your hash function
I choose values xi, x2 … xn, such that h(xi) = c for all i
All xi will collide
If you try to fight this by making a random hash function, I cannot reverse engineer it in this way
But you will be unable to find items (since when you hash them again, you will end up with a different index)
Since you have to pick your hash function when you make the table, there is seemingly no way around this
Tables setups vs better hashing
FIT2004: Lec-5: Efficient Lookup Structures
Solution: Use a random family of hash functions
Whenever you make a table, choose a function at random from an infinite family
Example: ak + b mod p
p is a prime larger than table size
a and b are chosen randomly, but
a can’t be 0 mod p (a multiple of p)
Tables setups vs better hashing
FIT2004: Lec-5: Efficient Lookup Structures
Universal family: for any two keys, the chance of them being hashed to the same position by a randomly chosen function from the universal family is at most 1/tablesize
Tables setups vs better hashing
FIT2004: Lec-5: Efficient Lookup Structures
There are better properties than just universality, like k-independence, but they are beyond the scope of this unit
For some hashing schemes, we need certain mathematical properties of our hash functions to guarantee our time complexities
Summary of Hashing
It is hard to design good hash functions
Imitating randomness is the ultimate goal of a good hash function
But it still needs to be deterministic!
The examples shown in the lecture show very simple hash functions (see notes!)
In practice hash tables give quite good performance, i.e. O(1) on average
Hash tables are disordered data structures. Therefore certain operations become expensive.
Find maximum and minimum of a set of elements (keys).
FIT2004: Lec-5: Efficient Lookup Structures
Summary
FIT2004: Lec-5: Efficient Lookup Structures
Take home message
Hash tables provide O(1) look up in practice (although the worst-case complexity may still be O(N))
AVL Trees guarantee worst-case time complexity of O(Log N)
Things to do (this list is not exhaustive)
Read more about hash tables and hash functions
Practice balancing AVL trees using pen and paper
Implement BST and AVL trees
Coming Up Next
Retrieval Data Structures for Strings
/docProps/thumbnail.jpeg