Copyright © 2003 by Hanan Samet
hp1
PRELIMINARIES
� File ≡ collection of records (N)
� Each record contains several attributes or keys (k)
Queries:
1. Point query
2. Range query (includes partial match)
3. Boolean query ≡ combine 1 and 2 with AND, OR, NOT
Search methods
1. Organize data to be stored
� boundaries of regions in the search space are
determined by the data
� e.g., binary search tree
2. Organize the embedding space from which the data is
drawn
� region boundaries in the search space are fixed
� e.g., address computation methods such as digital
searching
3. Hybrid
� use 1 for some attributes and 2 for others
Extreme solution:
� Bitmap representation where one bit is reserved for
every possible record in the multidimensional point space
whether or not it is present
� Problems:
1. large number of attributes
2. continuous data (non-discrete)
Copyright © 2003 by Hanan Samet
hp2
SIMPLE NON-HIERARCHICAL DATA STRUCTURES
1. Sequential list
2. Inverted List
� 2 sorted lists
� Data is pointers
� Enables pruning the search with respect to one key
NAME X Y
Chicago 35 42
Mobile 52 10
Toronto 62 77
Buffalo 82 65
Denver 5 45
Omaha 27 35
Atlanta 85 15
Miami 90 5
X Y
Denver Miami
Omaha Mobile
Chicago Atlanta
Mobile Omaha
Toronto Chicago
Buffalo Denver
Atlanta Buffalo
Miami Toronto
Copyright © 2003 by Hanan Samet
hp3
GRID METHOD
� Divide space into squares of width equal to the search
region
� Each cell contains a list of all points within it
� Assume L∞ distance metric (i.e., chessboard)
� Assume C = uniform distribution of points per cell
� Average search time for k-dimensional space is O(F•2k)
F = number of records found = C since query region has
the width of a cell
2k = number of cells examined
(0,100) (100,100)
(100,0) (0,0)
(62,77)
Toronto (82,65)
Buffalo
(5,45)
Denver (35,42)
Chicago
(27,35)
Omaha
(52,10)
Mobile
(85,15)
Atlanta
(90,5)
Miami
Copyright © 2003 by Hanan Samet
hp4
POINT QUADTREE (Finkel/Bentley)
� Marriage between a uniform grid and a binary search tree
(0,100) (100,100)
(100,0) (0,0)
(35,42)
Chicago
Chicago
1
b
Mobile
(52,10)
Mobile
2
r
(5,45)
Denver
Denver
5
v
(62,77)
Toronto
3
z
Toronto
(82,65)
Buffalo
4
g
Buffalo
(27,35)
Omaha
Omaha
6
g
(85,15)
Atlanta
7
v
Atlanta
(90,5)
Miami
Miami
8
z
Copyright © 2003 by Hanan Samet
hp5
� Optimal solution is to use H as the new root since the
shaded region is empty
3
z
4. I is the new root
3. G is the new root
2. D is the new root
1. B is the new root
PROBLEM OF DELETION IN POINT QUADTREES
� Delete node A
� Conventional algorithm takes one son as the new root
and reinserts the remaining subtrees
E
D
F
I
A
H
G
C
B
1
b
2
r
� Problems:
1. must search for H
2. a node such as H may possibly not exist as is the
case if node J is present
J
4
g
Copyright © 2003 by Hanan Samet
hp6
� at most one of the remaining candidates will be in the
shaded region
8
v
MECHANICS OF DELETION IN POINT QUADTREES
� Ideally, want to replace deleted
node (A) with a node (B) that
leaves an empty shaded region
B
A
1
b
� Involves search and instead settle on a set of candidate
nodes obtained using a method analogous to searching a
binary search tree
A
B
C
D
A
D
C
B
2
z
D is the closest node
3
r
� Set of candidates = “closest” node in each quadrant
1. choose the candidate node that is closer to each of
its bordering axes than any other candidate node
which is on the same side of those axes
4
v
A
B C
D
E
condition does not always hold
5
g
A
B C
D E
more than one node satisfies it
6
z
2. Use the L1 metric to break ties or deadlocks
� L1 metric is the sum of the displacements from the
bordering x and y axes
� rationale: area of shaded region is Lx•dy+Ly•dx-dx•dy
which can be approximated by 2dx•(Lx+Ly) assuming
dx≅dy and dx being very small
Ly
Lx
dx
dy
7
r
Copyright © 2003 by Hanan Samet
hp7
ALGORITHM FOR DELETION IN POINT QUADTREES
� Select a node satisfying the “closest” criteria to the
deleted node A to serve as the new root (B in the NE
quadrant
A
B
C D
E
F
G H
I
J K
L
M
N
1
b
2
z
1. no reinsertion in opposite quadrant (SW)
3
v
2. ADJ: apply to adjacent quadrants (NW,SE):
if root remains in the quadrant (J) then
� no reinsertion in 2 subquadrants (NW,NE)
� apply ADJ to remaining 2 subquadrants (SW,SE)
else reinsert entire quadrant
4
r
� Comparison with conventional reinsertion algorithm
1. 5/6 reduction in number of nodes requiring
reinsertion (2/3 if pick a candidate at random)
2. number of comparisons during reinsertion was
observed to be ~log4n vs. a much larger factor
3. (average total path length)/(optimal total path length)
was observed to be constant vs. an increase
6
b
3. NEWROOT: apply to quadrant containing replacement
node (NE)
� same subquadrant (EN): no reinsertion
� adjacent subquadrants (NW,SE): ADJ
� opposite subquadrant (SW): NEWROOT
5
g
Copyright © 2003 by Hanan Samet
hp8
MX QUADTREE (Hunter)
� Points are like BLACK pixels in a region quadtree
� Useful for raster to vector conversion
� Empty cells are merged to form larger empty cells
� Only good for discrete data
� Good for sparse matrix applications
� Assume that the point is associated with the lower
left corner of each cell
� Ex: assume an 8 x 8 array
divide coordinate values by 12.5
(0,8) (8,8)
(8,0) (0,0)
1
b
(2,3)
Chicago
2
r
(4,0)
Mobile
3
z
(4,6)
Toronto
4
g
(6,5)
Buffalo
5
v
(0,3)
Denver
6
z
(2,2)
Omaha
7
g
(6,1)
Atlanta
8
r
(7,0)
Miami
9
v
Copyright © 2003 by Hanan Samet
hp9
PR QUADTREE (Orenstein)
1. Regular decomposition point representation
2. Decomposition occurs whenever a block contains more
than one point
3. Useful when the domain of data points is not discrete but
finite
4. Maximum level of decomposition depends on the
minimum separation between two points
� if two points are very close, then decomposition can
be very deep
� can be overcome by viewing blocks as buckets with
capacity c and only decomposing the block when it
contains more than c points
Ex: c = 1
(0,100) (100,100)
(100,0) (0,0)
(35,42)
Chicago
1
b
(52,10)
Mobile
2
r
(62,77)
Toronto
3
z
(82,65)
Buffalo
4
g
(5,45)
Denver
5
v
(27,35)
Omaha
6
g
(85,15)
Atlanta
7
z
(90,5)
Miami
8
r
Copyright © 2003 by Hanan Samet
hp10
REGION SEARCH
� Ex: Find all points within radius r of point A
� Use of quadtree results in pruning the search space
A
r
1
b
1 2 3
4
5
6 7 8
9 10
11 12
13
� If a quadrant subdivision point p lies in a region l, then
search the quadrants of p specified by l
1. SE 6. NE 11. All but SW
2. SE, SW 7. NE, NW 12. All but SE
3. SW 8. NW 13. All
4. SE, NE 9. All but NW
5. SW, NW 10. All but NE
2
r
3
z
Copyright © 2003 by Hanan Samet
hp11
A
FINDING THE NEAREST OBJECT
� Ex: find the nearest object to P
� Assume PR quadtree for points (i.e., at most one point per
block)
� Search neighbors of block 1 in counterclockwise order
� Points are sorted with respect to the space they occupy
which enables pruning the search space
� Algorithm:
1 9 13 4 5
2 3
7 8 12 6
11 10
E
D
C
P
B
F
1
b
1. start at block 2 and compute distance to P from A
2
r
2. ignore block 3 whether or not it is empty as A is
closer to P than any point in 3
3
z
3. examine block 4 as distance to SW corner is shorter
than the distance from P to A; however, reject B as it is
further from P than A
4
g
4. ignore blocks 6, 7, 8, 9, and 10 as the minimum
distance to them from P is greater than the distance
from P to A
5
v
5. examine block 11 as the distance from P to the
southern border of 1 is shorter than the distance from
P to A; however, reject F as it is further from P than A
6
z
new F
� If F was moved, a better order would have started with
block 11, the southern neighbor of 1, as it is closest
7
r
Copyright © 2003 by Hanan Samet
hp12
COMPARISON OF POINT, MX, AND PR QUADTREES
FEATURE MX PR POINT
Regular
decomposition
Yes Yes No
Type of nodes Data stored in
leaf nodes and
non-leaf nodes
are for control
Data stored in leaf
nodes and non-
leaf nodes are for
control
Data stored in
leaf nodes and
non-leaf nodes
Shape of tree
depends on odrder
of inserting nodes
No No Yes
Deletion Simple but may
have to collapse
WHITE nodes
Simple but may
have to collapse
WHITE nodes
Complex
Size of space
represented
Finite Finite Unbounded
Type of data
represented
Discrete Continuous Continuous
Shape of space
represented
Square Rectangle Unbounded
Stores coordinates No Yes Yes
Depth of tree (d);
assume M points
All nodes are at
the same depth,
n for a 2n by 2n
region
For square region
with side length L
and minimum
separation S
between two
points,
⎡log4(M-1)⎤ ≤ d
and
d ≤ ⎡log2((L/S)2.5)⎤
⎡log4(3M)⎤ ≤ d
and d ≤ M-1
Copyright © 2003 by Hanan Samet
hp13
APPLICATION OF THE MX QUADTREE (Hunter)
� Represent the boundary as a sequence of BLACK
pixels in a region quadtree
� Useful for a simple digitized polygon (i.e., non-intersecting
edges)
� Three types of nodes
1. interior – treat like WHITE nodes
2. exterior – treat like WHITE nodes
3. boundary – the edge of the polygon passes through
them and treated like BLACK nodes
� Disadvantages
1. a thickness is associated with the line segments
2. no more than 4 lines can meet at a point
1
b
2
r
Copyright © 2003 by Hanan Samet
hp14
MX-CIF QUADTREE (Kedem)
1. Collections of small rectangles for VLSI applications
2. Each rectangle is associated with its minimum enclosing
quadtree block
3. Like hashing: quadtree blocks serve as hash buckets
A {2,6,7,8,9,10}
B {1} C {} D {11}
E {3,4,5} F {12}
1
b
4. Collision = more than one rectangle in a block
� resolve by using two one-dimensional MX-CIF trees
to store the rectangle intersecting the lines passing
through each subdivision point
2
r
� one for y-axis
Y1 8
Y2 Y3
10
Y4 Y5
6
Y6 Y7
2
Binary tree for y-
axis through A
3
g
� if a rectangle intersects both x and y axes, then
associate it with the y axis
4
v
� one for x-axis
X1
X3 X2
9
X5 X4
X6
7
Binary tree for x-
axis through A
5
z
Copyright © 2003 by Hanan Samet
hp15
K-D TREE (Bentley)
� Test one attribute at a time instead of all simultaneously
as in the point quadtree
� Usually cycle through all the attributes
� Shape of the tree depends on the order in which the data
is encountered
(0,100) (100,100)
(100,0) (0,0)
(35,42)
Chicago
Chicago
1
b
(52,10)
Mobile
Mobile
x test
2
r
(62,77)
Toronto
Toronto
y test
3
z
(82,65)
Buffalo
Buffalo
x test
4
g
(5,45)
Denver
Denver
5
v
(27,35)
Omaha
Omaha
6
g
(85,15)
Atlanta
Atlanta
y test
7
r
(90,5)
Miami
Miami
8
v
Copyright © 2003 by Hanan Samet
hp16
K-D TREE DELETION
� Similar to deletion in binary search trees
� Assume branch to HISON if test value is ≥ root value
� Assume root discriminates on the x coordinate and
subsequent alternation with the y coordinate value
� Algorithm:
1. replace root by node in HISON with the minimum x
coordinate value
2. repeat the process for the position of the replacement
node using the x or y coordinate values depending on
whether the replacement node is an x or a y
discriminator
Copyright © 2003 by Hanan Samet
hp17
EXAMPLE OF K-D TREE DELETION
Ex: A (20,20)
B (10,30) D (40,50)
C (10,20) E (30,40) F (35,60)
G (25,30)
H (29,40)
I (27,35)
J (28,45)
x
y
x
y
x
y
x
1
b
min x
2
r
min y
3
z
G (25,30)
B (10,30) D (40,50)
C (10,20) E (30,40) F (35,60)
I (27,35)
H (29,40)
J (28,45)
min y
4
g
� Analogy with binary search trees implies that we can
replace the root by the node in LOSON with the max x
coordinate value. However, if there is more than one
node with the same max value, then after replacement
there would be a node in LOSON which is not strictly less
than the new root (e.g., nodes B and C)
5
v
� What if the HISON of the root is empty?
T’
A
6
r
1. replace root node A with the node B in LOSON having the
minimum x coordinate value and set the HISON pointer
of A to be the old LOSON pointer of A
2. repeat process for the position of the replacement node
T’
B
B
B is the node with
the minimum x
coordinate value
and replaces A
7
z
Copyright © 2003 by Hanan Samet
hp18
K-D TREE SEARCH
� Search space can be pruned by testing if the search
region is completely contained in one of the partitions of
the node as now only one subtree must be examined
(0,100) (100,100)
(100,0) (0,0)
(35,42)
Chicago
(52,10)
Mobile
(62,77)
Toronto (82,65)
Buffalo
(5,45)
Denver
(27,35)
Omaha
(85,15)
Atlanta
(90,5)
Miami
1
b
1. search the region entirely to the left of Chicago (x=35)
3
z
2. search the region entirely below Denver (y=45)
4
g
3. search yields Omaha
5
v
� Ex: find all points within 10 of (20,30)
2
r
(20,30)
Copyright © 2003 by Hanan Samet
hp19
PR K-D TREE (Knowlton)
� A region contains at most one data point
� Analogous to EXCELL with bucket size of 1
(0,100) (100,100)
(100,0) (0,0)
1
b
(35,42)
Chicago
Chicago
2
r
(52,10)
Mobile
Chicago Mobile
3
z
(62,77)
Toronto
4
g
Mobile Toronto
(82,65)
Buffalo
5
v
Toronto Buffalo
(5,45)
Denver
6
z
Chicago
Denver
(27,35)
Omaha
7
g
Omaha Chicago
(85,15)
Atlanta
8
r
Mobile
Atlanta
(90,5)
Miami
9
v
Atlanta Miami
Copyright © 2003 by Hanan Samet
hp20
ADAPTIVE K-D TREE
� Data is only stored in terminal nodes
� An interior node contains the median of the set as the
discriminator
� The discriminator key is the one for which the spread of
the values of the key is a maximum
(0,100) (100,100)
(100,0) (0,0)
(35,42)
Chicago
(52,10)
Mobile
(62,77)
Toronto
(82,65)
Buffalo
(5,45)
Denver
(27,35)
Omaha
(85,15)
Atlanta
(90,5)
Miami
1
b
57,X
2
v
31,X
3
r
16,X
Denver Omaha
4
z
26,Y
Mobile Chicago
5
g
40,Y
6
g
10,Y
Miami Atlanta
7
v
72,X
Toronto Buffalo
8
r
Copyright © 2003 by Hanan Samet
hp21
GRID FILE (Nievergelt,Hinterberger,Sevcik)
� Two level grid for storing points
� Uses a grid directory (a 2d array of grid blocks) on disk
that contains the address of the bucket (i.e., page) that
contains the data associated with the grid block
� Linear scales (a pair of 1d arrays) in core that access the
grid block in the grid directory (on disk) that is associated
with a particular point thereby enabling the decomposition
of the space to be arbitrary
� Guarantees access to any record with two disk operations
— one for each level of the grid
1. access the grid block
2. access the bucket
� Each bucket has finite capacity
� Partition upon overflow
1. bucket partition — overflowing bucket is associated
with more than one grid block
2. grid partition — overflowing bucket is associated with
just one grid block
� Splitting policies
1. split at midpoint and cycle through attributes
2. adaptive
� increases granularity of frequently queried
attributes
� favors some attributes over others
Copyright © 2003 by Hanan Samet
hp22
GRID FILE EXAMPLE
� Assume bucket size = 2
(0,100) (100,100)
(100,0) (0,0)
(35,42)
Chicago
(52,10)
Mobile
A
Linear scales
x: y:
0 100 0 100
1. Initially, Chicago and Mobile in bucket A
1
b
(62,77)
Toronto
x=45
B
45
2. Insert Toronto causing a grid partition yielding bucket B
2
z
(82,65)
Buffalo
A
C
y=70
70
3. Insert Buffalo causing a grid partition yielding bucket C
3
r
(5,45)
Denver
4. Insert Denver causing no change
4
v
(90,5)
Miami
x=70
E
B
F
70
7. Insert Miami causing a grid partition yielding bucket F
7
v
43
5. Insert Omaha causing a grid partition yielding bucket D
5
g
(27,35)
Omaha
D
D
C
y=43
6. Insert Atlanta causing a bucket partition yielding bucket E
6
r
(85,15)
Atlanta
E
Copyright © 2003 by Hanan Samet
hp23
EXCELL (Tamminen)
� Uses regular decomposition
� Like grid file, guarantees access to any record with two
disk operations
� Differentiated from grid file by absence of linear scales
which enable decomposition of space to be arbitrary
� Grid partition results in doubling the size of grid directory
� Ex: bucket size = 2
(0,100) (100,100)
(100,0) (0,0)
(35,42)
Chicago
(52,10)
Mobile A
1. Initially, Chicago and Mobile in bucket A
1
b
(62,77)
Toronto
B
2. Insert Toronto causing a grid partition yielding bucket B
2
r
(5,45)
Denver
4. Insert Denver causing no change
4
r
(85,15)
Atlanta
7. Insert Atlanta causing no change
7
z
3. Insert Buffalo causing a grid partition yielding bucket C
3
z
(82,65)
Buffalo
A C
5. Insert Omaha; bucket A overflows; split A yielding bucket D
5
v
(27,35)
Omaha
D
6. Bucket A is still too full, so perform a grid partition
6
g
D C
E
B
(90,5)
Miami
F
8. Insert Miami causing a bucket partition yielding bucket F
8
v
Copyright © 2003 by Hanan Samet
hp24
SUMMARY
� Data structures can be grouped:
N = No data organization
D = Organize data to be
stored (e.g., binary
search tree)
E = Organize the embedding
space from which the
data is drawn (e.g.,
digital searching)
H = Hybrid (combines at
least two of N, D, or E)
1
b
N
0: no
organization
Sequential
List
2
r
H
1: one attribute
Inverted
List
3
z
E
2: all attributes
(fixed number
of cells)
Grid
Method
4
g
E D
3: permit the
number of
cells to vary
Excell
Grid
File
5
v
D E E
4: only partition
the
overflowing
cells
Point
Quadtree
PR
Quadtree
MX
Quadtree
6
r
D H
5: only partition
the
overflowing
cells into 2 cells
PR
K-d tree K-d tree
7
z
D
6: partition using
attribute having
greatest spread
across the cell
Adaptive
K-d tree
8
g
Copyright © 2003 by Hanan Samet
hp25
DRAWBACKS OF MOST HASHING METHODS
� Require rehashing all the data when the hash table
becomes too full
� Goal: only move a few records
� Solutions:
1. Knott
� use a trie in the form of a binary tree
2. extendible hashing
(Fagin, Nievergelt, Pippenger, Strong)
� like a trie except that all buckets are at same level
� buckets are accessed by use of a directory
� directory elements are NOT the same as buckets
3. Linear hashing (Litwin)
� provides for linear growth in the number of
buckets (i.e., the hash table grows at a rate of one
bucket at a time)
� does not make use of a directory
A
B
C
D
1
b
� bucket overflow is solved by splitting
� drawback: accessing a bucket at
level m requires m operations
E F
2
r
� implemented as a directory of pointers to buckets
� accessing a bucket requires 1 operation
A D A A A B C D
3
z
� bucket overflow may cause doubling the directory
� e.g., EXCELL
A A A A A A A A B B E F D D D D
4
g
Copyright © 2003 by Hanan Samet
hp26
MECHANICS OF LINEAR HASHING
� Assume a file with m buckets
� Use two hashing functions hi(k) = f(k) mod 2i+1 for i = n
and i = n+1
1. compute hn+1(k) = x and use the result if x < m
2. otherwise use hn(k)
� Such a file is said to be at level n,n+1
� There exist primary and overflow buckets
� When a record hashes to a full primary bucket, then it is
inserted into an overflow bucket corresponding to the
primary bucket
� τ : storage utilization factor
τ = number of records in file divided by the total of
available slots in primary and overflow buckets
� When τ > a given load α, then one of the buckets is split
� When bucket i is split, it and its overflow bucket’s records
are rehashed using hn+1 and distributed into buckets i
and i + 2n as is appropriate
Copyright © 2003 by Hanan Samet
hp27
LINEAR HASHING INSERTION ALGORITHM
� Let s denote the identity of the next bucket to be split and
cycles from 0 to 2n-1
� Insertion algorithm
1. compute bucket address i for record r
2. insert r in bucket i
3. if τ > α, then split bucket i creating bucket i + 2n and
reinsert in buckets i and i + 2n
4. if s = 2n, then all buckets have been split
� increment n
� reset s to 0
5. if buckets i or i+1 overflow, then allocate an overflow
bucket
6. if rehashing causes some overflow buckets to be
reclaimed, repeat steps 3-5
� Notes
1. a bucket split need not necessarily occur when a
record hashes to a full bucket, nor does the bucket
being split need to be full
2. key principle is that eventually every bucket will be
split and ideally all overflow buckets will be emptied
and reclaimed
3. if the storage utilization gets too low, then buckets
should be reclaimed
Copyright © 2003 by Hanan Samet
hp28
BIT INTERLEAVING HASHING FUNCTIONS
� Bit interleaving takes one bit from the binary
representation of the x coordinate value and one bit from
the binary representation of the y coordinate value and
alternates them
� Use city coordinate values and divide by 12.5 so that
each coordinate value requires just three binary digits
� Example with y being more significant than x
City x y f(z)=z div 12.5 Bit Interleaved
x y Value
Chicago 35 42 2 3 14
Mobile 52 10 4 0 16
Toronto 62 77 4 6 56
Buffalo 82 65 6 5 54
Denver 5 45 0 3 10
Omaha 27 35 2 2 12
Atlanta 85 15 6 1 22
Miami 90 5 7 0 21
1
b
� Ex: Atlanta (6,1)
1 1 0
0 0 1
x
y
2
r
0 1 0 1 1 0 = 22
3
z
Copyright © 2003 by Hanan Samet
hp29
0
EXAMPLE OF LINEAR HASHING
� Assume primary and overflow bucket capacity is 2
� A bucket is split whenever τ ≥ α = 0.66
� Initially, only bucket 0 exists
1
b
� Insert Chicago (14) and Mobile (16):
τ=1 and split bucket 0 creating bucket 1
0 1
Chicago
Mobile
2
r
� Insert Toronto (56) into bucket 0:
τ=0.75 and split bucket 0 creating bucket 2; move
Chicago to bucket 2
2
Chicago
Toronto
3
z
� Insert Denver (10) into bucket 2 which causes it to overflow
Denver
5
v
� Insert Omaha (12) into bucket 0 which causes it to overflow
Omaha
6
g
� Insert Atlanta (22) into bucket 2’s overflow area
Atlanta
7
r
� Insert Buffalo (54) into bucket 2:
τ=0.67 and split bucket 1 creating bucket 3
4
g
3
Buffalo
0
� Insert Miami (21) into bucket 1:
τ=0.67 and split bucket 0 creating bucket 4; move Omaha to
bucket 4
8
z
4
Miami Omaha
� Reclaim the overflow area of bucket 0:
τ=0.67 again and split bucket 1 creating bucket 5; move
Miami to bucket 5
9
v
5
Miami
Copyright © 2003 by Hanan Samet
hp30
ORDER PRESERVING LINEAR HASHING (OPLH)
� Problem: the hash function hn(k) = k mod 2n implies that
all records in a given bucket agree in the least
n significant bits
1. OK for random access
2. unacceptable for sequential access as each record will
be in a different bucket
� Solution: use the hash function
hn(k) = (reverse(k)) mod 2n
1. tests the n most significant bits
2. all records in a bucket are within a given range
� Shortcomings
1. records may not be scattered too well
� overflow is much more common than with traditional
hashing methods
� random access is slower since several overflow
buckets may have to be examined
2. creates a large number of sparsely filled buckets
� sequential access may be slower as may have to
examine many empty buckets
City x y f(z)=z mod 12.5 Bit Interleaved Value
x y x most sig y most sig
Chicago 35 42 2 3 44 28
Mobile 52 10 4 0 1 2
Toronto 62 77 4 6 11 7
Buffalo 82 65 6 5 39 27
Denver 5 45 0 3 40 20
Omaha 27 35 2 2 12 12
Atlanta 85 15 6 1 37 26
Miami 90 5 7 0 21 42
Copyright © 2003 by Hanan Samet
hp31
EXAMPLE OF ORDER PRESERVING LINEAR HASHING
� Assume primary and overflow bucket capacity is 2
� A bucket is split whenever τ ≥ α = 0.66
� Initially, only bucket 0 exists
(0,100) (100,100)
(100,0) (0,0)
0
1
b
(35,42)
Chicago
(52,10)
Mobile
1
� Insert Chicago (28) and Mobile (2): τ=1 and split bucket
0 creating bucket 1
2
r
(62,77)
Toronto
2
� Insert Toronto (7) into bucket 1: τ=0.75 and split bucket
0 creating bucket 2; move Mobile to bucket 2
3
z
(82,65)
Buffalo 3
� Insert Buffalo (27) into 1: τ=0.67, split bucket 1 creating
bucket 3; move Toronto and Buffalo to bucket 3
4
g
(5,45)
Denver
� Insert Denver (20) into bucket 0
5
v
(27,35)
Omaha 4
� Insert Omaha (12) into 0: τ=0.75 and split bucket 0 creating
bucket 4; move Denver and Chicago to bucket 4
6
g
(85,15)
Atlanta
� Insert Atlanta (26) into bucket 2
7
v
(90,5)
Miami
5
� Insert Miami (42) into 2: τ=0.67 and split bucket 1
creating bucket 5; move Miami to bucket 2’s overflow
area
8
z
Copyright © 2003 by Hanan Samet
hp32
COMPARISON OF OPLH WITH EXCELL
OPLH
1. Implicit directory
� one directory element
per primary bucket
� all buckets stored at
one of two levels
� overflow buckets
2. Reverse bit interleaving
3. Bucket overflow
� allocate at most two
additional buckets
(one for the bucket
that has been split
and one for the
overflowing bucket)
4. Retrieval of a record
requires examining
primary and overflow
buckets
EXCELL
1. Explicit directory
� set of primary buckets
2. Reverse bit interleaving
3. Bucket overflow
� triggers bucket split or
directory doubling
4. Retrieve any record with
two disk accesses
� Summary: order preserving linear hashing (OPLH) yields
a more gradual growth in the size of the
directory at the expense of the loss of the
guarantee of retrieval of any record with just
two disk accesses
Copyright © 2003 by Hanan Samet
hp33
EXAMPLES OF COMPARISON OF OPLH WITH EXCELL
1. Reversed bit interleaving with the y coordinate value as
the most significant (i.e., y is split first)
2. Reversed bit interleaving with the x coordinate value as
the most significant (i.e., x is split first)
OPLH:
� 6 primary buckets
� 2 overflow buckets
EXCELL:
� 7 buckets
� 16 directory elements
OPLH:
� 7 primary buckets
� no overflow buckets
EXCELL:
� 6 buckets
� 8 directory elements
Copyright © 2003 by Hanan Samet
hp34
SPIRAL HASHING (Martin)
� Drawbacks of linear hashing:
1. order in which buckets are split is unrelated to the
probability of the occurrence of overflow
2. all buckets that are candidates for a split have the
same probability of overflowing
� Central idea is the existence of an ever-changing (and
growing) address space of active bucket addresses
1. records are distributed in the active buckets in an
uneven manner
2. split the bucket with the highest probability of
overflowing
� When a bucket s is split
1. create d new buckets
2. rehash the contents of s into the d new buckets
3. bucket s is no longer used
� If [s,t] are the active buckets, then [s+1,t+d] are the active
buckets after the split
� Ex: assume that initially there are d – 1 active buckets
starting at address 1
� After s bucket splits, there are (s+1)•(d+1) active buckets
starting at address s + 1 (prove by induction)
d = 2: 1
1
b
2 3
2
r
3 4 5
3
z
4 5 6 7
4
g
Copyright © 2003 by Hanan Samet
hp35
SPIRAL HASHING FUNCTION
� Assume that initially there are d-1 active buckets starting
at address 1
� Key idea is the behavior of the function y = dx
1. dx+1 – dx = dx • (d -1)
2. bucket s has just been split
3. let dx = s+1 = address of first active bucket
4. implies dx+1 – dx = (s+1) • (d-1) = number of active buckets
5. last active bucket is at address dx+1 – 1
6. [dx,dx+1) is range of active buckets
� Use two hashing functions
1. h(k) maps key k uniformly into [0,1) which is the range of the
difference in exponent values of addresses in y = dx+1 – dx
2. y(k) maps h(k) into an address in [s+1,(s+1)+(s+1)•(d-1))
� y(k) = ⎣dx(k)⎦ with x(k) in range [logd(s+1),logd(s+1)+1)
� before split, active buckets lay in range [logds,logds+1)
� want to make sure that all key values previously in
bucket s – i.e., x(k) in [logds,logd(s+1)) are rehashed
into one of the new buckets with an x(k) value in
[logds+1,logd(s+1)+1)
� leave other key values in [logd(s+1),logds+1) unchanged
� difficult to choose x(k)
a. x(k) = logd(s+1) + h(k)
� drawback: must rehash all keys when a bucket is
split
b. x(k) = ⎡logd(s+1) – h(k)⎤ + h(k)
� guarantees that if k is hashed into bucket b (≥ s+1)
then it continues to hash there until bucket b is split
� implies that x(k) is a number in the range
[logd(s+1),logd(s+1)+1) whose fractional part is h(k)
Copyright © 2003 by Hanan Samet
hp36
BEHAVIOR OF THE SPIRAL HASHING FUNCTION
� Ex: d = 2
Bucket
address
Hash interval Relative
load
1 [0.0000,1.0000
)
1.0000
2 [0.0000,0.5849
)
0.5849
3 [0.5849,1.0000
)
0.4151
4 [0.0000,0.3219
)
0.3219
5 [0.3219,0.5849
)
0.2630
6 [0.5849,0.8073
)
0.2224
7 [0.8073,1.0000
)
0.1927
8 [0.0000,0.1699
)
0.1699
9 [0.1699,0.3219
)
0.1520
10 [0.3219,0.4594
)
0.1375
11 [0.4594,0.5849
)
0.1255
12 [0.5849,0.7004
)
0.1155
13 [0.7004,0.8073
)
0.1069
14 [0.8073,0.9068
)
0.0995
15 [0.9068,1.0000
)
0.0932
The function y = 2x Relative load of buckets 1-15
1
b
� Split bucket 3
2
r
� Yields buckets 6 and 7
3
z
Copyright © 2003 by Hanan Samet
hp37
TABLES FOR EXAMPLE OF SPIRAL HASHING
� Use bit interleaving to form a value k – i.e., take one bit
from the binary representation of the x coordinate value
and one bit from the binary representation of the
y coordinate value and alternate them
� Use city coordinate values and divide by 12.5 so that
each coordinate value requires just three binary digits
� Use h(k) = k/64 which has same effect as reverse bit
interleaving and behavior is analogous to OPLH
� Example with y being more significant than x
City x y f(z)=z div 12.5
x y k/64
Chicago 35 42 2 3 14 .21875
Mobile 52 10 4 0 16 .25
Toronto 62 77 4 6 56 .875
Buffalo 82 65 6 5 54 .84375
Denver 5 45 0 3 10 .15625
Omaha 27 35 2 2 12 .1875
Atlanta 85 15 6 1 22 .34375
Miami 90 5 7 0 21 .328125
1
b
� Ex: Atlanta (6,1)
1 1 0
0 0 1
x
y
2
r
0 1 0 1 1 0 = 22
3
z
Copyright © 2003 by Hanan Samet
hp38
MECHANICS OF EXAMPLE OF SPIRAL HASHING
� Assume d = 2 and primary and overflow bucket capacity of 2
� A bucket is split whenever τ ≥ α = 0.66
� Initially, only bucket 1 exists
1
1
1
b
2 3
CHI
MOB
2
3
Chicago
Mobile
� Insert Chicago (.22) and Mobile (.25): τ=1 and split bucket 1
creating buckets 2 and 3; Chicago and Mobile are moved to
bucket 2
2
r
4 5
TOR CHI
MOB
4
5
5
Toronto
� Insert Toronto (.87) into bucket 3: τ=0.75 and split bucket 2
creating buckets 3 and 4; Chicago and Mobile are moved to
bucket 4
3
z
6 7
TOR
BUF
6 7
Buffalo
� Insert Buffalo (.84) into bucket 3: τ=0.67 and split bucket 3
creating buckets 6 and 7; Toronto and Buffalo are moved to
bucket 7
4
g
Denver
Omaha
� Insert Denver (.16) and Omaha (.19) into bucket 4’s overflow
area
DEN
OMA
5
g
8 9
ATL DEN CHI
MOB
OMA
� Insert Atlanta (.34) into bucket 5: τ=0.7 and split bucket 4
creating buckets 8 and 9; Denver is moved to bucket 8,
while Chicago, Mobile, and Omaha are moved to bucket 9
which causes it to overflow
6
v
8
9
9
Atlanta
10 11
ATL
MIA
11
10 11
Miami
� Insert Miami (.33) into bucket 5: τ=0.67 and split bucket 5
creating buckets 10 and 11; Move Atlanta and Miami to
bucket 10
7
v
Copyright © 2003 by Hanan Samet
hp39
COMPARISON OF SPIRAL AND LINEAR HASHING
� Main advantage of spiral hashing over linear hashing is
that the bucket being split is the one most likely to
overflow
� Disadvantages of spiral hashing:
1. the buckets that have been split are not reused
� overcome by using a mapping between logical
and physical addresses
2. expensive to calculate function y = dx
� overcome by use of an approximation
Copyright © 2003 by Hanan Samet
hp40
WHY SPIRAL?
� Can rewrite y = ⎣dx⎦ as ⎣ρ = ejθ⎦ using polar coordinates
which yields the equation of a spiral
� Ex: ρ = e(ln 2)• θ/2π
� Polar coordinates mean that the active buckets are
always within one complete arc of the spiral – i.e., θ = 2π
� Mechanics of a bucket split
1. let first active bucket be at a = ⎣ej•b⎦ (i.e., θ = b)
2. last active bucket is at c = ⎡ej•(b+2π)⎤ – 1
3. bucket split means that the contents of the active
bucket at ρ = a (i.e., θ = b) are distributed into
buckets c + 1 through g where g = ⎣ej•(b+2π +φ)⎦ and φ is
the solution of a + 1 = ej•(b+φ) – i.e., φ = (ln (a + 1))/j – b
4. buckets a + 1 through g are now the active buckets
� Ex: d = 2
10
10
-10
-5
5
5 -5 -10 10 15
11
9
8
5
12
13
6
3
7
14
15
16
1
b
1. active buckets 6 through 11
2
r
2. split bucket 6 to yield buckets 12 and 13
3
z
3. active buckets 7 through 13
4
g
� Observations
1. insead of h(k) uniformly mapping key k into [0,1), use
hθ(k) to uniformly map k into [0,2π)
2. length of arc has constant value between successive
integer values of ρ
5
v