程序代写代做代考 Excel data structure algorithm Copyright © 2003 by Hanan Samet

Copyright © 2003 by Hanan Samet

hp1
PRELIMINARIES
�  File ≡ collection of records (N)
�  Each record contains several attributes or keys (k)
Queries:
1.  Point query
2.  Range query (includes partial match)
3.  Boolean query ≡ combine 1 and 2 with AND, OR, NOT
Search methods
1.  Organize data to be stored

�  boundaries of regions in the search space are
determined by the data

�  e.g., binary search tree
2.  Organize the embedding space from which the data is

drawn
�  region boundaries in the search space are fixed
�  e.g., address computation methods such as digital

searching
3.  Hybrid

�  use 1 for some attributes and 2 for others
Extreme solution:
�  Bitmap representation where one bit is reserved for

every possible record in the multidimensional point space
whether or not it is present

�  Problems:
1.  large number of attributes
2.  continuous data (non-discrete)

Copyright © 2003 by Hanan Samet

hp2
SIMPLE NON-HIERARCHICAL DATA STRUCTURES
1.  Sequential list

2.  Inverted List

�  2 sorted lists
�  Data is pointers
�  Enables pruning the search with respect to one key

NAME X Y

Chicago 35 42

Mobile 52 10

Toronto 62 77

Buffalo 82 65

Denver 5 45

Omaha 27 35

Atlanta 85 15

Miami 90 5

X Y

Denver Miami

Omaha Mobile

Chicago Atlanta

Mobile Omaha

Toronto Chicago

Buffalo Denver

Atlanta Buffalo

Miami Toronto

Copyright © 2003 by Hanan Samet

hp3
GRID METHOD

�  Divide space into squares of width equal to the search
region

�  Each cell contains a list of all points within it
�  Assume L∞ distance metric (i.e., chessboard)
�  Assume C = uniform distribution of points per cell
�  Average search time for k-dimensional space is O(F•2k)

F = number of records found = C since query region has
the width of a cell

2k = number of cells examined

(0,100) (100,100)

(100,0) (0,0)

(62,77)
Toronto (82,65)

Buffalo

(5,45)
Denver (35,42)

Chicago

(27,35)
Omaha

(52,10)
Mobile

(85,15)
Atlanta

(90,5)
Miami

Copyright © 2003 by Hanan Samet

hp4
POINT QUADTREE (Finkel/Bentley)
�  Marriage between a uniform grid and a binary search tree

(0,100) (100,100)

(100,0) (0,0)

(35,42)
Chicago

Chicago

1
b

Mobile

(52,10)
Mobile

2
r

(5,45)
Denver

Denver

5
v

(62,77)
Toronto

3
z

Toronto

(82,65)
Buffalo

4
g

Buffalo

(27,35)
Omaha

Omaha

6
g

(85,15)
Atlanta

7
v

Atlanta

(90,5)
Miami

Miami

8
z

Copyright © 2003 by Hanan Samet

hp5

�  Optimal solution is to use H as the new root since the
shaded region is empty

3
z

4.  I is the new root
3.  G is the new root
2.  D is the new root
1.  B is the new root

PROBLEM OF DELETION IN POINT QUADTREES

�  Delete node A
�  Conventional algorithm takes one son as the new root

and reinserts the remaining subtrees

E
D

F

I

A

H

G

C
B

1
b

2
r

�  Problems:
1.  must search for H
2.  a node such as H may possibly not exist as is the

case if node J is present

J

4
g

Copyright © 2003 by Hanan Samet

hp6

�  at most one of the remaining candidates will be in the
shaded region

8
v

MECHANICS OF DELETION IN POINT QUADTREES
�  Ideally, want to replace deleted

node (A) with a node (B) that
leaves an empty shaded region

B

A

1
b

�  Involves search and instead settle on a set of candidate
nodes obtained using a method analogous to searching a
binary search tree

A

B

C

D
A

D
C

B

2
z

D is the closest node

3
r

�  Set of candidates = “closest” node in each quadrant
1.  choose the candidate node that is closer to each of

its bordering axes than any other candidate node
which is on the same side of those axes

4
v

A

B C

D
E

condition does not always hold

5
g

A
B C

D E

more than one node satisfies it

6
z

2.  Use the L1 metric to break ties or deadlocks
�  L1 metric is the sum of the displacements from the

bordering x and y axes
�  rationale: area of shaded region is Lx•dy+Ly•dx-dx•dy

which can be approximated by 2dx•(Lx+Ly) assuming
dx≅dy and dx being very small

Ly

Lx

dx

dy

7
r

Copyright © 2003 by Hanan Samet

hp7
ALGORITHM FOR DELETION IN POINT QUADTREES

�  Select a node satisfying the “closest” criteria to the
deleted node A to serve as the new root (B in the NE
quadrant

A

B

C D

E

F
G H

I

J K

L

M

N

1
b

2
z

1.  no reinsertion in opposite quadrant (SW)

3
v

2.  ADJ: apply to adjacent quadrants (NW,SE):
if root remains in the quadrant (J) then
�  no reinsertion in 2 subquadrants (NW,NE)
�  apply ADJ to remaining 2 subquadrants (SW,SE)

else reinsert entire quadrant

4
r

�  Comparison with conventional reinsertion algorithm
1.  5/6 reduction in number of nodes requiring

reinsertion (2/3 if pick a candidate at random)
2.  number of comparisons during reinsertion was

observed to be ~log4n vs. a much larger factor
3.  (average total path length)/(optimal total path length)

was observed to be constant vs. an increase

6
b

3.  NEWROOT: apply to quadrant containing replacement
node (NE)
�  same subquadrant (EN): no reinsertion
�  adjacent subquadrants (NW,SE): ADJ
�  opposite subquadrant (SW): NEWROOT

5
g

Copyright © 2003 by Hanan Samet

hp8
MX QUADTREE (Hunter)
�  Points are like BLACK pixels in a region quadtree
�  Useful for raster to vector conversion
�  Empty cells are merged to form larger empty cells
�  Only good for discrete data
�  Good for sparse matrix applications
�  Assume that the point is associated with the lower

left corner of each cell
�  Ex: assume an 8 x 8 array

divide coordinate values by 12.5

(0,8) (8,8)

(8,0) (0,0)

1
b

(2,3)
Chicago

2
r

(4,0)
Mobile

3
z

(4,6)
Toronto

4
g

(6,5)
Buffalo

5
v

(0,3)
Denver

6
z

(2,2)
Omaha

7
g

(6,1)
Atlanta

8
r

(7,0)
Miami

9
v

Copyright © 2003 by Hanan Samet

hp9
PR QUADTREE (Orenstein)
1.  Regular decomposition point representation
2.  Decomposition occurs whenever a block contains more

than one point
3.  Useful when the domain of data points is not discrete but

finite
4.  Maximum level of decomposition depends on the

minimum separation between two points
�  if two points are very close, then decomposition can

be very deep
�  can be overcome by viewing blocks as buckets with

capacity c and only decomposing the block when it
contains more than c points

Ex: c = 1
(0,100) (100,100)

(100,0) (0,0)

(35,42)
Chicago

1
b

(52,10)
Mobile

2
r

(62,77)
Toronto

3
z

(82,65)
Buffalo

4
g

(5,45)
Denver

5
v

(27,35)
Omaha

6
g

(85,15)
Atlanta

7
z

(90,5)
Miami

8
r

Copyright © 2003 by Hanan Samet

hp10
REGION SEARCH
�  Ex: Find all points within radius r of point A

�  Use of quadtree results in pruning the search space

A

r

1
b

1 2 3

4

5

6 7 8

9 10

11 12

13

�  If a quadrant subdivision point p lies in a region l, then
search the quadrants of p specified by l
1. SE 6. NE 11. All but SW
2. SE, SW 7. NE, NW 12. All but SE
3. SW 8. NW 13. All
4. SE, NE 9. All but NW
5. SW, NW 10. All but NE

2
r

3
z

Copyright © 2003 by Hanan Samet

hp11

A

FINDING THE NEAREST OBJECT
�  Ex: find the nearest object to P

�  Assume PR quadtree for points (i.e., at most one point per
block)

�  Search neighbors of block 1 in counterclockwise order
�  Points are sorted with respect to the space they occupy

which enables pruning the search space
�  Algorithm:

1 9 13 4 5

2 3

7 8 12 6

11 10

E

D

C

P
B

F

1
b

1.  start at block 2 and compute distance to P from A

2
r

2.  ignore block 3 whether or not it is empty as A is
closer to P than any point in 3

3
z

3.  examine block 4 as distance to SW corner is shorter
than the distance from P to A; however, reject B as it is
further from P than A

4
g

4.  ignore blocks 6, 7, 8, 9, and 10 as the minimum
distance to them from P is greater than the distance
from P to A

5
v

5.  examine block 11 as the distance from P to the
southern border of 1 is shorter than the distance from
P to A; however, reject F as it is further from P than A

6
z

new F

�  If F was moved, a better order would have started with
block 11, the southern neighbor of 1, as it is closest

7
r

Copyright © 2003 by Hanan Samet

hp12
COMPARISON OF POINT, MX, AND PR QUADTREES

FEATURE MX PR POINT
Regular
decomposition

Yes Yes No

Type of nodes Data stored in
leaf nodes and
non-leaf nodes
are for control

Data stored in leaf
nodes and non-
leaf nodes are for
control

Data stored in
leaf nodes and
non-leaf nodes

Shape of tree
depends on odrder
of inserting nodes

No No Yes

Deletion Simple but may
have to collapse
WHITE nodes

Simple but may
have to collapse
WHITE nodes

Complex

Size of space
represented

Finite Finite Unbounded

Type of data
represented

Discrete Continuous Continuous

Shape of space
represented

Square Rectangle Unbounded

Stores coordinates No Yes Yes
Depth of tree (d);
assume M points

All nodes are at
the same depth,
n for a 2n by 2n
region

For square region
with side length L
and minimum
separation S
between two
points,
⎡log4(M-1)⎤ ≤ d
and
d ≤ ⎡log2((L/S)2.5)⎤

⎡log4(3M)⎤ ≤ d
and d ≤ M-1

Copyright © 2003 by Hanan Samet

hp13
APPLICATION OF THE MX QUADTREE (Hunter)
�  Represent the boundary as a sequence of BLACK

pixels in a region quadtree
�  Useful for a simple digitized polygon (i.e., non-intersecting

edges)
�  Three types of nodes

1.  interior – treat like WHITE nodes
2.  exterior – treat like WHITE nodes
3.  boundary – the edge of the polygon passes through

them and treated like BLACK nodes
�  Disadvantages

1.  a thickness is associated with the line segments
2.  no more than 4 lines can meet at a point

1
b

2
r

Copyright © 2003 by Hanan Samet

hp14
MX-CIF QUADTREE (Kedem)
1.  Collections of small rectangles for VLSI applications
2.  Each rectangle is associated with its minimum enclosing

quadtree block
3.  Like hashing: quadtree blocks serve as hash buckets

A {2,6,7,8,9,10}

B {1} C {} D {11}

E {3,4,5} F {12}

1
b

4.  Collision = more than one rectangle in a block
�  resolve by using two one-dimensional MX-CIF trees

to store the rectangle intersecting the lines passing
through each subdivision point

2
r

�  one for y-axis

Y1 8

Y2 Y3
10
Y4 Y5

6
Y6 Y7

2

Binary tree for y-
axis through A

3
g

�  if a rectangle intersects both x and y axes, then
associate it with the y axis

4
v

�  one for x-axis

X1

X3 X2
9

X5 X4

X6
7

Binary tree for x-
axis through A

5
z

Copyright © 2003 by Hanan Samet

hp15
K-D TREE (Bentley)
�  Test one attribute at a time instead of all simultaneously

as in the point quadtree
�  Usually cycle through all the attributes
�  Shape of the tree depends on the order in which the data

is encountered

(0,100) (100,100)

(100,0) (0,0)

(35,42)
Chicago

Chicago

1
b

(52,10)
Mobile

Mobile

x test

2
r

(62,77)
Toronto

Toronto

y test

3
z

(82,65)
Buffalo

Buffalo

x test

4
g

(5,45)
Denver

Denver

5
v

(27,35)
Omaha

Omaha

6
g

(85,15)
Atlanta

Atlanta

y test

7
r

(90,5)
Miami

Miami

8
v

Copyright © 2003 by Hanan Samet

hp16
K-D TREE DELETION
�  Similar to deletion in binary search trees
�  Assume branch to HISON if test value is ≥ root value
�  Assume root discriminates on the x coordinate and

subsequent alternation with the y coordinate value
�  Algorithm:

1.  replace root by node in HISON with the minimum x
coordinate value

2.  repeat the process for the position of the replacement
node using the x or y coordinate values depending on
whether the replacement node is an x or a y
discriminator

Copyright © 2003 by Hanan Samet

hp17
EXAMPLE OF K-D TREE DELETION
Ex: A (20,20)

B (10,30) D (40,50)

C (10,20) E (30,40) F (35,60)

G (25,30)

H (29,40)

I (27,35)

J (28,45)

x

y

x

y

x

y

x

1
b

min x

2
r

min y

3
z

G (25,30)

B (10,30) D (40,50)

C (10,20) E (30,40) F (35,60)

I (27,35)

H (29,40)

J (28,45)

min y

4
g

�  Analogy with binary search trees implies that we can
replace the root by the node in LOSON with the max x
coordinate value. However, if there is more than one
node with the same max value, then after replacement
there would be a node in LOSON which is not strictly less
than the new root (e.g., nodes B and C)

5
v

�  What if the HISON of the root is empty?

T’

A

6
r

1.  replace root node A with the node B in LOSON having the
minimum x coordinate value and set the HISON pointer
of A to be the old LOSON pointer of A

2.  repeat process for the position of the replacement node

T’

B

B

B is the node with
the minimum x
coordinate value
and replaces A

7
z

Copyright © 2003 by Hanan Samet

hp18
K-D TREE SEARCH
�  Search space can be pruned by testing if the search

region is completely contained in one of the partitions of
the node as now only one subtree must be examined

(0,100) (100,100)

(100,0) (0,0)

(35,42)
Chicago

(52,10)
Mobile

(62,77)
Toronto (82,65)

Buffalo

(5,45)
Denver

(27,35)
Omaha

(85,15)
Atlanta

(90,5)
Miami

1
b

1.  search the region entirely to the left of Chicago (x=35)

3
z

2.  search the region entirely below Denver (y=45)

4
g

3.  search yields Omaha

5
v

�  Ex: find all points within 10 of (20,30)

2
r

(20,30)

Copyright © 2003 by Hanan Samet

hp19
PR K-D TREE (Knowlton)
�  A region contains at most one data point
�  Analogous to EXCELL with bucket size of 1

(0,100) (100,100)

(100,0) (0,0)

1
b

(35,42)
Chicago

Chicago

2
r

(52,10)
Mobile

Chicago Mobile

3
z

(62,77)
Toronto

4
g

Mobile Toronto

(82,65)
Buffalo

5
v

Toronto Buffalo

(5,45)
Denver

6
z

Chicago
Denver

(27,35)
Omaha

7
g

Omaha Chicago

(85,15)
Atlanta

8
r

Mobile
Atlanta

(90,5)
Miami

9
v

Atlanta Miami

Copyright © 2003 by Hanan Samet

hp20
ADAPTIVE K-D TREE
�  Data is only stored in terminal nodes
�  An interior node contains the median of the set as the

discriminator
�  The discriminator key is the one for which the spread of

the values of the key is a maximum
(0,100) (100,100)

(100,0) (0,0)

(35,42)
Chicago

(52,10)
Mobile

(62,77)
Toronto

(82,65)
Buffalo

(5,45)
Denver

(27,35)
Omaha

(85,15)
Atlanta

(90,5)
Miami

1
b

57,X

2
v

31,X

3
r

16,X

Denver Omaha

4
z

26,Y

Mobile Chicago

5
g

40,Y

6
g

10,Y

Miami Atlanta

7
v

72,X

Toronto Buffalo

8
r

Copyright © 2003 by Hanan Samet

hp21
GRID FILE (Nievergelt,Hinterberger,Sevcik)
�  Two level grid for storing points
�  Uses a grid directory (a 2d array of grid blocks) on disk

that contains the address of the bucket (i.e., page) that
contains the data associated with the grid block

�  Linear scales (a pair of 1d arrays) in core that access the
grid block in the grid directory (on disk) that is associated
with a particular point thereby enabling the decomposition
of the space to be arbitrary

�  Guarantees access to any record with two disk operations
— one for each level of the grid
1.  access the grid block
2.  access the bucket

�  Each bucket has finite capacity
�  Partition upon overflow

1.  bucket partition — overflowing bucket is associated
with more than one grid block

2.  grid partition — overflowing bucket is associated with
just one grid block

�  Splitting policies
1.  split at midpoint and cycle through attributes
2.  adaptive

�  increases granularity of frequently queried
attributes

�  favors some attributes over others

Copyright © 2003 by Hanan Samet

hp22
GRID FILE EXAMPLE
�  Assume bucket size = 2

(0,100) (100,100)

(100,0) (0,0)

(35,42)
Chicago

(52,10)
Mobile

A

Linear scales
x: y:

0 100 0 100

1.  Initially, Chicago and Mobile in bucket A

1
b

(62,77)
Toronto

x=45

B

45

2.  Insert Toronto causing a grid partition yielding bucket B

2
z

(82,65)
Buffalo

A

C

y=70

70

3.  Insert Buffalo causing a grid partition yielding bucket C

3
r

(5,45)
Denver

4.  Insert Denver causing no change

4
v

(90,5)
Miami

x=70

E

B

F

70

7.  Insert Miami causing a grid partition yielding bucket F

7
v

43

5.  Insert Omaha causing a grid partition yielding bucket D

5
g

(27,35)
Omaha

D

D

C
y=43

6.  Insert Atlanta causing a bucket partition yielding bucket E

6
r

(85,15)
Atlanta

E

Copyright © 2003 by Hanan Samet

hp23
EXCELL (Tamminen)
�  Uses regular decomposition
�  Like grid file, guarantees access to any record with two

disk operations
�  Differentiated from grid file by absence of linear scales

which enable decomposition of space to be arbitrary
�  Grid partition results in doubling the size of grid directory
�  Ex: bucket size = 2

(0,100) (100,100)

(100,0) (0,0)

(35,42)
Chicago

(52,10)
Mobile A

1.  Initially, Chicago and Mobile in bucket A

1
b

(62,77)
Toronto

B

2.  Insert Toronto causing a grid partition yielding bucket B

2
r

(5,45)
Denver

4.  Insert Denver causing no change

4
r

(85,15)
Atlanta

7.  Insert Atlanta causing no change

7
z

3.  Insert Buffalo causing a grid partition yielding bucket C

3
z

(82,65)
Buffalo

A C

5.  Insert Omaha; bucket A overflows; split A yielding bucket D

5
v

(27,35)
Omaha

D

6.  Bucket A is still too full, so perform a grid partition

6
g

D C

E

B

(90,5)
Miami

F

8.  Insert Miami causing a bucket partition yielding bucket F

8
v

Copyright © 2003 by Hanan Samet

hp24
SUMMARY
�  Data structures can be grouped:

N = No data organization
D = Organize data to be

stored (e.g., binary
search tree)

E = Organize the embedding
space from which the
data is drawn (e.g.,
digital searching)

H = Hybrid (combines at
least two of N, D, or E)

1
b

N

0: no
organization

Sequential
List

2
r

H

1: one attribute

Inverted
List

3
z

E

2: all attributes
(fixed number
of cells)

Grid
Method

4
g

E D

3: permit the
number of
cells to vary

Excell
Grid
File

5
v

D E E

4: only partition
the
overflowing
cells

Point
Quadtree

PR
Quadtree

MX
Quadtree

6
r

D H

5: only partition
the
overflowing
cells into 2 cells

PR
K-d tree K-d tree

7
z

D

6: partition using
attribute having
greatest spread
across the cell

Adaptive
K-d tree

8
g

Copyright © 2003 by Hanan Samet

hp25
DRAWBACKS OF MOST HASHING METHODS
�  Require rehashing all the data when the hash table

becomes too full
�  Goal: only move a few records
�  Solutions:

1.  Knott
�  use a trie in the form of a binary tree

2.  extendible hashing
(Fagin, Nievergelt, Pippenger, Strong)
�  like a trie except that all buckets are at same level
�  buckets are accessed by use of a directory
�  directory elements are NOT the same as buckets

3.  Linear hashing (Litwin)
�  provides for linear growth in the number of

buckets (i.e., the hash table grows at a rate of one
bucket at a time)

�  does not make use of a directory

A

B
C

D

1
b

�  bucket overflow is solved by splitting
�  drawback: accessing a bucket at

level m requires m operations
E F

2
r

�  implemented as a directory of pointers to buckets
�  accessing a bucket requires 1 operation

A D A A A B C D

3
z

�  bucket overflow may cause doubling the directory
�  e.g., EXCELL

A A A A A A A A B B E F D D D D

4
g

Copyright © 2003 by Hanan Samet

hp26
MECHANICS OF LINEAR HASHING
�  Assume a file with m buckets
�  Use two hashing functions hi(k) = f(k) mod 2i+1 for i = n

and i = n+1
1.  compute hn+1(k) = x and use the result if x < m 2.  otherwise use hn(k) �  Such a file is said to be at level n,n+1 �  There exist primary and overflow buckets �  When a record hashes to a full primary bucket, then it is inserted into an overflow bucket corresponding to the primary bucket �  τ : storage utilization factor τ = number of records in file divided by the total of available slots in primary and overflow buckets �  When τ > a given load α, then one of the buckets is split
�  When bucket i is split, it and its overflow bucket’s records

are rehashed using hn+1 and distributed into buckets i
and i + 2n as is appropriate

Copyright © 2003 by Hanan Samet

hp27
LINEAR HASHING INSERTION ALGORITHM
�  Let s denote the identity of the next bucket to be split and

cycles from 0 to 2n-1
�  Insertion algorithm

1.  compute bucket address i for record r
2.  insert r in bucket i
3.  if τ > α, then split bucket i creating bucket i + 2n and

reinsert in buckets i and i + 2n

4.  if s = 2n, then all buckets have been split
�  increment n
�  reset s to 0

5.  if buckets i or i+1 overflow, then allocate an overflow
bucket

6.  if rehashing causes some overflow buckets to be
reclaimed, repeat steps 3-5

�  Notes
1.  a bucket split need not necessarily occur when a

record hashes to a full bucket, nor does the bucket
being split need to be full

2.  key principle is that eventually every bucket will be
split and ideally all overflow buckets will be emptied
and reclaimed

3.  if the storage utilization gets too low, then buckets
should be reclaimed

Copyright © 2003 by Hanan Samet

hp28
BIT INTERLEAVING HASHING FUNCTIONS
�  Bit interleaving takes one bit from the binary

representation of the x coordinate value and one bit from
the binary representation of the y coordinate value and
alternates them

�  Use city coordinate values and divide by 12.5 so that
each coordinate value requires just three binary digits

�  Example with y being more significant than x

City x y f(z)=z div 12.5 Bit Interleaved

x y Value

Chicago 35 42 2 3 14

Mobile 52 10 4 0 16

Toronto 62 77 4 6 56

Buffalo 82 65 6 5 54

Denver 5 45 0 3 10

Omaha 27 35 2 2 12

Atlanta 85 15 6 1 22

Miami 90 5 7 0 21

1
b

�  Ex: Atlanta (6,1)

1 1 0

0 0 1

x

y

2
r

0 1 0 1 1 0 = 22

3
z

Copyright © 2003 by Hanan Samet

hp29

0

EXAMPLE OF LINEAR HASHING
�  Assume primary and overflow bucket capacity is 2
�  A bucket is split whenever τ ≥ α = 0.66
�  Initially, only bucket 0 exists

1
b

�  Insert Chicago (14) and Mobile (16):
τ=1 and split bucket 0 creating bucket 1

0 1
Chicago

Mobile

2
r

�  Insert Toronto (56) into bucket 0:
τ=0.75 and split bucket 0 creating bucket 2; move
Chicago to bucket 2

2
Chicago

Toronto

3
z

�  Insert Denver (10) into bucket 2 which causes it to overflow

Denver

5
v

�  Insert Omaha (12) into bucket 0 which causes it to overflow

Omaha

6
g

�  Insert Atlanta (22) into bucket 2’s overflow area

Atlanta

7
r

�  Insert Buffalo (54) into bucket 2:
τ=0.67 and split bucket 1 creating bucket 3

4
g

3

Buffalo

0

�  Insert Miami (21) into bucket 1:
τ=0.67 and split bucket 0 creating bucket 4; move Omaha to
bucket 4

8
z

4
Miami Omaha

�  Reclaim the overflow area of bucket 0:
τ=0.67 again and split bucket 1 creating bucket 5; move
Miami to bucket 5

9
v

5
Miami

Copyright © 2003 by Hanan Samet

hp30
ORDER PRESERVING LINEAR HASHING (OPLH)
�  Problem: the hash function hn(k) = k mod 2n implies that

all records in a given bucket agree in the least
n significant bits

1.  OK for random access
2.  unacceptable for sequential access as each record will

be in a different bucket
�  Solution: use the hash function

hn(k) = (reverse(k)) mod 2n

1.  tests the n most significant bits
2.  all records in a bucket are within a given range

�  Shortcomings
1.  records may not be scattered too well

�  overflow is much more common than with traditional
hashing methods

�  random access is slower since several overflow
buckets may have to be examined

2.  creates a large number of sparsely filled buckets
�  sequential access may be slower as may have to

examine many empty buckets

City x y f(z)=z mod 12.5 Bit Interleaved Value

x y x most sig y most sig

Chicago 35 42 2 3 44 28

Mobile 52 10 4 0 1 2

Toronto 62 77 4 6 11 7

Buffalo 82 65 6 5 39 27

Denver 5 45 0 3 40 20

Omaha 27 35 2 2 12 12

Atlanta 85 15 6 1 37 26

Miami 90 5 7 0 21 42

Copyright © 2003 by Hanan Samet

hp31
EXAMPLE OF ORDER PRESERVING LINEAR HASHING
�  Assume primary and overflow bucket capacity is 2
�  A bucket is split whenever τ ≥ α = 0.66
�  Initially, only bucket 0 exists

(0,100) (100,100)

(100,0) (0,0)

0

1
b

(35,42)
Chicago

(52,10)
Mobile

1

�  Insert Chicago (28) and Mobile (2): τ=1 and split bucket
0 creating bucket 1

2
r

(62,77)
Toronto

2

�  Insert Toronto (7) into bucket 1: τ=0.75 and split bucket
0 creating bucket 2; move Mobile to bucket 2

3
z

(82,65)
Buffalo 3

�  Insert Buffalo (27) into 1: τ=0.67, split bucket 1 creating
bucket 3; move Toronto and Buffalo to bucket 3

4
g

(5,45)
Denver

�  Insert Denver (20) into bucket 0

5
v

(27,35)
Omaha 4

�  Insert Omaha (12) into 0: τ=0.75 and split bucket 0 creating
bucket 4; move Denver and Chicago to bucket 4

6
g

(85,15)
Atlanta

�  Insert Atlanta (26) into bucket 2

7
v

(90,5)
Miami

5

�  Insert Miami (42) into 2: τ=0.67 and split bucket 1
creating bucket 5; move Miami to bucket 2’s overflow
area

8
z

Copyright © 2003 by Hanan Samet

hp32
COMPARISON OF OPLH WITH EXCELL
OPLH
1.  Implicit directory

�  one directory element
per primary bucket

�  all buckets stored at
one of two levels

�  overflow buckets
2.  Reverse bit interleaving
3.  Bucket overflow

�  allocate at most two
additional buckets
(one for the bucket
that has been split
and one for the
overflowing bucket)

4.  Retrieval of a record
requires examining
primary and overflow
buckets

EXCELL
1.  Explicit directory

�  set of primary buckets

2.  Reverse bit interleaving
3.  Bucket overflow

�  triggers bucket split or
directory doubling

4.  Retrieve any record with
two disk accesses

�  Summary: order preserving linear hashing (OPLH) yields
a more gradual growth in the size of the
directory at the expense of the loss of the
guarantee of retrieval of any record with just
two disk accesses

Copyright © 2003 by Hanan Samet

hp33
EXAMPLES OF COMPARISON OF OPLH WITH EXCELL
1.  Reversed bit interleaving with the y coordinate value as

the most significant (i.e., y is split first)

2.  Reversed bit interleaving with the x coordinate value as
the most significant (i.e., x is split first)

OPLH:

�  6 primary buckets
�  2 overflow buckets

EXCELL:

�  7 buckets
�  16 directory elements

OPLH:

�  7 primary buckets
�  no overflow buckets

EXCELL:

�  6 buckets
�  8 directory elements

Copyright © 2003 by Hanan Samet

hp34
SPIRAL HASHING (Martin)
�  Drawbacks of linear hashing:

1.  order in which buckets are split is unrelated to the
probability of the occurrence of overflow

2.  all buckets that are candidates for a split have the
same probability of overflowing

�  Central idea is the existence of an ever-changing (and
growing) address space of active bucket addresses
1.  records are distributed in the active buckets in an

uneven manner
2.  split the bucket with the highest probability of

overflowing
�  When a bucket s is split

1.  create d new buckets
2.  rehash the contents of s into the d new buckets
3.  bucket s is no longer used

�  If [s,t] are the active buckets, then [s+1,t+d] are the active
buckets after the split

�  Ex: assume that initially there are d – 1 active buckets
starting at address 1

�  After s bucket splits, there are (s+1)•(d+1) active buckets
starting at address s + 1 (prove by induction)

d = 2: 1

1
b

2 3

2
r

3 4 5

3
z

4 5 6 7

4
g

Copyright © 2003 by Hanan Samet

hp35
SPIRAL HASHING FUNCTION
�  Assume that initially there are d-1 active buckets starting

at address 1
�  Key idea is the behavior of the function y = dx

1.  dx+1 – dx = dx • (d -1)
2. bucket s has just been split
3.  let dx = s+1 = address of first active bucket
4.  implies dx+1 – dx = (s+1) • (d-1) = number of active buckets
5.  last active bucket is at address dx+1 – 1
6.  [dx,dx+1) is range of active buckets

�  Use two hashing functions
1.  h(k) maps key k uniformly into [0,1) which is the range of the

difference in exponent values of addresses in y = dx+1 – dx

2.  y(k) maps h(k) into an address in [s+1,(s+1)+(s+1)•(d-1))
�  y(k) = ⎣dx(k)⎦ with x(k) in range [logd(s+1),logd(s+1)+1)
�  before split, active buckets lay in range [logds,logds+1)
�  want to make sure that all key values previously in

bucket s – i.e., x(k) in [logds,logd(s+1)) are rehashed
into one of the new buckets with an x(k) value in
[logds+1,logd(s+1)+1)

�  leave other key values in [logd(s+1),logds+1) unchanged
�  difficult to choose x(k)

a.  x(k) = logd(s+1) + h(k)
�  drawback: must rehash all keys when a bucket is

split
b.  x(k) = ⎡logd(s+1) – h(k)⎤ + h(k)

�  guarantees that if k is hashed into bucket b (≥ s+1)
then it continues to hash there until bucket b is split

�  implies that x(k) is a number in the range
[logd(s+1),logd(s+1)+1) whose fractional part is h(k)

Copyright © 2003 by Hanan Samet

hp36
BEHAVIOR OF THE SPIRAL HASHING FUNCTION
�  Ex: d = 2

Bucket
address

Hash interval Relative
load

1 [0.0000,1.0000
)

1.0000

2 [0.0000,0.5849
)

0.5849

3 [0.5849,1.0000
)

0.4151

4 [0.0000,0.3219
)

0.3219

5 [0.3219,0.5849
)

0.2630

6 [0.5849,0.8073
)

0.2224

7 [0.8073,1.0000
)

0.1927

8 [0.0000,0.1699
)

0.1699

9 [0.1699,0.3219
)

0.1520

10 [0.3219,0.4594
)

0.1375

11 [0.4594,0.5849
)

0.1255

12 [0.5849,0.7004
)

0.1155

13 [0.7004,0.8073
)

0.1069

14 [0.8073,0.9068
)

0.0995

15 [0.9068,1.0000
)

0.0932

The function y = 2x Relative load of buckets 1-15

1
b

�  Split bucket 3

2
r

�  Yields buckets 6 and 7

3
z

Copyright © 2003 by Hanan Samet

hp37
TABLES FOR EXAMPLE OF SPIRAL HASHING
�  Use bit interleaving to form a value k – i.e., take one bit

from the binary representation of the x coordinate value
and one bit from the binary representation of the
y coordinate value and alternate them

�  Use city coordinate values and divide by 12.5 so that
each coordinate value requires just three binary digits

�  Use h(k) = k/64 which has same effect as reverse bit
interleaving and behavior is analogous to OPLH

�  Example with y being more significant than x

City x y f(z)=z div 12.5
x y k/64

Chicago 35 42 2 3 14 .21875

Mobile 52 10 4 0 16 .25

Toronto 62 77 4 6 56 .875

Buffalo 82 65 6 5 54 .84375

Denver 5 45 0 3 10 .15625

Omaha 27 35 2 2 12 .1875

Atlanta 85 15 6 1 22 .34375

Miami 90 5 7 0 21 .328125

1
b

�  Ex: Atlanta (6,1)

1 1 0

0 0 1

x

y

2
r

0 1 0 1 1 0 = 22

3
z

Copyright © 2003 by Hanan Samet

hp38
MECHANICS OF EXAMPLE OF SPIRAL HASHING
�  Assume d = 2 and primary and overflow bucket capacity of 2
�  A bucket is split whenever τ ≥ α = 0.66
�  Initially, only bucket 1 exists

1

1

1
b

2 3
CHI
MOB

2

3

Chicago

Mobile

�  Insert Chicago (.22) and Mobile (.25): τ=1 and split bucket 1
creating buckets 2 and 3; Chicago and Mobile are moved to
bucket 2

2
r

4 5
TOR CHI

MOB

4

5
5

Toronto

�  Insert Toronto (.87) into bucket 3: τ=0.75 and split bucket 2
creating buckets 3 and 4; Chicago and Mobile are moved to
bucket 4

3
z

6 7
TOR
BUF

6 7

Buffalo

�  Insert Buffalo (.84) into bucket 3: τ=0.67 and split bucket 3
creating buckets 6 and 7; Toronto and Buffalo are moved to
bucket 7

4
g

Denver

Omaha

�  Insert Denver (.16) and Omaha (.19) into bucket 4’s overflow
area

DEN

OMA

5
g

8 9
ATL DEN CHI

MOB

OMA

�  Insert Atlanta (.34) into bucket 5: τ=0.7 and split bucket 4
creating buckets 8 and 9; Denver is moved to bucket 8,
while Chicago, Mobile, and Omaha are moved to bucket 9
which causes it to overflow

6
v

8

9

9
Atlanta

10 11
ATL
MIA

11
10 11

Miami

�  Insert Miami (.33) into bucket 5: τ=0.67 and split bucket 5
creating buckets 10 and 11; Move Atlanta and Miami to
bucket 10

7
v

Copyright © 2003 by Hanan Samet

hp39
COMPARISON OF SPIRAL AND LINEAR HASHING
�  Main advantage of spiral hashing over linear hashing is

that the bucket being split is the one most likely to
overflow

�  Disadvantages of spiral hashing:
1.  the buckets that have been split are not reused

�  overcome by using a mapping between logical
and physical addresses

2.  expensive to calculate function y = dx

�  overcome by use of an approximation

Copyright © 2003 by Hanan Samet

hp40
WHY SPIRAL?
�  Can rewrite y = ⎣dx⎦ as ⎣ρ = ejθ⎦ using polar coordinates

which yields the equation of a spiral
�  Ex: ρ = e(ln 2)• θ/2π

�  Polar coordinates mean that the active buckets are
always within one complete arc of the spiral – i.e., θ = 2π

�  Mechanics of a bucket split
1.  let first active bucket be at a = ⎣ej•b⎦ (i.e., θ = b)
2.  last active bucket is at c = ⎡ej•(b+2π)⎤ – 1
3.  bucket split means that the contents of the active

bucket at ρ = a (i.e., θ = b) are distributed into
buckets c + 1 through g where g = ⎣ej•(b+2π +φ)⎦ and φ is
the solution of a + 1 = ej•(b+φ) – i.e., φ = (ln (a + 1))/j – b

4.  buckets a + 1 through g are now the active buckets
�  Ex: d = 2

10
10

-10

-5

5

5 -5 -10 10 15

11

9

8

5

12

13

6
3

7

14

15

16

1
b

1.  active buckets 6 through 11

2
r

2.  split bucket 6 to yield buckets 12 and 13

3
z

3.  active buckets 7 through 13

4
g

�  Observations
1.  insead of h(k) uniformly mapping key k into [0,1), use

hθ(k) to uniformly map k into [0,2π)
2.  length of arc has constant value between successive

integer values of ρ

5
v