CS计算机代考程序代写 data structure ER algorithm An into ductory tutorial on kd􏴁trees Andrew W􏴂 Mo ore

An into ductory tutorial on kd􏴁trees Andrew W􏴂 Mo ore
Carnegie Mellon University awm􏴌cs􏴂cmu􏴂edu
Extract from Andrew Mo ore􏴋s PhD Thesis􏴘 E􏴐cient Memory􏴁based Learning for Robot
Control 􏴃􏴗􏴗􏴃􏴂
PhD􏴂 Thesis􏴙
Technical Rep ort No􏴂 􏴄􏴕􏴗􏴔 Computer Lab oratory􏴔 University of Cambridge􏴂

Chapter 􏴈
Kd􏴁trees for Cheap Learning
􏴈􏴂􏴃
Nearest Neighb our two multi􏴁dimensional spaces
Sp eci􏴀cation
Domain 􏴚 􏴦kd and Range 􏴚 􏴦kr 􏴔 let an
This chapter gives a speci􏴀cation of the nearest neighbour algorithm􏴂 It also gives both an informal and formal introduction to the kd􏴁tree data structure􏴂 Then there is an explicit􏴔 detailed account of how the nearest neighbour search algorithm is
implemented e􏴐ciently􏴔 gorithm􏴋s performance􏴂 to the nearest neighbour
which is fol lowed by an empirical investigation into Final ly􏴔 there is discussion of some other algorithms
search􏴂
the al􏴁 related
Given
memb er of Domain 􏴡
exemplar􏴁set 􏴔 E􏴔 and a
􏴑d􏴕􏴙 r􏴕 􏴒 􏴄 E such that
able exemplar 􏴂 This ambiguity captures the requirement that any nearest neighb our is None􏴁nearer is de􏴀ned thus􏴘
be a Given an exemplar one suit􏴁 adequate􏴂
Range and let an exemplar􏴁set b e a 􏴀nite set of
target domain vector􏴔 d􏴔 then a nearest neighbour None􏴁nearer􏴑E􏴙 d􏴙 d􏴕􏴒􏴂 Notice that there might be more than
None􏴁nearer􏴑E􏴙 d􏴙 d􏴕􏴒 􏴔 􏴖􏴑d􏴕􏴕􏴙 r􏴕􏴕􏴒 􏴄 E j d 􏴔 d􏴕 j􏴣j d 􏴔 d􏴕􏴕 j
In Equation 􏴈􏴂􏴃 the distance metric is Euclidean􏴔 though any other Lp 􏴁norm could have b een
􏴑􏴈􏴂􏴃􏴒 used􏴂
􏴑􏴈􏴂􏴄􏴒
with
d vu u X
i􏴚􏴃
exemplars􏴂 of d is any􏴂
exemplar
where di
In the following sections I
is the ith
vector d􏴂
the additional informal requirement that the computation time should
j d 􏴔 d􏴕 j􏴚 t
􏴑d 􏴔 d􏴕 􏴒􏴄 ii
comp onent of
describ e some algorithms to realize this
􏴈􏴁􏴃
i􏴚k
abstract
b e relatively short􏴂
sp eci􏴀cation

Algorithm􏴘 Nearest Neighb our by Scanning􏴂
Data Structures􏴘 domain􏴁vector range􏴁vector exemplar
A vector of kd 􏴏oating p oint numb ers􏴂
A vector of kr 􏴏oating p oint numb ers􏴂
A pair􏴘 􏴑domain􏴁vector 􏴙 range􏴁vector􏴒
Input􏴘 exlist 􏴔 of typ e list of exemplar dom􏴔 of typ e domain􏴁vector
Output􏴘 nearest􏴔 of typ e exemplar
Preconditions􏴘 exlist is not empty
Postconditions􏴘 if nearest represents the exemplar 􏴑d􏴕 􏴙 r􏴕 􏴒􏴔 and exlist represents the exemplar set E􏴔
and dom represents the vector d􏴔
then 􏴑d􏴕 􏴙 r􏴕􏴒 􏴄 E and None􏴁nearer􏴑E􏴙 d􏴙 d􏴕 􏴒􏴂
Co de􏴘 􏴃􏴂
􏴄􏴂
􏴅􏴂
􏴅􏴂􏴃 􏴅􏴂􏴄 􏴅􏴂􏴄􏴂􏴃 􏴅􏴂􏴄􏴂􏴄
nearest􏴁dist 􏴘􏴚 in􏴀nity nearest 􏴘􏴚 unde􏴀ned
for
ex 􏴘􏴚 each exemplar in exlist
dist 􏴘􏴚 distance b etween dom and the domain of ex
if
dist 􏴦 nearest􏴁dist then nearest􏴁dist 􏴘􏴚 dist nearest 􏴘􏴚 ex
Table 􏴈􏴂􏴃􏴘 Finding Nearest Neighb our by scanning a list􏴂
􏴈􏴂􏴄 Naive Nearest Neighb our
This op eration could b e achieved by representing the exemplar􏴁set as a list of exemplars􏴂 In Ta􏴁
ble 􏴈􏴂􏴃􏴔 I give the trivial nearest time complexity O 􏴑N 􏴒 where N it is p ossible to avoid making a
neighb our algorithm which scans the entire list􏴂 This algorithm has is the size of E􏴂 By structuring the exemplar􏴁set more intelligently􏴔 distance computation for every memb er􏴂
􏴈􏴂􏴅 Intro duction to kd􏴁trees
A kd􏴁tree is a data structure for storing a 􏴀nite set of p oints from a k 􏴁dimensional space􏴂 It was examined in detail by J􏴂 Bentley 􏴜Bentley􏴔 􏴃􏴗􏴖􏴕􏴙 Friedman et al􏴂􏴔 􏴃􏴗􏴉􏴉􏴝􏴂 Recently􏴔 S􏴂 Omohundro has recommended it in a survey of p ossible techniques to increase the sp eed of neural network learning 􏴜Omohundro􏴔 􏴃􏴗􏴖􏴉􏴝􏴂
A kd􏴁tree is a binary tree􏴂 The contents of each no de are depicted in Table 􏴈􏴂􏴄􏴂 Here I provide an informal description of the structure and meaning of the tree􏴔 and in the following subsection I
􏴈􏴁􏴄

Field Name􏴘
Field Typ e
Description
dom􏴁elt range􏴁elt split
left
right
domain􏴁vector range􏴁vector integer
kd􏴁tree
kd􏴁tree
A p oint from A p oint from The splitting
kd 􏴁d space
kr 􏴁d space
dimension
kd􏴁tree representing those p oints
A
to
A
to the right of the splitting plane
the left of the splitting plane kd􏴁tree representing those p oints
Table 􏴈􏴂􏴄􏴘 The 􏴀elds of a kd􏴁tree no de
give a formal de􏴀nition of the invariants and semantics􏴂
The exemplar􏴁set E is represented by the set of no des in the kd􏴁tree􏴔 each no de representing
one exemplar􏴂 The dom􏴁elt 􏴀eld represents the domain􏴁vector of the exemplar and the range􏴁elt
􏴀eld represents
the range􏴁vector􏴂 The dom􏴁elt comp onent is the index for the no de􏴂 It splits the
space
􏴍left􏴊
right
p erp endicular to the direction sp eci􏴀ed by the split 􏴀eld􏴂 Let i b e the value of the split 􏴀eld􏴂 Then a p oint is to the left of dom􏴁elt if and only if its ith comp onent is less than the ith comp onent of
into two
subspace are represented by the left subtree􏴔
subtree􏴂
The splitting hyp erplane is a plane
subspaces according to the splitting
dom􏴁elt 􏴂 The complimentary de􏴀nition holds for the the splitting hyp erplane is not required􏴂
right 􏴀eld􏴂 If a no de has no children􏴔 then
Figure 􏴈􏴂􏴃 demonstrates a kd􏴁tree representation of the four dom􏴁elt p oints 􏴑􏴄􏴙 􏴇􏴒􏴔 􏴑􏴅􏴙 􏴖􏴒􏴔 􏴑􏴈􏴙 􏴅􏴒 and 􏴑􏴖􏴙 􏴗􏴒􏴂 The root node􏴔 with dom􏴁elt 􏴑􏴄􏴙 􏴇􏴒 splits the plane in the y 􏴁axis into two subspaces􏴂 The point 􏴑􏴅􏴙 􏴖􏴒 lies in the lower subspace􏴔 that is f􏴑x􏴙 y􏴒 j y 􏴦 􏴇g􏴔 and so is in the left subtree􏴂 Figure 􏴈􏴂􏴄 shows how the no des partition the plane􏴂
􏴈􏴂􏴅􏴂􏴃 Formal Sp eci􏴀cation of a kd􏴁tree
The reader mapping
who is content with the informal description ab ove can omit this
section􏴂
I de􏴀ne a
􏴑􏴈􏴂􏴅􏴒
􏴑􏴈􏴂􏴆􏴒
which
maps the tree
exset􏴁rep 􏴘 kd􏴁tree 􏴥 exemplar􏴁set to the exemplar􏴁set it represents􏴘
exset􏴁rep􏴑empty􏴒
exset􏴁rep􏴑􏴦 d􏴙 r􏴙 􏴔􏴙 empty􏴙 empty 􏴫􏴒 exset􏴁rep􏴑􏴦 d􏴙 r􏴙 split􏴙 treeleft 􏴙 treeright 􏴫􏴒 􏴚
exset􏴁rep􏴑treeleft 􏴒 􏴜 f􏴑d􏴙 r􏴒g 􏴜 exset􏴁rep􏴑treeright 􏴒 􏴈􏴁􏴅
hyp erplane of the no de􏴂 All the p oints in the and the p oints in the 􏴍right􏴊 subspace by the which passes through dom􏴁elt and which is
􏴚 􏴪
􏴚 f􏴑d􏴙 r􏴒g

[2,5]
[6,3]
[3,8]
[8,9]
Figure 􏴈􏴂􏴃
A 􏴄d􏴁tree of four elements􏴂 The splitting planes are not indicated􏴂 The 􏴜􏴄􏴔􏴇􏴝 no de splits along the y 􏴚 􏴇 plane and the 􏴜􏴅􏴔􏴖􏴝 no de splits along the x 􏴚 􏴅 plane􏴂
[3,8]
[8,9]
[2,5]
[6,3]
􏴈􏴁􏴆
Figure 􏴈􏴂􏴄
How the tree of Figure 􏴈􏴂􏴃 splits up the x􏴔y plane􏴂

The invariant is that subtrees only ever contain dom􏴁elt s which are on the correct side of all ancestors􏴋 splitting planes􏴂
􏴈􏴂􏴅􏴂􏴄
􏴖􏴑d􏴕􏴙 r􏴕 􏴒 􏴄 exset􏴁rep􏴑tree right
􏴒 d􏴕
Constructing
an exemplar􏴁set E􏴔 a
􏴈􏴂􏴆 Nearest Neighb our Search
In this section􏴔 I describ e the nearest neighb our algorithm which op erates on kd􏴁trees􏴂 an informal description and worked example􏴔 and then give the precise algorithm􏴂
Given
cho osing
to use as the tree􏴋s ro ot􏴂 The Whichever exemplar is chosen tree􏴋s maximum depth and the
a kd􏴁tree
kd􏴁tree can b e pro cedure of Step 􏴄 insp ects the
pivot􏴁
their
􏴑􏴈􏴂􏴇􏴒
constructed by the algorithm in Table 􏴈􏴂􏴅􏴂 The
set and cho oses a 􏴍go o d􏴊 domain vector from this set
Is􏴁legal􏴁kdtree􏴑empty􏴒􏴘
Is􏴁legal􏴁kdtree􏴑􏴦 d􏴙 r􏴙 􏴔􏴙 empty􏴙 empty 􏴫􏴒􏴘 Is􏴁legal􏴁kdtree􏴑􏴦 d􏴙 r􏴙 split 􏴙 treeleft 􏴙 treeright 􏴫􏴒
􏴔
􏴖􏴑d􏴕􏴙 r􏴕 􏴒 􏴄 exset􏴁rep􏴑tree left
􏴒 d􏴕
split
􏴣 d split
􏴧
􏴧 Is􏴁legal􏴁kdtree􏴑treeright 􏴒
􏴫 d Is􏴁legal􏴁kdtree􏴑treeleft 􏴒􏴧
discussion of how such a ro ot is chosen is deferred to as ro ot will not a􏴎ect the correctness of the kd􏴁tree􏴔 shap e of the hyp erregions wil l b e a􏴎ected􏴂
Section 􏴈􏴂􏴉􏴂 though the
I b egin with
A 􏴀rst approximation is initially found at the leaf no de which contains the target p oint􏴂 In Figure 􏴈􏴂􏴅 the target p oint is marked X and the leaf no de of the region containing the target is coloured black􏴂 As is exempli􏴀ed in this case􏴔 this 􏴀rst approximation is not necessarily the nearest neighb our􏴔 but at least we know any p otential nearer neighb our must lie closer􏴔 and so must lie
within the circle centred on X and passing through the leaf no de􏴂 We now back up to
of the current no de􏴂 In Figure 􏴈􏴂􏴆 this parent is the black no de􏴂 We compute whether it is possible for a closer solution to that so far found to exist in this parent􏴋s other child􏴂 Here it is not p ossible􏴔 b ecause the circle do es not intersect with the 􏴑shaded􏴒 space o ccupied by the parent􏴋s other child􏴂 If no closer neighb our can exist in the other child􏴔 the algorithm can immediately move up a further level􏴔 else it must recursively explore the other child􏴂 In this example􏴔 the next parent which is checked will need to b e explored􏴔 b ecause the area it covers 􏴑i􏴂e􏴂 everywhere north of the central horizontal line􏴒 does intersect with the b est circle so far􏴂
Table 􏴈􏴂􏴆 describ es my actual implementation of the nearest neighb our algorithm􏴂 It is called with four parameters􏴘 the kd􏴁tree􏴔 the target domain vector􏴔 a representation of a hyp errectangle in Domain 􏴔 and a value indicating the maximum distance from the target which is worth searching􏴂 The search will only take place within those p ortions of the kd􏴁tree which lie b oth in the hyp er􏴁
􏴈􏴁􏴇
split
split
the parent

Algorithm􏴘 Constructing a kd􏴁tree
Input􏴘 exset􏴔 of typ e exemplar􏴁set
Output􏴘 kd􏴔 of typ e kdtree
Pre􏴘 None
Post􏴘 exset 􏴚 exset􏴁rep􏴑kd􏴒 􏴧 Is􏴁legal􏴁kdtree􏴑kd􏴒
Co de􏴘
􏴃􏴂 If exset is empty then return the empty kdtree
􏴄􏴂 Call pivot􏴁cho osing pro cedure􏴔 which returns ex 􏴘􏴚 a memb er of exset
split 􏴘􏴚 the splitting dimension 􏴅􏴂 d 􏴘􏴚 domain vector of ex
two values􏴘
􏴆􏴂 exset􏴋 􏴘􏴚 exset with ex removed
􏴇􏴂 r 􏴘􏴚 range vector of ex
􏴈􏴂 exsetleft 􏴘􏴚 f􏴑d􏴕􏴙 r􏴕􏴒 􏴄 exset􏴋 j d􏴕
􏴉􏴂 exsetright 􏴘􏴚 f􏴑d􏴕􏴙 r􏴕􏴒 􏴄 exset􏴋 j d􏴕
􏴖􏴂 kdleft 􏴘􏴚 recursively construct kd􏴁tree from exsetleft
􏴗􏴂 kdright 􏴘􏴚 recursively construct kd􏴁tree from exsetright
􏴃􏴕􏴂 kd 􏴘􏴚 􏴦 d􏴙 r􏴙 split􏴙 kdleft􏴙 kdright 􏴫
split split
􏴫 d
g
􏴣 d split
g split
Pro of􏴘 By induction on the length of exset and the de􏴀nitions of exset􏴁rep and Is􏴁legal􏴁kdtree􏴂
Table 􏴈􏴂􏴅􏴘
Constructing a
kd􏴁tree from a set of exemplars􏴂
Figure 􏴈􏴂􏴅
The black dot is the dot which owns the leaf no de containing the target 􏴑the cross􏴒􏴂 Any nearer neigh􏴁 b our must lie inside this cir􏴁 cle􏴂
􏴈􏴁􏴈

rectangle􏴔 and within the maximum distance to the target􏴂 The caller of the routine will generally sp ecify the in􏴀nite hyp errectangle which covers the whole of Domain 􏴔 and the in􏴀nite maximum distance􏴂
Before discussing its execution􏴔 I will explain how the op erations on the hyp errectangles can b e implemented􏴂 A hyp errectangle is represented by two arrays􏴘 one of its minimum co ordinates􏴔
the other of its maximum co ordinates􏴂 To cut the hyp errectangle􏴔 so that one of its
edges is moved hyp errectangle in hr which is
closer to its centre􏴔 the appropriate array comp onent is altered􏴂 hr intersects with a hyp ersphere radius r centered at p oint t 􏴔
To check to see if a we 􏴀nd the p oint p
min closest to t􏴂 Write hri
max as the minimum extreme of hr in the ith dimension and hri
as the
􏴑􏴈􏴂􏴈􏴒
maximum
extreme􏴂
pi 􏴔
the ith component of this closest point is computed
􏴖􏴫 hrmin if t 􏴣 hrmin 􏴫􏴦 i i i
thus􏴘
Step 􏴖􏴔 when this
closer p oint in the
must b e out of range􏴂 The test is made in Steps 􏴗 and
initial child is hyp errectangle
p ossible to prove that there cannot b e any
which any p ossible closer p oint could lie􏴔 and then 􏴈􏴁􏴉
the
􏴃􏴕􏴂 Step 􏴗 restricts the maximum radius in test in Step 􏴃􏴕 checks whether there is any
pi 􏴚 􏴫 ti if hrmin 􏴦 ti 􏴦 hrmax 􏴫ii
􏴘 hrmin if t 􏴤 hrmax iii
The ob jects
The search is depth 􏴀rst􏴔 and uses the heuristic of searching 􏴀rst the child no de
intersect
only if the distance between p and t is less than or
contains the target􏴂 Step 􏴃 deals with the trivial empty tree case􏴔 and Steps 􏴄 and 􏴅 assign two imp ortant lo cal variables􏴂 Step 􏴆 cuts the current hyp errectangle into the two hyp errectangles covering the space o ccupied by the child no des􏴂 Steps 􏴇􏴟􏴉 determine which child contains the target􏴂 After
Figure 􏴈􏴂􏴆
The black dot is the parent of the closest found so far􏴂 In this case the black dot􏴋s
other
need not be searched􏴂
child 􏴑shaded grey􏴒
searched􏴔 it may b e
of the further child􏴂 In particular􏴔 the p oint at the current no de
equal
to
r􏴂 which

Algorithm􏴘 Nearest Neighb our in a kd􏴁tree
Input􏴘 kd􏴔 of typ e kdtree
target 􏴔 of typ e domain vector
hr􏴔 of typ e hyp errectangle max􏴁dist􏴁sqd􏴔 of typ e 􏴏oat
Output􏴘 nearest􏴔 of typ e exemplar dist􏴁sqd􏴔 of typ e 􏴏oat
Pre􏴘 Is􏴁legal􏴁kdtree􏴑kd􏴒
Post􏴘 Informally􏴔 the p ostcondition is that nearest is a nearest exemplar
to target which also lies b oth within the hyp errectangle hr pp
and within distance max􏴁dist􏴁sqd of target 􏴂 dist􏴁sqd is the distance of this nearest p oint􏴂
If there is no such p oint then dist􏴁sqd contains in􏴀nity􏴂
Co de􏴘 􏴃􏴂
􏴄􏴂
􏴅􏴂
􏴆􏴂
􏴇􏴂 􏴈􏴂 􏴈􏴂􏴃 􏴈􏴂􏴄 􏴉􏴂 􏴉􏴂􏴃 􏴉􏴂􏴄 􏴖􏴂
􏴗􏴂 􏴃􏴕􏴂
􏴃􏴕􏴂􏴃 􏴃􏴕􏴂􏴃􏴂􏴃 􏴃􏴕􏴂􏴃􏴂􏴄 􏴃􏴕􏴂􏴃􏴂􏴅 􏴃􏴕􏴂􏴄
􏴃􏴕􏴂􏴅 􏴃􏴕􏴂􏴅􏴂􏴃
if kd is empty then set dist􏴁sqd to in􏴀nity and exit􏴂
s 􏴘􏴚 split 􏴀eld of kd
pivot 􏴘􏴚 dom􏴁elt 􏴀eld of kd
Cut hr into two sub􏴁hyp errectangles left􏴁hr and right􏴁hr􏴂
The cut plane is through pivot and p erp endicul ar to the s dimension􏴂 target􏴁in􏴁left 􏴘􏴚 target s 􏴣 pivot s
if target􏴁in􏴁left then
nearer􏴁kd 􏴘􏴚 left 􏴀eld of kd and nearer􏴁hr 􏴘􏴚 left􏴁hr
further􏴁kd 􏴘􏴚 right 􏴀eld if not target􏴁in􏴁left then
of kd and further􏴁hr 􏴘􏴚 right􏴁hr
􏴘􏴚 right 􏴀eld of kd and nearer􏴁hr 􏴘􏴚 right􏴁hr 􏴘􏴚 left 􏴀eld of kd and further􏴁hr 􏴘􏴚 left􏴁hr
nearer􏴁kd
further􏴁kd
Recursively call Nearest Neighb our with parameters 􏴑nearer􏴁kd􏴔target 􏴔 nearer􏴁hr􏴔max􏴁dist􏴁sqd􏴒􏴔 storing the results in nearest and dist􏴁sqd
max􏴁dist􏴁sqd 􏴘􏴚 minimum of max􏴁dist􏴁sqd and dist􏴁sqd
A nearer p oint could only part of further􏴁hr within
lie in further􏴁kd if there were some p
distance dist􏴁sqd then
􏴀eld
target 􏴂
max􏴁dist􏴁sqd of
if
this is the case then
if
􏴑pivot 􏴔 target􏴒􏴄 􏴦
nearest 􏴘􏴚 􏴑pivot􏴙 range􏴁elt dist􏴁sqd 􏴘􏴚 􏴑pivot 􏴔 target 􏴒􏴄 max􏴁dist􏴁sqd 􏴘􏴚 dist􏴁sqd
of kd􏴒
Recursively call Nearest Neighb our with parameters 􏴑further􏴁kd􏴔target􏴔 further􏴁hr􏴔max􏴁dist􏴁sqd􏴒􏴔
storing the results in temp􏴁nearest and temp􏴁dist􏴁sqd If temp􏴁dist􏴁sqd 􏴦 dist􏴁sqd then
nearest 􏴘􏴚 temp􏴁nearest and dist􏴁sqd 􏴘􏴚 temp􏴁dist􏴁sqd
Pro of􏴘 Outlined in text
Table 􏴈􏴂􏴆􏴘 The Nearest Neighb our Algorithm
􏴈􏴁􏴖

space in the hyp errectangle of the further child which lies within this radius􏴂 If it is not p ossible then no further search is necessary􏴂 If it is p ossible􏴔 then Step 􏴃􏴕􏴘􏴃 checks if the p oint asso ciated with the current no de of the tree is closer than the closest yet􏴂 Then􏴔 in Step 􏴃􏴕􏴂􏴄􏴔 the further child is recursively searched􏴂 The maximum distance worth examining in this further search is the distance to the closest p oint yet discovered􏴂
The pro of that this will 􏴀nd the nearest neighb our within the constraints is by induction on the size of the kd􏴁tree􏴂 If the cuto􏴎 were not made in Step 􏴃􏴕􏴔 then the pro of would b e straightforward􏴘
the p oint returned is the
the current no de and 􏴑iii􏴒
then the p oint returned is the closest p oint in the nearest child􏴔 and we can show that neither the current p oint􏴔 nor any p oint in the further child can p ossibly b e closer􏴂
Many lo cal optimizations are p ossible which while not altering the asymptotic p erformance of the algorithm will multiply the sp eed by a constant factor􏴂 In particular􏴔 it is in practice p ossible to hold almost all of the search state globally􏴔 instead of passing it as recursive parameters􏴂
closest out of 􏴑i􏴒 the closest p oint in the nearer child􏴔 􏴑ii􏴒 the p oint at
the closest p oint in the further child􏴂 If the cuto􏴎 were made in Step
􏴃􏴕􏴔
􏴈􏴂􏴇 Theoretical Behaviour
Given a kd􏴁tree with N no des􏴔 how many no des
least one leaf of the tree􏴂 It is also clear that no visits each no de at most once􏴂
Figure 􏴈􏴂􏴇 graphically shows why we might
visited􏴘 the shaded areas corresp ond to areas of the kd􏴁tree which were cut o􏴎􏴂
The imp ortant values are 􏴑i􏴒 the worst case numb er of insp ections and 􏴑ii􏴒 the exp ected numb er of insp ections􏴂 It is actually easy to construct worst case distributions of the p oints which will force nearly all the no des to b e insp ected􏴂 In Figure 􏴈􏴂􏴈􏴔 the tree is two􏴁dimensional􏴔 and the p oints are scattered along the circumference of a circle􏴂 If we request the nearest neighb our with the target close to the centre of the circle􏴔 it will therefore b e necessary for each rectangle􏴔 and hence each leaf􏴔 to b e insp ected 􏴑this is in order to ensure that there is no p oint lying inside the circle in any rectangle􏴒􏴂
Calculation of the exp ected numb er of insp ections is di􏴐cult􏴔 b ecause the analysis dep ends
critically on the exp ected distribution of the of the target p oints presented to the nearest
nearest
O􏴑log N 􏴒 inspections are necessary􏴔 because any nearest
neighb our using the algorithm in Section
The analysis is p erformed in 􏴜Friedman et
of hyp errectangles corresp onding to leaf no des which will provably need to b e searched􏴂 Such hyp errectangles intersect the volume enclosed by a hyp ersphere centered on the query p oint whose surface passes through the nearest neighb our􏴂 For example􏴔 in Figure 􏴈􏴂􏴇 the hyp ersphere 􏴑in this
need to b e insp ected in order to 􏴀nd the proven 􏴈􏴂􏴆􏴛􏴂 It is clear at once that on average􏴔 at least
neighbour search requires traversal to at more than N no des are searched􏴘 the algorithm
exp ect considerably fewer than N no des to b e
p oints in
neighb our algorithm􏴂
al􏴂􏴔 􏴃􏴗􏴉􏴉􏴝􏴂
This pap er considers the exp ected numb er
􏴈􏴁􏴗
the kd􏴁tree􏴔 and the exp ected distribution

Figure 􏴈􏴂􏴇
Generally during a nearest neighb our search only a few leaf no des need to b e in􏴁 sp ected􏴂
Figure 􏴈􏴂􏴈
A bad distribution which forces almost all no des to b e insp ected􏴂
􏴈􏴁􏴃􏴕

case a circle􏴒 is shown􏴔 and the numb er of intersecting hyp errectangles is two􏴂
The pap er shows that the exp ected numb er of intersecting hyp errectangles is indep endent of N􏴔 the number of exemplars􏴂 The asymptotic search time is thus logarithmic because the time to
descend from the exp ected constant
ro ot of the tree to the leaves is logarithmic 􏴑in a balanced tree􏴒􏴔 and then an amount of backtracking is required􏴂
However􏴔 this reasoning was based on the assumption that the hyp errectangles in the tree tend to b e hyp ercubic in shap e􏴂 Empirical evidence in my investigations has shown that this is not generally the case for their tree building strategy􏴂 This is discussed and demonstrated in Section 􏴈􏴂􏴉􏴂
A second danger is that the cost􏴔 while indep endent of N 􏴔 is exp onentially dep endent on k 􏴔 the dimensionality of the domain vectors􏴃 􏴂
Thus theoretical analysis provides some insight into the cost􏴔 but here􏴔 empirical investigation
will b e
􏴈􏴂􏴈
used to examine the exp ense of nearest neighb our in practice􏴂
Empirical Behaviour
In this
We exp ect that the numb er of no des insp ected in the tree varies according to the following prop erties of the tree􏴘
􏴢 N􏴔 the size of the tree􏴂
􏴢 kdom 􏴔 the dimensionality of the domain vectors in the tree􏴂 This value is the k in kd􏴁tree􏴂
􏴢 ddistrib 􏴔 the distribution of the domain vectors􏴂 This can b e quanti􏴀ed as the 􏴍true􏴊 dimen􏴁 sionality of the vectors􏴂 For example􏴔 if the vectors had three comp onents􏴔 but all lay on the surface of a sphere􏴔 then the underlying dimensionality would b e 􏴄􏴂 In general􏴔 discovery of
section I investigate the empirical b ehaviour of the nearest neighb our searching algorithm􏴂
the underlying dimensionality of a given sample of p oints is extremely
tests it is a straightforward matter to generate such p oints􏴂 To make a
ing dimensionality ddistrib􏴔 I use randomly generated kdom􏴁dimensional
lie on a ddistrib􏴁dimensional hyp erelliptical surface􏴂 The random vector generation algorithm is as follows􏴘 Generate ddistrib random angles 􏴨i 􏴄 􏴜􏴕􏴙 􏴄􏴩 􏴒 where 􏴕 􏴣 i 􏴦 ddistrib􏴂 Then let
Qi􏴚d􏴔􏴃
i􏴚􏴕 sin 􏴑􏴨i 􏴓 􏴪ij
the j th comp onent of the vector b e
􏴪ij 􏴚 􏴃 􏴩 if the j th bit of the binary 􏴄
􏴢 dtarget 􏴔 the probability distribution
shall assume that this distribution is the same as that which determines the domain vectors􏴂 This is indeed what will happ en when the kd􏴁tree is used for learning control􏴂
􏴃
This was p ointed out to the author by N􏴂 Maclaren􏴂
􏴒􏴂 The phase angles 􏴪ij are de􏴀ned representation of i is 􏴃 and is zero otherwise􏴂
as
di􏴐cult􏴔 but for these kd􏴁tree with underly􏴁 domain vectors which
from which the search target vector will b e selected􏴂 I
􏴈􏴁􏴃􏴃

12
10
8
6
4
2
0
0 2000
4000 6000
8000 1000
In
􏴈􏴂􏴈􏴂􏴃
the following
sections I
investigate
how p erformance dep ends on each of these prop erties􏴂
Performance against Tree Size
Figures 􏴈􏴂􏴉 and 􏴈􏴂􏴖 graph the numb er of no des insp ected against the numb er of no des in the entire kd􏴁tree􏴂 Each value was derived by generating a random kd􏴁tree􏴔 and then requesting 􏴇􏴕􏴕 random nearest neighb our searches on the kd􏴁tree􏴂 The average numb er of insp ected no des was recorded􏴂 A no de was counted as b eing insp ected if the distance b etween it and the target was computed􏴂 Figure 􏴈􏴂􏴉 was obtained from a 􏴆d􏴁tree with an distribution distribution ddistrib 􏴚 􏴅􏴂 Figure 􏴈􏴂􏴖 used an 􏴖d􏴁tree with an underlying distribution ddistrib 􏴚 􏴖􏴂
It is immediately clear that after a certain p oint􏴔 the exp ense of a nearest neighb our search has no detectable increase with the size of the tree􏴂 This agrees with the prop osed mo del of search
cost􏴠logarithmic with a large additive
􏴈􏴂􏴈􏴂􏴄 Performance against the Figure 􏴈􏴂􏴗 graphs the numb er of no des
constant term􏴂
􏴍k􏴊 in kd􏴁tree
kd􏴁tree􏴋s domain vectors for a The numb er of insp ections p er b ehaviour􏴔 the massive increase
􏴃􏴕􏴔􏴕􏴕􏴕 search
in cost
no de rises
with
tree􏴂 The underlying
in the
dimensionality was also
exp onentially􏴔 with kdom􏴂 This
in computational geometry􏴂
insp ected
against kdom􏴔 the numb er of comp onents
very
dimension􏴔 is familiar
􏴈􏴁􏴃􏴄
quickly􏴔 p ossibly
Figure 􏴈􏴂􏴉
Numb er of insp ections re􏴁 quired during a nearest neighb our search against
the
this
was
the
of the p oints was three􏴁 dimensional􏴂
size of the kd􏴁tree􏴂 In exp eriment the tree four􏴁dimensional and
underlying distribution
kdom􏴂

80
70
60
50
40
30
20
10
0
0 2000
4000 6000
8000 1000
Figure
Numb er of
against kd􏴁tree size for an eight􏴁dimensional tree with an eight􏴁dimensional un􏴁 derlying distribution􏴂
􏴈􏴂􏴖
insp ections
600
500
400
300
200
100
0
1 3 5 7 9 11 13 15
􏴈􏴁􏴃􏴅
Figure 􏴈􏴂􏴗
Numb er of insp ec􏴁
tions graphed
dimension􏴂 In these exp eri􏴁 ments the p oints had an un􏴁 derlying distribution with the same dimensionality as the tree􏴂
against tree

500
400
300
200
100
0
1 3 5 7 9 11 13
􏴈􏴂􏴈􏴂􏴅 Performance
This exp eriment con􏴀rms that it is ddistrib􏴔 the distribution dimension from which the p oints were
against the
Distribution
Dimensionality
selected􏴔 rather than kdom which critically a􏴎ects the search p erformance􏴂 The trials for Figure
a 􏴀xed 􏴑􏴆􏴁d􏴒 un􏴁 not seem to increase any worse than linearly with
from the kd􏴁tree􏴋s Distribution
In this exp eriment the p oints were distributed
dimensional space􏴂 The target vector was􏴔 however􏴔 chosen at random from a ten􏴁dimensional distribution􏴂 The kd􏴁tree contained 􏴃􏴕􏴔􏴕􏴕􏴕 p oints􏴂 The average numb er of insp ections over 􏴇􏴕 searches was found to b e 􏴖􏴔􏴅􏴗􏴈􏴂 This compares with another exp eriment in which b oth p oints and
derlying dimensionality􏴔 the search exp ense do es kdom􏴂
􏴈􏴂􏴈􏴂􏴆 When the Target is not Chosen
target were distributed in ten dimensions and the average numb er of insp ections was only 􏴄􏴆􏴖􏴂
reason for the appalling p erformance
its nearest neighb our then very many leaf no des must b e checked􏴂
Figure 􏴈􏴂􏴃􏴕
Numb er of insp ections graphed against underly􏴁 ing dimensionality for a fourteen􏴁dimensional tree􏴂
􏴈􏴂􏴃􏴕 ddistrib􏴂 The im􏴁 is that for 􏴃􏴆d􏴁trees􏴔 the p erformance do es improve greatly if the underlying
used a 􏴃􏴕􏴔􏴕􏴕􏴕 no de
p ortant observation
distribution􏴁dimensi on is relatively low􏴂 Conversely􏴔 Figure 􏴈􏴂􏴃􏴃 shows that for
kd􏴁tree with domain dimension of 􏴃􏴆􏴔 for various values of
on a three dimensional elliptical surface in ten􏴁
The was exempli􏴀ed in Figure 􏴈􏴂􏴈􏴘 if the target is very far from
􏴈􏴂􏴈􏴂􏴇 Conclusion
The sp eed of the search 􏴑measured as the numb er of distance computations required􏴒 seems to vary 􏴘􏴘􏴘
􏴈􏴁􏴃􏴆

100
80
60
40
20
0
4 6 8 10 12 14
􏴢 􏴘 􏴘 􏴘 only marginally with tree
of dimensions􏴔 it is
􏴢 􏴘 􏴘 􏴘 very quickly with the dimensionality
Figure 􏴈􏴂􏴃􏴃
Numb er of in􏴁 sp ections graphed against tree dimension􏴔 given a con􏴁 stant four dimensional un􏴁 derlying distribution􏴂
su􏴐ciently large with resp ect to the
numb er
These a rob otic
than ten comp onents􏴂 For data p oints
learning system􏴂 selecting a pivot
These include incrementally adding a p oint to a kd􏴁tree􏴔 range p oint􏴂
searching􏴔 and
size􏴂 If essentially constant􏴂
the tree is
of the distribution of the datap oints􏴔 ddistrib􏴂
􏴢 􏴘 􏴘 􏴘 linearly with the numb er of comp onents in the kd􏴁tree􏴋s domain 􏴑kdom􏴒􏴔 given a 􏴀xed
distribution dimension 􏴑ddistrib􏴒􏴂
There is also evidence to suggest that unless the target vector is drawn from the same distri􏴁 bution as the kd􏴁tree p oints􏴔 p erformance can b e greatly worsened􏴂
results supp ort the b elief that real time searching for nearest neighb ours is practical in system where we can exp ect the underlying dimensionality of the data p oints to b e low􏴔 roughly less than 􏴃􏴕􏴂 This need not mean that the vectors in the input space should have less
obtained from rob otic systems it will not b e easy to decide dimensionality is􏴂 However Chapter 􏴃􏴕 will show that the data do es tend to
what the underlying lie within a numb er
􏴈􏴂􏴉 Further
In this section I discuss some other op erations on kd􏴁trees which are required for use in the SAB
of low dimensional
subspaces􏴂
kd􏴁tree Op erations
􏴈􏴁􏴃􏴇

􏴈􏴂􏴉􏴂􏴃 Range Searching a kd􏴁tree
range􏴁search 􏴘 exemplar􏴁set 􏴡 Domain 􏴡 􏴦 􏴥 exemplar􏴁set
The abstract range search op eration on an exemplar􏴁set 􏴀nds all exemplars whose domain vectors are within a given distance of a target p oint􏴘
range􏴁search􏴑E􏴙 d􏴙 r􏴒 􏴚 f􏴑d􏴕􏴙 r􏴕 􏴒 􏴄 E j 􏴑d 􏴔 d􏴕 􏴒􏴄 􏴦 r􏴄g
This is implemented by a mo di􏴀ed nearest neighb our search􏴂
initial distance is not reduced as closer p oints are discovered and 􏴑ii􏴒 all discovered p oints within the distance are returned􏴔 not just the nearest􏴂 The complexity of this op eration is shown􏴔 in 􏴜Preparata and Shamos􏴔 􏴃􏴗􏴖􏴇􏴝􏴔 to still b e logarithmic in N 􏴑the size of E􏴒 for a 􏴀xed range size􏴂
􏴈􏴂􏴉􏴂􏴄 Cho osing a Pivot from an Exemplar Set
The tree building algorithm of Section 􏴈􏴂􏴅 requires that a pivot and a splitting plane b e selected
from which to build the ro ot of a kd􏴁tree􏴂 It is desirable for the tree to
and also for the shap es of the hyp erregions corresp onding to leaf no des
p ortioned􏴂 The 􏴀rst criterion is imp ortant b ecause a badly unbalanced tree would p erhaps have
O 􏴑N 􏴒 accessing b ehaviour instead of O 􏴑log N 􏴒􏴂 The second criterion is in
opp ortunities for the nearest neighb our search􏴂 This is di􏴐cult to formalize􏴔 but can b e motivated by an illustration􏴂 In Figure 􏴈􏴂􏴃􏴄 is a p erfectly balanced kd􏴁tree in which the leaf regions are very non􏴁square􏴂 Figure 􏴈􏴂􏴃􏴅 illustrates a kd􏴁tree representing the same set of p oints􏴔 but which promotes squareness at the exp ense of some balance􏴂
One pivoting strategy which would lead to a p erfectly balanced tree􏴔 and which is suggested in 􏴜Omohundro􏴔 􏴃􏴗􏴖􏴉􏴝􏴔 is to pick the splitting dimension as that with maximum variance􏴔 and let
the pivot b e the p oint with the median split comp onent􏴂
square regions b ecause having split in one dimension􏴔 that the same dimension has maximum spread􏴔 and uniform distributions this tends to p erform reasonably hyp erregions tend to take long thin shap es􏴂 This is balanced using this standard median pivot choice􏴂
This will􏴔 it is hop ed􏴔 tend to promote
the next level in
so will cho ose a di􏴎erent dimension􏴂 For
well􏴔 but for badly skewed distributions the exempli􏴀ed in Figure 􏴈􏴂􏴃􏴄 which has b een
To avoid this bad case􏴔 I cho ose a pivot which splits the exemplar set in the middle of the range
of the most spread dimension􏴂 As can
at the exp ense of a slight imbalance in the kd􏴁tree􏴂 This means that large empty areas of space are
The mo di􏴀cations are that 􏴑i􏴒 the
b e seen in Figure 􏴈􏴂􏴃􏴅􏴔 this tends to favour squarer regions
􏴀lled with only a few hyp errectangles
which need to b e insp ected in case they contain a nearer neighb our is smaller than for the original case􏴔 which had many small thin hyp errectangles􏴂
which are themselves large􏴂 Thus􏴔 the numb er of leaf no des
􏴈􏴁􏴃􏴈
b e reasonably balanced􏴔 to b e fairly equally pro􏴁
order to maximize cuto􏴎
the tree is unlikely to 􏴀nd

Figure 􏴈􏴂􏴃􏴄
A 􏴄d tree balanced using the 􏴞median of the most spread dimension􏴋 pivoting strategy􏴂
Figure 􏴈􏴂􏴃􏴅
A 􏴄d tree balanced using the 􏴞closest to the centre of the widest dimension􏴋 piv􏴁 oting strategy􏴂
􏴈􏴁􏴃􏴉

My pivot choice algorithm is to 􏴀rstly cho ose the splitting dimension as the longest dimension of the current hyp errectangle􏴔 and then cho ose the pivot as the p oint closest to the middle of the
hyp errectangle along this dimension􏴂 Occasionally􏴔 this pivot may even b e an extreme p oint
its dimension􏴔 leading to an entirely unbalanced no de􏴂 This is worth it􏴔 b ecause it creates a empty leaf no de􏴂 It is p ossible but extremely unlikely that the p oints could b e distributed in such a way as to cause the tree to have one empty child at every level􏴂 This would b e unacceptable􏴔 and
so ab ove a certain depth Selecting the median O 􏴑N 􏴒 op erations􏴔 and so
threshold􏴔 the pivots are chosen using the standard median technique􏴂
as the split and selecting the closest to the centre of the range are b oth either way a tree rebalance is O 􏴑N log N 􏴒􏴂
􏴈􏴂􏴉􏴂􏴅
Incrementally Adding a Point to a kd􏴁tree
the leaf no de which contains the new p oint is computed􏴂 The hyp errectangle corresp onding leaf is also obtained􏴂 See Section 􏴈􏴂􏴆 for hyp errectangle implementation􏴂 When the leaf found it may either b e 􏴑i􏴒 empty􏴔 in which case it is simply replaced by a new singleton
so its previously irrelevant
the squareness of the new
chosen as the dimension in
same requirement as for tree balancing􏴠that the regions should b e as square as p ossible􏴔 even if this means some loss of balance􏴂
This splitting choice is just a heuristic􏴔 and there is no guarantee that a series of p oints
in this way will preserve the balance of the kd􏴁tree􏴔 nor that the hyp errectangles will b e well shap ed
Firstly􏴔
to this
no de is
no de􏴔 or 􏴑ii􏴒 it contains a singleton no de􏴂 In case 􏴑ii􏴒 the singleton no de must b e given a child􏴔 and
split 􏴀eld must b e de􏴀ned􏴂 The split 􏴀eld should b e chosen to preserve subhyp errectangles􏴂 A simple heuristic is used􏴂 The split dimension is
which the hyp errectangle is longest􏴂 This heuristic is motivated
by the
added
exceeds a small multiple O 􏴑log N 􏴒􏴂
This uses a
sphere whose radius is the closest distance yet found􏴔 the search is within a sphere whose radius is
for nearest neighb our search􏴂 Thus􏴔 on o ccasion 􏴑such as when the depth
of the b est
􏴈􏴂􏴉􏴂􏴆 Q
p ossible depth􏴒 the tree is rebuilt􏴂 Incremental addition costs
Nearest Neighb ours
the Qth closest yet
􏴈􏴂􏴉􏴂􏴇 Deleting
found􏴂 Until Q p oints have b een found􏴔 this distance is in􏴀nity􏴂
mo di􏴀ed version of the nearest neighb our search􏴂 Instead of only searching within a
a Point from a kd􏴁tree
a leaf􏴔 this is straightforward􏴂 Otherwise􏴔 it is di􏴐cult􏴔 b ecause the structure of
If the p oint is at
b oth trees b elow this no de are pivoted around the p oint we wish to remove􏴂
One
solution would
b e to rebuild the tree b elow the deleted p oint􏴔 but
solution is to mark the p oint as deleted with an
deletion no des in nearest neighb our and similar searches􏴂 When the tree is next rebuilt􏴔 all deletion no des are removed􏴂
􏴈􏴁􏴃􏴖
on o ccasion this would b e extra 􏴀eld in the kd􏴁tree
very
no de􏴔 and to ignore
exp ensive􏴂 My
along large

Biblio graphy
􏴜Bentley􏴔 􏴃􏴗􏴖􏴕􏴝 J􏴂 L􏴂 Bentley􏴂 Multidimensional Divide and Conquer􏴂 Communications of the ACM􏴔 􏴄􏴅􏴑􏴆􏴒􏴘􏴄􏴃􏴆􏴠􏴄􏴄􏴗􏴔 􏴃􏴗􏴖􏴕􏴂
􏴜Friedman et al􏴂􏴔 􏴃􏴗􏴉􏴉􏴝 J􏴂 H􏴂 Friedman􏴔 J􏴂 L􏴂 Bentley􏴔 and R􏴂 A􏴂 Finkel􏴂 An Algorithm for Finding Best Matches in Logarithmic Exp ected Time􏴂 ACM Trans􏴂 on Mathematical Software􏴔 􏴅􏴑􏴅􏴒􏴘􏴄􏴕􏴗􏴟 􏴄􏴄􏴈􏴔 Septemb er 􏴃􏴗􏴉􏴉􏴂
􏴜Omohundro􏴔 􏴃􏴗􏴖􏴉􏴝 S􏴂 M􏴂 Omohundro􏴂 E􏴐cient Algorithms with Neural Network Behaviour􏴂 Jour􏴁 nal of Complex Systems􏴔 􏴃􏴑􏴄􏴒􏴘􏴄􏴉􏴅􏴟􏴅􏴆􏴉􏴔 􏴃􏴗􏴖􏴉􏴂
􏴜Preparata and Shamos􏴔 􏴃􏴗􏴖􏴇􏴝 F􏴂 P􏴂 Preparata and M􏴂 Shamos􏴂 Computational Geometry􏴂 Springer􏴁Verlag􏴔 􏴃􏴗􏴖􏴇􏴂
Bib􏴁􏴃