Searching Trajectories by Locations – An Efficiency Study
Zaiben Chen†, Heng Tao Shen†, Xiaofang Zhou†, Yu Zheng‡, Xing Xie‡ † School of Information Technology & Electrical Engineering
The University of Queensland, QLD 4072 Australia
‡Microsoft Research Asia, Beijing 100080 China
{zaiben, shenht, zxf}@itee.uq.edu.au, {yuzheng, xingx}@microsoft.com
ABSTRACT
Trajectory search has long been an attractive and challeng- ing topic which blooms various interesting applications in spatial-temporal databases. In this work, we study a new problem of searching trajectories by locations, in which con- text the query is only a small set of locations with or without an order specified, while the target is to find the k Best- Connected Trajectories (k-BCT) from a database such that the k-BCT best connect the designated locations geographi- cally. Different from the conventional trajectory search that looks for similar trajectories w.r.t. shape or other criteria by using a sample query trajectory, we focus on the goodness of connection provided by a trajectory to the specified query locations. This new query can benefit users in many novel applications such as trip planning.
In our work, we firstly define a new similarity function for measuring how well a trajectory connects the query loca- tions, with both spatial distance and order constraint being considered. Upon the observation that the number of query locations is normally small (e.g. 10 or less) since it is imprac- tical for a user to input too many locations, we analyze the feasibility of using a general-purpose spatial index to achieve efficient k-BCT search, based on a simple Incremental k- NN based Algorithm (IKNN). The IKNN effectively prunes and refines trajectories by using the devised lower bound and upper bound of similarity. Our contributions mainly lie in adapting the best-first and depth-first k-NN algorithms to the basic IKNN properly, and more importantly ensur- ing the efficiency in both search effort and memory usage. An in-depth study on the adaption and its efficiency is pro- vided. Further optimization is also presented to accelerate the IKNN algorithm. Finally, we verify the efficiency of the algorithm by extensive experiments.
Categories and Subject Descriptors
H.2.8 [Database Applications]: Spatial databases and GIS
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
SIGMOD’10, June 6–11, 2010, Indianapolis, Indiana, USA. Copyright 2010 ACM 978-1-4503-0032-2/10/06 …$10.00.
General Terms
Algorithms, Performance
Keywords
Trajectory Search, Locations, Efficiency, Optimization
1. INTRODUCTION
The proliferation of mobile devices enables people to log their geographical positions and to trace historical move- ments, which has spawned various novel applications. An emerging one is trajectory sharing and searching. Different from the conventional similarity search [22, 21, 8, 20] over trajectory databases that uses a sample query trajectory for full matching according to shape or other criteria, new tra- jectory search applications demand to find trajectories that connect a few selected locations, i.e., are close to all the loca- tions. As exemplified in Figure 1, given a set of coordinates of locations (represented by dots) acquired by clicking on map or image geo-coding, we retrieve from a database the raw trajectory (indicated by the marked track) that best connects these locations. This query can benefit travelers when they are planning a trip to multiple places of interest in an unfamiliar city by providing similar routes traveled by other people for reference.
Figure 1: Trajectory search by locations – an example
Moreover, by utilizing this new query, a tourist/travel agency can further survey popular routes through some at- tractions; a zoologist is capable of finding the nearest tra- jectories of animals to some stationary points (e.g. source of food) [11]; a transport department is able to investigate the driving habits of local residence within a suburb and onwards analyze the traffic causes of some critical road in- tersections, etc. This work is based on the Microsoft Geo- Life project [1], and some example websites for sharing and searching trajectories include Bikely, GPS-Waypoints and
Share My Routes1, that all provide a platform for people to share trajectories of driving, hiking, boating etc., and some very simple search functions have already been developed.
In this work, we propose a new type of trajectory query called the k Best-Connected Trajectory (k-BCT) query for searching trajectories by multiple geographical locations. Gen- erally, the k-BCT query is a set of locations indicated by coordinates like (latitude,longitude), which could be a fa- mous attraction, a nameless beach, or any arbitrary place approximated by the center location. The user may also specify a preferred visiting order for the intended places, in which case the order of a trajectory needs to be taken into account as well. For example, a query with three locations A, B and C appears as
{A(37.2601,122.0941) , B(37.2721,122.0648) , C(37.3344,122.1538) }
where (37.2601, 122.0941) denotes the (latitude, longitude) of A. If an order constraint further applies, the found trajecto- ries connecting A, B and C should also preserve the visiting order (A → B → C). Obviously, the conventional similar- ity search [22, 21, 8, 20, 9] can not address this problem since the similarity measure is different in our applications, which should reveal the connectivity between a trajectory and the query locations rather than the similarity in shape. Furthermore, the query is no longer a full trajectory or any sub-sequence, but a few arbitrary locations, which makes the query more flexible. In essence, the k-BCT query can be treated as an extension of the traditional single point- based query [11, 18] which looks for the nearest trajectories to only one location, while the k-BCT query searches the ‘nearest’ trajectories to multiple locations.
To answer the k-BCT query, we can separately issue at each query location an independent single point-based query to find out the nearest trajectories w.r.t. each location, and then the trajectories within the intersection of query results, if exist, are supposed to be close to all the query locations and thus with a good connectivity. Based on this basic idea, we propose an Incremental k-NN based Algorithm (IKNN), which retrieves the nearest trajectory points w.r.t. each query location incrementally and examines the k-BCT from the trajectory points discovered so far. In this method, the pruning and refinement of the search are conducted by using the lower bound and upper bound of trajectory simi- larity that are derived from the found trajectory points, and the retrieval of nearest trajectory points is based on the tra- ditional best-first and depth-first k-NN algorithms [19, 14] over an R-tree [13] index. Note that the utilization of the commonly used R-tree is only for browsing distance between points. Other spatial indexes can also be easily adapted.
However, the problem of the basic IKNN algorithm that we concern is about the search efficiency, which originates from the efficiency of retrieving the nearest trajectory points incrementally. For the best-first k-NN algorithm [14], it pro- vides an I/O optimal solution but causes a lack of guarantee in memory usage, which means it may cause an overlarge memory consumption that could potentially bring down the system or severely affect the number of concurrent queries. For the depth-first k-NN algorithm, it’s consumption of mem- ory is ascertained, but it performs very poorly in the IKNN algorithm as it involves too much repetition of search regions
1Bikely (http://www.bikely.com/)
GPS-Waypoints (http://www.gps-waypoints.net/) Share My Routes (http://www.sharemyroutes.com/)
in our case. Therefore, besides proposing the new query as well as a new similarity function, we theoretically analyze the cost of the IKNN adopting these two k-NN algorithms respectively, and a major part of our contribution lies in the adaption of the k-NN algorithms to the IKNN. Through the adaption mechanism, the IKNN adopting the depth-first strategy achieves a good search performance comparable to the best-first strategy in terms of R-tree node access and query time, while it also guarantees a fixed low memory consumption. Moreover, we also present an optimization solution to conduct the IKNN search over multiple query locations smartly, and it achieves notable improvement.
The remainder of the paper is organized as follows. In Sec- tion 2 we discuss the related work. In Section 3, the formal definition of the k-BCT query and its similarity function are given. The IKNN algorithm is presented in Section 4 with detailed analysis. Finally we show our experiment results in Section 5 and draw a conclusion in Section 6.
2.
REL ATED WORK
Our work shares a common methodology with similarity search on trajectory/time series data. Basically the first step is to define a similarity/distance function by some kind of aggregation of distances between trajectory points, and then efficient query processing algorithms are needed to address the problem of searching over a large set of candidate tra- jectories. A considerable amount of related work has been proposed before. Several typical similarity functions for dif- ferent applications include Euclidean Distance [2], Dynamic Time Warping (DTW) [22], Longest Common Subsequence (LCSS) [21], Edit Distance with Real Penalty (ERP) [7], Edit Distance on Real Sequences (EDR) [8], and enhanced techniques for evaluating the similarity of time series are also studied in [16, 20].
The pioneering work by Agrawal et al. in [2] utilizes Dis- crete Fourier Transform (DFT) to transform trajectories to multi-dimensional points, and then the comparison between trajectories can be done by comparing the Euclidean dis- tance between the points in the feature space, which should match or under-estimate the real distance. Faloutsos et al. extend this work in [10] to allow subsequence matching. Other transform functions such as Discrete Wavelet Trans- form (DWT) [6] are also available. Cai et al. [5] utilize the Chebyshev polynomials for approximating and indexing tra- jectories. However, those methods require the time series to have the same length and the transform functions are not applicable for our query.
Frentzos et al. define in [12] a distance measure as the area of the region between two trajectories. Nevertheless, their method does not work if the two trajectories are of differ- ent lengths. In contrast, DTW [22] adopts time-shifting in the comparison of trajectories and the basic idea is to allow ‘repeating’ some points as many times as needed to achieve the best alignment. Yet, all points including noises have to be matched, and therefore it may accumulate an over large DTW distance merely due to a single noisy point. Com- pared with DTW, LCSS [21] allows to ‘skip’ some points other than re-arrange the order of these points. As a con- sequence, far-away points would be ignored and thus it is robust to noises. However, it needs a matching threshold to determine whether to take a point into account. Chen et al. propose in [8] the EDR distance, which is similar to LCSS in using a threshold parameter to determine if two points are
matched in comparison, while considering penalties to gaps. In [7], Chen et al. also propose the ERP distance aiming to combine the merits of DTW and EDR, by using a constant reference point for computing distance.
A similarity function is normally designed by consider- ing the actual application, and none of the similarity func- tions above satisfies the requirements of our applications, in which the query is broken into a small set of locations, and we concern more on whether a trajectory provides a good connection to query locations rather than whether the tra- jectory is similar to the query in shape. Therefore, due to the new application requirements, we need to define a new similarity function. For query processing, this is the first work on trajectory search by multiple query locations. As the number of query locations is typically small, we are able to use a spatial method for the search. Here, our IKNN al- gorithm utilizes the k-NN algorithms in [19] and [14] over an R-tree [13] index of all the trajectory points. Notice that we do not consider the STR-tree or TB-tree [18] for trajectory indexing and searching since they focus more on trajectory preservation and leave other spatial properties like spatial proximity aside, while in the IKNN algorithm, the R-tree index is only for fast retrieval of nearest points.
3. THE K-BCT QUERY
A trajectory database contains a large collection of raw trajectories recorded by GPS devices in the form of a series of locations/points {p1, p2, · · · , pl}, where pi is a trajectory point represented by (latitude, longitude), and l is the num- ber of points in the trajectory. The k Best-Connected Tra- jectory (k-BCT) query Q, according to our applications, is represented by a set of locations:
Q = {q1,q2,··· ,qm}
m is the number of query locations, and the user may further assign a visiting order to the locations, in which case Q is treated as a sequence of locations from q1 to qm. Notice that we use the term ‘location’ and ‘point’ interchangeably.
In order to search for the best-connected trajectories from a database, firstly we need a similarity function to score how well a trajectory connects the query locations, which should consider the distance from the trajectory to each query loca- tion. We simply define the distance Distq between a query locationqi andatrajectoryR={p1,p2,···,pl}as:
Distq(qi,R) = min{Diste(qi,pj)} pj ∈R
Diste (qi , pj ) is the Euclidean distance between qi and pj , so actually Distq(qi,R) is the shortest distance from qi to any point on R. Here we call the < qi,pj > a matched pair where pj is the nearest point on R to qi. Then for a query Q that consists of {q1 , q2 , · · · , qm } without an order constraint, we define the similarity Sim(Q, R) between Q and R based on the distance of each matched pair as follows:
m
Sim(Q, R) = X e−Distq (qi ,R) (1)
i=1
The intuition of using the exponential function e−Distq (qi ,R) to measure the contribution of each matched pair to Sim(Q, R) is that we would like to assign a larger contribution to a closer matched pair of points while giving a much lower value
to those far away pairs, which results in an exponential de-
creasing of contribution as Distq(qi,R) linearly increases.
Therefore, only a trajectory that is really close to all the
query locations is considered to be ‘similar’, which agrees
with the perception of human beings. A matching example
is illustrated in Figure 2(a), where the query locations q1,
q2 and q3 are matched to the closest points p6, p4 and p7
on R respectively. Therefore, Sim(Q,R) = e−Diste(q1,p6) + e−Diste(q2,p4) + e−Diste(q3,p7) = e−1.5 + e−0.1 + e−0.1. The
dashed ellipses here indicate the matched pairs.
p1 p2 p3
q2
p4
q2
p4
p5
q p6 1
(a)
p5
q p6 1
(b)
q3 R
Diste(q1,p6)= 1.5 Diste(q2,p4)= 0.1 Diste(q3,p7)= 0.1
Diste(q1,p3)= 2.0 Diste(q2,p4)= 0.1 Diste(q3,p7)= 0.1
p7
p8
p1 p2 p3
q3 R
p7 p8
Figure 2: Matching a trajectory
When a visiting order is further specified for the query locations, the matched points on R have to preserve the order. Thus, an adjustment of matching may be necessary, and the matched point pj for location qi may not be the nearest point to qi any longer. Considering the example in Figure 2(a) again, suppose the user wishes to make visits in the order of q1 → q2 → q3. The actual order of the matched pointsonRisp4 →p6 →p7 (assumeRgoesfromleftto right), rather than p6 → p4 → p7, therefore this matching does not conform with the user-specified order, and hence we need to adjust the matching of trajectory points to satisfy the order constraint. As shown in Figure 2(b), q1 is re- matched with p3 and then the new visiting order is p3 → p4 → p7 which correctly preserves the user-specified order. Here, our target is to maximize the sum of the contribution of each matched pair, while still keeping the order of visits, i.e., the sum of the contribution of < q1,p3 >, < q2,p4 > and < q3 , p7 > pairs is maximum among all the possible combinations that preserve the order.
However, the calculation of similarity is not straightfor- ward when considering a visiting order. Given a sequence of query locations Q = {q1, q2, · · · , qm}, and a trajectory R = {p1, p2, · · · , pl}, we define the similarity Simo(Q, R) for ordered query locations in a recursive way as follows:
8><> e−Diste(Head(Q),Head(R)) +Simo(Rest(Q), R) (2)
: Simo(Q,Rest(R))
where Head(∗) is the first point of ∗, e.g., Head(Q) = q1, and Rest(∗) indicates the rest part of ∗ after remov- ing the first point, e.g., Rest(Q) = {q2, q3, · · · , qm}. The idea of Equation 2 is to define the maximal solution to Simo(Q,R) recursively in terms of the maximal solutions to subproblems: Simo(Rest(Q), R) and Simo(Q, Rest(R)). Therefore, once H ead(Q) and H ead(R) match, we sum up e−Diste(Head(Q),Head(R)) to the similarity and shift to the matching of the rest of Q by calling Simo(Rest(Q),R). In
Simo(Q, R) = max
Notation
Table 1: A list of notations
Explanation
The number of all the trajectory points The number of query locations
The Euclidean distance between qi, pj
simple k-NN based method that utilizes the R-tree for the search of k-BCT, followed by an in-depth study on effective adaption and further optimization. Table 1 shows a partial list of symbols used in this paper.
4.1 The IKNN Algorithm
By indexing trajectory points in an R-tree, we are able to efficiently find the closest trajectory point with respect to a query location by using the k-Nearest Neighbor (k-NN) search [19, 14]. Assume there is a query Q comprising of m locations {q1,q2,··· ,qm} without an order constraint, we firstly retrieve the λ-NN of each query location (λ > 0):
N
m
Diste(qi,pj)
Distq(qi,R) The shortest distance from qi to R
C
c
ǫ
nb(ǫ, ‘shape’)
D2, D0
r σj
Distlast MBR1
ρ ξ(qi) μ, ν
The candidate set
The fanout of the R-tree
The radius of a ‘shape’
The average number of neighbors within a region of ‘shape’ and radius ǫ
The Fractal Dimensions, D2 = D0 = 2 for a uniform distribution
The radius of a λ-NN search region The average side of MBR at level j
The distance of the previous λth NN The first visited MBR that contains not less than λ points in a depth-first search The density of trajectory points
The contribution of qi to upper bound The weights for optimization
λ-NN(q1) = {p1,p21,··· ,pλ1} λ-NN(q2) = {p1,p2,··· ,pλ}
222 ···12λ
λ-NN(qm) = {pm,pm,··· ,pm}
this case, Head(R) is still retained for the next round of comparison as a trajectory point may be matched with more than one query locations. Besides that, we can also skip tra- jectory points by calling Simo(Q, Rest(R)). Essentially this is a dynamic programming way to figure out the similarity, while keeping the matched trajectory points in the same or- der as the query locations. Obviously, Equation 2 combines the merits of DTW [22] which can repeat some trajectory points, and LCSS [21] that can skip un-matched trajectory points including outliers in the matching process. Formally, we define the k-BCT query as below:
Definition 1. (k-BCT query) Given a set of trajecto-
ries T = {R1,R2,··· ,Rn} (n ≥ k) , a set of locations Q, and
The set of scanned trajectories that contain at least one point in λ-NN(qi) then form a candidate set Ci for the k- BCT results. Note that the cardinality |Ci| ≤ λ, as there may be several λ-NN points belong to the same trajectory. By merging the candidate sets generated by all the λ-NN(qi), we get totally f different trajectories as candidates:
C = C1 ∪ C2 ∪ · · · ∪ Cm = {R1 , R2 , · · · , Rf }
For each trajectory Rx (x ∈ [1, f]) within C, it must contain at least one point whose distance to the corresponding query location is determined. For example, if Rx ∈ Ci (Ci ⊆ C), then the λ-NN of qi must include at least one point on Rx, and the shortest distance from Rx to qi is known. Therefore, at least one matched pair of points between Rx and some qi is discovered, and then a lower bound LB of similarity for each candidate Rx (x ∈ [1,f]) can thereafter be computed by using the found matched pairs:
a corresponding similarity function, the k-BCT query is to LB(Rx) = ′ii
find the k trajectories T from T, such that: Similarity(Q,Ri)Ri∈T′ ≥ Similarity(Q,Rj)Rj∈T−T′
where Similarity(Q, Ri) = Sim(Q, Ri) if no order is speci- fied for Q, otherwise Similarity(Q, Ri) = Simo(Q, Ri).
4. QUERY PROCESSING
To answer the k-BCT query, we take advantage of the ob- servation that the number of query locations is usually small. Thus we are able to adopt a spatial index for the search of close-by trajectories for each query location separately, and then merge the results for the exact k-BCT. The total cost of this method is expected to be the cost of searching over the index multiplying the number of query locations, which is a small constant. Here, since knowing the closest points on a trajectory to the query locations is sufficient to estimate the lower bound and upper bound of the proposed trajec- tory similarity (Equation 1 & 2) for pruning, we adopt the commonly used R-tree [13] index for the search of closest trajectory points. With points of all the database trajecto- ries indexed by one single R-tree, once we find the nearest point w.r.t. a query location, the trajectory that contains this point must be the closest trajectory to the query lo- cation. Besides, to support the whole trajectory retrieval, points from the same trajectory are further connected by a double linked list. In the following, we firstly introduce a
X
max {e−Diste(q ,pj )}! j∈[1,λ]∧pj∈Rx
i∈[1,m]∧Rx ∈Ci
Here {qi|i ∈ [1, m] ∧ Rx ∈ Ci} denotes the set of query lo-
(3) cations that have already been matched with some point on
Rx, and the pji which achieves the maximum e−Diste(qi,pji) with respect to qi is the point on Rx that is closest to qi, i.e., maxj∈[1,λ]∧pji∈Rx{e−Diste(qi,pji)} = e−Distq(qi,Rx). So LB(Rx) = Pi∈[1,m]∧Rx∈Ci(e−Distq(qi,Rx)) and obviously it is no larger than Pmi=1 e−Distq(qi,Rx), because it only takes those matched pairs found so far into account. Thus LB(Rx) must lowerbound the exact similarity Sim(Q, Rx) defined in Equation 1. On the other hand, if Rx ∈/ Ci, then none of it’s trajectory points has been scanned by λ-NN(qi) yet.
For those trajectories that are not contained in C, they have not been scanned by any of the λ-NN search, and any point on them must have a distance to qi no less than the distance of the λth NN of qi (i.e. Diste (qi , pλi )). Therefore, we can determine an upper bound UBn of similarity for all the non-scanned trajectories as follows:
Xm λ
UBn = e−Diste(qi,pi ) (4)
i=1
By using these two bounds, we can set up a pruning mech- anism for the k-BCT query to avoid scanning the whole trajectory database and thus restrict the search space.
i
Theorem 1. For a k-BCT query without an order con- ′
< qi , closestP oint >, which is the same as Equation 3. Oth- erwise, for a qi that λ-NN(qi) has not covered any point on Rx (i.e. Rx ∈/ Ci), we consider the current λth NN of qi (i.e. pλi ) that must be closer than the matched point, and accu- mulate the contribution of the < qi,pλi > pair to UB(Rx), which is similar to Equation 4. Straightforwardly, we have:
Sim(Q, Rx) − UB(Rx) =
Pm e−Distq(qi,Rx) − P (e−Distq(qi,Rx))
i=1 i∈[1,m]∧Rx ∈Ci
− Pi∈[1,m]∧Rx∈/Ci (e−Diste(qi,pλi )) =Pi∈[1,m]∧Rx∈/Ci(e−Distq(qi,Rx) −e−Diste(qi,pλi )) ≤0
Thus, for any candidate Rx within C, we have Sim(Q, Rx) ≤ UB(Rx). Algorithm 2 shows the refinement process.
straint, if we can get a subset of k trajectories C from the candidate set C after searching the λ-NN of each query lo- cation, and we have minRx∈C′ {LB(Rx)} ≥ UBn, then the k best-connected trajectories must be included in C.
′
Proof. For any Rx ∈ C , Sim(Q,Rx) ≥ LB(Rx), while for any Ry ∈/ C (i.e. Ry ∈ C), UBn ≥ Sim(Q,Ry). When
Theorem 1 is satisfied, which is min
′ {LB(Rx)} ≥ UBn,
Rx ∈C
we can conclude that ∀Rx∀Ry(Rx ∈ C′ ∧ Ry ∈ C) →
(Sim(Q,Rx) ≥ Sim(Q,Ry)). Therefore, no k-BCT result can be from C, and they must be from C.
′
Notice that C is not necessarily to be the k-BCT, and we can only guarantee the k -BCT results are included in C . Apparently, Theorem 1 provides a way to decide when the k-BCT results are found during the λ-NN search.
Here, one critical question is that what λ we should choose in the search to guarantee that the k-BCT results are con- tained in the candidate set. If λ is set to be a very large value, it is probably that the k-BCT results will all be re- trieved, but the search space will be huge as well. Neverthe- less, a smaller λ may not be enough to make sure the k-BCT results are included in C and thus leads to false dismissal. Instead of choosing a fixed λ, we devise an Incremental k-NN based Algorithm (IKNN) for efficient retrieval of the candi- date trajectories in a filter-and-refine fasion. The basic idea is to search for the λ-NN of each query location first (initially λ can be any positive integer, e.g., λ = k). If the generated candidate set C does not satisfy Theorem 1, we increase the search region by some ∆ and search for the (λ+∆)-NN. This process keeps going until the k-BCT is found. Algorithm 1 shows the details.
In each round of the IKNN (the while loop), we first of all retrieve the λ-NN of qi by KNN(qi, λ) at line 7, and then construct the candidate set C through line 8 and 9. If we have got enough candidates, then the lower bounds LB[] for all candidates and the upper bound UBn defined in Formula 3 and 4 respectively are computed at line 11 and 12. The k maximal lower bounds k-LB[] ⊂ LB[] are updated at line 13. If Theorem 1 is satisfied at line 14, which means the k- BCT has already been included in C and all the non-scanned trajectories beyond C can be safely filtered out, then the refinement procedure is triggered to examine the exact sim- ilarity of candidate trajectories at line 15, after which the k best-connected trajectories are returned. Otherwise we increase λ by ∆ and go to the next round.
The refine() function in Algorithm 1 performs a refine- ment step and it further prunes un-qualified candidates from C to get the final k-BCT results. To achieve this, we further define an upper bound UB of similarity for candidate tra- jectories in C only, following the same rationale of Equation 3 and 4.
1 2 3 4 5 6 7 8
9
10
11
12
13
14
15
16
17
1 2 3 4 5 6 7 8 9
10
11 12
Algorithm 1: IKNN()
input :k,Q
output: k-BCT
Candidate Set C; Upperbound UBn; Lowerbounds LB[], k-LB[]; Integerλ←k;
while true do
foreachqi ∈Qfromq1 toqm do
λ-NN(qi) ← KNN(qi, λ);
Ci ← trajectories scanned by λ-NN(qi);
C ← C1 ∪ C2 · · · ∪ Cm; if |C| ≥ k then
compute LB[] for all trajectories in C; compute UBn;
k-LB[] ← LB[].topK();
if k-LB[].min ≥ UBn then
k-BCT ← refine(C); return k-BCT;
λ ← λ + ∆;
Algorithm 2: refine(Candidate Set C)
k-BCT ← SortedList(k);
compute UB for each candidate in C;
sort candidates in C by UB in descending order; forx=1to|C|do
compute Sim(Q, Rx) by traversing Rx;
if x ≤ k then k-BCT.insert(Rx, Sim(Q,Rx)); else
if Sim(Q, Rx) > k-BCT.min then k-BCT.removeLast(); k-BCT.insert(Rx, Sim(Q, Rx));
if x = |C| or k-BCT.min ≥ UB(Rx+1) then return k-BCT;
UB(Rx) = X i∈[1,m]∧Rx∈Ci
+ X
i ∈ [ 1 , m ] ∧ R x ∈/ C i
{e−Diste(qi,pji )}) (e−Diste(qi,pλi ))
In Algorithm 2, the k-BCT maintains a list of k best-connected trajectories found so far, together with the exact similarity. As the algorithm examines the candidates from C in the de- scending order of UB, once the current minimum similarity of the k-BCT is even larger than or equal to the UB of the next candidate Rx+1 (line 11), the k-BCT is safely returned as the final result. Here the exact similarity of a trajectory is computed by traversing all the points on it through the double linked list (line 5), and the k-BCT is updated at line 8-10 whenever a more similar trajectory is discovered.
( max
(5)
j j∈[1,λ]∧pi ∈Rx
In Equation 5, Rx ∈ C = {C1 ∪C2 ···∪Cm}. For a query location within {qi|i ∈ [1, m] ∧ Rx ∈ Ci}, the closest point on Rx to qi is found by the λ-NN(qi) search, and thus we accumulate to UB(Rx) the contribution of the matched pair
4.2 Adaption of the λ-NN Algorithm
The λ-NN search is a major component of the IKNN as it
determines the efficiency of retrieving the λ nearest trajec- tory points. Basically, we have two options in designing the λ-NN function, i.e., KNN(qi, λ), invoked at line 7 of the Al- gorithm 1. The first one is the best-first k-NN algorithm in [14], and the second choice is the depth-first k-NN approach in [19]. They are tree traversal methods and apply to R-tree index. The best-first approach, although provides an I/O op- timal solution, has a risk in huge memory usage, while the depth-first one does have a guarantee of low memory usage but is sub-optimal in query processing. In the following, we elaborate how to adapt them to the IKNN separately, and how to significantly improve the performance of the depth- first strategy. Specifically, we use existing cost models [3, 15] to estimate the cost of the best-first strategy, and develop a new upper bound of cost for the depth-first strategy.
4.2.1 The Best-First Strategy
The best-first approach in [14], maintains a priority queue to store all the R-tree entries that have yet to be visited, using the MINDIST from the query location to an MBR as a key. Initially, the queue only contains the entries of the root. When deciding which entry to traverse next, it picks the entry with the least MINDIST (i.e. head of the queue). After the selected entry is visited, it is removed from the queue and it’s children entries are subsequently put on the queue. In such a manner, an entry is not examined until it reaches the head of the queue. Therefore, the first data point that comes to the head must be the 1st nearest neighbor since all entries closer to the query location have already been examined, and then the data point is dequeued. As this process proceeds, the next nearest neighbor is reported incrementally whenever it reaches the head of the queue.
Adopting this algorithm, we can simply record the pri- ority queue of each λ-NN search, and resume the search at a later time for the (λ + ∆)-NN by using the previ- ous priority queue. Thus, each time the IKNN further re- quires ∆ more trajectory points because the Theorem 1 has not been met, the KNN(qi,λ) function in Algorithm 1 can restart the traversal from the previous queue to get the (λ + 1)th,(λ + 2)th,··· ,(λ + ∆)th nearest points. The advantage is that there is no revisit to any processed entry and each λ-NN point is examined only once. We call the IKNN using this best-first strategy the IKNNbf Algorithm.
The complexity of the IKNNbf is obviously the cost of the best-first λ-NN search multiplying a constant m. Here we borrow some formulas from existing work to show the R-tree node access when a k-BCT query is returned. Specifically, we estimate the total leaf access when the m query locations all involve a λ-NN search. Firstly, given N points in an E- dimensional unit space, the average number of neighbors nb(ǫ,‘shape’) of a point within a region of regular ‘shape’ and radius ǫ is given by [3]:
and vol(r, ) = (2r)2, Equation 6 is transformed to:
πr2 D2 nb(r,‘circle’)=(4r2 ) 2
= (N − 1)(r√
D2
×(N−1)×(2r)
(7)
π)D2
By substituting nb(r,‘circle’) with λ and some simple alge- braic manipulations, we get the radius r of a λ-NN search region in Equation 8, where r is actually the distance from the λth NN to the query location.
1rλ
r=√ D2 (8)
For a best-first λ-NN search, only MBRs that intersected by the circle of radius r are visited. Hence it turns into a specific range query problem asking the number of node accesses w.r.t a search circle of radius r. To estimate it, we adopt the formulas in [15]. Firstly, assume the average side of MBR at level j of the R-tree is σj , and σj is given using the Hausdorff Dimension D0 [15].
σj =(
N
)D0 (9)
π N−1
ch−j 1
where h = logc(N) is the height of the R-tree and c is the fanout. Then the average node access P(λ) of R-tree for answering a range query with radius r is given by [15]:
X (N −1)(σj +4rσj +πr ) 2 +1
P (λ) = ch−j (10)
h−1 2 2 D2
j=0
The idea of Equation 10 is to sum up the access probability of every single node from the root (j = 0) to the leaf level (j = h − 1). For simplicity, we just take the node access Pleaf (λ) of leaves into consideration (j = h − 1 only), and assume a uniform distribution of trajectory points (i.e. D0 = D2 = 2, σh−1 = pc/N ). Therefore, by simplifying Equation 10, we get the leaf access of a best-first λ-NN search:
2 2 D2
Pleaf(λ)= (N −1)(σh−1 +4rσh−1 +πr ) 2 +1 (11)
c
N − 1 r λ(N − 1) λ + 1
= N +4 πcN + c ≈1+4rλ +λ+1=O(√λ+λ)
„
vol(ǫ, ‘shape’) nb(ǫ, ‘shape’) = vol(ǫ, )
«D2 E
× (N − 1) × (2ǫ)
D2
(6)
λ + λ) (12) priority queue used by the IKNNbf could potentially be-
πc c
For the IKNNbf algorithm, there is no overlap between the search region of the λ-NN and the region of the (λ + ∆)-NN. Therefore, the total leaf access for IKNNbf is:
√
LeafAccessbf = m × O(
However, one major concern here is that the size of the
D2 is the Correlation Fractal Dimension [3] and vol(ǫ, ‘shape’) indicates the volume of a shape of radius ǫ. In a two- dimensional space (E = 2), vol(ǫ,) is the volume of a square, and firstly we want to estimate the radius r of a circle that encloses λ trajectory points. Since vol(r, ‘circle’) = πr2
come very large. In an extreme case, it has to keep all the entries of the R-tree in the queue [14], which could easily oc- cupy the main memory and possibly bring down the system. Consequently, no guarantee of memory usage is given by the IKNNbf , and that is why we introduce the second option: the depth-first strategy, which can theoretically guarantee the usage of memory.
4.2.2 The Depth-First Strategy
In the depth-first k-NN algorithm [19], the basic idea is to recursively traverse the R-tree level by level in a depth-first manner, while maintaining a global list of k nearest candi- dates. Therefore, starting from the root of the R-tree, the entries of an internal node (organized as a BranchList) are sorted according to the MINDIST from the query location, and the entry with the smallest MINDIST is visited first. This downward process repeats recursively until reaching a leaf node where a potential nearest neighbor may be found. During the backtracking to upper levels, it only visits the entries whose MINDIST is smaller than the distance of the kth nearest candidate in the global list found so far. With this method, we need a recursion stack to keep track of what entries have yet to be visited. There are totally logc(N) lev- els, and the BranchList at each level has a size of c, where c is the fanout of the R-tree, so the size of the stack is as- certained to be O(clogc(N)). Here, we mention the IKNN using this depth-first strategy the IKNNdf Algorithm.
Nevertheless, the price paid for fixed memory usage by
this method is that we can no longer simply record the state
of the traversal where the λ-NN search stopped previously
and resume it later for the (λ + ∆)-NN, because some valid
candidates for the (λ+∆)-NN may have been skipped during
the previous search. As a result, we have to restart the
search from the very beginning and the previous λ-NN points
will be re-scanned by the (λ + ∆)-NN search. Assume λ is
initially k and ∆ = k as well for ease of proof. When a
depth-first λ-NN search returns, each of the 1st , 2nd , · · · , kth
nearest points has been visited by λ times, and each of the k
by using the MAXDIST that is defined as the distance from the query location qi to the furthest position on a MBR as illustrated in Figure 3.
MAXDIST(qi,MBR) =
max {Diste(qi,pj)} (13) pj ∈MBR
By using MAXDIST, when the depth-first algorithm is traversing the R-tree for the (λ + ∆)-NN, all encountered MBRs whose MAXDIST is smaller than the distance Distlast of the λth NN found in the previous round can be safely pruned, since all the points they contain must have been processed already. As exemplified in Figure 3, MBR1 and MBR2 can be skipped since they are totally covered by the circle with center qi and radius Distlast. However, MBRs intersect with the circle partially, e.g., MBR3 and MBR4, still need to be processed even though they were visited be- fore. The modified depth-first traversal method is shown in Algorithm 3 which is simplified from the one in [19] for an easier illustration. Here the only change is at line 6-7, where it skips the MBR contained by the circle of radius Distlast.
MBR1
Distlast
MAXDIST
MBR5
qi
MBR3
MBR2
MBR4
(k+1)th, (k+2)th, · · · , (2k)th nearest points has been visited by λ − 1 times, and so on. Thus, the total number of visits
1 2 3 4 5 6 7
8 9
10
Figure 3: Pruning by MAXDIST Algorithm 3: DepthTraversal(Node)
if Node = Leaf then update result; else
BranchList ← Node.entries; sort BranchList by MINDIST; for each e ∈ BranchList do
if MAXDIST(qi, e.MBR) < Distlast then skip e;
else if MINDIST(qi, e.MBR) < result.max then DepthTraversal(e);
else break;
k
Vsum of all the λ-NN points is:
λ
Vsum = P k (k × i) = k + 2k + 3k + · · · + λ k
i=1 k =k(λ +(λ)2)=O(λ2)
2kkk
In the case that λ ≫ k, the performance degrades dramati-
cally, and indeed this phenomenon happens frequently in a large dataset which is point-intensive as we need to search for a comparatively large λ before determining the k-BCT. In contrast, the Vsum of the best-first strategy is always λ.
Intuitively if we can guess correctly how large λ will even- tually be, this depth-first algorithm can go directly for the λ- NN without any repetition of search. However, this is nearly impossible even though the trajectory distribution is known in advance. To save cost, we suppose to reduce the number of repetitions. Here, instead of increasing λ by a constant ∆, we double λ at each round, that is, λ = k × 2round, with round = 0,1,2,···, and hence ∆ = k×2round. By doing so, the Vsum of all the λ-NN points is reduced to:
To estimate the leaf access of the IKNNdf , however, we can not use Equation 11 directly, since in a depth-first algorithm probably some leaf nodes that are further than the λth NN will be visited as well. Due to the difficulty in estimating it’s exact cost [4], here we just derive an upper bound of leaf access. Note that an upper bound for 1-NN search is given in [17], but it is not applicable for the λ-NN case.
In our derivation, denote by MBR1 the first visited MBR (at any level) that contains not less than λ points during the depth-first traversal (MBR1 is not necessarily to center at the query location). Obviously, a full global list with λ candidates is generated after visiting MBR1, and any MBR whose MINDIST is larger than the distance of the furthest point in the global list will be pruned. Assume rmax is the distance from the furthest point to the query location as illustrated in Figure 4. Only MBRs intersected by the circle with center qi and radius rmax can be visited after MBR1 is examined. Therefore, we can estimate an upper bound of the exact leaf access by using the leaf access w.r.t. the search circle of radius rmax. Then Equation 11 can be used directly for calculation by setting r = rmax.
2k Vsum =Plog (λ)(k×2)=k(2 +2 +···+2 λ )
This is similar to allocating space for dynamic tables. Theoretically, the Vsum drops from quadratic to linear, al- though λ may be over larger than necessary. As a conse- quence, now we are able to use the depth-first λ-NN algo- rithm directly for the IKNNdf with the line 17 in Algorithm 1 changed to ‘λ ← λ×2’. Although it is a very simply modi- fication, it brings down Vsum to the same order of magnitude as that of the best-first strategy.
Practically, we can further improve the search efficiency
2k i 01 log()
i=0
= 2λ − k = O(λ)
MAXDIST
furthest point rmax
qi MINDIST
also see that both the best-first and depth-first λ-NN search involve the same order of magnitude of leaf access.
For the IKNNdf strategy, the (λ + ∆)-NN will re-scan the search region of the λ-NN more or less even though MAXDIST is used for pruning. However, the order of complexity is ex- pected to be unchanged, if we double λ at each round.
r2 r1
MBR1
Figure 4: Estimating MAXDIST(qi,MBR1)
Although rmax is not easy to be determined, we can up- perbound rmax by rmax ≤MAXDIST(qi, MBR1) because the furthest point is enclosed by MBR1. Thus, now the problem is how to estimate MAXDIST(qi,MBR1). As shown in Fig- ure 4, denote by r1 the distance between the closest position and the furthest position to qi on the boundary of MBR1, and by r2 the length of the diagonal of MBR1 (obviously r2 ≥ r1). According to triangle inequality, we have:
L e a f A c c e s s df
Plog (λ)
≤ m × 2 k P l e a f ( k × 2 i )
i=0
= m × O(√λ + λ)
(21)
MAXDIST(qi , MBR1 ) ≤ r1 + MINDIST(qi , MBR1 ) ≤ r2 + MINDIST(qi, MBR1)
(14)
As a result, we prove that the cost of the IKNNdf and IKNNbf tend to be similar through doubling λ, while the IKNNdf also avoids potential high memory usage that can not be guaranteed by the IKNNbf. However, it does not mean that the IKNNbf is worse. In a normal case, the prior- ity queue can still be accommodated in main memory easily and typically the IKNNbf is faster than the IKNNdf as it involves no repetition.
4.3 Optimization
In the previous setting of the IKNN algorithm, query lo- cations are treated equally by increasing λ by the same ∆ for each λ-NN(qi). However, note that not all the query lo- cations are of equal importance as different λ-NN(qi) prob- ably have different contributions in constructing the can- didate set and determining lower/upper bounds of similar- ity. For instance, given two query locations qi and qj, if Diste(qi,pλi ) > Diste(qj,pλj ) for the same λ, which means the
We further notice that:
MINDIST(qi, MBR1) ≤ Diste(qi, 1stNN) (15)
The proof of Inequation 15 is as follows: If the 1st nearest
point of qi locates within MBR1, then MINDIST(qi,MBR1) ≤
Diste(qi,1stNN) as MINDIST is the shortest distance to qi.
Otherwise, if the 1st NN is enclosed by some other MBR′
at the same level, then we must have MINDIST(qi , MBR′ ) ≤
Diste(qi,1stNN). Since MBR1 is the first visited MBR that
has the minimum MINDIST among all the MBRs at it’s
λ
th −Diste(q ,pλ)
NN of qi is farther, then e i i is smaller than
level, we have MINDIST(qi , MBR1 ) ≤ MINDIST(qi , MBR Diste(qi, 1stNN). Therefore, Inequation 15 is proved.
′
) ≤ Combining Formula 14 and Formula 15, MAXDIST(qi , MBR1 )
−Diste(q ,pλ) −Diste(q ,pλ) −Diste(q ,pλ)
e jj.SinceUBn=e 11+e 22+
···+e−Diste(qm,pλm) according to Formula 4, we may say the λth NN of qi helps more than the λth NN of qj does in low- ering the UBn. Apparently, the lower the UBn is, the easier Theorem 1 would be satisfied, and thus the IKNN would re- turn results more quickly. In the following, we analyze how the contribution of qi is affected by λ. Firstly, we define the contribution of qi to UBn by ξ(qi):
ξ(qi) = e−Diste(qi,pλi )
Obviously the smaller contribution q1 to qm holds, the smaller UBn will be. Denote the density of trajectory points by ρ and the radius of a λ-NN search by r = Diste(qi, pλi ). Within the region of a λ-NN search, we roughly estimate ρ as
is further upperbounded as follows:
MAXDIST(qi, MBR1) ≤ r2 + Diste(qi, 1stNN) (16)
In the following, we reckon r2 and the distance of the 1st NN. Considering Equation 8, we set λ = 1 and get:
1r1 Diste(qi,1stNN) = √ D2
For estimating r2, again, we use Equation 6 by setting ‘shape = MBR’. Here, we take the same assumption from [17] that all MBRs are squares with side 2ǫ. Let nb(ǫ, ‘MBR’) = λ. The radius of a MBR that contains λ points is estimated as:
π N−1
(17)
1rλ ρ=λ ǫ= D2 (18) πr2
√2N−1s
We have r2 = 2 2ǫ, D2 = 2 and consequently:
r2λ 1 MAXDIST(qi, MBR1) ≤ N − 1 + π(N − 1) (19)
Let r = MAXDIST(qi , MBR1 ) in Equation 11. Through similar algebraic manipulations, we finally estimate an upper bound of leaf access for the depth-first λ-NN search by:
Easily, we have r = q λ and ξ(qi) is rewritten as πρ
πρ ξ(qi)=e−r =e−q λ
Here, our target is to figure out how fast ξ(qi) decays as λ
increases, and then assign different ∆ for q1 to qm according
to it’s decay rate when proceeding to the (λ+∆)-NN search.
We take the derivative dξ of ξ(qi) as the decay rate Dec(qi): dλ
qq dξd−qλ11−qλ π)2λ+2πλ+2+ 1+1 = e πρ =− (πρλ)−2 ×e πρ (22)
≈ (4 + 2√
√ c c πc (20) dλdλ 2
Pleaf (λ)
In a depth-first traversal, the actual leaf access should be smaller than the Pleaf (λ) in Equation 20, since not all the leaves within range r need to be visited with the global list being updated during the search. From this analysis, we can
=O( λ+λ) Each time of calculating dξ , again, we approximate ρ in
dλ
Equation 22 by πr2 using the current λ and r. Thus, the
current decay rate is approximated by:
Dec(qi)=|dξ|= r e−r (23) dλ 2λ
λ
We can see that, given a fixed λ, the decay rate rises first
and then drops gradually as r grows from 0 to ∞. Thus, in
the beginning we should assign a higher priority in exploring
a query location with a sparser distribution of trajectories
(i.e. with a larger r). However, after r reaches some value,
instead, a denser distribution (i.e. with a comparatively
smaller r) generates a larger decay rate. Therefore, more
effort should be made for query locations with dense trajec-
tories around. By doing so, we can decrease the UBn more
efficiently. Nevertheless, when r and λ are all large enough,
a further search for more trajectory points doesn’t help any
more in reducing UBn because dξ is already close to 0. dλ
Another potential way to accelerate the IKNN algorithm is to increase the lower bound LB of candidates as fast as possible, because larger lower bounds also make Theorem 1 easier to be satisfied. However, the lower bound of a tra- jectory is derived from the search results of more than one query locations, and it is not easy to predict when and where a λ-NN(qi) will meet with a point of a given trajectory. Con- sequently, regarding LB, we face difficulties in quantitatively estimating the contribution of a query location. As an al- ternative solution, we define the retrieval ratio Rat(qi) as a search heuristic, based on the number of new trajectories found by the λ-NN(qi) at each round.
N um(qi )
Rat(qi) = λ − λ′ (24)
where λ′ indicates the λ of the previous round, and Num(qi) is the number of new trajectories found by the λ-NN. The idea is that the more new trajectories are retrieved, the larger candidate set C will be, and it is likely to discover more points on the k best-connected trajectories. Thus the lower bounds probably rise more quickly.
By considering both the decay rate and retrieval ratio, we increase ∆ for different query locations accordingly.
by repeating/skipping some pi in order to best-match with the query, and finally figures out Simo(Q,R).
1 2 3 4 5 6
7
8
9
10
Algorithm 4: DP(Q,R)
Matrix M[i,j];
∀i∈ [1, m], M[i,0] ← 0; ∀j∈[1,l],M[0,j]←0; fori=1tomdo
for j = 1 to l do
if e−Diste(Head(Q),Head(R))+M[i-1,j] > M[i,j-1]
then
// match qi with pj and repeat pj M[i,j] ← e−Diste(Head(Q),Head(R))+M[i − 1,j];
else
// skip pj
M[i,j] ← M[i,j − 1];
return M[m,l];
„ Dec(qi) Rat(qi) « ∆(qi)=δ μPmi=1Dec(qi)+νPmi=1Rat(qi)
(25)
Here M[i,j] records the similarity of a subproblem, i.e., Simo({q1,··· ,qi},{p1,··· ,pj}). Once we got M[i − 1,j] and M[i,j − 1], we compare e−Diste(Head(Q),Head(R))+M[i − 1, j] with M[i, j − 1] at line 6 and take the maximum as M[i,j]. If the former one is larger, we say qi is matched with pj and accumulate e−Diste(Head(Q),Head(R)) to Simo, otherwise pj is skipped. In such a bottom-up manner, Algorithm 4 figuresoutM[i,j]withigoesfrom1tomandjfrom1tol, where m is number of query locations and l is the number of trajectory points. Finally the M[m,l] returned by DP (Q, R) at line 10 is the exact Simo(Q,R). The complexity of the algorithm is O(l · m) and it tends to be linear since m is a small constant.
Now the problem is how to adapt the IKNN algorithm to find the k-BCT with respect to Simo, observing that the lower bound LB and upper bound UB are not applicable any more as they are designed for Sim. Firstly, for a candidate trajectory Rx ∈ C generated by the IKNN, some of it’s trajectory points are scanned by the λ-NN search. Denote the set of scanned points on Rx by Rx′ , and we get:
In Equation 25, μ and ν are weights, and δ = m·k·2round for the IKNNdf, while δ is equal to m multiplying some constant ∆ for the IKNNbf . As a result, instead of increasing λ for all query locations equally at line 17 of Algorithm 1, the optimized IKNN determines λ for each qi separately according to Equation 25 (i.e. λ(qi ) = λ(qi ) + ∆(qi )), and therefore more effort is put on exploring important query locations which may potentially accelerates the searching of the k-BCT. Note that the optimized IKNN explores totally Pmi=1 ∆(qi) NN points at each round, and we have:
= δ · (μ + ν)
To guarantee the IKNN searches for a certain number of pointsperround,wesetμ+ν=1(e.g. μ=ν=0.5),so the total number of new points got per round is always δ.
4.4 Extension to Queries with an Order
Provided the query is with an order constraint, as dis- cussed in Section 3, we further consider the visiting order in the matching between query locations and trajectories. In this case, the similarity is measured by Simo as formulated in Equation 2. Typically, Simo can be solved by a Dy- namic Programming (DP) paradigm as shown in Algorithm 4, which conducts a shifting on trajectory R = {p1, p2, . . . , pl}
′
Rx ={pi|pi ∈Rx ∧pi ∈S}
Pm “ Pm Dec(qi) Pm Rat(qi) ” i=1 i=1
where S = λ-NN(q1)∪λ-NN(q2)∪···∪λ-NN(qm). Actually Rx is a sub-trajectory that includes only a subset of points on Rx . Let Rx′ still follow the order of Rx . we have:
′
Simo(Q, Rx) ≥ Simo(Q, Rx) (26) The proof of Inequation 26 is as follows: Suppose on the
′
contrary there exists a matching between Q and Rx with Simo(Q, R′x) > Simo(Q, Rx). Assume the matched pairs be- tween Q and R′x are {< q1,p(w1) >,< q2,p(w2) >,··· ,< qm, p(wm) >} where {p(w1), p(w2), · · · , p(wm)} are the matched points on Rx′ , and Simo(Q,Rx′ ) = Pmi=1 e−Diste(qi,p(wi)). Since R′x ⊆ Rx, {p(w1), p(w2), · · · , p(wm)} must be contained in Rx. Therefore the above matching can be applied on Rx as well and consequently Simo(Q,Rx) should be at least Simo(Q,R′x), which is a contradiction with the assumption.
Based on Inequation 26, we define a new lower bound LBo of similarity for ordered query locations by using the partially retrieved trajectory points of Rx.
′′
LBo(Rx) = Simo(Q, Rx) = DP (Q, Rx) (27)
∆(qi)=δ μ m
i=1 Pi=1 Dec(qi)
+ν m
Pi=1 Rat(qi)
′
where DP (Q, R′x) is calculated using Algorithm 4. For the refinement step, similar to the definition of UB in Equation 5, we define a new upper bound UBo for candidate trajec- tories within the candidate set C only.
5.1 Different Number of Query Locations
The number of query locations is a critical assumption in our applications, in which we assume the number is small as it’s impractical to input tens of locations for a query and importantly a few locations are enough to search a trajec- tory around some places of interest. In this part, we fix k to 15 and compare the IKNN variants as the number of query locations ranges from 2 to 10. In effect, the algorithm works without any problem even though the number of query lo- cations goes up to 100 or more, but in this case the problem itself is no longer very meaningful in our applications.
5.1.1 Best-first vs. Depth-first
As shown in Figure 6(a) and 6(b), the best-first method BF notably outperforms the depth-first method DF-C with- out any optimization. As the number of query locations goes up to 10, the Query Time and Node Access of the DF-C rise quickly to more than 800 ms and 14, 000 respectively, while the BF just raises the cost smoothly with Query Time≈ 200 ms and Node Access≈ 800. The root cause is that the DF- C involves an excessive number of re-visits to previous λ- NN search regions and consequently produces a much larger number of Node Access compared with the BF that intrinsi- cally avoids any overlap of search regions. However, the price paid by the BF to achieve high performance is high memory usage. As shown in Figure 6(c), the Queue Size of the BF is approximately 10 times larger than that of all the depth- first variants. With a very skewed dataset distribution, it is believed that much more queue capacity is required by the BF. Nevertheless, even though the Queue Size peaks to above 700, it still can be fitted in main memory easily (but a smaller number of concurrent queries). Therefore, neglect- ing very bad cases, the best-first strategy is the best choice for processing the k-BCT queries.
5.1.2 Effect of Adaption
To achieve a guarantee of low memory usage, we can con- sider the depth-first strategy. However, as observed, the DF-C doesn’t scale well and further adaption is necessary. Firstly, we study how ∆ affects the performance. As shown in Figure 6(a) and 6(b), the DF-D that doubles λ at each round presents smoother curves of the Query Time and Node Access than the DF-C does, which confirms the fact that by doubling λ, the DF-D reduces the number of rounds from O(λ/∆) to O(log2 λ), and thus the number of repetitions is reduced greatly. In essence, increasing λ exponentially is a fast way to achieve a necessarily large λ with a small number of rounds. Although DF-D still has to cover all the previous search regions, it already brings the search cost to the same order of magnitude as that of the BF. Pruning by MAXDIST is another adaption for the depth-first strategy. Again, as shown in Figure 6(a) and 6(b), DF-D-M further lowers the cost as it reduces the area of repetition remained with the DF-D. An interesting observation is that the Node Access of the DF-D-M drops slightly as the number of locations rises from 8 to 10. That is because more query locations may help in raising the lower bound of trajectories greatly in some cases and thus make Theorem 1 easier to be satis- fied. Besides, note that the Query Time of the BF is even larger than that of the DF-D-M as the BF involves frequent queue operations, which increases the cost notably when the Queue Size is large, and the BF also has to compute lower and upper bounds for many rounds (O(λ/∆)).
UBo(Rx) = LBo(Rx) + X (e−Diste(qi,pλi )) i ∈ [ 1 , m ] ∧ R x ∈/ C i
(28)
Therefore, we replace the LB in Algorithm 1 and the UB in Algorithm 2 with LBo and UBo respectively, and then the IKNN algorithm can be seamlessly adapted to k-BCT queries with an order constraint.
5. EXPERIMENTS
In this section, we conduct experiments on the Beijing dataset which consists of 12,653 GPS trajectories (1,147,116 points) collected by the Microsoft GeoLife Project [1], and the distribution of the dataset is illustrated in Figure 5. The IKNN algorithm is implemented in Java and examined on a windows platform with Intel Core 2 CPU (2.13GHz) and 1.0GB Memory. All the trajectory points are indexed by one single R-tree. In the experiments, k-BCT queries are collected manually by selecting a sequence of coordinates of places of interest that complies with a reasonable visiting order. We don’t use random generation for query locations as it may cause a sudden jump from one location to another far-away location that probably won’t happen in real life.
Figure 5: The Beijing dataset
The main metric we adopt for measuring the performance is the Query Time that reflects how fast a query is returned, and the Node Access that indicates the number of visits to R-tree nodes. Here we simply record how many times the internal nodes and leaf nodes of the R-tree are accessed dur- ing a query. Besides, we also compare the memory usage by using the Queue Size which is the number of elements of the m priority queues in the IKNNbf algorithm, or the total size of the branch lists in the IKNNdf algorithm. As this is the first work on searching trajectory by locations, we only com- pare the performance of the IKNN with different settings. The IKNNbf with the best-frist strategy is denoted by BF, while the IKNNbf with optimization is mentioned as BF-O. Similarly, the IKNNdf that uses a constant ∆ is indicated by DF-C, and DF-D is referred to as the IKNNdf that dou- bles λ at each round. DF-D-M further considers MAXDIST for pruning, and DF-D-M-O also includes the optimization mechanism. The experiment setting is as follows:
the number of k :
the number of locations :
the constant ∆ for BF and DF-C : the μ, ν for optimization :
1 to 25, default 15 2 to 10, default 8 50
0.5
Note that adjusting ∆, μ and ν also affects the perfor- mance perceptibly although the trend is still similar. The figures are not shown due to the limit of space.
1000
800
600
400
200
BF
BF-O DF-C DF-D
DF-D-M
DF-D-M-O
16000 14000 12000 10000
8000 6000 4000 2000
1200
BF BF
2000
1500
1000
500
BF
BF-O DF-C DF-D
DF-D-M
DF-D-M-O
20000 1200
BF BF
3000 2500 2000 1500 1000
BF
BF-O DF-C DF-D
DF-D-M
DF-D-M-O
30000
25000
20000
15000
10000
5000
40000 35000 30000 25000 20000 15000 10000
BF
BF-O DF-C DF-D
DF-D-M
DF-D-M-O
1200
1000
800
600
400
200
BF
BF-O DF-C DF-D
DF-D-M
DF-D-M-O
5000
4000
3000
2000
1000
BF
BF-O DF-C DF-D
DF-D-M
DF-D-M-O
1200
BF BF
000
2 3 4 5 6 7 8 9 10
Number of Query Points
2 3 4 5 6 7 8 9 10
Number of Query Points
(a) (b)
15000
BF-O DF-C DF-D
DF-D-M
DF-D-M-O
1000
800
BF-O DF-C DF-D
DF-D-M
DF-D-M-O
000
5 10 15 20 25
5 10 15 20 25
Number of K
(e)
5 10 15 20 25
Number of K
(f)
Number of K
(d)
2 3 4 5 6 7 8 9 10
Number of Query Points
(a)
2 3 4 5 6 7 8 9 10
Number of Query Points
(b)
2 3 4 5 6 7 8 9 10
Number of Query Points
(c)
BF-O DF-C DF-D
DF-D-M
DF-D-M-O
1000
800
600
400
200
BF-O DF-C DF-D
DF-D-M
DF-D-M-O
2 3 4 5 6 7 8 9 10
Number of Query Points
(c)
10000 600
5000
400
200
Figure 6: Performance for queries without an order
500 000
Query Time (ms)
Query Time (ms)
Query Time (ms)
Query Time (ms)
Node Access
Node Access
Node Access
Node Access
Queue Size
Queue Size
Queue Size
Queue Size
5000 000
5 10 15 20 25
Number of K
(d)
5 10 15 20 25
Number of K
(e)
Figure 7: Performance for queries with an order
5 10 15 20 25
Number of K
(f)
BF-O DF-C DF-D
DF-D-M
DF-D-M-O
1000
800
600
400
200
BF-O DF-C DF-D
DF-D-M
DF-D-M-O
5.1.3 Effect of Optimization
The purpose of adjusting ∆ separately for different query locations by considering their importance is to put more effort on exploring more important query locations. As ex- pected, a further improvement over the DF-D-M is achieved by the DF-D-M-O. In Figure 6(b), the Node Access is further reduced by nearly 1/4 ∼ 1/2, while in Figure 6(a) the Query Time is reduced by about 1/5. For the best-first strategy, a more significant improvement is observed. The BF-O re- quires only 295 Node Access which is about 50% less than that of the BF when the number of locations is 10. Obvi- ously, the reason of improvement is that both strategies visit fewer trajectory points during the query through optimiza- tion, and from Figure 6(c) it is seen that the Queue Size of the BF-O is lowered by up to 2/3 compared with the BF.
5.2 Different Number of k
Another concern about the IKNN algorithm is how it scales with different number of k. Here, we show the results in Figure 6(d), 6(e) and 6(f) with the number of locations fixed to 8. It is observed that the BF-O exposes an amaz- ingly stable performance without any significant fluctuation in both Query Time and Node Access as k increases from 1 to 25, while the Queue Size is kept at a comparatively low number (≈ 200). Besides that, the DF-D-M and DF-D- M-O also appear to be scalable with k, and no significant rise in terms of Query Time and Node Access is shown. On the other hand, the figures of the DF-C boost drastically to more than 1.5 second and 20,000 respectively, which reveals that a pure depth-first algorithm does not work well and the adaption and optimization are very necessary.
5.3 Queries with an Order
In comparison with queries without an order, when a query is with an order constraint, more computation is needed to figure out the lower bound and the exact similarity of a trajectory by using the Algorithm 4, and furthermore a query location may not be matched with the nearest point on a trajectory any longer which consequently requires a scanning of more trajectory points to get the best match- ing. Therefore, more Node Access and Query Time are in- troduced, as confirmed in Figure 7(a), 7(b), 7(d) and 7(e) where all the variants have brought up the cost more or less. However, the trends of the figures are still similar to that of the queries without an order in Figure 6. Regarding the size of queue, it does not change perceptibly in Figure 7(c) and 7(f) compared with Figure 6(c) and 6(f).
6. CONCLUSIONS
In this paper, we study a new problem of searching the k Best-Connected Trajectories from a database by using a set of locations with or without an order constraint. Since the number of query locations is typically small, it enables us to adopt a spatial method for answering a similarity search query. We start the study based on a simple IKNN algorithm and then analyze the efficiency of different variants. As a conclusion, we would say that the BF-O achieves the best query performance although involving a risk of high mem- ory usage. The pure DF-C algorithm, although guarantees a low memory consumption, performs poorly in efficiency. Therefore, we further devise the DF-D-M and DF-D-M-O to improve the DF-C for fewer R-tree node access and shorter query time, and finally their performance are theoretically and experimentally confirmed to be close to that of the BF.
7. REFERENCES
[1] http://research.microsoft.com/en-us/projects/geolife/. [2] R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient
similarity search in sequence databases. In FODO,
pages 69–84, 1993.
[3] A. Belussi and C. Faloutsos. Estimating the selectivity
of spatial queries using the ‘correlation’ fractal
dimension. In VLDB, pages 299–310, 1995.
[4] C. Bo ̈hm. A cost model for query processing in high
dimensional data spaces. TODS, 25(2):129–178, 2000. [5] Y. Cai and R. Ng. Indexing spatio-temporal
trajectories with chebyshev polynomials. In SIGMOD,
pages 599–610, 2004.
[6] K.-P. Chan and A. W.-C. Fu. Efficient time series
matching by wavelets. In ICDE, pages 126–133, 1999. [7] L. Chen and R. Ng. On the marriage of lp-norms and
edit distance. In VLDB, pages 792–803, 2004.
[8] L. Chen, M. T. O ̈ zsu, and V. Oria. Robust and fast
similarity search for moving object trajectories. In
SIGMOD, pages 491–502, 2005.
[9] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang,
and E. Keogh. Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB, pages 1542–1552, 2008.
[10] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In SIGMOD, pages 419–429, 1994.
[11] E. Frentzos, K. Gratsias, N. Pelekis, and
Y. Theodoridis. Algorithms for nearest neighbor search on moving object trajectories. Geoinformatica, 11(2):159–193, 2007.
[12] E. Frentzos, K. Gratsias, and Y. Theodoridis. Index-based most similar trajectory search. In ICDE, pages 816–825, 2007.
[13] A. Guttman. R-trees: a dynamic index structure for spatial searching. In SIGMOD, pages 47–57, 1984.
[14] G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. TODS, 24(2):265–318, 1999.
[15] F. Korn, B.-U. Pagel, and C. Faloutsos. On the ‘dimensionality curse’ and the ’self-similarity blessing’. TKDE, 13(1):96–111, 2001.
[16] M. D. Morse and J. M. Patel. An efficient and accurate method for evaluating time series similarity. In SIGMOD, pages 569–580, 2007.
[17] A. Papadopoulos and Y. Manolopoulos. Performance of nearest neighbor queries in r-trees. In ICDT, pages 394–408, 1997.
[18] D. Pfoser, C. S. Jensen, and Y. Theodoridis. Novel approaches in query processing for moving object trajectories. In VLDB, pages 395–406, 2000.
[19] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In SIGMOD, pages 71–79, 1995.
[20] R. Sherkat and D. Rafiei. On efficiently searching trajectories and archival data for historical similarities. PVLDB, pages 896–908, 2008.
[21] M. Vlachos, G. Kollios, and D. Gunopulos. Discovering similar multidimensional trajectories. In ICDE, pages 673–684, 2002.
[22] B.-K. Yi, H. Jagadish, and C. Faloutsos. Efficient retrieval of similar time sequences under time warping. In ICDE, pages 201–208, 1998.