Better than next()
• What’s the worst case for sequential merge-based
intersection?
• {52, 1} è move k2’s cursor
– To the position whose id is at least 52 è skipTo(52)
– Essentially, asking the first i, such that K2[i] >= 52 (K2’s list
is sorted).
– Takes many sequential call of next()
– Could use binary search in the rest of the list
– Cost: ⎡log2(Nremainder) ⎤
… … … 79
52 54 56 58
1 3 5 … K2:
K1: 1
Can you use a hashtable to implement skipTo()?
skipTo(id)
• Galloping search (gambler’s strategy)
– [Stage 1] Doubling the search range until you overshoot
– [Stage 2] Perform binary search in the last range
• Performance analysis (worst case)
– Let the destination position be n positions away.
– ≈ log2 n probes in Stage 1 + ≈ log2 n probes in Stage 2
– Total = 2⎡ log2 (n+1) ⎤= O(log2 n)
9 11 13 79 1 3 5 7 K1:
52?
Inc=1 Inc=2 Inc=4
Binary
search
2
Total Cost
• Galloping search (gambler’s strategy)
– Cost of the i-th probe: ≈ 2 log2(ni)
– Total cost of K1 probes: ≈ 2 log2(∏1|K1| ni)
≤ 2 log2( ((∑1|K1| ni) / |K1|)|K1| ) ≤ 2|K1|*log2(|K2|/|K1|)
• Asymptotically, resembles linear merge when
|K2|/|K1| = O(1), resembles binary search
when |K1| = O(1)
What about list intersection using binary search?
3
Multiple Term Conjunctive
Queries
• K1 AND K2 AND … AND Kn
• SvS does not perform well if none of the
associated lists are short
• In addition, it is blocking
• Can you design non-blocking multiple
sorted array intersection algorithm?
4
Generalization
• Generalize the 2-way intersection
algorithm
• 2-way:
– {1, 2} è move k1’s cursor
– skipTo(2)
• 3-way:
– {1, 2, 3} è move k1,k2’s cursor
– skipTo(3)
1 3
2 4 6
3 9 27 81
K1:
K2:
K3:
eliminator = Max1<=i<=n(ki.cursor)
5
Optimization
• Mismatch found even before accessing K3’s
cursor
• Choice 1: continue to get
cursors of other list
• Choice 2: settle the
dispute within the first two lists è max
algorithm [Culpepper & Moffat: Efficient set intersection for
inverted indexing]
– Better locality of access è fewer cache misses
– Similar to SvS
1 3
2 4 6
3 9 27 81
K1:
K2:
K3:
6
Pseudo-Code for the Max Algorithm
(Wrong)
• Input: K1, K2, …, Kn in increasing size
x := K1[1]; startAt := 2
while x is defined do
for i = startAt to n do
y := Ki.skipTo(x)
if y > x then
x := K1.next()
if y > x then startAt := 1; x := y else startAt := 2 end if
break
elsif i = n then
Output x
x := K1.next()
end if
end for
end while
//mismatch
//match
in
all
lists
//restart_1
//restart_2
//y
=
x
//x
is
the
eliminator
7
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
Pseudo-Code for the Max Algorithm (Fixed)
• Input: K1, K2, …, Kn in increasing size
x := K1[1]; startAt := 2
while x is defined do
for i = startAt to n do
y := Ki.skipTo(x)
if y > x then
x := K1.next()
if y > x then startAt := 1; x := y else startAt := 2 end if
break
elsif i = n then
Output x
x := K1.next()
end if
end for
end while
8
The original code has a bug when in restart_1 cases
(4.1) if i = 1 then
(4.2) if y > x then
(4.3) x := y
(4.4) break
(4.5) end if
(4.6) end if
References
• J. Shane Culpepper, Alistair Moffat: Efficient
set intersection for inverted indexing. ACM
Trans. Inf. Syst. 29(1): 1 (2010)
• F.K. Hwang and S. Lin, A simple algorithm for
merging two disjoint linearly ordered sets.
SIAM J. Comput. 1 1 (1972), pp. 31–39.
• Stefan Buettcher, Charles L. A. Clarke,
Gordon V. Cormack, Information Retrieval:
Implementing and Evaluating Search
Engines, 2010 [Chapter 5] 9