COMP6714 ASSIGNMENT 1
DUE ON 20:59 4 NOV, 2020 (WED)
Q1. (25 marks)
Some Boolean retrieval systems (e.g., Westlaw) support the proximity operator /S, which restricts the occurrences matches to be within the same sentence.
Assume that we have created an additional positional list for $, which records the posi- tions of the end of the sentences. E.g., for the document
A B C. D E.
the position list for $ is [4, 7].
You are required to engine an algorithm to support the query A /S B. To make the task
easier, we further constrain the semantics of the query to satisfy both conditions: • the occurrences of A and B must be within the same sentence.
• the occurrence of A must precede that of B.
For example, the above example document matches the query A /S C, but not C /S A.
You need to
• make simple modifications to the pseudocode shown in Algorithm 1, which is ex- actly the algorithm in Figure 2.12 in the textbook. Note that we modify the algorithm slightly so that array indexes start from 1 instead of 0. Specifically,
– you need to insert some code between Lines 6 and 7, and perform some mod- ifications to some lines afterwards.
– In your submitted algorithm pseudocode (named Q1(p1, p2, p$)), clearly mark the modifications using color or boxes.
• You can assume that there is a function skipTo(p,docID,pos), which move the cursor of list p to the first position such that (1) the position belongs to a document docID, and (2) the position is no smaller than pos.
Q2. (25 marks)
Consider the scenario of dynamic inverted index construction. Assume that t sub-indexes
(each of M pages) will be created if one chooses the no-merge strategy.
(1) Show that if the logarithmic merge strategy is used, it will result in at most ⌈log2 t⌉ sub-indexes.
(2) Prove that the total I/O cost of the logarithmic merge is O(t · M · log2 t). 1
2 DUE ON 20:59 4 NOV, 2020 (WED) Algorithm 1: PositionalIntersect(p1, p2, k)
1 2 3 4 5 6
7 8 9
10 11 12
13
14 15
16 17
18
19
20
21
22
23
24
25
answer ← ∅;
whilep1 ̸=nil∧p2 ̸=nildo
if docID(p1) = docID(p2) then l ← [ ];
pp1 ← positions(p1); pp2 ← positions(p2); while pp1 ̸= nil do
while pp2 ̸= nil do
if |pos(pp1) − pos(pp2)| ≤ k then
add(l, pos(pp2)); else
if pos(pp2) > pos(pp1) then break;
pp2 ← next(pp2);
while l ̸= [ ] ∧ |l[1] − pos(pp1)| > k do
delete(l[1]);
for each ps ∈ l do
answer ← answer ∪ [docID(p1), pos(pp1), ps]; pp1 ← next(pp1);
p1 ← next(p1); p2 ← next(p2); else
if docID(p1) < docID(p2) then p1 ← next(p1);
else
p2 ← next(p2); return answer ;
Q3. (25 marks)
After the δ encoding, the compressed non-positional inverted list is 01000101 11110001 01110000 00110000 11110110 11011
• Decode the sequence of numbers the compressed list represents. • List the document IDs in this list.
Q4. (25 marks)
Consider the WAND algorithm described in Section 2.4 in the original paper.1. There
is a typo in the algorithm in Line 21: it should use “terms[0 .. (pTerm-1)]”.
1Efficient Query Evaluation using a Two-Level Retrieval Process.
COMP6714 ASSIGNMENT 1 3
However, even with this fixed, there is a bug in the algorithm (Figure 2) in which the algorithm will end up in an infinite loop.
You need to
• Identify the single lines in Figure 2 that causes the bug and describe concisely why this will lead to a bug.
• Give a simple example illustrating this bug. You should use three terms (named A, B, C) and k = 1. Do not include unncessary entries in the lists.
ABC
⟨1, 3⟩ ⟨1, 4⟩ ⟨1, 4⟩
UB
List
You need to write your solutions to the questions in a pdf file named ass1.pdf. You must
• include your name and student ID in the file, and • the file can be opened correctly on CSE machines.
You need to show the key steps to get the full mark.
Note: Collaboration is allowed. However, each person must independently write up
his/her own solution.
You can then submit the file by give cs6714 ass1 ass1.pdf. The file size is limited
to 5MB.
Late Penalty: -10% per day for the first two days, and -20% per day for the following
days.
Submission Instructions