CS计算机代考程序代写 crawler algorithm information retrieval Name: ,

Name: ,
(Family name)
(Given name)
Student ID:
THE UNIVERSITY OF NEW SOUTH WALES Final Exam
COMP6714
Information Retrieval and Web Search
TERM T3, 2020
• Time allowed: 10 minutes reading time + 2 hours
– Exception: students with extra exam time approved by Equitable Learning Services (ELS) can make submissions after 14:30, 2 December 2020 within their approved extra time.
• Total number of questions: 8.
• Total number of marks: 100
• Total number of pages: 8 excluding this cover page
• This is an open-book exam. You are allowed to use textbook(s), lecture notes and other
study materials. However, you are not allowed to (1) communicate with anyone else or (2)
use the Internet during the exam.
• Items allowed: UNSW approved calculators.
• You can answer the questions in any order.
• Start each question on a new page.
• Answers must be written in ink on A4 papers and scanned into a PDF file. Alternatively,
you can use any software to directly generate the answers in a PDF file.

Question 1 (10 marks) Write down the pseudo-code of answering the Boolean query not A and B. You need to
use the function skipTo(id) wherever it is possible.
You can refer to the following skeleton code. Note that it is by no means the complete pseudo-code, and you should add multiple lines or modify existing line(s) to complete this task.
Algorithm 1: Q1(p1, p2)
1 answer ← ∅;
2 while…do
3 if docID(p1) > docID(p2) then 4;
5 else
6 if docID(p1) < docID(p2) then 7; 8 else 9; 10 return answer; COMP6714 Page 1 of 8 Question 2 (14 marks) Consider applying γ-encoding to a posting list of n document IDs (within the range of [1, N ]). Prove that: • For a value x, its γ-encoded value takes at most 2 log2(x) + 1 bits. • The compressed posting list (using γ codes on the gaps) takes at most n·log 􏰃2N2 􏰄 bits. COMP6714 Page 2 of 8 2 n2 Question 3 (14 marks) (a) Consider a dictionary that contains v terms, and each term has exactly l characters. Assume we build both a permuterm index and a bi-gram index for the dictionary. What are the sizes of these two indexes (note, do not include the size of the dic- tionary and the inverted lists), respectively? You may assume that a pointer (to a term in the dictionary) is 4-bytes and all terms only contain characters from a to z. (b) Consider a query of P*Q*R, where P, Q, and R are a sequence of characters. Briefly describe how a permuterm index can be used to efficiently answer this query. (c) The above query can also be answer without using list interaction. Briefly describe how this can be done, and give a reason why this might be more efficient than the previous query processing method. COMP6714 Page 3 of 8 Question 4 (14 marks) Consider the logarithmic merge algorithm for dynamic index construction. The sub- indexes created on the disk are: I3, I2, I1, and I0 (note: I0 is the smallest sub-index on the disk). Assume the current in-memory index is full and needs to be dumped to the disk. (a) What are the sub-indexes after dumping the current in-memory index to the disk? (b) How many sub-indexes will the algorithm create to index a document collection of size |C| when the memory size is M. (c) Let |C| = 14 · M. What is the total number of times sub-indexes are merged? You need to include merges of a sub-index on the disk and the index in the memory. COMP6714 Page 4 of 8 Question 5 (10 marks) The figure below shows the output of two information retrieval systems on the same two queries in a competitive evaluation. The top 10 ranks are shown. Crosses correspond to a document which has been judged relevant by a human judge; dashes correspond to irrelevant documents. Assume that System 1 retrieved all the relevant documents for both queries. System 1: System 2: Rank Q1 Q2 1 2 3 4 5 6 7 8 9 10 X X X X - - - - X X X - - - - X - - X X Rank Q1 Q2 1 2 3 4 5 6 7 8 9 10 X X X - X X - - X - X - - X X - - - - - (a) Explain the following evaluation metrics and give results for query Q2 for both systems. 1. Precision at rank 8. 2. Recall at precision 1 . 3 (b) Give the formula for mean average precision (MAP), and calculating MAP for both systems. (c) Consider Q1 for System 1. Compute the interpolated precisions at recall levels 0.5 and 0.8, respectively. COMP6714 Page 5 of 8 Question 6 (14 marks) Consider the following documents and the ground-truth of their relevance to the query {x1,x3}. Give the order that these documents will be ordered under the BIM model with add 1 smoothing? You need to justify your answer. 2 Document Relevant? x1 x2 x3 D1 T111 D2 T010 D3 F100 D4 T001 D5 F010 COMP6714 Page 6 of 8 Question 7 (10 marks) Suppose we have a document collection with an extremely small vocabulary with only 6 words w1, w2, . . . , w6. The following table shows the estimated background language model p(w|C) using the whole collection of documents (2nd column) and the word counts for document d1 (3rd column) and d2 (4th column), where c(w,di) is the count of word w in document di. Let Q = {w1, w2, w3, w4, w5, w6} be a query. Word p(w|C) c(w,d1) c(w,d2) w1 0.800 2 7 w2 0.100 3 1 w3 0.025 1 1 w4 0.025 2 1 w5 0.025 2 0 w6 0.025 0 0 (a) Suppose we do not smooth the language model for d1 and d2. Compute the likeli- hood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2) (Do not compute the log-likelihood. You should use the scientific notation (e.g., 0.0061 should be 6.1 × 10−3) Which document would be ranked higher? (b) Suppose we now smooth the language model for d1 and d2 using the Jelinek-Mercer smoothing method with λ = 0.8 (i.e., p(w|d) = λ·pmle(w|Md)+(1−λ)·pmle(w|Mc)). Recompute the likelihood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2) (Do not compute the log-likelihood. You should use the scientific notation) Which document would be ranked higher? COMP6714 Page 7 of 8 Question 8 (14 marks) Consider the basic architecture of a crawler in the figure below. (a) Explain why the “content seen?” module is needed before the “Dup URL elim” module? (b) Assume that the search engine uses MinHash algorithm to detect near-duplicate documents. Given a document with following hashed shingles: { 1, 7, 15, 81 }, and two universal hashing functions: • h1(x)=(7x+1mod31)mod13 • h2(x)=(18x+26mod31)mod13 What are the resulting MinHash signatures of the document? COMP6714 Page 8 of 8 END OF EXAM PAPER

Related Posts