程序代写代做代考 C crawler information retrieval graph algorithm Name: ,

Name: ,
(Family name)
(Given name)
Student ID:
THE UNIVERSITY OF NEW SOUTH WALES Final Exam
COMP6714
Information Retrieval and Web Search
SESSION 2, 2011
• Time allowed: 10 minutes reading time + 3 hours
• Total number of questions: 10+1
• Total number of marks: 100+5
• Only UNSW approved calculators are allowed in this exam. • Answer all questions.
• You can answer the questions in any order.
• Start each question on a new page.
• Answers must be written in ink.
• Answer these questions in the script book provided. • Do not write your answer in this exam paper.
• If you use more than one script book, fill in your details on the front of each book. • You may not take this question paper out of the exam.

Question 1 (20 marks) Briefly answer the following questions (1-4 sentences) in your script book. Lengthy but
irrelevant answers will be penalized.
(a) How does stemming typically affect recall? Why?
(b) Given at least two reasons why language identification is important when indexing documents.
(c) Why specialized algorithms are needed to construct inverted index for large docu- ment collections?
(d) What are the largest gap that can be encoded in 2 bytes using variable byte code? You also need to show the encoded two bytes.
(e) Why is cosine a better similarity metric than the inverse of Euclidean distance in vector space model?
(f) Why is vector space model generally considered a better retrieval model than the boolean model?
(g) List the advantage(s) of using NDCG to evaluate Web search results over measures such as MAP.
(h) List one problem with the probabilistic ranking principle.
(i) In the early age of Web search engines (definitely pre-Google era), some system uses
the following term frequency weighting 2·tf to fight spam Web pages. Explain why 2+tf
this worked.
(j) What is a “shingle”, and describe briefly the shingling method to detect near du- plicate documents.
(k) List at least three requirements that complicate the design and implementation of an industrial strength crawler.
(l) Define the terms “hub” and “authority” in the context of the HITS algorithm. Can a page be both a hub and authority page at the same time?
COMP6714 Page 1 of 11

Question 2 (5 marks) Consider the algorithm (from the textbook) to intersect two postings lists p1 and p2.
Algorithm 1: Intersect(p1, p2)
1 2 3 4 5 6
7 8
9 10
11
answer ← ∅;
whilep1 ̸=nilandp2 ̸=nildo
if docID(p1) = docID(p2) then Add(answer, docID(p1);
p1 ← next(p1);
p2 ← next(p2);
else if docID(p1) < docID(p2) then p1 ← next(p1); else p2 ← next(p2); return answer; (a) What is the time complexity of the algorithm? (b) Modify the algorithm so that it can answer queries like A AND NOT B in time O(|p1| + |p2|), where A and B are two terms. (c) Is it possible to modify the algorithm so that it can answer queries like A OR NOT B in time O(|p1| + |p2|)? If not, what complexity can you achieve? COMP6714 Page 2 of 11 Question 3 (10 marks) Consider a casual user who input the boolean query “A OR B AND C”. Our system deems the query as ambiguous, as either the OR or the AND operator can be executed first. To be on the safe side, the system decides to retrieve those results that belong to both interpretations only (i.e., no matter which interpretation the user intended, it will include our system’s result). Describe how to support such query efficiently by accessing the inverted lists of tokens A, B, and C at most once. COMP6714 Page 3 of 11 Question 4 (10 marks) From the following sequence of γ-coded gaps, reconstruct first the gap sequence and then the postings sequence (assume that docid starts from 1). Note that spaces were deliberately added for clarity purpose only. You need to illustrate your steps. 1110 1101 1111 1001 0111 1111 1110 1000 1111 1001 COMP6714 Page 4 of 11 Question 5 (10 marks) The figure below shows the output of two information retrieval systems on the same two queries in a competitive evaluation. The top 15 ranks are shown. Crosses correspond to a document which has been judged relevant by a human judge; dashes correspond to irrelevant documents. There are no relevant documents in lower ranks. System 1: System 2: Rank Q1 Q2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 - X X X - - - X X X X - - - X X - - - - - - - - - - - X X - Rank Q1 Q2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X X X - X X - - - - X X - - X X - - X X - - - - - - - - - - (a) Explain the following evaluation metrics and give results for query Q1 for both systems. 1. Precision at rank 10. 2. Recall at precision 0.5. (b) The metrics in part (a) above are not adequate measures of system performance for arbitrary queries. Why not? What other disadvantages do these metrics have? (c) Give the formula for mean average precision (MAP), and calculating MAP for both systems. (d) For each system, draw a precision-recall curve. Explain how you arrived at your result. COMP6714 Page 5 of 11 Question 6 Determine (α = β = documents Note: “R” (10 marks) the new query vector determined by the Rocchio relevant feedback algorithm γ = 1.0), given that the initial query is “t1 t3” and we have the following and user feedback. docid feedback 1R 2 NR 3R 4 NR 5 NR standards for relevant and “NR” stands for non-relevant. t1 t2 t3 t4 2100 3210 0303 2122 0123 COMP6714 Page 6 of 11 Question 7 (10 marks) (a) State and justify briefly the assumptions made to derive Equations (3) from (2) and Equation (6) from (5) in the Binary Independence Model. (b) State which values need to be estimated for a document collection in the final Equation (8) (i.e., other parts can be discarded safely without affecting the ranking). Let ⃗x be the binary term incidence vector representing document D, O(p) be the odd ratio of probability p, Q be the query, R and NR stand for “relevant” and “non-relevant”, respectively, V is the vocabulary. In addition, we use the shorthand notations: pi = p(xi = 1|R,Q) and ri = p(xi = 1|NR,q). O(R|Q, ⃗x) = p(R|Q,⃗x) (1) p(N R|Q, ⃗x) = p(R|Q) · p(⃗x|R, Q) (2) p(N R|Q) p(⃗x|N R, Q) = O(p(R|Q)) · = O(p(R|Q)) · = O(p(R|Q)) · = O(p(R|Q)) · = O(p(R|Q)) · = O(p(R|Q)) · |V | p(x |R, Q) 􏰃i (3) i=1 p(xi|NR,Q) 􏰃 p(xi = 1|R,Q) · 􏰃 p(xi = 0|R,Q) (4) xi=1 p(xi = 1|NR, Q) xi=0 p(xi = 0|NR, Q) 􏰃 pi · 􏰃 1 − pi xi=1 ri xi=0 1 − ri 􏰃 xi=1,xi∈Q 􏰃 xi=1,xi∈Q 􏰃 xi=1,xi∈Q (5) pi· 􏰃 1−pi (6) ri xi=0,xi∈Q 1 − ri p􏰀􏰂 1−pi 􏰁 i xi∈Q 1−r ·􏰂i (7) ri 1−pi xi=1,xi∈Q 1−ri pi(1−ri) · 􏰃 1−pi (8) ri(1 − pi) xi∈Q 1 − ri COMP6714 Page 7 of 11 Question 8 (5 marks) Suppose we have a document collection with an extremely small vocabulary with only 6 words w1, w2, . . . , w6. The following table shows the estimated background language model p(w|C) using the whole collection of documents (2nd column) and the word counts for document d1 (3rd column) and d2 (4th column), where c(w,di) is the count of word w in document di. Let Q = {w1, w2, w3, w4} be a query. Word p(w|C) c(w,d1) c(w,d2) w1 0.800 2 7 w2 0.100 3 1 w3 0.025 1 1 w4 0.025 2 1 w5 0.025 2 0 w6 0.025 0 0 (a) Suppose we do not smooth the language model for d1 and d2. Compute the likeli- hood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2) (Do not compute the log-likelihood. You should use the scientific notation (e.g., 0.0061 should be 6.1 × 10−3) Which document would be ranked higher? (b) Suppose we now smooth the language model for d1 and d2 using the Jelinek-Mercer smoothing method with λ = 0.8 (i.e., p(w|d) = λ·pmle(w|Md)+(1−λ)·pmle(w|Mc)). Recompute the likelihood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2) (Do not compute the log-likelihood. You should use the scientific notation) Which document would be ranked higher? COMP6714 Page 8 of 11 Question 9 Consider the following web graph: Page A points to page B, C, and D. Page B points to C and D. Page C points to A and E. Page D points to E and F. Page E points to G. Page F points to G and H. Consider a crawler that starts from page A (10 marks) (a) Give the order of the indexing, assuming the crawler uses a URL frontier with duplicate detection, and all the pages are at different web sites. (b) AssumepagesB,C,F,Hareonwebsiteα,pagesD,E,Gareonwebsiteβ,and page A is on web site γ. The politeness policies on these three web sites all specify at least 3 seconds between each visit (i.e., if the crawler visit a web site at the i second, the earliest time it can revisit the web site is the i + 3 second). We assume that (1) the crawler can only fetch a page every one second, and all the processing (including physically getting the page, extracting and processing the links, etc.) can be completed before the next fetch. (2) the crawler process links in the order mentioned above. The crawler still uses a ULR frontier with duplicate detection, and also uses back queues to adhere to the politeness policies. Give the order of the indexing. (If two pages can be visited at the same time, we always choose the smaller one according to the alphabetical order) COMP6714 Page 9 of 11 Question 10 (10 marks) VW XYZ (a) Explain the concept of PageRank, and how it is calculated in practice. (b) Why is it relevant for Web search? (c) Give, and briefly explain, the corresponding matrix notation of the PageRank com- putation. (d) Show the final matrix that will be used for the PageRank calculation for the above graph, if the random teleporting probability is 0.2. (e) Perform two iterations starting from the initial probability distribution vector of (0.2, 0.2, 0.2, 0.2, 0.2). COMP6714 Page 10 of 11 Question 11 (5 marks) BONUS Explain analytically why galloping search (aka. double binary search) is preferred to the normal binary search when implementing the skipTo(docid) method on a sorted list of docids. Make sure you state clearly the meaning of variables and any assumption you use. COMP6714 Page 11 of 11 END OF EXAM PAPER