A few notes:
CS7800 Information Retrieval
Final Exam Spring 2019
1. Some questions are open. Please try your best to give your answers based on your understanding.
2. You are welcome to search the web to obtain the background knowledge. However, you CANNOT copy & paste answers directly from the sources or from other students. Please use your own words. Any significant duplicates will result in 0 for that question.
4. You will need to include the references (papers or URLs) you use for your answers.
5. Try to make your answers concise (typically in 1 page, no more than 2 pages per question). Do not include irrelevant information.
1. Vector Space model.
Consider the following document collection D = {D1, D2, D3} (given as one document per line):
D1 => Betty Botter bought some butter
D2 => But the butter¡¯s bitter
D3 => The bitter butter makes the batter bitter
Assume that the stopword list contains the words {some, but, the}. The words will be stemmed, i.e., {bought -> buy, butter¡¯s -> butter, makes -> make}.
(1.1) Show the uncompressed dictionary and the posting lists including the raw tf and idf values: raw tf is the raw term count and raw idf is (number of documents)/(document frequency). The terms are sorted in the dictionary order and the posting list is sorted by document id.
(1.2) If the terms in the query are sorted by IDF = log (raw idf), what are the terms likely used in scoring for the query: ¡°bitter butter in the batter¡±? Briefly justify the answer.
(1.3) If a query-independent document quality scoring scheme g(d) gives 1, 1.5, 0.5, for D1, D2, and D3, respectively, all posting lists will be sorted by the quality score in descending order and cutoff by 0.8. Let the document vector weighting function be g(d) + raw tfidf. What are the relevance scores and ranking of the documents for the query ¡°bitter batter¡±?
2. Learning to Rank
(2.1) What are the advantages of pairwise learning-to-rank algorithms?
(2.2) If you are asked to evaluate the ranking quality of two learning-to-rank algorithms, X and Y, how will you design the experiment? Briefly describe the key steps.
3. Document Clustering
(3.1) Can normalized mutual information (NMI) be used to determine the optimal number of clusters? Briefly justify your answer.
(3.2) If you want to use cosine similarity as the similarity measure in clustering, can you still use the k- means clustering algorithm? Briefly justify your answer.
4. Document vectors
We have discussed a simple method in the class for deriving a document vector with word vectors (i.e., word2vec). Please check the literature to find other possible methods for representing documents with vectors. Compare and discuss the methods you have found.
5. Conversational search
Read this paper ¡°Filip Radlinski and Nick Craswell. A theoretical framework for conversational search. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval¡±, and write a research summary that include (1)The research problem and its significance, (2) The research challenges this paper wants to address, (3)A brief description of the approach, (4) Unique contributions, (5) Strengths and weaknesses