Lecture Review and Questions for Week 5
Professor Yuefeng Li
School of Computer Science, Queensland University of Technology (QUT)
Retrieval models provide a framework for defining new tasks and explaining assumptions. For a given query Q, an Information Retrieval (IR) model finds relevant documents to answer Q.
Copyright By PowCoder代写 加微信 powcoder
It is quite difficult for a person to explain why one document is more relevant than another, and when people are asked to judge the relevance of documents for a given query, they often disagree to use such systems.
People usually discuss two key aspects of relevance: topical and user relevance. A document is topically relevant to a query if it is judged to be on the same topic, i.e., the query and the document are about the same thing. User relevance considers all the other factors that go into a user’s judgment of relevance. Also, relevance is possible binary (either relevant or irrelevant) or multivalued.
Many IR models have been proposed over the years. Two of the oldest are the Boolean and vector space models.
An IR model should provide a clear statement about the assumptions upon which it is based. Boolean and vector space models make implicit assumptions about relevance and textual representations that influence the design and effectiveness1 of ranking algorithms.
One early theoretical statement about effectiveness, known as the
Probability Ranking Principle encouraged the development of
probabilistic IR models.
These models have achieved this status because probability theory is a strong foundation for representing and manipulating the uncertainty in text data.
1 Effectiveness (quality): We want to be able to retrieve the most relevant set of documents possible for a query. Efficiency (speed): We want to process queries from users as quickly as possible.
The binary independence model is the basis for one of the most effective and popular ranking algorithms, known as BM25. BM25 has performed very well in TREC retrieval experiments and has influenced the ranking algorithms of commercial search engines, including web search engines.
Question 1. Which of the following descriptions is wrong? and justify your answer.
(1) The Boolean retrieval model is also known as exact-match retrieval since documents are retrieved if they exactly match the query specification, and otherwise are not retrieved.
(2) The process of developing queries with a focus on the size of the retrieved set has been
called searching by numbers and is a consequence of the limitations of the Boolean
retrieval model.
(3) In a vector space model, documents and queries are assumed to be part of a t-dimensional
vector space, where t is the number of index terms (words, stems, phrases, etc.).
(4) There is an explicit definition of relevance in the vector space model.
Question 2. (Probabilistic Models)
Let N=|D| be the total number of documents in a training set D and R be the number of relevant documents in D. The following is a term weighting formula of probabilistic methods:
w1(t) = log[(r(t) / R) / (n(t) / N)] Explain the meaning of r(t) and n(t) in the above formula.
Language Model and Relevance Feedback
The simplest form of language model, known as a unigram language model, is a probability distribution over the words in the language. This means that the language model associates a probability of occurrence with every word in the index vocabulary for a collection.
For example, if the documents in a collection contained just five different words (“girl”, “cat”, “the”, “boy”, and “touched”), a possible language model for that collection might be (0.2, 0.1, 0.35, 0.25, 0.1), where each number is the probability of a word occurring. So, we can treat each document as a sequence of words, e.g., the sequence “girl cat” (with probability 0.02=0.2*0.1).
An n-gram model predicts a word based on the previous n−1 words. The most common n-gram models are bigram (predicting based on the previous word) and trigram (predicting based on the previous two words) models.
Language models can be used to represent the topical content of a document. A topic is something that is talked about often. It is defined as a probability distribution over words or terms (in other words, a language model).
Question 3. Which of the following descriptions is wrong? and justify your answer.
(1) In the query likelihood retrieval model, documents are ranked by the probability that the query text could be generated by the document language model.
(2) To use the unigram language model to estimate P(Q|D) – the document D’s score for the given query Q ={q1, q2, …, qn}, the language model probabilities P(qi|D) (i =1, 2, …, n) are needed to be estimated.
(3) The major problem with the estimate of P(qi|D) is that if any of the query words are missing from the document, the score given by the query likelihood model for P(Q|D) will be zero.
(4) Kullback-Leibler (KL) divergence is a technique for avoiding this estimation problem and overcoming data sparsity, which means that we typically do not have large amounts of text to use for the language model probability estimates.
Question 4. (Language Models)
Let Q = {US, ECONOM, ESPIONAG} be a query, and C = {D1, D2, D3, D4, D5, D6} be a collection of documents, where
D1 = {GERMAN, VW}
D2 = {US, US, ECONOM, SPY}
D3 = {US, BILL, ECONOM, ESPIONAG}
D4 = {US, ECONOM, ESPIONAG, BILL}
D5 = {GERMAN, MAN, VW, ESPIONAG}
D6 = {GERMAN, GERMAN, MAN, VW, SPY}
Jelinek-Mercer smoothing method uses the following equation to calculate a score for a document D
where n is the number of query terms in Q, and
for all query term qi in Q, where fqi,D is the number of times query word qi occurs in document D, |D| is the number of word occurrences in D, cqi is the number of times query word qi occurs in
collection C, and |C| is the total number of word occurrences in collection C. Assume parameter λ = 0.4, calculate P(Q|D2). You are required to write the finding process.
Question 5. (Language IR Model)
For a given query Q, you are going to find the relevant documents in a given folder, e.g., a folder “Test_Doc”. Assume the documents in the folder are indexed by the python function index_docs(), i.e., index_docs(“Test_Doc”, stop_words) which returns a dictionary with the following data structure:
{term: {docID1: frequency1, DocID2: frequency2, …}, …}
(1) Let Q = {q1:1, q2:1, …, qn:1} be a dictionary, please define a function likelihood_IR(I, Q)
to estimate P(Q|D), i.e., return the score of document D for the given query Q by using
where fqi,D is the number of times word qi occurs in document D, and |D| is the number of words in D.
(2) Assume you have obtained another folder, e.g., “Collection_C”. Extend function likelihood_IR(I, Q) to a Jelinek-Mercer smoothing function likelihood_JM(I_C, I_D, Q, lambda) by using the following equation:
where cqi is the number of times query term qi occurs in I_C, and |C| is the total number of term occurrences in I_C. The function estimates a score for documents in I_D.
You can assume that both the query and the document have the same text preprocessing applied. So, each word in D is a stem, and |D| and |C| simply count the number of terms (stems) in D or C (including repeated occurrences).
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com