Week 6 Question Solutions
Professor Yuefeng Li
School of Computer Science, Queensland University of Technology (QUT)
Evaluation overview
Copyright By PowCoder代写 加微信 powcoder
Evaluation is the key to making progress in building better search systems or IR models. One of the primary distinctions made in the evaluation of search systems is between effectiveness and efficiency. Effectiveness measures the ability of the search system to find the right information, and efficiency measures how quickly this is done.
The retrieval and indexing techniques have many parameters that can be adjusted to optimize performance, both in terms of effectiveness and efficiency. Typically, the best values for these parameters are determined using training data and a cost function. Training data is a sample of the real data, and the cost function is the quantity based on the data that is being maximized (or minimized).
The optimization process would use the training data to learn parameter settings for the ranking algorithm that maximized the effectiveness measure.
One of the basic requirements for evaluation is that the results from different techniques can be compared. To do this comparison fairly and to ensure that experiments are repeatable, the experimental settings and data used must be fixed.
We usually need to find a suitable data collection to test proposed models or algorithms. For example, AP (Associated Press) newswire documents from 1988–1990, which was created as part of the TREC conference series sponsored by the National Institute of Standards and Technology (NIST).
The queries for the AP are based on TREC topics. The topics were created by government information analysts employed by NIST. Queries are the title fields from TREC topics 51–150. Topics and relevance judgments generated by government information analysts.
Example of a TREC topic
How are pets or animals used in therapy for humans and what are the benefits?
Relevant documents must include details of how pet! or animal!assisted therapy is or has been used. Relevant details include information about pet therapy programs, descriptions of the circumstances in which pet therapy is used, the benefits of this type of therapy, the degree of success of this therapy, and any laws or regulations governing it.
Relevance Judgments
The relevance judgments depend on the task that is being evaluated. For example, TREC analysts judged a document as relevant if it contained information that could be used to help write a report on the query topic (here we primarily focused on topical relevance).
Relevance judgments are normally binary, meaning that a document is either relevant or non-relevant. For some tasks, multiple levels of relevance may be appropriate, e.g., three levels of relevance (relevant, non-relevant, and neutral or unknown).
Data collection can be divided into three parts (training set, validation and/or test set). We will discuss the partitions of data collection for information filtering and text classification.
Creating a new test collection can be a time-consuming task since relevance judgments require a considerable investment of manual effort.
Question 1. Which of the following are False? and justify your answer.
(1) For a very large data collection, pooling technique is used to select top-k results from the rankings obtained by different retrieval algorithms. The results are merged into a pool, duplicates are removed, and the documents are presented in some random order to the people doing the relevance judgments.
(2) Pooling produces a large number of relevance judgments for each query. However, this list is incomplete, and for a new retrieval algorithm that had not contributed documents to the original pool, this could potentially be a problem.
(3) Many user actions (e.g., Query log data) can be considered implicit relevance judgments. The main drawback with this data is that it is not as precise as explicit relevance judgments.
(4) A typical query log does not contain user identifier or user session identifier because of the privacy issue.
Solution: (4)
A typical query log contains user identifier or user session identifier, query terms, list of URLs of results, their ranks on the result list, and whether they were clicked on, and timestamp(s).
The privacy is particularly an issue when query logs are shared, distributed for research, or used to construct user profiles.
Question 2. (Effectiveness Metrics)
Assume Table 1 illustrates the relevant judgments for documents 110 to 124 about topic 101, where 1 means the corresponding document is relevant and 0 means non-relevant; and Table 2 shows IR model1’s output, where documents are sorted according to their weights.
Table 1. Relevant Judgements Table 2. IR Model1’s Output (ranked by weight)
Assume A is the relevant set of documents for topic 101, A’ is the non-relevant set. We also assume IR model1 selects top-6 as the relevant documents B (i.e., the set of retrieved documents), and B’ is the set of documents that are not retrieved.
(1) List all sets’ elements and enclose them in braces.
(2) Calculate recall and precision of IR model1.
(3) Calculate false positive and false negative of IR model1.
TOPIC DocNo Rel
101 110 1
101 111 1
101 112 0
101 113 1
101 114 1
101 115 0
101 116 1
101 117 1
101 118 0
101 119 1
101 120 0
101 121 0
101 122 0
101 123 1
101 124 0
TOPIC DocNo Weight
101 111
101 112
101 113
101 110
101 114
101 119
101 115
101 122
101 118
101 116
101 123
101 121
101 120
101 117
101 124
The set of retrieved documents
(4) Calculate IR model1’s F1 measure.
Solution: (1)
A= {110, 111, 113, 114, 116, 117, 119, 123}
B = {111, 112, 113, 110, 114, 119}
A’ = {112, 115, 118, 120, 121, 122, 124}
B’ = {115, 122, 118, 116, 123, 121, 120, 117, 124}
Recall = |”∩$|= |{110, 111, 113, 114, 119}| / |A| = 5/8 = 62.5%
Precision = |”∩$| = |{110, 111, 113, 114, 119}| / |B| = 5/6 = 83.33% |$|
FP = Fallout = |”%∩$| = |{112 }|/|A’| = 1/7
FN = |”∩$%| = |{116, 117, 123}|/|A| = 3/8 |”|
F1 = 2*Recall*Precision/(Recall + Precision) = 2*(5/8)*(5/6) / (5/8 + 5/6) = 10/14 = 71.43%
Ranking Effectiveness
Most of retrieval models produce ranked output. For a fair comparison of recall and precision measures, the set of retrieved documents should be defined in terms of ranking. One possibility way is to calculate recall and precision values at every rank position p.
There are several techniques have been developed to summarize the effectiveness of a ranking. The first one is simply to calculate recall-precision values at a small number of predefined rank positions. For a given query, to compare two or more rankings, the precision at the predefined rank positions needs to be calculated. This effectiveness measure is known as precision at rank p. The most common versions are precision at 10 and precision at 20 or 25.
Another method of summarizing the effectiveness of a ranking is to calculate precision at fixed or standard recall levels from 0.0 to 1.0 in increments of 0.1. Each ranking is then represented using
11 numbers. This idea can be extended to calculate averaging effectiveness across queries and generating recall-precision graphs (interpolation).
This method has the advantage of summarizing the effectiveness of the ranking of all relevant documents, rather than just those in the top ranks.
The third method, and the most popular, is to summarize the ranking by averaging the precision values from the rank positions where a relevant document was retrieved (i.e., when recall increases).
Average precision has several advantages. It is a single number that is based on the ranking of all the relevant documents, but the value depends heavily on the highly ranked relevant documents. This means it is an appropriate measure for evaluating the task of finding as many relevant documents as possible while still reflecting the intuition that the top-ranked documents are the most important.
Question 3. For the give relevance judgment (Table 1) and the ranked output of IR model1 (Table 2).
(1) Calculate IR model1’s precision at rank 6 (position 6) and precision at rank 10 (position 10).
(2) Calculate IR model1’s average precision.
Solution: (1)
precision at rank 6 = 83.33% (see question 2 solution)
precision at rank 10 = = |{110, 111, 113, 114, 119,116}| / p = 6 / 10 = 60%
IR model1’s ranked output:
111 112 113 110 114 119 115 122 118 116 123 121 120 117 124
Patp 1 0.5 2/3 3⁄4 4/55/65/75/85/90.67/117/127/138/148/15
Average Precision = (1 + 2/3 + 3⁄4 + 4/5 + 5/6 + 0.6 + 7/11 + 8/14) / 8 = (1+ 0.667 + 0.75 + 0.8 + 0.833 + 0.6 + 0.636 + 0.571) / 8
The aim of an averaging technique is to summarize the effectiveness of a specific ranking algorithm across multiple queries (40-50 queries are commonly recommended for evaluating IR models) rather
than a single query since different queries will often have different numbers of relevant documents.
Given that the average precision provides a number for each ranking, the simplest way to summarize the effectiveness of rankings from multiple queries would be to average these numbers. This effectiveness measure, mean average precision (MAP), is used in most research papers and some system evaluations.
Also, an average recall-precision graph is plotted by simply joining (by using MAP) the average precision points at the standard recall levels.
Question 4. Fig. 1 shows the recall and precision at positions 1, 2, … , 10 from two different queries query 1 and query 2. The precision P at any standard recall level R is defined as
where S is the set of observed (R, P) points.
(1) Calculate MAP of the two queries.
(2) For query 1, list the elements of S in curly braces.
(3) Calculate query 1’s precision P at any standard recall level R.
Fig 1. Recall and precision values for rankings from two different queries
(4) Assume S is represented as a list of recall-precision pairs, e.g., S = [(0.2, 1.0), (0.2, 0.5), (0.4, 0.67), (0.4, 0.5), (0.4, 0.4), (0.6, 0.5), (0.6, 0.43), (0.6, 0.38), (0.8, 0.44), (1.0, 0.5)]. Define a list comprehension to calculate precisions at all standard recall levels.
Solution: (1)
S= {(0.2, 1.0), (0.2, 0.5), (0.4, 0.67), (0.4, 0.5), (0.4, 0.4), (0.6, 0.5), (0.6, 0.43), (0.6, 0.38), (0.8, 0.44), (1.0, 0.5)}
For recall level 0, we have {P’: R’ 3 0 Ù (R’, P’) ÎS} = {1.0, 0.5, 0.67, 0.4, 0.43, 0.38, 0.44} For recall level 0.1, we have {P’: R’ 3 0.1 Ù (R’, P’) ÎS} = {1.0, 0.5, 0.67, 0.4, 0.43, 0.38, 0.44} For recall level 0.2, we have {P’: R’ 3 0.2 Ù (R’, P’) ÎS} = {1.0, 0.5, 0.67, 0.4, 0.43, 0.38, 0.44} max{1.0, 0.5, 0.67, 0.4, 0.43, 0.38, 0.44} = 1.0.
So, P(0) = P(0.1) = P(0.2) = 1.0
For recall level 0.3, we have {P’: R’ 3 0.3 Ù (R’, P’) ÎS} = {0.67, 0.5, 0.4, 0.43, 0.38, 0.44} For recall level 0.4, we have {P’: R’ 3 0.4 Ù (R’, P’) ÎS} = {0.67, 0.5, 0.4, 0.43, 0.38, 0.44} max{0.67, 0.5, 0.4, 0.43, 0.38, 0.44}=0.67
So, P(0.3) = P(0.4) = 0.67
For recall level 0.5, we have {P’: R’ 3 0.5 Ù (R’, P’) ÎS} = {0.5, 0.43, 0.38, 0.44} For recall level 0.6, we have {P’: R’ 3 0.6 Ù (R’, P’) ÎS} = {0.5, 0.43, 0.38, 0.44} max{0.5, 0.43, 0.38, 0.44} = 0.5
So, P(0.5) = P(0.6) = 0.5
For recall level 0.7, we have {P’: R’ 3 0.7 Ù (R’, P’) ÎS} = {0.44, 0.5} For recall level 0.8, we have {P’: R’ 3 0.8 Ù (R’, P’) ÎS} = {0.44, 0.5} max{0.44, 0.5}=0.5
So, P(0.7) = P(0.8) = 0.5
For recall level 0.9, we have {P’: R’ 3 0.7 Ù (R’, P’) ÎS} = {0.5} For recall level 1.0, we have {P’: R’ 3 0.8 Ù (R’, P’) ÎS} = {0.5} max{0.5} = 0.5
So, P(0.9) = P(1.0) = 0.5
[max([p for (r,p) in S if r >= i/10]) for i in range(11)]
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com