Search Engines
Text, Web And Media Analytics
Evaluation
Copyright By PowCoder代写 加微信 powcoder
1. Evaluation overview
2. Relevance Judgments
Query Logs
Filtering Clicks
3. Effectiveness Measures
Precision, recall, F measure
4. Ranking Effectiveness
Average Precision
Mean Average Precision (MAP)
Average recall-precision graph
Discounted Cumulative Gain (DCG) & NDCG
5. Efficiency Metrics
6. Significance Tests
1. Evaluation Overview
Evaluation is key to building effective and efficient search systems
measurement usually carried out in controlled laboratory experiments测量通常在实验室控制实验中进行
online testing can also be done
Effectiveness效能, efficiency and cost are related
e.g., if we want a particular level of effectiveness and efficiency, this will determine the cost of the system configuration这将决定系统配置的成本
efficiency and cost targets may impact effectiveness效率和成本目标可能会影响效果
Evaluation Corpus评价主体
Test collections consisting of documents, queries, and relevance judgments, e.g.,由文档、查询和相关判断组成的测试集合,例如:
Test Collections
TREC Topic Example
2. Relevance Judgments相关判断
Obtaining relevance judgments is an expensive, time-consuming process
who does it?
what are the instructions?
what is the level of agreement? n. 协议
TREC judgments
depend on task being evaluated
generally binary r
agreement good because of “narrative”
Pooling联营,合并;池化
Exhaustive judgments for all documents in a collection is not practical对集合中的所有文档进行详尽的判断是不现实的
Pooling technique is used in TREC
top k results (for TREC, k varied between 50 and 200) from the rankings obtained by different search engines (or retrieval algorithms) are merged into a pool
duplicates are removed重复的删除
documents are presented in some random order to the relevance judges文件以某种随机顺序提交给相关法官
Produces a large number of relevance judgments for each query, although still incomplete为每个查询生成大量的相关性判断,尽管仍然不完整
TREC中使用了池化技术
来自不同搜索引擎(或检索算法)的排名的前k个结果(对于TREC, k在50到200之间)被合并到一个池中
Query Logs查询日志
Used for both tuning and evaluating search engines用于调优和评估搜索引擎
also for various techniques such as query suggestion还有各种技术,如查询建议
Typical contents
User identifier or user session identifier
Query terms – stored exactly as user entered
List of URLs of results, their ranks on the result list, and whether they were clicked on
Timestamp(s) – records the time of user events such as query submission, clicks
用户标识符或用户会话标识符
查询条件-完全按照用户输入存储
结果的url列表,它们在结果列表中的排名,以及它们是否被点击
时间戳(s)——记录用户提交查询、单击等事件的时间
Query Logs查询日志
Clicks are not relevance judgments
although they are correlated
biased by a number of factors such as rank on result list
Can use clickthough data to predict preferences between pairs of documents
appropriate for tasks with multiple levels of relevance, focused on user relevance
various “policies” used to generate preferences
点击不是相关性判断,虽然他们是相关的偏见的一些因素,如排名的结果列表
可以使用点击数据来预测文档对之间的偏好吗
适合于具有多级相关性的任务,关注用户相关性
用于生成偏好的各种“策略”
Example Click Policy
Skip Above and Skip Next
click data
generated preferences
Query Logs cont.分布式查询日志。
Click data can also be aggregated to remove noise单击数据也可以聚合以消除噪声
Click distribution information点击分布信息
can be used to identify clicks that have a higher frequency than would be expected
high correlation with relevance
e.g., using click deviation to filter clicks for preference-generation policies
可以用来识别比预期频率更高的点击吗
相关性与相关性高度相关
例如,使用点击偏差过滤点击偏好生成策略
Filtering Clicks
Click deviation CD(d, p) for a result d in position p:
O(d,p): observed click frequency for a document in a rank position p over all instances of a given query
E(p): expected click frequency at rank p averaged across all queriesE(p):期望点击频率在排名p平均所有查询
单击偏离CD(d, p)以在位置p中得到结果d
O(d,p):在给定查询的所有实例中,位于排序位置p的文档的观察点击频率
3. Effectiveness Measures
A is set of relevant documents,
B is set of retrieved documents
B是检索到的文档的集合
Recall 和 precision的公式可以理解为percentage
U all documents in the world
客户查询的部分(rev)
客户不想查的部分(non)
In certain settings, all sets under discussion are considered to be subsets of a given universal set U.
In such cases, U – A is called the absolute complement or simply complement of A, and is denoted by \overline(A) or Ac
在某些条件下,讨论中的所有集合都被认为是给定全集U的子集。
在这种情况下,U – A被称为A的绝对补体或简单的补体,用\加线(A)或Ac表示
Classification Errors
False Positive (Type I error)
a non-relevant document is retrieved 没有相关文件被找到
False Negative (Type II error)
a relevant document is not retrieved
= 1 – Recall
Precision is used when the probability that a positive result is correct is important
当一个积极的结果是正确的概率很重要时,就使用Precision
如果你想追求precision,recall就低。反之亦然
Harmonic mean of recall and precision查全率和查准率的调和平均值
harmonic mean emphasizes the importance of small values, whereas the arithmetic mean is affected more by outliers that are unusually large
More general form
β is a parameter that determines relative importance of recall and precision. Values of β > 1 emphasize recall.
The common F measure is in fact F1 (β =1) where recall and precision have equal importance.
调和平均值强调小值的重要性,而算术平均值更容易受到异常大的异常值的影响
β是一个参数,决定了查全率和查准率的相对重要性。β > 1值强调回忆
常见的F测度实际上是F1 (β =1),其中召回率和精确度同等重要。
4. Ranking Effectiveness
Different rankings:
(1) For the same A, that is, the same query, different IR models may produce different rankings.
(2) For a given IR model, it produces multiple rankings for different queries (multiple As).
Summarizing a Ranking
Calculating recall and precision at fixed rank positions (or cutoff)
Calculating precision at standard recall levels, from 0.0 to 1.0
requires interpolation
Averaging the precision values from the rank positions where a relevant document was retrieved
Average Precision Example
For the same A, that is, the same query, different IR models may produce different rankings.
Here, we use average precision to compare the effectiveness of two IR models that produce Ranking #1 and Ranking 2, respectively.
Averaging Across Queries Example
(2) For a given IR model, it produces multiple rankings for different queries (multiple As).
Here, we want to measure the effectiveness (MAP) of an IR model based on two queries (query 1 and query 2).
Mean Average Precision (MAP)
summarize rankings from multiple queries by averaging average precision
most commonly used measure in research papers
assumes user is interested in finding many relevant documents for each query
requires many relevance judgments in text collection
Recall-precision graphs are also useful summaries
Recall-Precision Graph
Interpolation
To average graphs, calculate precision at standard recall levels:
where S is the set of observed (R,P) points
Defines precision at any recall level as the maximum precision observed in any recall-precision point at a higher recall level
produces a step function
defines precision at recall 0.0
P(R) = {P’ : R’ >= R ^ (R’, P’) \in S }
>>> S = [(0.2, 1.0), (0.2, 0.5), (0.4, 0.67), (0.4, 0.5), (0.4, 0.4), (0.6, 0.5), (0.6, 0.43), (0.6, 0.38), (0.8, 0.44), (1.0, 0.5)]
>>> R =0.5
>>> PR = [p for (r,p) in S if r>= R ]
>>> print(PR)
[0.5, 0.43, 0.38, 0.44, 0.5]
>>> max(PR)
>>> All_PR = [max([p for (r,p) in S if r>= i/10]) for i in range(11)]
>>> print(All_PR)
[1.0, 1.0, 1.0, 0.67, 0.67, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
Interpolation
Average Precision at
Standard Recall Levels
Recall-precision graph plotted by simply
joining the average precision points at
the standard recall levels
Average Recall-Precision Graph
Graph for 50 Queries
Focusing on Top Documents
Users tend to look at only the top part of the ranked result list to find relevant documents
Some search tasks have only one relevant document
e.g., navigational search, question answering
Recall not appropriate
instead need to measure how well the search engine does at retrieving relevant documents at very high ranks
Focusing on Top Documents
Precision at Rank R
R typically 5, 10, 20
easy to compute, average, understand
not sensitive to rank positions less than R
Reciprocal Rank
reciprocal of the rank at which the first relevant document is retrieved
Mean Reciprocal Rank (MRR) is the average of the reciprocal ranks over a set of queries
very sensitive to rank position
Discounted Cumulative Gain
Popular measure for evaluating web search and related tasks
Two assumptions:
Highly relevant documents are more useful than marginally relevant document
the lower the ranked position of a relevant document, the less useful it is for the user, since it is less likely to be examined
Discounted Cumulative Gain cont.
Uses graded relevance as a measure of the usefulness, or gain, from examining a document
Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks
Typical discount is 1/log (rank)
With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3
Discounted Cumulative Gain:
DCG function
DCG is the total gain accumulated at a particular rank p:
Alternative formulation:
used by some web search companies
emphasis on retrieving highly relevant documents
DCG Example
10 ranked documents judged on 0-3 relevance scale:
3, 2, 3, 0, 0, 1, 2, 2, 3, 0
discounted gain:
3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0
= 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0
3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61
Normalized DCG
DCG numbers are averaged across a set of queries at specific rank values
e.g., DCG at rank 5 is 6.89 and at rank 10 is 9.61
DCG values are often normalized by comparing the DCG at each rank with the DCG value for the perfect ranking
makes averaging easier for queries with different numbers of relevant documents
NDCG Example
Perfect ranking:
3, 3, 3, 2, 2, 2, 1, 0, 0, 0
ideal DCG values:
3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 10.88, 10.88, 10.88
NDCG values (divide actual by ideal):
1, 0.83, 0.87, 0.76, 0.71, 0.69, 0.73, 0.8, 0.88, 0.88
NDCG £ 1 at any rank position
Using Preferences
Two rankings described using preferences can be compared using the Kendall tau coefficient (τ ):
P is the number of preferences that agree and Q is the number that disagree
For preferences derived from binary relevance judgments, can use BPREF
User preferences can be inferred from query logs.
Preferences have been used for training ranking algorithms, and have been suggested as an alternative to relevance judgments for evaluation
For example, if there were 15 preferences learned from clickthrough data, and a ranking agreed with 10 of these, the τ measure would be
(10 − 5)/15 = 0.33
Although this seems reasonable, no studies are available that show that this effectiveness measure is useful for comparing systems.
For a query with R relevant documents, only the first R non-relevant documents are considered
dr is a relevant document, and Ndr gives the number of non-relevant documents (from the set of R non-relevant documents that are considered) that are ranked higher than dr.
Alternative definition
The BPREF measure has been shown to be robust with partial information and to give similar results (in terms of system comparisons) to recall-precision measures such as MAP.
5. Efficiency Metrics
6. Significance Tests
Given the results from a number of queries, how can we conclude that ranking algorithm B is better than algorithm A?
The null hypothesis – there is no difference in effectiveness between the two algorithms.
The alternative hypothesis is that there is a difference.
A significance test enables us to
reject the null hypothesis
in favor of the alternative hypothesis (B is better than A)
Otherwise, we say that the null hypothesis cannot be rejected (i.e., B might not be better than A).
We can assume that A is a baseline algorithm (or model) and B is the proposed new algorithm (or model).
Significance Tests cont.
A Type I error is when the null hypothesis is rejected when it is in fact true.
A Type II error is when the null hypothesis is accepted when it is in fact false.
Significance tests are often described by their power – the probability that a test will reject the null hypothesis correctly
A test with high power will reduce the chance of a Type II error.
Increasing the number of queries in the experiment also increases power of test; and also reduce the chance of a Type I error as well.
If the probability of the null hypothesis being true is very small (e.g., 5%), we reject that hypothesis and conclude that
ranking algorithm B is more effective than the baseline algorithm A.
We obviously cannot conclude that B is better than A on the basis of the results of a single query, since A may be better than B on all other queries.
So how many queries do we have to look at to make a decision about which is better? If, for example, B is better than A for 90% of 200 queries in a test collection, we
should be more confident that B is better for that effectiveness measure, but how confident? Significance tests allow us to quantify the confidence we have in any
judgment about effectiveness.
Significance Tests
One-Sided or Two-sided Tests
A one-sided or one-tailed test is used to establish that B is better than baseline A; and a two-sided or two-tailed test is used to establish that a difference between B and A.
where the “side” or “tail” referred to is the tail of a probability distribution.
The following distribution is for the possible values of a test statistic assuming the null hypothesis
shaded area is the region of rejection
The significance tests most commonly used are the t-test, the Wilcoxon signed-rank test, and the sign test.
Example Experimental Results
Assumption is that the difference between the effectiveness values is a sample from a normal distribution
Null hypothesis is that the mean of the distribution of differences is zero
Test statistic
for the example,
The standard deviation
of the differences
The mean of the differences
If the p-value is very small, then either the null hypothesis is false or something unlikely has occurred. In a formal significance test, the null hypothesis H0 is rejected if the p-value is less than a predefined threshold value \alpha which is commonly set to 0.05.
The following formulae say how to calculate p-value from t-test. By cdft,d we denote the cumulative distribution function of the t-Student distribution with d degrees of freedom:
p-value from left-tailed t-test:
p-value = cdft,d(tscore)
p-value from right-tailed t-test:
p-value = 1 – cdft,d(tscore)
p-value from two-tailed t-test:
p-value = 2 * cdft,d(−|tscore|)
or equivalently: p-value = 2 – 2 * cdft,d(|tscore|)
Setting Parameter Values
Retrieval models often contain parameters that must be tuned to get best performance for specific types of data and queries
For experiments:
Use training and test data sets
If less data available, use cross-validation by partitioning the data into K subsets
Using training and test data avoids overfitting – when parameter values do not generalize well to other data
The evaluation is also useful to update the new algorithm by setting better parameter values.
Finding Parameter Values
Many techniques used to find optimal parameter values given training data
standard problem in machine learning
In IR, often explore the space of possible parameter values by brute force
requires large number of retrieval runs with small variations in parameter values (parameter sweep)
SVM optimization is an example of an efficient procedure for finding good parameter values with large numbers of parameters
Online Testing
Test (or even train) using live traffic on a search engine
real users, less biased, large amounts of test data
Drawbacks:
noisy data, can degrade user experience
Often done on small proportion (1-5%) of live traffic
No single measure is the correct one for any application
choose measures appropriate for task
use a combination
shows different aspects of the system effectiveness
Use significance tests (t-test)
Analyze performance of individual queries
Chapter 8 in book – W. , Search Engines – Information retrieval in Practice; Pearson, 2010.
Text Mining and User Information Needs
/docProps/thumbnail.jpeg
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com