CS代考计算机代写 data science information retrieval deep learning CE306/CE706 Information Retrieval

CE306/CE706 Information Retrieval
Evaluation
Dr Alba García Seco de Herrera

Brief Module Outline (Reminder)
§ Motivation + introduction
§ Processing pipeline / indexing + query processing
§ Large-scale open-source search tools
§ Information Retrieval models
§ Evaluation
§ Log analysis
§ User profiles / personalisation / contextualisation
§ IR applications, e.g. enterprise search

Information Retrieval Evaluation
§ Questions that arise:
§ How good is a particular IR system?
§ How does it compare to alternative systems?
§ What is the best system for a particular application?

Information Retrieval Evaluation
§ Some measures that might help: § Accuracy
§ User satisfaction § Price
§ Response time § Search time
§ ???

Information Retrieval Evaluation § Evaluation is key to building effective and
efficient search engines
§ measurement usually carried out in controlled
laboratory experiments
§ online testing can also be done
§ Effectiveness, efficiency and cost are related
§ e.g., if we want a particular level of effectiveness and efficiency, this will determine the cost of the system configuration
§ efficiency and cost targets may impact effectiveness

Information Retrieval
§ Given a query and a corpus, find relevant documents
§ query: user’s textual description of their information need
§ corpus: a repository of textual documents § relevance: satisfaction of the user’s
information need
… we will focus on effectiveness measures

IR Effectiveness
§ Precision = tp/(tp+fp) § Recall = tp/(tp+fn)

IR Effectiveness (put differently)
§ A is a set of relevant documents § B is a set of retrieved documents

Precision and Recall
§ !”#$%&%'( = *+.-./.0123 4+56 7.3872.4 *+.4+56 7.3872.4
§ 9#$:;; = *+.7./.0123 4+56 7.3872.4 <+31/ *+.7./.0123 4+56 § Precision indicates what proportion of the documents returned are relevant § Recall indicates what proportion of all relevant documents are recalled § Both precision and recall vary between zero and one 9 Should we use the accuracy? § Given a query, an engine classifies each doc as “Relevant” or “Nonrelevant” § The accuracy of an engine: the fraction of these classifications that are correct § (tp + tn) / ( tp + fp + fn + tn) § Accuracy is a commonly used evaluation measure in machine learning classification work Sec. 8.3 10 Information Retrieval Evaluation § Evaluation is a fundamental issue of information retrieval § an area of IR research in its own right (much of it due to uncertainty/ambiguity) § Often the only way to show that A is better than B, is through extensive experimentation § Evaluation methods: § batch evaluation § user-study evaluation § online evaluation § Each method has advantages and disadvantages Batch Evaluation Overview § Collect a set of queries (to test average performance) § Construct a more complete description of the information being sought for each query Batch Evaluation Overview: query + description (example) § QUERY: pet therapy § DESCRIPTION: Relevant documents must include details of how pet- and animal-assisted therapy is or has been used. Relevant details include information about pet therapy programs, descriptions of the circumstances in which pet therapy is used, the benefits of this type of therapy, the degree of success of this therapy, and any laws or regulations governing it. ... a TREC example (more on that later) Batch Evaluation Overview § Using these descriptions, have human judges determine which documents are relevant for each query § Evaluate systems based on their ability to retrieve the relevant documents for these queries § evaluation metric: a measurement that quantifies the quality of a particular ranking of results with known relevant/non-relevant documents Batch Evaluation Overview: metrics § Which ranking is better? § rank of the first relevant document Batch Evaluation Overview: metrics § Which ranking is better? § precision at rank 10 Batch Evaluation Overview: metrics § Which ranking is better? § precision at rank 1 Batch Evaluation Overview: metrics § Which ranking is better? § recall at rank 10 Batch Evaluation Overview: metrics § Which ranking is better? § precision at rank 30 Batch Evaluation Overview: trade-offs § Advantages: § inexpensive (once the test collection is constructed) § the experimental condition is fixed; same queries, and same relevance judgements § evaluations are reproducible; keeps us “honest” § by experimenting on the same set of queries and judgements, we can better understand how system A is better than B Batch Evaluation Overview: trade-offs § Disadvantages: § high initial cost. human assessors (the ones who judge documents relevant/non-relevant) are expensive § human assessors are not the users; judgements are made “out of context” § assumes that relevance is the same, independent of the user and the user’s context Batch Evaluation Overview: trade-offs § Many factors affect whether a document satisfies a particular user’s information need § Topicality, novelty, freshness, formatting, reading level, assumed level of expertise (remember?) § Topical relevance: the document is on the same topic as the query § User relevance: everything else § Which kind of relevance does batch- evaluation address? § Whether the document contains the sought-after information User-Study Evaluation Overview § Provide a small set of users with several retrieval systems § Ask them to complete several (potentially different) search tasks § Learn about system performance by: § observing what they do § asking about their actions and thought processes § measuring success (task completion, time, etc.) § measuring perceived success (questionnaire data) § Will present you an Essex-based example soon ... User-Study Evaluation Overview: trade-offs § Advantages: § very detailed data about users’ reaction to systems § in reality, a search is done to accomplish a higher-level task § in user studies, this task can be manipulated and studied § in other words, the experimental ‘starting- point’ need not be the query User-Study Evaluation Overview: trade-offs § Disdvantages: § user studies are expensive (pay users/subjects, scientist’s time, data coding) § difficult to generalize from small studies to broad populations § the laboratory setting is not the user’s normal environment § need to re-run experiment every time a new system is considered Online Evaluation Overview § Given a search service with an existing user population (e.g., Google, Yandex, Bing) ... § Have x% of query traffic use system A and y% of query-traffic use system B § Compare system effects on logged user interactions (implicit feedback) § clicks: surrogates for perceived relevance (good) § skips: surrogates for perceived non-relevance (bad) § Happens every time you submit a Web search query! Implicit feedback click! can we say that the first result is more relevant than the second? Implicit feedback skip! click! can we say that the second result is more relevant than the first? Implicit feedback skip! skip! skip! click! can we say that the fourth result is more relevant than the first/second/third? Implicit feedback click! a click is a noisy surrogate for relevance! Implicit feedback user sees the results and closes the browser Implicit feedback the absence of a click is a noisy surrogate for non- relevance On-line Evaluation Overview: trade-offs § Advantages: § system usage is naturalistic; users are situated in their natural context and often don’t know that a test is being conducted § evaluation can include lots of users On-line Evaluation Overview: trade-offs § Disdvantages: § requires a service with lots of users (enough of them to potentially hurt performance for some) § requires a good understanding on how different implicit feedback signals correlate with positive and negative user experiences § experiments are difficult to repeat Information Retrieval Evaluation § Evaluation is a fundamental issue of IR § an area of IR research in its own right § Evaluation methods: § batch evaluation § user-study evaluation § online evaluation § Each method has advantages and disadvantages Evaluation Metrics § Remember: we focus on effectiveness measures § Many of the metrics are ultimately derived from precision and recall § For example, F-measure: Question 1: What is the trade-off between P & R ? Question 2: Why use the harmonic mean for F1 rather than the arithmetic mean? Ranked Retrieval Precision and recall § In most situations, the system outputs a ranked list of documents rather than an unordered set § User behaviour assumption: § The user examines the output ranking from top- to-bottom until he/she is satisfied or gives up § Precision/Recall @ rank K Ranked Retrieval Precision and recall: example § Assume 20 relevant documents K P@K R@K 1 (1/1) = 1.0 (1/20) = 0.05 2 (1/2) = 0.5 (1/20) = 0.05 3 (2/3) = 0.67 (2/20) = 0.10 4 (3/4) = 0.75 (3/20) = 0.15 5 (4/5) = 0.80 (4/20) = 0.20 6 (5/6) = 0.83 (5/20) = 0.25 7 (6/7) = 0.86 (6/20) = 0.30 8 (6/8) = 0.75 (6/20) = 0.30 9 (7/9) = 0.78 (7/20) = 0.35 10 (7/10) = 0.70 (7/20) = 0.35 Ranked Retrieval P/R @ K § Advantages: § easy to compute § easy to interpret § Disadvantages: § the value of K has a huge impact on the metric § how do we pick K? Ranked Retrieval motivation: average precision § Ideally, we want the system to achieve high precision for varying values of K § The metric average precision accounts for precision and recall without having to set K Ranked Retrieval average precision 1. Go down the ranking one-rank-at-a-time 2. If the document at rank K is relevant, measure P@K § proportion of top-K documents that are relevant 3. Finally, take the average of P@K values § the number of P@K values will equal the number of relevant documents Ranked Retrieval average precision sum rank 1.00 1 1.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9 (K) ranking 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Number of relevant documents for this query: 10 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19 10.00 20 0.00 total 10.00 Ranked Retrieval average precision sum rank 1.00 1 1.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9 (K) 1. 2. 3. Go down the ranking ranking one-rank- at 0.00 1.00 0.00 0.00 - 0.00 -time a 1.00 1.00 1.00 1.00 1.00 ca culate P@K at 00 0 .0 0.00 0.00 0 If recall goes up, l 0.00 0. 0.00 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19 that rank K 0.00 0.00 1.00 1.00 When recall = 1.0, 0.00 1.00 av 0.00 rage P@K values e 0.00 0. 10.00 20 total 00 0.00 0.00 0.00 0.00 1.00 10.00 Ranked Retrieval average precision sum rank (K) 1.00 1 1.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19 10.00 R@K P@K 1.00 0.50 0.00 0.67 0.75 0.80 0.83 0.86 0.75 0.00 0.78 0.70 0.00 0.73 0.67 0.00 0.62 0.00 0.64 0.60 0.00 0.56 0.00 0.53 0.00 0.50 0.00 0.47 0.00 0.50 0.76 ranking 1.00 1.00 1.00 1.00 1.00 1.00 0.10 0.10 0.20 0.30 0.40 0.50 0.60 0.60 1.00 0.67 0.75 0.80 0.83 1.00 0.86 0.70 0.70 0.78 1.00 0.80 0.80 0.80 0.73 1.00 0.90 0.90 0.90 0.90 0.90 0.90 0.64 20 1.00 0.50 1.00 total 10.00 average-precision Ranked Retrieval average precision sum rank (K) 1.00 1 2.00 2 3.00 3 4.00 4 5.00 5 6.00 6 7.00 7 8.00 8 9.00 9 10.00 10 10.00 11 10.00 12 10.00 13 10.00 14 10.00 15 10.00 16 10.00 17 10.00 18 10.00 19 10.00 20 R@K P@K ranking 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.91 0.83 0.77 0.71 0.67 0.63 0.59 0.56 0.53 0.50 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 total 10.00 average-precision 1.00 Ranked Retrieval average precision sum rank (K) 0.00 1 0.00 2 0.00 3 0.00 4 0.00 5 0.00 6 0.00 7 0.00 8 0.00 9 0.00 10 1.00 11 2.00 12 3.00 13 4.00 14 5.00 15 6.00 16 7.00 17 8.00 18 9.00 19 10.00 20 total 10.00 R@K P@K ranking 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.09 0.17 0.23 0.29 0.33 0.38 0.41 0.44 0.47 0.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.09 0.17 0.23 0.29 0.33 0.38 0.41 0.44 0.47 0.50 average-precision 0.33 Ranked Retrieval average precision sum rank (K) 1.00 1 1.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19 10.00 20 total 10.00 R@K P@K 1.00 0.50 0.00 0.67 0.75 0.80 0.83 0.86 0.75 0.00 0.78 0.70 0.00 0.73 0.67 0.00 0.62 0.00 0.64 0.60 0.00 0.56 0.00 0.53 0.00 0.50 0.00 0.47 0.00 0.50 0.76 ranking 1.00 1.00 1.00 1.00 1.00 1.00 0.10 0.10 0.20 0.30 0.40 0.50 0.60 0.60 1.00 0.67 0.75 0.80 0.83 0.86 1.00 0.70 0.70 0.78 1.00 0.80 0.80 0.80 0.73 1.00 1.00 0.90 0.90 0.90 0.90 0.90 0.90 1.00 0.64 0.50 average-precision Ranked Retrieval average precision sum rank (K) R@K P@K 1 2 3 4 5 6 7 8 9 10 12 13 14 15 16 17 18 19 20 ranking 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.10 0.20 0.20 0.30 0.40 0.50 0.60 0.60 0.70 0.70 0.80 0.80 0.90 0.90 0.90 0.90 0.90 0.90 1.00 1.00 1.00 0.67 0.75 0.80 0.83 0.86 0.75 1.00 0.78 0.70 11 1.00 0.80 0.73 0.67 0.62 0.64 0.60 0.56 0.53 0.50 0.47 0.50 1.00 1.00 2.00 2.00 ranks 2 and 3 1.00 1.00 0.00 0.75 0.80 0.83 0.86 0.00 0.78 0.00 0.73 0.00 0.00 0.64 0.00 0.00 0.00 0.00 0.00 0.50 swapped 3.00 4.00 5.00 6.00 6.00 7.00 7.00 8.00 8.00 8.00 9.00 9.00 9.00 9.00 9.00 9.00 10.00 total 10.00 average-precision 0.79 Ranked Retrieval average precision sum rank (K) 1.00 1 1.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19 10.00 20 total 10.00 R@K P@K 1.00 0.50 0.00 0.67 0.75 0.80 0.83 0.86 0.75 0.00 0.78 0.70 0.00 0.73 0.67 0.00 0.62 0.00 0.64 0.60 0.00 0.56 0.00 0.53 0.00 0.50 0.00 0.47 0.00 0.50 0.76 ranking 1.00 1.00 1.00 1.00 1.00 0.10 0.10 0.20 0.30 0.40 0.50 0.60 0.60 0.70 0.70 0.80 1.00 0.67 0.75 0.80 0.83 1.00 0.86 0.78 1.00 0.73 1.00 0.80 0.80 1.00 1.00 0.90 0.90 0.90 0.90 0.90 0.90 1.00 0.64 0.50 average-precision Ranked Retrieval average precision sum rank (K) 1.00 1 R@K P@K 1.00 0.50 0.00 0.67 0.75 0.80 0.83 0.86 0.88 0.00 0.70 0.00 0.73 0.67 0.00 0.62 0.00 0.64 0.60 0.00 0.56 0.00 0.53 0.00 0.50 0.00 0.47 0.00 0.50 0.77 ranking 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.10 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.70 0.70 1.00 0.67 0.75 0.80 0.83 0.86 0.88 0.78 1.00 0.80 0.80 0.80 0.73 1.00 1.00 0.90 0.90 0.90 0.90 0.90 0.90 1.00 0.64 0.50 1.00 2 swapped 2.00 3 ranks 8 and 9 3.00 4 4.00 5 5.00 6 6.00 7 7.00 8 7.00 9 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19 10.00 20 total 10.00 average-precision Ranked Retrieval average precision § Advantages: § no need to choose K § accounts for both precision and recall § mistakes at the top are more influential § mistakes at the bottom are still accounted for § Disadvantages: § not quite as easy to interpret as P/R@K § recall can be difficult to assess in large collections Ranked Retrieval MAP: mean average precision § So far, we’ve talked about average precision for a single query § Mean Average Precision (MAP): average precision averaged across a set of queries § MAP is one of the most common metrics in IR evaluation Ranked Retrieval precision-recall curves § In some situations, we want to understand the trade-off between precision and recall § A precision-recall (PR) curve expresses precision as a function of recall Ranked Retrieval precision-recall curves: general idea § Different tasks require different levels of recall § Sometimes, the user wants a few relevant documents § Other times, the user wants most of them § Suppose a user wants some level of recall R § The goal for the system is to minimize the number of false negatives the user must look at in order to achieve a level of recall R Ranked Retrieval precision-recall curves Ranked Retrieval precision-recall curves which system is better? Ranked Retrieval other effectiveness metrics § In some retrieval tasks, we really want to focus on precision at the top of the ranking § A classic example is Web search § users rarely care about recall § users rarely navigate beyond the first page of results § Discounted Cumulative Gain (DCG) gives more weight to the top-ranked results Metric review § set-retrieval evaluation: we want to evaluate the set of documents retrieved by the system, without considering the ranking § ranked-retrieval evaluation: we want to evaluate the ranking of documents returned by the system Metric Review set-retrieval evaluation § precision: the proportion of retrieved documents that are relevant § recall: the proportion of relevant documents that are retrieved § f-measure: harmonic-mean of precision and recall § a difficult metric to “cheat” by getting very high precision and abysmal recall (and vice- versa) Metric Review ranked-retrieval evaluation § P@K: precision under the assumption that the top-K results is the ‘set’ retrieved § R@K: recall under the assumption that the top-K results is the ‘set’ retrieved § average-precision: average of P@K values for every K where recall increases § DCG: ignores recall, considers multiple levels of relevance, and focuses on the top ranks Evaluation Campaigns § Several large-scale (annual) evaluation campaigns exist that allow you to enter your own system and compare against others § Metrics tend to be the ones we discussed (effectiveness) § Examples: § Text REtrieval Conference (TREC) ... US-based § Conference and Labs of the Evaluation Forum (CLEF) ... European § MediaEval Benchmarking Initiative for Multimedia Evaluation ... European TREC § Annual conference to evaluate IR systems (since 1992) § Forum for IR researchers § TREC data collections are large (e.g. ClueWeb09, ClueWeb12 - about 1 billion Web pages) § Number of different tracks, examples at TREC- 2021: § Conversational Assistance Track § Deep Learning Track § Health Misinformation Track § Podcasts Track CLEF § Annual conference to evaluate cross language IR systems (since 2000) § Forum for IR researchers § Number of different tracks, examples at CLEF- 2021: § Answer Retrieval for Questions on Math § Living Labs for Academic Search § ImageCLEF: Multimedia Retrieval in Medicine, Lifelogging, and Internet § Touché: Argument Retrieval § LifeCLEF: Multimedia Retrieval in Nature MediaEval § Annual conference to evaluate multimedia IR systems (since 2000) § The "multi" in multimedia: speech, audio, visual content, tags, users, context § Number of different tracks, examples at MediaEval 2020: § FakeNews: Corona virus and 5G conspiracy § Predicting Media Memorability § Insight for Wellbeing: Multimodal personal health lifelog data analysis § Emotion and Theme recognition in music using Jamendo Caption task @ImageCLEF2021 § Goal: Image understanding by aligning visual content and textual descriptors to interpret medical C0006141: Breast C0024671: Mammography C0006826: Malignant Neoplasms C0772294: Epinastine Hydrochloride C0242114: Suspicion C4282132: Malignancy C0221198: Lesion Mailing list: groups.google.com/d/forum/imageclefcaption Coral task @ImageCLEF2021 § Goal: Coral Reef Image Annotation and Localisation Mailing list: https://groups.google.com/d/forum/imageclefcoral Predicting media memorability @Mediaeval2020 § Goal: predicting how memorable a video is to viewers § Textual caption https://multimediaeval.github.io/editions/2020/ tasks/memorability/ Evaluation Campaigns § Systematic and quantitative evaluation § Shared tasks on shared resources § Uniform scoring procedures, i.e. same conditions for each participant: § Collection of “documents” § Example information requests § Relevant “documents” for each request Evaluation Campaign Tools we need § Uniform scoring procedures, i.e. same conditions for each participant: § Collection of documents (the “dataset”) § Example information requests § A set of questions/queries/topics § Relevant documents for each request § a decision: relevant or not relevant Evaluation Campaigns Cycle Task definition Data preparation Results analysis Topic definition Results evaluation Experiments submission Scientific production Evaluation Campaign Relevant assessment § Who says which documents are relevant and which not? § Ideally – Sit down and look at all documents § Pooled assessment § Each system finds some relevant documents § Different systems find different relevant documents § Together, enough systems will find most of them § Assumption: if there are enough systems being tested and not one of them returned a document – the document is not relevant Evaluation Campaign Relevant assessment: pooling § Combine the results retrieved by all systems § Choose a parameter k (typically 100) § Choose the top k documents as ranked in each submitted run § The pool is the union of these sets of docs § Give pool to judges for relevance assessments Evaluation Campaign Relevant assessment: pooling § Advantages of pooling: § Fewer documents must be manually assessed for relevance § Disadvantages of pooling: § Can’t be certain that all documents satisfying the query are found (recall values may not be accurate) § Runs that did not participate in the pooling may be disadvantaged § If only one run finds certain relevant documents, but ranked lower than 100, it will not get credit for these. Evaluation Campaign Relevant assessment § Not only are we incomplete, but we might also be inconsistent in our judgments! Evaluation Campaign Relevant assessment § Good news: § “idiosyncratic nature of relevance judgments does not affect comparative results” (E. Voorhees) § Similar results held for: § Different query sets § Different evaluation measures § Different assessor types § Single opinion vs .group opinion judgments ...wait for the 2nd assignment Summary § We have seen how to evaluate IR systems § We have seen evaluation methods (batch, user-study, online) and metrics (precision, recall,...) § We have explore the cycle of the evaluation campaigns § Relevance assessments are needed for the consistent evaluation of IR system Reading § Chapter 8 § Fuhr, N. "Some Common Mistakes In IR Evaluation, And How They Can Be Avoided”. SIGIR Forum 51(3), December 2017. § Next week: § Chapter 3 § Sections 4.4 and 4.5 § Section 7.5 Henrik Nordmark § 4pm! § Profusion § MSc Data Science § Information on paid placement opportunities Acknowledgements § Thanks again to contain substantial material prepared by Udo Kruschwitz and Fernando Diaz § Additional material as provided by Croft et al. (2015)

Related Posts