CE306/CE706 Information Retrieval
Evaluation
Dr Alba García Seco de Herrera
Brief Module Outline (Reminder)
§ Motivation + introduction
§ Processing pipeline / indexing + query processing
§ Large-scale open-source search tools
§ Information Retrieval models
§ Evaluation
§ Log analysis
§ User profiles / personalisation / contextualisation
§ IR applications, e.g. enterprise search
Information Retrieval Evaluation
§ Questions that arise:
§ How good is a particular IR system?
§ How does it compare to alternative systems?
§ What is the best system for a particular application?
Information Retrieval Evaluation
§ Some measures that might help: § Accuracy
§ User satisfaction § Price
§ Response time § Search time
§ ???
Information Retrieval Evaluation § Evaluation is key to building effective and
efficient search engines
§ measurement usually carried out in controlled
laboratory experiments
§ online testing can also be done
§ Effectiveness, efficiency and cost are related
§ e.g., if we want a particular level of effectiveness and efficiency, this will determine the cost of the system configuration
§ efficiency and cost targets may impact effectiveness
Information Retrieval
§ Given a query and a corpus, find relevant documents
§ query: user’s textual description of their information need
§ corpus: a repository of textual documents § relevance: satisfaction of the user’s
information need
… we will focus on effectiveness measures
IR Effectiveness
§ Precision = tp/(tp+fp) § Recall = tp/(tp+fn)
IR Effectiveness (put differently)
§ A is a set of relevant documents § B is a set of retrieved documents
Precision and Recall
§ !”#$%&%'( = *+.-./.0123 4+56 7.3872.4 *+.4+56 7.3872.4
§ 9#$:;; = *+.7./.0123 4+56 7.3872.4 <+31/ *+.7./.0123 4+56
§ Precision indicates what proportion of the documents returned are relevant
§ Recall indicates what proportion of all relevant documents are recalled
§ Both precision and recall vary between zero and one
9
Should we use the accuracy?
§ Given a query, an engine classifies each doc as “Relevant” or “Nonrelevant”
§ The accuracy of an engine: the fraction of these classifications that are correct
§ (tp + tn) / ( tp + fp + fn + tn)
§ Accuracy is a commonly used evaluation measure in machine learning classification work
Sec. 8.3
10
Information Retrieval Evaluation
§ Evaluation is a fundamental issue of information retrieval
§ an area of IR research in its own right (much of it due to uncertainty/ambiguity)
§ Often the only way to show that A is better than B, is through extensive experimentation
§ Evaluation methods: § batch evaluation
§ user-study evaluation § online evaluation
§ Each method has advantages and disadvantages
Batch Evaluation Overview
§ Collect a set of queries (to test average performance)
§ Construct a more complete description of the information being sought for each query
Batch Evaluation
Overview: query + description (example)
§ QUERY: pet therapy
§ DESCRIPTION: Relevant documents must include details of how pet- and animal-assisted therapy is or has been used. Relevant details include information about pet therapy programs, descriptions of the circumstances in which pet therapy is used, the benefits of this type of therapy, the degree of success of this therapy, and any laws or regulations governing it.
... a TREC example (more on that later)
Batch Evaluation Overview
§ Using these descriptions, have human judges determine which documents are relevant for each query
§ Evaluate systems based on their ability to retrieve the relevant documents for these queries
§ evaluation metric: a measurement that quantifies the quality of a particular ranking of results with known relevant/non-relevant documents
Batch Evaluation Overview: metrics
§ Which ranking is better?
§ rank of the first relevant document
Batch Evaluation Overview: metrics
§ Which ranking is better?
§ precision at rank 10
Batch Evaluation Overview: metrics
§ Which ranking is better?
§ precision at rank 1
Batch Evaluation Overview: metrics
§ Which ranking is better?
§ recall at rank 10
Batch Evaluation Overview: metrics
§ Which ranking is better?
§ precision at rank 30
Batch Evaluation Overview: trade-offs
§ Advantages:
§ inexpensive (once the test collection is
constructed)
§ the experimental condition is fixed; same queries, and same relevance judgements
§ evaluations are reproducible; keeps us “honest”
§ by experimenting on the same set of queries and judgements, we can better understand how system A is better than B
Batch Evaluation Overview: trade-offs
§ Disadvantages:
§ high initial cost. human assessors (the ones who judge documents relevant/non-relevant) are expensive
§ human assessors are not the users; judgements are made “out of context”
§ assumes that relevance is the same, independent of the user and the user’s context
Batch Evaluation Overview: trade-offs
§ Many factors affect whether a document satisfies a particular user’s information need
§ Topicality, novelty, freshness, formatting, reading level, assumed level of expertise (remember?)
§ Topical relevance: the document is on the same topic as the query
§ User relevance: everything else
§ Which kind of relevance does batch-
evaluation address?
§ Whether the document contains the sought-after information
User-Study Evaluation Overview
§ Provide a small set of users with several retrieval systems
§ Ask them to complete several (potentially different) search tasks
§ Learn about system performance by:
§ observing what they do
§ asking about their actions and thought processes § measuring success (task completion, time, etc.)
§ measuring perceived success (questionnaire data)
§ Will present you an Essex-based example soon ...
User-Study Evaluation Overview: trade-offs
§ Advantages:
§ very detailed data about users’ reaction to
systems
§ in reality, a search is done to accomplish a higher-level task
§ in user studies, this task can be manipulated and studied
§ in other words, the experimental ‘starting- point’ need not be the query
User-Study Evaluation Overview: trade-offs
§ Disdvantages:
§ user studies are expensive (pay
users/subjects, scientist’s time, data coding)
§ difficult to generalize from small studies to broad populations
§ the laboratory setting is not the user’s normal environment
§ need to re-run experiment every time a new system is considered
Online Evaluation Overview
§ Given a search service with an existing user population (e.g., Google, Yandex, Bing) ...
§ Have x% of query traffic use system A and y% of query-traffic use system B
§ Compare system effects on logged user interactions (implicit feedback)
§ clicks: surrogates for perceived relevance (good)
§ skips: surrogates for perceived non-relevance (bad)
§ Happens every time you submit a Web search query!
Implicit feedback
click!
can we say that
the first result is more relevant than the second?
Implicit feedback
skip!
click!
can we say that the second result is more relevant than the first?
Implicit feedback
skip!
skip! skip!
click!
can we say that the fourth result is more relevant than the first/second/third?
Implicit feedback
click!
a click is a noisy surrogate for relevance!
Implicit feedback
user sees the results and closes the browser
Implicit feedback
the absence of a click is a noisy surrogate for non- relevance
On-line Evaluation Overview: trade-offs
§ Advantages:
§ system usage is naturalistic; users are situated in their natural context and often don’t know that a test is being conducted
§ evaluation can include lots of users
On-line Evaluation Overview: trade-offs
§ Disdvantages:
§ requires a service with lots of users (enough of them to potentially hurt performance for some)
§ requires a good understanding on how different implicit feedback signals correlate with positive and negative user experiences
§ experiments are difficult to repeat
Information Retrieval Evaluation
§ Evaluation is a fundamental issue of IR § an area of IR research in its own right
§ Evaluation methods: § batch evaluation
§ user-study evaluation § online evaluation
§ Each method has advantages and disadvantages
Evaluation Metrics
§ Remember: we focus on effectiveness measures
§ Many of the metrics are ultimately derived
from precision and recall
§ For example, F-measure:
Question 1: What is the trade-off between P & R ?
Question 2: Why use the harmonic mean for F1 rather than the arithmetic mean?
Ranked Retrieval Precision and recall
§ In most situations, the system outputs a ranked list of documents rather than an unordered set
§ User behaviour assumption:
§ The user examines the output ranking from top-
to-bottom until he/she is satisfied or gives up § Precision/Recall @ rank K
Ranked Retrieval Precision and recall: example
§ Assume 20 relevant documents
K
P@K
R@K
1
(1/1) = 1.0
(1/20) = 0.05
2
(1/2) = 0.5
(1/20) = 0.05
3
(2/3) = 0.67
(2/20) = 0.10
4
(3/4) = 0.75
(3/20) = 0.15
5
(4/5) = 0.80
(4/20) = 0.20
6
(5/6) = 0.83
(5/20) = 0.25
7
(6/7) = 0.86
(6/20) = 0.30
8
(6/8) = 0.75
(6/20) = 0.30
9
(7/9) = 0.78
(7/20) = 0.35
10
(7/10) = 0.70
(7/20) = 0.35
Ranked Retrieval P/R @ K
§ Advantages:
§ easy to compute § easy to interpret
§ Disadvantages:
§ the value of K has a huge impact on the
metric
§ how do we pick K?
Ranked Retrieval motivation: average precision
§ Ideally, we want the system to achieve high precision for varying values of K
§ The metric average precision accounts for precision and recall without having to set K
Ranked Retrieval average precision
1. Go down the ranking one-rank-at-a-time 2. If the document at rank K is relevant,
measure P@K
§ proportion of top-K documents that are
relevant
3. Finally, take the average of P@K values
§ the number of P@K values will equal the number of relevant documents
Ranked Retrieval average precision
sum rank 1.00 1 1.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9
(K)
ranking
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Number of relevant documents for this query: 10
7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19
10.00
20 0.00
total
10.00
Ranked Retrieval average precision
sum rank 1.00 1 1.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9
(K)
1. 2. 3.
Go down the
ranking
ranking one-rank-
at
0.00
1.00
0.00 0.00
-
0.00
-time
a
1.00
1.00
1.00
1.00
1.00
ca
culate P@K at 00
0
.0
0.00 0.00
0
If recall goes up,
l
0.00
0. 0.00
7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19
that rank K 0.00
0.00
1.00
1.00
When recall = 1.0,
0.00
1.00
av
0.00
rage P@K values
e
0.00
0.
10.00
20 total
00 0.00 0.00 0.00 0.00
1.00
10.00
Ranked Retrieval average precision
sum rank (K) 1.00 1
1.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19
10.00
R@K
P@K
1.00 0.50 0.00 0.67 0.75 0.80
0.83
0.86 0.75 0.00 0.78 0.70 0.00 0.73 0.67 0.00 0.62 0.00 0.64 0.60 0.00 0.56 0.00 0.53 0.00 0.50 0.00 0.47 0.00 0.50
0.76
ranking
1.00
1.00
1.00
1.00
1.00
1.00
0.10
0.10
0.20
0.30
0.40
0.50
0.60
0.60
1.00
0.67
0.75
0.80
0.83
1.00
0.86
0.70
0.70
0.78
1.00
0.80
0.80
0.80
0.73
1.00
0.90
0.90
0.90
0.90
0.90
0.90
0.64
20
1.00
0.50
1.00
total 10.00
average-precision
Ranked Retrieval average precision
sum rank (K) 1.00 1
2.00 2 3.00 3 4.00 4 5.00 5 6.00 6 7.00 7 8.00 8 9.00 9
10.00 10 10.00 11 10.00 12 10.00 13 10.00 14 10.00 15 10.00 16 10.00 17 10.00 18 10.00 19 10.00 20
R@K
P@K
ranking
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.91
0.83
0.77
0.71
0.67
0.63
0.59
0.56
0.53
0.50
1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
total 10.00
average-precision
1.00
Ranked Retrieval average precision
sum rank (K) 0.00 1
0.00 2 0.00 3 0.00 4 0.00 5 0.00 6 0.00 7 0.00 8 0.00 9 0.00 10 1.00 11 2.00 12 3.00 13 4.00 14 5.00 15 6.00 16 7.00 17 8.00 18 9.00 19
10.00 20
total 10.00
R@K
P@K
ranking
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.09
0.17
0.23
0.29
0.33
0.38
0.41
0.44
0.47
0.50
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.09 0.17 0.23 0.29 0.33 0.38 0.41 0.44 0.47 0.50
average-precision
0.33
Ranked Retrieval average precision
sum rank (K) 1.00 1
1.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19
10.00 20
total 10.00
R@K
P@K
1.00 0.50 0.00 0.67 0.75 0.80
0.83
0.86 0.75 0.00 0.78 0.70 0.00 0.73 0.67 0.00 0.62 0.00 0.64 0.60 0.00 0.56 0.00 0.53 0.00 0.50 0.00 0.47 0.00 0.50
0.76
ranking
1.00
1.00
1.00
1.00
1.00
1.00
0.10
0.10
0.20
0.30
0.40
0.50
0.60
0.60
1.00
0.67
0.75
0.80
0.83
0.86
1.00
0.70
0.70
0.78
1.00
0.80
0.80
0.80
0.73
1.00
1.00
0.90
0.90
0.90
0.90
0.90
0.90
1.00
0.64
0.50
average-precision
Ranked Retrieval average precision
sum
rank (K)
R@K
P@K
1
2
3
4
5
6
7
8
9
10
12
13
14
15
16
17
18
19
20
ranking
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.10
0.20
0.20
0.30
0.40
0.50
0.60
0.60
0.70
0.70
0.80
0.80
0.90
0.90
0.90
0.90
0.90
0.90
1.00
1.00
1.00
0.67
0.75
0.80
0.83
0.86
0.75
1.00
0.78
0.70
11
1.00
0.80
0.73
0.67
0.62
0.64
0.60
0.56
0.53
0.50
0.47
0.50
1.00
1.00 2.00
2.00
ranks 2 and 3
1.00 1.00 0.00 0.75 0.80 0.83 0.86 0.00 0.78 0.00 0.73 0.00 0.00 0.64 0.00 0.00 0.00 0.00 0.00 0.50
swapped
3.00 4.00 5.00 6.00 6.00 7.00 7.00 8.00 8.00 8.00 9.00 9.00 9.00 9.00 9.00 9.00
10.00
total
10.00
average-precision
0.79
Ranked Retrieval average precision
sum rank (K) 1.00 1
1.00 2 2.00 3 3.00 4 4.00 5 5.00 6 6.00 7 6.00 8 7.00 9 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19
10.00 20
total 10.00
R@K
P@K
1.00 0.50 0.00 0.67 0.75 0.80 0.83 0.86 0.75 0.00 0.78 0.70 0.00 0.73 0.67 0.00 0.62 0.00 0.64 0.60 0.00 0.56 0.00 0.53 0.00 0.50 0.00 0.47 0.00 0.50
0.76
ranking
1.00
1.00
1.00
1.00
1.00
0.10
0.10
0.20
0.30
0.40
0.50
0.60
0.60
0.70
0.70
0.80
1.00
0.67
0.75
0.80
0.83
1.00
0.86
0.78
1.00
0.73
1.00
0.80
0.80
1.00
1.00
0.90
0.90
0.90
0.90
0.90
0.90
1.00
0.64
0.50
average-precision
Ranked Retrieval average precision
sum rank (K) 1.00 1
R@K
P@K
1.00 0.50 0.00 0.67 0.75 0.80
0.83 0.86 0.88 0.00
0.70 0.00 0.73 0.67 0.00 0.62 0.00 0.64 0.60 0.00 0.56 0.00 0.53 0.00 0.50 0.00 0.47 0.00 0.50
0.77
ranking
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.10
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.70
0.70
1.00
0.67
0.75
0.80
0.83
0.86
0.88
0.78
1.00
0.80
0.80
0.80
0.73
1.00
1.00
0.90
0.90
0.90
0.90
0.90
0.90
1.00
0.64
0.50
1.00 2 swapped
2.00 3 ranks 8 and 9
3.00 4 4.00 5 5.00 6 6.00 7 7.00 8 7.00 9 7.00 10 8.00 11 8.00 12 8.00 13 9.00 14 9.00 15 9.00 16 9.00 17 9.00 18 9.00 19
10.00 20
total 10.00
average-precision
Ranked Retrieval average precision
§ Advantages:
§ no need to choose K
§ accounts for both precision and recall
§ mistakes at the top are more influential
§ mistakes at the bottom are still accounted for
§ Disadvantages:
§ not quite as easy to interpret as P/R@K
§ recall can be difficult to assess in large collections
Ranked Retrieval
MAP: mean average precision
§ So far, we’ve talked about average precision for a single query
§ Mean Average Precision (MAP): average precision averaged across a set of queries
§ MAP is one of the most common metrics in IR evaluation
Ranked Retrieval precision-recall curves
§ In some situations, we want to understand the trade-off between precision and recall
§ A precision-recall (PR) curve expresses precision as a function of recall
Ranked Retrieval precision-recall curves: general idea
§ Different tasks require different levels of recall
§ Sometimes, the user wants a few relevant documents
§ Other times, the user wants most of them
§ Suppose a user wants some level of recall R
§ The goal for the system is to minimize the number of false negatives the user must look at in order to achieve a level of recall R
Ranked Retrieval precision-recall curves
Ranked Retrieval precision-recall curves
which system is better?
Ranked Retrieval other effectiveness metrics
§ In some retrieval tasks, we really want to focus on precision at the top of the ranking
§ A classic example is Web search
§ users rarely care about recall
§ users rarely navigate beyond the first page of results
§ Discounted Cumulative Gain (DCG) gives more weight to the top-ranked results
Metric review
§ set-retrieval evaluation: we want to evaluate the set of documents retrieved by the system, without considering the ranking
§ ranked-retrieval evaluation: we want to evaluate the ranking of documents returned by the system
Metric Review set-retrieval evaluation
§ precision: the proportion of retrieved documents that are relevant
§ recall: the proportion of relevant documents that are retrieved
§ f-measure: harmonic-mean of precision and recall
§ a difficult metric to “cheat” by getting very high precision and abysmal recall (and vice- versa)
Metric Review ranked-retrieval evaluation
§ P@K: precision under the assumption that the top-K results is the ‘set’ retrieved
§ R@K: recall under the assumption that the top-K results is the ‘set’ retrieved
§ average-precision: average of P@K values for every K where recall increases
§ DCG: ignores recall, considers multiple levels of relevance, and focuses on the top ranks
Evaluation Campaigns
§ Several large-scale (annual) evaluation campaigns exist that allow you to enter your own system and compare against others
§ Metrics tend to be the ones we discussed (effectiveness)
§ Examples:
§ Text REtrieval Conference (TREC) ... US-based
§ Conference and Labs of the Evaluation Forum (CLEF) ... European
§ MediaEval Benchmarking Initiative for Multimedia Evaluation ... European
TREC
§ Annual conference to evaluate IR systems (since 1992)
§ Forum for IR researchers
§ TREC data collections are large (e.g.
ClueWeb09, ClueWeb12 - about 1 billion Web
pages)
§ Number of different tracks, examples at TREC-
2021:
§ Conversational Assistance Track § Deep Learning Track
§ Health Misinformation Track
§ Podcasts Track
CLEF
§ Annual conference to evaluate cross language IR systems (since 2000)
§ Forum for IR researchers
§ Number of different tracks, examples at CLEF-
2021:
§ Answer Retrieval for Questions on Math
§ Living Labs for Academic Search
§ ImageCLEF: Multimedia Retrieval in Medicine, Lifelogging, and Internet
§ Touché: Argument Retrieval
§ LifeCLEF: Multimedia Retrieval in Nature
MediaEval
§ Annual conference to evaluate multimedia IR systems (since 2000)
§ The "multi" in multimedia: speech, audio, visual content, tags, users, context
§ Number of different tracks, examples at MediaEval 2020:
§ FakeNews: Corona virus and 5G conspiracy
§ Predicting Media Memorability
§ Insight for Wellbeing: Multimodal personal health lifelog data analysis
§ Emotion and Theme recognition in music using Jamendo
Caption task
@ImageCLEF2021
§ Goal: Image understanding by aligning visual content and textual descriptors to interpret medical
C0006141: Breast C0024671: Mammography
C0006826: Malignant Neoplasms C0772294: Epinastine Hydrochloride C0242114: Suspicion
C4282132: Malignancy
C0221198: Lesion
Mailing list: groups.google.com/d/forum/imageclefcaption
Coral task
@ImageCLEF2021
§ Goal: Coral Reef Image Annotation and Localisation
Mailing list: https://groups.google.com/d/forum/imageclefcoral
Predicting media memorability
@Mediaeval2020
§ Goal: predicting how memorable a video is to viewers
§ Textual caption
https://multimediaeval.github.io/editions/2020/ tasks/memorability/
Evaluation Campaigns
§ Systematic and quantitative evaluation
§ Shared tasks on shared resources § Uniform scoring procedures, i.e. same
conditions for each participant:
§ Collection of “documents”
§ Example information requests
§ Relevant “documents” for each request
Evaluation Campaign Tools we need
§ Uniform scoring procedures, i.e. same conditions for each participant:
§ Collection of documents (the “dataset”) § Example information requests
§ A set of questions/queries/topics
§ Relevant documents for each request
§ a decision: relevant or not relevant
Evaluation Campaigns Cycle
Task definition
Data preparation
Results analysis
Topic definition
Results evaluation
Experiments submission
Scientific production
Evaluation Campaign Relevant assessment
§ Who says which documents are relevant and which not?
§ Ideally – Sit down and look at all documents § Pooled assessment
§ Each system finds some relevant documents
§ Different systems find different relevant
documents
§ Together, enough systems will find most of them
§ Assumption: if there are enough systems being tested and not one of them returned a document – the document is not relevant
Evaluation Campaign Relevant assessment: pooling
§ Combine the results retrieved by all systems
§ Choose a parameter k (typically 100)
§ Choose the top k documents as ranked in
each submitted run
§ The pool is the union of these sets of docs
§ Give pool to judges for relevance assessments
Evaluation Campaign Relevant assessment: pooling
§ Advantages of pooling:
§ Fewer documents must be manually
assessed for relevance
§ Disadvantages of pooling:
§ Can’t be certain that all documents satisfying
the query are found (recall values may not be
accurate)
§ Runs that did not participate in the pooling
may be disadvantaged
§ If only one run finds certain relevant
documents, but ranked lower than 100, it will not get credit for these.
Evaluation Campaign Relevant assessment
§ Not only are we incomplete, but we might also be inconsistent in our judgments!
Evaluation Campaign Relevant assessment
§ Good news:
§ “idiosyncratic nature of relevance judgments
does not affect comparative results” (E.
Voorhees)
§ Similar results held for:
§ Different query sets
§ Different evaluation measures
§ Different assessor types
§ Single opinion vs .group opinion judgments
...wait for the 2nd assignment
Summary
§ We have seen how to evaluate IR systems
§ We have seen evaluation methods (batch, user-study, online) and metrics (precision, recall,...)
§ We have explore the cycle of the evaluation campaigns
§ Relevance assessments are needed for the consistent evaluation of IR system
Reading
§ Chapter 8
§ Fuhr, N. "Some Common Mistakes In IR Evaluation, And How They Can Be Avoided”. SIGIR Forum 51(3), December 2017.
§ Next week:
§ Chapter 3
§ Sections 4.4 and 4.5 § Section 7.5
Henrik Nordmark
§ 4pm!
§ Profusion
§ MSc Data Science
§ Information on paid placement opportunities
Acknowledgements
§ Thanks again to contain substantial material prepared by Udo Kruschwitz and Fernando Diaz
§ Additional material as provided by Croft et al. (2015)