L8-evaluation.pptx
COMP6714: Informa2on Retrieval & Web Search
Introduc*on to
Informa(on Retrieval
Lecture 8: Evalua*on
1
COMP6714: Informa2on Retrieval & Web Search
2
This lecture
§ How do we know if our results are any good?
§ Evalua*ng a search engine
§ Benchmarks
§ Precision and recall
Sec. 6.2
COMP6714: Informa2on Retrieval & Web Search
EVALUATING SEARCH ENGINES
3
COMP6714: Informa2on Retrieval & Web Search
4
Measures for a search engine
§ How fast does it index
§ Number of documents/hour
§ (Average document size)
§ How fast does it search
§ Latency as a func*on of index size
§ Expressiveness of query language
§ Ability to express complex informa*on needs
§ Speed on complex queries
§ UncluRered UI
§ Is it free?
Sec. 8.6
COMP6714: Informa2on Retrieval & Web Search
5
Measures for a search engine
§ All of the preceding criteria are measurable: we can
quan*fy speed/size
§ we can make expressiveness precise
§ The key measure: user happiness
§ What is this?
§ Speed of response/size of index are factors
§ But blindingly fast, useless answers won’t make a user
happy
§ Need a way of quan*fying user happiness
Sec. 8.6
COMP6714: Informa2on Retrieval & Web Search
6
Measuring user happiness
§ Issue: who is the user we are trying to make happy?
§ Depends on the seYng
§ Web engine:
§ User finds what they want and return to the engine
§ Can measure rate of return users
§ User completes their task – search as a means, not end
§ See Russell hRp://dmrussell.googlepages.com/JCDL-talk-
June-2007-short.pdf
§ eCommerce site: user finds what they want and buy
§ Is it the end-user, or the eCommerce site, whose
happiness we measure?
§ Measure *me to purchase, or frac*on of searchers who
become buyers?
Sec. 8.6.2
COMP6714: Informa2on Retrieval & Web Search
7
Measuring user happiness
§ Enterprise (company/govt/academic): Care about
“user produc*vity”
§ How much *me do my users save when looking for
informa*on?
§ Many other criteria having to do with breadth of access,
secure access, etc.
Sec. 8.6.2
COMP6714: Informa2on Retrieval & Web Search
8
Happiness: elusive to measure
§ Most common proxy: relevance of search results
§ But how do you measure relevance?
§ We will detail a methodology here, then examine
its issues
§ Relevance measurement requires 3 elements:
1. A benchmark document collec*on
2. A benchmark suite of queries
3. A usually binary assessment of either Relevant or
Nonrelevant for each query and each document
§ Some work on more-than-binary, but not the standard
Sec. 8.1
COMP6714: Informa2on Retrieval & Web Search
9
Evalua*ng an IR system
§ Note: the informa(on need is translated into a
query
§ Relevance is assessed rela*ve to the informa(on
need not the query
§ E.g., Informa*on need: I’m looking for informa2on on
whether drinking red wine is more effec2ve at
reducing your risk of heart aHacks than white wine.
§ Query: wine red white heart a+ack effec/ve
§ You evaluate whether the doc addresses the
informa*on need, not whether it has these words
Sec. 8.1
COMP6714: Informa2on Retrieval & Web Search
10
Standard relevance benchmarks
§ TREC – Na*onal Ins*tute of Standards and
Technology (NIST) has run a large IR test bed for
many years
§ Reuters and other benchmark doc collec*ons used
§ “Retrieval tasks” specified
§ some*mes as queries
§ Human experts mark, for each query and for each
doc, Relevant or Nonrelevant
§ or at least for subset of docs that some system returned
for that query
Sec. 8.2
COMP6714: Informa2on Retrieval & Web Search
11
Unranked retrieval evalua*on:
Precision and Recall
§ Precision: frac*on of retrieved docs that are relevant
= P(relevant|retrieved)
§ Recall: frac*on of relevant docs that are retrieved =
P(retrieved|relevant)
§ Precision P = tp/(tp + fp)
§ Recall R = tp/(tp + fn)
Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn
Sec. 8.3
COMP6714: Informa2on Retrieval & Web Search
12
Should we instead use the accuracy
measure for evalua*on?
§ Given a query, an engine classifies each doc as
“Relevant” or “Nonrelevant”
§ The accuracy of an engine: the frac*on of these
classifica*ons that are correct
§ (tp + tn) / ( tp + fp + fn + tn)
§ Accuracy is a commonly used evalua*on measure in
machine learning classifica*on work
§ Why is this not a very useful evalua*on measure in
IR?
Sec. 8.3
COMP6714: Informa2on Retrieval & Web Search
13
Why not just use accuracy?
§ How to build a 99.9999% accurate search engine on
a low budget….
§ People doing informa*on retrieval want to find
something and have a certain tolerance for junk.
Search for:
0 matching results found.
Sec. 8.3
COMP6714: Informa2on Retrieval & Web Search
14
Precision/Recall
§ You can get high recall (but low precision) by
retrieving all docs for all queries!
§ Recall is a non-decreasing func*on of the number
of docs retrieved
§ In a good system, precision decreases as either the
number of docs retrieved or recall increases
§ This is not a theorem, but a result with strong empirical
confirma*on
Sec. 8.3
COMP6714: Informa2on Retrieval & Web Search
15
Difficul*es in using precision/recall
§ Should average over large document collec*on/
query ensembles
§ Need human relevance assessments
§ People aren’t reliable assessors
§ Assessments have to be binary
§ Nuanced assessments?
§ Heavily skewed by collec*on/authorship
§ Results may not translate from one domain to another
Sec. 8.3
COMP6714: Informa2on Retrieval & Web Search
16
A combined measure: F
§ Combined measure that assesses precision/recall
tradeoff is F measure (weighted harmonic mean):
§ People usually use balanced F1 measure
§ i.e., with β = 1 or α = ½
§ Harmonic mean is a conserva*ve average
§ See CJ van Rijsbergen, Informa2on Retrieval
RP
PR
RP
F
+
+
=
−+
= 2
2 )1(
1
)1(
1
1
β
β
αα
Sec. 8.3
COMP6714: Informa2on Retrieval & Web Search
17
F1 and other averages
Combined Measures
0
20
40
60
80
100
0 20 40 60 80 100
Precision (Recall fixed at 70%)
Minimum
Maximum
Arithmetic
Geometric
Harmonic
Sec. 8.3
COMP6714: Informa2on Retrieval & Web Search
18
Evalua*ng ranked results
§ Evalua*on of ranked results:
§ The system can return any number of results
§ By taking various numbers of the top returned documents
(levels of recall), the evaluator can produce a precision-
recall curve
Sec. 8.4
COMP6714: Informa2on Retrieval & Web Search
19
A precision-recall curve
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
P
re
ci
si
on
Sec. 8.4
COMP6714: Informa2on Retrieval & Web Search
20
Averaging over queries
§ A precision-recall graph for one query isn’t a very
sensible thing to look at
§ You need to average performance over a whole
bunch of queries.
§ But there’s a technical issue:
§ Precision-recall calcula*ons place some points on the
graph
§ How do you determine a value (interpolate) between the
points?
Sec. 8.4
COMP6714: Informa2on Retrieval & Web Search
21
Interpolated precision
§ Idea: If locally precision increases with increasing
recall, then you should get to count that…
§ So you max of precisions to right of value
Sec. 8.4
COMP6714: Informa2on Retrieval & Web Search
22
Evalua*on
§ Graphs are good, but people want summary measures!
§ Precision at fixed retrieval level
§ Precision-at-k: Precision of top k results
§ Perhaps appropriate for most of web search: all people want are
good matches on the first one or two results pages
§ But: averages badly and has an arbitrary parameter of k
§ 11-point interpolated average precision
§ The standard measure in the early TREC compe**ons: you take
the precision at 11 levels of recall varying from 0 to 1 by tenths
of the documents, using interpola*on (the value for 0 is always
interpolated!), and average them
§ Evaluates performance at all recall levels
Sec. 8.4
COMP6714: Informa2on Retrieval & Web Search
23
Typical (good) 11 point precisions
§ SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
P
re
ci
si
on
Sec. 8.4
COMP6714: Informa2on Retrieval & Web Search
24
Yet more evalua*on measures…
§ Mean average precision (MAP)
§ Average of the precision value obtained for the top k
documents, each *me a relevant doc is retrieved
§ Avoids interpola*on, use of fixed recall levels
§ MAP for query collec*on is arithme*c ave.
§ Macro-averaging: each query counts equally
§ R-precision
§ If have known (though perhaps incomplete) set of relevant
documents of size Rel, then calculate precision of top Rel
docs returned
§ Perfect system could score 1.0.
Sec. 8.4
COMP6714: Informa2on Retrieval & Web Search
25
Variance
§ For a test collec*on, it is usual that a system does
crummily on some informa*on needs (e.g., MAP =
0.1) and excellently on others (e.g., MAP = 0.7)
§ Indeed, it is usually the case that the variance in
performance of the same system across queries is
much greater than the variance of different systems
on the same query.
§ That is, there are easy informa*on needs and hard
ones!
Sec. 8.4
COMP6714: Informa2on Retrieval & Web Search
CREATING TEST COLLECTIONS
FOR IR EVALUATION
26
COMP6714: Informa2on Retrieval & Web Search
27
Test Collec*ons
Sec. 8.5
COMP6714: Informa2on Retrieval & Web Search
28
From document collec*ons
to test collec*ons
§ S*ll need
§ Test queries
§ Relevance assessments
§ Test queries
§ Must be germane to docs available
§ Best designed by domain experts
§ Random query terms generally not a good idea
§ Relevance assessments
§ Human judges, *me-consuming
§ Are human panels perfect?
Sec. 8.5
COMP6714: Informa2on Retrieval & Web Search
29
Unit of Evalua*on
§ We can compute precision, recall, F, and ROC curve
for different units.
§ Possible units
§ Documents (most common)
§ Facts (used in some TREC evalua*ons)
§ En**es (e.g., car companies)
§ May produce different results. Why?
Sec. 8.5
COMP6714: Informa2on Retrieval & Web Search
30
Kappa measure for inter-judge
(dis)agreement
§ Kappa measure
§ Agreement measure among judges
§ Designed for categorical judgments
§ Corrects for chance agreement
§ Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]
§ P(A) – propor*on of *me judges agree
§ P(E) – what agreement would be by chance
§ Kappa = 0 for chance agreement, 1 for total agreement.
Sec. 8.5
COMP6714: Informa2on Retrieval & Web Search
31
Kappa Measure: Example P(A)? P(E)?
Sec. 8.5
Judge 2:
Relevant
Judge 2:
Nonrelevant
Judge 1:
Relevant 300 20
Judge 1:
Nonrelevant 10 70
Total assessment:400
§ P(A) = 370/400 = 0.9250
§ P(nonrelevant) = (10+20+70+70)/800 = 0.2125
§ P(relevant) = (10+20+300+300)/800 = 0.7875
§ P(E) = 0.2125^2 + 0.7875^2 = 0.6653
§ Kappa = (0.9250 – 0.6653)/(1-0.6653) = 0.7759
COMP6714: Informa2on Retrieval & Web Search
32
Kappa Example
§ P(A) = 370/400 = 0.9250
§ P(nonrelevant) = (10+20+70+70)/800 = 0.2125
§ P(relevant) = (10+20+300+300)/800 = 0.7875
§ P(E) = 0.2125^2 + 0.7875^2 = 0.6653
§ Kappa = (0.9250 – 0.6653)/(1-0.6653) = 0.7759
§ Kappa > 0.8 = good agreement
§ 0.67 < Kappa < 0.8 -> “tenta*ve conclusions” (CarleRa ’96)
§ Depends on purpose of study
§ For >2 judges: average pairwise kappas
Sec. 8.5
Using pooled marginals
COMP6714: Informa2on Retrieval & Web Search
33
TREC
§ TREC Ad Hoc task from first 8 TRECs is standard IR task
§ 50 detailed informa*on needs a year
§ Human evalua*on of pooled results returned
§ More recently other related things: Web track, HARD
§ A TREC query (TREC 5)
What is the main func*on of the Federal Emergency Management
Agency (FEMA) and the funding level provided to meet emergencies?
Also, what resources are available to FEMA such as people,
equipment, facili*es?
Sec. 8.2
COMP6714: Informa2on Retrieval & Web Search
Standard relevance benchmarks:
Others
§ GOV2
§ Another TREC/NIST collec*on
§ 25 million web pages
§ Largest collec*on that is easily available
§ But s*ll 3 orders of magnitude smaller than what Google/
Yahoo/MSN index
§ NTCIR
§ East Asian language and cross-language informa*on retrieval
§ Cross Language Evalua*on Forum (CLEF)
§ This evalua*on series has concentrated on European languages
and cross-language informa*on retrieval.
§ Many others
34
Sec. 8.2
COMP6714: Informa2on Retrieval & Web Search
35
Interjudge Agreement: TREC 3
Sec. 8.5
COMP6714: Informa2on Retrieval & Web Search
36
Impact of Inter-judge Agreement
§ Impact on absolute performance measure can be significant
(0.32 vs 0.39)
§ LiRle impact on ranking of different systems or rela*ve
performance
§ Suppose we want to know if algorithm A is beRer than
algorithm B
§ A standard informa*on retrieval experiment will give us a
reliable answer to this ques*on.
Sec. 8.5
COMP6714: Informa2on Retrieval & Web Search
37
Cri*que of pure relevance
§ Relevance vs Marginal Relevance
§ A document can be redundant even if it is highly relevant
§ Duplicates
§ The same informa*on from different sources
§ Marginal relevance is a beRer measure of u*lity for the
user.
§ Using facts/en**es as evalua*on units more directly
measures true relevance.
§ But harder to create evalua*on set
§ See Carbonell reference
Sec. 8.5.1
COMP6714: Informa2on Retrieval & Web Search
38
Can we avoid human judgment?
§ No
§ Makes experimental work hard
§ Especially on a large scale
§ In some very specific seYngs, can use proxies
§ E.g.: for approximate vector space retrieval, we can
compare the cosine distance closeness of the closest docs
to those found by an approximate retrieval algorithm
§ But once we have test collec*ons, we can reuse
them (so long as we don’t overtrain too badly)
Sec. 8.6.3
COMP6714: Informa2on Retrieval & Web Search
Evalua*on at large search engines
§ Search engines have test collec*ons of queries and hand-ranked
results
§ Recall is difficult to measure on the web
§ Search engines o|en use precision at top k, e.g., k = 10
§ . . . or measures that reward you more for geYng rank 1 right than
for geYng rank 10 right.
§ NDCG (Normalized Cumula*ve Discounted Gain)
§ Search engines also use non-relevance-based measures.
§ Clickthrough on first result
§ Not very reliable if you look at a single clickthrough … but preRy
reliable in the aggregate.
§ Studies of user behavior in the lab
§ A/B tes*ng
39
Sec. 8.6.3
COMP6714: Informa2on Retrieval & Web Search
A/B tes*ng
§ Purpose: Test a single innova*on
§ Prerequisite: You have a large search engine up and running.
§ Have most users use old system
§ Divert a small propor*on of traffic (e.g., 1%) to the new
system that includes the innova*on
§ Evaluate with an “automa*c” measure like clickthrough on
first result
§ Now we can directly see if the innova*on does improve user
happiness.
§ Probably the evalua*on methodology that large search
engines trust most
§ In principle less powerful than doing a mul*variate regression
analysis, but easier to understand
40
Sec. 8.6.3
COMP6714: Informa2on Retrieval & Web Search
RESULTS PRESENTATION
41
Sec. 8.7
COMP6714: Informa2on Retrieval & Web Search
42
Result Summaries
§ Having ranked the documents matching a query, we
wish to present a results list
§ Most commonly, a list of the document *tles plus a
short summary, aka “10 blue links”
Sec. 8.7
COMP6714: Informa2on Retrieval & Web Search
43
Resources for this lecture
§ IIR 8
§ MIR Chapter 3
§ MG 4.5
§ Carbonell and Goldstein 1998. The use of MMR,
diversity-based reranking for reordering documents
and producing summaries. SIGIR 21.