程序代写代做代考 algorithm Excel L8-evaluation.pptx

L8-evaluation.pptx

COMP6714: Informa2on Retrieval & Web Search

Introduc*on to

Informa(on Retrieval

Lecture 8: Evalua*on

1

COMP6714: Informa2on Retrieval & Web Search

2

This lecture
§  How do we know if our results are any good?

§  Evalua*ng a search engine
§  Benchmarks
§  Precision and recall

Sec. 6.2

COMP6714: Informa2on Retrieval & Web Search

EVALUATING SEARCH ENGINES

3

COMP6714: Informa2on Retrieval & Web Search

4

Measures for a search engine
§  How fast does it index

§  Number of documents/hour
§  (Average document size)

§  How fast does it search
§  Latency as a func*on of index size

§  Expressiveness of query language
§  Ability to express complex informa*on needs
§  Speed on complex queries

§  UncluRered UI
§  Is it free?

Sec. 8.6

COMP6714: Informa2on Retrieval & Web Search

5

Measures for a search engine
§  All of the preceding criteria are measurable: we can
quan*fy speed/size
§  we can make expressiveness precise

§  The key measure: user happiness
§  What is this?
§  Speed of response/size of index are factors
§  But blindingly fast, useless answers won’t make a user
happy

§  Need a way of quan*fying user happiness

Sec. 8.6

COMP6714: Informa2on Retrieval & Web Search

6

Measuring user happiness
§  Issue: who is the user we are trying to make happy?

§  Depends on the seYng
§  Web engine:

§  User finds what they want and return to the engine
§  Can measure rate of return users

§  User completes their task – search as a means, not end
§  See Russell hRp://dmrussell.googlepages.com/JCDL-talk-
June-2007-short.pdf

§  eCommerce site: user finds what they want and buy
§  Is it the end-user, or the eCommerce site, whose
happiness we measure?

§  Measure *me to purchase, or frac*on of searchers who
become buyers?

Sec. 8.6.2

COMP6714: Informa2on Retrieval & Web Search

7

Measuring user happiness
§  Enterprise (company/govt/academic): Care about
“user produc*vity”
§  How much *me do my users save when looking for
informa*on?

§  Many other criteria having to do with breadth of access,
secure access, etc.

Sec. 8.6.2

COMP6714: Informa2on Retrieval & Web Search

8

Happiness: elusive to measure

§  Most common proxy: relevance of search results
§  But how do you measure relevance?
§  We will detail a methodology here, then examine

its issues
§  Relevance measurement requires 3 elements:

1.  A benchmark document collec*on
2.  A benchmark suite of queries
3.  A usually binary assessment of either Relevant or

Nonrelevant for each query and each document
§  Some work on more-than-binary, but not the standard

Sec. 8.1

COMP6714: Informa2on Retrieval & Web Search

9

Evalua*ng an IR system
§  Note: the informa(on need is translated into a
query

§  Relevance is assessed rela*ve to the informa(on
need not the query

§  E.g., Informa*on need: I’m looking for informa2on on
whether drinking red wine is more effec2ve at
reducing your risk of heart aHacks than white wine.

§  Query: wine red white heart a+ack effec/ve
§  You evaluate whether the doc addresses the
informa*on need, not whether it has these words

Sec. 8.1

COMP6714: Informa2on Retrieval & Web Search

10

Standard relevance benchmarks
§  TREC – Na*onal Ins*tute of Standards and
Technology (NIST) has run a large IR test bed for
many years

§  Reuters and other benchmark doc collec*ons used
§  “Retrieval tasks” specified

§  some*mes as queries
§  Human experts mark, for each query and for each
doc, Relevant or Nonrelevant
§  or at least for subset of docs that some system returned
for that query

Sec. 8.2

COMP6714: Informa2on Retrieval & Web Search

11

Unranked retrieval evalua*on:
Precision and Recall

§  Precision: frac*on of retrieved docs that are relevant
= P(relevant|retrieved)

§  Recall: frac*on of relevant docs that are retrieved =
P(retrieved|relevant)

§  Precision P = tp/(tp + fp)
§  Recall R = tp/(tp + fn)

Relevant Nonrelevant

Retrieved tp fp

Not Retrieved fn tn

Sec. 8.3

COMP6714: Informa2on Retrieval & Web Search

12

Should we instead use the accuracy
measure for evalua*on?

§  Given a query, an engine classifies each doc as
“Relevant” or “Nonrelevant”

§  The accuracy of an engine: the frac*on of these
classifica*ons that are correct
§  (tp + tn) / ( tp + fp + fn + tn)

§  Accuracy is a commonly used evalua*on measure in
machine learning classifica*on work

§  Why is this not a very useful evalua*on measure in
IR?

Sec. 8.3

COMP6714: Informa2on Retrieval & Web Search

13

Why not just use accuracy?
§  How to build a 99.9999% accurate search engine on
a low budget….

§  People doing informa*on retrieval want to find
something and have a certain tolerance for junk.

Search for:

0 matching results found.

Sec. 8.3

COMP6714: Informa2on Retrieval & Web Search

14

Precision/Recall

§  You can get high recall (but low precision) by
retrieving all docs for all queries!

§  Recall is a non-decreasing func*on of the number
of docs retrieved

§  In a good system, precision decreases as either the
number of docs retrieved or recall increases
§  This is not a theorem, but a result with strong empirical
confirma*on

Sec. 8.3

COMP6714: Informa2on Retrieval & Web Search

15

Difficul*es in using precision/recall
§  Should average over large document collec*on/
query ensembles

§  Need human relevance assessments
§  People aren’t reliable assessors

§  Assessments have to be binary
§  Nuanced assessments?

§  Heavily skewed by collec*on/authorship
§  Results may not translate from one domain to another

Sec. 8.3

COMP6714: Informa2on Retrieval & Web Search

16

A combined measure: F

§  Combined measure that assesses precision/recall
tradeoff is F measure (weighted harmonic mean):

§  People usually use balanced F1 measure
§  i.e., with β = 1 or α = ½

§  Harmonic mean is a conserva*ve average
§  See CJ van Rijsbergen, Informa2on Retrieval

RP
PR

RP

F
+

+
=

−+
= 2

2 )1(
1
)1(

1
1

β
β

αα

Sec. 8.3

COMP6714: Informa2on Retrieval & Web Search

17

F1 and other averages

Combined Measures

0

20

40

60

80

100

0 20 40 60 80 100

Precision (Recall fixed at 70%)

Minimum

Maximum

Arithmetic

Geometric

Harmonic

Sec. 8.3

COMP6714: Informa2on Retrieval & Web Search

18

Evalua*ng ranked results
§  Evalua*on of ranked results:

§  The system can return any number of results
§  By taking various numbers of the top returned documents
(levels of recall), the evaluator can produce a precision-
recall curve

Sec. 8.4

COMP6714: Informa2on Retrieval & Web Search

19

A precision-recall curve

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0
Recall

P
re
ci
si
on

Sec. 8.4

COMP6714: Informa2on Retrieval & Web Search

20

Averaging over queries
§  A precision-recall graph for one query isn’t a very
sensible thing to look at

§  You need to average performance over a whole
bunch of queries.

§  But there’s a technical issue:
§  Precision-recall calcula*ons place some points on the
graph

§  How do you determine a value (interpolate) between the
points?

Sec. 8.4

COMP6714: Informa2on Retrieval & Web Search

21

Interpolated precision
§  Idea: If locally precision increases with increasing
recall, then you should get to count that…

§  So you max of precisions to right of value

Sec. 8.4

COMP6714: Informa2on Retrieval & Web Search

22

Evalua*on

§  Graphs are good, but people want summary measures!
§  Precision at fixed retrieval level

§  Precision-at-k: Precision of top k results
§  Perhaps appropriate for most of web search: all people want are
good matches on the first one or two results pages

§  But: averages badly and has an arbitrary parameter of k
§  11-point interpolated average precision

§  The standard measure in the early TREC compe**ons: you take
the precision at 11 levels of recall varying from 0 to 1 by tenths
of the documents, using interpola*on (the value for 0 is always
interpolated!), and average them

§  Evaluates performance at all recall levels

Sec. 8.4

COMP6714: Informa2on Retrieval & Web Search

23

Typical (good) 11 point precisions

§  SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Recall

P
re
ci
si
on

Sec. 8.4

COMP6714: Informa2on Retrieval & Web Search

24

Yet more evalua*on measures…
§  Mean average precision (MAP)

§  Average of the precision value obtained for the top k
documents, each *me a relevant doc is retrieved

§  Avoids interpola*on, use of fixed recall levels
§  MAP for query collec*on is arithme*c ave.

§  Macro-averaging: each query counts equally

§  R-precision
§  If have known (though perhaps incomplete) set of relevant
documents of size Rel, then calculate precision of top Rel
docs returned

§  Perfect system could score 1.0.

Sec. 8.4

COMP6714: Informa2on Retrieval & Web Search

25

Variance
§  For a test collec*on, it is usual that a system does
crummily on some informa*on needs (e.g., MAP =
0.1) and excellently on others (e.g., MAP = 0.7)

§  Indeed, it is usually the case that the variance in
performance of the same system across queries is
much greater than the variance of different systems
on the same query.

§  That is, there are easy informa*on needs and hard
ones!

Sec. 8.4

COMP6714: Informa2on Retrieval & Web Search

CREATING TEST COLLECTIONS
FOR IR EVALUATION

26

COMP6714: Informa2on Retrieval & Web Search

27

Test Collec*ons

Sec. 8.5

COMP6714: Informa2on Retrieval & Web Search

28

From document collec*ons
to test collec*ons
§  S*ll need

§  Test queries
§  Relevance assessments

§  Test queries
§  Must be germane to docs available
§  Best designed by domain experts
§  Random query terms generally not a good idea

§  Relevance assessments
§  Human judges, *me-consuming
§  Are human panels perfect?

Sec. 8.5

COMP6714: Informa2on Retrieval & Web Search

29

Unit of Evalua*on
§  We can compute precision, recall, F, and ROC curve
for different units.

§  Possible units
§  Documents (most common)
§  Facts (used in some TREC evalua*ons)
§  En**es (e.g., car companies)

§  May produce different results. Why?

Sec. 8.5

COMP6714: Informa2on Retrieval & Web Search

30

Kappa measure for inter-judge
(dis)agreement

§  Kappa measure
§  Agreement measure among judges
§  Designed for categorical judgments
§  Corrects for chance agreement

§  Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]
§  P(A) – propor*on of *me judges agree
§  P(E) – what agreement would be by chance
§  Kappa = 0 for chance agreement, 1 for total agreement.

Sec. 8.5

COMP6714: Informa2on Retrieval & Web Search

31

Kappa Measure: Example P(A)? P(E)?

Sec. 8.5

Judge 2:
Relevant

Judge 2:
Nonrelevant

Judge 1:
Relevant 300 20

Judge 1:
Nonrelevant 10 70

Total assessment:400

§  P(A) = 370/400 = 0.9250
§  P(nonrelevant) = (10+20+70+70)/800 = 0.2125
§  P(relevant) = (10+20+300+300)/800 = 0.7875
§  P(E) = 0.2125^2 + 0.7875^2 = 0.6653
§  Kappa = (0.9250 – 0.6653)/(1-0.6653) = 0.7759

COMP6714: Informa2on Retrieval & Web Search

32

Kappa Example

§  P(A) = 370/400 = 0.9250
§  P(nonrelevant) = (10+20+70+70)/800 = 0.2125
§  P(relevant) = (10+20+300+300)/800 = 0.7875
§  P(E) = 0.2125^2 + 0.7875^2 = 0.6653
§  Kappa = (0.9250 – 0.6653)/(1-0.6653) = 0.7759

§  Kappa > 0.8 = good agreement
§  0.67 < Kappa < 0.8 -> “tenta*ve conclusions” (CarleRa ’96)
§  Depends on purpose of study
§  For >2 judges: average pairwise kappas

Sec. 8.5

Using pooled marginals

COMP6714: Informa2on Retrieval & Web Search

33

TREC
§  TREC Ad Hoc task from first 8 TRECs is standard IR task

§  50 detailed informa*on needs a year
§  Human evalua*on of pooled results returned
§  More recently other related things: Web track, HARD

§  A TREC query (TREC 5)

Number: 225
Descrip*on:
What is the main func*on of the Federal Emergency Management

Agency (FEMA) and the funding level provided to meet emergencies?
Also, what resources are available to FEMA such as people,
equipment, facili*es?

Sec. 8.2

COMP6714: Informa2on Retrieval & Web Search

Standard relevance benchmarks:
Others
§  GOV2

§  Another TREC/NIST collec*on
§  25 million web pages
§  Largest collec*on that is easily available
§  But s*ll 3 orders of magnitude smaller than what Google/
Yahoo/MSN index

§  NTCIR
§  East Asian language and cross-language informa*on retrieval

§  Cross Language Evalua*on Forum (CLEF)
§  This evalua*on series has concentrated on European languages
and cross-language informa*on retrieval.

§  Many others

34

Sec. 8.2

COMP6714: Informa2on Retrieval & Web Search

35

Interjudge Agreement: TREC 3

Sec. 8.5

COMP6714: Informa2on Retrieval & Web Search

36

Impact of Inter-judge Agreement

§  Impact on absolute performance measure can be significant
(0.32 vs 0.39)

§  LiRle impact on ranking of different systems or rela*ve
performance

§  Suppose we want to know if algorithm A is beRer than
algorithm B

§  A standard informa*on retrieval experiment will give us a
reliable answer to this ques*on.

Sec. 8.5

COMP6714: Informa2on Retrieval & Web Search

37

Cri*que of pure relevance
§  Relevance vs Marginal Relevance

§  A document can be redundant even if it is highly relevant
§  Duplicates
§  The same informa*on from different sources
§  Marginal relevance is a beRer measure of u*lity for the
user.

§  Using facts/en**es as evalua*on units more directly
measures true relevance.

§  But harder to create evalua*on set
§  See Carbonell reference

Sec. 8.5.1

COMP6714: Informa2on Retrieval & Web Search

38

Can we avoid human judgment?
§  No
§  Makes experimental work hard

§  Especially on a large scale
§  In some very specific seYngs, can use proxies

§  E.g.: for approximate vector space retrieval, we can
compare the cosine distance closeness of the closest docs
to those found by an approximate retrieval algorithm

§  But once we have test collec*ons, we can reuse
them (so long as we don’t overtrain too badly)

Sec. 8.6.3

COMP6714: Informa2on Retrieval & Web Search

Evalua*on at large search engines
§  Search engines have test collec*ons of queries and hand-ranked

results
§  Recall is difficult to measure on the web
§  Search engines o|en use precision at top k, e.g., k = 10
§  . . . or measures that reward you more for geYng rank 1 right than

for geYng rank 10 right.
§  NDCG (Normalized Cumula*ve Discounted Gain)

§  Search engines also use non-relevance-based measures.
§  Clickthrough on first result

§  Not very reliable if you look at a single clickthrough … but preRy
reliable in the aggregate.

§  Studies of user behavior in the lab
§  A/B tes*ng

39

Sec. 8.6.3

COMP6714: Informa2on Retrieval & Web Search

A/B tes*ng
§  Purpose: Test a single innova*on
§  Prerequisite: You have a large search engine up and running.
§  Have most users use old system
§  Divert a small propor*on of traffic (e.g., 1%) to the new

system that includes the innova*on
§  Evaluate with an “automa*c” measure like clickthrough on

first result
§  Now we can directly see if the innova*on does improve user

happiness.
§  Probably the evalua*on methodology that large search

engines trust most
§  In principle less powerful than doing a mul*variate regression

analysis, but easier to understand

40

Sec. 8.6.3

COMP6714: Informa2on Retrieval & Web Search

RESULTS PRESENTATION

41

Sec. 8.7

COMP6714: Informa2on Retrieval & Web Search

42

Result Summaries
§  Having ranked the documents matching a query, we
wish to present a results list

§  Most commonly, a list of the document *tles plus a
short summary, aka “10 blue links”

Sec. 8.7

COMP6714: Informa2on Retrieval & Web Search

43

Resources for this lecture
§  IIR 8
§  MIR Chapter 3
§  MG 4.5
§  Carbonell and Goldstein 1998. The use of MMR,
diversity-based reranking for reordering documents
and producing summaries. SIGIR 21.