CS代考 IR H/M Course

IR H/M Course
Why Evaluate?
• There are three main reasons for evaluating IR systems
– Economic reasons: if people are going to buy the technology, they want to know how effective it is

Copyright By PowCoder代写 加微信 powcoder

– Scientific progress:
• Researchers want to know if progress is being made or not. So, we
need a measure for progress
• Scientists want to show that their method is better than someone
else’s. Therefore, we need a measure for better
– Verification: if you built a system you need to verify its performance
Architecture of an IR system
How+well+the+ system+actually+ perform?
Query+features
How+well+for+the+given+ document+collection+and+ user+population+could+ any+system+perform?
s1>s2>s3>+…
How+well+could+a+system+ with+a+generic+design+ perform?
Index+features
Term+1 Term+2 Term+3
Doc++++Score

IR H/M Course
What to Evaluate?
• Efficiency
– E.g. query latency (time lag), query throughput
• Coverage
– Does this system cover the domain very well?
• How many pages does a web search engine index? • Presentation of results/query formulation
– Effort of the user
– User engagement aspects; Cognition; User attention • Effectiveness
– How effective the system is? • Correctness
Evaluation
• Evaluation is key to building effective and
efficient search engines
– Usually carried out in controlled laboratory
experiments
– Online testing/evaluation can also be done
• Effectiveness, efficiency and cost are related
– High efficiency may be obtained at the price of effectiveness
• e.g., if we want a particular level of effectiveness and efficiency, this will determine the cost of the system configuration
• Efficiency and cost targets may impact effectiveness

IR H/M Course
Test Collections
• Collectionsof with them:
documents that also have associated
queries for which
– Relevance assessments are often called “qrels” for short.
• Documents?Whatarethey?
– Genuine documents like email messages, journal articles, memos etc. kind of material for which we tend to look for some information
• Queries?Whatarethey?
– Kind of questions users of such collection will ask. How do we collect them? Co-operation of user population, query logs, etc.
GROUND’TRUTH
set of available.
relevance assessments are
TREC: Text Retrieval Conference
• Started in 1992, organised by NIST, USA and
funded by the US government.
• Introduced a new standard for retrieval system evaluation
• Developed diverse test collections of different sizes (recent collections have > 1 billion docs)
– News articles, web documents, blogs, tweets, etc.
• Avoid exhaustive assessment of documents
using the pooling method

IR H/M Course
Evaluation Corpus
• Test collections consisting of documents, queries, and relevance judgments, e.g.,
TREC Topic Example

IR H/M Course
Relevance Judgments
• Obtaining relevance judgments is an expensive,
time-consuming process
– Who does it?
– What are the instructions?
– What is the level of agreement?
• TREC judgments
– Depend on task being evaluated
– Binary vs. Graded relevance
– Reasonable agreement because of “narrative”
• Exhaustive judgments for all documents in a collection is not practical
– i.e. we can’t judge 1 billion of documents • Pooling technique is used in TREC
where are top k results from?
how to generate them?
– Top k results (k varied between 50 and 200) from the rankings obtained by different search engines are merged into a pool
– Duplicates are removed
– Documents are presented in some random order to the
relevance judges
• Producesalargenumberofrelevancejudgmentsfor each query, although possibly incomplete

IR H/M Course
Pooling Technique
1,000,000/docs
To be judged
Search engine 1
d1 d2 d100 d75
Search engine 2
d43 d2 d501 d75
Search engine 3
d1 d13 d100 d75
Relevance Assessments @ TREC
Assessors(inspecting(documents(for(relevance(at(TREC(: a(thankless(job!

IR H/M Course
IR Experimental Setup
• A test collection: docs, queries, & relevance
assessments (Ground Truth) • Measures of performance
– Precision, Recall, …. • Systems to compare
– System A vs System B
– E.g. A uses TF weighting; B uses TF-IDF: • An experimental design
– Traditionally, the same queries and documents are used repeatedly, with different systems or variants of systems
Evaluation Method
• Run your experiment:
– Input the documents to the systems – Issue each query to the systems
• Collect the output
– Then you need to evaluate the output using the ground
– Analyse the results
• Quantitatively and statistically! – How to do this?
Describe(the(classical((IR( experimental(setup

IR H/M Course
Comparing Systems
• Given a query, the IR system provides a ranked list after searching the underlying collection of documents
• The assumption is
– The better system will provide a better ranked list – A better ranked list satisfies the users overall
– How to compare (2 or more) ranked lists of documents?
– How to verify whether one system is more effective than another?
– We need a measure (metric)!
Effectiveness Measures • The function of an IR system is to:
– Retrieve all relevant documents • Measured by recall
– Retrieve no non-relevant documents, • Measured by precision
• Classical effectiveness Measures – Recall & Precision

IR H/M Course
Effectiveness Measures
A is#set#of#relevant#documents,# B is#set#of#retrieved#documents
Relevant vs. Retrieved
For a given query and a collection

IR H/M Course
Precision)=)
Relevant vs. Retrieved Precision & Recall
Retrieved ( B)
Set$based$measures
Relevant (R)
Relevant vs. Retrieved
Very%high%precision,%very%low%recall

IR H/M Course
Relevant vs. Retrieved
Relevant vs. Retrieved
Very%low%precision%(zero),% very%low%recall%(zero)
High%recall,%but%low%precision

IR H/M Course
Relevant vs. Retrieved
Ranking Effectiveness (1)
1 2 3 4 5 6 7 8 9 10
Precision = 2/3 = 0.67 Recall = 2/6 = 0.33
Precision = Recall =
| Relevant Retrieved | | Retrieved |
Relevant Retrieved | Relevant |
High%recall,%high%precision

IR H/M Course
Ranking Effectiveness (2)
1 2 3 4 5 6 7 8 9 10
Precision = 5/6 = 0.83 Recall = 5/6 = 0.83
Precision = Recall =
| Relevant Retrieved | | Retrieved |
Relevant Retrieved | Relevant |
Precision @ Rank R
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
• Popular metric in Web search
• Precision @ Rank 10 of both rankings is the same (0.6)
• If Precision @ rank R is higher for Ranking 1 than for Ranking 2,
recall will also be higher

IR H/M Course
Summarising a Ranking
• Method 1: Calculating recall and precision at
fixed rank positions (Precision @ Rank R)
• Method 2: Calculating precision at standard recall levels, from 0.0 to 1.0
– Requires interpolation (discussed later)
• Method 3: Average Precision
– Averaging the precision values from the rank positions where a relevant document was retrieved
– Set precision values to be zero for the not retrieved documents
Average Precision (1)
NB:$6$is$the$TOTAL$ number$of$relevant$ documents,$not$just$those$
that$we$have$observed.

IR H/M Course
Average Precision (2)
Mean Average Precision (MAP)

IR H/M Course
Mean Average Precision (MAP)
• Summarise rankings from multiple queries by
averaging average precision
• A widely used measure in research papers
• Assumes user is interested in finding many
relevant documents for each query
• Requires many relevance judgments in test collection
• Recall-precision graphs are also useful summaries
Recall-Precision Graph
How can we compare?

IR H/M Course
Interpolation
• To average graphs, calculate precision at standard recall levels:
– where S is the set of observed (R,P) points
• Defines precision at any recall level as the maximum precision observed in any recall- precision point at a higher recall level
– Produces a step function
– Defines precision at recall 0.0
Interpolated Precision
Interpolation
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0

IR H/M Course
Interpolated Precision
Interpolation
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Interpolated Precision
Interpolation
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 1.0

IR H/M Course
Interpolated Precision
Interpolation
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 1.0 1.0
Interpolated Precision
Interpolation
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 1.0 1.0

IR H/M Course
Interpolated Precision
Interpolation
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 1.0 1.0 0.67
Interpolated Precision
Interpolation
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 1.0 1.0 0.67 0.67 0.5 0.5 0.5 0.5 0.5 0.5

IR H/M Course
Interpolation
Average Precision at Standard Recall Levels
• Only consider standard recall levels: varying from 0.0 to 1.0 at the incremental of 0.1
• Recall-precision graph plotted by simply joining the average precision points at the standard recall levels

IR H/M Course
Average Recall-Precision Graph
Graph for 50 Queries

IR H/M Course
Trade-off between Recall and Precision
Returns2mostly2relevant2documents2but misses2many2useful2ones2too
1 Returns2all2the2relevant documents2but2includes
lots2of22junk
Comparing Two or More Systems
• The curve closest to the upper right-hand corner of the graph indicates the best performance
1 0.8 0.6 0.4 0.2
It#seems#that#stemming#is#better# 0 than#no#stemming,# esp.#towards#the#top#of#the# ranking#(i.e.#where#recall#is#low)
NoStem Stem
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

IR H/M Course
’s F Measure • Harmonic mean of recall and precision
– Why harmonic mean?
– Harmonic mean emphasises the importance of small values, whereas the arithmetic mean is affected more by outliers that are unusually large
• More general form
– β is a parameter that determines relative importance of recall and precision
F Measure Example (1)
Recall = 2/6 = 0.33
Precision = 2/3 = 0.67
F = 2*Recall*Precision/(Recall + Precision)
= 2*0.33*0.67/(0.33 + 0.67) = 0.22

IR H/M Course
F Measure Example (2)
Recall = 5/6 = 0.83
Precision = 5/6 = 0.83
F = 2*Recall*Precision/(Recall + Precision)
= 2*0.83*0.83/(0.83 + 0.83) = 0.83
Problems with Precision/Recall
• Can’tknowtruerecallvalue – Except in small collections
– See Pooling technique
• Precision/Recallarerelated
– A combined measure is sometimes more appropriate
• Does not take into account degree of relevance
• Assumes batch mode
– Interactive IR is important and has different criteria for successful searches (more on this in a later lecture)
Discuss(some(of(the(issues(with(P;R

IR H/M Course
Focusing on Top Documents
• Users tend to look at only the top part of the ranked result list to find relevant documents
• Some search tasks have only one relevant document
– e.g., navigational search, question answering
• Recall not appropriate
– Instead need to measure how well the search engine does at retrieving relevant documents at very high ranks
Focusing on Top Documents
• Precision at Rank R etc.) – R typically 5, 10, 20
– Easy to compute, average, understand
– Not sensitive to rank positions less than R
• Reciprocal Rank
– Reciprocal of the rank at which the first relevant
document is retrieved
– Mean Reciprocal Rank (MRR) is the average of the reciprocal ranks over a set of queries
– Very sensitive to rank position

IR H/M Course
RR”=”1/2″=”0.5
RR”=”1/5″=”0.2
Mean Reciprocal Rank (MRR)
MRR”=”(0.5″+”0.2)”/”2″=”0.35
Discounted Cumulative Gain (DCG)
• Popular measure for evaluating web search and related tasks
– Use graded relevance • Two assumptions:
– Highly relevant documents are more useful than marginally relevant documents
– The lower the ranked position of a relevant document, the less useful it is for the user, since it is less likely to be examined

IR H/M Course
Discounted Cumulative Gain
• Uses graded relevance as a measure of the usefulness, or gain, from examining a document
• Gain is accumulated starting at the top of the ranking and is discounted at lower ranks
– Typical discount is 1/log (rank)
– With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3
DCG Example
• 10 ranked documents judged on 0-3 relevance scale:
3, 2, 3, 0, 0, 1, 2, 2, 3, 0
• Discounted gain:
3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0 = 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0
3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61

IR H/M Course
Normalized DCG
• DCG numbers are averaged across a set of queries at specific rank values
– e.g., DCG at rank 5 is 6.89 and at rank 10 is 9.61
• DCG values are often normalised by comparing the DCG at each rank with the DCG value for the perfect ranking
– makes averaging easier for queries with different numbers of relevant documents
NDCG Example
• Perfect ranking:
3, 3, 3, 2, 2, 2, 1, 0, 0, 0
• Ideal DCG values:
3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 10.88, 10.88, 10.88
• NDCG values (divide actual by ideal):
1, 0.83, 0.87, 0.76, 0.71, 0.69, 0.73, 0.8, 0.88, 0.88 – NDCG ! 1 at any rank position

IR H/M Course
Queryid (Num):///////49/////// Total/number/of/documents/over/all/queries
Retrieved:////49000 Relevant://////1670 Rel_ret:///////1258
Interpolated/Recall/G Precision/Averages: at/0.00///////0.6880/
at/0.10///////0.5439/
at/0.20///////0.4773/
at/0.30///////0.4115/ at/0.40///////0.3741/ at/0.50///////0.3174/ at/0.60///////0.2405/ at/0.70///////0.1972/ at/0.80///////0.1721/ at/0.90///////0.1337/ at/1.00///////0.1113/
Average/precision/(nonGinterpolated)/for/all/ rel docs(averaged/over/queries)
Precision: At////5/docs:///0.3837 At///10/docs:///0.3408 At///15/docs:///0.3102 At///20/docs:///0.2806 At///30/docs:///0.2422 At//100/docs:///0.1365 At//200/docs:///0.0883 At//500/docs:///0.0446 At/1000/docs:///0.0257
RGPrecision/(precision/after/R/(=/ num_rel for/a/query)/docs/retrieved):
Evaluation Results
Exact:////////0.3068
Evaluation*tools*produce many trec_eval is*the*NIST,*TREC*standard
Analysing Results
• Quantitative evaluation MAP, etc) gives overall performance
• We also introduced recall-precision graphs as a mechanism to analyse
the query results
• Per-query evaluation results wrt a baseline are also very useful
– How many queries were improved/degraded with respect to a baseline?
– How per-query performance changes wrt baseline (per-query bar chart) – are there recognisable patterns about what queries are improved or not
– Complemented with statistical significance testing (discussed later)

IR H/M Course
Significance Tests
• Given the results from a number of queries, how can we conclude that ranking algorithm A is better than algorithm B?
• A significance test
– Null hypothesis: no difference between A and B
– Alternative hypothesis: B is better than A
– The power of a test is the probability that the test will reject the null hypothesis correctly
– Increasing the number of queries in the experiment also increases power of a test
Example Experimental Results
Significance level: ! = 0.05
Is System B better than System A (baseline)?

IR H/M Course
Example Experimental Results
t = 2.33 p-value = 0.02
p-value= 0.02 < 0.05 !Reject null hypothesis and conclude that B is better than A Significance level: ! = 0.05 Is System B better than System A (baseline)? • No single measure is the correct one for any application – Choose measures appropriate for task – Use a combination – Shows different aspects of the system effectiveness • Analyse performance of individual queries – Number of queries that improved/degraded relative to a baseline – Type of queries that benefited, etc • Use significance tests (t-test) 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com