CS代写 CIKM 2010

IR H/M Course
Quest for Efficiency
Remember the scale of Web-scale search engines
• Microsoft Bing uses “hundreds of thousands of servers” in their search engine

Copyright By PowCoder代写 加微信 powcoder

• Any new feature/retrieval technique should not increase the response time of the system – as this might lead to a loss in revenues
• Hence, deploying a retrieval technique that causes a 1% increase in response time implies 1000 additional servers must be activated in their data centre(s)
−This has significant cost and “Green” impact
IR infrastructures are concerned with making effective yet efficient retrieval
Efficient infrastructure techniques
• Static Pruning
• Query Evaluation (TAAT & DAAT), and Dynamic Pruning • Index Compression
Distributed IR infrastructure
• Distributed Retrieval
• Query Efficiency Prediction

IR H/M Course
Increasing Efficiency
SMALLER, BETTER, FASTER
Recall: The Format of an Index
An index normally contains three sub-structures
− Lexicon: Records the list of all unique terms and their statistics
− Document Index: Records the list of all documents and their statistics
− Inverted Index: Records the mapping between terms and documents
Term Pipeline
Could also contain other document
information: e.g. PageRank
InvertedIndex
Could also contain other occurrence
information: e.g. term positions,
fields (title, URL) 6
DocumentIndex
id df cf p
id tf id tf
each entry (posting) represents a document

IR H/M Course
Recall: Search Efficiency
It is important to make retrieval as fast as possible
−Research by indicates that even slightly slower retrieval (0.2s-0.4s) can lead to a dramatic drop in the perceived quality of the results [1]
So what is the most costly part of a (classical) search system?
Term Pipeline
Re-Ranking
Top Results
Document Retrieval Model
− Scoring each document for the user query
[1] Teevan et al. Slow Search: Information Retrieval without Time Constraints. HCIR’13
Why is Document Scoring Expensive?
The largest reason for the expense of document scoring is that there are lots of documents:
− A Web search index can contain billions of documents • Google currently indexes trillions of pages [1]
More specifically, the cost of a search is dependent on:
− Query length (the number of search terms) − Posting list length for each query term
• i.e. The number of documents containing each term
term1 id df cf p id tf id tf id tf
term2 id df cf p id tf id tf [1] http://www.statisticbrain.com/total-number-of-pages-indexed-by-google/

IR H/M Course
Strategies to Speed
There are several enhancements to the search architecture that can make search more efficient
Search result/Term caching
• Where possible avoid the scoring process altogether Pruning
• Static Pruning: Skip the indexing of documents that are not likely to make the first few search result pages
• Dynamic Pruning: Skip the scoring of documents that will not make the first few search result pages
Index compression
• Reduce the time it takes to read a posting list
Discuss a number of strategies to speed-up the retrieval process along with their pros and cons.

IR H/M Course
Search Result/Term Caching
Caching strategies are built on the idea that we should store answers to past queries
• Past answers can be used to bypass the scoring process for subsequent queries
• Forpopularqueries,cachingisverybeneficial There are two types of caching strategy
• Search Result Caching stores the final ranked list of documents for a query
• TermCachingstoresthepostinglistsforeachofthequerytermsinmemory
Caching is beneficial to efficiency [1]:
• SearchResultcachingcanavoidscoringfor50%ofqueries–socalledhead queries
• Termcachingcanavoidloadingoneormorepostinglistsfor88%ofqueries
[1] Baeza-Yates et al. The impact of caching on search engines. SIGIR’07 11
Search Result/Term Caching
Search Result Caching Architecture
Cache Miss
Term Pipeline
Result Cache
Document Retrieval Model
Re-Ranking
Term Caching Architecture
Cache the results for the query
Additional term posting lists are cached
Get Term Posting List(s)
Term Pipeline
Posting Cache
Document Retrieval Model
Re-Ranking
Discuss the pros & cons of 2 caching techniques.
All term posting lists are cached
Only some term posting lists are cached

IR H/M Course
More on Caching
Memory is cheap…
• The logical consequence is that many search engines keep the entire index in memory
SSDs and hard drives offer slower storage tiers
For many queries, phrases can be used to help the ranking
• If we had bigram posting lists, we could score these queries much quicker
• But we cannot store postings for all bigrams
Instead, frequent bigrams from the query log can be selected, and then SSD and disk space can be used to cache and store these “term pair” posting lists [1]
• Then decide on a per-query basis to use them or not [2]
[1] Yan et al. Efficient Term Proximity Search with Term-Pair Indexes. CIKM 2010
[2] Risvik et al. Maguro, a system for indexing and searching over very large text collections. WSDM 2013
Paired Posting Lists
Consider the example query white house from Lect. 8
• We can retrieve for the phrase “white house” by intersecting
the inverted index posting lists for ‘white’ and ‘house’, −calculating a ‘pair frequency’ (pf) for each document
0, <1,5> 5, <3>
0, <2,6> 3, <2,4,6>
6, <4> 6, <5>
‘white house’
• In essence, we can simulate a posting list for “white house”, without indexing it as a bigram
Then if ‘white house’ occurs frequently in the query stream, it can be cached, e.g. to SSD

IR H/M Course
STATIC PRUNING
K is usually very small viz. the size of the collection (N)
• K << 1000, i.e. 20 for the first page of results • Even if we are re-ranking results, K << N, e.g. K=5000 Question: do we really need to keep ALL of the inform- ation in the index for a good-quality top-K search for common queries? • There should be a way to remove some of the less important data, while (hopefully) retaining the quality of top-K results! What is (not) important? • Will some documents never be retrieved, for any query? • Will some terms never help retrieval? K Retrieval IR H/M Course Static Pruning Concepts Document-based pruning: • Discards terms from a document that are less representative of a document’s content • Can be applied on-the-fly during indexing, IF we have reasonable collection statistics Term-based pruning: • Discards term occurrences that are less likely to affect the retrieval performance of a specific weighting model (e.g. BM25) • Done after the index is completed Pruning Algorithm Foreach common query: • Run it against the full index • Record the top-K matching documents • Foreach document: − Record the terms and term positions that contributed to the score Finally: remove all non-recorded postings and terms First proposed by D. Carmel (2001) for single term queries What is static pruning? Discuss 2 possible static pruning techniques. IR H/M Course Static Pruning Advantages & Disadvantages Some results claim a modest negative impact on effectiveness when pruning up to 60% of postings • Static pruning is a lossy compression of the inverted index • Index is smaller; retrieval is faster; pruning is done offline An application of static pruning is a multi-tier architecture: • Most common queries are handled by a heavily pruned 1st tier index that fits wholly in RAM • Less common queries handled by a 2nd tier index (on SSD?) • Remaining queries handled by a full index on disk QUERY EVALUATION & DYNAMIC PRUNING IR H/M Course Query Evaluation Even when using caching and/or static pruning, retrieval still needs to score many documents in the index • i.e. when the query/query terms are not in the cache Normal strategies make a pass on the postings lists for each query term • This can be done Term-at-a-Time (TAAT) – one query term at a time • Or Document-at-a-time (DAAT) – all query terms in parallel I will explain these, before showing how we can improve them Compare and contrast the TAAT and DAAT query evaluation techniques. term1 p term2 p Time (TAAT) 21 2 25 2 21 2 IR H/M Course term1 p term2 p Time (TAAT) 21 2 25 2 21 2 term1 p term2 p Advantages: • Simple Disadvantages: Time (TAAT) 21 2 25 2 • Requires lots of memory to contain partial scores for all documents • Difficult to do Boolean or phrase queries, as we don’t have all postings for a given document at the same time IR H/M Course term1 p term2 p Time (DAAT) term1 p term2 p Advantages: Time (DAAT) • Reduced memory compared to TAAT (and hence faster) • Supports Boolean query operators, phrases, etc. Disadvantages: • Slightly more complex to implement Most commercial search engines are reported to use DAAT IR H/M Course Dynamic Pruning during Query Evaluation Dynamic pruning strategies aim to make scoring faster by only scoring a subset of the documents • The core assumption of these approaches is that the user is only interested in the top K results, say K=20 • During query scoring, it is possible to determine if a document cannot make the top K ranked results • Hence, the scoring of such documents can be terminated early, or skipped entirely, without damaging retrieval effectiveness to rank K We call this “safe-to-rank K” Dynamic Pruning Techniques The two most well-known methods for DAAT dynamic pruning are MaxScore and WAND MaxScore [1] − Early termination: does not compute scores for documents that won’t be retrieved by comparing upper bounds with a score threshold −Approximate evaluation: does not consider documents with approximate scores (sum of upper bounds) lower than threshold −Therefore, it focuses on the combinations of terms needed (wAND) for a document to be retrieved [1] H Turtle & J Flood. Query Evaluation : Strategies and Optimisations. IPM: 31(6). 1995. [2] A Broder et al. Efficient Query Evaluation using a Two-Level Retrieval Process. CIKM 2003. IR H/M Course Require K=20 documents Weighting model BM25 term scores determined using upper bounds Could make top K Require K=20 documents Weighting model BM25 term score calculated using BM25 PRUNE: Can’t make top K! IR H/M Course Require K=20 documents Weighting model BM25 term scores determined using upper bounds PRUNE: Can’t make top K! WAND Example Require K=20 documents Weighting model BM25 WAND focuses on the combinations of terms needed (c.f. Weighted AND) to reach the threshold − With threshold 4.5, any document without term1 cannot make the retrieved set. Hence, we can skip docid 25 in the term2 posting list − Hence, it will focus retrieval using term1, and only score term2 for documents that could exceed the threshold For both MaxScore & WAND, smaller K => faster retrieval

IR H/M Course
Some numbers…
To demonstrate the benefit of dynamic pruning, we report experiments from [1]
• Retrieve K=1000 document for BM25 on the ClueWeb09 collection • 1000 real search engine queries
This is for “safe-to-rank 1000”. Both WAND & MaxScore can be configured to be faster, but unsafe, i.e. permit losses in effectiveness above rank K
• This is achieved by over-inflating the threshold
There are also unsafe dynamic pruning techniques for TAAT
Overall, dynamic pruning is an important component of modern search engine deployments
Query Evaluation
Mean Response Time
Postings Scored
Exhaustive DAAT
[1] N Tonellotto, C Macdonald, and I Ounis. Effect of different docid orderings on dynamic pruning retrieval strategies. SIGIR 2011.
INDEX COMPRESSION

IR H/M Course
Index Compression
The previous approaches make retrieval faster by scoring fewer documents
• However it is also possible to make the scoring of each document faster! This can be achieved by applying index compression [1]
−Motivation: it physically takes time to read the term posting lists, particularly if they are stored on a (slow) hard disk
−Using lossless compressed layouts for the term posting lists can save space (on disk or in memory) and reduce the amount of time spent doing IO
−But decompression can also be expensive, so efficient decompression is key! 1 integer = 32 bits = 4 bytes
term2 21 2 25 2 26 5
total = 24 bytes
Do we need 32 bits?
[1] Witten et al. Managing Gigabytes: Compressing and Indexing Documents and Images.
Delta Gaps
One component of the information stored in a posting list is the docids…
• …in ascending order!
term2 21 2 25 2 26 5
We can make smaller numbers by taking the differences
term2 21 2 245 2 26 5 1
So each docid in the posting lists could be represented using less bits
• How to represent these numbers?
• 32 bits has a range -2147483648 .. 2147483648
• Using a fixed number of bits is wasteful (192 bits = 24 bytes) 36

IR H/M Course
& Gamma Encoding
• Use as many 0s as the input value x, followed by a 1
• E.g.: 5 is 000001
• Let N=⌊log2 x⌋ be the highest power of 2 that x contains;
• Write N out in unary representation, followed by the remainder
(x – 2N) in binary
• E.g.: 5 is represented as 00101
Let’s represent docids as gamma, and tf as unary
• (This is the default compression used by Terrier)
21 2 245 2 216 5
Achieved Compression Rate = 4.5 bits/int
= 27 bits (< 4 bytes), down from 24 bytes! 0000 1 0101 Consider the following posting list: term4 14 3 15 2 17 1 id=14 tf=3 id= 15 tf=2 id=17 tf=1 Encode the posting list using Unary encoding, and calculate the achieved compression rate Firstly take the delta-gaps 14,3,1,2,2,1 Encode each in unary: 000000000000001 0001 01 001 001 01 Encoding 6 integers as 32bit integers = 192 bits (i.e. 32 bits per integer) Encoding 6 integers as using Unary = 29 bits =~ 4.8 bit per integer IR H/M Course Other Compressions Schemes & are moderately expensive to decode: lots of bit twiddling! • Other schemes are byte-aligned, e.g. −Variable byte [1] −Simple family [2] Documents are often clustered in the inverted index (e.g. by URL ordering) • Compression can be more effective in blocks of numbers • List-adaptive techniques work on blocks of numbers −Frame of reference (FOR) [3] −Patched frame of reference (PFOR) [4] [1] H.E. Williams and J. Zobel. Compressing Integers for Fast File Access. Comput. J. 1999 [2] V. Anh & A. Moffat. Inverted Index Compression using Word-aligned Binary Codes. INRT. 2005 [3] J. Goldstein et al. Compressing relations and indexes. ICDE 1998. [4] M. Zukowski et al. Super-scalar RAM-CPU cache compression. ICDE 2006. Frame of Reference Idea: pick the minimum m and the maximum M values of the block of numbers that you are compressing. • Then, any value x can be represented using b bits, where b = ⌈log2(M-m+1)⌉. Example: To compress numbers in range {2000,...,2063} • ⌈log2(64) ⌉ = 6 • So we use 6 bits per value: 2000 6 xxxxxxxxxxxxxxxxxx... 22 IR H/M Course Compression: Some numbers [1] ClueWeb09 corpus – 50 million Web documents • 12,725,738,385 postings => 94.8GB inverted file uncompressed – NO retrieval numbers: WAY TOO SLOW!
• Terrier’s standard /Unary compression = 15GB
Compression is essential for an efficient IR system
• List adaptive compression: slightly larger indices, markedly faster
[1] M Catena, C Macdonald, and I Ounis. On Inverted Index Compression for Search Engine Efficiency. ECIR 2014. Award.
Time (s) Size
Gamma/Unary
Variable Byte
Efficient Query Evaluation
Caching, Pruning & Compression all form essential aspects of an efficient IR system
• Each provides important improvements to efficiency
• Some techniques like static pruning or unsafe dynamic
pruning can degrade effectiveness
−A search engine implementer must be aware of the tradeoffs to
achieve their desired effectiveness within cost constraints
Other state-of-the-art approaches I haven’t included:
• Impact ordered posting lists: an alternative index layout −Requires efficiency/effectiveness tradeoff
• Block- AND: integrates WAND more tightly with the index compression format

IR H/M Course
Scaling Up
WIDE SYSTEMS
Scaling Up Information Retrieval
Even with the aforementioned efficiency improvements, indexing and search is computationally and (disk) IO intensive
To satisfy high query loads, the retrieval process needs to be spread over many CPUs and hard disks
There are 2 main paradigms to scale up:
− Vertical: Buy a large mainframe machine with lots of CPU cores and storage
− Horizontal: Buy many machines and distribute the search process over them

IR H/M Course
Vertical vs. Horizontal Scaling
Vertical Scaling
− Advantages
• All resources are local to the
processing
• Some applications do not lend themselves to a distributed computing model
− Disadvantages
• Expensive infrastructure
• Fault tolerance is hard to achieve
Horizontal Scaling
− Advantages
• Nodes can be added in an ad- hoc manner as processing power is needed
• Multi-core processing nodes are inexpensive
− Disadvantages
• Additional communication and coordination overheads are incurred
Parallelised Indexing and Retrieval
Horizontal scaling is used by large search engines to parallelise the indexing process and the index itself
−Spread the index out into shards running on many machines −Replicate each shard multiple times to allow for multiple
queries to be processed in parallel and for fault tolerance
Index Replicated Shards Shards
In the following, we cover distributed retrieval

IR H/M Course
Distributed Retrieval Architectures (1)
So how do we partition data between nodes?
t1 t2 t3 t4
Option 1: Term Partitioning
Each row represents a document dj and each column represents an indexing t

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com