Search Engines
Text, Web And Media Analytics
Copyright By PowCoder代写 加微信 powcoder
Indexing Text and Ranking Documents
1. Abstract Model of Ranking
2. Indexes
Inverted Index
Index construction
3. Query processing
Document-at-a-time
Term-at-a-time
Optimization techiques
1. Abstract Model of Ranking
More Concrete Model
2. Indexes
Indexes are data structures designed to make search faster
Text search has unique requirements, which leads to unique data structures
Most common data structure is inverted index
general name for a class of structures
“inverted” because documents are associated with words, rather than words with documents
Indexes and Ranking
Indexes are designed to support search or text analysis
faster response time, supports updates
Text search engines use a particular form of search: ranking
documents are retrieved in sorted order according to a score computing using the document representation, the query, and a ranking algorithm
What is a reasonable abstract model for ranking?
enables discussion of indexes without details of retrieval model
Inverted Index
Each index term is associated with an inverted list
Contains lists of documents, or lists of word occurrences in documents, and other information
Each entry is called a posting
The part of the posting that refers to a specific document or location is called a pointer
Each document in the collection is given a unique number
Lists are usually document-ordered (sorted by document number)
Example “Collection”
Simple Inverted
Data structures for the simple
inverted index
Linked List representation
Dictionary representation (in Python)
I = {‘fish’: [1, 2, 3, 4], ‘fishkeepers’: [2], ‘found’: [1], … }
Inverted Index
with counts
supports better ranking algorithms
Inverted Index
with positions
proximity matches
Proximity Matches
Matching phrases or words within a window
e.g., “tropical fish”, or “find tropical within 5 words of fish”
Word positions in inverted lists make these types of query features efficient
Fields and Extents
Document structure is useful in search
field restrictions
e.g., date, from:, etc.
some fields more important
e.g., title
separate inverted lists for each field type
add information about fields to postings
use extent lists
Extent Lists
An extent is a contiguous region of a document
represent extents using word positions
inverted list records all extents for a given field type
e.g., (1,3) means S1 has a title that contains the first two words (ending just before 3rd word)
extent list
Other Issues
Precomputed scores in inverted list
e.g., list for “fish” [(1:3.6), (3:2.2)], where 3.6 is total feature value for document 1
improves speed but reduces flexibility
Score-ordered lists
query processing engine can focus only on the top part of each inverted list, where the highest-scoring documents are recorded
very efficient for single-word queries
Auxiliary Structures
Inverted lists usually stored together in a single file for efficiency
Inverted file
Vocabulary or lexicon
Contains a lookup table from index terms to the byte offset of the inverted list in the inverted file
Either hash table in memory or B-tree for larger vocabularies
Term statistics stored at start of inverted lists
Collection statistics stored in separate file
Index Construction
Simple in-memory indexer
It is classic way to solve the memory problem if the inverted list is very big (or in distributed index framework).
Merging addresses limited memory problem to
Build the inverted list structure until memory runs out
Then write the partial index to disk, and start making a new one again.
At the end of this process, the disk is filled with many partial indexes, which are then merged to a single result.
Partial lists must be designed so they can be merged in a small memory
e.g., storing in alphabetical order
Read two words w1 and w2 from both Index A and Index B respectively.
If w1 = w2, merging w1 and w2 and then append it to Index;
If w1 < w2, append w1 to Index; then read a new word from Index A as w1 then repeat the process;
If w2 < w1, append w2 to Index; then read a new word from Index B as w2 then repeat the process;
Result Merging
Index merging is a good strategy for handling updates when they come in large batches
For small updates this is very inefficient
instead, create separate index for new documents, merge results from both searches
could be in-memory, fast to update and search
Deletions handled using delete list
Modifications done by putting old version on delete list, adding new version to new documents index
3. Query Processing
Once an index is built, we need to know how to process it to get query results.
Clever algorithm can boost query processing speed
Document-at-a-time
Calculates complete scores for documents by processing all term lists, one document at a time
Term-at-a-time
Accumulates scores for documents by processing term lists one at a time
Both approaches have optimization techniques that significantly reduce time required to generate scores
每次处理一个文档,通过处理所有术语列表计算文档的完整分数
通过每次处理一个术语列表来累积文档的分数
这两种方法都具有显著减少生成分数所需时间的优化技术
Document-At-A-Time
Handle some documents is better by this way
Pseudocode Function Descriptions
getCurrentDocument()
Returns the document number of the current posting of the inverted list.
skipForwardToDocument(d)
Moves forward in the inverted list until getCurrentDocument() <= d. This function may read to the end of the list.
movePastDocument(d)
Moves forward in the inverted list until getCurrentDocument() < d.
moveToNextDocument()
Moves to the next document in the list. Equivalent to movePastDocument(getCurrentDocument()).
getNextAccumulator(d)
returns the first document number d' >= d that has already has an accumulator.
removeAccumulatorsBetween(a, b)
Removes all accumulators for documents numbers between a and b. Ad will be removed iff a < d < b.
removeAccumulatorsFrom(d’)
Removes all accumulators after d’ (including d’).
Document-At-A-Time
\\Select inverted lists based on the query
f document features function
g query feature function
k return top result
Empty array
Result list
Example 1:
Q = {salt = l1, water=l2, tropical=l3}
Documents = {d1, d2, d3, d4}
sd = sd + 1 = 1
sd = sd + 1 = 2
sd = sd + 2 = 4
sd = sd + 1 = 1
sd = sd + 2 = 3
Term-At-A-Time
Term-At-A-Time
Example 2:
Q = {salt = l1, water=l2, tropical=l3}
Documents = {d1, d2, d3, d4}
HashTable A
A1 = 0+1 =1
A4 = 0+1 =1
A1 = A1+1 =2
A2 = 0+1 =1
A4 = A4+1 = 2
A1 = A1+2 = 4
A2 = A2 +2 = 3
A3 = 0+1 =1
Optimization Techniques
Term-at-a-time uses more memory for accumulators, but accesses disk more efficiently
Two classes of optimization
Read less data from inverted lists
e.g., skip lists
better for simple feature functions
Calculate scores for fewer documents
e.g., conjunctive processing
better for complex feature functions
每次一项会为累加器使用更多的内存,但访问磁盘更有效
从倒排列表中读取更少的数据,例如,跳跃列表更适合简单的特性函数
对于较少的文档计算分数,例如,对于复杂的特征函数更好地进行合并处理
Conjunctive processing
It means that every document returned to the user contains all of the query terms.
It is a default mode for many Web search engines.
It works very well when one of the query terms is rare (i.e., the output size is not very big).
Conjunctive
Term-at-a-Time
if (d is the end node of the list) then A.removeAccumulatorsFrom(d’)
Example 2:
Q = {salt = l1, water=l2, tropical=l3}
Documents = {d1, d2, d3, d4}
i=0, d0=-1
A1 = 1, A4 = 1 (Accumulator)
i=1, d0=-1, d=1, d’=1
A1 = A1+1 =2, A4=1
d=2, d’ = 4,
d =4, d’=4
A4 = A4+1 = 2
i=2, d0=-1, d =1, d’=1
A1 = A1+2 = 4
d=2, d’=4 then dd’,
d=3, d’=4, then dd’, and d is the end node, so remove A4.
The output is A1 only.
Threshold Methods
Threshold methods use number of top-ranked documents needed (k) to optimize query processing
for most applications, k is small
For any query, there is a minimum score that each document needs to reach before it can be shown to the user
score of the kth-highest scoring document
the threshold τ
The true value is hard to calculate, but we can approximate it by using τ′ τ, so that we can ignore any document with a score less than τ′.
Threshold Methods cont.
Estimating threshold
For document-at-a-time processing, use score of lowest-ranked document in R so far for τ′
For term-at-a-time, have to use kth-largest score in the accumulator table A because we don’t have full document scores until the query processing is finished.
MaxScore method
It can ignore parts of the inverted lists that will not generate document scores above τ′.
Please note “safe optimization” in that ranking will be the same without optimization.
MaxScore Example
Assume we are almost certain to find a set of top k documents that contain both words.
Suppose the indexer computes μtree
The maximum score for any document containing just “tree”
Assume k =3, τ′ is lowest score after first three docs
Highly likely that τ ′ > μtree
As τ ′ is the score of a document that contains both query terms
Can safely skip over all gray postings
Other Approaches
Early termination of query processing
ignore high-frequency word lists in term-at-a-time
ignore documents at end of lists in doc-at-a-time
unsafe optimization
List ordering
order inverted lists by quality metric (e.g., PageRank) or by partial score
makes unsafe (and fast) optimizations more likely to produce good documents
Structured Queries
Query language can support specification of complex features
similar to SQL for database systems (support feature combination)
query translator converts the user’s input into the structured query representation
Galago query language is the example used here
e.g., Galago query:
Evaluation Tree for Structured Query
This query indicates that the document score should be a combination of the scores from three subqueries.
where the #od:1 operator means that the terms inside it need to appear next to each other, in that order, in a matching document.
Chapter 5 in book – W. , Search Engines – Information retrieval in Practice; Pearson, 2010.
/docProps/thumbnail.jpeg
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com