CS代考 Search Engines

Search Engines

Text, Web And Media Analytics

Copyright By PowCoder代写 加微信 powcoder

Indexing Text and Ranking Documents

1. Abstract Model of Ranking
2. Indexes
Inverted Index
Index construction
3. Query processing
Document-at-a-time
Term-at-a-time
Optimization techiques

1. Abstract Model of Ranking

More Concrete Model

2. Indexes
Indexes are data structures designed to make search faster
Text search has unique requirements, which leads to unique data structures
Most common data structure is inverted index
general name for a class of structures
“inverted” because documents are associated with words, rather than words with documents

Indexes and Ranking
Indexes are designed to support search or text analysis
faster response time, supports updates
Text search engines use a particular form of search: ranking
documents are retrieved in sorted order according to a score computing using the document representation, the query, and a ranking algorithm
What is a reasonable abstract model for ranking?
enables discussion of indexes without details of retrieval model

Inverted Index
Each index term is associated with an inverted list
Contains lists of documents, or lists of word occurrences in documents, and other information
Each entry is called a posting
The part of the posting that refers to a specific document or location is called a pointer
Each document in the collection is given a unique number
Lists are usually document-ordered (sorted by document number)

Example “Collection”

Simple Inverted

Data structures for the simple
inverted index
Linked List representation

Dictionary representation (in Python)

I = {‘fish’: [1, 2, 3, 4], ‘fishkeepers’: [2], ‘found’: [1], … }

Inverted Index
with counts

supports better ranking algorithms

Inverted Index
with positions

proximity matches

Proximity Matches
Matching phrases or words within a window
e.g., “tropical fish”, or “find tropical within 5 words of fish”
Word positions in inverted lists make these types of query features efficient

Fields and Extents
Document structure is useful in search
field restrictions
e.g., date, from:, etc.
some fields more important
e.g., title
separate inverted lists for each field type
add information about fields to postings
use extent lists

Extent Lists
An extent is a contiguous region of a document
represent extents using word positions
inverted list records all extents for a given field type
e.g., (1,3) means S1 has a title that contains the first two words (ending just before 3rd word)

extent list

Other Issues
Precomputed scores in inverted list
e.g., list for “fish” [(1:3.6), (3:2.2)], where 3.6 is total feature value for document 1
improves speed but reduces flexibility
Score-ordered lists
query processing engine can focus only on the top part of each inverted list, where the highest-scoring documents are recorded
very efficient for single-word queries

Auxiliary Structures
Inverted lists usually stored together in a single file for efficiency
Inverted file
Vocabulary or lexicon
Contains a lookup table from index terms to the byte offset of the inverted list in the inverted file
Either hash table in memory or B-tree for larger vocabularies
Term statistics stored at start of inverted lists
Collection statistics stored in separate file

Index Construction
Simple in-memory indexer

It is classic way to solve the memory problem if the inverted list is very big (or in distributed index framework).
Merging addresses limited memory problem to
Build the inverted list structure until memory runs out
Then write the partial index to disk, and start making a new one again.
At the end of this process, the disk is filled with many partial indexes, which are then merged to a single result.
Partial lists must be designed so they can be merged in a small memory
e.g., storing in alphabetical order

Read two words w1 and w2 from both Index A and Index B respectively.
If w1 = w2, merging w1 and w2 and then append it to Index;
If w1 < w2, append w1 to Index; then read a new word from Index A as w1 then repeat the process; If w2 < w1, append w2 to Index; then read a new word from Index B as w2 then repeat the process; Result Merging Index merging is a good strategy for handling updates when they come in large batches For small updates this is very inefficient instead, create separate index for new documents, merge results from both searches could be in-memory, fast to update and search Deletions handled using delete list Modifications done by putting old version on delete list, adding new version to new documents index 3. Query Processing Once an index is built, we need to know how to process it to get query results. Clever algorithm can boost query processing speed Document-at-a-time Calculates complete scores for documents by processing all term lists, one document at a time Term-at-a-time Accumulates scores for documents by processing term lists one at a time Both approaches have optimization techniques that significantly reduce time required to generate scores 每次处理一个文档,通过处理所有术语列表计算文档的完整分数 通过每次处理一个术语列表来累积文档的分数 这两种方法都具有显著减少生成分数所需时间的优化技术 Document-At-A-Time Handle some documents is better by this way Pseudocode Function Descriptions getCurrentDocument() Returns the document number of the current posting of the inverted list. skipForwardToDocument(d) Moves forward in the inverted list until getCurrentDocument() <= d. This function may read to the end of the list. movePastDocument(d) Moves forward in the inverted list until getCurrentDocument() < d. moveToNextDocument() Moves to the next document in the list. Equivalent to movePastDocument(getCurrentDocument()). getNextAccumulator(d) returns the first document number d' >= d that has already has an accumulator.
removeAccumulatorsBetween(a, b)
Removes all accumulators for documents numbers between a and b. Ad will be removed iff a < d < b. removeAccumulatorsFrom(d’) Removes all accumulators after d’ (including d’). Document-At-A-Time \\Select inverted lists based on the query f document features function g query feature function k return top result Empty array Result list Example 1: Q = {salt = l1, water=l2, tropical=l3} Documents = {d1, d2, d3, d4} sd = sd + 1 = 1 sd = sd + 1 = 2 sd = sd + 2 = 4 sd = sd + 1 = 1 sd = sd + 2 = 3 Term-At-A-Time Term-At-A-Time Example 2: Q = {salt = l1, water=l2, tropical=l3} Documents = {d1, d2, d3, d4} HashTable A A1 = 0+1 =1 A4 = 0+1 =1 A1 = A1+1 =2 A2 = 0+1 =1 A4 = A4+1 = 2 A1 = A1+2 = 4 A2 = A2 +2 = 3 A3 = 0+1 =1 Optimization Techniques Term-at-a-time uses more memory for accumulators, but accesses disk more efficiently Two classes of optimization Read less data from inverted lists e.g., skip lists better for simple feature functions Calculate scores for fewer documents e.g., conjunctive processing better for complex feature functions 每次一项会为累加器使用更多的内存,但访问磁盘更有效 从倒排列表中读取更少的数据,例如,跳跃列表更适合简单的特性函数 对于较少的文档计算分数,例如,对于复杂的特征函数更好地进行合并处理 Conjunctive processing It means that every document returned to the user contains all of the query terms. It is a default mode for many Web search engines. It works very well when one of the query terms is rare (i.e., the output size is not very big). Conjunctive Term-at-a-Time if (d is the end node of the list) then A.removeAccumulatorsFrom(d’) Example 2: Q = {salt = l1, water=l2, tropical=l3} Documents = {d1, d2, d3, d4} i=0, d0=-1 A1 = 1, A4 = 1 (Accumulator) i=1, d0=-1, d=1, d’=1 A1 = A1+1 =2, A4=1 d=2, d’ = 4, d =4, d’=4 A4 = A4+1 = 2 i=2, d0=-1, d =1, d’=1 A1 = A1+2 = 4 d=2, d’=4 then dd’, d=3, d’=4, then dd’, and d is the end node, so remove A4. The output is A1 only. Threshold Methods Threshold methods use number of top-ranked documents needed (k) to optimize query processing for most applications, k is small For any query, there is a minimum score that each document needs to reach before it can be shown to the user score of the kth-highest scoring document the threshold τ The true value is hard to calculate, but we can approximate it by using τ′  τ, so that we can ignore any document with a score less than τ′. Threshold Methods cont. Estimating threshold For document-at-a-time processing, use score of lowest-ranked document in R so far for τ′ For term-at-a-time, have to use kth-largest score in the accumulator table A because we don’t have full document scores until the query processing is finished. MaxScore method It can ignore parts of the inverted lists that will not generate document scores above τ′. Please note “safe optimization” in that ranking will be the same without optimization. MaxScore Example Assume we are almost certain to find a set of top k documents that contain both words. Suppose the indexer computes μtree The maximum score for any document containing just “tree” Assume k =3, τ′ is lowest score after first three docs Highly likely that τ ′ > μtree
As τ ′ is the score of a document that contains both query terms
Can safely skip over all gray postings

Other Approaches
Early termination of query processing
ignore high-frequency word lists in term-at-a-time
ignore documents at end of lists in doc-at-a-time
unsafe optimization
List ordering
order inverted lists by quality metric (e.g., PageRank) or by partial score
makes unsafe (and fast) optimizations more likely to produce good documents

Structured Queries
Query language can support specification of complex features
similar to SQL for database systems (support feature combination)
query translator converts the user’s input into the structured query representation
Galago query language is the example used here
e.g., Galago query:

Evaluation Tree for Structured Query

This query indicates that the document score should be a combination of the scores from three subqueries.
where the #od:1 operator means that the terms inside it need to appear next to each other, in that order, in a matching document.

Chapter 5 in book – W. , Search Engines – Information retrieval in Practice; Pearson, 2010.

/docProps/thumbnail.jpeg

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com