程序代写代做代考 algorithm html assembly COMP6714: Informa2on Retrieval & Web Search

COMP6714: Informa2on Retrieval & Web Search
Introduc*on to
Informa(on Retrieval
Lecture 7: Scoring and results assembly
1

COMP6714: Informa2on Retrieval & Web Search
Ch. 6
Recap: ;-idf weigh*ng
§ The ;-idf weight of a term is the product of its ; weight and its idf weight.
t,d
w =(1+logtf )×log (N/df)
t,d 10 t
§ Best known weigh*ng scheme in informa*on retrieval
§ Increases with the number of occurrences within a document
§ Increases with the rarity of the term in the collec*on
2

COMP6714: Informa2on Retrieval & Web Search
Ch. 6
Recap: Queries as vectors
§ Key idea 1: Do the same for queries: represent them as vectors in the space
§ Key idea 2: Rank documents according to their proximity to the query in this space
§ proximity = similarity of vectors
3

COMP6714: Informa2on Retrieval & Web Search
Ch. 6
Recap: cosine(query,document)
Dot product
Unit vectors
!!!!! ∑Vqd !q•dqd ii
cos(q,d)= ! ! = ! • ! = i=1
qd q d ∑V q2 ∑V d2
i=1 i i=1 i
cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.
4

COMP6714: Informa2on Retrieval & Web Search
Ch. 7
This lecture
§ Speeding up vector space ranking § PuRng together a complete search
system
§ Will require learning about a number of miscellaneous topics and heuris*cs
Ques(on: Why don’t we just use the query processing methods for Boolean queries?
5

COMP6714: Informa2on Retrieval & Web Search
Sec. 6.3.3
Compu*ng cosine scores
Term-at-a-*me
6

COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1
Efficient cosine ranking
§ Find the K docs in the collec*on “nearest” to the
query ⇒ K largest query-doc cosines.
§ Efficient ranking:
§ Compu*ng a single cosine efficiently.
§ Choosing the K largest cosine values efficiently. § Can we do this without compu*ng all N cosines?
7

COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1
Efficient cosine ranking
§ What we’re doing in effect: solving the K-nearest neighbor problem for a query vector
§ In general, we do not know how to do this efficiently for high-dimensional spaces
§ But it is solvable for short queries, and standard indexes support this well
Curse of dimensionality
8

COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1
Special case – unweighted queries
§ No weigh*ng on query terms
§ Assume each query term occurs only once
§ Then for ranking, don’t need to normalize query vector
§ Slight simplifica*on of algorithm from Lecture 6
9

COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1
Faster cosine: unweighted query
10

COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1
Compu*ng the K largest cosines: selec*on vs. sor*ng
§ Typically we want to retrieve the top K docs (in the cosine ranking for the query)
§ not to totally order all docs in the collec*on § Can we pick off docs with K highest cosines?
§ Let n of docs with nonzero cosines
§ We seek the K best of these n
11

COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1
Use heap for selec*ng top K
§ Binary tree in which each node’s value > the values
of children
§ Takes 2n opera*ons to construct, then each of K “winners” read off in 2log n steps.
§ Total *me is O(n + K*log(n))
§ For n=1M, K=100, this is about 10% of the cost of
sor*ng. 1 .9
.3 .3 .8 .1
.1 12
hop://en.wikipedia.org/wiki/Binary_heap

COMP6714: Informa2on Retrieval & Web Search
Quick Select
§ QuickSelect is similar to QuickSort to find the top-K elements from an array
§ Takes O(n) *me
§ Sor*ng the top-K items takes O(K*log(K)) *me
§ Total *me is O(n + K*log(K))
13
hop://en.wikipedia.org/wiki/Quickselect

COMP6714: Informa2on Retrieval & Web Search
Query Processing § Document-at-a-*me
§ Calculates complete scores for documents by processing all term lists, one document at a *me
§ Term-at-a-*me
§ Accumulates scores for documents by processing term lists
one at a *me
§ Both approaches have op*miza*on techniques that significantly reduce *me required to generate scores
§ Dis*nguish between safe and heuris*c op*miza*ons
[CMS09].begin

COMP6714: Informa2on Retrieval & Web Search
Document-At-A-Time

COMP6714: Informa2on Retrieval & Web Search
Document-At-A-Time

COMP6714: Informa2on Retrieval & Web Search
Term-At-A-Time
3:1

COMP6714: Informa2on Retrieval & Web Search
Term-At-A-Time
// accumulators
// Ad contains par*al score

COMP6714: Informa2on Retrieval & Web Search
Op*miza*on Techniques
§ Term-at-a-*me uses more memory for accumulators, but accesses disk more efficiently
§ Two classes of op*miza*on
§ Read less data from inverted lists
§ e.g., skip lists
§ beoer for simple feature func*ons
§ Calculate scores for fewer documents § e.g., conjunc*ve processing
§ beoer for complex feature func*ons

COMP6714: Informa2on Retrieval & Web Search
Conjunc*ve Processing
§ Requires the result document containing all the query terms (i.e., conjunc*ve Boolean queries)
§ More efficient
§ Can also be more effec*ve for short queries § Default for many search engines
§ Can be combined with both DAAT and TAAT
20

COMP6714: Informa2on Retrieval & Web Search
Conjunctive Term-at-a-Time

COMP6714: Informa2on Retrieval & Web Search
Conjunctive Document-at-a-Time

COMP6714: Informa2on Retrieval & Web Search
Threshold Methods
§ Threshold methods use number of top-ranked
documents needed (k) to op*mize query processing § for most applica*ons, k is small
§ For any query, there is a minimum score that each document needs to reach before it can be shown to the user
§ score of the kth-highest scoring document
§ gives threshold τ
§ op*miza*on methods es*mate τʹ to ignore documents

COMP6714: Informa2on Retrieval & Web Search
Threshold Methods
§ For document-at-a-*me processing, use score of
lowest-ranked document so far for τʹ
§ for term-at-a-*me, have to use kth-largest score in the
accumulator table
§ MaxScore method compares the maximum score
that remaining documents could have to τʹ
§ safe op*miza*on in that ranking will be the same without op*miza*on

COMP6714: Informa2on Retrieval & Web Search
MaxScore Example
xyz
§ Compute max term scores, μt, of each list and sort them in decreasing order (fixed during query processing)
§ Assume k =3, τʹ is lowest score ater first three docs
§ If μtree < τ ʹèany doc that scores higher than τ’ must contains at least one of the first two keywords (aka required term set) § Use pos*ngs lists of required term set to “drive” the query processing § Will only check some of the white pos*ngs in the list of “tree” to compute scoreèat least all gray pos*ngs are skipped. Beoer than the example in the textbook. See my Note 2 too. COMP6714: Informa2on Retrieval & Web Search Other Approaches § Early termina*on of query processing § ignore high-frequency word lists in term-at-a-*me § ignore documents at end of lists in doc-at-a-*me § unsafe op*miza*on § List ordering § order inverted lists by quality metric (e.g., PageRank) or by par*al score § makes unsafe (and fast) op*miza*ons more likely to produce good documents [CMS09].end COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.1 Boolenecks § Primary computa*onal booleneck in scoring: cosine computa*on § Can we avoid all this computa*on? § Yes, but may some*mes get it wrong § a doc not in the top K may creep into the list of K output docs § Is this such a bad thing? 27 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.1 Cosine similarity is only a proxy § Jus*fica*ons § User has a task and a query formula*on § Cosine matches docs to query § Thus cosine is anyway a proxy for user happiness § Approximate query processing § IfwegetalistofKdocs“close”tothetopKbycosine measure, should be ok 28 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.1 Generic approach § Find a set A of contenders, with K < |A| << N § A does not necessarily contain the top K, but has many docs from among the top K § Return the top K docs in A § Think of A as pruning non-contenders § The same approach is also used for other (non- cosine) scoring func*ons § Will look at several schemes following this approach 29 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.2 Index elimina*on § Basic algorithm FastCosineScore of Fig 7.1 only considers docs containing at least one query term § Take this further: § Only consider high-idf query terms § Only consider docs containing many query terms 30 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.2 High-idf query terms only § For a query such as catcher in the rye § Only accumulate scores from catcher and rye § Intui*on: in and the contribute liole to the scores and so don’t alter rank-ordering much § Benefit: § Pos*ngs of low-idf terms have many docs → these (many) docs get eliminated from set A of contenders 31 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.2 Docs containing many query terms § Any doc with at least one query term is a candidate for the top K output list § For mul*-term queries, only compute scores for docs containing several of the query terms § Say, at least 3 out of 4 § Imposes a “sot conjunc*on” on queries seen on web search engines (early Google) § Easy to implement in pos*ngs traversal 32 COMP6714: Informa2on Retrieval & Web Search 3 of 4 query terms Can generalize to WAND method (safe) Sec. 7.1.2 Antony 3 4 8 16 32 64 128 Brutus 2 4 8 16 32 64 128 Caesar 21 34 1 2 3 5 8 13 Calpurnia 13 16 32 Scores only computed for docs 8, 16 and 32. 33 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.3 Champion lists § Precompute for each dic*onary term t, the r docs of highest weight in t’s pos*ngs § Call this the champion list for t § (aka fancy list or top docs for t) § Note that r has to be chosen at index build *me § Thus, it’s possible that r < K § At query *me, only compute scores for docs in A = ∪t∈Q ChampionList(t) § Pick the K top-scoring docs from amongst these Inspired by “fancy lists” of Google: hop://infolab.stanford.edu/~backrub/google.html 34 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.3 Exercises § How do Champion Lists relate to Index Elimina*on? Can they be used together? § How can Champion Lists be implemented in an inverted index? § Note that the champion list has nothing to do with small docIDs 35 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.4 Sta*c quality scores § We want top-ranking documents to be both relevant and authorita2ve § Relevance is being modeled by cosine scores § Authority is typically a query-independent property of a document § Examples of authority signals § Wikipedia among websites § Ar*cles in certain newspapers § A paper with many cita*ons § Many diggs, Y!buzzes or del.icio.us marks § (Pagerank) Quan*ta*ve 36 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.4 Modeling authority § Assign to each document a query-independent quality score in [0,1] to each document d § Denote this by g(d) § Thus, a quan*ty like the number of cita*ons is scaled into [0,1] § Exercise: suggest a formula for this. 37 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.4 Net score § Consider a simple total score combining cosine relevance and authority § net-score(q,d) = g(d) + cosine(q,d) § Can use some other linear combina*on than an equal weigh*ng § Indeed, any func*on of the two “signals” of user happiness – more later § Now we seek the top K docs by net score 38 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.4 Top K by net score – fast methods § First idea: Order all pos*ngs by g(d) § Key: this is a common ordering for all pos*ngs § Thus, can concurrently traverse query terms’ pos*ngs for § Pos*ngs intersec*on § Cosine score computa*on § Exercise: write pseudocode for cosine score computa*on if pos*ngs are ordered by g(d) 39 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.4 Why order pos*ngs by g(d)? § Under g(d)-ordering, top-scoring docs likely to appear early in pos*ngs traversal § In *me-bound applica*ons (say, we have to return whatever search results we can in 50 ms), this allows us to stop pos*ngs traversal early § Short of compu*ng scores for all docs in pos*ngs 40 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.4 Champion lists in g(d)-ordering § Can combine champion lists with g(d)-ordering § Maintain for each term a champion list of the r docs with highest g(d) + ;-idftd § Seek top-K results from only the docs in these champion lists 41 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.4 High and low lists § For each term, we maintain two pos*ngs lists called high and low § Think of high as the champion list § When traversing pos*ngs on a query, only traverse all the high lists first § If we get more than K docs, select the top K and stop § Only union the high lists § Else proceed to get docs from the low lists § Can be used even for simple cosine scores, without global quality g(d) § A means for segmen*ng index into two *ers 42 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.5 Impact-ordered pos*ngs § We only want to compute scores for docs for which wft,d is high enough § We sort each pos*ngs list by wft,d § Now: not all pos*ngs in a common order! § How do we compute scores in order to pick off top K? § Two ideas follow 43 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.5 1. Early termina*on § When traversing t’s pos*ngs, stop early ater either § a fixed number of r docs § wft,d drops below some threshold § Take the union of the resul*ng sets of docs § One from the pos*ngs of each query term § Compute only the scores for docs in this union 44 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.5 2. idf-ordered terms § When considering the pos*ngs of query terms § Look at them in order of decreasing idf § High idf terms likely to contribute most to score § As we update score contribu*on from each query term § Stop if doc scores rela*vely unchanged § Can apply to cosine or some other net scores 45 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.6 Why N1/2 learder? Cluster pruning: preprocessing § Pick √N docs at random: call these leaders § For every other doc, pre-compute nearest leader § Docs aoached to a leader: its followers; § Likely: each leader has ~ √N followers. 46 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.6 Cluster pruning: query processing § Process a query as follows: § Given query Q, find its nearest leader L. § Seek K nearest docs from among L’s followers. 47 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.6 Visualiza*on Query Leader Follower 48 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.6 Why use random sampling § Fast § Leaders reflect data distribu*on 49 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.6 General variants § Have each follower aoached to b1=3 (say) nearest leaders. § From query, find b2=4 (say) nearest leaders and their followers. § Can recur on leader/follower construc*on. 50 COMP6714: Informa2on Retrieval & Web Search Sec. 7.1.6 Exercises § To find the nearest leader in step 1, how many cosine computa*ons do we do? § Why did we have √N in the first place? § Hint: write down the algorithm, model its cost, and minimize the cost. § What is the effect of the constants b1, b2 on the previous slide? § Devise an example where this is likely to fail – i.e., we miss one of the K nearest docs. § Likely under random sampling. 51 COMP6714: Informa2on Retrieval & Web Search Sec. 6.1 Parametric and zone indexes § Thus far, a doc has been a sequence of terms § In fact documents have mul*ple parts, some with special seman*cs: § Author § Title § Date of publica*on § Language § Format § etc. § These cons*tute the metadata about a document 52 COMP6714: Informa2on Retrieval & Web Search Sec. 6.1 Fields § We some*mes wish to search by these metadata § E.g., find docs authored by William Shakespeare in the year 1601, containing alas poor Yorick § Year = 1601 is an example of a field § Also, author last name = shakespeare, etc § Field or parametric index: pos*ngs for each field value § Some*mes build range trees (e.g., for dates) § Field query typically treated as conjunc*on § (doc must be authored by shakespeare) 53 COMP6714: Informa2on Retrieval & Web Search Sec. 6.1 Zone § A zone is a region of the doc that can contain an arbitrary amount of text e.g., § Title § Abstract § References ... § Build inverted indexes on zones as well to permit querying § E.g., “find docs with merchant in the *tle zone and matching the query gentle rain” 54 COMP6714: Informa2on Retrieval & Web Search Sec. 6.1 Example zone indexes Encode zones in dictionary vs. postings. 55 COMP6714: Informa2on Retrieval & Web Search Sec. 7.2.1 Tiered indexes § Break pos*ngs up into a hierarchy of lists § Most important § ... § Least important § Can be done by g(d) or another measure § Inverted index thus broken up into *ers of decreasing importance § At query *me use top *er unless it fails to yield K docs § If so drop to lower *ers 56 COMP6714: Informa2on Retrieval & Web Search Sec. 7.2.1 Example *ered index 57 COMP6714: Informa2on Retrieval & Web Search Sec. 7.2.2 Query term proximity § Free text queries: just a set of terms typed into the query box – common on the web § Users prefer docs in which query terms occur within close proximity of each other § Let w be the smallest window in a doc containing all query terms, e.g., § For the query strained mercy the smallest window in the doc The quality of mercy is not strained is 4 (words) § Would like scoring func*on to take this into account – how? 58 COMP6714: Informa2on Retrieval & Web Search Sec. 7.2.3 Query parsers § Free text query from user may in fact spawn one or more queries to the indexes, e.g. query rising interest rates § Run the query as a phrase query § If