程序代写代做代考 assembly scheme algorithm COMP6714: Informa2on Retrieval & Web Search

COMP6714: Informa2on Retrieval & Web Search

Introduc*on to

Informa(on Retrieval

Lecture 7: Scoring and results assembly

1

COMP6714: Informa2on Retrieval & Web Search

Recap: ;-idf weigh*ng

§  The ;-idf weight of a term is the product of its ;
weight and its idf weight.

§  Best known weigh*ng scheme in informa*on retrieval
§  Increases with the number of occurrences within a
document

§  Increases with the rarity of the term in the collec*on

)df/(log)tflog1(w 10,, tdt Ndt ×+=

Ch. 6

2

COMP6714: Informa2on Retrieval & Web Search

Recap: Queries as vectors
§  Key idea 1: Do the same for queries: represent them
as vectors in the space

§  Key idea 2: Rank documents according to their
proximity to the query in this space

§  proximity = similarity of vectors

Ch. 6

3

COMP6714: Informa2on Retrieval & Web Search

Recap: cosine(query,document)

∑∑

==

==•=

=
V

i i
V

i i

V

i ii

dq

dq

d
d

q
q

dq
dq

dq
1

2
1

2

1),cos( !
!

!
!

!!

!!!!

Dot product Unit vectors

cos(q,d) is the cosine similarity of q and d … or,
equivalently, the cosine of the angle between q and d.

Ch. 6

4

COMP6714: Informa2on Retrieval & Web Search

This lecture

§  Speeding up vector space ranking
§ PuRng together a complete search
system
§ Will require learning about a number of
miscellaneous topics and heuris*cs

Ch. 7

5

Ques(on: Why don’t we just use the query
processing methods for Boolean queries?

COMP6714: Informa2on Retrieval & Web Search

Compu*ng cosine scores

Sec. 6.3.3

6

Term-at-a-*me

COMP6714: Informa2on Retrieval & Web Search

Efficient cosine ranking
§  Find the K docs in the collec*on “nearest” to the
query ⇒ K largest query-doc cosines.

§  Efficient ranking:
§  Compu*ng a single cosine efficiently.
§  Choosing the K largest cosine values efficiently.

§ Can we do this without compu*ng all N cosines?

Sec. 7.1

7

COMP6714: Informa2on Retrieval & Web Search

Efficient cosine ranking
§  What we’re doing in effect: solving the K-nearest
neighbor problem for a query vector

§  In general, we do not know how to do this efficiently
for high-dimensional spaces

§  But it is solvable for short queries, and standard
indexes support this well

Sec. 7.1

8

Curse of dimensionality

COMP6714: Informa2on Retrieval & Web Search

Special case – unweighted queries
§  No weigh*ng on query terms

§  Assume each query term occurs only once
§  Then for ranking, don’t need to normalize query
vector
§  Slight simplifica*on of algorithm from Lecture 6

Sec. 7.1

9

COMP6714: Informa2on Retrieval & Web Search

Faster cosine: unweighted query

Sec. 7.1

10

COMP6714: Informa2on Retrieval & Web Search

Compu*ng the K largest cosines:
selec*on vs. sor*ng
§  Typically we want to retrieve the top K docs (in the
cosine ranking for the query)
§  not to totally order all docs in the collec*on

§  Can we pick off docs with K highest cosines?
§  Let n of docs with nonzero cosines

§ We seek the K best of these n

Sec. 7.1

11

COMP6714: Informa2on Retrieval & Web Search

Use heap for selec*ng top K
§  Binary tree in which each node’s value > the values
of children

§  Takes 2n opera*ons to construct, then each of K
“winners” read off in 2log n steps.

§  Total *me is O(n + K*log(n))
§  For n=1M, K=100, this is about 10% of the cost of
sor*ng. 1

.9 .3

.8 .3

.1

.1

Sec. 7.1

12

hop://en.wikipedia.org/wiki/Binary_heap

COMP6714: Informa2on Retrieval & Web Search

Quick Select
§  QuickSelect is similar to QuickSort to find the top-K
elements from an array
§  Takes O(n) *me

§  Sor*ng the top-K items takes O(K*log(K)) *me
§  Total *me is O(n + K*log(K))

13

hop://en.wikipedia.org/wiki/Quickselect

COMP6714: Informa2on Retrieval & Web Search

Query Processing
§  Document-at-a-*me

§  Calculates complete scores for documents by processing
all term lists, one document at a *me

§  Term-at-a-*me
§  Accumulates scores for documents by processing term lists
one at a *me

§  Both approaches have op*miza*on techniques that
significantly reduce *me required to generate scores
§  Dis*nguish between safe and heuris*c op*miza*ons

[CMS09].begin

COMP6714: Informa2on Retrieval & Web Search

Document-At-A-Time

COMP6714: Informa2on Retrieval & Web Search

Document-At-A-Time

COMP6714: Informa2on Retrieval & Web Search

Term-At-A-Time

3:1

COMP6714: Informa2on Retrieval & Web Search

Term-At-A-Time
// accumulators

// Ad contains par*al score

COMP6714: Informa2on Retrieval & Web Search

Op*miza*on Techniques
§  Term-at-a-*me uses more memory for accumulators,
but accesses disk more efficiently

§  Two classes of op*miza*on
§  Read less data from inverted lists

§  e.g., skip lists
§  beoer for simple feature func*ons

§  Calculate scores for fewer documents
§  e.g., conjunc*ve processing
§  beoer for complex feature func*ons

COMP6714: Informa2on Retrieval & Web Search

Conjunc*ve Processing
§  Requires the result document containing all the
query terms (i.e., conjunc*ve Boolean queries)
§  More efficient
§  Can also be more effec*ve for short queries
§  Default for many search engines

§  Can be combined with both DAAT and TAAT

20

COMP6714: Informa2on Retrieval & Web Search

Conjunctive
Term-at-a-Time

COMP6714: Informa2on Retrieval & Web Search

Conjunctive
Document-at-a-Time

COMP6714: Informa2on Retrieval & Web Search

Threshold Methods
§  Threshold methods use number of top-ranked
documents needed (k) to op*mize query processing
§  for most applica*ons, k is small

§  For any query, there is a minimum score that each
document needs to reach before it can be shown to
the user
§  score of the kth-highest scoring document
§  gives threshold τ
§  op*miza*on methods es*mate τʹ to ignore documents

COMP6714: Informa2on Retrieval & Web Search

Threshold Methods
§  For document-at-a-*me processing, use score of
lowest-ranked document so far for τʹ
§  for term-at-a-*me, have to use kth-largest score in the
accumulator table

§  MaxScore method compares the maximum score
that remaining documents could have to τʹ
§  safe op*miza*on in that ranking will be the same without
op*miza*on

COMP6714: Informa2on Retrieval & Web Search

MaxScore Example

§  Compute max term scores, μt, of each list and sort
them in decreasing order (fixed during query
processing)

§  Assume k =3, τʹ is lowest score ater first three docs
§  If μtree < τ ʹ è any doc that scores higher than τ’ must contains at least one of the first two keywords (aka required term set) §  Use pos*ngs lists of required term set to “drive” the query processing §  Will only check some of the white pos*ngs in the list of “tree” to compute score è at least all gray pos*ngs are skipped. xyz Beoer than the example in the textbook. See my Note 2 too. COMP6714: Informa2on Retrieval & Web Search Other Approaches §  Early termina*on of query processing §  ignore high-frequency word lists in term-at-a-*me §  ignore documents at end of lists in doc-at-a-*me §  unsafe op*miza*on §  List ordering §  order inverted lists by quality metric (e.g., PageRank) or by par*al score §  makes unsafe (and fast) op*miza*ons more likely to produce good documents [CMS09].end COMP6714: Informa2on Retrieval & Web Search Boolenecks §  Primary computa*onal booleneck in scoring: cosine computa*on §  Can we avoid all this computa*on? §  Yes, but may some*mes get it wrong §  a doc not in the top K may creep into the list of K output docs §  Is this such a bad thing? Sec. 7.1.1 27 COMP6714: Informa2on Retrieval & Web Search Cosine similarity is only a proxy §  Jus*fica*ons §  User has a task and a query formula*on §  Cosine matches docs to query §  Thus cosine is anyway a proxy for user happiness §  Approximate query processing §  If we get a list of K docs “close” to the top K by cosine measure, should be ok Sec. 7.1.1 28 COMP6714: Informa2on Retrieval & Web Search Generic approach §  Find a set A of contenders, with K < |A| << N §  A does not necessarily contain the top K, but has many docs from among the top K §  Return the top K docs in A §  Think of A as pruning non-contenders §  The same approach is also used for other (non- cosine) scoring func*ons §  Will look at several schemes following this approach Sec. 7.1.1 29 COMP6714: Informa2on Retrieval & Web Search Index elimina*on §  Basic algorithm FastCosineScore of Fig 7.1 only considers docs containing at least one query term §  Take this further: §  Only consider high-idf query terms §  Only consider docs containing many query terms Sec. 7.1.2 30 COMP6714: Informa2on Retrieval & Web Search High-idf query terms only §  For a query such as catcher in the rye §  Only accumulate scores from catcher and rye §  Intui*on: in and the contribute liole to the scores and so don’t alter rank-ordering much §  Benefit: §  Pos*ngs of low-idf terms have many docs → these (many) docs get eliminated from set A of contenders Sec. 7.1.2 31 COMP6714: Informa2on Retrieval & Web Search Docs containing many query terms §  Any doc with at least one query term is a candidate for the top K output list §  For mul*-term queries, only compute scores for docs containing several of the query terms §  Say, at least 3 out of 4 §  Imposes a “sot conjunc*on” on queries seen on web search engines (early Google) §  Easy to implement in pos*ngs traversal Sec. 7.1.2 32 COMP6714: Informa2on Retrieval & Web Search 3 of 4 query terms Brutus Caesar Calpurnia 1 2 3 5 8 13 21 34 2 4 8 16 32 64 128 13 16 Antony 3 4 8 16 32 64 128 32 Scores only computed for docs 8, 16 and 32. Sec. 7.1.2 33 Can generalize to WAND method (safe) COMP6714: Informa2on Retrieval & Web Search Champion lists §  Precompute for each dic*onary term t, the r docs of highest weight in t’s pos*ngs §  Call this the champion list for t §  (aka fancy list or top docs for t) §  Note that r has to be chosen at index build *me §  Thus, it’s possible that r < K §  At query *me, only compute scores for docs in A = ∪t∈Q ChampionList(t) §  Pick the K top-scoring docs from amongst these Sec. 7.1.3 34 Inspired by “fancy lists” of Google: hop://infolab.stanford.edu/~backrub/google.html COMP6714: Informa2on Retrieval & Web Search Exercises §  How do Champion Lists relate to Index Elimina*on? Can they be used together? §  How can Champion Lists be implemented in an inverted index? §  Note that the champion list has nothing to do with small docIDs Sec. 7.1.3 35 COMP6714: Informa2on Retrieval & Web Search Quan*ta*ve Sta*c quality scores §  We want top-ranking documents to be both relevant and authorita2ve §  Relevance is being modeled by cosine scores §  Authority is typically a query-independent property of a document §  Examples of authority signals §  Wikipedia among websites §  Ar*cles in certain newspapers §  A paper with many cita*ons §  Many diggs, Y!buzzes or del.icio.us marks §  (Pagerank) Sec. 7.1.4 36 COMP6714: Informa2on Retrieval & Web Search Modeling authority §  Assign to each document a query-independent quality score in [0,1] to each document d §  Denote this by g(d) §  Thus, a quan*ty like the number of cita*ons is scaled into [0,1] §  Exercise: suggest a formula for this. Sec. 7.1.4 37 COMP6714: Informa2on Retrieval & Web Search Net score §  Consider a simple total score combining cosine relevance and authority §  net-score(q,d) = g(d) + cosine(q,d) §  Can use some other linear combina*on than an equal weigh*ng §  Indeed, any func*on of the two “signals” of user happiness – more later §  Now we seek the top K docs by net score Sec. 7.1.4 38 COMP6714: Informa2on Retrieval & Web Search Top K by net score – fast methods §  First idea: Order all pos*ngs by g(d) §  Key: this is a common ordering for all pos*ngs §  Thus, can concurrently traverse query terms’ pos*ngs for §  Pos*ngs intersec*on §  Cosine score computa*on §  Exercise: write pseudocode for cosine score computa*on if pos*ngs are ordered by g(d) Sec. 7.1.4 39 COMP6714: Informa2on Retrieval & Web Search Why order pos*ngs by g(d)? §  Under g(d)-ordering, top-scoring docs likely to appear early in pos*ngs traversal §  In *me-bound applica*ons (say, we have to return whatever search results we can in 50 ms), this allows us to stop pos*ngs traversal early §  Short of compu*ng scores for all docs in pos*ngs Sec. 7.1.4 40 COMP6714: Informa2on Retrieval & Web Search Champion lists in g(d)-ordering §  Can combine champion lists with g(d)-ordering §  Maintain for each term a champion list of the r docs with highest g(d) + ;-idftd §  Seek top-K results from only the docs in these champion lists Sec. 7.1.4 41 COMP6714: Informa2on Retrieval & Web Search High and low lists §  For each term, we maintain two pos*ngs lists called high and low §  Think of high as the champion list §  When traversing pos*ngs on a query, only traverse all the high lists first §  If we get more than K docs, select the top K and stop §  Only union the high lists §  Else proceed to get docs from the low lists §  Can be used even for simple cosine scores, without global quality g(d) §  A means for segmen*ng index into two *ers Sec. 7.1.4 42 COMP6714: Informa2on Retrieval & Web Search Impact-ordered pos*ngs §  We only want to compute scores for docs for which wft,d is high enough §  We sort each pos*ngs list by wft,d §  Now: not all pos*ngs in a common order! §  How do we compute scores in order to pick off top K? §  Two ideas follow Sec. 7.1.5 43 COMP6714: Informa2on Retrieval & Web Search 1. Early termina*on §  When traversing t’s pos*ngs, stop early ater either §  a fixed number of r docs §  wft,d drops below some threshold §  Take the union of the resul*ng sets of docs §  One from the pos*ngs of each query term §  Compute only the scores for docs in this union Sec. 7.1.5 44 COMP6714: Informa2on Retrieval & Web Search 2. idf-ordered terms §  When considering the pos*ngs of query terms §  Look at them in order of decreasing idf §  High idf terms likely to contribute most to score §  As we update score contribu*on from each query term §  Stop if doc scores rela*vely unchanged §  Can apply to cosine or some other net scores Sec. 7.1.5 45 COMP6714: Informa2on Retrieval & Web Search Cluster pruning: preprocessing §  Pick √N docs at random: call these leaders §  For every other doc, pre-compute nearest leader § Docs aoached to a leader: its followers; § Likely: each leader has ~ √N followers. Sec. 7.1.6 46 Why N1/2 learder? COMP6714: Informa2on Retrieval & Web Search Cluster pruning: query processing §  Process a query as follows: § Given query Q, find its nearest leader L. § Seek K nearest docs from among L’s followers. Sec. 7.1.6 47 COMP6714: Informa2on Retrieval & Web Search Visualiza*on Query Leader Follower Sec. 7.1.6 48 COMP6714: Informa2on Retrieval & Web Search Why use random sampling §  Fast §  Leaders reflect data distribu*on Sec. 7.1.6 49 COMP6714: Informa2on Retrieval & Web Search General variants §  Have each follower aoached to b1=3 (say) nearest leaders. §  From query, find b2=4 (say) nearest leaders and their followers. §  Can recur on leader/follower construc*on. Sec. 7.1.6 50 COMP6714: Informa2on Retrieval & Web Search Exercises §  To find the nearest leader in step 1, how many cosine computa*ons do we do? §  Why did we have √N in the first place? §  Hint: write down the algorithm, model its cost, and minimize the cost. §  What is the effect of the constants b1, b2 on the previous slide? §  Devise an example where this is likely to fail – i.e., we miss one of the K nearest docs. §  Likely under random sampling. Sec. 7.1.6 51 COMP6714: Informa2on Retrieval & Web Search Parametric and zone indexes §  Thus far, a doc has been a sequence of terms §  In fact documents have mul*ple parts, some with special seman*cs: §  Author §  Title §  Date of publica*on §  Language §  Format §  etc. §  These cons*tute the metadata about a document Sec. 6.1 52 COMP6714: Informa2on Retrieval & Web Search Fields §  We some*mes wish to search by these metadata §  E.g., find docs authored by William Shakespeare in the year 1601, containing alas poor Yorick §  Year = 1601 is an example of a field §  Also, author last name = shakespeare, etc §  Field or parametric index: pos*ngs for each field value §  Some*mes build range trees (e.g., for dates) §  Field query typically treated as conjunc*on §  (doc must be authored by shakespeare) Sec. 6.1 53 COMP6714: Informa2on Retrieval & Web Search Zone §  A zone is a region of the doc that can contain an arbitrary amount of text e.g., §  Title §  Abstract §  References … §  Build inverted indexes on zones as well to permit querying §  E.g., “find docs with merchant in the *tle zone and matching the query gentle rain” Sec. 6.1 54 COMP6714: Informa2on Retrieval & Web Search Example zone indexes Encode zones in dictionary vs. postings. Sec. 6.1 55 COMP6714: Informa2on Retrieval & Web Search Tiered indexes §  Break pos*ngs up into a hierarchy of lists §  Most important §  … §  Least important §  Can be done by g(d) or another measure §  Inverted index thus broken up into *ers of decreasing importance §  At query *me use top *er unless it fails to yield K docs §  If so drop to lower *ers Sec. 7.2.1 56 COMP6714: Informa2on Retrieval & Web Search Example *ered index Sec. 7.2.1 57 COMP6714: Informa2on Retrieval & Web Search Query term proximity §  Free text queries: just a set of terms typed into the query box – common on the web §  Users prefer docs in which query terms occur within close proximity of each other §  Let w be the smallest window in a doc containing all query terms, e.g., §  For the query strained mercy the smallest window in the doc The quality of mercy is not strained is 4 (words) §  Would like scoring func*on to take this into account – how? Sec. 7.2.2 58 COMP6714: Informa2on Retrieval & Web Search Query parsers §  Free text query from user may in fact spawn one or more queries to the indexes, e.g. query rising interest rates §  Run the query as a phrase query §  If