程序代写代做代考 scheme python information retrieval algorithm database crawler Java cache PowerPoint Presentation

PowerPoint Presentation

LECTURE 14

Introducton to Informaton Retrieval

Arkaitz Zubiaga, 21st February, 2018

 What is Informaton Retrieval (IR)?

 Indexing Documents.

 Query Processing.

 Positonal Indices.

 Other Challenges in Informaton Retrieval.

LECTURE 14: CONTENTS

 Informaton Retrieval (IR): from a large collecton, the task of
obtaining documents that satsfy an informaton need.

 Collecton (e.g. the Web) may include images and videos.

 In this module we’ll focus on text.

INFORMATION RETRIEVAL

 Web search, e.g. Google.

 Vertcal search, web search on a partcular topic, e.g. Yelp.

 Email search.

 Searching content in large databases or hard drives.

EXAMPLES OF INFORMATION RETRIEVAL

EXAMPLE OF AN IR SYSTEM?

 The informaton need that an IR system has to satsfy is usually
expressed as a (short) text query, e.g. hotel in Coventry.

 Many queries are vague, i.e. the average search query has 2-4
keywords (Arantzapis & Kamps, 2008).

INFORMATION NEED

 Queries can seek 3 types of informaton need:

 Navigatonal (looking for website, e.g. Facebook).

 Informatonal (looking for info, e.g. Thai food).

 Transactonal (buying/looking for product/service, e.g.
iPhone 10).

INFORMATION NEED

THE CLASSIC SEARCH MODEL

 The search process can be iteratve.

 Query gives results.

 If unhappy with results,
refine the query.

 With a good IR system, we aim
to minimise the number of
tmes the user refines the
query.

 In IR, documents in a collecton are deemed relevant (R) or non-
relevant (N) with respect to a partcular query.

 The objectve of an IR system is to present, for a given query, as
many relevant results as possible (and as few non-relevant
results as possible).

 58% of users never go to the 2nd page of search results
(Jansen et al., 1998)

RELEVANCE TO A QUERY

 In a small collecton, we can process files on the fy.

 For a given query, go through all fles and check if the query
text appears in each fle.

 It would take ages for large collectons.

 For large collectons, indexing is the soluton.
i.e. pregeneratng lists of word-document associatons.

SEARCHING THROUGH LARGE COLLECTIONS

INDEXING DOCUMENTS

 Reduced sample of Shakespeare’s works:

TERM-DOCUMENT INCIDENCE MATRICES

 Reduced sample of Shakespeare’s works:

 We can search for: +Brutus +Caesar -Calpurnia (2 results)

TERM-DOCUMENT INCIDENCE MATRICES

 Collecton of N=1M documents, each with 1K words.

 Say there are M = 500K distnct terms among these.

 500K x 1M matrix has half-a-trillion 0’s and 1’s.

 Very sparse, only one billion 1’s (that’s 0.2% of the values).

 Alternatve: record only 1’s → use an inverted index.

EVEN BIGGER COLLECTIONS

Brutus

Calpurnia

Caesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45 173

2 31

174

54 101

 For each term t, store list of documents that contain t.

 List needs to have variable size.

 Word ‘Brutus’ occurs in documents ID 1, ID 2, ID 4, etc.

INVERTED INDEX

 List (token, Document ID) pairs.

INDEXING DOCUMENTS: EXAMPLE

I did enact Julius
Caesar I was killed

i’ the Capitol;
Brutus killed me.

Doc 1

So let it be with
Caesar. The noble

Brutus hath told you
Caesar was ambitious

Doc 2

 Sort (alphabetcally) by tokens and then doc ID.

INDEXING DOCUMENTS: EXAMPLE

 Merge entries + add frequency counts: dictonary and postngs.

INDEXING DOCUMENTS: EXAMPLE

QUERY PROCESSING

 Search query: Brutus AND Caesar

 Retrieve postngs with Brutus.

 Retrieve postngs with Caesar.

 Get the intersecton as the set of results (docs 2 and 8).

QUERY PROCESSING: AND

128

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar

GETTING THE INTERSECTION

 Complexity of algorithm: O(m+n), as we iterate through all items
– m is the length of p

1
, and n is the length of p

SKIP POINTERS

 Can we improve the linear tme O(m+n) and compute it in
sublinear tme?

● We can use skip pointers, i.e.:

● If p
1
has: 1, 3, 5, 15, 40

and p
2
has: 35, 40, 90

● Pointers will initally be pp
1
=1 and pp

2
=35

● We can skip all values smaller than 35 in p
1
.

SKIP POINTERS

 I want to search for “University of Warwick” – as a phrase.

 For this query, “You can live in Warwick if you are a student at the
university” is NOT a match.

 If we want to do this search, then our index is not
enough.

PHRASE QUERIES

 Index bigrams instead of single words.

 For the document, “I went to the University of Warwick”, index:
I went, went to, to the, the University, etc.

 We can easily search for bigrams now, but we can’t look for
“University of Warwick” yet!

 We can search for “University of AND of Warwick”, but there
is no guarantee that they occur together in the document.

HOW ABOUT BIGRAM INDICES?

 Longer n-gram indices are not feasible, too many possible n-
grams.

 Bigrams could be used when we don’t need longer search
queries, but it’s generally not enough.

 Use of bigrams to look for longer n-grams can lead to false
positves (as with the “University of AND of Warwick” example)

LONGER N-GRAM INDICES?

POSITIONAL INDICES

 In the postngs, store for each term the positon(s) in which
tokens of it appear:

THE SOLUTION: POSITIONAL INDICES

 The search query could be: “to be or not to be”

 Which documents contain it?

 For phrase queries, we use a merge algorithm recursively at the
document level.

 But we now need to deal with more than just equality.

POSITIONAL INDEX: EXAMPLE



POSITIONAL INDEX: EXAMPLE

Which of docs 1,2,4,5
could contain “to be

or not to be”?

 Get inverted index entries for each distnct term: to, be, or, not.

 Merge their doc:positon lists.

 to:
2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; …

 be:
1:17,19; 4:17,191,291,430,434; 5:14,19,101; …

 or…

 Look for occurrences where the positons match the sequence
of our query: “to be or not to be”.

POSITIONAL INDEX: EXAMPLE

 Get inverted index entries for each distnct term: to, be, or, not.

 Merge their doc:positon lists.

 to:
2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; …

 be:
1:17,19; 4:17,191,291,430,434; 5:14,19,101; …

 or…

 Look for occurrences where the positons match the sequence
of our query: “to be or not to be”.

POSITIONAL INDEX: EXAMPLE

429: to, 430: be, 431: or, 432: not, 433: to, 434: be

OTHER CHALLENGES IN
INFORMATION RETRIEVAL

 We may stll want to do more sophistcated queries, e.g.
proximity queries.

 Search for two phrases, which occur within a maximum
distance of K words between them.

PROXIMITY QUERIES

 Query: Student AROUND(5) Warwick.

 We can do this with positonal indices.

 With bigram indices we can’t.

 The algorithm for getng the intersecton here is more complex:

 We need to get the intersecton of “student” and “Warwick”.

 with the restricton that there has to be a maximum of 5
words in between.

PROXIMITY QUERIES

PROXIMITY QUERIES: ALGORITHM

 Positonal index uses more space than a binary index.

 However, positonal indices are today’s standard approach to
index documents, given their fexibility for searching.

POSITIONAL INDEX SIZE

 Rules of thumb:

 A positonal index is 2–4 as large as a non-positonal index.

 Positonal index size 35–50% of volume of original text.

 Caveat: all of this holds for “English-like” languages.

POSITIONAL INDEX SIZE

 Positonal indices and bigram indices can be combined to get the
most of each approach:

 For frequent phrases (“Michael Jackson”, “Britney Spears”,
“The Who”) it is inefcient to keep on merging positonal
postngs lists.

 For very popular search queries, you can also cache the results.

COMBINATION SCHEMES

 Williams et al. (2004) evaluated a mixed indexing scheme.

 A typical web query mixture was executed in ¼ of the tme of
using just a positonal index.

 It required 26% more space than having a positonal index
alone.

COMBINATION SCHEMES

 Case sensitve search: if we lowercase everything before
indexing, we can’t consider the case when searching.

 Search page ttles only: need an additonal fag to indicate that
word positon belongs to the ttle.

 Search numeric ranges: e.g. £50..£100. If numbers are just
another string in our indices, we won’t be able to search for this.

Related Posts