程序代写代做代考 scheme python information retrieval algorithm database crawler Java cache lecture14.pptx

lecture14.pptx

LECTURE 14

Introducton to Informaton Retrieval

Arkaitz Zubiaga, 21
st
February, 2018

2

 What is Informaton Retrieval (IR)?

 Indexing Documents.

 Query Processing.

 Positonal Indices.

 Other Challenges in Informaton Retrieval.

LECTURE 14: CONTENTS

3

 Informaton Retrieaal (IR): from a large collecton, the task of
obtaining documents that satsfy an informaton need.

 Collecton (e.g. the Web) may include images and videos.

 In this module we’ll focus on text.

INFORMATION RETRIEVAL

4

 Web search, e.g. Google.

 Vertcal search, web search on a partcular topic, e.g. Yelp.

 Email search.

 Searching content in large databases or hard driaes.

EXAMPLES OF INFORMATION RETRIEVAL

5

EXAMPLE OF AN IR SYSTEM?

6

 The informaton need that an IR system has to satsfy is usually
expressed as a (short) text query, e.g. hotel in Coventry.

 Many queries are aague, i.e. the average search query has 2-4
keywords (Arantzapis & Kamps, 2008).

INFORMATION NEED

7

 Queries can seek 3 types of informaton need:

 Naaigatonal (looking for website, e.g. Facebook).

 Informatonal (looking for info, e.g. Thai food).

 Transactonal (buying/looking for product/service, e.g.
iPhone 10).

INFORMATION NEED

8

THE CLASSIC SEARCH MODEL

 The search process can be iteratae.

 Query gives results.

 If unhappy with results,
refne the query.

 With a good IR system, we aim
to minimise the number of
tmes the user refnes the
query.

9

 In IR, documents in a collecton are deemed releaant (R) or non-
releaant (N) with respect to a partcular query .

 The objectve of an IR system is to present, for a given query, as
many releaant results as possible (and as few non-relevant
results as possible).

 58% of users neaer go to the 2nd page of search results
(Jansen et al., 1998)

RELEVANCE TO A QUERY

10

 In a small collecton, we can process fles on the fy .

 For a given query, go through all fles and check if the query
text appears in each fle.

 It would take ages for large collectons .

 For large collectonss indexing is the soluton.
i.e. pregeneratng lists of word-document associatons.

SEARCHING THROUGH LARGE COLLECTIONS

INDEXING DOCUMENTS

12

 Reduced sample of Shakespeare’s works:

TERM-DOCUMENT INCIDENCE MATRICES

13

 Reduced sample of Shakespeare’s works:

 We can search for: +Brutus +Caesar -Calpurnia (2 results)

TERM-DOCUMENT INCIDENCE MATRICES

14

 Collecton of N=1M documentss each with 1K words .

 Say there are M = 500K distnct terms among these.

 500K x 1M matrix has half-a-trillion 0’s and 1’s .

 Very sparse, only one billion 1’s (that’s 0.2% of the values).

 Alternatve: record only 1’s → use an inaerted index.

EVEN BIGGER COLLECTIONS

15

Brutus

Calpurnia

Caesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45 173

2 31

174

54 101

 For each term t, store list of documents that contain t .

 List needs to have aariable size.

 Word ‘Brutus’ occurs in documents ID 1, ID 2, ID 4, etc.

INVERTED INDEX

16

 List (token, Document ID) pairs.

INDEXING DOCUMENTS: EXAMPLE

I did enact Julius
Caesar I was killed

i’ the Capitol;
Brutus killed me.

Doc 1

So let it be with
Caesar. The noble

Brutus hath told you
Caesar was ambitious

Doc 2

17

 Sort (alphabetcally) by tokens and then doc ID.

INDEXING DOCUMENTS: EXAMPLE

18

 Merge entries + add frequency counts: dictonary and postngs .

INDEXING DOCUMENTS: EXAMPLE

QUERY PROCESSING

20

 Search query: Brutus AND Caesar

 Retrieve postngs with Brutus.

 Retrieve postngs with Caesar.

 Get the intersecton as the set of results (docs 2 and 8).

QUERY PROCESSING: AND

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar

21

GETTING THE INTERSECTION

22

GETTING THE INTERSECTION

 Complexity of algorithm: O(m+n), as we iterate through all items
– m is the length of p

1
, and n is the length of p

2

23

SKIP POINTERS

 Can we improve the linear tme O(m+n) and compute it in
sublinear tme?

● We can use skip pointers, i.e.:

● If p
1
has: 1, 3, 5, 15, 40

and p
2
has: 35, 40, 90

● Pointers will initally be pp
1
=1 and pp

2
=35

● We can skip all values smaller than 35 in p
1
.

24

SKIP POINTERS

25

 I want to search for “Uniaersity of Warwick” – as a phrase.

 For this query, “You can live in Warwick if you are a student at the
university” is NOT a match.

 If we want to do this search, then our index is not
enough.

PHRASE QUERIES

26

 Index bigrams instead of single words.

 For the document, “I went to the University of Warwick”, index:
I went, went to, to the, the University, etc.

 We can easily search for bigrams now, but we can’t look for
“Uniaersity of Warwick” yet!

 We can search for “University of AND of Warwick”, but there
is no guarantee that they occur together in the document.

HOW ABOUT BIGRAM INDICES?

27

 Longer n-gram indices are not feasible , too many possible n-
grams.

 Bigrams could be used when we don’t need longer search
queries, but it’s generally not enough.

 Use of bigrams to look for longer n-grams can lead to false
positaes (as with the “University of AND of Warwick” example)

LONGER N-GRAM INDICES?

POSITIONAL INDICES

29

 In the postngs, store for each term the positon(s) in which
tokens of it appear:

THE SOLUTION: POSITIONAL INDICES

30

 The search query could be: “to be or not to be”

 Which documents contain it?

 For phrase queries, we use a merge algorithm recursiaely at the
document leael.

 But we now need to deal with more than just equality .

POSITIONAL INDEX: EXAMPLE

31

POSITIONAL INDEX: EXAMPLE

Which of docs 1,2,4,5
could contain “to be

or not to be”?

32

 Get inaerted index entries for each distnct term : to, be, or, not.

 Merge their doc:positon lists.

 to:
2:1,17,74,222,551; 4:8,16,190,429s433; 7:13,23,191; …

 be:
1:17,19; 4:17,191,291,430s434; 5:14,19,101; …

 or…

 Look for occurrences where the positons match the sequence
of our query: “to be or not to be”.

POSITIONAL INDEX: EXAMPLE

33

 Get inaerted index entries for each distnct term : to, be, or, not.

 Merge their doc:positon lists.

 to:
2:1,17,74,222,551; 4:8,16,190,429s433; 7:13,23,191; …

 be:
1:17,19; 4:17,191,291,430s434; 5:14,19,101; …

 or…

 Look for occurrences where the positons match the sequence
of our query: “to be or not to be”.

POSITIONAL INDEX: EXAMPLE

429: to, 430: be, 431: or, 432: not, 433: to, 434: be

OTHER CHALLENGES IN
INFORMATION RETRIEVAL

35

 We may stll want to do more sophistcated queriess e.g.
proximity queries.

 Search for two phrases, which occur within a maximum
distance of K words between them.

PROXIMITY QUERIES

36

PROXIMITY QUERIES

37

 Query: Student AROUND(5) Warwick.

 We can do this with positonal indices.

 With bigram indices we can’t.

 The algorithm for getng the intersecton here is more complex:

 We need to get the intersecton of “student” and “Warwick”.

 with the restricton that there has to be a maximum of 5
words in between.

PROXIMITY QUERIES

38

PROXIMITY QUERIES: ALGORITHM

40

 Positonal index uses more space than a binary index.

 However, positonal indices are today’s standard approach to
index documents, given their fexibility for searching.

POSITIONAL INDEX SIZE

41

 Rules of thumb:

 A positonal index is 2–4 as large as a non-positonal index.

 Positonal index size 35–50% of aolume of original text.

 Caaeat: all of this holds for “English-like” languages.

POSITIONAL INDEX SIZE

42

 Positonal indices and bigram indices can be combined to get the
most of each approach:

 For frequent phrases (“Michael Jackson”, “Britney Spears”,
“The Who”) it is inefcient to keep on merging positonal
postngs lists.

 For aery popular search queries, you can also cache the results.

COMBINATION SCHEMES

43

 Williams et al. (2004) evaluated a mixed indexing scheme.

 A typical web query mixture was executed in ¼ of the tme of
using just a positonal index.

 It required 26% more space than having a positonal index
alone.

COMBINATION SCHEMES

44

 Case sensitae search: if we lowercase everything before
indexing, we can’t consider the case when searching.

 Search page ttles only: need an additonal fag to indicate that
word positon belongs to the ttle.

 Search numeric ranges: e.g. £50..£100. If numbers are just
another string in our indices, we won’t be able to search for this.

MORE POSSIBLE SEARCHES

45

 For all these searches, the challenge lies in implementng it
efficiently:

 Case: we could store original and lowercased words , but
that increases the index size substantally.

 We could also have separate indices for page ttles and for
numeric aalues, but again, that’s much more data to store.

MORE CHALLENGES

46

 With Google, you can even search for related pages:
e.g. related:warwick.ac.uk

MORE CHALLENGES

47

MORE CHALLENGES

48

 With Google, you can even search for related pages:
e.g. related:warwick.ac.uk
returns a bunch of university websites

 They may also be storing content similarity between websites .

 Again, that’s much more data for the index.

MORE CHALLENGES

49

 So far, we can retrieae documents that match a query .

 That’s fne, but we ofen get many results.

 And we want to rank them.

 Weigh them based on releaance , not just binary match.

 Ranking is an additonal challenge in informaton retrieval.

 Next lecture!

MORE CHALLENGES

50

RESOURCES

 Apache Lucene (open source search software, Java):
https://lucene.apache.org/

 Python wrapper for Lucene:
http://lucene.apache.org/pylucene/

 Apache Nutch (web crawler, can be integrated with Lucene to build a
search engine):
http://nutch.apache.org/

51

REFERENCES

 Arampatzis, A., & Kamps, J. (2008, July). A study of query length. In
Proceedings of the 31st annual international ACM SIGIR conference
on Research and development in information retrieval (pp. 811-812).
ACM.

 Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998, April).
Real life information retrieval: A study of user queries on the web. In
ACM SIGIR Forum (Vol. 32, No. 1, pp. 5-17). ACM.

52

ASSOCIATED READING

 Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to
information retrieval (Vol. 1, No. 1, p. 496). Cambridge: Cambridge
university press. Chapters 1-2.
https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf