程序代写代做代考 scheme python information retrieval algorithm database crawler Java cache PowerPoint Presentation

PowerPoint Presentation

LECTURE 14

Introducton to Informaton Retrieval

Arkaitz Zubiaga, 21st February, 2018

2

 What is Informaton Retrieval (IR)?

 Indexing Documents.

 Query Processing.

 Positonal Indices.

 Other Challenges in Informaton Retrieval.

LECTURE 14: CONTENTS

3

 Informaton Retrieval (IR): from a large collecton, the task of
obtaining documents that satsfy an informaton need.

 Collecton (e.g. the Web) may include images and videos.

 In this module we’ll focus on text.

INFORMATION RETRIEVAL

4

 Web search, e.g. Google.

 Vertcal search, web search on a partcular topic, e.g. Yelp.

 Email search.

 Searching content in large databases or hard drives.

EXAMPLES OF INFORMATION RETRIEVAL

5

EXAMPLE OF AN IR SYSTEM?

6

 The informaton need that an IR system has to satsfy is usually
expressed as a (short) text query, e.g. hotel in Coventry.

 Many queries are vague, i.e. the average search query has 2-4
keywords (Arantzapis & Kamps, 2008).

INFORMATION NEED

7

 Queries can seek 3 types of informaton need:

 Navigatonal (looking for website, e.g. Facebook).

 Informatonal (looking for info, e.g. Thai food).

 Transactonal (buying/looking for product/service, e.g.
iPhone 10).

INFORMATION NEED

8

THE CLASSIC SEARCH MODEL

 The search process can be iteratve.

 Query gives results.

 If unhappy with results,
refine the query.

 With a good IR system, we aim
to minimise the number of
tmes the user refines the
query.

9

 In IR, documents in a collecton are deemed relevant (R) or non-
relevant (N) with respect to a partcular query.

 The objectve of an IR system is to present, for a given query, as
many relevant results as possible (and as few non-relevant
results as possible).

 58% of users never go to the 2nd page of search results
(Jansen et al., 1998)

RELEVANCE TO A QUERY

10

 In a small collecton, we can process files on the fy.

 For a given query, go through all fles and check if the query
text appears in each fle.

 It would take ages for large collectons.

 For large collectons, indexing is the soluton.
i.e. pregeneratng lists of word-document associatons.

SEARCHING THROUGH LARGE COLLECTIONS

INDEXING DOCUMENTS

12

 Reduced sample of Shakespeare’s works:

TERM-DOCUMENT INCIDENCE MATRICES

13

 Reduced sample of Shakespeare’s works:

 We can search for: +Brutus +Caesar -Calpurnia (2 results)

TERM-DOCUMENT INCIDENCE MATRICES

14

 Collecton of N=1M documents, each with 1K words.

 Say there are M = 500K distnct terms among these.

 500K x 1M matrix has half-a-trillion 0’s and 1’s.

 Very sparse, only one billion 1’s (that’s 0.2% of the values).

 Alternatve: record only 1’s → use an inverted index.

EVEN BIGGER COLLECTIONS

15

Brutus

Calpurnia

Caesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45 173

2 31

174

54 101

 For each term t, store list of documents that contain t.

 List needs to have variable size.

 Word ‘Brutus’ occurs in documents ID 1, ID 2, ID 4, etc.

INVERTED INDEX

16

 List (token, Document ID) pairs.

INDEXING DOCUMENTS: EXAMPLE

I did enact Julius
Caesar I was killed

i’ the Capitol;
Brutus killed me.

Doc 1

So let it be with
Caesar. The noble

Brutus hath told you
Caesar was ambitious

Doc 2

17

 Sort (alphabetcally) by tokens and then doc ID.

INDEXING DOCUMENTS: EXAMPLE

18

 Merge entries + add frequency counts: dictonary and postngs.

INDEXING DOCUMENTS: EXAMPLE

QUERY PROCESSING

20

 Search query: Brutus AND Caesar

 Retrieve postngs with Brutus.

 Retrieve postngs with Caesar.

 Get the intersecton as the set of results (docs 2 and 8).

QUERY PROCESSING: AND

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar

21

GETTING THE INTERSECTION

22

GETTING THE INTERSECTION

 Complexity of algorithm: O(m+n), as we iterate through all items
– m is the length of p

1
, and n is the length of p

2

23

SKIP POINTERS

 Can we improve the linear tme O(m+n) and compute it in
sublinear tme?

● We can use skip pointers, i.e.:

● If p
1
has: 1, 3, 5, 15, 40

and p
2
has: 35, 40, 90

● Pointers will initally be pp
1
=1 and pp

2
=35

● We can skip all values smaller than 35 in p
1
.

24

SKIP POINTERS

25

 I want to search for “University of Warwick” – as a phrase.

 For this query, “You can live in Warwick if you are a student at the
university” is NOT a match.

 If we want to do this search, then our index is not
enough.

PHRASE QUERIES

26

 Index bigrams instead of single words.

 For the document, “I went to the University of Warwick”, index:
I went, went to, to the, the University, etc.

 We can easily search for bigrams now, but we can’t look for
“University of Warwick” yet!

 We can search for “University of AND of Warwick”, but there
is no guarantee that they occur together in the document.

HOW ABOUT BIGRAM INDICES?

27

 Longer n-gram indices are not feasible, too many possible n-
grams.

 Bigrams could be used when we don’t need longer search
queries, but it’s generally not enough.

 Use of bigrams to look for longer n-grams can lead to false
positves (as with the “University of AND of Warwick” example)

LONGER N-GRAM INDICES?

POSITIONAL INDICES

29

 In the postngs, store for each term the positon(s) in which
tokens of it appear:

THE SOLUTION: POSITIONAL INDICES

30

 The search query could be: “to be or not to be”

 Which documents contain it?

 For phrase queries, we use a merge algorithm recursively at the
document level.

 But we now need to deal with more than just equality.

POSITIONAL INDEX: EXAMPLE

31

POSITIONAL INDEX: EXAMPLE

Which of docs 1,2,4,5
could contain “to be

or not to be”?

32

 Get inverted index entries for each distnct term: to, be, or, not.

 Merge their doc:positon lists.

 to:
2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; …

 be:
1:17,19; 4:17,191,291,430,434; 5:14,19,101; …

 or…

 Look for occurrences where the positons match the sequence
of our query: “to be or not to be”.

POSITIONAL INDEX: EXAMPLE

33

 Get inverted index entries for each distnct term: to, be, or, not.

 Merge their doc:positon lists.

 to:
2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; …

 be:
1:17,19; 4:17,191,291,430,434; 5:14,19,101; …

 or…

 Look for occurrences where the positons match the sequence
of our query: “to be or not to be”.

POSITIONAL INDEX: EXAMPLE

429: to, 430: be, 431: or, 432: not, 433: to, 434: be

OTHER CHALLENGES IN
INFORMATION RETRIEVAL

35

 We may stll want to do more sophistcated queries, e.g.
proximity queries.

 Search for two phrases, which occur within a maximum
distance of K words between them.

PROXIMITY QUERIES

36

PROXIMITY QUERIES

37

 Query: Student AROUND(5) Warwick.

 We can do this with positonal indices.

 With bigram indices we can’t.

 The algorithm for getng the intersecton here is more complex:

 We need to get the intersecton of “student” and “Warwick”.

 with the restricton that there has to be a maximum of 5
words in between.

PROXIMITY QUERIES

38

PROXIMITY QUERIES: ALGORITHM

40

 Positonal index uses more space than a binary index.

 However, positonal indices are today’s standard approach to
index documents, given their fexibility for searching.

POSITIONAL INDEX SIZE

41

 Rules of thumb:

 A positonal index is 2–4 as large as a non-positonal index.

 Positonal index size 35–50% of volume of original text.

 Caveat: all of this holds for “English-like” languages.

POSITIONAL INDEX SIZE

42

 Positonal indices and bigram indices can be combined to get the
most of each approach:

 For frequent phrases (“Michael Jackson”, “Britney Spears”,
“The Who”) it is inefcient to keep on merging positonal
postngs lists.

 For very popular search queries, you can also cache the results.

COMBINATION SCHEMES

43

 Williams et al. (2004) evaluated a mixed indexing scheme.

 A typical web query mixture was executed in ¼ of the tme of
using just a positonal index.

 It required 26% more space than having a positonal index
alone.

COMBINATION SCHEMES

44

 Case sensitve search: if we lowercase everything before
indexing, we can’t consider the case when searching.

 Search page ttles only: need an additonal fag to indicate that
word positon belongs to the ttle.

 Search numeric ranges: e.g. £50..£100. If numbers are just
another string in our indices, we won’t be able to search for this.

MORE POSSIBLE SEARCHES

45

 For all these searches, the challenge lies in implementng it
efciently:

 Case: we could store original and lowercased words, but
that increases the index size substantally.

 We could also have separate indices for page ttles and for
numeric values, but again, that’s much more data to store.

MORE CHALLENGES

46

 With Google, you can even search for related pages:
e.g. related:warwick.ac.uk

MORE CHALLENGES

47

MORE CHALLENGES

48

 With Google, you can even search for related pages:
e.g. related:warwick.ac.uk
returns a bunch of university websites

 They may also be storing content similarity between websites.

 Again, that’s much more data for the index.

MORE CHALLENGES

49

 So far, we can retrieve documents that match a query.

 That’s fne, but we ofen get many results.

 And we want to rank them.

 Weigh them based on relevance, not just binary match.

 Ranking is an additonal challenge in informaton retrieval.

 Next lecture!

MORE CHALLENGES

50

RESOURCES

 Apache Lucene (open source search software, Java):
https://lucene.apache.org/

 Python wrapper for Lucene:
http://lucene.apache.org/pylucene/

 Apache Nutch (web crawler, can be integrated with Lucene to build a
search engine):
http://nutch.apache.org/

51

REFERENCES

 Arampatzis, A., & Kamps, J. (2008, July). A study of query length. In
Proceedings of the 31st annual international ACM SIGIR conference
on Research and development in information retrieval (pp. 811-812).
ACM.

 Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998, April).
Real life information retrieval: A study of user queries on the web. In
ACM SIGIR Forum (Vol. 32, No. 1, pp. 5-17). ACM.

https://lucene.apache.org/
http://lucene.apache.org/pylucene/
http://nutch.apache.org/

52

ASSOCIATED READING

 Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to
information retrieval (Vol. 1, No. 1, p. 496). Cambridge: Cambridge
university press. Chapters 1-2.
https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 40
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45
Slide 46
Slide 47
Slide 48
Slide 49
Slide 50
Slide 51
Slide 52