PowerPoint Presentation
LECTURE 14
Introducton to Informaton Retrieval
Arkaitz Zubiaga, 21st February, 2018
2
What is Informaton Retrieval (IR)?
Indexing Documents.
Query Processing.
Positonal Indices.
Other Challenges in Informaton Retrieval.
LECTURE 14: CONTENTS
3
Informaton Retrieval (IR): from a large collecton, the task of
obtaining documents that satsfy an informaton need.
Collecton (e.g. the Web) may include images and videos.
In this module we’ll focus on text.
INFORMATION RETRIEVAL
4
Web search, e.g. Google.
Vertcal search, web search on a partcular topic, e.g. Yelp.
Email search.
Searching content in large databases or hard drives.
EXAMPLES OF INFORMATION RETRIEVAL
5
EXAMPLE OF AN IR SYSTEM?
6
The informaton need that an IR system has to satsfy is usually
expressed as a (short) text query, e.g. hotel in Coventry.
Many queries are vague, i.e. the average search query has 2-4
keywords (Arantzapis & Kamps, 2008).
INFORMATION NEED
7
Queries can seek 3 types of informaton need:
Navigatonal (looking for website, e.g. Facebook).
Informatonal (looking for info, e.g. Thai food).
Transactonal (buying/looking for product/service, e.g.
iPhone 10).
INFORMATION NEED
8
THE CLASSIC SEARCH MODEL
The search process can be iteratve.
Query gives results.
If unhappy with results,
refine the query.
With a good IR system, we aim
to minimise the number of
tmes the user refines the
query.
9
In IR, documents in a collecton are deemed relevant (R) or non-
relevant (N) with respect to a partcular query.
The objectve of an IR system is to present, for a given query, as
many relevant results as possible (and as few non-relevant
results as possible).
58% of users never go to the 2nd page of search results
(Jansen et al., 1998)
RELEVANCE TO A QUERY
10
In a small collecton, we can process files on the fy.
For a given query, go through all fles and check if the query
text appears in each fle.
It would take ages for large collectons.
For large collectons, indexing is the soluton.
i.e. pregeneratng lists of word-document associatons.
SEARCHING THROUGH LARGE COLLECTIONS
INDEXING DOCUMENTS
12
Reduced sample of Shakespeare’s works:
TERM-DOCUMENT INCIDENCE MATRICES
13
Reduced sample of Shakespeare’s works:
We can search for: +Brutus +Caesar -Calpurnia (2 results)
TERM-DOCUMENT INCIDENCE MATRICES
14
Collecton of N=1M documents, each with 1K words.
Say there are M = 500K distnct terms among these.
500K x 1M matrix has half-a-trillion 0’s and 1’s.
Very sparse, only one billion 1’s (that’s 0.2% of the values).
Alternatve: record only 1’s → use an inverted index.
EVEN BIGGER COLLECTIONS
15
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
174
54 101
For each term t, store list of documents that contain t.
List needs to have variable size.
Word ‘Brutus’ occurs in documents ID 1, ID 2, ID 4, etc.
INVERTED INDEX
16
List (token, Document ID) pairs.
INDEXING DOCUMENTS: EXAMPLE
I did enact Julius
Caesar I was killed
i’ the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
17
Sort (alphabetcally) by tokens and then doc ID.
INDEXING DOCUMENTS: EXAMPLE
18
Merge entries + add frequency counts: dictonary and postngs.
INDEXING DOCUMENTS: EXAMPLE
QUERY PROCESSING
20
Search query: Brutus AND Caesar
Retrieve postngs with Brutus.
Retrieve postngs with Caesar.
Get the intersecton as the set of results (docs 2 and 8).
QUERY PROCESSING: AND
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
21
GETTING THE INTERSECTION
22
GETTING THE INTERSECTION
Complexity of algorithm: O(m+n), as we iterate through all items
– m is the length of p
1
, and n is the length of p
2
23
SKIP POINTERS
Can we improve the linear tme O(m+n) and compute it in
sublinear tme?
● We can use skip pointers, i.e.:
● If p
1
has: 1, 3, 5, 15, 40
and p
2
has: 35, 40, 90
● Pointers will initally be pp
1
=1 and pp
2
=35
● We can skip all values smaller than 35 in p
1
.
24
SKIP POINTERS
25
I want to search for “University of Warwick” – as a phrase.
For this query, “You can live in Warwick if you are a student at the
university” is NOT a match.
If we want to do this search, then our
enough.
PHRASE QUERIES
26
Index bigrams instead of single words.
For the document, “I went to the University of Warwick”, index:
I went, went to, to the, the University, etc.
We can easily search for bigrams now, but we can’t look for
“University of Warwick” yet!
We can search for “University of AND of Warwick”, but there
is no guarantee that they occur together in the document.
HOW ABOUT BIGRAM INDICES?
27
Longer n-gram indices are not feasible, too many possible n-
grams.
Bigrams could be used when we don’t need longer search
queries, but it’s generally not enough.
Use of bigrams to look for longer n-grams can lead to false
positves (as with the “University of AND of Warwick” example)
LONGER N-GRAM INDICES?
POSITIONAL INDICES
29
In the postngs, store for each term the positon(s) in which
tokens of it appear:
THE SOLUTION: POSITIONAL INDICES
30
The search query could be: “to be or not to be”
Which documents contain it?
For phrase queries, we use a merge algorithm recursively at the
document level.
But we now need to deal with more than just equality.
POSITIONAL INDEX: EXAMPLE
31
POSITIONAL INDEX: EXAMPLE
Which of docs 1,2,4,5
could contain “to be
or not to be”?
32
Get inverted index entries for each distnct term: to, be, or, not.
Merge their doc:positon lists.
to:
2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; …
be:
1:17,19; 4:17,191,291,430,434; 5:14,19,101; …
or…
Look for occurrences where the positons match the sequence
of our query: “to be or not to be”.
POSITIONAL INDEX: EXAMPLE
33
Get inverted index entries for each distnct term: to, be, or, not.
Merge their doc:positon lists.
to:
2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; …
be:
1:17,19; 4:17,191,291,430,434; 5:14,19,101; …
or…
Look for occurrences where the positons match the sequence
of our query: “to be or not to be”.
POSITIONAL INDEX: EXAMPLE
429: to, 430: be, 431: or, 432: not, 433: to, 434: be
OTHER CHALLENGES IN
INFORMATION RETRIEVAL
35
We may stll want to do more sophistcated queries, e.g.
proximity queries.
Search for two phrases, which occur within a maximum
distance of K words between them.
PROXIMITY QUERIES
36
PROXIMITY QUERIES
37
Query: Student AROUND(5) Warwick.
We can do this with positonal indices.
With bigram indices we can’t.
The algorithm for getng the intersecton here is more complex:
We need to get the intersecton of “student” and “Warwick”.
with the restricton that there has to be a maximum of 5
words in between.
PROXIMITY QUERIES
38
PROXIMITY QUERIES: ALGORITHM
40
Positonal index uses more space than a binary index.
However, positonal indices are today’s standard approach to
index documents, given their fexibility for searching.
POSITIONAL INDEX SIZE
41
Rules of thumb:
A positonal index is 2–4 as large as a non-positonal index.
Positonal index size 35–50% of volume of original text.
Caveat: all of this holds for “English-like” languages.
POSITIONAL INDEX SIZE
42
Positonal indices and bigram indices can be combined to get the
most of each approach:
For frequent phrases (“Michael Jackson”, “Britney Spears”,
“The Who”) it is inefcient to keep on merging positonal
postngs lists.
For very popular search queries, you can also cache the results.
COMBINATION SCHEMES
43
Williams et al. (2004) evaluated a mixed indexing scheme.
A typical web query mixture was executed in ¼ of the tme of
using just a positonal index.
It required 26% more space than having a positonal index
alone.
COMBINATION SCHEMES
44
Case sensitve search: if we lowercase everything before
indexing, we can’t consider the case when searching.
Search page ttles only: need an additonal fag to indicate that
word positon belongs to the ttle.
Search numeric ranges: e.g. £50..£100. If numbers are just
another string in our indices, we won’t be able to search for this.
MORE POSSIBLE SEARCHES
45
For all these searches, the challenge lies in implementng it
efciently:
Case: we could store original and lowercased words, but
that increases the index size substantally.
We could also have separate indices for page ttles and for
numeric values, but again, that’s much more data to store.
MORE CHALLENGES
46
With Google, you can even search for related pages:
e.g. related:warwick.ac.uk
MORE CHALLENGES
47
MORE CHALLENGES
48
With Google, you can even search for related pages:
e.g. related:warwick.ac.uk
returns a bunch of university websites
They may also be storing content similarity between websites.
Again, that’s much more data for the index.
MORE CHALLENGES
49
So far, we can retrieve documents that match a query.
That’s fne, but we ofen get many results.
And we want to rank them.
Weigh them based on relevance, not just binary match.
Ranking is an additonal challenge in informaton retrieval.
Next lecture!
MORE CHALLENGES
50
RESOURCES
Apache Lucene (open source search software, Java):
https://lucene.apache.org/
Python wrapper for Lucene:
http://lucene.apache.org/pylucene/
Apache Nutch (web crawler, can be integrated with Lucene to build a
search engine):
http://nutch.apache.org/
51
REFERENCES
Arampatzis, A., & Kamps, J. (2008, July). A study of query length. In
Proceedings of the 31st annual international ACM SIGIR conference
on Research and development in information retrieval (pp. 811-812).
ACM.
Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998, April).
Real life information retrieval: A study of user queries on the web. In
ACM SIGIR Forum (Vol. 32, No. 1, pp. 5-17). ACM.
https://lucene.apache.org/
http://lucene.apache.org/pylucene/
http://nutch.apache.org/
52
ASSOCIATED READING
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to
information retrieval (Vol. 1, No. 1, p. 496). Cambridge: Cambridge
university press. Chapters 1-2.
https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 40
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45
Slide 46
Slide 47
Slide 48
Slide 49
Slide 50
Slide 51
Slide 52