lecture14.pptx
LECTURE 14
Introducton to Informaton Retrieval
Arkaitz Zubiaga, 21
st
February, 2018
2
What is Informaton Retrieval (IR)?
Indexing Documents.
Query Processing.
Positonal Indices.
Other Challenges in Informaton Retrieval.
LECTURE 14: CONTENTS
3
Informaton Retrieaal (IR): from a large collecton, the task of
obtaining documents that satsfy an informaton need.
Collecton (e.g. the Web) may include images and videos.
In this module we’ll focus on text.
INFORMATION RETRIEVAL
4
Web search, e.g. Google.
Vertcal search, web search on a partcular topic, e.g. Yelp.
Email search.
Searching content in large databases or hard driaes.
EXAMPLES OF INFORMATION RETRIEVAL
5
EXAMPLE OF AN IR SYSTEM?
6
The informaton need that an IR system has to satsfy is usually
expressed as a (short) text query, e.g. hotel in Coventry.
Many queries are aague, i.e. the average search query has 2-4
keywords (Arantzapis & Kamps, 2008).
INFORMATION NEED
7
Queries can seek 3 types of informaton need:
Naaigatonal (looking for website, e.g. Facebook).
Informatonal (looking for info, e.g. Thai food).
Transactonal (buying/looking for product/service, e.g.
iPhone 10).
INFORMATION NEED
8
THE CLASSIC SEARCH MODEL
The search process can be iteratae.
Query gives results.
If unhappy with results,
refne the query.
With a good IR system, we aim
to minimise the number of
tmes the user refnes the
query.
9
In IR, documents in a collecton are deemed releaant (R) or non-
releaant (N) with respect to a partcular query .
The objectve of an IR system is to present, for a given query, as
many releaant results as possible (and as few non-relevant
results as possible).
58% of users neaer go to the 2nd page of search results
(Jansen et al., 1998)
RELEVANCE TO A QUERY
10
In a small collecton, we can process fles on the fy .
For a given query, go through all fles and check if the query
text appears in each fle.
It would take ages for large collectons .
For large collectonss indexing is the soluton.
i.e. pregeneratng lists of word-document associatons.
SEARCHING THROUGH LARGE COLLECTIONS
INDEXING DOCUMENTS
12
Reduced sample of Shakespeare’s works:
TERM-DOCUMENT INCIDENCE MATRICES
13
Reduced sample of Shakespeare’s works:
We can search for: +Brutus +Caesar -Calpurnia (2 results)
TERM-DOCUMENT INCIDENCE MATRICES
14
Collecton of N=1M documentss each with 1K words .
Say there are M = 500K distnct terms among these.
500K x 1M matrix has half-a-trillion 0’s and 1’s .
Very sparse, only one billion 1’s (that’s 0.2% of the values).
Alternatve: record only 1’s → use an inaerted index.
EVEN BIGGER COLLECTIONS
15
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
174
54 101
For each term t, store list of documents that contain t .
List needs to have aariable size.
Word ‘Brutus’ occurs in documents ID 1, ID 2, ID 4, etc.
INVERTED INDEX
16
List (token, Document ID) pairs.
INDEXING DOCUMENTS: EXAMPLE
I did enact Julius
Caesar I was killed
i’ the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
17
Sort (alphabetcally) by tokens and then doc ID.
INDEXING DOCUMENTS: EXAMPLE
18
Merge entries + add frequency counts: dictonary and postngs .
INDEXING DOCUMENTS: EXAMPLE
QUERY PROCESSING
20
Search query: Brutus AND Caesar
Retrieve postngs with Brutus.
Retrieve postngs with Caesar.
Get the intersecton as the set of results (docs 2 and 8).
QUERY PROCESSING: AND
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
21
GETTING THE INTERSECTION
22
GETTING THE INTERSECTION
Complexity of algorithm: O(m+n), as we iterate through all items
– m is the length of p
1
, and n is the length of p
2
23
SKIP POINTERS
Can we improve the linear tme O(m+n) and compute it in
sublinear tme?
● We can use skip pointers, i.e.:
● If p
1
has: 1, 3, 5, 15, 40
and p
2
has: 35, 40, 90
● Pointers will initally be pp
1
=1 and pp
2
=35
● We can skip all values smaller than 35 in p
1
.
24
SKIP POINTERS
25
I want to search for “Uniaersity of Warwick” – as a phrase.
For this query, “You can live in Warwick if you are a student at the
university” is NOT a match.
If we want to do this search, then our
enough.
PHRASE QUERIES
26
Index bigrams instead of single words.
For the document, “I went to the University of Warwick”, index:
I went, went to, to the, the University, etc.
We can easily search for bigrams now, but we can’t look for
“Uniaersity of Warwick” yet!
We can search for “University of AND of Warwick”, but there
is no guarantee that they occur together in the document.
HOW ABOUT BIGRAM INDICES?
27
Longer n-gram indices are not feasible , too many possible n-
grams.
Bigrams could be used when we don’t need longer search
queries, but it’s generally not enough.
Use of bigrams to look for longer n-grams can lead to false
positaes (as with the “University of AND of Warwick” example)
LONGER N-GRAM INDICES?
POSITIONAL INDICES
29
In the postngs, store for each term the positon(s) in which
tokens of it appear:
THE SOLUTION: POSITIONAL INDICES
30
The search query could be: “to be or not to be”
Which documents contain it?
For phrase queries, we use a merge algorithm recursiaely at the
document leael.
But we now need to deal with more than just equality .
POSITIONAL INDEX: EXAMPLE
31
POSITIONAL INDEX: EXAMPLE
Which of docs 1,2,4,5
could contain “to be
or not to be”?
32
Get inaerted index entries for each distnct term : to, be, or, not.
Merge their doc:positon lists.
to:
2:1,17,74,222,551; 4:8,16,190,429s433; 7:13,23,191; …
be:
1:17,19; 4:17,191,291,430s434; 5:14,19,101; …
or…
Look for occurrences where the positons match the sequence
of our query: “to be or not to be”.
POSITIONAL INDEX: EXAMPLE
33
Get inaerted index entries for each distnct term : to, be, or, not.
Merge their doc:positon lists.
to:
2:1,17,74,222,551; 4:8,16,190,429s433; 7:13,23,191; …
be:
1:17,19; 4:17,191,291,430s434; 5:14,19,101; …
or…
Look for occurrences where the positons match the sequence
of our query: “to be or not to be”.
POSITIONAL INDEX: EXAMPLE
429: to, 430: be, 431: or, 432: not, 433: to, 434: be
OTHER CHALLENGES IN
INFORMATION RETRIEVAL
35
We may stll want to do more sophistcated queriess e.g.
proximity queries.
Search for two phrases, which occur within a maximum
distance of K words between them.
PROXIMITY QUERIES
36
PROXIMITY QUERIES
37
Query: Student AROUND(5) Warwick.
We can do this with positonal indices.
With bigram indices we can’t.
The algorithm for getng the intersecton here is more complex:
We need to get the intersecton of “student” and “Warwick”.
with the restricton that there has to be a maximum of 5
words in between.
PROXIMITY QUERIES
38
PROXIMITY QUERIES: ALGORITHM
40
Positonal index uses more space than a binary index.
However, positonal indices are today’s standard approach to
index documents, given their fexibility for searching.
POSITIONAL INDEX SIZE
41
Rules of thumb:
A positonal index is 2–4 as large as a non-positonal index.
Positonal index size 35–50% of aolume of original text.
Caaeat: all of this holds for “English-like” languages.
POSITIONAL INDEX SIZE
42
Positonal indices and bigram indices can be combined to get the
most of each approach:
For frequent phrases (“Michael Jackson”, “Britney Spears”,
“The Who”) it is inefcient to keep on merging positonal
postngs lists.
For aery popular search queries, you can also cache the results.
COMBINATION SCHEMES
43
Williams et al. (2004) evaluated a mixed indexing scheme.
A typical web query mixture was executed in ¼ of the tme of
using just a positonal index.
It required 26% more space than having a positonal index
alone.
COMBINATION SCHEMES
44
Case sensitae search: if we lowercase everything before
indexing, we can’t consider the case when searching.
Search page ttles only: need an additonal fag to indicate that
word positon belongs to the ttle.
Search numeric ranges: e.g. £50..£100. If numbers are just
another string in our indices, we won’t be able to search for this.
MORE POSSIBLE SEARCHES
45
For all these searches, the challenge lies in implementng it
efficiently:
Case: we could store original and lowercased words , but
that increases the index size substantally.
We could also have separate indices for page ttles and for
numeric aalues, but again, that’s much more data to store.
MORE CHALLENGES
46
With Google, you can even search for related pages:
e.g. related:warwick.ac.uk
MORE CHALLENGES
47
MORE CHALLENGES
48
With Google, you can even search for related pages:
e.g. related:warwick.ac.uk
returns a bunch of university websites
They may also be storing content similarity between websites .
Again, that’s much more data for the index.
MORE CHALLENGES
49
So far, we can retrieae documents that match a query .
That’s fne, but we ofen get many results.
And we want to rank them.
Weigh them based on releaance , not just binary match.
Ranking is an additonal challenge in informaton retrieval.
Next lecture!
MORE CHALLENGES
50
RESOURCES
Apache Lucene (open source search software, Java):
https://lucene.apache.org/
Python wrapper for Lucene:
http://lucene.apache.org/pylucene/
Apache Nutch (web crawler, can be integrated with Lucene to build a
search engine):
http://nutch.apache.org/
51
REFERENCES
Arampatzis, A., & Kamps, J. (2008, July). A study of query length. In
Proceedings of the 31st annual international ACM SIGIR conference
on Research and development in information retrieval (pp. 811-812).
ACM.
Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998, April).
Real life information retrieval: A study of user queries on the web. In
ACM SIGIR Forum (Vol. 32, No. 1, pp. 5-17). ACM.
52
ASSOCIATED READING
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to
information retrieval (Vol. 1, No. 1, p. 496). Cambridge: Cambridge
university press. Chapters 1-2.
https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf