Information Retrieval Evaluation Indexing and Retrieval
COMP3220 — Document Processing and the Semantic Web
Week 02 Lecture 1: Searching for Information
Diego Moll ́a
Department of Computer Science Macquarie University
COMP3220 2021H1
Diego Moll ́a
W02L1: Search 1/53
Information Retrieval Evaluation Indexing and Retrieval
Programme
1 Information Retrieval 2 Evaluation
Precision and Recall
3 Indexing and Retrieval Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Diego Moll ́a
W02L1: Search 2/53
Information Retrieval Evaluation Indexing and Retrieval
Reading
Essential Reading
NLTK chapter 6 section 3.3 (precision and recall).
Manning et al. IR book, chapter 1 (Boolean retrieval), chapter 6 section 2 (Td.idf), chapter 8 section 3 (precision and recall). http://nlp.stanford.edu/IR-book/
Additional Reading
“Information Retrieval”, section 23.1 of Jurafsky & Martin’s book: https://web.stanford.edu/~jurafsky/slp3/23.pdf
Diego Moll ́a
W02L1: Search 3/53
Programme
1 Information Retrieval 2 Evaluation
Precision and Recall
3 Indexing and Retrieval Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Information Retrieval
Evaluation Indexing and Retrieval
W02L1: Search 4/53
Diego Moll ́a
Information Retrieval
Evaluation Indexing and Retrieval
Need for Search
The Problem
The Web can be seen as a very large, unstructured data store.
There exist hundreds of millions of Web pages but there is no central index.
Even worse: It is unknown where all the Web servers are.
The Solution
Search engines.
Diego Moll ́a
W02L1: Search 5/53
Information Retrieval
Evaluation Indexing and Retrieval
Information Retrieval
Information Retrieval (IR)
IR is about searching for information.
IR typically means “document retrieval”.
IR is one of the core components of Web search.
http://boston.lti.cs.cmu.edu/classes/11-744/treclogo-c.gif
Diego Moll ́a
W02L1: Search 6/53
Information Retrieval
Evaluation Indexing and Retrieval
Stages in an IR System
1: Indexing
This stage is done off-line, prior to running any searches.
The goal is to reduce the documents to a description: the indices.
We want to optimise the representation: for example, ignore the terms that do not contribute.
2: Retrieval
Use the indices to retrieve the documents (ignore the remaining information in the documents).
We want retrieval to be fast.
Diego Moll ́a
W02L1: Search 7/53
Information Retrieval
Evaluation Indexing and Retrieval
Stages in an IR System
1: Indexing
This stage is done off-line, prior to running any searches.
The goal is to reduce the documents to a description: the indices.
We want to optimise the representation: for example, ignore the terms that do not contribute.
2: Retrieval
Use the indices to retrieve the documents (ignore the remaining information in the documents).
We want retrieval to be fast.
Diego Moll ́a
W02L1: Search 7/53
Programme
1 Information Retrieval 2 Evaluation
Precision and Recall
3 Indexing and Retrieval Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Information Retrieval
Evaluation
Indexing and Retrieval
Precision and Recall
W02L1: Search 8/53
Diego Moll ́a
Information Retrieval
Evaluation
Indexing and Retrieval
Precision and Recall
Why Evaluate?
Document processing systems almost never give 100% correct results.
When you develop a document processing system, you want to know how good it is.
You want to know if a modification in a system is an improvement.
Human evaluations are expensive to produce.
In this lecture we will focus on automatic evaluations.
Of course, in addition you have to debug the system.
Diego Moll ́a
W02L1: Search 9/53
Information Retrieval
Evaluation
Indexing and Retrieval
Precision and Recall
Training vs. Test Data
For pretty much all evaluations, you want to divide your data into at least two sets: training and test.
Training data is what you use to develop your models.
You only look at the training data.
For statistical models (coming later in this course), this is what you use to calculate your statistics.
Test data is separate, used for evaluation.
You may also have a third set of data to help develop your system (DevTest).
You’ll see the use of the DevTest set when we look at statistical models.
Golden Rule
You don’t ever, ever, look at the test data (you only look at its evaluation results).
Diego Moll ́a
W02L1: Search 10/53
Programme
1 Information Retrieval 2 Evaluation
Precision and Recall
3 Indexing and Retrieval Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Information Retrieval
Evaluation
Indexing and Retrieval
Precision and Recall
W02L1: Search
11/53
Diego Moll ́a
Information Retrieval
Evaluation
Indexing and Retrieval
Precision and Recall
Positives and Negatives
Whenever a system needs to make a binary choice, it is classifying the text into two classes: a positive and a negative class.
Positive: What we want to select. Negative: What we do not want to select.
The choice of what is a positive or a negative class depend on the application.
In information retrieval, documents relevant to the query belong to the positive class.
In spam filtering, spam documents belong to the positive class.
We can see that the concept “positive” may not agree with our intuitions!
Diego Moll ́a
W02L1: Search 12/53
Information Retrieval
Evaluation
Indexing and Retrieval
Precision and Recall
True, False, Positives, Negatives
We can group results of the system into four categories:
True positive (tp) : The system correctly detects a positive.
True negative (tn) : The system correctly detects a negative.
False positive (fp) : The system wrongly classifies a negative as a positive.
False negative (fn) : The system wrongly classifies a positive as a negative.
system actual case decision positive negative positive tp fp negative fn tn
Diego Moll ́a
W02L1: Search 13/53
Information Retrieval
Evaluation
Indexing and Retrieval
Precision and Recall
Example: Positives and Negatives in Information Retrieval
In IR, “relevant” documents belong to the positive class.
system
decision
retrieved
not retrieved fn
actual case relevant not relevant
tp fp tn
If our retrieval system fails to retrieve a relevant document, this is a false negative.
Diego Moll ́a
W02L1: Search 14/53
Information Retrieval
Evaluation
Indexing and Retrieval
Precision and Recall
Example: Spam Filtering
In spam filtering, “spam” emails belong to the positive class.
actual case spam not spam tp fp
system
marked spam
not marked spam fn
tn
If our spam filter classifies a legitimate email as spam, this is a
false positive.
Question
False positives in spam filtering are usually more dangerous than false negatives; why?
Diego Moll ́a
W02L1: Search 15/53
Information Retrieval
Evaluation
Indexing and Retrieval
Precision and Recall
Example: Spam Filtering
In spam filtering, “spam” emails belong to the positive class.
actual case spam not spam tp fp
system
marked spam
not marked spam fn
tn
If our spam filter classifies a legitimate email as spam, this is a
false positive.
Question
False positives in spam filtering are usually more dangerous than false negatives; why?
Diego Moll ́a
W02L1: Search 15/53
Information Retrieval
Evaluation
Indexing and Retrieval
Precision and Recall
Precision and Recall
Formulas
precision = tp = tp
recall =
Example
should be marked as positive
tp+fn
marked as positive by the system tp = tp
tp+fp
From a total collection of 200 documents, a retrieval system returned 30 documents, but 5 were not relevant. It also missed 12 documents.
Diego Moll ́a
W02L1: Search 16/53
Information Retrieval
Evaluation
Indexing and Retrieval
Precision and Recall
Example
actual case relevant not relevant
system
retrieved
not retrieved 12
25 5 158
Values of measures
precision = 25 = 25
recall = 25 = 25 25+12 37
25+5
30
Diego Moll ́a
W02L1: Search 17/53
Information Retrieval
Evaluation
Indexing and Retrieval
Precision and Recall
F-Measure
The F-measure combines precision and recall into a single
measure.
Fβ = (1 + β2) precision × recall β2precision + recall
The most common combination is when β = 1, referred to as
F1 :
F1 = 2 precision × recall precision + recall
This is the harmonic mean of precision and recall. For our previous example, F1 = 0.746
Diego Moll ́a
W02L1: Search 18/53
Information Retrieval
Evaluation
Indexing and Retrieval
Precision and Recall
Accuracy
Accuracy is the number correctly classified out of the whole
set.
accuracy = correct decisions = tp+tn
all data tp+fp+tn+fn For previous example, accuracy is 183/200
Sometimes used (wrongly) to refer to precision.
Beware of unbalanced data
What happens if you have unbalanced classes, e.g. there are 100 documents, 90 of them belong to the negative class, and the system classifies everything as a negative?
Recall: 0 = 0 10
Precision: 0 = NAN 0
Accuracy: 90 = 0.9 100
Diego Moll ́a
W02L1: Search 19/53
Information Retrieval
Evaluation
Indexing and Retrieval
Precision and Recall
Exercise: Spam Filtering
Exercise
Assume your system processes 1000 emails. It classifies 640 as spam, of which 480 are actually spam. It missed 120 spam emails. What are the precision and recall of the spam detection? What is the accuracy?
Diego Moll ́a
W02L1: Search 20/53
Programme
1 Information Retrieval 2 Evaluation
Precision and Recall
3 Indexing and Retrieval Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
W02L1: Search
21/53
Diego Moll ́a
Programme
1 Information Retrieval 2 Evaluation
Precision and Recall
3 Indexing and Retrieval Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
W02L1: Search
22/53
Diego Moll ́a
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Bag of Words Representation
Bag of words (BoW)
At indexing time, a compact representation of the document is built.
The document is seen as a bag of words. Information about word position is (often) discarded. Only the important words are kept.
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. Recently, the bag-of-words model has also been used for computer vision.
{bag, bag-of-words, computer, disregarding, document, grammar, information, IR, keeping, language, model,
=⇒ multiplicity, multiset, natural, order, processing, representation, represented, retrieval, sentence, simplifying, text, vision, word, words}
Diego Moll ́a
W02L1: Search 23/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Stop Words
Stop words
A simple solution to determine important words is to keep a list of non-important words: the stop words.
All stop words in a document are ignored. Stop words are language-specific.
Typically, stop words are connecting words.
Stop words in NLTK
>>> from nltk . corpus import stopwords >>> stop = stopwords.words(’english’) >>> stop[:5]
[’i’, ’me’, ’my’, ’myself’, ’we’]
Diego Moll ́a
W02L1: Search 24/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Term Frequency
Term Frequency
Usually, words that are not frequent are not important.
Words that are too frequent may occur in most documents and therefore can’t be used to discriminate among documents.
Usually, important words are in the middle.
Zipf’s Law for term frequency
A small percentage of words are very frequent.
A large percentage of words have very little frequency. The relation approximates a Zipfian distribution.
This is also referred as a “long-tailed” distribution.
Diego Moll ́a
W02L1: Search 25/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Zipf’s Law in Action
Python code
import nltk
import collections
import matplotlib . pyplot as plt
words = nltk.corpus.gutenberg.words(’austen−emma.txt’) fd = collections .Counter(words)
data = [f for w, f in fd.most common()]
plt . plot (data [:1000])
plt .show()
500 most frequent words
Diego Moll ́a
W02L1: Search 26/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
tf.idf
tf.idf
Term frequency: If a word is very frequent in a document, it is important for the document.
tf (t, d) = frequency of word t in document d Inverse document frequency: If a word appears in many
documents, it is not important for any of the documents. idf (t) = log number of documents
number of documents that contain t
tf .idf combines these two characteristics.
tf .idf (t , d ) = tf (t , d ) × idf (t )
Diego Moll ́a
W02L1: Search 27/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Problems with Bag of Word Representations
BoW representations ignore important information such as:
Word position: “Australia beat New Zealand” is not the same as “New Zealand beat Australia”
Morphology: If you search for “table”, a webpage that uses the word “tables” might be relevant.
Words with similar meanings: If you search for “truck”, a webpage that uses the word “lorry” might be relevant.
Ambiguity: If you search for “Apple” you might be interested in the company and not in the fruit.
Still, BoW representations are very simple, fast, and often surprisingly good.
Diego Moll ́a
W02L1: Search 28/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Beyond BoW Representations
A simple way to account for (some) information about word positions is to use n-grams:
Bigrams, trigrams, 4-grams (usually there is no need for longer n-grams).
Thus, instead of representing a text as a bag of words, it can be represented as a bag of n-grams.
Using n-grams instead of words may introduce other kinds of problems (we will see some of these problems in a future lecture).
However, there are many more n-grams than words and this can create problems for large vocabularies.
m words = m2 bigrams, m3 trigrams, . . . .
Diego Moll ́a
W02L1: Search 29/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Accounting for Word Meaning I
Ambiguity
Word disambiguation attempts to determine the sense of a word.
A word like “Apple” could be disambiguated as “apple1” or “apple2” to account for its several meanings.
Word disambiguation systems usually look at the “context” of the word:
Yesterday I ate an apple1.
Apple2 reported a benefit last fiscal year.
Diego Moll ́a
W02L1: Search 30/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Accounting for Word Meaning II
Synonymy
There are lexical resources such as thesauri (singular: thesaurus) that list words with related meanings.
WordNet is a popular resource
(https://wordnet.princeton.edu/)
Recent innovations include the use of distributional semantics to map words to vectors of numbers called word embeddings.
Word2Vec and Glove are two popular early systems.
https://www.springboard.com/blog/introduction- word- embeddings/
Diego Moll ́a
W02L1: Search 31/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Inverted Indices
Index
Inverted Index
computer
D1 D2 D3
software
D1
D1
computer language
software information
retrieval
D2 D3
D2
document library
D3
information
D2 D3
library
filtering
computer retrieval
D1 D3
computer filtering
language
document
D1
D2
information retrieval
Diego Moll ́a
W02L1: Search 32/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Inverted Indices
Index
Inverted Index
computer
D1 D2 D3
software
D1
D1
computer language
software information
retrieval
D2 D3
D2
document library
D3
information
D2 D3
library
filtering
computer retrieval
D1 D3
computer filtering
language
document
D1
D2
information retrieval
Diego Moll ́a
W02L1: Search 32/53
Programme
1 Information Retrieval 2 Evaluation
Precision and Recall
3 Indexing and Retrieval Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
W02L1: Search
33/53
Diego Moll ́a
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Retrieval
In the retrieval stage, the index is searched. This enables fast retrieval.
Note that the index does not contain the full information from the documents.
For example, searching a stop word will be useless.
Diego Moll ́a
W02L1: Search 34/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Boolean Retrieval
Use Boolean operations among the search terms. x AND y Documents that contain both terms.
x OR y Documents that contain at least one term. NOT x Documents that do not contain the term.
The use of inverted indices simplifies this method. x AND y Set intersection.
x OR y Set union.
NOT x Set complement.
Diego Moll ́a
W02L1: Search 35/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Example of Boolean Retrieval
Keywords
D1:{computer, software, information, language} D2:{computer, document, retrieval, library} D3:{computer, information, filtering, retrieval}
Inverted Index
computer → {D1, D2, D3}, software → {D1}, information → {D1,D3}, language → {D1}, document → {D2}, retrieval → {D2, D3},
library → {D2}, filtering → {D3}
Boolean Query
(information OR document) AND retrieval
Result
({D1,D3} ∪ {D2}) ∩ {D2,D3} = {D2,D3}
Diego Moll ́a
W02L1: Search 36/53
Programme
1 Information Retrieval 2 Evaluation
Precision and Recall
3 Indexing and Retrieval Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
W02L1: Search
37/53
Diego Moll ́a
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Vector Retrieval
Boolean retrieval and ranking
There are no obvious methods to rank the results of Boolean retrieval.
A very successful attempt to handle this is Google’s PageRank but we will not cover it here.
An easy method to rank documents is to represent them as vectors and use well-established methods for vector comparison.
Diego Moll ́a
W02L1: Search 38/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
From Documents to Vectors
We can define a document as a vector where each word in the entire vocabulary represents an element in a vector.
The final document matrix will typically be sparse since a document will typically contain only a small fraction of all the possible words.
Possible information to store in the vector:
The occurrence of a word/stem/n-gram (1) or not (0) ← see example below.
The word frequency.
tf .idf ← a popular choice.
Distributional semantics ← a hot research topic.
Diego Moll ́a
W02L1: Search 39/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Example of Vector Space Model
Template:
{computer,software,information,document,retrieval,language,library,filtering} Initial documents
D1:{computer, software, information, language} D2:{computer, document, retrieval, library} D3:{computer, information, filtering, retrieval}
Document vectors
D1: (1,1,1,0,0,1,0,0) D2: (1,0,0,1,1,0,1,0) D3:(1,0,1,0,1,0,0,1)
Document matrix
(typically a sparse matrix)
1 1 1 0 0 1 0 0 D=10011010
10101001
Diego Moll ́a
W02L1: Search 40/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Cosine Similarity
Cosine Method
This is a popular approach to compare vectors.
We calculate the cosine of the angle between vectors. If the angle is zero, then the cosine is 1.
8
cos(D1, Q1) cos(D2, Q1) cos(D3, Q1)
= cos(α)
= cos(0) = 1 = cos(β)
β
D3
D2 Q1
D1 α
0
08
Diego Moll ́a
W02L1: Search 41/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Cosine Similarity: Formulas
General Formula
ΣNi=1Dj,iQk,i
cos(Dj , Qk ) = =
Dj ·Qk ||Dj||2 ||Qk||2
ΣN D2 i=1 j,i
If the vectors are normalised
ΣN Q2 i=1 k,i
cos(Dj,Qk)=ΣNi=1Dj,iQk,i =Dj ·Qk
Diego Moll ́a
W02L1: Search 42/53
Programme
1 Information Retrieval 2 Evaluation
Precision and Recall
3 Indexing and Retrieval Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
W02L1: Search
43/53
Diego Moll ́a
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Vectors and Matrices in Python
numpy
Python’s numpy is a collection of libraries that include manipulation of vectors and matrices.
http://www.numpy.org/
It’s pre-loaded in the Anaconda distribution.
Diego Moll ́a
W02L1: Search 44/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Manipulating Vectors
>>> import numpy as np
>>> a = np.array([1,2,3,4]) >>> a[0]
1
>>> a[1:3]
array ([2 , 3]) >>> a+1
array ([2 , 3, 4,
>>> b=np. array ([2 ,3 ,4 ,5])
>>> a+b array ([3 ,
40
5,
# add two vectors
7, 9])
# pairwise multiplication
>>> a*b array ([ 2,
# slicing
# add a constant to a vector
5])
6, 12, 20])
>>> np.dot(a,b) # dot product between vectors , a . b
Diego Moll ́a
W02L1: Search 45/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Manipulating Matrices
>>> x = np. array ([[1 ,2 ,3] ,[4 ,5 ,6]]) >>> x
array([[1, 2, 3], [4, 5, 6]])
>>> y = np. array ([[1 ,1 ,1] ,[2 ,2 ,2]]) >>> x+y # add two matrices
array([[2, 3, 4], [6, 7, 8]])
>>> x*y
array([[ 1, 2, 3],
[ 8, 10, 12]]) >>> x .T #
array ([[1 , 4] , [2, 5],
[3, 6]])
>>> np.dot(x.T,y) # dot product
array([[ 9, 9, 9], [12, 12, 12],
[15, 15, 15]])
# pairwise multiplication
transpose
>>>
Diego Moll ́a
W02L1: Search 46/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Scikit-learn I
http://scikit-learn.org/
Incorporates an extensive set of machine learning algorithms into Python.
It has a consistent and intuitive interface. The documentation is very complete.
Includes generic tutorials on the main machine learning algorithms.
To install Scikit-learn:
https://scikit-learn.org/stable/install.html
Diego Moll ́a
W02L1: Search 47/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Scikit-learn II
Diego Moll ́a
W02L1: Search 48/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
tf .idf with scikit-learn
>>> import glob
>>> files = glob.glob(’enron1/ham/*.txt’)
>>> from sklearn . feature extraction . text import TfidfVectoriz >>> tfidf = TfidfVectorizer(input=’filename’,stop words=’eng >>> tfidf . fit ( files )
>>> tfidf values = tfidf . transform( files )
>>> len( tfidf . get feature names ())
19892
>>> tfidf . get feature names ()[10000:10005]
[’grandma’, ’grandpa’, ’grandsn’, ’grandsons’, ’grant’] >>> type(tfidf values)
scipy . sparse . csr . csr matrix
>>> type( tfidf values . toarray ())
numpy . ndarray
>>> tfidf values .shape
(3672 , 19892)
Diego Moll ́a
W02L1: Search 49/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Normalised tf.idf and cosine similarity in Python
>>> tfidf norm = TfidfVectorizer(input=’filename’, stop words=’ english ’ ,
norm=’l2 ’)
>>> tfidf norm values = tfidf norm.fit transform(files).toarr
>>> import numpy as np
>>> def cosine similarity (X,Y):
return np . dot (X,Y)
>>> cosine similarity(tfidf norm values[0,:],
tfidf norm values [1 ,:])
0.017317648885111028
Diego Moll ́a
W02L1: Search 50/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Some Open Source Search Engines
If you don’t want to implement your search engine from scratch,
try these (http://www.mytechlogy.com/IT-blogs/8685/
top-5-open-source-search-engines/):
1 Elasticsearch: https://www.elastic.co/
2 Solr: https://lucene.apache.org/solr/
3 Lucene: https://lucene.apache.org/
4 Sphinx: http://sphinxsearch.com/
5 Xapian: https://xapian.org/
6 Indri: https://www.lemurproject.org/
7 Zettair: http://www.seg.rmit.edu.au/zettair/
The following is a Python library that can be used for indexing and
retrieving documents (among many other things):
1 Gensim: https://radimrehurek.com/gensim/
Diego Moll ́a
W02L1: Search 51/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
Take-home Messages
1 What is indexing? what is retrieval?
2 What is an inverted index?
3 Perform Boolean retrieval by hand.
4 Implement Boolean retrieval in Python.
5 Use sklearn to build a vector model with tf.idf.
6 Use sklearn to implement cosine similarity.
Diego Moll ́a
W02L1: Search 52/53
Information Retrieval Evaluation Indexing and Retrieval
Indexing
Boolean Retrieval
Vector Retrieval
Vector Retrieval in Python
What’s Next
Week 3
Introduction to Statistical Classification.
Submit Assignment 1 by Friday week 3.
Reading
NLTK Chapter 6 “Learning to Classify Text”.
Diego Moll ́a
W02L1: Search 53/53