CS计算机代考程序代写 information retrieval python Practical Week 02

Practical Week 02

Demonstration¶
The following demonstration will use the training set of the OHSUMED corpus. This training set was used in the Filtering Track of the 9th edition of the Text REtrieval Conference (TREC-9). We will use it for the information retrieval exercises of this workshop. Download ohsumed.zip into the same folder as this notebook. The file is part of the git repository, so if you have cloned or downloaded the entire repository you will have the file in the right folder.

The following code unzips the file:

In [1]:

import zipfile
zip_ref = zipfile.ZipFile(‘ohsumed.zip’, ‘r’)
zip_ref.extractall(‘.’)
zip_ref.close()

To help you read the data, we are providing the file ohsumed.py (in the zip file above) that has a simple API to the data. When you import it at the Python prompt, it will provide the following variables:

index: a dictionary with document IDs as keys, and document text as values.
questions: a dictionary with query IDs as keys, and query text as values.
answers: a dictionary with query IDs as keys, and a set with the IDs of known relevant documents as values. This information is used for evaluation.

Below are some examples:

In [2]:

import ohsumed

Reading OHSUMED data

In [3]:

len(ohsumed.index)

Out[3]:

54710

In [4]:

sorted(list(ohsumed.index.keys()))[:10]

Out[4]:

[‘87049087’,
‘87049088’,
‘87049089’,
‘87049090’,
‘87049091’,
‘87049092’,
‘87049093’,
‘87049094’,
‘87049095’,
‘87049096’]

In [5]:

ohsumed.index[‘87049087’]

Out[5]:

‘Some patients converted from ventricular fibrillation to organized rhythms by defibrillation-trained ambulance technicians (EMT-Ds) will refibrillate before hospital arrival. The authors analyzed 271 cases of ventricular fibrillation managed by EMT-Ds working without paramedic back-up. Of 111 patients initially converted to organized rhythms, 19 (17%) refibrillated, 11 (58%) of whom were reconverted to perfusing rhythms, including nine of 11 (82%) who had spontaneous pulses prior to refibrillation. Among patients initially converted to organized rhythms, hospital admission rates were lower for patients who refibrillated than for patients who did not (53% versus 76%, P = NS), although discharge rates were virtually identical (37% and 35%, respectively). Scene-to-hospital transport times were not predictively associated with either the frequency of refibrillation or patient outcome. Defibrillation-trained EMTs can effectively manage refibrillation with additional shocks and are not at a significant disadvantage when paramedic back-up is not available.’

In [6]:

len(ohsumed.questions)

Out[6]:

63

In [7]:

sorted(list(ohsumed.questions.keys()))[:10]

Out[7]:

[‘OHSU1’,
‘OHSU10’,
‘OHSU11’,
‘OHSU12’,
‘OHSU13’,
‘OHSU14’,
‘OHSU15’,
‘OHSU16’,
‘OHSU17’,
‘OHSU18’]

In [8]:

ohsumed.questions[‘OHSU1′]

Out[8]:

’60 year old menopausal woman without hormone replacement therapy Are there adverse effects on lipids when progesterone is given with estrogen replacement therapy’

In [9]:

len(ohsumed.answers)

Out[9]:

63

In [10]:

ohsumed.answers[‘OHSU1’]

Out[10]:

{‘87097544’, ‘87157536’, ‘87157537’, ‘87202778’, ‘87316316’, ‘87316326’}

Inverted index¶
We are going to build an inverted index of the non-stop words with frequency higher than 5.

The following code reads the files and creates a counter of all words in the corpus (including stop words). We will use NLTK’s word tokeniser (read the beginning of chapter 3 of NLTK’s book) to convert each document into a list of tokens. Note that this code may take some time to run.

In [11]:

import nltk, collections
nltk.download(‘stopwords’)
nltk.download(‘punkt’)
stop = nltk.corpus.stopwords.words(‘english’)
wordcounter = collections.Counter([w.lower() for k in ohsumed.index
for s in nltk.sent_tokenize(ohsumed.index[k])
for w in nltk.word_tokenize(s)])

[nltk_data] Downloading package stopwords to /home/diego/nltk_data…
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/diego/nltk_data…
[nltk_data] Package punkt is already up-to-date!

In [12]:

wordcounter.most_common(10)

Out[12]:

[(‘the’, 305806),
(‘of’, 271953),
(‘.’, 254858),
(‘,’, 239656),
(‘and’, 179604),
(‘in’, 172449),
(‘to’, 107431),
(‘)’, 96259),
(‘(‘, 95948),
(‘a’, 95281)]

The following code creates the inverted index of all non-stop words with frequency higher than 5. Note that this code may take some time to run.

In [13]:

inverted = dict()
for d in ohsumed.index:
for w in nltk.word_tokenize(ohsumed.index[d]):
w = w.lower()
if w in stop or wordcounter[w] <= 5: continue if w in inverted: inverted[w].add(d) else: inverted[w] = set([d]) In [14]: sorted(list(inverted.keys()))[3000:3010] Out[14]: ['accentuation', 'accept', 'acceptability', 'acceptable', 'acceptably', 'acceptance', 'accepted', 'accepting', 'acceptor', 'acceptors'] In [15]: inverted['acceptability'] Out[15]: {'87057543', '87067994', '87073895', '87074134', '87114326', '87119697', '87121859', '87129900', '87149032', '87153185', '87193350', '87223625', '87223856', '87224779', '87232524', '87251875', '87273001', '87282178', '87295871', '87297008'} The following code saves the inverted index into a pickle file. This way we do not need to compute the inverted index again. Read Python's documentation on pickle files for more detail. Note that the file we created is opened for writing in binary mode, following the advice of this stackoverflow post about saving pickle files. In [16]: import pickle with open('inverted.pickle', 'wb') as f: pickle.dump(inverted,f) Boolean retrieval¶ The following code reads the pickle file and returns the list of documents that maches this Boolean query: (menopausal OR pregnant) AND woman AND NOT healthy In [17]: import pickle with open('inverted.pickle', 'rb') as f: inverted = pickle.load(f) In [18]: (inverted['menopausal'] | inverted['pregnant']) & inverted['woman'] - inverted['healthy'] Out[18]: {'87060673', '87066899', '87097274', '87097518', '87099263', '87114245', '87117852', '87128881', '87134330', '87138205', '87153548', '87153568', '87169457', '87185313', '87226668', '87231479', '87235637', '87251241', '87252385', '87261426', '87281235', '87290433', '87296136', '87316210', '87316220', '87316328', '87324028', '87325497'} Note that it took very little time to run the query. In general, creating the index may take some time but it is needed only once if the files do not change. Queries on the index are very fast. Your Turn¶ 1. Vector Retrieval¶ Exercise 1.1: Boolean Information Retrieval¶ Create an inverted index of the NLTK Gutenberg corpus and save it into a file "gutenbergindex.pickle". To create this index there is no need to look for stop words or word frequencies, since the corpus is not that large. Simply use all the words. Use this index to find the documents that match the following Boolean queries: Brutus OR Caesar Brutus AND NOT Caesar (Brutus AND Caesar) OR Calpurnia In [19]: import pickle import nltk nltk.download("gutenberg") # Write your code here [nltk_data] Downloading package gutenberg to /home/diego/nltk_data... [nltk_data] Package gutenberg is already up-to-date! Saving index into file gutenbergindex.pickle... Done In [20]: with open('gutenbergindex.pickle','rb') as f: gutenbergindex = pickle.load(f) In [21]: # Write your code for searching for Brutus OR Caesar Out[21]: {'bible-kjv.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt'} In [22]: # Write your code for searching for Brutus AND NOT Caesar Out[22]: set() In [1]: # Write your code for searching for (Brutus AND Caesar) OR Calpurnia Exercise 1.2: tf.idf¶ Using scikit-learn, compute the tf.idf of all words in the OHSUMED corpus. Use the English list of stop words, and leave all other settings to their default values. In particular, do not stem the words. Pickle the resulting tf.idf vectoriser into a file tfidf.pickle. Note that in this exercise you should use the sklearn functions, not nltk. In particular, do not use NLTK's list of stop words or its tokeniser. In [24]: # Write your code to compute the tf.idf Out[24]: TfidfVectorizer(stop_words='english') In [25]: # Write your code to save the results in a pickle file Exercise 1.3: Sort by tf.idf¶ Write a program that returns the words of a document with highest tf.idf score. The resulting list of words should be sorted by frequency in descending order. In [26]: def best_tfidf(tfidf, docID, numwords=10): """Print the words with highest tf.idf, in descending order >>> best_tfidf(tfidf, ‘87049087’, numwords=3)
[‘rhythms’, ‘refibrillation’, ‘organized’]
“””
# Write your code here

In [27]:

best_tfidf(tfidf,’87049087′)

Out[27]:

[‘rhythms’,
‘refibrillation’,
‘organized’,
‘refibrillated’,
‘converted’,
’emt’,
‘paramedic’,
‘ds’,
‘defibrillation’,
‘hospital’]

Optional exercise: tf.idf cosine similarity¶
Use the OHSUMED collection for the following exercise. Write a function that takes as a parameter a string and an optional parameter $n$ the number of results, and returns the IDs of the $n$ documents that are most relevant according to tf.idf and cosine similarity. The results are sorted in descending order of the cosine similarity score.

In [30]:

# The following funcion implements cosine similarity by using the formulas we have seen in the lectures.
# Feel free to use sklearn’s implementation of cosine similarity instead.

def best_documents(querystring,n=10):
“””Return the indices of the best n documents using cosine similarity
>>> best_documents(ohsumed.questions[‘OHSU1’], n=3)
[‘87285549’, ‘87162574’, ‘87068356’]”””
# Write your code here

In [31]:

best_documents(ohsumed.questions[‘OHSU1’])

Out[31]:

[‘87052846’,
‘87053030’,
‘87057603’,
‘87057561’,
‘87054719’,
‘87053640’,
‘87053630’,
‘87055106’,
‘87057550’,
‘87053614’]

In [ ]: