CS代考 QUT2020_22/teaching/Sem1_2022/IFN647/workshop/week5

wk5_solutions

Inverted Index and Query Processing¶
© Professor Yuefeng Li (QUT)

For the given two XML documents (you can download them from week 3 workshop and then save them in a folder, e.g., ‘data’), design a python function index_docs() to index them (please remove stop words and index stems only).

The returned index should be a dictionary {term:{docID1:freq1, DocID2:freq2}, …}

import glob, os
import string
from stemming.porter2 import stem

def index_docs(inputpath,stop_words):
Index = {} # initialize the index
os.chdir(inputpath)
for file_ in glob.glob(“*.xml”):
start_end = False
for line in open(file_):
line = line.strip()
if(start_end == False):
if line.startswith(““):
start_end = True
elif line.startswith(““):
line = line.replace(“

“, “”).replace(“

“, “”)
line = line.translate(str.maketrans(”,”, string.digits)).translate(str.maketrans(string.punctuation, ‘ ‘*len(string.punctuation)))
for term in line.split():
term = stem(term.lower())
if len(term) > 2 and term not in stop_words:
Index[term][docid] += 1
except KeyError:
Index[term][docid]=1
except KeyError:
Index[term] = {docid:1}
return Index

# Note that text preprocessing happens before terms are indexed, where terms are stemmed.

Design a python function doc_at_a_time(I, Q), where index I is a Dictionary of term:Directionary of (itemId:freq), which returns a dictionary of docId:relevance for the given query Q (a term:freq dictionary).

def doc_at_a_time(I, Q): # index I is a Dirctionary of term:Directionary of (itemId:freq)
L={} # L is the selected inverted list
R={} # R is a directionary of docId:relevance
for list in I.items():
for id in list[1].items(): # get all document IDs with value 0
R[id[0]]=0
if (list[0] in Q): # select inverted lists based on the query
L[list[0]]= I[list[0]]
for (d, sd) in R.items():
for (term, f) in L.items():
if (d in f):
sd = sd + f[d]*Q[term]

Design a python function term_at_a_time(I, Q), where index I is a Dictionary of term:Directionary of (itemId:freq), which returns a dictionary of docId:relevance for the given query Q (a term:freq dictionary).

def term_at_a_time(I, Q): # index I is a Dirctionary of term:Directionary of (itemId:freq)
L={} # L is the selected inverted list
R={} # R is a directionary of docId:relevance
for list in I.items():
for id in list[1].items(): # get all document IDs with value 0
R[id[0]]=0
if (list[0] in Q): # select inverted lists based on the query
L[list[0]]= I[list[0]]
for (term, li) in L.items(): # traversal of the selected inverted list
for (d, f) in li.items(): # for each occurence of doc, update R
R[d] = R[d] + f*Q[term]

Design a python main program to call the above three functions for a query, e.g., Query = {‘formula’:1, ‘one’:1}.

#if __name__ == ‘__main__’:

import sys

#if len(sys.argv) != 2:
# sys.stderr.write(“USAGE: %s \n” % sys.argv[0])
# sys.exit()

curr_path=os.getcwd()
print(curr_path)

stopwords_f = open(‘common-english-words.txt’, ‘r’)
stop_words = stopwords_f.read().split(‘,’)
stopwords_f.close()
#Index = index_docs(sys.argv[1], stop_words) #create an index for all terms in , data structure {‘w1’:{‘ID1’:2, ‘ID2’:1}, ‘w2’:{‘ID3’:1, ‘ID1’:3}}
“”” for term in coll.items():
print “Term — %s” % (term[0])
for id in coll[term[0]].items():
print ” Document ID: %s and frequency: %d” % (id[0], id[1]) “””
#Query = {‘leaderboard’:1, ‘british’:1}
#print(Index)

data_path = curr_path+’/data’
Index = index_docs(data_path, stop_words) #create an index for all terms in , data structure {‘w1’:{‘ID1’:2, ‘ID2’:1}, ‘w2’:{‘ID3’:1, ‘ID1’:3}}
os.chdir(curr_path)

Query = {‘formula’:1, ‘one’:1}
result1 = doc_at_a_time(Index, Query)
result2 = term_at_a_time(Index, Query)
x1 = sorted(result1.items(), key=lambda x: x[1],reverse=True)
x2 = sorted(result2.items(), key=lambda x: x[1],reverse=True)
print(‘Document_at_a_time result——–‘)
for (id, w) in x1:
print(‘Document ID: ‘+id + ‘ and relevance weight: ‘ + str(w))
print(‘Term_at_a_time result ——–‘)
for (id, w) in x2:
print(‘Document ID: ‘ + id + ‘ and relevance weight: ‘ + str(w))

/Users/li3/Desktop/QUT2020_22/teaching/Sem1_2022/IFN647/workshop/week5
Document_at_a_time result——–
Document ID: 741299 and relevance weight: 2
Term_at_a_time result ——–
Document ID: 741299 and relevance weight: 2

# We assume the jupyter starts from your working directory.
# We use os methods to find the current working directory ‘curre_path’, then data’s directory ‘data_path’.
# Note we need to go back to the current working directory after call index_docs as it changed the working directory.
# You may change the ‘Query’ to test more queries.

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts