COSC 472/572 Big Data

COSC 472/572 Big Data

Assignment 4 – Create an Inverted Index

Due: Monday April 13, 2015

This is a paired assignment. You must have one (and only one partner). You must submit a paper with both partners names on it by Monday March 9.

Inverted indexes are one of the main data structures employed by a search engine. Basically, an inverted index has as a key value a specific word (e.g. “cat”). It will then contain every files (html document) that contains the word “cat”. A more elaborate version of an inverted index would contain a file name and word location. So for example, we might have something like the following:

cat-
a.html , carnivores.html , jungle_animals.html , …………

The more elaborate model would look like:

cat-
a.html (3,7,103) , carnivores.html (10,23,44,56,1234) , etc

In the second instance, the list indicates the position of the word “cat” within the file. Your assignment is to create an inverted index. This will be done in several steps:

1) Use heritrix to crawl the web. You must collect at least a gigabyte of data. This will take a long time (10-20 hours) . Instructions on using heritrix , including an excellent video done by graduate student Cade Sperlich, can be found below:

Wikipedia – Inverted Indexes
Heritrix home page
Cade Sperlich’s excellent Heritrix web crawl video

2) Heritrix creates .warc files . “.warc” stands for “web archive file”. But we will need text files as output. Again, Cade has provided excellent documentation on this conversion, as well as a Python script:

Cade’s Python script for converting .warc files to text files Some notes from Cade

3) Now that you have text files, create a Hadoop/MapReduce job to create the inverted index (again, refer to Cade’s notes). This will take some time.

4) Import your results into sqlite

5) Create a simple search engine! Write your program in either Python or Java and interface to your sqlite database. The search engine will have two input options:

– input a single word . We define word as and contiguous collection of non-blank roman letters like cat, dog, eastern, michigan. In this case, the program should return all documents containing the word

– input a pound sign (#) followed by a word, for example #cat , #dog, #eastern . In this case, the

program should return with the total number of documents containing the word.

References for calling sqlite from Python or java can be found here:

Java and SQLite Python and SQLite

What to hand in:
A paper copy of your mapper and reducer programs. Make sure this code is well documented and

has both members names on each paper. You will also need to demonstrate your program to me.