CMPT 456 Course Project 1
Due: 11:59 pm, July 17, 2020 100 points in total
Please submit your assignment in Coursys.
Every student has to complete the project independently. While you are encouraged to learn through discussion with the instructor, the TAs and the peer students, any plagiarisms are serious violation of the university’s academic integrity policy. We have absolutely zero tolerance of such behavior.
This project will introduce you to working with the Lucene library. We will help you to walk through a common codebase we have built in order to help you get familiar with Lucene library as much as possible.
Codebase
● The codebase is already in our GitLab at https://csil-git1.cs.surrey.sfu.ca/pia5/cmpt456- project1-starter-code .
● You should be able to clone it to your own workspace in order to do the programming tasks described in the next sections.
● For this assignment, we use the latest version of Lucene, branch 6.6 (6.6.7), with Java 8. You can check its detail API here: https://lucene.apache.org/core/6_6_6/index.html
● The codebase is a fork from Lucene/Solr open source code (https://github.com/apache/lucene-solr) with some customizations in order to allow you to run it inside a Docker environment (https://www.docker.com/). So, you need to have Docker installed on your machine. The CSIL computers have been equipped with Docker already, so feel free to use them.
● The purpose of having Lucene/Solr running inside a Docker container is to help you work on this assignment using mostly any OS you prefer, Linux, Mac or Windows. If you are curious about how the Docker container is built, look at the Dockerfile in the source code.
Project Data
● We are going to use Wiki Small data (6043 documents) from our textbook
● We have included the data for you, within the codebase at location lucene/demo/data. In the subsequent sections, you will use it in to demonstrate indexing and querying process.
Compiling
● Checkout the codebase to local machine with git command:
git clone https://csil-git1.cs.surrey.sfu.ca/pia5/cmpt456-project1-starter-code . cd cmpt456-project1-starter-code
● Build Docker image from the source code (make sure that we have. (i.e. current location) at the end of the command):
docker build -t cmpt456-lucene-solr:6.6.7.
NOTE: Since Docker is not available free for Windows OS, we recommend you use VirtualBox with Ubuntu OS or Windows Subsystem for Linux (WSL)
● Run the Docker image we just built in order to activate the Docker container:
docker run -it cmpt456-lucene-solr:6.6.7
Demo
In this section, we help you to get familiar with Lucene basic components by running 2 simple programs:
● Index Files: this program uses standard analyzers to create tokens from input text files, convert them to lowercase then filer out predefined list of stop-words.
The source code is stored in this file within the codebase: lucene/demo/src/java/org/apache/lucene/demo/IndexFiles.java
Index demo data with the following command inside the Docker container:
ant -f lucene/demo/build.xml \
-Ddocs=lucene/demo/data/wiki-small/en/articles/ run-indexing-demo
● Search Files: this program uses a query parser to parse the input query text, then pass to
the index searcher to look for matching results.
The source code is stored in lucene/demo/src/java/org/apache/lucene/demo/SearchFiles.java
this file within the codebase: Search demo data with the following command inside the Docker container:
ant -f lucene/demo/build.xml run-search-index-demo
You are expected to run these examples, understand Lucene components used in the indexing and querying process in order to make further extensions in the below programming tasks.
Text Parsing (30 pts)
In the first part of the assignment, you will learn how to use Lucene to build search capabilities for documents in various formats, such as HTML, XML, PDF, Word. In fact, Lucene does not care about the parsing of these and other document formats, and it is the responsibility of the application using Lucene to use an appropriate parser to convert the original format into plain text before passing that plain text to Lucene.
In the class IndexFiles.java within the Demo section, you can see that it indexes the content of html files, including all html tags (e.g.,