CE306 – Information Retrieval
Assignment 1: Indexing for Web Search
Udo Kruschwitz
6th February 2017
Plagiarism
You are reminded that this work is for credit towards the composite mark in CE306, and that the work you
submit must therefore be your own. Any material you make use of, whether it be from textbooks, the Web or
any other source must be acknowledged as a comment in the program, and the extent of the reference clearly
indicated.
The Context of your Task
Here is part of a current job advert.1 It is entitled NLP Software Engineer and the job will be with Babylon
Health, a tech start-up which happens to be the host of the next Text Analytics Meetup later this month (some
slots still available, so you better be quick if you want to attend):
We are looking for a software engineer with natural language processing (NLP) and information extraction (IE)
experience to join our NLP team.
…
Requirements
• Someone with experience in natural language processing and the handling of unstructured text.
• A natural team player, who enjoys working collaboratively with colleagues.
• Strong development skills in at least one of Java or Python essential.
• Excellent system design with solid testing and an eye towards scalability and robustness.
• Demonstrated experience with using NLP technologies as a software engineer.
• Demonstrable experience of major NLP libraries and tools, e.g. SpaCy, OpenNLP, NLTK, GATE, UIMA,
Stanford CoreNLP.
• Scaling NLP solutions using large-scale data processing engines such as Apache Spark.
…
That looks exactly like the profile I expect you to have after finishing this module!
The Task
Your task is to apply your IR skills to build a processing pipeline that turns a Web site into structured knowledge
(thus enhancing your chances to get the job outlined above). Your system should take HTML pages as input,
1https://babylon-health.workable.com/jobs/299414
1
process them using the kind of techniques that we have been looking at in the module, and output an index of
terms identified in the documents.
This assignment comes in stages. Marks are given for each stage. You may choose not to attempt some stages.
You might also implement a system that does not strictly follow the stages but will work in the same way. The
stages are as follows:
• Input/Output (10%) The system must be able to read Web pages (a small number will do here and they
can be stored locally) and produce appropriately formatted output. The Web pages should be processed
one at a time using the steps outlined below.
• HTML Parsing (10%) Before the text can be analyzed it is necessary to get rid of the HTML tags.
The result will be plain text. Note that if you simply delete all HTML tags, you will lose information
such as meta tag keywords. Therefore, I strongly suggest that you use some tool to perform this task.
• Pre-processing: Sentence Splitting, Tokenization and Normalization (10%) The next step
should be to transform the input text into a normal form of your choice.
• Part-of-Speech Tagging (10%) The input should be tagged with a suitable part-of-speech tagger, so
that the result can then be processed in the next steps.
• Selecting Keywords (20%) One aim of your system is to identify the words or phrases in the text
that are most useful for indexing purposes. Your system should remove words which are not useful, such
as very frequent words or stopwords. You should develop a selection method, possibly using POS tags
(e.g. nouns and noun phrases) in combination with statistical/frequency information (e.g. using term
frequency).
• Stemming or Morphological Analysis (10%) Writing word stems to the database rather than words
allows to treat various inflected forms of a word in the same way, i.e. bus and busses refer to exactly the
same thing even though they are different words.
• Engineering a Complete System (10%) The final system should have control over all the individual
components so that there is a single call and all the above steps will be performed.
You will have noticed that the percentages above only add up to 80%. This is because one of the important
aspects of the project is that your work should be well documented and your code well commented. 20% of
your mark will come from this. You should submit:
• A description of your implementation: what the code does, and the software you used
• Unedited and commented output from a run of the code submitted using these two Web pages:
http://csee.essex.ac.uk/staff/udo/index.html
http://orb.essex.ac.uk/CE/CE306/syllabus.html
(feel free to submit other runs as well, i.e. using Web pages of your own choice)
• A short discussion of your solution focussing on functionality implemented and possible improvements and
extensions.
You may work in pairs (remember the job advert: “A natural team player, who enjoys working collaboratively
with colleagues”). If you do, you only need to submit one report. Both members of a pair will get the same
mark unless there is reason to do otherwise.
You can implement your system either on the Linux or the Windows machines. Perl, Java, Python, C/C++,
and shell scripts are good choices for this project, but you are by no means restricted to those languages.
Identify suitable open-source tools that help you building your pipeline.
Submission
The assignment, which counts for 20% of the overall mark, should be submitted as a single zip file via the
electronic submission system by Friday, 24 February 2017, 11:59 (mid-day). The guidelines about late
assignments are explained in the students’ handbook.
2