CE306/CE706 – Spring 2021 Laboratory Worksheet 2
Regular Expressions + Tools for the IR Indexing
Pipeline
This lab aims at getting you familiar with tools that can be employed in the pre-processing pipeline of an Information Retrieval (IR) application. In addition to that, it should help you get started with the first assignment.
The first part will be about regular expressions and tokenization (continuing from the materials that have already been introduced to you in the lecture and the class). The remainder is to point you in the right direction for state-of-theart open-source tools. Remember that one of the beauties of IR is the fact that there are many different ways to solve your problems and as a result we would not be surprised to get many different solutions to the assignment.
1 Regular Expressions in Java
For those of you who are familiar with regular expressions, this is a bit of a revision as well as an illustration as to how they are applied in the indexing pipeline of an information retrieval system (for very basic pre-processing tasks). For everybody else, this will be a good time to get used to regular expressions as they are essential in IR applications (though they might be hidden away in a tool that does all the indexing for you). If you do not manage to do the lab script within the allocated time, then please do go through it in your own time.
There are several regular expressions libraries in Java but in this lab, we will use the default package that comes with Java, java.util.regex. The lab is based on the regular expressions tutorial which you find here:
http://download.oracle.com/javase/tutorial/essential/regex/
The two main classes of the java.util.regex API are Pattern and Matcher. In the tutorial, you start by creating a java file, RegexTestHarness.java, that can be used to read in different regular expressions from the console input. The regular expression read from the keyboard input is compiled into a pattern using the compile method of the Pattern class; the pattern is used to find instances that match the regular expression using the matcher method of the same class.
Exercise: Go to the Regex tutorial page, download RegexTestHarness.java into your folders, and make sure you can compile it.
The tutorial then covers increasingly complex types of regular expressions: from the simplest form of RE (a string of characters), to metacharacters, disjunction, ranges, negation, predefined characters, and quantifiers.
Exercise: Go through the tutorial and do the exercises.
2 Tokenization in Java
As discussed in the lecture, tokenization is the task of extracting tokens from the input text. The definition of ‘token’ depends on the application, but in most cases, complete words count as tokens; sometimes punctuation markers do as well. Finite-state methods are typically used for tokenization, because of their efficiency. In Java, the methods of the class StringTokenizer can be used for a very basic form of tokenization. For example, the code:
StringTokenizer st = new StringTokenizer(“this is a test”); while (st.hasMoreTokens()) {
System.out.println(st.nextToken()); } prints the following output:
this is a test
More sophisticated types of tokenization, allowing for different types of delimiting characters, can be specified using the split method of String or the java.util.regex package. The following example illustrates how the String.split method can be used to break up a string into its basic tokens:
String[] result = “this is a test”.split(“\\s”); for (int x=0; x