Project 1: Effectiveness of Approximate String Search

COMP90049 Knowledge Technologies, Semester 1 2015

Project 1: Effectiveness of Approximate String Search

Due:
Submission mechanism: Submission materials:

Assessment criteria:

Marks:

Overview

5pm, Tue 21 April, 2015
To be described on the LMS.

Source code, README (see below); Report (in PDF; also below)

Analysis (10 marks), Soundness (5 marks), Report Quality (5 marks)

The project will be marked out of 20, and will contribute 20% of your total mark.

For this project, we will be working with tweets from Twitter. Your objective is to develop software that identifies location names in these tweets.

However, there will be no direct assessment of the technical merits of your software: the focus of this project is a critical analysis of methodologies — which is to say, what you have learned about the problem and can pass along to another person attempting it. Impeccably-coded software with little insight will not receive a strong mark.

Resources

You will be given a set of query strings and a set of tweets in which to search for them.
The query strings will take the format of a (fragment of a) gazetteer of locations in the United States of America, preprocessed so that each entry consists of a single location per line. Location

names may have more than one word. For example, the gazetteer might start as follows:

    ACGC Elementary School
    Acorn Corner
    Acorn Hill Childrens Center
    ...

The set of tweets will take the form of a (long) document, with one tweet per line, prefixed by a tab- delimited tweet ID (which is not significant for the purposes of this project). We have pre-processed the tweets to remove line breaks, and any characters other than the upper and lower case English alphabetical letters and spaces1. The nature of the data means that this renders certain elements of the data (Twitter handles, hashtags, URLs) unrecognisable. For example, the tweet file might start as follows:

1Using the regular expression s/[ˆa-zA-Z ]// . We have also removed queries which did not follow this format; there were only a small number of these in the gazetteer.

    6039701829 AndrewWindham Leaving at am to drive to Akron OH to
    spend holiday with cousins anntaylor
    5515350829 CherriPRBuzz hey how u doing
    ...

The entire gazetteer has about 1M entries, and there are about 3M tweets in total. We will also make available a selection of random sub-samples of these files, in case you wish to start with a smaller collection, or if the entire collection takes too much time or space to process. These are indicated by the infix in the filename queries.1M.txt (for the entire set of queries), queries.100K.txt (for 10% of the query collection), tweets.3M.txt (for the entire set of tweets), tweets.300K.txt (for 10% of the dataset), and so on.

The directory /home/subjects/comp90049/2015-sm1/project1/ on the CIS servers (dimefox and nutmeg.eng.unimelb.edu.au) contains the files listed above. The full-sized files are quite large, and have been gzipped. All of files have been archived into a single tar-ball, for easy downloading (project1-data.tgz, about 170MB).

Terms of Use

As part of the terms of use of Twitter, in using the data you must agree to the following:

• You are strictly forbidden from redistributing (sharing) the dataset with others or reusing it for any purpose other than this project.

• You are strictly forbidden from reproducing documents in the document collection in any pub- lication, other than in the form of isolated examples.

If, for some reason, you object to these terms of use, contact Jeremy (nj@unimelb.edu.au) as soon as possible.
Note that the document collection is a sub-sample of actual data posted to Twitter, without any filtering whatsoever. As such, the opinions expressed within the documents in no way express the official views of The University of Melbourne or any of its employees, and my using them does not constitute endorsement of the views expressed within. It is likely that you will find some of the tweets to be insulting, or in poor taste; please use your discretion when considering the data, and try to look beyond the content to the task at hand. The University of Melbourne accepts no responsibility for offence caused any content contained in the documents.

Tasks

Given a set of target strings (in this case, the tweets) and a set of queries (US locations), you will find the best approximate match (or matches) to each query in the texts. The location names are therefore the “queries”, and the tweets are the target texts. Please note that many tweets will not contain any location names at all, and many location names will not be present in the tweet collection.

For example, for the location name Acorn Corner, you might observe an approximate match in a tweet such as the following:

    123456789 my new hat is a scorn cornet its awesome

Obviously, this is a spurious match that is not an instance of the location, but that is the nature of the problem! Preferably, your system will output the tweet ID where the approximate match is hypoth- esised to occur (in this case, 123456789); if you find processing the IDs to be inconvenient, you

can instead output the raw substring of the text which corresponds to the approximate match (scorn cornet).

Approximate Matching

We have discussed a selection of methods for approximate matching in this subject: • Neighbourhood search (e.g. agrep or regular expressions)
• N-gram distance
• Edit distance

• Phonetic matching (e.g. Soundex or carney)

Other approximate matching strategies have been discussed in the literature, you might wish to try them out instead.

One immediate problem that will surface will be that many of the queries comprise multiple words (for example, Acorn Corner). You will need to find a way of dealing with this; there are two main options:

  1. You may process the texts (and possibly the queries) into separate tokens (“words”). This approach dovetails best with a N-gram distance or Global Edit Distance matching strategy. You will then need some strategy for aggregating matches in sequence.
  2. You may treat the query as a string (possibly with spaces), and compare it against each tweet, suitably interpreted as a string with spaces. (Or, if you’re very ambitious, you might treat the entire collection as a string with spaces!) This approach dovetails best with a Neighbourhood search or Local Edit Distance matching strategy.

The task will be to implement and compare at least two approximate matching strategies. There are two main points of comparison:

  1. Intermsofeffectiveness:differentmethodswillproposedifferenttypesofapproximatematches. Some of the matches will be meaningful, and some will not.
  2. In terms of efficiency: different methods have different overheads, which is a relevant concern in terms of processing a large amount of data. You should be aware of these issues — and you will certain become aware of them, if you attempt to run all of the queries on the entire data set — however, this issue will not be our main focus for this project.

Since implementation is not our main concern in this project, you may use external packages or supplied code if you wish; if you do so, the source of the package should be clearly indicated in your README.

Evaluating Effectiveness

Your main analysis will be in terms of assessing whether the results are meaningful, in terms of the approximate matches returned by the two (or more) approximate matching methods you have tried. In your report, you will be asked to identify some particular instances or good and bad approximate matching, and to assess the methods globally, as to whether they are appropriate or not for solving the problem of finding approximate matches to location names within a set of Twitter data.

There are two problems, however: finding meaningful queries for comparison, and the lack of a formal definition of what a “good” approximate match is.

  • For many queries (e.g. Chicago), the best “approximate” matches will be exact matches. The systems will probably return the same results, and will be impossible to compare in this context. One possibility is to ignore the exact matches, and only compare results that are not exact; this might make the problem artificially difficult, but the location names that are most likely to be mis-spelled are the ones that appear the most often in the collection!
  • Many other queries (e.g. Acorn Hill Childrens Center), will not appear in the col- lection at all. There may not even be any “close” matches with which to compare the algorithms — you should probably consider “thresholding” your output to help trim obvious spurious re- sults. (For example, if the best edit distance is greater than some value, then there are no approximate matches according to the Edit Distance.) If you do this, you should discuss the threshold and its significance in your report.
  • Even where there are close approximate matches, it may not be obvious whether the matches are correct or not. You will need to examine the results of a selection of queries (by hand), to determine (to the best of your ability) whether the approximate match is a relevant instance of the query string — there will probably be some subjectivity involved here2. You should attempt to:
    • –  Find some examples of good (relevant) approximate matches conjectured by each of your systems, as well as poor (irrelevant) matches;
    • –  Systematically compare the results of your systems over a selection of queries: of the approximate matches found, what proportion of results were actually relevant;
    • –  Hypothesise the reasons for the above points, and conclude which of the systems is more effective at finding approximate matches for locations;
    • –  Discuss the above in your report.

      Report

      You will write a report detailing your analysis, which should aim to discuss:

  1. A basic description of the problem and data set;
  2. Some indication of how you dealt with queries of multiple words, and the advantages or disad- vantages that it introduces to the string search problem;
  3. A comparison of the effectiveness of the approximate matching methods you chose, with illus- trative examples, and some reasoning as to why this is the case;
  4. Possibly some discussion of the scalability of your methods to larger data collections;
  5. Some conclusions about the problem of searching for approximate matches to location names within a corpus of Twitter data.

2There is also the separate issue as to whether the approximate match is indeed a location — that problem is even more difficult, and you can gloss over it if you wish to.

The report should consist of about 750–1500 words. This is quite short: you will need to be concise, and you should use tables or graphs to present the data more compactly where appropriate. You should not discuss the technical details of your implementation, unless they are novel or crucial to your analysis of the methods; you can assume that the marker is familiar with the methods we have discussed in this subject. Overly long reports will be penalised.

Note that we will be looking for evidence that you have thought about the task and have determined reasons for the performance of the methods involved. A report that simply records data without corresponding analysis will not receive a strong mark.

You should include a bibliography and citations to relevant research papers. A good place to begin is probably:

Zobel, Justin and Philip Dart. (1996). Phonetic String Matching: Lessons from Informa- tion Retrieval. In Proceedings of the Eighteenth International ACM SIGIR Conference on Research and Development in Information Retrieval. Zu ̈rich, Switzerland. pp. 166–173.

Note that, in general, you should not cite Wikipedia. For more information, see the Caution in http://en.wikipedia.org/wiki/Wikipedia:Citing_Wikipedia and more generally http://en.wikipedia.org/wiki/Wikipedia:Academic_use.

We will post some sample reports on the LMS, for you to get a sense of style for you report. We will also make report templates (in Rich-Text Format (for MS Word or similar) and LATEX) available; it would be preferred for you to use these templates when writing your report. Please do not include a title page, abstract, or table-of-contents. You report must be submitted in PDF format. We reserve the right to return reports submitted in any other format with a mark of 0.

Submission

Submission will entail two parts:

  1. An archive which contains your code and a README file which briefly explains how to compile and run your submission (preferably as a Makefile or run.sh) and the location and format of the output
  2. Your written report, as a single file in Portable Document Format (PDF)

Details of how to submit will be given on the LMS.
Your software can be implemented in any programming language or combination of programming languages. There is no requirement that it runs on the CIS servers; please be considerate of other users if you run your system there — the task may be very memory–intensive.

Changes/Updates to the Project Specifications

If we require any (hopefully small-scale) changes or clarifications to the project specifications, they will be posted on the LMS. Any addendums will supersede information included in this document.

Academic Misconduct

For most people, collaboration will form a natural part of the undertaking of this project. However, it is still an individual task, and so reuse of code or excessive influence in algorithm choice and development will be considered cheating. We will be checking submissions for originality and will

invoke the University’s Academic Misconduct policy (http://academichonesty.unimelb. edu.au/policy.html) where inappropriate levels of collusion or plagiarism are deemed to have taken place.