Named Entity Recognition Due date: 2018-05-25
This assignment is worth 20% of your final assessment.
In this assignment, you will build and evaluate a named entity recognition (NER) system via sequence tagging, or you will perform a systematic comparison of existing NER approaches.
3.0.1 Option I: Implementation
• Implement the Viterbi algorithm for predicting the best tag sequence according to a learnt model; • Use this to construct a Maximum Entropy Markov Model;
• Explore possible feature sets and perform experiments comparing them;
• Evaluate the performance of your system on English and German;
• Describe your experiments, results and analysis in a report.
3.0.2 Option II: Application
• Find existing NER systems and apply them to given text;
• Critically compare NER system descriptions;
• Systematically analyse and compare the errors made by NER systems; • Describe your experiments, results and analysis in a report.
3.1.3 Data Set
The main dataset (eng) is a collection of newswire articles (1996-7) from Reuters, which was developed for the Computational Natural Language Learning (CONLL) 2003 shared task. It uses the typical set of four entity types: person (PER), organisation (ORG), location (LOC), miscellaneous (MISC). The second dataset (deu) is the German data from CONLL 2003, which has an extra column (the second) which holds the lemma (base form) of each word. Your system should be primarily developed for English, but also tested on German for comparison.
The prepared data can be downloaded from here. By downloading the data you agree 1) to only use it for this
assignment,
2) to delete all copies after you are finished with the assignment, and
3) not to distribute it to anybody. The data is split into subsets:
• training (eng.train) – to be used for training you system;
• development (eng.testa) – to be used for development experiments;
• held-out test (eng.testb) – to be used only once features and algorithms are finalised.
For further information, see http://www.clips.ua.ac.be/conll2003/ner/.
Do not download and build the data set from the above URL. It does not include the text.
3.2.2 Training the MaxEnt model
To train the MaxEnt model (i.e. estimate ! in equations above), caluclate features for each token, including one or more tag history features, and use standard MaxEnt / logistic regression training to learn the association between each token’s features and its class. That is, tokens and their context here take the place of documents and their content in Assignment 2.
3.3.1 Other Features
You should explore some other possible features for NER. For example, you might consider a few of the following questions:
• how could you capture the fact that words can be capitalized to be names but also due to other conventions (e.g. start of a sentence, month names, etc.)?
• how could you handle infrequent words?
• how could you incorporate external lists of common locations or common person names?
• how could you capture common word types (e.g. how capitalisation and punctuation appear) in a general way?
• how could you capture inflection and derivational morphemes?
• how could you model sentence-level information?
Hint: Lectures on sequence tagging and named entity recognition contain larger list of example NER features.
Another challenging alternative would be to make your system handle a two word Viterbi history, rather than just the previous word.
3.4 Precision, Recall and F-score for Phrases
Precision, Recall and F-score are a family of metrics for measuring a gold standard set of results against a predicted set of results. But the set in NER is a set of phrases or mentions extracted, not merely token classifications.
Thus, calculating P, R and F for NER is not the same as just calculating the P, R and F of each token classification decision. For NER we take sets of (start offset, end offset, entity type) triples for each entity mention (by interpreting the IOB notation), and compare the sets between the predicted and gold standard annotations. (Token- wise performance would consider the set of (offset, class) pairs, but we’re not interested in that.)
For example, overall precision is the number of entity mentions (i.e. phrases) with start, end and type matched between predicted and gold annotations divided by the number of entity mentions predicted.
Along with the data set, we have provided the conlleval perl script for scoring your predicted tag sequences against the gold standard. This script expects data on stdin in the format described above. The second-to-last column should contain the gold standard NER tag and the last column should contain your predicted NER tag.
3.5 Option I: Implementation
The assessment for Option I is about how well you can:
• implement Viterbi on top of probablistic MaxEnt predictions;
• implement and clearly/concisely describe basic features;
• implement and clearly/concisely describe additional features;
• devise and clearly/concisely describe sound experiments;
• devise and clearly/concisely describe insightful error analysis.
The assessment is not about the quality of your code. However, your code may be consulted to verify that it is original and consistent with your report.
3.5.1 Implementation
You are free to use a programming language of your choice to implement the assignment. You may use any implementation of logistic regression / maximum entropy with citation.
3.5.2 Report
The report should describe which features you included, and identify which types of features were most important for your classifier’s accuracy. It should also characterise the kinds of errors the system made.
The report should not focus on your implementation of the Viterbi algorithm.
Features:
• What features does your tagger implement?
• Why are they important and what information do you expect them to capture? • Are these new features or can you attribute them to the literature?
Experiments / results:
• Which features are most important?
e.g., performance of different feature combinations e.g., subtractive analysis for many feature subsets
• What is the impact of sequence modelling?
e.g., performance with and without Viterbi
• How does your system perform with respect to a simple baseline? e.g., majority class
e.g., classification with only obvious features, for instance, words
• How does your system perform with respect to the best results reported at the shared task? • How does your system perform training and testing on German?
• Which of your features are language-independent?
Error analysis:
• What are the typical tag confusions made by your system? • What are the characteristic errors made by your system?
e.g., manually characterise the errors on a sample of your predictions
e.g., hypothesise and/or implement some possible solutions
• Are there problems or inconsistencies in the data set that affect your system’s performance? • You only need to do error analysis on English (unless you speak German!)
Although in general the choice of how to present your results is up to you, you must include micro-averaged Precision, Recall and F-Measure statistics. Development experiments should use the development data sub set and final results should be reported on the held-out test set.
3.6 Option II: Application
The assessment for Option II is about how well you can:
• apply existing NER systems;
• clearly/concisely demonstrate understanding of features and models;
• devise and clearly/concisely describe sound experiments for analysing errors; • report an insightful analysis of errors.
3.6.1 Task: NER System Comparison
Compare two publicly available NER systems to each other and to one of the CoNLL 2003 submissions.
1. Download a recent Named Entity Recognition system that you can train, or which includes a CoNLL 2003- style English NER model.
• A system is recent enough if it has been developed or improved since 2008.
• If using a pre-trained model, make sure that its outputs are using the same style of IOB notation as in our version of the CoNLL data.
• Make sure you can find a technical description of the system, which you can use for step 6 below.
• The system should be trained on (at least) the CoNLL 2003 eng.train data provided above.
• The system should not be trained on eng.testb.
2. Use that model to label CoNLL’s eng.testb data.
3. Use the conlleval script to quantify performance of the system. 4. Repeat from step 1 for a second system.
5. Select one of the top-performing systems from CoNLL 2003 (FIJZ, KSNM, ZJ, CMP and MMP all did well on both English and German). Read its system description paper from the CoNLL 2003 proceedings, and download its output. Output for CONLL 2003 shared task systems can be found on the task web
page (http://www.clips.ua.ac.be/conll2003/ner/) under the CoNLL-2003 Shared Task
Papers heading. Follow the system output links.
6. Describe and compare the three system descriptions, focusing on: how does each incorporate information needed to perform NER well? what are key differences in how they represent the context?
7. Systematically analyse errors performed by each system on eng.testb, to answer: What kinds of errors do both systems make? What kinds of error does one system make, but others do not?
3.6.2 Report Structure
Your report should contain the following sections:
Systems Which systems are being analysed, and what publications are they described in? Summarise and critically
compare the approaches. How are their approaches to NER similar or different? Do they capture the same or different contextual cues (e.g. features)? Do they model the same kinds of information in a different way?
Results Use conlleval to report performance on eng.testb and remark briefly on key results.
Method How did you systematically compare the outputs of different systems? (Don’t give results here.)
Analysis Report your analysis of kinds of error. Think broadly when considering kinds of errors, and make sure you refer to examples from the data. The quantitative analysis gives precision and recall for each class at the level of matched phraes. It does not, however, tell you whether there were boundary discrepancies (where the tag was agreed, but a word was included or excluded relative to the other system or gold standard), or whether there were tag discrepancies despite mention start and end being agreed. It does not tell you which entity types or token-level tags were confused by a system; nor whether some part of speech sequences are particuarly troublesome. Nothman et al. (EACL 2009) sections 3–4 might provide some ideas about systematic comparisons of NER taggings, though your approach need not be as sophisticated.
Discussion Consider at least one of the following questions, or your own. Please write the question you are answering. You do not need to answer all questions; it is better to answer one well:
• Are some kinds of errors more problematic than others in application?
• Can you explain differences in the systems’ errors in terms of how their models (features etc.) differ? • Are there problems or inconsistencies in the data set that affect your system’s performance?