lab7_ner.md
Lab 7: Named entity recognition with the structured perceptron
Andreas Vlachos
The goal of this lab session is learn a named entity recognizer (NER) using the structured perceptron. The named entity recognizer will need to predict for each word one of the following labels:
- O: not a named entity
- PER: part of a person’s name
- LOC: part of a location’s name
- ORG: part of an organization’s name
- MISC: part of a name of a different type (miscellaneous)
The training and the test data are available from this folder (use your university google ID to access it): https://drive.google.com/drive/folders/1JOvM7GHF15tW5ufwWlBjA7NAhOJfOkKF?usp=sharing. It consists of sentences up to 5 words length from the data used in this shared task on NER : https://www.clips.uantwerpen.be/conll2003/ner/. The tags in that task for named entity type distinguish between tokens starting an entity, e.g. B-PER and those continuing one, e.g. I-PER. We ignore this distinction for the purposes of this assignment.
After downloading the data, implement, train and evaluate a structured perceptron with following features:
- current word-current label
- current word-current label and previous label-current label
- as above, but using at least two more features of your choice. Ideas: sub word features, previous/next words, label trigrams, etc.
Your report should discuss the following:
- What are the most positively-weighted features for each version of the perceptron? Give the top 10 for each class and comment on whether they make sense. (If they don’t you might have a bug.)
- Are the differences among the versions in micro-F1 score expected? Did the features you propose improve the results? Why?
- Can you propose other features beyond the ones you implemented?
In implementing the above, you are advised to use multiple passes with randomized order and averaging. (use python’s random
library to fix the random seed so that your results are reproducible!). As the dataset is imbalanced (most words are not part of a name), the evaluation metric to use is the micro-aveaged F1 score from scikit learn: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html. You should call it in your code as:
f1_micro = f1_score(correct, predicted, average='micro', labels=['ORG', 'MISC', 'PER', 'LOC'])
where y_true
and y_predicted
are arrays containing the correct and the predicted labels.
You should submit a python file (lab4.py) that can be executed as:
python3 lab7.py train.txt test.txt
and returns the micro F1 for the three versions of the structured peceptron, and the top 10 most positive features for each.
You are advised to have separate train and test functions and you should comment your code to explain it. You also need to accompany it with a lab7.pdf (no more than two A4 pages) answering the questions about it. Make sure your code is Python3 compatible. There is no single correct answer on what your accuracy should be, but correct implementations for the structured perceptron with averaging, 10 passes and current word-current label and previous label-current label features usually achieve around 70% micro F1. The quality of the analysis in your report is as important as the accuracy itself.
This lab will be marked out of 6, 3 for the code and 3 for the report. It is worth 6.8% of your final grade in the module.
The deadline for this assignment is the beginning of the next lab and it needs to be submitted via MOLE. Standard departmental penalties for lateness will be applied.