cwi.md
Class project: Complex Word Identification
Andreas Vlachos and Fernando E Alva Manchego
The goal of the lab sessions in weeks 3, 5, 6, 8 and 10 is to develop a system for the Complex Word Identification shared task https://sites.google.com/view/cwisharedtask2018/. The demonstrators and myself will be available during the lab sessions to help you if/when you get stuck, as well as to discuss ideas.
The data and all information about the task are available from this webpage: https://github.com/sheffieldnlp/cwisharedtask2018-teaching. HINT1: If the files get large, use your non-backed-up scratch space for anything that takes space and is easy to regenerate. HINT2: Use your backed up space for everything else!
For this assignment you are welcome to use any software or toolkit that you find useful. Note that initially you will only have access to the training and the development data. The test data will only be released to you in the last lab session on week 10.
The assessment (14.8% of the total grade) will be based on the report which should be 4 pages long (excluding bibliographic references) and it should address following questions:
- Introduction: What is the task and why is it important?
- Baseline system description: System descriptions in enough detail for the reader to be able to understand how to reimplement your baseline models and to appreciate why they are suitable for the task at hand.
- Improved system motivation and description.
- Experiments on development set. Does your idea work as expected? Evaluate on the test set the baseline and the improved system, is it still the case? Identify examples in development data which help showcase why the improved system works better.
- Plot learning curves for the trainable systems you experiment with. Are some systems better than others when less training data is available?
- Identify examples where your improved system fails to predict correctly and propose ideas for future work to address them.
- Conclusions: what have we learnt from your experiments that could inform future work
Your final report should be written using the LaTeX or Word conference paper templates available at http://acl2018.org/downloads/acl18-latex.zip and at http://acl2018.org/downloads/acl18-word.zip respectively. If you like/want to learn LaTeX, I strongly recommend that you use the online editing platform Overleaf, which has a template for it already: https://www.overleaf.com/latex/templates/instructions-for-acl-2018-proceedings/xzmhqgnmkppc#.WrPylnXFK6o The code, while not evaluated per se, should also be made available (on github) together with instructions on how one can reproduce the results mentioned in the paper. Don’t leave writing for the last minute, it is a shame if you achieve good results but no time to write them up. I recommend the following timeplan:
- Weeks 3,5 and 6: develop a baseline system and evaluate it on the dev data. Within a week after the session of week 6 (i.e. before the lab of week 7), you need to submit a short report (1 A4 page in PDF) to me (via e-mail) with what you have done so far. This will be graded to give you an indication of how good your progress is, but the grade will not be included in the final grade calculation. The purpose of it is to give you feedback and help as needed.
- Weeks 8,10: develop an improved system and evaluate it on the dev and the test data.
- Weeks 11-12: write your report and submit it via MOLE.
The deadline is midnight UK time of Friday in week 12. The assignment will be graded out of 15 points: 10 for the quality of the report and 5 for the accuracy of the system developed in the shared task. While an accurate system matters, it is twice as important to have a well written report justifying the choices made in its development, documenting its results and giving insights that would be useful to researchers working on this task.