Assignment-5
Document Analysis Assignment 5: Information Extraction¶
Your Information¶
Please fill in the following information:
Name: [Your name]
Uni id: [Your uid]
Overview¶
In this assignment, the task is to code a Named Entity Recognizer (NER) application in Python using the CRFsuite library.
To complete this task, follow the tutorial NamedEntityExtraction.ipynb and the ie-assignment-instructions.ipynb instructions posted in Wattle.
The following items summaries the assigment tasks:
Built a NER classifier following the tutorial.
Write a Python NER application that used your classifier.
Submit your results in Kaggle.
Answerd three written assignments.
We will check the correctness of your code, but the score of the programming assignments will be graded based on your performance on Kaggle competition.
Write your code after ### Your code here, and remove raise NotImplementedError after implementation.
Written assignments should be written in the given notebook cells. Please write them direcly in to the designated cells, and upload the notebook file to Wattle page.
Write answers in this notebook file, and upload the file to Wattle submission site. Please rename and submit jupyter notebook file (Assignment5.ipynb) to your_uid.ipynb (e.g. u6000001.ipynb) with your written answers therein. Do not upload any other files to Wattle except this notebook file.
For the Kaggle competition¶
Join the competition here.
Before submitting the result, first go to team menu and change your team name as your university id.
You need to upload the generated result file to Kaggle.
Note that you are only allowed to upload 5 copies of your results to Kaggle per day. Make every upload count, and don’t waste your opportunities!
You should use cross-validation instead of relying on the public set – this is what the daily limit is for!
Note: you need to fill in the cells below with your code. Failure to provide your code nullifies your Kaggle grade (meaning you get zero for the coding part).
1. Build a NER model (4 points) ¶
You can use the code provided in tutorial sheet
In [ ]:
### Your code here
The output of the above cell should look something like this (ignored the numbers)
precision recall f1-score support
B-LOC 0.68 0.47 0.55 1084
I-LOC 0.52 0.25 0.34 325
B-MISC 0.54 0.11 0.19 339
I-MISC 0.54 0.22 0.32 557
B-ORG 0.76 0.51 0.61 1400
I-ORG 0.67 0.44 0.53 1104
B-PER 0.73 0.68 0.71 735
I-PER 0.78 0.82 0.80 634
avg / total 0.68 0.48 0.55 6178
2. Generate your result file for Kaggle¶
In [1]:
### Your code here
3. Written part (6 pts)¶
Answer briefly and concisely the following questions.
Provide answers using bullet list with 2~3 items.
Check this if you are not familiar with markdown syntax.
Each questions is worth 1 mark (10%) of the total grade for the IE assignment.
Question 1 (2 pts)¶
Think about three relevant baselines for the Named Entity Classification task.
Provide answers using bullet list with 3 items. Give a short description of each of them.
Your answer here
Question 2 (2 pts)¶
How the Maximal Marginal Relevance (MMR) addresses redundancy issues? (1 point)
How can you tell MMR that “Sydney” and “Melbourne” are cities? (0.5 points)
How can you tell MMR that “solar panels” and “photovoltaic cells” have similar meaning? (0.5 points)
Your answer here
Question 3 (2 pts)¶
Imagine you are developing an extractive text summarization tool using HMM.
What are the hidden states and the observations of the HMM model?
Which algorithm is use to compute the prob. of a particular observation sequence?
Your answer here