程序代写代做代考 python deep learning data structure Deadline + Late Penalty¶

Deadline + Late Penalty¶
$\textbf{Note:}$ It will take you quite some time to complete this project, therefore, we earnestly recommend that you start working as early as possible. You should read the specs carefully at least 2-3 times before you start coding.
• $\textbf{Submission deadline for the Project (Part-2) is 20:59:59 (08:59:59 PM) on 18th Nov, 2019}$
• $\textbf{LATE PENALTY: 10% on day-1 and 20% on each subsequent day.}$

Instructions¶
1. This note book contains instructions for $\textbf{COMP6714-Project (Part-2)}$. We have already released the instructions for the $\textbf{Part-1 of the Project}$ in a seperate notebook.

2. You are required to complete your implementation for part-2 in the file project_part2.py provided along with this notebook. Please $\textbf{DO NOT ALTER}$ the name of the file.

3. You are not allowed to print out unnecessary stuff. We will not consider any output printed out on the screen. All results should be returned in appropriate data structures via corresponding functions.

4. You can submit your implementation for Project (Part-2) via submission system: http://kg.cse.unsw.edu.au/submit/ . We have already sent out the invitations for you to join the submission system. In case of problems please post your request @ Piazza.

5. For each question, we have provided you with detailed instructions along with question headings. In case of problems, you can post your query @ Piazza.

6. You are allowed to add other functions and/or import modules (you may have to for this project), but you are not allowed to define global variables. Only functions are allowed in project_part2.py

7. You should not import unnecessary and non-standard modules/libraries. Loading such libraries at test time will lead to errors and hence ZERO score for your project. If you are not sure, please ask @ Piazza.

8. We will provide immediate feedback on your submission. You can access your scores using the online submission portal on the same day.

9. For the Final Evaluation, we will be using different data sets, so your final scores may vary.

10. You are allowed to have a limited number of Feedback Attempts $\textbf{(15 Attempts for each student)}$, we will use your LAST submission for Final Evaluation.

Allowed Libraries:¶
You are required to write your implementation for the project (part-2) using Python 3.6.5. You are only allowed to use the following python libraries:
• $\textbf{spacy (v2.1.8)}$
• $\textbf{XGBoost (v0.90)}$

Q2: Named Entity Disambiguation/Named Entity Linking (20 Points)¶

For $Q2$, you are required to use your experience gained in $Q1$ (from project-part1) to solve a renowned problem in language processing, i.e., Named Entity Disambiguation (NED) aka Named Entity Linking (NEL). It aims at assigning unique identities (i.e., entities, such as Persons, Locations and Organizations etc.) to the mention (i.e., a substring/span of the sentence that refer to an entity) identified in the text.
For example, consider the sentence: Olympia is the captial of Washington. The mention Washington refers to the entity Washington (state) (https://en.wikipedia.org/wiki/Washington_(state)), rather than other possible entities with similar names, such as: (i) The Washington Post (an American daily newspaper) (https://en.wikipedia.org/wiki/The_Washington_Post), (ii) The George Washington (former U.S. president) (https://en.wikipedia.org/wiki/George_Washington), etc.
For this project, we provide you with the text documents with the mentions pre-idetified alongwith a list of possible candidate entities corresponding to each mention. Your task is to come up with a learning to rank model in order to disambiguate the mention, i.e., map the mention to the correct entity.
Inputs:¶
Input to your model are as follows. The file formats are explained in the next sub-section.
1. $men\_docs.pickle$ 
A python dictionary of the documents with mentions pre-identified.
2. $parsed\_candidate\_entities.pickle$ 
A dictionary containing textual description for each candidate entity (pages from Wikipedia). Note that we have already pre-processed and parsed the candidate entity pages for you.
3. $train.pickle$ 
Training data.
4. $train\_labels.pickle$ 
Labels corresponding to the training data.
5. $dev.pickle$ 
Development data. Note: we will use this $dev$ data to provide Feedback for $Q2$. For final evaluation, we will be using completely different $test$ data sets.
6. $dev\_labels.pickle$ 
Labels corresponding to the dev data.

File Formats¶
1. $men\_docs.pickle$¶
It is a python dictionary containing the documents pertaining to the mention with $key:$ document title, $value:$ document text. Each document conatins free text.
2. $parsed\_candidate\_entities.pickle$¶
It is a python dictionary storing the Wikipedia description pages for each candidate entity with:
• $key:$ entity name,
• $value:$ text corresponding to entity’s description. We use entity’s wikipedia page to capture the entity’s description. Parsing large text files may take a considerable time, so we provide you with the $spacy’s$ parsing results. We consider each document as a large paragraph and store the parsing results as a list of tuples, with each tuple of the form:
$(id, token, lemma, pos\text{-}tag, entity\text{-}tag)$, where 

• $id$: corresponds to a unique token id in the paragpah 

• $token$: corresponds to the original text token 

• $lemma$: corresponds to the token’s lemma 

• $pos\text{-}tag$: is the token’s part-of-the-speech tag 

• $entity\text{-}tag$: is the entity tag detected by the spacy.

For detailed descrition of the spacy’s paring results, please check the following link: https://spacy.io/usage/linguistic-features
A small subset of the parsed text for the entity Cartoon_Network_Nordic from the file $parsed\_candidate\_entities.pickle$ is as under:
[(1, ‘Cartoon’, ‘Cartoon’, ‘PROPN’, ‘B-ORG’), (2, ‘Network’, ‘Network’, ‘PROPN’, ‘I-ORG’), (3, ‘television’, ‘television’, ‘NOUN’, ‘O’), (4, ‘channel’, ‘channel’, ‘NOUN’, ‘O’), (5, ‘broadcasting’, ‘broadcast’, ‘VERB’, ‘O’), (6, ‘youth’, ‘youth’, ‘NOUN’, ‘O’), (7, ‘children’, ‘child’, ‘NOUN’, ‘O’), (8, ‘programmes’, ‘programme’, ‘NOUN’, ‘O’), (9, ‘Sweden’, ‘Sweden’, ‘PROPN’, ‘B-GPE’), (10, ‘Norway’, ‘Norway’, ‘PROPN’, ‘B-GPE’)]

3. $train.pickle$¶
A python dictionary containing the training data. It consists of following fields: 

• $key:$ A unique integer containing the mention_id,
• $value:$ A dictionary containing the mention’s description as following $key\text{-}value$ pairs:
▪ $doc\_title:$ Title of the document containing the mention. It will be one of the keys of the dictionary: $men\_docs.pickle$ 

▪ $mention:$ Token span within the document $doc\_title$ indicating the mention.

▪ $offset:$ Mention’s offset position in the document $doc\_title$

▪ $length:$ Length of the mention’s tokens

▪ $candidate\_entities:$ A list of candidate entities corresponding to the mention. Each entity candidate corresponds to a key in the file: $parsed\_candidate\_entities.pickle$ 

• 

An example mention from the file $train.pikcle$ is shown below:
{1: {‘doc_title’: ‘1_GOLF’, ## Mention’s document title ‘mention’: ‘PGA Tour’, ## Mention’s tokens ‘offset’: 2046, ## Mention’s offset position in the document ‘length’: 8, ## Length of mention tokens ‘candidate_entities’: [‘Professional_Golfers_Association’, ## Candidate Entities for the mention ‘PGA_Tour’, ‘Golf_Channel_on_NBC’, ‘2009_PGA_Tour’, ‘2011_PGA_Tour’, ‘2008_PGA_Tour’, ‘PGA_Tour_on_CBS’]}}

4. $train\_labels.pickle$¶
A python dictionary containing the labels corresponding to the training mentions. It consists of following fields: 

• $key:$ A unique integer containing the mention_id, $value:$ A dictionary containing the following key-value pairs:
1. $doc\_title:$ Title of the document containing the mention. It will be one of the keys of the dictionary: $men\_docs.pickle$ 

2. $mention:$ Tokens span within the document $doc\_title$ indicating the mention.

3. $label:$ Mention’s Ground Truth Entity Label. It also corresponds to a key in the file: $parsed\_candidate\_entities.pickle$ 

• 

Note:, for each mention, we use the same mention_id for both the files: (i) $train.pickle$, and (ii) $train\_labels.pickle$.

An example from the file $train\_labels.pikcle$ is shown as follows:
{1: {‘doc_title’: ‘1_GOLF’, ### Mention’s document title ‘mention’: ‘PGA Tour’, ### Mention’s tokens ‘label’: ‘PGA_Tour’}} ### Mention’s True Entity Label

5. $dev.pickle$¶
It follows the same format as that of the file: $train.pickle$
6. $dev\_labels.pickle$¶
It follows the same format as the of the file: $train\_labels.pickle$
$\textbf{Note:}$ The dev data set is meant to provide the Project Feedback and facilitate your implementation. For final evaluation, we will be using a totally different test data set.

$\textbf{TASK:}$
Given a document $men\_doc = [w_1, w_2,…,w_Q]$; mention span within the document = $\{m_i\}$; and a collection of candidate entities for each mention alongwith corresponding entity description pages: $\{e_i\}_{i=1}^{n}$. You are required to use the:

1. Mention
2. Mention’s document (i.e., men_doc)
3. Entity description page for each candidate entity.
to come up with a learning-to-rank model to rank the candidate entities corresponding to each mention in such a way that the Ground Truth Entity is ranked higher than the false candidates. You are only allowed to:
• Use the $XGBoost$ classifier to build your learning-to-rank model.
$\textbf{HINTS:}$
1. As a baseline model, you can use your experience gained in the Project (Part-1) to compute the TF-IDF statistics for words appearing in the mention and/or entity description pages. You can consider different ways to generate useful features for your learning-to-rank model. Your model should be able to achieve more than 70-% accuracy for the dev set using basic TF-IDF features.

2. Later, you may think of advanced feaures to further enhance the performance of your model.

3. In order to train your XGBoost ranking classifier, you can start with the following parameters, and keep on improving them later:
• objective: rank:pairwise
• max_depth: 7-9
• n_estimators: 4500-5500 

• eta: 0.01-0.09
• lambda: ~100
• min_child_weight: 0.01-0.02
4. 

For XGBoost parameter description, see the following URL: https://xgboost.readthedocs.io/en/latest/parameter.html

NOTE¶
1. $\textbf{YOU SHOULD NOT HARD CODE}$ the Ground Truth in your implementation. Violations in this regard will get $\textbf{ZERO SCORE}$ for the $\textbf{Project (Part-1 + Part-2)}$.


2. For final evaluation of $Q2$, we will be using different set of test data sets (with different numbers of testing mentions), which will follow the same format as that of the dev data.


3. You model should not overfit the provided dev data set. Overfitting may lead to a worse performance for the test data sets and you may get a LOW SCORE.

4. For ranking entity candidates for each mention, you are $\textbf{NOT ALLOWED}$ to use Additional Information and/or External Resources other than the provided in the files corresponding to $Q2$.


5. You are $\textbf{NOT ALLOWED}$ to use deep learning and/or pre-trained embedding models.

6. In order to come up with a learning-to-rank model, you are only allowed to use $XGBoost$ classifier (v0.90). You may check-out the documentation of the $XGBoost$ classifier via following url: https://xgboost.readthedocs.io/en/latest/python/python_api.html


Output Format (Q2):¶
Your output should be a dict() of the form:

{mid:’Entity_Label’}, where 

• mid corresponds to the mention id in the dev and/or test data set.
• Entity_Label corresponds to the mention’s most relevant entity label amongst the candidate labels.
Running Time:¶
• On CSE machine, your implementation should return the result within 10 minutes (600-sec) (USER + SYSTEM).

How we test your implementation¶
In [1]:
## Import Necessary Modules…
import pickle
import project_part2 as project_part2
In [2]:
## Read the data sets…

### Read the Training Data
train_file = ‘./Data/train.pickle’
train_mentions = pickle.load(open(train_file, ‘rb’))

### Read the Training Labels…
train_label_file = ‘./Data/train_labels.pickle’
train_labels = pickle.load(open(train_label_file, ‘rb’))

### Read the Dev Data… (For Final Evaluation, we will replace it with the Test Data)
dev_file = ‘./Data/dev.pickle’
dev_mentions = pickle.load(open(dev_file, ‘rb’))

### Read the Parsed Entity Candidate Pages…
fname = ‘./Data/parsed_candidate_entities.pickle’
parsed_entity_pages = pickle.load(open(fname, ‘rb’))

### Read the Mention docs…
mens_docs_file = “./Data/men_docs.pickle”
men_docs = pickle.load(open(mens_docs_file, ‘rb’))
In [3]:
## Result of the model…
result = project_part2.disambiguate_mentions(train_mentions, train_labels, dev_mentions, men_docs, parsed_entity_pages)
In [4]:
## Here, we print out sample result of the model for illustration…
for key in list(result)[:5]:
print(‘KEY: {} \t VAL: {}’.format(key,result[key]))

KEY: 1 VAL: 1998_FIFA_World_Cup
KEY: 2 VAL: Bucharest
KEY: 3 VAL: Romania_national_football_team
KEY: 4 VAL: Lithuania_national_football_team
KEY: 5 VAL: 1998_FIFA_World_Cup
In [5]:
## We will be using the following function to compute the accuracy…
def compute_accuracy(result, data_labels):
assert set(list(result.keys())) – set(list(data_labels.keys())) == set()
TP = 0.0
for id_ in result.keys():
if result[id_] == data_labels[id_][‘label’]:
TP +=1
assert len(result) == len(data_labels)
return TP/len(result)
In [6]:
### Read the Dev Labels… (For Final Evaluation, we will replace it with the Test Data)
dev_label_file = ‘./Data/dev_labels.pickle’
dev_labels = pickle.load(open(dev_label_file, ‘rb’))

accuracy = compute_accuracy(result, dev_labels)
print(“Accuracy = “, accuracy)

Accuracy = 0.897887323943662

Evaluation Metric + Scoring Function¶
• We will compute the accuracy of your model on the test and/or dev data sets, as shown in the function: compute_accuracy(result, data_labels) given above. Later, we will be using the following piece-wise linear scoring function to compute your scores (0-20) for $Q2$.
$$ { Score(x) =\left\{ \begin{array}{lr} 0\quad x<0.70\\ 10\cdot(20x-14) \quad 0.70 < x \leq 0.75 \\ 10\cdot(5x-2.75) \quad 0.75 < x < 0.85 \\ 10\cdot(12.5x-9.125) \quad 0.85 \leq x < 0.89 \\ 20\quad x \geq 0.89 \end{array} \right. } $$ Project Submission and Feedback¶ For project submission, you are required to submit the following files: 1. Your implementation in a python file project_part2.py.
 2. A report project_part2.pdf You need to write a concise and simple report illustrating • Implementation details of $Q2$. • Especially, your approach for extending your implementation in ($Q1$) for ($Q2$). Note: Every student will be entitled to 15 Feedback Attempts (use them wisely), we will use the last submission for final evaluation of the Project (part-2). Bonus Points (10 points)¶ We will award $\textbf{BONUS POINTS to the TOP-10}$ best performing students in the decreasing order of the performance on the Project (Part-2), i.e., the best performing student will get 10 points, second-best will get 9 points and so on. NOTE: • We will not consider Project (Part-1) to award BONUS scores. • We will not recieve any seperate submission for the BONUS points. • Your project implementation for the Part-2 will be automatically considered for the Bonus scores. In [ ]: