自然语言处理python NLP代写:NLE Coursework 2

NLE Coursework 2


Important Information

Submission format:
You should submit just one file which should either be a Jupyter notebook, or a Jupyter notebook that has been zipped up with any files that it loads (e.g. images). Note that when marking your work, we will only look at the contents of the Jupyter notebook.

Due date:
Your notebook should be submitted on Study Direct before 4pm on Thursday 15th January, 2018. The standard late penalties will apply.

Return date:
Marks and feedback will be provided on Sussex Direct on Friday February 5th for all submissions that are submitted by the due date.

Weighting:
This assessment contributes 50% of the mark for the module.

Word limit:
Your Jupyter notebook should contain no more than 3000 words. The standard penalties apply for deviations from this limit (see below). You must specify the number of words in your report.

Failure to observe limits of length

The following is taken from the Examination and Assessment Regulations Handbook

The maximum length for each assessment is publicised to students. The limits as stated include quotations in the text, but do not include the bibliography, footnotes/endnotes, appendices, abstracts, maps, illustrations, transcriptions of linguistic data, or tabulations of numerical or linguistic data and their captions. Any excess in length should not confer an advantage over other students who have adhered to the guidance. Students are requested to state the word count on submission. Where a student has marginally (within 10%) exceeded the word length the Marker should penalise the work where the student would gain an unfair advantage by exceeding the word limit. In excessive cases (>10%) the Marker need only consider work up to the designated word count, and discount any excessive word length beyond that to ensure equity across the cohort. Where an assessment is submitted and falls significantly short (>10%) of the word length, the Marker must consider in assigning a mark, if the argument has been sufficiently developed and is sufficiently supported and not assign the full marks allocation where this is not the case.

Note that code and the content of cell outputs and examples of data are excluded from the word count.


Overview

For this assignment, you are asked to write a report on the activities covered in the notebook for Topic 7 and Topic 8.

In particular, your report should consist of a Jupyter notebook containing the following four sections (further details regarding the content of each section are given below).

Section 1:
A report on what you discovered while undertaking an assessment of the accuracy of spaCy’s entity extraction capability on a personalised sample of sentences.

Section 2:
A report on your gender classifier.

Section 3:
A report on your code that extracted feature sets for the characters in a novel.

Section 4:
A report on your investigation into differences in the way male and female characters are portrayed.


Details of Requirements

Your submission will be marked out of 100. Please read the following guidelines carefully.

Section 1: Assessing spaCy’s Entity Extraction Capactiy

There are 20 marks available for this section.

  • You should run the entity extractor on a sample of 100 sentences and report on how accurately it has been performed.
  • The sample you use must be produced using your candidate number to generate a personalised set of sentences for this evaluation.
  • You do not need to describe your assessement of each sentence individually, but you should give overall statistics and use illustative examples from your sample.
  • Your analysis should be broken down by the type of named entity.
  • When an error occurs you should describe the nature of the error. You should distinguish the following cases
    • where the wrong entity type is assigned to a span
    • where the wrong span is identified
    • where an entity is missed altogether
  • A confusion matrix can be used here to summarise what you have found.

Section 2: Gender Classifier

There are 20 marks available for this section.

  • You should present the code of your gender classifier and explain how it works.
  • You should use the names.csv data or, if you prefer, some other comparable source of information about the gender of names.
  • Your code should deal with cases where a character is referred to by more than just their first name (e.g. “John Jones”).
  • Your code should deal with cases where a character is referred to using a title (e.g. Mrs Smith”).
  • By running your gender classifier on a sample of data and reviewing the results, provide an indication of how accurate your gender classifier is. What proportion of names are being correctly analysed?
  • Bonus marks will be awarded if you successfully deal with situations where just a surname is used (e.g. “Smith”) after the gender of that character has been revealed (e.g. “Mrs Smith”).

Section 3: Building feature sets that characterise the way a character is portrayed

There are 30 marks available for this section.

  • You should explore a number of alternative ways of characterising the way a person in portrayed by a novelist. This should include implementing the suggestions that appear in the Topic 8 notebook. You are encouraged to go beyond the suggestions made in the Topic 7 notebook.
  • You should describe the code that you have written to create feature sets for characters.
  • You should describe the code that showned how you were able to extract features in situations where one of the pronouns “he”, “she”, “his” and “her” is used in a novel.

Section 4: Investigating differences in the way genders are portrayed

There are 30 marks available for this section.

  • You should make it clear how you have aggregated feature sets across the male and female characters appearing in at least two collection of novels. These collections could be novels by different authors, different sets of authors, or sets of novels that were written at different periods in history.
  • You should discuss the result of measuring the cosine similarity of the aggregated male an female feature sets. The reason to consider different sets of novels is to look at differences in gender-based cosine similarity in the different collections.
  • In Section 3 you should have considered a number of alternative ways of deriving feature sets for characters. In this section, you should present the results of using these alternative approaches.
  • You should explain what you have done to assess the cosine similarity of pairs of features sets that are aggregated over randomly selected characters (i.e. characters that aren’t split up on the basis of gender). This should provide an indication as to whether the differences you find when making a gender-based comparison are meaningful.
  • You should explain how you went about assessing what the impact would be of an imbalance in the number of male and female characters. Is there an gender imbalance in your gender-based comparison.