python-nlp代写

Assessment Description

Text documents, such as long recordings and meeting transcripts, are usually comprised of topically coherent text segments, each of which contains some number of text passages. Within each topically coherent segment, one would expect that the word usage demonstrates more consistent lexical distributions than that across segments. A linear partition of texts into topic segments can be used for text analysis tasks, such as passage retrieval in IR, document summarization, and discourse analysis. In this assessment, you are required to write Python code to preprocess a set of meeting transcripts and convert them into numerical representations suitable for input into topic segmentation algorithms.

The detailed tasks are as follows:

 Task 1: Reconstruct meeting transcripts with topical boundaries. The original meeting transcripts are stored in three different types of XML files, which are ending with “.words.xml”, “.topic.xml” and “.segments.xml”. (The details about the three types of files can be found in Section 3 below). The task here is to reconstruct the original meeting transcripts with the corresponding topical and paragraph boundaries from these files. Please note that

o A meeting transcript must be generated for each of the “*.topic.xml” file. For example, “ES2002a.txt” will be generated for “ES2002a.topic.xml”.

o All the generated meeting transcripts with the “.txt” file extension must be saved in the folder “txt_files”.

o The topical boundaries must be denoted with “**********”(i.e., 10 asterisks).
o All the tokens, including punctuations, must be separated by a white space. For

example, “Alright , okay . Okay .”
o Besides the topical boundaries, the paragraph boundaries must also be

reconstructed with the “*.segments.xml” file.
o The input files to your notebook “task_1.ipynb” must be the three types of XML

files. The output must be the meeting transcripts saved in a set of txt files. o A sample meeting transcript is provided in the “txt_file” folder.

 Task 2: Generate sparse representations for the meeting transcripts. The aim of this task is to build sparse representations for the meeting transcripts generated in task 1, which includes word tokenization, vocabulary generation, and the generation of sparse representations. Please note that

o The word tokenization must use the following regular expression, “\w+(?:[- ‘]\w+)?”, and all the words must be converted into the lower case.

o The stop words list (i.e, stopwords_en.txt) provided in the zip file must be used. o The words, whose document frequencies are greater than 132, must be

removed.
o Generating multi-word phrases (i.e., collocations) are not needed. o The output of this task must contain the following files:

 vocab.txt: It contains the unigram vocabulary in the following
format, word_string:integer_index. Words in the vocabulary must be sorted in alphabetic order. For example, “absolute:22” in the following figure means that the 23rd word in the vocabulary is “absolute”.

 topic_seg.txt: It contains the topic boundaries encoded in boolean vectors. For example, if a meeting transcript, “ES2018d.txt” contains 10 paragraphs in total after being preprocessed, and there are topic boundaries after the 2nd, 5th, and 7th paragraphs, the boolean vector must be “ES2018d:0,1,0,0,1,0,1,0,0,1”. Every line in topic_seg.txt corresponds to one meeting transcript.

 ./sparse_files/*.txt : Each txt file in the “sparse_files” folder corresponds to one of the meeting transcripts in the “txt_files” folder, and they have the same file name. For example, “./sparse_files/ES2002a.txt” corresponds to “./txt_files/ES2002a.txt”. Each file in “/sparse_files” contains the sparse representations for all its paragraphs as

where 1) each line is a paragraph and the order of the lines must match the

paragraph order in the corresponding meeting transcript. 2) the integer before “:” is the word index in the vocabulary and the one after is the frequency of the word in the corresponding paragraph; 3) empty paragraphs after preprocessing must be excluded.

3. Assessment Resources

Before you start writing your code, you will need to download the file

 meeting_transcripts
Unzipping the file, you will find that

 There are three types of XML files in the given folder :

  1. ./topics/*.topic.xml contains the information about topic segments. Each topic tag directly linked to the root indicates one topic segment that is required in text segmentation task. Each topic segment can contain a number of paragraphs given by different meeting attendees. It can also contain sub- topics.
  2. ./words/*.words.xml contains the word tokens generated with the force alignment technique. Each word is associated with its start time and end time in the meeting transcript.
  3. ./segments/*.segments.xml contains the paragraph boundaries, the start and end of which are denoted by the corresponding word IDs.
  •   ./spase_files: the file folder used to store the generated sparse representations for all the meeting transcripts.
  •   ./txt_files: the file folder used to save the reconstructed meeting transcripts.
  •   ./stopwords_en.txt: the stopword list used in word tokenization.
  •   ./topic_segs.txt: the file used to save the topical boundaries.
  •   ./vocab.txt: the file used to save the vocabulary.
  •   ./task_1.ipynb: the python code you are going to write for task 1
  •   ./task_2.ipynb: the python code you are going to write for task 2

4. Assessment Criteria

The following outlines the criteria which you will be assessed against.

4.1 Mark allocation and general marking criteria

1. The submitted scripts in the notebook should work without any errors and must give the correct results. If the submitted notebook cannot be run by the assessor, which will be double-checked by the head tutor and the lecturer, zero marks will then be given to the corresponding task.

o task1:14outof30

o task2:14outof30
o task 2 will be assessed if and only if task 1 is successfully finished and

receives a full mark (i.e., 14).

  1. The code should be well structured and properly commented. (1 out of 30)
  2. The notebook should be structured in a logical way so that it clearly shows howstudents finish the tasks in the assessment. (1 out of 30)
  3. Criteria 2 and 3 will be assessed if and only if the mark for criteria 1 is greaterthan and equal to 25.

5. How to Submit

Once you have completed your work, take the following steps to submit your work.

1. Only one zip file needs to be submitted: Once you finished the tasks, please zip the folder only contains the files specified in Section 3, including the original XML files.