3. Key tasks
● Task 1: you will be given the opportunity to apply the concepts of classes andmethods in Python; in particular you are required to define two user-defined data types – String and List – as a form of class.
● Task 2 : you will be given the opportunity to manipulate data from text files, to conduct some basic data analysis using Python external packages.
3.1. Task 1
3.1.2. Instructions & Requirements
In this task you will be assessed on how to implement two user-defined data types – String and List – and the associated methods for each class that are useful for data processing. You are required to implement the required methods for each class with your own algorithms without utilising any of the built-in functions provided by Python. The data stored in these data types are organised in the array-based structure.
Part A: The String Class
In part A, you are required to implement a number of user-defined methods that are useful for processing or manipulating strings (data in textual form). Although Python has provided a good collection of string methods (that you would have used in the implementation of your first assessment), the purpose of part A is to assess your programming knowledge in developing algorithms for handling strings.
Your task is to create a Python class that contains a collection of methods defined for manipulating strings. An attribute or instance variable is required for this String class to represent each individual string data. It should be stored in an array-based structure. For the purpose of this part, you should define this instance variable by using a Python list to represent each string data. (With this implementation, we are assuming that each character in a string is represented as one element in the Python list.)
The following is a list of methods that can be applied on the strings represented by this String class. You are required to implement each of them. We have suggested the method header with the method name as well as the argument(s) that each method requires.
1. __init__(self, str_value) : This is the constructor method that is required for creating string objects from this String class. It takes the Python string ( str_value ) as an argument and assigns it to an instance variable defined by a Python list as mentioned above. (You may name the instance variable as str_data .)
2. search(self, target_char) : This method checks whether a specific character ( target_char ) exists within the string. So, as long the character exists in the string (i.e. once you have found its first occurrence), return a True value; otherwise a False value. You are expected to implement this using the “linear search” algorithm.
3. frequency(self, target_char) : This method performs in the similar way as the search method, except that it returns the number of occurrences of the specific character ( target_char ) in the string.
4. replace(self, target_char, new_char) : This method will search for the specific character ( target_char ) within the string and replace all its occurrences with a new character ( new_char ).
5. lowercase(self) : This method converts or normalises the string into lower cases. No arguments are required for this method. (Note that you should not attempt to use the built-in lower() method provided by Python.)
6. uppercase(self) : Similar to the lowercase method, this method converts the string into upper cases. (Again, you should not attempt to use the built-in upper() method provided by Python.)
7. tokenise(self, the_delimiter) : This method tokenises/splits the string based on a specific character, referred as the delimiter (the_delimiter). The delimiter could be a space character “ ” or a punctuation mark “,”. This method will return a list of tokens (sub-strings) where each of the tokens can be represented as either an individual Python list or an object of this String class.
8. __eq__(self, other) : This is one of the overloaded methods in Python that enables us to check for equality. Here, we want to compare whether the data represented by the argument other is the same as the data represented in the current string object referred by self . (Note that you will have to first ensure that the argument other is an object of this String class before attempting to compare the contents of the two objects.)
9. __str__(self) : This is another overloaded method that is useful for formatting the output of the string data represented in this String class. Re-build the instance variable ( str_data ) into a Python string and return it.
You should name the Python class as StringClass and save the Python source file as Task1_PartA.py .
Part B: The StringList Class
In part B, you are required to implement another class that is responsible for handling a collection of strings, which is essentially a List abstract data type (ADT) that we have discussed. For the purpose of this part, we are (again) using the Python list to represent the string collection as an array-based structure. This means that each element in the Python list is holding an object from the String class defined in Part A. As such, we need to define a Python list as the instance variable (or attribute) for this StringList class that we are going to implement. You are required to implement each of the following methods for this StringList class. The method header as well as the argument(s) needed by each of the methods are suggested for your consideration.
1. __init__(self, size) : This is the constructor method that is required for constructing an initial empty list which will hold a collection of objects from the String class. You may name this instance variable as str_list . Given that we are adopting an array-based structure for the implementation, you should decide on an initial size for the collection, as indicated by the argument size.
2. add(self, new_item) : This method will add a new item which is a StringClass object to the collection represented in this StringList class (str_list). For the purpose of this task, the new item should just be added to the end of the collection. (Note that duplication of an existing item is allowed here; and you should not attempt to use the built-in append() method provided by Python.)
3. remove(self, target_item) : This method will remove all the occurrences of the specific item (as indicated by target_item ) from the string collection represented in this StringList class ( str_list ).
4. search(self, target_item) : The method is for searching for a specific item ( target_item ) in the string collection ( str_list ). Again, so long as an occurrence of the target item is encountered, return a True value; otherwise a False value. For the implementation of this method, you are required to use the “linear search” algorithm. (Note that the assumption here is that the string collection ( str_list ) has been sorted before this method can be applied to perform the search.)
5. __len__(self) : This is another overloaded method that is commonly implemented to return the number of items in the collection. (Again, you should not attempt to use the built-in len() method provided by Python.)
6. __str__(self) : This is again the overloaded method that is useful for formatting the output of the string collection represented in this StringList class. Construct a Python string which organises each item from the string collection ( str_list ) as a separate line in the output.
You should name the Python class as StringListClass and save the Python source file as
Task1_PartB.py .
Part C: Creating Instances
The final part of this task, you will be assessed on how to make use of the two user-defined classes implemented in the first two parts (i.e., Part A and Part B).
The task here is to construct a Python program for creating instances or objects of each class by importing the two classes. You should attempt to apply “all” the methods defined for each class on the corresponding objects to “test” the implementation for each of the methods. Note that the design and organisation of the program for this task is of your own decision.
You should name your program for this last part as Task1_PartC.py .
3.2. Task 2
3.2.1 Instructions & Requirements
Building upon the programming knowledge and skills acquired from the Task 1, you will be assessed in this Task on how to conduct pre-processing and formatting tasks on the datasets given as a form of the text-based file format. You are required to perform some basic data analysis on the cleaned datasets (i.e. after pre-processing) by adopting the external Python packages (such as NumPy, SciPy, Pandas, and Matplotlib). In addition, you are also required to manage your programs by handling any potential errors or exceptions.
The Dataset: Conti-Ramsden 4
Before you get started with any of the programming tasks, you should read through the description of the dataset that we will be using for the purpose of this assignment. The dataset is known as the Conti-Ramsden 4 Corpus which is a collection of narrative transcripts gathered for a clinical study carried out in the United Kingdom, to study children with language disorders. Two sets of data were collected: the first set is from children diagnosed with Specific Language Impairment (SLI) – one type of language disorders; and the second set is from children with the typical development (TD). A subset of the original corpus is used in this assignment with 10 selected transcripts for each group of children.
Each of the narrative transcripts is a record of the story-telling task performed by each child (for both groups), under the supervision of an investigator. The story is based on the wordless 24-picture storybook authored by Mercer Mayer, ‘ Frog, where are you? ’. Below is an excerpt extracted from the transcript produced by a SLI child.
You should note that there are many details recorded in each of these transcripts. However, for the purpose of this assignment, the data required for processing and analysis is the narrative produced by the children, which are the statements (or lines) indicated by the label of ‘ *CHI: ’ in the transcripts (as highlighted in the excerpt). As a side note, the format of the transcripts is based on the CHAT Transcription Format. You may want refer to the manual [ http://talkbank.org/manuals/CHAT.pdf ] for the explanation of the various
CHAT symbols, such as [//], [/], [*], (.), <…>, etc. ( Note: Please download the dataset (linked below under Assessment Resources ) before attempting the following parts. The SLI transcripts are organised under the folder ‘SLI’ and the TD transcripts are under the folder of ‘TD’.)
Part A: Handling with File Contents and Preprocessing
In this part, you will begin by reading in all the transcripts of the dataset given, both the SLI and TD groups. We will then conduct a number of pre-processing tasks to extract only the relevant contents or texts needed for analysis in the subsequent part (i.e. part 2). Upon completing the pre-processing tasks, each of the cleaned transcripts should be saved as an output file. This would be a more efficient approach whenever we need to manipulate the cleaned dataset without having to repeat the pre-processing task.
As mentioned earlier, for the purpose of this assignment, the data required for processing and analysis is the narrative produced by the children, which are the statements (or lines) indicated by the label of ‘ *CHI: ’ in the transcripts. The first step is that, for each original transcript, extract only the statements which are prefixed or begin with ‘ *CHI: ’. (Note that there are some statements that extend to the next line, you should ensure that you take those into account.)
The next step is to perform a set of pre-processing or filtering tasks. We want to remove certain words (generally referred to as tokens) in each statement that have the CHAT symbols as either prefixes or suffixes, but retaining certain symbols and words for analysis in part 2. For this part of the implementation, you should consider splitting each statement into a list of words or tokens before you begin with the filtering process. Below is a list of symbols that you should filter off from each of the child statements extracted.
● Remove those words that have prefixes of ‘ & ’ or ‘ + ’
○ Example:
Before filtering: *CHI: and he fell out into &-er the window .
After filtering: and he fell out into the window .
● Remove those words that have either ‘ [’ as prefix or ‘] ’ as suffix but retain these three symbols: [//] , [/] , and [*]
○ Example:
Before filtering: *CHI: and he’s [/-] the jar smashes [//] smashed
After filtering: and he’s the jar smashes [//] smashed
● Retain those words that have either ‘ <’ as prefix or ‘> ’ as suffix but these two symbols should be removed
○ Example:
Before filtering: *CHI:
After filtering: they were [//] he goes to bed .
● Retain those words that have either ‘ (’ as prefix or ‘) ’ as suffix but these two symbols should also be removed
○ Example:
Before filtering: *CHI: a + boy an(d) a dog h(ad) (.)
After filtering: a + boy and a dog had (.)
Note: The ‘(’ and ‘)’ symbols could appear as an infix , i.e. appears in between a word – you should also remove it from the word or token that it is attached to. But, you should not remove the symbols of (.), (..), and (…) as these symbols should be retained for data analysis. Also note that for the symbol ‘+’ that appears in between two words, it should be retained (i.e. do not remove that symbol).
Important: You should also take note of the following additional requirements.
1. Pauses:
○ In addition to (.), there are two other symbols indicating longer pauses: (..) and (…). Please don’t remove these symbols.
2. Linked words:
○ An underscore is used ‘_’ to indicate linked words. These words should be retained, thus no processing is needed.
3. Special form markers:
○ You may encounter words ending with the ‘@’ symbol. Just retain these words, no processing is needed. (Refer to CHAT Manual Section 8.3 for more details.)
4. Special utterance terminators:
○ These special terminators should have been removed when you attempt to remove words prefixed with ‘+’. Hence, the statement delimiters are restricted to either a full stop ‘.’, a question mark ‘?’, or an exclamation mark ‘!’.
5. Nested symbols:
○ You may also want to pay attention to words that have more than one
symbol, e.g.
Finally, once you have completed the filtering process for all the unwanted symbols, you should now save each of the cleaned child transcripts as an output file. You should produce a separate output file for each cleaned transcript and each statement should be written as a separate line in the output file. You may also want to organise your cleaned dataset into two groups: save the cleaned SLI transcripts under a folder named ‘SLI_cleaned’, and the cleaned TD transcripts under another new folder named ‘TD_cleaned’.
One additional requirement for the implementation for this first part is that you should handle any potential errors or exceptions that might occur by implementing the appropriate handling code. You should consider using the try-except clauses and/or the assert statements.
You should name your program for this first part as Task2_PartA.py .
Part B: Working with Basic Data Analysis
In this part, we are going to perform some basic data analysis by using some of the external Python packages (such as NumPy, SciPy, Pandas, and Matplotlib). The main task is to produce a number of statistics for the two groups of children transcripts. The statistics might serve as good indicators for distinguishing between the children with SLI and the typically developed children.
Amongst the statistics of each child that we are interested in are the following:
● Length of the transcript – indicated by the number of statements
● Size of the vocabulary – indicated by the number of unique words
● Number of repetition for certain words or phrases – indicated by the CHAT symbol [/]
● Number of retracing for certain words or phrases – indicated by the CHAT symbol[//]
● Number of grammatical errors made – indicated by the CHAT symbol [*]
● Number of pauses made – indicated by the CHAT symbols (.)(..)(…)
(Note: Given that the length of each child transcript is indicated by the number of statements, the end of each statement can be determined based on the pronunciation marks of either a full stop ‘.’, a question mark ‘?’, or an exclamation mark ‘!’.)
To begin with the implementation, you should first read in the cleaned dataset that you have prepared from part A. Implement a program that, for each of the cleaned child transcripts from both groups (SLI and TD), extract the count for each of the statistics mentioned above. You should carefully consider a suitable data type or data structure from either Pandas or Numpy for the representation of the statistics extracted for each child group in your program. (Note that the data type/data structure chosen should allow you to represent the statistics as a tabular format, where the columns denote the statistic types and the rows denote the statistic counts of each child transcript.) Then, by using the functions provided by Matplolib, create a visualisation to present these statistics for each
child group.
In addition, you should produce the average or mean of these statistics for the two groups, and plot another graph to demonstrate the mean difference for each of the statistics considered. (You may want to consider the functions of Pandas for this part of the implementation.)
You should name your program for this second task as Task2_PartB.py . Also, you should save all the graphs produced in this part for the submission.