程序代写代做 data science GPU graph algorithm Excel Cardiff School of Computer Science and Informatics Coursework Assessment Pro-forma

Cardiff School of Computer Science and Informatics Coursework Assessment Pro-forma
Module Code: Module Title: Lecturer: Assessment Title: Assessment Number: Date set:
Submission date and time: Return date:
CMT309
Computational Data Science
Dr. Matthias Treder, Dr. Luis Espinosa-Anke CMT309 Programming Exercises
3
06-03-2020
08-05-2020 at 9:30 am
This assignment is worth 40% of the total marks available for this module. If coursework is submitted late (and where there are no extenuating circumstances):
1 – If the assessment is submitted no later than 24 hours after the deadline, the mark for the assessment will be capped at the minimum pass mark;
2- If the assessment is submitted more than 24 hours after the deadline, a mark of 0 will be given for the assessment.
Your submission must include the official Coursework Submission Cover sheet, which can be found here:
https://docs.cs.cf.ac.uk/downloads/coursework/Coversheet.pdf
Submission Instructions
Your coursework should be submitted via Learning Central by the above deadline. You have to upload the following files:
Description
Type
Name
Cover sheet
Compulsory
One PDF (.pdf) file
Student_number.pdf
Your solution to question 1
Compulsory
One Python (.py) file
Q1.py
Your solution to question 2
Compulsory
One Python (.py) file
Q2.py
Your solution to question 3
Compulsory
One Word (.docx) file
Q3.docx
For the filename of the Cover Sheet, replace ‘Student_number’ by your student number, e.g. “C1234567890.pdf”. Make sure to include your student number as a comment in all of the Python files! Any deviation from the submission instructions (including the number and types of files submitted) may result in a reduction of marks for the assessment or question part.

You can submit multiple times on Learning Central. ONLY files contained in the last attempt will be marked, so make sure that you upload all files in the last attempt.
Assignment
Start by downloading the following files from Learning Central:
• Q1.py
• acronym_example1.txt, acronym_example2.txt, acronym_example3.txt,
acronym_example4.txt, acronym_tuples.txt
• Q2.py
• Q3.py
• Q3.docx
Then answer the following questions. You can use any Python expression or package that was used in the lectures. Additional packages are not allowed unless instructed in the question.
Question 1 – What is the long form of the acronym? (Total 35 marks)
In this question, your task is to implement several functions that parse text strings for acronyms and their long forms. Acronyms are abbreviations typically formed from the initial letters of multiple words and pronounced as a word. For instance, the acronym “GPU” stands for the long form “graphics processing unit”. In this question, an acronym is defined as a character sequence of at least two successive capital letters. Your task is to implement several functions that together parse a text for acronyms and find their long forms.
As an example text, let us define the string
s = “A GPU, which stands for graphics processing unit, is different from CPUs, says the IT expert. For some operations, a GPU is faster than a CPU. GPUs are not always faster though.”
Q1 a) Parse acronyms (10 marks)
Write a function read_file(filename) that receives as input a filename. The filename includes the filepath. The function returns the entire content of the file as a single string.
Write a function find_acronyms(s) that receives as input a string s representing the text. The function returns a list of acronyms. For our example above, find_acronyms(s) returns the list [‘GPU’, ‘CPU’, ‘IT’]. Note: It is not important in which order the acronyms appear in the returned list.

Q1 b) Find the long forms (15 marks)
In this question the hard work is done: given the acronyms, your task is to find their long form in the text. To this end, write a function find_long_forms(s, acronyms). It receives as input a string s representing the text and a Python list of acronyms. The function returns a dictionary d with key-value pairs, where the key is the acronym and the value is its long form. For instance, in our example above the output is the dictionary d = {‘GPU’ : ‘graphics processing unit’, ‘CPU’ : None, ‘IT’ : None}.
You can make the following assumptions:
• The long form is found in the same sentence as the acronym itself.
• If the acronym occurs multiple times in a text, its long form is found in the first
sentence that contains the acronym.
• Every ‘.’ (dot) marks the end of a sentence. Sentences like “I talked to the Dr. and
raised my concerns.” where dots are contained within the sentence will not occur.
• The first letter of the acronym is the same letter as the first letter of the first word of
the long form. All of the letters in the acronym need to appear in the long form.
• If no long form can be found for an acronym, it is set to None (Python’s None type)
as in the dictionary above.
Four examples for texts with acronyms are given in the example files acronym_example1.txt, acronym_example2.txt etc. The corresponding tuples of (acronym, long form) are specified in the file acronym_tuples.txt.
Q1 c) Replace acronyms by long forms (10 marks)
Assume we want to make the document more self-explanatory and replace its acronyms with their corresponding long forms. To this end, write a function replace_acronyms(s, d). It receives as input a string s representing the text, and a dictionary d which contains key-value pairs as defined in Q1b). The function returns another string as output. In this output, all acronyms in s have been replaced with their long forms. The following rules apply:
– If an acronym has a long form, the sentence wherein the long form was defined remains unchanged. For any other sentence, the acronym is replaced by the long form.
– If an acronym has no long form, it is not replaced anywhere.
– If you add the long form at the beginning of a sentence, make sure that its first word is
capitalised.
For instance, in our example above the output of the function is the string:
“A GPU, which stands for graphics processing unit, is different from CPUs, says the IT expert. For some operations, a graphics processing unit is faster than a CPU. Graphics processing units are not always faster though.”
As a starting point, use Q1.py from Learning Central. Do not rename the file or the function.

Q2 Statistics (Total 35 marks)
In this question, your task is to implement several statistical functions that perform t-tests, linear regression, and variable selection.
Q2 a) Mass t-tests (10 marks)
In this question, your task is to implement two functions that perform dependent and independent t-tests on input data. You can use the corresponding t-test functions in scipy.stats.
Write a function mass_paired_ttest(X) that performs a series of paired-samples t-tests. It receives as input a numpy array X with dimensions 𝑛×𝑝, where 𝑛 is the number of rows and
𝑝 is the number of columns. Each of the 𝑝 columns represents one sample. Your function
should find the pair of columns that yields the lowest p-value i.e. it is the ‘most significant’. Then the function returns a tuple with three elements (index of the first column from the pair, index of the second column from the pair, corresponding p-value).
Example: imagine your dataset is of (100, 3) shape i.e. has three columns. Assume the p- values for the three pairs of colums are p = 0.4 (col 0 vs col 1), p = 0.12 (col 0 vs col 2), p = 0.08 (col 1 vs col 2). The lowest p-value is obtained for col 1 vs col 2 and its value is 0.08, so the tuple that is returned by the function is t = (1, 2, 0.08).
Write a similar function mass_independent_ttest(*X) that performs a series of independent t-tests. It takes multiple inputs: Each input is a vector (1-D array) representing a single sample, so X is a list of Numpy arrays. The arrays can have different lengths. You can access each array using its index, e.g. X[0] is the first array, X[1] is the second array etc. Like for the paired-samples t-test, find the most significant pair of columns and return the tuple of three elements.
Q2 b) Ridge regression (10 marks)
In this question your task is to implement ridge regression from scratch using Numpy. Do not use statsmodels or scipy for this question. Ridge regression is a slightly modified version of linear regression which is more stable for collinear data.
Let us first develop the theory behind linear regression: Assume you have a vector of responses 𝑦∈R𝑛, where 𝑛 is the number of samples. Let 𝑥1, 𝑥2, 𝑥3,…, 𝑥p∈R𝑛 be our
predictors, where 𝑝 is the number of predictors. Then our linear regression model is 𝑦̂ = 𝛽0 + 𝛽1𝑥1+𝛽2𝑥2+ 𝛽3𝑥3 + … + 𝛽p𝑥p

with 𝛽0 being the intercept and 𝛽1,…,𝛽𝑝 being the slopes for the predictors. For convenience, we store our predictors in a matrix 𝑋∈R𝑛×(𝑝+1)=[ 𝑥1, 𝑥2, 𝑥3,…, 𝑥p,𝟙]. In other words, each column of 𝑋 represents one predictor. The last column consists entirely of ones, it represents the intercept. We also store all 𝛽’s in a vector 𝐵=[ 𝛽1, 𝛽2, … , 𝛽p, 𝛽0]∈R𝑝+1. To calculate 𝐵 we can use the equation
𝐵 = (𝑋⊤ 𝑋)−1𝑋⊤ 𝑦
where 𝑋⊤ is the matrix transpose of 𝑋 and the superscript ()−1 refers to the matrix inverse.
Unfortunately the inverse can be unstable or even undefined if 𝑋⊤𝑋 is not well-conditioned. As a fix, we will use a different formula called ridge regression which adds a so-called regularization term 𝑎𝐈.
𝐵 = (𝑋⊤𝑋+𝑎𝐈) −1 𝑋⊤𝑦
where 𝐈∈R(𝑝+1)×(𝑝+1) is an identity matrix and 𝑎 is a positive number that represents the
regularization strength. The inverse then always exists as long as 𝑎>0. The parameter 𝑎 has to be provided by the user.
Your task: Write a function fit_ridge(y, X, a) that implements ridge regression as defined above. It receives the following inputs:
– The response vector y is a numpy array with shape (n,1).
– The matrix X is a numpy array of predictors with shape (n, p). Note that X does not
contain the column of 1’s, so you need to add it yourself.
– The input a represents the strength of regularization. a can be either a single number
(e.g. a = 1) or a list with multiple numbers (e.g. a = [1, 5, 10]).
If a is a single number, the function returns 𝐵, the ridge regression coefficients using a for the regularization. If a is a list with multiple numbers, separate ridge regression solutions should be calculated for each value of a. In this case, the function returns a Python list of
vectors of regression coefficients [𝐵0,𝐵1,𝐵2,…], where 𝐵0 is the regression coefficients using the first value of a, 𝐵1 is the regression coefficients using the second value a, and so
on.
Tip: remember than the * operator operates element-wise on Numpy arrays. If you want proper matrix or vector multiplication like in linear algebra, you can use the @ operator.

Q2 c) Variable selection in linear regression (15 marks)
In this question, your task is to use statsmodels to implement two variable selection functions for standard linear regression (a.k.a. OLS regression). The motivation is that regression models can have dozens or even hundreds of predictors. This can make it difficult to interpret the relationship between the predictors and the response variable y. Ideally, one wants to identify a subset of the predictors that carries most of the information about y. A possible approach is variable selection. In variable selection (‘variable’ means the same as ‘predictor’), variables get iteratively added or removed from the regression model. Once finished, the model typically contains only a subset of the original variables.
In the following, we will call a predictor “significant” if the p-value of its coefficient is smaller or equal to a given threshold. Your approach operates in two stages: In stage 1, you iteratively remove predictors that are not significant. This leaves you with a subset of the original predictors. In stage 2, you iteratively add interaction terms and keep them in the
model if they are significant. Remember what an interaction term is: if 𝑥1 and 𝑥2 are two predictors, then the variable 𝑧= 𝑥1⋅ 𝑥2 is their corresponding interaction term. We will split
the two stages into two functions: Stage 1 (remove variables)
Write a function remove_variables(y, X, threshold = 0.05, variable_names = None). The function receives the following inputs:
• y and X are numpy arrays like in Q2b).
• threshold is the cut-off value that determines whether a p-value is significant. If a p-
value <= threshold, it counts as significant. • variable_names is a Python list of variable names that a user can provide. This is the names for the columns of X (e.g. ['TV', 'radio', 'newspaper'] for the advertisement dataset discussed in the lecture). If no variable names are provided, your function should create the variable names ['x1', 'x2’, ‘x3’, ...] where 'x1' is the name for the first column of X, 'x2' is the name for the second column of X, and so on. The function returns a tuple (new_X, new_variable_names) containing two variables: • new_X is the matrix of predictors after non-significant variables have been removed. It should not include the column of 1’s corresponding to the intercept. • new_variable_names is a list of strings containing the variable names for the columns of new_X. Use the statsmodels function add_constant to make sure that X contains a column of 1's for the intercept, and use the intercept in all fits. Next, these are the details on how to implement the two stages of variable selection: • To start, fit an OLS model using all of the predictors in X. • Identify the predictor whose coefficient has the largest p-value. If it is not significant, remove it and fit the model again. • Repeat this process until either all predictors have been removed or all predictors left are significant. • Never remove the intercept irrespective of whether or not it is significant. • If no predictors are left after stage 1, return the tuple (None, None). Tip: It might be useful to use Boolean arrays to select subsets of columns of X. Stage 2 (add interaction terms) Write a function add_interaction_terms(y, X, threshold = 0.05, variable_names = None). The inputs have the same meaning as in remove_variables. The function returns a tuple (new_X, new_variable_names) containing two variables: • new_X is the matrix of predictors after the interaction terms have been added. Hence, it contains the predictors in X plus the interaction terms that have been added as new columns to the right. It should not contain the column of 1’s corresponding to the intercept term. • new_variable_names is a list of strings containing the variable names for the columns of new_X. For interaction terms, use names that combine the two variable names with a ‘*’ sign. For instance, if you add the interaction term for ‘tv’ and ‘radio’, then call their interaction variable ‘tv*radio’. The function implements the following algorithm: • To start, fit an OLS model using all of the predictors in X. • Test whether it is useful to add interaction terms: For each pair of predictors, add their interaction term into the model. If the interaction term is significant, keep it in the model. If it is not significant, remove it again. • Continue this until you checked every pair of predictors. • Never add an interaction term involving the intercept. • It can happen that when you add new interaction terms, predictors that you previously added become non-significant. You can ignore this issue. • Add the interaction terms in order, starting from the leftmost predictor in X. For instance, if you have predictors with column indices 1, 2, 3, and 4, you first add the [1, 2] interaction term, then [1, 3], [1, 4], [2, 3], [2, 4], and finally [3, 4]. • After you checked the interaction terms for all pairs of predictors, you are finished. Return the new set of predictors and variable names as defined above. Finally, note that it should be possible to run both functions one after the other. For instance, given y and X, the following two lines of code (new_X,new_variable_names)=remove_variables(y, X) (new_X,new_variable_names)=add_interaction_terms(y, new_X, variable_names=new_variable_names) should first perform removal of variables and then add interaction terms. As a starting point, use Q2.py from Learning Central. Do not rename the file or the function. Question 3 – Ethics (Total 30 Marks) In this question you will investigate bias in text corpora (document collections). You are provided with two datasets from a recent data science competition on Hyperpartisan News Detection [1]. These datasets are - bias_corpus.txt: a corpus of news articles from media that have been classified as exhibiting right or left political bias. - nobias_corpus.txt: a corpus of news articles that have been classified as neutral. These newspaper articles are mostly written in the context of US politics. They could be used for building targeted political ads (reader of newspaper X will prefer to see ads of party Y), user or community profiling, etc. However, some articles may depict certain protected communities (women, immigrants or LGBT) in a negative way. This may bias any data science model built on top of this data. In this question you implement a 'pattern matching' procedure for investigating how protected communities are depicted in both corpora (biased vs non-biased). As an inspiration, you can start experimenting with Hearst patterns [2], which are often used to identify word pairs in which a type-of relationship holds. An example for a Hearst pattern is ‘X is a type of Y’. The slots X and Y will be filled with matches in corpora, e.g., ‘cat is a type of animal’ or ‘sofa is a type of furniture’. Such patterns can also be used to reveal how certain communities are depicted. For example, ‘immigrants and other x’ would reveal how immigrants are depicted in these media. A neutral example could be ‘immigrants and other communities’, whereas a (negatively) biased example could be ‘immigrants and other criminals’. An initial list of patterns is provided below (with actual examples from the data). However, you are free and encouraged to experiment with text patterns of your own design. You can experiment using only X, only Y, or both X and Y as empty slots (regex groups). Pattern Example occurrence X is a Y Obama is a citizen X is Y Trump is threatening X and other Y Refugees and other criminals marginalized Y, especially X marginalized groups, especially gays X works as a Y He works as a manager or She works as a hairdresser Your tasks: • Download and uncompress the text corpora from this url: https://drive.google.com/drive/folders/1A Tp_zALwRRG5- rd9o0WEcP9IXKOGADSd?usp=sharing • Decide on a person or community of interest. Define a pattern which you hypothesize is likely to reveal how this person/community is depicted. This pattern could include regular expressions and group matching for a slot x to fill. For example, the pattern ‘Trump is x’ is likely to match more verbs in non-biased media because they talk more about what he does (‘Trump is speaking’ or ‘Trump is attending’). In biased media, however, we could find more adjectives (‘Trump is arrogant’ or ‘Trump is great’). • Retrieve and count the hits you get for each value of x in the biased and the non biased corpora separately and store those in the dbias and dnobias dictionaries. For example, if ‘Trump is speaking’ occurs twice in the non-biased corpus, then the dictionary entry dnobias['speaking'] has the value 2. • Do this pattern extraction process for three different persons/communities to obtain a total of three case studies. You can use the same or different patterns. Then, report your results in the template document Q3.docx as follows: For each of the three case studies, write a short justification (up to 300 words for each) with 1. Your initial hypothesis, why you chose the pattern and no other, how many patterns you tried for the case you wanted to test, etc. 2. Provide the comparison match frequency table for the two corpora (see example, provided as a comment, at the end of Q3.py). 3. Discuss differences, if any, between the results obtained from the two corpora, and highlight the stereotypical or negative depictions you found. As a starting point, use Q3.py and Q3.docx from Learning Central. Your submission should only include the Word document, not the Python script. References: [1] Kiesel, J., Mestre, M., Shukla, R., Vincent, E., Adineh, P., Corney, D., ... & Potthast, M. (2019, June). Semeval-2019 Task 4: Hyperpartisan news detection. In Proceedings of the 13th International Workshop on Semantic Evaluation (pp. 829-839). (Available at https://www.aclweb.org/anthology/S19-2145/) [2] Hearst, M. A. (1992, August). Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational Linguistics-Volume 2 (pp. 539-545). Association for Computational Linguistics. (Available at https://www.aclweb.org/anthology/C92-2082/) Learning Outcomes Assessed • Carry out data analysis and statistical testing using code • Critically analyse and discuss methods of data collection, management and storage • Reflect upon the legal, ethical and social issues relating to data science and its applications Criteria for assessment Credit will be awarded against the following criteria. The score in each implemented function is judged by its functionality. For Q1 and Q2, the functions you have implemented will be tested against different data sets to judge their functionality. Additionally, quality and efficiency (Q1) will be assessed. For Q3, marks are based on the written report. The below table explains the criteria. Criteria Distinction (70-100%) Merit (60-69%) Pass (50-59%) Fail (0-50%) Q1 Functionality (70%) fully working application that demonstrates an excellent understanding of the assignment problem using relevant python approach. All required functionality is met, and the application are working probably with some minors’ errors Some of the functionality developed with and incorrect output major errors. Faulty application with wrong implementation and wrong output Efficiency (15%) Excellent performance using a concise and elegant solution Good performance using a concise and appropriate solution Partial performance showing an appropriate approach Incorrect or highly inefficient approach Quality (15%) Excellent documentation with usage of __docstring__ and comments Good documentation with minor missing of comments. Fair documentation. No comments or documentation at all Criteria Distinction (70-100%) Merit (60-69%) Pass (50-59%) Fail (0-50%) Q2 Functionality (85%) Excellent working condition with no errors Mostly correct. Minor errors in output Major problem. Errors in output Mostly wrong or hardly implemented Quality (15%) Excellent documentation with usage of __docstring__ and comments Good documentation with minor missing of comments. Fair documentation. No comments or documentation at all Criteria Distinction (70-100%) Merit (60-69%) Pass (50-59%) Fail (0-50%) Q3 All patterns and associated reflections implemented. Strong presentation of the hypothesis and discussion of the results. All patterns and associated reflections implemented. Strong presentation of the hypothesis and discussion of the results, minor degree of overlap between the issues encountered and discussed. Several protected communities addressed. All patterns and associated reflections implemented. Weak presentation of the hypothesis and discussion of the results, high degree of overlap between the issues encountered and discussed. Overall report reflects poor understanding of the task. No or incomplete patterns, or with major flaws, or missing reflection. Discussion of results based on factually incorrect or non existing data. Feedback and suggestion for future learning Feedback on your coursework will address the above criteria. Feedback and marks will be returned within 4 weeks of your submission date via Learning Central. In case you require further details, you are welcome to schedule a one-to-one meeting. Feedback from this assignment will be useful for next year’s version of this module as well as the Python for Data Analysis module.