代写 Scheme math python graph (50 pts) Part 1: Missing Data Analysis and Dimensionality Reduction¶

(50 pts) Part 1: Missing Data Analysis and Dimensionality Reduction¶
To answer the first part of Lab1, use the data as described in the Jupyter tutorial presented in class and derived from the following paper
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. “Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.” In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019. (https://arxiv.org/abs/1909.09638).
The data can also be found at https://www.kaggle.com/sobhanmoosavi/us-accidents.
Objective of this Lab:¶
• Inform you about best practice pertaining missing data.
• Reducing the number of attributes.

(25 pts) Question 1: Missing Data¶
Missing data typically occurs when no data value is observed for any given set of attributes. This can have a significant impact on the inference that can be derived from the data and has to be addressed carefully to minimize said impact. The “accident risk” dataset has a certain amount of missing data. There is a need to address this problem. Fill in missing values for every data column with continuous data using the following three strategies:
In [2]:
import pandas as pd

# be sure to make use of the full dataset.
df = pd.read_csv(“./data/US_Accidents_May19.csv”)
df.head()
Out[2]:

ID
Source
TMC
Severity
Start_Time
End_Time
Start_Lat
Start_Lng
End_Lat
End_Lng

Roundabout
Station
Stop
Traffic_Calming
Traffic_Signal
Turning_Loop
Sunrise_Sunset
Civil_Twilight
Nautical_Twilight
Astronomical_Twilight
0
A-603309
MapQuest
241.0
2
2019-02-15 16:56:38
2019-02-15 17:26:16
34.725163
-86.596359
NaN
NaN

False
False
False
False
False
False
Day
Day
Day
Day
1
A-676455
MapQuest
201.0
2
2019-01-26 15:54:00
2019-01-26 16:25:00
32.883579
-80.019722
NaN
NaN

False
False
False
False
False
False
Day
Day
Day
Day
2
A-2170526
Bing
NaN
4
2018-01-06 10:51:40
2018-01-06 16:51:40
39.979172
-82.983870
39.99384
-82.98502

False
False
False
False
False
False
Day
Day
Day
Day
3
A-1162086
MapQuest
201.0
2
2018-05-25 16:12:02
2018-05-25 17:11:49
34.857208
-82.256157
NaN
NaN

False
False
False
False
True
False
Day
Day
Day
Day
4
A-309058
MapQuest
201.0
2
2016-10-26 19:42:11
2016-10-26 20:26:58
47.662689
-117.357658
NaN
NaN

False
False
False
False
True
False
Night
Night
Night
Night
5 rows × 49 columns

Fill in missing values for every continuous data column using the following three strategies

(10 pts) Q 1.1: Use the mean to fill the missing values for any given attribute.¶
First collect the mean value of all the observations of each attribute with missing data. Then use the mean value (calculated per attribute) to fill in for the missing data.
In [3]:
# code here

(15 pts) Q 1.2: Randomly sample based on the probability distribution of the given column data.¶
Generate and fill missing values using random samples from the probability distribution estimated from the observed values for each column. Verify your estimated pdf by comparing the histogram of observed values vs estimated values. Remember to fill all the columns containing missing data.
Steps to generate random samples by estimating the probability distribution:
• First generate the probability distribution of the attribute using the existing observations.
▪ Draw Histogram of the observed values
▪ Estimate the distribution using KDE
• Using the pdf find the cdf (make sure the cdf is normalized to values 0 to 1).
• Generate the inverse cdf function, and pass a random value [0,1] from a uniform distribution to the inverse function to get random samples from our estimated pdf.
i.e.,
$$r = \mathit{Uniform}(0, 1)$$ $$x = \mathit{cdf}^{-1}(r)$$
In [4]:
# code here

(Bonus – 15 pts) Q 1.3: Use linear regression using other continuous columns fill in the missing data.¶
Make sure to use multiple attributes (columns) that do not contain too missing values ($< 5\%$) as training features or attributes. Let us say that you select four training attributes each with some possible missing values. Now, select only those rows which are fully populated for each of the four attributes. Next, use multivariate regression to predict the missing values. In [ ]: # code here import numpy as np from sklearn.linear_model import LinearRegression # create 2d array X here # X = All the other predictors as a 2d array. # where, rows are the samples and columns are training features, # or attributes that have < 5% missing values # create 1d array y here # y = Missing data column that you wish to fill. It must # only contain the observed values for training. # train the model reg = LinearRegression().fit(X, y) # predict the values using the same set of training features. # here the rows are samples that contain missing values in the y column, # and the columns are the same training features for those samples. y_miss = reg.predict(X_miss) #now repeat this for all columns containing missing data. (25 points) Question 2: Dimensionality Reduction¶ Often data is collected with a certain design and purpose. After the collection is over, it is often determined that certain attributes do not have any impact on the sought inferences and/or predictions. It is therefore of value (storage, computing time, etc.) to reduce the amount of time. Think of compression schemes. We ask you to do the same for the accident risk data. There are many methods to reduce data size. For this laboratory, we ask you to use a dimensionality reduction called Principal Component Analysis or PCA. Conduct dimensionality reduction using PCA on the two datasets generated after answering Question 1.1 and Question 1.2 respectively. Follow these steps: 1. Conduct exploratory analysis in the following manner¶ 1.1. Consider only the columns with continuous data (there should be 10 weather related columns and nine should be continuous).¶ In [9]: # code here 1.2. Choose five attributes you think are important to predict severity. Please feel to consult the paper if necessary.¶ In [ ]: # code here 1.3. Plot scatter plot matrix for all the five attributes for 1000 randomly selected points.¶ In [ ]: # code here 1.4. Comment on what you observe.¶ 1.5. Plot histograms for each of the five data columns (all in one figure).¶ In [10]: # code here 2. Now, conduct a PCA on all 9 weather attributes and select the two most principal components.¶ In [ ]: import numpy as np from sklearn.decomposition import PCA def pca_model(X): # code here pca = PCA(n_components=2) pca.fit(X) return pca # build X # X is a 2d matrix, where rows are the number # of samples and the columns are the 9 weather # related continuous features. pca = pca_model(X) 3. Retain only two columns of the transformed data.¶ In [ ]: # now transform the data X_2col = pca.transform(X) # code to display or visualize the two columns 4. Next, visualize the transformed data in two dimensions.¶ 4.1. Draw a scatter plot using the two dimensions. Use four different colors for each datapoint, one for each severity category.¶ In [11]: def pca_plotter(X, severity, s_colors): """ X is the data to be transformed. severity is the data column containing severity information s_colors is a list containing the colors to assign each severity value. """ # code here 4.2 What differences do you observe for the two or three datasets (from answers to Q1.1, Q1.2 and/or Q1.3) after dimensionality reduction.¶ In [12]: # code here # X_q1 = PCA transformed data after Q1.1 # X_q2 = PCA transformed data after Q1.2 # X_q3 = PCA transformed data after Q1.3 if you answered the bonus question. for x in [X_q1, X_q2, X_q3]: pca_plotter(x, severity, s_colors) (50 pts)Part 2: Visualizations and Explorations of Unstructured Text Data¶ Text data is hard to work with and require special tools to analyze and visualize. We will analyze text corpora using parsers and tokenizers. Then, using word clouds we will visualize the distributions of word tokens in perceptually meaningful way. Thus, we will then be able to extract and glean patterns from unstructured text data. We will use this data collection for this part of the laboratory. https://www.kaggle.com/stackoverflow/pythonquestions (25 pts) Question 1: Create a word cloud for tags and answer the following questions.¶ Please visit the following data source: https://www.kaggle.com/stackoverflow/pythonquestions#Tags.csv. This data consists full text of all questions and answers from Stack Overflow that are tagged with the python tag. This data is useful for developing methods of natural language processing analytic methods pertaining to “Q&A” and community response. It is normally of interest to find determine a variety of words (or tags) which either occur frequently or occur very rarely. Our goal is to find those words and also accord those tags/words a measure of strength. To weigh each word’s strength: 1. Use frequency counts for each tag to highlight most frequent tags.¶ In [ ]: df = pd.read_csv("./data/Tags.csv") tags = # code here to extract tag as a list def freq_counter(tags): # code here pass d = freq_counter(tags): # d = dict where keys are the tags, # and values are its corresponding frequency counts. print(d) 2. Collect the results in a table manifest as alltags.csv file.¶ In [ ]: import pandas as pd # code here to convert to dataframe save csv file 3. Plot the histogram of the tags. Choose an appropriate number to plot.¶ In [ ]: def plot_histogram(freq_dict): # code here plot_histogram(freq_dict) 4. What distribution would best describe the occurence of tags ? What do you observe ?¶ 5. Now use inverse of frequency counts (1/f) to highlight rarely occurring tags.¶ In [ ]: def inverse_freq_counter(tags): # code here invf = inverse_freq_counter(tags) plot_histogram(invf) 6. Collect the results in a table manifest as a mostfrequent.csv file.¶ In [15]: # code here (25 pts) Question 2: Use word clouds to answer the following questions.¶ Find other two associated data necessary for the second part of this laboratory. https://www.kaggle.com/astackoverflow/pythonquestions#Answers.csv; https://www.kaggle.com/stackoverflow/pythonquestions#Questions.csv. Here you will tokenize and extract frequency counts. Make sure to also remove stop words. These frequently occur in English and include articles of speech, tense, etc. (e.g., “an”, “a”, “the”, “is”, “at”, etc.). Typically in text pre-processing stop words are filtered out before the actual processing of any natural language text. A typical word cloud shows all the words in a collection placed in such a way that the frequently occurring words appear prominent (large, colorful, etc.). Word clouds can help identify trends and patterns that would otherwise be unclear or difficult to see in a tabular format. Visit https://en.wikipedia.org/wiki/Tag_cloud for some explanation. Use nltk (pyPI)(github)(docs) for tokenization and general text processing needs. Make use of the following python package wordcloud (pyPI)(github)(docs) to do all the heavy lifting with the actual visualization aspects. 1. Build a wordcloud for the python tags and compare the visualization with the histogram. Do you observe any significant differences between the two types of visualizations?¶ In [ ]: from wordcloud import WordCloud def plot_word_cloud(freq_dict): # code here wc = WordCloud(background_color="black") wc.generate_from_frequencies(freq_dict) # run this to plot them both plot_histogram(freq_dict) plot_word_cloud(freq_dict) 2. What are the most frequent words in Questions.csv? What are the most frequent word in Answers.csv ? Plot adjacent wordclouds.¶ In [ ]: import nltk import pandas as pd ans_df = pd.read_csv("./data/Answers.csv") ques_df = pd.read_csv("./data/Questions.csv") def text_freq_counter(texts): # code here # be sure to remove all stop words in questions and answer data from nltk.corpus import stopwords print(set(stopwords.words('english'))) # Generate word cloud here plot_word_cloud(freq_dict) 3. What are the most frequent words combined (StackOverflow Questions + Answers)? Comment on the differences you observe between the three!¶ In [18]: # code here 4. Assign a different color to words depending on where the belong. Assign shades of red for words that only exist in StackOverflow Questions, blue for only StackOverflow Answers and purple for the combined. What do you observe?¶ In [19]: # code here 5. Do you find the same set of words appear as highly frequent in all three collections? If yes, highlight them. Describe how you picked and decided on which set of words would be deemed as highly frequent. What criteria did you use ? How many did you choose to display ? Why ?¶ In [21]: # code here 6. Can you find words that are more frequent in StackOverflow Answers but least frequent in StackOverflow Questions? What about visa versa ? What can you say about the differences?¶ In [ ]: # code here Appendix A¶ To decide how to handle missing data, we must first better understand why exactly are they missing and what could be the causes. Next, we can deploy some strategies to handle missing data. For more information, please peruse this manuscript. Also, some introduction can be found on this Wiki page (https://en.wikipedia.org/wiki/Missing_data). Mechanics of Missing Data:¶ (Rubin, Donald B. "Inference and missing data." Biometrika 63.3 (1976): 581-592.) There are some well understood mechanisms or assumptions you can make of your missing data to make an informed decision on how to handle your missing data: 1. Missing Completely At Random(MCAR): If an attribute has missing data points completely and happens at random, it can then be assumed that the probability of an attribute missing its value is uniformly the same for every sample. For example, everytime the dice rolls $6$, do not enter the value. In such scenarios, if missing data $< 5\%$, removing the samples containing missing data will not bias your inference. Since the observed distribution is the same as unobserved, if the missing data is more substantial ($5-40\%$), one impute using linear (numerical data) or logistic (categorical) regression.
 2. Missing At Random(MAR): In reality nothing is truly random, a variable is usually dependent or conditional on existing attributes. This dependency is often hidden and cannot be easily discerned. For example, if blood pressure data are missing at random, conditional on age, gender and ethnicity, then the distributions of missing and observed blood pressures will be similar among people of the same age, gender and ethnicity. Thus, the probability of data being missing is somewhat dependent on other attributes and is not truly random. In such cases, if the missing data is smaller proportionally, we can simply remove those. If however they are substantial, we can interpolate using linear or logistic regression against the conditional attributes.
 1. Missing Not At Random (MNAR): When the data is neither MCAR or MAR and when the data depends on the value itself, it is harder to fix. For example, individuals rarely provide their weight or their age given personal apprehensions and cultural stigmas. In these cases missing data drastically affects the distribution and hence the final inference of the data. Bias and inaccuracies are common and the popular press is rife with such mishaps. The best technical solution is to model the missing data along with your responses or targets. Practical Handling of Missing Data:¶ The following are the various ways to deal with missing data 1. You can eliminate rows that contain the missing data point. If you are working with missing data that occurs $< 5\%$, and you know the missingness is due to complete randomness(MCAR) or is conditionally random (MAR), then you can choose to simply eliminate the rows.
 2. You can also choose to eliminate columns if you think the attribute will not provide any useful information in your analysis. This approach can also be taken if you are missing a substantial amount of data ($50$ to $60 \%$ or greater) and that including this data attribute will heavily bias your inference.
 3. Estimate missing values A. In the case of categorical data, you can choose to use Logistic Regression to predict the missing classes/labels. However this will only work if the reason for missingness is MCAR or MAR. In the case of MNAR you cannot use existing predictors and many of the missing classes will not have sufficient ground truth. In such cases, you may have to jointly learn the missing value along with the missingness.
 B. In the case of continuous data, you can choose to use simple aggregate measures like mean, median or mode as those typically do not affect the overall distribution of the attribute. However that only works when the values missing are not substantial ($< 5\%$). If larger number of missing values is the case, you can apply linear regression using existing predictors (use all in case of MCAR or use feature subset selection in addition to linear regression for MAR).
 4. Another simpler strategy is to use random sampling based on probability distribution of given data. This works really well when you know the missingness is due to MCAR.