IE 7275 Data Mining in Engineering Homework 1
Deadline: 1/30/2018
Note:
o Allthefilesyoucreateneedtobesubmittedalongwithyoursolutionsheetson
Blackboard.
o Excelisnotallowed.YoucanuseR,Python,MATLAB,evenSPSS,SAS.
Problem 1 (Tweeter Accounts) [20 points]
Twitter is a social news website. It can be viewed as a hybrid of email, instant messaging and sms messaging all rolled into one neat and simple package. It’s a new and easy way to discover the latest news related to subjects you care about.
This is the data set crawled on July, 2009. BlogCatalog is a social blog directory website. This contains the friendship network crawled. For easier understanding, all the contents and variables are organized in CSV file format.
First load the file M01_quasi_twitter.csv, next perform the following tasks:
a. How are the data distributed for friend_count variable?
b. Compute the summery statistics (min, 1Q, mean, median, 3Q, max) on
friend_count?
c. How are the data quality in friend_count variable? Interpret your answer
- Producea3Dscatterplotwithhighlightingtoimpressionthedepthforvariables
below on M01_quasi_twitter.csv dataset. created_at_year, education, age.
Put the name of the scatter plot “3D scatter plot”.
- Consider 650, 1000,900,300 and 14900 tweeter accounts are in UK, Canada,
India, Australia and US respectively. Plot the percentage Pie chart includes percentage amount and country name adjacent to it, and also plot 3D pie chart for those countries along with the percentage pie chart. Hint: Use C=(1, 2) matrix form to plot the charts together.
- Create kernel density plot of created_at_year variable and interpret the result.
Problem 2 (Cereals Analysis) [20 points]
Download the dataset from the link below and see Problem 4.1 in the textbook.
http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html
Problem 3 (House Median Price in Boston) [20 points]
Consider the prediction of house price in Boston.
- Preprocess all features at the same scale. Explain your method for different types
(e.g., numerical, categorical) of data.
- Implementcorrelationanalysisforallfeaturescomparedtotheclass(MEDV>30)
in the last column. Rank features based on their correlation coefficients and
identify the top 5 features.
- Apply PCA to the “original” data and the preprocessed data (from part a.).
Compare results (by showing samples in a 2-D space of the first two PCs).
- Continue part c. How many new features (PCs) are determined to contribute
more than 90% for both cases?
Problem 4 (Chemical Features of Wine) [20 points]
See Problem 4.4 in the textbook
Problem 5 (PCA) [20 points] Consider the following data matrix D:
- Compute the mean 𝜇 and covariance matrix Σ for D.
- ComputetheeigenvaluesofΣ.
- What is the “intrinsic” dimensionality of this dataset (discounting some small
amount of variance)?
- Computethefirstprincipalcomponent.