数据挖掘机器学习代写: IE 7275 Data Mining in Engineering Homework 1

IE 7275 Data Mining in Engineering Homework 1
Deadline: 1/30/2018

Note:
o Allthefilesyoucreateneedtobesubmittedalongwithyoursolutionsheetson

Blackboard.
o Excelisnotallowed.YoucanuseR,Python,MATLAB,evenSPSS,SAS.

Problem 1 (Tweeter Accounts) [20 points]

Twitter is a social news website. It can be viewed as a hybrid of email, instant messaging and sms messaging all rolled into one neat and simple package. It’s a new and easy way to discover the latest news related to subjects you care about.

This is the data set crawled on July, 2009. BlogCatalog is a social blog directory website. This contains the friendship network crawled. For easier understanding, all the contents and variables are organized in CSV file format.

First load the file M01_quasi_twitter.csv, next perform the following tasks:
a. How are the data distributed for friend_count variable?
b. Compute the summery statistics (min, 1Q, mean, median, 3Q, max) on

friend_count?
c. How are the data quality in friend_count variable? Interpret your answer

  1. Producea3Dscatterplotwithhighlightingtoimpressionthedepthforvariables

    below on M01_quasi_twitter.csv dataset. created_at_year, education, age.

    Put the name of the scatter plot “3D scatter plot”.

  2. Consider 650, 1000,900,300 and 14900 tweeter accounts are in UK, Canada,

    India, Australia and US respectively. Plot the percentage Pie chart includes percentage amount and country name adjacent to it, and also plot 3D pie chart for those countries along with the percentage pie chart. Hint: Use C=(1, 2) matrix form to plot the charts together.

  3. Create kernel density plot of created_at_year variable and interpret the result.

Problem 2 (Cereals Analysis) [20 points]

Download the dataset from the link below and see Problem 4.1 in the textbook.

http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html

Problem 3 (House Median Price in Boston) [20 points]

Consider the prediction of house price in Boston.

  1. Preprocess all features at the same scale. Explain your method for different types

    (e.g., numerical, categorical) of data.

  2. Implementcorrelationanalysisforallfeaturescomparedtotheclass(MEDV>30)

    in the last column. Rank features based on their correlation coefficients and

    identify the top 5 features.

  3. Apply PCA to the “original” data and the preprocessed data (from part a.).

    Compare results (by showing samples in a 2-D space of the first two PCs).

  4. Continue part c. How many new features (PCs) are determined to contribute

    more than 90% for both cases?

Problem 4 (Chemical Features of Wine) [20 points]

See Problem 4.4 in the textbook

Problem 5 (PCA) [20 points] Consider the following data matrix D:

  1. Compute the mean 𝜇 and covariance matrix Σ for D.
  2. ComputetheeigenvaluesofΣ.
  3. What is the “intrinsic” dimensionality of this dataset (discounting some small

    amount of variance)?

  4. Computethefirstprincipalcomponent.