MS6711 Data Mining
Exercise 1
• Explain the differences between statistical and machine-learning approaches to the analysis of large data sets.
• Enumerate the tasks that a data warehouse may solve as a part of the data mining process.
• What are the differences between data mining and OLAP?
• Explain why it is not possible to analyze some large data sets using classical modeling technique?
• Determine whether or not each of the following activities is a data-mining task. Discuss your answer.
• Dividing the customers of a company according to their age and sex.
• Classifying the customers of a company according to the level of their debt.
• Analyzing the total sales of a company in the next month based on current-month sales.
• Classifying a student database based on a department and sorting based on student identification numbers.
• Estimating the future stock price of a company using historical records.
• Predicting the outcome of tossing a pair of dice.
• Outline the CRISP-DM process.
• For each of the following meetings, explain which phase in the CRISP-DM process is represented:
• Managers want to know by next week whether deployment will take place. Therefore, analysts meet to discuss how useful and accurate their model is.
• The data mining project manager meets with the data warehousing manager to discuss how the data will be collected.
• The data mining consultant meets with the Vice President for Marketing, who says that he would like to move forward with customer relationship management.
• The data mining project manager meets with the production line supervisor, to discuss implementation of changes and improvements.
• The analysts meet to discuss whether the neural network for decision tree models should be applied.
• Go to the web site www.kdnuggets.com.
• Locate articles about how data mining has been applied to solve real-world problems.
• Follow the DATASETS link and scroll the UCI KDD Database Repository for interesting datasets.
Exercise for SAS Enterprise Guide and SAS Programming
• Download the following SAS data sets from Canvas and place them all under one folder (say, EGProject1) in a local hard disk: Customer_Profile_Sample.sas7bdat, Broadband_Subscribers_Sample.sas7bdat, Billing_Accounts_Sample.sas7bdat, and Requests_Sample.sas7bdat.
• In EG, create a new project. Then save the new project as Exe1_Q9 into the folder EGProject1.
• Name a SAS library as MYDATA and define the path identical to that of the folder EGProject1.
• Open the above data sets one by one inside this project via the library MYDATA.
• For the data set Broadband_Subscribers_Sample, extract the records with Activation_Date <= 31 May 2011, and 1 July 2011 <= Termination_Date <= 30 September 2011. Name the new data set as Terminate_July and place it under the library MYDATA.
• The file Billing_Accounts_Sample contains the amount to pay in each month for each billing account in the format of one row of record for each service identity and each billing month. Create a new data set that contains only one observation for each service identity. For each observation in the new data set, show the amount to pay for each month from Jan 2011 to May 2011 respectively. Name the new data set as Billing_Jan_May and place it under the library MYDATA. The first few observations of Billing_Jan_May should look like these:
• The file Requests_Sample contains the number of requests or complaints made by each customer in each month in the format of one observation for the total number of the same type of requests or complaints made in each month. Create a new data set that contains only one row of record for each customer. Each observation in the new data set shows the total number of requests or complaints made in each month from Jan 2011 to May 2011 respectively regardless of the type of requests or complaints. Name the new data set as Requests_Jan_May and place it under the library MYDATA. The first few observations of Requests_Jan_May should look like these:
• Merge the data sets Terminate_July, and Billing_Jan_May by the column Service_ID into a single data set. Only those Service_ID appears in Terminate_July shall appear in the merged data set. Name the data set as Temp1.
• Merge data sets Customer_Profile_Sample, Requests_Jan_May, and Temp1 by the column Customer_ID into a single data set. Only those Customer_ID appears in Temp1 shall appear in the merged data set. If a customer had not made any requests or complaints between Jan 2011 and May 2011, then set the respective monthly total number of requests or complaints to 0. Name the data set ModelData_July and place it under the library MYDATA.