CS代考计算机代写 data mining 2/19/2021 Assignment #2 – kNN, Neural Networks & k-means Clustering

2/19/2021 Assignment #2 – kNN, Neural Networks & k-means Clustering
Assignment #2 – kNN, Neural Networks & k-means Clustering
Submit Assignment
Due Thursday by 10:30am Points 100 Submitting a website url or a file upload File Types pptx, knwf, and mp4 Available after Feb 3 at 6pm
In this exercise we will see if we can predict a loan default from a bank for 30,000 customers using kNN. We’ll also segment the customers to see if default rates differ by age segment.
The data comes from this research: Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.
This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:
LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. SEX: (1 = male; 2 = female).
EDUCATION: (1 = graduate school; 2 = university; 3 = high school; 4 = others).
MARRIAGE: Marital status (1 = married; 2 = single; 3 = others).
AGE: Age (year).
PAY_0 – PAY_6: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: PAY_0 = the repayment status in September, 2005; PAY_1 = the repayment status in August, 2005; . . .;PAY_6= the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
BILL_AMT1 – BILL_AMT6: Amount of bill statement (NT dollar). BILL_AMT1= amount of bill statement in September, 2005; BILL_AMT2 = amount of bill statement in August, 2005; . . .; BILL_AMT6= amount of bill statement in April, 2005.
PAY_AMT1 – PAY_AMT6: Amount of previous payment (NT dollar). PAY_AMT1 = amount paid in September, 2005; PAY_AMT2 = amount paid in August, 2005; . . .;PAY_AMT6 = amount paid in April, 2005.
Here is the data: default_of_credit_card_clients.csv The column labeled DEFAULT is coded as follows:
DEFAULT = 0 means no default
DEFAULT = 1 means default
In class and we learned about the k-Nearest Neighbor classification technique and the k-means clustering method. We will use both techniques in this assignment.
Your submission will as usual be through Canvas and will include a recording of a Zoom call presentation to senior executives, a PowerPoint presentation (*.pptx) and an exported KNIME workflow (*.knwf). Assume that the audience for your presentation will not remember the business questions that prompted the study. So in your PowerPoint deck remind us of those questions along with your findings.
Remember to do a thorough EDA. Then tell a narrative of your analysis in your presentation. Give some information about the source of the data and what you think the variables mean.
Q1: Slice and Dice
Q1.1 How many customers are in the sample?
Q1.2 What is the most common sex in the sample? Q1.3 How many distinct values does marriage take on?
Q2: Histograms and Boxplots
Q2.1 How is BILL_AMT1 distributed by sex?
Q2.2 Are they normally distributed and how do you know? Q2.3 Which sex has the most defaults?
Q3: Hunt for Relationships
Q3.1 Does there appear to be any relationship between default and AGE?
Q4: kNN Model
Q4.1 Build a model of default using kNN. Randomly partition the data into a training set (70%) and a validation set (30%). What value of k did you decide to use and why?
https://smu.instructure.com/courses/84418/assignments/482044
1/4

2/19/2021 Assignment #2 – kNN, Neural Networks & k-means Clustering
Q4.2 Score the validation data (predict) using the model. Produce a confusion table and an ROC for the scored validation data.
Q4.3 From the confusion table calculate the following metrics: accuracy, misclassification rate, true positive rate, false positive rate, specificity, precision, and prevalence.
Q4.4 Use k-means clustering to segment the customers on AGE. What value of k did you decide to use and why?
Q4.5 Build a model of default using kNN for each segment. Randomly partition the data into a training set (70%) and a validation set (30%) for each segment. What value of k did you decide to use and why?
Q4.6 Score the validation data (predict) using the models. Produce a confusion table for the scored validation data for each segment. How do they compare?
Q4.7 From the confusion tables for each segment calculate the following metrics: accuracy, misclassification rate, true positive rate, false positive rate, specificity, precision, and prevalence. How do they compare?
Q4.8 Produce an ROC curve for each AGE segment and report the AUCs.
Q4.9 Do any of the models built on the AGE segments have a better classification performance than the non-segmented population model? How much better or worse?
Q5: Neural Network Model
Q5.1 Build a model of default using ANN. Randomly partition the data into a training set (70%) and a validation set (30%).
Q5.2 Score the validation data (predict) using the model. Produce a confusion table and an ROC for the scored validation data.
Q5.3 From the confusion table calculate the following metrics: accuracy, misclassification rate, true positive rate, false positive rate, specificity, precision, and prevalence.
Q6: Compare Models
Q6.1 Of the three models, which do you prefer to use and why?
Logit & CART Rubric
https://smu.instructure.com/courses/84418/assignments/482044
2/4

2/19/2021 Assignment #2 – kNN, Neural Networks & k-means Clustering
Criteria
Ratings
Pts
Submitted on time
20 pts Full Marks
0 pts No Marks
20 pts
Count of the number of observations (synonymous with cases or rows) in each data set.
1 pts Full Marks
0 pts No Marks
1 pts
Make a list or table of all variables (synonymous with columns or features) and their data type (numeric or text). Includes detection of numeric variables which actually represent categorical concepts (i.e., Zip codes) or text variables that are really numeric concepts (i.e., date and time).
2 pts Full Marks
0 pts No Marks
2 pts
Count missing values in each column and identify the rows that have missing values.
2 pts Full Marks
0 pts No Marks
2 pts
Describe the missing value pattern and what action is taken to deal with it (imputation or case deletion).
1 pts Full Marks
0 pts No Marks
1 pts
Count and identify outliers. Includes the identification of “odd” values such as nonsensical categorical values.
1 pts Full Marks
0 pts No Marks
1 pts
Describe how outliers have been dealt with (imputation or case deletion).
1 pts Full Marks
0 pts No Marks
1 pts
For numeric variables, calculate the four moments: mean, variance (standard deviation), skewness and kurtosis. State whether each numeric variable is normally distributed or not and how you determined that.
4 pts Full Marks
0 pts No Marks
4 pts
For numeric variables, visualize the distributions with box plots or histograms.
4 pts Full Marks
0 pts No Marks
4 pts
For numeric variables, do a correlational analysis that includes calculation of pairwise correlation statistics between ALL numeric variables and visualizations such as scatterplot matrices or tabulated heat maps.
4 pts Full Marks
0 pts No Marks
4 pts
For categorical variables, calculate the frequencies of unique case values.
5 pts Full Marks
0 pts No Marks
5 pts
KNIME nodes named by function in the workflow and neatly arranged.
15 pts Full Marks
0 pts No Marks
15 pts
PowerPoint presentation: EDA findings presented in plain english with presentation grade graphics suitable for technical and non- technical audiences. Pay particular attention to graph labeling. Senior executives will not read complex tables produced by analytic software. If tables and graphs are used, they must be interpreted for the audience and annotated either directly or with text comments such as bullet lists.
10 pts Full Marks
0 pts No Marks
10 pts
Q1
5 pts Full Marks
0 pts No Marks
5 pts
https://smu.instructure.com/courses/84418/assignments/482044
3/4

2/19/2021 Assignment #2 – kNN, Neural Networks & k-means Clustering
Criteria
Ratings
Pts
Q2
5 pts Full Marks
0 pts No Marks
5 pts
Q3
5 pts Full Marks
0 pts No Marks
5 pts
Q4
5 pts Full Marks
0 pts No Marks
5 pts
Q5
5 pts Full Marks
0 pts No Marks
5 pts
Q6
5 pts Full Marks
0 pts No Marks
5 pts
Total Points: 100
https://smu.instructure.com/courses/84418/assignments/482044
4/4