Homework Assignment #5

 

 

Homework Assignment #5

 

All data sets relevant for this assignment can be found at the UCI Machine Learning Repository, at http://archive.ics.uci.edu/ml/

Problem 1 (45 points) Download three different data sets (of your choice) for binary classification. Plan on taking only the data with real-valued features; however, if some features are categorical, say with ܿ                                                                                                                                                                 attribute values, convert them to ܿ                                                                                                                                                                 binary features. You can also use any of the data sets for regression where the median value for the target can be used to generate classes 0 and 1. If you use the Iris data you can group two of the flowers as class 0 and the third as class 1.

 

(a) (20 points) Implement 10-fold cross-validation and apply it to each of the three data sets.

Use the code from class to compare logistic regression and k-nearest neighbor algorithm (K

= 5) in terms of classification accuracy and area under the ROC curve (AUC). Use Matlab’s function perfcurve to calculate AUCs.

(b) (5 points) Repeat the algorithm from (a) with several different random partitions of the data.

Observe the variation in performance measures and comment on them.

(c) (20 points) Repeat steps (a) and (b) but this time use ensembles of 10 neural networks (5 hidden neurons) and 100 classification trees (also feel free use code from class). Compare the accuracy of all four types of classifiers.

 

Problem 2 (30 points) Consider the same three data sets from Problem 1. This time incorporate a data normalization step, then repeat the 10-fold cross-validation for each of the four classifiers.

 

(a) (15 points) Use z-score normalization

(b) (15 points) Use min-max normalization where the new minimum for each feature is 0 and the maximum is 1.

 

Note that the normalization must be done on the training partition during which you need to collect the means and standard deviations for each feature (for the z-score normalization). Then, normalize the training and test sets using those parameters calculated on the training set. The same approach should be used for the min-max normalization.

 

Problem 3 (30 points) Consider the same three data sets from Problem 1. Now implement an algo- rithm to evaluate the importance of each feature in the training set. The importance of each feature should be evaluated by reducing the entire training set to only a single feature at the time and then constructing a predictor and calculating the area under the ROC curve (using cross-validation). Those features with the largest AUC should be considered the most important. Provide the top few features for each of the classification models above and comment on the differences. Finally, look at the best features and comment on whether your results make sense in terms of what you originally thought would be important features.

—–