Data Mining Assignment
Take any reasonable data set you like with at least 200 observations.
Next, shuffle the data set and create three sets with it: a training set, a validation set, and a test set. You are free to use any appropriate split, but I will recommend 50% for training, 30% for validation, and hold out the remaining 20% of observations to the very end to see how you did. You may choose to do k-fold cross validation; in this case you would use 80% of the data for that purpose, and hold the remaining 20% to the end.
Next, build a decision tree or a regression tree with your training data subject to two rules. These rules could be something like: maximum tree depth, minimum number of observations in a leaf, minimum decrease in impurity, etc.. Use your validation data to find the optimal settings of these rules e.g. minimum tree depth = 3 if it gives the best accuracy. Finally, use your 20% hold-out set to see how these optimal settings do in production.
Guidelines are the same as prior weeks.