代写 For this problem you will be working with two datasets:

For this problem you will be working with two datasets:
• Iris: has three classes and the task is to accurately predict one of the three sub-types of the Iris flower given four different physical features. These features include the length and width of the sepals and the petals. There are a total of 150 instances with each class having 50 instances.
• Spambase: is a binary classification task and the objective is to classify email messages as being spam or not. To this end the dataset uses fifty seven text based features to represent each email message. There are about 4600 instances
Since, both datasets have continuous features you will implement decision trees that have binary splits. For determining the optimal threshold for splitting you will need to search over all possible thresholds for a given feature(refer to class notes and discussion for an efficient search strategy). Use information gain to select the splitting features and values and measure node impurity using entropy in your implementation.
Instead of growing full trees, you will use an early stopping strategy. To this end, we will impose a limit on the minimum number of instances at a leaf node, let this threshold be denoted as ηmin, where 0 ≤ ηmin ≤ 1 is described as a ratio relative to the size of the training dataset. For example if the size of the training dataset is 150 and ηmin = 0.05, then a node will only be split further if it has more than eight instances.
(a) For the Iris dataset use ηmin ∈ {0.05, 0.10, 0.15, 0.20}, and calculate the accuracy using ten fold cross-validation for each value of ηmin
(b) For the Spambase dataset use ηmin ∈ {0.05, 0.10, 0.15, 0.20, 0.25}, and calculate the accuracy using ten fold cross-validation for each value of ηmin
You can summarize your results in two separate tables, one for each dataset (report the average accuracy and standard deviation across the folds).
(a) Select the best value of ηmin for the Iris dataset, and create a class confusion matrix using ten-fold cross validation(use only the test set for populating the confusion matrix ). How do you interpret the confusion matrix, and why?
(b) Select the best value of ηmin for the Spambase dataset, and create a class confusion matrix using ten-fold cross validation(use only the test set for populating the confusion matrix ). How do you interpret the confusion matrix, and why?
(c) How does different values of ηmin impact classifier performance for both datasets and why? Support your claims/insights through your results.

Related Posts