CS代考 STAT318/462

STAT318/462
1
STAT 318/462: Data Mining Assignment 3
Due Date: 4pm, 2nd June, 2021
Your assignment must be submitted electronically to the STAT318/462 Learn page, under Assignment 3.
You may do the assignment by yourself or with one other person from the same cohort (300- level students cannot work with 400-level students). If you hand in a joint assignment, you will each be given the same mark. Marks will be lost for unexplained, poorly presented and incomplete answers. Whenever you are asked to do computations with data, feel free to do them any way that is convenient. If you use R (recommended), please provide your code. All figures and plots must be clearly labelled.
1. (10 marks) In this question, you will fit regression trees to predict sales using the Carseats data. This dataset has been divided into training and testing sets: carseatsTrain.csv and carseatsTest.csv (download these sets from Learn). Use the tree(), randomForest() and gbm() R functions to answer this question (see Section 8.3 of the course textbook).
(a) Fit a regression tree to the training set (do not prune the tree). Plot the tree and interpret the results. What are the test and training MSEs for your tree?
(b) Use the cv.tree() R function to prune your tree (use your judgement here). Does the pruned tree perform better?
(c) Fit a bagged regression tree and a random forest to the training set. What are the test and training MSEs for each model? Was decorrelating trees an effective strategy for this problem?
(d) Fit a boosted regression tree to the training set. Experiment with different tree depths, shrinkage parameters and the number of trees. What are the test and training MSEs for your best tree? Comment on your results.
(e) Which model performed best and which predictors were the most important in this model?
2. (4 marks) Using the itemset lattice in Figure 1 (on page 3) and the transactions given in Table 1, answer the following questions. Assume minsup = 30%.
(a) Label each node in the itemset lattice with the following letter(s): M: if the node is a maximal frequent itemset;
C: if the node is a closed frequent itemset;
F: if the node is frequent, but not maximal nor closed; I: if the node is infrequent.
(b) Find the confidence and lift for the rule {d, e} ¡ú {b}. Comment on what you find.
University of Canterbury, , 2021

STAT318/462 2
Transaction ID Items Bought Transaction ID Items Bought 1 {a,d,e} 6 {a,b,d}
2 {b,c,d} 7 {b,d}
3 {a,d,e} 8 {a,b}
4 {a,b,c,d,e} 9 {a,b,d} 5 {b,d,e} 10 {b,d,e}
Table 1: Market basket transactions for Question 2.
3. (6 marks) This question considers clustering the A3data2 data (download from the Learn page). Use x1 and x2 for clustering, the third variable ¡®Cluster¡¯ is the actual cluster label of each point. This variable is included so that you plot the clusters and assess the performance of each clustering method.
(a) Perform k-means clustering using k = 3. Plot the clustering using a different colour for each cluster.
(b) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the data, provide a dendrogram and plot the clustering with 3 clusters. Repeat using single linkage.
(c) Comment on your results from parts (a) and (b). Provide possible explanations for each clustering obtained.
(d) Rescale your data using the R function scale(Data, center=TRUE, scale=TRUE) and repeat parts (a) and (b). Does rescaling improve your results? Comment on your results and provide possible explanations for each clustering obtained.
Question 4 is for students taking STAT462. STAT318 students will NOT receive additional credit if they choose to answer this question. This is an independent research question (you will not be taught this material in class), but you will find Section 9.6 of the course textbook very useful.
4. (4 marks) In this question, you will fit support vector machines to the Banknote data from Assignment 2 (on the Learn page). Only use the predictors x1 and x3 to fit your classifiers.
(a) Is it possible to find a separating hyperplane for the training data? Explain.
(b) Fit a support vector classifier to the training data using tune() to find the best cost value. Plot the best classifier and produce a confusion matrix for the testing data. Comment on your results.
(c) Fit a support vector machine (SVM) to the training data using the radial kernel. Use tune() to find the best cost and gamma values. Plot the best SVM and produce a confusion matrix for the testing data. Compare your results with those obtained in part (b).
University of Canterbury, , 2021

STAT318/462
3
University of Canterbury, , 2021
null abcde
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade
bcd bce bde cde
abcd abce abde
acde bcde
abcde
Figure 1: Itemset lattice for Question 3.