CS代考 Mid-year Examinations, 2019

Mid-year Examinations, 2019
STAT318-19S1 (C) / STAT462-19S1 (C)
Family Name First Name Student Number Venue
Seat Number
_____________________ _____________________ |__|__|__|__|__|__|__|__| ____________________ ________
No electronic/communication devices are permitted. No exam materials may be removed from the exam room.
Mathematics and Statistics EXAMINATION
Mid-year Examinations, 2019
STAT318-19S1 (C) Data Mining STAT462-19S1 (C) Data Mining
Examination Duration: 120 minutes
Exam Conditions:
Closed Book exam: Students may not bring in any written or printed materials. Calculators with a ‘UC’ sticker approved.
Materials Permitted in the Exam Venue:
None
Materials to be Supplied to Students:
1 x Standard 16-page UC answer book
Instructions to Students:
Answer all FOUR questions
All questions carry equal marks.
Show all working.
Use black or blue ink only.
Write your answers in the answer booklet provided.
Page 1 of 6

Mid-year Examinations, 2019 STAT318/462-19S1 (C)
Questions Start on Page 3
Page 2 of 6

Mid-year Examinations, 2019 STAT318/462-19S1 (C)
1. (a)
Sketch the tree corresponding to the CART partition given below. The word in each box indicates the class label for that region.
Cat

Dog
Sheep
Cat
Rabbit
X2 1 0
346
X1
(b) Create a diagram similar to that given in part (a) using the CART tree below. Indicate the class label in each region of the partitioned feature space.
X2 < 1 X1 < 1 X2 < 0 Banana Orange Grapes Apple X2 < 2 X1 < 2 (c) The predictive performance of a single tree can be substantially improved by aggregating many decision trees. i. Briefly explain the random forest method for classification. ii. Should pruned or un-pruned trees be used in random forests? Explain. iii. Briefly explain the of out-of-bag error rate for random forests. (d) Briefly explain how K-fold cross-validation can be used to approximate a test- ing error rate. Describe one advantage of cross-validation over the validation set approach. Page 3 of 6 TURN OVER Mid-year Examinations, 2019 STAT318/462-19S1 (C) 2. (a) Suppose that we have the following 100 market basket transactions. Transaction Frequency {a} 9 {a, c} 10 {a, b, c} 27 {a, b, d} 11 {a,b,c,d} 21 {b,d} 3 {b, c, d} 14 {c,d} 5 100 For example, there are 10 transactions of the form {a, c}. i. Compute the support of {d}, {b, c}, and {b, c, d}. ii. Compute the confidence of the association rules {b, c} ¡ú {d} and {d} ¡ú {b, c}. Is confidence a symmetric measure? Justify your answer. iii. Find the 3-itemset(s) with the largest support. iv. If minsup = 0.2, is {b,c,d} a maximal frequent itemset? Justify your answer. v. Would we generally expect the number of closed frequent itemsets to be less than or greater than the number of maximal frequent itemsets? Justify your answer. (b) This question examines linear discriminant analysis (LDA) and quadratic dis- criminant analysis (QDA) for a 3-class classification problem. i. Under what conditions are Bayes classifier and LDA equivalent classifiers? ii. Briefly explain the difference between LDA and QDA. iii. If we have a relatively small training data set, would we expect LDA or QDA to perform better? Discuss. Page 4 of 6 Mid-year Examinations, 2019 STAT318/462-19S1 (C) 3. (a) Using one or two sentences, explain the main difference between regression and classification problems. (b) What do we mean by the bias and the variance of a statistical learning method? (c) What is the bias-variance trade-off for statistical learning methods? (d) Provide a sketch typical of training error, testing error, and the Bayes error, on a single plot, against the flexibility of a statistical learning method. The x-axis should represent the flexibility and the y-axis should represent the error. Make sure the plot is clearly labelled. Explain why each of the three curves has the shape displayed in your plot. (e) Indicate on your plot in part (d) where we would generally expect to find an un-pruned classification tree, a random forest, and k-nearest neighbour classification with a relatively large k value. Explain why you have placed these methods where you have. (f) Briefly explain what is meant by under-fitting and over-fitting the training data. Page 5 of 6 TURN OVER Mid-year Examinations, 2019 STAT318/462-19S1 (C) 4. (a) Using one or two sentences, explain the difference between supervised learning and unsupervised learning. (b) Supposethatwehavefivepoints,x1,...,x5,withthefollowingdistancematrix: x1 x2 x3 x4 x5 x1 0 6 6 9 12 x2 602710. x3 62058 x4 97503 x5 12 10 8 3 0 For example, the distance between x1 and x2 is 6 and the distance between x3 and x5 is 8. i. Briefly explain single linkage and complete linkage hierarchical clustering. ii. Using the distance matrix above, sketch the dendrogram that results from hierarchically clustering these points using single linkage. Clearly label your dendrogram and include all merging distances. iii. Suppose we want a clustering with two clusters. Which points are in each cluster for single linkage? (c) This question examines the k-means clustering algorithm. i. Briefly explain the k-means clustering algorithm. ii. After several iterations, k-means (using Euclidean distance with k = 3) has produced the following cluster assignment: Cluster 1 = {(1, 2), (2, 3), (2, 1)} Cluster 2 = {(6, 2), (5, 3), (5, 2), (4, 4)}. Cluster 3 = {(3, 4)}. Has the k-means algorithm converged? Justify your answer. iii. Would we generally expect to get the same clustering if we run the k-means algorithm several times? Explain. End of Examination Page 6 of 6