CS代考 End-of-year Examinations, 2017

End-of-year Examinations, 2017
STAT318-17S2 (C) / STAT462-17S2 (C)
Family Name First Name Student Number Venue
Seat Number
_____________________ _____________________ |__|__|__|__|__|__|__|__| ____________________ ________
No electronic/communication devices are permitted. No exam materials may be removed from the exam room.
Mathematics and Statistics EXAMINATION
End-of-year Examinations, 2017
STAT318-17S2 (C) Data Mining STAT462-17S2 (C) Data Mining
Examination Duration: 120 minutes
Exam Conditions:
Closed Book exam: Students may not bring in any written or printed materials. Any scientific/graphics/basic calculator is permitted.
Materials Permitted in the Exam Venue:
None.
Materials to be Supplied to Students:
1 x Standard 16-page UC answer book
Instructions to Students:
Answer all FOUR questions
All questions carry equal marks.
Use black or blue ink only.
Show all working.
Write your answers in the answer booklet provided.
Page 1 of 6

End-of-year Examinations, 2017 STAT318/462-17S2 (C)
Questions Start on Page 3
Page 2 of 6

End-of-year Examinations, 2017 STAT318/462-17S2 (C)
1. (a)
(b) What is the bias-variance trade-off for statistical learning methods?
What do we mean by the variance and the bias of a statistical learning method?
(c) Provide a sketch typical of training error, testing error, and Bayes error, on a single plot, against the flexibility of a statistical learning method. The x-axis should represent the flexibility and the y-axis should represent the error. Make sure the plot is clearly labelled. Explain why each of the three curves has the shape displayed in your plot.
(d) Indicate on your plot in part (c) where we would generally expect to find linear regression, a pruned regression tree, and an un-pruned regression tree. Explain why you have placed these methods where you have.
(e) For each of parts (i) through (iii), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
i. The relationship between the predictors and the response is highly non- linear.
ii. The sample size n is extremely large, and the number of predictors is small. iii. The variance of the error terms (σ2 = V (ε)) is extremely high.
Page 3 of 6

End-of-year Examinations, 2017 STAT318/462-17S2 (C)
2. (a)
Sketch the tree corresponding to the CART partition given below. The num- bers in the boxes indicate the mean response within each region.
11

5
10
7
17
4
2
X2
1 0
23
X1
(b) Create a diagram similar to that given in part (a) using the CART tree below. Indicate the mean response in each region of the partitioned feature space.
20
15
11
X2 < 1 X1 < 2 X1 < 1 X2 < 0 25 12 (c) Briefly explain the method of bagging trees. Would we generally expect pruning to improve the performance of bagging trees? Explain. (d) Briefly explain the random forests method. What is the potential advantage of random forests over bagging trees? (e) Briefly explain how k-fold cross-validation can be used to approximate a testing mean squared error (MSE). What is an advantage of k-fold cross-validation relative to leave-one-out cross-validation (LOOCV)? Page 4 of 6 End-of-year Examinations, 2017 STAT318/462-17S2 (C) 3. Suppose that we have the following 100 market basket transactions. Transaction Frequency {a} 9 {a,b} 8 {a, d} 11 {c,d} 5 {a, b, c} 17 {a, b, d} 6 {a, b, c, d} 10 {b, c, d} 34 100 For example, there are 8 transactions of the form {a, b}. (a) Compute the support of {d}, {a, c}, and {a, c, d}. (b) Compute the confidence of the association rules {a, c} → {d} and {d} → {a, c}. Is confidence a symmetric measure? Justify your answer. (c) If minsup = 0.2, is {a, b, c} a maximal frequent itemset? Justify your answer. (d) Lift is defined as Lift(X → Y ) = s(X ∪ Y ) , s(X)s(Y ) where s( ) denotes support. Compute the Lift of the rules {b} → {c} and {b} → {d}. For each rule, determine whether the items are independent, positively correlated, or negatively correlated. Justify your answer. (e) Would we generally expect the number of closed frequent itemsets to be less than or more than the number of maximal frequent itemsets? Justify your answer. (f) State the Apriori principle. Use this principle to show that c({x, y} → {z}) ≥ c({x} → {y, z}), where c( ) denotes confidence. Page 5 of 6 End-of-year Examinations, 2017 STAT318/462-17S2 (C) 4. (a) (b) Suppose that we have four points with the following similarity matrix: Using one or two sentences, explain the difference between supervised learning and unsupervised learning. 1 0.70.6 0.3 0.71 0.5 0.2. 0.6 0.5 1 0.55 0.3 0.2 0.55 1 For example, the similarity between the first and second points is 0.7 and the similarity between the third and fourth points is 0.55. i. Briefly explain complete linkage hierarchical clustering. ii. Using the similarity matrix above, sketch the dendrogram that results from hierarchically clustering these points using complete linkage. Clearly label your dendrogram and include all merging similarities. iii. Suppose we want a clustering with two clusters. Which points are in each cluster for complete linkage? (c) Briefly discuss one strength and one weakness of hierarchical clustering algo- rithms. (d) Consider applying the k-means clustering algorithm to two-dimensional points using Euclidean distance with k = 2. After two iterations, the following cluster assignment is obtained: Cluster 1 = {(1, 2), (2, 3), (5, 2), (2, 1)} Cluster 2 = {(6, 2), (5, 3), (7, 4)}. i. Calculate the cluster means. ii. Has the k-means algorithm converged? Justify your answer. iii. Would we generally expect to get the same clustering if we run the k-means algorithm several times? Explain. End of Examination Page 6 of 6