End-of-year Examinations, 2020 STAT318/STAT462-20S2 (C)
Page 1 of 6
No electronic/communication devices are permitted.
No exam materials may be removed from the exam room.
Mathematics and Statistics
EXAMINATION
End-of-year Examinations, 2020
STAT318-20S2 (C) / STAT462-20S2 (C) Data Mining
Examination Duration: 120 minutes
Exam Conditions:
Closed Book exam: Students may not bring in any written or printed materials.
Calculators with a ‘UC’ sticker approved.
Materials Permitted in the Exam Venue:
None.
Materials to be Supplied to Students:
1 x Standard 16-page UC answer book.
Instructions to Students:
Answer all FOUR questions.
All questions carry equal marks.
Show all working.
Write your answers in the answer booklet provided.
Use black or blue ink only.
Family Name _____________________
First Name _____________________
Student Number |__|__|__|__|__|__|__|__|
Venue ____________________
Seat Number ________
End-of-year Examinations, 2020 STAT318/STAT462-20S2 (C)
Page 2 of 6
Questions Start on Page 3
End-of-year Examinations, 2020 STAT318/STAT462-20S2 (C)
1. (a) Suppose that we have the following 100 market basket transactions.
Transaction Frequency
{apple} 9
{apple, carrot} 10
{apple, banana, carrot} 27
{apple, banana, carrot, orange} 21
{banana, orange} 3
{apple, banana, orange} 11
{carrot, orange} 5
{banana, carrot, orange} 14
100
For example, there are 10 transactions of the form {apple, carrot}
i. Compute the support of {orange}, {banana, carrot}, and {banana, carrot, orange}.
ii. Compute the confidence of the association rules:
{banana, carrot} → {orange}; and
{orange} → {banana, carrot}.
Is confidence a symmetric measure? Justify your answer.
iii. Find the 3-itemset(s) with the largest support.
iv. If minsup = 0.2, is {banana, carrot, orange} a maximal frequent itemset?
Justify your answer.
v. Lift is defined as
Lift(X → Y ) =
s(X ∪ Y )
s(X)s(Y )
,
where s( ) denotes support. What does it mean if Lift(X → Y ) < 1. (b) This question examines linear discriminant analysis (LDA) and quadratic discrimi- nant analysis (QDA) for a 2-class classification problem. i. Under what conditions are Bayes classifier and LDA equivalent classifiers? ii. Explain the difference between LDA and QDA. iii. In general, as the sample size increases, do we expect the test prediction ac- curacy of QDA relative to LDA to improve, decline, or be unchanged? Why? Page 3 of 6 TURN OVER End-of-year Examinations, 2020 STAT318/STAT462-20S2 (C) 2. (a) Describe two potential advantages of regression trees over other statistical learning methods. (b) This question uses the CART partition given below. The points in each box denote the training data and the numbers next to them are their response values. X X2 1 1 4 0 3 6 2 10 10 14 1 3 5 4 16 12 14 . . .. . . . . . . . 5 7 8 9 102 -1 2 3 1. i. Sketch the decision tree corresponding to the CART partition. ii. Predict the response values for the following observations using your tree, where x = (x1, x2): a = (5, 1); b = (3.5, 2); and c = (11,−2.) iii. What is the training MSE for your tree? (c) The predictive performance of a single regression tree can be substantially improved by aggregating many decision trees. i. Briefly explain the method of bagging regression trees. ii. Explain the difference between bagging and random forest. iii. Briefly explain two differences between boosted regression trees and random forest. Page 4 of 6 End-of-year Examinations, 2020 STAT318/STAT462-20S2 (C) 3. (a) Using one or two sentences, explain the main difference between regression and classification problems. (b) The expected test MSE, for a given x0, can be decomposed into the sum of three fundamental quantities: E[y0 − f̂(x0)]2 = V (f̂(x0)) + [Bias(f̂(x0))]2 + V (�). Briefly explain each of these three quantities. (c) Briefly explain what is meant by over-fitting and under-fitting the training data. (d) Provide a sketch typical of training error, testing error, and the irreducible error, on a single plot, against the flexibility of a statistical learning method. The x-axis should represent the flexibility and the y-axis should represent the error. Make sure the plot is clearly labelled. Explain why each of the three curves has the shape displayed in your plot. (e) Indicate on your plot in part (d) where we would generally expect to find a regression tree with five splits, a random forest, and k-nearest neighbour regression with k = 3. Explain why you have placed these methods where you have. (f) Would we generally expect the training MSE of an un-pruned regression tree to be better or worse than a pruned regression tree? Why? Page 5 of 6 TURN OVER End-of-year Examinations, 2020 STAT318/STAT462-20S2 (C) 4. (a) Suppose that we have five points, x1, . . . , x5, with the following distance matrix: x1 x2 x3 x4 x5 x1 0 6 6 9 12 x2 6 0 2 7 10 x3 6 2 0 5 8 x4 9 7 5 0 3 x5 12 10 8 3 0 For example, the distance between x1 and x2 is 6 and the distance between x3 and x5 is 8. i. Briefly explain the single linkage and complete linkage hierarchical clustering algorithms. ii. Using the distance matrix above, sketch the dendrogram that results from hierarchically clustering these points using complete linkage. Clearly label your dendrogram and include all merging distances. iii. Suppose we want a clustering with two clusters. Which points are in each cluster for complete linkage? iv. Briefly explain one strength and one weakness of hierarchical clustering. (b) Consider applying the k-means clustering algorithm to two-dimensional points using Euclidean distance with k = 2. i. Briefly explain the k-means clustering algorithm. ii. After two passes of the main loop of the k-means algorithm, the following cluster assignment was obtained: Cluster 1 = {(1, 2), (2, 3), (5, 2), (2, 1)} Cluster 2 = {(6, 2), (5, 3), (7, 4)}. Has k-means converged to a stable clustering? Justify your answer. iii. Would we generally expect to get the same clustering if we run the k-means algorithm several times? Explain. iv. Briefly explain one strength and one weakness of the k-means clustering algo- rithm. End of Examination Page 6 of 6