Mid-year Examinations, 2018
STAT318-18S1 (C) / STAT462-18S1 (C)
Family Name First Name Student Number Venue
Seat Number
_____________________ _____________________ |__|__|__|__|__|__|__|__| ____________________ ________
No electronic/communication devices are permitted. No exam materials may be removed from the exam room.
Mathematics and Statistics EXAMINATION
Mid-year Examinations, 2018
STAT318-18S1 (C) Data Mining STAT462-18S1 (C) Data Mining
Examination Duration: 120 minutes
Exam Conditions:
Closed Book exam: Students may not bring in any written or printed materials. Calculators with a ‘UC’ sticker approved.
Materials Permitted in the Exam Venue:
None
Materials to be Supplied to Students:
1 x Standard 16-page UC answer book
Instructions to Students:
Answer all FOUR questions
All questions carry equal marks.
Show all working.
Use black or blue ink only.
Write your answers in the answer booklet provided.
Page 1 of 6
Mid-year Examinations, 2018 STAT318/462-18S1 (C)
Questions Start on Page 3
Page 2 of 6
Mid-year Examinations, 2018 STAT318/462-18S1 (C)
1. (a)
(b) What is the bias-variance trade-off for statistical learning methods?
What do we mean by the variance and the bias of a statistical learning method?
(c) Provide a sketch typical of training error, testing error, and Bayes error, on a single plot, against the flexibility of a statistical learning method. The x-axis should represent the flexibility and the y-axis should represent the error. Make sure the plot is clearly labelled. Explain why each of the three curves has the shape displayed in your plot.
(d) Indicate on your plot in part (c) where we would generally expect to find logistic regression, a random forest, and a k-nearest neighbour classifier with k = 3. Explain why you have placed these methods where you have.
(e) This question examines differences between linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). Justify your answers.
i. If the Bayes decision boundary is approximately linear, do we expect LDA or QDA to perform better on the training data set?
ii. If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the testing data set?
iii. In general, as the sample size increases, do we expect the test prediction accuracy of LDA relative to QDA to improve, decline, or be unchanged? Why?
Page 3 of 6
Mid-year Examinations, 2018 STAT318/462-18S1 (C)
2. (a)
Sketch the tree corresponding to the CART partition given below. The num- bers in the boxes indicate the mean response within each region.
1
2
4
1
31
2
2 X2 1
0
123
X1
(b) Create a diagram similar to that given in part (a) using the CART tree below. Indicate the mean response in each region of the partitioned feature space.
X1 < 1
X2 < 1
X1 < 0
5 20 10
X1 < 2
X2 < 2
7
10 13
(c) Describe one advantage of decision trees for regression or classification over other statistical learning methods.
(d) When fitting a decision tree, we tend to grow a large tree and then prune the large tree to obtain our final model. Why do we prune large trees?
(e) The predictive performance of a single tree can be substantially improved by aggregating many decision trees. Briefly explain the random forests method.
(f) Briefly explain the out-of-bag (OOB) error estimation method for random forests.
Page 4 of 6
Mid-year Examinations, 2018 STAT318/462-18S1 (C)
3. (a)
Suppose that we have the following 100 market basket transactions.
Transaction Frequency {a} 9 {a,b} 8
{a, d} 11 {c,d} 5
{a, b, c} 17 {a, b, d} 6 {a, b, c, d} 10 {b, c, d} 34
100
For example, there are 8 transactions of the form {a, b}.
i. Compute the support of {d}, {a, c}, and {a, c, d}.
ii. Compute the confidence of the association rules {a, c} → {d} and {d} → {a, c}.
Is confidence a symmetric measure? Justify your answer.
iii. If minsup = 0.25, is {a,b,c} a maximal frequent itemset? Justify your answer.
(b) Is it possible to have more closed frequent itemsets than maximal frequent itemsets? Justify your answer.
(c) Lift is defined as
Lift(X → Y ) = s(X ∪ Y ) , s(X)s(Y )
where s( ) denotes support. What does it mean if Lift(X → Y ) < 1.
(d) Suppose that the Apriori algorithm has found five frequent 3-itemsets
{w,v,x},{v,w,y},{v,x,y},{v,x,z},{w,x,y}.
i. Use the Fk−1 × Fk−1 method to generate all possible candidate 4-itemsets.
ii. Are any candidate 4-itemsets infrequent? Explain.
(e) State the Apriori principle. Use this principle to show that
c({x, y} → {z}) ≥ c({x} → {y, z}), where c( ) denotes confidence.
Page 5 of 6
Mid-year Examinations, 2018 STAT318/462-18S1 (C)
4. (a)
(b) Suppose that we have the following training data with six observations, three
Using one or two sentences, explain the difference between supervised learning and unsupervised learning.
predictors and a response variable Y :
Observation Number X1 X2 X3 Y 10304 22003 30131
4 0 1 2 12 5 -1 0 1 10 61112
WewishtomakeapredictionforatestpointX1 =X2 =X3 =0using k-nearest neighbours.
i. Compute the Euclidean distance between each training observation and the test point X1 = X2 = X3 = 0. (Hint: ∥x∥2 = pi=1 |xi|2 )
ii. What is our prediction for the test point with k = 1? Why?
iii. What is our prediction for the test point with k = 3? Why?
(c) This question examines the k-means clustering algorithm.
i. Briefly explain the k-means clustering algorithm.
ii. After several iterations, k-means (using Euclidean distance with k = 2) has produced the following cluster assignment:
Cluster 1 = {(1, 2), (2, 3), (5, 2), (2, 1)} Cluster 2 = {(6, 2), (5, 3), (7, 4)}.
Has the k-means algorithm converged? Justify your answer.
iii. Would we generally expect to get the same clustering if we run the k-means
algorithm several times? Explain.
(d) What are the advantages and disadvantages of k-fold cross-validation relative to the validation set approach?
End of Examination
Page 6 of 6