Time allowed: TWO HOURS Number of Pages: 37
Read these instructions carefully. • Answer all FOUR questions.
• All questions carry equal marks. • Calculators are permitted.
• Use black or blue ink only.
• Show all working.
• Write your answers in the spaces pro-
vided.
• You may use the left-hand pages for rough working.
UNIVERSITY OF CANTERBURY
END OF YEAR EXAMINATIONS
Prescription Number(s): STAT318/462-16S2 (C) Paper Title: Data Mining
Name:
Student ID number:
Question 1 2 3 4 number
Marks
Total marks:
2 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
3 STAT318/462-16S2 (C)
1. (a) What do we mean by the variance and bias of a statistical learning method? What is the bias-variance trade-off in statistical learning?
TURN OVER
4 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
5 STAT318/462-16S2 (C)
(b) Suppose we have the following training data with six observations, three pre- dictors and one qualitative response variable Y :
Observation Number 1 2 3 4 5 6
X1 X2 X3 Y
0 3 0 Red 2 0 0 Red 0 1 3 Red 0 1 2 Blue
-1 0 1 Blue 1 1 1 Red
WewishtomakeapredictionforatestpointX1 =X2 =X3 =0using K-nearest neighbors.
i. Compute the Euclidean distance between each training observation and the test point. (Hint: ∥x∥2 = pi=1 |xi|2 )
TURN OVER
6 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
7 STAT318/462-16S2 (C) ii. What is our prediction for the test point with K = 1? Why?
iii. What is our prediction for the test point with K = 3? Why?
TURN OVER
8 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
9 STAT318/462-16S2 (C)
(c) Briefly explain how K-fold cross-validation can be used to approximate a test- ing mean squared error. What is an advantage of cross-validation over the validation set approach?
TURN OVER
10 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
11 STAT318/462-16S2 (C)
2. (a) Sketch the tree corresponding to the CART partition given below. The num- bers in the boxes indicate the mean response within each region.
15
5
3
0
10
X2 1 0
01 X1
TURN OVER
12 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
13 STAT318/462-16S2 (C)
(b) Create a diagram similar to what was given in part (a) using the CART tree given below. Indicate the mean response in each region of the partitioned feature space.
X1 < 1
-2.0
X2 < 2
0.5
-1.1
X2 < 1
X1 < 0
2.5
0.2
TURN OVER
14 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
15 STAT318/462-16S2 (C)
(c) Briefly explain the method of bootstrap aggregation (bagging) for statistical learning methods. Why is bagging particularly useful for decision trees?
TURN OVER
16 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
17 STAT318/462-16S2 (C)
(d) Briefly explain the random forests method. What is the advantage of random forests over bagged decision trees?
TURN OVER
18 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
19 STAT318/462-16S2 (C)
(e) Decision tree methods often grow large trees and then prune them back into smaller trees. Why are decision trees pruned? Would you recommend using pruned trees in a random forest? Why?
TURN OVER
20 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
21 STAT318/462-16S2 (C) 3. Suppose that we have the following market basket transactions:
Transaction Frequency Transaction Frequency {a} 9 {a,b,c} 27 {a,c} 10 {a,b,d} 11 {b,d} 3 {a,b,c,d} 21 {c,d} 5 {b,c,d} 14
For example, there are 10 transactions of the form {a, c} and 100 in total. (a) Compute the support of {d}, {b, c}, and {b, c, d}.
(b) Compute the confidence of the association rules {b, c} → {d} and {d} → {b, c}. Is confidence a symmetric measure? Justify your answer.
TURN OVER
22 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
23 STAT318/462-16S2 (C) (c) Find the 2-itemset(s) with the largest support.
(d) If minsup = 0.2, is {a, b, d} a maximal frequent itemset? Justify your answer.
TURN OVER
24 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
(e) Lift is defined as
25 STAT318/462-16S2 (C)
Lift(X → Y ) = s(X ∪ Y ) . s(X)s(Y )
Compute the Lift of the rules {a} → {b} and {a} → {d}. For each rule, deter- mine whether the items are independent, positively correlated, or negatively correlated. Justify your answer.
TURN OVER
26 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
27 STAT318/462-16S2 (C)
(f) Suppose we add 1000 transactions of the form {c,f,g} to our set of market basket transactions. Would the Lift of the rules {a} → {b} and {a} → {d} change? Justify your answer.
(g) Let ∅ denote the empty set. Find the confidence of the rules {b, c, f } → ∅ and ∅ → {b,c,f}.
TURN OVER
28 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
29 STAT318/462-16S2 (C) 4. (a) Briefly explain the steps of the k-means clustering algorithm.
TURN OVER
30 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
31 STAT318/462-16S2 (C)
(b) Briefly discuss one strength and one weakness of the k-means clustering algo- rithm.
TURN OVER
32 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
33 STAT318/462-16S2 (C)
(c) Briefly explain single linkage hierarchical clustering and complete linkage hier- archical clustering.
TURN OVER
34 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
35 STAT318/462-16S2 (C) (d) Suppose that we have four points with the following distance matrix:
0 0.30.4 0.7 0.30 0.5 0.8 0.4 0.5 0 0.45
0.7 0.8 0.45 0
For example, the distance between the first and second point is 0.3 and the
distance between the third and fourth point is 0.45.
i. Using the distance matrix above, sketch the dendrogram that results from hierarchically clustering these points using single linkage. Clearly label your dendrogram and include all merging distances.
ii. Suppose we want a clustering with two clusters. Which points are in each cluster for single linkage?
TURN OVER
36 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
37 STAT318/462-16S2 (C) iii. Repeat part (i) using complete linkage.
iv. Suppose we want a clustering with two clusters. Which points are in each cluster for complete linkage?
END OF TEST