CS代考 Time allowed: TWO HOURS Number of Pages: 37

Time allowed: TWO HOURS Number of Pages: 37
Read these instructions carefully. • Answer all FOUR questions.
• All questions carry equal marks. • Calculators are permitted.
• Use black or blue ink only.
• Show all working.
• Write your answers in the spaces pro-
vided.
• You may use the left-hand pages for rough working.
UNIVERSITY OF CANTERBURY
END OF YEAR EXAMINATIONS
Prescription Number(s): STAT318/462-16S2 (C) Paper Title: Data Mining
Name:
Student ID number:
Question 1 2 3 4 number
Marks
Total marks:

2 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK

3 STAT318/462-16S2 (C)
1. (a) What do we mean by the variance and bias of a statistical learning method? What is the bias-variance trade-off in statistical learning?
TURN OVER

4 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK

5 STAT318/462-16S2 (C)
(b) Suppose we have the following training data with six observations, three pre- dictors and one qualitative response variable Y :
Observation Number 1 2 3 4 5 6
X1 X2 X3 Y
0 3 0 Red 2 0 0 Red 0 1 3 Red 0 1 2 Blue
-1 0 1 Blue 1 1 1 Red
WewishtomakeapredictionforatestpointX1 =X2 =X3 =0using K-nearest neighbors.
i. Compute the Euclidean distance between each training observation and the test point. (Hint: ∥x∥2 = 􏰎􏰊pi=1 |xi|2 )
TURN OVER

6 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK

7 STAT318/462-16S2 (C) ii. What is our prediction for the test point with K = 1? Why?
iii. What is our prediction for the test point with K = 3? Why?
TURN OVER

8 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK

9 STAT318/462-16S2 (C)
(c) Briefly explain how K-fold cross-validation can be used to approximate a test- ing mean squared error. What is an advantage of cross-validation over the validation set approach?
TURN OVER

10 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK

11 STAT318/462-16S2 (C)
2. (a) Sketch the tree corresponding to the CART partition given below. The num- bers in the boxes indicate the mean response within each region.
15
5

3
0
10
X2 1 0
01 X1
TURN OVER

12 STAT318/462-16S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK

13 STAT318/462-16S2 (C)
(b) Create a diagram similar to what was given in part (a) using the CART tree given below. Indicate the mean response in each region of the partitioned feature space.
X1 < 1 -2.0 X2 < 2 0.5 -1.1 X2 < 1 X1 < 0 2.5 0.2 TURN OVER 14 STAT318/462-16S2 (C) THIS PAGE IS INTENTIONALLY LEFT BLANK 15 STAT318/462-16S2 (C) (c) Briefly explain the method of bootstrap aggregation (bagging) for statistical learning methods. Why is bagging particularly useful for decision trees? TURN OVER 16 STAT318/462-16S2 (C) THIS PAGE IS INTENTIONALLY LEFT BLANK 17 STAT318/462-16S2 (C) (d) Briefly explain the random forests method. What is the advantage of random forests over bagged decision trees? TURN OVER 18 STAT318/462-16S2 (C) THIS PAGE IS INTENTIONALLY LEFT BLANK 19 STAT318/462-16S2 (C) (e) Decision tree methods often grow large trees and then prune them back into smaller trees. Why are decision trees pruned? Would you recommend using pruned trees in a random forest? Why? TURN OVER 20 STAT318/462-16S2 (C) THIS PAGE IS INTENTIONALLY LEFT BLANK 21 STAT318/462-16S2 (C) 3. Suppose that we have the following market basket transactions: Transaction Frequency Transaction Frequency {a} 9 {a,b,c} 27 {a,c} 10 {a,b,d} 11 {b,d} 3 {a,b,c,d} 21 {c,d} 5 {b,c,d} 14 For example, there are 10 transactions of the form {a, c} and 100 in total. (a) Compute the support of {d}, {b, c}, and {b, c, d}. (b) Compute the confidence of the association rules {b, c} → {d} and {d} → {b, c}. Is confidence a symmetric measure? Justify your answer. TURN OVER 22 STAT318/462-16S2 (C) THIS PAGE IS INTENTIONALLY LEFT BLANK 23 STAT318/462-16S2 (C) (c) Find the 2-itemset(s) with the largest support. (d) If minsup = 0.2, is {a, b, d} a maximal frequent itemset? Justify your answer. TURN OVER 24 STAT318/462-16S2 (C) THIS PAGE IS INTENTIONALLY LEFT BLANK (e) Lift is defined as 25 STAT318/462-16S2 (C) Lift(X → Y ) = s(X ∪ Y ) . s(X)s(Y ) Compute the Lift of the rules {a} → {b} and {a} → {d}. For each rule, deter- mine whether the items are independent, positively correlated, or negatively correlated. Justify your answer. TURN OVER 26 STAT318/462-16S2 (C) THIS PAGE IS INTENTIONALLY LEFT BLANK 27 STAT318/462-16S2 (C) (f) Suppose we add 1000 transactions of the form {c,f,g} to our set of market basket transactions. Would the Lift of the rules {a} → {b} and {a} → {d} change? Justify your answer. (g) Let ∅ denote the empty set. Find the confidence of the rules {b, c, f } → ∅ and ∅ → {b,c,f}. TURN OVER 28 STAT318/462-16S2 (C) THIS PAGE IS INTENTIONALLY LEFT BLANK 29 STAT318/462-16S2 (C) 4. (a) Briefly explain the steps of the k-means clustering algorithm. TURN OVER 30 STAT318/462-16S2 (C) THIS PAGE IS INTENTIONALLY LEFT BLANK 31 STAT318/462-16S2 (C) (b) Briefly discuss one strength and one weakness of the k-means clustering algo- rithm. TURN OVER 32 STAT318/462-16S2 (C) THIS PAGE IS INTENTIONALLY LEFT BLANK 33 STAT318/462-16S2 (C) (c) Briefly explain single linkage hierarchical clustering and complete linkage hier- archical clustering. TURN OVER 34 STAT318/462-16S2 (C) THIS PAGE IS INTENTIONALLY LEFT BLANK 35 STAT318/462-16S2 (C) (d) Suppose that we have four points with the following distance matrix: 0 0.30.4 0.7 0.30 0.5 0.8 0.4 0.5 0 0.45 0.7 0.8 0.45 0 For example, the distance between the first and second point is 0.3 and the distance between the third and fourth point is 0.45. i. Using the distance matrix above, sketch the dendrogram that results from hierarchically clustering these points using single linkage. Clearly label your dendrogram and include all merging distances. ii. Suppose we want a clustering with two clusters. Which points are in each cluster for single linkage? TURN OVER 36 STAT318/462-16S2 (C) THIS PAGE IS INTENTIONALLY LEFT BLANK 37 STAT318/462-16S2 (C) iii. Repeat part (i) using complete linkage. iv. Suppose we want a clustering with two clusters. Which points are in each cluster for complete linkage? END OF TEST