Time allowed: TWO HOURS Number of Pages: 7
Read these instructions carefully.
• Answer all FOUR questions
• All questions carry equal marks. • Calculators are permitted.
• Use black or blue ink only.
• Show all working.
• Write your answers in the booklet pro- vided.
UNIVERSITY OF CANTERBURY
END OF YEAR EXAMINATIONS
Prescription Number(s): STAT318/462-15S2 (C) Paper Title: Data Mining
2 STAT318/462-15S2 (C)
THIS PAGE IS INTENTIONALLY LEFT BLANK
3 STAT318/462-15S2 (C) 1. Suppose a course had four key internal elements (with labels in brackets):
• total assignment mark (Assign);
• skills test (Skill);
• mid-course test mark (Test);
• project marks (Project).
You are interested in investigating whether there is any association between the final exam marks and these four explanatory variables.
(a) Why would you consider a regression with a subset of the explanatory vari- ables?
(b) One subset selection method is called forward stepwise selection. Please de- scribe the procedure for forward stepwise selection.
(c) The table on page 4 is a summary of the p-values for the F-test when con- sidering adding in the relevant variable into the model. Carry out forward stepwise selection to determine the best model by using the p-value as variable inclusion criterion until all variables in the model are significant. Use an entry significance level of α = 0.05. Explain the stepwise process undertaken at ALL steps.
(d) Another subset selection method is called backward stepwise selection. Please describe the procedure for backward stepwise selection.
(e) Please use the table on page 4 to carry out backward stepwise selection to determine the best model by using the p-value as variable exclusion criterion until all variables in the model are significant. Use a significance level of α = 0.05. Explain the stepwise process undertaken at ALL steps.
Current Model y=β0
y=β0 + β1Assign
y=β0 + β1Skill
y=β0 +β1Project
y=β0 +β1Test
y=β0 + β1Assign + β2Skill y=β0 + β1Assign + β2P roject y=β0 + β1Assign + β2T est y=β0 + β1Skill + β2P roject y=β0 + β1Skill + β2T est y=β0 + β1P roject + β2T est
y=β0 + β1Assign + β2Skill + β3P roject y=β0 + β1Assign + β2Skill + β3T est y=β0 + β1Assign + β2P roject + β3T est y=β0 + β1Skill + β2P roject + β3T est
Term to Enter p-value Assign 7e-15 Skill 7e-10 Project 4e-18 T est 2e-17 Skill 1e-4 Project 6e-8
T est 1e-6 Assign 1e-9 Project 1e-10 T est 6e-10 Assign 1e-4 Skill 0.05 T est 9e-5 Assign 3e-3 Skill 0.34 Project 9e-5 Project 3e-5 T est 3e-3 Skill 0.09 T est 0.01 Skill 0.04 Project 3e-3 Assign 3e-3 T est 5e-3 Assign 6e-3 Project 1e-3 Assign 6e-3 Skill 1e-3 T est 0.02 Project 0.002 Skill 0.31 Assign 0.01
4
STAT318/462-15S2 (C)
Table 1: Table for question 1
2. (a)
The number of Christchurch days with frosts in winter is critical for farmers trying to estimate the amount of feed that is needed. Denoting with X is the random variable of the number of frost days per year measured at the Botanic Gardens, the last 8 measurements are:
x = {17, 46, 27, 24, 26, 27, 26, 31}
Which of the examples below are NOT plausible bootstrap samples from x (there may be more than one):
xboot1 ={17,46,27,24,26,27,26,31}
xboot2 ={46,27,24,26,27,26}
xboot3 ={27,46,27,24,46,27,26,31} xboot4 ={17,27,25,26,17,17,46,31} xboot5 ={31,31,31,31,31,31,31,31}
5 STAT318/462-15S2 (C)
(b) Explain how k-fold cross-validation is implemented.
(c) What are the advantages and disadvantages of k-fold cross-validation relative
to
i. The validation set approach? ii. Leave one out cross-validation?
(d) Briefly explain the bagging method for trees.
(e) Would you expect bagged trees to be correlated or uncorrelated? Please ex- plain.
(f) Briefly explain the random forest method.
(g) Would you expect trees in a random forest to be more correlated than an ensemble of bagged trees? Please explain.
6 STAT318/462-15S2 (C)
3. We want to predict the outcome of the next tennis match between two top ranked tennis players: (SW) and (MS). Data collected on these players is given in the following table.
Time Morning Afternoon Night Afternoon Afternoon Afternoon Afternoon Afternoon Morning Afternoon Night Night Afternoon Afternoon Afternoon Afternoon
Match Type Premier Grand Slam Friendly Friendly Premier Grand Slam Grand Slam Grand Slam Premier Premier Friendly Premier Premier Premier Grand Slam Grand Slam
Court Surface
Grass SW
Clay SW Hard SW Clay MS Clay MS Grass SW Hard SW Hard SW Grass SW Clay MS Hard SW Clay MS Clay MS Grass SW Hard SW Clay SW
Outcome
(a) List the features and class labels for this problem.
(b) Calculate the impurity at the root node using the Gini impurity measure.
(I(Parent) = 1−i p2i , where pi is the fraction of observations in the ith class.)
(c) The best multiway split for a non-terminal node is the one that maximizes the
change in impurity, given by
∆(I) = I(Parent) − |Rj|
j |Parent|
where |Rj| is the number of observations at the jth descendent node. ∆(I) has been calculated for Time and Match Type, giving the values 0.028 and 0.132, respectively. Calculate ∆(I) for Court Surface.
(d) Using your results from part (c), sketch the best split and label each descendent node.
(e) Continue growing the tree from part (d) by finding the best multiway split for all impure descendent nodes. Plot your tree, labeling terminal nodes and queries.
(f) The next match between Williams and Sharapova is a friendly match on clay court surface, and will be played in the afternoon. Predict the outcome of the match using your decision tree. Does your prediction change if the match is played at night?
I(Rj),
as
SMC = f11 + f00 and JC = f11 . f00 +f01 +f10 +f11 f01 +f10 +f11
7 STAT318/462-15S2 (C)
4. The simple matching coefficient (SMC) and the Jarcard coefficient (JC) are similar- ity measures that can be used to compare binary-valued vectors. They are defined
Consider the following binary-valued vectors
x = (0,0,0,0,1,1,0,0,0,0,0,1,0,1,0,0,1,0,0,1)
y = (0,0,0,1,0,1,0,0,1,0,0,0,0,0,1,1,0,0,0,1).
(a) Define what the notation f00 means.
(b) Calculate the SMC and JC for x and y.
(c) If these binary-valued vectors are asymmetric, would you measure their simi- larity using SMC or JC? Explain why you have made your choice.
(d) If the cosine similarity of two binary-valued vectors is zero, what is the value of f11? (cosine(x, y) = x · y/∥x∥∥y∥)
The following questions consider hierarchical clustering algorithms.
(e) Explain (using words and/or pictures and/or equations) the single linkage and complete linkage measures.
(f) Using Manhattan distance as a measure of dissimilarity, sketch the Den- drograms for the points given in Figure 1 using single linkage and complete linkage hierarchical clustering. Label the cluster merging distances on your plot. (Hint: use the grid to calculate the Manhattan distances.)
..
65 ..5 44 33 22 11
1 2 3 4 5 6 7 8 9 10
..
B
C
D
E
.
A
B
C
D
E
.
A
6
..
1 2 3 4 5 6 7 8 9 10
Figure 1: Points to hierarchically cluster.
(g) Explain what Cophenetic distance measures in a hierarchical clustering. What is the Cophenetic distance between C and E for single linkage hierarchical clustering?
END OF TEST