CITY UNIVERSITY OF HONG KONG
Module code & title : Session : Time allowed :
MS6711 Data Mining Semester B, 2018-2019 Three hours
Student EID: ___________________ Student Number: __________________
Seat Number: ______________
Instructions to students:
• Write down the student EID, student number, and seat number in the spaces provided.
• Do not turn the question paper until you are instructed to do so.
• This paper has 25 pages (including this page).
• This paper consists of 11 questions. Answer all questions.
• Write your answers in the spaces provided.
• Show all calculations clearly. Display all non-integer numeric values in at least 2 decimal
places.
• Pages 3, 9, and 25 are intentionally left blank. You may write on them as needed.
Supplemental materials:
• Figures
Allowed materials:
• Hard copy of printed materials
• Approved calculator
For official use only:
1 (9)
2 (9)
a
b
c
3 (6)
a
b
4 (8)
5 (10)
a
b
c
d
6 (20)
a
b
c
d
7 (8)
a
b
c
d
8 (10)
a
b
c
d
9 (6)
a
b
10 (7)
a
b
c
11 (7)
Total
Question 1 [9 marks]
Suppose you are employed as a data mining consultant for an Internet-based retailer (e.g. Amazon, or Taobao). Describe briefly how the following data mining techniques can help the retailer: clustering, classification, and association analysis.
Answer:
Clustering analysis can be used to discover cluster of users that exhibit similar information needs. By analysing the characteristics of the clusters, web users can be understood better and thus can be provided with more suitable and customized services.
Association analysis can be used to identify the set of items viewed by the same customer in each session. Association rules can be formed by the group of identified items so that if a customer has browsed (or brought) an item, the associated items can be introduced immediately. This will improve the successful rate of cross-selling.
Keeping existing customers informed is a very important business strategy for internet-based retailers. However, the retailer must contact the customers with great care for otherwise the customers may feel the retailer just kept sending in garbage (or related items) and decide to leave the retailer. Classification model can be used to select the group of customers who are most likely to respond to the sent message.
2
3
Question 2 [9 marks]
Describe the following approaches in handling missing values with one or two sentences, and then highlight two drawbacks of each approach.
(a) Case deletion
[3 marks]
Answer:
It excludes all samples for which the target or any of the inputs are missing.
Two problems:
1. If the units with missing values differ systematically from the completely observed cases, this could bias the complete-case analysis.
2. If many variables are included in a model, there may be very few complete cases, so that most of the data would be discarded.
(b) Mean imputation
[3 marks]
Answer:
It replaces each missing value with the mean of the observed values for that variable. Two problems:
1. May severely distort the distribution for this variable, leading to complications with summary measures including, notable, under estimates of the standard deviation.
2. May distorts relationships between variables by pulling estimates of the correlation toward zero.
4
(c) Decision tree imputation
[3 marks]
Answer:
It creates a decision tree model to estimate values that will substitute the missing value. The attribute with missing data is used as the target variable, and the remaining attributes are used as input for the decision model.
Two problems:
(1) The model estimated values are usually more well behaved than the true values would be;
(2) The computational cost will be high since a large number of models need to be built
5
Question 3 [6 marks]
(a) Explain briefly the uses of the following statistics in determining the number of clusters for a given data set: Semi-partial R-square (SPRSQ), and R-squared (RSQ).
[4 marks]
Answer:
Semi-partial R-square: measures the loss of homogeneity due to combining two clusters to form a new cluster. It takes values from 0 to 1, where 0 means the new cluster is obtained by merging two perfectly homogeneous clusters and 1 means the new cluster is formed by combining two heterogeneous clusters.
R-sqaured: measures the extent to which clusters are different from each other. It takes values from 0 to 1, where 0 indicationg no differences among clusters and 1 indicating maximum differences among clusters.
6
(b) Figure Q4b shows the statistics that are obtained from a hierarchical clustering
analysis. Determine the possible number of clusters and state your reasons.
[2 marks]
Answer:
There is a bigger increase of semi-partial Rsqauare value from 6 clusters to 5 clusters. The RSQ value also has a bigger decrease from 6 clusters to 5 clusters. This suggest the possible number of clusters is 6 clusters.
7
Question 4 [8 marks]
Figure Q4 shows the price and quality rating data for four brands of beer. Use the centroid method to perform a bottom-up hierarchical clustering of the brands. Show your computations and the centroids of each cluster solution (4-cluster, 3-cluster, 2-cluster, and 1- cluster) clearly. You do not need to standardize the values of the variables. You may use the squared Euclidean to measure the distance between two objects.
First level:
4-cluster solution: A, B, C, D and the centroids are (8,10), (5,4), (8,9) and (6,7) respectively.
Second level: D2(A,B) = 9+36 = 45 D2(A,C) = 0+1 = 1 D2(A,D) = 4+9 = 13 D2(B,C) = 9+25 = 34 D2(B,D) = 1+9 = 10 D2(C,D) = 4+4=8
3-cluster solution: AC, B, and D with the centroids (8,9.5), (5,4),and (6,7) respectively.
Third level:
D2(AC,B) = 9+30.25 = 39.25
D2(AC,D) = 4+6.25 = 10.25
D2(B,D) = 1+9 = 10
2-cluster solution: AC, BD with centroids (8,9.5) and (5.5,5.5) respectively.
Fourth level:
1-cluster solution: ACBD with centroids (6.75,7.5).
8
9
Question 5 [10 marks]
Churn is the action of a customer if he or she is leaving the company for some reason. We can broadly categorize churn by who initiates the action – the company or customer. We call it voluntary churn if the customer first initiates the action. If the company decides to terminate its service with the customer, it is called involuntary churn. Suppose you are now in the month of May and you are asked to predict which customer is likely to churn voluntary in the coming July. You have been given the historical data of all customers from January of last year till April of the current year.
(a) Describe briefly how you will divide the provided data for inputs and output of your model. There is no need to consider seasonal variation or calendar effect in this
question.
[3 marks]
Answer:
The task requires one to predict whether a customer is likely to churn voluntary in 2 months ahead. This time gap must be taken into accounts when preparing the data.
Since we are given the data from January of last year to the April of the current year, we can divide the data into two parts, the first 13 months (January of last year – January of this year) will be used for the inputs and the status of the customer in April will be targeted. The recent February, and March are latent months. They will not be used in the model.
(b) How would you define the target variable from the provided data set?
[2 marks]
Answer:
The target variable of the model is the status of the customer in the month April. If a customer churned voluntary in this month, a value 1 is assigned to the target variable for this customer. Otherwise a value of 0 is assigned.
10
(c) What kind of records in the provided data set should be excluded from the analysis? [2 marks]
Answer:
Customers who were involuntary churned in April or earlier should be excluded, as these churned customers are not the focus of the analysis.
Customers who churned voluntary before February should be excluded as they cannot be the customers in April.
(d) Explain how seasonal variation (or calendar effect) may affect the performance of the
developed model for the above described purpose.
[3 marks]
Answer:
Suppose an input attribute has high value in last January and low value in last June due to seasonal variation. When the model is developed for the identifying the churners in the coming August, the corresponding input month of the same attribute will be last April and last September respectively. The seasonal pattern of these two months will not necessary be the same of that in January and June. Since the pattern in the input attributes has been changed, the developed model may not perform consistently as one may expect.
Furthermore, seasonal variation may also have impacted the number of churners in last April, the target month of the training data set. When the developed model is deployed for the coming August, the model may not perform consistently as one may expect due to the change of seasonal impact.
11
Question 6 [20 marks]
Figure Q6 shows the decision tree for classifying potential donors as Response (1) or Non- Response (0):
(a) How much information is gained from to the split of “Gift Count Card 36 Months” in
terms of Entropy (i.e. from Node A to Node B and Node C)?
Answer:
The Entropy in Node A is:
-(0.482 log2(0.482) + 0.518 log2(0.518)) = -(0.482 (-1.0529) + 0.518 (-0.9489))
= -(-0.5075 -0.4915) = 0.999.
The Entropy in Node B is:
-(0.702 log2(0.702) + 0.298 log2(0.298)) = -(0.702 (-0.5105) + 0.298 (-1.7466))
= 0.8789
The Entropy in Node C is:
-(0.473 log2(0.473) + 0.527 log2(0.527)) = -(0.473 (-1.0801) + 0.527 (-0.9241))
= 0.9979
The average Entropy due to the split is: 124 / 3142 (0.8789) + 3018 / 3142 (0.9979) = 0.0347 + 0.9585 = 0.9932
The information gain is: 0.999 – 0.9932 = 0.0058
[7 marks]
12
(b) Write all the rules and the respective score of Response (1) that can be derived from
each leaf node of the above decision tree. Answer:
[6 marks]
Rule
Score of Response (1)
If Gift amount average card < 14.15
0.567
If Gift amount average card >= 14.15 and Median home value region < 68150.
0.377
If Gift amount average card >= 14.15 and Median home value region >= 68150 and Gift count card 36 Months >=3.5
0.702
If Gift amount average card >= 14.15 and Median home value region >= 68150 and Gift count card 36 Months < 3.5.
0.473
13
(c) Suppose the target event level in the above decision tree is Response (1). Complete the following confusion matrix with the frequency of all events for the above decision
tree at a cutoff of 0.5. Answers:
(Allow rounding errors)
[4 marks]
Classified
Response (1)
Non-Response (0)
Total
Actual
1
2570+87=2657
759+1428=2187
4843
0
1962+37=1999
1253+1590=2843
4843
Total
4656
5030
9686
14
(d) Suppose the target event level in the above decision tree is Response (1). What is the score of Response (1) for each of the following donors? If the cutoff score for the target level of the decision tree is set at 0.6, what is the classification decision for each donor?
Variables
Donor A
Donor B
Donor C
Gift Amount Average Card ...
10
15
15
Median Home value Region
70000
70000
70000
Gift Count Card 36 Months
4
4
3
Answer:
Donor A:
Score = 0.567.
Decision: Non-Response.
Donor B:
Score = 0.702. Decision: Response.
Donor C:
Score = 0.473.
Decision: Non-Response.
[3 marks]
15
Question 7 [8 marks]
Consider a neural network with four input nodes, one hidden layer with one hidden node, and an output layer containing one node for the target event level only. Assume all weights and biases of the links equal to 0.2. Apply the input instance [0.8, 0.5, 0.2, 1.0] to the neural network.
(a) Use a linear combination function, calculate the input value to the hidden node.
Answer:
Input value: 0.2 + 0.2(0.8)+0.2(0.5)+0.2(0.2)+0.2(1.0)=0.7
[2 marks]
(b) Use a logistic activation function, calculate the output value of the hidden node.
Answer:
Output value: 1/[1+exp(-0.7)] = 0.6682
[1 mark]
(c) Use the output values calculated in (b) to compute the input and output value for the output node. Use linear combination function and logistic activation function for your
computations.
Answer:
Input value: 0.2+0.2(0.6682)=0.33364 Output value: 1/[1+exp(-0.3178)]=0.5826
[2 marks]
16
(d) Describe briefly how to use validation data to prevent a neural network model from
overfitting the training data.
[3 marks]
Answer:
The training error goes down as the number of iterations increases. We can’t determine when to stop training by looking at the error on the training sample. We can spot overfitting by watching the error on a validation sample. When the error on the validation sample goes up, overfitting has begun. Consequently, we stop training, go back to the weights that produced the lowest error on the validation sample and use those weights for our model.
17
Question 8 [10 marks]
Figure Q8 shows the results of a logistic regression model that was developed to predict the response (Y = 1) of customers to a mail promotion:
(a) Explain the meaning of each estimated coefficient of the model.
Answer:
For 0 income and 0 age, the odds that a customer will response is e-5 = 0.0067.
[3 marks]
As Income increases 1 unit and holding the Age variable constant, the odds of respond will be increased by a multiplicative factor of e0.0025 = 1.0025.
As Age increases 1 unit and holding the Income variable constant, the odds of respond will be increased by a multiplicative factor e(0.046-0.005-0.01Age). The exact amount of change depends on the base value of the Age.
(b) Assume the cutoff value of the model is 0.5. What is the classification of a customer
with a $5000 of income and at 40 years old?
[3 marks]
Answer:
Odds of response = exp (-5 + 0.0025(5000)+0.046(40)-0.005(40)2) = exp(1.34) =3.8190. P(response) = 3.8190 / (1+3.8190) = 0.7925.
Since the score is higher than the cutoff value, the customer is classified as a response.
18
(c) To increase the percentage of correctly classified true response (Y = 1), should the
cutoff value of the model be increase or decrease? Justify your answer.
[2 marks]
Answer:
The cutoff value should be decreased so that more customers are classified as response. Consequently, more true response will be classified as response by the model. Hence the classification rate for the true response increases.
(d) Briefly explain why ordinary linear regression model is inappropriate for a categorical
target variable.
[2 marks]
Answer:
Using linear regression model to classify categorical target variable yields score outside the range of 0 and 1.
The assumption that the target variable follows a normal distribution is violated.
The constant variance assumption for the target variable is violated.
19
Question 9 [6 marks]
Figure Q9 lists out all possible 3-item frequent sequences that are derived from a set of transaction records.
(a) List all 4-item frequent sequence candidates produced by the candidate generation step of the Apriori-like algorithm. State clearly the sequence IDs of each pair of 3-item frequent sequences that are to be merged for forming each 4-item frequent sequence candidate. Note that only the sequences that are produced by the candidate generation step of Apriori-like algorithm will be considered as the correct candidates.
Answer
ID of Merged Sequences A,G
A,H
B,I
B,J
C,G
C,H
[3 marks]
4-item Frequent Sequence Candidate <{1,2,3} {3}>
<{1,2,3} {4}>
<{1,2} {3} {3}>
<{1,2} {3} {4}>
<{1} {2,3} {3}>
<{1} {2,3} {4}>
20
(b) Which of the 4-item frequent sequence candidates derived in (a) above are to be
pruned, assuming no timing constraints? Justify your answer.
Answer:
The pruned candidates are:
< {1, 2, 3} {3}> because < {1, 3} {3}> is not a 3-item frequent sequence
< {1, 2} {3} {3}> because < {1} {3} {3}> is not a 3-item frequent sequence < {1, 2} {3} {4}> because < {1} {3} {4}> is not a 3-item frequent sequence < {1} {2, 3} {3}> because < {1} {3} {3}> is not a 3-item frequent sequence < {1} {2, 3} {4}> because < {1} {2} {4}> is not a 3-item frequent sequence
[3 marks]
21
Question 10 [7 marks]
Consider the transaction records as shown in Figure Q10.
(a) Compute the support % for itemsets {i5}, {i2, i4}, and {i2, i4, i5} by treating each transaction ID as a market basket.
Answer:
{i5}: 10 / 10 = 1 , 100%
{i2, i4}: 2 / 10 = 0.2 , 20% {i2,i4,i5}: 2/10=0.2,20%
[3 marks]
(b) Use the results in (a) above to compute the confidence for these two association rules: {i2, i4} Þ {i5}, and {i5} Þ {i2, i4}.
Answer:
{i2, i4} Þ {i5}:
0.2/0.2 = 100%
{i5} Þ {i2, i4}: 0.2/1 = 20%
[2 marks]
22
(c) Are the rules specified in (b) above useful or not? Justify your answer.
Answer
Life value for rule {i2, i4} Þ {i5} 100% / 100% = 1
The rule is not useful
Life value for the rule {i5} Þ {i2, i4} 20% / 20% = 1
The rule is not useful
[2 marls]
23
Question 11 [7 marks]
Figure Q11 shows a set of Age values and the respective rank index in the data set. Complete the following tables for creating 5 quantile bins (equal frequency bins) for the Age values.
Index of each quantile bin Qk, where k = 1, 2, 3, and 4:
Quantile Bin
Index of Quantile Bin
Upper limit of Quantile Bin
Q1 = 1/5
27×1/5 =5.4 » 6
< 20
Q2 = 2/5
27x2/5 =10.8 =11
< 25
Q3 = 3/5
27x3/5=16.2 »17
< 33
Q4 = 4/5
27x4/5=21.6 »22
< 36
Age values in the data set that fall into each bin are:
Quantile Bin
Age Values
Q1:
13, 15, 16, 16, 19
Q2:
20, 20, 21, 22, 22
Q3:
25, 25, 25, 25, 30
Q4:
33, 33, 35, 35, 35, 35
Q5:
36, 40, 45, 46, 52, 70
24