代写 algorithm The Chinese University of Hong Kong

The Chinese University of Hong Kong
Dept of Decision Sciences & Managerial Economics DSME6650: Individual Exercise
Date Due: Submission:
Question 1
10 Feb 2019 (23:55)
The exercise is in individual basis. Every student should submit his/her own assignment. You should follow the University guidelines as mentioned in http://www.cuhk.edu.hk/policy/academichonesty/ to have an electronic submission. Please submit it to (i) VeriGuide and (ii) the blackboard. Please contact your tutor if you have any problem during your submission.
(a)
15 data points are arranged and settled into three clusters as follows:
Cluster 1
Cluster 2
Cluster 3
Centroid at (15.5, 28.5)
Centroid at (43.75, 19.75)
Centroid at (10.4, 10.6)
(11, 34)
(42, 25)
(5, 12)
(16, 31)
(47, 21)
(10, 18)
(18, 24)
(45, 13)
(15, 8)
(20, 23)
(41, 20)
(9, 4)
(10, 27)
(13, 11)
(18, 32)
(i) Which is the nearest cluster of Cluster 1?
(ii) What are the average Silhouette coefficients of the three clusters?
(iii) What is the average Silhouette coefficient of the entire dataset?
(iv) Which cluster is more cohesive than others?
(v) Justify whether the choice k=3 for the clustering is appropriate or not.
(b) Apply the fuzzy c-means technique on the dataset with three initial centroids, (15, 28), (40, 20) and (10, 8) for three iterations. Compare your result in (a).
Question 2
Consider the training instances shown in Table Q2 for a dichotomous decision problem.
(a) What is the entropy of this collection of customer instances with respect to the target
attribute with value ¡°Stay¡±?
(b) What are the information gains of the two attributes x1 and x2 relative to the customer
instances?
(c) For the attribute x3, identify the best threshold by computing the information gain for
every possible split.
(d) Whatisthebestsplit,amongx1,x2,andx3,accordingtotheinformationgain?
(e) Whatisthebestsplit,betweenx1andx2,accordingtotheGiniindex?
(f) Devise two different but complete decision trees for the instances shown in the table using
information gain.
CID
x1
x2
x3
Target attribute
01
M
HIGH
1.0
Stay
02
M
HIGH
6.0
Stay
03
M
LOW
5.0
Churn
04
F
LOW
4.0
Stay
1

05
F
HIGH
7.0
Churn
06
F
HIGH
3.0
Churn
07
F
LOW
8.0
Churn
08
M
LOW
7.0
Stay
09
F
HIGH
5.0
Churn
Question 3
Table Q2: Instances show whether the customers stay and churn in a brand
Risk assessment is important for financial institution, especially in loan applications. Some have already implemented their own credit-scoring mechanisms to evaluate their clients¡¯ risk and make decisions based on this indicator. In fact, the data gathered by financial institutions is valuable source of information to create information assets, from which credit-scoring mechanisms can be developed. The purpose of this question is to, from information assets, create a decision mechanism that is able to evaluate a client¡¯s risk. The attached data files contain 700 training instances and 300 test instances. The training instances are used to train decision models while the test instances are used for the evaluation. The data set itself is a combination of personal, social and financial information about past bank clients. The complete list of attributes is shown in Table Q3.
No.
Attribute
Description
1
Status
Status of checking account
2
Duration
Loan duration in months
3
Credit history
Client¡¯s credit history
4
Purpose
Purpose of the loan
5
Credit amount
Loan amount
6
Savings
Saving accounts or bonds
7
Employment duration
Duration of the present employment
8
Instalment rate
Instalment rate of disposable income
9
Personal status
Personal marital status and sex
10
Debtors
Other debtors/guarantors
11
Residence
Present residence since
12
Property
Property owned
13
Age
Age in years
14
Instalment plans
Other instalment plans
15
Housing
House status
16
Existing credits
Number of existing credits at the bank
17
Job
Job category
18
Liable people
Number of people liable to
19
Telephone
Existence of telephone number
20
Foreign worker
Native or foreign worker
21
Classification
Credit classification
Table Q3: Description of attributes
(a) Explainwhatarethepre-processingprocedure(s),ifany,thatshouldbetakenbeforethe data mining.
(b) Without any data preprocessing, compare the training and test results on the following major classification algorithms using Weka. You have to produce a short report, with not more than 800 words, to explain your findings.
(i) Logistic regression
(ii) Ridge logistic regression for the optimal ridge 2

(iii) k-NN, for the optimal value of k.
(iv) Decision Tree, using J48
(v) Random Forest
Question 4
Study the business case ¡°Customer Segmentation¡± carefully.
(a) Produce a short summary, not more than 300 words, explain how the customer
segmentation can be achieved.
(b) Using not more than 500 words, list out all the major steps in the KDD approach. In each
step, describe clearly, in form of a table,
(i) what is/are the objective(s) of the step.
(ii) what is/are the major technique(s) or data mining algorithm(s) involved.
(iii) what is/are the precaution(s) that should be taken when the techniques or algorithms
are applied.
*** End ***
3