程序代写代做 graph C algorithm data mining MS6711 Data Mining

MS6711 Data Mining
Exercise 3

• Given two objects represented by the points (22, 1, 42, 10) and (20, 0, 36, 8). Compute the Euclidean distance between the two objects.

• Given the following measurements for the variable age:

18, 22, 25, 42, 28, 43, 33, 35, 56, 28

standardize the variable by:
• range method to [0,1] range.
• Standard deviation method so that the standard deviation of transformed variable equals to 1.

• Explain the importance of attribute scaling when the similarity of objects is measured. Give an example to illustrate your points.

• The following table presents price (P) and quality rating (Q) data for six brands of beer:

Brand
Price
Quality
A
7.89
10
B
4.79
4
C
7.65
9
D
6.39
7
E
4.50
3
F
6.25
6

• Plot the data in two-dimensional space and perform a visual clustering of the brands from the plots.
• Perform an agglomerative hierarchical clustering of the brands using the Euclidean distance and the centroid method.
• Repeat b but use the complete linkage method to form the clusters.
• Repeat b but use the Ward method to form the clusters.
• Comparing your solutions to that obtained in part (a).

• Suppose that a data mining task is to cluster the following eight points (with (x, y) representing location) into three clusters.

P1 (2,10), P2 (2,5), P3 (8,4), P4 (5,8), P5 (7.5), P6 (6,4), P7 (1,2), P8 (4,9).

The distance function is Euclidean distance. Suppose initially we assign P1, P4, and P7 as the center of each cluster, respectively. Use the k-means algorithm to find the final three clusters. You much show clearly at each round of iteration the points in each cluster and the respective the centroid.

• Clustering has been popularly recognized as an important data mining task with broad applications. Give one application example for each of the following cases:
• An application that takes clusters as a major data mining function.
• An application that takes clustering as a preprocessing tool for data preparation for other data mining tasks.

• Explain the uses of the following statistics in determining the number of clusters: Semi-partial R-square (SPRSQ), and R-squared (RSQ).

• A company wants to create segmentation of its customer base to further understand customer behaviour. Suppose that the following statistics are obtained from a hierarchical clustering analysis. Determine the possible number of clusters and state your reasons.

The CLUSTER Procedure
Ward’s Minimum Variance Cluster Analysis

Cluster History
NCL –Clusters Joined— FREQ SPRSQ RSQ
20 OB31 OB36 111 0.0022 .962
19 CL31 CL36 161 0.0024 .959
18 CL21 OB9 117 0.0028 .956
17 CL23 CL33 212 0.0030 .953
16 CL24 CL28 220 0.0032 .950
15 OB18 OB37 222 0.0033 .947
14 CL30 CL20 192 0.0038 .943
13 CL29 OB16 246 0.0038 .939
12 OB10 CL25 265 0.0040 .935
11 CL26 CL18 223 0.0048 .930
10 CL19 CL32 200 0.0049 .926
9 CL16 CL17 432 0.0063 .919
8 CL14 CL11 415 0.0064 .913
7 CL22 CL15 331 0.0065 .906
6 CL7 CL12 596 0.0077 .899
5 CL13 CL27 357 0.0264 .872
4 CL10 CL8 615 0.0587 .813
3 CL9 CL6 1028 0.0627 .751
2 CL3 CL5 1385 0.1321 .619
1 CL2 CL4 2000 0.6186 .000

Exercises for SAS EM
• A new title, “The Art History of Florence”, is ready for release. CBC has sent a test mailing to a random sample of 4,000 customers from its customer base. The customer responses have been collated with past purchase data and stored in data file named booksales.sas7bdat. Each row (or case) in the data file corresponds to one market test customer. The variable names and descriptions are given in the table below:
ID#
Identification number test data set
Gender
0 =Male 1=Female
M
Monetary- Total money spent on books
R
Recency- Months since last purchase
F
Frequency – Total number of purchases
FirstPurch
Months since first purchase
ChildBks
Number of purchases from the category: Child books
YouthBks
Number of purchases from the category: Youth books
CookBks
Number of purchases from the category: Cookbooks
DoItYBks
Number of purchases from the category Do It Yourself books
RefBks
Number of purchases from the category: Reference books (Atlases, encyclopedias, Dictionaries)
ArtBks
Number of purchases from the category: Art books
GeoBks
Number of purchases from the category: Geography books
ItalCook
Number of purchases of book title: “Secrets of Italian Cooking.”
ItalAtlas
Number of purchases of book title: “Historical Atlas of Italy.”
ItalArt
Number of purchases of book title: “Italian Art.”
Florence
=1 “The Art History of Florence.” was bought, =0 if not
Related purchase
Number of related books purchased
• Create clusters of the customers based on the type of books they had purchased before the test mailing.
• Interpret the clusters you have created.

• Consider the data set CUSTOMER_JOIN.SAS7BDAT discussed in Chapter 2.
• Cluster the data set without using the CHURN_REASON column.
• Interpret the clusters with respect to the variables that were used in forming the clusters.
• Remove a random 10% of the data, and repeat the analysis. Does the same picture emerge?
• Which clusters contains mostly churners or non-churners?