CS计算机代考程序代写 algorithm data mining decision tree CLUSTERING

CLUSTERING

Chapter 3:
Cluster Analysis

2
Objectives
Introduction to cluster analysis
Prepare data for clustering
Clustering algorithms
SAS EM Clustering Node and examples

3
Introduction
What is cluster analysis?
Clustering is the process of using statistical (or machine learning) algorithms to identify groups based on many variables.
Not quite the same as segmentation.
Segmentation is the process of putting customers into groups based on pre-specified criteria. For example, a fashion retail marketer wants to target women who have a high annual household income, over twenty-one years old, region of home, etc.
On the other hand, based on all the available variables, cluster analysis may find women purchased a dress in the first two months of the year and were in their early twenties.
Once clusters are found, the records can be segmented according to the results found in clustering process.

4
Introduction
What is cluster analysis?
One of the ways to determine the similarity between two records is to measure the distance between them.
The most frequently used distance measurement is called the Euclidean distance.
The distance between two records (x, y) with 2 variables (A, B) is defined as:

The distance between two records (x, y) with (A, B, … M) variables is defined as

Other distance measurements exist.

A
B
Ax
Ay
Bx
By

5
Introduction
What is cluster analysis?
Once clusters have been identified, one can try to understand the similarities in and differences between clusters.
Example:
Income: High
Children: 1
Car: Luxury
Income: Low
Children: 0
Car: Compact
Income: Medium
Children: 3
Car: Sedan
Cluster 1
Cluster 2
Cluster 3

6
Introduction
What is good clustering?
Quality of a clustering solution can be measured objectively in the following statistics:
Within group similarity, as high as possible.
Between groups similarity, as low as possible.
Some statistics can be used to determine these similarities, but they do not guarantee that the derived clusters are useful.
There is no objective method to determine the usefulness of a clustering solution. The quality is often determined subjectively by the end users.
Most importantly, a good clustering solution should contain these features:
It contains manageable number of clusters.
Each cluster contains manageable number of observations.
Each cluster (or most of the derived clusters) reveals some interesting features relating to the data mining goal that are not found in other clusters.

7
Introduction
Basic steps
3. Data understanding
4. Data preparation
5. Verifying the resultant data
7. Analyzing the DM model
Finding clusters in data
Interpretation of clusters

8
Data Understanding: Choosing The Variables
Avoid using as many variables as available.
Results can be very difficult to interpret.
Focus on the identified data mining goals.
Make use (but not totally relied upon) of the existed business rules or domain knowledge.
Selected variables should not be highly correlated.
Correlated variables imply higher emphasis on the dimension represented by them.
Consider using only one of the highly correlated variables.
Or using Principle Component Analysis to combine linearly correlated variables.
Clustering results could be more difficult to be interpreted.

Data Understanding: Choosing The Variables
Variable with single value, almost single, or very little variations should be excluded.
Avoid using categorical variables if possible.
Most clustering algorithms are biased towards categorical variables.
Choose appropriate window width for different variables.
A narrower window width for frequent event, a wider widow width for less active event.
Select suitable historic period.
For most customer focused applications, 1-2 years are sufficient.

10
Data Understanding: Choosing The Variables
At least five types of data can be used in most customer focused clustering:
Product data
Transactional data
Customer relationship data
Demographic data
Additional data

11
Data Understanding: Choosing The Variables
Product data
Usually stored in a form that enables the Universal Product Code.
Mostly include the information on product name, brand, pricing information, and so on.
May have a product hierarchy structure.
For example, available variables of a bank are: ownership of current account, saving products, loan products, investment products, pension products, mutual funds, credit card, youth products, etc.

12
Data Understanding: Choosing The Variables
Transactional data
Typically includes:
date and time of transaction
store identifier
UPC or other item code
quantity of the product purchased
price paid
method of payment
customer identifier
For example, available variables of a bank are: teller usage, ATM usage, web bank usage, call center usage, number of current account transactions, etc.

Data Understanding: Choosing The Variables
Customer relationship data
Describes the data that is generated within an organization as a direct result of the customer relationship management. These include promotions, segments of the customer has previously been assigned, product preferences, channel utilization, complaints etc.
For example: Available variables of a bank are: time as customer, total deposits, total liabilities, total disposable funds, defaulted or not, number of automatic payments, salary deposited in bank, etc.
13

14
Data Understanding: Choosing The Variables
Customer demographic data
Descriptive data related to the individual customer and is not related to them being a customer.
For example, available variables of a bank are: age, sex, address, annual income, occupation, car ownership, home ownership, martial status, number of children, etc.
Additional data
Any other data that may be useful for understanding the interested objects.
For example, available variables of a bank are: credit score, profitability measure, etc.

15
Data Preparation
All data have to be aggregated to the object level.
E.g. Transactional data are often recorded in the transaction level, i.e. one transaction per record.

It is common to match merge records across multiple data sets

Transactional records
Object (Customer) level data

16
Data Preparation
Missing values
Decide whether it is necessary to impute the missing values or leave them as they are.
Most clustering algorithms cannot handle missing value.
Categorical variables with large number of levels
Most clustering algorithms are biased toward categorical variables with high number of levels.
Properly grouping of categories levels can reduce the impact of the categorical column on the discovered pattern.
Consider grouping categories into higher level or grouping levels with small number of observations together.

Data Preparation
Outliers
Clusters with very small number of observations are often formed around outliers.
Consider excluding the outliers in the analysis.
Interval variables with high skewness
Makes it more difficult to distinguish observations with values at the opposite end of the tail of these variables.
Small clusters are often formed around the tails of these variables.
Consider applying transformation such as log or square root to reduce the data range, or to bin the variables.

18
Data Preparation
Creating new variables
Such as summing up variables as total, transforming count to percentage, or calculating average.
E.g. A customer who shopped frequently also spent more on many products.
The Number of transactions made by a customer within a period of time tends to be correlated with the amount of spending. The joint analysis of these variables will simply identify clusters with high transaction numbers and high spending amount, or low transaction numbers and low spending.
Convert the amount of spending on each type (or group) of products into proportion of the spending on each type (or group) of products may reveal the customer’s preference.

19
Data Preparation
Converting categorical column to dummy columns
Most data mining software will perform this task automatically.
Binary categorical column
One binary variable is created that contains a value of 0 or 1.
Nominal categorical column
One binary variable per level.

20
Data Preparation
Converting categorical column to dummy columns
Ordinal categorical column
Most commonly used encoding method is the Index Encoding.
One numeric variable of interval type for each variable.
The smallest ordered value is mapped to 0 , each level of ordered value is increased by 1/(number of level -1). The largest ordered value is mapped to 1.
Other kinds of encoding methods are available.

21
Data Preparation
Standardizing numeric variables
Variables on different measurement units are difficult to compare.
Since the scale implies an implicit weight in Euclidean distance calculation, most distance measures adopted by clustering algorithms based on such are quite sensitive to the measurement unit of a variable.
Expressing a variable in smaller measuring units will lead to a larger range for the variable, which will then have a larger effect on the resulting cluster structure.

22
Data Preparation
Standardizing numeric variables
E.g. Consider the distance between each object pair (A, B, C).

Distance based on minutes of viewing time:
Order of similarity: B-C, A-C, A-B.

Distance based on seconds of viewing time:
Order of similarity: B-C, A-B, A-C.

23
Data Preparation
Standardizing numeric variables
Standardize the i th value of attribute j by:

Z-score standardization (Z-score normalization)
aj is the mean, and bj is the standard deviation of attribute j.
Range standardization (Min-max normalization )
aj is the minimum, and bj is the range of attribute j.
The actual value of aj is not important in some clustering algorithms as its value will be cancelled in computing the Euclidean distance between two observations.

Data Preparation
Standardizing numeric variables
Range standardization or z-score standardization ?
No obvious answer.
Standardizing changes the importance of variables in clustering; a smaller bj will make the variable j more important and a larger bj will make the variable less important in the distance computations. Normally, we would like to treat each variable equally.
Be aware of the present of extremely large (or small) value, i.e. outlier. Standardization with the present of outlier will make the smaller values much closer to each other and therefore they are more difficult to be separated.
Consider applying a transformation (e.g. log) to make the extreme values smaller before standardization.

25
Data Preparation
Standardizing numeric variables
E.g. After range standardization (or z-score standardization), the measuring scale is not an issue any more:

Standardizing gives all variables an equal weight, it may be used when there is no prior knowledge on which variable is more important.

26
Data Preparation
Rescaling nominal binary columns
A categorical variable with m (m>2) levels are often converted to m binary (0,1) numeric columns.
For m =2, it is common to covert it to a single binary (0,1) numeric column.
This increases the weight of the variable in distance measures as the same variable contributes more than once in the Euclidean distance calculation.
Example:

5
(1—5)

27
Data Preparation
Rescaling nominal binary columns
Example:

28
Data Preparation
Rescaling nominal binary columns
E.g.
Squared Euclidean (A, B)
= (1-1)2 + (0.4583-1)2
+ (1- 0)2 + (0-1)2 + (0-0)2
+ (0- 0.25)2
= 2.3559
The variable ‘colour preference’ dominated the value of the distance measure.
To restore the original equal weighting of each variable, rescale the j th dummy column derived from a m-level categorical variable by

29
Data Preparation
Rescaling nominal binary columns
E.g.

Squared Euclidean (A, B)
= (1-1)2 + (0.4583-1)2
+ (0.5774-0)2 + (0-0.5774)2 + (0-0)2
+ (0-0.25)2
=1.0227
Total impact of colour preference on the distance is reduced.
This approach may apply to both range and z-score standardization.

Data Preparation
Variable reduction
It is not always easy to determine the correct relationships between variables, particular when hundreds of variables are available.
Including highly correlated variables in the modeling process increases the difficulty of finding a clustering solution that meets the end users’ need.
Apart from using domain knowledge to select variables, the variables could be grouped into similar clusters. A few variables can then be selected from each cluster and thus reduces the number variables used for clustering.
This process is sometimes called variable clustering.
30

Data Preparation
Variable reduction
The variable clustering process is closely related to principal component analysis (PCA).
PCA is a technique for forming new variables which are linear composites of the original variables. The maximum number of new variables that can be formed is equal to the number of original variables, and the new variables are uncorrelated among themselves. The total variance of the new variables is identical to that of the original variables.
Each new variable (which is called a principal component (PC)) is a linear combination of all original variables. The PC that accounts for most of the total variance is called the first principal component.
Original variables are correlated to each PC in different degrees.
Categorical variables often are not used to derive PCA.
31

Data Preparation
Variable reduction
Variable clustering begins with all variables as a single cluster. If the variance (eigen value) of the second PC is smaller than a pre-specified threshold, the process stops, otherwise proceed to the following steps repeatedly:
A variable cluster with the largest second eigen value is chosen for splitting.
Split the set of variables into two clusters by assigning each variable to the first or the second principal components which it has the higher squared correlation.
The process stops splitting when either of the following conditions holds:
The number of clusters is equal to the pre-specified maximum number of clusters.
All second eigen values from all clusters are smaller than the pre-specified threshold.
32

Data Preparation
Variable reduction
Example: Consider the following set of variables and the respective pairwise correlations:

Data Preparation
Variable reduction
Example: Cont’d
The following variable clusters are derived:
34

Variables that have high correlation with their own cluster are likely to be highly correlated with each other. Select one of these variables.
Variables that have low correlation with their own cluster are likely to be uncorrelated with other variables. Select these variables.
Select the variable in a single variable cluster.

35
Clustering Algorithms
What procedure should be used to place similar observations into groups or clusters?
Hundreds of algorithms are available.
Essentially, all algorithms attempt to maximize the differences between clusters and minimize the variation within a cluster.
Two general categories: hierarchical and non-hierarchical.

36
Clustering Algorithms
Hierarchical clustering procedures
The construction of a hierarchy of a treelike structure.
Two types:
Agglomerative (bottom-up)
Each observation starts out as its own cluster.
Merge two clusters iteratively.
Divisive (top-down)
One large cluster containing all observations.
Split a cluster into two iteratively.

Clustering Algorithms
Hierarchical clustering procedures

a
b
c
d
e
ab
de
cde
abcde
agglomerative
divisive

1
2
3
4
0
Level

3
2
1
0
4
Level

38
Clustering Algorithms
Hierarchical clustering procedures
Popular methods to compute the distances between two clusters are:
Centroid method: not as sensitive to outliers but may not perform as well as other methods.
Average-linkage method: tend to produce clusters with same variance.
Single-linkage (Nearest-neighbour) method: biased towards clusters in elongated or irregular shape.
Complete-linkage (Farthest-neighbour) method: biased towards clusters with roughly equal diameters, and sensitive to outliers.
Ward’s method: biased towards clusters with same number of observations, and sensitive to outliers.
In practice, we will normally not know the actual shape of a cluster.

39
Clustering Algorithms
Hierarchical clustering procedures
Centroid method
Each group (cluster) is replaced by an Average Subject, which is the centroid of the group.

Input data (6 clusters)

Five clusters

Four clusters

Three clusters

S3
S4

centroid

40
Clustering Algorithms
Hierarchical clustering procedures
Average-linkage method
Calculate the average distance (or squared distance) between all pairs of subjects in the two clusters.

41
Clustering Algorithms
Hierarchical clustering procedures
Single-Linkage method
The distance between two clusters is represented by the minimum of the distances between all possible pairs of subjects in the two clusters.
Two clusters with the shortest distance will be merged.
Complete-Linkage method
The distance between two clusters is represented by the maximum of the distances between all possible pairs of subjects in the two clusters.
Two clusters with the shortest distance will be merged.

42
Clustering Algorithms
Hierarchical clustering procedures
Ward’s method
Does not compute distances between clusters.
Clusters are formed at each step such that the resulting cluster solution has the smallest total within-cluster sum of squares (SSW).
Within-cluster sum of squares measures the variability of the observations within each cluster.
For jth cluster formed with m observations with d variables, the respective within-cluster sum of squares is:
SSWj = (X11 – C1j)2 + (X12 – C2j)2 + . . . + (X1d – Cdj)2
+ (X21 – C1j)2 + (X22 – C2j)2 + . . . + (X2d – Cdj)2 + . . .
+ (Xm1 – C1j)2 + (Xm2 – C2j)2 + . . . + (Xmd – Cdj)2

Choose a K-cluster solution that minimize SSW = SSW1 + SSW2 + … + SSWj + … + SSWK .

43
Clustering Algorithms
Hierarchical clustering procedures
Ward’s method

Repeat until 1-cluster solution.

Step 1
Step 2

44
Clustering Algorithms
Hierarchical clustering procedures
Determining the number of clusters, K
R-Squared (RSQ)
At a step of agglomerative approach, RSQ measures the extent to which clusters are different from each other in the form of a ratio of total between-cluster sum of squares (SSB) to the total sum of squares (SST).
For a data set with n observations with d variables, the SST is:
SST = (X11 –)2 + (X12 – )2 + . . . + (X1d – )2
+ (X21 – )2 + (X22 – )2 + . . . + (X2d – )2 + . . .
+ (Xn1 – )2 + (Xn2 – )2 + . . . + (Xnd – )2
SST is fixed, and note that SST = SSB + SSW, thus the smaller the SSW, the bigger the SSB will be.

Clustering Algorithms
Hierarchical clustering procedures
Determining the number of clusters, K
R-Squared (RSQ)
RSQ = SSB / SST
RSQ takes values from 0 to 1, and it is scale independent.
0 indicating no differences among clusters.
1 indicating maximum differences among clusters.
Under agglomerative approach, RSQ decreases as number of clusters (K) decreases.
Choose the K so that the next smaller K leads to a large reduction of RSQ. A large reduction of RSQ implies that two distinct clusters are forced merging, hence we should stop the process before it happens.

46
Clustering Algorithms
Hierarchical clustering procedures
Determining the number of clusters, K
Semi-partial R-Squared (SPRSQ)
It measures the loss of homogeneity due to combining two clusters for forming a new cluster under agglomerative approach. It is good to combine two similar (homogeneous) clusters, but it is bad to combine two distinct (heterogeneous) clusters.
Loss of homogeneity is defined as the SSW of the newly formed cluster minus the sum of SSW of the two joined clusters.
If the loss of homogeneity is zero, then the new cluster is obtained by merging two perfectly homogeneous. That is good. If loss of homogeneity is large then the new cluster is obtained by merging to heterogeneous clusters. That is bad.

Clustering Algorithms
Hierarchical clustering procedures
Determining the number of clusters, K
Semi-partial R-Squared (SPRSQ)
SPRSQ = Loss of homogeneity divided by SST.
SPRSQ takes values from 0 to 1, and it is scale independent.
Under agglomerative approach, SPRSQ increases as the number of clusters (K) decreases.
Choose the K so that a smaller K leads to a large increase in SPRSQ.

48
Clustering Algorithms
Hierarchical clustering procedures
Determining the number of clusters, K
Example:

RSQ and SPRSQ often demonstrate big change in values when combining the final two clusters into one. This is natural and it does not necessary indicate that the 2-cluster solution is appropriate.

49
Clustering Algorithms
Hierarchical clustering procedures
Determining the number of clusters, K
Cubic Clustering Criterion (CCC)
At a step of an agglomerative approach, it computes the normalized RSQ in reference to a null hypothesis of uniform distribution of the observations in a hyper-cube against the alternative hypothesis that the data has been sampled from a mixture of spherical multivariate normal distribution with equal variances and equal sampling probabilities.
Positive CCC means that the observed RSQ is greater than the expected and indicate the possible presence of clusters.
Local peaks (>3) on plot of CCC against the number of clusters are supposed to correspond to an appropriate number of clusters;

Clustering Algorithms
Hierarchical clustering procedures
Determining the number of clusters, K
Cubic Clustering Criterion (CCC)
If no obvious local maxima is detected, than a good partition into K clusters will show a dip for K-1 clusters and a peak for K clusters, possible followed either by a gradual decrease in the CCC, or a smaller rise.
If all values of the CCC are negative and decreases for two or more clusters, the distribution is probably long tailed.
Very negative values of the CCC, say, -30, might be due to outliers.
If the CCC increases continually as the number of clusters increases, the distribution of the variables might be grainy.

51
Clustering Algorithms
Hierarchical clustering procedures
Determining the number of clusters, K
Cubic Clustering Criterion (CCC)
E.g.:

Clustering Algorithms
Hierarchical clustering procedures
Determining the number of clusters, K
Aligned Box Criterion (ABC)
Similar to CCC, but instead of simulating a uniformly distributed sample data of K clusters for testing the hypotheses, the hypotheses testing is done by sampling from a box that is aligned with the dimension of the actual observations, in theory this should increase the testing power for K.
It works with interval variables only.
52

Clustering Algorithms
Hierarchical clustering procedures
Determining the number of clusters, K
Aligned Box Criterion (ABC)
Under ABC, a prespecified number of samples (more samples can be taken if the computer power is high enough) can be simulated for each K value.
Plot ABC against K. A local peak indicates a possible K-cluster solution for the data set.
53

The following video gives some explanations about CCC and ABC:

54
Clustering Algorithms
Non-hierarchical clustering procedures
To organize the objects into the best K clusters.
Value of K is pre-determined.
General steps (K-means method):
Select K seeds as initial centroids of the K clusters.
Assign each observation to the centroid to which it is the closest.
After the assignment for all the observations have been made, calculate the new centroids of the K clusters (fixed centroids approach).
Re-assign each observation to one of the K new centroids.
Stop if there is no reallocation of data points or some stopping rule (such as new centroids are almost identical to the old centroids, or the maximum number of iterations is reached) is satisfied. Otherwise go back to Step 3.

55
Clustering Algorithms
Non-hierarchical clustering procedures
Seed 2
New centroids
Initial clusters
New clusters

Seed 1
Seed 3

56
Clustering Algorithms
Determine the number of clusters for K-mean clusters
No standard, objective selection procedure exists.
Some suggested approaches:
Specify a manageable number of clusters based on practical considerations.
Use some statistics, such as overall R-Squared, to guide the choice.
Try out a number of solutions and choose the one that is most meaningful.

57
Clustering Algorithms
Variations on the K-means method
Selecting initial seeds methods:
Method 1: Select the first K complete observations as seeds or cluster centroids.
Not a good approach if the data set is arranged in some order.
Method 2: Randomly select K complete observations as cluster centroids.
Method 3: As Method 2, but replace one of the closest pair of seeds by one of the remaining observations if the distance between this observation and the nearest seed is more than the distance of the pair of seeds considered. The one of the pair which is closest to the candidate observation is replaced. This process is repeated until no more replacement.

Clustering Algorithms
Variations on the K-means method
Selecting initial seeds methods:
Method 4: If an observation fails the criterion in Method 3, then a second criterion is applied: The observation replaces the nearest seed if the smallest distance from the observation to all seeds other than the nearest seed is greater than the shortest distance from the nearest seed to all other seeds. If the observation fails to meet this criterion, move on to the next observation.
In addition to the above methods, one can also specify the minimum distance between two seeds. It is possible that less than K seeds can be selected or outliers are selected as seeds if this minimum distance is too large.
58

Clustering Algorithms
Variations on the K-means method
Variation in assignment rules:
Drift centroids
For the assignment of each observation, re-compute the centroid of the cluster to which the observation is assigned and the cluster from which the observation is assigned. Reassignment is continued until the change in cluster centroids is less than the selected convergence criterion.

60
Clustering Algorithms
Hierarchical or non-hierarchical?
Number of clusters
Hierarchical method does not need to predetermine the number of clusters but still need to decide its value after computation.
Re-assign observation to another cluster
An observation cannot be re-assigned with hierarchical method.
Computation
Hierarchical method is computational intensive.

61
Clustering Algorithms
Hierarchical or non-hierarchical?
Steps of a hybrid approach:
Randomly select a sample of observations, or using the entire data set if sufficient computing resources are available.
Perform a non-hierarchical clustering for m clusters, where m is chosen to be large, such as 40.
Perform hierarchical clustering from these m clusters in a agglomerative (or divisive) manner.
Use the statistics (R-squared, semi-Partial R-squared, CCC, or ABC) that are derives from Step 3 to determine the appropriate value of K.
Apply non-hierarchical clustering to the entire data set for a K-cluster solution.

62
Clustering Validation
Features of a good cluster solution
For overall result:
Heterogeneous between clusters
The overall R-Squared value is large.
Homogeneous within cluster
The ratio of the overall Within STD that pooled across all the variables in each cluster to the overall Total STD that pooled across all the variables in the entire data set should be small.
If one is not sure about the number of clusters, then one can rerun the analysis to obtain a solution for a different number of clusters and use R-squared and Overall Within STD to determine the number of clusters. Look for the number of clusters that makes big improvement in these statistics.

63
Clustering Validation
Features of a good cluster solution
For individual variable considered:
The observations within each cluster are similar in values with respect to each variable.
The Within STD value pooled across all clusters for such variable should be relatively lower than Total STD value of the same variable in the entire data set.
The clusters are well separated with respect to each variable.
The R-Squared value pooled across all clusters for such variable should be high.
Possible causes for a variable with low R-Squared pooled across value and a high Within STD and Total STD ratio:
Other variables may dominate the analysis due to their measurement scales or correlation among themselves.
Insufficient number of clusters considered.
The variance of this variable is too small.

64
Clustering Validation
Features of a good cluster solution
Example

The overall R-squared value is not very high but acceptable.
The overall Within STD / Total STD = 0.6243 which is on the higher side, a lower value is preferred.
The R-squared value for variable PURCHTOT is low, and the respective Within STD is large  some clusters are not well separated by this variable.
If it is considered important to segment the data by PURCHTOT, one may want to repeat the analysis with higher number of clusters.

Clustering Validation
Features of a good cluster solution
Reliability
The data set is first split into two halves.
Cluster analysis is done on the first half of the sample.
Observations in the second half of the the sample are assigned to the cluster centroid that has the smallest Euclidean distance.
Degree of agreement between the assignment of the observations and a separate cluster analysis of the second sample is an indicator of reliability.
The results must make sense from a business point of view.

66
Interpretation of the Clusters
What makes a cluster special?
Mean comparisons
If many variables are used in the analysis, focus on the variables with high R-squared value or relatively low within STD, or variables with high logworth value.
Variables that are not used in the analysis can also be used for interpretation.
Compare the average of each variable (in its original scale) in a cluster to the respective population average. The values should be compared in the original scale of each variable.

Interpretation of the Clusters
What makes a cluster special?
Mean comparisons can sometimes be misleading
Variables with opposite skewness may have similar mean values.
Distribution of selected variables in a cluster can be compared to its overall distribution.
It is often more meaningful to examine the distribution in the original scale of the variable.

Segment 4
Segment 5
Segment 6
67

Interpretation of the Clusters
What makes a cluster special?
Use a classification model (e.g. decision tree) to classify the objects from a cluster or not, then another model for another cluster, etc. One model for each cluster. Each model may review the most important set of variables for each cluster that can be used to classify the objects into the cluster. These variables may provide some insights on how these clusters are different to each other.
It is also possible to use a single multiclass classification model to classify the objects into one of the clusters. A set of most important variables can be derived. Further analyses are required to see how the clusters are different to each other in these variables.

69
SAS EM Clustering Node
Major features
Accepts binary, nominal, ordinal and interval data.
Allow user to select variables for clustering.
Option for range or z-score standardization on numeric attributes.
Encode categorical variables automatically.
Measure distance between observations by Euclidean.
Option to perform a hybrid of agglomerative hierarchical and K-means clustering method automatically.
Option to perform K-means clustering with different settings.
Certain characteristics of the clusters can be examined graphically.
Assign a label of cluster class to each observation.
Generate cluster profile tree and classification rules.

70
SAS Clustering Node
How SAS Clustering Node works
User specified method: Preform K-means clustering with a user specified K value.
Automatic : Three stages procedure
First stage:
Use K-means procedure to find m clusters.
Some important options (suggested settings):
Numeric variables standardization (range or z-score).
Maximum number of clusters (default m= 40) for this stage.
Seed Initialization (Full replacement).
Maximum number of iterations (at least 50 for large data set and large number of variables).

71
SAS Clustering Node
How SAS Clustering Node works
Automatic : Three stages procedure
Second stage:
Use agglomerative hierarchical algorithm to combine and reduce the number of clusters derived from the first stage.
Determine the ‘best’ K based on the CCC criterion (not always working).
Some important options (suggested settings):
Clustering method (Ward).
CCC cutoff value (3).

72
SAS Clustering Node
How SAS Clustering Node works
Automatic : Three stages procedure
Third stage:
Use K-means method to allocate the whole data set into the K chosen groups.
Should only adopt the automatic procedure as a base solution and a way to obtain statistics for determining K, the number of clusters.
Always fine tune the solution by forming clusters manually, with number of clusters smaller and larger than the above determined.

73
SAS Clustering Node
Example 1 – Simulated data Clustersim2.sas7bdat
Contains 8000 simulated observations with 4 variables.
Nominal variable Class indicates the true cluster class.
Import the data set into a EM project.
Use StatExplore node to obtain descriptive statistics for the other three variables, Col1, Col2, and Col3.
Magnitude of Col2 is higher than the other two variables.

74
SAS Clustering Node
Example 1 (Cont’d)
Use Variable Clustering node (Set Print Option to All, default for other settings,) for checking the pairwise correlations.
From Results, View | Model | Variable Correlation,

Some positive correlations between Col1 and Col3,
and between Col2 and Col3 exist.

SAS Clustering Node
Example 1 (Cont’d)
Using the default settings (automatic) of the Cluster node.
Set the Use status for variable Class to No.
Result:
It found 4 clusters.

Centroids:
Segment size plot:

SAS Clustering Node
Example 1 (Cont’d)
Why did the automatic procedure choose the 4-cluster solution?
At Results of Cluster node, select View | Summary Statistics | CCC Plot.
This solution is selected by the system because it is the solution with a CCC score that exceeds 3 and the CCC score is the first identified local maxima amount all solutions.
76

SAS Clustering Node
Example 1 (Cont’d)
Other statistics also suggest a 4-cluster solution.
77

SAS Clustering Node
Example 1 (Cont’d)
How good is the result, statistically?
At Results of Cluster node, select View | Summary Statistics | Cluster Statistics.
Overall RSQ = 0.89026, very high.
Within_STD / Total_STD = 0.3313 / 1 = 0.3313, not very low but acceptable.
RSQ for Col2 is very high, but RSQ for Col1 and Col3 is also high.
Statistically, this is a
good cluster solution.

SAS Clustering Node
Example 1 (Cont’d)
Other useful results:
View | Summary Statistics | Input Means Plot

If the mean plot is not displayed correctly, right click the plot and select Data Option to specify the role of INPUT under Variables Tab.

SAS Clustering Node
Example 1 (Cont’d)
Other useful results:
View | Cluster Profile |Tree

View | Cluster Profile | Node Rules

SAS Clustering Node
Example 1 (Cont’d)
Other useful results:
Each observation is assigned to one of the identified segment.
From Exported Data property of the Cluster node, select Train data set and click the Browse button.
81

Example 1 (Cont’d)
Segment Profile Node
For comparing the distribution of a variable in a segment to the distribution of the variable in the population.
Drag Segment Profile node from the Assess menu into the diagram and connect the Cluster node to it.

SAS Clustering Node
82

SAS Clustering Node
Example 1 (Cont’d)
Segment Profile Node
Change Use status of variable Class to Yes.
In this example, we can use Class to assess the quality of the results.
Specify the number of midpoints that will be used to compute the distribution of interval variables. Select 16 for this example.
Set Profile All in General property to Yes to profile all of the segments.
When the property is set to No, small segments will be combined into the _Other_ category.
Use the Cutoff Percentage property to specify the threshold level for combining segments.
Run the node and select Results.

SAS Clustering Node
Example 1 (Cont’d)
Cluster Profile: Segment Profile Node
Variable Worth window shows the relative worth of each variable in characterizing each segment.
Charts are arranged in the order of the size of each segment, from large to small.
Within each chart, worth are displayed from the highest to the lowest.
84

SAS Clustering Node
Example 1 (Cont’d)
Cluster Profile: Segment Profile Node
Profile window shows the distribution of a variable in a segment and its overall distribution.
Segments are arranged in the order of the size of each segment, from the largest to the smallest.
Within each segment, charts are arranged in the order of the relative worth, from the highest to the lowest.

SAS Clustering Node
Example 1 (Cont’d)
Other possible settings
After a possible K (=4) value is determined, re-run the node with the following settings:
Set Internal Standardization to Range.
Set Specification Method to User Specify.
Set Maximum Number of Clusters to 4 (also try a larger or smaller number for comparison purpose).
Set Seed Initialization Method to Full Replacement.
In Training options, Set Use Defaults to ‘No’. In Settings, change the Maximum Number of Iterations to 50 (or higher for more complex data set and higher number of clusters).

SAS HP Cluster Node
Example 1 (Cont’d)
HP Cluster node
HP Cluster node in HPDM is designed to run faster on multi-processor computer. It applies K-means clustering method with automatic or manual setting.
Create clusters based on the interval variables only. All non-interval variables are ignored during the cluster creation process.
Connect the Input Data node to a HP Cluster node it. For this example, set Use of variable Class to No in the Variables property.
HP Cluster node settings:
Set Standardize Inputs to None (or Range/Standardization, for comparison purpose).
In Stop Criterion property, set Maximum number of Iterations to 10 (or higher value for complex data set), and set Cluster Change to 0 (the lower the value, the more iterations are required).
Cluster Change value specifies the minimum percentage of observations change clusters before termination.

SAS HP Cluster Node
Example 1 (Cont’d)
HP Cluster node
HP Cluster node settings:
In Aligned Box Criterion Options, set the followings:
Number of Reference Data Sets to 5.
Depends on data complexity. Higher number requires more computation power.
Align Reference Data Sets to PCA for specifying the boundary of the simulated data sets.
Set Maximum Number of Clusters to 10 (or other acceptable number).

89
SAS Clustering Node (Self-Study)
Example 2: Data Simulate.sas7bdat
A small data set of 200 observations with 1 interval variable and 1 nominal variable.
Var1: Interval, min=502.487, max = 3995.61, std = 1511.689
Var2: Nominal, number of levels = 3.
Without standardization, the interval variable with large magnitude dominates the determination of similarity.
As the range of Var1 is much larger than 1, and its STD is relatively small, Var1 still dominates after z-score standardization.
After range standardization, the set of binary numeric variables for Var2 may dominate the determination of similarity.

SAS Clustering Node (Self-Study)
Example 2 (Cont’d)
Trial 1: Clustering with range standardization. Use the default settings but change the following:
Set Internal Standardization to Range.
Results: As expected, Var2 dominated the analysis.
90

SAS Clustering Node (Self-Study)
Example 2 (Cont’d)
Trial 2: Clustering with manual standardization and rescaling
Apply range standardization to Var1 and create dummy columns for each level of Var2 manually.
Connect the Data Source node to a Transform Variables node.
Set Default Methods | Interval Inputs to Range.
Set Default Methods | Class Inputs to Dummy indicators to create dummy variable for each level of Var2.
Set Hide and Reject of Original Variables of Score property to No, so that the original variables can be passed to subsequence nodes for further analyses.
Run the node and view the results.

SAS Clustering Node (Self-Study)
Example 2 (Cont’d)
Trial 2: Clustering with manual standardization and rescaling
Rescale the created dummy columns
Connect the Transform variables node to another Transform variables node.
Click Formula Builder icon in the property panel to activate Formula Builder.
Click Create icon to create scaled value for the dummies of Var2:
MS_Var2A = TI_Var21 / sqrt(3)
MS_Var2B = TI_Var22 / sqrt(3)
MS_Var2C = TI_Var23 / sqrt(3)
It is important to set the Level of the created variables as Interval.
Run the node and view the results.
92

SAS Clustering Node (Self-Study)
Example 2 (Cont’d)
Trial 2: Clustering with manual standardization and rescaling
Drag a Cluster node into the diagram and connect the last Transform Variables node to it.
Set Use | No for variables Var1 and Var2.
Do not apply Internal Standardization as it will undo the manual transformations.
Set Specification Method to Automatic.
Set Maximum Number of iterations to 50.
Run the node.
93

SAS Clustering Node (Self-Study)
Example 2 (Cont’d)
Trial 2: Clustering with manual standardization and rescaling
Results:
6 segments are found.
All RSQs are high.

SAS Clustering Node (Self-Study)
Example 2 (Cont’d)
Trial 2: Clustering with manual standardization and rescaling
Segment profiling with original variables:
95

SAS Clustering Node (Self-Study)
Example 2 (Cont’d)
Trial 3: Clustering with HP Cluster node
Use Metadata node to change the New Level of the three binary dummies from Default to Interval.
Connect the first Transform Variables node to Metadata node.
Click Train button in Variables property and set the following:

Run the Metadata node.
96

SAS Clustering Node (Self-Study)
Example 2 (Cont’d)
Trial 3: Clustering with HP Cluster node
Connect the Metadata node to HP Cluster node and set the following properties of HP Cluster node:
In Variables, set Use of var1 and var2 to No.
In Stop Criterion, set
Maximum Number of Iterations to 20.
Cluster Change to 0.
In Aligned Box Criterion Options, set
Number of Reference Data Sets to 5.
Maximum Number of Clusters to 10.
Align Reference Data Sets to PCA.
The ABC method suggests 8 clusters, but the final results found 6 clusters (Reason? Failed to sample 8 initial seeds.).
97

SAS Clustering Node
Example 3: EastWestAirlinesCluster.sas7bdat
For each passenger the data include information on their mileage history and on different ways they accrued miles in the past 12 months. The goal is to try to identify clusters of passengers that have similar characteristics in earning miles for the purpose of targeting different segments for different types of mileage offers.

SAS Clustering Node
Example 3 (Cont’d)
Variables contained in the data set:

Import data to a new project or diagram.
Set the level of CC1_miles, CC2_miles, and CC3_miles to Ordinal, and the level of Award to Binary.
Set the Role of ID to ID.

99
Variable Description
ID Unique ID
Balance Number of miles eligible for award travel
Qual_miles Number of miles counted as qualifying for Topflight status
CC1_miles Number of miles earned with freq. flyer credit card in the past 12 months:
CC2_miles Number of miles earned with Rewards credit card in the past 12 months:
CC3_miles Number of miles earned with Small Business credit card in the past 12 months:
note: miles bins: 0 = 0; 1 = under 5,000; 2 = 5,000 – 10.000; 3=10,001 – 25,000;
4 = 25,001 – 50,000; 5 = over 50,000
Bonus_miles Number of miles earned from non-flight bonus transactions in the past 12 months
Bonus_trans Number of non-flight bonus transactions in the past 12 months
Flight_miles_12mo Number of flight miles in the past 12 months
Flight_trans_12 Number of flight transactions in the past 12 months
Days_since_enroll Number of days since Enroll_date
Award Claim any award before (1 Yes; 0 No)

SAS Clustering Node
Example 3 (Cont’d)
Obtain summary statistics (StatExplore node):

No missing value.
Over 96% of cc2_miles and cc3_miles values are equal to 0.
Some interval variables have large measurement scale.
Some interval variables have very high kurtosis.
100

SAS Clustering Node
Example 3 (Cont’d)
Variables bonus_miles and bonus_trans, flight_miles_12mo and flight_trans_12 are likely to be correlated.
Variable Clustering reports that the correlations between
bonus_miles & bonus_trans = 0.6032
flight_miles_12mo & flight_trans_12 = 0.8692
Need to choose one between bonus_miles and bonus_trans, and one between filight_miles_12mo and flight_trans_12.

101

SAS Clustering Node
Example 3 (Cont’d)
Connect the Input Data node to a Transform Variables node. Use the Transform Variables node to create the following variables:
AveBonus_miles = sum(Bonus_miles / Bonus_trans, 0)
AveFlight_miles = sum(Flight_miles_12mo / Flight_trans_12, 0)
Trn_CC2_Miles = sum(CC2_miles / CC2_miles, 0) *0.7071
Trn_Award = sum(Award/Award, 0)*0.7071
where sum is a SAS function for summing up the values in the argument ignoring the missing values.
Show the original variables and do not reject them in the new data table.

102

SAS Clustering Node
Example 3 (Cont’d)
Connect the Transform Variables node to another Transform Variables node.
Apply the Log transformations to the following variables:
Balance, Qual_miles, Bonus_trans, Flight_trans_12, days_since_enroll, AveBonus_miles, and AveFlight_miles.
Show the original variables and reject them in the new data table.

103

SAS Clustering Node
Example 3 (Cont’d)
For illustrative purpose, use Variable Cluster node to identify variable groups.
Connect the last Transform Variables node to a Variable Clustering node. Apply the following settings:
Use only these variables:

Set Two Stage Clustering to Yes.
Set Print Option to All.
Set Variable Selection to Best Variables.
Set Hides Rejected Variables to No.

104

SAS Clustering Node
Example 3 (Cont’d)
Variable Clustering results:
Two variable clusters are found. Around 54.68% of total variables are explained by the two clusters.
Not all variables are highly correlated to their respective assigned cluster.

105

SAS Clustering Node
Example 3 (Cont’d)
Click Interactive Selection from the property panel of the Variable Clustering node. Select the following variables from each cluster:

Rerun the node.
Connect the Variable Clustering node to a Cluster node. Name this Cluster node as Cluster A.

106

SAS Clustering Node
Example 3 (Cont’d)
Set the following properties of the Cluster node:
Set the Use of variables Award, CC2_miles, and CC3_miles to No.
Set Internal Standardization to None.
Set Ordinal Encoding to Index.
Set Initialization Method to Full Replacement.
Set Use Defaults to No and Maximum Number of Iterations to 50.
Run the node.

107

SAS Clustering Node
Example 3 (Cont’d)
The CCC plot suggests the data set contains 37 clusters.
Unless we are looking for a solution with large number of clusters, this is not a very useful results.
From the Semi-partial R-Squared list, a relatively large increase of SPRSQ value from 11-cluster to 10-cluster, and also from 6-cluster to 5-cluster.
R-squared list also shows bigger deduction of values from the 11-cluster down.
108

SAS Clustering Node
Example 3 (Cont’d)
For the same Cluster node (or use a new Cluster node) set the Specification Method and Maximum Number of Cluster to User Specify and 11 respectively.
Run the node.
Results:

Trn_CC2_miles has a low RSQ, this is not unexpected. The variable can be removed in next run of the node.

109

SAS Clustering Node
Example 3 (Cont’d)
Results:

Segment ID = 3 contains only 2 observations. They have very short days since enrollment, and no other activities at all. The observations should be removed from the data set.
110

SAS Clustering Node
Example 3 (Cont’d)
Segment Profile node
Connect the Cluster node to a Segment Profile node. Use the default variables or selected variables for profiling.
Set Profile All to Yes for profiling all segments.
Set the Minimum Worth to 0.0001 so that almost all input variables (not more than 10) are included in the profile report.

111

SAS Clustering Node
Example 3 (Cont’d)
Graph Explore node
Connect the Cluster node to a Graph node. Use the default variables or selected variables to show the scatter plot of each pair of selected variables.
Set the Sample Role of _Segment_ to Stratification so that each segment is sampled when the data size is large.
Set the Method of Sample property to Stratify, and set Size to Max.
Run the node.

112

SAS Clustering Node
Example 3 (Cont’d)
Graph Explore node
From the Graph Explore Results window, select Sample Table, and click View | Plot. Select Matrix from Select a Chart Type window. Select these variables:

Assign Group Role to _SEGMENT_ VARIABLE
113

SAS Clustering Node
Example 3 (Cont’d)
Graph Explore node
Right right the chart, select Graph property. Select Plot Matrix feature, choose Histogram as the Diagonal Type. Click OK button.
Click a segment legend or multiple segment legends (Control key + click) to highlight the points of the selected segments in the chart.
114

Custom IDProduct
AP1
BP2
AP3
CP1
BP3
AP1

Customer IDProd_P1Prod_P2Prod_P3
A201
B011
C100

Dummy
Dummy
Dummy
Dummy
Region
East
North
South
West
East
1
0
0
0
North
0
1
0
0
South
0
0
1
0
West
0
0
0
1

Purchase
Probability (%)(minutes)(seconds)
A603.0180
B653.5210
C634.0240
Commercial Viewing Time

Sheet1
Purchase Commercial Viewing Time
Probability (%) (minutes) (seconds)
A 60 3.0 180
B 65 3.5 210
C 63 4.0 240
Distance based on minutes of viewing time
Euclidean City-Block Max of dimension
A-B 5.03 5.50 5
A-C 3.16 4.00 3
B-C 2.06 2.50 2
Distance based on seconds of viewing time
Euclidean City-Block Max of dimension
A-B 30.41 35.00 30
A-C 60.07 63.00 60
B-C 30.06 32.00 30

Sheet2

Sheet3

PairDistance
A, B5.03
A, C3.16
B, C2.06

PairDistance
A, B30.41
A, C60.07
B, C30.06

j
j
ij
ij
b
a
x
z
–
=

Purchase
Probability (%)(Minutes)(Seconds)
A000
B10.50.5
C0.611
Commercial Viewing Time

IDSexAgeColour PreferenceRank (1-6)
AM45Red1
BM58Blue2
CF34Green6
DF48Blue3
EF40Red4

IDSexAgeColour_RedColour_BlueColour_GreenRank
A1451001
B1580102
C0340015
D0480103
E0401004

IDSexAgeColour_RedColour_BlueColour_GreenRank
A10.45831000.00
B11.00000100.25
C00.00000011.00
D00.58330100.50
E00.25001000.75
Range Standardized

m
z
z
j
j
=
,

IDSexAgeColour_RedColour_BlueColour_GreenRank
A10.45830.57740.00000.00000.00
B11.00000.00000.57740.00000.25
C00.00000.00000.00000.57741.00
D00.58330.00000.57740.00000.50
E00.25000.57740.00000.00000.75
Range Standardized and Re-Scaled

SubjectIncomeEducation
S155
S266
S31514
S41615
S52520
S63019

SubjectIncomeEducation
S1 S25.55.5
S31514
S41615
S52520
S63019

SubjectIncomeEducation
S1 S25.55.5
S3 S415.514.5
S52520
S63019

SubjectIncomeEducation
S1 S25.55.5
S3 S415.514.5
S5 S627.519.5

Sheet1

Education
Income
Education

Sheet2

Sheet3

0
5
10
15
20
25
05101520253035
Income
Education

Sheet1

Education
Income
Education

Sheet2

Sheet3

Sheet1

Education
Income
Education

Sheet2

Sheet3

Sheet1

Education
Income
Education

Sheet2

Sheet3

S1S2S3S4S5S6
S102181221625821
S20145181557745
S302136250
S40106212
S5026
S60

S1 S2S3S4S5S6
S1 S20163201591783
S302136250
S40106212
S5026
S60

Sheet1
Subject Income Education Subject Income Education Subject Income Education Subject Income Education
S1 5 5 S1 S2 5.5 5.5 S1 S2 5.5 5.5 S1 S2 5.5 5.5
S2 6 6 S3 15 14 S3 S4 15.5 14.5 S3 S4 15.5 14.5
S3 15 14 S4 16 15 S5 25 20 S5 S6 27.5 19.5
S4 16 15 S5 25 20 S6 30 19
S5 25 20 S6 30 19
S6 30 19
15.5 14.5
S1 S2 S3 S4 S5 S6
S1 S2 0 163 201 591 783
S3 0 2 136 250
S4 0 106 212
S5 0 26
S6 0
S1 S2 S3 S4 S5 S6
S1 0 2 181 221 625 821
S2 0 145 181 557 745
S3 0 2 136 250
S4 0 106 212
S5 0 26
S6 0

Sheet1

Education
Income
Education

Sheet2

Sheet3

Sheet1

Education
Income
Education

Sheet2

Sheet3

Five-cluster solutions
12345SS
S1,S2S3S4S5S61
S1,S3S2S4S5S690.5
…
S5,S6S1S2S3S413

Four-Cluster solutions
1234SS
S1,S2,S3S4S5S6109.33
S1,S2,S4S3S5S6134.67
…
S1,S2S3,S4S5S62.00
S1,S2S3,S5S4S669.00
…
S1,S2S5,S6S3S414.00

Centroids
(5.5,5.5)(27.5,19.5)(15,14)(16,15)

Number
of clusters
SPRSQRSQPSFPST2
20CL28OB270.00001.0003.90E+05543
19CL31OB200.00001.0003.60E+05868
18CL30OB170.00001.0003.30E+05757
17OB7CL390.00001.0003.10E+058839
16CL17CL270.00010.9992.60E+05502
15OB10CL210.00010.9992.30E+05519
14CL20OB310.00010.9992.00E+05557
13OB8CL190.00020.9991.70E+051119
12CL18OB130.00070.9981.10E+053938
11CL23OB230.00080.9988.10E+045868
10OB6CL130.00080.9976.90E+04853
9CL16CL330.00140.9955.40E+042728
8CL11CL380.00170.9944.50E+04771
7CL25CL80.00350.9903.40E+04690
6CL10CL140.00400.9862.90E+041562
5CL9CL220.00420.9822.70E+041308
4CL12CL150.00490.9772.80E+042756
3CL6CL50.09720.88073198597
2CL7CL40.10170.77870148808
1CL2CL30.77830.000.7014
Clusters Joined

ClusterPurchase totalC1C2C3C4C5C6C7
118.20000.94150.05850.00000.00000.00000.00000.0000
29.25000.00000.00000.00000.00000.00000.00001.0000
312.50000.00000.00000.53750.03750.42500.00000.0000
460.18070.02380.02660.01910.00080.00660.87580.0474
594.55720.01410.00480.21880.32850.10930.32050.0039
689.74590.01880.37150.01440.00090.01380.57260.0079
Proportion of purchases

Overall79.8630.02480.15790.07660.08790.0420.58660.0243

/docProps/thumbnail.jpeg

Related Posts