程序代写代做代考 graph algorithm C BASIC AND ADVANCED

BASIC AND ADVANCED
CLUSTERING Applied Analytics: Frameworks and Methods 2
1

Outline
■ The Concept of Clustering
■ Market Segmentation
■ Input to Predictive modeling
■ Process of Clustering
■ Hierarchical Clustering
■ k-Means
■ Model-based Clustering
2

Cluster Analysis
■ is a class of techniques used to classify objects or cases into relatively homogeneous groups called clusters.
■ Thus, clustering groups objects that are similar in a single cluster and those that are dissimilar in other clusters.
■ groups observations based on similarity on a set of variables (in this sense, cluster analysis differs from dimension reduction techniques like principal component analysis).
3

Cluster Analysis
■ We are all accustomed to organizing our belongings. Clustering is about organizing data.
4

Cluster Analysis
 Classification technique used to group together observations typically based on spatial distance
 No statistical basis to this grouping
 Used primarily as an exploratory technique
 So, no hypotheses
 No unique solution
 Number of clusters is usually guided by theory and inter-cluster distance
5

MARKET SEGMENTATION
6

Market Segmentation
■ “You can have the Ford in any color, as long as it is Black”, Henry Ford
7

Available Sizes: One size fits all Available Styles: One style fits all Available colors: One color fits all
Suitable for
Girls and Boys
Women and Men
Teens and Senior Citizens Americans and Non-Americans
8

One Size Doesn’t Fit All
9

Market Segmentation
■ In general
– Customers are happier when they get what they like
– Generic one-size offerings lead to lower satisfaction
■ On the other hand, cost of catering to individual customers is very high
■ Segmentation is a compromise between serving individual customers’ needs and offering all customers the same product
10

Market Segmentation
■ Market Segmentation involves dividing a market into distinct groups with
– distinct needs, characteristics, or behavior, and
– who respond differently to a marketing action
1. 2. 3.
Break market down
Group into segments
Choose target market
TARGET MARKET
11

Segmentation and Targeting
1. 2. 3.
Break market down
Group into segments
Choose target market
TARGET MARKET
12

Segmentation (for Carpet Fibers)
Perceptions/Ratings for one respondent: Customer Values
… ..
.. ..
A .
.
. .
D …
..
. … …
Strength
.
(Importance)
.. .. .. ..
Distance between segments C and D
.. ..
A,B,C,D: Location of segment centers.
Typical members:
A: schools
B: light commercial C: indoor/outdoor
carpeting D: healthclubs

. …
B..C
.. . . …
Water Resistance
(Importance)
13

Coffee buyers?
Cluster Analysis
Profile Clusters
14

Market Segmentation Approaches
■ Heuristic Approach
– Use observable variables to segment the market
– Variables selected to segment are expected to be the ones on which customer needs vary
– This is the traditional approach and does not involve cluster analysis
■ Needs-based Clusters
– Use data on customer needs to generate clusters of individuals that differ in their needs
– Profile clusters on observable variables (e.g., gender, age, location)
15

Market Segmentation – Heuristic Approach
■ Here are some ways markets may be segmented using the Heuristic Approach
■ Consumer markets
– Geography (region, country, state, city)
■ E.g., Walmart, McDonalds,
– Demographic (Age, gender, income, lifecycle, occupation)
■ E.g., Gap, Abercrombie and Fitch, American Eagle, Toyota, Target vs. Walmart, Canon Powershot – Psychographic (personality traits such as conscientiousness, openness to experience; athleticism,
innovativeness)
■ E.g., GNC, Paragon, EMS
– Behavioral (loyalty, purchase quantity, frequency)
■ All airlines, Costco, credit cards
■ Business markets
– Demographics (industry, company size)
– Operating variables (technology, usage status)
– Purchasing approaches
– Situational factors (urgency, order size)
– Personal characteristics (buyer-seller similarity, loyalty)
16

Needs-based Clusters: Process
■ Articulate a strategic rationale for segmentation (i.e., why are we segmenting this market?).
■ Select a set of needs-based segmentation variables most useful for achieving the strategic goals.
■ Select a cluster analysis procedure for aggregating (or disaggregating customers) into segments.
■ Group customers into a defined number of different segments.
■ Choose the segments that will best serve the firm’s strategy, given its capabilities and the likely reactions of competitors.
17

INPUT TO PREDICTIVE MODELING
18

Cluster before Predictive Modeling
■ Clustering data before predictive modeling can improve prediction quality
■ Cluster Data – Predictive Model for each Cluster – Combine Predictions
19

Cluster before Predictive Modeling
Cluster1
Predictive Model1
Predictions1
Data
Cluster 2
Predictive Model2
Predictions2
Cluster3
Predictive Model3
Predictions3
20

PROCESS OF CLUSTER ANALYSIS
21

Cluster Analysis
Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups called clusters. Objects in each cluster tend to be similar to each other and dissimilar to objects in the other clusters. Cluster analysis is also called classification analysis, or numerical taxonomy.
22

Steps in Cluster Analysis
1. Select variables for analysis
2. Prepare data
3. Compute similarity using a metric
4. Apply a clustering technique
5. Interpret results to determine number of segments (for some methods)
6. Evaluate, eliminate outliers, validate clustering scheme with another sample.
7. Profile clusters and evaluate usefulness of resulting segments
23

Select variables for analysis
■ Clusters are only as good as the variables chosen to define it.
■ Software cannot offer guidance on the practical use of the variables.
■ Consider relevance of variables chosen to the problem
– In market segmentation, variables are chosen based on customer needs. For a
fitness club, this could include, visit frequency, and workout goals.
– On the other hand, is number of visits/week to the gym relevant in segmenting the market for Folgers coffee? What about variety seeking behavior? What about innovativeness?
■ Use of multiple variables tapping the same dimension may overweight the dimensions’ impact in defining clusters.
■ Nature of variables selected influence choice of clustering algorithm
24

Prepare Data ■ Transform format of variables
– Clustering algorithms prefer all variables to be of the same class (e.g., numeric, factor) ■ Impute missing data
– Clustering algorithms use data on all variables. A missing observation on one variable will cause the entire row of data to be ignored for analysis. Therefore, it is important to impute missing values.
■ Scale
– For distance-based clustering methods, scale of the variable (e.g., seconds vs minutes)
affects the weight assigned to the variable. Therefore, it is best to standardize variables.
■ Redundant variables
– Multiple variables measuring the same dimension should be replaced by the underlying factor or a representative variable.
25

Compute Similarity
Similarity between observations is computed for distance-based methods but not for model- based methods.
■ Distance-based methods
– Find groups that minimize the distance between members within the group while
maximizing the distance from members of other groups
– Popular methods are Hierarchical clustering and k-means
– Computation of similarity in observation is a pre-requisite
■ Model-based methods
– View the data as a mixture of groups sampled from different distributions
– In this approach, similarity between observations is not computed
26

Compute Similarity
■ Almost always a distance-based measure (as opposed to a correlational measure)
■ Distance-based similarity measures differ in how distance is computed
– Euclidean distance
– Maximum
– Manhattan
– Canberra (weighted Manhattan distance, Wikipedia)
– Minkowski (Wikipedia)
– Mahalanobis distance (adjusts for multicollinearity)
■ Distance-based similarity measures are sensitive to scale (e.g., minutes and seconds). Therefore, it is desirable to standardize.
27

Apply a Clustering Technique
■ Hierarchical Clustering
■ k-means clustering
■ Model-based clustering
■ Latent class analysis
28

Interpret Results:
Select Number of Clusters
■ Domain knowledge
– Whenever possible, the decision of number of clusters must be based on an
understanding of the problem.
■ Statistical criterion
– Some algorithms require specifying number of clusters (e.g., kmeans and polCA) while others require one to infer number of clusters from the output (e.g., hclust)
29

Validation
■ In the absence of an outcome variable, there is no “answer” to evaluate the clustering scheme based on
■ It is best to validate the clustering scheme with another sample or a holdout sample
■ To examine the potential bias from outliers, the cluster results with outliers can be compared to that without outliers.
30

Evaluate Meaningfulness
■ The usefulness of results will depend on the meaningfulness of clusters in the context of the problem.
31

TYPES
32

Cluster Analysis Types
■ Hierarchical Clustering
■ K-means clustering
■ Model based clustering methods
– Finite Mixture models
33

HIERARCHICAL CLUSTERING
34

Hierarchical Clustering
■ Popular method that groups observations according to their similarity
■ Distance-based algorithm that operates on a dissimilarity matrix
■ Algorithm begins with each observation in its own cluster.
■ Successively joins neighboring observations or clusters one at a time according to their distances from one another, and continues until all observations are linked
■ Process of repeatedly joining observations is called the agglomerative method
■ hclust()
35

Distance Measure
■ Distance between observations may be measured as
– Euclidean distance
– Maximum
– Manhattan
– Canberra (weighted Manhattan distance, Wikipedia)
– Minkowski (Wikipedia)
■ dist(x = data, method = ‘euclidean’)
■ Most distance metrics are only defined when observations are numeric. If dataset contains a mix of variable types, consider
– library(cluster); daisy(x=data,method=‘euclidean’)
36

Distance Measure
37

Clustering Method
■ Distance between pairs of points is useful in identifying similarity between a pair of points.
■ As observations are combined into clusters, distances must now be evaluated among clusters or between a point and a cluster.
■ There are different approaches to grouping observations and these are called clustering methods.
38

Clustering Method
■ Linkage methods generate clusters based on distance between observations – single
– complete – average
■ Variance methods generate clusters to minimize the within-cluster variance. Ward’s minimum variance method is aimed at finding compact spherical clusters
– ward.D2 (original Ward’s clustering criterion, Ward, 1963) – ward.D
■ Centroid methods compute distance between cluster centroids – median
– centroid
■ hclust(d, method = “complete”)
39

Clustering Methods
Illustration of Linkage Methods
Distances between “1-2” and “3” using each Linkage method
40

Illustration: Average Linkage Method for Clustering
Credit: Formatting suggestions by Rohan Lala.
41

Interpret Results: Dendrogram
42

Interpret Results:
Determine Number of Clusters
■ A hierarchical dendrogram is interpreted primarily by height and where observations are joined. The height represents the dissimilarity between elements that are joined.
■ Cophenetic correlation coefficient (CPC) is a goodness of fit statistic for hierarchical clustering which assesses how well a dendrogram matches the true distance metric. It is interpreted similar to a correlation coefficient. CPC> 0.7 indicates relatively strong fit.
43

Interpret Results:
Determine Number of Clusters
■ Number of clusters may be determined by examining the height between branch splits.
■ In this dendrogram, a cut between 8 and 12 would generate two clusters and a cut between 4 and 7 would create a four cluster solution.
Based on distances, a three cluster solution doesn’t seem viable.
44

Interpret Results:
Determine Number of Clusters
45

Meaningfulness of Clusters
■ Hierarchical clustering will seldom converge on a single cluster solution.
■ Ultimate decision on the number of clusters to pick should be made based on distances and meaningfulness of clusters
■ The characteristics of the clusters should be examined based on the features used for clustering and other variables not used for clustering.
■ In conducting market segmentation, it is common to segment based on needs- based variables (e.g., attitude, preference, and product usage), but profile segments based on demographic variables (e.g., age, gender, income).
46

Hierarchical Clustering
■ Hierarchical clustering is simple and intuitive.
■ However, this technique does not scale well to large datasets. This is because, the technique requires computing distances between every pair of observations. For large datasets, this can be a very expensive process, (for n observations there are n(n-1)/2 distances).
■ The computation of the distance matrix can consume significant computing resources and the distance matrix itself can overwhelm system memory.
■ Let us turn to another technique that is less resource intensive.
47

K-MEANS CLUSTERING
48

k-means Clustering
■ Attempts to find groups that are most compact, in terms of the mean sum-of- squares deviation of each observation from the multivariate center (centroid) of its assigned group
■ Non-hierarchical process that reaches solution through an iteration.
■ K-means clustering begins by arbitrarily placing centroids in the data and then iterating from that point to the final solution. This disorganized approach to clustering produces similar quality of clusters to hierarchical clustering but much faster.
49

k-means Process
■ Requires an a priori definition of number of clusters
■ Begins with an arbitrary assignment of observations to cluster centroids. Seed is important for reproducibility.
■ Iterative process involves
– Updating cluster centers
– Reassigning observations
■ Updating is done to minimize sum of squares from observations to cluster centers
■ Since it explicitly computes a mean deviation, k-means clustering relies on Euclidean distance
■ It is only suitable for numerical data
■ Distance is computed using Euclidean’s method
■ Method of updating is defined by algorithms including
– “Hartigan-Wong”, “Lloyd”, “Forgy”, “MacQueen”
50

k-means
■ Initial assignment of centers is influenced by seed
■ Sensitive to scale of variables as similarity is based on Euclidean distance
51

Visualizing k-means
k-means clustering algorithm involves the following steps
1. Identify number of centers, k, to be used
2. Randomly place centers in data
3. Assign observations to nearest center
4. Update cluster centers
5. Reassign cases
6. Repeat steps 4 and 5 until convergence
Next few slides visually illustrate the process until convergence is reached in iteration 5.
52

53

54

55

56

57

58

59

60

61

62

63

Entire iterative process
Take a look at k_means_process.gif for an animated version of the iterative process.
64

k: Number of Clusters
■ For k-means, the decision of k must be made before running the algorithm
■ Initial choice of k must be well thought-out and rely on domain knowledge as it drives the cluster solution
■ There are also a few data-driven approaches to determining clusters. Two of them are
– Total within cluster sum of squares
– Silhouette Method
65

k: Number of Clusters Within sum of squares
■ Total within sum of squares Plot
– Compute total within sum of squares for a number of values of k. Plot a line graph of k (on x-axis) against total within sum of squares (on y-axis). Ideal number of clusters is inferred from a sudden change in the line graph or what is commonly known as the “elbow”.
■ Ratio Plot
– Another alternative that generates a similar conclusion is to compute the ratio of between cluster sum of squares and total sum of squares for a number of values of k. Ideal number of clusters is inferred from the elbow in the graph.
66

k: Number of Clusters Within sum of squares
Within sum of squares
plot Ratio Plot
67

k: Number of Clusters Silhouette Method
■ Silhouette analysis reflects how well points fit in their respective clusters
■ It involves computing the silhouette width(s(i)) for each point
■ s(i) = (b(i) – a(i))/max{a(i),b(i)} – where
■ a(i) is average distance of a point to all observations within its cluster
■ b(i) is the average distance of a point to all observations in the nearest cluster
■ -1 <= s(i) <= 1 – s(i) = 1 implies i is in the right cluster – s(i) = 0 implies i is between two clusters – s(i) = -1 implies i should be in the nearest cluster 68 k: Number of Clusters using Silhouette Method 69 k-means ■ Relatively low computational power is the most important benefit of k-Means over hierarchical clustering ■ However, k-means – is sensitive to starting values, which are based on the seed – only allows the use of one distance metric – Euclidean – lacks the benefit of a dendrogram for inferring clusters 70 MODEL BASED CLUSTERING 71 Model-Based Clustering ■ Key idea of model-based clustering is that observations come from groups with different statistical distributions (such as different means and variances). The algorithms try to find the best set of such underlying distributions to explain the observed data. ■ These methods are also known as mixture models because it is assumed that the data reflect a mixture of observations drawn from different populations. ■ mclust() models such clusters as being drawn from a mixture of normal (also known as Gaussian) distributions. ■ Only useful when data are numeric 72 Model-Based Clustering ■ Infers ideal number of clusters by comparing a variety of different mixture shapes. ■ It is also possible to specify number of clusters ■ Different solutions may be compared using log-likelihood or BIC 73 Number of clusters based on bic 74 Other Clustering Techniques ■ DBSCAN ■ OPTICS 75 INPUT TO PREDICTIVE MODELING 76 Cluster before Predictive Modeling ■ Clustering data before predictive modeling can improve prediction quality ■ This approach involves generating clusters of observations and using different predictive models for each cluster. The predictions are then combined into a single vector. ■ Cluster Data – Predictive Model for each Cluster – Combine Predictions 77 Cluster before Predictive Modeling Cluster1 Predictive Model1 Predictions1 Data Cluster 2 Predictive Model2 Predictions2 Cluster3 Predictive Model3 Predictions3 78 Conclusion ■ In this module, we – discussed the concept of clustering – examined application of clustering for market segmentation and as an input to predictive modeling – reviewed process of clustering – discussed three clustering algorithms: hierarchical clustering, k-means and model-based clustering 79