CS计算机代考程序代写 data mining information retrieval database algorithm CS699 Lecture 10 Clustering

CS699 Lecture 10 Clustering

What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group  dissimilar (or unrelated) to the objects in other groups
 Cluster analysis (or clustering, data segmentation, …)
 Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters
 Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised)
 Typical applications
 As a stand-alone tool to get insight into data distribution  As a preprocessing step for other algorithms
2

Clustering for Data Understanding and Applications
 Biology:taxonomyoflivingthings:kingdom,phylum,class,order, family, genus and species
 Information retrieval: document clustering
 Landuse:Identificationofareasofsimilarlanduseinanearth
observation database
 Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
 City-planning:Identifyinggroupsofhousesaccordingtotheirhouse type, value, and geographical location
 Earth-quakestudies:Observedearthquakeepicentersshouldbe clustered along continent faults
 Climate:understandingearthclimate,findpatternsofatmospheric and ocean
 EconomicScience:marketresarch
3

Clustering as a Preprocessing Tool (Utility)
 Summarization:
 Preprocessing for regression, PCA, classification, and
association analysis  Compression:
 Image processing: vector quantization  Finding K-nearest Neighbors
 Localizing search to one or a small number of clusters  Outlier detection
 Outliers are often viewed as those “far away” from any cluster
4

Quality: What Is Good Clustering?
 A good clustering method will produce high quality clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters  The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden patterns
5

Measure the Quality of Clustering
 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
 The definitions of distance functions are usually different for interval-scaled, Boolean, categorical, ordinal, and vector variables
 Weights should be associated with different variables based on applications and data semantics
 Quality of clustering:
 There is usually a separate “quality” function that
measures the “goodness” of a cluster.
 It is hard to define “similar enough” or “good enough”
 The answer is typically highly subjective
6

Considerations for Cluster Analysis
 Partitioningcriteria
 Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)  Separationofclusters
 Exclusive (e.g., one customer belongs to only one region) vs. non- exclusive (e.g., one document may belong to more than one class)
 Similaritymeasure
 Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)  Clusteringspace
 Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)
7

Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples
 Abilitytodealwithdifferenttypesofattributes
 Numerical, binary, categorical, ordinal, linked, and mixture of
these
 Constraint-basedclustering
 Usermaygiveinputsonconstraints
 Usedomainknowledgetodetermineinputparameters  Interpretabilityandusability
 Others
 Discovery of clusters with arbitrary shape
 Ability to deal with noisy data
 Incremental clustering and insensitivity to input order  High dimensionality
8

Major Clustering Approaches
 Partitioningapproach:
 Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors  Typical methods: k-means, k-medoids, CLARANS
 Hierarchicalapproach:
 Create a hierarchical decomposition of the set of data (or objects)
using some criterion
 Typical methods: Diana, Agnes, BIRCH, CAMELEON
 Density-basedapproach:
 Based on connectivity and density functions
 Typical methods: DBSCAN, OPTICS, DenClue
9

Major Clustering Approaches
 Model-based
 Grid-based
 Frequent pattern-based
 User-guidedorconstraint-based
 Link-basedclustering
10

Partitioning Algorithms: Basic Concept
 Partitioning method: Partitions a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where p is an object and ci is the centroid or medoid of cluster Ci).
Ek  (pc)2 i 1 pCi i
 This is also called within-cluster variation or SSE (sum of squared errors).
11

Partitioning Algorithms: Basic Concept
 More about SSE:
sum for all clusters
Ek  (pc)2 i 1 pCi i
squared distance between an object and the centroid of its cluster
sum for all objects in a cluster
12

Partitioning Algorithms: Basic Concept
 Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster
13

The K-Means Clustering Method
 Givenk,thek-meansalgorithmworksasfollows:
1. Arbitrarily choose k points as initial centroids (the
centroid is the center, i.e., mean point, of the
cluster). Each centroid represents a cluster.
2. Assign each object to the cluster with the nearest
centroid.
3. Compute new centroids.
4. Go back to Step 2. Stop when the membership
assignment does not change or other criterion is met.
14

The K-Means Clustering Method
 Other stopping criteria
 After each reassignment, E is computed and if E
falls below a predefined threshold.
 If the decrease in E, between two consecutive iterations, falls below a predefined threshold.
 Run for a predetermined number of iterations (e.g., run 20 iterations).
15

Initial data set k=2
Reassign objects
Compute new centroids
Outline of K-Means
Choose initial centroids arbitrarily
Assign objects to closest centroid
Compute new centroids,
and repeat until no change
16

Initial dataset
Two objects are randomly chosen as initial centroids
K-Means Illustration
Initial dataset, D = {(1,3), (3,5), (4,1), (5, 3), (5, 4), (7, 2), (7, 4)}
17

Objects are assigned to the cluster with the closest centroid
New centroids are computed. C1.x = (1+3+5)/3 = 3
C1.y = (3+4+5)/3 = 4
C2.x = (4+5+7+7)/4= 5.75 C2.y = (1+2+3+4)/4 = 2.5
K-Means Illustration
18

Objects are reassigned based on the distances to new centroids. Note that object (5,4) moved to cluster of C2.
New centroids are computed. C1.x = (1+3)/2 = 2
C1.y = (3+5)/2 = 4
C2.x = (4+5+5+7+7)/5= 5.6 C2.y = (1+2+3+4+4)/5 = 2.8
K-Means Illustration
19

K-Means Illustration
Objects are reassigned based on the distances to new centroids.
There is no membership change.
So, stop here.
20

Comments on the K-Means Method
 Strength:Efficient:  Weakness
 Initial random selection of centroids affects the results (i.e., may not converge or may end up with a local optimum)
 Run k-means multiple times with different initial centroids  Applicable only when the mean of objects can be defined
 Use the k-modes method for categorical data
 Need to specify k, the number of clusters, in advance
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with arbitrary shapes
21

yy
yy
yy
Comments on the K-Means Method
 Weakness(continued)
 Selection of initial centroids (good choice)
Iteration 1 Iteration 2 333
Iteration 3
2.5 2.5 2.5 222 1.5 1.5 1.5 111 0.5 0.5 0.5 000
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
xxx
Iteration 4 Iteration 5 333
Iteration 6
2.5 2.5 2.5 222 1.5 1.5 1.5 111 0.5 0.5 0.5 000
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
22
xxx

yy
yy
y
Comments on the K-Means Method
 Weakness(continued)
 Selection of initial centroids (bad choice)
Iteration 1 33
Iteration 2
2.5 2.5 22 1.5 1.5 11 0.5 0.5 00
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
xx
Iteration 3 Iteration 4 333
Iteration 5
2.5 2.5 2.5 222 1.5 1.5 1.5 111 0.5 0.5 0.5 000
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
xxx

Comments on the K-Means Method
 Weakness(continued)
 Sensitive to noisy data and outliers
 Consider one-dimensional objects: {1, 2, 3, 8, 9, 10, 25}  Reasonable clustering:
Two clusters {1, 2, 3,} and {8, 9, 10}, and an outlier 25.
 Run k-means with k = 2:
Two clusters {1, 2, 3, 8} and {9, 10, 25}
 K-medoids method is more robust in the presence of noise/outliers
24

Comments on the K-Means Method
 Weakness(continued)
 Not suitable to discover clusters with arbitrary shapes
Initial dataset
K-means with k = 3
Natural clustering
This is an output of DBSCAN algorithm
25

Variations of the K-Means Method
 Mostofthevariantsofthek-meansdifferin  Selection of the initial centroids
 Dissimilarity calculations
 Strategies to calculate cluster means
26

1. 2.
Initially a single cluster includes all objects.
3. 4.
A cluster is selected (based on certain criterion), and go to step 2
Variations of the K-Means Method
 Bisectingk-means
The cluster is partitioned into two clusters using K-means
(This can be repeated multiple times and the one with the smallest SSE can be chosen)
Stop when we have k clusters
(becomes a hierarchical clustering – refer to later slides)
27

Variations of the K-Means Method
 Bisecting k-means illustration
Initial dataset
After 1st iteration
After 2nd iteration After 3rd iteration
28

Hierarchical Clustering
 Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition
Step0 Step1 Step2 Step3 Step4
agglomerative (AGNES)
a b c d e
ab
cde de
Step4 Step3 Step2 Step1 Step0
divisive (DIANA)
abcde
29

AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity matrix  Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10
999
888
777
666
555
444
333
222
111
000
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
30

Weakness of Agglomerative Clustering
 Can never undo what was done previously
 Do not scale well: time complexity of at least O(n2), where n is the number of total objects
31

Dendrogram: Shows How Clusters are Merged
• Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram
• A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster
32

Dendrogram: Shows How Clusters are Merged
33

A1 A2 126 234 338 445 547 662 772 874 984
Hierarchical Clustering
 Example (agglomerative)
34

Hierarchical Clustering
• SPSS hierarchical clustering output
35

Hierarchical Clustering
• SPSS hierarchical clustering output
36

Hierarchical Clustering
37

Hierarchical Clustering
38

DIANA (Divisive Analysis)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g., Splus  Inverse order of AGNES
 Eventually each node forms a cluster on its own
10 10 10 999 888 777 666 555 444 333 222 111
000
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
39

Distance between Clusters
 Minimum distance (single link)
 Maximum distance (complete link)  Average distance
 Mean distance
40

Distance between Clusters
 Minimum distance (single link): smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) =
min(tip, tjq)
minimum distance = 3 (using Manhattan distance)
41

Distance between Clusters
 Maximum distance (complete link): largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
maximum distance = 10 (using Manhattan distance)
42

Distance between Clusters
 Average distance: average of distances between all pairs of elements, i.e., dist(Ki, Kj) = avg(tip, tjq)
d(a,e) = 6 d(a,f) = 7 d(a,g) = 8
d(b,e) = 6 d(b,f) = 5 d(b,g) = 8
d(c,e) = 7 d(c,f) = 10 d(c,g) = 7
d(d,e) = 3 d(d,f) = 6 d(d,g) = 5
average distance
= average of all the above = 78 / 12
= 6.5
43

Distance between Clusters
 Mean distance: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)
C1 = (1.75, 4.25) C2 = (6.67, 4.33)
mean distance = 5.0
(using Manhattan distance)
44

Evaluation of Clustering
 Assessing clustering tendency
 See whether there are non-random structures in the dataset (i.e., whether there are natural clusters in the dataset)
 Measure the probability that the dataset was generated by a uniform data distribution. If so, there may not be any natural clusters at all.
45

Evaluation of Clustering
 Determining the number of clusters
 Not trivial because, in part, we don’t know “right” number of
clusters
 A simple method: k =

 Elbow method: as we increase k, find the point where the marginal benefit in regard to SSE does not increase significantly (typically a turning point in a graph)
 Cross-validation method: While changing the value of k, perform cross-validation and select k that gives the best result.
46

 Cross-validation
Evaluation of Clustering
 Repeat this m times and get the average of SSE’s
 Dothisfork=2,3,4,…andselectthekwiththesmallest average SSE.
47

Evaluation of Clustering
 Measuring clustering quality  Extrinsic method
 When ground truth is available
 BCubed precision, BCubed recall  Intrinsic method
 When ground truth is not available
 Measures how well clusters are separated and how compact
each cluster is
 Silhouette coefficient
48

Evaluation of Clustering
 Silhouette coefficient of an object o
 Measures how well clusters are separated and how
compact each cluster is.
 Assume objects are partitioned into k clusters, C1, C2,
…, Ck.
 Silhoutte coefficient is calculated as:
s(o)  b(o)  a(o) max{a(o), b(o)}
49

 a(o)
Evaluation of Clustering
 Represents compactness of the cluster o belongs to.
 Calculates the average distance between an object o and all other objects in the same cluster.
 Smaller values are better
a(o)  o’Ci ,oo’ dist(o,o’) Ci 1
50

 b(o)
Evaluation of Clustering
 Represents how far o is from other clusters
 Calculates the minimum average distance between an object o in a cluster to all other clusters.
 Larger values are better.
 dist(o,o’)  o ‘  C j 
b(o)  mn
Cj:1jk,ji Cj 
51

Evaluation of Clustering
s(o)  b(o)  a(o) max{a(o), b(o)}
 s(o) is between -1 and 1
 Closer to 1 means it is better
 Negative: not good; o is closer to objects in other clusters than to objects in its own cluster.
 Overall cluster quality:
 Compute average silhouette coefficient of all objects  Use the average to evaluate the quality of clustering
52

References
• Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012
• http://www.cs.illinois.edu/~hanj/bk3/
• P.Tan,M.Steinbach,V.Kumar,“IntroductiontoData
53
Mining,” Addison Wesley, 2006.