Data Mining: Concepts and Techniques
— Chapter 10 —
Qiang (Chan) Ye Faculty of Computer Science Dalhousie University University
Copyright By PowCoder代写 加微信 powcoder
Chapter 10. Cluster Analysis: Basic Concepts and Methods
n Cluster Analysis: Basic Concepts n Partitioning Methods
n Hierarchical Methods
n Density-Based Methods
What is Cluster Analysis?
n Cluster: A collection of data objects
n similar (or related) to one another within the same group n dissimilar (or unrelated) to the objects in other groups
n Clustering (or cluster analysis, data segmentation, …)
n Finding similarities between data and grouping similar data
objects into clusters
n It is also known as unsupervised learning because there are no predefined classes. Note that classification is known as supervised learning.
n Typical applications
n As a stand-alone tool to get insight into data distribution (e.g.
placing customers into different groups)
n As a preprocessing step for other algorithms (e.g. outlier detection – find clusters and thereafter find outliers)
Requirements of Clustering Algorithms
n Scalability
n When the scale of the problem is large, the alg. should still
n Ability to deal with different types of attributes
n Numerical, binary, categorical, ordinal, and mixture of these n Ability to discover arbitrary-shape clusters
n Many clustering algorithms determine clusters based on Euclidean distance.
n Algorithms based on this type of distance tend to find circular clusters (or spherical clusters in the 3D scenario) with similar size and density.
n It is important to develop algorithms that can detect clusters of arbitrary shape.
Requirements of Clustering Algorithms
n No input parameters should be required
n Many clustering algorithms require users to provide domain knowledge in the form of input parameters, such as the desired number of clusters.
n Requiring the specification of domain knowledge not only burdens users, but also makes the quality of clustering difficult to control.
n Ability to deal with noisy data
n Clustering algorithms can be sensitive to noisy data
and consequently produce poor-quality clusters.
n We need clustering methods that are robust to noise.
Requirements of Clustering Algorithms
n Incremental clustering and insensitivity to input order
n In many applications, incremental updates (representing newer
data) may arrive at any time.
n Some clustering algorithms cannot incorporate incremental updates into existing clustering structures and, instead, have to recompute everything from scratch.
n Clustering algorithms may also be sensitive to the input data order.
n That is, given a set of data objects, clustering algorithms may return dramatically different clusters depending on the order in which the objects are presented, which is not an ideal behavior.
n Capability of clustering high-dimensionality data
n Finding clusters of data objects in a high-dimensional space is challenging, especially considering that such data can be very sparse and highly skewed.
Requirements of Clustering Algorithms
n Constraint-based clustering
n Suppose that your job is to choose the locations for a given
number of new automatic teller machines (ATMs) in a city.
n To finalize the choice, you may cluster households while considering constraints such as the city’s rivers and highway networks and the types/number of customers per cluster.
Requirements of Clustering Algorithms
n Interpretability and usability
n Users prefer clustering results to be interpretable and usable. That is, clustering may need to be tied in with specific semantic interpretations and applications.
n It is important to study how an application goal may influence the selection of clustering features and clustering methods.
Major Clustering Approaches
n Partitioningapproach:
n Given a set of n objects, a partitioning method constructs k
partitions, where each partition represents a cluster and k <= n. n Typical methods: k-means, k-medoids
n Hierarchicalapproach:
n Construct a hierarchy of clusters
n Typical methods: AGNES, DIANA, BIRCH, CAMELEON
n Density-basedapproach:
n Based on the density of objects
n Typical methods: DBSACN, OPTICS, DenClue
n Grid-basedapproach:
n Quantize the object space into a finite number of cells that form a
grid: All the clustering operations are performed on the grid. n Typical methods: STING, CLIQUE
Chapter 10. Cluster Analysis: Basic Concepts and Methods
n Cluster Analysis: Basic Concepts n Partitioning Methods
n Hierarchical Methods
n Density-Based Methods
Partitioning Algorithms: An Overview
n Formally, given a data set D of n objects and the final cluster number k (i.e. the number of clusters to form), a partitioning algorithm organizes the objects into k partitions (k <= n), where each partition represents a cluster.
n The partitioning process is optimized using an objective function, such as a dissimilarity function based on distance, so that:
n The objects within a partition are “similar” to one another.
n The objects in a partition are “dissimilar” to objects in other partitions.
k-means: A Centroid-based Partitioning Alg.
n Suppose a data set D contains n objects in a Euclidean space. n Partitioning methods distribute the objects in D into k clusters,
C1,...,Ck.Thatis,Ci⊂DandCi ∩Cj =𝛷(for1<=i,j<=k).
n An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters.
n The centroid is the mean position of all the objects under investigation.
n Typically, each cluster has one centroid. A centroid-based partitioning technique uses the centroid (i.e. ci) of a cluster (i.e. Ci) to represent the corresponding cluster.
k-means: A Centroid-based Partitioning Alg.
n The difference between an object p∈Ci and ci (i.e. the representative of the cluster) is denoted as dist(p, ci), where dist(x,y) is the Euclidean distance between two points x and y.
n The quality of the cluster Ci can be measured using the within cluster variation, which is the sum of squared error (note that the error is equal to the distance between an object in Ci and the centroid ci), defined as:
n The quality of the clusters resulting from a clustering algorithm can be measured using the sum of within cluster variation.
k-means: A Centroid-based Partitioning Alg.
n Minimizing the sum of within-cluster variation is computationally challenging. It has been shown that the problem is NP-hard.
n To solve the NP-hard problem, greedy approaches are often used in practice.
n k-means algorithm is one of the greedy approaches, which is simple and commonly used.
k-means: A Centroid-based Partitioning Alg.
Step 1 of k-means:
n k-means randomly selects k objects in D, each of which is considered a centroid (also knowns as cluster mean or cluster center, these terms are interchangeable).
n For each of the remaining objects, the object is assigned to the cluster to which it is most similar, according to the Euclidean distance between the object and the centroid.
k-means: A Centroid-based Partitioning Alg.
Step 2 of k-means:
n k-means algorithm then iteratively improves the sum of
within-cluster variation.
n For each cluster, it computes the new centroid using the objects assigned to the cluster in the previous iteration.
n All the objects are then reassigned using the updated centroids as the new cluster centers.
n The iterations continue until the assignment is stable, that is, the clusters formed in the current round are the same as those formed in the previous round.
k-means: A Centroid-based Partitioning Alg.
k-means: A Centroid-based Partitioning Alg.
n k-means example: Let k = 3, that is, the user would like the objects to be partitioned into three clusters. Initially, we arbitrarily choose three objects as the three initial cluster centroids (marked by ”+”).
k-means: A Centroid-based Partitioning Alg.
n Here is a youtube video illustrating how k-means works:
https://www.youtube.com/watch?v=5I3Ei69I40s&list=PLU- SOU_FA5awgAxx1xHS84qNX8VQpPq-f
k-means Algorithm: Weaknesses
n No. 1: The k-means method is not guaranteed to converge to the global optimum and often terminates at a local optimum.
n The results may depend on the initial random selection of cluster centroids.
n To obtain good results in practice, it is common to run the k-means algorithm multiple times with different initial cluster centroids.
k-means Algorithm: Weaknesses
n No. 2: The necessity for users to specify k, the number of clusters, in advance can be seen as a weakness.
n Determining the number of clusters is far from easy because the “right” number is often ambiguous.
n Figuring out what the right number of clusters should depend on the data distribution and the clustering resolution required by the user.
n There are many possible ways to estimate the number of clusters. Here, we briefly introduce a few simple yet effective methods.
k-means Algorithm: Determining k
n Method # 1: A simple way is to set the number of clusters to for a data set of n points.
n Method # 2: Elbow Method
n It is based on the observation that increasing the number of clusters can help to reduce the sum of within-cluster variation of each cluster.
n This is because having more clusters allows one to capture finer groups of data objects that are more similar to each other.
n However, the marginal effect of reducing the sum of within-cluster variation may drop if too many clusters are formed, because splitting a cohesive cluster into two gives only a small reduction.
n Consequently, a heuristic for selecting the right number of clusters is to use the turning point in the curve of the sum of within-cluster variation with respect to the number of clusters.
k-means Algorithm: Determining k
n Technically, given a number k, we can form k clusters and calculate the sum of within-cluster variation var(k).
n We can then plot the curve of var(k) with respect to k.
n Then we can choose the k corresponding to the turning point. n In this example, k should be set to 4.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com