Chapter 10. Cluster Analysis: Basic Concepts and Methods
n Cluster Analysis: Basic Concepts n Partitioning Methods
n Hierarchical Methods
n Density-Based Methods
Copyright By PowCoder代写 加微信 powcoder
Hierarchical Clustering: An Overview
n Partitioning methods meet the basic clustering requirement of organizing a set of objects into a number of exclusive groups.
n In some situations we may want to partition our data into groups at different levels in a hierarchy.
n A hierarchical clustering method works by grouping data objects into a hierarchy of clusters.
n For example:
n Level 1: “Staff in FCS”
n Level 2: “Admin Assistants” and “Technicians”
n Representing data objects in the form of a hierarchy is useful for data summarization and visualization.
Agglomerative vs. Divisive
n A hierarchical clustering method can be either agglomerative or divisive, depending on whether the hierarchical decomposition is formed in a bottom-up (merging) or top-down (splitting) manner.
n An agglomerative hierarchical clustering method uses a bottom-up strategy.
n It typically starts by letting each object form its own cluster and iteratively merges clusters into larger and larger clusters, until all the objects are in a single cluster or certain termination conditions are satisfied.
n The single cluster becomes the hierarchy’s root. For the merging step, it finds the two clusters that are closest to each other (according to some similarity measure), and combines the two to form one cluster.
Agglomerative vs. Divisive
n A divisive hierarchical clustering method employs a top- down strategy.
n It starts by placing all objects in one cluster, which is the hierarchy’s root.
n It then divides the root cluster into several smaller subclusters, and recursively partitions those clusters into smaller ones.
n The partitioning process continues until each cluster at the lowest level is coherent enough—either containing only one object, or the objects within a cluster are sufficiently similar to each other.
Agglomerative vs. Divisive: Example
agglomerative (e.g. AGNES)
Step4 Step3 Step2 Step1 Step0
(e.g. DIANA)
Figure 10.6 Agglomerative and divisive hierarchical clustering on data set {a, b, c, d, e}.
Agglomerative vs. Divisive: AGNES
n Initially, AGNES, the agglomerative method, places each object into a cluster of its own.
n The clusters are then merged step-by-step according to some criterion.
n For example, cluster C1 and C2 may be merged if an object in C1 and an object in C2 form the minimum Euclidean distance between any pair of objects from different clusters.
n This is a single-linkage approach. Namely, each cluster is represented by all the objects in the cluster, and the similarity between two clusters is measured by the distance of the closest pair of data points belonging to different clusters.
n The cluster-merging process repeats until all the objects are eventually merged to form one cluster.
Agglomerative vs. Divisive: DIANA
n DIANA, the divisive method, proceeds along the opposite direction:
n Step 1: All the objects are used to form one initial cluster.
n Step 2: The cluster is split according to some mechanism.
n Step 3: The cluster-splitting process repeats until, eventually, each new cluster contains only a single object.
***Details of DIANA can be found in the following book***
Kaufman, L. and Roussew, P. J., “Finding Groups in Data – An Introduction to Cluster Analysis”, & Sons, 1990.
Agglomerative vs. Divisive: DIANA
How to split a cluster into two clusters in Step 2 of DIANA?
n One possible mechanism somewhat resembles the way that a political party is split due to inner conflicts:
n a) The most unhappy member leaves the party and starts a new party.
n b) Thereafter, others from the old party follow him until some kind of equilibrium is attained.
Agglomerative vs. Divisive: DIANA
n Translated back to the algorithm, the splitting algorithm needs to:
n a) Look for the object that is most dissimilar to all other objects in the original cluster C1, then create a new cluster C2 and move the object from C1 to C2.
n b) Look for the object that is most dissimilar in the updated C1 and move the object to C2.
n c) Repeat Step b) until C1 and C2 are balanced (i.e. consisting of a similar number of objects)
Agglomerative vs. Divisive: DIANA
n Theoretically, we need to define the dissimilarity between an object and other objects in the same cluster.
n Average distance: We can use the average of the distances between an object and all other objects in the same cluster as the dissimilarity metric.
n Let us see how DIANA splits the data set {a, b, c, d, e} in Figure 10.6.
Agglomerative vs. Divisive: DIANA
n 1) The initial cluster C1 corresponds to {a, b, c, d, e}. n 2) For each member of C1={a, b, c, d, e}, we calculate
the average distance.
n We use distavg,a, distavg,b, distavg,c, distavg,d, distavg,e, to denote these average distances.
n After sorting them, suppose that we have: distavg,a > distavg,b >distavg,c> distavg,d > distavg,e.
n Thus, object a is most dissimilar, so we create a new cluster C2 and move a to C2, now C1={b, c, d, e} and C2={a}.
Agglomerative vs. Divisive: DIANA
n 3) For each member of the updated C1={b, c, d, e}, we calculate the average distance again.
n After sorting them, suppose that we have: distavg,b >distavg,c> distavg,d > distavg,e.
n Thus, object b is most dissimilar, so move b to C2, now C1={c, d, e} and C2={a, b}.
n 4) Since the size of C1 is similar to that of C2 at this moment (namely, C1 and C2 are balanced), the split task is completed.
n Note that:
n In the next round of split, C1 will be selected because |C1|=3 is
greater than |C2|=2.
n The split process is repeated until each cluster includes only one object.
Agglomerative vs. Divisive: Dendrogram
n A tree structure called a dendrogram is commonly used to represent the process of hierarchical clustering. The dendrogram for Figure 10.6 is shown below.
n It shows how objects are grouped together (in an agglomerative method) or partitioned (in a divisive method) step-by-step.
Distance Between Two Clusters
Note that:
1) mi is the mean of Ci
2) |p-p’| is the distance between two objects 3) ni is the number of objects in cluster Ci
Distance Measures in Algorithmic Methods
n When an algorithm uses the minimum distance, distmin(Ci ,Cj), to measure the distance between clusters, it is sometimes called a nearest-neighbor clustering algorithm.
n Moreover, if the clustering process is terminated when the distance between nearest clusters exceeds a user-defined threshold, it is called a single-linkage algorithm.
n When an algorithm uses the maximum distance, distmax(Ci ,Cj), to measure the distance between clusters, it is sometimes called a farthest-neighbor clustering algorithm.
n If the clustering process is terminated when the maximum distance between nearest clusters exceeds a user-defined threshold, it is called a complete-linkage algorithm.
Chapter 10. Cluster Analysis: Basic Concepts and Methods
n Cluster Analysis: Basic Concepts n Partitioning Methods
n Hierarchical Methods
n Density-Based Methods
Density-Based Clustering Methods
n Partitioning and hierarchical methods are designed to find spherical-shaped clusters.
n They have difficulty finding clusters of arbitrary shape such as the “S” shape and oval clusters in Figure 10.13.
Density-Based Clustering Methods
n To find clusters of arbitrary shape, alternatively, we can model clusters as dense regions in the data space, separated by sparse regions.
n This is the main strategy behind density-based clustering methods, which can discover clusters of non-spherical shape.
n DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is one of the density-based approach.
n How can we find dense regions using density-based clustering?
n The density of the neighborhood of an object o can be measured by the number of objects close to o.
n The objects that have dense neighborhoods are called core objects.
n DBSCAN completes the clustering task with two phases:
n Phase 1: Find core objects.
n Phase 2: Connect core objects and their neighborhoods to form dense regions as clusters.
n How does DBSCAN formally quantify the density of the neighborhood of an object?
n A user-specified parameter Ɛ > 0 is used to specify the radius of a neighborhood that we consider for every object.
n The Ɛ-neighborhood of an object o is the circular space centered at o, with Ɛ being the radius.
n Due to the fixed neighborhood size (which is determined by Ɛ), the density of a neighborhood can be measured using the number of objects in the neighborhood.
n To determine whether a neighborhood is dense or not, DBSCAN uses another user-specified parameter, MinPts, which specifies the density threshold of dense regions.
n An object is a core object if the Ɛ-neighborhood of the object contains at least MinPts objects.
n Given a set of objects, D, we can identify all core objects with respect to the given parameters, Ɛ and MinPts.
n The clustering task is thereafter converted to using core objects and their neighborhoods to form dense regions, where the dense regions are clusters.
n Foracoreobjectqandanobjectp,wesaythatpis directly density-reachable from q (with respect to Ɛ and MinPts) if p is within the Ɛ -neighborhood of q.
n Clearly, an object p is directly density-reachable from another object q if and only if q is a core object and p is in the Ɛ -neighborhood of q.
n Using the directly density-reachable relation, a core object can “bring” all objects from its Ɛ -neighborhood into a dense region.
n How can we assemble a large dense region using small dense regions with core objects being centers?
n In DBSCAN, p is density-reachable from q (with respect to Ɛ and MinPts in D) if there is a chain of objects p1, … ,pn, so that p1 = q, pn = p, and pi+1 is directly density-reachable from pi with respect toƐand MinPts, for 1 < i < n, pi ∈D.
n Note that density-reachability is not an equivalence relation because it is not symmetric.
n If 1) both o1 and o2 are core objects; 2) o1 is density- reachable from o2, then o2 is density-reachable from o1.
n However, if o2 is a core object but o1 is not, then o1 may be density-reachable from o2, but not vice versa.
n To connect core objects as well as their neighbors in a dense region, DBSCAN uses the notion of density- connectedness.
n Two objects p1,p2 ∈ D are density-connected with respect to Ɛ and MinPts if there is an object q ∈ D so that both p1 and p2 are density-reachable from q with respect to Ɛ and MinPts.
n Unlike density-reachability, density connectedness is an equivalence relation (in other words, it is symmetric).
n Furthermore, for objects o1, o2, and o3, if o1 and o2 are density-connected via q; o2 and o3 are density-connected via q, then o1 and o3 are density-connected via q.
n Consider the following figure for a given Ɛ represented by the radius of the circles, and, say, let MinPts = 3.
n m, p, o, r are core objects because each is in anƐ- neighborhood containing at least three points.
n q, s are not a core object.
n misdirectlydensity-reachablefrompandviceversa.
n Objectqis(indirectly)density-reachablefrompbecauseqisdirectly
density-reachable from m and m is directly density-reachable from p.
n However, p is not density-reachable from q because q is not a core object.
n Objectrandsaredensity-connectedviao.
DBSCAN – Cluster Definition
n Let D be the data set under investigation.
n A cluster C with respect to Ɛ and MinPts is a non-empty
subset of D satisfying the following conditions:
n 1) Maximality: ∀p, ∀q, if p∈ C and q is density-
reachable from p, then q ∈ C.
n 2) Connectivity: ∀p ∈ C, ∀q∈ C, p is density- connected to q.
n How does DBSCAN find clusters?
n Step 1: Initially, all objects in a given data set D are
marked as “unvisited.”
n Step 2: DBSCAN randomly selects an unvisited object p, marks p as “visited,” and checks whether the Ɛ- neighborhood of p contains at least MinPts objects.
n 1) If not, p is marked as a noise point.
n 2) Otherwise, a new cluster C is created for p, and all the objects in the Ɛ -neighborhood of p are added to a candidate set N. DBSCAN iteratively adds the objects in N (that do not belong to any cluster) to C.
n In this adding process, for an object p’ in N that carries the label “unvisited,” DBSCAN marks it as “visited” and checks its Ɛ-neighborhood.
n If the Ɛ-neighborhood of p’ has at least MinPts objects, those objects in the Ɛ-neighborhood of p’ are added to N.
n DBSCAN continues adding the objects in N to C until C can no longer be expanded, that is, N is empty.
n At this time, cluster C is completed, and thus is output.
n Step 3: To find the next cluster, DBSCAN randomly selects an unvisited object from the remaining ones.
n The clustering process continues until all objects are visited.
n Visualizing the Clustering Process of DBSCAN:
https://www.naftaliharris.com/blog/visualizing-dbscan- clustering/
n Note that if two clusters, C1 and C2, are very close to each other, it might happen that some object p belongs to both C1 and C2.
n In this case, object p must be a non-core point in both clusters because otherwise C1 would be equal to C2.
n In this case, object p will be assigned to the cluster that is discovered first.
n The details of this scenario can be found on page 299 of the following paper:
“A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Proceedings of KDD 1996.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com