CS代考 Clustering

Clustering
Chapter 10

Clustering

Copyright By PowCoder代写 加微信 powcoder

• Clustering is the process of grouping a set of data objects into multiple groups or clusters so that objects within a cluster
• have high similarity
• are very dissimilar to objects in other clusters
• Clustering is called data segmentation because it partitions large data sets into groups according to their similarity
• Clustering can be used for outlier detection
• Clustering is known as unsupervised learning: class label information is not present
• a form of learning by observation, rather than learning by examples

Requirements of clustering
• Scalability
• Ability to deal with different types of attributes (not only numerical)
• Discovery of clusters with arbitrary shape (not only spherical)
• Limit requirements for domain knowledge to determine input parameters (the quality of the result depends on parameters)
• Ability to deal with noisy data
• Incremental clustering and insensitivity to input order
• Capability of clustering high-dimensionality data
• Constraint-based clustering
• Interpretability and usability of clustering results

Comparison criteria
• Partitioning criteria
• Single level partitioning
• Hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)
• Separationofclusters
• Exclusive (e.g., one customer belongs to only one region)
• Non-exclusive (e.g., one document may belong to more than one class)
• Similarity measure
• Distance-based (e.g., Euclidian, road network, vector) • Connectivity-based (e.g., density or contiguity)
• Clusteringspace
• Fullspace(oftenwhenlowdimensional)
• Subspaces(ofteninhigh-dimensionalclustering)

Overview of major clustering methods (1)
• Partitioning methods
• Find mutually exclusive clusters of spherical shape
• Distance-based
• May use mean or medoid (etc.) to represent cluster center • Effective for small to medium-size datasets
• Typical methods: k-means, k-medoids
• Hierarchical methods
• Clustering is a hierarchical decomposition (i.e., multiple levels) • Cannot correct erroneous merges or splits
• Typical methods: Diana, Agnes, BIRCH, CAMELEON

Overview of major clustering methods (2)
• Density-based methods
• Can find arbitrarily shaped clusters
• Clusters are dense regions of objects in space that are separated by low-density regions
• Cluster density: each point must have a minimum number of points within its “neighborhood”
• May filter out outliers
• Typical methods: DBSACN, OPTICS, DenClue
• Grid-based methods
• Use a multiresolution grid data structure
• Fast processing time (typically independent of the number of data objects, yet dependent on grid size)
• Typical methods: STING, WaveCluster, CLIQUE

Partitioning methods
• a data set D of n objects
• the number k (k<=n) of clustersßknown in advance • A partitioning algorithm organizes the objects into k partitions where each partition represents a cluster • The clusters are formed to optimize an objective partitioning criterion • Objects in a cluster are similar to one another • Objects in a cluster are dissimilar to objects in other clusters Centroid-based technique (1) • Distribute the objects in D into k clusters C1,...,Ck with •!" ⊆% • !" ∩ !' = ∅ • Each cluster !" is represented by its centroid *+ (e.g., computed as the mean or medoid of the objects) • The quality of cluster !" is measured as the within cluster variation, that is the sum of squared errors between objects and the centroid : ,=--./012,*+ 2 ";< 6∈89 Centroid-based technique (2) • ./01 2, *+ is the Euclidean distance • Objective: make the resulting k clusters as compact and as separate as possible • Global optimal solution • NP-hard problem • Computationally challenging • Greedy approaches 1. arbitrarily choose k objects from D as the initial cluster centers 3. (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster 4. update the cluster means, that is, calculate the mean value of the objects for each cluster 5. until no change k-means: example k-means: pros and cons • Usually identifies a local optimum • Advantages • The algorithm is efficient: O(tkn), wit n objects, k clusters, and t iterations Normally k, t << n • Disadvantages • Can be applied only when the mean of a cluster is defined • Need to specify k, the number of clusters, in advance • Sensitive to noisy data and outliers (can influence the mean) • Not suitable to discover clusters with non-convex shapes k-medoid (1) • k-means is sensitive to outliers • Need to cluster 1, 2, 3, 8, 9, 10, 25 with k=2 • Ideal solution: {1,2,3} {8,9,10} and 25 is discarded as outlier • Solution {1,2,3}, {8,9,10,25} has E=196 • Solution {1,2,3,8}, {9,10,25} has E=189,67ßselected by k-means! • k-medoids enhances k-means not to be sensitive to outliers k-medoid (2) • Each cluster !" is represented by one of its objects (representative object – medoid =+) • The quality of cluster !" is measured using the absolute-error criterion : ,=--./012,=+ 2 ";< 6∈89 • k-medoid problem is NP-hard Partitioning Around Medoids (PAM) 1. arbitrarily choose k objects in D as the initial representative objects 3. assign each remaining object to the cluster with the nearest representative object 4. randomly select a non-representative object, orandom 5. compute the total cost S of swapping representative object oj with orandom then swap oj with orandom to form the new set of k representative objects 7. until no change PAM: swap evaluation (2) • Current representative objects {o1,...,ok} • Replace oj with orandom and consider {o1,..., oj-1, orandom, oj+1,...,ok} • For each p in D, find the closest representative • Compute the error E with the new set of representative objects • Loweràswap to orandom • Higheràkeep the original set of representatives • The complexity of each iteration is O(k(n-k)2) PAM: swap evaluation (1) k-means vs k-medoid • k-medoid is more robust than k-means in presence of noise and/or outliers • k-medoid is more costly that the k-means • k-medoid and k-means both require to select k in advance • PAM does not scale well with the dataset size • Alternative more scalable solutions that select medoids among a random sample of data items (e.g., CLARA, CLARANS) Hierarchical clustering • Group data objects into a tree of clusters • Hierarchical methods can be • Agglomerative: bottom-up approach • Divisive: top-down approach • Important: if a merge or split turns out to be poor choice, the methods cannot correct it Agglomerative and divisive clustering (1) • Agglomerative: use a bottom-up strategy • Each object initially forms a cluster • Clusters are iteratively merged into larger and larger clusters • Merge clusters according to a similarity measure • Merging terminates when all objects are in a single class • Divisive: use a top-down approach • All the objects initially belong to the same (root) cluster • Clusters are recursively partitioned into smaller clusters • Partitioning continues until each clusters is coherent enough Agglomerative and divisive clustering (2) agglomerative (AGNES) divisive (DIANA) Distance measure: minimum distance • Agglomerative and divisive methods need to measure the distance between clusters dmin(Ci,Cj)=minpÎCi,p'ÎCj |p-p'| • Algorithms that use minimum distance are called nearest-neighbor • Minimum distance clustering algorithm • If the clustering process terminates when the minimum distance between nearest clusters exceeds an arbitrary threshold, it is called single-linkage algorithm • An agglomerative algorithm that uses the minimum distance measure is also called minimal spanning tree algorithm Distance measure: maximum distance •Maximumdistance dmax(Ci,Cj)=maxpÎCi,p'ÎCj |p-p'| • Algorithms that use maximum distance are called farthest-neighbor clustering algorithm • If the clustering process terminates when the maximum distance between nearest clusters exceeds an arbitrary threshold, it is called complete-linkage algorithm Distance measure: mean and average dist. • Minimum and maximum distance tend to be overly sensitive to outliers or noisy data •Meandistance dmean(Ci,Cj)=|mi -mj | • with mi the mean for cluster !" •Averagedistance davg(Ci,Cj)= 1 åå|p-p'| ninj pÎCip'ÎCj • with with ni the number of objects in cluster !" Multiphase clustering • It is difficult to select merge/split points • No backtracking (decisions are definitive) • Hierarchical clustering does not scale well • Solution: combine hierarchical clustering with other clustering techniques • Multi-phase clustering • Balanced Iterative Reducing and Clustering using Hierarchies • Phase 1: BIRCH scans the database to build an initial in-memory CF-tree (multilevel compression of the data that tries to preserve the data’s inherent clustering structure) • Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF-tree • Scales linearly • finds a good clustering with a single scan • improves the quality with a few additional scans • Weakness • handles only numeric data • is sensitive to the order of the data record Clustering feature • Clustering Feature (CF): CF = (N, LS, SS) • N: number of data points • LS: linear sum of N points • SS: square sum of N point • 3-D vector summarizing information about clusters of objects • summary of the statistics for the given cluster • avoid storing the detailed information about individual objects or points • CFs are additive CF = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8) 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 • A CF-tree is a height-balanced tree that stores the clustering features for a hierarchical clustering • Non-leaf nodes store sums of the CFs of their children (summarize clustering information about their children) • A CF tree has two parameters • Branching factor: max # of children • Threshold: max diameter of sub-clusters stored at the leaf nodes CF-tree: example Non-leaf node BIRCH algorithm 1 å ( xi - x j ) 2 n(n-1) • For each point in the input • Find closest leaf entry • Add point to leaf entry and update CF • If entry_diameter > max_diameter, then split leaf, and possibly parents
• Algorithm has complexity O(n)
• Concerns
• Sensitive to insertion order of data points
• Since we fix the size of leaf nodes, clusters may be not so natural
• Clusters tend to be spherical given the radius and diameter measures on which the algorithm is based

Density-based clustering
• Clusters are modeled as dense regions in the data space, separated by sparse regions
• Features
• Discover clusters of arbitrary shape
• Handle noise
• One scan
• Need density parameters as termination condition

Basic concepts (1)
• > > 0 user-defined parameter
• >-neighborhood of object p: space with radius > centered at p
• Density of a neighborhood: number of objects in the neighborhood
• MinPts: density threshold of dense region
• Core object: if its >-neighborhood contains at least MinPts objects
• An object q is directly density-reachable from a core object p if q is within the >-neighborhood of p
• A core object can bring all the objects in its >-neighborhood in a density region

Basic concepts (2)
• p is density-reachable from q if there is a chain of objects p1,…,pn, such that p1=q, pn=p, and pi+1 is directly density-reachable from pi for 1<=i<=n • Not an equivalence relationship! • Not symmetric • q is directly density reachable from m • q is (indirectly) density reachable from p • p is not density reachable from q because q is not a core object Basic concepts (3) • p1, p2 are density-connected if there is an object q such that both p1 and p2 are density reachable from q • It is an equivalence relationship! Example: • p and q are density connected, thanks to the presence DBSCAN (1) • A cluster is defined as a maximal set of density-connected points • Discovers clusters of arbitrary shape in spatial databases with noise MinPts = 5 DBSCAN (2) • Arbitrary select a point p • Retrieve all points density-reachable from p w.r.t. > and MinPts
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database
• Continue the process until all the points have been processed
• Complexity: O(n log n)

Evaluation of clustering
• Assessing clustering tendency
• Clustering analysis on a data set is meaningful only when there is a non-
random structure in the data
• Determining the number of clusters in a data set
• the number of clusters is an interesting and important summary statistic of a data set
• it should be estimated in advance
• Measuring clustering quality
• how well do the clusters fit the data set? • how good are the resulting clusters?

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com