Hierarchical Clustering and Expectation-Maximization
Shuo I Taxonomy
§ Clustering concepts
Copyright By PowCoder代写 加微信 powcoder
§ Hierarchical clustering
§ Gaussian mixture models using EM
Clustering
• Segment data into clusters, such that there is
• high intra-cluster similarity
• low inter-cluster similarity
• Informally, finding natural groupings
among objects.
What is the natural grouping of these objects?
Picture from CMU
What is the natural grouping of these objects?
Simpson’s family School employees Females Males Picture from CMU
Clustering set-up
§ Ourdataare:D={x1,…,xN}.
§ Each data point is m-dimensional, i.e.
xi =
§ Define a distance function (i.e. similarity measures) between data, d(xi, xj)
§ Goal: segment xn into k groups {z1,…,zN}wherezi ∈{1,…,K}
Similarity Measures
§ Between any two data samples p and q, we can calculate their distance d(p,q) using a number of measurements:
Types of Clustering Algorithms
§ Partitional clustering, e.g. K-means, K-medoids
§ Hierarchical clustering Bottom-up (agglomerative) Top-down
§ Density-based clustering, e.g. DBScan
§ Mixture density based clustering
§ Fuzzy theory based, graph theory based, grid based, etc.
Hierarchical clustering
§ Create a hierarchical decomposition of the set of objects using some criterion
§ Produce a dendrogram
(Bovine, (Spider Monkey, (Gibbon, (Orang, (Gorilla, (Chimp, Human)))));
Hierarchical clustering
Agglomerative clustering illustration
Dendrogram: A visual representation of output
Agglomerative clustering algorithm
1. Place each data point into its own singleton group
2. Repeat: iteratively merge the two closest groups
3. Until: all the data are merged into a single cluster
§ Output: a dendrogram
§ Reply on: a distance metric between clusters
Measuring distance between clusters
§ Single linkage
the similarity of the closest pair
§ Complete linkage
the similarity of the furthest pair
§ Group average
the average similarity of all pairs more widely used
robust against noise
Try your own data here:
http://jydelort.appspot.com/resources/figue/demo.html
Strengths, weaknesses, caveats
§ Strengths
Ø provides deterministic results
Ø no need to specify number of clusters beforehand Ø can create clusters of arbitrary shapes
§ Weakness
Ø does not scale up for large datasets, time complexity at least O(n2)
Ø Different decisions about group similarities can lead to vastly different
dendrograms.
Ø The algorithm imposes a hierarchical structure on the data, even data for which such structure is not appropriate.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com