COMP9321:
Data services engineering
Week 9: Clustering
Term1, 2021
By Mortada Al-Banna, CSE UNSW
Supervised learning
label label1
label3
label4 label5
model/ predictor
Supervised learning: given labeled examples
Unsupervised learning
Unupervised learning: given data, i.e. examples, but no labels
Unsupervised Learning
Definition of Unsupervised Learning:
Learning useful structure without labeled classes, optimization criterion, feedback signal, or any other information beyond the raw data
5
Unsupervised Learning
• Unsupervised learning involves operating on datasets without labelled responses or target values.
• The goal is to capture a structure of interest of useful information (e.g., relationships)
• Unsupervised learning good be used in:
❑Visualizing the structure of a complex dataset
❑Compressing and summarising the data (e.g, image compression)
❑Extracting features for supervised learning ❑Discover groups or outliers
6
Clustering
Unsupervised Learning
Clustering
– Unsupervised learning
– Requires data, but no labels – Detect patterns
Motivations of Clustering
• Exploratory data analysis
– understanding general characteristics of data – visualizing data
• Generalization – infer something about an instance (e.g. a gene) based on how it relates to other instances
Paradigms
Flat algorithms
• Usually start with a random (partial) partitioning
• Refine it iteratively
– K means clustering
– Model based clustering
• Spectral clustering
Hierarchical algorithms
• Bottom-up, agglomerative
• Top-down, divisive
Paradigms
Hard clustering: Each example belongs to exactly one cluster
Soft clustering: An example can belong to more than one cluster (probabilistic)
• Makes more sense for applications like creating browsable hierarchies
• You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes
Clustering: Image Segmentation
Break up the image into meaningful or perceptually similar regions
Figure from: James Hayes
Clustering: Edge Detection
Figure from: James Hayes
Basic Idea of Clustering
Basic Idea of Clustering
Basic Idea of Clustering
Group together similar data points (instances) • How to measure the similarity?
✓What could similar mean?
• How many clusters do we need?
K-means
Most well-known and popular clustering algorithm:
Step 1. Start with some initial cluster centers (k random points)
Step 2. Iterate:
• Assign/cluster each example to closest center
• Recalculate and change centers as the mean of the points in the cluster.
Step 3. Stop when no points’ assignments change
K-means: an example
K-means: Initialize centers randomly
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
No changes: Done
K-means
Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster
How do we do this?
K-means
Iterate:
◼ Assign/cluster each example to closest center
iterate over each point:
– get distance to each cluster center
– assign to closest center (hard cluster)
◼ Recalculate centers as the mean of the points in a cluster
K-means
Iterate:
◼ Assign/cluster each example to closest center
iterate over each point:
– get distance to each cluster center
– assign to closest center (hard cluster)
◼ Recalculate centers as the mean of the points in a cluster
What distance measure should we use?
Distance measures
Euclidean:
d(x,y)= ån (x-y)2 i=1i i
good for spatial data
K-means
Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster
Where are the cluster centers?
K-means
Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster
How do we calculate these?
K-means
Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster
e.g., for a set of instances that have been assigned to a cluster mean of the cluster as follow
, we re-compute the
K-means
Run an example together ~~
Initialization: 4 points, 2 clusters and distance function
Properties of K-means
Guaranteed to converge in a finite number of iterations
Running time per iteration
1. Assign data points to closest cluster center O(KN) time
2. Change the cluster center to the average of its assigned points O(N)
K-means variations/parameters
Start with some initial cluster centers
Iterate:
◼ Assign/cluster each example to closest center
◼ Recalculate centers as the mean of the points in a cluster
What are some other variations/parameters we haven’t specified?
K-means variations/parameters
Initial (seed) cluster centers
Convergence
• A fixed number of iterations • partitions unchanged
• Cluster centers don’t change
K!
K-means Choosing K
Using the Elbow Method:
• Elbow method is one of the most popular method used to select the optimal number of clusters by fitting the model with a range of values for K in K-means algorithm.
• Elbow method requires drawing a line plot between SSE (Sum of Squared errors) vs number of clusters and finding the point representing the “elbow point” (the point after which the SSE or inertia starts decreasing in a linear fashion).
Choosing K with Elbow Method
K-means: Initialize centers randomly
What would happen here? Seed selection ideas?
Seed choice
Results can vary drastically based on random seed selection
Some seeds can result in poor convergence rate, or convergence to sub-optimal clustering
Common heuristics
• Random centers in the space
• Randomly pick examples
• Points least similar to any existing center (furthest centers heuristic)
• Try out multiple starting points
• Initialize with the results of another clustering method
Furthest centers heuristic
μ1 = pick random point for i = 2 to K:
μi = point that is furthest from any previous centers
m = argmax i x
point with the largest distance to any previous center
min
mj :1
– Makes it non-deterministic, which will help with random runs
– Nice theoretical guarantees!
What Is A Good Clustering?
Internal criterion: A good clustering will produce high quality clusters in which:
• the intra-class (that is, intra-cluster) similarity is high
• the inter-class similarity is low
• The measured quality of a clustering depends on both the document representation and the similarity measure used
Sec. 16.3
Clustering Evaluation
• Intra-cluster cohesion (compactness):
– Cohesion measures how near the data points in a cluster are to the cluster centroid.
– Sum of squared error (SSE) is a commonly used measure.
• Inter-cluster separation (isolation):
– Separation means that different cluster centroids should be far away from one another.
• In most applications, expert judgments are still the key
Web Clustering Examples
55
https://www.coursera.org/learn/python-machine-learning
Limitations of k-means
• Sometime the number of clusters is difficult to determine
• Does not do well with irregular or complex clusters.
• Has a problem with data containing outliers
56
http://arogozhnikov.github.io/2017/07/10/opera-clustering.html
57
Hierarchal Clustering
• What is it?
➢Unsupervised machine learning.
➢It is essentially building a hierarchy of clusters
• Types of Hierarchal Clustering ➢Agglomerative hierarchical clustering ➢Divisive Hierarchical clustering
58
59
Linkage Criteria
• It is necessary to determine from where distance is computed in cluster.
• Your options
➢It can be computed between the two most
similar parts of a cluster (single-linkage)
➢the two least similar bits of a cluster (complete-
linkage)
➢the center of the clusters (mean or average-
linkage)
➢or some other criterion
60
Linkage Criteria
Linkage Criteria Comparison
61
62
Agglomerative Clustering Algorithm
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat: Merge the two closest clusters and update the proximity matrix
4. Until only a single cluster remains
Agglomerative Clustering Example
Proximity Matrix
63
Agglomerative Clustering Example
64
Agglomerative Clustering Example
65
How Can we Choose the Number of Clusters?
• Using a Dendrogram
• A dendrogram is a tree-like diagram that records the sequences of merges or splits
• Whenever two clusters are merged, we will join them in this dendrogram and the height of the join will be the distance between these points
• We set a threshold distance and draw a horizontal line (try to set the threshold in such a way that it cuts the tallest vertical line)
• The number of clusters will be the number of vertical
lines which are being intersected by the line drawn using
the threshold 66
How Can we Choose the Number of Clusters?
67
68
Advantages and Disadvantages of Hierarchal Clustering
• Advantages
➢Easy to Implement
➢No Need to decide the number of clusters beforehand.
• Disadvantages
➢Not suitable for large datasets
➢Sensitive to Outliers
➢Initial Seeds have strong impact of final results
➢Linkage criteria and Distance measure are selected most of the time arbitrary.
Useful Resources
• https://www.datacamp.com/community/tutorials/k- means-clustering-python
• https://realpython.com/k-means-clustering- python/
• https://developers.google.com/machine- learning/clustering/algorithm/advantages- disadvantages
• https://towardsdatascience.com/understanding- the-concept-of-hierarchical-clustering-technique- c6e8243758ec
70
Q&A