COMP3308/3608, Lecture 12
ARTIFICIAL INTELLIGENCE
Unsupervised Learning (Clustering)
, COMP3308/3608 AI, week 12, 2022 1
Copyright By PowCoder代写 加微信 powcoder
• Introduction to clustering
• Clustering algorithms
• K-Medoids (COMP3608 only) • Nearestneighbour
• Hierarchical
, COMP3308/3608 AI, week 12, 2022 2
What is Clustering?
• Clustering – the process of partitioning the data into a set of groups (called clusters) so that the items from the same cluster are:
• similar to each other
• dissimilar to the items in other clusters
• Similarity is defined in terms of distance measure
Ex. Star clustering based on temperature and brightness (Hertzsprung-Russel diagram)
• 3 well-defined clusters
• now we understand that the groups represent
stars in 3 different phases of their life
from “Data Mining Techniques”, M. Berry, G. Linoff, and Sons Publ.
, COMP3308/3608 AI, week 12, 2022 3
Clustering Example – Fitting Troops
• Re-designing the uniforms for female soldiers in the US army • Goal: reduce the number of uniform sizes to be kept in inventory while
still providing good fit
• Researchers from Cornell University used clustering and designed a new set of sizes
• Traditional clothing system: ordered set of sizes where all dimensions increase together
• The new system: sizes that fit body types
• E.g. one size for women with short legs, small waist, long torsos, average
arms, broad shoulders and skinny necks
an example (= a vector of body measurements)
3 clusters corresponding to 3 body types
, COMP3308/3608 AI, week 12, 2022 4
Supervised vs Unsupervised Learning
• Clustering is unsupervised learning: no labels • Given:
• Asetofunlabeledexamples(inputvectors)xi
• k – desired number of clusters (may not be given)
• Task: Cluster (group) the examples into k clusters (groups)
• Supervised vs unsupervised learning
• Supervised: We know the class labels and the number of classes. We want to build a classifier that can be used to predict the class of new (unlabelled) examples.
• Unsupervised: We do not know the class labels and may not know the number of classes. We want to group similar examples together.
, COMP3308/3608 AI, week 12, 2022 5
Clustering Task – Formal Definition
• a dataset P={p1,…, pn} of input vectors (examples, instances, points, data
objects, items, records)
• an integer k – the number of clusters we are looking for
• a mapping f: P->{1,…,k} where each pi is assigned to 1 cluster Kj, 1≤j ≤ k
• a set of clusters K={K1,K2,…,Kk}
• Note: According to this definition:
• each example is assigned to exactly 1 cluster. Some clustering algorithms (e.g. fuzzy and probabilistic) assign each example to more than 1 cluster – e.g. to each cluster with a certain probability.
• not all clustering algorithms require k to be specified (e.g. agglomerative, SOM) , COMP3308/3608 AI, week 12, 2022 6
Typical Clustering Applications
• As a stand-alone tool to group data
• As a building block for other algorithms
• e.g. pre-processing tool for dimensionality reduction – using the cluster center to represent all data points in the cluster (also called vector quantization)
, COMP3308/3608 AI, week 12, 2022 7
More Examples of Clustering Applications
• Marketing
• find distinct groups of customers, and then use this knowledge to develop
targeted marketing programs (e.g. potential customers of a new product)
• derive plant and animal taxonomies
• find genes with similar expression patterns; often they have similar
• Land use
• find areas of similar land use in an earth observation database • Insurance
• find groups of insurance policy holders with a high average claim cost • City-planning
• identify groups of houses according to their house type, value, and geographical location
, COMP3308/3608 AI, week 12, 2022 8
Centroid and Medoid of a Cluster
• Consider a cluster K of N points {p1,..,pN}
• Centroid (means) – the “middle” of the cluster
• no need to be an actual data point in the cluster N
• Medoid M – the centrally located data point in the cluster
, COMP3308/3608 AI, week 12, 2022 9
Distance Between Clusters
• Different ways to define it:
single link
centroid complete link
• Centroid – the distance between the centroids
• Medoid – the distance between the medoids
• Single link (MIN) – The smallest pairwise distance between elements from each cluster
• Complete link (MAX) – the largest pairwise distance between elements from each cluster
• Average link – the average pairwise distance between elements from each cluster
, COMP3308/3608 AI, week 12, 2022 10
What is a Good Clustering?
• A good clustering will produce clusters with
• High cohesion ( i.e. high similarity within the cluster)
• High separation (i.e. low similarity between the clusters)
• Cohesion and separation are measured with a distance function
• Various ways to combine them in 1 measure
• Davies-Bouldin (DB) index
• Silhouette coefficient
, COMP3308/3608 AI, week 12, 2022 11
Davies-Bouldin (DB) index
• A heuristic measure of the quality of the clustering
• Combines cohesion and separation
• Each pair of clusters i and j (from the resulting clustering) are compared in pairs:
a clustering consisting of 3 clusters
1 k dist(x,c )+dist(x,c ) DB= max i j
j, ji dist(ci ,c j )
k – number of clusters
ci and cj – centroids of clusters i and j
dist(x, ci) – mean-squared distance from each item x in cluster i to its centroid dist(ci,cj) – distance between the centroids of cluster i and j
What is the DB index for a good clustering – big or small?
, COMP3308/3608 AI, week 12, 2022 12
A Closer Look at the DB-index
Spread of the items within cluster i
1 k dist(x,c )+dist(x,c ) DB= max i j
j, ji dist(ci ,c j )
Distance between the clusters (centroid distance)
• The summation is pair-wise (for pairs of clusters from the clustering)
• meaning of “max” – for each cluster i, find the cluster j that maximizes
the fraction in brackets:
– denominator is low: small distance between the centroids of i and j, i.e.
possibly overlapping clusters i and j AND/OR
– numerator is high: big spread within each of the clusters
• => DB index is sum of max-es (worst pairing for each cluster) and we would like to keep it minimum
• Small DB index = clusters have small spread and are far from each other
=good clustering
, COMP3308/3608 AI, week 12, 2022 13
Taxonomy of Clustering Algorithms
• Partitional – k-means, k-medoids, nearest neighbor, SOM
• create only one set of clusters
• The number of clusters k is required for most algorithms (e.g. k-means and k-medoids) and not required for some (k-nearest neighbor and SOM)
• Hierarchical – agglomerative and divisive
• Creates a nested set of clusters
• k does not need to be specified
• Density-based – DBSCAN
• regions of high density form clusters
• k does not need to be specified directly
• Model-based (generative) – EM
• Fuzzy clustering
COMP3308/3608 AI, week 12, 2022 14
K-Means Clustering Algorithm
, COMP3308/3608 AI, week 12, 2022 15
K-Means Clustering Algorithm
• Partitional clustering algorithm; simple and very popular
• Iterative and distance-based
• Requires the number of clusters k to be specified in advance
• Can be implemented in 3 steps:
1. Choose k examples as the initial centroids (seeds) of the clusters
2. Form k clusters by assigning each example to the closest centroid
3. At the end of each epoch:
• Re-computethecentroidoftheclusters
• Checkifthestoppingcriterionissatisfied:centroidsdonotchange.If yes – stop; otherwise, repeat steps 2 and 3 using the new centroids.
, COMP3308/3608 AI, week 12, 2022 16
10 9 8 7 6 5 4 3 2 1 0
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
Select centroids
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
Find clusters
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
Recompute centroids
10 9 8 7 6 5 4 3 2 1 0
Find clusters
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
Recompute centroids
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
Find clusters
0 1 2 3 4 5 6 7 8 9 10
K-Means – Example
End of epoch 1
(3.4,6) (6.2,3.6)
End of epoch 2
End of epoch 3; recompute centroids; no change; stop
Final clusters
COMP3308/3608 AI, week 12, 2022 17 0 1 2 3 4 5 6 7 8 9 10
K-means Clustering – Example
• Given: 5 items with the distance between them
• Task: Cluster them into 2 groups using the k-means algorithm. Assume that the initial centroids (seeds) are B and C. Show the clusters after the first epoch.
From Tan, Steinbach, Karpatne & Kumar, Introduction to Data Mining
, COMP3308/3608 AI, week 12, 2022 18
, COMP3308/3608 AI, week 12, 2022 19
K-means – Issues
• Typical values of k: 2 to 10
• Different distance measures can be used
• typically Euclidean distance
• Data should be normalized
• Nominal data need to be converted into numeric
• The number of epochs for convergence is typically much smaller than the number of points, i.e. converges quickly
• Often the stopping criterion is changed to ‘Until relatively few points change clusters”, e.g. <1%
• Sensitive to the choice of initial seeds (centroids)
• different runs of k-means produce different clusters (see next slide)
• there are several approaches for choosing “good” initial centroids
, COMP3308/3608 AI, week 12, 2022 20
Images from from Tan, Steinbach, Karpatne & Kumar, Introduction to Data Mining
Good Initial Centroids
Iteration 1 Iteration 2 333
2.5 2.5 2.5 222 1.5 1.5 1.5 111 0.5 0.5 0.5 000
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Iteration 3
Iteration 4 Iteration 5 333
2.5 2.5 2.5
222 1.5 1.5 1.5
111 0.5 0.5 0.5 000
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Iteration 6
, COMP3308/3608 AI, week 12, 2022 21
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Images from from Tan, Steinbach, Karpatne & Kumar, Introduction to Data Mining
Poor Initial Centroids
Iteration 1 33
2.5 2.5 22 1.5 1.5 11 0.5 0.5 00
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Iteration 2
Iteration 3 Iteration 4 333
2.5 2.5 2.5 222 1.5 1.5 1.5 111 0.5 0.5 0.5 000
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
, COMP3308/3608 AI, week 12, 2022 22
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Time and Space Complexity
• Space complexity - modest
• All examples and centroids need to be stored
• O( (m+k)n ) where: m – number of examples, n – number of features, k – number of clusters
• Time complexity – expensive
• O(tkmn), t - number of iterations
• Involves finding the distance from each example to each cluster center at each iteration
• t is often small and can be safely bounded as most changes occur in the first few iterations
• Not practical for large datasets
, COMP3308/3608 AI, week 12, 2022 23
K-means as an Optimization Problem
• k-means clustering can be viewed as an optimization problem: find k clusters that minimize the sum-squared error (SSE)
SSE =dist(ci,x)2
• ci are the centroids, k- number of clusters, x – examples
• Not guaranteed to find the global minimum, i.e. may find a local
, COMP3308/3608 AI, week 12, 2022 24
K-means – Strengths and Weaknesses
• Simple and very popular
• Relatively efficient
• Modest space complexity
• Fast convergence (typically)
• Not guaranteed to find the optimal solution
• Sensitive to initialization
• In practice: run it several times with different initialization, select the best clustering found
• Not sensitive to the order of input examples
• Does not work well for clusters with
• non-spherical shape
• non-convex shape
• Does not work well with data containing outliers
• Pre-processing is needed - outlier detection and removal
, COMP3308/3608 AI, week 12, 2022 25
K-means – Other Issues
• The best number of clusters k is not known in advance
• There is more than one correct answer to a clustering problem
• Domain expert may be required to suggest a good number of clusters and also to evaluate a solution
• Interpreting the semantic meaning of each cluster is difficult
• Applies for all clustering algorithms, not only k-means
• Why are the items in the same cluster – what are their common characteristics? What are the distinct characteristics of the items in each cluster?
• Domain expert is needed
, COMP3308/3608 AI, week 12, 2022 26
K-means – Variations
• Improving the chances of k-means to find the global minimum
• Different ways to initialize the centroids (seeds)
• Using weights based on how close the example is to the cluster center – Gaussian mixture models
• Allowing clusters to split and merge
• Split if the variance within a cluster is large
• Merge if the distance between cluster centers is smaller than a threshold
• Make it scale better
• Save distance information from one iteration to the next, thus reducing
the number of calculations
• K-means can be used for hierarchical clustering
• Start with k=2 and repeat recursively within each cluster
, COMP3308/3608 AI, week 12, 2022 27
Example and images from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning
K-means – Image Compression Example
• Compression by Vector Quantization (VQ) using k-means
• Original image 1024 x 1024 pixels
• Each pixel is a grayscale value [0, 255]
• How many bytes are needed to store the original image?
, COMP3308/3608 AI, week 12, 2022 28
Image Compression Example (2)
• Break the image into 2x2 blocks of pixels
• Each of the 512x512 blocks of 4 numbers is regarded as a vector (4 dim)
• Cluster these vectors using k-means (e.g. k=10 )
• In the compressed image represent each block with the centroid of its cluster (i.e. replace each vector with the centroid)
[23, 54, 47, 73]
[39, 56, 89, 95]
512 x 512 input vectors for clustering
Clustering into 10 clusters: centroid1: [95, 56, 78, 56] centroid2: [256, 250, 196, 175] centroid3: [115, 168, 130, 178] ...
centroid10: [23, 45, 67, 89]
COMP3308/3608 AI, week 12, 2022 29
Image Compression Example (3)
• For k=10 and 2x2 blocks
• What is the number of centroids?
• In the new image, what is the maximum number of
• unique grayscale vectors?
• unique grayscale values? original
[23, 54, 47, 73]
[39, 56, 89, 95]
512 x 512 input vectors for clustering
Clustering into 10 clusters: centroid1: [95, 56, 78, 56] centroid2: [256, 250, 196, 175] centroid3: [115, 168, 130, 178] ...
centroid10: [23, 45, 67, 89]
COMP3308/3608 AI, week 12, 2022 30
Image Compression Example (4)
• The clustering process is called encoding
• The centroids are called codewords
• The collection of all centroids is called codebook
original compressed1 (k=200) compressed2 (k=4)
, COMP3308/3608 AI, week 12, 2022 31
Storage Requirements for the Compressed Image
For our example (when the image is split into 2x2 blocks):
1) All centroids
• k centroids with dimensionality 4 = k x 4 x 8 bits
• If k is small (as in this example), this is a very small storage requirement and we can ignore it
2) For each block, the index of the closest centroid that will replace its values: log2k bits per block
• Why log2k bits? (we need to store k integers => log2k bits to encode them)
Compression rate (compressed/original)= =(k*4*8+ 512*512*log2k)/(1024*1024*8) bits =0.239 for k=200
=0.063 for k=4
, COMP3308/3608 AI, week 12, 2022 32
K-Medoids Clustering Algorithm
, COMP3308/3608 AI, week 12, 2022 33
• Similar to K-means, reduces the sensitivity to outliers
• Uses cluster medoid instead of cluster means
• Similarly to K-means, K-medoids minimizes the distance between a point in a cluster and the reference point (medoid in K-medoids and means in K-means)
• This algorithm is also called PAM (Partitioning Around the Medoids)
, COMP3308/3608 AI, week 12, 2022 34
For COMP3608 only
K-medoids – Pseudo code
• Select K points (items) as the initial medoids
• Form K clusters by assigning all items to the closest medoid
• Re-compute the medoids for each cluster (search for better medoids):
• Initial (parent) state = current set of medoids
• Generatethechildrenstatesbyswappingeachmedoidwitheach
other non-medoid
• Evaluatethechildrenstates–aretheybetterthantheparent?
evaluation function: cost=sum of absolute distances (item, closest medoid).
• Choosethebeststate(setofmedoidswiththesmallestcost)
• Until medoids don’t change or cost doesn’t decrease
Hill-climbing search for the set of medoids minimizing the
distance from an example to a medoid
For COMP3608 only
, COMP3308/3608 AI, week 12, 2022 35
K-medoids – Example
• Given: 5 items with the distance between them
• Task: Cluster them using the K-medoids, k=2
• Breaking ties: place examples randomly in clusters if they have identical distances to the medoids
For COMP3608 only
, COMP3308/3608 AI, week 12, 2022 36
Adapted example and pictures from Dunham, Data Mining
K-medoids – Solution
For COMP3608 only
1) Randomly select 2 items as initial medoids, • e.g. A (K1) and B (K2)
2) Assign all items to the closest medoid: • K1= {A,C,D} , K2={B,E}
3) Consider swapping medoids A and B with each non-medoid
A->C, B=B (swap A with C, don’t change B; now the medoids are C and B)
A->D, B=B A->E, B=B A=A, B->C A=A, B->D A=A, B->E
Cost=7 A,B best=smallest
Cost=5 Cost=5 Cost=6 Cost=5 Cost=5 Cost=5
D, B E, B A, C A, D A, E
• Given n items and k desired clusters , how many swaps need to be considered?
, COMP3308/3608 AI, week 12, 2022 37
Evaluating Children States
• Evaluate child 1 (C, B) – the new medoids are C and B, resultingfromtheswap:A->C, B=B
• Assign each non-medoid to the closest medoid and calculate the total cost:
– From A to the new medoids: dist(A,C)=2 > dist(A,B)=1 – From D to the new medoids: dist(D,C)=1 < dist(D,B)=4 - From E to the new medoids: dist(E,C)=5 > dist(E,B)=3 cos
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com