程序代写 COMP3308/3608, Lecture 12

COMP3308/3608, Lecture 12
ARTIFICIAL INTELLIGENCE
Unsupervised Learning (Clustering)
, COMP3308/3608 AI, week 12, 2022 1

• Introduction to clustering
• Clustering algorithms
• K-Medoids (COMP3608 only) • Nearestneighbour
• Hierarchical
, COMP3308/3608 AI, week 12, 2022 2

What is Clustering?
• Clustering – the process of partitioning the data into a set of groups (called clusters) so that the items from the same cluster are:
• similar to each other
• dissimilar to the items in other clusters
• Similarity is defined in terms of distance measure
Ex. Star clustering based on temperature and brightness (Hertzsprung-Russel diagram)
• 3 well-defined clusters
• now we understand that the groups represent
stars in 3 different phases of their life
from “Data Mining Techniques”, M. Berry, G. Linoff, and Sons Publ.
, COMP3308/3608 AI, week 12, 2022 3

Clustering Example – Fitting Troops
• Re-designing the uniforms for female soldiers in the US army • Goal: reduce the number of uniform sizes to be kept in inventory while
still providing good fit
• Researchers from Cornell University used clustering and designed a new set of sizes
• Traditional clothing system: ordered set of sizes where all dimensions increase together
• The new system: sizes that fit body types
• E.g. one size for women with short legs, small waist, long torsos, average
arms, broad shoulders and skinny necks
an example (= a vector of body measurements)
3 clusters corresponding to 3 body types
, COMP3308/3608 AI, week 12, 2022 4

Supervised vs Unsupervised Learning
• Clustering is unsupervised learning: no labels • Given:
• Asetofunlabeledexamples(inputvectors)xi
• k – desired number of clusters (may not be given)
• Task: Cluster (group) the examples into k clusters (groups)
• Supervised vs unsupervised learning
• Supervised: We know the class labels and the number of classes. We want to build a classifier that can be used to predict the class of new (unlabelled) examples.
• Unsupervised: We do not know the class labels and may not know the number of classes. We want to group similar examples together.
, COMP3308/3608 AI, week 12, 2022 5

Clustering Task – Formal Definition
• a dataset P={p1,…, pn} of input vectors (examples, instances, points, data
objects, items, records)
• an integer k – the number of clusters we are looking for
• a mapping f: P->{1,…,k} where each pi is assigned to 1 cluster Kj, 1≤j ≤ k
• a set of clusters K={K1,K2,…,Kk}
• Note: According to this definition:
• each example is assigned to exactly 1 cluster. Some clustering algorithms (e.g. fuzzy and probabilistic) assign each example to more than 1 cluster – e.g. to each cluster with a certain probability.
• not all clustering algorithms require k to be specified (e.g. agglomerative, SOM) , COMP3308/3608 AI, week 12, 2022 6

Typical Clustering Applications
• As a stand-alone tool to group data
• As a building block for other algorithms
• e.g. pre-processing tool for dimensionality reduction – using the cluster center to represent all data points in the cluster (also called vector quantization)
, COMP3308/3608 AI, week 12, 2022 7

More Examples of Clustering Applications
• Marketing
• find distinct groups of customers, and then use this knowledge to develop
targeted marketing programs (e.g. potential customers of a new product)
• derive plant and animal taxonomies
• find genes with similar expression patterns; often they have similar
• Land use
• find areas of similar land use in an earth observation database • Insurance
• find groups of insurance policy holders with a high average claim cost • City-planning
• identify groups of houses according to their house type, value, and geographical location
, COMP3308/3608 AI, week 12, 2022 8

Centroid and Medoid of a Cluster
• Consider a cluster K of N points {p1,..,pN}
• Centroid (means) – the “middle” of the cluster
• no need to be an actual data point in the cluster N
• Medoid M – the centrally located data point in the cluster

, COMP3308/3608 AI, week 12, 2022 9

Distance Between Clusters
• Different ways to define it:
single link
centroid complete link
• Centroid – the distance between the centroids
• Medoid – the distance between the medoids
• Single link (MIN) – The smallest pairwise distance between elements from each cluster
• Complete link (MAX) – the largest pairwise distance between elements from each cluster
• Average link – the average pairwise distance between elements from each cluster
, COMP3308/3608 AI, week 12, 2022 10

What is a Good Clustering?
• A good clustering will produce clusters with
• High cohesion ( i.e. high similarity within the cluster)
• High separation (i.e. low similarity between the clusters)
• Cohesion and separation are measured with a distance function
• Various ways to combine them in 1 measure
• Davies-Bouldin (DB) index
• Silhouette coefficient
, COMP3308/3608 AI, week 12, 2022 11

Davies-Bouldin (DB) index
• A heuristic measure of the quality of the clustering
• Combines cohesion and separation
• Each pair of clusters i and j (from the resulting clustering) are compared in pairs:
a clustering consisting of 3 clusters
1 k dist(x,c )+dist(x,c ) DB= max i j 
j, ji  dist(ci ,c j )  
k – number of clusters
ci and cj – centroids of clusters i and j
dist(x, ci) – mean-squared distance from each item x in cluster i to its centroid dist(ci,cj) – distance between the centroids of cluster i and j
What is the DB index for a good clustering – big or small?
, COMP3308/3608 AI, week 12, 2022 12

A Closer Look at the DB-index
Spread of the items within cluster i
1 k dist(x,c )+dist(x,c ) DB= max i j 
j, ji  dist(ci ,c j )  
Distance between the clusters (centroid distance)
• The summation is pair-wise (for pairs of clusters from the clustering)
• meaning of “max” – for each cluster i, find the cluster j that maximizes
the fraction in brackets:
– denominator is low: small distance between the centroids of i and j, i.e.
possibly overlapping clusters i and j AND/OR
– numerator is high: big spread within each of the clusters
• => DB index is sum of max-es (worst pairing for each cluster) and we would like to keep it minimum
• Small DB index = clusters have small spread and are far from each other
=good clustering
, COMP3308/3608 AI, week 12, 2022 13

Taxonomy of Clustering Algorithms
• Partitional – k-means, k-medoids, nearest neighbor, SOM
• create only one set of clusters
• The number of clusters k is required for most algorithms (e.g. k-means and k-medoids) and not required for some (k-nearest neighbor and SOM)
• Hierarchical – agglomerative and divisive
• Creates a nested set of clusters
• k does not need to be specified
• Density-based – DBSCAN
• regions of high density form clusters
• k does not need to be specified directly
• Model-based (generative) – EM
• Fuzzy clustering
COMP3308/3608 AI, week 12, 2022 14

K-Means Clustering Algorithm
, COMP3308/3608 AI, week 12, 2022 15

K-Means Clustering Algorithm
• Partitional clustering algorithm; simple and very popular
• Iterative and distance-based
• Requires the number of clusters k to be specified in advance
• Can be implemented in 3 steps:
1. Choose k examples as the initial centroids (seeds) of the clusters
2. Form k clusters by assigning each example to the closest centroid
3. At the end of each epoch:
• Re-computethecentroidoftheclusters
• Checkifthestoppingcriterionissatisfied:centroidsdonotchange.If yes – stop; otherwise, repeat steps 2 and 3 using the new centroids.
, COMP3308/3608 AI, week 12, 2022 16

10 9 8 7 6 5 4 3 2 1 0
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
Select centroids
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
Find clusters
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
Recompute centroids
10 9 8 7 6 5 4 3 2 1 0
Find clusters
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
Recompute centroids
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
Find clusters

0 1 2 3 4 5 6 7 8 9 10
K-Means – Example
End of epoch 1
(3.4,6) (6.2,3.6)
End of epoch 2
End of epoch 3; recompute centroids; no change; stop
Final clusters
COMP3308/3608 AI, week 12, 2022 17 0 1 2 3 4 5 6 7 8 9 10

K-means Clustering – Example
• Given: 5 items with the distance between them
• Task: Cluster them into 2 groups using the k-means algorithm. Assume that the initial centroids (seeds) are B and C. Show the clusters after the first epoch.
From Tan, Steinbach, Karpatne & Kumar, Introduction to Data Mining
, COMP3308/3608 AI, week 12, 2022 18

, COMP3308/3608 AI, week 12, 2022 19

K-means – Issues
• Typical values of k: 2 to 10
• Different distance measures can be used
• typically Euclidean distance
• Data should be normalized
• Nominal data need to be converted into numeric
• The number of epochs for convergence is typically much smaller than the number of points, i.e. converges quickly
• Often the stopping criterion is changed to ‘Until relatively few points change clusters”, e.g. <1% • Sensitive to the choice of initial seeds (centroids) • different runs of k-means produce different clusters (see next slide) • there are several approaches for choosing “good” initial centroids , COMP3308/3608 AI, week 12, 2022 20 Images from from Tan, Steinbach, Karpatne & Kumar, Introduction to Data Mining Good Initial Centroids Iteration 1 Iteration 2 333 2.5 2.5 2.5 222 1.5 1.5 1.5 111 0.5 0.5 0.5 000 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Iteration 3 Iteration 4 Iteration 5 333 2.5 2.5 2.5 222 1.5 1.5 1.5 111 0.5 0.5 0.5 000 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Iteration 6 , COMP3308/3608 AI, week 12, 2022 21 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Images from from Tan, Steinbach, Karpatne & Kumar, Introduction to Data Mining Poor Initial Centroids Iteration 1 33 2.5 2.5 22 1.5 1.5 11 0.5 0.5 00 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Iteration 2 Iteration 3 Iteration 4 333 2.5 2.5 2.5 222 1.5 1.5 1.5 111 0.5 0.5 0.5 000 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Iteration 5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 , COMP3308/3608 AI, week 12, 2022 22 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Time and Space Complexity • Space complexity - modest • All examples and centroids need to be stored • O( (m+k)n ) where: m – number of examples, n – number of features, k – number of clusters • Time complexity – expensive • O(tkmn), t - number of iterations • Involves finding the distance from each example to each cluster center at each iteration • t is often small and can be safely bounded as most changes occur in the first few iterations • Not practical for large datasets , COMP3308/3608 AI, week 12, 2022 23 K-means as an Optimization Problem • k-means clustering can be viewed as an optimization problem: find k clusters that minimize the sum-squared error (SSE) SSE =dist(ci,x)2 • ci are the centroids, k- number of clusters, x – examples • Not guaranteed to find the global minimum, i.e. may find a local , COMP3308/3608 AI, week 12, 2022 24 K-means – Strengths and Weaknesses • Simple and very popular • Relatively efficient • Modest space complexity • Fast convergence (typically) • Not guaranteed to find the optimal solution • Sensitive to initialization • In practice: run it several times with different initialization, select the best clustering found • Not sensitive to the order of input examples • Does not work well for clusters with • non-spherical shape • non-convex shape • Does not work well with data containing outliers • Pre-processing is needed - outlier detection and removal , COMP3308/3608 AI, week 12, 2022 25 K-means – Other Issues • The best number of clusters k is not known in advance • There is more than one correct answer to a clustering problem • Domain expert may be required to suggest a good number of clusters and also to evaluate a solution • Interpreting the semantic meaning of each cluster is difficult • Applies for all clustering algorithms, not only k-means • Why are the items in the same cluster – what are their common characteristics? What are the distinct characteristics of the items in each cluster? • Domain expert is needed , COMP3308/3608 AI, week 12, 2022 26 K-means – Variations • Improving the chances of k-means to find the global minimum • Different ways to initialize the centroids (seeds) • Using weights based on how close the example is to the cluster center – Gaussian mixture models • Allowing clusters to split and merge • Split if the variance within a cluster is large • Merge if the distance between cluster centers is smaller than a threshold • Make it scale better • Save distance information from one iteration to the next, thus reducing the number of calculations • K-means can be used for hierarchical clustering • Start with k=2 and repeat recursively within each cluster , COMP3308/3608 AI, week 12, 2022 27 Example and images from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning K-means – Image Compression Example • Compression by Vector Quantization (VQ) using k-means • Original image 1024 x 1024 pixels • Each pixel is a grayscale value [0, 255] • How many bytes are needed to store the original image? , COMP3308/3608 AI, week 12, 2022 28 Image Compression Example (2) • Break the image into 2x2 blocks of pixels • Each of the 512x512 blocks of 4 numbers is regarded as a vector (4 dim) • Cluster these vectors using k-means (e.g. k=10 ) • In the compressed image represent each block with the centroid of its cluster (i.e. replace each vector with the centroid) [23, 54, 47, 73] [39, 56, 89, 95] 512 x 512 input vectors for clustering Clustering into 10 clusters: centroid1: [95, 56, 78, 56] centroid2: [256, 250, 196, 175] centroid3: [115, 168, 130, 178] ... centroid10: [23, 45, 67, 89] COMP3308/3608 AI, week 12, 2022 29 Image Compression Example (3) • For k=10 and 2x2 blocks • What is the number of centroids? • In the new image, what is the maximum number of • unique grayscale vectors? • unique grayscale values? original [23, 54, 47, 73] [39, 56, 89, 95] 512 x 512 input vectors for clustering Clustering into 10 clusters: centroid1: [95, 56, 78, 56] centroid2: [256, 250, 196, 175] centroid3: [115, 168, 130, 178] ... centroid10: [23, 45, 67, 89] COMP3308/3608 AI, week 12, 2022 30 Image Compression Example (4) • The clustering process is called encoding • The centroids are called codewords • The collection of all centroids is called codebook original compressed1 (k=200) compressed2 (k=4) , COMP3308/3608 AI, week 12, 2022 31 Storage Requirements for the Compressed Image For our example (when the image is split into 2x2 blocks): 1) All centroids • k centroids with dimensionality 4 = k x 4 x 8 bits • If k is small (as in this example), this is a very small storage requirement and we can ignore it 2) For each block, the index of the closest centroid that will replace its values: log2k bits per block • Why log2k bits? (we need to store k integers => log2k bits to encode them)
Compression rate (compressed/original)= =(k*4*8+ 512*512*log2k)/(1024*1024*8) bits =0.239 for k=200
=0.063 for k=4
, COMP3308/3608 AI, week 12, 2022 32

K-Medoids Clustering Algorithm
, COMP3308/3608 AI, week 12, 2022 33

• Similar to K-means, reduces the sensitivity to outliers
• Uses cluster medoid instead of cluster means
• Similarly to K-means, K-medoids minimizes the distance between a point in a cluster and the reference point (medoid in K-medoids and means in K-means)
• This algorithm is also called PAM (Partitioning Around the Medoids)
, COMP3308/3608 AI, week 12, 2022 34
For COMP3608 only

K-medoids – Pseudo code
• Select K points (items) as the initial medoids
• Form K clusters by assigning all items to the closest medoid
• Re-compute the medoids for each cluster (search for better medoids):
• Initial (parent) state = current set of medoids
• Generatethechildrenstatesbyswappingeachmedoidwitheach
other non-medoid
• Evaluatethechildrenstates–aretheybetterthantheparent?
evaluation function: cost=sum of absolute distances (item, closest medoid).
• Choosethebeststate(setofmedoidswiththesmallestcost)
• Until medoids don’t change or cost doesn’t decrease
Hill-climbing search for the set of medoids minimizing the
distance from an example to a medoid
For COMP3608 only
, COMP3308/3608 AI, week 12, 2022 35

K-medoids – Example
• Given: 5 items with the distance between them
• Task: Cluster them using the K-medoids, k=2
• Breaking ties: place examples randomly in clusters if they have identical distances to the medoids
For COMP3608 only
, COMP3308/3608 AI, week 12, 2022 36

Adapted example and pictures from Dunham, Data Mining
K-medoids – Solution
For COMP3608 only
1) Randomly select 2 items as initial medoids, • e.g. A (K1) and B (K2)
2) Assign all items to the closest medoid: • K1= {A,C,D} , K2={B,E}
3) Consider swapping medoids A and B with each non-medoid
A->C, B=B (swap A with C, don’t change B; now the medoids are C and B)
A->D, B=B A->E, B=B A=A, B->C A=A, B->D A=A, B->E
Cost=7 A,B best=smallest
Cost=5 Cost=5 Cost=6 Cost=5 Cost=5 Cost=5
D, B E, B A, C A, D A, E
• Given n items and k desired clusters , how many swaps need to be considered?
, COMP3308/3608 AI, week 12, 2022 37

Evaluating Children States
• Evaluate child 1 (C, B) – the new medoids are C and B, resultingfromtheswap:A->C, B=B
• Assign each non-medoid to the closest medoid and calculate the total cost:
– From A to the new medoids: dist(A,C)=2 > dist(A,B)=1 – From D to the new medoids: dist(D,C)=1 < dist(D,B)=4 - From E to the new medoids: dist(E,C)=5 > dist(E,B)=3 cos

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts