Elements of Data Processing
Semester 1, 2022
Clustering analysis – Introduction
© University of Melbourne 2022
Copyright By PowCoder代写 加微信 powcoder
Clustering
J Novembre et al. Nature 000, 1-4 (2008) doi:10.1038/nature07331
© University of Melbourne 2022
• What is clustering
• Clustering algorithms • K-means
• Visualisation of clustering tendency (VAT) • Hierarchical clustering
© University of Melbourne 2022
What is Cluster Analysis?
We will be looking at two classic clustering algorithms • K-means
• Hierarchical clustering
Intra-cluster distances are minimized
Inter-cluster distances are maximized
Figure from Tan, Steinbach and Kumar 2004
© University of Melbourne 2022
How to obtain the clusters?
• Need to assign each object to exactly one cluster
• Each cluster can be summarised by its centroid (the average of all its objects)
© University of Melbourne 2022 7
Clustering applications
Clustering is a major task in data analysis and visualization
• Market segmentation
• Image analysis
• Search engine result presentation • Outlier detection
• Personality types
© University of Melbourne 2022 8
Cambridge Analytica
• Infer people’s personality from their Facebook likes (correlating likes with an online personality test) – psychographics
• “Kosinski proved that on the basis of an average of 68 Facebook “likes” by a user, it was possible to predict their skin color (with 95 percent accuracy), their sexual orientation (88 percent accuracy), and their affiliation to the Democratic or Republican party (85 percent). But it didn’t stop there. Intelligence, religious affiliation, as well as alcohol, cigarette and drug use, could all be determined.”
© University of Melbourne 2022 9
Cambridge Analytica
• “before long, he was able to evaluate a person better than the average work colleague, merely on the basis of ten Facebook “likes”. Seventy “likes” were enough to outdo what a person’s friends knew, 150 what their parents knew, and 300 “likes” what their partner knew. More “likes” could even surpass what a person thought they knew about themselves.”
• Might then target clusters (segments) of people with certain personality traits, with appropriate political messages
• “For a highly neurotic and conscientious audience the threat of a burglary—and the insurance policy of a gun.”
• Quotes taken from following article https://motherboard.vice.com/en_us/article/how-our-likes-helped-trump-win
© University of Melbourne 2022 10
Pre-requisite: Distance functions
• Clustering methods are typically distance based. Represent each object / instance as a vector (a row in the data) and then compute Euclidean distance between pairs of vectors
• Commonly normalise (scale) each attribute into range [0,1] via a pre-processing step before computing distances
Given X = (x1,x2,x3,…,xn) and Y = (y1,y2,y3,…,yn)
D(X,Y)=p(x1 y1)2 +(x2 y2)2 +(x3 y3)2 +…(xn yn)2
© University of Melbourne 2022 11
Good Clustering
• Quality:
• Objects within same cluster are close together
• Objects in different clusters are far apart • Output:
• Needed to assign each object to exactly one cluster
• Each cluster can be summarised by its centroid (the average of all its objects)
© University of Melbourne 2022 12
Elements of Data Processing
Semester 1, 2022
Clustering analysis – K-Means Clustering
© University of Melbourne 2022
K-Means Clustering
© University of Melbourne 2022 3
K-Means – example
© University of Melbourne 2022
1. Ask user how many clusters they would like (e.g. K=5)
(Example from http://www.autonlab.org/tutorials/kmeans11.pdf)
K-Means – example
© University of Melbourne 2022
Ask user how many clusters they would like (e.g. K=5)
Randomly pick K cluster centroids
K-Means – example
1. Ask user how many clusters they would like (e.g. K=5)
2. Randomly pick K cluster centroids
3. For each data point find out which centroid it is closest to. (Thus each centroid “owns” a set of datapoints)
© University of Melbourne 2022
K-Means – example
1. Ask user how many clusters they would like (e.g. K=5)
2. Randomly pick K cluster centroids
3. For each data point find out which centroid it is closest to. (Thus each centroid “owns” a cluster of datapoints)
4. Each cluster finds the new centroid of the points in it
© University of Melbourne 2022
K-Means – example
© University of Melbourne 2022
Ask user how many clusters they would like (e.g. K=5)
Randomly pick K cluster centroids
For each data point find out which centroid it is closest to. (Thus each centroid “owns” a cluster of datapoints)
Each cluster finds the new centroid of the points in it
New centroids => new boundaries Repeat 3-5 until no further changes
Given parameter k, the k-means algorithm is implemented in four steps:
1. Randomlyselectkseedpointsastheinitialclustercentroids
2. Assigneachobjecttotheclusterwiththenearestcentroid
3. Computethecentroidsoftheclustersofthecurrentpartitioning(the centroid is the center, i.e., mean point, of the cluster)
4. GobacktoStep2,stopwhentheassignmentdoesnotchange
© University of Melbourne 2022 9
Understanding the Algorithm
For which dataset does k-means require a fewer number of iterations? k= 2
Dataset 1 Dataset 2
© University of Melbourne 2022
K-Means: further details
• Typically choose the initial seed points randomly
• Different runs of the algorithm will produce different results
• Closeness measured by Euclidean distance
• Can also use other distance functions (e.g. correlation)
• Algorithm can be shown to converge (to a local optimum) • Typicallydoesnotrequiremanyiterations
© University of Melbourne 2022 11
K-Means interactive demos
• http://syskall.com/kmeans.js/
• http://tech.nitoyon.com/en/blog/2013/11/07/k-means/
• https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
© University of Melbourne 2022 12
An application:
cluster-based outlier detection
• An outlier is expected to be far away from any groups of normal objects
• Each object is associated with exactly one cluster and its outlier score is equal to the distance from its cluster centre
• Score = size of circle in following example
© University of Melbourne 2022 13
Clustering based outlier detection 3 clusters
© University of Melbourne 2022 14
Clustering based outlier detection 6 clusters
© University of Melbourne 2022 15
Reflections: K-Means (1)
Original Points
K-means (3 Clusters)
Limitations of K-means – result quality varies based on size and density
Original Points
K-means (3 Clusters)
© University of Melbourne 2022 17
Reflections: K-Means (2)
Original Points
K-means (2 Clusters)
© University of Melbourne 2022
Limitations of K-means – not strong for non-globular shapes
Clustering Reflections: K-Means (3)
An idea: Use many clusters, merge together
© University of Melbourne 2022 19
Finding K – the elbow method
• Within Cluster Sum of Squared distance (WCSS) $
!”## $ =&& ” − ‘! (,’! *+,h./.0,12*324/56+,.1/
!”# %!∈! https://www.kaggle.com/abhishekyadav5/kmeans-clustering-with-elbow-method-and-silhouette
• Sum of squared distances of data points to their closest cluster centre.
• Also referred to as SSE (Sum of Squared Error)
© University of Melbourne 2022 20
Elements of Data Processing
Semester 1, 2022 Clustering analysis – VAT
© University of Melbourne 2022
Visual Assessment of Clustering Tendency (VAT)
© University of Melbourne 2022 2
Visual Assessment of Clustering Tendency (VAT)
• How many clusters are in the data? How big are they? What is the likely membership of objects in each cluster?
• The k parameter for k-means
• One solution: visually determine the clustering structure by inspecting a heat map
• Represent datasets in an n*n image format
• Applicable for many different types of object data
© University of Melbourne 2022 3
What is a (dis)similarity matrix?
Compute all pairwise distances between objects. This gives a dissimilarity matrix.
© University of Melbourne 2022 4
Visualising a dissimilarity matrix
We can visualise a dissimilarity matrix as a heat map, where the colour of each cell indicates that cell’s value
© University of Melbourne 2022 5
Properties of a dissimilarity matrix D
• The diagonal of D is all zeros
• D is symmetric about its leading diagonal
• D(i,j)=D(j,i) for all i and j
• Objects follow the same order along rows and columns
• Visualising the (raw) dissimilarity matrix may not reveal enough useful information
• Further processing often needed
© University of Melbourne 2022 6
Reordering a Dissimilarity matrix
Example dataset with 16 Random order of the 16 objects for the objects dissimilarity matrix
© University of Melbourne 2022 7
Reordering the matrix reveals the clusters
A better ordering of the 16 objects. Nearby objects in the ordering are similar to each
other, producing large dark blocks. We can see four clusters along the diagonal
© University of Melbourne 2022 8
Good VAT images
• A good VAT image suggests both the number of clusters and approximate members of each cluster
• A diagonal dark block appears in the VAT image only when a tight group exists in the data (low within-cluster dissimilarities)
© University of Melbourne 2022 9
Another example:
dissimilarity matrix for Iris dataset
Random order for 150 objects: Where are the clusters?
© University of Melbourne 2022 10
Dissimilarity matrix for Iris data (reordered)
© University of Melbourne 2022 11
Exercise – Match the datasets with the VAT images
From: An algorithm for clustering Tendency Assessment, Hu, Y., Hathaway, R. WSEAS Transactions on Mathematics, 2008.
© University of Melbourne 2022 12
VAT ordering example
© University of Melbourne 2022 13
VAT ordering example
© University of Melbourne 2022 14
VAT ordering example
© University of Melbourne 2022 15
VAT ordering example
© University of Melbourne 2022 16
VAT ordering example
© University of Melbourne 2022 17
VAT ordering example
© University of Melbourne 2022 18
VAT ordering example
© University of Melbourne 2022 19
VAT ordering example
© University of Melbourne 2022 20
VAT ordering example
© University of Melbourne 2022
The VAT algorithm:
Visual assessment for clustering tendency
(Bezdek and Hathaway 2002)
Given an N*N dissimilarity matrix D
Let K={1,…N}, I=J={} ### {} is a set (collection of objects) Pick the two least similar objects oa and ob from D
P(1)=a; I={a}; J=K-{a}
For r = 2, …., N
Select (i,j): pair of most similar objects oi and oj from D
Such that i ∈ I, j ∈ J ### ’∈’ means ‘member of’
P(r) = j; I = I ∪ {j}; J=J – {j}; ### ‘∪’ means ‘union’ i.e. add two collections together
• Obtain reordered dissimilarity matrix D* from permutation (ordering) P
• E.g. P(1)=2, P(2)=5, P(3)=7
• The first object in the ordering is 2, the second is 5, the third is 7 ….
• VAT algorithm python code – you will practice in workshops
© University of Melbourne 2022 22
The VAT algorithm – problems
VAT algorithm is not effective in every situation
• For complex shaped datasets (either significant overlap or irregular geometries between different clusters), the quality of the VAT image may significantly degrade
© University of Melbourne 2022 23
Elements of Data Processing
Semester 1, 2022
Clustering analysis – Hierarchical Clustering
© University of Melbourne 2022
Hierarchical Clustering
© University of Melbourne 2022 2
Hierarchical clustering
• Produces a set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram
• A tree like diagram that records the sequences of merges (y-axis is distance)
0.2 0.15 0.1 0.05 0
© University of Melbourne 2022
Strengths of hierarchical clustering
• No assumption of any particular number of clusters
• Any desired number of clusters can be obtained by ʻcuttingʼ the dendrogram
at the appropriate level
• They may correspond to meaningful taxonomies
• Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)
© University of Melbourne 2022 4
Number of clusters
0.2 0.15 0.1 0.05 0
2 clusters ? clusters
© University of Melbourne 2022
Tree of life
(http://www.talkorigins.org/faqs/comdesc/phylo.html)
© University of Melbourne 2022 6
Hierarchical clustering algorithms
• Twomaintypesofhierarchicalclustering
• Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster (or k clusters) left
• Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there are k clusters)
• Traditionalhierarchicalalgorithmsuseadis-similarityordistancematrixor proximity matrix
• Merge or split one cluster at a time
© University of Melbourne 2022 7
Agglomerative hierarchical clustering
• More popular hierarchical clustering technique • Basic algorithm is straightforward
• Compute the dis-similarity matrix
• Let each data point be a cluster
• Merge the two closest clusters
• Update the proximity matrix
• Until only a single cluster remains
• Key operation is the computation of the similarity of two clusters
• Different approaches to defining the distance between clusters distinguish the different algorithms
© University of Melbourne 2022 8
Agglomerative clustering example (1)
Start with clusters of individual points and a proximity matrix
p1 p2 p3 p4 p5 p1
© University of Melbourne 2022
p1 p2 p3 p4
p9 p10 p11 p12
Proximity Matrix
Agglomerative clustering example (2)
After some merging, we have clusters
Proximity Matrix
© University of Melbourne 2022
p1 p2 p3 p4
p9 p10 p11 p12
Agglomerative clustering example (3)
We merge the two closest clusters (C2 and C5) and update the proximity matrix
Proximity Matrix
© University of Melbourne 2022
p1 p2 p3 p4
p9 p10 p11 p12
Agglomerative clustering example (4)
How do we update the dis-similarity (proximity) matrix?
Proximity Matrix
© University of Melbourne 2022
p1 p2 p3 p4
p9 p10 p11 p12
Agglomerative clustering – distance
Similarity?
• Min (Single Linkage)
• Define the similarity to be the minimum distance between the clusters
• Max (Complete Linkage)
• Define the similarity to be the maximum distance between the clusters
Proximity Matrix
© University of Melbourne 2022
Agglomerative clustering example (5)
MIN (Single linkage: the similarity of two clusters is based on the two most similar (closest) points in the different clusters
• Determined by one pair of points, i.e., by one link in the proximity graph. 15
0.2 0.15 0.1 0.05 0
Nested Clusters Dendrogram
© University of Melbourne 2022
Clustering Reflections: Agglomerative (1)
Reflections: hierarchical clustering (1)
Original Points
Two Clusters
Strength of MIN: Single linkage: Can handle non-similar shapes
Original Points
Two Clusters
Limitations of MIN: Single linkage: Sensitive to noise and outliers
© University of Melbourne 2022 16
Clustering Reflections: Agglomerative (1)
Reflections: hierarchical clustering (2)
Original Points Two Clusters
Strength of MAX: Complete linkage: Less susceptible to noise and outliers
Original Points
Two Clusters
Limitations of Max: Complete linkage: Tends to create spherical clusters with consistent diameter. Tends to break large clusters
© University of Melbourne 2022 17
Clustering Reflections: Agglomerative (3)
Reflections: hierarchical clustering (3)
• Once a decision is made to combine two clusters, it cannot be undone, i.e., an object that is in the wrong cluster will always stay there.
• No objective function is directly minimised, i.e., no ‘relative’ quality measure
© University of Melbourne 2022 18
Acknowledgements
• Materials are partially adopted from …
• Previous COMP2008 slides including material produced by ,
, , and others
• Data Mining Concepts and Techniques”, Han et al, 2nd edition 2006. • “Introduction to Data Mining”, Tan et al 2005.
© University of Melbourne 2022 19
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com