RECORD THE LECTURE
CLUSTERING
COMP2420/COMP6420 INTRODUCTION TO DATA MANAGEMENT, ANALYSIS AND SECURITY
Copyright By PowCoder代写 加微信 powcoder
WEEK 5 – LECTURE 2 Wednesday 23 March 2022
of Computing
College of Engineering and Computer Science
Credit: (previous course convenor)
HOUSEKEEPING
Class Feedback form
• Survey developed by course reps
• Course reps will give quick overview
Lab-test (week 04) Ambiguities/Error – 1
If we want to draw a graph to compare the average amount of time women of different age groups (i.e., 20-30, 30-40, 40- 50) spend on make-up and dressing. Which of the following type of graphs would be the best?
A. Histogram > B. Box Plot
C. Scatter Plot D. Pie Chart
We acknowledge the error in this question wording. It should have said ‘the number of women of different age groups’ for the answer ‘Histogram’ to be correct. A histogram as students have flagged, should have a count on the y-axis.
The correct answer is awarded to ‘box plot’ given the wording of the question, as it is the best option that allows the comparison of averages of different groups. This is advised by Dr Hwan- who is a statistician and did our lectures on data analysis.
Lab-test (week 04)
Ambiguities/Error – 2
Which of the following descriptions of data describes qualitative data?
A. Responses from a survey on customers opinions towards the dramatic increase in the price of beef sold in supermarkets over the last two years.
>B. University IDs of all students in College of Engineering and Computer Science.
C. Customer satisfaction with the service of RAKU Japanese Restaurant on a scale of 1 to 5.
>D. Options A and B.
We acknowledge the ambiguity in this question wording.
Dr Hwan- who did our data analysis lectures was consulted on this question as well as
A/Prof Patrick L’ , an expert on qualitative research and research methods from CBE was consulted on this for his opinion.
Patrick pointed to the ambiguity around the use of the word ‘survey’ without explaining what type of survey or type of questions it had.
Option A should have stated that the survey had ‘open-ended’ questions for it to be considered qualitative as a survey can capture quantitative data even on opinions.
Option B was unanimously agreed to be qualitative data by both Hwan-Jin and Patrick. University ID is ordinal data (categorical) and therefore qualitative. This is in the lecture and notes Hwan-Jin provided.
As a result, I have decided that both option B and D will be awarded as correct. A by itself would leave out B which is qualitative
B by itself is ok as A could consist entirely of quantitative questions.
Assignment 1
• If you have not forked the repo yet, urgently start now!
• Don’t wait till last minute
Learning Outcomes
Recap unsupervised Machine Learning 02 Explain what clustering is
Apply clustering using k-means algorithm
Evaluate performance of clustering algorithm
Evaluate how many clusters to implement
INTRODUCTION
Unsupervised Learning
Supervised Learning: Given a set of features and labels {𝑥! , 𝑦! }, learn a mapping from features to target
Unsupervised Learning: We only have features 𝑥!, goal is to “summarize” or “find patterns” or “structure”. No way to get feedback on our output
Use cases:
• Find clusters
• Model data density
• Dimensionality reduction
Often useful in exploratory data analysis, and as a pre-processing step for supervised learning
• Asktheclass clustering?
What is clustering?
• Theorganizationof unlabeled data into similarity groups called clusters
• Aclusterisacollectionof data items that are ‘similar’ between them, and ‘dissimilar’ to data items in other clusters.
Why use clustering?
Clustering is grouping similar objects together
• To establish prototypes, or detect outliers
• To simplify data for further analysis/learning
• To visualize data (in conjunction with dimensionality reduction)
Clustering algorithms:
• Employ some notion of distance between objects
• Have an explicit or implicit criterion defining what a good
cluster is
• Heuristically optimize that criterion to determine the clustering
Intra-cluster and Inter-cluster distances
Ambiguity with number of clusters (k)
Attribution:
Types of clustering Flat vs Hierarchical
• Flat: divide data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset
• Hierarchical : A set of nested clusters organized as a hierarchical tree
Hard vs Soft
• Hard: Each point in exactly one cluster
• Soft: a point belongs to every cluster with some membership
Flat clustering
19 Attribution: (EMIS7331, Data Mining course at SMU)
Hierarchical clustering
20 Attribution: (EMIS7331, Data Mining course at SMU)
K-Means Clustering
What is k-means clustering?
• One of the most commonly-used clustering algorithms, because it is easy to implement and quick to run
• Assumes the objects (instances) to be clustered are n-dimensional vectors, 𝑥!
• Uses a distance measure between the instances (typically Euclidean distance)
• The goal is to partition the data into K disjoint subsets
Definition of centroid:
Each cluster in K-means is defined by a centroid 𝜇⃗ 𝜔 = 1 + 𝑥⃗
𝜔 #⃗$% where we use 𝜔 to denote a cluster
Centroids act as centers of gravity, abstractions of the class
Algorithm K-Means
• Objective/partitioning criterion: minimize the average squared difference from the centroid
• We try to find the minimum average squared difference by iterating two steps:
– reassignment: assign each vector to its closest centroid
– recomputation: recompute each centroid as the average of the vectors that were assigned to it in reassignment
Until stopping criterion met
Example of one iteration
Attribution:
Formal Algorithm
Assign that data point or vector to the cluster of the closest centroid
Centroids are selected at random initially
Iterate through k number of clusters where K is pre-determined
Iterate through every point in the data set
Find the centroid which is closest to any given data point
Attribution:IR book
Worked example
Initial Data
assign into 3 clusters randomly
compute centroids
reassign clusters
recompute centroids
reassign clusters
recompute centroids – done!
if we do not know the right number of clusters?
assign into 4 clusters randomly
compute centroids
reassign clusters
recompute centroids
reassign clusters
recompute centroids – done!
Visualisation
K-Means visualization demo site (try it out)
Salient points
• 𝐾 −means is guaranteed to converge (proof beyond scope)
• It might take many iterations for convergence
• In practice we can always stop after 20-50 iterations (problem size we are handling)
• Is convergence optimal?
• Doesn’t converge to same points with different intializations!
Problems with convergence
• What is the optimal clustering for 𝐾 = 2?
• Do we converge on this clustering for arbitrary
seeds 𝑑!,𝑑” ?
(1) d2 and d5 as initial seeds (2) d2 and d3 as initial seeds
How good are your clusters?
• If the clustering is used as a pre-processing step for supervised learning, measure the performance of the supervised learner
• Measure the “tightness” of the clusters: points in the same cluster should be close together, points in different clusters should be far apart
• Tightness can be measured by:
– the minimum distance between points in different
– the maximum distance between points in the same cluster
– the average distance between points in the same cluster (can be normalized by average distance between clusters)
Problem: these measures usually favour large numbers of clusters, so some form of regularization or description length penalty is necessary
Initialisation of K-means
• Random seed selection is not very robust – it’s easy to get a suboptimal clustering
• Better ways of computing initial centroids:
– Select seeds not randomly, but using some heuristic (e.g., filter out outliers or find a set of seeds that has “good coverage” of the input space)
– Use hierarchical clustering to find good seeds
– Select i (e.g., i = 10) different random sets of seeds, do a 𝐾 −means clustering for each. Select the clustering with best tightness
Choosing the number of clusters
• A difficult problem, no one correct way
• Delete clusters that cover too few points
• Split clusters that cover too many points
• Add extra clusters for “outliers”
• Minimize loss with a penalty for each extra cluster suggested
– just minimizing loss will always favor 𝐾 = 𝑁 clusters, where N is the total number of points
– penalize the algorithm for each extra cluster formed with a rate 𝜆
Some commonly used approaches
Assume that the data set has been clustered into 𝑘 clusters
We will look briefly at the following three approaches:
• Elbow method
• Silhouette method
• Gap statistic
Elbow method
Find the total of the within cluster sum of squares for a range of number of clusters
𝑇𝑜𝑡𝑎𝑙𝑊𝐶𝑆𝑆 =++𝑑 𝑗,𝑐̂ !
* !∈’ (∈)!
Here, 𝑐̂ stands for the centroid of cluster 𝐶 . !!
Choose the number of cluster that correspond to the point where the graph tapers or elbows off (the elbow point)
Optimal k = 3
Silhouette method
• Determines how well each data point lies within its cluster
• High average silhouette width indicates good clustering
For data point 𝑖 ∈ 𝐶!, let
where 𝑑 𝑖, 𝑗
is the distance between points 𝑖 and 𝑗 in the cluster 𝐶!. 𝑏𝑖=min1 0𝑑𝑖,𝑗
where 𝑑 𝑖, 𝑗 the cluster 𝐶(
‘&! |𝐶’|”∈$”
is the distance between points 𝑖 in cluster 𝐶! and 𝑗 in
𝑎𝑖= 1 0𝑑𝑖,𝑗 |𝐶!|−1″∈$!,!&”
Now, the silhouette value 𝑠 𝑖 of 𝑖 is give by: 𝑠𝑖= 𝑏𝑖−𝑎𝑖 ,if|𝐶!|>1
max{𝑎 𝑖 ,𝑏 𝑖 }
𝑠𝑖 =0,if|𝐶!|=1
Silhouette plot
What happened to the values to the left of zero?
Attribution:Three types of animals from zoo dataset
Silhouette width vs # clusters
Attribution:Silhoutte width
Gap statistic
Compares the within cluster variation with their expected values under a null reference distribution of the data (i.e. a distribution with no obvious clustering), for different values of 𝑘
The gap statistic for a given K is defined by:
𝐺 𝑎 𝑝 + 𝑘 = 𝐸 +∗ { 𝑙 𝑜 𝑔 𝑊 – } − 𝑙 𝑜 𝑔 𝑊 –
where 𝐸+∗ denotes expectation under a sample of size 𝑛 from the reference distribution and 𝑊- is the pooled within cluster variation. Use 𝑘 that maximizes the gap statistic
Ref: Estimating the number of clusters in a data set via the gap statistic (2001) Tibshirani, Walther and statistic (intuition)
• Setting a Null Hypothesis: There is no obvious clustering in the data
• Research hypothesis: k Clusters exist in the data
Generate a reference distribution of the data (ie a distribution with
no obvious clustering) e.g. using Monte Carlo simulations
Compare the actual distribution for different k with this reference distribution.
Take value of k which is most different
Gap statistic algorithm
Attribution:Gap statistic
Example application of clustering: Color quantization
• Suppose you have an image stored with 24 bits per pixel
• You want to compress it so that you use only 8 bits per pixel (256 colours)
• You want the compressed image to look as similar as possible to the original image
Attribution:Bishop
Advantages and Disadvantages of K-means
Advantages
• Simple algorithm to understand
• Efficiently deal with large datasets
Disadvantages
• Need to specify k in advance
• Sensitive to outliers
• Can obtain different results based on how seeded
WHAT WE LEARNT TODAY
• What is clustering?
• Applications of clustering: 𝐾 −means algorithm
• Evaluation of clustering
• How many clusters?
End of Machine learning part in this course Next week: Data Ethics and start of Databases
Live coding
Mindika will do the live coding session on Clustering next.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com