Pattern Analysis & Machine Intelligence Research Group
Today’s Class
ECE 657A : Data and Knowledge Modelling and Analysis
Lecture 8 – Clustering
Mark Crowley
February 29, 2016
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
• Announcements
• Association Rule Mining
• Unsupervised Learning: The Clustering Problem
• Classic Algorithms
• Spectral and Kernel Clustering
• Document Clustering
Today’s Class
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
Announcements
• Assignment 2: Due March 19 11:59pm
• Final Exam: 12:30pm-3pm April 7, 2017 Room:
RCH 302
• Project Presentations…
• Project Report: Due April 2 11:59pm
• Course Evaluations March 13 – 24
• During Next Class we’ll do evaluation during the break.
• Can also do anytime online at
https://evaluate.uwaterloo.ca/
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
Project Presentations
Schedule: ~14 minutes each (plus a couple minutes for
questions/discussion while the next group sets up).
Presentation Outline:
Each group member must speak for part of it.
Problem Statement – Explain algorithms and/or dataset you focussed
on, why it’s interesting.
Literature Review: two papers you read and what they found
Experimental Design: what did you implement, how did you acquire,
prepare the data, how did you remove bias, ensure quality of results
Something New:
Summary of your results (so far)
Peer Evaluation: You will each be given presentations to evaluate on
the week you are not presenting. Your evaluations will be graded for
quality and used as input along with TAs for your presentation grade.
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
Supervised Learning: 𝑝𝑝(𝑦𝑦|𝒙𝒙,𝜃𝜃)
• Learn from examples 𝒙𝒙 to predict labels 𝒚𝒚.
• Prediction/Regression
• Classification
Unsupervised Learning: 𝑝𝑝 𝒙𝒙,𝜃𝜃
• Learn a possible distribution that generated data 𝒙𝒙.
• Or find the “best” respresentation of the data automatically.
• Density estimation
• Synthesis
• Denoising
• Clustering
People also talk about
• Semi-supervised Learning: Only some datapoints have a label.
learn distribution or classifier for remaining labels.
• Multi-instance learning: Set 𝑿𝑿 contains label 𝑌𝑌 = 𝑦𝑦 but you don’t
know which datapoints.
Where are we?
None of these are rigid
definitions, they are
just useful categories
for algorithms.
Presenter
Presentation Notes
These are not rigid or formal distinctions, they are just useful categories for us
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
Find the “best” representation:
• A representation that preserves as much information about
the data as possible
• Yet obeys given constraints
• Hopefully the representation is simpler than the data itself.
Common ways to make a simpler representation:
1. Dimensionality Reduction
2. Spare Representation – most entries are are 0 for most
datapoints x
3. Independent Representation
These are not mutually exclusive.
What ways have we seen so far to do this?
Unsupervised Learning
Presenter
Presentation Notes
DimReduction – PCA does this, learns a simpler representation based on correlation alone that is rotation of axis. Also learns an independent rep. So it does 1 and 3
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
Data Clustering
• Given,
• a set of data instances (e.g., documents, images)
• a representation for data instances
• Goal: Divide data instances into groups
• data instances into each group are similar
• data instances into different groups are dissimilar
• Constraint: No prior knowledge of the labels or
classes of the pattern (unsupervised)
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
Why Data Clustering?
• Discover natural groups in the data based on the
similarity between data instances
• Identify the groups of customers based on their
purchase history
• Identify the groups of users in a social network based
on their connections
• Identify the groups of proteins based on their structures
• Usually used for preliminary domain exploration /
exploratory data analytics
• K-means is the most commonly used algorithm
for data clustering
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
3
Clustering vs. Classification
Classification
• Supervised
• Uses labeled data
• Requires training phase
• Domain sensitive
• Easy to evaluate
• Examples: Naive Bayes, KNN,
SVM, Decision Trees
Clustering
• Unsupervised
• Uses unlabeled data
• Organize patterns
• w.r.t. an optimization criteria
• Notion of similarity
• Hard to evaluate
• Example: K- means,
Fuzzy C- means, Hierarchical
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
• Data analysis
• Bioinformatics data
• Image segmentation
• GIS and spatial data
• Data and web mining
• Document clustering
• Medical data
• Economic and financial data
Applications of data clustering
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
• How should we represent the patterns?
• What should we use as a similarity measure?
• How to decide on the number of clusters?
• How to actually group the patterns, i.e. what is the clustering
criteria?
• How should we assess the results (performance measures)?
• How to interpret the resulting clusters
Issues
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
Patterns can be represented as
multidimensional vectors, each dimension is
a single feature
x = {f1, f2 , fd}
The features can be quantitative or
qualitative (qualitative can be handled by the
appropriate measures)
Other representations include structures
(tree like) or symbolic objects (logical
conjunction of events)
Representation
12
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
Distance can be used as a measure of dissimilarity (review what
we talked about back on distance measures).
There are other applications where similarity could be defined
based on other concepts than distance or proximity
e.g. conceptual similarity or density
Similarity Measures
very
close in
terms of
distance
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
Objective or cost function or rule to be evaluated during the
process of clustering
Examples:
• Min sum of squared errors from reference or statistics (means,
modes)
• Min intra-cluster distances
• Max inter-cluster distances
• Max number of common neighbors
• Distance Density
Clustering Criteria
Clustering Approaches
Hierarchical
Approach
Single Link
Complete Link
Average Link
Ward’s
Algorithm
BIRCH
CURE
Chameleon
Partitioning
Approach
K-means
PAM
Fuzzy C-
means
CLARA
CLARANS
Density
Based
Approach
DBSCAN
DENCLUE
Others
NN
Evolutionary
Spectral
Kernel
Affinity Prop
Can handle larger data sets
16
Hierarchical Approach: nested sets of
clusters
a) Agglomerative: start with each
pattern in one cluster and merge
close clusters until stopping criteria
is satisfied
b) Divisive: start with all patterns in
one cluster and keep on dividing
the dissimilar ones
Agglomerative Algorithms
17
From current clusters, merge the two
clusters that have the minimum distance
between them
Single Link
The distance between 2 clusters is the
minimum of the distance between all pairs of
patterns (one from each; similar to NN
algorithm).
d(C ,C )= min d(p p′)
p∈Ci , p′∈C j
i j ,
Presenter
Presentation Notes
Single link and complete link. Example coming up
18
Complete Link
The distance between 2 clusters is
the maximum of all pairwise distance
between the patterns
d(C ,C )= max d(p , p′)
i j
p∈Ci , p′∈C j
Presenter
Presentation Notes
Example coming up
Single and complete link
19
A
B
C
D
dsingle A, B = minimum
𝑑𝑑complete 𝐴𝐴,𝐵𝐵 = maximum 𝑑𝑑single 𝐶𝐶,𝐷𝐷 > 𝑑𝑑single(𝐴𝐴,𝐵𝐵)
𝑑𝑑complete 𝐶𝐶,𝐷𝐷 < 𝑑𝑑complete(𝐴𝐴,𝐵𝐵) Single link algorithm will select A,B to merge Complete Link will select C,D to merge Hierarchal algorithms produce a Dendrogram 15 Cut at desired # of clusters Merging or splitting Result in a Dendrogram or Minimum Spanning Tree (MST) x1 x2 x3 x4 x5 x6 x7 agglomerative Divisive Average Link Take average of distances. In both cases, the merge is done based on minimum distance between clusters. d(C ,C )= i j ∑ ∑d(p, p′) p∈Ci p′∈C j nin j 1 Space and Time O(n2)proximity matrix each iteration may access it 21 Presenter Presentation Notes Antoher way to do it • Reverse the process by splitting elements not sufficiently close. Use a threshold and minimum spanning tree (MST) • One can look at it from a graph point of view (nodes are patterns, links are distances) • Find the minimum spanning tree (corresponds to minimum distance), merge based on order of distances in MST Divisive Algorithm 22 Presenter Presentation Notes We’ll talk about graphs later Single Link (agglomerative) 18 Merge two clusters if the minimum distance between the points in the two clusters is below the distance threshold MST A 1 B 2 C 13 D E A B C D E 1 2 1 3 3 1 3 2 1 A B C D E Threshold 4 Threshold 3 Threshold 2 4 2 3 5 Presenter Presentation Notes How the algorithms work. Come up witah graph connecting all the points and choose the closest Single Link (divisive) 24 We can use MST to do a divisive version Start with one cluster {A,B,C, D,E} Cut the edge with the largest distance ED Now we have 2 clusters {E}, {A,B,C,D} Cut edge BC We have {A,B}, {C,D}, {E} Next {A}, {B}, {C}, {D}, {E} Median linkage: distance between clusters as distance between medians d(C ,C )= d(Med , Med ) i j i j Centroid linkage: measure distance between clusters as distance between centroids (mean) Other agglomerative versions include: d(C ,C )= d(m ,m )ji ∑ x l l∈c k 25 i j k m k = n Ward’s Algorithm Min variance based Computes the square of errors for each cluster (from the mean) 21 () ()[x −m ]2∑∑ i=1 j=1 2e = n d ij Total for all k clusters k ∑ =1 E = e22 Merge pair of clusters p,q that minimize 2 p ,q∆ E j Presenter Presentation Notes Wards’s algorithm is really a general approach to hierarchical clustering. Define an objective function, choose clusters to merge to optimizie the function. His example was minimizing variance Ward’s Algorithm If merging p,q produce cluster t 22 Then With some algebra we get ∆ E 2 = e2 − e2 − e2 p ,q t p q (p ) j (q )[m − m ] j 2d ∑ j=1 ∆ E 2 =p ,q np + nq npnq Ward’s Algorithm 23 Instead of computing new distances between all new clusters just update the distance between the new cluster and other clusters e.g. for single link merging cp ,cq into ct d(c ,c )= min(d(c ,c ),d(c ,c ) t p q c existing cluster Update by deleting rows and columns corresponding to cp ,cq and adding a row corresponding to the new cluster ct Presenter Presentation Notes Store the distances as computed, so it’s pretty efficient once initial distance computed Hierarchical Clustering 24 Is sometimes called connectivity based clustering Most common criticism of the previous HC algorithms is lack of robustness, and hence sensitivity to noise, outliers, and their complexity O(n2) in time and space With the need to handle large data some new algorithms have been developed Partitional Approach 25 Obtain a single partition of the patterns that optimizes a criterion function One common criterion is the sum of squared errors between patterns and cluster centroids 2 d 2 k ∑∑ j=1 x∈c j k ∑∑ j=1 x∈c j j minimize J = 1 C j = set of patterns in cluster j n = number of patterns in C x∈c j j mj = ∑ x n j j x −m = Presenter Presentation Notes A different way of doing the same thing but without building a hierarchy. Not a tree but sets, kmeans in this Tough optimization problem (discrete objective function non-convex; Ismail & Kamel SMC 1985) Theoretically it is easy Try all possible partitions, evaluate J and select the one that minimizes Not practical though Partitional Approach Solution: approximate iterative solutions N(n,k)= 1 ∑(−1) k1 m=1 n k−m k m k m 15 partitions 34,105 partition = 5.3×1028 N = 6.7×1058n =100 k = 4 For n = 5 k = 2 n =10 k = 4 For n = 50 k = 4 Presenter Presentation Notes In general K-means Clustering Algorithm Most commonly used clustering algorithm in this type 1.Choose K cluster centers [m m ] 1 k by selecting randomly K patterns 2.Repeat 3.Distribute patterns among clusters using similarity measure x j ∈ c w i f x j − m w < x j − m i j = 1 , , n , i ≠ w , i = 1 , k 4.Compute new cluster centers using the current memberships Repeat Until convergence K-means Clustering Algorithm 39 Convergence can be: 1. No change in centers 2. Minimum decrease in objective function J 3. Minimum reassignment of patterns to new cluster centers The above version is Forgy’s variant Other versions - McQueen: Update centers after each pattern assignment. - ISODATA: the resulting clusters are changed by splitting and merging. K-means Clustering Algorithm 40 Sensitive to step 1, the choice of initial center Complexity O (t k n) iterations clusters patterns It may find local optimum and may be stuck in a bad solution Does not work on categorical data due to use of mean Sensitive to outliers K-means Clustering Algorithm There are variants of the k-means, that permit merge and split of resulting clusters, or different use of objective function One variant that can handle categorical data is k-modes, it uses modes rather than mean Sensitivity of K-means to Outliers x outlier o 31 o oo o o K-mean Pros oSimple o Fast for low dimensional data Cons o Can’t handle non-globular clusters o Will not identify outliers oRestricted to data which has notion of center o May converge to a bad solution and produce empty clusters (only with centroids and no other members) ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 - Clustering K-means divides data into 𝑘𝑘 clusters of data points that are “near” each other. K-dimensional one-hot code : • For 𝒙𝒙𝒊𝒊 ∈ 𝑐𝑐 then we have a vector 𝒉𝒉𝒄𝒄 = 1 and 𝒉𝒉𝒅𝒅 = 0 ∀𝑑𝑑 ≠ 𝑐𝑐 • So most entries of 𝒉𝒉 are zero. K-means Clustering as a Sparse Representation x1 Non-globular shapes x2 c1 c2 34 K-medoids Algorithm Uses medoids to represent centers of clusters Medoid is the pattern in dataset that is centrally located in the cluster (as opposed to mean point) The idea is based on selecting initially a random K medoid, then examining all the patterns to see which should replace existing medoids to improve the objective function. Keep repeating until convergence. Then group the patterns based on the PAM (Partitioning Around Medoids) new medoids. Complexity is high Not good for large sets O(t k(n − k)2) k(n − k)pairs Presenter Presentation Notes More robust to outliers and noise, just like median id vs mean. Uses datapoints not calcaulted means. Mediod has lowest average distance/dissimilarity to all other point in the set F1={(1,1.0),(2,1.0),(3,1.0),(4,0.6),(5,0.55),(6,0.2),(7,0.2),(8,0.0),(9,0.0)} F2={(1,0.0),(2,0.0),(3,0.0),(4,0.4),(5,0.45),(6,0.8),(7,0.8),(8,1.0),(9,1.0)} Fuzzy Clustering In Hard clustering, each pattern belongs to one and only one cluster. Fuzzy clustering extends this by allowing each pattern to have a membership in each cluster with different degree. The result is overlapping clusters, not really partitions. Example Jain Murty & Flynn article ACM Computing Survey 1999 3 2 1 4 5 7 6 8 9 F1 F2 1H 2H 36 Fuzzy C-Mean (Bezdek 1981) Algorithm is an extension of k-means and uses squared errors criterion to handle fuzzy membership minimize J (W , M )= wmd 2∑∑ i=1 j=1 ij ij n k m Subject to k ≥ 0 ,1 ≤ i ≤ n ,1 ≤ j ≤ kw ij = 1, 1 ≤ i ≤ n ,∑ w ij j =1 (1) (2) (3) where m is a scalar, m >1,
dij is the Euclidean distance between point xi and center m j ,
wij is the grade of association of pattern i with cluster
j, M = [m1, m2 ,mk ]is an d × k matrix, and
W = {w }is an n× k matrix
m ∈ Rd ,1≤ j ≤ k are cluster centers,
ij
j
• If m =1and wij takes a value1or 0, then we
get Hard Clustering
• The problem has local minimum points
which are not the global solution
38
Algorithm FCM (Bezdek, 1981)
Step 0 Select initial membership functions
W(1). Set =1.
Step 1 Compute M() using formula;
( ) ( ) / w (),∀j,
i=1 i=1
Step 2 Compute W(+1) using the formula;
wij ( +1)=1/∑dij ()/ dir ()
jM =
n
∑
n
i ∑m mij ijw x
2(m−1) , for d ()> 0,∀i, j
ir
() = 0 then wir ( +1)=1 and wij ( +1)= 0 for j ≠ r
Step 3 If W −W (+1)
r=1
If dir
<∈ stop, where ∈> 0
is a small scalar. Otherwise set = +1
and go to Step 1.
k
Presenter
Presentation Notes
testing
It can be used to produce hard clustering at
different levels by thresholding the membership
values (semi-fuzzy, Kamel and Selim, PR 1993)
It still suffers from some of the limitations of K-
means (non-globular shapes, needs to know
choice of K,..)
Fuzzy C-shells and other variants: detects
circular and elliptical boundaries with hollow
interior, missing values, different data types
Fuzzy C-means has better convergence properties
but can still get stuck at local minima
54
Clustering Approaches
Hierarchical
Approach
Single Link
Complete Link
Average Link
Ward’s
Algorithm
BIRCH
CURE
Chameleon
Partitioning
Approach
K-means
PAM
Fuzzy C-
means
CLARA
CLARANS
Density
Based
Approach
DBSCAN
DENCLUE
Others
NN
Evolutionary
Spectral
Kernel
Affinity Prop
handle large data sets
55
Group points based on their
density
Handles outliers
Can detect clusters with arbitrary
shapes and densities
Density Based Clustering
Presenter
Presentation Notes
Data mining literature
DBScan(X, ε, m): Density Based Spatial Clustering
of Applications with Noise
Dense Region = neighbourhood of distance ε contains
at least m points
Classify the points according to their density:
ε
Core points : points in the
interior of dense region.
Border points : points on the
edge of dense region.
Noise points : points that are
not border nor core.
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
DBScan
Eliminate noise points
Core points which are within the neighborhood of
each others belong to the same cluster.
Assign border points to the cluster that the point
can be reachable by one of cluster core points.
DBSCAN can find clusters of different shapes
and sizes.
DBSCAN can’t handle well clusters of varying
densities.
Can be expensive for large data due to
neighborhood calculations (distances)
DBSCAN: Core, Border and Noise Points
Original Points Point types: core,
border and noise
Eps = 10, MinPts = 4
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
When DBSCAN Works Well
Original Points Clusters
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.75).
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
Large data set
58
Needs:
One scan of data
Out of core or part in core
Incremental
Hierarchical: BIRCH, CURE
Partitioning: CLARA, CLARANS
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
BIRCH
59
(Balanced Iterative Reducing and Clustering using
Hierarchies)
Uses a data structure called Clustering
Feature (CF) Tree which keeps summaries of
the clusters (N=no. of points, LS=sum of
points, SS=square sum of points)
The CF tree is like a dendrogram but stores
information that facilitates the operations to
form hierarchical clustering
CF of a merger between two clusters i and j
is the addition of the CFi+CFj
Presenter
Presentation Notes
Same time as DBScane but handled large data that couldn’t fit database in memory. Builds tree to divide data into smaller sets, then does clustering on those sets
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
BIRCH
60
The algorithm allows clustering incrementally and
therefore handles large data sets.
The tree structure uses 2 parameters, number of
entries (subclusters) in leaf node (L) and
threshold (T) on the diameter of each subcluster
in the leaf node
A point is inserted in the tree by reaching the leaf
node that includes the cluster closest to it (starting
from root and traversing down to a leaf node)
Presenter
Presentation Notes
Each leaf contains many datapoints, statistics kept up to date
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
BIRCH
61
If the point can be added (without violating the
threshold) then the CF of the subcluster and node
cluster and all parent nodes are updated
If not, then create a new subcluster for the point
and keep in the same node if not violating the
number of entries parameter (L)
Else split the node to become a nonleaf node
pointing to two leaf nodes.
Nonleaf nodes can also have a limit on the
branching factor (B)
ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
B=3CF1 CF CF
CF1.1 CF1.2 CF1.3
L=5
Each subcluster has max diameter < T, each subcluster will have
number of points (1 or more)
BIRCH : CF Tree
CF1=CF1.1+CF1.2+CF1.3
CF1.1=CF1.1.1+CF1.1.2+…..+CF1.1.5
CF CF CF CF CF CF
CF1.1.1
CF1.1.2
CF1.1.3
CF1.1.4
CF1.1.5
Presenter
Presentation Notes
Now since we have stats for each node and leaf and datapoints organized in leaves you can do aglloomrative clustering easily.
ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 - Clustering
BIRCH
The algorithm adapts to memory size by
changing the threshold T and branching factor
B.
Larger T and B will produce smaller
(shorter) CF tree
T can be changed to larger value if the tree
can’t fit in memory.
Smaller tree can be obtained by performing
agglomerative clustering on the subclusters.
A final optional step of refining the clustering
by reassigning the points
Complexity is of O(n) in general.
Evaluation Measures
Internal
• Density of clusters
• Inherent properties of the data, classes
External
• Evaluated based on correct clusters
Evaluation Measures
External
If we know the labels of the true clusters,
accuracy, F-measure can be calculated by
constructing the contingency table using the a
priori clustering to compare to resulting
clustering
Suppose we know a priori the clusters as
U ={U1,U2 ,UR}
and the resulting clustering as
V ={V1,VK}
Evaluation Measures
Characteristic Functions
( )
( ) i s
and x j ∈Vs 1≤ s ≤ K
1 if x ∈U and x j ∈Ur ,1≤ r ≤ Ri, j =
0 otherwise
1 if x ∈V
i, j = 0
i r
u
v
I
I
Evaluation Measures
same
cluster
different
cluster
same
class
different
class
a (TP) b (FN)
c (FP) d (TN)
Presenter
Presentation Notes
Count the number of items in the dataset that match in these ways
Evaluation Measures
n(n −1)
2
M = a + b + c + d =
Contingency Table
1 0
1
0
m1
1M −m
m2 M − m2 M
total # of pairs
of objects
Iv
IU
a b
c d
Evaluation Measures
69
( ) n (a + d )
Jaccard = a (a + b + c)
Fowlkes & Mallows = a (m m )
2
1
1 2
Rand Index = a + d / 2 = M
m1 = # of objects in the same groups in U
m2 = # of objects in the same groups in V
Larger
Means
Closer excludes 0-0
Presenter
Presentation Notes
Rand: focused on relative accuracy, number of agreements normalized.
Jaccard: size of intersection divided by size of union, similarity between sets.
F&M:correct normalized by bunchiness of data, how easy is it
• Rand: focused on relative accuracy,
number of agreements normalized.
• Jaccard: size of intersection divided by
size of union, similarity between sets.
• F&M: corrects data to be normalized by
bunchiness of data, how easy is it to
cluster
Evaluation Measures
Evaluation Measures
71
If we can’t specify the class of the cluster we can
compute the F-measure of the cluster with regard
to each class
Cluster i Class j
precision(i, j)= m / n
ij i
recall(i, j)= m / mjij
m j = # of objects in Class j
n = # of objects in Cluster i
in Cluster i
m = # of objects of Class j
i
ij
Evaluation Measures
72
F-measure of cluster i with respect to class j is
F(i, j)= 2 precision (i, j)recall (i, j)
precision (i, j)+ recall (i, j)
max F(i, j)
i∈kn
m
F =
C
∑
j=1
j
For k clusters and C classes
Entropy
The degree to which each cluster consists of
objects of a single class
i
C
ei = −∑Pij log2 Pij
j=1
i
P = mij
ij
e =∑ e
n
i=1 n
= entropy of cluster i
= probability that a member
of cluster i belongs to class j
k
in
total entropy k = clusters
n = number of points
Internal Measures
Evaluating how well the results of a clustering
algorithm perform without reference to external
information
Two types of measures:
1. Cluster Cohesion (compactness, tightness):
how closely related are the patterns in the same
cluster
2. Cluster Separation (isolation):
how well separated the cluster is from other
clusters
Presenter
Presentation Notes
If we don’t have the ground truth,
Internal Measures
75
cohesion(Ci )= ∑ proximity(x, y)
x∈Ci
Y∈Ci
Proximity can be a function of similarity between
patterns for protoype-based clusters
cohesion (Ci )= ∑ proximity (x,mi )
x∈Ci
mi is the prototype (or center) of cluster i
Internal Measures
overall cohesion =∑wi cohesion(Ci ), k clusters
i=1
wi is a weight function of the size of the
max d ∑d
cluster e.g. ni
1 or 1 or 1 another example is
example of proximity is cos.similarity
n
k
∑d 2
51
Internal Measures
Separation between 2 clusters can be the distance
between the 2 clusters (can be minimum)
d(C ,C )= min d(x,y)
d(C ,C )= max d(x, y)
overall separation = min d(C ,C )
y∈C j
overall separation can be the min pairwise
separation between the clusters
complete link like
y∈C j
or max
single - link like
j=i+1,k
i=1 k
i j
i j
x∈C
x∈Ci j
j
i
i
52
Ratios & Indices
d (x , m )
n min d(Cr ,Cs )Cr ,Cs∈C
k
∑ ∑
i=1 x j∈Ci
ij
SI =
Smaller indicates more separate
Separation Index (SI)
2
Within cluster
scatter
53
Dunn - Index
d(C ,C )
max diam(C )
d(C ,C )= min d(x, y)
diam(c)= max d(x, y)
DC = min
x, y∈C
Large indicates compact & well separated clusters
Other measures
see Jain and Dubes book Chapter 4
see article Halkidi etal, J of Intelligent Info. Systems, Dec. 2001
x∈Ci j
i j
KK j=i+1,i=1,
y∈C j
i
min
=1,k
Distance between
clusters (separation)
Measures dispersion of
the cluster (inverse of
cohesion)
ECE 657A : Data and Knowledge Modelling and Analysis
Today’s Class
Announcements
Project Presentations
Where are we?
Unsupervised Learning
Data Clustering
Why Data Clustering?
Clustering vs. Classification
Applications of data clustering
Issues
Representation
Similarity Measures
Clustering Criteria
Clustering Approaches
Slide Number 16
Agglomerative Algorithms
Slide Number 18
Single and complete link
Hierarchal algorithms produce a Dendrogram
Slide Number 21
Divisive Algorithm
Single Link (agglomerative)
Single Link (divisive)
Slide Number 25
Slide Number 26
Slide Number 27
Ward’s Algorithm
Hierarchical Clustering
Partitional Approach
Partitional Approach
K-means Clustering Algorithm
Slide Number 33
Slide Number 34
Slide Number 35
Slide Number 36
Slide Number 37
Slide Number 38
K-means Clustering Algorithm
K-means Clustering Algorithm
K-means Clustering Algorithm
K-mean
K-means Clustering as a Sparse Representation
Slide Number 44
PAM (Partitioning Around Medoids)
Fuzzy Clustering
Slide Number 47
Slide Number 48
Algorithm FCM (Bezdek, 1981)
Fuzzy C-means has better convergence properties but can still get stuck at local minima
Clustering Approaches
Density Based Clustering
Slide Number 53
DBScan
DBSCAN: Core, Border and Noise Points
When DBSCAN Works Well
When DBSCAN Does NOT Work Well
Large data set
BIRCH
BIRCH
BIRCH
BIRCH : CF Tree
BIRCH
Evaluation Measures
Evaluation Measures
Evaluation Measures
Evaluation Measures
Evaluation Measures
Evaluation Measures
Slide Number 70
Evaluation Measures
Evaluation Measures
Entropy
Internal Measures
Internal Measures
Internal Measures
Internal Measures
Slide Number 78
Dunn - Index