程序代写代做代考 Bioinformatics data structure data mining algorithm database decision tree Pattern Analysis & Machine Intelligence Research Group

Pattern Analysis & Machine Intelligence Research Group

Today’s Class

ECE 657A : Data and Knowledge Modelling and Analysis
Lecture 8 – Clustering

Mark Crowley

February 29, 2016

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

• Announcements
• Association Rule Mining
• Unsupervised Learning: The Clustering Problem

• Classic Algorithms
• Spectral and Kernel Clustering
• Document Clustering

Today’s Class

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

Announcements

• Assignment 2: Due March 19 11:59pm
• Final Exam: 12:30pm-3pm April 7, 2017 Room:

RCH 302
• Project Presentations…
• Project Report: Due April 2 11:59pm
• Course Evaluations March 13 – 24

• During Next Class we’ll do evaluation during the break.
• Can also do anytime online at

https://evaluate.uwaterloo.ca/

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

Project Presentations

Schedule: ~14 minutes each (plus a couple minutes for
questions/discussion while the next group sets up).
Presentation Outline:

Each group member must speak for part of it.
Problem Statement – Explain algorithms and/or dataset you focussed
on, why it’s interesting.
Literature Review: two papers you read and what they found
Experimental Design: what did you implement, how did you acquire,
prepare the data, how did you remove bias, ensure quality of results
Something New:
Summary of your results (so far)

Peer Evaluation: You will each be given presentations to evaluate on
the week you are not presenting. Your evaluations will be graded for
quality and used as input along with TAs for your presentation grade.

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

Supervised Learning: 𝑝𝑝(𝑦𝑦|𝒙𝒙,𝜃𝜃)
• Learn from examples 𝒙𝒙 to predict labels 𝒚𝒚.

• Prediction/Regression
• Classification

Unsupervised Learning: 𝑝𝑝 𝒙𝒙,𝜃𝜃
• Learn a possible distribution that generated data 𝒙𝒙.
• Or find the “best” respresentation of the data automatically.

• Density estimation
• Synthesis
• Denoising
• Clustering

People also talk about
• Semi-supervised Learning: Only some datapoints have a label.

learn distribution or classifier for remaining labels.
• Multi-instance learning: Set 𝑿𝑿 contains label 𝑌𝑌 = 𝑦𝑦 but you don’t

know which datapoints.

Where are we?

None of these are rigid
definitions, they are
just useful categories
for algorithms.

Presenter
Presentation Notes
These are not rigid or formal distinctions, they are just useful categories for us

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

Find the “best” representation:
• A representation that preserves as much information about

the data as possible
• Yet obeys given constraints
• Hopefully the representation is simpler than the data itself.

Common ways to make a simpler representation:
1. Dimensionality Reduction
2. Spare Representation – most entries are are 0 for most

datapoints x
3. Independent Representation

These are not mutually exclusive.
What ways have we seen so far to do this?

Unsupervised Learning

Presenter
Presentation Notes
DimReduction – PCA does this, learns a simpler representation based on correlation alone that is rotation of axis. Also learns an independent rep. So it does 1 and 3

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

Data Clustering

• Given,
• a set of data instances (e.g., documents, images)
• a representation for data instances

• Goal: Divide data instances into groups
• data instances into each group are similar
• data instances into different groups are dissimilar

• Constraint: No prior knowledge of the labels or
classes of the pattern (unsupervised)

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

Why Data Clustering?

• Discover natural groups in the data based on the
similarity between data instances
• Identify the groups of customers based on their

purchase history
• Identify the groups of users in a social network based

on their connections
• Identify the groups of proteins based on their structures

• Usually used for preliminary domain exploration /
exploratory data analytics

• K-means is the most commonly used algorithm
for data clustering

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering
3

Clustering vs. Classification

Classification
• Supervised
• Uses labeled data
• Requires training phase
• Domain sensitive
• Easy to evaluate
• Examples: Naive Bayes, KNN,

SVM, Decision Trees

Clustering
• Unsupervised
• Uses unlabeled data
• Organize patterns
• w.r.t. an optimization criteria
• Notion of similarity
• Hard to evaluate
• Example: K- means,

Fuzzy C- means, Hierarchical

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

• Data analysis
• Bioinformatics data
• Image segmentation
• GIS and spatial data
• Data and web mining
• Document clustering
• Medical data
• Economic and financial data

Applications of data clustering

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

• How should we represent the patterns?
• What should we use as a similarity measure?
• How to decide on the number of clusters?
• How to actually group the patterns, i.e. what is the clustering

criteria?
• How should we assess the results (performance measures)?
• How to interpret the resulting clusters

Issues

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

 Patterns can be represented as
multidimensional vectors, each dimension is
a single feature

x = {f1, f2 , fd}
 The features can be quantitative or

qualitative (qualitative can be handled by the
appropriate measures)
 Other representations include structures

(tree like) or symbolic objects (logical
conjunction of events)

Representation

12

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

Distance can be used as a measure of dissimilarity (review what
we talked about back on distance measures).
There are other applications where similarity could be defined
based on other concepts than distance or proximity
e.g. conceptual similarity or density

Similarity Measures

very
close in
terms of
distance

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

Objective or cost function or rule to be evaluated during the
process of clustering

Examples:
• Min sum of squared errors from reference or statistics (means,

modes)
• Min intra-cluster distances
• Max inter-cluster distances
• Max number of common neighbors
• Distance Density

Clustering Criteria

Clustering Approaches

Hierarchical
Approach
Single Link
Complete Link
Average Link
Ward’s
Algorithm
BIRCH
CURE
Chameleon

Partitioning
Approach
K-means
PAM
Fuzzy C-
means

CLARA
CLARANS

Density
Based
Approach
DBSCAN
DENCLUE

Others
NN
Evolutionary
Spectral
Kernel
Affinity Prop

Can handle larger data sets

16

Hierarchical Approach: nested sets of
clusters

a) Agglomerative: start with each
pattern in one cluster and merge
close clusters until stopping criteria
is satisfied

b) Divisive: start with all patterns in
one cluster and keep on dividing
the dissimilar ones

Agglomerative Algorithms

17

From current clusters, merge the two
clusters that have the minimum distance
between them
Single Link
The distance between 2 clusters is the
minimum of the distance between all pairs of
patterns (one from each; similar to NN
algorithm).

d(C ,C )= min d(p p′)
p∈Ci , p′∈C j

i j ,

Presenter
Presentation Notes
Single link and complete link. Example coming up

18

Complete Link
The distance between 2 clusters is
the maximum of all pairwise distance
between the patterns

d(C ,C )= max d(p , p′)
i j

p∈Ci , p′∈C j

Presenter
Presentation Notes
Example coming up

Single and complete link

19

A

B

C
D

dsingle A, B = minimum

𝑑𝑑complete 𝐴𝐴,𝐵𝐵 = maximum 𝑑𝑑single 𝐶𝐶,𝐷𝐷 > 𝑑𝑑single(𝐴𝐴,𝐵𝐵)

𝑑𝑑complete 𝐶𝐶,𝐷𝐷 < 𝑑𝑑complete(𝐴𝐴,𝐵𝐵) Single link algorithm will select A,B to merge Complete Link will select C,D to merge Hierarchal algorithms produce a Dendrogram 15 Cut at desired # of clusters Merging or splitting Result in a Dendrogram or Minimum Spanning Tree (MST) x1 x2 x3 x4 x5 x6 x7 agglomerative Divisive Average Link Take average of distances. In both cases, the merge is done based on minimum distance between clusters. d(C ,C )= i j ∑ ∑d(p, p′) p∈Ci p′∈C j nin j 1 Space and Time O(n2)proximity matrix each iteration may access it 21 Presenter Presentation Notes Antoher way to do it • Reverse the process by splitting elements not sufficiently close. Use a threshold and minimum spanning tree (MST) • One can look at it from a graph point of view (nodes are patterns, links are distances) • Find the minimum spanning tree (corresponds to minimum distance), merge based on order of distances in MST Divisive Algorithm 22 Presenter Presentation Notes We’ll talk about graphs later Single Link (agglomerative) 18 Merge two clusters if the minimum distance between the points in the two clusters is below the distance threshold MST A 1 B 2 C 13 D E A B C D E 1 2 1 3 3 1 3 2 1 A B C D E Threshold 4 Threshold 3 Threshold 2 4 2 3 5 Presenter Presentation Notes How the algorithms work. Come up witah graph connecting all the points and choose the closest Single Link (divisive) 24  We can use MST to do a divisive version  Start with one cluster {A,B,C, D,E}  Cut the edge with the largest distance ED  Now we have 2 clusters {E}, {A,B,C,D}  Cut edge BC  We have {A,B}, {C,D}, {E}  Next {A}, {B}, {C}, {D}, {E}  Median linkage: distance between clusters as distance between medians d(C ,C )= d(Med , Med ) i j i j  Centroid linkage: measure distance between clusters as distance between centroids (mean) Other agglomerative versions include: d(C ,C )= d(m ,m )ji ∑ x l l∈c k 25 i j k m k = n Ward’s Algorithm  Min variance based  Computes the square of errors for each cluster (from the mean) 21 () ()[x −m ]2∑∑ i=1 j=1 2e = n d  ij  Total for all k clusters k ∑ =1 E = e22   Merge pair of clusters p,q that minimize 2 p ,q∆ E j Presenter Presentation Notes Wards’s algorithm is really a general approach to hierarchical clustering. Define an objective function, choose clusters to merge to optimizie the function. His example was minimizing variance Ward’s Algorithm  If merging p,q produce cluster t 22  Then  With some algebra we get ∆ E 2 = e2 − e2 − e2 p ,q t p q (p ) j (q )[m − m ] j 2d ∑ j=1 ∆ E 2 =p ,q np + nq npnq Ward’s Algorithm 23 Instead of computing new distances between all new clusters just update the distance between the new cluster and other clusters e.g. for single link merging cp ,cq into ct d(c ,c )= min(d(c ,c ),d(c ,c ) t  p  q c existing cluster  Update by deleting rows and columns corresponding to cp ,cq and adding a row corresponding to the new cluster ct Presenter Presentation Notes Store the distances as computed, so it’s pretty efficient once initial distance computed Hierarchical Clustering 24 Is sometimes called connectivity based clustering Most common criticism of the previous HC algorithms is lack of robustness, and hence sensitivity to noise, outliers, and their complexity O(n2) in time and space With the need to handle large data some new algorithms have been developed Partitional Approach 25 Obtain a single partition of the patterns that optimizes a criterion function One common criterion is the sum of squared errors between patterns and cluster centroids 2 d 2 k ∑∑ j=1 x∈c j k ∑∑ j=1 x∈c j j minimize J = 1 C j = set of patterns in cluster j n = number of patterns in C x∈c j j mj = ∑ x n j j x −m = Presenter Presentation Notes A different way of doing the same thing but without building a hierarchy. Not a tree but sets, kmeans in this Tough optimization problem (discrete objective function non-convex; Ismail & Kamel SMC 1985) Theoretically it is easy Try all possible partitions, evaluate J and select the one that minimizes Not practical though Partitional Approach Solution: approximate iterative solutions N(n,k)=  1 ∑(−1)  k1 m=1 n k−m  k m k m     15 partitions 34,105 partition = 5.3×1028 N = 6.7×1058n =100 k = 4 For n = 5 k = 2 n =10 k = 4 For n = 50 k = 4 Presenter Presentation Notes In general K-means Clustering Algorithm Most commonly used clustering algorithm in this type 1.Choose K cluster centers [m m ] 1 k by selecting randomly K patterns 2.Repeat 3.Distribute patterns among clusters using similarity measure x j ∈ c w i f x j − m w < x j − m i j = 1 , , n , i ≠ w , i = 1 , k 4.Compute new cluster centers using the current memberships Repeat Until convergence K-means Clustering Algorithm 39 Convergence can be: 1. No change in centers 2. Minimum decrease in objective function J 3. Minimum reassignment of patterns to new cluster centers The above version is Forgy’s variant Other versions - McQueen: Update centers after each pattern assignment. - ISODATA: the resulting clusters are changed by splitting and merging. K-means Clustering Algorithm 40 Sensitive to step 1, the choice of initial center Complexity O (t k n) iterations clusters patterns  It may find local optimum and may be stuck in a bad solution  Does not work on categorical data due to use of mean  Sensitive to outliers K-means Clustering Algorithm There are variants of the k-means, that permit merge and split of resulting clusters, or different use of objective function One variant that can handle categorical data is k-modes, it uses modes rather than mean Sensitivity of K-means to Outliers x outlier o 31 o oo o o K-mean Pros oSimple o Fast for low dimensional data Cons o Can’t handle non-globular clusters o Will not identify outliers oRestricted to data which has notion of center o May converge to a bad solution and produce empty clusters (only with centroids and no other members) ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 - Clustering K-means divides data into 𝑘𝑘 clusters of data points that are “near” each other. K-dimensional one-hot code : • For 𝒙𝒙𝒊𝒊 ∈ 𝑐𝑐 then we have a vector 𝒉𝒉𝒄𝒄 = 1 and 𝒉𝒉𝒅𝒅 = 0 ∀𝑑𝑑 ≠ 𝑐𝑐 • So most entries of 𝒉𝒉 are zero. K-means Clustering as a Sparse Representation x1 Non-globular shapes x2 c1 c2 34 K-medoids Algorithm  Uses medoids to represent centers of clusters  Medoid is the pattern in dataset that is centrally located in the cluster (as opposed to mean point)  The idea is based on selecting initially a random K medoid, then examining all the patterns to see which should replace existing medoids to improve the objective function. Keep repeating until convergence. Then group the patterns based on the PAM (Partitioning Around Medoids) new medoids.  Complexity is high  Not good for large sets O(t k(n − k)2) k(n − k)pairs Presenter Presentation Notes More robust to outliers and noise, just like median id vs mean. Uses datapoints not calcaulted means. Mediod has lowest average distance/dissimilarity to all other point in the set F1={(1,1.0),(2,1.0),(3,1.0),(4,0.6),(5,0.55),(6,0.2),(7,0.2),(8,0.0),(9,0.0)} F2={(1,0.0),(2,0.0),(3,0.0),(4,0.4),(5,0.45),(6,0.8),(7,0.8),(8,1.0),(9,1.0)} Fuzzy Clustering In Hard clustering, each pattern belongs to one and only one cluster. Fuzzy clustering extends this by allowing each pattern to have a membership in each cluster with different degree. The result is overlapping clusters, not really partitions. Example Jain Murty & Flynn article ACM Computing Survey 1999 3 2 1 4 5 7 6 8 9 F1 F2 1H 2H 36 Fuzzy C-Mean (Bezdek 1981) Algorithm is an extension of k-means and uses squared errors criterion to handle fuzzy membership minimize J (W , M )= wmd 2∑∑ i=1 j=1 ij ij n k m Subject to k ≥ 0 ,1 ≤ i ≤ n ,1 ≤ j ≤ kw ij = 1, 1 ≤ i ≤ n ,∑ w ij j =1 (1) (2) (3) where m is a scalar, m >1,

dij is the Euclidean distance between point xi and center m j ,

wij is the grade of association of pattern i with cluster

j, M = [m1, m2 ,mk ]is an d × k matrix, and

W = {w }is an n× k matrix

m ∈ Rd ,1≤ j ≤ k are cluster centers,

ij

j

• If m =1and wij takes a value1or 0, then we
get Hard Clustering

• The problem has local minimum points
which are not the global solution

38

Algorithm FCM (Bezdek, 1981)
Step 0 Select initial membership functions

W(1). Set  =1.

Step 1 Compute M() using formula;

( ) ( ) / w (),∀j,
i=1 i=1

Step 2 Compute W(+1) using the formula;

wij ( +1)=1/∑dij ()/ dir ()

jM  =
n


n

i ∑m mij ijw  x

2(m−1) , for d ()> 0,∀i, j
ir

() = 0 then wir ( +1)=1 and wij ( +1)= 0 for j ≠ r
Step 3 If W  −W (+1)

r=1

If dir

<∈ stop, where ∈> 0

is a small scalar. Otherwise set  =  +1
and go to Step 1.

k

 

Presenter
Presentation Notes
testing

 It can be used to produce hard clustering at
different levels by thresholding the membership
values (semi-fuzzy, Kamel and Selim, PR 1993)

 It still suffers from some of the limitations of K-
means (non-globular shapes, needs to know
choice of K,..)

 Fuzzy C-shells and other variants: detects
circular and elliptical boundaries with hollow
interior, missing values, different data types

 Fuzzy C-means has better convergence properties
but can still get stuck at local minima

54

Clustering Approaches

Hierarchical
Approach
Single Link
Complete Link
Average Link
Ward’s
Algorithm
BIRCH
CURE
Chameleon

Partitioning
Approach
K-means
PAM
Fuzzy C-
means

CLARA
CLARANS

Density
Based
Approach
DBSCAN
DENCLUE

Others
NN
Evolutionary
Spectral
Kernel
Affinity Prop

handle large data sets

55

Group points based on their
density

Handles outliers
Can detect clusters with arbitrary

shapes and densities

Density Based Clustering

Presenter
Presentation Notes
Data mining literature

DBScan(X, ε, m): Density Based Spatial Clustering
of Applications with Noise
Dense Region = neighbourhood of distance ε contains
at least m points
Classify the points according to their density:

ε

Core points : points in the
interior of dense region.
Border points : points on the
edge of dense region.
Noise points : points that are
not border nor core.

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

DBScan

 Eliminate noise points
 Core points which are within the neighborhood of

each others belong to the same cluster.
 Assign border points to the cluster that the point

can be reachable by one of cluster core points.
 DBSCAN can find clusters of different shapes

and sizes.
 DBSCAN can’t handle well clusters of varying

densities.
 Can be expensive for large data due to

neighborhood calculations (distances)

DBSCAN: Core, Border and Noise Points

Original Points Point types: core,
border and noise

Eps = 10, MinPts = 4
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

When DBSCAN Works Well

Original Points Clusters

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

• Resistant to Noise

• Can handle clusters of different shapes and sizes

When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.75).

Original Points

• Varying densities
• High-dimensional data

(MinPts=4, Eps=9.92)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

Large data set

58

Needs:
 One scan of data
 Out of core or part in core
 Incremental
 Hierarchical: BIRCH, CURE
 Partitioning: CLARA, CLARANS

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

BIRCH

59

(Balanced Iterative Reducing and Clustering using
Hierarchies)

 Uses a data structure called Clustering
Feature (CF) Tree which keeps summaries of
the clusters (N=no. of points, LS=sum of
points, SS=square sum of points)

 The CF tree is like a dendrogram but stores
information that facilitates the operations to
form hierarchical clustering

 CF of a merger between two clusters i and j
is the addition of the CFi+CFj

Presenter
Presentation Notes
Same time as DBScane but handled large data that couldn’t fit database in memory. Builds tree to divide data into smaller sets, then does clustering on those sets

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

BIRCH

60

 The algorithm allows clustering incrementally and
therefore handles large data sets.

 The tree structure uses 2 parameters, number of
entries (subclusters) in leaf node (L) and
threshold (T) on the diameter of each subcluster
in the leaf node

 A point is inserted in the tree by reaching the leaf
node that includes the cluster closest to it (starting
from root and traversing down to a leaf node)

Presenter
Presentation Notes
Each leaf contains many datapoints, statistics kept up to date

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

BIRCH

61

 If the point can be added (without violating the
threshold) then the CF of the subcluster and node
cluster and all parent nodes are updated

 If not, then create a new subcluster for the point
and keep in the same node if not violating the
number of entries parameter (L)

 Else split the node to become a nonleaf node
pointing to two leaf nodes.

 Nonleaf nodes can also have a limit on the
branching factor (B)

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Clustering

B=3CF1 CF CF

CF1.1 CF1.2 CF1.3

L=5
Each subcluster has max diameter < T, each subcluster will have number of points (1 or more) BIRCH : CF Tree CF1=CF1.1+CF1.2+CF1.3 CF1.1=CF1.1.1+CF1.1.2+…..+CF1.1.5 CF CF CF CF CF CF CF1.1.1 CF1.1.2 CF1.1.3 CF1.1.4 CF1.1.5 Presenter Presentation Notes Now since we have stats for each node and leaf and datapoints organized in leaves you can do aglloomrative clustering easily. ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 - Clustering BIRCH  The algorithm adapts to memory size by changing the threshold T and branching factor B.  Larger T and B will produce smaller (shorter) CF tree  T can be changed to larger value if the tree can’t fit in memory.  Smaller tree can be obtained by performing agglomerative clustering on the subclusters.  A final optional step of refining the clustering by reassigning the points  Complexity is of O(n) in general. Evaluation Measures Internal • Density of clusters • Inherent properties of the data, classes External • Evaluated based on correct clusters Evaluation Measures External If we know the labels of the true clusters, accuracy, F-measure can be calculated by constructing the contingency table using the a priori clustering to compare to resulting clustering Suppose we know a priori the clusters as U ={U1,U2 ,UR} and the resulting clustering as V ={V1,VK} Evaluation Measures Characteristic Functions ( ) ( ) i s and x j ∈Vs 1≤ s ≤ K    1 if x ∈U and x j ∈Ur ,1≤ r ≤ Ri, j = 0 otherwise 1 if x ∈V i, j = 0 i r u v I I        Evaluation Measures same cluster different cluster same class different class a (TP) b (FN) c (FP) d (TN) Presenter Presentation Notes Count the number of items in the dataset that match in these ways Evaluation Measures n(n −1) 2 M = a + b + c + d = Contingency Table 1 0 1 0 m1 1M −m m2 M − m2 M total # of pairs of objects Iv IU a b c d Evaluation Measures 69 ( ) n (a + d ) Jaccard = a (a + b + c) Fowlkes & Mallows = a (m m ) 2 1 1 2 Rand Index = a + d / 2 = M  m1 = # of objects in the same groups in U m2 = # of objects in the same groups in V Larger Means Closer excludes 0-0 Presenter Presentation Notes Rand: focused on relative accuracy, number of agreements normalized. Jaccard: size of intersection divided by size of union, similarity between sets. F&M:correct normalized by bunchiness of data, how easy is it • Rand: focused on relative accuracy, number of agreements normalized. • Jaccard: size of intersection divided by size of union, similarity between sets. • F&M: corrects data to be normalized by bunchiness of data, how easy is it to cluster Evaluation Measures Evaluation Measures 71 If we can’t specify the class of the cluster we can compute the F-measure of the cluster with regard to each class Cluster i Class j precision(i, j)= m / n ij i recall(i, j)= m / mjij m j = # of objects in Class j n = # of objects in Cluster i in Cluster i m = # of objects of Class j i ij Evaluation Measures 72 F-measure of cluster i with respect to class j is F(i, j)= 2 precision (i, j)recall (i, j) precision (i, j)+ recall (i, j) max F(i, j) i∈kn m F = C ∑ j=1 j For k clusters and C classes Entropy The degree to which each cluster consists of objects of a single class i C ei = −∑Pij log2 Pij j=1 i P = mij ij e =∑ e n i=1 n = entropy of cluster i = probability that a member of cluster i belongs to class j k in total entropy k = clusters n = number of points Internal Measures Evaluating how well the results of a clustering algorithm perform without reference to external information Two types of measures: 1. Cluster Cohesion (compactness, tightness): how closely related are the patterns in the same cluster 2. Cluster Separation (isolation): how well separated the cluster is from other clusters Presenter Presentation Notes If we don’t have the ground truth, Internal Measures 75 cohesion(Ci )= ∑ proximity(x, y) x∈Ci Y∈Ci Proximity can be a function of similarity between patterns for protoype-based clusters cohesion (Ci )= ∑ proximity (x,mi ) x∈Ci mi is the prototype (or center) of cluster i Internal Measures overall cohesion =∑wi cohesion(Ci ), k clusters i=1 wi is a weight function of the size of the max d ∑d cluster e.g. ni 1 or 1 or 1 another example is example of proximity is cos.similarity n k ∑d 2 51 Internal Measures Separation between 2 clusters can be the distance between the 2 clusters (can be minimum) d(C ,C )= min d(x,y) d(C ,C )= max d(x, y) overall separation = min d(C ,C ) y∈C j overall separation can be the min pairwise separation between the clusters complete link like y∈C j or max single - link like j=i+1,k i=1 k i j i j x∈C x∈Ci j j i i 52 Ratios & Indices d (x , m ) n min d(Cr ,Cs )Cr ,Cs∈C k ∑ ∑ i=1 x j∈Ci ij SI = Smaller indicates more separate Separation Index (SI) 2 Within cluster scatter 53 Dunn - Index d(C ,C )  max diam(C ) d(C ,C )= min d(x, y) diam(c)= max d(x, y) DC = min x, y∈C Large indicates compact & well separated clusters Other measures see Jain and Dubes book Chapter 4 see article Halkidi etal, J of Intelligent Info. Systems, Dec. 2001 x∈Ci j i j KK j=i+1,i=1, y∈C j i min =1,k        Distance between clusters (separation) Measures dispersion of the cluster (inverse of cohesion) ECE 657A : Data and Knowledge Modelling and Analysis Today’s Class Announcements Project Presentations Where are we? Unsupervised Learning Data Clustering Why Data Clustering? Clustering vs. Classification Applications of data clustering Issues Representation Similarity Measures Clustering Criteria Clustering Approaches Slide Number 16 Agglomerative Algorithms Slide Number 18 Single and complete link Hierarchal algorithms produce a Dendrogram Slide Number 21 Divisive Algorithm Single Link (agglomerative) Single Link (divisive) Slide Number 25 Slide Number 26 Slide Number 27 Ward’s Algorithm Hierarchical Clustering Partitional Approach Partitional Approach K-means Clustering Algorithm Slide Number 33 Slide Number 34 Slide Number 35 Slide Number 36 Slide Number 37 Slide Number 38 K-means Clustering Algorithm K-means Clustering Algorithm K-means Clustering Algorithm K-mean K-means Clustering as a Sparse Representation Slide Number 44 PAM (Partitioning Around Medoids) Fuzzy Clustering Slide Number 47 Slide Number 48 Algorithm FCM (Bezdek, 1981) Fuzzy C-means has better convergence properties but can still get stuck at local minima Clustering Approaches Density Based Clustering Slide Number 53 DBScan DBSCAN: Core, Border and Noise Points When DBSCAN Works Well When DBSCAN Does NOT Work Well Large data set BIRCH BIRCH BIRCH BIRCH : CF Tree BIRCH Evaluation Measures Evaluation Measures Evaluation Measures Evaluation Measures Evaluation Measures Evaluation Measures Slide Number 70 Evaluation Measures Evaluation Measures Entropy Internal Measures Internal Measures Internal Measures Internal Measures Slide Number 78 Dunn - Index