COMP9318: Data Warehousing and Data Mining
— L8: Clustering —
COMP9318: Data Warehousing and Data Mining 1
n What is Cluster Analysis?
COMP9318: Data Warehousing and Data Mining 2
Cluster Analysis
n Motivations
n Arranging objects into groups is a natural and necessary skill that we all share
Human Being’s Approach
Computer’s Approach
sex
glasses
moust- ache
smile
hat
1
m
y
n
y
n
2
f
n
n
y
n
3
m
y
n
n
n
4
m
n
n
n
n
5
m
n
n
y?
n
6
m
n
y
n
y
7
m
y
n
y
n
8
m
n
n
y
n
9
m
y
y
y
n
10
f
n
n
n
n
11
m
n
y
n
n
12
f
n
n
n
n
Object Tracking in Computer Vision
Colour Histograms (Swain & Ballard, 1990)
▪ Create a target model
We see a Computers see numbers person
▪ Find matching model in subsequent video frames
Model the target statistically
Find Similarity
Bucket
1
2
…
12
1
40
70
…
70
2
36
71
…
69
Example 1: Clustering of Flickr Tags
IRDM WS 2009
7-10
Example 2: Clustering of Social Tags
Grigory Begelmann et al: Automated Tag Clustering: Improving search and exploration in the tag space
7-11
http
:/
IR
D
M
W
S
2
i
.c
h/phred/automated_tag_clustering
/w
w
w
.p
u
0
0
9
Example 3: Clustering in Social
Networks
Jeffrey Heer, Danah Boyd: Vizster: Visualizing Online Social Networks. INFOVIS 2005
7-12
IRDM WS 2009
Example 4: Clustering Search Results for Visualization and Navigation
http://www.grokker.com/
7-13
IRDM WS 2009
What is Cluster Analysis?
n Cluster: a collection of data objects
n Similar to one another within the same cluster n Dissimilar to the objects in other clusters
n Cluster analysis
n Grouping a set of data objects into clusters
n Clustering belongs to unsupervised classification: no predefined classes
n Typical applications
n As a stand-alone tool to get insight into data
distribution
n As a preprocessing step for other algorithms
COMP9318: Data Warehousing and Data Mining 14
General Applications of Clustering
n Pattern Recognition n Spatial Data Analysis
n create thematic maps in GIS by clustering feature spaces
n detect spatial clusters and explain them in spatial data mining
n Image Processing
n Economic Science (especially market research) n WWW
n Document classification
n Cluster Weblog data to discover groups of similar
access patterns
COMP9318: Data Warehousing and Data Mining 15
Examples of Clustering Applications
n Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
n Land use: Identification of areas of similar land use in an earth observation database
n Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
n City-planning: Identifying groups of houses according to their house type, value, and geographical location
n Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
COMP9318: Data Warehousing and Data Mining 16
What Is Good Clustering?
n A good clustering method will produce high quality clusters with
n high intra-class similarity
n low inter-class similarity
n The quality of a clustering result depends on both the similarity measure used by the method and its implementation.
n The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.
COMP9318: Data Warehousing and Data Mining 17
Requirements of Clustering in Data Mining
n Scalability
n Ability to deal with different types of attributes
n Discovery of clusters with arbitrary shape
n Minimal requirements for domain knowledge to determine input parameters
n Able to deal with noise and outliers
n Insensitive to order of input records
n High dimensionality
n Incorporation of user-specified constraints n Interpretability and usability
COMP9318: Data Warehousing and Data Mining 18
Chapter 8. Cluster Analysis
n Preliminaries
COMP9318: Data Warehousing and Data Mining 19
Typical Inputs
n Data matrix
n N objects, each
represented by a m- dimensional feature vector
n Dissimilarity matrix
n A square matrix giving distances between all pairs of objects.
n If similarity functions are
éx11 ê …
ê xi1 ê …
êxn1 ë
…
… …
… …
x1f …
xif …
xnf
…
… …
… …
x1mù … ú
xim ú … ú
xnmú û
nxm
used è similarity matrix
COMP9318: Data Warehousing and Data Mining 20
é0ù
êd(2,1) 0 ú
êd(3,1) d(3,2) 0 ú
ê:::ú
êd (n,1) d (n,2) … … 0ú ëû
n x n
Key component for clustering: the dissimilarity/similarity metric:d(i, j)
Comments
n The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.
n Weights should be associated with different variables based on applications and data semantics, or appropriate preprocessing is needed.
n There is a separate “quality” function that measures the “goodness” of a cluster.
n It is hard to define “similar enough” or “good enough” n theansweristypicallyhighlysubjective.
COMP9318: Data Warehousing and Data Mining 21
Type of data in clustering analysis
n Interval-scaled variables:
n Binary variables:
n Nominal, ordinal, and ratio variables: n Variables of mixed types:
COMP9318: Data Warehousing and Data Mining 22
Interval-valued variables
n Standardizedata
n Calculate the mean absolute deviation:
s =1(|x -m |+|x -m |+…+|x -m |) fn1ff2ff nff
where
n Calculate the standardized measurement (z-score)
m =1(x +x + … +x ). fn1f2f nf
xif -mf sf
n Usingmeanabsolutedeviationismorerobustthanusingstandard deviation
zif =
COMP9318: Data Warehousing and Data Mining 23
Similarity and Dissimilarity Between Objects
n Distances are normally used to measure the similarity or dissimilarity between two data objects
n A popular choice is the Minkowski distance, or the Lp norm of difference vector
n Special cases:
n if p = 1, d is the Manhattan distance n if p = 2, d is the Euclidean distance n
COMP9318: Data Warehousing and Data Mining 24
Similarity and Dissimilarity Between Objects (Cont.)
n Other similarity/distance functions:
n Mahalanobis distance
n Jaccard, Dice, cosine similarity, Pearson correlation coefficient
n Metric distance n Properties
n d(i,j) 3 0
n d(i,i) = 0
n d(i,j) = d(j,i)
n d(i,j) £ d(i,k) + d(k,j)
common to all distance functions
positiveness symmetry
reflexivity
triangular inequality
COMP9318: Data Warehousing and Data Mining 25
Areas within a unit distance from q under different Lp distances
qqq
L1 L2 L8
q
L¥
COMP9318: Data Warehousing and Data Mining 26
Obj
Binary Variables
n A contingency table for binary data Object j
1 0 sum 1 a b a+b Object i 0 c d c+d
sum a+c b+d p
n Simple matching coefficient (invariant, if the binary
i
j
Vector Representation
[0, 1, 0, 1, 0, 0, 1, 0]
[0, 0, 0, 0, 1, 0, 1, 1]
variable is symmetric):
b+c a+b+c+d
d(i, j)=
n Jaccard coefficient (noninvariant if the binary variable is
asymmetric):
b+c a+b+c
d(i, j)=
COMP9318: Data Warehousing and Data Mining 27
Dissimilarity between Binary
Variables
n Example
d(i, j)=
b+c a+b+c
Name
Gender
Fever
Cough
Test-1
Test-2
Test-3
Test-4
Jack M Y N P N N N MaryF Y N P N P N Jim M Y P N N N N
n gender is a symmetric attribute
n the remaining attributes are asymmetric binary
n letthevaluesYandPbesetto1,andthevalueNbesetto0
d(jack,mary)= 0+1 = 0.33 2+0+1
d(jack,jim)= 1+1 =0.67 1+1+1
d(jim,mary)= 1+2 =0.75 1+1+2
COMP9318: Data Warehousing and Data Mining 28
Nominal Variables
n A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green
n Method 1: Simple matching
n m: # of matches, p: total # of variables
d(i, j)=
n Method 2: One-hot encoding
p-m p
n creating a new binary variable for each of the M nominal states
COMP9318: Data Warehousing and Data Mining 29
Ordinal Variables
n An ordinal variable can be discrete or continuous n Order is important, e.g., rank
n Can be treated like interval-scaled
rif Î{1,…,M f } i-th object in the f-th variable by
rif -1 Mf-1
n compute the dissimilarity using methods for interval- scaled variables
n replace xif by their rank
n map the range of each variable onto [0, 1] by replacing
z= if
COMP9318: Data Warehousing and Data Mining 30
Ratio-Scaled Variables
n Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt n Methods:
1. treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted), or
2. apply logarithmic transformation yif = log(xif)
3. or treat them as continuous ordinal data; treat their rank as interval-scaled
COMP9318: Data Warehousing and Data Mining 31
n A Categorization of Major Clustering Methods
COMP9318: Data Warehousing and Data Mining 32
Major Clustering Approaches
n Partitioning algorithms: Construct various partitions and then evaluate them by some criterion
n Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion
n Graph-based algorithms: Spectral clustering
n Density-based: based on connectivity and density functions n Grid-based: based on a multiple-level granularity structure
n Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to
each other
COMP9318: Data Warehousing and Data Mining 33
n Partitioning Methods
COMP9318: Data Warehousing and Data Mining 34
Partitioning Algorithms: Problem Definition
n Partitioning method: Construct a “good” partition of a database of n objects into a set of k clusters
n Input: a n x m data matrix
n How to measure the “goodness” of a given partitioning
X
scheme?
n Cost of a cluster, cost(Ci) =
kxj center(Ci)k2 n How to choose the center of a cluster?
Note: L distance used 2
xj 2Ci
n
n Analogy with binning?
n Centroid (i.e., Avg) of Xj è Minimizes cost(Ci) n Cost of k clusters: sum of cost(Ci)
35
Example (2D)
COMP9318: Data Warehousing and Data Mining 36
Partitioning Algorithms: Basic Concept
n It’s an optimization problem! n Global optimal:
NP-hard (for a wide range of cost functions)✓ Requires exhaustively enumerate all nno = ⇥ kn
n Stirling numbers of the second kind n Heuristic methods:
n n
k k!
◆
partitions
n k-means: an instance of the EM (expectation-maximization) algorithm
n Many variants
COMP9318: Data Warehousing and Data Mining 37
Cost function: Total squared distance of points to its cluster representative
n
The K-Means Clustering Method
Lloyds Algorithm:
n
n
1. 2.
i.
Initialize k centers randomly
While stopping condition is not met
Assign each object to the cluster with the nearest center
Compute the new center for each cluster. Stopping condition =?
What are the final clusters?
COMP9318: Data Warehousing and Data Mining 38
ii.
The K-Means Clustering Method
n Example
10 9 8 7 6 5 4 3 2 1 0
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
K=2
Arbitrarily choose K object as initial cluster center
Assign each objects to most similar center
Update the cluster means
10 9 8 7 6 5 4 3 2 1 0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
reassign
reassign
10 9 8 7 6 5 4 3 2 1 0
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
0 1 2 3 4 5 6 7 8 9 10
COMP9318: Data Warehousing and Data Mining
39
Update the cluster means
Comments on the K-Means Method
n Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.
n Comparing:PAM:O(k(n-k)2),CLARA:O(ks2+k(n-k)) n Comment:
n Oftenterminatesatalocaloptimum.Theglobaloptimummaybe found using techniques such as: deterministic annealing and genetic algorithms
n Noguaranteeonthequality.Usek-means++. n Weakness
n Applicableonlywhenmeanisdefined,thenwhataboutcategorical data?
n Needtospecifyk,thenumberofclusters,inadvance
n Unabletohandlenoisydataandoutliers
n Notsuitabletodiscoverclusterswithnon-convexshapes
COMP9318: Data Warehousing and Data Mining 40
Variations of the K-Means Method
n Afewvariantsofthek-meanswhichdifferin n Selection of the initial k means
n Dissimilarity calculations
n Strategies to calculate cluster means
n Handlingcategoricaldata:k-modes(Huang’98)
n Replacing means of clusters with modes
n Using new dissimilarity measures to deal with categorical objects n Using a frequency-based method to update modes of clusters
n A mixture of categorical and numerical data: k-prototype method
COMP9318: Data Warehousing and Data Mining 41
k-Means++
n A simple initialization routine that guarantees to find a solution that is O(log k) competitive to the optimal k-means solution.
n Algorithm:
1. Find first center uniformly at random
2. For each data point x, compute D(x) as the distance to its nearest center
3. Randomly sample one point as the new center, with probabilities proportional to D2(x)
4. Goto 2 if less than k centers
5. Run the normal k-means with the k centers
[Arthur and Vassilvitskii, SODA 2007]
COMP9318: Data Warehousing and Data Mining 42
k-means: Special Matrix Factorization
n Xnxd ≈Unxk Vkxd
n Loss function: ǁ X – UV ǁF2 n Squared Frobenius norm
n Constraints:
n Rows of U must be a one-hot encoding
n Alternative view
n Xj,* ≈ Uj,* V è Xj,* can be explained as a “special”
linear combination of rows in V
≈
--- x2 ---
1
0
0
-
--
c1
---
-
--
c2
---
-
--
c3
---
43
Expectation Maximization Algorithm
n Xnxd ≈Unxk Vkxd
n Loss function: ǁ X – UV ǁF2
n Finding the best U and V simutaneously is hard, but n Expectation step:
n Given V, find the best U è easy n Maximization step:
n Given U, find the best V è easy
n Iterate until converging at a local minima.
≈
--- x2 ---
1
0
0
-
--
c1
---
-
--
c2
---
-
--
c3
---
44
COMP9318: Data Warehousing and Data Mining 45
What is the problem of k-Means Method?
n The k-means algorithm is sensitive to outliers !
n Since an object with an extremely large value may substantially
distort the distribution of the data.
n K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.
10 10 99 88 77 66 55 44 33 22 11
00 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
COMP9318: Data Warehousing and Data Mining 46
K-medoids (PAM)
n k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster
COMP9318: Data Warehousing and Data Mining 47
The K-Medoids Clustering Method
n Findrepresentativeobjects,calledmedoids,inclusters n PAM(PartitioningAroundMedoids,1987)
n starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering
n PAM works effectively for small data sets, but does not scale well for large data sets
n CLARA(Kaufmann&Rousseeuw,1990)
n CLARANS(Ng&Han,1994):Randomizedsampling n Focusing+spatialdatastructure(Esteretal.,1995)
COMP9318: Data Warehousing and Data Mining 48
Typical k-medoids algorithm (PAM)
Total Cost = 20
10 9 8 7 6 5 4 3 2 1 0
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
Arbitrary choose k object as initial medoids
Assign each remainin g object to nearest medoids
10 Compute 98
0 1 2 3 4 5 6 7 8 9 10
K=2
Do loop
Until no change
Total Cost = 26
Fo each nonmedoid object,Oa
Swapping O
and O a
If quality is improved.
10 98 7
total cost of 7 66
55
44
33
22
11
00 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
COMP9318: Data Warehousing and Data Mining
49
swapping
10 9 8 7 6 5 4 3 2 1 0
0 1 2 3 4 5 6 7 8 9 10
PAM (Partitioning Around Medoids) (1987)
n PAM (Kaufman and Rousseeuw, 1987), built in Splus n Use real object to represent the cluster
n Select k representative objects arbitrarily
n For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih n For each pair of i and h,
nIfTCih <0,iisreplacedbyh
n Then assign each non-selected object to the most
similar representative object
n repeat steps 2-3 until there is no change
COMP9318: Data Warehousing and Data Mining 50
What is the problem with PAM?
n PAM is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean
n PAM works efficiently for small data sets but does not scale well for large data sets.
n O(k(n-k)2) for each iteration
where n is # of data,k is # of clusters èSampling based method,
CLARA(Clustering LARge Applications)
COMP9318: Data Warehousing and Data Mining 51
Gaussian Mixture Model for Clustering
n k-means can be deemed as a special case of the EM algorithm for GMM
n GMM
n allows “soft” cluster assignment:
n model Pr(C | x)
n also a good example of n Generative model
n Latent variable model
n Use the Expectation-Maximization (EM) algorithm to obtain a local optimal solution
COMP9318: Data Warehousing and Data Mining 52
n Hierarchical Methods
COMP9318: Data Warehousing and Data Mining 53
Hierarchical Clustering
n Producesasetofnestedclustersorganizedasahierarchicaltree n Canbevisualizedasadendrogram
n A tree like diagram that records the sequences of merges or splits
n A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
Distance
0.2 0.15 0.1 0.05 0
6 4
324
5 2
1
5
3
1
132546
Objects
Strengths of Hierarchical Clustering
n Do not have to assume any particular number of clusters
n Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level
n They may correspond to meaningful taxonomies
n Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, ...)
Hierarchical Clustering
n Two main types of hierarchical clustering
n Agglomerative:
n Start with the points as individual clusters
n At each step, merge the closest pair of clusters until only one cluster (or k clusters) left
n Divisive:
n Start with one, all-inclusive cluster
n At each step, split a cluster until each cluster contains a point (or there are k clusters)
n Traditional hierarchical algorithms use a similarity or distance matrix
n Merge or split one cluster at a time
Agglomerative Clustering Algorithm
n n
More popular hierarchical clustering technique Basic algorithm is straightforward
Compute the proximity matrix (i.e., matrix of pair-wise distances) Let each data point be a cluster
Repeat
Merge the two clusters the proximity matrix
1. 2. 3. 4. 5. 6.
two clusters ç different from that of two points
n Different approaches to defining the distance between clusters distinguish the different algorithms
closest
n
Until only a single cluster remains
Key operation is the computation of the proximity of
Update
Starting Situation
n Start with clusters of individual points and a proximity matrix
...
p1 p2 p3 p4 p5 p1
p2 p3
p4
p5 .
. .
...
p1 p2 p3 p4
p9 p10 p11 p12
Proximity Matrix
Intermediate Situation
n After some merging steps, we have some clusters C1 C2 C3
C4 C5
C3
C1
C2
C4
C1
C2 C3
C4
C5
Proximity Matrix
C5
...
p1 p2 p3 p4
p9 p10 p11 p12
Intermediate Situation
n We want to merge the two closest clusters (C2 and C5) and update the proximity matrix.
C1 C2 C3
C4 C5
C3
C1
C2
C4
C1
C2 C3
C4 C5
C5
Proximity Matrix
...
p1 p2 p3 p4
p9 p10 p11 p12
After Merging
n The question is “How do we update the proximity matrix?” C2
?
?
?
?
?
?
?
C3
U
C1 C5 C3 C4
C1
C2 U C5
C3
C4
Proximity Matrix
C4
C1
C2 U C5
...
p1 p2 p3 p4
p9 p10 p11 p12
How to Define Inter-Cluster Distance
p1 p2 p3 p4 p5 p1
p2 p3
p4
p5 .
. .
...
Similarity?
! MIN
! MAX
! Centroid-based
! Group Average
Proximity Matrix
! Other methods driven by an objective function
– Ward’sMethodusessquarederror
How to Define Inter-Cluster Similarity
! MIN
! MAX
! Centroid-based
! Group Average
Proximity Matrix
! Other methods driven by an objective function
– Ward’sMethodusessquarederror
p1 p2 p3 p4 p5 p1
p2 p3
p4
p5 .
. .
...
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 p1
p2 p3
p4
p5 .
. .
...
! MIN
! MAX
! Centroid-based
! Group Average
Proximity Matrix
! Other methods driven by an objective function
– Ward’sMethodusessquarederror
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 p1
p2 p3
p4
p5 .
. .
...
́ ́
! MIN
! MAX
! Centroid-based
! Group Average
Proximity Matrix
! Other methods driven by an objective function
– Ward’sMethodusessquarederror
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 p1
p2 p3
p4
p5 .
. .
...
! MIN
! MAX
! Centroid-based
! Group Average
Proximity Matrix
! Other methods driven by an objective function
– Ward’sMethodusessquarederror
Note: not simple avg distance between the clusters
Cluster Similarity: MIN or Single Link/LINK
n Similarity of two clusters is based on the two most similar (closest) points in the different clusters
n i.e., sim(Ci, Cj) = min(dissim(px, py)) // px ∈ Ci, py ∈ Cj = max(sim(px, py))
n Determined by one pair of points, i.e., by one link in the proximity graph.
P1
P2
P3
P4
P5
P1
1.00
0.90
0.10
0.65
0.20
P2
0.90
1.00
0.70
0.60
0.50
P3
0.10
0.70
1.00
0.40
0.30
P4
0.65
0.60
0.40
1.00
0.80
P5
0.20
0.50
0.30
0.80
1.00
12345
Usu. Called single-link to avoid confusions (min similarity or dissimilarity?)
Single-Link Example
P1
1.00
P2
0.90
P3
0.10
P4
0.65
P5
0.20
P1
1.00
P2
0.90
P3
0.10
P4
0.65
P5
P1
P1
0.20
P2
1.00
0.60
P2
1.00
P4
P5
0.90
0.65
0.20
0.60
0.50
0.70
0.40
0.30
1.00
0.80
0.50
0.80
1.00
P4
P5
0.70
1.00
0.60
1.00
0.50
P3
0.10
0.70
1.00
0.40
0.30
P3
0.40
0.30
0.80
1.00
12
P3
P4
P5
12
1.00
0.70
0.65
0.50
P3
1.00
0.40
0.30
P4
1.00
0.80
P5
1.00
12
P3
45
12
1.00
0.70
0.65
P3
1.00
0.40
45
1.00
12345
sim(Ci, Cj) = max(sim(px, py))
Hierarchical Clustering: MIN
15 3
0.2 0.15 0.1 0.05 0
362541
5
2
1 2
Nested Clusters
Dendrogram
4
3 4
6
Strength of MIN
Original Points Two Clusters
• Can handle non-elliptical shapes
Limitations of MIN
Original Points Two Clusters • Sensitive to noise and outliers
Cluster Similarity: MAX or Complete Link (CLINK)
n Similarity of two clusters is based on the two least similar (most distant) points in the different clusters
n i.e., sim(Ci, Cj) = max(dissim(px, py)) // px ∈ Ci, py ∈ Cj = min(sim(px, py))
n Determined by all pairs of points in the two clusters
P1
P2
P3
P4
P5
P1
1.00
0.90
0.10
0.65
0.20
P2
0.90
1.00
0.70
0.60
0.50
P3
0.10
0.70
1.00
0.40
0.30
P4
0.65
0.60
0.40
1.00
0.80
P5
0.20
0.50
0.30
0.80
1.00
12345
Complete-Link Example
P1
1.00
P2
0.90
P3
0.10
P4
0.65
P5
0.20
P1
1.00
P2
0.90
P3
0.10
P4
0.65
P5
P1
P1
0.20
P2
1.00
0.60
P2
1.00
P4
P5
0.90
0.65
0.20
0.60
0.50
0.70
0.40
0.30
1.00
0.80
0.50
0.80
1.00
P4
P5
0.70
1.00
0.60
1.00
0.50
P3
0.10
0.70
1.00
0.40
0.30
P3
0.40
0.30
0.80
1.00
12
P3
P4
P5
12
1.00
0.10
0.60
0.20
P3
1.00
0.40
0.30
P4
1.00
0.80
P5
1.00
12
P3
45
12
1.00
0.10
0.20
P3
1.00
0.30
45
1.00
12345
sim(Ci, Cj) = min(sim(px, py))
Hierarchical Clustering: MAX
4 25
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
364125
5
1 2
3
3
4
Nested Clusters
Dendrogram
6
1
Strength of MAX
Original Points Two Clusters
• Less susceptible to noise and outliers
Limitations of MAX
Original Points Two Clusters
•Tends to break large clusters •Biased towards globular clusters
Cluster Similarity: Group Average
n GAAC (Group Average Agglomerative Clustering)
n Similarity of two clusters is the average of pair-wise similarity
between points in the two clusters.
similarity(Cluster , Cluster ) = ij
∑similarity(pi,pj) pi ,pj ∈Clusteri ∪Clusterj
pj ≠pi
(|Clusteri|+|Clusterj |) ∗ (|Clusteri|+|Clusterj |-1)
n Why not using simple average distance? This method guarantees that no inversions can occur.
(3, 4.5) (5, 1)
(1, 1)
Group Average Example
P1
1.00
P2
0.90
P3
0.10
P4
0.65
P5
0.20
P1
1.00
P2
0.90
P3
0.10
P4
0.65
P5
P1
P1
0.20
P2
1.00
0.60
P2
1.00
P4
P5
0.90
0.65
0.20
0.60
0.50
0.70
0.40
0.30
1.00
0.80
0.50
0.80
1.00
P4
P5
0.70
1.00
0.60
1.00
0.50
P3
0.10
0.70
1.00
0.40
0.30
P3
0.40
0.30
0.80
1.00
12
P3
P4
P5
12
1.00
0.567
0.717
0.533
P3
1.00
0.40
0.30
P4
1.00
0.80
P5
1.00
12
P3
45
12
1.0
0.567
0.608
P3
1.00
0.5
45
1.00
Sim(12,3)=2*(0.1+0.7+0.9)/6 = 0.5666666
Sim(12,45)=2*(0.9+0.65+0.2+0.6+0.5+0.8)/12 = 0.608
1245
3
sim(Ci, Cj) = avg(sim(pi, pj))
Hierarchical Clustering: Centroid- based and Group Average
n CompromisebetweenSingleandComplete Link
n Strengths
n Less susceptible to noise and outliers
n Limitations
n Biased towards globular clusters
More on Hierarchical Clustering Methods
n Major weakness of agglomerative clustering methods n do not scale well: time complexity of at least O(n2),
where n is the number of total objects
n can never undo what was done previously
n Integration of hierarchical with distance-based clustering n BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
n CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction
n CHAMELEON (1999): hierarchical clustering using dynamic modeling
COMP9318: Data Warehousing and Data Mining 80
Spectral Clustering
n See additional slides.
COMP9318: Data Warehousing and Data Mining 81