8clst
COMP9318: Data Warehousing and Data Mining 1
COMP9318: Data Warehousing
and Data Mining
— L8: Clustering —
COMP9318: Data Warehousing and Data Mining 2
n What is Cluster Analysis?
COMP9318: Data Warehousing and Data Mining 3
What is Cluster Analysis?
n Cluster: a collection of data objects
n Similar to one another within the same cluster
n Dissimilar to the objects in other clusters
n Cluster analysis
n Grouping a set of data objects into clusters
n Clustering belongs to unsupervised classification: no
predefined classes
n Typical applications
n As a stand-alone tool to get insight into data
distribution
n As a preprocessing step for other algorithms
COMP9318: Data Warehousing and Data Mining 4
General Applications of Clustering
n Pattern Recognition
n Spatial Data Analysis
n create thematic maps in GIS by clustering feature
spaces
n detect spatial clusters and explain them in spatial data
mining
n Image Processing
n Economic Science (especially market research)
n WWW
n Document classification
n Cluster Weblog data to discover groups of similar
access patterns
COMP9318: Data Warehousing and Data Mining 5
Examples of Clustering Applications
n Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs
n Land use: Identification of areas of similar land use in an
earth observation database
n Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
n City-planning: Identifying groups of houses according to
their house type, value, and geographical location
n Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
COMP9318: Data Warehousing and Data Mining 6
What Is Good Clustering?
n A good clustering method will produce high quality
clusters with
n high intra-class similarity
n low inter-class similarity
n The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
n The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns.
COMP9318: Data Warehousing and Data Mining 7
Requirements of Clustering in Data
Mining
n Scalability
n Ability to deal with different types of attributes
n Discovery of clusters with arbitrary shape
n Minimal requirements for domain knowledge to
determine input parameters
n Able to deal with noise and outliers
n Insensitive to order of input records
n High dimensionality
n Incorporation of user-specified constraints
n Interpretability and usability
COMP9318: Data Warehousing and Data Mining 8
Chapter 8. Cluster Analysis
n Preliminaries
COMP9318: Data Warehousing and Data Mining 9
Typical Inputs
n Data matrix
n N objects, each
represented by a m-
dimensional feature vector
n Dissimilarity matrix
n A square matrix giving
distances between all
pairs of objects.
n If similarity functions are
used è similarity matrix
ú
ú
ú
ú
ú
ú
ú
û
ù
ê
ê
ê
ê
ê
ê
ê
ë
é
nm
x…
nf
x…
n1
x
……………
im
x…
if
x…
i1
x
……………
1m
x…
1f
x…
11
x
ú
ú
ú
ú
ú
ú
û
ù
ê
ê
ê
ê
ê
ê
ë
é
0…)2,()1,(
:::
)2,3()
…ndnd
0dd(3,1
0d(2,1)
0
n x m
n x n
Key component for clustering:
the dissimilarity/similarity
metric:d(i, j)
COMP9318: Data Warehousing and Data Mining 10
Comments
n The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
n Weights should be associated with different variables
based on applications and data semantics, or appropriate
preprocessing is needed.
n There is a separate “quality” function that measures the
“goodness” of a cluster.
n It is hard to define “similar enough” or “good enough”
n the answer is typically highly subjective.
COMP9318: Data Warehousing and Data Mining 11
Type of data in clustering analysis
n Interval-scaled variables:
n Binary variables:
n Nominal, ordinal, and ratio variables:
n Variables of mixed types:
COMP9318: Data Warehousing and Data Mining 12
Interval-valued variables
n Standardize data
n Calculate the mean absolute deviation:
where
n Calculate the standardized measurement (z-score)
n Using mean absolute deviation is more robust than using standard
deviation
…. )1 2
1
f f f nf
m (x x x
n
+ += +
1 2
1(| | | | … | |)
f f f f f nf f
s x m x m x mn= – + – + + –
f
fif
if s
mx
z
–
=
COMP9318: Data Warehousing and Data Mining 13
Similarity and Dissimilarity Between
Objects
n Distances are normally used to measure the similarity or
dissimilarity between two data objects
n A popular choice is the Minkowski distance, or the Lp
norm of difference vector
n Special cases:
n if p = 1, d is the Manhattan distance
n if p = 2, d is the Euclidean distance
n
COMP9318: Data Warehousing and Data Mining 14
Similarity and Dissimilarity Between
Objects (Cont.)
n Other similarity/distance functions:
n Mahalanobis distance
n Jaccard, Dice, cosine similarity, Pearson correlation
coefficient
n Metric distance
n Properties
n d(i,j) ³ 0
n d(i,i) = 0
n d(i,j) = d(j,i)
n d(i,j) £ d(i,k) + d(k,j)
positiveness
symmetry
reflexivity
triangular inequality
common to all distance functions
COMP9318: Data Warehousing and Data Mining 15
Areas within a unit distance from q
under different Lp distances
q
L1
q
L2 L8
L¥
q
q
COMP9318: Data Warehousing and Data Mining 16
Binary Variables
n A contingency table for binary data
n Simple matching coefficient (invariant, if the binary
variable is symmetric):
n Jaccard coefficient (noninvariant if the binary variable is
asymmetric):
dcba
cb jid
+++
+=),(
pdbcasum
dcdc
baba
sum
++
+
+
0
1
01
cba
cb jid
++
+=),(
Object i
Object j
Obj Vector Representation
i [0, 1, 0, 1, 0, 0, 1, 0]
j [0, 0, 0, 0, 1, 0, 1, 1]
COMP9318: Data Warehousing and Data Mining 17
Dissimilarity between Binary
Variables
n Example
n gender is a symmetric attribute
n the remaining attributes are asymmetric binary
n let the values Y and P be set to 1, and the value N be set to 0
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
75.0
211
21
),(
67.0
111
11
),(
33.0
102
10
),(
=
++
+
=
=
++
+
=
=
++
+
=
maryjimd
jimjackd
maryjackd
cba
cb jid
++
+=),(
COMP9318: Data Warehousing and Data Mining 18
Nominal Variables
n A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
n Method 1: Simple matching
n m: # of matches, p: total # of variables
n Method 2: One-hot encoding
n creating a new binary variable for each of the M
nominal states
p
mpjid -=),(
COMP9318: Data Warehousing and Data Mining 19
Ordinal Variables
n An ordinal variable can be discrete or continuous
n Order is important, e.g., rank
n Can be treated like interval-scaled
n replace xif by their rank
n map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
n compute the dissimilarity using methods for interval-
scaled variables
1
1
–
–
=
f
if
if M
r
z
},…,1{ fif Mr Î
COMP9318: Data Warehousing and Data Mining 20
Ratio-Scaled Variables
n Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
n Methods:
n treat them like interval-scaled variables—not a good
choice! (why?—the scale can be distorted)
n apply logarithmic transformation
yif = log(xif)
n treat them as continuous ordinal data treat their rank
as interval-scaled
COMP9318: Data Warehousing and Data Mining 21
n A Categorization of Major Clustering Methods
COMP9318: Data Warehousing and Data Mining 22
Major Clustering Approaches
n Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
n Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion
n Graph-based algorithms: Spectral clustering
n Density-based: based on connectivity and density functions
n Grid-based: based on a multiple-level granularity structure
n Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
COMP9318: Data Warehousing and Data Mining 23
n Partitioning Methods
24
Partitioning Algorithms: Problem Definition
n Partitioning method: Construct a “good” partition of a
database of n objects into a set of k clusters
n Input: a n x m data matrix
n How to measure the “goodness” of a given partitioning
scheme?
n Cost of a cluster, cost(Ci) =
n Note: L2 distance used
n Analogy with binning?
n How to choose the center of a cluster?
n Centroid (i.e., Avg) of Xj è Minimizes cost(Ci)
n Cost of k clusters: sum of cost(Ci)
X
xj2Ci
kxj � center(Ci)k22
Example (2D)
COMP9318: Data Warehousing and Data Mining 25
COMP9318: Data Warehousing and Data Mining 26
Partitioning Algorithms: Basic Concept
n It’s an optimization problem!
n Global optimal:
n NP-hard (for a wide range of cost functions)
n Requires exhaustively enumerate all partitions
n Stirling numbers of the second kind
n Heuristic methods:
n k-means: an instance of the EM (expectation-maximization)
algorithm
n Many variants
nn
k
o
= ⇥
✓
kn
k!
◆
COMP9318: Data Warehousing and Data Mining 27
The K-Means Clustering Method
n Lloyds Algorithm:
1. Initialize k centers randomly
2. While stopping condition is not met
i. Assign each object to the cluster with the nearest
center
ii. Compute the new center for each cluster.
n Stopping condition =?
n What are the final clusters?
Cost function: Total squared distance of points to its cluster representative
COMP9318: Data Warehousing and Data Mining 28
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
reassign
The K-Means Clustering Method
n Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K
object as initial
cluster center
Assign
each
objects
to most
similar
center
Update
the
cluster
means
Update
the
cluster
means
reassign
COMP9318: Data Warehousing and Data Mining 29
Comments on the K-Means Method
n Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
n Comparing: PAM: O(k(n-k)2), CLARA: O(ks2 + k(n-k))
n Comment:
n Often terminates at a local optimum. The global optimum may be
found using techniques such as: deterministic annealing and genetic
algorithms
n No guarantee on the quality. Use k-means++.
n Weakness
n Applicable only when mean is defined, then what about categorical
data?
n Need to specify k, the number of clusters, in advance
n Unable to handle noisy data and outliers
n Not suitable to discover clusters with non-convex shapes
COMP9318: Data Warehousing and Data Mining 30
Variations of the K-Means Method
n A few variants of the k-means which differ in
n Selection of the initial k means
n Dissimilarity calculations
n Strategies to calculate cluster means
n Handling categorical data: k-modes (Huang’98)
n Replacing means of clusters with modes
n Using new dissimilarity measures to deal with categorical objects
n Using a frequency-based method to update modes of clusters
n A mixture of categorical and numerical data: k-prototype method
k-Means++ [Arthur and Vassilvitskii, SODA 2007]
n A simple initialization routine that guarantees to
find a solution that is O(log k) competitive to the
optimal k-means solution.
n Algorithm:
1. Find first center uniformly at random
2. For each data point x, compute D(x) as the distance
to its nearest center
3. Randomly sample one point as the new center, with
probabilities proportional to D2(x)
4. Goto 2 if less than k centers
5. Run the normal k-means with the k centers
COMP9318: Data Warehousing and Data Mining 31
k-means: Special Matrix Factorization
n Xn x d ≈ Un x k Vk x d
n Loss function: ǁ X – UV ǁF2
n Squared Frobenius norm
n Constraints:
n Rows of U must be a one-hot encoding
n Alternative view
n Xj,* ≈ Uj,* V è Xj,* can be explained as a “special”
linear combination of rows in V
32
≈ --- x2 --- 1 0 0
--- c1 ---
--- c2 ---
--- c3 ---
Expectation Maximization Algorithm
n Xn x d ≈ Un x k Vk x d
n Loss function: ǁ X – UV ǁF2
n Finding the best U and V simutaneously is hard, but
n Expectation step:
n Given V, find the best U è easy
n Maximization step:
n Given U, find the best V è easy
n Iterate until converging at a local minima.
33
≈ --- x2 --- 1 0 0
--- c1 ---
--- c2 ---
--- c3 ---
COMP9318: Data Warehousing and Data Mining 34
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
What is the problem of k-Means Method?
n The k-means algorithm is sensitive to outliers !
n Since an object with an extremely large value may substantially
distort the distribution of the data.
n K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
COMP9318: Data Warehousing and Data Mining 35
K-medoids (PAM)
n k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
COMP9318: Data Warehousing and Data Mining 36
The K-Medoids Clustering Method
n Find representative objects, called medoids, in clusters
n PAM (Partitioning Around Medoids, 1987)
n starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
n PAM works effectively for small data sets, but does not scale
well for large data sets
n CLARA (Kaufmann & Rousseeuw, 1990)
n CLARANS (Ng & Han, 1994): Randomized sampling
n Focusing + spatial data structure (Ester et al., 1995)
COMP9318: Data Warehousing and Data Mining 37
Typical k-medoids algorithm (PAM)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrary
choose k
object as
initial
medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign
each
remainin
g object
to
nearest
medoids
Fo each nonmedoid
object,Oa
Compute
total cost of
swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O
and Oa
If quality is
improved.
Do loop
Until no
change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
COMP9318: Data Warehousing and Data Mining 38
PAM (Partitioning Around Medoids) (1987)
n PAM (Kaufman and Rousseeuw, 1987), built in Splus
n Use real object to represent the cluster
n Select k representative objects arbitrarily
n For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
n For each pair of i and h,
n If TCih < 0, i is replaced by h
n Then assign each non-selected object to the most
similar representative object
n repeat steps 2-3 until there is no change
COMP9318: Data Warehousing and Data Mining 39
What is the problem with PAM?
n PAM is more robust than k-means in the presence of
noise and outliers because a medoid is less influenced by
outliers or other extreme values than a mean
n PAM works efficiently for small data sets but does not
scale well for large data sets.
n O(k(n-k)2) for each iteration
where n is # of data,k is # of clusters
èSampling based method,
CLARA(Clustering LARge Applications)
Gaussian Mixture Model for Clustering
n k-means can be deemed as a special case of the
EM algorithm for GMM
n GMM
n allows “soft” cluster assignment:
n model Pr(C | x)
n also a good example of
n Generative model
n Latent variable model
n Use the Expectation-Maximization (EM)
algorithm to obtain a local optimal solution
COMP9318: Data Warehousing and Data Mining 40
COMP9318: Data Warehousing and Data Mining 41
n Hierarchical Methods
Hierarchical Clustering
n Produces a set of nested clusters organized as a hierarchical tree
n Can be visualized as a dendrogram
n A tree like diagram that records the sequences of merges or splits
n A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected component
forms a cluster.
1 3 2 5 4 6
0
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
2
3 4
5
Objects
Distance
Strengths of Hierarchical Clustering
n Do not have to assume any particular number of
clusters
n Any desired number of clusters can be
obtained by ‘cutting’ the dendogram at the
proper level
n They may correspond to meaningful taxonomies
n Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)
Hierarchical Clustering
n Two main types of hierarchical clustering
n Agglomerative:
n Start with the points as individual clusters
n At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left
n Divisive:
n Start with one, all-inclusive cluster
n At each step, split a cluster until each cluster contains a point (or
there are k clusters)
n Traditional hierarchical algorithms use a similarity or
distance matrix
n Merge or split one cluster at a time
Agglomerative Clustering Algorithm
n More popular hierarchical clustering technique
n Basic algorithm is straightforward
1. Compute the proximity matrix (i.e., matrix of pair-wise distances)
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
n Key operation is the computation of the proximity of
two clustersç different from that of two points
n Different approaches to defining the distance between
clusters distinguish the different algorithms
Starting Situation
n Start with clusters of individual points and a
proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
n After some merging steps, we have some clusters
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
n We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix.
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
n The question is “How do we update the proximity matrix?”
C1
C4
C2 U C5
C3 ? ? ? ?
?
?
?
C2
U
C5C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Distance
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Similarity?
● MIN
● MAX
● Centroid-based
● Group Average
● Other methods driven by an objective
function
– Ward’s Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Proximity Matrix
● MIN
● MAX
● Centroid-based
● Group Average
● Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Proximity Matrix
● MIN
● MAX
● Centroid-based
● Group Average
● Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Proximity Matrix
● MIN
● MAX
● Centroid-based
● Group Average
● Other methods driven by an objective
function
– Ward’s Method uses squared error
´ ´
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Proximity Matrix
● MIN
● MAX
● Centroid-based
● Group Average
● Other methods driven by an objective
function
– Ward’s Method uses squared error
Note: not simple
avg distance
between the
clusters
Cluster Similarity: MIN or Single Link/LINK
n Similarity of two clusters is based on the two most
similar (closest) points in the different clusters
n i.e., sim(Ci, Cj) = min(dissim(px, py)) // px ∈ Ci, py ∈ Cj
= max(sim(px, py))
n Determined by one pair of points, i.e., by one link in the
proximity graph.
1 2 3 4 5
Usu. Called single-link to avoid confusions (min similarity or dissimilarity?)
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 0.90 1.00 0.70 0.60 0.50
P3 0.10 0.70 1.00 0.40 0.30
P4 0.65 0.60 0.40 1.00 0.80
P5 0.20 0.50 0.30 0.80 1.00
Single-Link Example
1 2 3 4 5
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 0.90 1.00 0.70 0.60 0.50
P3 0.10 0.70 1.00 0.40 0.30
P4 0.65 0.60 0.40 1.00 0.80
P5 0.20 0.50 0.30 0.80 1.00
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 1.00 0.70 0.60 0.50
P3 1.00 0.40 0.30
P4 1.00 0.80
P5 1.00
12 P3 P4 P5
12 1.00
P3 1.00 0.40 0.30
P4 1.00 0.80
P5 1.00
0.70 0.65 0.50
12 P3 45
12 1.00 0.70
P3 1.00
45 1.00
0.65
0.40
sim(Ci, Cj) = max(sim(px, py))
Hierarchical Clustering: MIN
Nested Clusters Dendrogram
1
2
3
4
5
6
1
2
3
4
5
3 6 2 5 4 1
0
0.05
0.1
0.15
0.2
Strength of MIN
Original Points Two Clusters
• Can handle non-elliptical shapes
Limitations of MIN
Original Points Two Clusters
• Sensitive to noise and outliers
Cluster Similarity: MAX or Complete
Link (CLINK)
n Similarity of two clusters is based on the two least similar
(most distant) points in the different clusters
n i.e., sim(Ci, Cj) = max(dissim(px, py)) // px ∈ Ci, py ∈ Cj
= min(sim(px, py))
n Determined by all pairs of points in the two clusters
1 2 3 4 5
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 0.90 1.00 0.70 0.60 0.50
P3 0.10 0.70 1.00 0.40 0.30
P4 0.65 0.60 0.40 1.00 0.80
P5 0.20 0.50 0.30 0.80 1.00
Complete-Link Example
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 0.90 1.00 0.70 0.60 0.50
P3 0.10 0.70 1.00 0.40 0.30
P4 0.65 0.60 0.40 1.00 0.80
P5 0.20 0.50 0.30 0.80 1.00
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 1.00 0.70 0.60 0.50
P3 1.00 0.40 0.30
P4 1.00 0.80
P5 1.00
12 P3 P4 P5
12 1.00
P3 1.00 0.40 0.30
P4 1.00 0.80
P5 1.00
0.10 0.60 0.20
12 P3 45
12 1.00 0.10
P3 1.00
45 1.00
0.20
0.30
1 2 3 4 5
sim(Ci, Cj) = min(sim(px, py))
Hierarchical Clustering: MAX
Nested Clusters Dendrogram
3 6 4 1 2 5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1
2
3
4
5
6
1
2 5
3
4
Strength of MAX
Original Points Two Clusters
• Less susceptible to noise and outliers
Limitations of MAX
Original Points Two Clusters
•Tends to break large clusters
•Biased towards globular clusters
Cluster Similarity: Group Average
n GAAC (Group Average Agglomerative Clustering)
n Similarity of two clusters is the average of pair-wise similarity
between points in the two clusters.
n Why not using simple average distance? This method
guarantees that no inversions can occur.
�
similarity(Clusteri,Clusterj) =
similarity(pi,pj)
pi ,pj ∈Clusteri∪Clusterj
pj ≠pi
∑
(|Clusteri|+|Clusterj|)∗(|Clusteri|+|Clusterj|-1)
(3, 4.5)
(5, 1)(1, 1)
Group Average Example
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 0.90 1.00 0.70 0.60 0.50
P3 0.10 0.70 1.00 0.40 0.30
P4 0.65 0.60 0.40 1.00 0.80
P5 0.20 0.50 0.30 0.80 1.00
P1 P2 P3 P4 P5
P1 1.00 0.90 0.10 0.65 0.20
P2 1.00 0.70 0.60 0.50
P3 1.00 0.40 0.30
P4 1.00 0.80
P5 1.00
12 P3 P4 P5
12 1.00 0.567 0.717 0.533
P3 1.00 0.40 0.30
P4 1.00 0.80
P5 1.00
12 P3 45
12 1.0 0.567 0.608
P3 1.00 0.5
45 1.00
sim(Ci, Cj) = avg(sim(pi, pj))
Sim(12,3)=2*(0.1+0.7+0.9)/6 = 0.5666666
1 2 34 5
Sim(12,45)=2*(0.9+0.65+0.2+0.6+0.5+0.8)/12 = 0.608
Hierarchical Clustering: Centroid-
based and Group Average
n Compromise between Single and Complete
Link
n Strengths
n Less susceptible to noise and outliers
n Limitations
n Biased towards globular clusters
COMP9318: Data Warehousing and Data Mining 68
More on Hierarchical Clustering Methods
n Major weakness of agglomerative clustering methods
n do not scale well: time complexity of at least O(n2),
where n is the number of total objects
n can never undo what was done previously
n Integration of hierarchical with distance-based clustering
n BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
n CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of the
cluster by a specified fraction
n CHAMELEON (1999): hierarchical clustering using
dynamic modeling
Spectral Clustering
n See additional slides.
COMP9318: Data Warehousing and Data Mining 69