程序代写代做 graph C data mining database algorithm Hierarchical

Hierarchical
k
Clustering Means Clustering
– Khasha
Dehnad
1

Clustering Task

classes of similar objects
Clustering refers to grouping records, observations, or tasks into
– – – – –

Cluster is collection records similar to one another
Records in one cluster dissimilar to records in other clusters
Clustering is unsupervised data mining task
Therefore, no target variable specified
Clustering algorithms segment records and maximize homogeneity in subgroups
Similarity to records outside cluster minimized
2

Clustering Task
(cont’d)

geographic areas, according to zip code


– – – –
For example,
Claritas
, Inc. provides demographic profiles of
PRIZM segmentation system clusters zip codes in terms of lifestyle

types
Recall clusters identified for 90210 Beverly Hills, CA
Cluster 01: Blue Blood Estates
“Established executives, professionals, and ‘old money’ heirs that live in America’s wealthiest suburbs…”
Cluster 10: Cluster 02: Cluster 07: Cluster 08:
Bohemian Mix Winner’s Circle Money and Brains Young Literati
http://www.claritas.com/MyBestSegments/Default.jsp?ID=20#
3

Clustering Task

(cont’d)
Clustering Tasks in Business and Research

budget
Target marketing for niche product, without large marketing

Segment financial behavior into benign and suspicious categories

characteristics
Gene expression clustering, where genes exhibit similar
Clustering often performed as preliminary step in data mining Clustering results used as input to other data mining techniques

process

4

Clustering Task
(cont’d)
Between-cluster variation: Within-cluster variation:




Clustering identifies groups of highly Algorithms construct clusters where between
similar records
cluster variation (WCV)
(BCV) large, as compared to within
– cluster variation

Analogous to concept behind analysis of variance
5

Clustering Task
– –
(cont’d)
Applying cluster analysis to enormous databases helpful Reduces search space for downstream algorithms

classification
– – – –
Cluster analysis addresses similar issues encountered in
Similarity measurement
Recoding categorical variables Standardizing and normalizing variables Number of clusters
6

Distance Function

neighbors?

the similarity between coordinates
How is
similarity defined between an unclassified record and its
A
distance metric is a real

used to measure
,
valued function
x
d , and
y
z
with properties:
1. d(x,y)0,andd(x,y)0if andonlyif x y 2. d(x,y)d(y,x)
3. d(x,z)d(x,y)d(y,z)

Property 1: Distance is always non

negative

from “B to A”
Property 2: Commutative, distance from “A to B” is distance

must be less than or equal to distance from “A to B to C”
Property 3:
Triangle inequality holds, distance from “A to C”
7

Clustering Task


(cont’d)
Euclidean Distance measures distance between records
Measuring Similarity
d
(x,y) (xy)2,where Euclidean i i
i
xx,x,…,x andyy,y,…,y representmattributevaluesoftworecords 12m12m

Minkowski

Block Distance
and
Other distance measurements include City Distance
dCity-Block(x,y)xi yi i
d
(x,y)x y q Minkowski i i
i
8

Clustering Task

(cont’d)
Different From” function measures similarity between categorical attributes

different(xi , yi )  0 if xi  yi 1 otherwise
different(
max Normalization or Z

Euclidean Distance function
)
for each categorical attribute in
Substitute
Normalizing data enhances performance of clustering algorithms
x,y
– –
Use Min


Score Standardization
Min – Max Normalization  X  min( X ) Z – Score Standardization  X  mean( X ) max(X)min(X) standarddeviation(X)
MIS 637
Summer 2013
9

Hierarchical Clustering Methods





Clustering algorithms either
Hierarchical or Non

Hierarchical
Hierarchical

recursive partitioning (Divisive Methods) or combining (Agglomerative Methods) existing clusters
Tree like
cluster structure (
dendogram
) created through
Divisive Methods
All records initialized into single cluster
At each iteration, most dissimilar record split off into separate Continues until each record represents single cluster

cluster
10

Hierarchical Clustering Methods (cont’d)

– – – – – – –

Agglomerative Methods
Each observation initialized to become
At each iteration two closest clusters aggregated together Number of clusters reduced by one, each step
Eventually, all records combined into single cluster Agglomerative more popular hierarchical method
Therefore, focus remains on this
Measuring distance between records straightforward once recoding and normalization applied
its own
cluster
approach
However, how is distance between clusters determined?
11

Hierarchical Clustering Methods (cont’d)

Distance Between Clusters

clusters, A and B
Several criteria examined to determine distance between


Single Linkage
Known as Nearest

Neighbor Approach
most similar records from each
Tends to form long, slender clusters
Sometime heterogeneous records clustered together

record in cluster B
Minimum distance between any record in cluster A, and any

cluster
Cluster similarity based on
– –
12

Hierarchical Clustering Methods (cont’d)

in cluster B
Measure is average distance of records in cluster A, from records
Resulting clusters have approximately equal within Next, linkage methods examined using small data set
2 5 9 15 16 18 25 33 33 45

variability

cluster

13

Single


Linkage Clustering
To begin, each record assigned to
its own
cluster

records, in separate clusters
Single
linkage seeks minimum distance between any two


{33}. Distance = 0, clusters combined
– – –
Step 2: Step 3: Step 4:
Clusters {15} and {16} combined, where distance = 1 Cluster {15, 16} combined with cluster {18}
Clusters {2} and {5} combined
2, 5 15, 16 33, 33 15, 16, 18
Step 1: Minimum cluster distance is between clusters {33} and
2
5
9
15
16
18
25
33
33
45
14

Single

Linkage Clustering
(cont’d)
2
5
9
15
16
18
25
33
33
45
2, 5
2, 5, 9
15, 16 33, 33 15, 16, 18
2, 5, 9, 15, 16, 18
2, 5, 9, 15, 16, 18, 25
2, 5, 9, 15, 16, 18, 25, 33, 33
2, 5, 9, 15, 16, 18, 25, 33, 33, 45

Agglomeration continues similarly Steps 4

9

all records in data set
Above, last cluster {2, 5, 9, 15, 16, 18, 25, 33, 33, 45} contains
15

k



Means Clustering
k

Means effective at finding clusters in data
k
– –
– –

Means Algorithm

data
Step 1:
Analyst specifies
k
= number of clusters to partition
Step 2: Step 3:
Step 4: Step 5:
k
a partition of data set into
k Repeats Steps 3
records randomly assigned to initial clusters
For each record, find the nearest cluster center,
Each cluster center “owns” subset of records, we have
k clusters, find cluster
,C
, …., Update cluster center location to centroid
clusters, C centroid
C
1
2
k
For each of

5 until convergence or termination
16

k


– – –

Means Clustering
(cont’d)
Nearest criterion in Step 3 typically Euclidean Distance
Determining Cluster Centroid
n Located at point (
,
b
,
c
)
Assume
Centroid of points is center of gravity of points
data points (a
,b
,c
), (a
,b
,c
), …, (a
1
Σ
1
a
1
,
2
/
2
Σ
2
/
n
n
n
i
i
i
)
/
n
Σ
b
n
,
c
n

have centroid
For example, points (1, 1, 1), (1, 2, 1), (1, 3, 1), and (2, 1, 1)
1112,1231,1111(1.25,1.75,1.00)  4 4 4 
17

k


Means Clustering
k
(cont’d)
Means algorithm terminates when centroids no longer change


remain in cluster
– –
,C
Convergence criterion may also cause termination For example, no significant reduction in SSE
For
k
clusters, C
1
2
, ….,
C
, all records “owned” by cluster
k
k
SSE    d ( p, m )2 , where
p  Ci mi
i i1 pCi
 each data point in cluster i
 represents centroid of cluster i
18

Example of Work

– –
Assume
k
k
For example, m

= 2 to cluster following data points
= 2 specifies number of clusters to partition
k
Means Clustering at
a
b
c
d
e
f
g
h
(1, 3)
(3, 3)
(4, 3)
(5, 3)
(1, 2)
(4, 2)
(1, 1)
(2, 1)
Step 1: Step 2:
k
Randomly assign
= 2 cluster centers
1
= (1, 1) and m
= (2, 1)
2
19

Example of Work
k

Means Clustering at
First Pass (Copied from book)
Centroid
d1
d2
m1
1
1
m2
2
1
Clustering
Point a
d1 1
d2 Distance from m1
Distance from m2 Cluster Membership
SE
3
b c
3 4
3
3
d e f
5 1 4
3
2
2
g h
1 2
1
1
SSE
BCV
BCV/WCV
20

Example of Work
k

Means Clustering at
First Pass (Copied from book)
Centroid
d1
d2
m1
1
1
m2
2
1
Clustering
Point
d1
d2
Distance from m1
Distance from m2 Cluster Membership
SE
a
1
3
2.00
2.24 C1
4
b
3
3
2.83
2.24 C2
5
c
4
3
3.61
2.83 C2
8
d
5
3
4.47
3.61 C2
13
e
1
2
1.00
1.41 C1
1
f
4
2
3.16
2.24 C2
5
g
1
1
0.00
1.00 C1
0
h
2
1
1.00
0.00 C2
0
SSE
36.00
BCV
1
BCV/WCV
0.028
Centroid (Newly calculated)
d1
d2
m1
1
2
m2
3.6
2.4
21

Example of Work
k

Means Clustering at
Second Pass
Centroid
d1
d2
m1
1
2
m2
3.6
2.4
Clustering
Point
d1
d2
Distance from m1
Distance from m2
Cluster Membership
SE
a
1
3
1.00
2.67
C1
1
b
3
3
2.24
0.85
C2
0.72
c
4
3
3.16
0.72
C2
0.52
d
5
3
4.12
1.52
C2
2.32
e
1
2
0.00
2.63
C1
0
f
4
2
3.00
0.57
C2
0.32
g
1
1
1.00
2.95
C1
1
h
2
1
1.41
2.13
C1
2
SSE
7.88
BCV
2.631
BCV/WCV
0.334
Centroid (Newly calculated)
d1
d2
m1
1.25
1.75
m2
4
2.75
22

Example of Work
k

Means Clustering at
Third Pass (Copied from book)
Centroid
d1
d2
m1
1.25
1.75
m2
4
2.75
Clustering
Point
d1
d2
Distance from m1
Distance from m2
Cluster Membership
SE
a
1
3
1.27
3.01
C1
1.625
b
3
3
2.15
1.03
C2
1.0625
c
4
3
3.02
0.25
C2
0.0625
d
5
3
3.95
1.03
C2
1.0625
e
1
2
0.35
3.09
C1
0.125
f
4
2
2.76
0.75
C2
0.5625
g
1
1
0.79
3.47
C1
0.625
h
2
1
1.06
2.66
C1
1.125
SSE
6.25
BCV
2.926
BCV/WCV
0.468
Centroid (Newly calculated)
d1
d2
m1
1.25
1.75
m2
4
2.75
23

Clustering Task
(Repeated slide)
Between-cluster variation: Within-cluster variation:




Clustering identifies groups of highly Algorithms construct clusters where between
similar records
(BCV) large, as compared to within
– cluster variation

Analogous to concept behind analysis of variance
cluster variation (WCV)
MIS 637
Summer 2013
24

Example of Work

– –
k
k
Euclidean distance from points to m

= 2 to cluster following data points
Means Clustering at
Assume
a
b
c
d
e
f
g
h
(1, 3)
(3, 3)
(4, 3)
(5, 3)
(1, 2)
(4, 2)
(1, 1)
(2, 1)
Step 1: Step 2:
k
Randomly assign
= 2 specifies number of clusters to partition
= 2 cluster centers
k
= (1, 1) and m
= (2, 1) For each record, find nearest cluster center
For example, m
1
2


First Iteration Step 3:
1
and m
shown
2
Point
a
b
c
d
e
f
g
h
Distance from m1
2.00
2.83
3.61
4.47
1.00
3.16
0.00
1.00
Distance from m2
2.24
2.24
2.83
3.61
1.41
2.24
1.00
0.00
Cluster Membership
C1
C2
C2
C2
C1
C2
C1
C2
MIS 637
Summer 2013
25

Example of
k
Work

Means Clustering at (cont’d)
– –
contains {a, e, g} and m
Cluster membership assigned, now SSE calculated
Cluster m
has {b, c, d, f, h}
1
2
k
SSE    d ( p, m )2
i i1 pCi
22 2.242 2.832 3.612 12 2.242 02 02 36

Recall clusters constructed where between
(BCV) large, as compared to within
cluster variation (WCV)


cluster variation
12
BCV d(m,m ) 1 0.0278,where
WCV
SSE 36 surrogateforBCV
 surrogate for WCV
d(m,m ) 12
SSE

Ratio BCV/WCV expected to increase for successive iterations
26

Example of
k

k
Means Clustering at (cont’d)

Work
Step 4: For
clusters, find cluster centroid, update location

[(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)
Cluster 1 = [(1 + 1 + 1)/3, (3 + 2 + 1)/3] = (1, 2), Cluster 2 =

first iteration of algorithm
and m
(triangles) after
Figure shows movement of clusters m
1
2
5 4 3 2 1
00123456
27


Means Clustering at (cont’d)


– –
m
Example of
k
Work
Step 5: Repeats Steps 3

4 until convergence or termination
Second Iteration

contains {a, e, g, h} and m SSE = 7.86, and BCV/WCV = 0.3346
Repeat procedure for Steps 3
Again, for each record find nearest cluster center m
4
= (1, 2) or has {b, c, d, f}
1
2
= (3.6, 2.4)
– – –
Cluster m
1
2
Note 0.3346 has increased compared to First Iteration value = 0.0278

cluster variation

Between

cluster variation increasing with respect to Within
MIS 637
Summer 2013
28


Cluster centroids updated to m
After Second Iteration, cluster centroids shown to move slightly
5
4
3
2
1
0
Means Clustering at (cont’d)
– –
Example of
k
Work
1
= (1.25, 1.75) or m
= (4, 2.75)
2
0123456
MIS 637
Summer 2013 29

Means Clustering at (cont’d)

Example of
k

Third (Final) Iteration
– –
– – – –
Work
Repeat procedure for Steps 3
Now, for each record find nearest cluster center m

4
= (1.25,
Again, BCV/WCV has increased compared to previous = 0.3346 This time, no records shift cluster membership
Centroids remain unchanged, therefore algorithm terminates
1
= (4, 2.75)
SSE = 6.23, and BCV/WCV = 0.4703
1.75) or m
2
MIS 637
Summer 2013 30


Means not guaranteed to find to find global minimum SSE
Instead, local minimum found
Invoking algorithm using variety of initial cluster centers
Example of
k
Means Clustering at (cont’d)

– – –

remaining clusters placed far from previous centers

– –
Work
Summary
k

improves probability of achieving global minimum
One approach places first cluster at random point, with
(Moore)
What is appropriate value for
k
Potential problem for applying
Analyst may have
?
k
– knowledge of
Means
a priori
k
MIS 637
Summer 2013
31


Outer loop to algorithm possible
Cycles through different
Results compared, selecting solution with smallest SSE


Means Clustering at (cont’d)
– – –
Example of
k
Work
k
What attributes to use as input?
values
Some attributes likely more relevant than others

discussed in Chapter 5
Apply axis

stretching methods for quantifying attribute relevance
MIS 637
Summer 2013 32