IT enabled Business Intelligence, CRM, Database Applications
Sep-18
Clustering
Prof. Vibs Abhishek
The Paul Merage School of Business
University of California, Irvine
BANA 273 Session 8
1
Agenda
Assignment 4 due on Canvas soon
Please work on your projects
Clustering using k-means algorithm
2
Clustering Definition
Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to one another.
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
3
What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are minimized
Euclidean Distance Based Clustering in 3-D space.
4
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on their geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.
5
Notion of a Cluster can be Ambiguous
How many clusters?
Four Clusters
Two Clusters
Six Clusters
6
Types of Clusterings
A clustering is a set of clusters
Important distinction between hierarchical and partitional sets of clusters
Partitional Clustering
A division of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset
Hierarchical clustering
A set of nested clusters organized as a hierarchical tree
7
Partitional Clustering
Original Points
A Partitional Clustering
8
K-Means Clustering
Begin by specifying K, the number of clusters
Select K points as initial cluster centroids
Assign each point to the cluster whose centroid is closest using similarity measure (Euclidean Distance)
Re-compute the centroids of the clusters
Repeat steps 3 and 4 until points stop moving between clusters
9
Similarity Measure
Need a distance measure
Example of a distance measure:
Manhattan distance:
10
Similarity Metric
Example for a distance measure:
Euclidean distance
11
Example of Euclidean Distance
Distance (John, Rachel)=sqrt [(35-41)2+(95-215)2 +(3-2)2]
John:
Age=35
Income=95K
no. of credit cards=3
Rachel:
Age=41
Income=215K
no. of credit cards=2
12
13
K-Means
Cluster center is
Age=(36+41+43)/3=40
Income=(48K+51K+54K)/3=51K
(Age=41, Income=54K)
(43, 51K)
(36, 48K)
(40, 51K)
13
14
Example: 2-Means
14
K-means clustering
Select inputs
15
K-means clustering
Select inputs
Select k cluster centers
16
K-means clustering
Select inputs
Select k cluster centers
Assign cases to closest center
Need to define “close”
17
K-means clustering
Select inputs
Select k cluster centers
Assign cases to closest center
Need to define “close”
Update cluster centers
18
K-means clustering
Select inputs
Select k cluster centers
Assign cases to closest center
Need to define “close”
Update cluster centers
Re-assign cases
19
K-means clustering
Select inputs
Select k cluster centers
Assign cases to closest center
Need to define “close”
Update cluster centers
Re-assign cases
Repeat steps 4 and 5 until changes in cluster centers & assigned cases are insignificant
20
“k” in k-means clustering
Generally, k is set in advance
If not known, a good idea is to try out different values of k that are near the number of clusters one expects from the data, to see how the sum of distances (in the clusters) reduces with larger k’s
21
Cluster Validity
Compute ratio
= [sum of squared distances for a given k] / [sum of squared distances to the mean of all the records (k = 1)]
If the ratio is near 1.0 the clustering has not been very effective
If it is small we have well-separated groups
Weka reports sum of squared errors (Intra cluster distance)
22
Example
Customer Age Income
(K)
John 0.55 0.175
Rachel 0.34 0.25
Hannah 1 1
Tom 0.93 0.85
Nellie 0.39 0.2
David 0.58 0.25
Age
Income
Note: Both Age and Income are normalized.
23
Step 1:
Nellie and David are selected as cluster centers A and B respectively
Customer Distance from David Distance from Nellie
John 0.08 0.16
Rachel 0.24 0.07
Hannah 0.86 1.01
Tom 0.69 0.85
Nellie
David
K-Means Algorithm: Example
A
B
Age
Income
24
Calculate cluster center:
Cluster A center:
Age 0.37, Income=0.23
Cluster B center:
Age 0.77, Income=0.57
Assign customers to clusters based on new cluster centers
K-Means Algorithm: Example
Age
Income
25
Customer Distance A Distance B
John 0.19 0.45
Rachel 0.04 0.54
Hannah 0.99 0.49
Tom 0.84 0.32
Nellie 0.04 0.53
David 0.21 0.37
K-Means Algorithm: Example
Age
Income
A
B
26
Calculate cluster center:
Cluster A center:
Age 0.47, Income=0.22
Cluster B center:
Age 0.97, Income= 0.93
Clusters do not change
K-Means Algorithm: Example
Age
Income
27
Scale and Weigh Data
Scaling makes sure that the distance is not biased by units (1K, 1M, etc.)
Weighting can add the information that one variable is more (or less) important than others.
After scaling to get rid of biases caused by different units, use weights to introduce bias based on knowledge of the business context.
(eg. Two households with the same income are more similar than two households with the same number of pets.)
Common way to scale:
Range: (value-min)/(max-min); [0,1]
E.g. {11,8,4,6,10,1} {1, 0.7, 0.3, 0.5, 0.9, 0}
Range: not as sensitive to outliers
Standard Deviation: sensitive to outliers (b/c it’s calculated by squaring the difference between each observation and its own mean)
28
What is a “Good” cluster?
A. Inter-cluster distance is maximized and intra-cluster distance is maximized
B. Inter-cluster distance is minimized and intra-cluster distance is maximized
C. Inter-cluster distance is maximized and intra-cluster distance is minimized
D. Inter-cluster distance is minimized and intra-cluster distance is minimized
E. None of the Above
29
Clustering in Weka
Utility Example
East West Airlines
http://facweb.cs.depaul.edu/mobasher/classes/ect584/WEKA/k-means.html
30
Clustering Exercise
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Start with indivuduals 1 and 4 as initial centroids
Strengths and Weaknesses of the K-Means
Strength
Relatively efficient
Simple implementation
Weakness
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers well
Euclidian Distance does not work for nominal variables.
Categorical, rank data can be converted to binary data with the value of 0 and 1, which can be used as numeric value for distance calculation.
32
Applications of Clustering
Marketing: Customer segmentation (discovery of distinct groups of customers) for target marketing. Create product differentiation: different offers for different segments (It’s not always possible to offer personalization.)
Car insurance: Identify customer groups with high average claim cost
Property: Identify houses in the same city with similar characteristics
Image recognition
Creating document collections, or grouping web pages
33
Review of Assignments
Next Session
Review of Assignment 4
Review of sample final exam
Other Data mining techniques
Text Mining
Collaborative Filtering
35
å
=
–
=
n
i
i
i
y
x
Y
X
D
1
|
|
)
,
(
å
=
–
=
n
i
i
i
y
x
Y
X
D
1
2
)
(
)
,
(
Age
I
n
c
o
m
e
Jane
Mark
Rachel
Center/
Mean
Distance
X-Axis�
Y-Axis�
Age�
Income�
�
Jane�
�
Mark�
�
Rachel�
�
Center/Mean�
�
Distance�
0
1
2
3
4
5
6
7
8
9
10
012345678910
0
1
2
3
4
5
6
7
8
9
10
012345678910
0
1
2
3
4
5
6
7
8
9
10
012345678910
0
1
2
3
4
5
6
7
8
9
10
012345678910
Chart8
3
3
7
4
3
8
4
5
7
5
y
4
6
3
7
8
5
5
1
4
5
Sheet1
x y
3 4
3 6
7 3
4 7
3 8
8 5
4 5
5 1
7 4
5 5
3.6666666667 5.1666666667
6.75 4.25
distance 1.8055555556 1.125
3.6666666667 5.8333333333
x y
3 4
3 6
7 3
4 7
3 8
8 5
4 5
5 1
7 4
5 5
3.6666666667 5.8333333333
6.1666666667 3.8333333333
Sheet1
y
Sheet2
y
Sheet3
y
Chart9
3
3
7
4
3
8
4
5
7
5
3.6666666667
6.75
y
4
6
3
7
8
5
5
1
4
5
5.1666666667
4.25
Sheet1
x y
3 4
3 6
7 3
4 7
3 8
8 5
4 5
5 1
7 4
5 5
3.6666666667 5.1666666667
6.75 4.25
distance 1.8055555556 1.125
3.6666666667 5.8333333333
x y
3 4
3 6
7 3
4 7
3 8
8 5
4 5
5 1
7 4
5 5
3.6666666667 5.8333333333
6.1666666667 3.8333333333
Sheet1
y
Sheet2
y
Sheet3
y
Chart11
3
3
7
4
3
8
4
5
7
5
4
6
3
7
8
5
5
1
4
5
Sheet1
x y
3 4
3 6
7 3
4 7
3 8
8 5
4 5
5 1
7 4
5 5
3.6666666667 5.1666666667
6.75 4.25
distance 1.8055555556 1.125
3.6666666667 5.8333333333
x y
3 4
3 6
7 3
4 7
3 8
8 5
4 5
5 1
7 4
5 5
3.6666666667 5.8333333333
6.1666666667 3.8333333333
Sheet1
y
Sheet2
y
Sheet3
y
Chart10
3
3
7
4
3
8
4
5
7
5
3.6666666667
6.1666666667
4
6
3
7
8
5
5
1
4
5
5.8333333333
3.8333333333
Sheet1
x y
3 4
3 6
7 3
4 7
3 8
8 5
4 5
5 1
7 4
5 5
3.6666666667 5.1666666667
6.75 4.25
distance 1.8055555556 1.125
3.6666666667 5.8333333333
x y
3 4
3 6
7 3
4 7
3 8
8 5
4 5
5 1
7 4
5 5
3.6666666667 5.8333333333
6.1666666667 3.8333333333
Sheet1
0
0
0
0
0
0
0
0
0
0
y
0
0
0
0
0
0
0
0
0
0
Sheet2
0
0
0
0
0
0
0
0
0
0
0
0
y
0
0
0
0
0
0
0
0
0
0
0
0
Sheet3
0
0
0
0
0
0
0
0
0
0
y
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
/docProps/thumbnail.jpeg