程序代写代做代考 data mining database algorithm IT enabled Business Intelligence, CRM, Database Applications

IT enabled Business Intelligence, CRM, Database Applications

Sep-18
Clustering
Prof. Vibs Abhishek
The Paul Merage School of Business
University of California, Irvine
BANA 273 Session 8

1

Agenda
Assignment 4 due on Canvas soon
Please work on your projects
Clustering using k-means algorithm

2

Clustering Definition
Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to one another.
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.

3

What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are minimized
Euclidean Distance Based Clustering in 3-D space.

4

Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on their geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

5

Notion of a Cluster can be Ambiguous

How many clusters?

Four Clusters

Two Clusters

Six Clusters

6

Types of Clusterings
A clustering is a set of clusters

Important distinction between hierarchical and partitional sets of clusters

Partitional Clustering
A division of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset

Hierarchical clustering
A set of nested clusters organized as a hierarchical tree

7

Partitional Clustering

Original Points

A Partitional Clustering

8

K-Means Clustering
Begin by specifying K, the number of clusters
Select K points as initial cluster centroids
Assign each point to the cluster whose centroid is closest using similarity measure (Euclidean Distance)
Re-compute the centroids of the clusters
Repeat steps 3 and 4 until points stop moving between clusters

9

Similarity Measure
Need a distance measure
Example of a distance measure:
Manhattan distance:

10

Similarity Metric
Example for a distance measure:
Euclidean distance

11

Example of Euclidean Distance

Distance (John, Rachel)=sqrt [(35-41)2+(95-215)2 +(3-2)2]

John:
Age=35
Income=95K
no. of credit cards=3

Rachel:
Age=41
Income=215K
no. of credit cards=2

12

13
K-Means
Cluster center is
Age=(36+41+43)/3=40
Income=(48K+51K+54K)/3=51K

(Age=41, Income=54K)

(43, 51K)

(36, 48K)

(40, 51K)

13

14
Example: 2-Means

14

K-means clustering
Select inputs

15

K-means clustering
Select inputs
Select k cluster centers

16

K-means clustering
Select inputs
Select k cluster centers
Assign cases to closest center
Need to define “close”

17

K-means clustering
Select inputs
Select k cluster centers
Assign cases to closest center
Need to define “close”
Update cluster centers

18

K-means clustering
Select inputs
Select k cluster centers
Assign cases to closest center
Need to define “close”
Update cluster centers
Re-assign cases

19

K-means clustering
Select inputs
Select k cluster centers
Assign cases to closest center
Need to define “close”
Update cluster centers
Re-assign cases
Repeat steps 4 and 5 until changes in cluster centers & assigned cases are insignificant

20

“k” in k-means clustering
Generally, k is set in advance
If not known, a good idea is to try out different values of k that are near the number of clusters one expects from the data, to see how the sum of distances (in the clusters) reduces with larger k’s

21

Cluster Validity
Compute ratio
= [sum of squared distances for a given k] / [sum of squared distances to the mean of all the records (k = 1)]
If the ratio is near 1.0 the clustering has not been very effective
If it is small we have well-separated groups
Weka reports sum of squared errors (Intra cluster distance)

22

Example
Customer Age Income
(K)
John 0.55 0.175
Rachel 0.34 0.25
Hannah 1 1
Tom 0.93 0.85
Nellie 0.39 0.2
David 0.58 0.25

Age

Income
Note: Both Age and Income are normalized.

23

Step 1:
Nellie and David are selected as cluster centers A and B respectively

Customer Distance from David Distance from Nellie
John 0.08 0.16
Rachel 0.24 0.07
Hannah 0.86 1.01
Tom 0.69 0.85
Nellie
David

K-Means Algorithm: Example

A
B
Age

Income

24

Calculate cluster center:
Cluster A center:
Age 0.37, Income=0.23
Cluster B center:
Age 0.77, Income=0.57

Assign customers to clusters based on new cluster centers

K-Means Algorithm: Example
Age

Income

25

Customer Distance A Distance B
John 0.19 0.45
Rachel 0.04 0.54
Hannah 0.99 0.49
Tom 0.84 0.32
Nellie 0.04 0.53
David 0.21 0.37

K-Means Algorithm: Example

Age

Income

A
B

26

Calculate cluster center:
Cluster A center:
Age 0.47, Income=0.22
Cluster B center:
Age 0.97, Income= 0.93

Clusters do not change
K-Means Algorithm: Example

Age

Income

27

Scale and Weigh Data
Scaling makes sure that the distance is not biased by units (1K, 1M, etc.)

Weighting can add the information that one variable is more (or less) important than others.

After scaling to get rid of biases caused by different units, use weights to introduce bias based on knowledge of the business context.
(eg. Two households with the same income are more similar than two households with the same number of pets.)

Common way to scale:
Range: (value-min)/(max-min); [0,1]
E.g. {11,8,4,6,10,1}  {1, 0.7, 0.3, 0.5, 0.9, 0}

Range: not as sensitive to outliers
Standard Deviation: sensitive to outliers (b/c it’s calculated by squaring the difference between each observation and its own mean)
28

What is a “Good” cluster?
A. Inter-cluster distance is maximized and intra-cluster distance is maximized
B. Inter-cluster distance is minimized and intra-cluster distance is maximized
C. Inter-cluster distance is maximized and intra-cluster distance is minimized
D. Inter-cluster distance is minimized and intra-cluster distance is minimized
E. None of the Above

29

Clustering in Weka
Utility Example

East West Airlines

http://facweb.cs.depaul.edu/mobasher/classes/ect584/WEKA/k-means.html

30

Clustering Exercise
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5

Start with indivuduals 1 and 4 as initial centroids

Strengths and Weaknesses of the K-Means
Strength
Relatively efficient
Simple implementation

Weakness
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers well
Euclidian Distance does not work for nominal variables.

Categorical, rank data can be converted to binary data with the value of 0 and 1, which can be used as numeric value for distance calculation.
32

Applications of Clustering
Marketing: Customer segmentation (discovery of distinct groups of customers) for target marketing. Create product differentiation: different offers for different segments (It’s not always possible to offer personalization.)
Car insurance: Identify customer groups with high average claim cost
Property: Identify houses in the same city with similar characteristics
Image recognition
Creating document collections, or grouping web pages

33

Review of Assignments

Next Session
Review of Assignment 4
Review of sample final exam
Other Data mining techniques
Text Mining
Collaborative Filtering

35

å
=

=
n
i
i
i
y
x
Y
X
D
1
|
|
)
,
(

å
=

=
n
i
i
i
y
x
Y
X
D
1
2
)
(
)
,
(

Age
I
n
c
o
m
e
Jane
Mark
Rachel
Center/
Mean
Distance

X-Axis�
Y-Axis�
Age�
Income�

Jane�

Mark�

Rachel�

Center/Mean�

Distance�

0
1
2
3
4
5
6
7
8
9
10
012345678910

0
1
2
3
4
5
6
7
8
9
10
012345678910

0
1
2
3
4
5
6
7
8
9
10
012345678910

0
1
2
3
4
5
6
7
8
9
10
012345678910

Chart8
3
3
7
4
3
8
4
5
7
5

y
4
6
3
7
8
5
5
1
4
5

Sheet1

x y
3 4
3 6
7 3
4 7
3 8
8 5
4 5
5 1
7 4
5 5
3.6666666667 5.1666666667
6.75 4.25
distance 1.8055555556 1.125
3.6666666667 5.8333333333
x y
3 4
3 6
7 3
4 7
3 8
8 5
4 5
5 1
7 4
5 5
3.6666666667 5.8333333333
6.1666666667 3.8333333333

Sheet1

y

Sheet2

y

Sheet3

y

Chart9
3
3
7
4
3
8
4
5
7
5
3.6666666667
6.75

y
4
6
3
7
8
5
5
1
4
5
5.1666666667
4.25

Sheet1

x y
3 4
3 6
7 3
4 7
3 8
8 5
4 5
5 1
7 4
5 5
3.6666666667 5.1666666667
6.75 4.25
distance 1.8055555556 1.125
3.6666666667 5.8333333333
x y
3 4
3 6
7 3
4 7
3 8
8 5
4 5
5 1
7 4
5 5
3.6666666667 5.8333333333
6.1666666667 3.8333333333

Sheet1

y

Sheet2

y

Sheet3

y

Chart11
3
3
7
4
3
8
4
5
7
5

4
6
3
7
8
5
5
1
4
5

Sheet1

x y
3 4
3 6
7 3
4 7
3 8
8 5
4 5
5 1
7 4
5 5
3.6666666667 5.1666666667
6.75 4.25
distance 1.8055555556 1.125
3.6666666667 5.8333333333
x y
3 4
3 6
7 3
4 7
3 8
8 5
4 5
5 1
7 4
5 5
3.6666666667 5.8333333333
6.1666666667 3.8333333333

Sheet1

y

Sheet2

y

Sheet3

y

Chart10
3
3
7
4
3
8
4
5
7
5
3.6666666667
6.1666666667

4
6
3
7
8
5
5
1
4
5
5.8333333333
3.8333333333

Sheet1

x y
3 4
3 6
7 3
4 7
3 8
8 5
4 5
5 1
7 4
5 5
3.6666666667 5.1666666667
6.75 4.25
distance 1.8055555556 1.125
3.6666666667 5.8333333333
x y
3 4
3 6
7 3
4 7
3 8
8 5
4 5
5 1
7 4
5 5
3.6666666667 5.8333333333
6.1666666667 3.8333333333

Sheet1
0
0
0
0
0
0
0
0
0
0

y
0
0
0
0
0
0
0
0
0
0

Sheet2
0
0
0
0
0
0
0
0
0
0
0
0

y
0
0
0
0
0
0
0
0
0
0
0
0

Sheet3
0
0
0
0
0
0
0
0
0
0

y
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0
0

/docProps/thumbnail.jpeg