程序代写代做代考 data mining database algorithm EM623-Week5

EM623-Week5

Carlo Lipizzi
clipizzi@stevens.edu

SSE

2016

Machine Learning and Data Mining
Clustering and association analysis using kMeans and basket analysis

Machine learning and our focus

• Like human learning from past experiences
• A computer does not have “experiences”
• A computer system learns from data, which represent some

“past experiences” of an application domain
• Our focus: learn a target function that can be used to

predict the values of a discrete class attribute, e.g.:
approve or not-approved, and high-risk or low risk

• The task is commonly called: Supervised learning,
classification, or inductive learning

2

Supervised vs. Unsupervised Learning

• Supervised learning (classification)
– Supervision: The training data (observations, measurements,

etc.) are accompanied by labels indicating the class of the
observations

– New data is classified based on the training set

• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of data, the task is to establish the existence of

classes or clusters in the data

3

Clustering

• Clustering is a technique for finding similarity groups in data,
called clusters

– it groups data instances that are similar to (near) each other
in one cluster and data instances that are very different (far
away) from each other into different clusters

• Clustering is an unsupervised learning task as no class values
denoting an a priori grouping of the data instances are given

• Due to historical reasons, clustering is often considered
synonymous with unsupervised learning

4

Clustering

• Cluster analysis addresses similar issues to those in
classification

– Similarity measurement
– Recoding categorical variables
– Standardizing and normalizing variables
– Number of clusters

5

Clustering

• The art of finding groups in data
• Objective: gather items from a database into sets

according to (unknown) common characteristics
• Much more difficult than classification since the classes

are not known in advance (no training)
• Technique: unsupervised learning

6

Examples

• The data set on the left has three natural groups of
data points, i.e., 3 natural clusters

• Marketing: finding groups of customers with similar
behavior given a large database of customer data
containing their properties and past buying records

• Biology: classification of plants and animals given
their features

7

• Insurance: identifying groups of motor insurance policy holders with a high
average claim cost; identifying frauds

• City-planning: identifying groups of houses according to their house type,
value and geographical location

• Earthquake studies: clustering observed earthquake epicenters to identify
dangerous zones

• WWW: document classification; clustering weblog data to discover groups of
similar access patterns

Aspects of clustering

• A clustering algorithm
– Partitional clustering
– Hierarchical clustering
– …

• A distance (similarity, or dissimilarity) function
• Clustering quality

– Inter-clusters distance Þ maximized
– Intra-clusters distance Þ minimized

• The quality of a clustering result depends on the algorithm, the
distance function, and the application

8

Aspects of clustering

High-complexity separator with low error rate

9

Low-complexity separator with high error rate

k-Means Clustering

• The K-means clustering algorithm is a simple method for
estimating the mean (vectors) of a set of K-groups

• The simplicity of the algorithm also can lead to some bad
solutions

10

The K-Means Clustering Method

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10
0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10
0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K
objects as initial
cluster center

Assign
each
of the
objects
to most
similar
center

Updat
e the
cluster
means

Updat
e the
cluster
means

reassignreassign

11

k-Means Algorithm

12

k-Means Algorithm

• Step 1: Begin with a decision on the value of k = number of
clusters

• Step 2: Put any/random initial partition that classifies the data
into k clusters

• Step 3: Take each sample in sequence and compute its
distance from the centroid of each of the clusters. If a sample is
not currently in the cluster with the closest centroid, switch this
sample to that cluster and update the centroid of the cluster
gaining the new sample and the cluster losing the sample

• Step 4: Repeat step 3 until convergence is achieved, that is until
a pass through the training sample causes no new assignments

13

Euclidean Distance

• Euclidean Distance

dist= ∑ 𝒑𝒌 − 𝒒𝒌 𝟐𝒏𝒌)𝟏

– Where n is the number of dimensions (attributes)
and pk and qkare, respectively, the kth attributes
(components) or data objects p and q

• Standardization is necessary, if scales differ

14

Euclidean Distance

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1

Distance Matrix

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

15

Data Description and Clustering

• Euclidean distance is a possible metric: a possible criterion is to
assume samples belonging to same cluster if their distance is
less than a threshold d0

16

Data Description and Clustering

• Clusters defined by Euclidean distance are invariant to
translations and rotation of the feature space, but not invariant
to general transformations that distort the distance relationship

• To achieve invariance, one can normalize the data, e.g., such
that they all have zero means and unit variance

17

Minkowski Distance

• Minkowski Distance is a generalization of Euclidean Distance

– Where r is a parameter, n is the number of
dimensions (attributes) and pk and qkare,
respectively, the kth attributes (components) or data
objects p and q.

dist = ( | pk − qk |
r

k=1

n

∑ )
1
r

18

Minkowski Distance: Examples

• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this is the Hamming distance, which is

just the number of bits that are different between two binary
vectors

• r = 2. Euclidean distance

• r ® ¥. “supremum” (Lmax norm, L¥ norm) distance.
– This is the maximum difference between any component of

the vectors

• Do not confuse r with n, i.e., all these distances are defined for all
numbers of dimensions

19

• For the entire set of objects, the Error Sum of Squares is calculated
by:

• The K-means algorithm proceeds to try to find a minimum for the
Error Sum of Squares (SSE)

• This commonly does not happen in a single run of the algorithm

20

SSE = d(p,mi )
2

p∈Ci


i=1

k

Error Sum of Squares

k-Means Clustering – Example
– Assume k = 2 to cluster following data points

– Step 1: k = 2 specifies number of clusters to partition
– Step 2: Randomly assign k = 2 cluster centers
– For example, m1 = (1, 1) and m2 = (2, 1)
– Step 3: For each record, find nearest cluster center
– Euclidean distance from points to m1 and m2 has been used

a b c d e f g h

(1, 3) (3, 3) (4, 3) (5, 3) (1, 2) (4, 2) (1, 1) (1, 2)

Point a b c d e f g h

Distance from m1 2.00 2.83 3.61 4.47 1.00 3.16 0.00 1.00

Distance from m2 2.24 2.24 2.83 3.61 1.41 2.24 1.00 0.00

Cluster Membership C1 C2 C2 C2 C1 C2 C1 C2

21

k-Means Clustering – Example

– Cluster m1 contains {a, e, g} and m2 has {b, c, d, f, h}
– Cluster membership assigned, now SSE calculated

– Recall clusters constructed where between-cluster variation
(BCV) large, as compared to within-cluster variation (WCV)

– Ratio BCV/WCV expected to increase for successive iterations

360024.2161.383.224.22

),(

22222222

1

2

=+++++++=

=
=

k

i iCp
impdSSE

for WCV surrogate SSE
BCVfor surrogate ),(

where,0278.0
36
1

SSE
),(

WCV
BCV

21

21

=
=

===

mmd

mmd

22

k-Means Clustering – Example

– Step 4: For k clusters, find cluster centroid, update location
– Cluster 1 = [(1 + 1 + 1)/3, (3 + 2 + 1)/3] = (1, 2)
– Cluster 2 = [(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)
– Figure shows movement of clusters m1 and m2 after the first

iteration of the algorithm

0 1 2 3 4 5 6
0

1

2

5

4

3

23

k-Means Clustering – Example

– Step 5: Repeats Steps 3 – 4 until convergence or
termination

• Second Iteration
– Repeat procedure for Steps 3 – 4
– Again, for each record find nearest cluster center m1 = (1,

2) or m2 = (3.6, 2.4)
– Cluster m1 contains {a, e, g, h} and m2 has {b, c, d, f}
– SSE = 7.86, and BCV/WCV = 0.3346
– Note 0.3346 has increased compared to First Iteration

value = 0.0278
– Between-cluster variation increasing with respect to Within-

cluster variation

24

k-Means Clustering – Example

– Cluster centroids updated to m1 = (1.25, 1.75) or m2 = (4, 2.75)
– After Second Iteration, cluster centroids shown to move slightly

0 1 2 3 4 5 6
0

1

2

5

4

3

25

k-Means Clustering – Example

• Third (Final) Iteration
– Repeat procedure for Steps 3 – 4
– Now, for each record find nearest cluster center m1 = (1.25,

1.75) or m2 = (4, 2.75)
– SSE = 6.23, and BCV/WCV = 0.4703
– Again, BCV/WCV has increased compared to previous =

0.3346
– This time, no records shift cluster membership
– Centroids remain unchanged, therefore algorithm

terminates

26

Using the dataset above, implement the k-Means algorithm with k=3

27

k-Means Clustering – Exercise

k-Means Clustering – Exercise: solution

Step 1 Step 2

28

k-Means Clustering – Exercise: solution

29

k-Means Clustering – Rattle Exercise

• Using the weather.csv data set, apply the k-means algorithm to
create clusters

• Optimize the clusters applying:
• Different number of clusters
• Seed value
• Variable normalization

30

k-Means Clustering – Summary

• K-means is a nice method to quickly sort your data into clusters
• All you need to know are the number of clusters you seek to

find
• Local optima in K-means can derail your results, if you are not

careful
• Run the process many times with differing starting

values
• What is appropriate value for k?
• Analyst may have a priori knowledge of k

31

Rule Induction

• Try to find rules of the form
IF THEN

– This is the reverse of a rule-based agent, where the rules
are given and the agent must act. Here the actions are
given and we have to discover the rules!

• Prevalence = probability that left-hand-side and right-hand-
side occur together (sometimes called “support factor,”
“leverage” or “lift”)

• Predictability = probability of right-hand-side given left-hand-
side (sometimes called “confidence” or “strength”)

32

®
prevalence = 4.99%, predictability = 22.89%

®
prevalence = 0.94%, predictability = 28.14%

®
prevalence = 1.36%, predictability = 41.02%

®
prevalence = 1.16%, predictability = 38.01%

33

Association Rules from Market
Basket Analysis

• Coupons, discounts
Don’t give discounts on 2 items that are frequently bought
together. Use the discount on 1 to “pull” the other

• Product placement
Offer correlated products to the customer at the same time.
Increases sales

• Timing of cross-marketing
Send camcorder offer to VCR purchasers 2-3 months after VCR
purchase

• Discovery of patterns
People who bought X, Y and Z (but not any pair) bought W over
half the time

34

Use of Rule Associations

• Example: grocery shopping
• For each item, count # of occurrences (say out of 100,000)

apples 1891, caviar 3, ice cream 1088, …
• Drop the ones that are below a minimum support level

apples 1891, ice cream 1088, pet food 2451, …
• Make a table of each item against each other item:

• Discard cells below support threshold. Now make a cube
for triples, etc. Add 1 dimension for each product on LHS

apples ice cream pet food
apples 1891 685 24
ice cream —– 1088 322
pet food —– —– 2451

35

Finding Rule Associations Algorithm

Learning Associations

• Market basket analysis
– To find associations between products bought by

customers
• Learning a conditional probability

– P (Y | X )
– probability that somebody

who buys X also buys Y where
X and Y are products/services

• Example
– P ( chips | beer ) = 0.7
– 70 percent of customers who buy beer also buy chips

36

Market Basket Analysis (MBA)

37

Barbie® Þ Candy

1. Put them closer together in the store
2. Put them far apart in the store
3. Package candy bars with the dolls
4. Package Barbie + candy + poorly selling item
5. Raise the price on one, lower it on the other
6. Barbie accessories for proofs of purchase
7. Do not advertise candy and Barbie together
8. Offer candies in the shape of a Barbie Doll

38

Market Basket Analysis

• MBA in retail setting
– Find out what are bought together
– Cross-selling
– Optimize shelf layout
– Product bundling
– Timing promotions
– Discount planning (avoid double-discounts)
– Product selection under limited space
– Targeted advertisement, Personalized coupons, item

recommendations
• Usage beyond Market Basket

– Medical (one symptom after another)
– Financial (customers with mortgage acct also have saving

acct)

39

Rules Discovered from MBA

• Actionable Rules
– Wal-Mart customers who purchase Barbie dolls

have a 60% likelihood of also purchasing one of
three types of candy bars

• Trivial Rules
– Customers who purchase large appliances are

very likely to purchase maintenance agreements
• Inexplicable Rules

– When a new hardware store opens, one of the
most commonly sold items is toilet bowl cleaners

40

Association Rule Discovery: Definition

• Given a set of records each of which contain some number of
items from a given collection;

– Produce dependency rules which will predict occurrence
of an item based on occurrences of other items.

TID Items

1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

Rules Discovered:
{Milk} –> {Coke}
{Diaper, Milk} –> {Beer}

41

Association Rule Discovery: Application 1

• Marketing and Sales Promotion:
– Let the rule discovered be
– {Bagels, … } –> {Potato Chips}
– Potato Chips as consequent => Can be used to determine

what should be done to boost its sales.
– Bagels in the antecedent => Can be used to see which

products would be affected if the store discontinues selling
bagels.

– Bagels in antecedent and Potato chips in consequent =>
Can be used to see what products should be sold with
Bagels to promote sale of Potato chips!

42

Association Rule Discovery: Application 2

• Supermarket shelf management
– Goal: To identify items that are bought together by sufficiently

many customers
– Approach: Process the point-of-sale data collected with

barcode scanners to find dependencies among items
– A classic rule —

• If a customer buys diaper and milk, then he is very likely to
buy beer

• So, don’t be surprised if you find six-packs stacked next to
diapers!

43

Association Rule Discovery: Application 3

• Inventory Management:
– Goal: A consumer appliance repair company wants

to anticipate the nature of repairs on its consumer
products and keep the service vehicles equipped
with right parts to reduce on number of visits to
consumer households.

– Approach: Process the data on tools and parts
required in previous repairs at different consumer
locations and discover the co-occurrence patterns.

44

Evaluation of Association Rules

• What rules should be considered valid?

• An association rule is valid if it satisfies some evaluation
measures

If {Diaper} {Beer}
Then

LHS RHS

45

Rule Evaluation – Support

• Support:
The frequency in which the items in LHS and RHS co-occur

• E.g., The support of the {Diaper} à {Beer} rule is 3/5:
60% of the transactions contain both items

No. of transactions containing items in LHS and RHS
Total No. of transactions in the dataset

Transaction No. Item 1 Item 2 Item 3 …

100 Beer Diaper Chocolate
101 Milk Chocolate Shampoo

102 Beer Wine Vodka

103 Beer Cheese Diaper
104 Ice Cream Diaper Beer

Support =

46

Rule Evaluation – Confidence
• Is Beer leading to Diaper purchase or Diaper leading to Beer

purchase?
– Among the transactions with Diaper, 100% have Beer
– Among the transactions with Beer, 75% have Diaper

Transaction No. Item 1 Item 2 Item 3 …

100 Beer Diaper Chocolate
101 Milk Chocolate Shampoo

102 Beer Wine Vodka
103 Beer Cheese Diaper

104 Ice Cream Diaper Beer

Confidence = No. of transactions containing both LHS and RHS
No. of transactions containing LHS

• confidence for {Diaper} à{Beer} : 3/3
When Diaper is purchased, the likelihood of Beer purchase is 100%

• confidence for {Beer} à{Diaper} : 3/4
When Beer is purchased, the likelihood of Diaper purchase is 75%

• So, {Diaper} à{Beer} is a more important rule according to confidence

47

Transaction No. Item 1 Item 2 Item 3 Item 4 …

100 Beer Diaper Chocolate

101 Milk Chocolate Shampoo

102 Beer Milk Vodka Chocolate

103 Beer Milk Diaper Chocolate

104 Milk Diaper Beer

What’s the support and confidence for rule {Chocolate}à{Milk}?
Support = 3/5 Confidence = 3/4
Very high support and confidence.
Does Chocolate really lead to Milk purchase?
No! Because Milk occurs in 4 out of 5 transactions. Chocolate is even
decreasing the chance of Milk purchase (3/4 < 4/5) Lift = (3/4)/(4/5) = 0.9375 < 1 The lift of a rule is the ratio of the support of the items on the LHS of the rule co- occuring with items on the RHS divided by probability that the LHS and RHS co-occur if the two are independent 48 Rule Evaluation - Lift Rule Evaluation – Lift (cont.) • Measures how much more likely is the RHS given the LHS than merely the RHS • Lift = confidence of the rule / frequency of the RHS • Example: {Diaper} à {Beer} • Total number of customer in database: 1000 • No. of customers buying Diaper: 200 • No. of customers buying beer: 50 • No. of customers buying Diaper & beer: 20 • Frequency of Beer = 50/1000 (5%) • Confidence = 20/200 (10%) • Lift = 10%/5% = 2 • Lift higher than 1 implies people have higher change to buy Beer when they buy Diaper. Lift lower than 1 implies people have lower change to buy Milk when they buy Chocolate 49 Finding Association Rules from Data Association rules discovery problem is decomposed into two sub-problems: • Find all sets of items (itemsets) whose support is above minimum support --- called frequent itemsets or large itemsets • From each frequent itemset, generate rules whose confidence is above minimum confidence • Given a large itemset Y, and X is a subset of Y • Calculate confidence of the rule X Þ (Y - X) • If its confidence is above the minimum confidence, then X Þ (Y - X) is an association rule we are looking for 50 • A data set with 5 transactions • Minimum support = 40%, Minimum confidence = 80% Phase 1: Find all frequent itemsets • {Beer} (support=80%), • {Diaper} (60%), • {Chocolate} (40%) • {Beer, Diaper} (60%) Transaction No. Item 1 Item 2 Item 3 100 Beer Diaper Chocolate 101 Milk Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer Beer à Diaper (conf. 60%÷80%= 75%) Diaper à Beer (conf. 60%÷60%= 100%) Phase 2: 51 Example Phase 1: Finding all frequent itemsets How to perform an efficient search of all frequent itemsets? Note: frequent itemsets of size n contain itemsets of size n-1 that also must be frequent Example: if {diaper, beer} is frequent then {diaper} and {beer} are each frequent as well This means that… •If an itemset is not frequent (e.g., {wine}) then no itemset that includes wine can be frequent either, such as {wine, beer} •We therefore first find all itemsets of size 1 that are frequent Then try to “expand” these by counting the frequency of all itemsets of size 2 that include frequent itemsets of size 1 Example: If {wine} is not frequent we need not try to find out whether {wine, beer} is frequent. But if both {wine} & {beer} were frequent then it is possible (though not guaranteed) that {wine, beer} is also frequent •Then take only itemsets of size 2 that are frequent, and try to expand those, etc 52 Phase 2: Generating Association Rules – Assume {Milk, Bread, Butter} is a frequent itemset • Using items contained in the itemset, list all possible rules – {Milk} à {Bread, Butter} – {Bread} à {Milk, Butter} – {Butter} à {Milk, Bread} – {Milk, Bread} à {Butter} – {Milk, Butter} à {Bread} – {Bread, Butter} à {Milk} • Calculate the confidence of each rule • Pick the rules with confidence above the minimum confidence Support {Milk, Bread, Butter} Support {Milk} No. of transaction that support {Milk, Bread, Butter} No. of transaction that support {Milk} = Confidence of {Milk} à {Bread, Butter}: 53 •Retail outlets • Telecommunications •Banks • Insurance link analysis for fraud •Medical symptom analysis 54 Market Basket Analysis Applications • LIMITATIONS takes over 18 months to implement market basket analysis only identifies hypotheses, which need to be tested measurement of impact needed difficult to identify product groupings complexity grows exponentially 55 Market Basket Analysis Market Basket Analysis - Exercise Given the above list of transactions, do the following: 1. Find all the frequent itemsets (minimum support 40%) 2. Find all the association rules (minimum confidence 70%) 3. For the discovered association rules, calculate the lift Transaction No. Item 1 Item 2 Item 3 Item 4 100 Beer Diaper Chocolate 101 Milk Chocolate Shampoo 102 Beer Soap Vodka 103 Beer Cheese Wine 104 Milk Diaper Beer Chocolate 56 Basket Analysis – Rattle Exercise • Using the dvdtrans.csv or Phone_transactions_list.csv data set, apply basket analysis • Comment the results 57