CS699
Lecture 9 Correlation Analysis Other Frequent Pattern Mining
Association Rule Mining on Weka
Data preparation
When performing association rule mining on a transactional data using Weka, the dataset must be converted to an appropriate form.
Each item becomes an attribute.
Each attribute takes on only single value, e.g., {1} or {t}
Only items are used (i.e., transaction id’s, customer id’s, etc. are removed, temporarily or permanently).
2
Association Rule Mining on Weka
Data preparation example CID Items
C1 beer, bread, chip, egg
C2 beer, bread, chip, egg, popcorn,
C3 bread, chip, egg
C4 beer, bread, chip, egg, milk, popcorn
C5 beer, bread, milk
C6 beer, bread, egg
C7 bread, chip, milk
C8 bread, butter, chip, egg, milk
C9 butter, chip, egg
3
Association Rule Mining on Weka
Running Apriori on Weka
Starts with min. support of 100% and decreases this in steps of 5% until there are at least 10 rules with the min. confidence of 90% or until the support has reached a lower bound of 10%.
These default values can be changed.
4
Interestingness Measure: Correlations (Lift)
play basketball eat cereal [40%, 66.7%] is misleading
The overall % of students eating cereal is 75% > 66.7%.
play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence
Measureofdependent/correlatedevents:lift
lift P(AB) P(A)P(B)
Cereal Not cereal Sum(col.)
Basketball Not basketball 2000 1750
Sum (row) 3750
2000/5000
3000 / 5000 * 3750 / 5000
0.89
1000 250 3000 2000
1250 5000
lift(B,C)
lift(B,C) 1000/5000
1.33
3000 / 5000 *1250 / 5000
lift>1:positivelycorrelated,lift<1:negativelycorrelated, lift = 1: independent
5
Cereal Not cereal Sum(col.)
Basketball 2000 (2250)
Not basketball 1750 (1500)
Sum (row) 3750
1000 (750) 3000
250 (500) 2000
1250 5000
Chi-square Test
First,wecomputetheexpectedvalues(shownintheparentheses) Example: For (cereal, basketball)
Expected value = (3750 * 3000) / 5000 = 2250
Second,computethechi-squareteststatistic:
2 (2000 2250)2 (1750 1500)2 (1000 750)2 (250 500)2 277.78 2250 1500 750 500
6
20.05,1 3.84
Chi-square Test
Third, look up the chi‐square distribution table. degrees of freedom = (num_rows – 1) * (num_cols – 1) = 1, and α = 0.05
7
Chi-square Test
Finally, compare the computed test statistic with the value from the distribution table and make a conclusion.
If the computed χ2 statistic is greater than the χ2 from the distribution table, we reject the null hypothesis.
Otherwise, we cannot reject the null hypothesis.
In this example, the computed chi‐square value is greater than that from the chi‐square distribution table (i.e., it is in the rejection region).
So, we reject the null hypothesis and conclude that there is a correlation between the two.
8
coffee
Not coffee Sum(col.)
milk 100
Not milk 1,000
Sum (row) 1,100
lift(m,c)
100/102200 1200/102200*1100/102200
7.74
1,100 1,200
100,000 101,000
101,100 102,200
Null Transactions
When the number of null transactions is large, these measures may generate misleading results.
The lift measure indicates they are positively correlated.
But, actual data says they are negatively correlated.
Among 1,100 people who bought coffee, only 100 (or only 9%) bought also milk. This is similar with those who bought milk.
9
correlated
coffee
Not coffee Sum(col.)
milk 100
Not milk 1,000
Sum (row) 1,100
𝑎𝑙𝑙_𝑐𝑜𝑛𝑓𝑚, 𝑐
𝑐𝑜𝑠𝑖𝑛𝑒𝑚,𝑐 ∪
0.09
1,100 1,200
100,000 101,000
101,100 102,200
all_confidence and cosine
Between 0 and 1
greater than 0.5: positively correlated; smaller than 0.5: negatively
∪ ,
0.08
These measures show they are negatively correlated.
all_confidence and cosine measures are null-invariant.
10
all_confidence, cosine: another example
coffee
Not coffee Sum(col.)
milk Not milk 1,000 1,000
Sum (row) 2,000
1,000 100,000 2,000 101,000
101,000 103,000
𝑙𝑖𝑓𝑡 𝑚, 𝑐 / 25.75 (says positively correlated)
𝑎𝑙𝑙_𝑐𝑜𝑛𝑓𝑚, 𝑐 0.5 (says independent)
𝑐𝑜𝑠𝑖𝑛𝑒 𝑚, 𝑐 0.5 (says independent)
Actualdata:independent
Other measures: max_confidence, Kulczynski measure
11
coffee
Not coffee Sum(col.)
milk Not milk 1,000 1,000
Sum (row) 2,000
Kulczynski and Imbalance Ratio (IR)
1,000 100,000 2,000 101,000
101,000 103,000
KulcA,BPA|BPB|A/2,oraverageoftwocond.prob.
Between 0 and 1; 0.5: positive; 0.5: negative; 0.5: independent
Kulc𝑚,𝑐
IR(A,B) |sup(A)sup(B)|
independent
1 (P(m | c) P(c | m)) 1 ( mc mc ) 1 (1000 1000 ) 0.5 2 2 c m 2 2000 2000
sup(A)sup(B)sup(AB)
, 0≤IR<1
IR(m,c) |mc| |20002000| 0 mcmc 200020001000
(balanced)
12
Kulczynski and Imbalance Ratio (IR)
In the table in the next slide, mc, m’c, mc’, and m’c’ represent the following entries in the contingency table.:
coffee
Not coffee Sum(col.)
mc’ m’c’
milk Not milk mc m’c
Sum (row)
13
Kulczynski and Imbalance Ratio (IR)
Example:
D1 (P) 10,000
m’c
mc’
m’c’
coffee
Not coffee Sum(col.)
milk 10,000 (mc) 1,000 (mc’) 11,000
Not milk 1,000 (m’c) 100,000 (m’c’) 101,000
Sum (row) 11,000 101,000 112,000
mc
1,000
1,000
100,000
14
D1(P) 10000 D2(P) 10000 D3(N) 100 D4(I) 1000 D5(*) 1000 D6(*) 1000
1000 1000 1000 1000 1000 1000 1000 1000
100000
100
100000
100000
100000
100000
9.26 (P) 1.00(I) 8.44(P) 25.75(P) 9.18(P) 1.97(P)
0.91(P) 0.91(P) 0.09(N) 0.50(I) 0.09(N) 0.01(N)
0.91(P) 0.91(P) 0.91(P) 0.91(P) 0.09(N) 0.09(N) 0.50(I) 0.50(I) 0.29(N) 0.50(I) 0.01(N) 0.50(I)
0 0 0 0
mc
m’c mc’
m’c’
lift
all_conf
cosine Kulc
IR
100 10000 10 100000
0.83 0.99
Comparison
P:positive,N:negative,I:independent,*:contradictory
BothD1andD2havepositivelycorrelateddatabutliftshowsdifferent
values.
D3hasnegativelycorrelateddatabutliftsayspositive.
D4hasindependentdatabutliftsayspositive.Thisisbecauseliftis affected by null transactions.
all_conf,cosine,andKulczynskiarenull-invariant. 15
mc m’c
mc’
(mc)’
lift
all_conf
cosine
Kulc
IR
D5(*) 1000 100 D6(*) 1000 10
10000 100000
100000
100000
9.18(P) 1.97(P)
0.09(N) 0.29(N) 0.01(N) 0.01(N)
0.50(I) 0.50(I)
0.83 0.99
Comparison (continued)
P:positive,N:negative,I:independent,*:contradictory D5:P(c|m)=9.09%(negativelycorrelated)
D5: P(m|c) = 90.9% (positively correlated) D6:P(c|m)=0.99%(negativelycorrelated)
D6: P(m|c) = 99% (positively correlated)
Kulczynski says independent (makes sense) IRindicatesD5andD6areunbalanced.
Kulczynski along with IR is recommended.
16
D4
D5
D4 (I)
1,000 1,000
1,000 100,000
D5 (*) 1,000 100
10,000 100,000
mc m’c
mc’ m’c’
mc m’c
mc’ m’c’
Kulc = 0.5, IR = 0
Kulc = 0.5, IR = 0.83
Kulczynski and Imbalance Ratio (IR)
milk Not milk Sum
milk
Not milk
Sum 1,100
coffee 1,000 1,000 2,000 (mc) (m’c)
coffee
1,000 (mc)
100 (m’c)
Not 1,000 100,000 101,000 coffee (mc’) (m’c’)
Not coffee
10,000 (mc’)
100,000 (m’c’)
110,000 111,100
Sum 2,000 101,000 103,000
Sum
11,000
100,100
17
Multi-level Association Rule Mining
When there is a concept hierarchy in the database
Not many strong association rules at low levels
Different users are interested in association rules at different levels
Example database
TID Items
1 IBM desktop computer, Sony b/w printer
2 MS educational SW, MS financial management SW
3 Logitech mouse, Ergoway wrist pad
4 IBM desktop computer, MS financial management SW
5 IBM desktop computer
......
18
desktop laptop
educa- finan-
color b/w
mouse wrist pad
IBM
MS
HP
SONY Logitech Ergoway
Multi-level Association Rule Mining
Concept hierarchy
computer
software
printer
accessory
...
tional
cial
... ... ... ...
...
all
19
Multi-level Association Rule Mining
Mining rules
In general, top-down approach is employed.
Once all frequent itemsets at level 1 are identified, then those at level 2 are found, and so on.
For each level, any algorithm to find frequent itemsets can be used.
Variations
Using uniform support
Using reduced minimum support
20
Multi-level Association Rule Mining
Using uniform support for all levels
Same minimum support is used for all levels.
Level 1 min_sup = 5%
computer (support = 10%)
Level 2 min_sup = 5%
desktop (support = 6%) laptop (support = 4%)
Frequent itemsets: computer, desktop laptop is discarded
21
Multi-level Association Rule Mining
Using reduced minimum support at lower levels Each level has its own min. support.
Lower levels have smaller min. supports.
Level 1 min_sup = 5%
computer (support = 10%)
Level 2 min_sup = 3%
desktop (support = 6%) laptop (support = 4%)
All (computer, desktop, and laptop) are found as frequent itemsets.
22
Multi-level Association Rule Mining
If a node does not satisfy minimum support, its children don’t need to be examined.
If min. support is set too high, some meaningful rules at low levels may be missed.
If min. support is set too low, too many uninteresting rules at high levels may be found.
23
Mining Sequential Patterns
Given a sequence of elements or events, find a sequential pattern that occurs frequently.
Applications:
Customer shopping sequences
Web click streams
Program execution sequences
Biological sequences
Sequence of events in natural or social development
24
Mining Sequential Patterns
Wewilldiscusssequentialpatternminingfromatransactional database.
The approach described here is based on a sequential pattern mining algorithm called GSP (Generalized Sequential Patterns).
Atransactionisatuple(sid,ts,itemset),where
sid is a sequence id, which typically is customer id (or cid) ts is a timestamp
itemset is a set of items (and items are ordered)
Adata-sequenceisanorderedlistoftransactions.
A (transactional) database is a set of data-sequences
25
Exampledatabase CID Time Items
• There are five data-sequences, each corresponding to a customer.
1 3/25 30
1 3/30 90
2 3/10 10
2 3/15 20,30
2 3/20 40,60,70
3 3/25 30,50,70
4 3/25 30
• First data-sequence has two transactions, the second data- sequence has three transactions, ...
4 3/30 40,70
4 4/25 90
5 3/12 90
• Each transaction represents a purchase by a customer of a set of items at a certain time.
Mining Sequential Patterns
26
Exampledatabase CID Time Items
• A sequence is an ordered list of itemsets.
1 3/25 30
1 3/30 90
2 3/10 10
2 3/15 20,30
2 3/20 40,60,70
3 3/25 30,50,70
4 3/25 30
• Sequence examples: <{30}, {90}>
4 3/30 40,70
4 4/25 90
5 3/12 90
• Above sequences are 2-sequence, 3- sequence, and 3-sequence, respectively
Mining Sequential Patterns
<{30}, {40, 70}> <{10}, {40, 60}>
• A k-sequence is a sequence consisting of k items.
27
Mining Sequential Patterns
AsequenceA=
B = <{7}, {3,8}, {9}, {4,5,6}, {8}>
A is a subsequence of B.
28
E = <{3}, {5,8}>
G = <{3}, {8}>
H = <{3,8}>
Mining Sequential Patterns
C = <{3}, {15}>
D = <{3,5}, {7,8}, {2, 15}>
E is not a subsequence of F. F = <{2,3}, {5}, {7,8}>
?
G is not a subsequence of H.
?
C is a subsequence of D.
29
C1 1 C1 2
C1 15
C2 1
C2 20 C2 50 C2 50
cheese
butter
bread, milk butter, cheese egg, bread milk
= 1 (50%) support of <{cheese}, {milk}>
egg
Mining Sequential Patterns
Supportofasequenceisthenumberofdatasequencesthat“contain” this sequence.
Adatasequencedcontainsasequencesifsisasubsequenceofd
cid ts
Itemset
support of <{cheese}> = 2 (100%) support of <{egg}> = 1 (50%) support of <{cheese}, {egg}>
= 2 (100%) support of <{cheese}, {bread, milk>
= 1 (50%)
30
Exampledatabase CID Time Items
• Supports of the following sequences are:
1 3/25
1 3/30
2 3/10
2 3/15
2 3/20
3 3/25
4 3/25
4 3/30
4 4/25
5 3/12
30
90
10 20,30 40,60,70 30,50,70 10,30 40,70 60,90 90
<{30}, {90}>: 2 (or 40%) <{30}, {40, 70}>: 2 (or 40%) <{10}, {40, 60}>: 1 (or 20%)
Mining Sequential Patterns
• Goal: To discover all sequences with a user-specified minimum support
31
Example
CID Time Items
Mine all frequent sequential patterns with minimum support = 40% (or 2 data-sequences).
1 1
1 4
2 3
2 7
2 20 60,70,80
3 2 30,50
3 8 70,80
<{80}>:5
L2 (frequent 2-sequences)
4 2 70
4 10 80
5 1 10,20
5 10 30 5 28 70 5 31 80
<{10},{30}>:2 <{10},{70}>:2 <{10},{80}>:3 <{30},{70}>:3 <{30},{80}>:4 <{70},{80}>:2 <{70, 80}>:2
10,30 80 10
L1 (frequent 1-sequences): <{10}>:3
Mining Sequential Patterns
L3 (frequent 3-sequences) <{10},{30},{70}>:2 30,40 <{30}>:4 <{10},{30},{80}>:2
<{70}>:4
<{30},{70,80}>:2
32
Mining Sequential Patterns
Sequential pattern mining on Weka
Weka implemented GSP (with some limitations).
Transactional database needs to be converted to an appropriate format as an arff file.
One transaction per tuple
All tuples, so all transactions, have the same number of attributes, and each item is a value of the corresponding attribute.
33
Example
Mining Sequential Patterns
34
Result with
1-sequences
min_sup = 90%
[1] <{revolution}> (3) [2] <{civil war}> (3) [3] <{anderson}> (3) [4] <{football}> (3)
Mining Sequential Patterns
2-sequences
[1] <{revolution}{civil war}> (3) [2] <{revolution}{anderson}> (3) [3] <{civil war,anderson}> (3)
3-sequences
[1] <{revolution}{civil war,anderson}> (3)
35
min_sup =
60%
1-sequences
[1] <{revolution}> (3) [2] <{civil war}> (3) [3] <{anderson}> (3) [4] <{baseball}> (2) [5] <{football}> (3)
3-sequences
[1] <{revolution}{civil war,anderson}> (3) [2] <{revolution}{civil war,football}> (2) [3] <{revolution}{anderson,football}> (2) [4] <{civil war,anderson,football}> (2)
[5] <{revolution,baseball}{civil war}> (2) [6] <{revolution,baseball}{anderson}> (2) [7] <{revolution,baseball}{football}> (2) [8] <{baseball}{civil war,anderson}> (2) [9] <{baseball}{civil war,football}> (2) [10] <{baseball}{anderson,football}> (2)
Mining Sequential Patterns
2-sequences
[1] <{revolution}{civil war}> (3) [2] <{revolution}{anderson}> (3) [3] <{revolution}{football}> (2) [4] <{civil war,anderson}> (3) [5] <{revolution,baseball}> (2) [6] <{baseball}{civil war}> (2) [7] <{baseball}{anderson}> (2) [8] <{baseball}{football}> (2) [9] <{civil war,football}> (2)
[10] <{anderson,football}> (2)
4-sequences
[1] <{revolution}{civil war,anderson,football}> (2) [2] <{revolution,baseball}{civil war,anderson}> (2) [3] <{revolution,baseball}{civil war,football}> (2) [4] <{revolution,baseball}{anderson,football}> (2) [5] <{baseball}{civil war,anderson,football}> (2)
5-sequences
[1] <{revolution,baseball}{civil war,anderson,football}> (2)
36
Associative Classification
Each sample (or tuple) is considered as a transaction.
An (attribute, value) pair is an item.
Frequent itemsets are mined using an association rule mining algorithm.
Strong rules are mined from the frequent itemsets, which satisfy the minimum support and minimum confidence thresholds.
We only use rules which has class attribute in the consequent.
Rules are organized to form a rule-based classifier.
37
Example outlook
temperature humidity windy play
dataset
sunny hot sunny hot overcast hot rainy mild rainy cool rainy cool overcast cool sunny mild sunny cool rainy mild sunny mild overcast mild overcast hot rainy mild
high F N high T N high F Y high F Y normal F Y normal T N normal T Y high F N normal F Y normal F Y normal T Y high T Y normal F Y high T N
Associative Classification
38
Associative Classification
For association rule mining, each tuple is considered as a transaction and each (attribute, value) pair becomes an itme as follows:
TID items
1 outlook=sunny, temperature=hot, humidity=high, windy=F, play=N
2 outlook=sunny, temperature=hot, humidity=high, windy=T, play=N
3 outlook=overcast, temperature=hot, humidity=high, windy=F, play=Y
4 outlook=rainy, temperature=mild, humidity=high, windy=F, play=Y
5 outlook=rainy, temperature=cool, humidity=normal, windy=F, play=Y
6…
39
On Weka, Set car to True
Specify the index of the class attribute
Associative Classification
40
Result
Top ten rules by confidence
Associative Classification
41
• •
Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012
• •
Multilevel association rule mining: Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012, pp. 283 – 287.
•
Associative classification: Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012, pp. 416 – 419.
42
http://www.cs.illinois.edu/~hanj/bk3/
References
Sequential pattern mining: R. Srikant and R. Agrawal, “Mining sequential patterns: generalization and performance improvements,” Proc. 5th Int’l Conf. on Extending Database Technology: Advances in Database Technology, pp. 3 – 17.